Category Archives: Tutorials

The Powers and Pitfalls of Power-Law Analyses

People love power-laws. In the 90s and early 2000s it seemed like they were found everywhere. Yet early power-law studies did not subject the data distributions to rigorous tests. This decreased the potential value of some of these studies. And since an influential study by Aaron Clauset of CU Boulder , Cosma Shalizi of Carnegie Mellon, and Mark Newman of the University of Michigan, researchers have become aware that not all distributions that look power-law like are actually power-laws.

But power-law analyses can be incredibly useful. In this post I show you first what a power-law is, second demonstrate an appropriate case-study to use these analyses in, and third walk you through how to use these analyses to understand distributions in your data.

 

What is a power-law?

A power-law describes a distribution of something—wealth, connections in a network, sizes of cities—that follow what is known as the law of preferential attachment. In power-laws there will be many of the smallest object, with increasingly fewer of the larger objects. However, the largest objects disproportionally get the highest quantities of stuff.

The world wide web follows a power-law. Many sites (like Simulating Complexity) get small amounts of traffic, but some sites (like Google, for example) get high amounts of traffic. Then, because they get more traffic, they attract even more visits to their sites. Cities also tend to follow power-law distributions, with many small towns, and few very large cities. But those large cities seem to keep getting larger. Austin, TX for example, has 157.2 new citizens per day, making this city the fastest growing city in the United States. People are attracted to it because people keep moving there, which perpetuates the growth. Theoretically there should be a limit, though maybe the limit will be turning our planet into a Texas-themed Coruscant.

This is in direct contrast to log-normal distributions. Log-normal distributions follow the law of proportional effect. This means that as something increases in size, it is predictably larger than what came before it. Larger things in log-normal distributions do not attract exponentially more things… they have a proportional amount of what came before. For example, experience and income should follow a log-normal distribution. As someone works in a job longer they should get promotions that reflect their experience. When we look at incomes of all people in a region we see that when incomes are more log-normally distributed these reflect greater equality, whereas when incomes are more power-law-like, inequality increases. Modern incomes seem to follow log-normality up to a point, after which they follow a power-law, showing that the richest attract that much more wealth, but under a certain threshold wealth is predictable.

If we analyze the distribution of modern incomes in a developing nation and see that they follow a power-law distribution, we will understand that there is a ‘rich get richer’ dynamic in that country, whereas if we see the incomes follow a log-normal distribution we would understand that that country had greater internal equality. We might want to know this to help influence policy.

When we analyze power-laws, however, we don’t want to just look at the graph that is created and say “Yeah, I think that looks like a power-law.” Early studies seemed to do just that. Thankfully Clauset et al. came up with rigorous methods to examine a distribution of data and see if it’s a power-law, or if it follows another distribution (such as log-normal). Below I show how to use these tools in R.

 

Power-law analyses and archaeology

So, if modern analyses of these distributions can tell us something about the equality (log-normal) or inequality (power-law) of a population, then these tools can be useful for examining the lifeways of past people. Questions we might be interested in asking are whether prehistoric cities also follow a power-law distribution, suggesting that the largest cities offered more social (and potentially economic) benefits similar to modern cities. Or we might want to understand whether societies in prehistory were more egalitarian or more hierarchical, thus looking at distributions of income and wealth (as archaeologists define them) to examine these. Power-law analyses of distributions of artifacts or settlement sizes would enable us to understand the development of inequality in the past.

Clifford Brown et al. talked about these very issues in their chapter Poor Mayapan from the book The Ancient Maya of Mexico edited by Braswell. While they don’t use the statistical tools I present below, they do present good arguments for why and when power-law versus other types of distributions would occur, and I would recommend tracking down this book and reading it if you’re interested in using power-law analyses in archaeology. Specifically they suggest that power-law distributions would not occur randomly, so there is intentionality behind those power-law-like distributions.

I recently used power-law and log-normal analyses to try to understand the development of hierarchy in the American Southwest. The results of this study will be published in 2017 in  American Antiquity.  Briefly, I wanted to look at multiple types of evidence, including ceremonial structures, settlements, and simulation data to understand the mechanisms that could have led to hierarchy and whether or not (and when) Ancestral Pueblo groups were more egalitarian or more hierarchical. Since I was comparing multiple different datasets, a method to quantitatively compare them was needed. Thus I turned to Clauset’s methods.

These had been updated by Gillespie in the R package poweRlaw.

Below I will go over the poweRlaw package with a built-in dataset, the Moby Dick words dataset. This dataset counts the frequency of different words. For example, there are many instances of the word “the” (19815, to be exact) but very few instances of other words, like “lamp” (34 occurrences) or “choice” (5 occurrences), or “exquisite” (1 occurrence). (Side note, I randomly guessed at each of these words, assuming each would have fewer occurrences. My friend Simon DeDeo tells me that ‘exquisite’ in this case is hapax legomenon, or a term that only has one recorded use. Thanks Simon.)  To see more go to http://roadtolarissa.com/whalewords/.

In my research I used other datasets that measured physical things (the size of roomblocks, kivas, and territories) so there’s a small mental leap for using a new dataset, but this should allow you to follow along.

 

The Tutorial

Open R.

Load the poweRlaw package

library(“poweRlaw”)

Add in the data

data(“moby”, package=”poweRlaw”)

This will load the data into your R session.

Side note:

If you are loading in your own data, you first load it in like you normally would, e.g.:

data <- read.csv(“data.csv”)

Then if you were subsetting your data you’d do something like this:

a <- subset(data, Temporal_Assignment !=’Pueblo III (A.D. 1140-1300)’)

 

Next you have to decide if your data is discrete or continuous. What do I mean by this?

Discrete data can only take on particular values. In the case of the Moby Dick dataset, since we are counting physical words, this data is discrete. You can have 1 occurrence of exquisite and 34 occurrences of lamp. You can’t have 34.79 occurrences of it—it either exists or it doesn’t.

Continuous data is something that doesn’t fit into simple entities, but whose measurement can exist on a long spectrum. Height, for example, is continuous. Even if we bin peoples’ heights into neat categories (e.g., 6 feet tall, or 1.83 meters) the person’s height probably has some tailing digit, so they aren’t exactly 6 feet, but maybe 6.000127 feet tall. If we are being precise in our measurements, that would be continuous data.

The data I used in my article on kiva, settlement, and territory sizes was continuous. This Moby Dick data is discrete.
The reason this matters is the poweRlaw package has two separate functions for continuous versus discrete data. These are:

conpl for continuous data, and

displ for discrete data

You can technically use either function and you won’t get an error from R, but the results will differ slightly, so it’s important to know which type of data you are using.

In the tutorial written here I will be using the displ function since the Moby dataset is discrete. Substitute in conpl for any continuous data.

So, to create the powerlaw object first we fit the displ to it. So,

pl_a <- displ$new(moby)

We then want to estimate the x-min value. Powerlaws are usually only power-law-like in their tails… the early part of the distribution is much more variable, so we find a minimum value below which we say “computer, just ignore that stuff.”

However, first I like to look at what the x_min values are, just to see that the code is working. So:

pl_a$getXmin()

Then we estimate and set the x-mins

So this is the code that does that:

est <- estimate_xmin(a)

We then update the power-law object with the new x-min value:

pl_a$setXmin(est)

We do a similar thing to estimate the exponent α of the power law. This function is pars, so:

Pl_a$getPars()

estimate_pars(pl_a)

Then we also want to know how likely our data fits a power law. For this we estimate a p-value (explained in Clauset et al). Here is the code to do that (and output those data):

booty <- bootstrap_p(pl_a)

This will take a little while, so sit back and drink a cup of coffee while R chunks for you.

Then look at the output:

booty

Alright, we don’t need the whole sim, but it’s good to have the goodness of fit (gof: 0.00825) and p value (p: 0.75), so this code below records those for you.

variables <- c(“p”, “gof”)

bootyout <- booty[variables]

write.table(bootyout, file=”/Volumes/file.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

 

Next, we need to see if our data better fits a log-normal distribution. Here we compare our dataset to a log-normal distribution, and then compare the p-values and perform a goodness-of-fit test. If you have continuous data you’d use conlnorm for a continuous log normal distribution. Since we are using discrete data with the Moby dataset we use the function dislnorm. Again, just make sure you know which type of data you’re using.

### Estimating a log normal fit

aa <- dislnorm$new(moby)

We then set the xmin in the log-normal dataset so that the two distributions are comparable.

aa$setXmin(pl_a$getXmin())

Then we estimate the slope as above

est2 <-estimate_pars(aa)

aa$setPars(est2$pars)

Now we compare our two distributions. Please note that it matters which order you put these in. Here I have the power-law value first with the log-normal value second. I discuss what ramifications this has below.

comp <- compare_distributions(pl_a, aa)

Then we actually print out the stats:

comp

And then I create a printable dataset that we can then look at later.

myvars <- c(“test_statistic”, “p_one_sided”, “p_two_sided”)

compout <- comp[myvars]

write.table(compout, file=”/Volumes/file2.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

And now all we have left to do is graph it!

 

pdf(file=paste(‘/Volumes/Power_Law.pdf’, sep=”),width=5.44, height = 3.5, bg=”white”, paper=”special”, family=”Helvetica”, pointsize=8)

par(mar=c(4.1,4.5,0.5,1.2))

par(oma=c(0,0,0,0))

plot(pts_a, col=’black’, log=’xy’, xlab=”, ylab=”, xlim=c(1,400), ylim=c(0.01,1))

lines(pl_a, col=2, lty=3, lwd=2, xlab=”, ylab=”)

lines(aa, col=3, lty=2, lwd=1)

legend(“bottomleft”, cex=1, xpd=T, ncol=1, lty=c(3,2), col=c(2,3), legend=c(“powerlaw fit”, “log normal fit”), lwd=1, yjust=0.5,xjust=0.5, bty=”n”)

text(x=70,y= 1,cex=1, pos=4, labels=paste(“Power law p-value: “,bootyout$p))

mtext(“All regions, Size”, side=1, line=3, cex=1.2)

mtext(“Relative frequencies”, side=2, line=3.2, cex=1.2)

legend=c(“powerlaw fit”, “log normal fit”)

box()

dev.off()

Now, how do you actually tell which is better, the log normal or power-law? Here is how I describe it in my upcoming article:

 

The alpha parameter reports the slope of the best-fit power-law line. The power-law probability reports the probability that the empirical data could have been generated by a power law; the closer that statistic is to 1, the more likely that is. We consider values below 0.1 as rejecting the hypothesis that the distribution was generated by a power law (Clauset et al. 2009:16). The test statistic indicates how closely the empirical data match the log normal. Negative values indicate log-normal distributions, and the higher the absolute value, the more confident the interpretation. However, it is possible to have a test statistic that indicates a log-normal distribution in addition to a power-law probability that indicates a power-law, so we employ the compare distributions test to compare the fit of the distribution to a power-law and to the log-normal distribution. Values below 0.4 indicate a better fit to the log-normal; those above 0.6 favor a power-law; intermediate values are ambiguous. Please note, though, that it depends on what order you put the two distributions in the R code: if you put log-normal in first in the above compare distributions code, then the above would be reversed—those below 0.4 would favor power-laws, while above 0.6 would favor log normality. I may be wrong, but as far as I can tell it doesn’t actually matter which order you put the two distributions in, as long as you know which one went first and interpret it accordingly.

 

So, there you have it! Now you can run a power-law analysis on many types of data distributions to examine if you have a rich-get-richer dynamic occurring! Special thanks to Aaron Clauset for answering my questions when I originally began pursuing this research.

 

Full code at the end:

 

library(“poweRlaw”)

data(“moby”, package=”poweRlaw”)

pl_a <- displ$new(moby)

pl_a$getXmin()

est <- estimate_xmin(a)

pl_a$setXmin(est)

Pl_a$getPars()

estimate_pars(pl_a)

 

 

booty <- bootstrap_p(pl_a)

variables <- c(“p”, “gof”)

bootyout <- booty[variables]

#write.table(bootyout, file=”/Volumes/file.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

 

### Estimating a log normal fit

aa <- dislnorm$new(moby)

aa$setXmin(pl_a$getXmin())

est2 <-estimate_pars(aa)

aa$setPars(est2$pars)

 

comp <- compare_distributions(pl_a, aa)

comp

 

myvars <- c(“test_statistic”, “p_one_sided”, “p_two_sided”)

compout <- comp[myvars]

write.table(compout, file=”/Volumes/file2.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

 

pdf(file=paste(‘/Volumes/Power_Law.pdf’, sep=”),width=5.44, height = 3.5, bg=”white”, paper=”special”, family=”Helvetica”, pointsize=8)

par(mar=c(4.1,4.5,0.5,1.2))

par(oma=c(0,0,0,0))

plot(pts_a, col=’black’, log=’xy’, xlab=”, ylab=”, xlim=c(1,400), ylim=c(0.01,1))

lines(pl_a, col=2, lty=3, lwd=2, xlab=”, ylab=”)

lines(aa, col=3, lty=2, lwd=1)

legend(“bottomleft”, cex=1, xpd=T, ncol=1, lty=c(3,2), col=c(2,3), legend=c(“powerlaw fit”, “log normal fit”), lwd=1, yjust=0.5,xjust=0.5, bty=”n”)

text(x=70,y= 1,cex=1, pos=4, labels=paste(“Power law p-value: “,bootyout$p))

mtext(“All regions, Size”, side=1, line=3, cex=1.2)

mtext(“Relative frequencies”, side=2, line=3.2, cex=1.2)

legend=c(“powerlaw fit”, “log normal fit”)

box()

dev.off()

Complex social dynamics in a few lines of code

To prove that there is a world beyond agents, turtles and all things ABM, we have created a neat little tutorial in system dynamics implemented in Python.

Delivered by Xavier Rubio-Campillo and Jonas Alcaina just a few days ago at the annual Digital Humanities conference (this year held in the most wonderful of all cities – Krakow), it is tailored to humanities students so it does not require any previous experience in coding.

System dynamics is a type of mathematical or equation-based modelling. Archaeologists (with a few noble exceptions) have so far shunned from, what is often perceived as, ‘pure math’ mostly citing the ‘too simplistic’ argument when awful mathematics teacher trauma was probably the real reason. However, in many cases an ABM is a complete overkill when a simple system dynamics model would be well within one’s abilities. So give it a go if only to ‘dewizardify’* the equations.

Follow this link for the zip file with the tutorial: https://zenodo.org/record/57660#.V4YIIKu7Ldk

*the term ‘dewizardify’ courtesy of SSI fellow Robert Davey (@froggleston)

Building a Schelling Segregation Model in R

Happy New Year! Last year, our good friend Shawn over at Electric Archaeology introduced us to an excellent animated, interactive representation of Thomas Schelling’s (1969) model of segregation called “Parable of the Polygons”. I’ve always liked Schelling’s model because I think it illustrates the concepts of self-organization and emergence, and is also easy to explain, so it works as a useful example of a complex system. In the model, individuals situated in a gridded space decide whether to stay put or move based on a preference for neighbours like them. The model demonstrates how features of segregated neighborhoods can emerge even when groups are relatively ‘tolerant’ in their preferences for neighbors.

Here, I’ve created a simple version of Schelling’s model using R (building on Marco Smolla’s excellent work on creating agent-based models in R). Schelling’s model is situated on a grid, and in its simplest form, the cells of the grid will be in one of three states: uninhabited, inhabited by a member of one group, or inhabited by a member of a second group. This could be represented as a matrix of numbers, with each element being either 0, 1, or 2. So we’ll bring together these components as follows:

number<-2000
group<-c(rep(0,(51*51)-number),rep(1,number/2),rep(2,number/2))
grid<-matrix(sample(group,2601,replace=F), ncol=51)

par(mfrow=c(1,2))
image(grid,col=c("black","red","green"),axes=F)
plot(runif(100,0,1),ylab="percent happy",xlab="time",col="white",ylim=c(0,1))

Here, we start with a 51 x 51 grid of 2000 occupied cells.  To create this, a vector called group is generated that contains 1000 1s, 1000 2s, and the remainder are 0s. These are collated into a matrix called grid through random sampling of the group vector. Finally, the matrix is plotted as an image where occupied cells are colored green or red depending on their number while unoccupied cells are colored black, like so:

Rplot01
The next step is to establish the common preference for like neighbors, and to setup a variable which tracks the overall happiness of the population.

alike_preference<-0.60
happiness_tracker<-c()

Finally, we’ll need a function which we’ll use later to establish who the neighbors are for a given patch, which we will call get_neighbors. To do this, we’ll feed the function a set of two xy-coordinates as a vector (e.g. 2 13 ), and using a for-loop, pull each neighbor with the Moore neighborhood (8 surrounding patches) in order, counterclockwise from the right. Then we’ll need to ensure that if a neighboring cell goes beyond the bounds of the grid (>51 or <1), we account for this by grabbing the cell at the opposite end of the grid. This function will return eight pairs of coordinates as a matrix.

get_neighbors<-function(coords) {
  n<-c()
  for (i in c(1:8)) {
 
    if (i == 1) {
      x<-coords[1] + 1
      y<-coords[2]
    }

    if (i == 2) {
      x<-coords[1] + 1
      y<-coords[2] + 1
    }
  
    if (i == 3) {
      x<-coords[1]
      y<-coords[2] + 1
    }
    
    if (i == 4) {
      x<-coords[1] - 1
      y<-coords[2] + 1
    }
    
    if (i == 5) {
      x<-coords[1] - 1
      y<-coords[2]
    }
    
    if (i == 6) {
      x<-coords[1] - 1
      y<-coords[2] - 1
    }
   
    if (i == 7) {
      x<-coords[1]
      y<-coords[2] - 1
    }
    
    if (i == 8) {
      x<-coords[1] + 1
      y<-coords[2] - 1
    }
   
    if (x < 1) {
      x<-51
    }
    if (x > 51) {
      x<-1
    }
    if (y < 1) {
      y<-51
    }
    if (y > 51) {
      y<-1
    }
    n<-rbind(n,c(x,y))
  }
  n
}

Now to get into the program. We’ll run the process 1000 times to get output, so the whole thing will will be embedded in a for loop. Then, we’ll set up some variables which keep track of happy versus unhappy cells:

for (t in c(1:1000)) {
happy_cells<-c()
unhappy_cells<-c()  

Each of these tracker vectors (happy_cells and unhappy_cells) will keep track of additional vectors that contain coordinates of cells that are happy or unhappy.

Next, we’ll use two for loops to iterate through each row and column in the matrix. For each cell (here called current), we’ll take its value (0, 1, or 2). If the cell is not empty (that is, does not have a value of 0), then we’ll create variables that keep track of the number of like neighbors and the total number of neighbors (that is, neighboring cells which are inhabited), and then we’ll use our get_neighbors function to generate a vector called neighbors . Then we’ll use a for loop to iterate through each of those neighbors, and compare their values to the value of the current patch. If it is a match, we add 1 to the number of like neighbors. If it is inhabited, we add 1 to the total number of neighbors (a varable called all_neighbors). Then we divide the number of like neighbors by the total number of neighbors, and compare that number to the like-neighbor preference to determine whether the current patch is happy or not (The is.nan function is used here to escape situations where a cell is completely isolated and thus would involve division by 0). Happy patches are added to our happy patches variable, while unhappy patches are added to our unhappy patches variable, both as matrices of coordinates.

for (j in c(1:51)) {
  for (k in c(1:51)) {
    current<-c(j,k)
    value<-grid[j,k] 
    if (value > 0) {
      like_neighbors<-0
      all_neighbors<-0
      neighbors<-get_neighbors(current)
      for (i in c(1:nrow(neighbors))){
        x<-neighbors[i,1]
        y<-neighbors[i,2]
        if (grid[x,y] > 0) {
          all_neighbors<-all_neighbors + 1
        }
        if (grid[x,y] == value) {
          like_neighbors<-like_neighbors + 1
        }
      }
      if (is.nan(like_neighbors / all_neighbors)==FALSE) {
        if ((like_neighbors / all_neighbors) < alike_preference) {
            unhappy_cells<-rbind(unhappy_cells,c(current[1],current[2]))
        }
          else {
            happy_cells<-rbind(happy_cells,c(current[1],current[2]))
          }
        }
   
      else {
        happy_cells<-rbind(happy_cells,c(current[1],current[2]))
      }
    }
  }
}

Next, we’ll get our overall happiness by dividing the number of happy cells by the total number of occupied cells, and update our happiness tracker by appending that value to the end of the vector.

happiness_tracker<-append(happiness_tracker,length(happy_cells)/(length(happy_cells) + length(unhappy_cells)))

Next, we’ll get our unhappy patches to move to unoccupied spaces. To do this, we’ll randomly sample unhappy cells so we’re not introducing a spatial bias. Then, we’ll iterate through that sample, calling each patch in that group a mover, and picking a random spot in the grid as a moveto. A while loop will continue to pick a new random moveto while the current moveto is inhabited. Once an uninhabited moveto has been found, the mover’s value is applied to that patch, and removed from the original mover patch.

rand<-sample(nrow(unhappy_cells))
for (i in rand) {
  mover<-unhappy_cells[i,]
  mover_val<-grid[mover[1],mover[2]]
  move_to<-c(sample(1:51,1),sample(1:51,1))
  move_to_val<-grid[move_to[1],move_to[2]]
  while (move_to_val > 0 ){
    move_to<-c(sample(1:51,1),sample(1:51,1))
    move_to_val<-grid[move_to[1],move_to[2]]
  }
  grid[mover[1],mover[2]]<-0
  grid[move_to[1],move_to[2]]<-mover_val
}

Finally, we’ll check the output.

par(mfrow=c(1,2))
image(grid,col=c("black","red","green"),axes=F)
plot(runif(100,0,1),ylab="percent happy",xlab="time",col="white",ylim=c(0,1))
lines(happiness_tracker,oma = c(0, 0, 2, 0),col="red")
}

With the for loop we created around the whole program, we get animation in our graphical display. Here’s what we get when we set the alike_preference value to 70 percent:animation1

And here’s what happens when it’s set to 72 percent:

animation2

Finally, here’s what happens when it’s set to 80:

animation3

For comparison, check out this version in NetLogo or this HTML version.

Run, Python, Run! Part 2: Speeding up your code

In the previous blog post in this series (see also here) we described how to profile the code in Python. But knowing which lines of code are slow is just the start – the next step is to make them faster. Code optimization is a grand topic in software development and much ink has been spent on describing various methods. Hence, I  picked the brain of a senior pythonista: my Southampton colleague and the local go-to person for all Python enquiries: Max Albert . We have divided the optimisation techniques into two sections: (predominantly Max’s) thoughts on deep optimisation and some quick fixes we both came across during our time with Python.  For those with formal computer science education many of the points may sound trivial but if, like many of us, you’ve learned to code out of necessity and therefore in a rather haphazard way, this post may be of interest.

Deep Code Optimisation

1. ‘Premature optimisation is the root of all evil’ said Donald Knuth in 1974 and the statement still holds true 40 years later. The correct order of actions in any code development is: first make the code run, then make it do what you want it to do, then check that’s what it’s actually doing under all conditions (test it), only then profile it and  start optimising. Otherwise you’re running a risk of making the code unnecessarily complicated and wasting time on optimising bits that you blind guessed are slow but which actually take insignificant amounts of time.

2. There are no points for spending hours on developing DIY solutions. In fact it has been suggested that coding should be renamed ‘googling stack overflow’ (as a joke, but it wouldn’t be funny if it wasn’t so true). The chances are that whatever you want to do, someone has done it before, and has done it better. Google it. The chances are that one of the stack overflow gurus had nothing better to do in their spare time and developed an algorithm that will fit your needs just fine and run at turbospeed.

3. Think through what you’re doing. Use a correct algorithm and correct data structures. Don’t hang on that one algorithm that you developed some time ago for one task and that you have been tweaking ever since to do a number of other tasks. Similarly, even if lists are your favourite, check if a dictionary won’t be more efficient.

4. Think with the program – that is step through the code in the same order as it is going to be executed. If you don’t have a beautiful mind and cannot easily follow computer logic, use a debugger. It will take you through the code step by step. It will show you what repeats and how many times as well as when and where things are stored. Sometimes it’s worth storing things in the memory, sometimes it makes sense to recompute them – you can test what’s quicker by profiling the code with different implementations.

Quick fixes

There are a few things that one can do straight away and, apparently, they should increase performance instantaneously. I say ‘apparently’ because with each new version of Python the uberpythonistas make things better and more efficient so it is likely that some of these tricks are not as effective as they used to be in the version of Python (and its libraries) you are using at the moment. Either way, it’s always worth trying out alternatives and profiling them to find the fastest option.

1. Remove any possible calculations from your ‘if’ and ‘for’ loops. If there is anything you can pre-calculate and attach to a variable (as in a = b – 45 / c and then use only ‘a’ in the loop), DO IT! It may add extra lines of code but remember that whatever is inside the loop will be repeated with each iteration (and if you have 10 000 agents in 100 000 steps then that’s a hell of a lot of looping).

2. Use as many built-in functions as possible. They are actually written in C (the fast language) so anything you write in Python is likely to be slower. A good example is Numpy – the ‘numerical python’ library which has arrays and a wide range of operations you can run on them, instead of building list. See this little essay about why this is the case. A more advanced version of this approach is to try Cython – Python extension, which with a few relatively simple changes, could boost your code to near-C speed.

3. Try using list comprehension instead of manually looping through lists. Sounding more scary than it actually is – list comprehension is an easy, compact and sometimes faster than looping way of doing calculations on lists. Check out the documentation here – the examples will teach you all you need to know in less than an hour. Even better, use the map function, as it’s supposed to be the speediest one – check out the tutorial here.

4. Since we’re on lists: you probably know about the deep and shallow copying. If you don’t, the gist is that when you assign a name to data, it is a reference to the data and not a copy. Try the following code:

a = [1, 2, 3]
b = a
print a, b
a.append(4)
print a, b

Whoa, isn’t it? Check out this fantastic talk by Ned Batchelder at PyCon2015 about why it is this way. To avoid the potentially serious bugs you can deep copy:

list_2 = list_1[:]
Or:
list_2 = list(list_1)

There seems to be a bit of disagreement as for which method is faster (compare here & here) and I personally got mixed results depending on the length of the list and its components (floats, integers, etc.) so test the alternative implementations code as the gain on this little change may be considerable.

In general, deep copying is costly so there is a balance to strike here – go for safe but computationally expensive or spend some time ensuring that shallow copying does not produce any unintended behaviour.  

5. Division may potentially be expensive, but can be easily swapped with multiplication. For example, if you’re dividing by 2 – try multiplying by 0.5 instead. If you need to divide by funny numbers (as in ‘the odd ones’) try this trick: a = b * (1.0 / 7.0), it’s supposed to be quicker than a straightforward division. Again, try and time different implementations – depending on the version of Python, the number of operations and the type of data (integers, floats) the results may differ.

6. Trying is cheaper, Ifing is expensive. From this fantastic guide to optimisation in python comes a simple rule that should speed up the defensive parts of the code.

If your code looks like this:

if somethingcrazy_happened:
uhOhBetterDoSomething()
else:
doWhatWeNormallyDo()

The following version is speedier:

try:
doWhatWeNormallyDo()
except SomethingCrazy:
uhOhBetterDoSomething()

To push your optimisation effort even further, there are quite a few optimisation tutorials online, with many more techniques.  I particularly recommend these three:

  1. Python wiki on optimisation – this is quite ‘dry’ so only do it if you’re happy with loads of python jargon.
  2. Dive into Python – nice tutorial with an example code, shows the scary truth that more often than not it’s difficult to predict which implementation will actually be quicker.
  3. A comprehensive Stack Overflow answer the the ultimate question ‘how do I make my code go faster?’

Top image: Alan Cleaver on Flickr flickr.com/photos/alancleaver/2661425133/in/album-72157606825074174/

Agent-based modelling meets R

In the world of computational archaeology the technical hurdle is a significant deterrent for many. I hear quite often fellow archaeologists complaining that ‘an ABM would really complement my research, but I cannot code in Python and don’t have the time to learn it’. Time needed to learn new software or programming language never seems to be there and many great ideas are put on the back burner indefinitely simply because they cannot be implemented in an environment one is already familiar with.

Now, thanks to our fellow blogger and complexity scientist Marco Smolla, you can learn how to build an agent-based model in one of the most popular scripting languages in archaeology: R!  Follow this link for a tutorial which will allow you to turn your data analysis environment into a simulation tool. Nice!

Spatially-explicit iterated games in NetLogo: an agent-based h/t to John Nash

From xkcd:

Admittedly, I’ve never seen A Beautiful Mind, although I’ve come across the clip parodied in the cartoon above which, as the New York Times also points out, is not a great example of John Nash’s contributions to game theory. But I’ve always liked the way game theory, like other forms of modeling, builds upward from simple premises, and Nash’s work features in a good chunk of what is familiar to me in game theory. So, in tribute to Professor Nash (and wishing to learn a little more about game theory), I’ve put together a little tutorial on building games into an agent-based model.

Game theory considers how strategic decisions are made; these are usually conceived of in terms of the relative payoffs for each choice within a realm of possible choices for a given decision. Many of the games studied by game theoreticians are models and have a limited number of options, but depending on how the payoffs are structured and what knowledge is attributed to the players, these elementary games may produce non-intuitive outcomes and insights into behavior.

This description is perhaps better served by an example: imagine you find yourself in the produce section at the grocery store, and you have to choose between buying a delicious pumpkin that requires two people to carry and buying asparagus that are less enjoyable. If you and an acquaintance work together, you can bring the pumpkin to the register and both benefit from eating it. On the other hand, both of you could just buy asparagus separately and get less enjoyment. But there’s also another option: one of you could decide to buy the gigantic pumpkin, while the other could take asparagus. This would leave one person with a decent amount of asparagus and the other without any pumpkin because they are unable to carry it. How does one decide which strategy to choose?

Decisions, decisions...
The eternal struggle

On its face, it might seem like choosing to cooperate and taking the tasty pumpkin makes the most sense as this would provide the best payoff for everyone, but this is a risky strategy if you don’t know what the other person will do; you could be left trying to drag an immense pumpkin to the checkout all by yourself while the other person makes off with heaps of asparagus (the horror!). In that case, it might make the most sense to take the sure thing. This can be visualized as a decision matrix:

matrix 1

For this game (which is really the famous Prisoner’s Dilemma game), the asparagus-asparagus strategy represents a Nash equilibrium: a strategy under which both players would not benefit from changing strategies, even when the equilibrium strategies of other players is known. If Player 1 initially chose pumpkin, but knew it would benefit the other player if they switched to asparagus, then Player 1 would rationally change their strategy to asparagus as well to avoid being left to drag the pumpkin alone. But if Player 1 initially chose asparagus, they would not benefit from changing strategies, no matter whether Player 2 chose asparagus or pumpkin; therefore, the rational choice would always be asparagus. Under some games, there may be multiple Nash equilibria. For instance, we could change the values so that the benefit of defecting were the same no matter what the other player is doing, with a decision matrix that looks something like this:

decision matrix

In this case, there is no rational reason for a player to switch strategies knowing what the other has chosen, so the game has two Nash equilibria: both players choosing the pumpkin, or both choosing the asparagus. Both of these games are kinds of non-cooperative games, which are a class of games in which the players make decisions without consulting one another (and the type where Nash made his most well-known contributions).

Games explored by game theorists are usually conceptual, such that players are not typically constrained by any physical distance between them. Agent-based models, on the other hand, have the ability to incorporate space explicitly within the framework of the model, making it possible to inhibit or enhance agent interactions based on proximity. While the incorporation of space is not necessary for all models or games, many real world processes and entities are affected by space,  so it might be useful (and dare I say fun?) to see how space might affect the outcomes of games. In that spirit, we’ll use NetLogo to build non-cooperative games within a spatial framework.

Building iterated games in an agent-based model

In this model, agents perform a random walk, selecting a direction at random and taking one step forward. Following each step, agents search for other agents within a given radius, search-radius. If any agents are present, one of them is chosen, and together they play a coordination game not unlike our produce aisle example above, receiving different payoffs depending on whether they choose to cooperate or defect. The model will be flexible enough so that the payoffs for both cooperating, both defecting, or losing and winning from a cooperate-defect scenario can be variables capable of being tuned by the user. We’ll call these, respectively, cooperate-cooperate, defect-defect, lone-cooperator, and lone-defector. These will be made into sliders at the front end of the model.

Before we do anything, we’ll want to figure out how we want agents to make decisions. We could assume that agents know the equilibrium strategies and will thus act accordingly. But what if our agents don’t know what the payoffs will be? In the absence of any information, agents could make their decisions randomly, but this isn’t likely to remain a reasonable course of action once the agent has played the game a few times (this concept probably has a name, but for now we’ll call it the “this ain’t my first rodeo” effect). An alternative is to let agents make decisions about whether to cooperate or not based on their past experiences with other agents. We can do this by asking each agent to keep track of the payoffs in a set of lists for each strategy. The relative numbers of choices of strategy in each time step can then be used to gauge how the population is making decisions.

globals [ decisions ]
turtles-own [ cooperate-history defect-history ]

Here, the cooperate-history and defect-history variables will be lists used to keep track of payoffs received when a particular strategy is employed. For example, if two agents cooperate and the cooperate-cooperate payoff is 5, then both agents will add 5 to their cooperate-history lists; however, if one agent defects and the other cooperates, and the lone-defector payoff is 6 and the lone-cooperator payoff is 0, then the defecting agent will add 6 to their defect-history and the cooperating agent will add 0 to their cooperate-history. The decisions variable will keep track of all agent strategy decisions for each time step, with 1 representing a cooperation and 0 representing a defection. We can use this as a gauge on population-level decision-making as the model progresses.

Now we can create the agents. What we need to consider carefully is how agents will make their initial decisions. We could start them out with no history whatsoever, and let them guess at first until they figure out a good strategy, but unless agents try both strategies at least once, they will unwittingly choose a random strategy and then stick with that strategy simply because the outcome of the other strategy is unknown. To get around this, we’ll seed the cooperate-history and defect-history of each agent with ten random outcomes from each of those choices.

to setup
clear-all
set decisions [0 0 0 0 0 1 1 1 1 1]
crt 100 [
setxy random-xcor random-ycor
set cooperate-history []
set defect-history []
repeat 10 [
set cooperate-history lput (one-of (list lone-cooperator cooperate-cooperate )) cooperate-history
set defect-history lput (one-of (list lone-defector defect-defect )) defect-history
]
]
reset-ticks
end

Here, we’ve taken the decisions variable and seeded it with an equal number of 0s and 1s, starting off with a mean value of 0.5; we do this so that the plot we create for this later has some values to use when the model is set up. Next, we create (crt) 100 agents, distribute them to random coordinates with the setxy command, and then give them cooperate-history  and defect-history lists with randomly chosen outcomes for cooperation (either lone-cooperator or cooperate-cooperate) and defection (either lone-defector or defect-defect) respectively.

Next, we create our go command, where the model is controlled. Here, we want to make our turtles move, and, if any agents are within the search-radius, they’ll play a coordination game. It should look something like this:

to go
set decisions []
ask turtles [
set heading random 360
fd 1
if any? other turtles in-radius search-radius [
check-game
]
]
tick
end

In this code, the decisions variable, which is a list, is cleared in order to get fresh decisions from the new time step. This is done by converting it to an empty list []. Next, each turtle is asked first to pick a random direction by using set heading random 360, and then asked to move forward (fd) one step. The agent then looks for other players. The any? command checks to see whether any of a specified group of agents meets a given criteria; in this case, we want to know whether there are any agents within the search-radius. If there are, we’ll run a routine called check-game.

The check-game routine has three objectives: first, it determines what the preferred choice of our focal agent is based on their cooperate-history and defect-history lists; second, it identifies the second agent playing the game and determines their preferred choice; and finally, it determines the payoffs for both agents based on the combination of their decisions. This routine is a bit long, so we’ll break it up based into a few parts, using ellipses (…) to indicate where code leaves off and picks up.


to check-game
let p1choice 0
ifelse sum cooperate-history > sum defect-history [
set p1choice 1
]
[
ifelse sum cooperate-history = sum defect-history [
set p1choice one-of [ 0 1 ]
]
[
set p1choice 0
]
]
...

This code first declares a local variable for the choice of the focal agent (hereafter referred to as Player 1), called p1choice, and sets it to a default of 0 (defect). Then, it runs through a series of checks to see whether the p. First, it uses the ifelse command to check whether the mean value of the cooperate-history is greater then that of the defect-history, indicating this agent has historically received greater benefit from cooperating than defecting. If this is the case, it sets p1choice to 1 (cooperate). If this is not the case, then the agent uses if to check whether the mean cooperate-history is equal to the mean defect-history. If this is true, then the agent randomly sets its p1choice to one-of two values (0 or 1); otherwise, the agent retains its original p1choice of 0.

Now that Player 1 has made a choice, we can determine who Player 2 is and make their choice:

...
let p2 one-of other turtles in-radius search-radius
let p2choice 0
ifelse [ sum cooperate-history ] of p2 > [ sum defect-history ] of p2 [
set p2choice 1
]
[
ifelse [ sum cooperate-history ] of p2 = [ sum defect-history ] of p2 [
set p2choice one-of [ 0 1 ]
]
[
set p2choice 0
]
]
...

Here, we set a local variable called p2 which holds the identity of Player 2, and associates that variable with an agent selected randomly from within the search-radius. We do this because the call of check-game is being made by a particular agent, and so any commands made to Player 2 are made through Player 1. So while the choice algorithm is effectively the same as that used for Player 1, variables held by Player 2 are referred to using the of command (for example, [ cooperation-history ] of p2).

Now that this is done, we can compare the two decisions and establish the payoffs to each player. We’ll begin with scenarios where Player 1 chooses to cooperate.

...
ifelse p1choice = 1 [
ifelse p2choice = 1 [
set cooperate-history lput cooperate-cooperate cooperate-history
ask p2 [
set cooperate-history lput cooperate-cooperate cooperate-history
]
]
[
set cooperate-history lput lone-cooperator cooperate-history
ask p2 [
set defect-history lput lone-defector defect-history
]
]
]
...

The first ifelse here is used to establish if Player 1 has chosen to cooperate or to defect, with cooperate being p1choice = 1. We use a second ifelse to establish if Player 2 has chosen to cooperate or to defect. In the first case, where Player 2 has chosen to cooperate (p2choice = 1), then each player adds the cooperate-cooperate payoff to their respective cooperate-history lists. In the second case (following the open bracket [ indicating the alternative condition of the second ifelse statement), where Player 2 has chosen to defect, Player 1 adds the lone-cooperator payoff to their cooperate-history list, while Player 2 adds the lone-defector payoff to their cooperate-history.

Next, we’ll look at the alternative condition of that first ifelse statement, in which Player 1 chooses to defect.

...
[
ifelse p2choice = 1 [
set defect-history lput lone-defector defect-history
ask p2 [
set cooperate-history lput lone-cooperator cooperate-history
]
]
[
set defect-history lput defect-defect defect-history
ask p2 [
set defect-history lput defect-defect defect-history
]
]
]
...

Under this scenario, Player 1 has chosen to defect (p1choice = 0). Again, we’ll use ifelse to establish if Player 2 has chosen to cooperate or to defect. In the first case, where Player 2 has chosen to cooperate (p2choice = 1), then Player 1 adds the lone-defector payoff  to their defect-history while Player 2 adds the lone-cooperator payoff to their cooperate-history. In the second case, where Player 2 has also chosen to defect, both players add the defect-defect payoff to their respective defect-history lists.

Now that all potential outcomes from the game have been covered, we’ll finish up the check-game routine by first maintaining the cooperate-history and defect-history lists and then updating the decisions variable.

...
if length cooperate-history > 10 [
set cooperate-history remove-item 0 cooperate-history
]
if length defect-history > 10 [
set defect-history remove-item 0 defect-history
]
ask p2 [
if length cooperate-history > 10 [
set cooperate-history remove-item 0 cooperate-history
]
if length defect-history > 10 [
set defect-history remove-item 0 defect-history
]
]
set decisions lput p1choice decisions
set decisions lput p2choice decisions
end

The first part of this code asks agents to check the length their respective cooperate-history and defect-history lists. We’ll cap these lists at 10 interactions by asking agents whose lists are longer than 10 to drop the oldest item from those lists. We can do this using the remove-item command. Because we were updating the lists with lput, which adds items to the end of a list, the oldest item in each list is item 0, so that will be the item which gets dropped. Finally, the global decisions list is updated with the choices from both players.

Exploring games in this model

We’ll use a plot to keep track of two variables within the model: 1) the mean of the decisions variable, which should show us the balance of strategies being employed, and 2) the proportion of agents likely to cooperate on their next move. We can do this by going to the interface tab, right-clicking on some open space, and selecting Plot. This brings up a dialog box which we can enter in the following:

  • Set the Y min value to 0 and Y max value to 1, as both of our measures will fall somewhere between 0 and 1 (make sure the Auto scale? box is ticked).
  • Under the Pen update commands for the default pen, enter this code to get the mean of the decisions variable : plot mean decisions
  • Click the Add pen button, and for this new pen (pen-1), enter this code under the Pen update to track the percentage of “cooperative” agents: plot ((count turtles with [ sum cooperate-history > sum defect-history ]) + round ( 0.5 * count turtles with [ sum cooperate-history = sum defect-history ] )) / count turtles
  • The examples below will plot mean decisions in black and percent cooperating in grey. You can change the colors of the pen to suit your needs.

First, let’s start with  something simple: a game where the payoffs for cooperate-cooperate and defect-defect are exactly the same (also called the “choosing sides” game). The decision-matrix should look like this:

csmatrix

This game has two Nash equilibria: cooperate-cooperate and defect-defect. To start, we’ll run the model with a search-radius of 5. Doing so produces two basic types of outcomes:

plot2     plot1

In one instance, the overall strategy trends toward 1 (cooperation), while in the other the overall strategy moves toward 0 (defection). Why does this happen? At the outset, each agent has a cooperate-history and defect-history seeded with 0’s and 4’s; the relative proportion of these for each agent will determine whether they will cooperate or defect on their next turn. At first, these will fluctuate around even, but over time, these random fluctuations reach a point where the proportions of cooperators versus defectors drives the system toward fixation at one end of the spectrum or the other.

Next, we’ll try decreasing the search-radius to 1 :

plot4    plot3

The increased variability in the mean decisions is due to the fact that, because some agents may not have any other agents within the search-radius during a given time step, there will be fewer games being played overall, and the outcomes from those that are played can swing the mean  decisions to a greater degree. However, this also affects the time it takes to reach fixation, as the homogenizing effects of the process don’t spread as quickly when the number of potential opponents for any given agent is small.

We can also see how the Prisoner’s Dilemma plays out in our current model. As a reminder, the decision matrix looks like this:

pdmatrix

Based on the outcomes of the previous simulation, and what we know about the Nash equilibrium of the Prisoner’s Dilemma (defect-defect), the outcomes are pretty predictable:

plot5     plot6

The plot on the left uses a search-radius of 5, while the plot on the right uses a search-radius of 1. As you can see, both move toward defection fairly rapidly; however, the time until defection becomes the sole strategy can change based on the search-radius, with smaller radii taking longer to reach fixation.

The last game we’ll examine probably has a name in the game theory literature, but I can’t find it. The decision matrix for this game looks a little something like this:

pwmatrix

This game, like the Prisoner’s Dilemma, only has one Nash equilibrium (in this case, cooperate-cooperate). However, when we run the scenario using the two search-radius settings we’ve used so far, we see some strange behavior:

plot8 r 5     plot7 r 1

In both scenarios, agents begin for the most part in a cooperative mood, which makes sense since this is the optimal strategy. Over time, however, they move toward the center, eventually settling into an uneasy equilibrium around 0.5, never reaching fixation. As earlier, the search-radius of 5 (left) is less variable in terms of its mean decision value than that with a search-radius of 1 (right). But the search-radius of 5 takes longer to reach the center equilibrium state than does the search-radius of 1.  Why might this be? We might get some insight by further increasing our search-radius to 10.

plot9 r 10

In this setting, the mean decisions and percent cooperating both hover near 1, but don’t quite reach it, persisting in this state for 5000 times steps. What on earth is causing this to happen? Turns out it’s all because of this guy:

jerk
Jerk.

Remember how the memories of all of our agents are seeded with random values from the relevant strategies? This means that some agents will start out more cooperative and some will start out more prone to defection. Under the decision-matrix used for this game, very few agents will be prone toward defection (in the last case, just one), but those that are will persist in defecting because nearly everyone they encounter will be cooperating with them. They are free to be jerks with absolute impunity.

In this scenario, if these defectors are interacting with a large number of potential players (search-radius = 10), then the ill-effects of their bad behavior get spread around, having little effect on others. This allows a few defectors  to persist within a general climate of cooperation. But if the search-radius is smaller, then there is a good chance that a defecting agent will interact with the same opponent multiple times in succession. This has the effect of wearing down the cooperating agents, eventually encouraging them to defect. As more and more are brought over to the dark side of defecting, the population eventually reaches a threshold at which the number of cooperating and defecting agents balances out, producing shifts around a mean decision of 0.5. Under such conditions, even though the optimal strategy is cooperation and agents generally start out cooperative, a few bad apples can spoil the party.

Going further

The way this model is built allows us to explore a multitude of non-cooperative games and the effects of interaction distances on them. But this is not the only way to build game theory ideas into an agent-based model. The NetLogo Models Library has several examples of Prisoner’s Dilemma, spatial and non-spatial, which can be used to assess different ways the game might play out.

This tutorial is meant to be a thought provoker on how game theory ideas might be incorporated into an agent-based model, rather than a thorough treatment of spatially-explicit games. Spatially-explicit and iterated  games have been explored using simulation in much greater detail elsewhere.  A good place to start is Robert Axelrod’s book The Evolution of Cooperation, as well as the work of Nowak and May. There are several articles in the Journal of Artificial Societies and Social Simulation dealing with this subject.

Finally, while this tutorial has focused on non-cooperative games, there are other kinds of games which can be explored. Cooperative games are aimed at the strategic formation of coalitions. Some games involve stochastic mechanisms for deciding strategies: mixed strategy games, for example, involve a probabilistic determination of strategies rather than a strict choice of the most optimal strategies.

Code for the model can be found here.

Featured image: A non-cooperating Super Mario

Run, Python, run! Part 1: Timing your code

Sooner or later everyone comes to a realisation that one’s code is slow, too slow. Thankfully there were a lot of ‘everyone’ before us and they actually did something about it. This means that the tempting option of leaving one’s laptop churning the simulation and going on a month long holiday is not longer defendable. This blogpost will sketch out what one can do instead. It draws from my own experience, so it’s very mac-centric (windows users can just skip all the paragraphs on how to deal with the hell that apple sends your way) and applies to scripts written in Python only. For a tutorial on how to profile your code in NetLogo, check out Colin Wren’s post here.

Also, if you’re a proper software developer stop reading now or you may get a cardiac arrest with my frivolous attitude to the comp science nomenclature. Having wasted hours on deciphering the cryptic jargon of online tutorials, stack overflow and various bits of documentation I make no apologies.

Finally, a piece of warning: it may take up to a full day to set everything up. There will be cursing, there will be plenty of ‘are you kidding me’ and ‘oh c’mon why cannot you just work’ moments, so make sure you take frequent breaks, switch the computer off and on from time to time and know that perseverance is key. Good luck!

The basics

The key to speeding up your code is to a) figure out which bits are actually slow and b) make them go faster. The first task is called ‘profiling,’ the second ‘optimisation’ (so that you know what to google for). This post will be about profiling with another one on optimisation following soon.

A good place to start is to check how long your code takes to run overall. You can time it by typing in the terminal on Mac/command line in Windows (don’t type the ‘$’ it’s just a marker to say ‘new line’):

$ time python script_name.py  

where script_name.py is the name of your file. Remember to either navigate to the folder containing the file by typing cd (meaning ‘change directory’) followed by the path to the file, eg. cd user/phd/python_scripts or provide the full path, e.g.

$ time python user/phd/p_scripts/script_name.py

If you cannot figure out what is the full path  (thank you apple, you really couldn’t make it any more complicated), drag and drop the file onto the terminal as if you were moving it from one folder to another and the full path will appear in a new line.

Screen Shot 2015-02-27 at 11.42.05

Once it works, the time function produces a pretty straight forward output, to tell you how long it all took:

real 0m3.213s
user 0m0.486s
sys 0m0.127s

Now, add up the sys and user times – if it is much less than the real time then the main problem is that you computer is busy with other stuff and the code needed to wait until other tasks were completed. Yep, switching your Facebook off may actually speed up the code. Sad times.

Profiling the functions

So far so good, but the overall time tells you little about which bit of the code is slowing things down. For the rescue comes the armada of profiling tools. The first step into the world of profiling is to watch this highly enjoyable talk: Python profiling. The first half is really useful, the second half is for some hardcore business application so you can skip it.

To sum it up, Python has a standard inbuilt tool called cProfile. You call it from the terminal  with:

 $ python -m cProfile script_name.py

And usually it produces pages upon pages of not particularly useful output, along the lines of:

163884 function calls (161237 primitive calls) in 5.938 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.000    0.000    0.000    0.000 <string>:1(<module>)
1    0.000    0.000    0.000    0.000 <string>:1(ArgInfo)
1    0.000    0.000    0.000    0.000 <string>:1(ArgSpec)
1    0.000    0.000    0.000    0.000 <string>:1(Arguments)
1    0.000    0.000    0.000    0.000 <string>:1(Attribute)
1    0.000    0.000    0.000    0.000 <string>:1(DecimalTuple)
1    0.000    0.000    0.000    0.000 <string>:1(Match)
1    0.000    0.000    0.000    0.000 <string>:1(Mismatch)
1    0.000    0.000    0.000    0.000 <string>:1(ModuleInfo)
1    0.000    0.000    0.000    0.000 <string>:1(ParseResult)
1    0.000    0.000    0.000    0.000 <string>:1(SplitResult)

One needs to use some tools to trawl through this. To start with, it makes sense to stop viewing it in the the terminal window. If you run this command:

$ python -m cProfile -o script_output script_name.py

it will create a file with the data and  ‘script_output’ is the name you want to assign to that file. To then see the data in a browser window install the cprofilev tool.  As usual they tell you that one line of code:

$ pip install cprofilev 

in your terminal is enough to install it. Yeah, right, as if that was ever to happen. To start with, make sure you’ve got the words sudo and pip at the beginning of the line. Using sudo means that you’re pretending to be the root administrator, i.e. the almighty ‘this is my bloody computer and if I want to I will break it’ – you can also give yourself the root admin rights by following these instructions, but apple will drag you through pages and pages of warnings that will leave only the bravest of us actually daring to continue. So whenever I get an ‘access denied’ message I stick the sudo command in front  and usually it does the trick:

$ sudo pip install cprofilev 

If the terminal spits out an error message about pip, it is likely that you don’t have it installed, so type in the terminal :

$ sudo easy_install pip 

and try again. This should be enough of sweat to make it work but if it keep on producing error messages, go to the ultimate authority of google.  If it did work (i.e. you run the  $ sudo pip install cprofilev and it didn’t show any errors) type in the terminal:

$ cprofilev /path/to/script_output

script_output is the name you assigned to the file created with the cProfiler four code-snippets ago (scroll up). The terminal will spit out this fantastically cryptic message:

cprofilev server listening on port 4000

This just means that you need to copy past this line into your browser (Safari, Firefox, IE):

http://localhost:4000/

and the data will appear as a nice webpage where you can click on the headings and sort it by the number of calls, total time they took etc. You can find a comprehensive description of how to use it here.

Run, snake, run!

A more fancy way of doing it is to use the ‘Run Snake Run‘ tool. You can try to instal it from these files. Or from the terminal:

$ sudo easy_install SquareMap RunSnakeRun

You need wxPython to run the tool, thankfully this one has actually a ‘click’ installation type (I almost forgot these exists!): you can get the file from here. If may get an error saying that the disc is damaged, it’s not. It’s (yet again…) apple who does not want you to instal anything that is not ‘certified’ by them. Here are instructions on how to bypass it in the Mac internal settings.

If you’re a happy windows or linux user you’re good to go, if you have a mac there is one more bug, you probably get this error message:

OSError( """Unable to determine user's application-data directory""" )

This is because the software is not fully compatible with Mac OS X but you can repair it by typing in the terminal:

$ mkdir~/.config

Now, run this in your terminal:

$ runsnake script_output

where the script_output is that file you created with cProfiler, remember? The one you got with this line:

$ python -m cProfile -o script_output script_name.py

and you should now be able to get a nice visualisation of how much time each function consumes. It looks like this:

Screen Shot 2015-02-26 at 15.41.52

In the left hand side panel you can sort the functions by the execution time, the number of calls the combined time they took etc, while the right hand side panel shows the same information in a more human friendly format, i.e. in colour.

Line profiling

Run, snake, run is truly a fantastic tool and it gives you an idea of what eats the time and power pointing to the functions that may be optimised for a better performance. But, it also floods you with loads of the ‘inner guts of Python’ – functions inside functions inside functions hence finding out which bit of your code, i.e. the exact line, is the root of the problem is far from obvious. The line_profiler tool is great for that. There are some nice tutorial on how to use it here and here. To instal it try typing in the terminal:

$ sudo pip install line_profiler

This should do the trick, if not download the files from here and try all possible installation routes described here. Once it’s installed, in order to use it add the  @profile decorator in front of the function you want to test, for example:

Screen Shot 2015-02-26 at 16.30.37

and run from the terminal:

$ kernprof.py -l -v script_name.py

The -l flag makes the decorator (@profile) works and the -v flag is used to display the time when the script is finished.

If it doesn’t work make sure that you have the  kernprof.py in the same folder as the script you want to run (it’s in the line_profiler folder, you’ve downloaded earlier), or provide the full path to where it lives in your computer, for example:

$ /line_profiler-1.0/kernprof.py -l -v script_name.py

The output is pure joy and simplicity, and looks something like this:

Screen Shot 2015-02-26 at 16.35.03

Now, we’re talking. It clearly shows that in case of my script almost 60% of the time is spent on line 20 where I count how often each number appears in my list. If you need someone to spell out how to read the output, head to this tutorial.

If you want to go further, check out this tutorial on profiling how much memory the program uses and checking if none of it is leaking. Or get on with it and switch to working on speeding up the slow bits of your code. The next tutorial will give you a few hints on how to achieve it.

Top image source: Alan Cleaver on Flickr flickr.com/photos/alancleaver/2661425133/in/album-72157606825074174/