Tag Archives: featured

The Powers and Pitfalls of Power-Law Analyses

December 8, 2016 stefanicrabtree Leave a comment

People love power-laws. In the 90s and early 2000s it seemed like they were found everywhere. Yet early power-law studies did not subject the data distributions to rigorous tests. This decreased the potential value of some of these studies. And since an influential study by Aaron Clauset of CU Boulder , Cosma Shalizi of Carnegie Mellon, and Mark Newman of the University of Michigan, researchers have become aware that not all distributions that look power-law like are actually power-laws.

But power-law analyses can be incredibly useful. In this post I show you first what a power-law is, second demonstrate an appropriate case-study to use these analyses in, and third walk you through how to use these analyses to understand distributions in your data.

What is a power-law?

A power-law describes a distribution of something—wealth, connections in a network, sizes of cities—that follow what is known as the law of preferential attachment. In power-laws there will be many of the smallest object, with increasingly fewer of the larger objects. However, the largest objects disproportionally get the highest quantities of stuff.

The world wide web follows a power-law. Many sites (like Simulating Complexity) get small amounts of traffic, but some sites (like Google, for example) get high amounts of traffic. Then, because they get more traffic, they attract even more visits to their sites. Cities also tend to follow power-law distributions, with many small towns, and few very large cities. But those large cities seem to keep getting larger. Austin, TX for example, has 157.2 new citizens per day, making this city the fastest growing city in the United States. People are attracted to it because people keep moving there, which perpetuates the growth. Theoretically there should be a limit, though maybe the limit will be turning our planet into a Texas-themed Coruscant.

This is in direct contrast to log-normal distributions. Log-normal distributions follow the law of proportional effect. This means that as something increases in size, it is predictably larger than what came before it. Larger things in log-normal distributions do not attract exponentially more things… they have a proportional amount of what came before. For example, experience and income should follow a log-normal distribution. As someone works in a job longer they should get promotions that reflect their experience. When we look at incomes of all people in a region we see that when incomes are more log-normally distributed these reflect greater equality, whereas when incomes are more power-law-like, inequality increases. Modern incomes seem to follow log-normality up to a point, after which they follow a power-law, showing that the richest attract that much more wealth, but under a certain threshold wealth is predictable.

If we analyze the distribution of modern incomes in a developing nation and see that they follow a power-law distribution, we will understand that there is a ‘rich get richer’ dynamic in that country, whereas if we see the incomes follow a log-normal distribution we would understand that that country had greater internal equality. We might want to know this to help influence policy.

When we analyze power-laws, however, we don’t want to just look at the graph that is created and say “Yeah, I think that looks like a power-law.” Early studies seemed to do just that. Thankfully Clauset et al. came up with rigorous methods to examine a distribution of data and see if it’s a power-law, or if it follows another distribution (such as log-normal). Below I show how to use these tools in R.

Power-law analyses and archaeology

So, if modern analyses of these distributions can tell us something about the equality (log-normal) or inequality (power-law) of a population, then these tools can be useful for examining the lifeways of past people. Questions we might be interested in asking are whether prehistoric cities also follow a power-law distribution, suggesting that the largest cities offered more social (and potentially economic) benefits similar to modern cities. Or we might want to understand whether societies in prehistory were more egalitarian or more hierarchical, thus looking at distributions of income and wealth (as archaeologists define them) to examine these. Power-law analyses of distributions of artifacts or settlement sizes would enable us to understand the development of inequality in the past.

Clifford Brown et al. talked about these very issues in their chapter Poor Mayapan from the book The Ancient Maya of Mexico edited by Braswell. While they don’t use the statistical tools I present below, they do present good arguments for why and when power-law versus other types of distributions would occur, and I would recommend tracking down this book and reading it if you’re interested in using power-law analyses in archaeology. Specifically they suggest that power-law distributions would not occur randomly, so there is intentionality behind those power-law-like distributions.

I recently used power-law and log-normal analyses to try to understand the development of hierarchy in the American Southwest. The results of this study will be published in 2017 in American Antiquity. Briefly, I wanted to look at multiple types of evidence, including ceremonial structures, settlements, and simulation data to understand the mechanisms that could have led to hierarchy and whether or not (and when) Ancestral Pueblo groups were more egalitarian or more hierarchical. Since I was comparing multiple different datasets, a method to quantitatively compare them was needed. Thus I turned to Clauset’s methods.

These had been updated by Gillespie in the R package poweRlaw.

Below I will go over the poweRlaw package with a built-in dataset, the Moby Dick words dataset. This dataset counts the frequency of different words. For example, there are many instances of the word “the” (19815, to be exact) but very few instances of other words, like “lamp” (34 occurrences) or “choice” (5 occurrences), or “exquisite” (1 occurrence). (Side note, I randomly guessed at each of these words, assuming each would have fewer occurrences. My friend Simon DeDeo tells me that ‘exquisite’ in this case is hapax legomenon, or a term that only has one recorded use. Thanks Simon.) To see more go to http://roadtolarissa.com/whalewords/.

In my research I used other datasets that measured physical things (the size of roomblocks, kivas, and territories) so there’s a small mental leap for using a new dataset, but this should allow you to follow along.

The Tutorial

Open R.

Load the poweRlaw package

library(“poweRlaw”)

Add in the data

data(“moby”, package=”poweRlaw”)

This will load the data into your R session.

Side note:

If you are loading in your own data, you first load it in like you normally would, e.g.:

data <- read.csv(“data.csv”)

Then if you were subsetting your data you’d do something like this:

a <- subset(data, Temporal_Assignment !=’Pueblo III (A.D. 1140-1300)’)

Next you have to decide if your data is discrete or continuous. What do I mean by this?

Discrete data can only take on particular values. In the case of the Moby Dick dataset, since we are counting physical words, this data is discrete. You can have 1 occurrence of exquisite and 34 occurrences of lamp. You can’t have 34.79 occurrences of it—it either exists or it doesn’t.

Continuous data is something that doesn’t fit into simple entities, but whose measurement can exist on a long spectrum. Height, for example, is continuous. Even if we bin peoples’ heights into neat categories (e.g., 6 feet tall, or 1.83 meters) the person’s height probably has some tailing digit, so they aren’t exactly 6 feet, but maybe 6.000127 feet tall. If we are being precise in our measurements, that would be continuous data.

The data I used in my article on kiva, settlement, and territory sizes was continuous. This Moby Dick data is discrete.
The reason this matters is the poweRlaw package has two separate functions for continuous versus discrete data. These are:

conpl for continuous data, and

displ for discrete data

You can technically use either function and you won’t get an error from R, but the results will differ slightly, so it’s important to know which type of data you are using.

In the tutorial written here I will be using the displ function since the Moby dataset is discrete. Substitute in conpl for any continuous data.

So, to create the powerlaw object first we fit the displ to it. So,

pl_a <- displ$new(moby)

We then want to estimate the x-min value. Powerlaws are usually only power-law-like in their tails… the early part of the distribution is much more variable, so we find a minimum value below which we say “computer, just ignore that stuff.”

However, first I like to look at what the x_min values are, just to see that the code is working. So:

pl_a$getXmin()

Then we estimate and set the x-mins

So this is the code that does that:

est <- estimate_xmin(a)

We then update the power-law object with the new x-min value:

pl_a$setXmin(est)

We do a similar thing to estimate the exponent α of the power law. This function is pars, so:

Pl_a$getPars()

estimate_pars(pl_a)

Then we also want to know how likely our data fits a power law. For this we estimate a p-value (explained in Clauset et al). Here is the code to do that (and output those data):

booty <- bootstrap_p(pl_a)

This will take a little while, so sit back and drink a cup of coffee while R chunks for you.

Then look at the output:

booty

Alright, we don’t need the whole sim, but it’s good to have the goodness of fit (gof: 0.00825) and p value (p: 0.75), so this code below records those for you.

variables <- c(“p”, “gof”)

bootyout <- booty[variables]

write.table(bootyout, file=”/Volumes/file.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

Next, we need to see if our data better fits a log-normal distribution. Here we compare our dataset to a log-normal distribution, and then compare the p-values and perform a goodness-of-fit test. If you have continuous data you’d use conlnorm for a continuous log normal distribution. Since we are using discrete data with the Moby dataset we use the function dislnorm. Again, just make sure you know which type of data you’re using.

### Estimating a log normal fit

aa <- dislnorm$new(moby)

We then set the xmin in the log-normal dataset so that the two distributions are comparable.

aa$setXmin(pl_a$getXmin())

Then we estimate the slope as above

est2 <-estimate_pars(aa)

aa$setPars(est2$pars)

Now we compare our two distributions. Please note that it matters which order you put these in. Here I have the power-law value first with the log-normal value second. I discuss what ramifications this has below.

comp <- compare_distributions(pl_a, aa)

Then we actually print out the stats:

comp

And then I create a printable dataset that we can then look at later.

myvars <- c(“test_statistic”, “p_one_sided”, “p_two_sided”)

compout <- comp[myvars]

write.table(compout, file=”/Volumes/file2.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

And now all we have left to do is graph it!

pdf(file=paste(‘/Volumes/Power_Law.pdf’, sep=”),width=5.44, height = 3.5, bg=”white”, paper=”special”, family=”Helvetica”, pointsize=8)

par(mar=c(4.1,4.5,0.5,1.2))

par(oma=c(0,0,0,0))

plot(pts_a, col=’black’, log=’xy’, xlab=”, ylab=”, xlim=c(1,400), ylim=c(0.01,1))

lines(pl_a, col=2, lty=3, lwd=2, xlab=”, ylab=”)

lines(aa, col=3, lty=2, lwd=1)

legend(“bottomleft”, cex=1, xpd=T, ncol=1, lty=c(3,2), col=c(2,3), legend=c(“powerlaw fit”, “log normal fit”), lwd=1, yjust=0.5,xjust=0.5, bty=”n”)

text(x=70,y= 1,cex=1, pos=4, labels=paste(“Power law p-value: “,bootyout$p))

mtext(“All regions, Size”, side=1, line=3, cex=1.2)

mtext(“Relative frequencies”, side=2, line=3.2, cex=1.2)

legend=c(“powerlaw fit”, “log normal fit”)

box()

dev.off()

Now, how do you actually tell which is better, the log normal or power-law? Here is how I describe it in my upcoming article:

The alpha parameter reports the slope of the best-fit power-law line. The power-law probability reports the probability that the empirical data could have been generated by a power law; the closer that statistic is to 1, the more likely that is. We consider values below 0.1 as rejecting the hypothesis that the distribution was generated by a power law (Clauset et al. 2009:16). The test statistic indicates how closely the empirical data match the log normal. Negative values indicate log-normal distributions, and the higher the absolute value, the more confident the interpretation. However, it is possible to have a test statistic that indicates a log-normal distribution in addition to a power-law probability that indicates a power-law, so we employ the compare distributions test to compare the fit of the distribution to a power-law and to the log-normal distribution. Values below 0.4 indicate a better fit to the log-normal; those above 0.6 favor a power-law; intermediate values are ambiguous. Please note, though, that it depends on what order you put the two distributions in the R code: if you put log-normal in first in the above compare distributions code, then the above would be reversed—those below 0.4 would favor power-laws, while above 0.6 would favor log normality. I may be wrong, but as far as I can tell it doesn’t actually matter which order you put the two distributions in, as long as you know which one went first and interpret it accordingly.

So, there you have it! Now you can run a power-law analysis on many types of data distributions to examine if you have a rich-get-richer dynamic occurring! Special thanks to Aaron Clauset for answering my questions when I originally began pursuing this research.

Full code at the end:

library(“poweRlaw”)

data(“moby”, package=”poweRlaw”)

pl_a <- displ$new(moby)

pl_a$getXmin()

est <- estimate_xmin(a)

pl_a$setXmin(est)

Pl_a$getPars()

estimate_pars(pl_a)

booty <- bootstrap_p(pl_a)

variables <- c(“p”, “gof”)

bootyout <- booty[variables]

#write.table(bootyout, file=”/Volumes/file.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

### Estimating a log normal fit

aa <- dislnorm$new(moby)

aa$setXmin(pl_a$getXmin())

est2 <-estimate_pars(aa)

aa$setPars(est2$pars)

comp <- compare_distributions(pl_a, aa)

comp

myvars <- c(“test_statistic”, “p_one_sided”, “p_two_sided”)

compout <- comp[myvars]

write.table(compout, file=”/Volumes/file2.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

pdf(file=paste(‘/Volumes/Power_Law.pdf’, sep=”),width=5.44, height = 3.5, bg=”white”, paper=”special”, family=”Helvetica”, pointsize=8)

par(mar=c(4.1,4.5,0.5,1.2))

par(oma=c(0,0,0,0))

plot(pts_a, col=’black’, log=’xy’, xlab=”, ylab=”, xlim=c(1,400), ylim=c(0.01,1))

lines(pl_a, col=2, lty=3, lwd=2, xlab=”, ylab=”)

lines(aa, col=3, lty=2, lwd=1)

legend(“bottomleft”, cex=1, xpd=T, ncol=1, lty=c(3,2), col=c(2,3), legend=c(“powerlaw fit”, “log normal fit”), lwd=1, yjust=0.5,xjust=0.5, bty=”n”)

text(x=70,y= 1,cex=1, pos=4, labels=paste(“Power law p-value: “,bootyout$p))

mtext(“All regions, Size”, side=1, line=3, cex=1.2)

mtext(“Relative frequencies”, side=2, line=3.2, cex=1.2)

legend=c(“powerlaw fit”, “log normal fit”)

box()

dev.off()

General, Tutorials

Building a Schelling Segregation Model in R

January 6, 2016 benjdavies Leave a comment

Happy New Year! Last year, our good friend Shawn over at Electric Archaeology introduced us to an excellent animated, interactive representation of Thomas Schelling’s (1969) model of segregation called “Parable of the Polygons”. I’ve always liked Schelling’s model because I think it illustrates the concepts of self-organization and emergence, and is also easy to explain, so it works as a useful example of a complex system. In the model, individuals situated in a gridded space decide whether to stay put or move based on a preference for neighbours like them. The model demonstrates how features of segregated neighborhoods can emerge even when groups are relatively ‘tolerant’ in their preferences for neighbors.

Here, I’ve created a simple version of Schelling’s model using R (building on Marco Smolla’s excellent work on creating agent-based models in R). Schelling’s model is situated on a grid, and in its simplest form, the cells of the grid will be in one of three states: uninhabited, inhabited by a member of one group, or inhabited by a member of a second group. This could be represented as a matrix of numbers, with each element being either 0, 1, or 2. So we’ll bring together these components as follows:

number<-2000
group<-c(rep(0,(51*51)-number),rep(1,number/2),rep(2,number/2))
grid<-matrix(sample(group,2601,replace=F), ncol=51)

par(mfrow=c(1,2))
image(grid,col=c("black","red","green"),axes=F)
plot(runif(100,0,1),ylab="percent happy",xlab="time",col="white",ylim=c(0,1))

Here, we start with a 51 x 51 grid of 2000 occupied cells. To create this, a vector called group is generated that contains 1000 1s, 1000 2s, and the remainder are 0s. These are collated into a matrix called grid through random sampling of the group vector. Finally, the matrix is plotted as an image where occupied cells are colored green or red depending on their number while unoccupied cells are colored black, like so:

The next step is to establish the common preference for like neighbors, and to setup a variable which tracks the overall happiness of the population.

alike_preference<-0.60
happiness_tracker<-c()

Finally, we’ll need a function which we’ll use later to establish who the neighbors are for a given patch, which we will call get_neighbors. To do this, we’ll feed the function a set of two xy-coordinates as a vector (e.g. 2 13 ), and using a for-loop, pull each neighbor with the Moore neighborhood (8 surrounding patches) in order, counterclockwise from the right. Then we’ll need to ensure that if a neighboring cell goes beyond the bounds of the grid (>51 or <1), we account for this by grabbing the cell at the opposite end of the grid. This function will return eight pairs of coordinates as a matrix.

get_neighbors<-function(coords) {
  n<-c()
  for (i in c(1:8)) {
 
    if (i == 1) {
      x<-coords[1] + 1
      y<-coords[2]
    }

    if (i == 2) {
      x<-coords[1] + 1
      y<-coords[2] + 1
    }
  
    if (i == 3) {
      x<-coords[1]
      y<-coords[2] + 1
    }
    
    if (i == 4) {
      x<-coords[1] - 1
      y<-coords[2] + 1
    }
    
    if (i == 5) {
      x<-coords[1] - 1
      y<-coords[2]
    }
    
    if (i == 6) {
      x<-coords[1] - 1
      y<-coords[2] - 1
    }
   
    if (i == 7) {
      x<-coords[1]
      y<-coords[2] - 1
    }
    
    if (i == 8) {
      x<-coords[1] + 1
      y<-coords[2] - 1
    }
   
    if (x < 1) {
      x<-51
    }
    if (x > 51) {
      x<-1
    }
    if (y < 1) {
      y<-51
    }
    if (y > 51) {
      y<-1
    }
    n<-rbind(n,c(x,y))
  }
  n
}

Now to get into the program. We’ll run the process 1000 times to get output, so the whole thing will will be embedded in a for loop. Then, we’ll set up some variables which keep track of happy versus unhappy cells:

for (t in c(1:1000)) {
happy_cells<-c()
unhappy_cells<-c()

Each of these tracker vectors (happy_cells and unhappy_cells) will keep track of additional vectors that contain coordinates of cells that are happy or unhappy.

Next, we’ll use two for loops to iterate through each row and column in the matrix. For each cell (here called current), we’ll take its value (0, 1, or 2). If the cell is not empty (that is, does not have a value of 0), then we’ll create variables that keep track of the number of like neighbors and the total number of neighbors (that is, neighboring cells which are inhabited), and then we’ll use our get_neighbors function to generate a vector called neighbors . Then we’ll use a for loop to iterate through each of those neighbors, and compare their values to the value of the current patch. If it is a match, we add 1 to the number of like neighbors. If it is inhabited, we add 1 to the total number of neighbors (a varable called all_neighbors). Then we divide the number of like neighbors by the total number of neighbors, and compare that number to the like-neighbor preference to determine whether the current patch is happy or not (The is.nan function is used here to escape situations where a cell is completely isolated and thus would involve division by 0). Happy patches are added to our happy patches variable, while unhappy patches are added to our unhappy patches variable, both as matrices of coordinates.

for (j in c(1:51)) {
  for (k in c(1:51)) {
    current<-c(j,k)
    value<-grid[j,k] 
    if (value > 0) {
      like_neighbors<-0
      all_neighbors<-0
      neighbors<-get_neighbors(current)
      for (i in c(1:nrow(neighbors))){
        x<-neighbors[i,1]
        y<-neighbors[i,2]
        if (grid[x,y] > 0) {
          all_neighbors<-all_neighbors + 1
        }
        if (grid[x,y] == value) {
          like_neighbors<-like_neighbors + 1
        }
      }
      if (is.nan(like_neighbors / all_neighbors)==FALSE) {
        if ((like_neighbors / all_neighbors) < alike_preference) {
            unhappy_cells<-rbind(unhappy_cells,c(current[1],current[2]))
        }
          else {
            happy_cells<-rbind(happy_cells,c(current[1],current[2]))
          }
        }
   
      else {
        happy_cells<-rbind(happy_cells,c(current[1],current[2]))
      }
    }
  }
}

Next, we’ll get our overall happiness by dividing the number of happy cells by the total number of occupied cells, and update our happiness tracker by appending that value to the end of the vector.

happiness_tracker<-append(happiness_tracker,length(happy_cells)/(length(happy_cells) + length(unhappy_cells)))

Next, we’ll get our unhappy patches to move to unoccupied spaces. To do this, we’ll randomly sample unhappy cells so we’re not introducing a spatial bias. Then, we’ll iterate through that sample, calling each patch in that group a mover, and picking a random spot in the grid as a moveto. A while loop will continue to pick a new random moveto while the current moveto is inhabited. Once an uninhabited moveto has been found, the mover’s value is applied to that patch, and removed from the original mover patch.

rand<-sample(nrow(unhappy_cells))
for (i in rand) {
  mover<-unhappy_cells[i,]
  mover_val<-grid[mover[1],mover[2]]
  move_to<-c(sample(1:51,1),sample(1:51,1))
  move_to_val<-grid[move_to[1],move_to[2]]
  while (move_to_val > 0 ){
    move_to<-c(sample(1:51,1),sample(1:51,1))
    move_to_val<-grid[move_to[1],move_to[2]]
  }
  grid[mover[1],mover[2]]<-0
  grid[move_to[1],move_to[2]]<-mover_val
}

Finally, we’ll check the output.

par(mfrow=c(1,2))
image(grid,col=c("black","red","green"),axes=F)
plot(runif(100,0,1),ylab="percent happy",xlab="time",col="white",ylim=c(0,1))
lines(happiness_tracker,oma = c(0, 0, 2, 0),col="red")
}

With the for loop we created around the whole program, we get animation in our graphical display. Here’s what we get when we set the alike_preference value to 70 percent:

And here’s what happens when it’s set to 72 percent:

Finally, here’s what happens when it’s set to 80:

For comparison, check out this version in NetLogo or this HTML version.

Tutorials

Turtles in Space! Integrating GIS and NetLogo

August 20, 2014 benjdavies 4 Comments

Classic agent-based models like Schelling’s model of segregation use very simple ideas about how the world works to explore how complex structures might emerge from simple behavioral rules. This is certainly useful when dealing with a general problem like segregation, but what if we have a specific case study to which we want to apply our model? The integration of models with case-specific datasets becomes particularly important when our models are to be used for policy and decision-making. In these instances, it is often necessary to have models that are capable of producing results that can be interpreted in real-world terms. For archaeology and other disciplines where the systems under study are cannot be confined to a laboratory, this often means being able to consider our data in spatial terms. If we want to build models that operate in realistic geographic settings (and many archaeological applications of agent-based models are aimed at this goal; see here or here for examples), we need to find a way of integrating geographic data with models. In this tutorial, I’ll talk about how to do this using NetLogo.

The NetLogo platform possesses several of the basic characteristics of a GIS, in the sense that it keeps track of spatial data in a systematic way, and can be used to create visualizations of spatial data. However, for most applications, the spatial data generated by NetLogo models are gross abstractions (e.g. these patches are “grass”, other patches are “not grass”). NetLogo has an in-built, gridded topology which can be manipulated by right clicking on the world window in the NetLogo interface and selecting “edit”. Here you can set your origin point, maximum and minimum coordinates, patch size, and whether or not the world “wraps” as a toroid or cylinder. You’ll want to consider these when you add in your data, as these will influence how your data is translated to the NetLogo world.

Importing GIS data using file input/output commands

One way to bring data into NetLogo would be to use the file input-output commands, which allow Netlogo to read in text files according to user-defined rules. An example of this can be found in the Code Examples section of the NetLogo model library (“File Input Example.nlogo”). If you open this model, there are three buttons: “load-patch-data”, “show-patch-data”, and “load-own-patch-data”. If you push “load-patch-data”, a window opens that looks like this:

Doesn’t seem like anything else happened, but when you push that button, some code runs in the background:

to load-patch-data
  ifelse ( file-exists? "File IO Patch Data.txt" ) [
    set patch-data []
    file-open "File IO Patch Data.txt"
    while [ not file-at-end? ][
       set patch-data sentence patch-data (list (list file-read file read file-read))
    ]
   user-message "File loading complete!"
   file-close
   ]
   [
     user-message "There is no File IO Patch Data.txt file in current directory!" 
   ]
end

What this does is first check whether the data file (“File IO Patch Data.txt”) exists, and if it didn’t it would send you a message saying the file isn’t in the current directory. NetLogo will only find a file if it is located in the same directory as the model itself, or if you specify another file location. But in this case, the file should be in the same directory, so there shouldn’t be any trouble. NetLogo then opens up an empty list called patch-data, and then reads in the data. The data comes in a sequence of numbers that can be divided into lots of three, with each representing the x and y coordinates of a patch, and then the color value to be added to that patch. Like so:

The model then uses a while loop with the conditional not file-at-end? to iterate through the entire file. On each go ’round the loop, NetLogo will use the file-read command three times to read three blocks of text (each being separated by a space in the data file), combine them into a single list of values (e.g. [ -17 17 9.9999 ]) and add them to the list called patch-data. Once the file reaches its end, the while loop stops, and all of the data has been loaded into NetLogo as sublists within the list variable patch-data. The file is then closed using file-close; this is important to do if you end up reading and writing using multiple files.

The next step is to push the button “show-patch-data”. Pressing this button runs this code:

to show-patch-data
  cp ct
  ifelse ( is-list? patch-data ) [ 
    foreach patch-data [ 
      ask patch first ? item 1 ? [ 
        set pcolor last ? 
      ]
    ]
  ]
  [ 
    user-message "You need to load in patch data first!"
  ]
end

This code first clears all existing entities from the model (cp = clear-patches, ct = clear-turtles). Then, as long as the patch-data list exists, it will read in each sublist using the foreach command. The foreach command is a loop which will iterate over each item in a list. If you want to refer to the item in question with your code, you can use the ?. The ? stands in for “an item in this list”. In this case, ? refers to each sublist within patch-data, and as the foreach loop continues through the list, each sublist becomes ? in turn. So, when the code reads the first sublist, which is [ -17 17 9.9999 ], it will ask for first ? item 1 ?, which translates to “(the first item in this sublist) (the second item in this sublist)” or -17 17. The first command refers to the first item in the sublist; you could also use the command item 0 to refer to this item. The second item in the sublist is item 1, and so on. Then it asks that patch (patch -17 17) to set pcolor last ?, which asks for the last item (which could also be referred to as item 2) in the sublist, which is 9.9999 (the numerical code for a shade of white). The list/sublist structure within a foreach loop can be a bit confusing; hopefully this diagram shows the relationships a little better:

Once each sublist has been read and a value assigned to the patches, every patch should have its appropriate color and the world looks like this:
Rather than code color variants, you might use this method to input geographic data (e.g. elevation, vegetation, etc.) into grid cells. Additionally, if you want your data to be point coverages, you could just express them in sets of two decimal coordinates and a value, and transfer them to agents using the create-turtles command. For example:

to show-point-data
  cp ct
  ifelse ( is-list? point-data ) [
    foreach point-data [ create-turtles 1 [
      setxy first ? item 1 ?  
      set pcolor last ?  
    ]
  ] 
  [
    user-message "You need to load in point data first!" 
  ]
end

However, if your data isn’t in this format (or you can’t convert it easily), another (perhaps easier) option is to use NetLogo’s GIS extension.

Using the GIS extension

The NetLogo GIS extension allows for standard GIS data formats to be added to and manipulated by a NetLogo model. This is really handy if you want to work directly with data downloaded from other sources or recorded using GPS in the field. This extension comes pre-packaged with NetLogo, so it isn’t necessary to download or configure yourself. All you need is to add this code to the top of your model:

extensions [ gis ]

The NetLogo GIS extension can handle both raster and vector (shapefile) data. There are examples of how the GIS code works in the Code Examples section of the NetLogo model library. The following worked example will use both types of data in a single NetLogo model.

Working with raster data

Raster (gridded) data can only be directly imported to NetLogo as ASCII (.asc) or ESRI grid (.grd) files. It is important that each file be accompanied by a projection file (.prj) file that is in the correct format.

Let’s say I’m interested in river discharge on coastal fish populations. To start, we’ll download some elevation/bathymetric data from the GEBCO world bathymetric dataset. I’ll pull a square from 150 E to 170 W, and from 60 S to 20 S (the area around New Zealand) and save it as “nz_area.asc”. Then we’ll drop the files (nz_area.asc and nz_area.prj, at the very least) into the directory where our model is stored.

The data itself, recorded at the scale of 30 arc seconds per pixel, gives us a grid of 4800 x 4800 pixels. To set up the world to accept this, you should edit the settings so that the location of the origin is in the bottom left corner, set the max-pxcor and max-pycor to 199, and choose a pixel size of 2 or so. This should make the map a reasonable size on your screen, but you can change this later on as you see fit.

Adding this code to your model should allow you to import the GIS data directly:

globals [ elevation ]

to load-gis
  set elevation gis:load-dataset "nz_area.asc"
  gis:set-world-envelope-ds gis:envelope-of elevation
end

The global variable elevation will be where the raster dataset is stored within the model. The gis:load-dataset command loads in the data, while the gis:set-world-envelope-ds sets the envelope of the world to match that of the GIS dataset. This latter command is very important if you’re working with more than one GIS layer and want to limit the boundaries of your world.

Sometimes, loading the data produces errors. Here’s a common one:

Extension exception: error parsing number
  error while observer running GIS:LOAD-DATASET
    called by procedure LOAD-RASTER
      called by Command Center

The problem here is that the data is not formatted exactly as we would like it to be (which happens quite often with data anyway). NetLogo’s GIS input is sensitive to a few issues:

It won’t handle commas(,) in place of decimals (.). Any commas or other punctuation should be avoided.
It doesn’t like certain NO_DATA values. You’re best off replacing these with an otherwise unused integer (like -9999).
Your header terms need to be separated from their values by single spaces, and not run into eachother. They should look something like this

ncols 4800
nrows 4800
xllcorner 150
yllcorner -60
cellsize 0.00833333333333333

To fix these issues, just open your file with a text editor, make the necessary changes, and then save your data and try again (you may have to check this a few times for weird carriage returns and formatting things like that in the data). If you follow these steps, you should be able to get your data into shape.

Other common errors might occur if you exceed the Java heap space or have incorrect projection data. If it’s the former, your dataset is too big to be used by NetLogo. You may be able to get around this by changing the Java heap size, but this only works up to a certain point. You could try loading it into QGIS or another GIS platform and either cutting down the area coverage or resampling it to a lower resolution to make files smaller, as long as this doesn’t adversely affect your model. If the issue is projection data, make sure you’ve got a projection file that is properly formatted for your data. In this case, the GEOGCS format looks like this:

GEOGCS[“GCS_WGS_1984”,DATUM[“D_WGS_1984”,SPHEROID[“WGS_1984”,6378137,298.257223563]],PRIMEM[“Greenwich”,0],UNIT[“Degree”,0.017453292519943295]]

Once we have data loaded in, we’ll want to visualize it. The first thing we can do is use the gis:paint command to color the patches to correspond with the raster data. Try entering this code into the command line:

gis:paint elevation 50

This should produce this picture in the world window.

However, we probably want to do more than just view the data. What would be better would be to allow agents to interact with it. There are two ways you could do this. The first is to ask agents to sample the raster wherever they are using the gis:raster-sample command:

turtles-own [ height_asl ]
...
to get-elevation
  ask turtles [
    set height_asl gis:raster-sample elevation self
  ]
end

Here, the agents will read the elevation data from wherever they stand and translate it into their own variable, called height_asl. Of course, calling up the raster data every time you want an agent to do something may be more coding than you want to use. You may want to get this out of the way up front and just bring the raster data into a patch variable. You can do this using the gis:apply-raster command:

patches-own [ elev ]
…
gis:apply-raster  elevation elev

Now, patches have an elevation value (called elev) drawn from the raster data. We can recolor our world based on that data using this code:

to recolor-patches
  ask patches [
    ifelse elev >= 0 [ set pcolor scale-color green elev 0 2000 ] [
      set pcolor blue
    ]
  ]
end

This produces a world where everything above sea-level is colored in a shade of green, while everything below sea-level is colored blue. It should look like this:

Working with shapefiles

Shapefiles (.shp), or vector data, can be imported into NetLogo fairly simply, but because the NetLogo world is gridded to begin with, the topological relationships between shapefiles and agents/patches can be a little fraught, and usually requires a good understanding of how lists work (see above).

We’ll use some polyline river data from NZ forest service for this example. Again, we’ll download the data and save the files in the same directory as our model. Now we need edit our globals to create a rivers variable, and then edit our load-gis procedure from the last section to load in the new dataset:

globals [ elevation rivers ]
...
to load-gis
  set elevation gis:load-dataset "nz_area.asc"
  gis:set-world-envelope-ds gis:envelope-of elevation
  gis:apply-raster elevation elev
  set rivers gis:load-dataset "nz-major-rivers.shp"
end

This is easy enough, and we can see the fruits of our labor by adding some code to the recolor-patches procedure from the last section:

to recolor-patches
ask patches [
    ifelse elev >= 0 [ set pcolor scale-color green elev 0 2000 ] [
      set pcolor blue
    ]
  ]
  gis:set-drawing-color blue + 2
  gis:draw rivers 1
end

Adding these last two lines allows us to see the overlaid river data:

Not a great visualization, but you get the idea. At this stage, agents can’t directly interact with the data. Whether or not an agent can interact with the data depends on its relationship to the shapefile, or, more importantly, its features and vertices. We can get the individual features of the shapefile (that is, each polyline representing a river section) by using the gis:feature-list-of command. Try entering this code into the command line:

gis:feature-list-of rivers

It should produce a list that looks something like this:

[{{gis:VectorFeature ["RPOLY":"0.0"] ["FNODE":"0.0"] ["RIVERS":"556.0"]["TNODE":"0.0"]["LPOLY":"0.0"]["LENGTH":"6935.921"]["RIVERS_ID":"0.0"]}} {{gis:VectorFeature...

Each set of double curly braces {{}} contains a vector feature, in this case a polyline representing a river segment approximately 6.935 kilometres long. We can also get at the river data in terms of its vertices by using the gis:vertex-list-of command on a particular feature (which could be accessed by treating gis:feature-list-of as a NetLogo list; e.g. gis:vertex-list-of (item 0 gis:feature-list-of rivers)). This is particularly useful when dealing with point data, as each point is its own vertex. A similar command is the gis:centroid-of, which would offer a single vertex at the center of a feature. This could be used to identify the geographic center of river segment. Finally, these vertices can be translated into lists of NetLogo coordinates using the gis:location-of command. For example:

gis:location-of(gis:centroid-of ( item 0 gis:feature-list-of rivers))

This would translate into “the NetLogo coordinates of the centroid of the first feature in the rivers dataset”, which are [100.3872222118864 77.92170204594207 ].

In order to get agents to be aware of vector data, we establish their topological relationship to the data. The basic relationships are established using a series of commands which return boolean values:

gis:intersects? determines whether the agent or patch intersects with the data
gis:contains? determines whether the agent or patch contains the data
gis:contained-by? determines whether the agent or patch is contained by the data

If we wanted to determine whether an agent was on a river, we could do this using something like this:

turtles-own [ height_asl wet? ]
...
to check-river
  ask turtles [
    ifelse gis:intersects? rivers self [
      set wet? true
    ]
    [
     set wet? false
   ]
  ]
end

However, you may find that your agents have a hard time locating rivers, even though by the look of the drawing they seem to be right on top of them. This is because both the river and the agent have very precise locations. It may be more useful to determine whether an agent is on a patch that intersects with a river. This would be:

turtles-own [ height_asl wet? ]
...
to check-river
  ask turtles [
    ifelse gis:intersects? rivers patch-here [
      set wet? true
    ]
    [
     set wet? false
   ]
  ]
end

Going further

There are additional commands in the GIS extension which are worth looking through, as they can come in handy for things like getting maximum and minimum values and so on. Also, getting GIS data into NetLogo is only half the fun: you can use the extension to extract data from your model. If you want to generate a dataset from NetLogo, you’ll need to combine the store-dataset command, which saves a generated dataset, with one of the following:

patch-dataset turns a patch set into a raster dataset
turtle-dataset turns an agent set into a point vector dataset
link-dataset turns a link set into a line vector dataset

If you can’t be bothered exporting data and doing spatial analysis in a separate program, you could use the NetLogo R extension to run spatial analysis using the spatstat package. Instructions on how to work with the R extension can be found in this earlier blog post, and information on the spatstat package is here.

I’d like to add a word of caution: working with real spatial relationships may seem appealing because of the potential to connect models to real world data, but it adds additional complexity to the model and reduces its generality. There are some issues which are inherent in the representation of geospatial data which are not often accounted for in an ABM setting. No spatial data is a perfect representation of reality, and certain trade-offs are always made. For example, depending on what your application is and how your model is scaled, you may need to consider how your spatial data is projected, and carefully calibrate agent behaviors to account for differences in projection. In the example above, the scale from east to west at the northern part of the map is quite different from that at the southern part of the map. Additionally, moving models into realistic spaces means outcomes usually become far more dependent on the particular conditions and data being used in the model, and less about the process under study. If your goal is hypothesis testing, then this might be appropriate, but it is certainly worth considering how well a process is understood and what assumptions are going into a model before undertaking site-specific modeling studies.

This latter concern has been voiced before, and should be of real importance to modelers in all disciplines. But this competes with our need for information which is useful for issues in the here and now. I’ll end with a quote from H. Randy Gimblett, in the introduction to an edited volume on the subject, which I think provides the counterpoint which should be considered in tandem with concerns of particularism in models:

It is without a doubt that individual-based modeling techniques coupled with GIS for examining human/wildlife/landscape interactions is a powerful tool for decisionmakers. Aside from representing individual behaviors and their associated environments, exploring both the social and ecological impacts in a dynamic system has tremendous potential for assisting decisionmakers in understanding and protecting extraordinary landscapes.

…agent-based simulations provide new and exciting ways for managers and researchers to explore and compare alternative scenarios before they are implemented and evaluate them in terms [of] consequences of policy actions and social, ecological, and economic impacts.

Featured image: “Space Turtle” by gabriela2006.deviantart.com

simulatingcomplexity

Tag Archives: featured

The Powers and Pitfalls of Power-Law Analyses

Building a Schelling Segregation Model in R

Turtles in Space! Integrating GIS and NetLogo

From the world of Complex Systems Simulation in Humanities