Tag Archives: tools

Software tools for ABMs

July 5, 2017 izaromanowska Leave a comment

A key consideration when embarking on an agent-based modelling focused project is ‘what are we going to write the model in?’. The investment of time and effort that goes into learning a new software tool or a language is so considerable that in the vast majority of cases it is the model that has to be adjusted to the modellers skills and knowledge rather than the the other way round.

Browsing through the OpenABM library it is clear that Netlogo is archaeology’s, social sciences and ecology first choice (51 results), with other platforms and languages trailing well behind (Java – 13 results, Repast – 5 results, Python – 5 results)*. But it comes without saying that there are more tools out there. A new paper published in Computer Science Review compares and contrasts 85 ABM platforms and tools.

It classifies each software package according to the easy of development (simple-moderate-hard) as well as its capabilities (light-weight to extreme-scale). It also sorts them according to their scope and possible subjects (purpose-specific, e.g., teaching, social science simulations, cloud computing, etc., or subject-specific, e.g., pedestrian simulation, political phenomena, artificial life) so that you have a handy list of software tools designed for different applications. This is, to the best of my knowledge, the first survey of this kind since this, equally useful but by now badly outdated, report from 2010.

Abar, Sameera, Georgios K. Theodoropoulos, Pierre Lemarinier, and Gregory M.P. O’Hare. 2017. “Agent Based Modelling and Simulation Tools: A Review of the State-of-Art Software.” Computer Science Review 24: 13–33. doi:10.1016/j.cosrev.2017.03.001.

* Note that the search terms might have influenced the numbers, e.g., if the simulation is concerned with pythons (the snakes) it would add to the count regardless of the language it was written in.

Image source: wikipedia.org

Case Studies, Tutorials

The Powers and Pitfalls of Power-Law Analyses

December 8, 2016 stefanicrabtree Leave a comment

People love power-laws. In the 90s and early 2000s it seemed like they were found everywhere. Yet early power-law studies did not subject the data distributions to rigorous tests. This decreased the potential value of some of these studies. And since an influential study by Aaron Clauset of CU Boulder , Cosma Shalizi of Carnegie Mellon, and Mark Newman of the University of Michigan, researchers have become aware that not all distributions that look power-law like are actually power-laws.

But power-law analyses can be incredibly useful. In this post I show you first what a power-law is, second demonstrate an appropriate case-study to use these analyses in, and third walk you through how to use these analyses to understand distributions in your data.

What is a power-law?

A power-law describes a distribution of something—wealth, connections in a network, sizes of cities—that follow what is known as the law of preferential attachment. In power-laws there will be many of the smallest object, with increasingly fewer of the larger objects. However, the largest objects disproportionally get the highest quantities of stuff.

The world wide web follows a power-law. Many sites (like Simulating Complexity) get small amounts of traffic, but some sites (like Google, for example) get high amounts of traffic. Then, because they get more traffic, they attract even more visits to their sites. Cities also tend to follow power-law distributions, with many small towns, and few very large cities. But those large cities seem to keep getting larger. Austin, TX for example, has 157.2 new citizens per day, making this city the fastest growing city in the United States. People are attracted to it because people keep moving there, which perpetuates the growth. Theoretically there should be a limit, though maybe the limit will be turning our planet into a Texas-themed Coruscant.

This is in direct contrast to log-normal distributions. Log-normal distributions follow the law of proportional effect. This means that as something increases in size, it is predictably larger than what came before it. Larger things in log-normal distributions do not attract exponentially more things… they have a proportional amount of what came before. For example, experience and income should follow a log-normal distribution. As someone works in a job longer they should get promotions that reflect their experience. When we look at incomes of all people in a region we see that when incomes are more log-normally distributed these reflect greater equality, whereas when incomes are more power-law-like, inequality increases. Modern incomes seem to follow log-normality up to a point, after which they follow a power-law, showing that the richest attract that much more wealth, but under a certain threshold wealth is predictable.

If we analyze the distribution of modern incomes in a developing nation and see that they follow a power-law distribution, we will understand that there is a ‘rich get richer’ dynamic in that country, whereas if we see the incomes follow a log-normal distribution we would understand that that country had greater internal equality. We might want to know this to help influence policy.

When we analyze power-laws, however, we don’t want to just look at the graph that is created and say “Yeah, I think that looks like a power-law.” Early studies seemed to do just that. Thankfully Clauset et al. came up with rigorous methods to examine a distribution of data and see if it’s a power-law, or if it follows another distribution (such as log-normal). Below I show how to use these tools in R.

Power-law analyses and archaeology

So, if modern analyses of these distributions can tell us something about the equality (log-normal) or inequality (power-law) of a population, then these tools can be useful for examining the lifeways of past people. Questions we might be interested in asking are whether prehistoric cities also follow a power-law distribution, suggesting that the largest cities offered more social (and potentially economic) benefits similar to modern cities. Or we might want to understand whether societies in prehistory were more egalitarian or more hierarchical, thus looking at distributions of income and wealth (as archaeologists define them) to examine these. Power-law analyses of distributions of artifacts or settlement sizes would enable us to understand the development of inequality in the past.

Clifford Brown et al. talked about these very issues in their chapter Poor Mayapan from the book The Ancient Maya of Mexico edited by Braswell. While they don’t use the statistical tools I present below, they do present good arguments for why and when power-law versus other types of distributions would occur, and I would recommend tracking down this book and reading it if you’re interested in using power-law analyses in archaeology. Specifically they suggest that power-law distributions would not occur randomly, so there is intentionality behind those power-law-like distributions.

I recently used power-law and log-normal analyses to try to understand the development of hierarchy in the American Southwest. The results of this study will be published in 2017 in American Antiquity. Briefly, I wanted to look at multiple types of evidence, including ceremonial structures, settlements, and simulation data to understand the mechanisms that could have led to hierarchy and whether or not (and when) Ancestral Pueblo groups were more egalitarian or more hierarchical. Since I was comparing multiple different datasets, a method to quantitatively compare them was needed. Thus I turned to Clauset’s methods.

These had been updated by Gillespie in the R package poweRlaw.

Below I will go over the poweRlaw package with a built-in dataset, the Moby Dick words dataset. This dataset counts the frequency of different words. For example, there are many instances of the word “the” (19815, to be exact) but very few instances of other words, like “lamp” (34 occurrences) or “choice” (5 occurrences), or “exquisite” (1 occurrence). (Side note, I randomly guessed at each of these words, assuming each would have fewer occurrences. My friend Simon DeDeo tells me that ‘exquisite’ in this case is hapax legomenon, or a term that only has one recorded use. Thanks Simon.) To see more go to http://roadtolarissa.com/whalewords/.

In my research I used other datasets that measured physical things (the size of roomblocks, kivas, and territories) so there’s a small mental leap for using a new dataset, but this should allow you to follow along.

The Tutorial

Open R.

Load the poweRlaw package

library(“poweRlaw”)

Add in the data

data(“moby”, package=”poweRlaw”)

This will load the data into your R session.

Side note:

If you are loading in your own data, you first load it in like you normally would, e.g.:

data <- read.csv(“data.csv”)

Then if you were subsetting your data you’d do something like this:

a <- subset(data, Temporal_Assignment !=’Pueblo III (A.D. 1140-1300)’)

Next you have to decide if your data is discrete or continuous. What do I mean by this?

Discrete data can only take on particular values. In the case of the Moby Dick dataset, since we are counting physical words, this data is discrete. You can have 1 occurrence of exquisite and 34 occurrences of lamp. You can’t have 34.79 occurrences of it—it either exists or it doesn’t.

Continuous data is something that doesn’t fit into simple entities, but whose measurement can exist on a long spectrum. Height, for example, is continuous. Even if we bin peoples’ heights into neat categories (e.g., 6 feet tall, or 1.83 meters) the person’s height probably has some tailing digit, so they aren’t exactly 6 feet, but maybe 6.000127 feet tall. If we are being precise in our measurements, that would be continuous data.

The data I used in my article on kiva, settlement, and territory sizes was continuous. This Moby Dick data is discrete.
The reason this matters is the poweRlaw package has two separate functions for continuous versus discrete data. These are:

conpl for continuous data, and

displ for discrete data

You can technically use either function and you won’t get an error from R, but the results will differ slightly, so it’s important to know which type of data you are using.

In the tutorial written here I will be using the displ function since the Moby dataset is discrete. Substitute in conpl for any continuous data.

So, to create the powerlaw object first we fit the displ to it. So,

pl_a <- displ$new(moby)

We then want to estimate the x-min value. Powerlaws are usually only power-law-like in their tails… the early part of the distribution is much more variable, so we find a minimum value below which we say “computer, just ignore that stuff.”

However, first I like to look at what the x_min values are, just to see that the code is working. So:

pl_a$getXmin()

Then we estimate and set the x-mins

So this is the code that does that:

est <- estimate_xmin(a)

We then update the power-law object with the new x-min value:

pl_a$setXmin(est)

We do a similar thing to estimate the exponent α of the power law. This function is pars, so:

Pl_a$getPars()

estimate_pars(pl_a)

Then we also want to know how likely our data fits a power law. For this we estimate a p-value (explained in Clauset et al). Here is the code to do that (and output those data):

booty <- bootstrap_p(pl_a)

This will take a little while, so sit back and drink a cup of coffee while R chunks for you.

Then look at the output:

booty

Alright, we don’t need the whole sim, but it’s good to have the goodness of fit (gof: 0.00825) and p value (p: 0.75), so this code below records those for you.

variables <- c(“p”, “gof”)

bootyout <- booty[variables]

write.table(bootyout, file=”/Volumes/file.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

Next, we need to see if our data better fits a log-normal distribution. Here we compare our dataset to a log-normal distribution, and then compare the p-values and perform a goodness-of-fit test. If you have continuous data you’d use conlnorm for a continuous log normal distribution. Since we are using discrete data with the Moby dataset we use the function dislnorm. Again, just make sure you know which type of data you’re using.

### Estimating a log normal fit

aa <- dislnorm$new(moby)

We then set the xmin in the log-normal dataset so that the two distributions are comparable.

aa$setXmin(pl_a$getXmin())

Then we estimate the slope as above

est2 <-estimate_pars(aa)

aa$setPars(est2$pars)

Now we compare our two distributions. Please note that it matters which order you put these in. Here I have the power-law value first with the log-normal value second. I discuss what ramifications this has below.

comp <- compare_distributions(pl_a, aa)

Then we actually print out the stats:

comp

And then I create a printable dataset that we can then look at later.

myvars <- c(“test_statistic”, “p_one_sided”, “p_two_sided”)

compout <- comp[myvars]

write.table(compout, file=”/Volumes/file2.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

And now all we have left to do is graph it!

pdf(file=paste(‘/Volumes/Power_Law.pdf’, sep=”),width=5.44, height = 3.5, bg=”white”, paper=”special”, family=”Helvetica”, pointsize=8)

par(mar=c(4.1,4.5,0.5,1.2))

par(oma=c(0,0,0,0))

plot(pts_a, col=’black’, log=’xy’, xlab=”, ylab=”, xlim=c(1,400), ylim=c(0.01,1))

lines(pl_a, col=2, lty=3, lwd=2, xlab=”, ylab=”)

lines(aa, col=3, lty=2, lwd=1)

legend(“bottomleft”, cex=1, xpd=T, ncol=1, lty=c(3,2), col=c(2,3), legend=c(“powerlaw fit”, “log normal fit”), lwd=1, yjust=0.5,xjust=0.5, bty=”n”)

text(x=70,y= 1,cex=1, pos=4, labels=paste(“Power law p-value: “,bootyout$p))

mtext(“All regions, Size”, side=1, line=3, cex=1.2)

mtext(“Relative frequencies”, side=2, line=3.2, cex=1.2)

legend=c(“powerlaw fit”, “log normal fit”)

box()

dev.off()

Now, how do you actually tell which is better, the log normal or power-law? Here is how I describe it in my upcoming article:

The alpha parameter reports the slope of the best-fit power-law line. The power-law probability reports the probability that the empirical data could have been generated by a power law; the closer that statistic is to 1, the more likely that is. We consider values below 0.1 as rejecting the hypothesis that the distribution was generated by a power law (Clauset et al. 2009:16). The test statistic indicates how closely the empirical data match the log normal. Negative values indicate log-normal distributions, and the higher the absolute value, the more confident the interpretation. However, it is possible to have a test statistic that indicates a log-normal distribution in addition to a power-law probability that indicates a power-law, so we employ the compare distributions test to compare the fit of the distribution to a power-law and to the log-normal distribution. Values below 0.4 indicate a better fit to the log-normal; those above 0.6 favor a power-law; intermediate values are ambiguous. Please note, though, that it depends on what order you put the two distributions in the R code: if you put log-normal in first in the above compare distributions code, then the above would be reversed—those below 0.4 would favor power-laws, while above 0.6 would favor log normality. I may be wrong, but as far as I can tell it doesn’t actually matter which order you put the two distributions in, as long as you know which one went first and interpret it accordingly.

So, there you have it! Now you can run a power-law analysis on many types of data distributions to examine if you have a rich-get-richer dynamic occurring! Special thanks to Aaron Clauset for answering my questions when I originally began pursuing this research.

Full code at the end:

library(“poweRlaw”)

data(“moby”, package=”poweRlaw”)

pl_a <- displ$new(moby)

pl_a$getXmin()

est <- estimate_xmin(a)

pl_a$setXmin(est)

Pl_a$getPars()

estimate_pars(pl_a)

booty <- bootstrap_p(pl_a)

variables <- c(“p”, “gof”)

bootyout <- booty[variables]

#write.table(bootyout, file=”/Volumes/file.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

### Estimating a log normal fit

aa <- dislnorm$new(moby)

aa$setXmin(pl_a$getXmin())

est2 <-estimate_pars(aa)

aa$setPars(est2$pars)

comp <- compare_distributions(pl_a, aa)

comp

myvars <- c(“test_statistic”, “p_one_sided”, “p_two_sided”)

compout <- comp[myvars]

write.table(compout, file=”/Volumes/file2.csv”, sep=’,’, append=F, row.names=FALSE, col.names=TRUE)

pdf(file=paste(‘/Volumes/Power_Law.pdf’, sep=”),width=5.44, height = 3.5, bg=”white”, paper=”special”, family=”Helvetica”, pointsize=8)

par(mar=c(4.1,4.5,0.5,1.2))

par(oma=c(0,0,0,0))

plot(pts_a, col=’black’, log=’xy’, xlab=”, ylab=”, xlim=c(1,400), ylim=c(0.01,1))

lines(pl_a, col=2, lty=3, lwd=2, xlab=”, ylab=”)

lines(aa, col=3, lty=2, lwd=1)

legend(“bottomleft”, cex=1, xpd=T, ncol=1, lty=c(3,2), col=c(2,3), legend=c(“powerlaw fit”, “log normal fit”), lwd=1, yjust=0.5,xjust=0.5, bty=”n”)

text(x=70,y= 1,cex=1, pos=4, labels=paste(“Power law p-value: “,bootyout$p))

mtext(“All regions, Size”, side=1, line=3, cex=1.2)

mtext(“Relative frequencies”, side=2, line=3.2, cex=1.2)

legend=c(“powerlaw fit”, “log normal fit”)

box()

dev.off()

Tutorials

Working with NetCDF files in an agent-based model: Skinning the model input data cat (UPDATED)

September 21, 2016 benjdavies 6 Comments

An older version of this tutorial used the now-deprecated ncdf package for R. This updated version makes use of the ncdf4 package, and fixes a few broken links while we’re at it.

You found it: the holy grail of palaeoenvironmental datasets. Some government agency or environmental science department put together some brilliant time series GIS package and you want to find a way to import it into your model. But oftentimes the data may be in a format which isn’t readable by your modeling software, or takes some finagling to get the data in there. NetCDF is one of the more notorious of these. A NetCDF file (which stands for Network Common Data Form) is a multidimensional array, where each layer represents the spatial gridded distribution of a different variable or set of variables, and sets of grids can be stacked into time slices. To make this a little more clear, here’s a diagram:

netcdf file structure — The basic structure of a NetCDF file

In this diagram, each table represents a gridded spatial coverage for a single variable. Three variables are represented this way, and these are stored together in a single time step. The actual structure of the file might be simpler (that is, it might consist of a single variable and/or single time step) or more complex (with many more variables or where each variable is actually a set of coverages representing a range of values for that variable; imagine water temperature readings taken at a series of depths). These chunks of data can then be accessed as combined spatial coverages over time. Folks who work with climate and earth systems tend to store their data this way. It’s also a convenient way to keep track of data obtained from satellite measurements over time. They’re great for managing lots of spatial data, but if you’ve never dealt with them before, they can be a bit of a bear to work with. ArcGIS and QGIS support them, but it can be difficult to work them into simulations without converting to a more benign data type like an ASCII file. In a previous post, we’ve discussed importing GIS data into a NetLogo model, but of course this depends on our ability to get the data into a model-readable format. The following tutorial is going to walk through the process of getting a NetCDF file, manipulating it in R, and then getting it into NetLogo.

Step #1 – Locate the data

First let’s locate a useful NetCDF dataset and import it to R. As an example, we’ll use the Global Potential Vegetation Dataset from the UW-Madison Nelson Institute Sage Center for Sustainability and the Global Environment. As you can see, the data is also available as an ASCII file; this is useful because you can use this later to check that you’ve got the NetCDF working. Click on the appropriate link to download the Global Potential Veg Data NetCDF. The file is a tarball (extension .tar.gz), so you’ll need something to unzip it. If you’re not partial to a particular file compressor, try 7-Zip. Keep track of where the file is located on your local drive after downloading and unzipping.

Step #2- Bring the data into R

R won’t read NetCDF files as is, so you’ll need to download a package that works with this kind of data. The ncdf package is one of a few different packages that work with these files, and we’ll use it for this tutorial. First, open the R console and go to Packages->Install Packages and download the ncdf4 package from your preferred mirror site. Then load the package by entering the following: library(ncdf4) Now, remembering where you saved your NetCDF file, you can bring it into R with the following command: data <- nc_open(filename) If you didn’t save the data file in your R working directory and want to navigate to the file, just replace filename with file.choose(). For now, we’ll use the 0.5 degree resolution vegetation data (vegtype_0.5.nc). Now if you type in data and press enter, you can check to see what the data variable holds. You should get something like this:

File C:\Users\me\Downloads\potveg_nc.tar\potveg_nc\vegtype_0.5.nc 
(NC_FORMAT_CLASSIC):

1 variables (excluding dimension variables):
  float vegtype[longitude,latitude,level,time]
    units:
    add_offset: 0
    scale_factor: 1
    missing_value: 8.99999982852418e+20

4 dimensions:
  longitude Size:720
    units: longitude
    add_offset: 0
    scale_factor: 1
 latitude Size:360
    units: latitude
    add_offset: 0
    scale_factor: 1
 level Size:1
    units: level/index
    add_offset: 0
    scale_factor: 1
 time Size:1 *** is unlimited ***
   units: year
   add_offset: 0
   scale_factor: 1

1 global attributes:
   title: Cover Types

This is telling you what your file is composed of. The first line tells you the name of the file. Beneath this are your variables. In this case, there is only one, vegtype, which according to the above uses a number just shy of nine hundred quintillion as a missing value (the computer will interpret any occurences of this number as no data).

Next come your dimensions, giving the intervals of measurement. In this case, there are four dimensions: longitude, latitude, level, and time. Our file only has one time slice, meaning that it represents a single snapshot of data; if this number is larger, there will be more coverages included in your file over time. The coverage spans from 89.75 S to 89.75 N latitude in 0.5 degree increments, and 180 W to 180 E longitude by the same increments.

To access the vegtype data, we need to assign it to a local variable, which we will call veg:

ncvar_get(data,"vegtype") -> veg

The ncvar_get command extracts an identified variable (“vegtype”) and extracts it from the NetCDF file (data) as a matrix. Then we assign it to the local variable veg. There are a number of other commands within the ncdf4 package which are useful for reading and writing NetCDF files, but these go beyond the scope of this blog entry. You can read more about them here.

Step #3 – Checking out the data

Now our data is available to us as a matrix. We can view it by entering the following:

image(veg)

upsidedown — R visualization of potential vegetation dataset (upside down)

Oops! Our output reads from bottom to top instead of top to bottom. No problem, we can just invert the latitude of the matrix like so:

image(veg, ylim=c(1,0))

rightsideup — R visualization of potential vegetation dataset (right side up)

However, this only changes the view; when we get the data into NetLogo later on, we’ll need to transpose it. But for now, let’s add some terrain colors. According to the readme file associated with the data, there are 15 different landcover types used here:

Tropical Evergreen Forest/Woodland
Tropical Deciduous Forest/Woodland
Temperate Broadleaf Evergreen Forest/Woodland
Temperate Needleleaf Evergreen Forest/Woodland
Temperate Deciduous Forest/Woodland
Boreal Evergreen Forest/Woodland
Boreal Deciduous Forest/Woodland
Evergreen/Deciduous Mixed Forest/Woodland
Savanna
Grassland/Steppe
Dense Shrubland
Open Shrubland
Tundra
Desert
Polar Desert/Rock/Ice

We could choose individual colors for each of these, but for the moment we’ll just use the in-built terrain color ramp:

image(veg,ylim=c(1,0),col=terrain.colors(15))

terrain cols — R visualization of potential vegetation dataset using terrain colors

Step #4 – Exporting the data to NetLogo

Finally, we want to read our data into a modeling platform, in this case NetLogo, so let’s export it as a raster coverage we can work with. Before we do any file writing, we’ll need to coerce the matrix into a data frame and make sure we transpose it so that it doesn’t come out upside down again. To do this, we’ll use the following code:

veg2<-as.data.frame(t(veg))

The as.data.frame command does the coercing, while the t command does the transposing. Now we have to open up the file we’re going to write to:

fileCon<-file('vegcover.asc')

This establishes a connection to an open file which we’ve named vegcover.asc. Next, we’ll write the header data for an ASCII coverage. We can do this by adding lines to the file:

writeLines('ncols\t\t720\nnrows\t\t360\\nxllcorner\t-179.75\nyllcorner\t-89.75\ncellsize\t0.5\nNODATA_value\t8.99999982852418e+20', fileCon) close(fileCon)

This may look like a bunch of nonsense, but each \t is a tab, and each \n is a new line. The result is a header on our file which looks like this: ncols 720 nrows 360 xllcorner -179.75 yllcorner -89.75 cellsize 0.5 NODATA_value 8.99999982852418e+20 Any program (whether a NetLogo model, GIS, or otherwise) that reads this file will look for this header first. The terms ncols and nrows define the number of columns and rows in the grid. The xllcorner and yllcorner define the lower left corner of the grid. The cellsize term describes how large each cell should be, and the NODATA_value is the same value from the original dataset which we used to define places where data is not available. Now just need to enter in our transposed data.

write.table(veg2,'vegcover.asc',append=TRUE,sep=" ",row.names=FALSE,col.names=FALSE)

This will take our data frame and write it to the file we just created, appending it after the header. It’s important that your separator be a space (sep=” “) in order to assure that it is in a format NetLogo can read. Also make sure to get rid of any row and column names as well. Now we can read our file into NetLogo using the GIS extension (for an explanation of this, see here). Open a new NetLogo file, set the world window settings with the origin at the bottom left, a max-pxcor of 719 and and max-pycor of 359, and a patch size of 1. Save your NetLogo model in the same directory as the vegcover.asc file, and the following NetLogo code should do the trick:

extensions [ gis ] globals [ vegcover ] patches-own [ vegtype ]

to setup clear-all set vegcover gis:load-dataset "vegcover.asc" gis:set-world-envelope-ds gis:envelope-of vegcover ask patches [ set pcolor white set vegtype gis:raster-sample vegcover self ] ask patches with [ vegtype <= 8 ] [ set pcolor scale-color green vegtype -5 10 ] ask patches with [ vegtype > 8 ] [ set pcolor scale-color pink vegtype 9 15 ] end

This should produce a world in which patches have a variable called vegtype with values that correspond to the original dataset. Furthermore, patches are colored according to a set scheme where forested areas are on a scale of green, while non-forested areas are on a scale of pink. The result:

gis ncdf data read view — NetLogo visualization of potential vegetation dataset

If you’re truly curious as to whether this has worked as it should, you might download the ASCII version of the 0.5 degree data from the SAGE website, save it to the same directory, and replace vegcover.asc with the name of the ASCII file in the above NetLogo code to see if there is any difference.

Going further

So far, this has been meant to provide a simple tutorial of how to get data from a NetCDF file into an ABM platform. If you’re only dealing with a single coverage, you might be more at home converting your file using QGIS or another standalone GIS. If you’re dealing with multiple time steps or variables from a large dataset, it might make sense to write an R script that will extract the data systematically using combinations of the commands above. However, you might also make use of the R NetLogo extension to query a NetCDF file on the fly. To proceed with this part of the tutorial, you’ll need to download the R extension and have it installed correctly.

First, let’s find a NetCDF file with a temporal component. In honor of the impending winter my Northern Hemisphere colleagues are about to endure, I’m going to use the Northern Hemisphere EASE-Grid Snow Cover and Sea Ice Extent dataset from NOAA, which gives monthly (derived from weekly) snow cover data from 1971 to 1995. Go to the website and download the Monthly Mean dataset and save the file ‘snowcover.mon.mean.nc’ to your local drive, keeping track of the its location.

We’ll start a new NetLogo model, implement the R extension, and create two global variables and a patch variable:

extensions [ R ] globals [ snowcover s ] patches-own [ snow ]

The snowcover variable will be our dataset, while s will be a placeholder for monthly coverages. The patch variable snow will be the individual grid cell values from our data which will be updated monthly. Next, we’ll run a setup command which clears the model, installs the ncdf library, opens our NetCDF snowcover file, extracts our snowcover data, and resets our ticks counter. You may need to edit the code below so that it reflects the location of your NetCDF file.

to setup
clear-all
r:clear
r:eval "library(ncdf4)"
r:eval "data<-nc_open(\"C:/Users/me/Downloads/snowcover.mon.mean.nc\")"
r:eval "ncvar_get(data, \"snowcover\") -> snow"
reset-ticks
end

Now, we could automate the process of converting to ASCII and importing the GIS data here, but that’s likely to be a slow solution and generate a lot of file bloat. Alternatively, if our world window is scaled to the same size as the NetCDF grid (or to some easily computed fraction of it), we can simply import the raw data and transmit the values directly to patches (not unlike the File Input example here). To do this, right click on the world window and edit it so that the location of the origin is the bottom left, and that the max-pxcor is 359 and the max-pycor is 89 (this is 360 x 90, the same size as our Northern Hemisphere snowcover data). We’ll also make sure the world doesn’t wrap, and set the patch size to 3 to make sure it fits on our screen.

edit world window settings — NetLogo world window settings

Next, we’ll generate the transposed dataframe as in the above example, but this time for a single monthly coverage. Then we’ll import this data from R into the NetLogo placeholder variable s:

to go tick r:eval (word "snow2<-as.data.frame(t(snow[,," ticks "]))") set s r:get "snow2" ask patches [ get-snow ] if ticks >= 297 [ stop ] end

Because our snowcover data has a time component, we need to tell it which month we want to use by inserting a value for the third axis. For example, if we wanted the value for row 1, column 1 in month 3, we would send R the phrase snow[1,1,3]. In this case, we want the entire coverage but for a single month, so we leave our the values for row and column and only feed R a value for the month. We use the word command here to concatenate the string which will serve as our R command, but which incorporates the current value from the NetLogo ticks counter to substitute for the month value. As the ticks counter increases, this will shift the data from one month to the next. The if ticks >= 297 [ stop ] command will ensure that the model only runs for as long as we have data for (which is 297 months). When we import this data frame from R into our NetLogo model, it will be imported as a set nested lists, where each sublist represents a column from the data frame (from 1 to 360).If we enter s into the command line, it will look something like this:

[[1.0427087545394897E-5 1.0427087545394897E-5 1.0427087545394897E-5…

What we’ll want to do is pull values from these lists which correspond with the patch coordinates. However, remember that our world originates in the bottom left and increases toward the top right, while our data originates in the top left and increases toward the bottom right. What we’ll need to do is flip the y-axis values we use to reflect this (note: originating the model in the top left would give our NetLogo world negative Y-values, which would likewise need to be converted). We can do this with the following:

to get-snow let x pxcor let y ((89 - pycor) / 89 ) * 89 set snow item y (item x s) set pcolor scale-color grey snow 0 100 end

What this does is create temporary x and y values from the patch coordinates, but inverts the y-axis value of the patch (so top left is now bottom left). Then the patch sets its snow value by pulling out the value that corresponds with the appropriate row (item y) from the list the corresponds with the appropriate column (item x s). Finally, it sets is color along a scale from 0 to 100. When we run this code, the result is a lovely visualization of the monthly changes in snow cover from the Northern Hemisphere, like so:

So there you have it; a couple of different ways to get NetCDF data into a model using R and NetLogo. Of course, if you’re going to all of this trouble to work with such extensive datasets, it may be worth your while to explore alternative platforms which can build in native NetCDF support. Or you might build a model in R entirely. But I reckon the language is largely inconsequential as long as the model is well thought out, and part of that is figuring out what kind of input data you need and how to get it into your model. With a bit of imagination, there are many, many ways to skin this cat.

Data references:

Ramankutty, N., and J.A. Foley (1999). Estimating historical changes in global land cover: croplands from 1700 to 1992, Global Biogeochemical Cycles 13(4), 997-1027.

Cavalieri, D. J., J. Crawford, M. Drinkwater, W. J. Emery, D. T. Eppler, L. D. Farmer, M. Goodberlet, R. Jentz, A. Milman, C. Morris, R. Onstott, A. Schweiger, R. Shuchman, K. Steffen, C. T. Swift, C. Wackerman, and R. L. Weaver. 1992. NASA sea ice validation program for the DMSP SSM/I: final report. NASA Technical Memorandum 104559. 126 pp.

Featured image: GEBCO global bathymetric dataset and OSCAR Global Currents dataset visualized using QGIS.

Tutorials

Working with NetCDF files in an agent-based model: Skinning the model input data cat

November 17, 2014 benjdavies Leave a comment

UPDATE 22 SEPT 2016: Please see the updated version of this tutorial here.

Step #1 – Locate the data

Step #2- Bring the data into R

R won’t read NetCDF files as is, so you’ll need to download a package that works with this kind of data. The ncdf package is one of a few different packages that work with these files, and we’ll use it for this tutorial. First, open the R console and go to Packages->Install Packages and download the ncdf package from your preferred mirror site. Then load the package by entering the following: library(ncdf) Now, remembering where you saved your NetCDF file, you can bring it into R with the following command: data <- open.ncdf(filename) If you didn’t save the data file in your R working directory and want to navigate to the file, just replace filename with file.choose(). For now, we’ll use the 0.5 degree resolution vegetation data (vegtype_0.5.nc). Now if you type in data and press enter, you can check to see what the data variable holds. You should get something like this:

[1] "file C:\\Users\\me\\Downloads\\potveg_nc.tar\\vegtype_0.5.nc has 4 dimensions:" [1] "longitude Size: 720" [1] "latitude Size: 360" [1] "level Size: 1" [1] "time Size: 1" [1] "------------------------" [1] "file C:\\Users\\me\\Downloads\\potveg_nc.tar\\vegtype_0.5.nc has 1 variables:" [1] "float vegtype[longitude,latitude,level,time] Longname:vegtype Missval:8.99999982852418e+20"

This is telling you what your file is composed of. The first line tells you the name of the file and how many dimensions it has. In this case, there are four dimensions: longitude, latitude, level, and time. Our file only has one time slice, meaning that it represents a single snapshot of data; if this number is larger, there will be more coverages included in your file over time. The coverage spans from 89.75 S to 89.75 N latitude in 0.5 degree increments, and 180 W to 180 E longitude by the same increments. Beneath the dashed line are your variables. In this case, there is only one, vegtype, which according to the above uses a number just shy of nine hundred quintillion as a missing value (the computer will interpret any occurences of this number as no data). To access this coverage, we need to assign it to a local variable, which we will call veg:

get.var.ncdf(data, data$var[[1]]) -> veg

The get.var.ncdf command takes an identified variable (data$var[[1]]) and extracts it from the NetCDF file (data) as a matrix. In this instance, we only have one variable so get.var.ncdf would have identified it without the data$var[[1]], but if you had others, you would access them by replacing the 1 with the corresponding number from the list of variables above. Then we assign it to the local variable veg. There are a number of other commands within the ncdf package which are useful for reading and writing NetCDF files, but these go beyond the scope of this blog entry. You can read more about them here.

Step #3 – Checking out the data

Now our data is available to us as a matrix. We can view it by entering the following:

image(veg)

Oops! Our output reads from bottom to top instead of top to bottom. No problem, we can just invert the latitude of the matrix like so:

image(veg, ylim=c(1,0))

Tropical Evergreen Forest/Woodland
Tropical Deciduous Forest/Woodland
Temperate Broadleaf Evergreen Forest/Woodland
Temperate Needleleaf Evergreen Forest/Woodland
Temperate Deciduous Forest/Woodland
Boreal Evergreen Forest/Woodland
Boreal Deciduous Forest/Woodland
Evergreen/Deciduous Mixed Forest/Woodland
Savanna
Grassland/Steppe
Dense Shrubland
Open Shrubland
Tundra
Desert
Polar Desert/Rock/Ice

We could choose individual colors for each of these, but for the moment we’ll just use the in-built terrain color ramp:

image(veg,ylim=c(1,0),col=terrain.colors(15))

Step #4 – Exporting the data to NetLogo

veg2<-as.data.frame(t(veg))

The as.data.frame command does the coercing, while the t command does the transposing. Now we have to open up the file we’re going to write to:

fileCon<-file('vegcover.asc')

This establishes a connection to an open file which we’ve named vegcover.asc. Next, we’ll write the header data for an ASCII coverage. We can do this by adding lines to the file:

writeLines('ncols\t\t720\nnrows\t\t360\\nxllcorner\t-179.75\nyllcorner\t-89.75\ncellsize\t0.5\nNODATA_value\t8.99999982852418e+20', fileCon) close(fileCon)

write.table(veg2,'vegcover.asc',append=TRUE,sep=" ",row.names=FALSE,col.names=FALSE)

extensions [ gis ] globals [ vegcover ] patches-own [ vegtype ]

Going further

We’ll start a new NetLogo model, implement the R extension, and create two global variables and a patch variable:

extensions [ R ] globals [ snowcover s ] patches-own [ snow ]

to setup clear-all r:clear r:eval "library(ncdf)" r:eval "data<-open.ncdf(\"C:/Users/me/Downloads/snowcover.mon.mean.nc\")" r:eval "get.var.ncdf(data, data$var[[1]]) -> snow" reset-ticks end

Next, we’ll generate the transposed dataframe as in the above example, but this time for a single monthly coverage. Then we’ll import this data from R into the NetLogo placeholder variable s:

to go tick r:eval (word "snow2<-as.data.frame(t(snow[,," ticks "]))") set s r:get "snow2" ask patches [ get-snow ] if ticks >= 297 [ stop ] end

[[1.0427087545394897E-5 1.0427087545394897E-5 1.0427087545394897E-5…

to get-snow let x pxcor let y ((89 - pycor) / 89 ) * 89 set snow item y (item x s) set pcolor scale-color grey snow 0 100 end

Data references:

Ramankutty, N., and J.A. Foley (1999). Estimating historical changes in global land cover: croplands from 1700 to 1992, Global Biogeochemical Cycles 13(4), 997-1027.

Featured image: GEBCO global bathymetric dataset and OSCAR Global Currents dataset visualized using QGIS.

Tutorials

“R you experienced?” Using the R extension for NetLogo

June 23, 2014 benjdavies 3 Comments

Have you been working on NetLogo models, but wish you could be making prettier plots? Wish you could be doing data analysis on the fly? Have mathematical calculations that are tripping up your code? Stef’s post the other day on the R turtle graphics reminded me about a tool I use all the time: the R-extension for NetLogo. Put together by Jan C. Thiele and Volker Grimm, the extension allows you to connect NetLogo directly to the beloved stats package, R . If you don’t have familiarity with R but wish you did, check out the resources on our tutorials page. I heartily recommend this online course (if you can’t wait for the course to start, you can watch the videos on YouTube here).

First things first, you’ll need to have NetLogo installed, and download and install the extension. Installing the extension is not as simple as dropping the files into the extension folder, but it’s worth the trouble. Step-by-step instructions are included in the download. Basically, the things to watch out for are making sure that your environmental variables are set up correctly, and making sure that the Java and R you’re running are the same version (32 vs 64 bit). If you want to be using R to plot from NetLogo, you’ll need to have the JavaGD package installed as well.

With the R-extension properly installed, let’s try and add it to some existing models in the NetLogo library.

Example 1: Segregation Model

First we’ll use NetLogo’s Segregation Model, which is based on Thomas Schelling’s famous model of segregation. Agents in this model determine how happy they are based on a preference for neighbors of a similar type, and will try to move elsewhere if they are unhappy. What we want to do in this example is collect some information about the population and send it to R for plotting.

1) Add the code this to the top of the code page.

extensions [ r ]

This will connect the model to the R extension. Mind that this code will not work if the R extension is not installed correctly (most likely, NetLogo will simply close when you try to run it or check it). If you haven’t used any extensions for NetLogo, it’s worth having a look at the NetLogo user manual.

2) Add some code to the setup procedure which will clear any existing data in R and send any plot commands to a separate window.

r:clear
r:setPlotDevice

3) Now we need something for R to do. Let’s say we want to evaluate the average distance from a turtle of one color to a turtle of the other color and plot those values as a histogram. This can be done with the following code:

to get-distance-to-other
  let d []
  ask turtles [
    set d lput (distance (min-one-of (turtles with [ color != [ color ] of myself ]) [ distance myself ])) d
  ]
  r:put "dist" d
  r:eval "hist(dist, xlab=\"Distance\",main=\"Distance to Nearest Neighbor of Other Color\")"
end

There are two basic commands used here that will send things from NetLogo to R. The r:eval command will send a value to R without expecting any value to return to NetLogo. The r:put command assigns a value (in this case, a list) to an R variable. Making use of NetLogo’s lists works well because it is transferable to R’s vector class. Here, we’ve created an empty list, d, and asked all the agents to calculate the distance to their nearest neighbor of a different color, and then add it to the end of the list using the lput command. Once this is done, we can convert that list to a vector called dist in R by using the r:put command. Then we just ask R to make a histogram from the vector values using r:eval, and add in our own nifty labels (Note: be sure to use \” when using quotations within an R command sent from NetLogo). Now you can add this command to a button, enter it into the command line, or embed it elsewhere in your code whenever you want this histogram to appear. Voilà.

NetLogo displays and R histograms for %-similar-wanted 40% (top) and 75% (bottom)

Example 2: Random Walks

What if we want to bring values back from R? Another command, r:get, will do this. We’ll use the Random Walk example in the NetLogo library, which is just a model of an agent picking a random direction and taking a step (for an archaeological example of this, see Brantingham’s 2003 paper on raw material procurement). Lets say that, rather than have the step length be the default value of 1, we want to pick step lengths from a Cauchy distribution. NetLogo’s native random number generators don’t include Cauchy distributions, but we can get them from R.

To do this, we edit the go procedure in the Random Walk example so that it looks like this:

to go
   ask turtles [
       rt random 360
       forward r:get "rcauchy(1)"
   ]
   tick
end

Here, we’ve simply replaced the step length of 1 with a random number R has drawn from a Cauchy distribution using the default parameters (x0 = 0, ϒ = 1). By using the r:get command, a value is returned which is then used to move the agent forward.

NetLogo displays of random walks using step length 1 (left) and Cauchy-distributed step-lengths from R (right)

Going further…

Once you’re comfortable working between NetLogo and R, the sky’s the limit. Any package that works with your current version of R can be used by NetLogo. I’ve found many that are useful for data analysis during model-building. For example, I tend to use spatstat for spatial analysis, allowing me to run cluster analyses on agents and get output immediately. To add a package to your current project, make sure the package is installed on your system and then just add this code to your setup routine:

r:eval “library(package name)”

Hopefully what’s clear is that this is just the beginning: by adding R functionality to NetLogo, you’ve got a powerful set of tools at your disposal for both model-building and data analysis. Good times!

References

Thiele JC, Grimm V (2010). NetLogo meets R: Linking agent-based models with a toolbox for their analysis. Environmental Modelling and Software 25(8): 972 – 974. [DOI: 10.1016/j.envsoft.2010.02.008].

Featured image: some code used for a grand evil scheme of world domination (or possibly to parse CSV files), written using R-Studio.

General

How the Python Ate the Turtle

May 26, 2014 izaromanowska 7 Comments

In the first blogpost dealing with the modelling tools of trade Ben Davies argued in favour of NetLogo. I join the debate to argue for the simplest of commonly used programming languages – Python.

As pretty much everyone coming from archaeology into modelling I started with NetLogo. And I do appreciate its beauty, its simplicity, its user-friendliness and the underlying philosophy that the simpler the simulation the better.

The unbelievable speed in which you can teach a complete newbie (or yourself) to create a full blown simulation from scratch makes it the perfect tool to introduce non-coding archaeologists to modelling. On top of that it gives you the feeling of familiarity with what you code (turtles ‘hatch’ or get ‘sprouted’, they ‘move-to’, they ‘die’ etc.) and an immediate feedback on what’s happening as the simulation unravels on the screen. It is a great for prototyping, playing with ideas and creating simple models.

In a nutshell, I love the turtles!

Sadly, our paths split at some point. I learnt the second best thing in the computing world (after this blog) – Python, and I have never looked back. Here are the four main reasons why:

A complete change of perspective

NetLogo’s simplicity is seductive yet it may be deceiving. The simplicity of use comes at the price of obscuring what exactly happens underneath the hood of your simulation. You told your turtles to ‘move-to’ but what exactly does it mean? Do the agents have a special variable ‘location’ or, perhaps, the grid on which they live is a dynamically updated matrix? It may not take much time to develop a model but it will take time to fully understand it. And it is just too tempting to simply trust the language and hope that it does what you think it does. Until, as it happened to several of my colleagues, you reimplement your model in another language and the results are completely different. Why? Because they thought the model was doing x and, in fact, it was doing y. Anyone who has ever used any stats package, GIS software or a database knows how surprisingly often this happens in the digital world.

Programming is, in fact, a series of very simple maths operations and the more you understand what combination of these calculations constitute your model, the better. Sometimes, it may even become obvious that you don’t need a simulation at all – you can use nothing more than a calculator and still solve the problem. As boring as it may sound, sooner or later we will have to switch the focus of modelling in archaeology from ‘making turtles do x’ to thinking in terms of mathematical operations.

This touches on the issue of testing. ‘Testing’ means you take a series of numbers, calculate outside the model how these numbers should change with every step of the simulation and compare them to the numbers that your model spat out. One should not underestimated the testing stage – it often takes as long (or longer) as building the code. NetLogo lacks effective testing tools that would automate this process and it is difficult to ‘tease out’ the underlying calculations and do it in another environment (by hand, Excel, R etc). In practice, it is tempting to look at the screen and conclude ‘looks like it is doing what I think it should be doing’. If we want to avoid faulty modelling results that’s not good enough.

Computational modelling is not mainstream archaeology (yet, we are working on it 😉 and the chances are that many of the models will not be replicated any time soon or, indeed, at all. Therefore, taking responsibility of ensuring there are no bugs in your code is extremely important, even more so than in other disciplines. Using standard programming languages makes it much easier.

The speed

In a fairly dated review of simulation platform NetLogo didn’t do badly. It worked slower compared to such platforms as Repast or Mason but it wasn’t in the very tail of the race.

However, compare it to a programming language (Python, Java, C etc.) and soon you’ll see the chasm of how much you can achieve in a given time. As my Python teacher said in the first lecture ‘Life is short and the PhD even shorter’. You can optimise NetLogo code, you can use list, you can dump some of the heavy calculations onto its R extension to speed up your simulation, etc. Or you can code it in Python. Even a poor implementation in Python is likely to be faster than a well designed NetLogo code.

This looks like mostly logistic issue but it cuts deeper into the philosophy of model building. Why is the speed important? Three reasons. 1. With more speed you can simply do more – you can try out a wider parameter space, test more scenarios, test different implementations etc. 2. You can run the circle of model development several times, which is the best way to improve the quality of any model. 3. You will avoid the horror situations of realising there is a tiny bug shortly before a paper deadline/ an important conference/thesis submission and not being able to do anything about it. I would argue that the initial time you spend learning a programming language other than NetLogo will be given back to you later down the line when you need to run your simulations.

Universality – one language to rule them all

A number of non-archaeologists blogged recently about how Python seems to be taking over the world of scientific computing with its simplicity, its vast libraries of extensions and the ever growing documentation and support base (see here and here). They pointed out that scientists slowly shift from “I need software x to do x’, software y to do y’ and software z to do z'” to “how can I do it all in Python?”. The same applies to archaeologists.

Hardly any simulation can go forward without some form of GIS manipulation and some sort of data analysis. Anyone who had the pleasure of dealing with the GIS software that is currently the industry standard (you know which one I mean) knows that it involves long bouts of pulling ones hair, exclaiming ‘are you kidding me?’ and swearing to all known forces in the universe you will never, never ever use it again. Achtung, achtung my friends, let me put a stop to everyone’s misery as there are fantastic alternatives! You can use one of the open source GIS software (for example Grass GIS) or, even simpler, use ArcMap model builder, export the code in 3 clicks and use the script. Both solutions, however, involve at least some rudimentary knowledge of Python.

The same applies to data analysis (read: the stats) you will, most probably, run on your simulation’s results. The easy way is to use Excel, SPSS, MiniTab or another interface based software. Most of people quickly realise that this is either a) a frustrating exercise and/or b) too limited to do what you actually need and/or c) ridiculously time consuming (remember those days you spent copy-pasting columns from one spreadsheet to another?). R solves most of these problems. Python solves them even better. Even if you’re a coding fanatic and love learning new programming languages (which begs the questions why do you use NetLogo in the first place?) there is what Yarkoni calls the “cognitive switch cost of reminding yourself say, (…) that you need to call len(array) instead of array.length to get the size of an array…” etc. Now that you can do pretty much everything in Python instead of using the triad of ArcMap/Grass – NetLogo – R you cut down significantly on the overhead of switching between them every three months and relearning the same commands over and over again. Not to mention you don’t need to convert the files from one format to another. Remember the ‘life is short and the PhD…”?

The employability

Less than half of all PhD graduates will continue in Academia (if you want to get depressed in a matter of minutes, check the Nature report or the Economist’s analysis here and a fantastic article on ‘how academia resembles a drug gang’ here). I’m not saying you’re not going to be one of the lucky ones but stats are stats and nothing indicates that anthro-/archaeology has a particularly higher retention rate of PhDs compared to other disciplines.

The circle of academic-based simulation projects, non-humanities departments or commercial companies that are likely to be looking for someone with NetLogo coding skills is small. But what if you can code in Python (or Java, or C++)? The dream job (or ‘any job’) is much more likely to materialise.

In the current academic climate (around 1% or less of research funding spent on humanities AND social science in both US and Europe) it’s good to have a plan B (even though I hope none of us will need to execute it). An industry and academy approved and widely used programming language gives you a solid transferable skill, which NetLogo, no matter how well it actually works, doesn’t.

My final verdict: start off with the turtles, use it for prototyping, playing out with ideas and teaching modelling newbies but when you feel comfortable with ‘if loops’, ‘lists’ and ‘scheduling’ move on to Python. You won’t regret.

Top Images: http://en.wikipedia.org/wiki/File:Snake_skeleton.jpg and http://en.wikipedia.org/wiki/File:Florida_Box_Turtle_Digon3_re-edited.jpg

simulatingcomplexity

Tag Archives: tools

Software tools for ABMs

The Powers and Pitfalls of Power-Law Analyses

Working with NetCDF files in an agent-based model: Skinning the model input data cat (UPDATED)

Working with NetCDF files in an agent-based model: Skinning the model input data cat

“R you experienced?” Using the R extension for NetLogo

How the Python Ate the Turtle

A complete change of perspective

The speed

Universality – one language to rule them all

The employability

From the world of Complex Systems Simulation in Humanities