UPDATE 22 SEPT 2016: Please see the updated version of this tutorial here.
You found it: the holy grail of palaeoenvironmental datasets. Some government agency or environmental science department put together some brilliant time series GIS package and you want to find a way to import it into your model. But oftentimes the data may be in a format which isn’t readable by your modeling software, or takes some finagling to get the data in there. NetCDF is one of the more notorious of these. A NetCDF file (which stands for Network Common Data Form) is a multidimensional array, where each layer represents the spatial gridded distribution of a different variable or set of variables, and sets of grids can be stacked into time slices. To make this a little more clear, here’s a diagram:
In this diagram, each table represents a gridded spatial coverage for a single variable. Three variables are represented this way, and these are stored together in a single time step. The actual structure of the file might be simpler (that is, it might consist of a single variable and/or single time step) or more complex (with many more variables or where each variable is actually a set of coverages representing a range of values for that variable; imagine water temperature readings taken at a series of depths). These chunks of data can then be accessed as combined spatial coverages over time. Folks who work with climate and earth systems tend to store their data this way. It’s also a convenient way to keep track of data obtained from satellite measurements over time. They’re great for managing lots of spatial data, but if you’ve never dealt with them before, they can be a bit of a bear to work with. ArcGIS and QGIS support them, but it can be difficult to work them into simulations without converting to a more benign data type like an ASCII file. In a previous post, we’ve discussed importing GIS data into a NetLogo model, but of course this depends on our ability to get the data into a model-readable format. The following tutorial is going to walk through the process of getting a NetCDF file, manipulating it in R, and then getting it into NetLogo.
Step #1 – Locate the data
First let’s locate a useful NetCDF dataset and import it to R. As an example, we’ll use the Global Potential Vegetation Dataset from the UW-Madison Nelson Institute Sage Center for Sustainability and the Global Environment. As you can see, the data is also available as an ASCII file; this is useful because you can use this later to check that you’ve got the NetCDF working. Click on the appropriate link to download the Global Potential Veg Data NetCDF. The file is a tarball (extension .tar.gz), so you’ll need something to unzip it. If you’re not partial to a particular file compressor, try 7-Zip. Keep track of where the file is located on your local drive after downloading and unzipping.
Step #2- Bring the data into R
R won’t read NetCDF files as is, so you’ll need to download a package that works with this kind of data. The ncdf package is one of a few different packages that work with these files, and we’ll use it for this tutorial. First, open the R console and go to Packages->Install Packages and download the ncdf package from your preferred mirror site. Then load the package by entering the following:
library(ncdf) Now, remembering where you saved your NetCDF file, you can bring it into R with the following command:
data <- open.ncdf(filename) If you didn’t save the data file in your R working directory and want to navigate to the file, just replace filename with file.choose(). For now, we’ll use the 0.5 degree resolution vegetation data (vegtype_0.5.nc). Now if you type in data and press enter, you can check to see what the data variable holds. You should get something like this:
 "file C:\\Users\\me\\Downloads\\potveg_nc.tar\\vegtype_0.5.nc has 4 dimensions:"
 "longitude Size: 720"
 "latitude Size: 360"
 "level Size: 1"
 "time Size: 1"
 "file C:\\Users\\me\\Downloads\\potveg_nc.tar\\vegtype_0.5.nc has 1 variables:"
 "float vegtype[longitude,latitude,level,time] Longname:vegtype Missval:8.99999982852418e+20"
This is telling you what your file is composed of. The first line tells you the name of the file and how many dimensions it has. In this case, there are four dimensions: longitude, latitude, level, and time. Our file only has one time slice, meaning that it represents a single snapshot of data; if this number is larger, there will be more coverages included in your file over time. The coverage spans from 89.75 S to 89.75 N latitude in 0.5 degree increments, and 180 W to 180 E longitude by the same increments. Beneath the dashed line are your variables. In this case, there is only one, vegtype, which according to the above uses a number just shy of nine hundred quintillion as a missing value (the computer will interpret any occurences of this number as no data). To access this coverage, we need to assign it to a local variable, which we will call veg:
get.var.ncdf(data, data$var[]) -> veg
The get.var.ncdf command takes an identified variable (data$var[]) and extracts it from the NetCDF file (data) as a matrix. In this instance, we only have one variable so get.var.ncdf would have identified it without the data$var[], but if you had others, you would access them by replacing the 1 with the corresponding number from the list of variables above. Then we assign it to the local variable veg. There are a number of other commands within the ncdf package which are useful for reading and writing NetCDF files, but these go beyond the scope of this blog entry. You can read more about them here.
Step #3 – Checking out the data
Now our data is available to us as a matrix. We can view it by entering the following:
Oops! Our output reads from bottom to top instead of top to bottom. No problem, we can just invert the latitude of the matrix like so:
However, this only changes the view; when we get the data into NetLogo later on, we’ll need to transpose it. But for now, let’s add some terrain colors. According to the readme file associated with the data, there are 15 different landcover types used here:
- Tropical Evergreen Forest/Woodland
- Tropical Deciduous Forest/Woodland
- Temperate Broadleaf Evergreen Forest/Woodland
- Temperate Needleleaf Evergreen Forest/Woodland
- Temperate Deciduous Forest/Woodland
- Boreal Evergreen Forest/Woodland
- Boreal Deciduous Forest/Woodland
- Evergreen/Deciduous Mixed Forest/Woodland
- Dense Shrubland
- Open Shrubland
- Polar Desert/Rock/Ice
We could choose individual colors for each of these, but for the moment we’ll just use the in-built terrain color ramp:
Step #4 – Exporting the data to NetLogo
Finally, we want to read our data into a modeling platform, in this case NetLogo, so let’s export it as a raster coverage we can work with. Before we do any file writing, we’ll need to coerce the matrix into a data frame and make sure we transpose it so that it doesn’t come out upside down again. To do this, we’ll use the following code:
The as.data.frame command does the coercing, while the t command does the transposing. Now we have to open up the file we’re going to write to:
This establishes a connection to an open file which we’ve named vegcover.asc. Next, we’ll write the header data for an ASCII coverage. We can do this by adding lines to the file:
This may look like a bunch of nonsense, but each \t is a tab, and each \n is a new line. The result is a header on our file which looks like this: ncols 720 nrows 360 xllcorner -179.75 yllcorner -89.75 cellsize 0.5 NODATA_value 8.99999982852418e+20 Any program (whether a NetLogo model, GIS, or otherwise) that reads this file will look for this header first. The terms ncols and nrows define the number of columns and rows in the grid. The xllcorner and yllcorner define the lower left corner of the grid. The cellsize term describes how large each cell should be, and the NODATA_value is the same value from the original dataset which we used to define places where data is not available. Now just need to enter in our transposed data.
This will take our data frame and write it to the file we just created, appending it after the header. It’s important that your separator be a space (sep=” “) in order to assure that it is in a format NetLogo can read. Also make sure to get rid of any row and column names as well. Now we can read our file into NetLogo using the GIS extension (for an explanation of this, see here). Open a new NetLogo file, set the world window settings with the origin at the bottom left, a max-pxcor of 719 and and max-pycor of 359, and a patch size of 1. Save your NetLogo model in the same directory as the vegcover.asc file, and the following NetLogo code should do the trick:
extensions [ gis ]
globals [ vegcover ]
patches-own [ vegtype ]
set vegcover gis:load-dataset "vegcover.asc"
gis:set-world-envelope-ds gis:envelope-of vegcover
ask patches [
set pcolor white
set vegtype gis:raster-sample vegcover self
ask patches with [ vegtype <= 8 ] [
set pcolor scale-color green vegtype -5 10
ask patches with [ vegtype > 8 ] [
set pcolor scale-color pink vegtype 9 15
This should produce a world in which patches have a variable called vegtype with values that correspond to the original dataset. Furthermore, patches are colored according to a set scheme where forested areas are on a scale of green, while non-forested areas are on a scale of pink. The result:
If you’re truly curious as to whether this has worked as it should, you might download the ASCII version of the 0.5 degree data from the SAGE website, save it to the same directory, and replace vegcover.asc with the name of the ASCII file in the above NetLogo code to see if there is any difference.
So far, this has been meant to provide a simple tutorial of how to get data from a NetCDF file into an ABM platform. If you’re only dealing with a single coverage, you might be more at home converting your file using QGIS or another standalone GIS. If you’re dealing with multiple time steps or variables from a large dataset, it might make sense to write an R script that will extract the data systematically using combinations of the commands above. However, you might also make use of the R NetLogo extension to query a NetCDF file on the fly. To proceed with this part of the tutorial, you’ll need to download the R extension and have it installed correctly.
First, let’s find a NetCDF file with a temporal component. In honor of the impending winter my Northern Hemisphere colleagues are about to endure, I’m going to use the Northern Hemisphere EASE-Grid Snow Cover and Sea Ice Extent dataset from NOAA, which gives monthly (derived from weekly) snow cover data from 1971 to 1995. Go to the website and download the Monthly Mean dataset and save the file ‘snowcover.mon.mean.nc’ to your local drive, keeping track of the its location.
We’ll start a new NetLogo model, implement the R extension, and create two global variables and a patch variable:
extensions [ R ]
globals [ snowcover s ]
patches-own [ snow ]
The snowcover variable will be our dataset, while s will be a placeholder for monthly coverages. The patch variable snow will be the individual grid cell values from our data which will be updated monthly. Next, we’ll run a setup command which clears the model, installs the ncdf library, opens our NetCDF snowcover file, extracts our snowcover data, and resets our ticks counter. You may need to edit the code below so that it reflects the location of your NetCDF file.
r:eval "get.var.ncdf(data, data$var[]) -> snow"
Now, we could automate the process of converting to ASCII and importing the GIS data here, but that’s likely to be a slow solution and generate a lot of file bloat. Alternatively, if our world window is scaled to the same size as the NetCDF grid (or to some easily computed fraction of it), we can simply import the raw data and transmit the values directly to patches (not unlike the File Input example here). To do this, right click on the world window and edit it so that the location of the origin is the bottom left, and that the max-pxcor is 359 and the max-pycor is 89 (this is 360 x 90, the same size as our Northern Hemisphere snowcover data). We’ll also make sure the world doesn’t wrap, and set the patch size to 3 to make sure it fits on our screen.
Next, we’ll generate the transposed dataframe as in the above example, but this time for a single monthly coverage. Then we’ll import this data from R into the NetLogo placeholder variable s:
r:eval (word "snow2<-as.data.frame(t(snow[,," ticks "]))")
set s r:get "snow2"
ask patches [ get-snow ]
if ticks >= 297 [ stop ]
Because our snowcover data has a time component, we need to tell it which month we want to use by inserting a value for the third axis. For example, if we wanted the value for row 1, column 1 in month 3, we would send R the phrase snow[1,1,3]. In this case, we want the entire coverage but for a single month, so we leave our the values for row and column and only feed R a value for the month. We use the word command here to concatenate the string which will serve as our R command, but which incorporates the current value from the NetLogo ticks counter to substitute for the month value. As the ticks counter increases, this will shift the data from one month to the next. The if ticks >= 297 [ stop ] command will ensure that the model only runs for as long as we have data for (which is 297 months). When we import this data frame from R into our NetLogo model, it will be imported as a set nested lists, where each sublist represents a column from the data frame (from 1 to 360).If we enter s into the command line, it will look something like this:
[[1.0427087545394897E-5 1.0427087545394897E-5 1.0427087545394897E-5…
What we’ll want to do is pull values from these lists which correspond with the patch coordinates. However, remember that our world originates in the bottom left and increases toward the top right, while our data originates in the top left and increases toward the bottom right. What we’ll need to do is flip the y-axis values we use to reflect this (note: originating the model in the top left would give our NetLogo world negative Y-values, which would likewise need to be converted). We can do this with the following:
let x pxcor let y ((89 - pycor) / 89 ) * 89
set snow item y (item x s)
set pcolor scale-color grey snow 0 100
What this does is create temporary x and y values from the patch coordinates, but inverts the y-axis value of the patch (so top left is now bottom left). Then the patch sets its snow value by pulling out the value that corresponds with the appropriate row (item y) from the list the corresponds with the appropriate column (item x s). Finally, it sets is color along a scale from 0 to 100. When we run this code, the result is a lovely visualization of the monthly changes in snow cover from the Northern Hemisphere, like so:
So there you have it; a couple of different ways to get NetCDF data into a model using R and NetLogo. Of course, if you’re going to all of this trouble to work with such extensive datasets, it may be worth your while to explore alternative platforms which can build in native NetCDF support. Or you might build a model in R entirely. But I reckon the language is largely inconsequential as long as the model is well thought out, and part of that is figuring out what kind of input data you need and how to get it into your model. With a bit of imagination, there are many, many ways to skin this cat.
Ramankutty, N., and J.A. Foley (1999). Estimating historical changes in global land cover: croplands from 1700 to 1992, Global Biogeochemical Cycles 13(4), 997-1027.
Cavalieri, D. J., J. Crawford, M. Drinkwater, W. J. Emery, D. T. Eppler, L. D. Farmer, M. Goodberlet, R. Jentz, A. Milman, C. Morris, R. Onstott, A. Schweiger, R. Shuchman, K. Steffen, C. T. Swift, C. Wackerman, and R. L. Weaver. 1992. NASA sea ice validation program for the DMSP SSM/I: final report. NASA Technical Memorandum 104559. 126 pp.