Yesterday I got into a lively discussion on Facebook with a fellow archaeologist about how to graph data. Like many social scientists, my friend does not have formal training in making data visualization and was using Excel’s native graphing to make plots for their work. This is not inherently a problem, of course. Excel can make some accurate representations of data. But when one’s data is complicated (like most archaeological data is), something like Excel just can’t cut it.
I suggested to this friend to use R, since that would solve their woes. And once the script is written, it is incredibly easy to rerun the script if you find an error in your data, or you gather new data. My friend had never programmed in R, so asked for help.
It turns out I had written a script for another friend about six months ago who had wanted to graph some pXRF data in a way similar to the first friend. This second friend wrote to me this:
“The data that I’m trying to visualize was collected as follows. I ran two sets of ‘tests’ to determine how much of the variation I was seeing in readings for each element was due to slight differences in the composition of clay within a single sherd, and how much was due to minor inconsistencies in the detection abilities of the machine. First, I took 10 separate readings of a sherd, moving the sherd a little bit each time (to test the compositional variability of the clay paste). Then, I took 10 readings without moving the sherd at all (to test the reliability of the pXRF detector).
“The resulting dataset has 20 cases and 35 variables: the first variable identifies whether each case was a reading with replacement (testing clay variation) or without replacement (testing machine reliability); so 10 cases are denoted ‘with’ and 10 cases are denoted ‘without’. The remaining 34 variables are values of measured atomic abundance.
“I want to make a graph that has compares the mean abundance of each element (with 80%, 95%, and 99% confidence intervals) between the ‘with’ and ‘without’ cases. That would be a stupidly large graph, with paired observations for 34 variables.”
I asked my friend to draw me what she was expecting (since I’m a visual person) and she drew this very useful sketch, which helped me figure out how to write the code:
R (and Python) are great for doing just this kind of visualization. While I’m sure many of our readers are well-versed in these statistical packages, after the Facebook discussion yesterday it seems that posting the code I wrote for my friend above would be useful for many social scientists. I even had a few people request the code, so the code follows!
For this dataset I wrote a Violin plot since those at a glance show the median and interquartile range while also showing kernel density. This can be very useful for looking at variation.
Following is the code to produce this violin plot. You can copy and paste this into an R document and it should work, though you’ll want to rename the files, etc. to work with your data.
Happy plotting, simComp readers!
###First we load the data
camDat <- read.csv(“variability_R.csv”, header=T)
##Then we check to make sure camDat doesn’t look weird. Here I look at the first 5 lines of data
##It looks okay, but a common problem is putting a space in the name of a variable. Instead of doing that, you should always use an underscore. Why?
##Cause R uses periods for other specific things, and a name like K.12 can look confusing. K_12 is better practice, fyi.
##Now we subset the data into two types, WITH and WITHOUT. You will use those dataframes for all future analyses
camDatWith <- subset(camDat, Type==“WITH”)
camDatWithout <- subset(camDat, Type==“WITHOUT”)
### Here is where I play with multiple types of distributions. I’m only using the first two elements, in a with and without type. We can then generate a graph with all of your data in with and without type, but this will give us an idea of whether this graph is helpful.
##First is a violin plot, which is a combo boxplot and kernel density plot.
## For more info, go here: http://www.statmethods.net/graphs/boxplot.html
x1 <- camDatWith$Al_K12
x2 <- camDatWithout$Al_K12
x3 <- camDatWith$Ar_K12
x4 <- camDatWithout$Ar_K12
vioplot(x1, x2, x3, x4, names=c(“Al K12 With”, “Al K12 Without”, “Ar K12 With”, “Ar K12 Without”), col=“gold”)
title (“Violin Plots of Elemental Abundance”)
# Here is a standard boxplot
boxplot(x1, x2, x3, x4, names=c(“Al K12 With”, “Al K12 Without”, “Ar K12 With”, “Ar K12 Without”), col=“gold”)
title (“Boxplot of Elemental Abundance”)
# And here we have a boxplot with notches at the mean
boxplot(x1, x2, x3, x4, notch=TRUE, names=c(“Al K12 With”, “Al K12 Without”, “Ar K12 With”, “Ar K12 Without”), col=“gold”)
title (“Boxplot of Elemental Abundance”)