Analyzing Netlogo Data with R

Put this at the top of every file, unless you really really want to work with factors. This will save you a lot of confusion

options(stringsAsFactors = FALSE)

The main wiki page around these examples is at: http://wiki.tudelft.nl/bin/view/Education/SPM955xABMofCAS/LectureDataAnalysis

The code + data we will be using is at https://github.com/cbdavis/Demo-Analyzing-Netlogo-Data-with-R

The contents are:

Put all these into the same directory, open up RStudio and set your working directory to this directory containing the files. Click on “Session -> Set Working Directory -> Choose Directory” to do this.

Then install the necessary packages for R

Basic R examples

See http://www.statmethods.net/index.html for a good overview of R

RStudio has four panels, and in the default config these are:

Top Left - Code - here you can open and work on different script files

Top Right - Workspace/History

  • This shows all the variables and functions that are currently loaded in the workspace
  • You can click on the variables to see their values, which can be useful to inspect how the code is operating.
  • The history shows all of the commands that you have run.

Bottom Left - Console

  • commands can be run here
  • You can also find documentation for commands by typing in ?commandName to get help, i.e. ?sum

Bottom Right - Files/Plots/Packages/Help

  • Files - This shows everything in your current working directory
  • Plots - Once you start plotting, multiple plots will be stored here. There are arrows in this view that allow you to navigate between multiple plots
  • Packages - Shows all the packages installed and currently loaded
  • Help - Shows documentation for varous functions.

You can run lines of code by highlighting them, and then clicking on “Run” above.

You can run all the code at once by doing Code -> Run Region -> Run All

Introduction to operations

Add two numbers, this will only show the value in the console output

1 + 1
## [1] 2

Assign a variable, this way the answer is stored

a = 1 + 1

Same as above, just using “<-” instead of “=”

a <- 1 + 1

now add up two different variables

b = 3
c = a + b

make a vector, c() is the function for putting elements into a vector

d = c(3,2,1,4)

find the length

length(d)
## [1] 4

calculate the average and standard deviation

mean(d)
## [1] 2.5
sd(d)
## [1] 1.291

do a simple plot

plot(d)

plot of chunk unnamed-chunk-10

make another vector e, which is filled with random numbers ranging from 0 to 1, and contains the same number of elements as the d vector

e = runif(length(d), 0, 1)

combine these two vectors into a matrix, where d is the left column and e is the right column

f = cbind(d, e)

combine these two vectors into a matrix, where d is the top row and e is the bottom row

g = rbind(d, e)

See http://www.statmethods.net/advstats/matrix.html for more information about working with matrices

transpose the matrix that you just made above

t(g)
##      d       e
## [1,] 3 0.01398
## [2,] 2 0.47766
## [3,] 1 0.76745
## [4,] 4 0.71165

element-wise multiplication of two vectors

h = d * e

matrix multiplication

d %*% t(e)
##         [,1]   [,2]   [,3]   [,4]
## [1,] 0.04195 1.4330 2.3024 2.1349
## [2,] 0.02797 0.9553 1.5349 1.4233
## [3,] 0.01398 0.4777 0.7675 0.7116
## [4,] 0.05594 1.9106 3.0698 2.8466

Also, when you save R, it will request if you want to save the workspace This means that it will save all the variables currently loaded in the workspace

use rm(variableName) to remove variables from the workspace

Load in libraries needed for plotting

TODO: Make sure that you have these packages installed where you see “library(something)”

Load the ggplot2 library which is used for most of the visualizations here to understand why this library is cool, just do a google image search: https://www.google.com/search?q=ggplot2&tbm=isch

For documentation see: - http://docs.ggplot2.org/current/ - http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf - http://had.co.nz/ggplot2/book.pdf

Note that plot() is not the same as ggplot() - these are from two separate packages

library(ggplot2)

needed for reshaping data frames

library(reshape2)

used for querying data, performing aggregations, filtering, etc.

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns

MAKE SURE THAT THE WORKING DIRECTORY IS SET this line below sets the current working directory

setwd("/home/cbdavis/Desktop/svn/Demo-Analyzing-Netlogo-Data-with-R")

You can also do something like Session -> Set Working Directory -> To Source File Location

Make sure to specify the “Table” output for Netlogo

Read in the data. skip the first 6 lines, the line after that is the header, and the columns are separated by commas You can either specify the full path to the file, or make sure that the working directory for R points to the directory containing the file

myDataFrame = read.table("Wolf Sheep Predation experiment-table.csv", skip = 6, sep = ",", head=TRUE)

This gives you a quick summary of what’s in the data

This is especially important since it tells you what the column names are.

For example, if you see “newcomer.incumbent” then you can access this column using myDataFrame$newcomer.incumbent

Note: characters that are not letters or numbers (A-Z, a-z, 0-9) may be encoded as periods,

So a header that looks like “[run number]” will be accessible using myDataFrame$X.run.number.

What you’ll see here is that you’re working with a data frame.

It isn’t a matrix, but it’s an object containing several columns along with some additional properties.

summary(myDataFrame)
##  X.run.number.     grass.          sheep.reproduce initial.number.sheep
##  Min.   : 1.0   Length:19821       Min.   :4       Min.   : 50         
##  1st Qu.:11.0   Class :character   1st Qu.:4       1st Qu.:100         
##  Median :21.0   Mode  :character   Median :4       Median :150         
##  Mean   :20.7                      Mean   :4       Mean   :126         
##  3rd Qu.:31.0                      3rd Qu.:4       3rd Qu.:200         
##  Max.   :40.0                      Max.   :4       Max.   :200         
##  grass.regrowth.time sheep.gain.from.food show.energy.      
##  Min.   :10.0        Min.   :4            Length:19821      
##  1st Qu.:20.0        1st Qu.:4            Class :character  
##  Median :30.0        Median :4            Mode  :character  
##  Mean   :30.2        Mean   :4                              
##  3rd Qu.:40.0        3rd Qu.:4                              
##  Max.   :50.0        Max.   :4                              
##  initial.number.wolves wolf.reproduce wolf.gain.from.food    X.step.   
##  Min.   :50            Min.   :5      Min.   :20          Min.   :  0  
##  1st Qu.:50            1st Qu.:5      1st Qu.:20          1st Qu.:123  
##  Median :50            Median :5      Median :20          Median :247  
##  Mean   :50            Mean   :5      Mean   :20          Mean   :248  
##  3rd Qu.:50            3rd Qu.:5      3rd Qu.:20          3rd Qu.:373  
##  Max.   :50            Max.   :5      Max.   :20          Max.   :500  
##   count.sheep    count.wolves       grass     
##  Min.   :   0   Min.   :  0.0   Min.   : 420  
##  1st Qu.: 131   1st Qu.: 19.0   1st Qu.: 758  
##  Median : 157   Median : 45.0   Median : 909  
##  Mean   : 190   Mean   : 62.4   Mean   :1080  
##  3rd Qu.: 192   3rd Qu.: 78.0   3rd Qu.:1233  
##  Max.   :1048   Max.   :516.0   Max.   :2601

this will also show you the names of the column names

colnames = colnames(myDataFrame)

Don’t worry about what this means, it cleans up the column names. You can just copy/paste it and re-use it, just make sure that if your data frame isn’t called myDataFrame, then update that part.

Some colnames start with “X.”, get rid of this

colnames(myDataFrame) = gsub("X\\.", "", colnames(myDataFrame))

Get rid of periods at the start and end of the names

colnames(myDataFrame) = gsub("^\\.|\\.$", "", colnames(myDataFrame))

Convert all periods into underscores

colnames(myDataFrame) = gsub("\\.", "_", colnames(myDataFrame))

you can also just rename the columns yourself

colnames(myDataFrame)[1] = "runNumber"  # change "run_number" to "runNumber"
colnames(myDataFrame)[2] = "grass_is_on" # used to be ?grass in the data file
colnames(myDataFrame)[11] = "tick"  # change "step" to "tick"
colnames(myDataFrame)[14] = "count_grass" # there are two columns named grass, we need to distinguish them.

These are now the data columns that I can work with - myDataFrame$runNumber - myDataFrame$grass_is_on - myDataFrame$sheep_reproduce - myDataFrame$initial_number_sheep - myDataFrame$grass_regrowth_time - myDataFrame$sheep_gain_from_food - myDataFrame$show_energy - myDataFrame$initial_number_wolves - myDataFrame$wolf_reproduce - myDataFrame$wolf_gain_from_food - myDataFrame$tick - myDataFrame$count_sheep - myDataFrame$count_wolves - myDataFrame$count_grass

In the top right panel, you can now click on the “Environment” tab, then click on “myDataFrame” to bring up a table view of it.

find the maximum number of sheep encountered

max(myDataFrame$count_sheep)
## [1] 1048

which index has the maximum number of sheep?

indexWithMaxNumSheep = which.max(myDataFrame$count_sheep)
myDataFrame$count_sheep[indexWithMaxNumSheep]
## [1] 1048

Plot a sorted vector of the number of turtles over all time for all the simulations. This gives you an idea of how often you encounter high, low, medium values

plot(sort(myDataFrame$count_sheep))

plot of chunk unnamed-chunk-30

just give me a quick scatterplot

scatterplot = ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) + #use myDataFrame for the data, columns for x and y
  geom_point() + #we want to use points
  xlab("tick") +  #specify x and y labels
  ylab("number of sheep") + 
  ggtitle("Number of sheep over time") #give the plot a title
print(scatterplot) #display the scatterplot

plot of chunk unnamed-chunk-31

Something’s going on, but it’s useful to distinguish between the different runs, so color the dots by runNumber

scatterplot = ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) + #use myDataFrame for the data, columns for x and y
  geom_point(aes(colour = runNumber)) + #we want to use points, colored by runNumber
  xlab("tick") +  #specify x and y labels
  ylab("number of sheep") + 
  ggtitle("Number of sheep over time") #give the plot a title
print(scatterplot) #display the scatterplot

plot of chunk unnamed-chunk-32

Note: if you just do “ggplot(…)” instead of “something = ggplot(…)” then the image will be drawn automatically, but you won’t have a way to save it, except by clicking on the GUI for the image.

Now save the plot

ggsave(scatterplot, file="scatter.png") 
## Saving 7 x 5 in image

Do the same with lines. The only change from above is the addition of “group=runNumber” and geom_line is used instead of geom_point

ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=runNumber)) + #use myDataFrame for the data, columns for x and y
  geom_line(aes(colour = runNumber)) + #we want to use points, colored by runNumber
  xlab("tick") +  #specify x and y labels
  ylab("number of sheep") + 
  ggtitle("Number of sheep over time") #give the plot a title

plot of chunk unnamed-chunk-34

You can navigate back and forth between different graphs by using the left/right arrows in the “Plots” window. To have multiple graph windows open, you need to tell R specifically to open a new window. If you’re using Windows, this will looks something like this:

windows()

put your code for plot 1 here

windows()

put your code for plot 2 here

For mac, you would use macintosh() instead of windows. For Unix/Linux, you would use X11()

See http://www.statmethods.net/graphs/creating.html for more info

Give me a heatmap, without the scatter plot. This can be useful if you have a HUGE number of points that are all in a sort of cloud

simpleHeatMapOfScatterPlot = ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) + 
  stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) 

print(simpleHeatMapOfScatterPlot)

plot of chunk unnamed-chunk-35

Make a scatter plot

ggplot(data=myDataFrame, aes(x=count_grass, y=count_sheep)) + geom_point()

plot of chunk unnamed-chunk-36

Make a scatter plot with density contours drawn on it

ggplot(data=myDataFrame, aes(x=count_grass, y=count_sheep)) + geom_point() + geom_density2d()

plot of chunk unnamed-chunk-37

Instead of contours, show dots that are proportionally sized to the density

ggplot(data=myDataFrame, aes(x=count_grass, y=count_sheep)) + stat_density2d(geom="point", aes(size = ..density..), contour = FALSE)

plot of chunk unnamed-chunk-38

This is how to create filenames automatically. You can use these in loops to create lots of graphs

fileprefix="histogram"
val = 3
filename = paste(fileprefix, val, ".png", sep="")

Actually, I’m in the mood for a histogram

simpleHistogram = ggplot(data=myDataFrame, aes(x=count_sheep)) + geom_histogram() 
print(simpleHistogram)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-40

Now just give me a boxplot If we just say “group=tick”, then we would have 500 boxes, which fills up the whole plot and is hard to read

ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=tick)) + 
  geom_boxplot()

plot of chunk unnamed-chunk-41

“group=round(tick/25)” means that we group all the data into boxes 25 steps wide

boxplot = ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=round(tick/25))) + 
  geom_boxplot()

print(boxplot)

plot of chunk unnamed-chunk-42

Show a matrix (i.e. a “facet grid”) of individual graphs where every single graph show the values encountered for a single permutation of grass_regrowth_time & initial_number_sheep values

ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=round(tick/25))) + 
  geom_boxplot() + 
  facet_grid(grass_regrowth_time ~ initial_number_sheep)

plot of chunk unnamed-chunk-43

Same, but make the y scales independent per row to stretch things out a bit

ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=round(tick/25))) + 
  geom_boxplot() + 
  facet_grid(grass_regrowth_time ~ initial_number_sheep, scales="free_y")

plot of chunk unnamed-chunk-44

do the same, now just with lines

ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) + 
  geom_line() + 
  facet_grid(grass_regrowth_time ~ initial_number_sheep, scales="free_y")

plot of chunk unnamed-chunk-45

This looks weird since there are two repetitions, we should separate these out

ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=runNumber)) + 
  geom_line(aes(colour = runNumber)) + 
  facet_grid(grass_regrowth_time ~ initial_number_sheep, scales="free_y")

plot of chunk unnamed-chunk-46

just show single graphs per runNumber

ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) + 
  geom_point() + 
  facet_wrap(~runNumber)

plot of chunk unnamed-chunk-47

Now need to create a stacked area chart based on the columns below: - myDataFrame$count_sheep - myDataFrame$count_wolves - myDataFrame$count_grass

To do this, we need to stack up those columns into a single column.
We will use a column beside it to indicate what is being counted So instead of:

count_sheep count_wolves count_grass
50 10 100

We’ll have:

variable value
count_sheep 50
count_wolves 10
count_grass 100

The rest of the values in the columns will be duplicated for each of these new rows

See http://www.statmethods.net/management/reshape.html for what’s happening here

Note that “count_sheep”, “count_wolves”, “count_grass” are not in the list

data2 = melt(myDataFrame, id=c("runNumber", "grass_is_on", "sheep_reproduce", "initial_number_sheep", "grass_regrowth_time", "sheep_gain_from_food", "show_energy", "initial_number_wolves", "wolf_reproduce", "wolf_gain_from_food", "tick"))

Two new columns are introduced- “variable” and “value”:

variable value
count_sheep 50
count_wolves 10
count_grass 100
ggplot(data=data2, aes(x=tick, y=value)) + 
  geom_area(aes(fill=variable)) + 
  facet_grid(grass_regrowth_time ~ initial_number_sheep, scales="free")

plot of chunk unnamed-chunk-49

Querying data using SQLDF package

With this part of the tutorial, you’re using SQL (http://en.wikipedia.org/wiki/SQL) to run queries over your data

This is more advanced, but it will give you awesome superpowers

Whenever you are frustrated by something that is difficult to do in Excel, remember this

This visualization uses the sqldf package + a scatter plot: Time lapse of 860,000 photovoltaic systems installed across Germany https://www.youtube.com/watch?v=XpvQNn0n_Qw

The data I have lists the capacity per post code, but to make the visualization, I need to calculate the cumulative capacity up to the date the I am visualizing, for all of the distinct postcodes. This becomes awesomely easy when using sqldf.

Instructions for the package - http://code.google.com/p/sqldf/

The best place to start is via tutorials online. Just search for something like “sqlite tutorial queries” i.e. http://sqlite.awardspace.info/syntax/sqlitepg03.htm

Official documentation: http://www.sqlite.org/lang.html

This is way more than you need, and tutorials are easier to understand, but this shows you everything that you can do with the query language.

The uppercase terms below are some of the more popular commands that you may find useful. You can also use lowercase in the queries. Examples of their use are further below.

  • SELECT
  • FROM
  • WHERE
  • AS
  • DISTINCT
  • COUNT
  • ORDER BY
  • DESC
  • GROUP BY
  • BETWEEN
  • AND
  • OR
  • MAX
  • MIN
  • JOIN

NOTE - column names should not contain any punctionation, otherwise the queries may not work.

You need to change column names like “team.size” to something like “teamsize”

just get me one row

x = sqldf("SELECT * FROM myDataFrame LIMIT 1")
## Loading required package: tcltk

count the number of rows where the value for runNumber is equal to 1

sqldf("SELECT COUNT(*) FROM myDataFrame WHERE runNumber=1")
##   COUNT(*)
## 1      501

the same, but where runNumber < 10

sqldf("SELECT COUNT(*) FROM myDataFrame WHERE runNumber<10")
##   COUNT(*)
## 1     4290

find the average count_sheep for each runNumber (averaged over all ticks)

sqldf("SELECT AVG(count_sheep) AS avgSheep FROM myDataFrame GROUP BY runNumber LIMIT 10")
##    avgSheep
## 1     241.1
## 2     237.0
## 3     195.9
## 4     193.0
## 5     169.2
## 6     176.2
## 7     159.9
## 8     170.4
## 9     152.5
## 10    150.8

same, but order the values for avgSheep descending

sqldf("SELECT AVG(count_sheep) AS avgSheep FROM myDataFrame GROUP BY runNumber ORDER BY avgSheep DESC LIMIT 10")
##    avgSheep
## 1     483.3
## 2     444.9
## 3     246.6
## 4     241.1
## 5     237.0
## 6     236.7
## 7     209.7
## 8     209.5
## 9     199.8
## 10    196.8

same, but also give me the runNumber that corresponds to each value

x = sqldf("SELECT runNumber, AVG(count_wolves) AS avgWolves, AVG(count_sheep) AS avgSheep FROM myDataFrame GROUP BY runNumber ORDER BY avgSheep DESC")

plot stuff

plot(x$avgWolves, x$count_sheep)

plot of chunk unnamed-chunk-56

find the distinct values for initial_number_sheep

sqldf("SELECT DISTINCT initial_number_sheep FROM myDataFrame")
##   initial_number_sheep
## 1                   50
## 2                  100
## 3                  150
## 4                  200

get me the distinct combinations of initial_number_sheep and grass_regrowth_time that were used

sqldf("SELECT DISTINCT initial_number_sheep, grass_regrowth_time FROM myDataFrame")
##    initial_number_sheep grass_regrowth_time
## 1                    50                  10
## 2                    50                  30
## 3                    50                  40
## 4                    50                  20
## 5                    50                  50
## 6                   100                  10
## 7                   100                  20
## 8                   100                  30
## 9                   100                  40
## 10                  100                  50
## 11                  150                  10
## 12                  150                  20
## 13                  150                  30
## 14                  150                  40
## 15                  150                  50
## 16                  200                  10
## 17                  200                  20
## 18                  200                  30
## 19                  200                  40
## 20                  200                  50

select a subset of the original data, and then run a query on that subset

dataSubSet = sqldf("SELECT * FROM myDataFrame WHERE runNumber<10")
sqldf("SELECT count(*) FROM dataSubSet")
##   count(*)
## 1     4290

Get all rows where 20 <= p <= 60. LIMIT is included since many rows match this

sqldf("SELECT count_sheep, count_wolves FROM myDataFrame WHERE count_sheep BETWEEN 20 AND 60 LIMIT 10")
##    count_sheep count_wolves
## 1           50           50
## 2           50           50
## 3           50           50
## 4           49           50
## 5           50           50
## 6           50           50
## 7           50           50
## 8           50           50
## 9           46           52
## 10          48           52

Get all data where two conditions are met

sqldf("SELECT count_sheep, count_wolves FROM myDataFrame WHERE count_wolves > count_sheep LIMIT 10")
##    count_sheep count_wolves
## 1           49           50
## 2           46           52
## 3           48           52
## 4           50           51
## 5           49           51
## 6           49           50
## 7           49           54
## 8           46           51
## 9           43           51
## 10          49           50

Get me the row with the maximum value for the maximum average component size

sqldf("SELECT count_wolves, count_sheep, MAX(count_sheep) FROM myDataFrame")
##   count_wolves count_sheep MAX(count_sheep)
## 1            0        1048             1048