Put this at the top of every file, unless you really really want to work with factors. This will save you a lot of confusion
options(stringsAsFactors = FALSE)
The main wiki page around these examples is at: http://wiki.tudelft.nl/bin/view/Education/SPM955xABMofCAS/LectureDataAnalysis
The code + data we will be using is at https://github.com/cbdavis/Demo-Analyzing-Netlogo-Data-with-R
The contents are:
Put all these into the same directory, open up RStudio and set your working directory to this directory containing the files. Click on “Session -> Set Working Directory -> Choose Directory” to do this.
Then install the necessary packages for R
See http://www.statmethods.net/index.html for a good overview of R
RStudio has four panels, and in the default config these are:
Top Left - Code - here you can open and work on different script files
Top Right - Workspace/History
Bottom Left - Console
Bottom Right - Files/Plots/Packages/Help
You can run lines of code by highlighting them, and then clicking on “Run” above.
You can run all the code at once by doing Code -> Run Region -> Run All
Add two numbers, this will only show the value in the console output
1 + 1
## [1] 2
Assign a variable, this way the answer is stored
a = 1 + 1
Same as above, just using “<-” instead of “=”
a <- 1 + 1
now add up two different variables
b = 3
c = a + b
make a vector, c() is the function for putting elements into a vector
d = c(3,2,1,4)
find the length
length(d)
## [1] 4
calculate the average and standard deviation
mean(d)
## [1] 2.5
sd(d)
## [1] 1.291
do a simple plot
plot(d)
make another vector e, which is filled with random numbers ranging from 0 to 1, and contains the same number of elements as the d vector
e = runif(length(d), 0, 1)
combine these two vectors into a matrix, where d is the left column and e is the right column
f = cbind(d, e)
combine these two vectors into a matrix, where d is the top row and e is the bottom row
g = rbind(d, e)
See http://www.statmethods.net/advstats/matrix.html for more information about working with matrices
transpose the matrix that you just made above
t(g)
## d e
## [1,] 3 0.01398
## [2,] 2 0.47766
## [3,] 1 0.76745
## [4,] 4 0.71165
element-wise multiplication of two vectors
h = d * e
matrix multiplication
d %*% t(e)
## [,1] [,2] [,3] [,4]
## [1,] 0.04195 1.4330 2.3024 2.1349
## [2,] 0.02797 0.9553 1.5349 1.4233
## [3,] 0.01398 0.4777 0.7675 0.7116
## [4,] 0.05594 1.9106 3.0698 2.8466
Also, when you save R, it will request if you want to save the workspace This means that it will save all the variables currently loaded in the workspace
use rm(variableName) to remove variables from the workspace
TODO: Make sure that you have these packages installed where you see “library(something)”
Load the ggplot2 library which is used for most of the visualizations here to understand why this library is cool, just do a google image search: https://www.google.com/search?q=ggplot2&tbm=isch
For documentation see: - http://docs.ggplot2.org/current/ - http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf - http://had.co.nz/ggplot2/book.pdf
Note that plot() is not the same as ggplot() - these are from two separate packages
library(ggplot2)
needed for reshaping data frames
library(reshape2)
used for querying data, performing aggregations, filtering, etc.
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
MAKE SURE THAT THE WORKING DIRECTORY IS SET this line below sets the current working directory
setwd("/home/cbdavis/Desktop/svn/Demo-Analyzing-Netlogo-Data-with-R")
You can also do something like Session -> Set Working Directory -> To Source File Location
Make sure to specify the “Table” output for Netlogo
Read in the data. skip the first 6 lines, the line after that is the header, and the columns are separated by commas You can either specify the full path to the file, or make sure that the working directory for R points to the directory containing the file
myDataFrame = read.table("Wolf Sheep Predation experiment-table.csv", skip = 6, sep = ",", head=TRUE)
This gives you a quick summary of what’s in the data
This is especially important since it tells you what the column names are.
For example, if you see “newcomer.incumbent” then you can access this column using myDataFrame$newcomer.incumbent
Note: characters that are not letters or numbers (A-Z, a-z, 0-9) may be encoded as periods,
So a header that looks like “[run number]” will be accessible using myDataFrame$X.run.number.
What you’ll see here is that you’re working with a data frame.
It isn’t a matrix, but it’s an object containing several columns along with some additional properties.
summary(myDataFrame)
## X.run.number. grass. sheep.reproduce initial.number.sheep
## Min. : 1.0 Length:19821 Min. :4 Min. : 50
## 1st Qu.:11.0 Class :character 1st Qu.:4 1st Qu.:100
## Median :21.0 Mode :character Median :4 Median :150
## Mean :20.7 Mean :4 Mean :126
## 3rd Qu.:31.0 3rd Qu.:4 3rd Qu.:200
## Max. :40.0 Max. :4 Max. :200
## grass.regrowth.time sheep.gain.from.food show.energy.
## Min. :10.0 Min. :4 Length:19821
## 1st Qu.:20.0 1st Qu.:4 Class :character
## Median :30.0 Median :4 Mode :character
## Mean :30.2 Mean :4
## 3rd Qu.:40.0 3rd Qu.:4
## Max. :50.0 Max. :4
## initial.number.wolves wolf.reproduce wolf.gain.from.food X.step.
## Min. :50 Min. :5 Min. :20 Min. : 0
## 1st Qu.:50 1st Qu.:5 1st Qu.:20 1st Qu.:123
## Median :50 Median :5 Median :20 Median :247
## Mean :50 Mean :5 Mean :20 Mean :248
## 3rd Qu.:50 3rd Qu.:5 3rd Qu.:20 3rd Qu.:373
## Max. :50 Max. :5 Max. :20 Max. :500
## count.sheep count.wolves grass
## Min. : 0 Min. : 0.0 Min. : 420
## 1st Qu.: 131 1st Qu.: 19.0 1st Qu.: 758
## Median : 157 Median : 45.0 Median : 909
## Mean : 190 Mean : 62.4 Mean :1080
## 3rd Qu.: 192 3rd Qu.: 78.0 3rd Qu.:1233
## Max. :1048 Max. :516.0 Max. :2601
this will also show you the names of the column names
colnames = colnames(myDataFrame)
Don’t worry about what this means, it cleans up the column names. You can just copy/paste it and re-use it, just make sure that if your data frame isn’t called myDataFrame, then update that part.
Some colnames start with “X.”, get rid of this
colnames(myDataFrame) = gsub("X\\.", "", colnames(myDataFrame))
Get rid of periods at the start and end of the names
colnames(myDataFrame) = gsub("^\\.|\\.$", "", colnames(myDataFrame))
Convert all periods into underscores
colnames(myDataFrame) = gsub("\\.", "_", colnames(myDataFrame))
you can also just rename the columns yourself
colnames(myDataFrame)[1] = "runNumber" # change "run_number" to "runNumber"
colnames(myDataFrame)[2] = "grass_is_on" # used to be ?grass in the data file
colnames(myDataFrame)[11] = "tick" # change "step" to "tick"
colnames(myDataFrame)[14] = "count_grass" # there are two columns named grass, we need to distinguish them.
These are now the data columns that I can work with - myDataFrame$runNumber - myDataFrame$grass_is_on - myDataFrame$sheep_reproduce - myDataFrame$initial_number_sheep - myDataFrame$grass_regrowth_time - myDataFrame$sheep_gain_from_food - myDataFrame$show_energy - myDataFrame$initial_number_wolves - myDataFrame$wolf_reproduce - myDataFrame$wolf_gain_from_food - myDataFrame$tick - myDataFrame$count_sheep - myDataFrame$count_wolves - myDataFrame$count_grass
In the top right panel, you can now click on the “Environment” tab, then click on “myDataFrame” to bring up a table view of it.
find the maximum number of sheep encountered
max(myDataFrame$count_sheep)
## [1] 1048
which index has the maximum number of sheep?
indexWithMaxNumSheep = which.max(myDataFrame$count_sheep)
myDataFrame$count_sheep[indexWithMaxNumSheep]
## [1] 1048
Plot a sorted vector of the number of turtles over all time for all the simulations. This gives you an idea of how often you encounter high, low, medium values
plot(sort(myDataFrame$count_sheep))
just give me a quick scatterplot
scatterplot = ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) + #use myDataFrame for the data, columns for x and y
geom_point() + #we want to use points
xlab("tick") + #specify x and y labels
ylab("number of sheep") +
ggtitle("Number of sheep over time") #give the plot a title
print(scatterplot) #display the scatterplot
Something’s going on, but it’s useful to distinguish between the different runs, so color the dots by runNumber
scatterplot = ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) + #use myDataFrame for the data, columns for x and y
geom_point(aes(colour = runNumber)) + #we want to use points, colored by runNumber
xlab("tick") + #specify x and y labels
ylab("number of sheep") +
ggtitle("Number of sheep over time") #give the plot a title
print(scatterplot) #display the scatterplot
Note: if you just do “ggplot(…)” instead of “something = ggplot(…)” then the image will be drawn automatically, but you won’t have a way to save it, except by clicking on the GUI for the image.
Now save the plot
ggsave(scatterplot, file="scatter.png")
## Saving 7 x 5 in image
Do the same with lines. The only change from above is the addition of “group=runNumber” and geom_line is used instead of geom_point
ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=runNumber)) + #use myDataFrame for the data, columns for x and y
geom_line(aes(colour = runNumber)) + #we want to use points, colored by runNumber
xlab("tick") + #specify x and y labels
ylab("number of sheep") +
ggtitle("Number of sheep over time") #give the plot a title
You can navigate back and forth between different graphs by using the left/right arrows in the “Plots” window. To have multiple graph windows open, you need to tell R specifically to open a new window. If you’re using Windows, this will looks something like this:
windows()
put your code for plot 1 here
windows()
put your code for plot 2 here
For mac, you would use macintosh() instead of windows. For Unix/Linux, you would use X11()
See http://www.statmethods.net/graphs/creating.html for more info
Give me a heatmap, without the scatter plot. This can be useful if you have a HUGE number of points that are all in a sort of cloud
simpleHeatMapOfScatterPlot = ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) +
stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE)
print(simpleHeatMapOfScatterPlot)
Make a scatter plot
ggplot(data=myDataFrame, aes(x=count_grass, y=count_sheep)) + geom_point()
Make a scatter plot with density contours drawn on it
ggplot(data=myDataFrame, aes(x=count_grass, y=count_sheep)) + geom_point() + geom_density2d()
Instead of contours, show dots that are proportionally sized to the density
ggplot(data=myDataFrame, aes(x=count_grass, y=count_sheep)) + stat_density2d(geom="point", aes(size = ..density..), contour = FALSE)
This is how to create filenames automatically. You can use these in loops to create lots of graphs
fileprefix="histogram"
val = 3
filename = paste(fileprefix, val, ".png", sep="")
Actually, I’m in the mood for a histogram
simpleHistogram = ggplot(data=myDataFrame, aes(x=count_sheep)) + geom_histogram()
print(simpleHistogram)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Now just give me a boxplot If we just say “group=tick”, then we would have 500 boxes, which fills up the whole plot and is hard to read
ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=tick)) +
geom_boxplot()
“group=round(tick/25)” means that we group all the data into boxes 25 steps wide
boxplot = ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=round(tick/25))) +
geom_boxplot()
print(boxplot)
Show a matrix (i.e. a “facet grid”) of individual graphs where every single graph show the values encountered for a single permutation of grass_regrowth_time & initial_number_sheep values
ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=round(tick/25))) +
geom_boxplot() +
facet_grid(grass_regrowth_time ~ initial_number_sheep)
Same, but make the y scales independent per row to stretch things out a bit
ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=round(tick/25))) +
geom_boxplot() +
facet_grid(grass_regrowth_time ~ initial_number_sheep, scales="free_y")
do the same, now just with lines
ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) +
geom_line() +
facet_grid(grass_regrowth_time ~ initial_number_sheep, scales="free_y")
This looks weird since there are two repetitions, we should separate these out
ggplot(data=myDataFrame, aes(x=tick, y=count_sheep, group=runNumber)) +
geom_line(aes(colour = runNumber)) +
facet_grid(grass_regrowth_time ~ initial_number_sheep, scales="free_y")
just show single graphs per runNumber
ggplot(data=myDataFrame, aes(x=tick, y=count_sheep)) +
geom_point() +
facet_wrap(~runNumber)
Now need to create a stacked area chart based on the columns below: - myDataFrame$count_sheep - myDataFrame$count_wolves - myDataFrame$count_grass
To do this, we need to stack up those columns into a single column.
We will use a column beside it to indicate what is being counted So instead of:
count_sheep | count_wolves | count_grass |
---|---|---|
50 | 10 | 100 |
We’ll have:
variable | value |
---|---|
count_sheep | 50 |
count_wolves | 10 |
count_grass | 100 |
The rest of the values in the columns will be duplicated for each of these new rows
See http://www.statmethods.net/management/reshape.html for what’s happening here
Note that “count_sheep”, “count_wolves”, “count_grass” are not in the list
data2 = melt(myDataFrame, id=c("runNumber", "grass_is_on", "sheep_reproduce", "initial_number_sheep", "grass_regrowth_time", "sheep_gain_from_food", "show_energy", "initial_number_wolves", "wolf_reproduce", "wolf_gain_from_food", "tick"))
Two new columns are introduced- “variable” and “value”:
variable | value |
---|---|
count_sheep | 50 |
count_wolves | 10 |
count_grass | 100 |
ggplot(data=data2, aes(x=tick, y=value)) +
geom_area(aes(fill=variable)) +
facet_grid(grass_regrowth_time ~ initial_number_sheep, scales="free")
With this part of the tutorial, you’re using SQL (http://en.wikipedia.org/wiki/SQL) to run queries over your data
This is more advanced, but it will give you awesome superpowers
Whenever you are frustrated by something that is difficult to do in Excel, remember this
This visualization uses the sqldf package + a scatter plot: Time lapse of 860,000 photovoltaic systems installed across Germany https://www.youtube.com/watch?v=XpvQNn0n_Qw
The data I have lists the capacity per post code, but to make the visualization, I need to calculate the cumulative capacity up to the date the I am visualizing, for all of the distinct postcodes. This becomes awesomely easy when using sqldf.
Instructions for the package - http://code.google.com/p/sqldf/
The best place to start is via tutorials online. Just search for something like “sqlite tutorial queries” i.e. http://sqlite.awardspace.info/syntax/sqlitepg03.htm
Official documentation: http://www.sqlite.org/lang.html
This is way more than you need, and tutorials are easier to understand, but this shows you everything that you can do with the query language.
The uppercase terms below are some of the more popular commands that you may find useful. You can also use lowercase in the queries. Examples of their use are further below.
NOTE - column names should not contain any punctionation, otherwise the queries may not work.
You need to change column names like “team.size” to something like “teamsize”
just get me one row
x = sqldf("SELECT * FROM myDataFrame LIMIT 1")
## Loading required package: tcltk
count the number of rows where the value for runNumber is equal to 1
sqldf("SELECT COUNT(*) FROM myDataFrame WHERE runNumber=1")
## COUNT(*)
## 1 501
the same, but where runNumber < 10
sqldf("SELECT COUNT(*) FROM myDataFrame WHERE runNumber<10")
## COUNT(*)
## 1 4290
find the average count_sheep for each runNumber (averaged over all ticks)
sqldf("SELECT AVG(count_sheep) AS avgSheep FROM myDataFrame GROUP BY runNumber LIMIT 10")
## avgSheep
## 1 241.1
## 2 237.0
## 3 195.9
## 4 193.0
## 5 169.2
## 6 176.2
## 7 159.9
## 8 170.4
## 9 152.5
## 10 150.8
same, but order the values for avgSheep descending
sqldf("SELECT AVG(count_sheep) AS avgSheep FROM myDataFrame GROUP BY runNumber ORDER BY avgSheep DESC LIMIT 10")
## avgSheep
## 1 483.3
## 2 444.9
## 3 246.6
## 4 241.1
## 5 237.0
## 6 236.7
## 7 209.7
## 8 209.5
## 9 199.8
## 10 196.8
same, but also give me the runNumber that corresponds to each value
x = sqldf("SELECT runNumber, AVG(count_wolves) AS avgWolves, AVG(count_sheep) AS avgSheep FROM myDataFrame GROUP BY runNumber ORDER BY avgSheep DESC")
plot stuff
plot(x$avgWolves, x$count_sheep)
find the distinct values for initial_number_sheep
sqldf("SELECT DISTINCT initial_number_sheep FROM myDataFrame")
## initial_number_sheep
## 1 50
## 2 100
## 3 150
## 4 200
get me the distinct combinations of initial_number_sheep and grass_regrowth_time that were used
sqldf("SELECT DISTINCT initial_number_sheep, grass_regrowth_time FROM myDataFrame")
## initial_number_sheep grass_regrowth_time
## 1 50 10
## 2 50 30
## 3 50 40
## 4 50 20
## 5 50 50
## 6 100 10
## 7 100 20
## 8 100 30
## 9 100 40
## 10 100 50
## 11 150 10
## 12 150 20
## 13 150 30
## 14 150 40
## 15 150 50
## 16 200 10
## 17 200 20
## 18 200 30
## 19 200 40
## 20 200 50
select a subset of the original data, and then run a query on that subset
dataSubSet = sqldf("SELECT * FROM myDataFrame WHERE runNumber<10")
sqldf("SELECT count(*) FROM dataSubSet")
## count(*)
## 1 4290
Get all rows where 20 <= p <= 60. LIMIT is included since many rows match this
sqldf("SELECT count_sheep, count_wolves FROM myDataFrame WHERE count_sheep BETWEEN 20 AND 60 LIMIT 10")
## count_sheep count_wolves
## 1 50 50
## 2 50 50
## 3 50 50
## 4 49 50
## 5 50 50
## 6 50 50
## 7 50 50
## 8 50 50
## 9 46 52
## 10 48 52
Get all data where two conditions are met
sqldf("SELECT count_sheep, count_wolves FROM myDataFrame WHERE count_wolves > count_sheep LIMIT 10")
## count_sheep count_wolves
## 1 49 50
## 2 46 52
## 3 48 52
## 4 50 51
## 5 49 51
## 6 49 50
## 7 49 54
## 8 46 51
## 9 43 51
## 10 49 50
Get me the row with the maximum value for the maximum average component size
sqldf("SELECT count_wolves, count_sheep, MAX(count_sheep) FROM myDataFrame")
## count_wolves count_sheep MAX(count_sheep)
## 1 0 1048 1048