What/Why?

R is a language and environment for statistical computing and graphics. In more practical terms:

  • It’s a digital swiss army knife
    • It allows you to do very diverse things with data.
  • It’s free - just download and get started, no license, no trial period
  • There’s a huge community behind it
    • For even the weirdest problems you run into, you can Google it and usually find a solution (probably on stackoverflow).
  • You don’t have write to code that others have already created (stand on the shoulders of giants)
    • There are over 11,000 contributed packages covering a variety of topics. This means that for a lot of different applications and scientific domains, someone has already created something that you can re-use. This minimizes the amount of work you have to do, while allowing you to do much more sophisticated analysis.
    • Package management - you don’t have to copy/paste this code. It’s managed so that you just have to call the name of the software package. You can also have R update every single software package that you have installed, in order to get the latest bug fixes, enhancements, etc.
  • It lets you create repeatable workflows
    • Have new data? Just run the code again to generate plots + images for reports
  • Use of open source & open standards
    • Researchers can inspect the source code and report errors to the developers, meaning that the process of finding and fixing bugs in the software is more transparent and efficient.
    • Open standards allow for data portability between a variety of programs, so you’re not limited to just using proprietary programs. This can give you greater flexibility in the types of tools you use.

You’re already used to programming since spreadsheets are code:

=SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,MATCH($AL$7&$AL$6,$data.$A$1:$AMJ$1,0)-29,1,2))/1000+SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,RIGHT(V$53,2)*8-7,1,2))/1000-SUM(OFFSET($data.$A$1,MATCH($T197,$data.$C$1:$C$1048576,0)-1,MATCH($AL$7&$AL$6,$data.$A$1:$AMJ$1,0)-37,1,2))/1000

This is a formula from an actual spreadsheet in use by a company. This code is a nightmare to maintain and it’s very difficult to check if there are any errors in it. Spreadsheets create a soup of data storage, processing and presentation (i.e. numbers, formulas and graphs). This hinders transparency and as spreadsheets get more complex, it becomes very difficult to trace how the data analysis is actually done. With R, it’s possible to use code to read from data files and then automatically generate reports in the form of Word documents, pdfs, or html pages. Separating your data, processing and presentation of results into separate components will help you to stay organized.

Setting up R and RStudio

You need to have both R and RStudio set up. You can use R without RStudio, but if you do, the interface will look like the one on the left image below, and this can be a bit intimidating. You’ll see this interface also referred to as the “R console”. RStudio runs on top of the R console and provides a more visual interface, in addition to greatly assisting you with various aspects of programming.

R console

R console + RStudio

vs.

University Computer

  • You do not have to install R or RStudio on the university computers.
  • On the university computers you should be able to find RStudio in the Start Menu by looking for Mathematics & Statistics -> R.
    • If you cannot find it there, then just search for “R studio” (make sure to include a space between “R” and “studio”).
    • If there are several versions, just pick the one with the highest version number.
    • In the example below, you can see that “R Studio 1.0.153” was the latest version that was found.
  • Don’t run “R for Windows”. This will give you a less user friendly interface that what is available with RStudio.

Personal Computer

On your own computer you will have to install both R and RStudio

Open RStudio

Once you open up RStudio, you’ll see that it has four panels, which are:

Console - Bottom Left

  • R commands can be run directly here

Environment/History - Top Right

  • This shows all the variables and functions that are currently loaded in the workspace (the “Environment”)
  • You can click on the variables to see their values, which can be useful to inspect how the code is operating.
  • The History shows all of the commands that you have run.

R Script Files - Top Left

  • While you can run R code directly in the console, you can also save the code in a script file that allows you to rerun the code at a later date.
  • You can run lines of code by highlighting them, and then clicking on Run above.
  • You can run all the code at once by doing Code -> Run Region -> Run All

Files/Plots/Packages/Help - Bottom Right

  • Files - This shows everything in your current working directory
  • Plots - Once you start plotting, multiple plots will be stored here. There are arrows in this view that allow you to navigate between multiple plots
  • Packages - Shows all the packages installed and currently loaded
  • Help - Shows documentation for various functions.

How to read this tutorial

Running Code Examples

Everything shown in the large gray boxes below is code that you can run by copy/pasting into the Console tab in RStudio. For example:

print("hello world")
## [1] "hello world"

Gray bits of text like this usually refer to individual R commands, package names, or the exact name of things that you will see in the user interface.

Comments

You may also see comments included in the code, as indicated by text following the # sign

# This is a comment, R doesn't do anything with this

These comments can also be placed at the end of a line of R code

a <- "R will assign this text to a" # This is a comment, R doesn't do anything with this

For your own work, it is generally a good idea to include comments to help other people understand your code, and they can also be useful for yourself if you haven’t looked at a piece of code in a long time.

Exercises

For sections labelled Exercise we only show the results of the code and not the code itself. You should check to make sure that you get the same answers.

Modifying & Rerunning Commands in the Console

All of the commands that you use in the console are recorded in the History. You can also access previous commands by pressing the up key on your keyboard. When doing the exercises, it may take a few attempts for you to figure out the correct commands to use, and by using the up key you can easy recall your previous commands and modify them.

Saving code in an R Script

You can also collect these statements in a new R file (File -> New File -> R Script) which you can then run.

If you do create a new R file, you can run the code line by line by first placing your mouse cursor at the end of a line of code and then clicking on the Run button or using Ctrl + Enter. You can do the same if you also select several lines of code.

Additionally, it is possible to run all the lines of code at once via Ctrl + Alt + R. This will be useful once you work on your projects for this course.

Basic Operations

Using R as a calculator

You can use R just as you would use a calculator, for example, to add two numbers:

3 + 2
## [1] 5

Assinging values to variables.

We can assign the results of the calculation 3 + 2 to a variable named a by using <-

a <- 3 + 2

If you look in the top right quadrant of RStudio, you’ll see that in the Environment tab, it shows that the variable a now has a value of 5. In simple terms, the environment shows all of the variables which have some value assigned to them. This means that you can reuse each of these variables later on for other calculations.

Note that in R you can also just use = to assign the result. You often will see people use both forms of these.

a = 3 + 2

Now print out the value of a:

a
## [1] 5

You can also do the same with the print function:

print(a)
## [1] 5

Although you assigned a number to the variable a, you can also assign to it different types of data like text:

a <- "this is some text"
a
## [1] "this is some text"

Valid Variable Names

In the examples above, we just a variable simply named a. R actually allows you to create much more descriptive names for variables. A variable name must start with a letter, but after that you can use a mix of letters, numbers, as well as periods and underscores.

In short the requirements are:

  • First character must be: a-z or A-Z
  • All other characters can be: a-z, A-Z, 0-9, _, .

Valid variable names:

  • abc
  • ABC
  • a123
  • a_123
  • a_12.3
  • theta_a_b

Invalid variable names:

  • 2abc
  • _a
  • a-123

In general, it’s a good strategy to use descriptive “self-documenting” variable names like temperature_Groningen instead of a so that it’s easier for others (and you later on) to understand your code.

Exercise

Assign the result of five times three to a variable named b and print out the resulting value

print(b)
## [1] 15

Assign the result of b divided by 10 to a new variable c and print out the results using print(c)

print(c)
## [1] 1.5

Vectors

Creating Vectors

We can also assign a variable a list of values instead of just a single value. Vectors can be useful for representing information like hourly temperature readings over the course of a year.

Below we use the c() function which is used to concatenate or combine values together.

a <- c(3, 7, 1, 6)
a
## [1] 3 7 1 6

Note that when we print out the results, the commas are replaced by spaces.

Modifying Vectors

We can also use c() to add values to the beginning or end. Below we add 10 to the beginning and 21 to the end.

a <- c(10, a)
a
## [1] 10  3  7  1  6
a <- c(a, 21)
a
## [1] 10  3  7  1  6 21

We can also concatenate two vectors together:

b <- c(9, 2, 5)
c <- c(a, b)
c
## [1] 10  3  7  1  6 21  9  2  5

Note that here we have a variable named c and we also use the function c(). R is able to understand that when we use c with parenthesis like c(a,b) we’re referring to the function c(), while when we mention it without parenthesis like with c <- 3, we’re referring to the variable c and not to the function.

Operations on Vectors

With vectors you can do element-wise operations. Below we divide each element of c by 10.

c / 10
## [1] 1.0 0.3 0.7 0.1 0.6 2.1 0.9 0.2 0.5

We can divide one vector by another:

a <- c(10, 6)
b <- c(2, 5)
a / b
## [1] 5.0 1.2

This result is equivalent to c(10/2, 6/5), in other words, dividing the first element of a by the first element of b and then dividing the second element of a by the second element of b

If we divide vectors that are not of the same length, then we get a bit of a strange result:

a <- c(10, 6, 4)
b <- c(2, 5)
a / b 
## Warning in a/b: longer object length is not a multiple of shorter object
## length
## [1] 5.0 1.2 2.0

Even though R gives us a result, we get a warning that the vectors are not of the same length. What is happening here is that R will wrap around the shorter vector. Behind the scenes, it’s doing a calculation like c(10/2, 6/5, 4/2). Note that the 2 appears twice as it is the first element of b.

Above we divided the longer vector by the shorter vector. Below we divide the shorter vector by the longer vector:

b / a
## Warning in b/a: longer object length is not a multiple of shorter object
## length
## [1] 0.2000000 0.8333333 0.5000000

Here the calculation being performed is c(2/10, 5/6, 2/4). Note that in both cases the resulting vector has three elements, meaning that the result always has the same number of elements as the longest vector.

Subsetting Vectors

Sometimes we don’t want to use the all the values in a vector, but only want certain values. For the examples below we’ll work with a vector of ten random values:

x <- c(-1, 0, 0, -9, 1, 4, 8, -2, 3, 5)

Get the third element of the vector:

x[3]
## [1] 0

Note that in R, indices start one, while in other languages they may start at zero. In other words, in Python, x[3] would give you the fourth element in the vector.

If you try to access elements in the vector at invalid locations, R will not generate an error, but will return numeric(0) or NA. In a later practical we’ll discuss what these mean, but for now you should keep in mind that if you see something like this, it may mean that you’re trying to access an element that doesn’t exist.

x[0] # there is no element at location 0
## numeric(0)
x[11] # only ten elements in the vector
## [1] NA

If you do run into issues like this, you can always use the length() function to see how many items are contained within a vector.

length(x)
## [1] 10

Return everything except for the seventh element:

x[-7]
## [1] -1  0  0 -9  1  4 -2  3  5

Get the fifth, sixth and seventh element:

x[5:7]
## [1] 1 4 8

Return all elements except for the fifth, sixth and seventh element:

x[-(5:7)]
## [1] -1  0  0 -9 -2  3  5

Return elements at locations 5 and 7:

x[c(5,7)]
## [1] 1 8

Note that while x[c(5:7)] will give you the same results as x[5:7] i.e. you don’t need to include the c(), you will get an error if instead of x[c(5,7)], you try x[5,7]

x[5,7]
## Error in x[5, 7]: incorrect number of dimensions

The reason is that this is the syntax that is used to access elements of a matrix, specifically the element at row 5 and column 7. We will discuss this later on in the practical.

Find all values equal to zero:

x[x == 0]
## [1] 0 0

Note that zero is listed twice as we have two zeros in the vector

Find all values less than four:

x[x < 4]
## [1] -1  0  0 -9  1 -2  3

Find all values in x which are in the set of numbers one through four:

x[x %in% 1:4]
## [1] 1 4 3

The values are returned in the order in which they are found, which is why the 4 appears before the 3

Vectors and Statistical Functions

With vectors you can also perform operations like finding the sum, cumulative sum, mean, median and standard deviation:

sum(x) 
## [1] 9
cumsum(x) # cumulative sum
##  [1]  -1  -1  -1 -10  -9  -5   3   1   4   9
mean(x)   # average of all values
## [1] 0.9
median(x) 
## [1] 0.5
sd(x)     # standard deviation
## [1] 4.629615

Exercise

For this exercise, you will work with the vector y:

y = c(8, 3, 3, 6, 8, -4, -3, -2, -2, 10, 8, -4, 4, 0, -8, -6, 1, -8, -4, 4)

Find the average:

## [1] 0.7

Find the average and standard deviation for all values less than zero:

## [1] -4.555556
## [1] 2.297341

Create a vector containing the elements 4 2 7 1 and divide it by y.

##  [1]  0.5000000  0.6666667  2.3333333  0.1666667  0.5000000 -0.5000000
##  [7] -2.3333333 -0.5000000 -2.0000000  0.2000000  0.8750000 -0.2500000
## [13]  1.0000000        Inf -0.8750000 -0.1666667  4.0000000 -0.2500000
## [19] -1.7500000  0.2500000

Note that when printing the contents of a vector, R will wrap the results in order to fit the screen. If your results look different, then first check that the individual numbers are the same. For example, when you see [1], [7], [13], and [19], that means that the values next to them are those at position 1, 7, 13 and 17 in the vector.

In other words, if I print out the vector 1:8, then depending on the size of the screen, R might print it out like either of the two variants below:

[1] 1 2 3 4
[5] 5 6 7 8
[1] 1 2 3
[4] 4 5 6
[7] 7 8

Sequences

In R you will often be creating vectors that contain sequences of numbers. For many of the sequences you will need to create, R has several techniques that allow you to generate these without having to specify every element individually.

Based on what we showed previously, if you would like to create a vector for the sequence of integers from 3 to 7, you can do:

c(3, 4, 5, 6, 7)
## [1] 3 4 5 6 7

However, in this case, there’s no point in specifing the intermediate numbers since we’re just taking consecutive integers. What we want R to do is to start at a number and keep counting by one until we get to another number.

Using the : operator, we can shorten our code so that looks like start_number:end_number

3:7
## [1] 3 4 5 6 7

You will also sometimes see people use c() to do this. This also gives the same result.

c(3:7)
## [1] 3 4 5 6 7

If the second number is smaller, then you will get a sequence that counts down.

7:3
## [1] 7 6 5 4 3

It’s possible to use this technique with real numbers and not just integers.

3.1459:10
## [1] 3.1459 4.1459 5.1459 6.1459 7.1459 8.1459 9.1459

Note that the final value is 9.1459 and not 10. This is because the next number in the sequence would be 10.1459 and this is greater than 10. In general, the final value in a sequence will be a value less than or equal to the end value you specified.

Yet another way to do this is via the seq() function

seq(3, 7, by = 1)
## [1] 3 4 5 6 7

What’s special about this is the by argument which allows us to specify by which value we would like to increment the sequence

seq(3, 7, by = 0.5)
## [1] 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

Again, the final value will be less than or equal to the end value we request:

seq(3, 7, by = 3.14159)
## [1] 3.00000 6.14159

The seq() function can also calculate the increment needed in order to have a sequence of a specified length with certain start and end values. In this case, we want a sequence with ten elements that starts with 3 and ends with 10.

seq(3, 7, length.out = 10)
##  [1] 3.000000 3.444444 3.888889 4.333333 4.777778 5.222222 5.666667
##  [8] 6.111111 6.555556 7.000000

Repeating Sequences

Instead of creating sequences that increment by a fixed value, you may also want to create sequences of repeating elements. R allows you to do this with the rep() function.

Just like with the seq() function, there are different arguments we can use to get different behaviour. If we use the times argument, we can repeat a sequence multiple times like we would with c(1:3, 1:3, 1:3, 1:3)

rep(1:3, times=4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

If we want to repeat each element in a vector multiple times in a row, then we can use the each argument.

rep(1:3, each = 4)
##  [1] 1 1 1 1 2 2 2 2 3 3 3 3

You can even call rep() multiple times to generate very complex sequences

rep(rep(1:3, times=2), each=2)
##  [1] 1 1 2 2 3 3 1 1 2 2 3 3

Exercise

Create a sequence from -10 to 10, where each value is incremented by 0.5

##  [1] -10.0  -9.5  -9.0  -8.5  -8.0  -7.5  -7.0  -6.5  -6.0  -5.5  -5.0
## [12]  -4.5  -4.0  -3.5  -3.0  -2.5  -2.0  -1.5  -1.0  -0.5   0.0   0.5
## [23]   1.0   1.5   2.0   2.5   3.0   3.5   4.0   4.5   5.0   5.5   6.0
## [34]   6.5   7.0   7.5   8.0   8.5   9.0   9.5  10.0

Create a sequence from -10 to 10, with only five values

## [1] -10  -5   0   5  10

Matrices

Initializing Matrices

A matrix can be thought of as a tabular data contained in a set of rows and columns. As a simple example, here we initialize a matrix of zeros with three columns and two rows.

b <- matrix(0, ncol=3, nrow=2)
b
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0

Assinging Values to Matrix Elements

Then set the value at row 1 and column 2 to 3. The syntax we use here is of the form b[row_number, column_number]

b[1,2] <- 3
b
##      [,1] [,2] [,3]
## [1,]    0    3    0
## [2,]    0    0    0

Converting Vectors to Matrices

We can also convert a vector of values into a matrix. In this example, the vector consists of six elements, while the matrix consists of three columns and two rows. R will just take the values in the vector and arrange them from top to bottom in the columns, moving from left to right.

matrix(1:6, ncol=3, nrow=2)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

In the previous example, the vector had the same number of elements as the matrix. We can also create a matrix where the vector is repeated. Note how the values are repeated.

matrix(1:3, ncol=3, nrow=2)
##      [,1] [,2] [,3]
## [1,]    1    3    2
## [2,]    2    1    3

Note that you will get a warning of the length of the vector is not a sub-multiple or multiple of the number of rows in the matrix. R will still recycle the elements though.

matrix(1:5, ncol=3, nrow=2)
## Warning in matrix(1:5, ncol = 3, nrow = 2): data length [5] is not a sub-
## multiple or multiple of the number of rows [2]
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    1

We can also control how R populates the matrix with vector values, by using the byrow=TRUE argument. By default it will place vector values in a matrix by going down the columns from left to right.

matrix(1:6, ncol=3, nrow=2, byrow=TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Joining Matrices

You can append a matrix to the end of another matrix using rbind which means that you will bind the rows together.

b <- matrix(0, ncol=3, nrow=2)
c <- matrix(1, ncol=3, nrow=2)
d <- rbind(b,c)
d
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    1    1    1
## [4,]    1    1    1

You can also bind the columns together using cbind

cbind(b,c)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    1    1    1
## [2,]    0    0    0    1    1    1

Statistical Functions and Matrices

R also has functions that run operations such as mean and sum on matrix rows and columns:

rowMeans(d)
## [1] 0 0 1 1
rowSums(d)
## [1] 0 0 3 3
colMeans(d)
## [1] 0.5 0.5 0.5
colSums(d)
## [1] 2 2 2

We can also use the same functions as we used for vectors. These will return a single value as they analyze all elements in the matrix.

mean(d)
## [1] 0.5
sd(d)
## [1] 0.522233
sum(d)
## [1] 6
median(d)
## [1] 0.5

Subsetting Matrices

Here we create a new matrix:

a <- matrix(1:12, nrow=3, ncol=4, byrow = TRUE)
a
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

We can subset elements of a matrix by using square brackets, similar to how we accessed elements of a vector. The main difference is that for matrices we need to specify first number corresponds to the row number, the second corresponds to the column number.

Select values from row 2:

a[2,]
## [1] 5 6 7 8

Select values from column 2:

a[,2]
## [1]  2  6 10

Select element at row 1, column 2:

a[1,2]
## [1] 2

It’s still possible to access matrix elements by a single index, although this is a bit dangerous as it’s not always intuitive where the value is coming from.

a[10]
## [1] 4

Just like with vectors, we can still access elements that meet some condition

a[a > 5]
## [1]  9  6 10  7 11  8 12

Useful Functions

By using the summary() function we can get a quick overview of statistical properties of the columns of the matrix. This helps to give you an idea about the distribution of values by showing the min, max, median, etc values per column.

summary(a)
##        V1          V2           V3           V4    
##  Min.   :1   Min.   : 2   Min.   : 3   Min.   : 4  
##  1st Qu.:3   1st Qu.: 4   1st Qu.: 5   1st Qu.: 6  
##  Median :5   Median : 6   Median : 7   Median : 8  
##  Mean   :5   Mean   : 6   Mean   : 7   Mean   : 8  
##  3rd Qu.:7   3rd Qu.: 8   3rd Qu.: 9   3rd Qu.:10  
##  Max.   :9   Max.   :10   Max.   :11   Max.   :12

Other functions show you the number of rows, columns and dimensions of the matrix

nrow(a) # number of rows in a
## [1] 3
ncol(a) # number of columns in a
## [1] 4
dim(a) # dimension of a (number of rows and number of columns)
## [1] 3 4

Exercise

Find the average of all values in the second column of a

## [1] 6

Find the standard deviation of all values in a which are less than 6.5

## [1] 1.870829

Find the averages of the first and third rows of a. You should just need a single line of code for this.

## [1]  2.5 10.5

Data Frames

One issue with matrices is that it can only hold one type of data. This is not always ideal as in your own research, you will often use data that is a mix of both text and numbers. As example of this would be data containing the names of weather stations and then numerical information about temperature, humidity, etc.

The example below shows what happens when you try to combine matrices with numbers and text:

a <- matrix(1:12, nrow=4, ncol=3)
a
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
b <- matrix(c("this", "is", "some", "text"), nrow=4, ncol=1)
b
##      [,1]  
## [1,] "this"
## [2,] "is"  
## [3,] "some"
## [4,] "text"
cbind(a,b)
##      [,1] [,2] [,3] [,4]  
## [1,] "1"  "5"  "9"  "this"
## [2,] "2"  "6"  "10" "is"  
## [3,] "3"  "7"  "11" "some"
## [4,] "4"  "8"  "12" "text"

What’s happened is that by default R converted all the elements to text since it can’t combine numbers and text together. You can tell this since the values are surrounded by quotations marks like ", although there are some cases where R will not always display these.

A way around this is to use data frames. A key difference from matrices is that data frames only require that you have a single data type per column. In the example below, the first two columns are numbers, while the last column is text. Note that when we specify x =, y = and z = below, we actually creating a type of table where the columns will be labelled x, y and z.

a <- data.frame(x = c(1:3),
               y = c(4:6),
               z = c("a", "b", "c"))
a
##   x y z
## 1 1 4 a
## 2 2 5 b
## 3 3 6 c

Data frames are similar to matrices in that you can access elements based on their row and column indices. For example, to get the element in the 2nd row and third column:

a[2,3]
## [1] "b"

We can also get just the 2nd row:

a[2,]
##   x y z
## 2 2 5 b

Or just the 3rd column:

a[,3]
## [1] "a" "b" "c"

One of the nice things about data frames is that you can use the names of the columns (combined with the $ sign) to directly access the values in that column. So if we want to see the values of only the z column, we can use a$z

a$z
## [1] "a" "b" "c"

Multiple columns can be selected. Note that we have to include a comma to indicate that we want all rows

a[,c("x", "y")]
##   x y
## 1 1 4
## 2 2 5
## 3 3 6

Same, but rows two and three for columns x and y:

a[2:3,c("x", "y")]
##   x y
## 2 2 5
## 3 3 6

You can also add a new column to an existing data frame. Here we add a new column t by using the syntax a$t

a$t <- c(10, 13, 17)
a
##   x y z  t
## 1 1 4 a 10
## 2 2 5 b 13
## 3 3 6 c 17

We can remove an existing column by assinging it a value of NULL

a$x <- NULL
a
##   y z  t
## 1 4 a 10
## 2 5 b 13
## 3 6 c 17

We’ll not look at the mtcars data set that is included with R. If you type ?mtcars in the console, you’ll see more documentation. Looking at the first few lines of the mtcars data frame, we see the following which shows data in several columns: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear and carb

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The head command just shows us the top few rows, and you can also use the tail command to look at the bottom rows.

tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

For specific columns, we can use the mean and sd functions to find the average and standard deviation.

mean(mtcars$mpg)
## [1] 20.09062
sd(mtcars$mpg)
## [1] 6.026948

Just like with matrices, we can also run the summary() function to get an overview of the range and distribution of values in each column:

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

The same functions used to understand the size of matrices can be used for data frames too

nrow(mtcars) # number of rows
## [1] 32
ncol(mtcars) # number of columns
## [1] 11
dim(mtcars) # dimension (number of rows and number of columns)
## [1] 32 11

Exercise

Create a data frame with three columns apples, pears and oranges with the data values shown below:

##      apples pears oranges
## 1 Groningen    10       1
## 2 Amsterdam     9       2
## 3 Rotterdam     8       3
## 4   Utrecht     7       4

Find the averages of the pears and oranges columns

##   pears oranges 
##     8.5     2.5