Data structures

Until now all the variables we have used have contained a single piece of information, for example, a <- 4 makes a variable a containing a single number, 4. It’s very common to want to refer to collections of data. You can think, for example, of a bank statement that contains the list of expenses you had last month.

R has several build-in data structures that facilitate working with this common kind of data. In this beginners course we will four of the most used data structures: vector, list, matrix and data.frame. But keep in mind there are other built-in data structures.

Vectors

Vectors are the most basic data structure in R. They are one-dimensional arrays that can hold elements of the same data type.

# Create a numeric vector
num_vector <- c(1, 2, 3, 4, 5)

# Create a vector of strings
str_vector <- c("apple", "banana", "cherry")

The elements of a vectors can be accessed using square brackets, being the first element index 1:

# Access the third element
cat(num_vector[3])
3

As well as being able to select individual elements from a data structure, you can also grab sections of it at once. This process of asking for subsections of a data structure of called slicing.

# Access the third element
cat(num_vector[2:4])
2 3 4
# Access the third element
cat(num_vector[c(1,3,5)])
1 3 5

Addind elements in a vector can be done with the function append:

# Access the third element
str_vector <- append(str_vector, "orange")
cat(str_vector)
apple banana cherry orange

Elements can also be removed using a negative sign while indexing them, as in:

# Access the third element
str_vector <- str_vector[-2]
cat(str_vector)
apple cherry orange

Matrices

Matrices are two-dimensional arrays that contain elements of the same data type.

# Create a 3x3 matrix
my_matrix <- matrix(1:9, nrow = 3, ncol = 3)

print(my_matrix)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Elements in a matrix can be accessed using row and column indices:

cat(my_matrix[2, 3])  # Returns the element in the 2nd row, 3rd column
8

Using the functions rbind and cbind we can add rows and columns, respectively, to a matrix.

my_matrix <- rbind(my_matrix, c(20,21,22))
print(my_matrix)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
[4,]   20   21   22
my_matrix <- cbind(my_matrix, 31:34)
print(my_matrix)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   31
[2,]    2    5    8   32
[3,]    3    6    9   33
[4,]   20   21   22   34

You can also delete rows and columns using the same method that we saw for vectors.

Exercise

Given the matrix below, and a column with the values c(1, 2, 1, 2, 1) and remove the rows 1 and 3.

# Create an all-zero 5x5 matrix
my_matrix <- matrix(0, nrow = 5, ncol = 5)
print(my_matrix)
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    0    0    0
[4,]    0    0    0    0    0
[5,]    0    0    0    0    0
my_matrix <- matrix(0, nrow = 5, ncol = 5)

my_matrix <- rbind(my_matrix, c(1, 2, 1, 2, 1))
my_matrix <- my_matrix[,-c(1,3)]

print(my_matrix)
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
[3,]    0    0    0
[4,]    0    0    0
[5,]    0    0    0
[6,]    2    2    1

Lists

Lists are a bit more versatile data structures as they can contain elements of different data types, including other lists.

my_list <- list(
  numbers = 1:5,
  fruit = c("apple", "banana", "cherry"),
  nested_list = list("a", "b", "c")
)

You can access list elements by name or position:

cat(my_list$fruit)
apple banana cherry
cat(my_list[[2]])  
apple banana cherry

You can have as many items in a list as you like, even zero items as in:

my_empty_list <- list()
cat(my_empty_list)

We can add elements to a list just by indexing a new name. For instance:

my_list["new_element"] <- TRUE
print(my_list)
$numbers
[1] 1 2 3 4 5

$fruit
[1] "apple"  "banana" "cherry"

$nested_list
$nested_list[[1]]
[1] "a"

$nested_list[[2]]
[1] "b"

$nested_list[[3]]
[1] "c"


$new_element
[1] TRUE

We can remove any element of a list assigning them the value NULL.

my_list[c("nested_list","new_element")] <- NULL
print(my_list)
$numbers
[1] 1 2 3 4 5

$fruit
[1] "apple"  "banana" "cherry"

Data Frames

Data frames are table-like structures that can contain columns of different data types. They are one of the most commonly used data structures for data analysis in R. Note that all the columns in a data frame have the same number of elements.

df <- data.frame(
  name = c("Jean", "Thomas", "Daniel"),
  age = c(25, 30, 35),
  is_student = c(TRUE, FALSE, FALSE)
)

You can access data frame columns using the $ operator or square brackets []:

cat(df$name)
Jean Thomas Daniel
cat(df[, "age"])
25 30 35

You can also access the a data frame by rows

print(df[2, ])
    name age is_student
2 Thomas  30      FALSE

Data frames support adding new columns by passing to an assignment operation a new column name:

df$height <- c(165, 180, 175)
print(df)
    name age is_student height
1   Jean  25       TRUE    165
2 Thomas  30      FALSE    180
3 Daniel  35      FALSE    175

And adding new rows using rbind:

df <- rbind(df, c("Patricia",18,FALSE,160) )
print(df)
      name age is_student height
1     Jean  25       TRUE    165
2   Thomas  30      FALSE    180
3   Daniel  35      FALSE    175
4 Patricia  18      FALSE    160
Exercise

Given the below dataframe, add some more items in it:

name,   age, student
Kate,    41,   Y
Sarah,   62,   N
Maddie,  33,   Y
James,   19,   Y

After, create another dataframe with only the name and student status for the first 3 person.

df <- data.frame(
  name = c("Jean", "Thomas", "Daniel"),
  age = c(25, 30, 35),
  is_student = c(TRUE, FALSE, FALSE)
)

We can do this in different ways, using rbind and adding row by row, or creating another dataframe and binding both together.

df <- data.frame(
  name = c("Jean", "Thomas", "Daniel"),
  age = c(25, 30, 35),
  is_student = c(TRUE, FALSE, FALSE)
)

df <- rbind(df, c("Kate",    41,   TRUE) )

newdata <- data.frame(
  name = c("Sarah", "Maddie", "James"),
  age = c(62, 33, 19),
  is_student = c(FALSE, TRUE, TRUE)
)

df <- rbind(df, newdata)

print(df)
    name age is_student
1   Jean  25       TRUE
2 Thomas  30      FALSE
3 Daniel  35      FALSE
4   Kate  41       TRUE
5  Sarah  62      FALSE
6 Maddie  33       TRUE
7  James  19       TRUE

Dataframes can index the rows and columns we want using square brakets [rows,columns].

df2 <- df[1:3,c(1,3)]

print(df2)
    name is_student
1   Jean       TRUE
2 Thomas      FALSE
3 Daniel      FALSE

Errors while working with data stuctures

It is very likely that indexing lists is the first time you will see a R error. Seing R errors (also sometimes called exceptions) is not a sign that you’re a bad programmer or that you’re doing something terrible. Even experienced programmers see R errors on their screen.

Error messages are in fact a very useful feedback mechanism for the programmer but that can be a bit daunting when you first see them. Let’s recreate a typical error message: a dataframe with three columns will not have a column at index 6 (the highest index in that case would be 3) and produce an error if we ask for it.

dataframe.R
df <- data.frame(
  name = c("Jean", "Thomas", "Daniel"),
  age = c(25, 30, 35),
  is_student = c(TRUE, FALSE, FALSE)
)

cat(df[,6])
Error in `[.data.frame`(df, , 6): undefined columns selected

Likewise, it will return an error if there’s no matching column name.

cat(df[,"Will"])
Error in `[.data.frame`(df, , "Will"): undefined columns selected

In this last case, has two parts to it. The first is the words before the colon which tells you where the error was found. The second part of that line is usually a slightly more descriptive message, in this case telling us that the specific problem was that the column selected is not known.

Take your time to read the error messages when they are printied to the screen, they will most likely help you solve the issue. If you think that you’ve fixed the problem but the error persists, make sure that you’ve saved the script file and rerun your code afterwards.

Exercise

What happens if you try to print df[12,6]in the above example?

When trying to access individual elements, R returns an empty object but not an error.

cat(df[12,6])