As you may remember from the Intermediate R workshop, R is great at representing and manipulating tabular data. In “traditional” R, this was handled in data.frame, while in modern “tidyverse” R this is handled via a tibble.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
A tibble is a two (or possibly more) dimensional table of data.
This has created a tibble that we have assigned to the variable census. The column names are the keys (City, year and pop), while the data for each column is given in the values (the lists).
You can print a summary of the tibble via:
census
# A tibble: 12 × 3
City year pop
<chr> <dbl> <dbl>
1 Paris 2001 2.15
2 Paris 2008 2.21
3 Paris 2009 2.23
4 Paris 2010 2.24
5 London 2001 7.32
6 London 2006 7.66
7 London 2011 8.17
8 London 2015 8.62
9 Rome 2001 2.55
10 Rome 2006 2.63
11 Rome 2009 2.73
12 Rome 2012 2.63
Note that R will default to interpreting numbers as floating point (dbl). While this is correct for the pop (population) column, this is the wrong choice for the year. A better choice would be an integer. To force this, use as.integer to set the data type for the year column;
# A tibble: 12 × 3
City year pop
<chr> <int> <dbl>
1 Paris 2001 2.15
2 Paris 2008 2.21
3 Paris 2009 2.23
4 Paris 2010 2.24
5 London 2001 7.32
6 London 2006 7.66
7 London 2011 8.17
8 London 2015 8.62
9 Rome 2001 2.55
10 Rome 2006 2.63
11 Rome 2009 2.73
12 Rome 2012 2.63
will print
You access the contents of a tibble mostly by column, e.g.
census["City"]
# A tibble: 12 × 1
City
<chr>
1 Paris
2 Paris
3 Paris
4 Paris
5 London
6 London
7 London
8 London
9 Rome
10 Rome
11 Rome
12 Rome
will return a tibble of just a single column containing the City data.
You can also access the columns by their index, e.g.
census[1]
# A tibble: 12 × 1
City
<chr>
1 Paris
2 Paris
3 Paris
4 Paris
5 London
6 London
7 London
8 London
9 Rome
10 Rome
11 Rome
12 Rome
will return the first column, so is identical to census["City"].
You can also extract multiple columns by specifying them via c( ), e.g.
census[c("City", "year")]
# A tibble: 12 × 2
City year
<chr> <int>
1 Paris 2001
2 Paris 2008
3 Paris 2009
4 Paris 2010
5 London 2001
6 London 2006
7 London 2011
8 London 2015
9 Rome 2001
10 Rome 2006
11 Rome 2009
12 Rome 2012
will return a tibble with the City and year columns.
To access data by rows, you need to pass in the row index followed by a comma, e.g.
census[1, ]
# A tibble: 1 × 3
City year pop
<chr> <int> <dbl>
1 Paris 2001 2.15
will return a tibble containing just the first row of data.
You can use ranges to get several rows, e.g.
census[1:5, ]
# A tibble: 5 × 3
City year pop
<chr> <int> <dbl>
1 Paris 2001 2.15
2 Paris 2008 2.21
3 Paris 2009 2.23
4 Paris 2010 2.24
5 London 2001 7.32
would return the first five rows, while
census[seq(2, 10, 2), ]
# A tibble: 5 × 3
City year pop
<chr> <int> <dbl>
1 Paris 2008 2.21
2 Paris 2010 2.24
3 London 2006 7.66
4 London 2015 8.62
5 Rome 2006 2.63
would return the even rows from 2 to 10.
You can access specific rows and columns via [row, column], e.g.
census[1, 1]
# A tibble: 1 × 1
City
<chr>
1 Paris
returns a tibble containing just the first row and first column, while
census[seq(2, 10, 2), "year"]
# A tibble: 5 × 1
year
<int>
1 2008
2 2010
3 2006
4 2015
5 2006
would return the year column of the even rows from 2 to 10, and
census[5, 2:3]
# A tibble: 1 × 2
year pop
<int> <dbl>
1 2001 7.32
would return the second and third columns of the fifth row.
The above functions all return a tibble that is a subset of the whole tibble. You can extract the data for a single column as a list via [[ ]] or $, e.g.
and can then extract data from those lists via sub-indexing, e.g.
census$City[1]
[1] "Paris"
would return the City column data for the first row.
Querying
We can start to ask questions of our data using the filter function.
census %>%filter(City=="Paris")
# A tibble: 4 × 3
City year pop
<chr> <int> <dbl>
1 Paris 2001 2.15
2 Paris 2008 2.21
3 Paris 2009 2.23
4 Paris 2010 2.24
(note that we didn’t need to put double quotes around City in the filter - it knows that this is a column name. Also, look here if you need to refresh your knowledge of the %>% operator).
This has returned a new tibble, which you can then access using the same methods as above, e.g.
(census %>%filter(City=="Paris"))["year"]
# A tibble: 4 × 1
year
<int>
1 2001
2 2008
3 2009
4 2010
You can also test if the rows of a tibble match a condition, e.g.