Tibbles and filters

As you may remember from the Intermediate R workshop, R is great at representing and manipulating tabular data. In “traditional” R, this was handled in data.frame, while in modern “tidyverse” R this is handled via a tibble.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

A tibble is a two (or possibly more) dimensional table of data.

census <- tibble("City"=c("Paris", "Paris", "Paris", "Paris",
                          "London", "London", "London", "London",
                          "Rome", "Rome", "Rome", "Rome"),
                 "year"=c(2001, 2008, 2009, 2010,
                          2001, 2006, 2011, 2015,
                          2001, 2006, 2009, 2012),
                 "pop"=c(2.148, 2.211, 2.234, 2.244,
                         7.322, 7.657, 8.174, 8.615,
                         2.547, 2.627, 2.734, 2.627))

This has created a tibble that we have assigned to the variable census. The column names are the keys (City, year and pop), while the data for each column is given in the values (the lists).

You can print a summary of the tibble via:

census

# A tibble: 12 × 3
   City    year   pop
   <chr>  <dbl> <dbl>
 1 Paris   2001  2.15
 2 Paris   2008  2.21
 3 Paris   2009  2.23
 4 Paris   2010  2.24
 5 London  2001  7.32
 6 London  2006  7.66
 7 London  2011  8.17
 8 London  2015  8.62
 9 Rome    2001  2.55
10 Rome    2006  2.63
11 Rome    2009  2.73
12 Rome    2012  2.63

Note that R will default to interpreting numbers as floating point (dbl). While this is correct for the pop (population) column, this is the wrong choice for the year. A better choice would be an integer. To force this, use as.integer to set the data type for the year column;

census <- tibble("City"=c("Paris", "Paris", "Paris", "Paris",
                          "London", "London", "London", "London",
                          "Rome", "Rome", "Rome", "Rome"),
                 "year"=as.integer(c(2001, 2008, 2009, 2010,
                                     2001, 2006, 2011, 2015,
                                     2001, 2006, 2009, 2012)),
                 "pop"=c(2.148, 2.211, 2.234, 2.244,
                         7.322, 7.657, 8.174, 8.615,
                         2.547, 2.627, 2.734, 2.627))

census

# A tibble: 12 × 3
   City    year   pop
   <chr>  <int> <dbl>
 1 Paris   2001  2.15
 2 Paris   2008  2.21
 3 Paris   2009  2.23
 4 Paris   2010  2.24
 5 London  2001  7.32
 6 London  2006  7.66
 7 London  2011  8.17
 8 London  2015  8.62
 9 Rome    2001  2.55
10 Rome    2006  2.63
11 Rome    2009  2.73
12 Rome    2012  2.63

will print

You access the contents of a tibble mostly by column, e.g.

census["City"]

# A tibble: 12 × 1
   City  
   <chr> 
 1 Paris 
 2 Paris 
 3 Paris 
 4 Paris 
 5 London
 6 London
 7 London
 8 London
 9 Rome  
10 Rome  
11 Rome  
12 Rome

will return a tibble of just a single column containing the City data.

You can also access the columns by their index, e.g.

census[1]

# A tibble: 12 × 1
   City  
   <chr> 
 1 Paris 
 2 Paris 
 3 Paris 
 4 Paris 
 5 London
 6 London
 7 London
 8 London
 9 Rome  
10 Rome  
11 Rome  
12 Rome

will return the first column, so is identical to census["City"].

You can also extract multiple columns by specifying them via c( ), e.g.

census[c("City", "year")]

# A tibble: 12 × 2
   City    year
   <chr>  <int>
 1 Paris   2001
 2 Paris   2008
 3 Paris   2009
 4 Paris   2010
 5 London  2001
 6 London  2006
 7 London  2011
 8 London  2015
 9 Rome    2001
10 Rome    2006
11 Rome    2009
12 Rome    2012

will return a tibble with the City and year columns.

To access data by rows, you need to pass in the row index followed by a comma, e.g.

census[1, ]

# A tibble: 1 × 3
  City   year   pop
  <chr> <int> <dbl>
1 Paris  2001  2.15

will return a tibble containing just the first row of data.

You can use ranges to get several rows, e.g.

census[1:5, ]

# A tibble: 5 × 3
  City    year   pop
  <chr>  <int> <dbl>
1 Paris   2001  2.15
2 Paris   2008  2.21
3 Paris   2009  2.23
4 Paris   2010  2.24
5 London  2001  7.32

would return the first five rows, while

census[seq(2, 10, 2), ]

# A tibble: 5 × 3
  City    year   pop
  <chr>  <int> <dbl>
1 Paris   2008  2.21
2 Paris   2010  2.24
3 London  2006  7.66
4 London  2015  8.62
5 Rome    2006  2.63

would return the even rows from 2 to 10.

You can access specific rows and columns via [row, column], e.g.

census[1, 1]

# A tibble: 1 × 1
  City 
  <chr>
1 Paris

returns a tibble containing just the first row and first column, while

census[seq(2, 10, 2), "year"]

# A tibble: 5 × 1
   year
  <int>
1  2008
2  2010
3  2006
4  2015
5  2006

would return the year column of the even rows from 2 to 10, and

census[5, 2:3]

# A tibble: 1 × 2
   year   pop
  <int> <dbl>
1  2001  7.32

would return the second and third columns of the fifth row.

The above functions all return a tibble that is a subset of the whole tibble. You can extract the data for a single column as a list via [[ ]] or $, e.g.

census[[1]]

 [1] "Paris"  "Paris"  "Paris"  "Paris"  "London" "London" "London" "London"
 [9] "Rome"   "Rome"   "Rome"   "Rome"

census[["City"]]

 [1] "Paris"  "Paris"  "Paris"  "Paris"  "London" "London" "London" "London"
 [9] "Rome"   "Rome"   "Rome"   "Rome"

census$City

 [1] "Paris"  "Paris"  "Paris"  "Paris"  "London" "London" "London" "London"
 [9] "Rome"   "Rome"   "Rome"   "Rome"

and can then extract data from those lists via sub-indexing, e.g.

census$City[1]

[1] "Paris"

would return the City column data for the first row.

Querying

We can start to ask questions of our data using the filter function.

census %>% filter(City=="Paris")

# A tibble: 4 × 3
  City   year   pop
  <chr> <int> <dbl>
1 Paris  2001  2.15
2 Paris  2008  2.21
3 Paris  2009  2.23
4 Paris  2010  2.24

(note that we didn’t need to put double quotes around City in the filter - it knows that this is a column name. Also, look here if you need to refresh your knowledge of the %>% operator).

This has returned a new tibble, which you can then access using the same methods as above, e.g.

(census %>% filter(City=="Paris"))["year"]

# A tibble: 4 × 1
   year
  <int>
1  2001
2  2008
3  2009
4  2010

You can also test if the rows of a tibble match a condition, e.g.

census["City"] == "Paris"

       City
 [1,]  TRUE
 [2,]  TRUE
 [3,]  TRUE
 [4,]  TRUE
 [5,] FALSE
 [6,] FALSE
 [7,] FALSE
 [8,] FALSE
 [9,] FALSE
[10,] FALSE
[11,] FALSE
[12,] FALSE

returns a set of TRUE / FALSE values for each row, depending on whether the City value of that row was equal to Paris.

Adding new columns

New columns can be added to a tibble simply by assigning them by index (as you would for a dictionary);

census["continental"] <- census["City"] != "London"
census

# A tibble: 12 × 4
   City    year   pop continental
   <chr>  <int> <dbl> <lgl>      
 1 Paris   2001  2.15 TRUE       
 2 Paris   2008  2.21 TRUE       
 3 Paris   2009  2.23 TRUE       
 4 Paris   2010  2.24 TRUE       
 5 London  2001  7.32 FALSE      
 6 London  2006  7.66 FALSE      
 7 London  2011  8.17 FALSE      
 8 London  2015  8.62 FALSE      
 9 Rome    2001  2.55 TRUE       
10 Rome    2006  2.63 TRUE       
11 Rome    2009  2.73 TRUE       
12 Rome    2012  2.63 TRUE

Exercise

Create the tibble containing the census data for the three cities.

Select the data for the year 2001. Which city had the smallest population that year?

Answer

Import the tidyverse and load up the data

library(tidyverse)

census <- tibble("City"=c("Paris", "Paris", "Paris", "Paris",
                          "London", "London", "London", "London",
                          "Rome", "Rome", "Rome", "Rome"),
                 "year"=as.integer(c(2001, 2008, 2009, 2010,
                                     2001, 2006, 2011, 2015,
                                     2001, 2006, 2009, 2012)),
                 "pop"=c(2.148, 2.211, 2.234, 2.244,
                         7.322, 7.657, 8.174, 8.615,
                         2.547, 2.627, 2.734, 2.627))

We start by grabbing the data for the year we care about

census %>% filter(year==2001)

# A tibble: 3 × 3
  City    year   pop
  <chr>  <int> <dbl>
1 Paris   2001  2.15
2 London  2001  7.32
3 Rome    2001  2.55

We can see that the smallest population was in Paris that year but let’s try to extract it using R.

pop <- (census %>% filter(year==2001))$pop
pop

[1] 2.148 7.322 2.547

The min function returns the minimum of a list of numbers. If we run this on pop then we will get the smallest number.

min_pop <- min(pop)
min_pop

[1] 2.148

We can now use this minimum population to further filter the census data;

census %>% filter(year==2001) %>% filter(pop==min_pop)

# A tibble: 1 × 3
  City   year   pop
  <chr> <int> <dbl>
1 Paris  2001  2.15

Finally(!) we can extract the City column

(census %>% filter(year==2001) %>% filter(pop==min_pop))["City"]

# A tibble: 1 × 1
  City 
  <chr>
1 Paris

All of this could be combined into a single (dense) expression, e.g.

city <- (census %>%
           filter(year==2001) %>%
           filter(pop==min((census %>% filter(year==2001))["pop"]))
         )["City"]
city

# A tibble: 1 × 1
  City 
  <chr>
1 Paris

:::