Reading from files

One of the most common situations is that you have some data file containing the data you want to read. Perhaps this is data you’ve produced yourself or maybe it’s from a collegue. In an ideal world the file will be perfectly formatted and will be trivial to import into R but since this is so often not the case, R provides a number of features to make your life easier.

A good documentation on reading and writing files is available in R for Data Science (2e) but first it’s worth noting the common formats that R can work with:

Comma-separated files (or tab-separated, space-separated etc)
Excel spreadsheets
HDF5 files
SQL databases

For this course we will focus on plain-text CSV files as they are perhaps the most common format. Imagine we have a CSV (comma-separated values) file. The example we will use today is available at city_pop.csv. Open that file in your browser and you will see;

This is an example CSV file
The text at the top here is not part of the data but instead is here
to describe the file. You'll see this quite often in real-world data.
A -1 signifies a missing value.

year;London;Paris;Rome
2001;7.322;2.148;2.547
2006;7.652;;2.627
2008;-1;2.211;
2009;-1;2.234;2.734
2011;8.174;;
2012;-1;2.244;2.627
2015;8.615;;

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We can use the tidyverse function read_csv to read the file and convert it to a tibble. The function read_csv is part of the readr package that is installed with the tidyverse.

Full documentation for this function can be found in the manual or, as with any R function, directly in the notebook by putting a ? before the name:

?read_csv

The first argument to the function is called file, the documentation for which begins:

Either a path to a file, a connection, or literal data (either a single string or a raw vector).

Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting 
with http://, https://, ftp://, or ftps:// will be automatically downloaded. 
Remote gz files can also be automatically downloaded and decompressed.

This means that we can take our URL and pass it directly (or via a variable) to the function:

city_pop_file <- "https://raw.githubusercontent.com/Bristol-Training/intro-data-analysis-r/refs/heads/main/data/city_pop.csv"
read_csv(city_pop_file)

Rows: 11 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): This is an example CSV file

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 11 × 1
   `This is an example CSV file`                                        
   <chr>                                                                
 1 The text at the top here is not part of the data but instead is here 
 2 to describe the file. You'll see this quite often in real-world data.
 3 A -1 signifies a missing value.                                      
 4 year;London;Paris;Rome                                               
 5 2001;7.322;2.148;2.547                                               
 6 2006;7.652;;2.627                                                    
 7 2008;-1;2.211;                                                       
 8 2009;-1;2.234;2.734                                                  
 9 2011;8.174;;                                                         
10 2012;-1;2.244;2.627                                                  
11 2015;8.615;;

We can see that by default it’s done a fairly bad job of parsing the file (this is mostly because I’ve construsted the city_pop.csv file to be as obtuse as possible). It’s making a lot of assumptions about the structure of the file but in general it’s taking quite a naïve approach.

The first thing we notice is that it’s treating the text at the top of the file as though it’s data. Checking the documentation we see that the simplest way to solve this is to use the skip argument to the function to which we give an integer giving the number of rows to skip:

read_csv(
    city_pop_file,
    skip=5
)

Rows: 7 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): year;London;Paris;Rome

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 7 × 1
  `year;London;Paris;Rome`
  <chr>                   
1 2001;7.322;2.148;2.547  
2 2006;7.652;;2.627       
3 2008;-1;2.211;          
4 2009;-1;2.234;2.734     
5 2011;8.174;;            
6 2012;-1;2.244;2.627     
7 2015;8.615;;

The next most obvious problem is that it is not separating the columns at all. This is because read_csv is a special case of the more general read_delim that sets the separator (also called the delimiter) delim to a comma ,.

We can set the separator to ; by changing to read_delim and setting delim equal to ;

read_delim(
    city_pop_file,
    skip=5,
    delim=";"
)

Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (4): year, London, Paris, Rome

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 7 × 4
   year London Paris  Rome
  <dbl>  <dbl> <dbl> <dbl>
1  2001   7.32  2.15  2.55
2  2006   7.65 NA     2.63
3  2008  -1     2.21 NA   
4  2009  -1     2.23  2.73
5  2011   8.17 NA    NA   
6  2012  -1     2.24  2.63
7  2015   8.62 NA    NA

Now it’s actually starting to look like a real table of data.

Reading the descriptive header of our data file we see that a value of -1 signifies a missing reading so we should mark those too. This can be done after the fact but it is simplest to do it at import-time using the na argument:

read_delim(    
    city_pop_file,
    skip=5,
    delim=";",
    na="-1"
)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (4): year, London, Paris, Rome

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 7 × 4
   year London Paris  Rome
  <dbl>  <dbl> <dbl> <dbl>
1  2001   7.32  2.15  2.55
2  2006   7.65 NA     2.63
3  2008  NA     2.21 NA   
4  2009  NA     2.23  2.73
5  2011   8.17 NA    NA   
6  2012  NA     2.24  2.63
7  2015   8.62 NA    NA

The next issue is that you can see that the year has been read in as a floating point (double) number, rather than as an integer. Each column is read using a parser, that converts the text data in the file into data of the appropriate type. R will guess which parser to use, with this helpfully reported to the R console:

Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (4): year, London, Paris, Rome

In this case, R has guessed that all of the columns contain floating point numbers, and so it has used the col_double() specification, which calls the parse_double() function to convert the text from those columns from the file into numbers.

The tidyverse supplies many parsers, e.g. parse_integer(), parse_date() etc. More detail about these parsers (including how to parse different date formats, different number formats etc.) can be found in the free online book R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund.

You can set the parser to use for a column by specifying the column types via the col_types argument. We want the year to be an integer, so we can write:

read_delim(    
    city_pop_file,
    skip=5,
    delim=";",
    na="-1",
    col_types=cols("year"=col_integer())
)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

# A tibble: 7 × 4
   year London Paris  Rome
  <int>  <dbl> <dbl> <dbl>
1  2001   7.32  2.15  2.55
2  2006   7.65 NA     2.63
3  2008  NA     2.21 NA   
4  2009  NA     2.23  2.73
5  2011   8.17 NA    NA   
6  2012  NA     2.24  2.63
7  2015   8.62 NA    NA

Note that col_guess(), which guesses the right type of data, is used for any columns that you don’t specify.

year    London  Paris   Rome
<int>   <dbl>   <dbl>   <dbl>
2001    7.322   2.148   2.547
2006    7.652   NA      2.627
2008    NA      2.211   NA
2009    NA      2.234   2.734
2011    8.174   NA      NA
2012    NA      2.244   2.627
2015    8.615   NA      NA

Finally, we want to assign this tibble to a variable, called census;

census <- read_delim(
    city_pop_file,
    skip=5,
    delim=";",
    na="-1",
    col_types=cols("year"=col_integer())
)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

That covers the basics of reading in data with R. For more information please read the excellent free online book R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund, check out the documentation for readr, or check out the really handy official readr cheat sheet.

Next, now that we have our dataset loaded, we will do something useful with our data and plot it.

Exercise

Read the file cetml1659on.txt into a tibble (this data is originally from the Met Office and there’s a description of the format there too). This contains some historical weather data for a location in the UK. Import that file as a tibble using read_table, making sure that you cover all the possible NA values.

How many years had a negative average temperature in January?

What was the average temperature in June over the years in the data set?

Answer (click to open)

Import the tidyverse

library(tidyverse)

Read in the file. As whitespace is the delimiter, we need to use read_table. Note that read_delim with delim=" " is the wrong choice as it will try to split on single whitespace characters. read_table is the right choice for multiple whitespace separators.

Note that we should read the DATE as an integer, as it is a year.

temperature <- read_table(
    "https://raw.githubusercontent.com/Bristol-Training/intro-data-analysis-r/refs/heads/main/data/cetml1659on.txt",
    skip=6,
    na=c("-99.99", "-99.9"),
    col_types=cols("DATE"=col_integer())
)
temperature

# A tibble: 362 × 14
    DATE   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT   NOV   DEC
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  1659     3     4     6     7    11    13    16    16    13    10     5     2
 2  1660     0     4     6     9    11    14    15    16    13    10     6     5
 3  1661     5     5     6     8    11    14    15    15    13    11     8     6
 4  1662     5     6     6     8    11    15    15    15    13    11     6     3
 5  1663     1     1     5     7    10    14    15    15    13    10     7     5
 6  1664     4     5     5     8    11    15    16    16    13     9     6     4
 7  1665     1     1     5     7    10    14    16    15    13     9     6     2
 8  1666     4     5     6     8    11    15    18    17    14    11     6     3
 9  1667     0     4     2     7    10    15    17    16    13     9     6     3
10  1668     5     5     5     8    10    14    16    16    14    10     6     5
# ℹ 352 more rows
# ℹ 1 more variable: YEAR <dbl>

How many years had a negative average temperature in January?

negative_jan <- temperature %>% filter(temperature["JAN"] < 0)

Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
ℹ Please use one dimensional logical vectors instead.

negative_jan

# A tibble: 20 × 14
    DATE   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT   NOV   DEC
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  1684  -3    -1     3     6.5  13    15    16    15.5  12    11     3     4  
 2  1695  -1     0.5   3.5   5.5   9    13    13.5  13    11.5   9     5.5   4  
 3  1709  -1.5   2     3     9    12    14.5  15.5  15.5  13.5  10     7.5   3.5
 4  1716  -2     3     4.5   9    10.5  14    15.5  15.5  12.5   9.5   5.5   3  
 5  1740  -2.8  -1.6   3.9   6.4   8.6  12.8  15.3  14.7  14     5.3   3.3   2.2
 6  1763  -0.8   4.9   5.4   8.9  10.2  14.6  15.3  15.3  13.1   8.3   5.8   6.2
 7  1776  -1.6   3.8   6.4   9.4  10.8  14.1  16.3  15.2  12.9  10.2   6.2   4.4
 8  1780  -0.9   2.1   7.9   6.3  12.8  14.2  16.8  17.6  15.6   9.1   4.4   3.2
 9  1784  -0.6   1.4   2.7   5.7  13.5  13.7  15.2  14    14.8   7.8   5.5   0.3
10  1795  -3.1   0.8   3.9   7.7  10.9  13.2  15.2  16.6  16    11.7   4.5   6.6
11  1814  -2.9   1.4   2.9   9.6   9.2  12.2  16    14.7  12.8   8.1   4.7   4.3
12  1820  -0.3   3.2   4.7   8.9  11.4  13.6  15.7  14.7  12.3   8.1   5.6   4.7
13  1823  -0.1   3.1   5     6.7  12.2  12.3  14.1  14.4  12.5   8.4   7.1   4.8
14  1830  -0.2   2.2   7.7   8.9  12    12.7  16.2  13.7  11.9  10.4   6.9   1.8
15  1838  -1.5   0.4   4.9   6.1  10.5  14.4  15.6  15.1  12.7   9.8   4.6   4  
16  1879  -0.7   3.1   4.7   5.7   8.9  12.9  13.6  14.5  12.6   8.9   4.1   0.7
17  1881  -1.5   3.2   5.3   7.3  11.8  13.7  16.2  13.9  12.7   7.3   8.9   3.9
18  1940  -1.4   2.6   6     8.7  12.5  16.4  15.1  15.6  12.8   9.6   6.9   3.8
19  1963  -2.1  -0.7   6     8.7  10.6  14.9  15.2  14.3  12.9  11.1   8.2   2.6
20  1979  -0.4   1.2   4.7   7.8  10    13.9  16.2  14.9  13.5  11.3   6.8   5.8
# ℹ 1 more variable: YEAR <dbl>

num_negative_jan <- as.numeric(count(negative_jan))
num_negative_jan

[1] 20

What was the average temperature in June over the years in the data set?

jun_average <- mean(temperature[["JUN"]])
jun_average

[1] 14.33757

(note that we have to get the column data, and not a 1-column tibble, hence why we use [[ ]])