One of the most common situations is that you have some data file containing the data you want to read. Perhaps this is data you’ve produced yourself or maybe it’s from a collegue. In an ideal world the file will be perfectly formatted and will be trivial to import into R but since this is so often not the case, R provides a number of features to make your life easier.
A good documentation on reading and writing files is available in R for Data Science (2e) but first it’s worth noting the common formats that R can work with:
For this course we will focus on plain-text CSV files as they are perhaps the most common format. Imagine we have a CSV (comma-separated values) file. The example we will use today is available at city_pop.csv. Open that file in your browser and you will see;
This is an example CSV file
The text at the top here is not part of the data but instead is here
to describe the file. You'll see this quite often in real-world data.
A -1 signifies a missing value.
year;London;Paris;Rome
2001;7.322;2.148;2.547
2006;7.652;;2.627
2008;-1;2.211;
2009;-1;2.234;2.734
2011;8.174;;
2012;-1;2.244;2.627
2015;8.615;;
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We can use the tidyverse function read_csv to read the file and convert it to a tibble. The function read_csv is part of the readr package that is installed with the tidyverse.
Full documentation for this function can be found in the manual or, as with any R function, directly in the notebook by putting a ? before the name:
?read_csv
The first argument to the function is called file, the documentation for which begins:
Either a path to a file, a connection, or literal data (either a single string or a raw vector).
Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting
with http://, https://, ftp://, or ftps:// will be automatically downloaded.
Remote gz files can also be automatically downloaded and decompressed.
This means that we can take our URL and pass it directly (or via a variable) to the function:
Rows: 11 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): This is an example CSV file
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 11 × 1
`This is an example CSV file`
<chr>
1 The text at the top here is not part of the data but instead is here
2 to describe the file. You'll see this quite often in real-world data.
3 A -1 signifies a missing value.
4 year;London;Paris;Rome
5 2001;7.322;2.148;2.547
6 2006;7.652;;2.627
7 2008;-1;2.211;
8 2009;-1;2.234;2.734
9 2011;8.174;;
10 2012;-1;2.244;2.627
11 2015;8.615;;
We can see that by default it’s done a fairly bad job of parsing the file (this is mostly because I’ve construsted the city_pop.csv file to be as obtuse as possible). It’s making a lot of assumptions about the structure of the file but in general it’s taking quite a naïve approach.
The first thing we notice is that it’s treating the text at the top of the file as though it’s data. Checking the documentation we see that the simplest way to solve this is to use the skip argument to the function to which we give an integer giving the number of rows to skip:
read_csv( city_pop_file,skip=5)
Rows: 7 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): year;London;Paris;Rome
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The next most obvious problem is that it is not separating the columns at all. This is because read_csv is a special case of the more general read_delim that sets the separator (also called the delimiter) delim to a comma ,.
We can set the separator to ; by changing to read_delim and setting delim equal to ;
read_delim( city_pop_file,skip=5,delim=";")
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (4): year, London, Paris, Rome
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7 × 4
year London Paris Rome
<dbl> <dbl> <dbl> <dbl>
1 2001 7.32 2.15 2.55
2 2006 7.65 NA 2.63
3 2008 -1 2.21 NA
4 2009 -1 2.23 2.73
5 2011 8.17 NA NA
6 2012 -1 2.24 2.63
7 2015 8.62 NA NA
Now it’s actually starting to look like a real table of data.
Reading the descriptive header of our data file we see that a value of -1 signifies a missing reading so we should mark those too. This can be done after the fact but it is simplest to do it at import-time using the na argument:
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 7 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (4): year, London, Paris, Rome
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7 × 4
year London Paris Rome
<dbl> <dbl> <dbl> <dbl>
1 2001 7.32 2.15 2.55
2 2006 7.65 NA 2.63
3 2008 NA 2.21 NA
4 2009 NA 2.23 2.73
5 2011 8.17 NA NA
6 2012 NA 2.24 2.63
7 2015 8.62 NA NA
The next issue is that you can see that the year has been read in as a floating point (double) number, rather than as an integer. Each column is read using a parser, that converts the text data in the file into data of the appropriate type. R will guess which parser to use, with this helpfully reported to the R console:
In this case, R has guessed that all of the columns contain floating point numbers, and so it has used the col_double() specification, which calls the parse_double() function to convert the text from those columns from the file into numbers.
You can set the parser to use for a column by specifying the column types via the col_types argument. We want the year to be an integer, so we can write:
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
# A tibble: 7 × 4
year London Paris Rome
<int> <dbl> <dbl> <dbl>
1 2001 7.32 2.15 2.55
2 2006 7.65 NA 2.63
3 2008 NA 2.21 NA
4 2009 NA 2.23 2.73
5 2011 8.17 NA NA
6 2012 NA 2.24 2.63
7 2015 8.62 NA NA
Note that col_guess(), which guesses the right type of data, is used for any columns that you don’t specify.
year London Paris Rome
<int> <dbl> <dbl> <dbl>
2001 7.322 2.148 2.547
2006 7.652 NA 2.627
2008 NA 2.211 NA
2009 NA 2.234 2.734
2011 8.174 NA NA
2012 NA 2.244 2.627
2015 8.615 NA NA
Finally, we want to assign this tibble to a variable, called census;
census <-read_delim( city_pop_file,skip=5,delim=";",na="-1",col_types=cols("year"=col_integer()))
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Next, now that we have our dataset loaded, we will do something useful with our data and plot it.
Exercise
Read the file cetml1659on.txt into a tibble (this data is originally from the Met Office and there’s a description of the format there too). This contains some historical weather data for a location in the UK. Import that file as a tibble using read_table, making sure that you cover all the possible NA values.
How many years had a negative average temperature in January?
What was the average temperature in June over the years in the data set?
Answer (click to open)
Import the tidyverse
library(tidyverse)
Read in the file. As whitespace is the delimiter, we need to use read_table. Note that read_delim with delim=" " is the wrong choice as it will try to split on single whitespace characters. read_table is the right choice for multiple whitespace separators.
Note that we should read the DATE as an integer, as it is a year.
temperature <-read_table("https://raw.githubusercontent.com/Bristol-Training/intro-data-analysis-r/refs/heads/main/data/cetml1659on.txt",skip=6,na=c("-99.99", "-99.9"),col_types=cols("DATE"=col_integer()))temperature