The tidyverse is founded on the philosophy of tidy data. This philosophy says that, if you format your data in a “tidy manner”, then this simplifies all later data manipulation and visualisation. The visualisation and data analysis functions in the tidyverse are thus built on the assumption that the data they process is “tidy”.
Let’s look at how this applies to the Met Office data set we saw in the last section.
Each row in this data set corresponds to observations that are collected throughout the year;
DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC YEAR
1659 3.0 4.0 6.0 7.0 11.0 13.0 16.0 16.0 13.0 10.0 5.0 2.0 8.87
1660 0.0 4.0 6.0 9.0 11.0 14.0 15.0 16.0 13.0 10.0 6.0 5.0 9.10
1661 5.0 5.0 6.0 8.0 11.0 14.0 15.0 15.0 13.0 11.0 8.0 6.0 9.78
So, is this data tidy? Let’s look at the rules;
No, the month name is a variable that specifies in which month the observation was taken. It should be in its own column and not used as a column header,
No, a row contains 12 observations across a year, plus an average of the rows
No, the value of the month variable, e.g. for January 1659 is a column header and does not have its own cell. And the YEAR column is not a value that matches any variable, as it refers to an average of values that is a derived from the observations of the other cells.
To tidy the data we need to perform a series of steps. These will use functions from the tidyversedplyr and tidyr packages. These provide functions for tidying data, plus a grammar for data manipulation. Together, they will help you transform untidy data into tidy data.
select
The select function makes it easier to select (or deselect) one or more columns (variables) from a tibble. The general format is:
data %>%select(column_name) # selects a single columndata %>%select(c(column1, column2)) # selects multiple columndata %>%select(-column_name) # selects all columns except a single columndata %>%select(-c(column1, column2)) # selects except multiple columns
In our case, the YEAR column contains a derived value (the mean average) and should be removed as it is not a variable of an observation. We can do this using;
Next, we need to pivot (rotate) the table so that the observation for each month is on its own row in the table. We do this using the pivot_longer function. You need to pass to this function the names of the columns that contain observations. These columns will become rows, with the value going into a new column named according to values_to, and the old column name moving into a new column named according to names_to, e.g.
would move the values of the “JAN”, “FEB” and “MAR” columns into a new column called “temperature”, and the actual month names will be moved into a new column called “month”.
# A tibble: 4,344 × 3
DATE month temperature
<int> <chr> <dbl>
1 1659 JAN 3
2 1659 FEB 4
3 1659 MAR 6
4 1659 APR 7
5 1659 MAY 11
6 1659 JUN 13
7 1659 JUL 16
8 1659 AUG 16
9 1659 SEP 13
10 1659 OCT 10
# ℹ 4,334 more rows
Renaming columns
Our data is now “tidy”. Each variable (DATE, month, temperature) has its own column, each observation has its own row, and each value has its own cell.
However, “DATE” is a poor choice of name for the “year” variable. We can rename a column using the rename function. The general format is;
# A tibble: 4,344 × 3
year month temperature
<int> <chr> <dbl>
1 1659 JAN 3
2 1659 FEB 4
3 1659 MAR 6
4 1659 APR 7
5 1659 MAY 11
6 1659 JUN 13
7 1659 JUL 16
8 1659 AUG 16
9 1659 SEP 13
10 1659 OCT 10
# ℹ 4,334 more rows
Factors and mutate
The final problem to solve is that the value of the variable “month” for each observation is just a character string. As it is, R does not know that this is a categorical variable that can only take one of a number of valid values (e.g. in this case, only months of the year).
If there were any typos (e.g. “JULY” instead of “JUL”) then we would not see that error. Also, if we sorted by month then they would sort alphabetically (APR before JAN), instead of in the correct month order.
To solve this, we need to convert the month into a factor. A factor is used to represent categorical data. A factor must have a value that corresponds to one of its levels. For example, lets now create the levels for a factor to represent months of the year;
This list of strings sets the valid values of the factor, and the order in which they should be arranged.
To factor data, we use the factor function which is included in base R. The general syntax is;
factor(data, levels)
This would return the passed data factored into one of the levels specified in levels.
In our case, we will need to perform
factor(data, month_levels)
one each of the values of the month variable for each observation.
To do this, we need to edit our data to translate each month string into a month factor. Editing of a tibble is performed by the mutate function, e.g.
mutate(temperature=temperature+5)
would edit the “temperature” column variable, and add “5” to every single value in this column. In our case, we want to factor each value of the month column variable, so we need
mutate(month=factor(month, month_levels))
Putting this together with what we have before, and assigning the tibble to a new variable called historical_temperature, we now get;