Data analysis

Data analysis

Up to this point, we have just been learning how to read data and make it tidy. This was a lot of work. The pay-off is that now the analysis of the data will be much easier and straightforward.

This is intentional. Data cleaning is the messiest and most ambiguous part of data science. A truism is that data cleaning takes 80% of the time of any data science project. However, without this effort, data analysis and visualisation would be similarly messy and time consuming. By cleaning first, we can perform data analysis and visualisation using clean, consistent, well-tested and tidy tools.

Analysis via summarise

You can perform summary analysis on data in a tibble using the summarise function from the dply package.

summarise will create a new tibble that is a summary of the input tibble, based on grouping and a summarising function. Summarising functions include;

For example, we can calculate the mean average temperature using:

historical_temperature %>% 
    summarise("average temperature"=mean(temperature, na.rm=TRUE))
# A tibble: 1 × 1
  `average temperature`
                  <dbl>
1                  9.25

Note that we used na.rm=TRUE to tell the function to ignore NA values.

This has created a new tibble, where the column called “average temperature” contains the mean average temperature.

Grouping data

Each row of tidy data corresponds to a single observation. We can group observations together into groups using group_by. We can then feed these groups into summaries.

For example, we can group by year and summarise by the mean function to calculate the average temperature for each year;

historical_temperature %>% 
    group_by(year) %>%
    summarise("average temperature"=mean(temperature, na.rm=TRUE))
# A tibble: 362 × 2
    year `average temperature`
   <int>                 <dbl>
 1  1659                  8.83
 2  1660                  9.08
 3  1661                  9.75
 4  1662                  9.5 
 5  1663                  8.58
 6  1664                  9.33
 7  1665                  8.25
 8  1666                  9.83
 9  1667                  8.5 
10  1668                  9.5 
# ℹ 352 more rows

or, we could calculate the average temperature for each month via;

historical_temperature %>%
    group_by(month) %>%
    summarise("average temperature"=mean(temperature, na.rm=TRUE))
# A tibble: 12 × 2
   month `average temperature`
   <fct>                 <dbl>
 1 JAN                    3.28
 2 FEB                    3.89
 3 MAR                    5.35
 4 APR                    7.95
 5 MAY                   11.2 
 6 JUN                   14.3 
 7 JUL                   16.0 
 8 AUG                   15.6 
 9 SEP                   13.3 
10 OCT                    9.73
11 NOV                    6.08
12 DEC                    4.12

Filtering data

We can then use the filter, also from dplyr, to filter observations (rows) before we group. For example, we could filter the years in the 18th Century (year<1800 & year>=1700) and calculate the average monthly temperatures then via;

historical_temperature %>%
    filter(year<1800 & year>=1700) %>%
    group_by(month) %>%
    summarise("18th Century"=mean(temperature, na.rm=TRUE))
# A tibble: 12 × 2
   month `18th Century`
   <fct>          <dbl>
 1 JAN             2.89
 2 FEB             3.80
 3 MAR             5.04
 4 APR             7.88
 5 MAY            11.3 
 6 JUN            14.5 
 7 JUL            16.0 
 8 AUG            15.8 
 9 SEP            13.5 
10 OCT             9.40
11 NOV             5.84
12 DEC             3.89

We could then repeat this for the 21st Century…

historical_temperature %>%
    filter(year>=2000) %>%
    group_by(month) %>%
    summarise("21st Century"=mean(temperature, na.rm=TRUE))
# A tibble: 12 × 2
   month `21st Century`
   <fct>          <dbl>
 1 JAN             4.73
 2 FEB             4.93
 3 MAR             6.60
 4 APR             9.08
 5 MAY            12.0 
 6 JUN            14.9 
 7 JUL            16.8 
 8 AUG            16.4 
 9 SEP            14.3 
10 OCT            11.2 
11 NOV             7.44
12 DEC             5.12
Exercise

Use filter, group_by and summarise to create tibbles that contain the average monthly temperatures for the 17th and 21st Centuries. Take the difference of these to calculate the change in average temperature for each month.

Next calculate the minimum and maximum monthly temperatures for the 17th and 21st Centuries. Again, calculate the change in minimum and maximum temperatures for each month.

Finally, what is the average increase in maximum monthly temperatures between the 16th and 21st Centuries?

library(tidyverse)

Load the data…

temperature <- read_table(
    "https://raw.githubusercontent.com/Bristol-Training/intro-data-analysis-r/refs/heads/main/data/cetml1659on.txt",
    skip=6,
    na=c("-99.99", "-99.9"),
    col_types=cols("DATE"=col_integer())
)

Create the month levels

month_levels <- c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
                  "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")

Tidy the data…

historical_temperature <- temperature %>%
    select(-YEAR) %>%
    pivot_longer(c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
                   "JUL", "AUG", "SEP", "OCT", "NOV", "DEC"),
                 names_to="month",
                 values_to="temperature") %>%
    rename(year=DATE) %>%
    mutate(month=factor(month, month_levels))

Calculate the mean monthly temperatures in the 17th Century

c17th <- historical_temperature %>%
     filter(year<1700 & year>=1600) %>%
     group_by(month) %>%
     summarise("temperature"=mean(temperature, na.rm=TRUE), .groups="drop")

(the .groups="drop" removes a warning message in newer versions of R. It is experimental, e.g. see this stackoverflow post)

Calculate the mean monthly temperatures in the 21st Century

c21st <- historical_temperature %>%
     filter(year>=2000) %>%
     group_by(month) %>%
     summarise("temperature"=mean(temperature, na.rm=TRUE), .groups="drop")

Now add the difference to the c21st table and print it out

c21st["change"] <- c21st["temperature"] - c17th["temperature"]
c21st
# A tibble: 12 × 3
   month temperature change
   <fct>       <dbl>  <dbl>
 1 JAN          4.73  2.16 
 2 FEB          4.93  2.00 
 3 MAR          6.60  1.96 
 4 APR          9.08  1.80 
 5 MAY         12.0   1.33 
 6 JUN         14.9   0.818
 7 JUL         16.8   1.12 
 8 AUG         16.4   1.26 
 9 SEP         14.3   1.71 
10 OCT         11.2   1.91 
11 NOV          7.44  1.91 
12 DEC          5.12  1.76 

From this we can see that most of the warming is focused on the winter months.

We will now repeat this for the maximum and minimum temperatures…

c17th_max <- historical_temperature %>%
     filter(year<1700 & year>=1600) %>%
     group_by(month) %>%
     summarise("temperature"=max(temperature, na.rm=TRUE), .groups="drop")
c21st_max <- historical_temperature %>%
     filter(year>=2000) %>%
     group_by(month) %>%
     summarise("temperature"=max(temperature, na.rm=TRUE), .groups="drop")
c21st_max["change"] <- c21st_max["temperature"] - c17th_max["temperature"]

c21st_max
# A tibble: 12 × 3
   month temperature change
   <fct>       <dbl>  <dbl>
 1 JAN           7    0.5  
 2 FEB           7    1    
 3 MAR           8.7  1.7  
 4 APR          11.8  2.3  
 5 MAY          13.4  0.400
 6 JUN          16.1 -1.90 
 7 JUL          19.7  1.7  
 8 AUG          18.3  1.3  
 9 SEP          16.8  1.8  
10 OCT          13.3  1.8  
11 NOV           9.6  1.6  
12 DEC           9.7  3.2  
c17th_min <- historical_temperature %>%
     filter(year<1700 & year>=1600) %>%
     group_by(month) %>%
     summarise("temperature"=min(temperature, na.rm=TRUE), .groups="drop")
c21st_min <- historical_temperature %>%
     filter(year>=2000) %>%
     group_by(month) %>%
     summarise("temperature"=min(temperature, na.rm=TRUE), .groups="drop")
c21st_min["change"] <- c21st_min["temperature"] - c17th_min["temperature"]

c21st_min
# A tibble: 12 × 3
   month temperature change
   <fct>       <dbl>  <dbl>
 1 JAN           1.4    4.4
 2 FEB           2.8    3.8
 3 MAR           2.7    1.7
 4 APR           7.2    1.7
 5 MAY          10.4    1.9
 6 JUN          13.5    2  
 7 JUL          15.2    1.7
 8 AUG          14.9    1.9
 9 SEP          12.6    2.1
10 OCT           9.2    2.7
11 NOV           5.2    2.2
12 DEC          -0.7   -0.2

Finally, we can get the average increase in monthly temperatures by calculating the mean of the change column in c21st

mean(c21st[["change"]])
[1] 1.643698

Because we were working with tidy data the filtering and grouping of observations, and then generation of summary statistics was straightforward. This grammar (data is filtered, then grouped, then summarised) worked because the data was tidy. As we will see in the next section, a similar grammar for visualisation makes graph drawing equally logical.