Plotting

Plotting data

Plotting of data in modern R is handled by ggplot2, which is installed as part of the tidyverse. There is an excellent cheat sheet and a great free online book, ggplot2: Elegant graphics for data analysis, by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen.

We cannot cover all of ggplot2 in this workshop, so you are strongly encouraged to read the book and cheat sheet. For today, we will give a quick overview of how to create some simple graphs.

First we load the tidyverse in the same way as previously;

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Next we will load some data to plot. We will use the climate data from the previous section (available from cetml1659on.txt).

Plotting works best with tidy data, so we will load and tidy the data as in the previous section;

temperature <- read_table(
    "https://raw.githubusercontent.com/Bristol-Training/intro-data-analysis-r/refs/heads/main/data/cetml1659on.txt",
    skip=6,
    na=c("-99.99", "-99.9"),
    col_types=cols("DATE"=col_integer())
)

month_levels <- c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
                  "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")

historical_temperature <- temperature %>%
    select(-YEAR) %>%
    pivot_longer(c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
                   "JUL", "AUG", "SEP", "OCT", "NOV", "DEC"),
                 names_to="month",
                 values_to="temperature") %>%
    rename(year=DATE) %>%
    mutate(month=factor(month, month_levels))

Next, we will use ggplot to draw a graph (we will explain how this works after drawing). You should see a graph similar to this:

ggplot(historical_temperature, aes(x = year, y = temperature)) + geom_point()
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

This command has drawn a scatter plot of the data contained in the tibble historical_temperature, putting the year column on the x-axis, and the temperature column on the y-axis.

ggplot is written to follow a specific “grammar of visualisation”. There are three key components;

  1. data,
  2. A set of aesthetic mappings between variables in the data and visual properties of the graph, and
  3. one or more layers that describe how to render each observation.

These are entered via the general form

ggplot( data, aesthetic ) + layer1 + layer2 + layer3...

The data is a tibble containing tidy data with one observation per row, and one variable per column.

The aesthetic is a mapping specified via the aes function, which maps the variables to axes, colours or other graphical properties.

The layers (layer1, layer2 etc) are specific renderings of the data, e.g. geom_point() will draw points (scatter plot), geom_line() will draw lines (line graph) etc.

Plotting the analysis

You can combine analysis with plotting, e.g. here we plot the average yearly temperature as a line graph.

ggplot(historical_temperature %>%
           group_by(year) %>%
           summarise(temperature=mean(temperature)), 
       aes(x = year, y = temperature)) +
    geom_line() + 
    geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

Here we’ve added two layers; geom_line() to draw a line graph, and geom_smooth() to add a fitted line with errors.

We could do more. For example, here we add a new variable (column) to the data that is the decade in which the observation was taken;

historical_temperature["decade"] <- (historical_temperature["year"] %/% 10) * 10

(%/% means “integer division”, so 1655 %/% 10 equals 165, which becomes 1650 when multiplied by 10)

This enables us to draw a line graph of the average temperature each decade;

ggplot(historical_temperature %>%
           group_by(decade) %>%
           summarise(temperature=mean(temperature)), 
       aes(x = decade, y = temperature)) + 
       geom_line()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

A line chart is not a good choice for this plot, as it doesn’t show the underlying statistics of the average. Instead, a box-and-whisker or violin plot would be better. To use this, we need to specify the grouping in the aesthetic, e.g.

ggplot(historical_temperature, 
       aes(x = decade, y = temperature, group=decade)) + 
       geom_violin()
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_ydensity()`).

Finally, the global aesthetic set in ggplot can be overridden by setting it in the layers themselves. For example, here we overlay a smooth line over the violin plot;

ggplot(historical_temperature, 
       aes(x=year, y = temperature)) + 
       geom_violin(aes(group=decade)) + 
       geom_smooth()
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_ydensity()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).

Note how the group aesthetic has to be set only for the violin plot.

You can save your plots to a file using the ggsave function. This will save the last plot drawn to a file, with filename, size, format etc. all controlled via arguments to this function, e.g.

ggsave("violin.pdf", device="pdf", dpi="print")

would save the plot to a file called violin.pdf, in PDF format, using a resolution (dpi) that is suitable for printing.

Exercise

Create a graph that shows the average maximum temperature by month. Draw this as a line graph. Note that you may need to add group="decade" to the aesthetic so that the line layer can join together the points.

Create a graph that shows the change in average temperature in December per decade. Draw this as a scatter plot, with a smooth trend line added.

Create a graph that shows the change in average temperature by century. Draw this as a line graph with a smooth trend line added. Note that you may need to adjust the span value if the number of data points is too few to draw a very smooth line.

First load the tidyverse, then read in all of the data and tidy it up…

library(tidyverse)

temperature <- read_table(
    "https://raw.githubusercontent.com/Bristol-Training/intro-data-analysis-r/refs/heads/main/data/cetml1659on.txt",
    skip=6,
    na=c("-99.99", "-99.9"),
    col_types=cols("DATE"=col_integer())
)

month_levels <- c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
                  "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")

historical_temperature <- temperature %>%
    select(-YEAR) %>%
    pivot_longer(c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
                   "JUL", "AUG", "SEP", "OCT", "NOV", "DEC"),
                 names_to="month",
                 values_to="temperature") %>%
    rename(year=DATE) %>%
    mutate(month=factor(month, month_levels))

Next, add in the “decade” and “century” variables (columns) as we will be using them later…

historical_temperature["decade"] <- (historical_temperature["year"] %/% 10) * 10
historical_temperature["century"] <- (historical_temperature["year"] %/% 100) * 100
historical_temperature
# A tibble: 4,344 × 5
    year month temperature decade century
   <int> <fct>       <dbl>  <dbl>   <dbl>
 1  1659 JAN             3   1650    1600
 2  1659 FEB             4   1650    1600
 3  1659 MAR             6   1650    1600
 4  1659 APR             7   1650    1600
 5  1659 MAY            11   1650    1600
 6  1659 JUN            13   1650    1600
 7  1659 JUL            16   1650    1600
 8  1659 AUG            16   1650    1600
 9  1659 SEP            13   1650    1600
10  1659 OCT            10   1650    1600
# ℹ 4,334 more rows

Graph that shows the average maximum temperature per month…

ggplot(historical_temperature %>%
         group_by(month) %>%
         summarise(max_temp=max(temperature, na.rm=TRUE), .groups="drop"),
       aes(x=month, y=max_temp, group="month")) + 
  geom_line()

Change in average temperature in December per decade.

ggplot( historical_temperature %>%
          filter(month=="DEC") %>%
          group_by(decade) %>%
          summarise(average=mean(temperature, na.rm=TRUE), .groups="drop"),
        aes(x=decade, y=average)
      ) + 
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Change in average temperature per century, plus a smooth trend line

ggplot( historical_temperature %>%
          group_by(century) %>%
          summarise(average=mean(temperature, na.rm=TRUE), .groups="drop"),
        aes(x=century, y=average)) + 
  geom_point() + 
  geom_smooth(span=2.0)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'