Tidyverse

R is an old programming language that was written as a re-implementation of the even older language, S. Over the years this has meant that R has gained many different layers and methods of doing things. This has created lots of inconsistencies and cruft, with R sometimes behaving in strange and unexpected ways that can be confusing for new users, and not suited to applications in modern data science.

The tidyverse is a coherent collection of modern R packages that solves this problem. It is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy. The packages were originally mostly developed by Hadley Wickham, but have been expanded by several contributers and has now developed into a thriving and highly supportive community.

There is lots of information about tidyverse on the web, e.g. R for Data Science (2e) is an online book by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund, that teaches the concepts of tidy data and shows how R with the tidyverse will help you create tidy data workflows. Modern R is the tidyverse, and it is strongly recommend that you use the tidyverse when you use R for data science.

Tidyverse philosophy

The goal of tydiverse is to help you create and work with tidy data. Tidy data is data where:

  • Each variable is a column; each column is a variable.
  • Each observation is a row; each row is an observation.
  • Each value is a cell; each cell is a single value.

Installing and loading the tidyverse

You can install and use tidyverse by typing:

install.packages("tidyverse")

library(tidyverse)

This will download and then import tidyverse modules. Remember you only need to run install.packages("tidyverse") once. You should see something similar to this printed:

── Attaching core `tidyverse` packages ────────────────────────────────────────────────────────────────── `tidyverse` 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ──────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

This shows that the nine core tidyverse modules have been loaded:

It also shows that the modern dplyr::filter and dplyr::lag functions replace the older stats::filter and stats::lag functions.

Exercise

Install and load the tidyverse. Use the links above for the nine modules to find out exactly what each package provides.

From the links in the page, the tidyverse packages are:

  • dplyr - provides a set of data manipulation functions arranged around a consistent grammar for data manipulation. This includes, for example, filtering, selecting, mutating and arranging data.
  • forcats - provides updated and more consistent tools for handling R factors (ways of representing catagorical data).
  • ggplot2 - a package for creating fantastic graphs and visualisations of tidy data. It is built on the concept of a grammar for graphics, which, once understood, provides a straightforward and consistent interface for creating sophisticated graphs and plots of tidy data.
  • lubridate - makes easier and more unintuitive to work with date-times.
  • purrr - provides a functional programming toolkit, e.g. enabling you to perform functions on selected data within a tibble via mapping.
  • readr - provides an updated version of R’s data reading functions, bringing greater consistency and more predictable behaviour.
  • stringr - provides a cohesive set of functions designed to make working with strings as easy as possible.
  • tibble - provides an updated version of a data.frame, the tibble, that is more consistent and better-suited to the needs of modern data science
  • tidyr - provides tools for cleaning and manipulating your data so that it becomes “tidy data”. The tidyverse is built on the concept of tidy data. Tidy data is where every column is a variable, every row is an observation and every cell has a single value. Typically, real-world data needs to be munged to become tidy, e.g. via pivoting, rectangling, nesting etc. This package provides the functions to do this efficiently.