Worksheet

We saw in Beginning R how to load a dataset with read.csv. This was a study about gene expression data of 42 ER- and ER+ breast cancer patients. The code was like:

breast <- read.csv("https://raw.githubusercontent.com/Bristol-Training/beginning-r/refs/heads/main/data/GDS3716.soft",
            sep="\t",
            skip=99)

Using tidyverse, answer the below exercises.

Exercise 1

Read the dataset. You may want to look at the documentation for read_delim function. Remember specifying the separator character and number of lines you want to skip.

Answer 1

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

breast <- read_delim("https://raw.githubusercontent.com/Bristol-Training/beginning-r/refs/heads/main/data/GDS3716.soft",
            delim="\t",
            skip=99)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 22284 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (2): ID_REF, IDENTIFIER
dbl (42): GSM512539, GSM512540, GSM512541, GSM512542, GSM512543, GSM512544, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The function read_delim is quite smart and if we don’t specify the delimiter character will figure it out, although it may not always work.

breast <- read_delim("https://raw.githubusercontent.com/Bristol-Training/beginning-r/refs/heads/main/data/GDS3716.soft",
            skip=99)

Exercise 2

Calculate the average expresion levels for genes matching the name TP53.

Answer 2

First we can filter the rows that match the gene name and remove the non-numeric columns.

expr_tp53 <-
    breast %>%
        filter(IDENTIFIER=="TP53") %>%
        select(!c(ID_REF, IDENTIFIER)) 

expr_tp53

# A tibble: 2 × 42
  GSM512539 GSM512540 GSM512541 GSM512542 GSM512543 GSM512544 GSM512545
      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     309.      256.      214.      368       560.      262.      190. 
2       9.7      19.3       9.4       8.6      13.1       2.3       6.5
# ℹ 35 more variables: GSM512546 <dbl>, GSM512547 <dbl>, GSM512548 <dbl>,
#   GSM512549 <dbl>, GSM512550 <dbl>, GSM512551 <dbl>, GSM512552 <dbl>,
#   GSM512553 <dbl>, GSM512554 <dbl>, GSM512555 <dbl>, GSM512556 <dbl>,
#   GSM512575 <dbl>, GSM512576 <dbl>, GSM512577 <dbl>, GSM512578 <dbl>,
#   GSM512579 <dbl>, GSM512580 <dbl>, GSM512566 <dbl>, GSM512567 <dbl>,
#   GSM512568 <dbl>, GSM512569 <dbl>, GSM512570 <dbl>, GSM512571 <dbl>,
#   GSM512572 <dbl>, GSM512573 <dbl>, GSM512574 <dbl>, GSM512557 <dbl>, …

Now to calculate the mean of each one of the rows we can run

expr_mean_tp53 <-
    breast %>%
        filter(IDENTIFIER=="TP53") %>%
        select(!c(ID_REF, IDENTIFIER)) %>%
        rowMeans()

expr_mean_tp53

[1] 310.89524  17.15714

Exercise 3

Find the patient with the higher average expression levels accross the whole genome.

Answer 3

max_idx <-
    breast %>%
        select(!c(ID_REF, IDENTIFIER))  %>%
        colMeans(na.rm=TRUE) %>%
        which.max()

max_idx

GSM512562 
       39