Worksheet

We saw in Intro to R part 1 how to load a dataset with read.csv. This was a study about gene expression data of 42 ER- and ER+ breast cancer patients. The code was like:

breast <- read.csv("https://raw.githubusercontent.com/Bristol-Training/beginning-r/refs/heads/main/data/GDS3716.soft",
            sep="\t",
            skip=99)

Using tidyverse, answer the below exercises.

NoteExercise 1

Read the dataset. You may want to look at the documentation for read_delim function. Remember specifying the separator character and number of lines you want to skip.

library(tidyverse)
Error in library(tidyverse): there is no package called 'tidyverse'
breast <- read_delim("https://raw.githubusercontent.com/Bristol-Training/beginning-r/refs/heads/main/data/GDS3716.soft",
            delim="\t",
            skip=99)
Error in read_delim("https://raw.githubusercontent.com/Bristol-Training/beginning-r/refs/heads/main/data/GDS3716.soft", : could not find function "read_delim"

The function read_delim is quite smart and if we don’t specify the delimiter character will figure it out, although it may not always work.

breast <- read_delim("https://raw.githubusercontent.com/Bristol-Training/beginning-r/refs/heads/main/data/GDS3716.soft",
            skip=99)
NoteExercise 2

Calculate the average expresion levels for genes matching the name TP53.

First we can filter the rows that match the gene name and remove the non-numeric columns.

expr_tp53 <-
    breast %>%
        filter(IDENTIFIER=="TP53") %>%
        select(!c(ID_REF, IDENTIFIER)) 
Error in breast %>% filter(IDENTIFIER == "TP53") %>% select(!c(ID_REF, : could not find function "%>%"
expr_tp53
Error: object 'expr_tp53' not found

Now to calculate the mean of each one of the rows we can run

expr_mean_tp53 <-
    breast %>%
        filter(IDENTIFIER=="TP53") %>%
        select(!c(ID_REF, IDENTIFIER)) %>%
        rowMeans()
Error in breast %>% filter(IDENTIFIER == "TP53") %>% select(!c(ID_REF, : could not find function "%>%"
expr_mean_tp53
Error: object 'expr_mean_tp53' not found
NoteExercise 3

Find the patient with the higher average expression levels accross the whole genome.

max_idx <-
    breast %>%
        select(!c(ID_REF, IDENTIFIER))  %>%
        colMeans(na.rm=TRUE) %>%
        which.max()
Error in breast %>% select(!c(ID_REF, IDENTIFIER)) %>% colMeans(na.rm = TRUE) %>% : could not find function "%>%"
max_idx
Error: object 'max_idx' not found