Week 4 - Hands-On Examples

week04

exercise

The R script is available here: link

Goals

Understand and use operators to filter data with precision.
Apply functions to perform basic calculations and data manipulation.

Import Data

A gene-level differential expression (DE) analysis was performed to compare SET1 samples to WT samples using data from read-counts.csv.

The analysis results are available via this link.

Donwload the result file and upload it to your data folder.
Import the data using the read_csv() function from the package readr. (See the documentation with ?read_csv) Name the imported results de_res.

library(readr)
de_res <- read_csv(
  file = "../exos_data/toy_DEanalysis.csv",  # replace the path with your own
  col_names = TRUE
)

Rows: 45 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gene_name
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exercises

Check the structure of de_res using an appropriate R function. What are the dimensions?

str(de_res)

spc_tbl_ [45 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ gene_name     : chr [1:45] "HTB2" "HHF1" "HHT1" "POL30" ...
 $ baseMean      : num [1:45] 20259 9821 1539 1274 316 ...
 $ log2FoldChange: num [1:45] -0.3757 0.1789 0.0866 0.4165 0.2189 ...
 $ lfcSE         : num [1:45] 0.447 0.536 0.412 0.422 0.434 ...
 $ stat          : num [1:45] -0.841 0.334 0.21 0.988 0.505 ...
 $ pvalue        : num [1:45] 0.4 0.739 0.834 0.323 0.614 ...
 $ padj          : num [1:45] 0.891 0.906 0.915 0.891 0.891 ...
 - attr(*, "spec")=
  .. cols(
  ..   gene_name = col_character(),
  ..   baseMean = col_double(),
  ..   log2FoldChange = col_double(),
  ..   lfcSE = col_double(),
  ..   stat = col_double(),
  ..   pvalue = col_double(),
  ..   padj = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

The result is a data frame with 45 rows and 7 columns.

The result contains following columns:

gene_name: gene name
baseMean: mean of normalized counts for all samples
log2FoldChange: log2 fold change
lfcSE: standard error
stat: Wald statistic
pvalue: Wald test p-value
padj: adjusted p-values (Benjamini-Hochberg procedure)

Filter the rows where the gene has a log2 fold change (log2FoldChange) greater than 0.5.

de_res[de_res$log2FoldChange > 0.5, ]

# A tibble: 6 × 7
  gene_name baseMean log2FoldChange lfcSE  stat       pvalue        padj
  <chr>        <dbl>          <dbl> <dbl> <dbl>        <dbl>       <dbl>
1 SUT476        88.1          0.994 0.386  2.58 0.00999      0.0761     
2 CDC20        217.           0.789 0.491  1.61 0.108        0.579      
3 CLB6         111.           0.737 0.545  1.35 0.176        0.720      
4 LOH1          48.6          2.23  0.397  5.61 0.0000000206 0.000000927
5 SUT2873       26.0          1.43  0.369  3.87 0.000110     0.00165    
6 ACM1         140.           0.733 0.486  1.51 0.132        0.593

Filter the rows where the gene has a log2 fold change smaller than -0.5.

de_res[de_res$log2FoldChange < -0.5, ]

# A tibble: 5 × 7
  gene_name baseMean log2FoldChange lfcSE  stat   pvalue    padj
  <chr>        <dbl>          <dbl> <dbl> <dbl>    <dbl>   <dbl>
1 APQ12      5423.           -0.640 0.237 -2.70 0.00684  0.0761 
2 FAR1       5927.           -1.51  0.675 -2.23 0.0254   0.163  
3 SUT24       156.           -0.844 0.328 -2.57 0.0101   0.0761 
4 PIR3        304.           -2.37  0.608 -3.89 0.000100 0.00165
5 TUB33         2.95         -0.872 0.554 -1.57 0.116    0.579

Filter the rows where the gene has a log2 fold change greater than 0.5 or smaller than -0.5.

de_res[de_res$log2FoldChange > 0.5 | de_res$log2FoldChange < -0.5, ]

# A tibble: 11 × 7
   gene_name baseMean log2FoldChange lfcSE  stat       pvalue        padj
   <chr>        <dbl>          <dbl> <dbl> <dbl>        <dbl>       <dbl>
 1 SUT476       88.1           0.994 0.386  2.58 0.00999      0.0761     
 2 APQ12      5423.           -0.640 0.237 -2.70 0.00684      0.0761     
 3 CDC20       217.            0.789 0.491  1.61 0.108        0.579      
 4 CLB6        111.            0.737 0.545  1.35 0.176        0.720      
 5 FAR1       5927.           -1.51  0.675 -2.23 0.0254       0.163      
 6 SUT24       156.           -0.844 0.328 -2.57 0.0101       0.0761     
 7 LOH1         48.6           2.23  0.397  5.61 0.0000000206 0.000000927
 8 PIR3        304.           -2.37  0.608 -3.89 0.000100     0.00165    
 9 TUB33         2.95         -0.872 0.554 -1.57 0.116        0.579      
10 SUT2873      26.0           1.43  0.369  3.87 0.000110     0.00165    
11 ACM1        140.            0.733 0.486  1.51 0.132        0.593

## Bonus: we can test the absolute value of log2FoldChange to simplify condition
abs(c(0.5, -0.5)) # how abs() works, ?abs

[1] 0.5 0.5

de_res[abs(de_res$log2FoldChange) > 0.5, ]

# A tibble: 11 × 7
   gene_name baseMean log2FoldChange lfcSE  stat       pvalue        padj
   <chr>        <dbl>          <dbl> <dbl> <dbl>        <dbl>       <dbl>
 1 SUT476       88.1           0.994 0.386  2.58 0.00999      0.0761     
 2 APQ12      5423.           -0.640 0.237 -2.70 0.00684      0.0761     
 3 CDC20       217.            0.789 0.491  1.61 0.108        0.579      
 4 CLB6        111.            0.737 0.545  1.35 0.176        0.720      
 5 FAR1       5927.           -1.51  0.675 -2.23 0.0254       0.163      
 6 SUT24       156.           -0.844 0.328 -2.57 0.0101       0.0761     
 7 LOH1         48.6           2.23  0.397  5.61 0.0000000206 0.000000927
 8 PIR3        304.           -2.37  0.608 -3.89 0.000100     0.00165    
 9 TUB33         2.95         -0.872 0.554 -1.57 0.116        0.579      
10 SUT2873      26.0           1.43  0.369  3.87 0.000110     0.00165    
11 ACM1        140.            0.733 0.486  1.51 0.132        0.593

Filter the rows where the gene has a log2 fold change greater than 0.5 and adjusted p-value (padj) smaller than 0.05.

de_res[de_res$log2FoldChange > 0.5 & de_res$padj < 0.05, ]

# A tibble: 2 × 7
  gene_name baseMean log2FoldChange lfcSE  stat       pvalue        padj
  <chr>        <dbl>          <dbl> <dbl> <dbl>        <dbl>       <dbl>
1 LOH1          48.6           2.23 0.397  5.61 0.0000000206 0.000000927
2 SUT2873       26.0           1.43 0.369  3.87 0.000110     0.00165

Stats Time!

Multiple Tests Correction

Why multiple testing is a problem?

When performing multiple statistical tests, the probability of making at least one Type I error (false positive) increases with the number of tests.

For instance, if we perform 100 independent tests with a significance level (\(\alpha\)) of 5%, the chance of incorrectly rejecting at least one null hypothesis is no longer 5%, but much higher. This is because the errors accumulate across the tests.

If we do 100 tests simultaneously and set and use \(\alpha\) at 0.05, the probability to do at least one error is:

\[ \begin{aligned} P(\text{at least 1 significant result by chance}) &= 1- P(\text{non significant results}) \\ &= 1 – (1 - 0.05)^{100} \\ &= 0.99 \end{aligned} \]
Multiple test correction

To address this issue and control the overall Type I error rate, statistical corrections like the Bonferroni correction or False Discovery Rate (FDR) adjustments are commonly used in multiple testing scenarios.
- Bonferroni correction: adjust the significance threshold (\(\alpha\)) to account for the number of tests (Ntest) being performed, i.e., \(\alpha_{adjusted}= \frac{\alpha}{\text{Ntest}}\)
- FDR (False discovery rate): control the proportion of false positive amongst all significant results, e.g.: Benjamini-Hochberg (BH) procedure.

Extract results for these genes: RNR1, PIR3, SRP68.

de_res[de_res$gene_name %in% c("RNR1", "PIR3", "SRP68"), ]

# A tibble: 3 × 7
  gene_name baseMean log2FoldChange lfcSE   stat   pvalue    padj
  <chr>        <dbl>          <dbl> <dbl>  <dbl>    <dbl>   <dbl>
1 RNR1         1374.         -0.381 0.434 -0.879 0.379    0.891  
2 PIR3          304.         -2.37  0.608 -3.89  0.000100 0.00165
3 SRP68        1058.         -0.120 0.211 -0.569 0.570    0.891

Use ifelse() to categorize genes. Add a new column, gene_category, that assigns categories:

“up” if log2FoldChange > 0.5.
“down” if log2FoldChange < -0.5.
“neutral” otherwise.

de_res[["gene_category"]] <- ifelse(
  test = de_res[["log2FoldChange"]] > 0.5,
  yes = "up",
  no = ifelse(
    test = de_res[["log2FoldChange"]] < -0.5,
    yes = "down",
    no = "neutral"
  )
)

Use table() to count the occurrences of each gene category. (?table)

table(de_res[["gene_category"]])


   down neutral      up 
      5      34       6

Note

Ensembl Data Base

Ensembl is a comprehensive genome database that provides detailed information on genes and their annotations across a wide range of species (humain, mouse, zebrafish, etc.). It integrates genomic data with tools like BioMart, making it easy to query and extract information such as gene names, coordinates, functions, and orthologs for research purposes.

A yeast gene annotation file was obtained from the Ensembl data base. This file can be donwloaded here.

Import the data and add the annotation to the de_res data frame using merge() function. (?merge)

annot <- read_csv(
  "../exos_data/yeast_gene_annot.csv" # replace the path by your own
)

Rows: 7127 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ensembl_id, gene_name, chromosome, description
dbl (2): start, end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

de_res <- merge(de_res, annot, by = "gene_name", all.x = TRUE)
head(de_res)

  gene_name  baseMean log2FoldChange     lfcSE         stat      pvalue
1      ACM1  140.4359    0.733261747 0.4864127  1.507488856 0.131685398
2     APQ12 5422.7091   -0.640132968 0.2366950 -2.704463027 0.006841488
3     CDC20  216.8074    0.788541627 0.4908836  1.606371984 0.108192203
4      CDC5 1282.0508    0.172085182 0.4798742  0.358604793 0.719890761
5      CLB1  927.0893   -0.145822741 0.6384089 -0.228415908 0.819322924
6      CLB2  255.6934   -0.001034076 0.5231743 -0.001976543 0.998422948
        padj gene_category ensembl_id chromosome  start    end
1 0.59258429            up    YPL267W        XVI  38169  38798
2 0.07611881          down    YIL040W         IX 277723 278139
3 0.57921173            up    YGL116W        VII 289809 291641
4 0.90627615       neutral    YMR001C       XIII 269019 271136
5 0.91496751       neutral    YGR108W        VII 703636 705051
6 0.99842295       neutral    YPR119W        XVI 771653 773128
                                                                                                                                                                                                                                                                                                                                                                                                                                                       description
1                                                                                                                                                                                       Pseudosubstrate inhibitor of the APC/C; suppresses APC/C [Cdh1]-mediated proteolysis of mitotic cyclins; associates with Cdh1p, Bmh1p and Bmh2p; cell cycle regulated protein; the anaphase-promoting complex/cyclosome is also known as APC/C [Source:SGD;Acc:S000006188]
2                     Nuclear envelope/ER integral membrane protein; interacts and functions with Brr6p and Brl1p in lipid homeostasis; mutants are defective in nuclear pore complex biogenesis, nuclear envelope morphology, mRNA export from the nucleus and are sensitive to sterol biosynthesis inhibitors and membrane fluidizing agents; exhibits synthetic lethal genetic interactions with genes involved in lipid metabolism [Source:SGD;Acc:S000001302]
3                                                                                                        Activator of anaphase-promoting complex/cyclosome (APC/C); APC/C is required for metaphase/anaphase transition; directs ubiquitination of mitotic cyclins, Pds1p, and other anaphase inhibitors; cell-cycle regulated; potential Cdc28p substrate; relative distribution to the nucleus increases upon DNA replication stress [Source:SGD;Acc:S000003084]
4 Polo-like kinase; controls targeting and activation of Rho1p at cell division site via Rho1p guanine nucleotide exchange factors; regulates Spc72p; also functions in adaptation to DNA damage during meiosis; regulates the shape of the nucleus and expansion of the nuclear envelope during mitosis; similar to Xenopus Plx1 and S. pombe Plo1p; human homologs PLK1, PLK3 can each complement yeast cdc5 thermosensitive mutants [Source:SGD;Acc:S000004603]
5                                                                                                                 B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the transition from G2 to M phase; accumulates during G2 and M, then targeted via a destruction box motif for ubiquitin-mediated degradation by the proteasome; CLB1 has a paralog, CLB2, that arose from the whole genome duplication [Source:SGD;Acc:S000003340]
6                                                                                                                 B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the transition from G2 to M phase; accumulates during G2 and M, then targeted via a destruction box motif for ubiquitin-mediated degradation by the proteasome; CLB2 has a paralog, CLB1, that arose from the whole genome duplication [Source:SGD;Acc:S000006323]

Goals

Import Data

Exercises

Bravo! 🎉 You’ve learned the basics of R, and you’re already making great progress, keep it up!