Week 4 - Hands-On Examples

week04
exercise

The R script is available here: link

Goals

  • Understand and use operators to filter data with precision.
  • Understand function’s basic structure, learn to use new functions, write functions to automate tasks.

Import Data

A gene-level differential expression (DE) analysis was performed to compare SET1 samples to WT samples using data from read-counts.csv.

The analysis results are available via this link.

  • Please donwload the result file and upload it to your data folder.
  • Import the data using the read.csv() function. (See the documentation with ?read.csv) Name the imported results de_res.
de_res <- read.csv(
  file = "../exos_data/toy_DEanalysis.csv",  # replace the path with your own
  header = TRUE
)

Exercises

  1. Check the structure of de_res using an appropriate R function. What are the dimensions?
str(de_res)
'data.frame':   45 obs. of  7 variables:
 $ gene_name     : chr  "HTB2" "HHF1" "HHT1" "POL30" ...
 $ baseMean      : num  20259 9821 1539 1274 316 ...
 $ log2FoldChange: num  -0.3757 0.1789 0.0866 0.4165 0.2189 ...
 $ lfcSE         : num  0.447 0.536 0.412 0.422 0.434 ...
 $ stat          : num  -0.841 0.334 0.21 0.988 0.505 ...
 $ pvalue        : num  0.4 0.739 0.834 0.323 0.614 ...
 $ padj          : num  0.891 0.906 0.915 0.891 0.891 ...

The result is a data frame with 45 rows and 7 columns.

The result contains following columns:

  • gene_name: gene name
  • baseMean: mean of normalized counts for all samples
  • log2FoldChange: log2 fold change
  • lfcSE: standard error
  • stat: Wald statistic
  • pvalue: Wald test p-value
  • padj: adjusted p-values (Benjamini-Hochberg procedure)
  1. Filter the rows where the gene has a log2 fold change (log2FoldChange) greater than 0.5.
de_res[de_res$log2FoldChange > 0.5, ]
   gene_name  baseMean log2FoldChange     lfcSE     stat       pvalue
9     SUT476  88.12355      0.9935943 0.3856808 2.576209 9.989029e-03
15     CDC20 216.80742      0.7885416 0.4908836 1.606372 1.081922e-01
16      CLB6 111.03713      0.7373706 0.5448055 1.353457 1.759098e-01
21      LOH1  48.61440      2.2259655 0.3969997 5.606971 2.058986e-08
29   SUT2873  26.00372      1.4253323 0.3685458 3.867449 1.099797e-04
40      ACM1 140.43590      0.7332617 0.4864127 1.507489 1.316854e-01
           padj
9  7.611881e-02
15 5.792117e-01
16 7.196311e-01
21 9.265435e-07
29 1.649695e-03
40 5.925843e-01
  1. Filter the rows where the gene has a log2 fold change smaller than -0.5.
de_res[de_res$log2FoldChange < -0.5, ]
   gene_name    baseMean log2FoldChange     lfcSE      stat       pvalue
10     APQ12 5422.709150     -0.6401330 0.2366950 -2.704463 0.0068414883
19      FAR1 5927.470166     -1.5095445 0.6754613 -2.234835 0.0254281962
20     SUT24  155.788397     -0.8436243 0.3281685 -2.570705 0.0101491743
22      PIR3  304.294496     -2.3659253 0.6082583 -3.889672 0.0001003798
27     TUB33    2.952205     -0.8718026 0.5544172 -1.572467 0.1158423457
          padj
10 0.076118807
19 0.163466976
20 0.076118807
22 0.001649695
27 0.579211728
  1. Filter the rows where the gene has a log2 fold change greater than 0.5 or smaller than -0.5.
de_res[de_res$log2FoldChange > 0.5 | de_res$log2FoldChange < -0.5, ]
   gene_name    baseMean log2FoldChange     lfcSE      stat       pvalue
9     SUT476   88.123550      0.9935943 0.3856808  2.576209 9.989029e-03
10     APQ12 5422.709150     -0.6401330 0.2366950 -2.704463 6.841488e-03
15     CDC20  216.807415      0.7885416 0.4908836  1.606372 1.081922e-01
16      CLB6  111.037134      0.7373706 0.5448055  1.353457 1.759098e-01
19      FAR1 5927.470166     -1.5095445 0.6754613 -2.234835 2.542820e-02
20     SUT24  155.788397     -0.8436243 0.3281685 -2.570705 1.014917e-02
21      LOH1   48.614399      2.2259655 0.3969997  5.606971 2.058986e-08
22      PIR3  304.294496     -2.3659253 0.6082583 -3.889672 1.003798e-04
27     TUB33    2.952205     -0.8718026 0.5544172 -1.572467 1.158423e-01
29   SUT2873   26.003715      1.4253323 0.3685458  3.867449 1.099797e-04
40      ACM1  140.435903      0.7332617 0.4864127  1.507489 1.316854e-01
           padj
9  7.611881e-02
10 7.611881e-02
15 5.792117e-01
16 7.196311e-01
19 1.634670e-01
20 7.611881e-02
21 9.265435e-07
22 1.649695e-03
27 5.792117e-01
29 1.649695e-03
40 5.925843e-01
## Bonus: we can test the absolute value of log2FoldChange to simplify condition
abs(c(0.5, -0.5)) # how abs() works
[1] 0.5 0.5
de_res[abs(de_res$log2FoldChange) > 0.5, ]
   gene_name    baseMean log2FoldChange     lfcSE      stat       pvalue
9     SUT476   88.123550      0.9935943 0.3856808  2.576209 9.989029e-03
10     APQ12 5422.709150     -0.6401330 0.2366950 -2.704463 6.841488e-03
15     CDC20  216.807415      0.7885416 0.4908836  1.606372 1.081922e-01
16      CLB6  111.037134      0.7373706 0.5448055  1.353457 1.759098e-01
19      FAR1 5927.470166     -1.5095445 0.6754613 -2.234835 2.542820e-02
20     SUT24  155.788397     -0.8436243 0.3281685 -2.570705 1.014917e-02
21      LOH1   48.614399      2.2259655 0.3969997  5.606971 2.058986e-08
22      PIR3  304.294496     -2.3659253 0.6082583 -3.889672 1.003798e-04
27     TUB33    2.952205     -0.8718026 0.5544172 -1.572467 1.158423e-01
29   SUT2873   26.003715      1.4253323 0.3685458  3.867449 1.099797e-04
40      ACM1  140.435903      0.7332617 0.4864127  1.507489 1.316854e-01
           padj
9  7.611881e-02
10 7.611881e-02
15 5.792117e-01
16 7.196311e-01
19 1.634670e-01
20 7.611881e-02
21 9.265435e-07
22 1.649695e-03
27 5.792117e-01
29 1.649695e-03
40 5.925843e-01
  1. Filter the rows where the gene has a log2 fold change greater than 0.5 and adjusted p-value (padj) smaller than 0.05.
de_res[de_res$log2FoldChange > 0.5 & de_res$padj < 0.05, ]
   gene_name baseMean log2FoldChange     lfcSE     stat       pvalue
21      LOH1 48.61440       2.225966 0.3969997 5.606971 2.058986e-08
29   SUT2873 26.00372       1.425332 0.3685458 3.867449 1.099797e-04
           padj
21 9.265435e-07
29 1.649695e-03
Stats Time!

Multiple Tests Correction

  • Why multiple testing is a problem?

    When performing multiple statistical tests, the probability of making at least one Type I error (false positive) increases with the number of tests.

    For instance, if we perform 100 independent tests with a significance level (\(\alpha\)) of 5%, the chance of incorrectly rejecting at least one null hypothesis is no longer 5%, but much higher. This is because the errors accumulate across the tests.

    If we do 100 tests simultaneously and set and use \(\alpha\) at 0.05, the probability to do at least one error is:

    \[ \begin{aligned} P(\text{at least 1 significant result by chance}) &= 1- P(\text{non significant results}) \\ &= 1 – (1 - 0.05)^{100} \\ &= 0.99 \end{aligned} \]

  • Multiple test correction

    To address this issue and control the overall Type I error rate, statistical corrections like the Bonferroni correction or False Discovery Rate (FDR) adjustments are commonly used in multiple testing scenarios.

    • Bonferroni correction: adjust the significance threshold (\(\alpha\)) to account for the number of tests (Ntest) being performed, i.e., \(\alpha_{adjusted}= \frac{\alpha}{\text{Ntest}}\)
    • \[ \begin{aligned} P(\text{at least 1 significant result by chance}) &= 1 – (1 - \frac{0.05}{100})^{100} \\ &= 0.049 \end{aligned} \]

    • FDR (False discovery rate): control the proportion of false positive amongst all significant results, e.g.: Benjamini-Hochberg (BH) procedure.
  1. Extract results for these genes: RNR1, PIR3, SRP68.
de_res[de_res$gene_name %in% c("RNR1", "PIR3", "SRP68"), ]
   gene_name  baseMean log2FoldChange     lfcSE       stat       pvalue
11      RNR1 1374.1805     -0.3810586 0.4335968 -0.8788317 0.3794925108
22      PIR3  304.2945     -2.3659253 0.6082583 -3.8896719 0.0001003798
42     SRP68 1058.4173     -0.1198641 0.2107691 -0.5686988 0.5695605340
          padj
11 0.891290926
22 0.001649695
42 0.891290926
  1. Use ifelse() to categorize genes. Add a new column, gene_category, that assigns categories:
  • “up” if log2FoldChange > 0.5.
  • “down” if log2FoldChange < -0.5.
  • “neutral” otherwise.
de_res[["gene_category"]] <- ifelse(
  test = de_res[["log2FoldChange"]] > 0.5,
  yes = "up",
  no = ifelse(
    test = de_res[["log2FoldChange"]] < -0.5,
    yes = "down",
    no = "neutral"
  )
)
  1. Use table() to count the occurrences of each gene category.
table(de_res[["gene_category"]])

   down neutral      up 
      5      34       6 

Bonus Questions

  1. Write a function to automate “de_res” filtering for genes with a p-value less than or equal to a custom cutoff.
filter_p <- function(cutoff = 0.05) {
  de_res[de_res$pvalue <= cutoff, ]
}
filter_p(0.01)
   gene_name   baseMean log2FoldChange     lfcSE      stat       pvalue
9     SUT476   88.12355      0.9935943 0.3856808  2.576209 9.989029e-03
10     APQ12 5422.70915     -0.6401330 0.2366950 -2.704463 6.841488e-03
21      LOH1   48.61440      2.2259655 0.3969997  5.606971 2.058986e-08
22      PIR3  304.29450     -2.3659253 0.6082583 -3.889672 1.003798e-04
29   SUT2873   26.00372      1.4253323 0.3685458  3.867449 1.099797e-04
           padj gene_category
9  7.611881e-02            up
10 7.611881e-02          down
21 9.265435e-07            up
22 1.649695e-03          down
29 1.649695e-03            up
filter_p(0.0001)
   gene_name baseMean log2FoldChange     lfcSE     stat       pvalue
21      LOH1  48.6144       2.225966 0.3969997 5.606971 2.058986e-08
           padj gene_category
21 9.265435e-07            up
  1. Based on the function created in question 9, modify the function to allow output ordered by any desired column in de_res.

Hints: You need an extra parameter to specify the wanted column and another parameter to fix the cutoff.

filter_col <- function(col = "pvalue", cutoff = 0.05) {
  de_res[de_res[[col]] <= cutoff, ]
}
filter_col(cutoff = 0.0001)
   gene_name baseMean log2FoldChange     lfcSE     stat       pvalue
21      LOH1  48.6144       2.225966 0.3969997 5.606971 2.058986e-08
           padj gene_category
21 9.265435e-07            up
filter_col(col = "log2FoldChange", cutoff = -1)
   gene_name  baseMean log2FoldChange     lfcSE      stat       pvalue
19      FAR1 5927.4702      -1.509545 0.6754613 -2.234835 0.0254281962
22      PIR3  304.2945      -2.365925 0.6082583 -3.889672 0.0001003798
          padj gene_category
19 0.163466976          down
22 0.001649695          down
Tip

Ensembl Data Base

Ensembl is a comprehensive genome database that provides detailed information on genes and their annotations across a wide range of species (humain, mouse, zebrafish, etc.). It integrates genomic data with tools like BioMart, making it easy to query and extract information such as gene names, coordinates, functions, and orthologs for research purposes.

  1. A yeast gene annotation file was obtained from the Ensembl data base. This file can be donwloaded here.

Import the data and add the annotation to the de_res data frame using merge() function.

annot <- read.csv(
  "../exos_data/yeast_gene_annot.csv", # replace the path by your own
  header = TRUE
)

de_res <- merge(de_res, annot, by = "gene_name", all.x = TRUE)
head(de_res)
  gene_name  baseMean log2FoldChange     lfcSE         stat      pvalue
1      ACM1  140.4359    0.733261747 0.4864127  1.507488856 0.131685398
2     APQ12 5422.7091   -0.640132968 0.2366950 -2.704463027 0.006841488
3     CDC20  216.8074    0.788541627 0.4908836  1.606371984 0.108192203
4      CDC5 1282.0508    0.172085182 0.4798742  0.358604793 0.719890761
5      CLB1  927.0893   -0.145822741 0.6384089 -0.228415908 0.819322924
6      CLB2  255.6934   -0.001034076 0.5231743 -0.001976543 0.998422948
        padj gene_category ensembl_id chromosome  start    end
1 0.59258429            up    YPL267W        XVI  38169  38798
2 0.07611881          down    YIL040W         IX 277723 278139
3 0.57921173            up    YGL116W        VII 289809 291641
4 0.90627615       neutral    YMR001C       XIII 269019 271136
5 0.91496751       neutral    YGR108W        VII 703636 705051
6 0.99842295       neutral    YPR119W        XVI 771653 773128
                                                                                                                                                                                                                                                                                                                                                                                                                                                       description
1                                                                                                                                                                                       Pseudosubstrate inhibitor of the APC/C; suppresses APC/C [Cdh1]-mediated proteolysis of mitotic cyclins; associates with Cdh1p, Bmh1p and Bmh2p; cell cycle regulated protein; the anaphase-promoting complex/cyclosome is also known as APC/C [Source:SGD;Acc:S000006188]
2                     Nuclear envelope/ER integral membrane protein; interacts and functions with Brr6p and Brl1p in lipid homeostasis; mutants are defective in nuclear pore complex biogenesis, nuclear envelope morphology, mRNA export from the nucleus and are sensitive to sterol biosynthesis inhibitors and membrane fluidizing agents; exhibits synthetic lethal genetic interactions with genes involved in lipid metabolism [Source:SGD;Acc:S000001302]
3                                                                                                        Activator of anaphase-promoting complex/cyclosome (APC/C); APC/C is required for metaphase/anaphase transition; directs ubiquitination of mitotic cyclins, Pds1p, and other anaphase inhibitors; cell-cycle regulated; potential Cdc28p substrate; relative distribution to the nucleus increases upon DNA replication stress [Source:SGD;Acc:S000003084]
4 Polo-like kinase; controls targeting and activation of Rho1p at cell division site via Rho1p guanine nucleotide exchange factors; regulates Spc72p; also functions in adaptation to DNA damage during meiosis; regulates the shape of the nucleus and expansion of the nuclear envelope during mitosis; similar to Xenopus Plx1 and S. pombe Plo1p; human homologs PLK1, PLK3 can each complement yeast cdc5 thermosensitive mutants [Source:SGD;Acc:S000004603]
5                                                                                                                 B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the transition from G2 to M phase; accumulates during G2 and M, then targeted via a destruction box motif for ubiquitin-mediated degradation by the proteasome; CLB1 has a paralog, CLB2, that arose from the whole genome duplication [Source:SGD;Acc:S000003340]
6                                                                                                                 B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the transition from G2 to M phase; accumulates during G2 and M, then targeted via a destruction box motif for ubiquitin-mediated degradation by the proteasome; CLB2 has a paralog, CLB1, that arose from the whole genome duplication [Source:SGD;Acc:S000006323]

Bravo! 🎉 You’ve learned the basics of R, and you’re already making great progress, keep it up!