<- read.csv(
de_res file = "../exos_data/toy_DEanalysis.csv", # replace the path with your own
header = TRUE
)
Week 4 - Hands-On Examples
The R script is available here: link
Goals
- Understand and use operators to filter data with precision.
- Understand function’s basic structure, learn to use new functions, write functions to automate tasks.
Import Data
A gene-level differential expression (DE) analysis was performed to compare SET1 samples to WT samples using data from read-counts.csv
.
The analysis results are available via this link.
- Please donwload the result file and upload it to your data folder.
- Import the data using the
read.csv()
function. (See the documentation with?read.csv
) Name the imported resultsde_res
.
Exercises
- Check the structure of
de_res
using an appropriate R function. What are the dimensions?
str(de_res)
'data.frame': 45 obs. of 7 variables:
$ gene_name : chr "HTB2" "HHF1" "HHT1" "POL30" ...
$ baseMean : num 20259 9821 1539 1274 316 ...
$ log2FoldChange: num -0.3757 0.1789 0.0866 0.4165 0.2189 ...
$ lfcSE : num 0.447 0.536 0.412 0.422 0.434 ...
$ stat : num -0.841 0.334 0.21 0.988 0.505 ...
$ pvalue : num 0.4 0.739 0.834 0.323 0.614 ...
$ padj : num 0.891 0.906 0.915 0.891 0.891 ...
The result is a data frame with 45 rows and 7 columns.
The result contains following columns:
gene_name
: gene namebaseMean
: mean of normalized counts for all sampleslog2FoldChange
: log2 fold changelfcSE
: standard errorstat
: Wald statisticpvalue
: Wald test p-valuepadj
: adjusted p-values (Benjamini-Hochberg procedure)
- Filter the rows where the gene has a log2 fold change (
log2FoldChange
) greater than 0.5.
$log2FoldChange > 0.5, ] de_res[de_res
gene_name baseMean log2FoldChange lfcSE stat pvalue
9 SUT476 88.12355 0.9935943 0.3856808 2.576209 9.989029e-03
15 CDC20 216.80742 0.7885416 0.4908836 1.606372 1.081922e-01
16 CLB6 111.03713 0.7373706 0.5448055 1.353457 1.759098e-01
21 LOH1 48.61440 2.2259655 0.3969997 5.606971 2.058986e-08
29 SUT2873 26.00372 1.4253323 0.3685458 3.867449 1.099797e-04
40 ACM1 140.43590 0.7332617 0.4864127 1.507489 1.316854e-01
padj
9 7.611881e-02
15 5.792117e-01
16 7.196311e-01
21 9.265435e-07
29 1.649695e-03
40 5.925843e-01
- Filter the rows where the gene has a log2 fold change smaller than -0.5.
$log2FoldChange < -0.5, ] de_res[de_res
gene_name baseMean log2FoldChange lfcSE stat pvalue
10 APQ12 5422.709150 -0.6401330 0.2366950 -2.704463 0.0068414883
19 FAR1 5927.470166 -1.5095445 0.6754613 -2.234835 0.0254281962
20 SUT24 155.788397 -0.8436243 0.3281685 -2.570705 0.0101491743
22 PIR3 304.294496 -2.3659253 0.6082583 -3.889672 0.0001003798
27 TUB33 2.952205 -0.8718026 0.5544172 -1.572467 0.1158423457
padj
10 0.076118807
19 0.163466976
20 0.076118807
22 0.001649695
27 0.579211728
- Filter the rows where the gene has a log2 fold change greater than 0.5 or smaller than -0.5.
$log2FoldChange > 0.5 | de_res$log2FoldChange < -0.5, ] de_res[de_res
gene_name baseMean log2FoldChange lfcSE stat pvalue
9 SUT476 88.123550 0.9935943 0.3856808 2.576209 9.989029e-03
10 APQ12 5422.709150 -0.6401330 0.2366950 -2.704463 6.841488e-03
15 CDC20 216.807415 0.7885416 0.4908836 1.606372 1.081922e-01
16 CLB6 111.037134 0.7373706 0.5448055 1.353457 1.759098e-01
19 FAR1 5927.470166 -1.5095445 0.6754613 -2.234835 2.542820e-02
20 SUT24 155.788397 -0.8436243 0.3281685 -2.570705 1.014917e-02
21 LOH1 48.614399 2.2259655 0.3969997 5.606971 2.058986e-08
22 PIR3 304.294496 -2.3659253 0.6082583 -3.889672 1.003798e-04
27 TUB33 2.952205 -0.8718026 0.5544172 -1.572467 1.158423e-01
29 SUT2873 26.003715 1.4253323 0.3685458 3.867449 1.099797e-04
40 ACM1 140.435903 0.7332617 0.4864127 1.507489 1.316854e-01
padj
9 7.611881e-02
10 7.611881e-02
15 5.792117e-01
16 7.196311e-01
19 1.634670e-01
20 7.611881e-02
21 9.265435e-07
22 1.649695e-03
27 5.792117e-01
29 1.649695e-03
40 5.925843e-01
## Bonus: we can test the absolute value of log2FoldChange to simplify condition
abs(c(0.5, -0.5)) # how abs() works
[1] 0.5 0.5
abs(de_res$log2FoldChange) > 0.5, ] de_res[
gene_name baseMean log2FoldChange lfcSE stat pvalue
9 SUT476 88.123550 0.9935943 0.3856808 2.576209 9.989029e-03
10 APQ12 5422.709150 -0.6401330 0.2366950 -2.704463 6.841488e-03
15 CDC20 216.807415 0.7885416 0.4908836 1.606372 1.081922e-01
16 CLB6 111.037134 0.7373706 0.5448055 1.353457 1.759098e-01
19 FAR1 5927.470166 -1.5095445 0.6754613 -2.234835 2.542820e-02
20 SUT24 155.788397 -0.8436243 0.3281685 -2.570705 1.014917e-02
21 LOH1 48.614399 2.2259655 0.3969997 5.606971 2.058986e-08
22 PIR3 304.294496 -2.3659253 0.6082583 -3.889672 1.003798e-04
27 TUB33 2.952205 -0.8718026 0.5544172 -1.572467 1.158423e-01
29 SUT2873 26.003715 1.4253323 0.3685458 3.867449 1.099797e-04
40 ACM1 140.435903 0.7332617 0.4864127 1.507489 1.316854e-01
padj
9 7.611881e-02
10 7.611881e-02
15 5.792117e-01
16 7.196311e-01
19 1.634670e-01
20 7.611881e-02
21 9.265435e-07
22 1.649695e-03
27 5.792117e-01
29 1.649695e-03
40 5.925843e-01
- Filter the rows where the gene has a log2 fold change greater than 0.5 and adjusted p-value (
padj
) smaller than 0.05.
$log2FoldChange > 0.5 & de_res$padj < 0.05, ] de_res[de_res
gene_name baseMean log2FoldChange lfcSE stat pvalue
21 LOH1 48.61440 2.225966 0.3969997 5.606971 2.058986e-08
29 SUT2873 26.00372 1.425332 0.3685458 3.867449 1.099797e-04
padj
21 9.265435e-07
29 1.649695e-03
Multiple Tests Correction
-
Why multiple testing is a problem?
When performing multiple statistical tests, the probability of making at least one Type I error (false positive) increases with the number of tests.
For instance, if we perform 100 independent tests with a significance level (\(\alpha\)) of 5%, the chance of incorrectly rejecting at least one null hypothesis is no longer 5%, but much higher. This is because the errors accumulate across the tests.
If we do 100 tests simultaneously and set and use \(\alpha\) at 0.05, the probability to do at least one error is:
\[ \begin{aligned} P(\text{at least 1 significant result by chance}) &= 1- P(\text{non significant results}) \\ &= 1 – (1 - 0.05)^{100} \\ &= 0.99 \end{aligned} \]
-
Multiple test correction
To address this issue and control the overall Type I error rate, statistical corrections like the Bonferroni correction or False Discovery Rate (FDR) adjustments are commonly used in multiple testing scenarios.
- Bonferroni correction: adjust the significance threshold (\(\alpha\)) to account for the number of tests (Ntest) being performed, i.e., \(\alpha_{adjusted}= \frac{\alpha}{\text{Ntest}}\)
- FDR (False discovery rate): control the proportion of false positive amongst all significant results, e.g.: Benjamini-Hochberg (BH) procedure.
\[ \begin{aligned} P(\text{at least 1 significant result by chance}) &= 1 – (1 - \frac{0.05}{100})^{100} \\ &= 0.049 \end{aligned} \]
- Extract results for these genes: RNR1, PIR3, SRP68.
$gene_name %in% c("RNR1", "PIR3", "SRP68"), ] de_res[de_res
gene_name baseMean log2FoldChange lfcSE stat pvalue
11 RNR1 1374.1805 -0.3810586 0.4335968 -0.8788317 0.3794925108
22 PIR3 304.2945 -2.3659253 0.6082583 -3.8896719 0.0001003798
42 SRP68 1058.4173 -0.1198641 0.2107691 -0.5686988 0.5695605340
padj
11 0.891290926
22 0.001649695
42 0.891290926
- Use
ifelse()
to categorize genes. Add a new column,gene_category
, that assigns categories:
- “up” if
log2FoldChange > 0.5
. - “down” if
log2FoldChange < -0.5
. - “neutral” otherwise.
"gene_category"]] <- ifelse(
de_res[[test = de_res[["log2FoldChange"]] > 0.5,
yes = "up",
no = ifelse(
test = de_res[["log2FoldChange"]] < -0.5,
yes = "down",
no = "neutral"
) )
- Use
table()
to count the occurrences of each gene category.
table(de_res[["gene_category"]])
down neutral up
5 34 6
Bonus Questions
- Write a function to automate “de_res” filtering for genes with a p-value less than or equal to a custom cutoff.
<- function(cutoff = 0.05) {
filter_p $pvalue <= cutoff, ]
de_res[de_res
}filter_p(0.01)
gene_name baseMean log2FoldChange lfcSE stat pvalue
9 SUT476 88.12355 0.9935943 0.3856808 2.576209 9.989029e-03
10 APQ12 5422.70915 -0.6401330 0.2366950 -2.704463 6.841488e-03
21 LOH1 48.61440 2.2259655 0.3969997 5.606971 2.058986e-08
22 PIR3 304.29450 -2.3659253 0.6082583 -3.889672 1.003798e-04
29 SUT2873 26.00372 1.4253323 0.3685458 3.867449 1.099797e-04
padj gene_category
9 7.611881e-02 up
10 7.611881e-02 down
21 9.265435e-07 up
22 1.649695e-03 down
29 1.649695e-03 up
filter_p(0.0001)
gene_name baseMean log2FoldChange lfcSE stat pvalue
21 LOH1 48.6144 2.225966 0.3969997 5.606971 2.058986e-08
padj gene_category
21 9.265435e-07 up
- Based on the function created in question 9, modify the function to allow output ordered by any desired column in
de_res
.
Hints: You need an extra parameter to specify the wanted column and another parameter to fix the cutoff.
<- function(col = "pvalue", cutoff = 0.05) {
filter_col <= cutoff, ]
de_res[de_res[[col]]
}filter_col(cutoff = 0.0001)
gene_name baseMean log2FoldChange lfcSE stat pvalue
21 LOH1 48.6144 2.225966 0.3969997 5.606971 2.058986e-08
padj gene_category
21 9.265435e-07 up
filter_col(col = "log2FoldChange", cutoff = -1)
gene_name baseMean log2FoldChange lfcSE stat pvalue
19 FAR1 5927.4702 -1.509545 0.6754613 -2.234835 0.0254281962
22 PIR3 304.2945 -2.365925 0.6082583 -3.889672 0.0001003798
padj gene_category
19 0.163466976 down
22 0.001649695 down
Ensembl Data Base
Ensembl is a comprehensive genome database that provides detailed information on genes and their annotations across a wide range of species (humain, mouse, zebrafish, etc.). It integrates genomic data with tools like BioMart, making it easy to query and extract information such as gene names, coordinates, functions, and orthologs for research purposes.
- A yeast gene annotation file was obtained from the Ensembl data base. This file can be donwloaded here.
Import the data and add the annotation to the de_res
data frame using merge()
function.
<- read.csv(
annot "../exos_data/yeast_gene_annot.csv", # replace the path by your own
header = TRUE
)
<- merge(de_res, annot, by = "gene_name", all.x = TRUE)
de_res head(de_res)
gene_name baseMean log2FoldChange lfcSE stat pvalue
1 ACM1 140.4359 0.733261747 0.4864127 1.507488856 0.131685398
2 APQ12 5422.7091 -0.640132968 0.2366950 -2.704463027 0.006841488
3 CDC20 216.8074 0.788541627 0.4908836 1.606371984 0.108192203
4 CDC5 1282.0508 0.172085182 0.4798742 0.358604793 0.719890761
5 CLB1 927.0893 -0.145822741 0.6384089 -0.228415908 0.819322924
6 CLB2 255.6934 -0.001034076 0.5231743 -0.001976543 0.998422948
padj gene_category ensembl_id chromosome start end
1 0.59258429 up YPL267W XVI 38169 38798
2 0.07611881 down YIL040W IX 277723 278139
3 0.57921173 up YGL116W VII 289809 291641
4 0.90627615 neutral YMR001C XIII 269019 271136
5 0.91496751 neutral YGR108W VII 703636 705051
6 0.99842295 neutral YPR119W XVI 771653 773128
description
1 Pseudosubstrate inhibitor of the APC/C; suppresses APC/C [Cdh1]-mediated proteolysis of mitotic cyclins; associates with Cdh1p, Bmh1p and Bmh2p; cell cycle regulated protein; the anaphase-promoting complex/cyclosome is also known as APC/C [Source:SGD;Acc:S000006188]
2 Nuclear envelope/ER integral membrane protein; interacts and functions with Brr6p and Brl1p in lipid homeostasis; mutants are defective in nuclear pore complex biogenesis, nuclear envelope morphology, mRNA export from the nucleus and are sensitive to sterol biosynthesis inhibitors and membrane fluidizing agents; exhibits synthetic lethal genetic interactions with genes involved in lipid metabolism [Source:SGD;Acc:S000001302]
3 Activator of anaphase-promoting complex/cyclosome (APC/C); APC/C is required for metaphase/anaphase transition; directs ubiquitination of mitotic cyclins, Pds1p, and other anaphase inhibitors; cell-cycle regulated; potential Cdc28p substrate; relative distribution to the nucleus increases upon DNA replication stress [Source:SGD;Acc:S000003084]
4 Polo-like kinase; controls targeting and activation of Rho1p at cell division site via Rho1p guanine nucleotide exchange factors; regulates Spc72p; also functions in adaptation to DNA damage during meiosis; regulates the shape of the nucleus and expansion of the nuclear envelope during mitosis; similar to Xenopus Plx1 and S. pombe Plo1p; human homologs PLK1, PLK3 can each complement yeast cdc5 thermosensitive mutants [Source:SGD;Acc:S000004603]
5 B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the transition from G2 to M phase; accumulates during G2 and M, then targeted via a destruction box motif for ubiquitin-mediated degradation by the proteasome; CLB1 has a paralog, CLB2, that arose from the whole genome duplication [Source:SGD;Acc:S000003340]
6 B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the transition from G2 to M phase; accumulates during G2 and M, then targeted via a destruction box motif for ubiquitin-mediated degradation by the proteasome; CLB2 has a paralog, CLB1, that arose from the whole genome duplication [Source:SGD;Acc:S000006323]