Donwload the result file and upload it to your data folder.
Import the data using the read_csv() function from the package readr. (See the documentation with ?read_csv) Name the imported results de_res.
library(readr)de_res <-read_csv(file ="../exos_data/toy_DEanalysis.csv", # replace the path with your owncol_names =TRUE)
Rows: 45 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gene_name
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exercises
Check the structure of de_res using an appropriate R function. What are the dimensions?
When performing multiple statistical tests, the probability of making at least one Type I error (false positive) increases with the number of tests.
For instance, if we perform 100 independent tests with a significance level (\(\alpha\)) of 5%, the chance of incorrectly rejecting at least one null hypothesis is no longer 5%, but much higher. This is because the errors accumulate across the tests.
If we do 100 tests simultaneously and set and use \(\alpha\) at 0.05, the probability to do at least one error is:
\[
\begin{aligned}
P(\text{at least 1 significant result by chance}) &= 1- P(\text{non significant results}) \\
&= 1 – (1 - 0.05)^{100} \\
&= 0.99
\end{aligned}
\]
Multiple test correction
To address this issue and control the overall Type I error rate, statistical corrections like the Bonferroni correction or False Discovery Rate (FDR) adjustments are commonly used in multiple testing scenarios.
Bonferroni correction: adjust the significance threshold (\(\alpha\)) to account for the number of tests (Ntest) being performed, i.e., \(\alpha_{adjusted}= \frac{\alpha}{\text{Ntest}}\)
\[
\begin{aligned}
P(\text{at least 1 significant result by chance}) &= 1 – (1 - \frac{0.05}{100})^{100} \\
&= 0.049
\end{aligned}
\]
FDR (False discovery rate): control the proportion of false positive amongst all significant results, e.g.: Benjamini-Hochberg (BH) procedure.
Extract results for these genes: RNR1, PIR3, SRP68.
Use table() to count the occurrences of each gene category. (?table)
table(de_res[["gene_category"]])
down neutral up
5 34 6
Note
Ensembl Data Base
Ensembl is a comprehensive genome database that provides detailed information on genes and their annotations across a wide range of species (humain, mouse, zebrafish, etc.). It integrates genomic data with tools like BioMart, making it easy to query and extract information such as gene names, coordinates, functions, and orthologs for research purposes.
A yeast gene annotation file was obtained from the Ensembl data base. This file can be donwloaded here.
Import the data and add the annotation to the de_res data frame using merge() function. (?merge)
annot <-read_csv("../exos_data/yeast_gene_annot.csv"# replace the path by your own)
Rows: 7127 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ensembl_id, gene_name, chromosome, description
dbl (2): start, end
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
de_res <-merge(de_res, annot, by ="gene_name", all.x =TRUE)head(de_res)
gene_name baseMean log2FoldChange lfcSE stat pvalue
1 ACM1 140.4359 0.733261747 0.4864127 1.507488856 0.131685398
2 APQ12 5422.7091 -0.640132968 0.2366950 -2.704463027 0.006841488
3 CDC20 216.8074 0.788541627 0.4908836 1.606371984 0.108192203
4 CDC5 1282.0508 0.172085182 0.4798742 0.358604793 0.719890761
5 CLB1 927.0893 -0.145822741 0.6384089 -0.228415908 0.819322924
6 CLB2 255.6934 -0.001034076 0.5231743 -0.001976543 0.998422948
padj gene_category ensembl_id chromosome start end
1 0.59258429 up YPL267W XVI 38169 38798
2 0.07611881 down YIL040W IX 277723 278139
3 0.57921173 up YGL116W VII 289809 291641
4 0.90627615 neutral YMR001C XIII 269019 271136
5 0.91496751 neutral YGR108W VII 703636 705051
6 0.99842295 neutral YPR119W XVI 771653 773128
description
1 Pseudosubstrate inhibitor of the APC/C; suppresses APC/C [Cdh1]-mediated proteolysis of mitotic cyclins; associates with Cdh1p, Bmh1p and Bmh2p; cell cycle regulated protein; the anaphase-promoting complex/cyclosome is also known as APC/C [Source:SGD;Acc:S000006188]
2 Nuclear envelope/ER integral membrane protein; interacts and functions with Brr6p and Brl1p in lipid homeostasis; mutants are defective in nuclear pore complex biogenesis, nuclear envelope morphology, mRNA export from the nucleus and are sensitive to sterol biosynthesis inhibitors and membrane fluidizing agents; exhibits synthetic lethal genetic interactions with genes involved in lipid metabolism [Source:SGD;Acc:S000001302]
3 Activator of anaphase-promoting complex/cyclosome (APC/C); APC/C is required for metaphase/anaphase transition; directs ubiquitination of mitotic cyclins, Pds1p, and other anaphase inhibitors; cell-cycle regulated; potential Cdc28p substrate; relative distribution to the nucleus increases upon DNA replication stress [Source:SGD;Acc:S000003084]
4 Polo-like kinase; controls targeting and activation of Rho1p at cell division site via Rho1p guanine nucleotide exchange factors; regulates Spc72p; also functions in adaptation to DNA damage during meiosis; regulates the shape of the nucleus and expansion of the nuclear envelope during mitosis; similar to Xenopus Plx1 and S. pombe Plo1p; human homologs PLK1, PLK3 can each complement yeast cdc5 thermosensitive mutants [Source:SGD;Acc:S000004603]
5 B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the transition from G2 to M phase; accumulates during G2 and M, then targeted via a destruction box motif for ubiquitin-mediated degradation by the proteasome; CLB1 has a paralog, CLB2, that arose from the whole genome duplication [Source:SGD;Acc:S000003340]
6 B-type cyclin involved in cell cycle progression; activates Cdc28p to promote the transition from G2 to M phase; accumulates during G2 and M, then targeted via a destruction box motif for ubiquitin-mediated degradation by the proteasome; CLB2 has a paralog, CLB1, that arose from the whole genome duplication [Source:SGD;Acc:S000006323]
Bravo! 🎉 You’ve learned the basics of R, and you’re already making great progress, keep it up!