IOC-R Week 9
stringr
}Where do we actually encounter strings (text) in our data?
What happens when these strings are messy, inconsistent, or need to be extracted?
stringr
} PackageAlmost all functions start with str_
.
str_length()
str_to_title()
, str_to_upper()
, str_to_lower()
str_detect()
, str_count()
1: Need to provide a pattern which works with regular expressions (regex). By default, the pattern is case-sensitive, but this can be changed.
Cheat sheet for stringr
: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
stringr
- Manage Stringsstr_length()
: count string widthstringr
- Mutate Stringsstr_to_title()
: convert string to title case.[1] "Adcy3" "Sox5" "Lep"
str_to_upper()
: convert string to upper case.str_to_lower()
: convert string to lower case.stringr
- Detect Stringsstr_detect()
: detect presence of a match, returns TRUE
or FALSE
.str_count()
: count number of matches.[]
: Match any one of the characters inside.^
or $
: Matches the beginning or the end of a string.pathways <- c(
"Adaptive Immune Response", "Cytokine Signaling in Immune System", "Inflammatory Response Pathway",
"Cell Cycle Regulation", "Innate Immune System", "Toll-like Receptor Signaling Pathway"
)
str_detect(pathways, pattern = "^In") # starts with "In"
[1] FALSE FALSE TRUE FALSE TRUE FALSE
[1] FALSE TRUE FALSE FALSE TRUE FALSE
|
: Matches one pattern Or anotherCheat sheet for regular expression: https://github.com/rstudio/cheatsheets/blob/main/regex.pdf
tidyverse
}readr
}read_delim(file, delim, ...) # the general function
read_csv()
read_csv2() # use ";" as separator and "," for decimal
tidyr
}tidyr
}The annotation data for the Yeast (Saccharomyces cerevisiae) gene was obtained from the Ensembl data base.
# A tibble: 7,127 × 9
`Gene stable ID` `Gene type` `Gene % GC content` `Transcript count`
<chr> <chr> <dbl> <dbl>
1 YBR024W protein_coding 40.0 1
2 YDL245C protein_coding 41.9 1
3 YBR232C protein_coding 44.2 1
4 YDR320W-B protein_coding 34.8 1
5 YBR021W protein_coding 39.9 1
6 YGR014W protein_coding 43.8 1
7 tT(AGU)O2 tRNA 48.0 1
8 YKL119C protein_coding 36.3 1
9 YPR031W protein_coding 37.0 1
10 YKL066W protein_coding 38.7 1
# ℹ 7,117 more rows
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
# `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>
dplyr
}count()
occurence.dplyr
}filter()
rows based on values in columns.# A tibble: 3 × 9
`Gene stable ID` `Gene type` `Gene % GC content` `Transcript count`
<chr> <chr> <dbl> <dbl>
1 SRG1 ncRNA 35.4 1
2 SCR1 ncRNA 55.0 1
3 RPR1 ncRNA 51.5 1
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
# `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>
dplyr
}filter()
rows based on values in columns.# A tibble: 424 × 9
`Gene stable ID` `Gene type` `Gene % GC content` `Transcript count`
<chr> <chr> <dbl> <dbl>
1 tT(AGU)O2 tRNA 48.0 1
2 tR(ACG)E tRNA 58.9 1
3 Q0158 rRNA 21.2 1
4 tG(UCC)O tRNA 56.9 1
5 tY(GUA)F1 tRNA 52.8 1
6 tS(GCU)O tRNA 52.5 1
7 snR7-L snRNA 44.4 1
8 snR82 snoRNA 39.6 1
9 RDN58-2 rRNA 46.2 1
10 tI(AAU)E2 tRNA 58.1 1
# ℹ 414 more rows
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
# `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>
dplyr
}arrange()
rows based on values in columns.annot |>
count(`Gene type`, `Chromosome/scaffold name`) |>
arrange(`Chromosome/scaffold name`, desc(n))
# A tibble: 86 × 3
`Gene type` `Chromosome/scaffold name` n
<chr> <chr> <int>
1 protein_coding I 117
2 tRNA I 4
3 transposable_element I 2
4 ncRNA I 1
5 pseudogene I 1
6 snoRNA I 1
7 protein_coding II 456
8 tRNA II 13
9 transposable_element II 6
10 snoRNA II 2
# ℹ 76 more rows
dplyr
}select()
columns (with helper functions).# A tibble: 3 × 3
`Gene stable ID` `Gene % GC content` `Gene name`
<chr> <dbl> <chr>
1 YBR024W 40.0 SCO2
2 YDL245C 41.9 HXT15
3 YBR232C 44.2 <NA>
annot |>
filter(stringr::str_detect(`Gene type`, "RNA")) |>
select(contains(c("ID", "chrom", "descr"))) |>
head(3)
# A tibble: 3 × 3
`Gene stable ID` `Chromosome/scaffold name` `Gene description`
<chr> <chr> <chr>
1 tT(AGU)O2 XV Threonine tRNA (tRNA-Thr), predic…
2 tR(ACG)E V Arginine tRNA (tRNA-Arg), predict…
3 Q0158 Mito Mitochondrial 21S rRNA; intron en…
dplyr
}mutate()
columns.annot |>
mutate(gene_group = ifelse(
test = `Gene type` == "protein_coding",
yes = "protein_coding",
no = "non protein_coding"
)) |>
select(1:2, gene_group)
# A tibble: 7,127 × 3
`Gene stable ID` `Gene type` gene_group
<chr> <chr> <chr>
1 YBR024W protein_coding protein_coding
2 YDL245C protein_coding protein_coding
3 YBR232C protein_coding protein_coding
4 YDR320W-B protein_coding protein_coding
5 YBR021W protein_coding protein_coding
6 YGR014W protein_coding protein_coding
7 tT(AGU)O2 tRNA non protein_coding
8 YKL119C protein_coding protein_coding
9 YPR031W protein_coding protein_coding
10 YKL066W protein_coding protein_coding
# ℹ 7,117 more rows
dplyr
}summarize()
data.dplyr
}group_by()
data. (Use ungroup()
to remove grouping.)annot |>
group_by(`Chromosome/scaffold name`) |>
summarize(
mean_gc = mean(`Gene % GC content`),
max_gc = max(`Gene % GC content`),
min_gc = min(`Gene % GC content`),
)
# A tibble: 17 × 4
`Chromosome/scaffold name` mean_gc max_gc min_gc
<chr> <dbl> <dbl> <dbl>
1 I 41.9 60.3 25.8
2 II 40.5 58.3 20
3 III 41.8 66.7 26
4 IV 40.0 64.4 25.2
5 IX 41.2 58.3 30.2
6 Mito 26.4 44.9 8.18
7 V 41.5 65.3 27.6
8 VI 41.9 64.4 30.7
9 VII 40.6 64.4 27.4
10 VIII 40.9 64.4 30.1
11 X 41.2 65.3 31.7
12 XI 40.6 64.4 26
13 XII 41.0 64.4 26.0
14 XIII 40.3 64.4 27.6
15 XIV 40.8 66.7 28.7
16 XV 40.3 65.3 23
17 XVI 40.3 65.3 27.1
ggplot2
}Syntax: ggplot(data, aes(x, y)) + geom_xxx() + ...
ggplot2
}Task: Select a figure from a scientific paper and recreate it using ggplot2. Document your process in a Quarto script and present your work (15 min / person).
Your report should include: