String Tricks &
Final Review

IOC-R Week 9

String Manipulation with {stringr}

Strings Everywhere in Biological Data!

Where do we actually encounter strings (text) in our data?

  • Strings stored in data columns, such as gene name, sample name, category.
  • Colnames or rownames of data frame, names of named list or vector.
  • File names and paths: e.g., “GSE12345_raw.txt”, “data/results_2023-10.csv”
  • Figure labels: e.g., axes’ title, plot title.

What happens when these strings are messy, inconsistent, or need to be extracted?

The {stringr} Package

library(tidyverse)
# library(stringr) # or load only this package if needed

Almost all functions start with str_.

  • Manage strings: str_length()
  • Mutate strings: str_to_title(), str_to_upper(), str_to_lower()
  • Detect strings1: str_detect(), str_count()

1: Need to provide a pattern which works with regular expressions (regex). By default, the pattern is case-sensitive, but this can be changed.

stringr - Manage Strings

  • str_length(): count string width
seq1 <- "ATGCGTAGCTAGGCTATCCGA"

# using basic functions
length(unlist(strsplit(x = seq1, split = ""))) 
[1] 21
# use stringr's function
str_length(string = seq1)
[1] 21

stringr - Mutate Strings

  • str_to_title(): convert string to title case.
human_gene <- c("ADCY3", "SOX5", "LEP")
mouse_gene <- str_to_title(human_gene)
mouse_gene
[1] "Adcy3" "Sox5"  "Lep"  
  • str_to_upper(): convert string to upper case.
str_to_upper(mouse_gene)
[1] "ADCY3" "SOX5"  "LEP"  
  • str_to_lower(): convert string to lower case.
str_to_lower(mouse_gene)
[1] "adcy3" "sox5"  "lep"  

stringr - Detect Strings

  • str_detect(): detect presence of a match, returns TRUE or FALSE.
seq1
[1] "ATGCGTAGCTAGGCTATCCGA"
str_detect(string = seq1, pattern = "CG")
[1] TRUE
str_detect(string = seq1, pattern = "c")
[1] FALSE
  • str_count(): count number of matches.
seq1
[1] "ATGCGTAGCTAGGCTATCCGA"
# how many C and G present in the string?
str_count(string = seq1, pattern = "C")
[1] 5
str_count(string = seq1, pattern = "G")
[1] 6
str_count(string = seq1, pattern = "CG")
[1] 2
# using regular expression
str_count(string = seq1, pattern = "[CG]")
[1] 11

Regular Expression

  • []: Match any one of the characters inside.
str_count(string = seq1, pattern = "[CG]") # using regular expression
[1] 11
  • ^ or $: Matches the beginning or the end of a string.
pathways <- c(
  "Adaptive Immune Response", "Cytokine Signaling in Immune System", "Inflammatory Response Pathway",
  "Cell Cycle Regulation", "Innate Immune System", "Toll-like Receptor Signaling Pathway"
)
str_detect(pathways, pattern = "^In") # starts with "In"
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE
str_detect(pathways, pattern = "em$") # ends with "em"
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE
  • |: Matches one pattern Or another
str_detect(pathways, pattern = "Response|Receptor")
[1]  TRUE FALSE  TRUE FALSE FALSE  TRUE

Refresher on {tidyverse}

Import and Export with {readr}

  • Read structured (CSV, TSV) and unstructured (TXT) files.
read_delim(file, delim, ...) # the general function
read_csv()
read_csv2() # use ";" as separator and "," for decimal
  • The file can be a path to a file (compressed or not) or a connection (a link), e.g.:
read_csv2(file = "https://inforbio.github.io/IOC/ioc_r/exos_data/data_anonym_struc1_noise.csv")
  • Observe your data before import: header, separator, decimal, NA strings, quotes, etc.
  • Write structured (CSV, TSV) and unstructured (TXT) files.
write_delim(x , file, delim, na, ...)
write_csv()
write_csv2()

Reshape Data with {tidyr}

  • Reshape data to longer or wider format as needed.
data |> pivot_longer(cols = -1, names_to = "condition", values_to = "value")
data |> pivot_wider(names_from = gene, values_from = value)

Remove NA with {tidyr}

  • Remove rows with missing values.
data |> drop_na() # check across all columns
data |> drop_na(col1, col2) # specify columns where to check NA

Data Example

The annotation data for the Yeast (Saccharomyces cerevisiae) gene was obtained from the Ensembl data base.

annot <- readr::read_csv("../exos_data/mart_export.txt.gz") # read a compresseed file
annot
# A tibble: 7,127 × 9
   `Gene stable ID` `Gene type`    `Gene % GC content` `Transcript count`
   <chr>            <chr>                        <dbl>              <dbl>
 1 YBR024W          protein_coding                40.0                  1
 2 YDL245C          protein_coding                41.9                  1
 3 YBR232C          protein_coding                44.2                  1
 4 YDR320W-B        protein_coding                34.8                  1
 5 YBR021W          protein_coding                39.9                  1
 6 YGR014W          protein_coding                43.8                  1
 7 tT(AGU)O2        tRNA                          48.0                  1
 8 YKL119C          protein_coding                36.3                  1
 9 YPR031W          protein_coding                37.0                  1
10 YKL066W          protein_coding                38.7                  1
# ℹ 7,117 more rows
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
#   `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>

Manipulate Data with {dplyr}

  • count() occurence.
annot |> count(`Gene type`)
# A tibble: 8 × 2
  `Gene type`              n
  <chr>                <int>
1 ncRNA                   18
2 protein_coding        6600
3 pseudogene              12
4 rRNA                    24
5 snRNA                    6
6 snoRNA                  77
7 tRNA                   299
8 transposable_element    91
annot |>
  count(`Gene type`, `Chromosome/scaffold name`)
# A tibble: 86 × 3
   `Gene type`    `Chromosome/scaffold name`     n
   <chr>          <chr>                      <int>
 1 ncRNA          I                              1
 2 ncRNA          IX                             2
 3 ncRNA          Mito                           1
 4 ncRNA          V                              3
 5 ncRNA          VI                             4
 6 ncRNA          VII                            2
 7 ncRNA          VIII                           2
 8 ncRNA          X                              1
 9 ncRNA          XIII                           2
10 protein_coding I                            117
# ℹ 76 more rows

Manipulate Data with {dplyr}

  • filter() rows based on values in columns.
annot |>
  filter(`Gene type` == "ncRNA" & `Chromosome/scaffold name` == "V")
# A tibble: 3 × 9
  `Gene stable ID` `Gene type` `Gene % GC content` `Transcript count`
  <chr>            <chr>                     <dbl>              <dbl>
1 SRG1             ncRNA                      35.4                  1
2 SCR1             ncRNA                      55.0                  1
3 RPR1             ncRNA                      51.5                  1
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
#   `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>

Manipulate Data with {dplyr}

  • filter() rows based on values in columns.
annot |>
  filter(stringr::str_detect(`Gene type`, "RNA"))
# A tibble: 424 × 9
   `Gene stable ID` `Gene type` `Gene % GC content` `Transcript count`
   <chr>            <chr>                     <dbl>              <dbl>
 1 tT(AGU)O2        tRNA                       48.0                  1
 2 tR(ACG)E         tRNA                       58.9                  1
 3 Q0158            rRNA                       21.2                  1
 4 tG(UCC)O         tRNA                       56.9                  1
 5 tY(GUA)F1        tRNA                       52.8                  1
 6 tS(GCU)O         tRNA                       52.5                  1
 7 snR7-L           snRNA                      44.4                  1
 8 snR82            snoRNA                     39.6                  1
 9 RDN58-2          rRNA                       46.2                  1
10 tI(AAU)E2        tRNA                       58.1                  1
# ℹ 414 more rows
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
#   `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>

Manipulate Data with {dplyr}

  • arrange() rows based on values in columns.
annot |>
  count(`Gene type`, `Chromosome/scaffold name`) |>
  arrange(`Chromosome/scaffold name`, desc(n))
# A tibble: 86 × 3
   `Gene type`          `Chromosome/scaffold name`     n
   <chr>                <chr>                      <int>
 1 protein_coding       I                            117
 2 tRNA                 I                              4
 3 transposable_element I                              2
 4 ncRNA                I                              1
 5 pseudogene           I                              1
 6 snoRNA               I                              1
 7 protein_coding       II                           456
 8 tRNA                 II                            13
 9 transposable_element II                             6
10 snoRNA               II                             2
# ℹ 76 more rows

Manipulate Data with {dplyr}

  • select() columns (with helper functions).
annot |> select(1, 3, 5) |> head(3)
# A tibble: 3 × 3
  `Gene stable ID` `Gene % GC content` `Gene name`
  <chr>                          <dbl> <chr>      
1 YBR024W                         40.0 SCO2       
2 YDL245C                         41.9 HXT15      
3 YBR232C                         44.2 <NA>       
annot |>
  filter(stringr::str_detect(`Gene type`, "RNA")) |>
  select(contains(c("ID", "chrom", "descr"))) |>
  head(3)
# A tibble: 3 × 3
  `Gene stable ID` `Chromosome/scaffold name` `Gene description`                
  <chr>            <chr>                      <chr>                             
1 tT(AGU)O2        XV                         Threonine tRNA (tRNA-Thr), predic…
2 tR(ACG)E         V                          Arginine tRNA (tRNA-Arg), predict…
3 Q0158            Mito                       Mitochondrial 21S rRNA; intron en…

Manipulate Data with {dplyr}

  • mutate() columns.
annot |>
  mutate(gene_group = ifelse(
    test = `Gene type` == "protein_coding",
    yes = "protein_coding",
    no = "non protein_coding"
  )) |>
  select(1:2, gene_group)
# A tibble: 7,127 × 3
   `Gene stable ID` `Gene type`    gene_group        
   <chr>            <chr>          <chr>             
 1 YBR024W          protein_coding protein_coding    
 2 YDL245C          protein_coding protein_coding    
 3 YBR232C          protein_coding protein_coding    
 4 YDR320W-B        protein_coding protein_coding    
 5 YBR021W          protein_coding protein_coding    
 6 YGR014W          protein_coding protein_coding    
 7 tT(AGU)O2        tRNA           non protein_coding
 8 YKL119C          protein_coding protein_coding    
 9 YPR031W          protein_coding protein_coding    
10 YKL066W          protein_coding protein_coding    
# ℹ 7,117 more rows

Manipulate Data with {dplyr}

  • summarize() data.
annot |>
  summarize(
    mean_gc = mean(`Gene % GC content`),
    max_gc = max(`Gene % GC content`),
    min_gc = min(`Gene % GC content`),
  )
# A tibble: 1 × 3
  mean_gc max_gc min_gc
    <dbl>  <dbl>  <dbl>
1    40.6   66.7   8.18

Manipulate Data with {dplyr}

  • group_by() data. (Use ungroup() to remove grouping.)
annot |>
  group_by(`Chromosome/scaffold name`) |>
  summarize(
    mean_gc = mean(`Gene % GC content`),
    max_gc = max(`Gene % GC content`),
    min_gc = min(`Gene % GC content`),
  )
# A tibble: 17 × 4
   `Chromosome/scaffold name` mean_gc max_gc min_gc
   <chr>                        <dbl>  <dbl>  <dbl>
 1 I                             41.9   60.3  25.8 
 2 II                            40.5   58.3  20   
 3 III                           41.8   66.7  26   
 4 IV                            40.0   64.4  25.2 
 5 IX                            41.2   58.3  30.2 
 6 Mito                          26.4   44.9   8.18
 7 V                             41.5   65.3  27.6 
 8 VI                            41.9   64.4  30.7 
 9 VII                           40.6   64.4  27.4 
10 VIII                          40.9   64.4  30.1 
11 X                             41.2   65.3  31.7 
12 XI                            40.6   64.4  26   
13 XII                           41.0   64.4  26.0 
14 XIII                          40.3   64.4  27.6 
15 XIV                           40.8   66.7  28.7 
16 XV                            40.3   65.3  23   
17 XVI                           40.3   65.3  27.1 

Visualisation with {ggplot2}

Syntax: ggplot(data, aes(x, y)) + geom_xxx() + ...

annot |>
  filter(`Gene type` == "protein_coding") |>
  ggplot(aes(x = `Chromosome/scaffold name`, y = `Gene % GC content`)) +
  geom_boxplot() +
  labs(title = str_to_title("distribution across chromosome")) +
  theme_light()

Let’s Practice !

Today’s Goals

  • Know how to process strings in the data
  • Get familiar with the main tidyverse packege for data manipulation and visualisation.

Final Project

Reproducing a Scientific Figure with {ggplot2}

Task: Select a figure from a scientific paper and recreate it using ggplot2. Document your process in a Quarto script and present your work (15 min / person).

Your report should include:

  • Figure Selection: Why did you choose this figure?
  • Data & Preprocessing: Source of data, cleaning, transformations.
  • Packages Used: List and explain key packages.
  • Challenges & Solutions: Issues faced and how you resolved them.
  • Plot Construction: Steps taken to build the visualization.
  • Final Comparison & Insights: How close is your result? What did you learn?