String Tricks &
Final Review

IOC-R Week 9

String Manipulation with {`stringr`}

Strings Everywhere in Biological Data!

Where do we actually encounter strings (text) in our data?

Strings stored in data columns, such as gene name, sample name, category.
Colnames or rownames of data frame, names of named list or vector.
File names and paths: e.g., “GSE12345_raw.txt”, “data/results_2023-10.csv”
Figure labels: e.g., axes’ title, plot title.

What happens when these strings are messy, inconsistent, or need to be extracted?

The {`stringr`} Package

library(tidyverse)
# library(stringr) # or load only this package if needed

Almost all functions start with str_.

Manage strings: str_length()
Mutate strings: str_to_title(), str_to_upper(), str_to_lower()
Detect strings¹: str_detect(), str_count()

¹: Need to provide a pattern which works with regular expressions (regex). By default, the pattern is case-sensitive, but this can be changed.

Cheat sheet for stringr: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf

`stringr` - Manage Strings

str_length(): count string width

seq1 <- "ATGCGTAGCTAGGCTATCCGA"

# using basic functions
length(unlist(strsplit(x = seq1, split = "")))

[1] 21

# use stringr's function
str_length(string = seq1)

[1] 21

`stringr` - Mutate Strings

str_to_title(): convert string to title case.

human_gene <- c("ADCY3", "SOX5", "LEP")
mouse_gene <- str_to_title(human_gene)
mouse_gene

[1] "Adcy3" "Sox5"  "Lep"

str_to_upper(): convert string to upper case.

str_to_upper(mouse_gene)

[1] "ADCY3" "SOX5"  "LEP"

str_to_lower(): convert string to lower case.

str_to_lower(mouse_gene)

[1] "adcy3" "sox5"  "lep"

`stringr` - Detect Strings

str_detect(): detect presence of a match, returns TRUE or FALSE.

seq1

[1] "ATGCGTAGCTAGGCTATCCGA"

str_detect(string = seq1, pattern = "CG")

[1] TRUE

str_detect(string = seq1, pattern = "c")

[1] FALSE

str_count(): count number of matches.

seq1

[1] "ATGCGTAGCTAGGCTATCCGA"

# how many C and G present in the string?
str_count(string = seq1, pattern = "C")

[1] 5

str_count(string = seq1, pattern = "G")

[1] 6

str_count(string = seq1, pattern = "CG")

[1] 2

# using regular expression
str_count(string = seq1, pattern = "[CG]")

[1] 11

Regular Expression

[]: Match any one of the characters inside.

str_count(string = seq1, pattern = "[CG]") # using regular expression

[1] 11

^ or $: Matches the beginning or the end of a string.

pathways <- c(
  "Adaptive Immune Response", "Cytokine Signaling in Immune System", "Inflammatory Response Pathway",
  "Cell Cycle Regulation", "Innate Immune System", "Toll-like Receptor Signaling Pathway"
)
str_detect(pathways, pattern = "^In") # starts with "In"

[1] FALSE FALSE  TRUE FALSE  TRUE FALSE

str_detect(pathways, pattern = "em$") # ends with "em"

[1] FALSE  TRUE FALSE FALSE  TRUE FALSE

|: Matches one pattern Or another

str_detect(pathways, pattern = "Response|Receptor")

[1]  TRUE FALSE  TRUE FALSE FALSE  TRUE

Cheat sheet for regular expression: https://github.com/rstudio/cheatsheets/blob/main/regex.pdf

Refresher on {`tidyverse`}

Import and Export with {`readr`}

Read structured (CSV, TSV) and unstructured (TXT) files.

read_delim(file, delim, ...) # the general function
read_csv()
read_csv2() # use ";" as separator and "," for decimal

The file can be a path to a file (compressed or not) or a connection (a link), e.g.:

read_csv2(file = "https://inforbio.github.io/IOC/ioc_r/exos_data/data_anonym_struc1_noise.csv")

Observe your data before import: header, separator, decimal, NA strings, quotes, etc.

Write structured (CSV, TSV) and unstructured (TXT) files.

write_delim(x , file, delim, na, ...)
write_csv()
write_csv2()

Reshape Data with {`tidyr`}

Reshape data to longer or wider format as needed.

data |> pivot_longer(cols = -1, names_to = "condition", values_to = "value")
data |> pivot_wider(names_from = gene, values_from = value)

Remove NA with {`tidyr`}

Remove rows with missing values.

data |> drop_na() # check across all columns
data |> drop_na(col1, col2) # specify columns where to check NA

Data Example

The annotation data for the Yeast (Saccharomyces cerevisiae) gene was obtained from the Ensembl data base.

annot <- readr::read_csv("../exos_data/mart_export.txt.gz") # read a compresseed file
annot

# A tibble: 7,127 × 9
   `Gene stable ID` `Gene type`    `Gene % GC content` `Transcript count`
   <chr>            <chr>                        <dbl>              <dbl>
 1 YBR024W          protein_coding                40.0                  1
 2 YDL245C          protein_coding                41.9                  1
 3 YBR232C          protein_coding                44.2                  1
 4 YDR320W-B        protein_coding                34.8                  1
 5 YBR021W          protein_coding                39.9                  1
 6 YGR014W          protein_coding                43.8                  1
 7 tT(AGU)O2        tRNA                          48.0                  1
 8 YKL119C          protein_coding                36.3                  1
 9 YPR031W          protein_coding                37.0                  1
10 YKL066W          protein_coding                38.7                  1
# ℹ 7,117 more rows
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
#   `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>

Manipulate Data with {`dplyr`}

count() occurence.

annot |> count(`Gene type`)

# A tibble: 8 × 2
  `Gene type`              n
  <chr>                <int>
1 ncRNA                   18
2 protein_coding        6600
3 pseudogene              12
4 rRNA                    24
5 snRNA                    6
6 snoRNA                  77
7 tRNA                   299
8 transposable_element    91

annot |>
  count(`Gene type`, `Chromosome/scaffold name`)

# A tibble: 86 × 3
   `Gene type`    `Chromosome/scaffold name`     n
   <chr>          <chr>                      <int>
 1 ncRNA          I                              1
 2 ncRNA          IX                             2
 3 ncRNA          Mito                           1
 4 ncRNA          V                              3
 5 ncRNA          VI                             4
 6 ncRNA          VII                            2
 7 ncRNA          VIII                           2
 8 ncRNA          X                              1
 9 ncRNA          XIII                           2
10 protein_coding I                            117
# ℹ 76 more rows

Manipulate Data with {`dplyr`}

filter() rows based on values in columns.

annot |>
  filter(`Gene type` == "ncRNA" & `Chromosome/scaffold name` == "V")

# A tibble: 3 × 9
  `Gene stable ID` `Gene type` `Gene % GC content` `Transcript count`
  <chr>            <chr>                     <dbl>              <dbl>
1 SRG1             ncRNA                      35.4                  1
2 SCR1             ncRNA                      55.0                  1
3 RPR1             ncRNA                      51.5                  1
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
#   `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>

Manipulate Data with {`dplyr`}

filter() rows based on values in columns.

annot |>
  filter(stringr::str_detect(`Gene type`, "RNA"))

# A tibble: 424 × 9
   `Gene stable ID` `Gene type` `Gene % GC content` `Transcript count`
   <chr>            <chr>                     <dbl>              <dbl>
 1 tT(AGU)O2        tRNA                       48.0                  1
 2 tR(ACG)E         tRNA                       58.9                  1
 3 Q0158            rRNA                       21.2                  1
 4 tG(UCC)O         tRNA                       56.9                  1
 5 tY(GUA)F1        tRNA                       52.8                  1
 6 tS(GCU)O         tRNA                       52.5                  1
 7 snR7-L           snRNA                      44.4                  1
 8 snR82            snoRNA                     39.6                  1
 9 RDN58-2          rRNA                       46.2                  1
10 tI(AAU)E2        tRNA                       58.1                  1
# ℹ 414 more rows
# ℹ 5 more variables: `Gene name` <chr>, `Chromosome/scaffold name` <chr>,
#   `Gene start (bp)` <dbl>, `Gene end (bp)` <dbl>, `Gene description` <chr>

Manipulate Data with {`dplyr`}

arrange() rows based on values in columns.

annot |>
  count(`Gene type`, `Chromosome/scaffold name`) |>
  arrange(`Chromosome/scaffold name`, desc(n))

# A tibble: 86 × 3
   `Gene type`          `Chromosome/scaffold name`     n
   <chr>                <chr>                      <int>
 1 protein_coding       I                            117
 2 tRNA                 I                              4
 3 transposable_element I                              2
 4 ncRNA                I                              1
 5 pseudogene           I                              1
 6 snoRNA               I                              1
 7 protein_coding       II                           456
 8 tRNA                 II                            13
 9 transposable_element II                             6
10 snoRNA               II                             2
# ℹ 76 more rows

Manipulate Data with {`dplyr`}

select() columns (with helper functions).

annot |> select(1, 3, 5) |> head(3)

# A tibble: 3 × 3
  `Gene stable ID` `Gene % GC content` `Gene name`
  <chr>                          <dbl> <chr>      
1 YBR024W                         40.0 SCO2       
2 YDL245C                         41.9 HXT15      
3 YBR232C                         44.2 <NA>

annot |>
  filter(stringr::str_detect(`Gene type`, "RNA")) |>
  select(contains(c("ID", "chrom", "descr"))) |>
  head(3)

# A tibble: 3 × 3
  `Gene stable ID` `Chromosome/scaffold name` `Gene description`                
  <chr>            <chr>                      <chr>                             
1 tT(AGU)O2        XV                         Threonine tRNA (tRNA-Thr), predic…
2 tR(ACG)E         V                          Arginine tRNA (tRNA-Arg), predict…
3 Q0158            Mito                       Mitochondrial 21S rRNA; intron en…

Manipulate Data with {`dplyr`}

mutate() columns.

annot |>
  mutate(gene_group = ifelse(
    test = `Gene type` == "protein_coding",
    yes = "protein_coding",
    no = "non protein_coding"
  )) |>
  select(1:2, gene_group)

# A tibble: 7,127 × 3
   `Gene stable ID` `Gene type`    gene_group        
   <chr>            <chr>          <chr>             
 1 YBR024W          protein_coding protein_coding    
 2 YDL245C          protein_coding protein_coding    
 3 YBR232C          protein_coding protein_coding    
 4 YDR320W-B        protein_coding protein_coding    
 5 YBR021W          protein_coding protein_coding    
 6 YGR014W          protein_coding protein_coding    
 7 tT(AGU)O2        tRNA           non protein_coding
 8 YKL119C          protein_coding protein_coding    
 9 YPR031W          protein_coding protein_coding    
10 YKL066W          protein_coding protein_coding    
# ℹ 7,117 more rows

Manipulate Data with {`dplyr`}

summarize() data.

annot |>
  summarize(
    mean_gc = mean(`Gene % GC content`),
    max_gc = max(`Gene % GC content`),
    min_gc = min(`Gene % GC content`),
  )

# A tibble: 1 × 3
  mean_gc max_gc min_gc
    <dbl>  <dbl>  <dbl>
1    40.6   66.7   8.18

Manipulate Data with {`dplyr`}

group_by() data. (Use ungroup() to remove grouping.)

annot |>
  group_by(`Chromosome/scaffold name`) |>
  summarize(
    mean_gc = mean(`Gene % GC content`),
    max_gc = max(`Gene % GC content`),
    min_gc = min(`Gene % GC content`),
  )

# A tibble: 17 × 4
   `Chromosome/scaffold name` mean_gc max_gc min_gc
   <chr>                        <dbl>  <dbl>  <dbl>
 1 I                             41.9   60.3  25.8 
 2 II                            40.5   58.3  20   
 3 III                           41.8   66.7  26   
 4 IV                            40.0   64.4  25.2 
 5 IX                            41.2   58.3  30.2 
 6 Mito                          26.4   44.9   8.18
 7 V                             41.5   65.3  27.6 
 8 VI                            41.9   64.4  30.7 
 9 VII                           40.6   64.4  27.4 
10 VIII                          40.9   64.4  30.1 
11 X                             41.2   65.3  31.7 
12 XI                            40.6   64.4  26   
13 XII                           41.0   64.4  26.0 
14 XIII                          40.3   64.4  27.6 
15 XIV                           40.8   66.7  28.7 
16 XV                            40.3   65.3  23   
17 XVI                           40.3   65.3  27.1

Visualisation with {`ggplot2`}

Syntax: ggplot(data, aes(x, y)) + geom_xxx() + ...

annot |>
  filter(`Gene type` == "protein_coding") |>
  ggplot(aes(x = `Chromosome/scaffold name`, y = `Gene % GC content`)) +
  geom_boxplot() +
  labs(title = str_to_title("distribution across chromosome")) +
  theme_light()

Let’s Practice !

Today’s Goals

Know how to process strings in the data
Get familiar with the main tidyverse packege for data manipulation and visualisation.

Final Project

Reproducing a Scientific Figure with {`ggplot2`}

Task: Select a figure from a scientific paper and recreate it using ggplot2. Document your process in a Quarto script and present your work (15 min / person).

Your report should include:

Figure Selection: Why did you choose this figure?
Data & Preprocessing: Source of data, cleaning, transformations.
Packages Used: List and explain key packages.
Challenges & Solutions: Issues faced and how you resolved them.
Plot Construction: Steps taken to build the visualization.
Final Comparison & Insights: How close is your result? What did you learn?

String Tricks &Final Review

String Manipulation with {stringr}

Strings Everywhere in Biological Data!

The {stringr} Package

stringr - Manage Strings

stringr - Mutate Strings

stringr - Detect Strings

Regular Expression

Refresher on {tidyverse}

Import and Export with {readr}

Reshape Data with {tidyr}

Remove NA with {tidyr}

Data Example

Manipulate Data with {dplyr}

Manipulate Data with {dplyr}

Manipulate Data with {dplyr}

Manipulate Data with {dplyr}

Manipulate Data with {dplyr}

Manipulate Data with {dplyr}

Manipulate Data with {dplyr}

Manipulate Data with {dplyr}

Visualisation with {ggplot2}

Let’s Practice !

Today’s Goals

Final Project

Reproducing a Scientific Figure with {ggplot2}

String Tricks &
Final Review

String Manipulation with {`stringr`}

The {`stringr`} Package

`stringr` - Manage Strings

`stringr` - Mutate Strings

`stringr` - Detect Strings

Refresher on {`tidyverse`}

Import and Export with {`readr`}

Reshape Data with {`tidyr`}

Remove NA with {`tidyr`}

Manipulate Data with {`dplyr`}

Manipulate Data with {`dplyr`}

Manipulate Data with {`dplyr`}

Manipulate Data with {`dplyr`}

Manipulate Data with {`dplyr`}

Manipulate Data with {`dplyr`}

Manipulate Data with {`dplyr`}

Manipulate Data with {`dplyr`}

Visualisation with {`ggplot2`}

Reproducing a Scientific Figure with {`ggplot2`}