Tidy Your Data:
Exploring {tidyr} and {dplyr}

IOC-R Week 8

Recap apply() and lapply()

apply() and lapply()

apply(X, MARGIN, FUN)
  • X: a matrix or a data frame (coerced to a matrix)
  • MARGIN: 1 for rows and 2 for columns
  • FUN: the function to be applied
mat <- matrix(1:9, nrow = 3)
apply(mat, 1, sum)
[1] 12 15 18
lapply(X, FUN)
  • X: a list or a vector
  • FUN: the function to be applied
my_list <- list(1, 2:3, 4:6)
lapply(my_list, sum)
[[1]]
[1] 1

[[2]]
[1] 5

[[3]]
[1] 15
sapply(my_list, sum) # simplified output
[1]  1  5 15

The {tidyverse} Ecosystem

Why Do We Need {tidyverse}?

{tidyverse} provides a consistent and intuitive set of packages for data manipulation, visualization and analysis. The core packages include:

  • ggplot2: visualisation (see session 5);
  • readr: import data into R;
  • tidyr: restructure data;
  • dplyr: manipulate data;
  • stringr: manipulate strings (see session 9)
## install the core tidyverse packages
install.packages("tidyverse")


## load the core tidyverse packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Forward Chaining with Pipe |>

The pipe operator |> takes the output from one function and feed it to the first argument of the next function.

Compare the 3 ways to do calculate the square root of the mean of abolute values of x (\(\sqrt{\frac{1}{n} \sum_{i=1}^{n} |x_i|}\)).

# create a vector
x <- c(-3, -5, NA, 8)

# version 1: calculation with nested function
sqrt(mean(abs(x), na.rm = TRUE))
[1] 2.309401
# version 2: use variables to store intermediate results
abs_vals <- abs(x)
mean_val <- mean(abs_vals, na.rm = TRUE)
result <- sqrt(mean_val)
result
[1] 2.309401
# version 3: use pipe to execute code sequentially
abs(x) |>
  mean(na.rm = TRUE) |>
  sqrt()
[1] 2.309401
## version 3bis: or each function in one line
abs(x) |> mean(na.rm = TRUE) |> sqrt()
[1] 2.309401
  • |> native pipe operator, built into base R (version 4.1+)
  • %>% pipe from the {magrittr} package

What is a tibble?

A “modern” data frame, compatible with the data frame, but with some enhancements.

  • Data imported by {readr} is in tibble format.
readr::read_csv("../exos_data/read-counts.csv")
# A tibble: 45 × 41
   Feature    WT.1  WT.2  WT.3  WT.4  WT.5  WT.6  WT.7  WT.8  WT.9 WT.10 SET1.1
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 HTB2      20648   466  1783 25335 64252 24126  9067 19721 67353 28059  21214
 2 HHF1       7867   147   427  5178 27889  8547  3432  6935 34229 13913   9807
 3 HHT1       1481    37   187  1856  3952  1020   484  1409  4636  1870   1604
 4 POL30       743    27   370  4050   877   357   845  2110  1872   684   1075
 5 KCC4        185     6   200   669   166    68   360   595   438   204    209
 6 MCD1        457    16   117  1696   586   244   464  1227  1204   457    538
 7 MSA1-1of2     8     0     5    23    13    12    15     9    19    16      8
 8 HTB1       4209   105   493  4887 12369  3597  1769  3664 13649  5455   4560
 9 SUT476      100     8    10    28    23    31    47    42   122    72    143
10 APQ12      4644   490  4242  7891  6477  6754  6010  6494  7744  4722   4231
# ℹ 35 more rows
# ℹ 29 more variables: SET1.2 <dbl>, SET1.3 <dbl>, SET1.4 <dbl>, SET1.5 <dbl>,
#   SET1.6 <dbl>, SET1.7 <dbl>, SET1.8 <dbl>, SET1.9 <dbl>, SET1.10 <dbl>,
#   SET1.RRP6.1 <dbl>, SET1.RRP6.2 <dbl>, SET1.RRP6.3 <dbl>, SET1.RRP6.4 <dbl>,
#   SET1.RRP6.5 <dbl>, SET1.RRP6.6 <dbl>, SET1.RRP6.7 <dbl>, SET1.RRP6.8 <dbl>,
#   SET1.RRP6.9 <dbl>, SET1.RRP6.10 <dbl>, RRP6.1 <dbl>, RRP6.2 <dbl>,
#   RRP6.3 <dbl>, RRP6.4 <dbl>, RRP6.5 <dbl>, RRP6.6 <dbl>, RRP6.7 <dbl>, …

What is a tibble?

A “modern” data frame, compatible with the data frame, but with some enhancements.

  • Create a tibble.
  • tibble(col1 = 7:8, col2 = c(TRUE, FALSE))
    # A tibble: 2 × 2
       col1 col2 
      <int> <lgl>
    1     7 TRUE 
    2     8 FALSE
    • chr: character
    • dbl: double
    • int: integer
    • fct: factor
    • lgl: logical
    • dttm: date and time
  • Convert a data frame or a matrix into a tibble, with as_tibble() function.
iris_tbl <- as_tibble(iris)
iris_tbl |> head(3)
# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 

tibbles don’t store row names. The numbers before the first column in the printed tibble are not row names or an index stored in the data. They are simply row numbers displayed for readability.

Restructure Data with {tidyr}

Cheat sheet: https://github.com/rstudio/cheatsheets/blob/master/tidyr.pdf

What is Tidy Data?

Each variable is a column, each observation is a row … —- Notion introduced by Hadley Wickham

not_tidy <- tibble(
  gene = paste0("gene", LETTERS[1:3]),
  control = c(NA, 20, 30),
  treatment = c(15, 25, 35)
)
not_tidy
# A tibble: 3 × 3
  gene  control treatment
  <chr>   <dbl>     <dbl>
1 geneA      NA        15
2 geneB      20        25
3 geneC      30        35
tidy_data <- tibble(
  gene = rep(paste0("gene", LETTERS[1:3]), 2),
  condition = rep(c("control", "treatment"), each = 3),
  expression_level = c(NA, 20, 30, 15, 25, 35)
)
tidy_data
# A tibble: 6 × 3
  gene  condition expression_level
  <chr> <chr>                <dbl>
1 geneA control                 NA
2 geneB control                 20
3 geneC control                 30
4 geneA treatment               15
5 geneB treatment               25
6 geneC treatment               35

{tidyr} - pivot_longer()

Pivot data into longer format by increasing the number of rows.

not_tidy
# A tibble: 3 × 3
  gene  control treatment
  <chr>   <dbl>     <dbl>
1 geneA      NA        15
2 geneB      20        25
3 geneC      30        35
not_tidy |>
  pivot_longer(
    cols = c(control, treatment),
    names_to = "condition",
    values_to = "expression_level"
  )
# A tibble: 6 × 3
  gene  condition expression_level
  <chr> <chr>                <dbl>
1 geneA control                 NA
2 geneA treatment               15
3 geneB control                 20
4 geneB treatment               25
5 geneC control                 30
6 geneC treatment               35

{tidyr} - pivot_wider()

Pivot data into wider format by increasing the number of columns. It’s the inverse transformation of pivot_longer().

# widens by condition
tidy_data |>
  pivot_wider(names_from = condition, values_from = expression_level)
# A tibble: 3 × 3
  gene  control treatment
  <chr>   <dbl>     <dbl>
1 geneA      NA        15
2 geneB      20        25
3 geneC      30        35
# widens by gene
tidy_data |>
  pivot_wider(names_from = gene, values_from = expression_level)
# A tibble: 2 × 4
  condition geneA geneB geneC
  <chr>     <dbl> <dbl> <dbl>
1 control      NA    20    30
2 treatment    15    25    35

{tidyr} - drop_na() in Table

By default keep only rows with no missing value across all columns.

tidy_data$description <- rep(
  c("growth regulation", "stress response", NA),
  times = 2
)
tidy_data
# A tibble: 6 × 4
  gene  condition expression_level description      
  <chr> <chr>                <dbl> <chr>            
1 geneA control                 NA growth regulation
2 geneB control                 20 stress response  
3 geneC control                 30 <NA>             
4 geneA treatment               15 growth regulation
5 geneB treatment               25 stress response  
6 geneC treatment               35 <NA>             
# drop rows with NA in any columns
tidy_data |> drop_na()
# A tibble: 3 × 4
  gene  condition expression_level description      
  <chr> <chr>                <dbl> <chr>            
1 geneB control                 20 stress response  
2 geneA treatment               15 growth regulation
3 geneB treatment               25 stress response  
# drop rows with NA in specified column(s)
tidy_data |> drop_na(description)
# A tibble: 4 × 4
  gene  condition expression_level description      
  <chr> <chr>                <dbl> <chr>            
1 geneA control                 NA growth regulation
2 geneB control                 20 stress response  
3 geneA treatment               15 growth regulation
4 geneB treatment               25 stress response  

In tidyverse, column names can be used as-is without quotes, if a column name starts with number, has spaces or special characters, you must use backticks (`column name`, `1col`)

Manipulate Data with {dplyr}

Cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

{dplyr} - select() Columns

?select

Select by column index.

colnames(iris_tbl)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
# select one column
iris_tbl |>
  select(1) |>
  head(3)
# A tibble: 3 × 1
  Sepal.Length
         <dbl>
1          5.1
2          4.9
3          4.7
# select mutiple columns
iris_tbl |>
  select(1, 3) |>
  head(3)
# A tibble: 3 × 2
  Sepal.Length Petal.Length
         <dbl>        <dbl>
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
# select a range of consecutive indices
iris_tbl |>
  select(1:4) |>
  head(3)
# A tibble: 3 × 4
  Sepal.Length Sepal.Width Petal.Length Petal.Width
         <dbl>       <dbl>        <dbl>       <dbl>
1          5.1         3.5          1.4         0.2
2          4.9         3            1.4         0.2
3          4.7         3.2          1.3         0.2
# use ! or - operator to negate a selection
iris_tbl |>
  select(!1:4) |>
  head(3)
# A tibble: 3 × 1
  Species
  <fct>  
1 setosa 
2 setosa 
3 setosa 
# iris_tbl |> select(-c(1:4)) |> head(3) # idem

{dplyr} - select() Columns

Select by column name.

colnames(iris_tbl)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
# select one column
iris_tbl |>
  select(Species) |>
  head(3)
# A tibble: 3 × 1
  Species
  <fct>  
1 setosa 
2 setosa 
3 setosa 
# select mutiple columns
iris_tbl |>
  select(Species, Sepal.Length) |>
  head(3)
# A tibble: 3 × 2
  Species Sepal.Length
  <fct>          <dbl>
1 setosa           5.1
2 setosa           4.9
3 setosa           4.7
# select a range of consecutive variables
iris_tbl |>
  select(Sepal.Length:Petal.Length) |>
  head(3)
# A tibble: 3 × 3
  Sepal.Length Sepal.Width Petal.Length
         <dbl>       <dbl>        <dbl>
1          5.1         3.5          1.4
2          4.9         3            1.4
3          4.7         3.2          1.3
# use ! or - operator to negate a selection
iris_tbl |>
  select(!(Sepal.Length:Petal.Length)) |>
  head(3)
# A tibble: 3 × 2
  Petal.Width Species
        <dbl> <fct>  
1         0.2 setosa 
2         0.2 setosa 
3         0.2 setosa 
# iris_tbl |> # idem
#   select(-(Sepal.Length:Petal.Length)) |>
#   head(3)

{dplyr} - select() Columns

Select by using helper functions, by default case ignored when matching name.

colnames(iris_tbl)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
iris_tbl |>
  select(starts_with("s")) |>
  head(3)
# A tibble: 3 × 3
  Sepal.Length Sepal.Width Species
         <dbl>       <dbl> <fct>  
1          5.1         3.5 setosa 
2          4.9         3   setosa 
3          4.7         3.2 setosa 
iris_tbl |>
  select(ends_with("S")) |>
  head(3)
# A tibble: 3 × 1
  Species
  <fct>  
1 setosa 
2 setosa 
3 setosa 
iris_tbl |>
  select(contains("Len")) |>
  head(3)
# A tibble: 3 × 2
  Sepal.Length Petal.Length
         <dbl>        <dbl>
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
iris_tbl |>
  select(matches("len")) |>
  head(3)
# A tibble: 3 × 2
  Sepal.Length Petal.Length
         <dbl>        <dbl>
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
# do not ignore case
select(iris_tbl, matches("len", ignore.case = FALSE))
# A tibble: 150 × 0

{dplyr} - mutate() Columns

  • Use mutate() to add or modify columns
# create a new column
iris_tbl |> mutate(sepal_len_mm = Sepal.Length * 100) |> head(3)
# A tibble: 3 × 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_len_mm
         <dbl>       <dbl>        <dbl>       <dbl> <fct>          <dbl>
1          5.1         3.5          1.4         0.2 setosa           510
2          4.9         3            1.4         0.2 setosa           490
3          4.7         3.2          1.3         0.2 setosa           470
# remove the column
iris_tbl |> mutate(sepal_len_mm = NULL) |> head(3)
# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
mutate(iris_tbl, sepal_len_mm = Sepal.Length * 100, petal_len_mm = Petal.Length * 100) |>
  head(2)
# A tibble: 2 × 7
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_len_mm
         <dbl>       <dbl>        <dbl>       <dbl> <fct>          <dbl>
1          5.1         3.5          1.4         0.2 setosa           510
2          4.9         3            1.4         0.2 setosa           490
# ℹ 1 more variable: petal_len_mm <dbl>

{dplyr} - rename() Columns

iris_tbl |> rename(espece = Species, petal_width = Petal.Width)
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length petal_width espece
          <dbl>       <dbl>        <dbl>       <dbl> <fct> 
 1          5.1         3.5          1.4         0.2 setosa
 2          4.9         3            1.4         0.2 setosa
 3          4.7         3.2          1.3         0.2 setosa
 4          4.6         3.1          1.5         0.2 setosa
 5          5           3.6          1.4         0.2 setosa
 6          5.4         3.9          1.7         0.4 setosa
 7          4.6         3.4          1.4         0.3 setosa
 8          5           3.4          1.5         0.2 setosa
 9          4.4         2.9          1.4         0.2 setosa
10          4.9         3.1          1.5         0.1 setosa
# ℹ 140 more rows

{dplyr} - filter() Rows

Filter rows based on column values.

iris_tbl |> filter(Sepal.Length > 6.8)
# A tibble: 17 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          7           3.2          4.7         1.4 versicolor
 2          6.9         3.1          4.9         1.5 versicolor
 3          7.1         3            5.9         2.1 virginica 
 4          7.6         3            6.6         2.1 virginica 
 5          7.3         2.9          6.3         1.8 virginica 
 6          7.2         3.6          6.1         2.5 virginica 
 7          7.7         3.8          6.7         2.2 virginica 
 8          7.7         2.6          6.9         2.3 virginica 
 9          6.9         3.2          5.7         2.3 virginica 
10          7.7         2.8          6.7         2   virginica 
11          7.2         3.2          6           1.8 virginica 
12          7.2         3            5.8         1.6 virginica 
13          7.4         2.8          6.1         1.9 virginica 
14          7.9         3.8          6.4         2   virginica 
15          7.7         3            6.1         2.3 virginica 
16          6.9         3.1          5.4         2.1 virginica 
17          6.9         3.1          5.1         2.3 virginica 

{dplyr} - filter() Rows

Filter rows based on column values.

# combine conditions
iris_tbl |> filter(Sepal.Length > 6.8 & Species == "versicolor")
# A tibble: 2 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          7           3.2          4.7         1.4 versicolor
2          6.9         3.1          4.9         1.5 versicolor
# chaining with other operation
iris_tbl |>
  filter(Sepal.Length > 6.8 & Species == "versicolor") |>
  select(contains(c("sepal", "speci")))
# A tibble: 2 × 3
  Sepal.Length Sepal.Width Species   
         <dbl>       <dbl> <fct>     
1          7           3.2 versicolor
2          6.9         3.1 versicolor

Extract rows which correspond to setosa having sepal length smaller than 4.5 cm or versicolor having petal width bigger than 1.5 cm.

{dplyr} - arrange() Rows

Order rows using column values.

# by default in ascending order
iris_tbl |> arrange(Sepal.Length) |> head(3)
# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          4.3         3            1.1         0.1 setosa 
2          4.4         2.9          1.4         0.2 setosa 
3          4.4         3            1.3         0.2 setosa 
# by descending order
iris_tbl |> arrange(desc(Sepal.Length)) |> head(3)
# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
         <dbl>       <dbl>        <dbl>       <dbl> <fct>    
1          7.9         3.8          6.4         2   virginica
2          7.7         3.8          6.7         2.2 virginica
3          7.7         2.6          6.9         2.3 virginica

{dplyr} - slice() Rows

  • Use the function slice() to select rows.
# create a column containing row index
iris_tbl <- mutate(iris_tbl, index = seq_len(nrow(iris_tbl)))

tail(iris_tbl, 4)
# A tibble: 4 × 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     <int>
1          6.3         2.5          5           1.9 virginica   147
2          6.5         3            5.2         2   virginica   148
3          6.2         3.4          5.4         2.3 virginica   149
4          5.9         3            5.1         1.8 virginica   150
# extract rows 5 and 3
iris_tbl |> slice(5, 3)
# A tibble: 2 × 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
1          5           3.6          1.4         0.2 setosa      5
2          4.7         3.2          1.3         0.2 setosa      3

{dplyr} - group_by Data

  • Use group_by to group data if we need operation based on defined group(s).
iris_tbl |> group_by(Species)
# A tibble: 150 × 6
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
 1          5.1         3.5          1.4         0.2 setosa      1
 2          4.9         3            1.4         0.2 setosa      2
 3          4.7         3.2          1.3         0.2 setosa      3
 4          4.6         3.1          1.5         0.2 setosa      4
 5          5           3.6          1.4         0.2 setosa      5
 6          5.4         3.9          1.7         0.4 setosa      6
 7          4.6         3.4          1.4         0.3 setosa      7
 8          5           3.4          1.5         0.2 setosa      8
 9          4.4         2.9          1.4         0.2 setosa      9
10          4.9         3.1          1.5         0.1 setosa     10
# ℹ 140 more rows

group_by does not change the actual data, it just adds a grouping structure to it.

{dplyr} - group_by Data

  • Use group_by to group data if we need operation based on defined group(s).
iris_tbl |> group_by(Species) |> slice(2) # the 2nd row of each species
# A tibble: 3 × 6
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species    index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>      <int>
1          4.9         3            1.4         0.2 setosa         2
2          6.4         3.2          4.5         1.5 versicolor    52
3          5.8         2.7          5.1         1.9 virginica    102
iris_tbl |> group_by(Species) |> slice(5, 3) # the 5th and the 3rd rows of each species
# A tibble: 6 × 6
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species    index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>      <int>
1          5           3.6          1.4         0.2 setosa         5
2          4.7         3.2          1.3         0.2 setosa         3
3          6.5         2.8          4.6         1.5 versicolor    55
4          6.9         3.1          4.9         1.5 versicolor    53
5          6.5         3            5.8         2.2 virginica    105
6          7.1         3            5.9         2.1 virginica    103

{dplyr} - ungroup() Data

  • Use ungroup() to remove grouping.
iris_grouped <- iris_tbl |> group_by(Species)
head(iris_grouped, 3)
# A tibble: 3 × 6
# Groups:   Species [1]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
1          5.1         3.5          1.4         0.2 setosa      1
2          4.9         3            1.4         0.2 setosa      2
3          4.7         3.2          1.3         0.2 setosa      3
iris_ungrouped <-  iris_grouped |> ungroup()
head(iris_ungrouped, 3)
# A tibble: 3 × 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
1          5.1         3.5          1.4         0.2 setosa      1
2          4.9         3            1.4         0.2 setosa      2
3          4.7         3.2          1.3         0.2 setosa      3

{dplyr} - count() Rows

  • Check unique values of one or more variables.
# count based on one variable
iris_tbl |> count(Species)
# A tibble: 3 × 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50
# count for the entire data frame
iris_tbl |> count()
# A tibble: 1 × 1
      n
  <int>
1   150
# count for the "entire" data frame for each species
iris_tbl |> group_by(Species) |> count()
# A tibble: 3 × 2
# Groups:   Species [3]
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

{dplyr} - summarise() Function

# across all species
iris_tbl |>
  summarise(
    mean_sepal_len = mean(Sepal.Length, na.rm = TRUE),
    sd_sepal_len = sd(Sepal.Length, na.rm = TRUE),
    var_sepal_len = var(Sepal.Length, na.rm = TRUE)
  )
# A tibble: 1 × 3
  mean_sepal_len sd_sepal_len var_sepal_len
           <dbl>        <dbl>         <dbl>
1           5.84        0.828         0.686
# summarise by species
iris_tbl |>
  group_by(Species) |>
  summarise(
    mean_sepal_len = mean(Sepal.Length, na.rm = TRUE),
    sd_sepal_len = sd(Sepal.Length, na.rm = TRUE),
    var_sepal_len = var(Sepal.Length, na.rm = TRUE)
  )
# A tibble: 3 × 4
  Species    mean_sepal_len sd_sepal_len var_sepal_len
  <fct>               <dbl>        <dbl>         <dbl>
1 setosa               5.01        0.352         0.124
2 versicolor           5.94        0.516         0.266
3 virginica            6.59        0.636         0.404

Let’s Practice !

Today’s Goals

  • Clean, reshape, and manipulate real-world data using {dplyr} and {tidyr}
  • Integrate tidy data into downstream analysis (e.g., for visualization or statistical analysis)

Appendices

{dplyr} - pull() Column

Similar to $ operator, pull() extracts one column and return results in a vector.

# by default extract the last column
iris_tbl |>
  pull() |>
  head()
[1] 1 2 3 4 5 6
# extract by using explicite column name
iris_tbl |>
  pull(Species) |>
  head()
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
iris_tbl |>
  pull(Sepal.Width) |>
  head()
[1] 3.5 3.0 3.2 3.1 3.6 3.9

{dplyr} - if_else() Function

Similar to the ifelse() function, but allow to mange missing values.

x <- c(-2:2, NA)
ifelse(x < 0, "negative", "positive")
[1] "negative" "negative" "positive" "positive" "positive" NA        
if_else(x < 0, "negative", "positive", missing = "not available")
[1] "negative"      "negative"      "positive"      "positive"     
[5] "positive"      "not available"
iris_tbl |>
  mutate(sepal_len_cat = if_else(
    condition = Sepal.Length >= 7,
    true = "long",
    false = "normal",
    missing = "missing"
  )) |>
  select(Sepal.Length, Species, sepal_len_cat)
# A tibble: 150 × 3
   Sepal.Length Species sepal_len_cat
          <dbl> <fct>   <chr>        
 1          5.1 setosa  normal       
 2          4.9 setosa  normal       
 3          4.7 setosa  normal       
 4          4.6 setosa  normal       
 5          5   setosa  normal       
 6          5.4 setosa  normal       
 7          4.6 setosa  normal       
 8          5   setosa  normal       
 9          4.4 setosa  normal       
10          4.9 setosa  normal       
# ℹ 140 more rows

{dplyr} - slice() Rows

  • Use the slice_head() or slice_tail() to select the first or last rows.
iris_tbl |> slice_head(n = 3)
# A tibble: 3 × 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
1          5.1         3.5          1.4         0.2 setosa      1
2          4.9         3            1.4         0.2 setosa      2
3          4.7         3.2          1.3         0.2 setosa      3
iris_tbl |> slice_tail(n = 3)
# A tibble: 3 × 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     <int>
1          6.5         3            5.2         2   virginica   148
2          6.2         3.4          5.4         2.3 virginica   149
3          5.9         3            5.1         1.8 virginica   150

If data is a grouped data frame, slice_head() and slice_tail() will show the N first/last rows in each group.

iris_tbl |>
  group_by(Species) |>
  slice_head(n = 2) # the first 2 rows of each species
# A tibble: 6 × 6
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species    index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>      <int>
1          5.1         3.5          1.4         0.2 setosa         1
2          4.9         3            1.4         0.2 setosa         2
3          7           3.2          4.7         1.4 versicolor    51
4          6.4         3.2          4.5         1.5 versicolor    52
5          6.3         3.3          6           2.5 virginica    101
6          5.8         2.7          5.1         1.9 virginica    102

{dplyr} - arrange() Rows

Order rows using column values.

# based on multiple variables
iris_tbl |> arrange(desc(Sepal.Length), Sepal.Width) |> head(5)
# A tibble: 5 × 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   index
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     <int>
1          7.9         3.8          6.4         2   virginica   132
2          7.7         2.6          6.9         2.3 virginica   119
3          7.7         2.8          6.7         2   virginica   123
4          7.7         3            6.1         2.3 virginica   136
5          7.7         3.8          6.7         2.2 virginica   118