[1] 12 15 18
tidyr
} and {dplyr
}IOC-R Week 8
apply()
and lapply()
apply()
and lapply()
X
: a matrix or a data frame (coerced to a matrix)MARGIN
: 1 for rows and 2 for columnsFUN
: the function to be appliedtidyverse
} Ecosystemtidyverse
}?{tidyverse
} provides a consistent and intuitive set of packages for data manipulation, visualization and analysis. The core packages include:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
|>
The pipe operator |>
takes the output from one function and feed it to the first argument of the next function.
Compare the 3 ways to do calculate the square root of the mean of abolute values of x
(\(\sqrt{\frac{1}{n} \sum_{i=1}^{n} |x_i|}\)).
|>
native pipe operator, built into base R (version 4.1+)%>%
pipe from the {magrittr
} packagetibble
?A “modern” data frame, compatible with the data frame, but with some enhancements.
readr
} is in tibble format.# A tibble: 45 × 41
Feature WT.1 WT.2 WT.3 WT.4 WT.5 WT.6 WT.7 WT.8 WT.9 WT.10 SET1.1
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 HTB2 20648 466 1783 25335 64252 24126 9067 19721 67353 28059 21214
2 HHF1 7867 147 427 5178 27889 8547 3432 6935 34229 13913 9807
3 HHT1 1481 37 187 1856 3952 1020 484 1409 4636 1870 1604
4 POL30 743 27 370 4050 877 357 845 2110 1872 684 1075
5 KCC4 185 6 200 669 166 68 360 595 438 204 209
6 MCD1 457 16 117 1696 586 244 464 1227 1204 457 538
7 MSA1-1of2 8 0 5 23 13 12 15 9 19 16 8
8 HTB1 4209 105 493 4887 12369 3597 1769 3664 13649 5455 4560
9 SUT476 100 8 10 28 23 31 47 42 122 72 143
10 APQ12 4644 490 4242 7891 6477 6754 6010 6494 7744 4722 4231
# ℹ 35 more rows
# ℹ 29 more variables: SET1.2 <dbl>, SET1.3 <dbl>, SET1.4 <dbl>, SET1.5 <dbl>,
# SET1.6 <dbl>, SET1.7 <dbl>, SET1.8 <dbl>, SET1.9 <dbl>, SET1.10 <dbl>,
# SET1.RRP6.1 <dbl>, SET1.RRP6.2 <dbl>, SET1.RRP6.3 <dbl>, SET1.RRP6.4 <dbl>,
# SET1.RRP6.5 <dbl>, SET1.RRP6.6 <dbl>, SET1.RRP6.7 <dbl>, SET1.RRP6.8 <dbl>,
# SET1.RRP6.9 <dbl>, SET1.RRP6.10 <dbl>, RRP6.1 <dbl>, RRP6.2 <dbl>,
# RRP6.3 <dbl>, RRP6.4 <dbl>, RRP6.5 <dbl>, RRP6.6 <dbl>, RRP6.7 <dbl>, …
tibble
?A “modern” data frame, compatible with the data frame, but with some enhancements.
# A tibble: 2 × 2
col1 col2
<int> <lgl>
1 7 TRUE
2 8 FALSE
chr
: character
dbl
: double
int
: integer
fct
: factor
lgl
: logical
dttm
: date and time
as_tibble()
function.
# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
tibbles
don’t store row names. The numbers before the first column in the printed tibble are not row names or an index stored in the data. They are simply row numbers displayed for readability.
tidyr
}Cheat sheet: https://github.com/rstudio/cheatsheets/blob/master/tidyr.pdf
Each variable is a column, each observation is a row … —- Notion introduced by Hadley Wickham
tidy_data <- tibble(
gene = rep(paste0("gene", LETTERS[1:3]), 2),
condition = rep(c("control", "treatment"), each = 3),
expression_level = c(NA, 20, 30, 15, 25, 35)
)
tidy_data
# A tibble: 6 × 3
gene condition expression_level
<chr> <chr> <dbl>
1 geneA control NA
2 geneB control 20
3 geneC control 30
4 geneA treatment 15
5 geneB treatment 25
6 geneC treatment 35
tidyr
} - pivot_longer()
Pivot data into longer format by increasing the number of rows.
not_tidy |>
pivot_longer(
cols = c(control, treatment),
names_to = "condition",
values_to = "expression_level"
)
# A tibble: 6 × 3
gene condition expression_level
<chr> <chr> <dbl>
1 geneA control NA
2 geneA treatment 15
3 geneB control 20
4 geneB treatment 25
5 geneC control 30
6 geneC treatment 35
tidyr
} - pivot_wider()
Pivot data into wider format by increasing the number of columns. It’s the inverse transformation of pivot_longer()
.
# widens by condition
tidy_data |>
pivot_wider(names_from = condition, values_from = expression_level)
# A tibble: 3 × 3
gene control treatment
<chr> <dbl> <dbl>
1 geneA NA 15
2 geneB 20 25
3 geneC 30 35
# A tibble: 2 × 4
condition geneA geneB geneC
<chr> <dbl> <dbl> <dbl>
1 control NA 20 30
2 treatment 15 25 35
tidyr
} - drop_na()
in TableBy default keep only rows with no missing value across all columns.
# A tibble: 6 × 4
gene condition expression_level description
<chr> <chr> <dbl> <chr>
1 geneA control NA growth regulation
2 geneB control 20 stress response
3 geneC control 30 <NA>
4 geneA treatment 15 growth regulation
5 geneB treatment 25 stress response
6 geneC treatment 35 <NA>
# A tibble: 3 × 4
gene condition expression_level description
<chr> <chr> <dbl> <chr>
1 geneB control 20 stress response
2 geneA treatment 15 growth regulation
3 geneB treatment 25 stress response
# A tibble: 4 × 4
gene condition expression_level description
<chr> <chr> <dbl> <chr>
1 geneA control NA growth regulation
2 geneB control 20 stress response
3 geneA treatment 15 growth regulation
4 geneB treatment 25 stress response
In tidyverse, column names can be used as-is without quotes, if a column name starts with number, has spaces or special characters, you must use backticks (`column name`, `1col`
)
dplyr
}Cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf
dplyr
} - select()
ColumnsSelect by column index.
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# A tibble: 3 × 1
Sepal.Length
<dbl>
1 5.1
2 4.9
3 4.7
# A tibble: 3 × 2
Sepal.Length Petal.Length
<dbl> <dbl>
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
# A tibble: 3 × 4
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2
2 4.9 3 1.4 0.2
3 4.7 3.2 1.3 0.2
# A tibble: 3 × 1
Species
<fct>
1 setosa
2 setosa
3 setosa
dplyr
} - select()
ColumnsSelect by column name.
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# A tibble: 3 × 1
Species
<fct>
1 setosa
2 setosa
3 setosa
# A tibble: 3 × 2
Species Sepal.Length
<fct> <dbl>
1 setosa 5.1
2 setosa 4.9
3 setosa 4.7
# A tibble: 3 × 3
Sepal.Length Sepal.Width Petal.Length
<dbl> <dbl> <dbl>
1 5.1 3.5 1.4
2 4.9 3 1.4
3 4.7 3.2 1.3
# use ! or - operator to negate a selection
iris_tbl |>
select(!(Sepal.Length:Petal.Length)) |>
head(3)
# A tibble: 3 × 2
Petal.Width Species
<dbl> <fct>
1 0.2 setosa
2 0.2 setosa
3 0.2 setosa
dplyr
} - select()
ColumnsSelect by using helper functions, by default case ignored when matching name.
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# A tibble: 3 × 3
Sepal.Length Sepal.Width Species
<dbl> <dbl> <fct>
1 5.1 3.5 setosa
2 4.9 3 setosa
3 4.7 3.2 setosa
# A tibble: 3 × 1
Species
<fct>
1 setosa
2 setosa
3 setosa
# A tibble: 3 × 2
Sepal.Length Petal.Length
<dbl> <dbl>
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
# A tibble: 3 × 2
Sepal.Length Petal.Length
<dbl> <dbl>
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
# A tibble: 150 × 0
dplyr
} - mutate()
Columnsmutate()
to add or modify columns# A tibble: 3 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_len_mm
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 510
2 4.9 3 1.4 0.2 setosa 490
3 4.7 3.2 1.3 0.2 setosa 470
# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
# A tibble: 2 × 7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_len_mm
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 510
2 4.9 3 1.4 0.2 setosa 490
# ℹ 1 more variable: petal_len_mm <dbl>
dplyr
} - rename()
Columns# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length petal_width espece
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
dplyr
} - filter()
RowsFilter rows based on column values.
# A tibble: 17 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 7 3.2 4.7 1.4 versicolor
2 6.9 3.1 4.9 1.5 versicolor
3 7.1 3 5.9 2.1 virginica
4 7.6 3 6.6 2.1 virginica
5 7.3 2.9 6.3 1.8 virginica
6 7.2 3.6 6.1 2.5 virginica
7 7.7 3.8 6.7 2.2 virginica
8 7.7 2.6 6.9 2.3 virginica
9 6.9 3.2 5.7 2.3 virginica
10 7.7 2.8 6.7 2 virginica
11 7.2 3.2 6 1.8 virginica
12 7.2 3 5.8 1.6 virginica
13 7.4 2.8 6.1 1.9 virginica
14 7.9 3.8 6.4 2 virginica
15 7.7 3 6.1 2.3 virginica
16 6.9 3.1 5.4 2.1 virginica
17 6.9 3.1 5.1 2.3 virginica
dplyr
} - filter()
RowsFilter rows based on column values.
# A tibble: 2 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 7 3.2 4.7 1.4 versicolor
2 6.9 3.1 4.9 1.5 versicolor
# chaining with other operation
iris_tbl |>
filter(Sepal.Length > 6.8 & Species == "versicolor") |>
select(contains(c("sepal", "speci")))
# A tibble: 2 × 3
Sepal.Length Sepal.Width Species
<dbl> <dbl> <fct>
1 7 3.2 versicolor
2 6.9 3.1 versicolor
Extract rows which correspond to setosa having sepal length smaller than 4.5 cm or versicolor having petal width bigger than 1.5 cm.
dplyr
} - arrange()
RowsOrder rows using column values.
# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.3 3 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
3 4.4 3 1.3 0.2 setosa
# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 7.9 3.8 6.4 2 virginica
2 7.7 3.8 6.7 2.2 virginica
3 7.7 2.6 6.9 2.3 virginica
dplyr
} - slice()
Rowsslice()
to select rows.# create a column containing row index
iris_tbl <- mutate(iris_tbl, index = seq_len(nrow(iris_tbl)))
tail(iris_tbl, 4)
# A tibble: 4 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 6.3 2.5 5 1.9 virginica 147
2 6.5 3 5.2 2 virginica 148
3 6.2 3.4 5.4 2.3 virginica 149
4 5.9 3 5.1 1.8 virginica 150
# A tibble: 2 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5 3.6 1.4 0.2 setosa 5
2 4.7 3.2 1.3 0.2 setosa 3
dplyr
} - group_by
Datagroup_by
to group data if we need operation based on defined group(s).# A tibble: 150 × 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 3
4 4.6 3.1 1.5 0.2 setosa 4
5 5 3.6 1.4 0.2 setosa 5
6 5.4 3.9 1.7 0.4 setosa 6
7 4.6 3.4 1.4 0.3 setosa 7
8 5 3.4 1.5 0.2 setosa 8
9 4.4 2.9 1.4 0.2 setosa 9
10 4.9 3.1 1.5 0.1 setosa 10
# ℹ 140 more rows
group_by
does not change the actual data, it just adds a grouping structure to it.
dplyr
} - group_by
Datagroup_by
to group data if we need operation based on defined group(s).# A tibble: 3 × 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 4.9 3 1.4 0.2 setosa 2
2 6.4 3.2 4.5 1.5 versicolor 52
3 5.8 2.7 5.1 1.9 virginica 102
# A tibble: 6 × 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5 3.6 1.4 0.2 setosa 5
2 4.7 3.2 1.3 0.2 setosa 3
3 6.5 2.8 4.6 1.5 versicolor 55
4 6.9 3.1 4.9 1.5 versicolor 53
5 6.5 3 5.8 2.2 virginica 105
6 7.1 3 5.9 2.1 virginica 103
dplyr
} - ungroup()
Dataungroup()
to remove grouping.# A tibble: 3 × 6
# Groups: Species [1]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 3
# A tibble: 3 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 3
dplyr
} - count()
Rows# A tibble: 3 × 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
# A tibble: 1 × 1
n
<int>
1 150
# A tibble: 3 × 2
# Groups: Species [3]
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
dplyr
} - summarise()
Function# across all species
iris_tbl |>
summarise(
mean_sepal_len = mean(Sepal.Length, na.rm = TRUE),
sd_sepal_len = sd(Sepal.Length, na.rm = TRUE),
var_sepal_len = var(Sepal.Length, na.rm = TRUE)
)
# A tibble: 1 × 3
mean_sepal_len sd_sepal_len var_sepal_len
<dbl> <dbl> <dbl>
1 5.84 0.828 0.686
# summarise by species
iris_tbl |>
group_by(Species) |>
summarise(
mean_sepal_len = mean(Sepal.Length, na.rm = TRUE),
sd_sepal_len = sd(Sepal.Length, na.rm = TRUE),
var_sepal_len = var(Sepal.Length, na.rm = TRUE)
)
# A tibble: 3 × 4
Species mean_sepal_len sd_sepal_len var_sepal_len
<fct> <dbl> <dbl> <dbl>
1 setosa 5.01 0.352 0.124
2 versicolor 5.94 0.516 0.266
3 virginica 6.59 0.636 0.404
dplyr
} and {tidyr
}dplyr
} - pull()
ColumnSimilar to $
operator, pull()
extracts one column and return results in a vector.
[1] 1 2 3 4 5 6
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
[1] 3.5 3.0 3.2 3.1 3.6 3.9
dplyr
} - if_else()
FunctionSimilar to the ifelse()
function, but allow to mange missing values.
iris_tbl |>
mutate(sepal_len_cat = if_else(
condition = Sepal.Length >= 7,
true = "long",
false = "normal",
missing = "missing"
)) |>
select(Sepal.Length, Species, sepal_len_cat)
# A tibble: 150 × 3
Sepal.Length Species sepal_len_cat
<dbl> <fct> <chr>
1 5.1 setosa normal
2 4.9 setosa normal
3 4.7 setosa normal
4 4.6 setosa normal
5 5 setosa normal
6 5.4 setosa normal
7 4.6 setosa normal
8 5 setosa normal
9 4.4 setosa normal
10 4.9 setosa normal
# ℹ 140 more rows
dplyr
} - slice()
Rowsslice_head()
or slice_tail()
to select the first or last rows.# A tibble: 3 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 3
# A tibble: 3 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 6.5 3 5.2 2 virginica 148
2 6.2 3.4 5.4 2.3 virginica 149
3 5.9 3 5.1 1.8 virginica 150
If data is a grouped data frame, slice_head()
and slice_tail()
will show the N first/last rows in each group.
# A tibble: 6 × 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 2
3 7 3.2 4.7 1.4 versicolor 51
4 6.4 3.2 4.5 1.5 versicolor 52
5 6.3 3.3 6 2.5 virginica 101
6 5.8 2.7 5.1 1.9 virginica 102
dplyr
} - arrange()
RowsOrder rows using column values.
# A tibble: 5 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
<dbl> <dbl> <dbl> <dbl> <fct> <int>
1 7.9 3.8 6.4 2 virginica 132
2 7.7 2.6 6.9 2.3 virginica 119
3 7.7 2.8 6.7 2 virginica 123
4 7.7 3 6.1 2.3 virginica 136
5 7.7 3.8 6.7 2.2 virginica 118