IOC-R Week 3
The 4 data structures to store multiple values:
1 dimension | 2 dimensions (row/column) | |
---|---|---|
Same data type | vector | matrix |
Different data types | list | data frame |
A two dimensional data structure to store values of any data type.
Use data.frame()
to create a data frame, separate columns by ,
.
my_df <- data.frame(
id = 1:10,
gene_name = paste0("gene", LETTERS[1:10]),
detected = "yes",
gene_expr = c(
12.4, 11.3, 13.5, 10.2, 11.4,
0.5, 1, 1.2, 1.4, 0.6
),
status = rep( # repetition
c("activated", "inhibited"),
each = 5
)
)
my_df
id gene_name detected gene_expr status
1 1 geneA yes 12.4 activated
2 2 geneB yes 11.3 activated
3 3 geneC yes 13.5 activated
4 4 geneD yes 10.2 activated
5 5 geneE yes 11.4 activated
6 6 geneF yes 0.5 inhibited
7 7 geneG yes 1.0 inhibited
8 8 geneH yes 1.2 inhibited
9 9 geneI yes 1.4 inhibited
10 10 geneJ yes 0.6 inhibited
[1] TRUE
'data.frame': 10 obs. of 5 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ gene_name: chr "geneA" "geneB" "geneC" "geneD" ...
$ detected : chr "yes" "yes" "yes" "yes" ...
$ gene_expr: num 12.4 11.3 13.5 10.2 11.4 0.5 1 1.2 1.4 0.6
$ status : chr "activated" "activated" "activated" "activated" ...
id gene_name detected gene_expr status
1 1 geneA yes 12.4 activated
2 2 geneB yes 11.3 activated
3 3 geneC yes 13.5 activated
4 4 geneD yes 10.2 activated
5 5 geneE yes 11.4 activated
id gene_name detected gene_expr status
8 8 geneH yes 1.2 inhibited
9 9 geneI yes 1.4 inhibited
10 10 geneJ yes 0.6 inhibited
id gene_name detected gene_expr
Min. : 1.00 Length:10 Length:10 Min. : 0.50
1st Qu.: 3.25 Class :character Class :character 1st Qu.: 1.05
Median : 5.50 Mode :character Mode :character Median : 5.80
Mean : 5.50 Mean : 6.35
3rd Qu.: 7.75 3rd Qu.:11.38
Max. :10.00 Max. :13.50
status
Length:10
Class :character
Mode :character
Similar to matrix indexing, use [
idx ,
idx]
syntax to access elements with:
id gene_name detected gene_expr status
2 2 geneB yes 11.3 activated
3 3 geneC yes 13.5 activated
id gene_name detected gene_expr status
2 2 geneB yes 11.3 activated
3 3 geneC yes 13.5 activated
id gene_name detected gene_expr status
1 1 geneA yes 12.4 activated
2 2 geneB yes 11.3 activated
3 3 geneC yes 13.5 activated
4 4 geneD yes 10.2 activated
5 5 geneE yes 11.4 activated
6 6 geneF yes 0.5 inhibited
7 7 geneG yes 1.0 inhibited
8 8 geneH yes 1.2 inhibited
9 9 geneI yes 1.4 inhibited
10 10 geneJ yes 0.6 inhibited
How to get the 2nd and 3rd columns?
gene_name detected
1 geneA yes
2 geneB yes
3 geneC yes
4 geneD yes
5 geneE yes
6 geneF yes
7 geneG yes
8 geneH yes
9 geneI yes
10 geneJ yes
How to get “geneH” from the data frame?
$
or [[
]]
to get a column: [1] "geneA" "geneB" "geneC" "geneD" "geneE" "geneF" "geneG" "geneH" "geneI"
[10] "geneJ"
# Keep rows where the value in the "status" column is "activated"
my_df[my_df$status == "activated", ]
id gene_name detected gene_expr status
1 1 geneA yes 12.4 activated
2 2 geneB yes 11.3 activated
3 3 geneC yes 13.5 activated
4 4 geneD yes 10.2 activated
5 5 geneE yes 11.4 activated
Don’t worry, we’ll go into more detail at the next session!
If you select only 1 row of a data frame:
'data.frame': 1 obs. of 5 variables:
$ id : int 2
$ gene_name: chr "geneB"
$ detected : chr "yes"
$ gene_expr: num 11.3
$ status : chr "activated"
id gene_name detected gene_expr status
"2" "geneB" "yes" "11.3" "activated"
To keep the output as a data.frame when you select only 1 column, specify drop = FALSE
:
Modify existing column:
id gene_name detected gene_expr status
1 ID1 geneA yes 12.4 activated
2 ID2 geneB yes 11.3 activated
3 ID3 geneC yes 13.5 activated
4 ID4 geneD yes 10.2 activated
5 ID5 geneE yes 11.4 activated
6 ID6 geneF yes 0.5 inhibited
7 ID7 geneG yes 1.0 inhibited
8 ID8 geneH yes 1.2 inhibited
9 ID9 geneI yes 1.4 inhibited
10 ID10 geneJ yes 0.6 inhibited
id gene_name detected gene_expr status
1 1 geneA yes 12.4 activated
2 2 geneB yes 11.3 activated
3 ID3 geneC yes 13.5 activated
4 ID4 geneD yes 10.2 activated
5 ID5 geneE yes 11.4 activated
6 ID6 geneF yes 0.5 inhibited
7 ID7 geneG yes 1.0 inhibited
8 ID8 geneH yes 1.2 inhibited
9 ID9 geneI yes 1.4 inhibited
10 ID10 geneJ yes 0.6 inhibited
id gene_name detected gene_expr status tissue
1 1 geneA yes 12.4 activated liver
2 2 geneB yes 11.3 activated muscle
3 ID3 geneC yes 13.5 activated liver
4 ID4 geneD yes 10.2 activated muscle
5 ID5 geneE yes 11.4 activated liver
6 ID6 geneF yes 0.5 inhibited muscle
7 ID7 geneG yes 1.0 inhibited liver
8 ID8 geneH yes 1.2 inhibited muscle
9 ID9 geneI yes 1.4 inhibited liver
10 ID10 geneJ yes 0.6 inhibited muscle
id gene_name gene_expr status tissue
1 1 geneA 12.4 activated liver
2 2 geneB 11.3 activated muscle
3 ID3 geneC 13.5 activated liver
4 ID4 geneD 10.2 activated muscle
5 ID5 geneE 11.4 activated liver
6 ID6 geneF 0.5 inhibited muscle
7 ID7 geneG 1.0 inhibited liver
8 ID8 geneH 1.2 inhibited muscle
9 ID9 geneI 1.4 inhibited liver
10 ID10 geneJ 0.6 inhibited muscle
You need to recode the column data type if data is categorical, use the factor()
function.
my_df[["status"]] <- factor(my_df[["status"]])
my_df$tissue <- factor(
my_df[["tissue"]],
levels = c("muscle", "liver") # specify levels' order
)
# Check again the data
str(my_df)
'data.frame': 10 obs. of 5 variables:
$ id : chr "1" "2" "ID3" "ID4" ...
$ gene_name: chr "geneA" "geneB" "geneC" "geneD" ...
$ gene_expr: num 12.4 11.3 13.5 10.2 11.4 0.5 1 1.2 1.4 0.6
$ status : Factor w/ 2 levels "activated","inhibited": 1 1 1 1 1 2 2 2 2 2
$ tissue : Factor w/ 2 levels "muscle","liver": 2 1 2 1 2 1 2 1 2 1
id gene_name gene_expr status tissue
Length:10 Length:10 Min. : 0.50 activated:5 muscle:5
Class :character Class :character 1st Qu.: 1.05 inhibited:5 liver :5
Mode :character Mode :character Median : 5.80
Mean : 6.35
3rd Qu.:11.38
Max. :13.50
Use cbind()
to bind two or more data frames by columns.
Use merge()
to join two data frames based on a common column.
The most flexible data strucutre in R, with content(s) inside can be any data structure.
Use list()
function to create a list, separate elements with ,
.
[[1]]
[1] "geneA"
[[2]]
[1] 10.0 11.0 0.5
[[3]]
[1] TRUE TRUE FALSE
[1] 3
List of 3
$ : chr "geneA"
$ : num [1:3] 10 11 0.5
$ : logi [1:3] TRUE TRUE FALSE
We can name the elements of a list:
Use [
]
to subset a list, with numeric index, name or logical index.
$
to access named element:[[
]]
for named or indexed element:Once we have access to the element, we can extract values according to the data structure of that element.
my_list <- list(
sample_info = data.frame(
id = paste0("sample", 1:3),
age = c(25, 27, 30)
),
family_sequenced = list(
sample1 = c("father", "mother"),
sample2 = c("father", "mother", "sister"),
sample3 = c("mother", "sister")
)
)
my_list
$sample_info
id age
1 sample1 25
2 sample2 27
3 sample3 30
$family_sequenced
$family_sequenced$sample1
[1] "father" "mother"
$family_sequenced$sample2
[1] "father" "mother" "sister"
$family_sequenced$sample3
[1] "mother" "sister"
$gene_name
[1] 1 2 3 4 5 6 7 8 9 10
$counts
[1] 10.0 11.0 0.5
$expressed
[1] TRUE TRUE FALSE
$new_element
[1] 3 2 1
Use c()
to concatenate two or more lists.