More Complex Data Structures:
Data Frames and Lists

IOC-R Week 3

Last Week Review

What We’ve Learned So Far

  • Data types
  • Variable
  • Data structures:
    • vector
    • matrix

What are the outputs for following codes?

c(1, 3, 5)[3]
5:2
seq(1, 2, by = 0.5)
rep(c(1, 3), times = 2)
rep(c("case", "control"), each = 2)


mat <- matrix(1:6, nrow = 2)
mat
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
mat[1, ]
mat[, 2]
mat[2, 1]

Data Structures (Part 2)

Data Structures

The 4 data structures to store multiple values:

1 dimension 2 dimensions (row/column)
Same data type vector matrix
Different data types list data frame

Data Frames

A two dimensional data structure to store values of any data type.

Data Frames Creation

Use data.frame() to create a data frame, separate columns by ,.

my_df <- data.frame(
  id = 1:10,
  gene_name = paste0("gene", LETTERS[1:10]),
  detected = "yes",
  gene_expr = c(
    12.4, 11.3, 13.5, 10.2, 11.4,
    0.5, 1, 1.2, 1.4, 0.6
  ),
  status = rep( # repetition
    c("activated", "inhibited"),
    each = 5
  )
)
my_df
   id gene_name detected gene_expr    status
1   1     geneA      yes      12.4 activated
2   2     geneB      yes      11.3 activated
3   3     geneC      yes      13.5 activated
4   4     geneD      yes      10.2 activated
5   5     geneE      yes      11.4 activated
6   6     geneF      yes       0.5 inhibited
7   7     geneG      yes       1.0 inhibited
8   8     geneH      yes       1.2 inhibited
9   9     geneI      yes       1.4 inhibited
10 10     geneJ      yes       0.6 inhibited
# Check the data structure
is.data.frame(my_df)
[1] TRUE
str(my_df)
'data.frame':   10 obs. of  5 variables:
 $ id       : int  1 2 3 4 5 6 7 8 9 10
 $ gene_name: chr  "geneA" "geneB" "geneC" "geneD" ...
 $ detected : chr  "yes" "yes" "yes" "yes" ...
 $ gene_expr: num  12.4 11.3 13.5 10.2 11.4 0.5 1 1.2 1.4 0.6
 $ status   : chr  "activated" "activated" "activated" "activated" ...
  • How many rows and columns?
  • What is the data type for each column?

Exploring the Data Frame

nrow(my_df)
[1] 10
ncol(my_df)
[1] 5
dim(my_df)
[1] 10  5
rownames(my_df)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
colnames(my_df)
[1] "id"        "gene_name" "detected"  "gene_expr" "status"   
head(my_df, n = 5)
  id gene_name detected gene_expr    status
1  1     geneA      yes      12.4 activated
2  2     geneB      yes      11.3 activated
3  3     geneC      yes      13.5 activated
4  4     geneD      yes      10.2 activated
5  5     geneE      yes      11.4 activated
tail(my_df, n = 3)
   id gene_name detected gene_expr    status
8   8     geneH      yes       1.2 inhibited
9   9     geneI      yes       1.4 inhibited
10 10     geneJ      yes       0.6 inhibited
summary(my_df)
       id         gene_name           detected           gene_expr    
 Min.   : 1.00   Length:10          Length:10          Min.   : 0.50  
 1st Qu.: 3.25   Class :character   Class :character   1st Qu.: 1.05  
 Median : 5.50   Mode  :character   Mode  :character   Median : 5.80  
 Mean   : 5.50                                         Mean   : 6.35  
 3rd Qu.: 7.75                                         3rd Qu.:11.38  
 Max.   :10.00                                         Max.   :13.50  
    status         
 Length:10         
 Class :character  
 Mode  :character  
                   
                   
                   

Accessing Elements (1)

Similar to matrix indexing, use [idx ,idx] syntax to access elements with:

  • numeric index
  • row/column names
  • logical index
# Get 2nd and 3rd rows
my_df[2:3, ] 
  id gene_name detected gene_expr    status
2  2     geneB      yes      11.3 activated
3  3     geneC      yes      13.5 activated
my_df[-c(1, 4:10), ] # remove the other rows
  id gene_name detected gene_expr    status
2  2     geneB      yes      11.3 activated
3  3     geneC      yes      13.5 activated
my_df[c("2", "3"), ]
  id gene_name detected gene_expr    status
2  2     geneB      yes      11.3 activated
3  3     geneC      yes      13.5 activated
my_df[c(FALSE, TRUE, TRUE, rep(FALSE, 7)), ]
  id gene_name detected gene_expr    status
2  2     geneB      yes      11.3 activated
3  3     geneC      yes      13.5 activated

Accessing Elements (2)

my_df
   id gene_name detected gene_expr    status
1   1     geneA      yes      12.4 activated
2   2     geneB      yes      11.3 activated
3   3     geneC      yes      13.5 activated
4   4     geneD      yes      10.2 activated
5   5     geneE      yes      11.4 activated
6   6     geneF      yes       0.5 inhibited
7   7     geneG      yes       1.0 inhibited
8   8     geneH      yes       1.2 inhibited
9   9     geneI      yes       1.4 inhibited
10 10     geneJ      yes       0.6 inhibited

How to get the 2nd and 3rd columns?

my_df[, 2:3]
   gene_name detected
1      geneA      yes
2      geneB      yes
3      geneC      yes
4      geneD      yes
5      geneE      yes
6      geneF      yes
7      geneG      yes
8      geneH      yes
9      geneI      yes
10     geneJ      yes
my_df[, c("gene_name", "detected")] # idem
my_df[, c(FALSE, TRUE, TRUE, FALSE, FALSE)] # idem

How to get “geneH” from the data frame?

my_df[8, 2]
[1] "geneH"
my_df[8, "gene_name"]
[1] "geneH"

Accessing Elements (3)

  • Use the operator $ or [[ ]] to get a column:
my_df$gene_name
 [1] "geneA" "geneB" "geneC" "geneD" "geneE" "geneF" "geneG" "geneH" "geneI"
[10] "geneJ"
my_df[[2]] # idem, use numeric position
my_df[["gene_name"]] # idem, use column name
  • Subset the data frame based on some conditions:
# Keep rows where the value in the "status" column is "activated"
my_df[my_df$status == "activated", ]
  id gene_name detected gene_expr    status
1  1     geneA      yes      12.4 activated
2  2     geneB      yes      11.3 activated
3  3     geneC      yes      13.5 activated
4  4     geneD      yes      10.2 activated
5  5     geneE      yes      11.4 activated

Don’t worry, we’ll go into more detail at the next session!

Be Careful of Data Structure

If you select only 1 row of a data frame:

str(my_df[2, ]) # still a data frame
'data.frame':   1 obs. of  5 variables:
 $ id       : int 2
 $ gene_name: chr "geneB"
 $ detected : chr "yes"
 $ gene_expr: num 11.3
 $ status   : chr "activated"
unlist(my_df[2, ]) # convert to a vector
         id   gene_name    detected   gene_expr      status 
        "2"     "geneB"       "yes"      "11.3" "activated" 

If you select only 1 column of a data frame:

my_df[, 2]
 [1] "geneA" "geneB" "geneC" "geneD" "geneE" "geneF" "geneG" "geneH" "geneI"
[10] "geneJ"
is.vector(my_df[, 2]) # The output was "simplified" to a vector.
[1] TRUE

To keep the output as a data.frame when you select only 1 column, specify drop = FALSE:

my_df[, 2, drop = FALSE]
   gene_name
1      geneA
2      geneB
3      geneC
4      geneD
5      geneE
6      geneF
7      geneG
8      geneH
9      geneI
10     geneJ
str(my_df[, 2, drop = FALSE])
'data.frame':   10 obs. of  1 variable:
 $ gene_name: chr  "geneA" "geneB" "geneC" "geneD" ...

Data Frame Modification (1)

Modify existing column:

  • The whole column
my_df[["id"]] <- paste0("ID", my_df[["id"]])
my_df
     id gene_name detected gene_expr    status
1   ID1     geneA      yes      12.4 activated
2   ID2     geneB      yes      11.3 activated
3   ID3     geneC      yes      13.5 activated
4   ID4     geneD      yes      10.2 activated
5   ID5     geneE      yes      11.4 activated
6   ID6     geneF      yes       0.5 inhibited
7   ID7     geneG      yes       1.0 inhibited
8   ID8     geneH      yes       1.2 inhibited
9   ID9     geneI      yes       1.4 inhibited
10 ID10     geneJ      yes       0.6 inhibited
my_df$id <- paste0("ID", 1:10) # idem
  • Modify some elements
my_df[["id"]][1:2] <- 1:2 
my_df
     id gene_name detected gene_expr    status
1     1     geneA      yes      12.4 activated
2     2     geneB      yes      11.3 activated
3   ID3     geneC      yes      13.5 activated
4   ID4     geneD      yes      10.2 activated
5   ID5     geneE      yes      11.4 activated
6   ID6     geneF      yes       0.5 inhibited
7   ID7     geneG      yes       1.0 inhibited
8   ID8     geneH      yes       1.2 inhibited
9   ID9     geneI      yes       1.4 inhibited
10 ID10     geneJ      yes       0.6 inhibited
my_df[1:2, "id"] <- 1:2 # idem

Data Frame Modification (2)

  • Add new column
my_df[["tissue"]] <- rep(c("liver", "muscle"), times = 5)
my_df
     id gene_name detected gene_expr    status tissue
1     1     geneA      yes      12.4 activated  liver
2     2     geneB      yes      11.3 activated muscle
3   ID3     geneC      yes      13.5 activated  liver
4   ID4     geneD      yes      10.2 activated muscle
5   ID5     geneE      yes      11.4 activated  liver
6   ID6     geneF      yes       0.5 inhibited muscle
7   ID7     geneG      yes       1.0 inhibited  liver
8   ID8     geneH      yes       1.2 inhibited muscle
9   ID9     geneI      yes       1.4 inhibited  liver
10 ID10     geneJ      yes       0.6 inhibited muscle
my_df$tissue <- rep(c("liver", "muscle"), times = 5) # idem

Data Frame Modification (3)

  • Delete column(s)
my_df[["detected"]] <- NULL
my_df
     id gene_name gene_expr    status tissue
1     1     geneA      12.4 activated  liver
2     2     geneB      11.3 activated muscle
3   ID3     geneC      13.5 activated  liver
4   ID4     geneD      10.2 activated muscle
5   ID5     geneE      11.4 activated  liver
6   ID6     geneF       0.5 inhibited muscle
7   ID7     geneG       1.0 inhibited  liver
8   ID8     geneH       1.2 inhibited muscle
9   ID9     geneI       1.4 inhibited  liver
10 ID10     geneJ       0.6 inhibited muscle
my_df$detected <- NULL # idem

Recoding Data Type

You need to recode the column data type if data is categorical, use the factor() function.

my_df[["status"]] <- factor(my_df[["status"]])
my_df$tissue <- factor(
  my_df[["tissue"]],
  levels = c("muscle", "liver") # specify levels' order
)

# Check again the data
str(my_df)
'data.frame':   10 obs. of  5 variables:
 $ id       : chr  "1" "2" "ID3" "ID4" ...
 $ gene_name: chr  "geneA" "geneB" "geneC" "geneD" ...
 $ gene_expr: num  12.4 11.3 13.5 10.2 11.4 0.5 1 1.2 1.4 0.6
 $ status   : Factor w/ 2 levels "activated","inhibited": 1 1 1 1 1 2 2 2 2 2
 $ tissue   : Factor w/ 2 levels "muscle","liver": 2 1 2 1 2 1 2 1 2 1
summary(my_df)
      id             gene_name           gene_expr           status     tissue 
 Length:10          Length:10          Min.   : 0.50   activated:5   muscle:5  
 Class :character   Class :character   1st Qu.: 1.05   inhibited:5   liver :5  
 Mode  :character   Mode  :character   Median : 5.80                           
                                       Mean   : 6.35                           
                                       3rd Qu.:11.38                           
                                       Max.   :13.50                           

Data Frame Concatenation

Use cbind() to bind two or more data frames by columns.

df1 <- data.frame(x = 1:3, y = 4:6)
df1
  x y
1 1 4
2 2 5
3 3 6
df2 <- data.frame(a = 7:9, b = 10:12)
df2
  a  b
1 7 10
2 8 11
3 9 12
cbind(df1, df2)
  x y a  b
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12

Use rbind() to bind two or more data frames by rows, columns names should be the same.

rbind(df1, df2)
Error in match.names(clabs, names(xi)): names do not match previous names
df3 <- data.frame(x = 7:9, y = 10:12)
rbind(df1, df3)
  x  y
1 1  4
2 2  5
3 3  6
4 7 10
5 8 11
6 9 12

Join Data Frames

Use merge() to join two data frames based on a common column.

df1 <- data.frame(id = 1:3, x = letters[1:3])
df1
  id x
1  1 a
2  2 b
3  3 c
df2 <- data.frame(id = c(2, 4), y = LETTERS[c(2, 4)])
df2
  id y
1  2 B
2  4 D
# inner join
merge(x = df1, y = df2, by = "id") 
  id x y
1  2 b B
# left join
merge(x = df1, y = df2, by = "id", all.x = TRUE) 
  id x    y
1  1 a <NA>
2  2 b    B
3  3 c <NA>
# right join
merge(x = df1, y = df2, by = "id", all.y = TRUE) 
  id    x y
1  2    b B
2  4 <NA> D
# outer join
merge(x = df1, y = df2, by = "id", all = TRUE)
  id    x    y
1  1    a <NA>
2  2    b    B
3  3    c <NA>
4  4 <NA>    D

Lists

The most flexible data strucutre in R, with content(s) inside can be any data structure.

Lists Creation

Use list() function to create a list, separate elements with ,.

x <- "geneA"
y <- c(10, 11, 0.5)
z <- c(TRUE, TRUE, FALSE)
simple_list <- list(x, y, z)
simple_list
[[1]]
[1] "geneA"

[[2]]
[1] 10.0 11.0  0.5

[[3]]
[1]  TRUE  TRUE FALSE
# how many elements in the elements
length(simple_list)
[1] 3
# check the structure
str(simple_list)
List of 3
 $ : chr "geneA"
 $ : num [1:3] 10 11 0.5
 $ : logi [1:3] TRUE TRUE FALSE

We can name the elements of a list:

# name the elements during creation
list(gene_name = x, counts = y, expressed = z) 
$gene_name
[1] "geneA"

$counts
[1] 10.0 11.0  0.5

$expressed
[1]  TRUE  TRUE FALSE
# or name the elements afterward
names(simple_list) <- c("gene_name", "counts", "expressed")
simple_list
$gene_name
[1] "geneA"

$counts
[1] 10.0 11.0  0.5

$expressed
[1]  TRUE  TRUE FALSE

Subsetting a List

Use [ ] to subset a list, with numeric index, name or logical index.

simple_list[c(1, 3)]
$gene_name
[1] "geneA"

$expressed
[1]  TRUE  TRUE FALSE
simple_list[c("gene_name", "expressed")]
$gene_name
[1] "geneA"

$expressed
[1]  TRUE  TRUE FALSE
simple_list[c(TRUE, FALSE, TRUE)]
$gene_name
[1] "geneA"

$expressed
[1]  TRUE  TRUE FALSE

What is the data structure after subsetting?

  • simple_list[c(1, 3)]
  • simple_list[1]
str(simple_list[c(1, 3)])
List of 2
 $ gene_name: chr "geneA"
 $ expressed: logi [1:3] TRUE TRUE FALSE
str(simple_list[1])
List of 1
 $ gene_name: chr "geneA"

Accessing Element (1)

simple_list
$gene_name
[1] "geneA"

$counts
[1] 10.0 11.0  0.5

$expressed
[1]  TRUE  TRUE FALSE
  • Use $ to access named element:
simple_list$counts
[1] 10.0 11.0  0.5
  • Use [[ ]] for named or indexed element:
simple_list[["counts"]]
[1] 10.0 11.0  0.5
simple_list[[2]]
[1] 10.0 11.0  0.5

Accessing Element (2)

Once we have access to the element, we can extract values according to the data structure of that element.

my_list <- list(
  sample_info = data.frame(
    id = paste0("sample", 1:3),
    age = c(25, 27, 30)
  ),
  family_sequenced = list(
    sample1 = c("father", "mother"),
    sample2 = c("father", "mother", "sister"),
    sample3 = c("mother", "sister")
  )
)
my_list
$sample_info
       id age
1 sample1  25
2 sample2  27
3 sample3  30

$family_sequenced
$family_sequenced$sample1
[1] "father" "mother"

$family_sequenced$sample2
[1] "father" "mother" "sister"

$family_sequenced$sample3
[1] "mother" "sister"

How to extract the age of sample1?

my_list[["sample_info"]][1, "age"]
[1] 25

How to extract the sequenced family members of sample 2?

my_list[["family_sequenced"]][["sample2"]]
[1] "father" "mother" "sister"

List Modification

simple_list
$gene_name
[1] "geneA"

$counts
[1] 10.0 11.0  0.5

$expressed
[1]  TRUE  TRUE FALSE
  • Modify an element
simple_list[[1]] <- 1:10
simple_list
$gene_name
 [1]  1  2  3  4  5  6  7  8  9 10

$counts
[1] 10.0 11.0  0.5

$expressed
[1]  TRUE  TRUE FALSE
  • Add an element
simple_list[["new_element"]] <- 3:1
simple_list
$gene_name
 [1]  1  2  3  4  5  6  7  8  9 10

$counts
[1] 10.0 11.0  0.5

$expressed
[1]  TRUE  TRUE FALSE

$new_element
[1] 3 2 1
  • Remove an element
simple_list[["expressed"]] <- NULL
simple_list
$gene_name
 [1]  1  2  3  4  5  6  7  8  9 10

$counts
[1] 10.0 11.0  0.5

$new_element
[1] 3 2 1

List Concatenation

Use c() to concatenate two or more lists.

list1 <- list(1:3, 4:6)
list2 <- list(letters[1:3], "A")

list_long <- c(list1, list2)
list_long
[[1]]
[1] 1 2 3

[[2]]
[1] 4 5 6

[[3]]
[1] "a" "b" "c"

[[4]]
[1] "A"
str(list_long)
List of 4
 $ : int [1:3] 1 2 3
 $ : int [1:3] 4 5 6
 $ : chr [1:3] "a" "b" "c"
 $ : chr "A"

Let’s Practice !

Today’s Goals

  • Work with data frames and lists
  • Calculate fold change of gene expression between groups
  • Compare gene expression using the Wilcoxon test
  • Visualize differences with boxplots