Week 6 - Hands-On Examples

week06
exercise

The R script is available here: link

Goals

  • Hands-on challenge: “Fix the Code”
  • Mini data analysis project
    • Data cleaning
    • Simple analysis
    • Visualisation

“Fix the Code” Challenge

  1. Fix the data frame error.
data <- data.frame(
  gene = c("BRCA1", "TP53", "MYC"),
  expression = c(10.5, 8.2, 12.7)
  condition = c("Tumor", "Normal", "Tumor")
)
Error in parse(text = input): <text>:4:3: unexpected symbol
3:   expression = c(10.5, 8.2, 12.7)
4:   condition
     ^
Correction
data <- data.frame(
  gene = c("BRCA1", "TP53", "MYC"),
  expression = c(10.5, 8.2, 12.7),
  condition = c("Tumor", "Normal", "Tumor")
)
  1. Fix error when using the mean() function.
counts <- data.frame(
  sample1 = c(1, 2, 3),
  sample2 = c(1, 1, 3),
  sample3 = c(0, 1, 2),
  row.names = paste0("gene", 1:3)
)
counts
      sample1 sample2 sample3
gene1       1       1       0
gene2       2       1       1
gene3       3       3       2
mean(counts["gene3", ])
Warning in mean.default(counts["gene3", ]): argument is not numeric or logical:
returning NA
[1] NA
Correction
mean(unlist(counts["gene3", ]))
[1] 2.666667
Correction
# or
mean(as.matrix(counts)["gene3", ])
[1] 2.666667
  1. Fix error in condition.
x <- 10
if (x = 5) {
  print("x is 5")
}
Error in parse(text = input): <text>:2:7: unexpected '='
1: x <- 10
2: if (x =
         ^
Correction
x <- 10
if (x == 5) {
  print("x is 5")
}
  1. Fix error in ggplot2. The goal is to show petal length with a boxplot for each species.
ggplot(iris, aes(x = Sepal.Length, y = Species))
geom_boxplot()
Error in ggplot(iris, aes(x = Sepal.Length, y = Species)): could not find function "ggplot"
Error in geom_boxplot(): could not find function "geom_boxplot"
Correction
library(ggplot2) # load the package before use
ggplot(iris, aes(x = Sepal.Length, y = Species)) + # need "+" between layers
  geom_boxplot()

  1. Fix error in ggplot2. The aim is to draw boxplot for each group.
# simulate data for two groups of samples.
set.seed(1)
data <- data.frame(
  gp = rep(1:2, each = 20),
  value = c(rnorm(20), rnorm(20, mean = 5))
)
str(data)
'data.frame':   40 obs. of  2 variables:
 $ gp   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ value: num  -0.626 0.184 -0.836 1.595 0.33 ...
# draw boxplot by group.
ggplot(data, aes(x = gp, y = value)) + 
  geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

Correction
ggplot(data, aes(x = factor(gp), y = value)) + 
  geom_boxplot()

  1. Fix code in data filtration.
# try to keep rows where the value is smaller than -0.5
data[data$value<-0.5, ]
[1] gp    value
<0 rows> (or 0-length row.names)
Correction
str(data) # the original data was changed!
'data.frame':   40 obs. of  2 variables:
 $ gp   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ value: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Correction
# rebuild the data frame
set.seed(1)
data <- data.frame(
  gp = rep(1:2, each = 20),
  value = c(rnorm(20), rnorm(20, mean = 5))
)

data[data$value < -0.5, ]
   gp      value
1   1 -0.6264538
3   1 -0.8356286
6   1 -0.8204684
13  1 -0.6212406
14  1 -2.2146999

Mini Data Project

A researcher has collected some gene expression data from 12 samples. However, some expression values are missing. Please help the researcher to clean the data and to performs some basic analyses.

# Simulated dataset with missing values
data <- data.frame(
  sample_id = paste0("sample", 1:12),
  expression = c(
    10.2, 15.2, NA, NA, 9.4, 18.1,
    8.9, 16.0, 10.5, 15.5, 11.5, 13.4
  ),
  sample_group = rep(c("Control", "Case"), times = 6)
)

# Show the dataset
data
   sample_id expression sample_group
1    sample1       10.2      Control
2    sample2       15.2         Case
3    sample3         NA      Control
4    sample4         NA         Case
5    sample5        9.4      Control
6    sample6       18.1         Case
7    sample7        8.9      Control
8    sample8       16.0         Case
9    sample9       10.5      Control
10  sample10       15.5         Case
11  sample11       11.5      Control
12  sample12       13.4         Case

Tasks:

  1. Find missing values. Which rows contain missing values? → Hint: Use is.na()
Correction
# Check which values are missing
is.na(data$expression)
 [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Correction
# Show only rows with missing values
data[is.na(data$expression), ]
  sample_id expression sample_group
3   sample3         NA      Control
4   sample4         NA         Case
Correction
## or find the index of NAs
which(is.na(data$expression))
[1] 3 4
Correction
data[which(is.na(data$expression)), ]
  sample_id expression sample_group
3   sample3         NA      Control
4   sample4         NA         Case
  1. Remove rows with missing values. Create a new dataset without missing values.
Correction
# Remove rows where expression is NA
data_clean <- data[!is.na(data$expression), ]

# Show the cleaned dataset
data_clean
   sample_id expression sample_group
1    sample1       10.2      Control
2    sample2       15.2         Case
5    sample5        9.4      Control
6    sample6       18.1         Case
7    sample7        8.9      Control
8    sample8       16.0         Case
9    sample9       10.5      Control
10  sample10       15.5         Case
11  sample11       11.5      Control
12  sample12       13.4         Case
  1. Basic summary statistics
  • What is the mean expression level (after removing missing values)?
  • What is the max and min expression?
Correction
# Compute summary statistics
mean_expression <- mean(data_clean$expression)
mean_expression
[1] 12.87
Correction
max_expression <- max(data_clean$expression)
max_expression
[1] 18.1
Correction
min_expression <- min(data_clean$expression)
min_expression
[1] 8.9
Correction
## or use the range() function for the min and max
range(data_clean$expression)
[1]  8.9 18.1
  1. Find the average expression (mean()) and the standard deviation (sd()) for each sample group (Control vs. Case)
Correction
# Compute mean and sd of expression for each sample group
expr_ctrl <- data_clean$expression[data_clean$sample_group == "Control"]
mean(expr_ctrl)
[1] 10.1
Correction
sd(expr_ctrl)
[1] 1.007472
Correction
expr_case <- data_clean$expression[data_clean$sample_group == "Case"]
mean(expr_case)
[1] 15.64
Correction
sd(expr_case)
[1] 1.689083
  1. Use data_clean to draw a graph you have already seen, e.g.: box plots, scatter plots, etc.
Correction
## a boxplot to show if "Case" has higher/lower expression than "Control."
ggplot(
  data_clean,
  aes(x = sample_group, y = expression, fill = sample_group)
) +
  geom_boxplot() +
  labs(
    title = "Expression Levels by Group",
    x = NULL,
    y = "Expression"
  ) +
  theme(legend.position = "none")

Correction
## Violin plot to show variability and density of expression levels
ggplot(
  data_clean,
  aes(x = sample_group, y = expression, fill = sample_group)
) +
  geom_violin() +
  labs(
    title = "Violin Plot - Expression Distribution per Group",
    x = NULL,
    y = "Expression"
  ) +
  theme(legend.position = "none")

Correction
## Spotting individual variations
ggplot(
  data_clean,
  aes(x = sample_id, y = expression, color = sample_group)
) +
  geom_point(size = 3) +
  labs(
    title = "Scatter Plot - Expression per Sample",
    x = NULL,
    y = "Expression",
    color = NULL
  )

Correction
## better solution
ggplot(
  data_clean,
  aes(
    # recode variable as factor with the wanted order
    x = factor(sample_id, levels = sample_id),
    y = expression,
    color = sample_group
  )
) +
  geom_point(size = 3) +
  labs(
    title = "Scatter Plot - Expression per Sample",
    subtitle = "Improved Version :)",
    x = NULL,
    y = "Expression",
    color = NULL
  ) +
  theme(
    # rotate x-axis' text to avoid overlap
    axis.text.x = element_text(angle = 30, hjust = 1)
  )

  1. To go futhur: Let’s visualise the average expression of each group using barplot, with the help of ChatGPT (or any other AI tool).

6a. Prepare data for barplot with error bars. We need to reorganize the data in a dataframe with 2 rows and 3 columns:

  • the column group contain the name of each group
  • the column mean contain the mean expression in of each group
  • the column mean contain the standard deviations
Correction
gg_data <- data.frame(
  group = c("Control", "Case"),
  mean_val = c(mean(expr_ctrl), mean(expr_case)),
  sd_val = c(sd(expr_ctrl), sd(expr_case))
)

gg_data
    group mean_val   sd_val
1 Control    10.10 1.007472
2    Case    15.64 1.689083

6b. Draw the bar plot:

  • Plot bars for mean expression (geom_bar(stat = "identity"))
  • Add error bars for standard deviation (geom_errorbar())
  • Change the aesthetic apsects as you want, e.g.: color, title, legend, etc.
Correction
p_bar <- ggplot(data = gg_data, aes(x = group, y = mean_val)) +
  geom_bar(stat = "identity", fill = "orange", width = 0.5) # bar plot
p_bar

Correction
p_bar_error <- p_bar +
  geom_errorbar( # add errorbar
    aes(
      ymin = mean_val - sd_val, 
      ymax = mean_val + sd_val
    ), 
    width = 0.2 # smaller width of error bar
  )

# change labels and theme
p_bar_error +
  labs(
    title = "Mean Gene Expression by Sample Group",
    subtitle = "The error bars represent the standard deviation.",
    x = NULL, y = "Mean Expression Level"
  ) +
  theme_minimal()

6c: What if we want to add the expression level of each sample to the bar plot?

Hint: Add another layer for drawing points (geom_point), using the data frame that contains the individual data (data_clean).

Correction
# change labels and theme
p_bar_error +
  labs(
    title = "Mean Gene Expression by Sample Group",
    subtitle = "The error bars represent the standard deviation.",
    x = NULL, y = "Mean Expression Level"
  ) +
  theme_minimal() +
  geom_point( # add the layer of points
    data = data_clean, # use the individual data
    aes(x = sample_group, y = expression), # map axes
    position = position_jitter( # add random noise to avoid overlap between points
      width = 0.1, height = 0, # allow the points to be spread across a small horizontal range (width = 0.1),
      # while keeping the value on y-axis fixed (height = 0)
      seed = 1
    ) 
  )


Good job! 👏👏 You’ve taken your first big steps into R, and you’re off to a great start, keep it up!