Week 6 - Hands-On Examples

week06

exercise

The R script is available here: link

Goals

Hands-on challenge: “Fix the Code”
Mini data analysis project
- Data cleaning
- Simple analysis
- Visualisation

“Fix the Code” Challenge

Fix the data frame error.

data <- data.frame(
  gene = c("BRCA1", "TP53", "MYC"),
  expression = c(10.5, 8.2, 12.7)
  condition = c("Tumor", "Normal", "Tumor")
)

Error in parse(text = input): <text>:4:3: unexpected symbol
3:   expression = c(10.5, 8.2, 12.7)
4:   condition
     ^

Correction

data <- data.frame(
  gene = c("BRCA1", "TP53", "MYC"),
  expression = c(10.5, 8.2, 12.7),
  condition = c("Tumor", "Normal", "Tumor")
)

Fix error when using the mean() function.

counts <- data.frame(
  sample1 = c(1, 2, 3),
  sample2 = c(1, 1, 3),
  sample3 = c(0, 1, 2),
  row.names = paste0("gene", 1:3)
)
counts

      sample1 sample2 sample3
gene1       1       1       0
gene2       2       1       1
gene3       3       3       2

mean(counts["gene3", ])

Warning in mean.default(counts["gene3", ]): argument is not numeric or logical:
returning NA

[1] NA

Correction

mean(unlist(counts["gene3", ]))

[1] 2.666667

Correction

# or
mean(as.matrix(counts)["gene3", ])

[1] 2.666667

Fix error in condition.

x <- 10
if (x = 5) {
  print("x is 5")
}

Error in parse(text = input): <text>:2:7: unexpected '='
1: x <- 10
2: if (x =
         ^

Correction

x <- 10
if (x == 5) {
  print("x is 5")
}

Fix error in ggplot2. The goal is to show petal length with a boxplot for each species.

ggplot(iris, aes(x = Sepal.Length, y = Species))
geom_boxplot()

Error in ggplot(iris, aes(x = Sepal.Length, y = Species)): could not find function "ggplot"

Error in geom_boxplot(): could not find function "geom_boxplot"

Correction

library(ggplot2) # load the package before use
ggplot(iris, aes(x = Sepal.Length, y = Species)) + # need "+" between layers
  geom_boxplot()

Fix error in ggplot2. The aim is to draw boxplot for each group.

# simulate data for two groups of samples.
set.seed(1)
data <- data.frame(
  gp = rep(1:2, each = 20),
  value = c(rnorm(20), rnorm(20, mean = 5))
)
str(data)

'data.frame':   40 obs. of  2 variables:
 $ gp   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ value: num  -0.626 0.184 -0.836 1.595 0.33 ...

# draw boxplot by group.
ggplot(data, aes(x = gp, y = value)) + 
  geom_boxplot()

Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

Correction

ggplot(data, aes(x = factor(gp), y = value)) + 
  geom_boxplot()

Fix code in data filtration.

# try to keep rows where the value is smaller than -0.5
data[data$value<-0.5, ]

[1] gp    value
<0 rows> (or 0-length row.names)

Correction

str(data) # the original data was changed!

'data.frame':   40 obs. of  2 variables:
 $ gp   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ value: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Correction

# rebuild the data frame
set.seed(1)
data <- data.frame(
  gp = rep(1:2, each = 20),
  value = c(rnorm(20), rnorm(20, mean = 5))
)

data[data$value < -0.5, ]

   gp      value
1   1 -0.6264538
3   1 -0.8356286
6   1 -0.8204684
13  1 -0.6212406
14  1 -2.2146999

Mini Data Project

A researcher has collected some gene expression data from 12 samples. However, some expression values are missing. Please help the researcher to clean the data and to performs some basic analyses.

# Simulated dataset with missing values
data <- data.frame(
  sample_id = paste0("sample", 1:12),
  expression = c(
    10.2, 15.2, NA, NA, 9.4, 18.1,
    8.9, 16.0, 10.5, 15.5, 11.5, 13.4
  ),
  sample_group = rep(c("Control", "Case"), times = 6)
)

# Show the dataset
data

   sample_id expression sample_group
1    sample1       10.2      Control
2    sample2       15.2         Case
3    sample3         NA      Control
4    sample4         NA         Case
5    sample5        9.4      Control
6    sample6       18.1         Case
7    sample7        8.9      Control
8    sample8       16.0         Case
9    sample9       10.5      Control
10  sample10       15.5         Case
11  sample11       11.5      Control
12  sample12       13.4         Case

Tasks:

Find missing values. Which rows contain missing values? → Hint: Use is.na()

Correction

# Check which values are missing
is.na(data$expression)

 [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Correction

# Show only rows with missing values
data[is.na(data$expression), ]

  sample_id expression sample_group
3   sample3         NA      Control
4   sample4         NA         Case

Correction

## or find the index of NAs
which(is.na(data$expression))

[1] 3 4

Correction

data[which(is.na(data$expression)), ]

  sample_id expression sample_group
3   sample3         NA      Control
4   sample4         NA         Case

Remove rows with missing values. Create a new dataset without missing values.

Correction

# Remove rows where expression is NA
data_clean <- data[!is.na(data$expression), ]

# Show the cleaned dataset
data_clean

   sample_id expression sample_group
1    sample1       10.2      Control
2    sample2       15.2         Case
5    sample5        9.4      Control
6    sample6       18.1         Case
7    sample7        8.9      Control
8    sample8       16.0         Case
9    sample9       10.5      Control
10  sample10       15.5         Case
11  sample11       11.5      Control
12  sample12       13.4         Case

Basic summary statistics

What is the mean expression level (after removing missing values)?
What is the max and min expression?

Correction

# Compute summary statistics
mean_expression <- mean(data_clean$expression)
mean_expression

[1] 12.87

Correction

max_expression <- max(data_clean$expression)
max_expression

[1] 18.1

Correction

min_expression <- min(data_clean$expression)
min_expression

[1] 8.9

Correction

## or use the range() function for the min and max
range(data_clean$expression)

[1]  8.9 18.1

Find the average expression (mean()) and the standard deviation (sd()) for each sample group (Control vs. Case)

Correction

# Compute mean and sd of expression for each sample group
expr_ctrl <- data_clean$expression[data_clean$sample_group == "Control"]
mean(expr_ctrl)

[1] 10.1

Correction

sd(expr_ctrl)

[1] 1.007472

Correction

expr_case <- data_clean$expression[data_clean$sample_group == "Case"]
mean(expr_case)

[1] 15.64

Correction

sd(expr_case)

[1] 1.689083

Use data_clean to draw a graph you have already seen, e.g.: box plots, scatter plots, etc.

Correction

## a boxplot to show if "Case" has higher/lower expression than "Control."
ggplot(
  data_clean,
  aes(x = sample_group, y = expression, fill = sample_group)
) +
  geom_boxplot() +
  labs(
    title = "Expression Levels by Group",
    x = NULL,
    y = "Expression"
  ) +
  theme(legend.position = "none")

Correction

## Violin plot to show variability and density of expression levels
ggplot(
  data_clean,
  aes(x = sample_group, y = expression, fill = sample_group)
) +
  geom_violin() +
  labs(
    title = "Violin Plot - Expression Distribution per Group",
    x = NULL,
    y = "Expression"
  ) +
  theme(legend.position = "none")

Correction

## Spotting individual variations
ggplot(
  data_clean,
  aes(x = sample_id, y = expression, color = sample_group)
) +
  geom_point(size = 3) +
  labs(
    title = "Scatter Plot - Expression per Sample",
    x = NULL,
    y = "Expression",
    color = NULL
  )

Correction

## better solution
ggplot(
  data_clean,
  aes(
    # recode variable as factor with the wanted order
    x = factor(sample_id, levels = sample_id),
    y = expression,
    color = sample_group
  )
) +
  geom_point(size = 3) +
  labs(
    title = "Scatter Plot - Expression per Sample",
    subtitle = "Improved Version :)",
    x = NULL,
    y = "Expression",
    color = NULL
  ) +
  theme(
    # rotate x-axis' text to avoid overlap
    axis.text.x = element_text(angle = 30, hjust = 1)
  )

To go futhur: Let’s visualise the average expression of each group using barplot, with the help of ChatGPT (or any other AI tool).

6a. Prepare data for barplot with error bars. We need to reorganize the data in a dataframe with 2 rows and 3 columns:

the column group contain the name of each group
the column mean contain the mean expression in of each group
the column mean contain the standard deviations

Correction

gg_data <- data.frame(
  group = c("Control", "Case"),
  mean_val = c(mean(expr_ctrl), mean(expr_case)),
  sd_val = c(sd(expr_ctrl), sd(expr_case))
)

gg_data

    group mean_val   sd_val
1 Control    10.10 1.007472
2    Case    15.64 1.689083

6b. Draw the bar plot:

Plot bars for mean expression (geom_bar(stat = "identity"))
Add error bars for standard deviation (geom_errorbar())
Change the aesthetic apsects as you want, e.g.: color, title, legend, etc.

Correction

p_bar <- ggplot(data = gg_data, aes(x = group, y = mean_val)) +
  geom_bar(stat = "identity", fill = "orange", width = 0.5) # bar plot
p_bar

Correction

p_bar_error <- p_bar +
  geom_errorbar( # add errorbar
    aes(
      ymin = mean_val - sd_val, 
      ymax = mean_val + sd_val
    ), 
    width = 0.2 # smaller width of error bar
  )

# change labels and theme
p_bar_error +
  labs(
    title = "Mean Gene Expression by Sample Group",
    subtitle = "The error bars represent the standard deviation.",
    x = NULL, y = "Mean Expression Level"
  ) +
  theme_minimal()

6c: What if we want to add the expression level of each sample to the bar plot?

Hint: Add another layer for drawing points (geom_point), using the data frame that contains the individual data (data_clean).

Correction

# change labels and theme
p_bar_error +
  labs(
    title = "Mean Gene Expression by Sample Group",
    subtitle = "The error bars represent the standard deviation.",
    x = NULL, y = "Mean Expression Level"
  ) +
  theme_minimal() +
  geom_point( # add the layer of points
    data = data_clean, # use the individual data
    aes(x = sample_group, y = expression), # map axes
    position = position_jitter( # add random noise to avoid overlap between points
      width = 0.1, height = 0, # allow the points to be spread across a small horizontal range (width = 0.1),
      # while keeping the value on y-axis fixed (height = 0)
      seed = 1
    ) 
  )

Goals

“Fix the Code” Challenge

Mini Data Project

Good job! 👏👏 You’ve taken your first big steps into R, and you’re off to a great start, keep it up!