# install.packages("readr") # you only need to install it once
# install.packages("tidyr") # you only need to install it once
Week 5 - Hands-On Examples
The R script is available here: link
Goals
- Install new R packages
- Create basic plots with ggplot2
Install New Packages
We will use a couple of additional R packages for this training in sessions 8 and 9, for example:
- {tidyr}: provides functions that help you get to tidy data
- {readr}: provides fast and friendly way to read rectangular data
- Install the {
readr
} and {tidyr
} packages.
Check if you can find
{readr}
and{tidyr}
in the “Packages” panel in RStudio.Please use the following code to create a data frame of all installed packages in your system. Show a couple of first lines of the data frame.
<- as.data.frame(installed.packages()[, c(1, 3:4)])
my_pkgs head(my_pkgs)
Package Version Priority
askpass askpass 1.2.1 <NA>
backports backports 1.5.0 <NA>
base64enc base64enc 0.1-3 <NA>
bit bit 4.6.0 <NA>
bit64 bit64 4.6.0-1 <NA>
blob blob 1.2.4 <NA>
- According to
my_pkgs
, how many packages are installed in your system?
nrow(my_pkgs) # the number can be different
[1] 132
- Extract the rows for the “ggplot2”, “tidyr”, “readr” packages from
my_pkgs
.
# extract by rownames
c("ggplot2", "tidyr", "readr"), ] my_pkgs[
Package Version Priority
ggplot2 ggplot2 3.5.1 <NA>
tidyr tidyr 1.3.1 <NA>
readr readr 2.1.5 <NA>
## or
# extract by filtering the names in the "Package" column
$Package %in% c("ggplot2", "tidyr", "readr"), ] my_pkgs[my_pkgs
Package Version Priority
ggplot2 ggplot2 3.5.1 <NA>
readr readr 2.1.5 <NA>
tidyr tidyr 1.3.1 <NA>
Create a Basic Histogram
In the hands-on examples of session 1 and 2, we have drawn histograms with the basic R function hist()
.
Now let’s try with functions from {ggplot2
} package.
- Import the
read-counts.csv
file into RStudio and name the data ascounts
.
<- read.csv("../exos_data/read-counts.csv", header = TRUE)
counts dim(counts)
[1] 45 41
- Load the {
ggplot2
} package. Then create a histogram with the functiongeom_histogram()
for all genes of the sample “WT.2”. Have you noticed the message fromgeom_histogram()
about the bins?
library(ggplot2)
ggplot(data = counts, aes(x = WT.2)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- Create a new histogram for the sample “WT.2”, but this time use the log2 of the gene counts. Do you see the messages from
geom_histogram()
?
ggplot(data = counts, aes(x = log2(WT.2))) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
- Try 10 and 20 for the
bins
parameter ofgeom_histogram()
. Observe how the histogram changes.
ggplot(data = counts, aes(x = log2(WT.2))) +
geom_histogram(bins = 10) +
labs(title = "10 bins")
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
ggplot(data = counts, aes(x = log2(WT.2))) +
geom_histogram(bins = 20) +
labs(title = "20 bins")
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
- We will use 10 bins for the histogram. Change the color of the bars to blue.
ggplot(data = counts, aes(x = log2(WT.2))) +
geom_histogram(bins = 10, color = "blue")
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
- Based on the previous figure, fill the bars in orange.
ggplot(data = counts, aes(x = log2(WT.2))) +
geom_histogram(bins = 10, color = "blue", fill = "orange")
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
- Based on the previous figure, change:
- x-axis title to be: “log2 gene expression”
- y-axis title to be: “Counts”
- plot title to show sample name and the number of genes used.
ggplot(data = counts, aes(x = log2(WT.2))) +
geom_histogram(bins = 10, color = "blue", fill = "orange") +
labs(
x = "log2 gene expression",
y = "Counts",
title = paste0(nrow(counts), " genes of WT.2 sample")
)
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
- Based on the previous figure, use the theme
theme_bw()
.
ggplot(data = counts, aes(x = log2(WT.2))) +
geom_histogram(bins = 10, color = "blue", fill = "orange") +
labs(
x = "log2 gene expression",
y = "Counts",
title = paste0(nrow(counts), " genes of WT.2 sample")
+
) theme_bw()
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
Create a Boxplot with ggplot
In the hands-on examples of session 3, we demonstrated that there is a significant difference in the expression of LOH1 between WT and SET1 samples.
This gene is upregulated in SET1 samples compared to WT samples, the log2 fold change is 2.85.
Now, let’s draw a more sophisticated boxplot with ggplot2.
- Prepare a data frame for the boxplot. We need:
- a column for the expression levels of different samples
- a column to indicate what is the sample group (WT or SET1)
<- data.frame(
gg_df expr_value = c(
unlist(counts[counts$Feature == "LOH1", paste0("WT.", 1:10)]),
unlist(counts[counts$Feature == "LOH1", paste0("SET1.", 1:10)])
),gp = rep(c("WT", "SET1"), each = 10)
) gg_df
expr_value gp
WT.1 10 WT
WT.2 2 WT
WT.3 14 WT
WT.4 19 WT
WT.5 35 WT
WT.6 17 WT
WT.7 6 WT
WT.8 3 WT
WT.9 31 WT
WT.10 13 WT
SET1.1 67 SET1
SET1.2 49 SET1
SET1.3 83 SET1
SET1.4 185 SET1
SET1.5 203 SET1
SET1.6 83 SET1
SET1.7 40 SET1
SET1.8 84 SET1
SET1.9 134 SET1
SET1.10 155 SET1
- Create a boxplot for the expression level by sample group.
<- ggplot(gg_df, aes(x = gp, y = expr_value)) +
p_base geom_boxplot()
p_base
- Modify labels:
- Remove x-axis title
- Change y-axis title to “Expression Level”
- Add a plot title “Expression of LOH1 in SET1 and WT samples”
- Add a subtitle “log2FoldChange = 2.85”
<- p_base + labs(
p_base x = NULL,
y = "Expression Level",
title = "Expression of LOH1 in SET1 and WT samples",
subtitle = "log2FoldChange = 2.85"
) p_base
- Add a layer of scatter plot over the boxplot.
- Color the points by sample group.
- Use the
alpha
parameter to let points be semi-transparent.
<- p_base +
p_base geom_point(
aes(color = gp),
alpha = 0.5
) p_base
- Change the theme to
theme_minimal()
.
<- p_base + theme_minimal()
p_base p_base
- With the
theme()
function:
- Move the plot title to the center.
- Hide the legend.
<- p_base + theme(
p_base plot.title = element_text(hjust = 0.5), # center plot title
legend.position = "none"
) p_base