R-fresh:
Revisiting the Essentials

IOC-R Week 6

R Projects

Organizing Your Work Like a Pro

Use R Projects to manage files and scripts in a structured way.

  • Keep data, scripts, and outputs in separate folders to avoid chaos.

  • Example:

my_project/
  ├── data/          # Raw data files (e.g., RNA-seq counts)
  ├── scripts/       # R scripts for preprocessing & analysis
  ├── outputs/       # Output figures & tables

Variables & Data Types

What’s in Your Data?

Data Type Example in Biology
Numeric Expression levels (25.3)
Character Gene names ("TP53","BRCA1")
Logical Mutation status (TRUE for mutated, FALSE for WT)


expression <- 25.3   # Numeric
gene <- "TP53"       # Character
is_mutant <- TRUE    # Logical
x <- 1
y <- 2
total <- x + y
total
[1] 3
  • If we change x to 5, is the value stored in total changed?
  • If total is not changed, what do you need to do to update the value in total?

Data Structures

Where Do You Store Your Data?

  • Vectors

    • 1D homogeneous data (all elements of the same type).
    • Use cases: A list of gene names, expression levels, or p-values.
    • Use [] for indexing
gene_names <- c("TP53", "BRCA1", "EGFR", "INS")
gene_names[1]  # First gene: "TP53"
[1] "TP53"
gene_names[c(2, 4)]  # Select multiple genes
[1] "BRCA1" "INS"  
# factor
mut_gp <- factor(c("wt", "wt", "mut", "mut"))
mut_gp
[1] wt  wt  mut mut
Levels: mut wt
mut_gp[3]
[1] mut
Levels: mut wt

Where Do You Store Your Data?

  • Matrices:

    • 2D homogeneous data, a collection of equal-length vectors of the same data type.
    • Use case: gene expression counting table, rows = genes, columns = samples
    • Use [, ] for indexing.
expr_matrix <- matrix(
  c(10, 12, 15, 20, 8, 30),
  nrow = 3, byrow = TRUE,
  dimnames = list( # name the rows and columns   
    c("gene1", "gene2", "gene3"),
    c("sample1", "sample2")
  )
)
expr_matrix
      sample1 sample2
gene1      10      12
gene2      15      20
gene3       8      30
expr_matrix["gene1", ]  # Expression of Gene1 across samples
sample1 sample2 
     10      12 
expr_matrix[, "sample2"]  # All gene expressions in Sample2
gene1 gene2 gene3 
   12    20    30 
expr_matrix[2, 1]  # Specific value at row 2, column 1
[1] 15

Where Do You Store Your Data?

  • Data Frames

    • 2D heterogeneous data, a collection of equal-length vectors of one or more data types.
    • Use [, ] for indexing
    • Use $ and [[ ]] to extract one column.
df <- data.frame(
  Gene = c("TP53", "BRCA1", "EGFR"),
  Expression = c(25.3, 12.5, 30.1),
  Mutation = c(TRUE, FALSE, TRUE)
)
df
   Gene Expression Mutation
1  TP53       25.3     TRUE
2 BRCA1       12.5    FALSE
3  EGFR       30.1     TRUE
df$Gene  # Select column as a vector
[1] "TP53"  "BRCA1" "EGFR" 
df[, "Expression"]  # Same as df$Expression
[1] 25.3 12.5 30.1
df[[2]]  # Same as df$Expression
[1] 25.3 12.5 30.1
df[1, ]  # First row
  Gene Expression Mutation
1 TP53       25.3     TRUE
df[df$Expression > 30, ]  # Filter expression bigger than 30
  Gene Expression Mutation
3 EGFR       30.1     TRUE
df[which(df$Mutation), ]  # Filter mutated genes
  Gene Expression Mutation
1 TP53       25.3     TRUE
3 EGFR       30.1     TRUE

Where Do You Store Your Data?

  • Lists

    • Flexible structure, can hold anything (vectors, matrices, data frames or lists).
    • Use [] to subset a list.
    • Use [[ ]] or $ to select a specific component of a list.
bio_data <- list(
  genes = c("TP53", "BRCA1"),
  expression = matrix(c(20, 15, 30, 25), nrow = 2),
  metadata = data.frame(Sample = c("A", "B"), Condition = c("WT", "KO"))
)
bio_data
$genes
[1] "TP53"  "BRCA1"

$expression
     [,1] [,2]
[1,]   20   30
[2,]   15   25

$metadata
  Sample Condition
1      A        WT
2      B        KO
bio_data[1]  # A sub list
$genes
[1] "TP53"  "BRCA1"
bio_data[[1]]  # Get gene names
[1] "TP53"  "BRCA1"
bio_data$expression  # Get expression matrix
     [,1] [,2]
[1,]   20   30
[2,]   15   25
bio_data$metadata$Condition  # Get sample conditions
[1] "WT" "KO"

Conditions & Operators

Filtering Data with Accuracy in R

Type Operator Example
Comparison ==, !=, >, <, >=, <= p_value < 0.05
Logical & (AND), | (OR), ! (NOT) TRUE & FALSE


expression <- 25.3

if (expression > 20) {
  print("High expression")
} else {
  print("Low expression")
}
[1] "High expression"
ifelse(test = expression > 20, yes = "High expression", no = "Low expression")
[1] "High expression"
age <- c(18, 27, 20, 23, 22)
age[age < 20 | age > 25]
[1] 18 27
genes <- data.frame(
  gene = c("TP53", "BRCA1", "EGFR"),
  expr = c(25.3, 12.5, 30.1),
  gene_family = c("A", "A", "B")
  
)
genes[genes$expr > 20, ]
  gene expr gene_family
1 TP53 25.3           A
3 EGFR 30.1           B
# subset(genes, expr > 20) # idem

subset(genes, expr > 20, select = c("gene", "expr")) # filter and select some columns
  gene expr
1 TP53 25.3
3 EGFR 30.1

Functions

Automate Task in R

classify_expression <- function(expr, cutoff = 20) {
  message("Cutoff is ", cutoff)
  if (expr > cutoff) {
    return("High")
  } else {
    return("Low")
  }
}


classify_expression(expr = 25.3, cutoff = 20)
Cutoff is 20
[1] "High"
classify_expression(25.3)
Cutoff is 20
[1] "High"
classify_expression(25.3, 30)
Cutoff is 30
[1] "Low"
classify_expression(25.3, 10)
Cutoff is 10
[1] "High"

Ggplot2: Visualizing Your Results

Basic Graphing

Syntax: ggplot(data, aes(x, y)) + geom_*()

library(ggplot2)
df <- data.frame(
  Expression = c(
    25.3, 12.5, 30.1, 27.8, 18.2, 35.6
  ),
  Condition = rep(c("WT", "KO"), 3)
)
df
  Expression Condition
1       25.3        WT
2       12.5        KO
3       30.1        WT
4       27.8        KO
5       18.2        WT
6       35.6        KO
ggplot(
  data = df,
  aes(x = Condition, y = Expression)
) +
  geom_boxplot()

Possible to change aesthetics (color, fill, shape, alpha (transparency), etc.), theme, scales, etc.

## save figure
ggsave(filename = "path/to/my_figure.png")

Data Import and Export

Text and Excel Files

  • Text format (.txt, .csv)
# use {base}'s functions
## Import
text_file <- read.table("path/to/file.txt")
csv_file <- read.csv("path/to/file.csv")

## Export
write.table(
  x = df,
  file = "outputs/cleaned_gene_expression.txt"
)
write.csv(
  x = df,
  file = "outputs/cleaned_gene_expression.csv"
)

# use {readr}'s functions
## Import
readr::read_csv(
  file = "outputs/cleaned_gene_expression.csv"
)
## Export
readr::write_csv(
  x = df,
  file = "outputs/cleaned_gene_expression.csv"
)
  • Excel format (.xlsx, .xls)
# use {readxl}'s functions
## Import
xlsx_file <- readxl::read_xlsx("path/to/file.xlsx")
xls_file <- readxl::read_xls("path/to/file.xls")

# use {xlsx}'s function
## Export
xlsx::write.xlsx(
  x = df,
  file = "outputs/cleaned_gene_expression.xlsx"
)

R Specific Formats

  • RDS (for preserving a single R object)
saveRDS(
  object = df,
  file = "outputs/cleaned_gene_expression.rds"
)

# load RDS data into environment
my_df <- readRDS(
  file = "outputs/cleaned_gene_expression.rds"
)
  • RData (for saving multiple R objects at once)
save(
  df,
  summary_stats,
  file = "outputs/analysis_results.RData"
)

# Load all objects back into the environment
# with their original names
load("outputs/analysis_results.RData")


Both .RDS and .Rdata preserve data structures, such as column data types (numeric, character or factor).

Using AI to Solve Coding Problems in R

How Can AI help Us?

  • Debugging: Identify and troubleshoot error/warning messages
  • Generate code snippets (filter data, visualize, define functions, etc.)
  • Explain unfamiliar code and documentation
  • Enhance code quality (readability, efficiency, optimization)

Different AI tools: ChatGPT, Gemini, Perplexity, Claude, Le Chat, DeepSeek, etc.

AI is a great assistant, but it can make mistakes, always verify outputs!

How to ask AI for help effectively?

✅ Be specific: Instead of “Why is my code not working?”, ask:

“I’m trying to filter a data frame in R where Expression > 10, but I get an error. Here’s my code: df[df$Expression > 10]. How can I fix it?”

✅ Provide context:

  • A quick mention of your background can guide the response (e.g.: “I’m biologist working with gene expression data”)
  • What is your goal?
  • What error message do you see?
  • What does your dataset look like?

Let’s Practice!

Today’s Goals

  • Hands-on challenge: “Fix the Code”
  • Mini data analysis project
    • Data cleaning
    • Simple analysis
    • Visualisation