R-fresh:
Revisiting the Essentials

IOC-R Week 6

R Projects

Organizing Your Work Like a Pro

Use R Projects to manage files and scripts in a structured way.

Keep data, scripts, and outputs in separate folders to avoid chaos.
Example:

my_project/
  ├── data/          # Raw data files (e.g., RNA-seq counts)
  ├── scripts/       # R scripts for preprocessing & analysis
  ├── outputs/       # Output figures & tables

Variables & Data Types

What’s in Your Data?

Data Type	Example in Biology
Numeric	Expression levels (`25.3`)
Character	Gene names (`"TP53"`,`"BRCA1"`)
Logical	Mutation status (`TRUE` for mutated, `FALSE` for WT)

expression <- 25.3   # Numeric
gene <- "TP53"       # Character
is_mutant <- TRUE    # Logical

x <- 1
y <- 2
total <- x + y
total

[1] 3

If we change x to 5, is the value stored in total changed?
If total is not changed, what do you need to do to update the value in total?

Data Structures

Where Do You Store Your Data?

Vectors
- 1D homogeneous data (all elements of the same type).
- Use cases: A list of gene names, expression levels, or p-values.
- Use [] for indexing

gene_names <- c("TP53", "BRCA1", "EGFR", "INS")
gene_names[1]  # First gene: "TP53"

[1] "TP53"

gene_names[c(2, 4)]  # Select multiple genes

[1] "BRCA1" "INS"

# factor
mut_gp <- factor(c("wt", "wt", "mut", "mut"))
mut_gp

[1] wt  wt  mut mut
Levels: mut wt

mut_gp[3]

[1] mut
Levels: mut wt

Where Do You Store Your Data?

Matrices:
- 2D homogeneous data, a collection of equal-length vectors of the same data type.
- Use case: gene expression counting table, rows = genes, columns = samples
- Use [, ] for indexing.

expr_matrix <- matrix(
  c(10, 12, 15, 20, 8, 30),
  nrow = 3, byrow = TRUE,
  dimnames = list( # name the rows and columns   
    c("gene1", "gene2", "gene3"),
    c("sample1", "sample2")
  )
)

expr_matrix

      sample1 sample2
gene1      10      12
gene2      15      20
gene3       8      30

expr_matrix["gene1", ]  # Expression of Gene1 across samples

sample1 sample2 
     10      12

expr_matrix[, "sample2"]  # All gene expressions in Sample2

gene1 gene2 gene3 
   12    20    30

expr_matrix[2, 1]  # Specific value at row 2, column 1

[1] 15

Where Do You Store Your Data?

Data Frames
- 2D heterogeneous data, a collection of equal-length vectors of one or more data types.
- Use [, ] for indexing
- Use $ and [[ ]] to extract one column.

df <- data.frame(
  Gene = c("TP53", "BRCA1", "EGFR"),
  Expression = c(25.3, 12.5, 30.1),
  Mutation = c(TRUE, FALSE, TRUE)
)
df

   Gene Expression Mutation
1  TP53       25.3     TRUE
2 BRCA1       12.5    FALSE
3  EGFR       30.1     TRUE

df$Gene  # Select column as a vector

[1] "TP53"  "BRCA1" "EGFR"

df[, "Expression"]  # Same as df$Expression

[1] 25.3 12.5 30.1

df[[2]]  # Same as df$Expression

[1] 25.3 12.5 30.1

df[1, ]  # First row

  Gene Expression Mutation
1 TP53       25.3     TRUE

df[df$Expression > 30, ]  # Filter expression bigger than 30

  Gene Expression Mutation
3 EGFR       30.1     TRUE

df[which(df$Mutation), ]  # Filter mutated genes

  Gene Expression Mutation
1 TP53       25.3     TRUE
3 EGFR       30.1     TRUE

Where Do You Store Your Data?

Lists
- Flexible structure, can hold anything (vectors, matrices, data frames or lists).
- Use [] to subset a list.
- Use [[ ]] or $ to select a specific component of a list.

bio_data <- list(
  genes = c("TP53", "BRCA1"),
  expression = matrix(c(20, 15, 30, 25), nrow = 2),
  metadata = data.frame(Sample = c("A", "B"), Condition = c("WT", "KO"))
)

bio_data

$genes
[1] "TP53"  "BRCA1"

$expression
     [,1] [,2]
[1,]   20   30
[2,]   15   25

$metadata
  Sample Condition
1      A        WT
2      B        KO

bio_data[1]  # A sub list

$genes
[1] "TP53"  "BRCA1"

bio_data[[1]]  # Get gene names

[1] "TP53"  "BRCA1"

bio_data$expression  # Get expression matrix

     [,1] [,2]
[1,]   20   30
[2,]   15   25

bio_data$metadata$Condition  # Get sample conditions

[1] "WT" "KO"

Conditions & Operators

Filtering Data with Accuracy in R

Type	Operator	Example
Comparison	==, !=, >, <, >=, <=	p_value < 0.05
Logical	`&` (AND), `\|` (OR), `!` (NOT)	TRUE & FALSE

expression <- 25.3

if (expression > 20) {
  print("High expression")
} else {
  print("Low expression")
}

[1] "High expression"

ifelse(test = expression > 20, yes = "High expression", no = "Low expression")

[1] "High expression"

age <- c(18, 27, 20, 23, 22)
age[age < 20 | age > 25]

[1] 18 27

genes <- data.frame(
  gene = c("TP53", "BRCA1", "EGFR"),
  expr = c(25.3, 12.5, 30.1),
  gene_family = c("A", "A", "B")
  
)
genes[genes$expr > 20, ]

  gene expr gene_family
1 TP53 25.3           A
3 EGFR 30.1           B

# subset(genes, expr > 20) # idem

subset(genes, expr > 20, select = c("gene", "expr")) # filter and select some columns

  gene expr
1 TP53 25.3
3 EGFR 30.1

Functions

Automate Task in R

classify_expression <- function(expr, cutoff = 20) {
  message("Cutoff is ", cutoff)
  if (expr > cutoff) {
    return("High")
  } else {
    return("Low")
  }
}

classify_expression(expr = 25.3, cutoff = 20)

Cutoff is 20

[1] "High"

classify_expression(25.3)

Cutoff is 20

[1] "High"

classify_expression(25.3, 30)

Cutoff is 30

[1] "Low"

classify_expression(25.3, 10)

Cutoff is 10

[1] "High"

Ggplot2: Visualizing Your Results

Basic Graphing

Syntax: ggplot(data, aes(x, y)) + geom_*()

library(ggplot2)
df <- data.frame(
  Expression = c(
    25.3, 12.5, 30.1, 27.8, 18.2, 35.6
  ),
  Condition = rep(c("WT", "KO"), 3)
)
df

  Expression Condition
1       25.3        WT
2       12.5        KO
3       30.1        WT
4       27.8        KO
5       18.2        WT
6       35.6        KO

ggplot(
  data = df,
  aes(x = Condition, y = Expression)
) +
  geom_boxplot()

Possible to change aesthetics (color, fill, shape, alpha (transparency), etc.), theme, scales, etc.

## save figure
ggsave(filename = "path/to/my_figure.png")

Data Import and Export

Text and Excel Files

Text format (.txt, .csv)

# use {base}'s functions
## Import
text_file <- read.table("path/to/file.txt")
csv_file <- read.csv("path/to/file.csv")

## Export
write.table(
  x = df,
  file = "outputs/cleaned_gene_expression.txt"
)
write.csv(
  x = df,
  file = "outputs/cleaned_gene_expression.csv"
)

# use {readr}'s functions
## Import
readr::read_csv(
  file = "outputs/cleaned_gene_expression.csv"
)
## Export
readr::write_csv(
  x = df,
  file = "outputs/cleaned_gene_expression.csv"
)

Excel format (.xlsx, .xls)

# use {readxl}'s functions
## Import
xlsx_file <- readxl::read_xlsx("path/to/file.xlsx")
xls_file <- readxl::read_xls("path/to/file.xls")

# use {xlsx}'s function
## Export
xlsx::write.xlsx(
  x = df,
  file = "outputs/cleaned_gene_expression.xlsx"
)

Don’t forget to use arguments to specify if your data has header, row names, etc.
Cheat sheet for data import/export with {readr} and {readxl} in R: https://github.com/rstudio/cheatsheets/blob/main/data-import.pdf

R Specific Formats

RDS (for preserving a single R object)

saveRDS(
  object = df,
  file = "outputs/cleaned_gene_expression.rds"
)

# load RDS data into environment
my_df <- readRDS(
  file = "outputs/cleaned_gene_expression.rds"
)

RData (for saving multiple R objects at once)

save(
  df,
  summary_stats,
  file = "outputs/analysis_results.RData"
)

# Load all objects back into the environment
# with their original names
load("outputs/analysis_results.RData")

Both .RDS and .Rdata preserve data structures, such as column data types (numeric, character or factor).

Using AI to Solve Coding Problems in R

How Can AI help Us?

Debugging: Identify and troubleshoot error/warning messages
Generate code snippets (filter data, visualize, define functions, etc.)
Explain unfamiliar code and documentation
Enhance code quality (readability, efficiency, optimization)
…

Different AI tools: ChatGPT, Gemini, Perplexity, Claude, Le Chat, DeepSeek, etc.

AI is a great assistant, but it can make mistakes, always verify outputs!

How to ask AI for help effectively?

✅ Be specific: Instead of “Why is my code not working?”, ask:

“I’m trying to filter a data frame in R where Expression > 10, but I get an error. Here’s my code: df[df$Expression > 10]. How can I fix it?”

✅ Provide context:

A quick mention of your background can guide the response (e.g.: “I’m biologist working with gene expression data”)
What is your goal?
What error message do you see?
What does your dataset look like?

Let’s Practice!

Today’s Goals

Hands-on challenge: “Fix the Code”
Mini data analysis project
- Data cleaning
- Simple analysis
- Visualisation

R-fresh:Revisiting the Essentials

R Projects

Organizing Your Work Like a Pro

Variables & Data Types

What’s in Your Data?

Data Structures

Where Do You Store Your Data?

Where Do You Store Your Data?

Where Do You Store Your Data?

Where Do You Store Your Data?

Conditions & Operators

Filtering Data with Accuracy in R

Functions

Automate Task in R

Ggplot2: Visualizing Your Results

Basic Graphing

Data Import and Export

Text and Excel Files

R Specific Formats

Using AI to Solve Coding Problems in R

How Can AI help Us?

How to ask AI for help effectively?

Let’s Practice!

Today’s Goals

R-fresh:
Revisiting the Essentials