Smart Shortcuts:
Mastering the apply Family

IOC-R Week 7

Recap Functions

Functions

  • Functions = Reusable blocks of code.
  • Can take 0, 1 or more parameters (arguments) as input
  • Local variable (created inside the function) cannot be accessed outside of the function

Syntax:

my_function <- function(arg1, ...) {
  # function body
  return(res)
}
  • How to use functions in R?
    • Use base R functions (e.g., sum(), mean(), log2())
    • For functions from additional packages: load the package first (library(package_name)), or use the function with its prefix (package_name::function_name())
    • Create custom functions for specific tasks

Custom Functions

  • Example: Create a score to rank the results of a differential expression analysis. The score should be a weighted sum of log2 fold change (log2FC) and p-value. We will use:
    • the absolute value of log2 fold change to avoid negative scores,
    • -log10(p-value) for better interpretability (so that lower p-values correspond to higher values).
## Define a function to calculate weighted score for genes
weighted_gene_score <- function(
  log2fc, p_value, fc_weight = 0.7, p_weight = 0.3
) {
  weighted_score <- fc_weight * abs(log2fc) + p_weight * (-log10(p_value))
  return(weighted_score)
}

Why the following command returns error?

weighted_score
Error: object 'weighted_score' not found

Custom Functions

## Define a function to calculate weighted score for genes
weighted_gene_score <- function(
  log2fc, p_value, fc_weight = 0.7, p_weight = 0.3
) {
  weighted_score <- fc_weight * abs(log2fc) + p_weight * (-log10(p_value))
  return(weighted_score)
}


## moke gene data
gene_data <- data.frame(
  gene = c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE"),
  log2FC = c(2.5, -1.8, 0.8, 1.6, -0.5), 
  p_value = c(0.0001, 0.03, 0.2, 0.0005, 0.05)
)
gene_data
   gene log2FC p_value
1 GeneA    2.5   1e-04
2 GeneB   -1.8   3e-02
3 GeneC    0.8   2e-01
4 GeneD    1.6   5e-04
5 GeneE   -0.5   5e-02
# score for geneA
weighted_gene_score(
  log2fc = gene_data$log2FC[1],
  p_value = gene_data$p_value[1]
)
[1] 2.95
# score for geneC
score_geneC <- weighted_gene_score(
  log2fc = gene_data$log2FC[3],
  p_value = gene_data$p_value[3]
)
score_geneC
[1] 0.769691

Meet the apply Family Functions

Why Learn the apply Family?

What happens when we need to apply a function multiple times?

E.g.: calculate the median of each row of the following matrix.

mat <- matrix(1:100, nrow = 25)
dim(mat)
[1] 25  4
head(mat, 3)
     [,1] [,2] [,3] [,4]
[1,]    1   26   51   76
[2,]    2   27   52   77
[3,]    3   28   53   78
tail(mat, 3)
      [,1] [,2] [,3] [,4]
[23,]   23   48   73   98
[24,]   24   49   74   99
[25,]   25   50   75  100
median(mat[1, ])
[1] 38.5
median(mat[2, ])
[1] 39.5
median(mat[3, ])
[1] 40.5
median(mat[4, ])
...
median(mat[25, ])

Write 25 times the similar code? Or some more efficient way?

The apply() Function

Applies a function across rows or columns.

?apply
apply(X, MARGIN, FUN)
  • X: Matrix (or data frame)
  • MARGIN = 1: Apply function to rows; MARGIN = 2: Apply function to columns
  • FUN: The function to apply
dim(mat)
[1] 25  4
# median of each row
apply(X = mat, MARGIN = 1, FUN = median)
 [1] 38.5 39.5 40.5 41.5 42.5 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5
[16] 53.5 54.5 55.5 56.5 57.5 58.5 59.5 60.5 61.5 62.5
# median of each column
apply(X = mat, MARGIN = 2, FUN = median)
[1] 13 38 63 88
# sum of each column
apply(X = mat, MARGIN = 2, FUN = sum)
[1]  325  950 1575 2200
# compare result with the built-in function
colSums(mat)
[1]  325  950 1575 2200

The apply() Function

How to apply a more complex/custom function?

gene_expr <- matrix(c(1, 2, 3, 2, 2, 3, 3, 3, 3, 4, 5, 6), nrow = 3)
gene_expr
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    2    2    3    5
[3,]    3    3    3    6

E.g.: how many unique value each row contains?

apply(
  X = gene_expr,
  MARGIN = 1,
  FUN = function(row_i) {
    return(length(unique(row_i)))
  }
)
[1] 4 3 2
# or write your function then provide to "FUN"
unique_len <- function(row_i) length(unique(row_i))
apply(gene_expr, 1, FUN = unique_len)
[1] 4 3 2

how many values are bigger than 2 in each column?

apply(
  X = gene_expr,
  MARGIN = 2,
  FUN = function(col_i) {
    sum(col_i > 2)
  }
)
[1] 1 1 3 3

The lapply() Function

Applies a function to each element of a list or vector.

?lapply
lapply(X, FUN, ...)
  • X: A list or vector
  • FUN: The function to apply

# List of genes with their exon start positions
gene_exons <- list(
  gene1 = c(100, 200, 300),
  gene2 = c(50, 150),
  gene3 = c(10, 110, 210, 310),
  gene4 = c(500)
)

How many exons does each gene have?

lapply(X = gene_exons, FUN = length)
$gene1
[1] 3

$gene2
[1] 2

$gene3
[1] 4

$gene4
[1] 1

Returns always a list.

sapply() – A Simpler lapply()

sapply() simplifies lapply()’s output, it tries to return a vector or matrix when possible.

?sapply
sapply(X, FUN, ...)


gene_exons # A list
$gene1
[1] 100 200 300

$gene2
[1]  50 150

$gene3
[1]  10 110 210 310

$gene4
[1] 500
lapply(gene_exons, length)  # Returns a list
$gene1
[1] 3

$gene2
[1] 2

$gene3
[1] 4

$gene4
[1] 1
sapply(gene_exons, length)  # Returns a vector
gene1 gene2 gene3 gene4 
    3     2     4     1 

sapply() – A Simpler lapply()

# Function to return exon count and whether it has more than 2 exons
exon_info <- function(exons) {
  count <- length(exons)
  more_than_2 <- ifelse(count > 2, "yes", "no")
  # Returns a vector of length 2
  return(c(count, more_than_2))
}

lapply(gene_exons, exon_info)
$gene1
[1] "3"   "yes"

$gene2
[1] "2"  "no"

$gene3
[1] "4"   "yes"

$gene4
[1] "1"  "no"
sapply(gene_exons, exon_info)
     gene1 gene2 gene3 gene4
[1,] "3"   "2"   "4"   "1"  
[2,] "yes" "no"  "yes" "no" 

Results are simplified into matrix.

# Function that returns exon count + some info for each gene
exon_info2 <- function(exons) {
  count <- length(exons)
  if (count > 2) {
    return(c(count, "high exon number")) # Returns 2 elements
  } else {
    return(count)  # Returns 1 element if <= 2
  }
}

sapply(gene_exons, exon_info2)
$gene1
[1] "3"                "high exon number"

$gene2
[1] 2

$gene3
[1] "4"                "high exon number"

$gene4
[1] 1

Results cannot be simplified, still stored in a list.

Let’s Practice !

Today’s Goals

  • Use apply() for column-wise and row-wise operations (e.g., calculation variance of each rows or columns)
  • Leverage lapply() for list-based computations (e.g., repeating generation of plot for a list of genes)