Data Foundations:
Vectors and Matrices

IOC-R Week 2

Variables and Data Types in R

How We Store Data in R?

  • A variable is like a box where you store data.
  • Each variable has a name and content (one or multiple values).
  • A variable is created at the moment you assign a value to it. Use <- for assignment.
x <- 1 # put a space on each side of the assignment sign
x # type variable name to print its value(s)
[1] 1
char_name <- "InforBio"
char_name
[1] "InforBio"
char_name <- InforBio
Error: object 'InforBio' not found

Check the “Environment” pane or type ls() in the console, are variables you just created there?

Variable Naming Convention

  • Choose a short and descriptive name
  • Use snake_case (lowercase letters and underscores only)
  • Avoid special characters (such as !, #, ) and spaces
  • Do not start a name with numbers
  • Avoid reserved keywords in R (e.g., function, if, TRUE)
  • Do not overwrite built-in functions (e.g., mean, sd)

Which are valid names?

foo
test
var
var2
exam_results
a_variable_with_a_name_super_long

day_1
day_one
day1
first_day_of_the_month
DayOne
dayOne
DAYONE
DAYone

How to know if a variable name was already used?

  • Type help(reserved) to check reserved words in R.
  • Check in “Environment” pane.
  • Type the first letters of a name and press the Tab key to trigger autocompletion

Data Types

Examples: 1, 2.5, "A", "InforBio", "I love R", TRUE, FALSE, …


How R understands and stores information?

Main data types:

  • Numeric
    • Double: 2.5
    • Integer: 1
  • Character: "A", "InforBio", "I love R"
  • Logical (boolean): TRUE, FALSE
  • Factor: for categorical data

We’ll see factor next week.

Numeric

  • Double (default): used for numbers with decimal points or without.
a <- 3
is.numeric(a)
[1] TRUE
is.double(a)
[1] TRUE
typeof(a)
[1] "double"
  • If you explicitly want an integer (whole numbers), you can define it by appending an L to the number:
b <- 3L
is.numeric(b)
[1] TRUE
is.integer(b)
[1] TRUE
typeof(b)
[1] "integer"

Character

R stores text (strings) as character. Use quotation marks to indicate a value is character.

# enclosed in either double quotes (") or single quotes (')
x <- "I love R"
x
[1] "I love R"
is.character(x)
[1] TRUE
"1" + "2"
Error in "1" + "2": non-numeric argument to binary operator
as.numeric("1") + as.numeric("2")  # convert to numeric
[1] 3
as.character(1) # convert to character
[1] "1"

Logical (1)

  • Only two possible values for logical data: TRUE or FALSE.

  • Can be written as T or F, but never in other formats (e.g.: True, true)

is.logical(TRUE)
[1] TRUE
is.logical(T)
[1] TRUE
is.logical(True)
Error: object 'True' not found
  • Can be obtained from logical statements, e.g.:
2 > 1
[1] TRUE

Logical (2)

Convert to other types:

as.numeric(TRUE)
[1] 1
as.numeric(FALSE)
[1] 0
TRUE + TRUE + FALSE
[1] 2
as.character(TRUE)
[1] "TRUE"
as.character(FALSE)
[1] "FALSE"
as.logical(1)
[1] TRUE
as.logical(-1)
[1] TRUE
as.logical(0)
[1] FALSE

When as.logical() is applied to numbers, any non-zero number is converted to TRUE.

Data Strucutres

From Single Value to Multiple Values

When we store multiple values, we need a structure.

gene1 <- 10
gene2 <- 12
gene3 <- 9

How to put gene1, gene2 and gene3 together?


R provides 4 data structures to store multiple values:

1 dimension 2 dimensions (row/column)
Same data type vector matrix
Different data types list data frame

Vectors

The simplest data structure in R, for one dimension data of the same type.

Vector Creation (1)

Use the function c() to create a vector and use , to separate elements.

c(10, 12, 9) # Numeric vector
[1] 10 12  9
c(gene1, gene2, gene3)
[1] 10 12  9
gene_expr <- c(gene1, gene2, gene3) # store in a variable
gene_expr
[1] 10 12  9
c(gene_expr, 18)
[1] 10 12  9 18
  • Quickly create sequences of numbers
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
5:1
[1] 5 4 3 2 1
seq(from = 1, to = 10, by = 2)
[1] 1 3 5 7 9
c("gene1", "gene2", "gene3") # Character vector
[1] "gene1" "gene2" "gene3"
c(TRUE, FALSE, FALSE, TRUE, TRUE) # Logical vector
[1]  TRUE FALSE FALSE  TRUE  TRUE
  • Other tricks
paste0("gene", 1:3)
[1] "gene1" "gene2" "gene3"
rep(c(TRUE, FALSE), each = 2) # repetition
[1]  TRUE  TRUE FALSE FALSE

A single value (scalar) is treated as a vector of length 1.

is.vector(10)
[1] TRUE
length(10)
[1] 1
is.vector("gene1")
[1] TRUE

Vector Creation (2)

When you combine different data types …

c(10, TRUE)
[1] 10  1
c(10, "gene1")
[1] "10"    "gene1"
c(TRUE, "gene1")
[1] "TRUE"  "gene1"
c(10, "gene1", TRUE)
[1] "10"    "gene1" "TRUE" 


R follows a hierarchy of data types for coercion:

logical (least inclusive) → numeric → character (most inclusive)

Vector Indexing (1)

  • Use [idx] to access element(s).

Notes: The index starts from 1.

gene_expr
[1] 10 12  9
gene_expr[1]   # 1st element
[1] 10
gene_expr[c(2, 3)] # elements 2 and 3
[1] 12  9
gene_expr[2:3] # elements 2 and 3
[1] 12  9
gene_expr[-1]  # remove 1st element
[1] 12  9
  • Modify element(s).
gene_expr[1] <- 100
gene_expr
[1] 100  12   9
gene_expr[2:3] <- 8
gene_expr
[1] 100   8   8
gene_expr[2:3] <- c(0, 20)
gene_expr
[1] 100   0  20

Vector Indexing (2)

  • Use [name] to access element(s) if the vector is named.
# name elements of the vector
names(gene_expr) <- c("gene1", "gene2", "gene3")
gene_expr
gene1 gene2 gene3 
  100     0    20 
gene_expr["gene1"] # the element named "gene1"
gene1 
  100 
gene_expr[c("gene1", "gene2")] # extract "gene1" and "gene2"
gene1 gene2 
  100     0 
  • Modify element(s).
gene_expr["gene1"] <- 2
gene_expr
gene1 gene2 gene3 
    2     0    20 
  • How to modify the expression value of “gene1” and “gene2” to 5?
  • How to change the expression value of “gene1” to 0 and “gene2” to 16?
gene_expr[c("gene1", "gene2")] <- 5
gene_expr
gene1 gene2 gene3 
    5     5    20 
gene_expr[c("gene1", "gene2")] <- c(0, 16)
gene_expr
gene1 gene2 gene3 
    0    16    20 

Vector Indexing (3)

  • Use a logical vector for indexing.
num_vec <- c(1, 2, 5, 4)
num_vec
[1] 1 2 5 4
logical_vec <- c(TRUE, TRUE, FALSE, FALSE)
logical_vec
[1]  TRUE  TRUE FALSE FALSE
num_vec[logical_vec]
[1] 1 2
# create logical vect using comparison operator
num_vec < 3
[1]  TRUE  TRUE FALSE FALSE
# then use it to extract values from the numeric vector
num_vec[num_vec < 3]
[1] 1 2

R use the == to test equality. E.g.:

1 == 2
[1] FALSE
c(1, 2, 3) == 2
[1] FALSE  TRUE FALSE

How to extract the value 5 from the num_vec?

num_vec[num_vec == 5]
[1] 5

Vector Operations

# create a vector for gene expression
gene_expr <- c(15, 19, 14, 3, 10) 

# check the structure
is.numeric(gene_expr)
[1] TRUE
str(gene_expr)
 num [1:5] 15 19 14 3 10
# vector length
length(gene_expr)
[1] 5
# show the first/last elements
head(gene_expr)
[1] 15 19 14  3 10
tail(gene_expr)
[1] 15 19 14  3 10
## check if missing value present
is.na(gene_expr)
[1] FALSE FALSE FALSE FALSE FALSE
# Arithmetic operations
gene_expr + 1
[1] 16 20 15  4 11
gene_expr * 10
[1] 150 190 140  30 100
# Get some summary stats
sum(gene_expr)
[1] 61
mean(gene_expr)
[1] 12.2
median(gene_expr)
[1] 14
summary(gene_expr)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0    10.0    14.0    12.2    15.0    19.0 

Matrices

A matrix is a two dimensional data structure with rows and columns, it contains data of the same data type.

Matrices Creation

  • Use the matrix() function to create a matrix.
my_mat1 <- matrix(1:6, nrow = 2)
my_mat2 <- matrix(1:6, nrow = 2, byrow = TRUE)

my_mat1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
my_mat2
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
  • Combine vectors to create matrix.
vec1 <- 1:3
vec2 <- 4:6

rbind(vec1, vec2)
     [,1] [,2] [,3]
vec1    1    2    3
vec2    4    5    6
cbind(vec1, vec2)
     vec1 vec2
[1,]    1    4
[2,]    2    5
[3,]    3    6


What is the data structure of each row/column of a matrix?

Matrices Indexing (1)

  • Use [row_idx,column_idx] to access element(s).
mat <- matrix(1:12, ncol = 4)
mat
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
dim(mat) # dimensions of the matrix
[1] 3 4
nrow(mat)
[1] 3
ncol(mat)
[1] 4
mat[1, 2] # element in the 1st row and 2nd column
[1] 4
mat[, 3] # all rows of the 3rd column
[1] 7 8 9
  • How to get all columns of the 2nd and the 3rd rows?
  • How to get the value 5 from the matrix?
mat[2:3, ]
     [,1] [,2] [,3] [,4]
[1,]    2    5    8   11
[2,]    3    6    9   12
mat[2, 2]
[1] 5

Matrices Indexing (2)

  • Use [row_name,column_name] to access element(s) if names exist.
mat
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# add names to the columns and rows
rownames(mat) <- c("r1", "r2", "r3")
colnames(mat) <- paste0("c", 1:4)

mat
   c1 c2 c3 c4
r1  1  4  7 10
r2  2  5  8 11
r3  3  6  9 12
mat["r1", ] # all element of the 1st row
c1 c2 c3 c4 
 1  4  7 10 

By using the names of rows and columns:

  • How to get 2nd row of the 2nd and the 3rd columns?
  • How to get the value 5 from the matrix?
mat["r2", c("c2", "c3")]
c2 c3 
 5  8 
mat["r2", "c2"]
[1] 5

Matrices Indexing (3)

  • Use logical vector(s) for indexing.
mat[c(TRUE, TRUE, FALSE), ]
   c1 c2 c3 c4
r1  1  4  7 10
r2  2  5  8 11
mat[c(TRUE, TRUE, FALSE), c(FALSE, TRUE, TRUE, FALSE)]
   c2 c3
r1  4  7
r2  5  8

By using the logical indexing, how to select the 2nd and 3rd rows, the 1st and 2nd columns of the mat?

mat[c(FALSE, TRUE, TRUE), c(TRUE, TRUE, FALSE, FALSE)]
   c1 c2
r2  2  5
r3  3  6

Matrices Operations

  • Check the structure
is.matrix(mat)
[1] TRUE
str(mat)
 int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:3] "r1" "r2" "r3"
  ..$ : chr [1:4] "c1" "c2" "c3" "c4"
  • Some maths
rowSums(mat)
r1 r2 r3 
22 26 30 
colSums(mat)
c1 c2 c3 c4 
 6 15 24 33 
colMeans(mat)
c1 c2 c3 c4 
 2  5  8 11 
rowMeans(mat)
 r1  r2  r3 
5.5 6.5 7.5 
  • Modify elements
mat[1:2, 3] <- c(1, 2)
mat
   c1 c2 c3 c4
r1  1  4  1 10
r2  2  5  2 11
r3  3  6  9 12
mat[1:2, ] <- 10
mat
   c1 c2 c3 c4
r1 10 10 10 10
r2 10 10 10 10
r3  3  6  9 12
# replace values in 1st row
mat[1, ] <- c(0, 1, 2)
Error in mat[1, ] <- c(0, 1, 2): number of items to replace is not a multiple of replacement length

Value Replacement

  • In matrices
mat
   c1 c2 c3 c4
r1 10 10 10 10
r2 10 10 10 10
r3  3  6  9 12
# replace values in 1st row
mat[1, ] <- c(0, 1, 2)
Error in mat[1, ] <- c(0, 1, 2): number of items to replace is not a multiple of replacement length
  • In vectors
num_vec <- 1:10
num_vec
 [1]  1  2  3  4  5  6  7  8  9 10
num_vec[1:5] <- c(0, 1, 2)
Warning in num_vec[1:5] <- c(0, 1, 2): number of items to replace is not a
multiple of replacement length
num_vec
 [1]  0  1  2  0  1  6  7  8  9 10

When assigning new values, you must provide either:

  • A single value,
  • A vector with the exact number of elements to be replaced, or
  • A vector whose length is a factor of the number of elements to be replaced. (Recycling) Not recommanded!

Let’s Practice !

Today’s Goals

  • Get familiar with variables and data types
  • Get familiar with vectors and matrices manipulations
  • Simulate your own biological data and test the normality using the Shapiro-Wilk test