Data Foundations:
Vectors and Matrices

IOC-R Week 2

Variables and Data Types in R

How We Store Data in R?

A variable is like a box where you store data.
Each variable has a name and content (one or multiple values).
A variable is created at the moment you assign a value to it. Use <- for assignment.

x <- 1 # put a space on each side of the assignment sign
x # type variable name to print its value(s)

[1] 1

char_name <- "InforBio"
char_name

[1] "InforBio"

char_name <- InforBio

Error: object 'InforBio' not found

Check the “Environment” pane or type ls() in the console, are variables you just created there?

Variable Naming Convention

Choose a short and descriptive name
Use snake_case (lowercase letters and underscores only)
Avoid special characters (such as !, #, ) and spaces
Do not start a name with numbers
Avoid reserved keywords in R (e.g., function, if, TRUE)
Do not overwrite built-in functions (e.g., mean, sd)

Which are valid names?

foo
test
var
var2
exam_results
a_variable_with_a_name_super_long

day_1
day_one
day1
first_day_of_the_month
DayOne
dayOne
DAYONE
DAYone

How to know if a variable name was already used?

Type help(reserved) to check reserved words in R.
Check in “Environment” pane.
Type the first letters of a name and press the Tab key to trigger autocompletion

Data Types

Examples: 1, 2.5, "A", "InforBio", "I love R", TRUE, FALSE, …

How R understands and stores information?

Main data types:

Numeric
- Double: 2.5
- Integer: 1
Character: "A", "InforBio", "I love R"
Logical (boolean): TRUE, FALSE
Factor: for categorical data

We’ll see factor next week.

Numeric

Double (default): used for numbers with decimal points or without.

a <- 3
is.numeric(a)

[1] TRUE

is.double(a)

[1] TRUE

typeof(a)

[1] "double"

If you explicitly want an integer (whole numbers), you can define it by appending an L to the number:

b <- 3L
is.numeric(b)

[1] TRUE

is.integer(b)

[1] TRUE

typeof(b)

[1] "integer"

Character

R stores text (strings) as character. Use quotation marks to indicate a value is character.

# enclosed in either double quotes (") or single quotes (')
x <- "I love R"
x

[1] "I love R"

is.character(x)

[1] TRUE

"1" + "2"

Error in "1" + "2": non-numeric argument to binary operator

as.numeric("1") + as.numeric("2")  # convert to numeric

[1] 3

as.character(1) # convert to character

[1] "1"

Logical (1)

Only two possible values for logical data: TRUE or FALSE.
Can be written as T or F, but never in other formats (e.g.: True, true)

is.logical(TRUE)

[1] TRUE

is.logical(T)

[1] TRUE

is.logical(True)

Error: object 'True' not found

Can be obtained from logical statements, e.g.:

2 > 1

[1] TRUE

Logical (2)

Convert to other types:

as.numeric(TRUE)

[1] 1

as.numeric(FALSE)

[1] 0

TRUE + TRUE + FALSE

[1] 2

as.character(TRUE)

[1] "TRUE"

as.character(FALSE)

[1] "FALSE"

as.logical(1)

[1] TRUE

as.logical(-1)

[1] TRUE

as.logical(0)

[1] FALSE

When as.logical() is applied to numbers, any non-zero number is converted to TRUE.

Data Strucutres

From Single Value to Multiple Values

When we store multiple values, we need a structure.

gene1 <- 10
gene2 <- 12
gene3 <- 9

How to put gene1, gene2 and gene3 together?

R provides 4 data structures to store multiple values:

	1 dimension	2 dimensions (row/column)
Same data type	vector	matrix
Different data types	list	data frame

Vectors

The simplest data structure in R, for one dimension data of the same type.

Vector Creation (1)

Use the function c() to create a vector and use , to separate elements.

c(10, 12, 9) # Numeric vector

[1] 10 12  9

c(gene1, gene2, gene3)

[1] 10 12  9

gene_expr <- c(gene1, gene2, gene3) # store in a variable
gene_expr

[1] 10 12  9

c(gene_expr, 18)

[1] 10 12  9 18

Quickly create sequences of numbers

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

5:1

[1] 5 4 3 2 1

seq(from = 1, to = 10, by = 2)

[1] 1 3 5 7 9

c("gene1", "gene2", "gene3") # Character vector

[1] "gene1" "gene2" "gene3"

c(TRUE, FALSE, FALSE, TRUE, TRUE) # Logical vector

[1]  TRUE FALSE FALSE  TRUE  TRUE

Other tricks

paste0("gene", 1:3)

[1] "gene1" "gene2" "gene3"

rep(c(TRUE, FALSE), each = 2) # repetition

[1]  TRUE  TRUE FALSE FALSE

A single value (scalar) is treated as a vector of length 1.

is.vector(10)

[1] TRUE

length(10)

[1] 1

is.vector("gene1")

[1] TRUE

Vector Creation (2)

When you combine different data types …

c(10, TRUE)

[1] 10  1

c(10, "gene1")

[1] "10"    "gene1"

c(TRUE, "gene1")

[1] "TRUE"  "gene1"

c(10, "gene1", TRUE)

[1] "10"    "gene1" "TRUE"

R follows a hierarchy of data types for coercion:

logical (least inclusive) → numeric → character (most inclusive)

Vector Indexing (1)

Use [idx] to access element(s).

Notes: The index starts from 1.

gene_expr

[1] 10 12  9

gene_expr[1]   # 1st element

[1] 10

gene_expr[c(2, 3)] # elements 2 and 3

[1] 12  9

gene_expr[2:3] # elements 2 and 3

[1] 12  9

gene_expr[-1]  # remove 1st element

[1] 12  9

Modify element(s).

gene_expr[1] <- 100
gene_expr

[1] 100  12   9

gene_expr[2:3] <- 8
gene_expr

[1] 100   8   8

gene_expr[2:3] <- c(0, 20)
gene_expr

[1] 100   0  20

Vector Indexing (2)

Use [name] to access element(s) if the vector is named.

# name elements of the vector
names(gene_expr) <- c("gene1", "gene2", "gene3")
gene_expr

gene1 gene2 gene3 
  100     0    20

gene_expr["gene1"] # the element named "gene1"

gene1 
  100

gene_expr[c("gene1", "gene2")] # extract "gene1" and "gene2"

gene1 gene2 
  100     0

Modify element(s).

gene_expr["gene1"] <- 2
gene_expr

gene1 gene2 gene3 
    2     0    20

How to modify the expression value of “gene1” and “gene2” to 5?
How to change the expression value of “gene1” to 0 and “gene2” to 16?

gene_expr[c("gene1", "gene2")] <- 5
gene_expr

gene1 gene2 gene3 
    5     5    20

gene_expr[c("gene1", "gene2")] <- c(0, 16)
gene_expr

gene1 gene2 gene3 
    0    16    20

Vector Indexing (3)

Use a logical vector for indexing.

num_vec <- c(1, 2, 5, 4)
num_vec

[1] 1 2 5 4

logical_vec <- c(TRUE, TRUE, FALSE, FALSE)
logical_vec

[1]  TRUE  TRUE FALSE FALSE

num_vec[logical_vec]

[1] 1 2

# create logical vect using comparison operator
num_vec < 3

[1]  TRUE  TRUE FALSE FALSE

# then use it to extract values from the numeric vector
num_vec[num_vec < 3]

[1] 1 2

R use the == to test equality. E.g.:

1 == 2

[1] FALSE

c(1, 2, 3) == 2

[1] FALSE  TRUE FALSE

How to extract the value 5 from the num_vec?

num_vec[num_vec == 5]

[1] 5

Vector Operations

# create a vector for gene expression
gene_expr <- c(15, 19, 14, 3, 10) 

# check the structure
is.numeric(gene_expr)

[1] TRUE

str(gene_expr)

 num [1:5] 15 19 14 3 10

# vector length
length(gene_expr)

[1] 5

# show the first/last elements
head(gene_expr)

[1] 15 19 14  3 10

tail(gene_expr)

[1] 15 19 14  3 10

## check if missing value present
is.na(gene_expr)

[1] FALSE FALSE FALSE FALSE FALSE

# Arithmetic operations
gene_expr + 1

[1] 16 20 15  4 11

gene_expr * 10

[1] 150 190 140  30 100

# Get some summary stats
sum(gene_expr)

[1] 61

mean(gene_expr)

[1] 12.2

median(gene_expr)

[1] 14

summary(gene_expr)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0    10.0    14.0    12.2    15.0    19.0

Matrices

A matrix is a two dimensional data structure with rows and columns, it contains data of the same data type.

Matrices Creation

Use the matrix() function to create a matrix.

my_mat1 <- matrix(1:6, nrow = 2)
my_mat2 <- matrix(1:6, nrow = 2, byrow = TRUE)

my_mat1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

my_mat2

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Combine vectors to create matrix.

vec1 <- 1:3
vec2 <- 4:6

rbind(vec1, vec2)

     [,1] [,2] [,3]
vec1    1    2    3
vec2    4    5    6

cbind(vec1, vec2)

     vec1 vec2
[1,]    1    4
[2,]    2    5
[3,]    3    6

What is the data structure of each row/column of a matrix?

Matrices Indexing (1)

Use [row_idx,column_idx] to access element(s).

mat <- matrix(1:12, ncol = 4)
mat

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

dim(mat) # dimensions of the matrix

[1] 3 4

nrow(mat)

[1] 3

ncol(mat)

[1] 4

mat[1, 2] # element in the 1st row and 2nd column

[1] 4

mat[, 3] # all rows of the 3rd column

[1] 7 8 9

How to get all columns of the 2nd and the 3rd rows?
How to get the value 5 from the matrix?

mat[2:3, ]

     [,1] [,2] [,3] [,4]
[1,]    2    5    8   11
[2,]    3    6    9   12

mat[2, 2]

[1] 5

Matrices Indexing (2)

Use [row_name,column_name] to access element(s) if names exist.

mat

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

# add names to the columns and rows
rownames(mat) <- c("r1", "r2", "r3")
colnames(mat) <- paste0("c", 1:4)

mat

   c1 c2 c3 c4
r1  1  4  7 10
r2  2  5  8 11
r3  3  6  9 12

mat["r1", ] # all element of the 1st row

c1 c2 c3 c4 
 1  4  7 10

By using the names of rows and columns:

How to get 2nd row of the 2nd and the 3rd columns?
How to get the value 5 from the matrix?

mat["r2", c("c2", "c3")]

c2 c3 
 5  8

mat["r2", "c2"]

[1] 5

Matrices Indexing (3)

Use logical vector(s) for indexing.

mat[c(TRUE, TRUE, FALSE), ]

   c1 c2 c3 c4
r1  1  4  7 10
r2  2  5  8 11

mat[c(TRUE, TRUE, FALSE), c(FALSE, TRUE, TRUE, FALSE)]

   c2 c3
r1  4  7
r2  5  8

By using the logical indexing, how to select the 2nd and 3rd rows, the 1st and 2nd columns of the mat?

mat[c(FALSE, TRUE, TRUE), c(TRUE, TRUE, FALSE, FALSE)]

   c1 c2
r2  2  5
r3  3  6

Matrices Operations

Check the structure

is.matrix(mat)

[1] TRUE

str(mat)

 int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:3] "r1" "r2" "r3"
  ..$ : chr [1:4] "c1" "c2" "c3" "c4"

Some maths

rowSums(mat)

r1 r2 r3 
22 26 30

colSums(mat)

c1 c2 c3 c4 
 6 15 24 33

colMeans(mat)

c1 c2 c3 c4 
 2  5  8 11

rowMeans(mat)

 r1  r2  r3 
5.5 6.5 7.5

Modify elements

mat[1:2, 3] <- c(1, 2)
mat

   c1 c2 c3 c4
r1  1  4  1 10
r2  2  5  2 11
r3  3  6  9 12

mat[1:2, ] <- 10
mat

   c1 c2 c3 c4
r1 10 10 10 10
r2 10 10 10 10
r3  3  6  9 12

# replace values in 1st row
mat[1, ] <- c(0, 1, 2)

Error in mat[1, ] <- c(0, 1, 2): number of items to replace is not a multiple of replacement length

Value Replacement

In matrices

mat

   c1 c2 c3 c4
r1 10 10 10 10
r2 10 10 10 10
r3  3  6  9 12

# replace values in 1st row
mat[1, ] <- c(0, 1, 2)

Error in mat[1, ] <- c(0, 1, 2): number of items to replace is not a multiple of replacement length

In vectors

num_vec <- 1:10
num_vec

 [1]  1  2  3  4  5  6  7  8  9 10

num_vec[1:5] <- c(0, 1, 2)

Warning in num_vec[1:5] <- c(0, 1, 2): number of items to replace is not a
multiple of replacement length

num_vec

 [1]  0  1  2  0  1  6  7  8  9 10

When assigning new values, you must provide either:

A single value,
A vector with the exact number of elements to be replaced, or
A vector whose length is a factor of the number of elements to be replaced. (Recycling) Not recommanded!

Let’s Practice !

Today’s Goals

Get familiar with variables and data types
Get familiar with vectors and matrices manipulations
Simulate your own biological data and test the normality using the Shapiro-Wilk test

Data Foundations:Vectors and Matrices

Variables and Data Types in R

How We Store Data in R?

Variable Naming Convention

Data Types

Numeric

Character

Logical (1)

Logical (2)

Data Strucutres

From Single Value to Multiple Values

Vectors

Vector Creation (1)

Vector Creation (2)

Vector Indexing (1)

Vector Indexing (2)

Vector Indexing (3)

Vector Operations

Matrices

Matrices Creation

Matrices Indexing (1)

Matrices Indexing (2)

Matrices Indexing (3)

Matrices Operations

Value Replacement

Let’s Practice !

Today’s Goals

Data Foundations:
Vectors and Matrices