SC1

Statistical Computing 1

Fundamentals


This page provides a very brief introduction to general R programming, and is sufficient to perform very complicated computations. It is certainly not comprehensive, however, and it is expected that one will need to supplement the information here with other material, and some healthy experimentation.

Variables and values

A value can be assigned to a variable using the assignment operator <-. The variable is created if it doesn’t already exist.

x <- 123

Creating or assigning a variable a value does not produce any output. Providing the name of the variable as input does produce its value as output.

x
## [1] 123

One can reassign the value of x using <-. The type (e.g. numeric, string) of the variable need not be the same as before.

x <- "hello"
x
## [1] "hello"

One can use the operator = instead of <-, but this is not recommended. In RStudio Alt + - is a keyboard shortcut for writing the assignment operator.

A valid variable name consists of letters, numbers and the dot . or underscore _ characters. Visible variable names must start with a letter.

Arithmetic

As you would expect, one can add, subtract, multiply, divide and exponentiate. There are also operators for integer division and remainder (modulus).

Operator Meaning
+ add
- subtract
* multiply
/ divide
^ or ** exponentiate
%/% integer division
%% integer modulus
2^3
## [1] 8
7 %/% 3
## [1] 2
7 %% 3
## [1] 1

For convenience, these operators are used the same way we write them on paper. The alternative is to surround the operator with ticks.

`+`(2,3)

While this is natural, it is necessary in some circumstances to use parentheses or curly braces to ensure operations occur in the right order.

2 + 3 * 5
## [1] 17
(2 + 3) * 5
## [1] 25
{2 + 3} * 5
## [1] 25

Conditional statements

Commands can be executed conditional upon a logical value (i.e. TRUE or FALSE) by using an if/else statement.

if (TRUE) {
  print("this is printed")
} else {
  print("this is not printed")
}
## [1] "this is printed"
if (FALSE) {
  print("this is not printed")
} else {
  print("this is printed")
}
## [1] "this is printed"

It is not necessary to have an else statement.

if (TRUE) {
  print("this is printed")
}
## [1] "this is printed"
if (FALSE) {
  print("this is not printed")
}

One can nest if/else statements

if (FALSE) {
  print("this is not printed")
} else {
  if (TRUE) {
    print("this is printed")
  }
}
## [1] "this is printed"
if (FALSE) {
  print("this is not printed")
} else if (TRUE) {
  print("this is printed")
}
## [1] "this is printed"

The curly braces { and } are necessary when there are multiple statements to be executed in the conditional block. For example, in the following code “1” is not printed, but “2” is (the indentation is just misleading).

if (FALSE)
  print(1)
  print(2)
## [1] 2

The above snippet is equivalent to

if (FALSE) {
  print(1)
}
print(2)
## [1] 2

It is often considered good practice to use curly braces even for single-statement blocks, as it is easy to read and can be less likely to lead to bugs on modification.

The ifelse operator can also be useful in some situations. You can read about it after you are familiar with vectors in R.

Relational and logical operators

Conditional statements are often used when the logical value is computed dynamically.

x <- 5
if (x < 10) {
  print("x is less than 10")
} else {
  print("x is greater than or equal to 10")
}
## [1] "x is less than 10"

The relational operators are

Operator Meaning
< less than
> greater than
<= less than or equal to
>= greater than or equal to
== equals
!= not equals

Logical operators are useful for combining logical values. These are

Operator Meaning
! not
& and
| or
&& short-circuit and
|| short-circuit or

A short-circuit && or || evaluates only those values from left to right that are required to determine if the result is TRUE or FALSE. For example, FALSE && x evaluates to FALSE and TRUE || x evaluates to TRUE irrespective of the value of x and so x need not be evaluated.

x <- TRUE
y <- FALSE

if (!y) {
  print("this is printed")
}
## [1] "this is printed"
if (x & y) {
  print("this is not printed")
}

if (x | y) {
  print("this is printed")
}
## [1] "this is printed"
if (x || NA) {
  print("this is printed even though NA is not a logical value")
}
## [1] "this is printed even though NA is not a logical value"

Functions

Functions are incredibly useful building blocks for writing understandable programs, enabling code re-use and avoiding repetitions. The syntax for creating a function is

miles2km <- function(x) {
  # 1 mile is 1.609344 km
  y <- x * 1.609344
  return(y)
}

What is 500 miles in kilometres?

miles2km(500)
## [1] 804.672

What is 200 miles in kilometers?

miles2km(200)
## [1] 321.8688

Once the miles2km function has been written once, it can be used many times.

Functions can have multiple arguments.

hypotenuse <- function(a, b) {
  return(sqrt(a^2 + b^2))
}

hypotenuse(3, 4)
## [1] 5

There are many functions and constants that have already been defined in R’s base package. These include the constant pi, trigonometric functions like sin, cos, tan, and functions like log, exp. The log function has a default second argument.

log(100) # outputs natural log of 100
## [1] 4.60517
log(100, 10) # outputs log of 100 to base 10
## [1] 2
log(100, base = 10) # outputs log of 100 to base 10
## [1] 2

There are advanced techniques one can use to define and use functions, which you can research on your own. For example, one can call functions using named arguments, which can help avoid mistakes.

foo <- function(a, b) {
  return(10*a + b)
}

foo(4, 2)
## [1] 42
foo(a=4, b=2)
## [1] 42
foo(b=2, a=4)
## [1] 42

While it is often not necessary to explicitly use return, it is usually considered good practice. An R function will return the last evaluated statement in the function.

foo <- function(x) {
  x + 1
}
foo(22)
## [1] 23

The return command stops execution of that function and returns.

foo <- function(x) {
  return(x)
  print("this is not printed")
}
foo(22)
## [1] 22

Only one value can be returned by an R function. However, this value can be a vector, a list, or even another function.

Getting help

One can access R’s help documentation by opening the Help window in RStudio. One often wants to access the documentation for a specific function. One can do this by typing “?” followed by the function name in the R console.

?log

One can perform a keyword search using “??” instead of “?”.

??logarithm

Of course, R is a popular language and there are also many resources online such as webpages, blogs and forums that can help solve problems.

Vectors

We have focused on single values so far, for simplicity. In fact, a single number in R is a vector of size 1.

One can combine two vectors using the c function.

x <- 1
y <- 2
z <- c(x, y)
z
## [1] 1 2
w <- c(z, 3)
w
## [1] 1 2 3

There are a few ways to quickly specify vectors.

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
10:1
##  [1] 10  9  8  7  6  5  4  3  2  1
seq(1, 2, 0.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
rep(42, 8)
## [1] 42 42 42 42 42 42 42 42

Many basic R functions operate element-wise on vectors.

a <- c(1, 3, 5, 7)
b <- c(2, 4, 6, 8)
a + b
## [1]  3  7 11 15
a * b
## [1]  2 12 30 56
a ^ b
## [1]       1      81   15625 5764801
log(a)
## [1] 0.000000 1.098612 1.609438 1.945910

Other functions make sense for vectors of length greater than 1.

x <- c(5, 2, 1, 6, 4, 1, 3)
sort(x)
## [1] 1 1 2 3 4 5 6
min(x)
## [1] 1
max(x)
## [1] 6
sum(x)
## [1] 22
mean(x)
## [1] 3.142857
var(x)
## [1] 3.809524

Vectors can be indexed using vectors of integers (although not necessarily with integer type).

x <- 1:6
x[3]
## [1] 3
x[c(2, 4)]
## [1] 2 4
x[c(3, 5, 3)]
## [1] 3 5 3
x[-c(2, 4)]
## [1] 1 3 5 6

They can also be indexed using vectors of logicals.

x <- 1:6
x[c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE)]
## [1] 2 4
x[x %% 2 == 0]
## [1] 2 4 6

Values can be changed using assignment.

x <- 1:6
x[2] <- 42
x
## [1]  1 42  3  4  5  6
x[c(2, 3)] <- c(11, 12)
x
## [1]  1 11 12  4  5  6

Vector types

So far we have focused on variables that are classified as numeric by R. There are 6 atomic vector types, and each vector can only have one of these types.

A single logical value takes either the value TRUE or FALSE.

An integer is specified by appending “L” to the end of the input. It is not that common to create integer vectors in R functions: most of the time one just uses numeric (i.e. double-precision floating point) vectors even when all the elements are mathematically integers.

x <- 123L
typeof(x)
## [1] "integer"
x <- 123
typeof(x)
## [1] "double"

A complex number is specified using complex. One can call the following functions on a complex number: Re, Im, Mod, Arg, Conj.

x <- complex(real=1, imaginary=2)
x
## [1] 1+2i
Conj(x)
## [1] 1-2i

A variable of type character can be a string, not just a single character. In particular, a single string is a vector of length 1, not a vector of characters.

x <- "hello"
typeof(x)
## [1] "character"
x <- "hello world"
length(x)
## [1] 1
print(x)
## [1] "hello world"
x <- c("hello", "world")
length(x)
## [1] 2
print(x)
## [1] "hello" "world"

We do not discuss the raw type here: it is used to store raw bytes.

Matrices

Matrices can be created using the matrix command.

x <- matrix(1:6, 2, 3)
x
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

The dimension of a matrix is given by dim.

dim(x)
## [1] 2 3

They can be indexed in a similar way to vectors.

x[1,3]
## [1] 5
x[2,]
## [1] 2 4 6
x[2,c(TRUE, FALSE, TRUE)]
## [1] 2 6
x[,3]
## [1] 5 6

There are many built-in R functions for matrices. One important one is %*% for matrix multiplication (* is element-wise multiplication).

A <- matrix(c(4, 2, 2, 4), 2, 2)
B <- matrix(c(1, 2, 3, 4), 2, 2)
A * B
##      [,1] [,2]
## [1,]    4    6
## [2,]    4   16
A %*% B
##      [,1] [,2]
## [1,]    8   20
## [2,]   10   22
t(B) # transpose
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
eigen(B)
## eigen() decomposition
## $values
## [1]  5.3722813 -0.3722813
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.5657675 -0.9093767
## [2,] -0.8245648  0.4159736
svd(B) # singular value decomposition
## $d
## [1] 5.4649857 0.3659662
## 
## $u
##            [,1]       [,2]
## [1,] -0.5760484 -0.8174156
## [2,] -0.8174156  0.5760484
## 
## $v
##            [,1]       [,2]
## [1,] -0.4045536  0.9145143
## [2,] -0.9145143 -0.4045536
chol(A) # cholesky decomposition
##      [,1]     [,2]
## [1,]    2 1.000000
## [2,]    0 1.732051
diag(A)
## [1] 4 4

One can add a row (resp. column) to a matrix using the rbind (resp. cbind) function.

A <- diag(1:3)
A
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    2    0
## [3,]    0    0    3
rbind(A, 4:6)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    2    0
## [3,]    0    0    3
## [4,]    4    5    6
cbind(A, 4:6)
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    4
## [2,]    0    2    0    5
## [3,]    0    0    3    6

An R matrix is equivalent a 2-dimensional array. One can create higher dimensional arrays.

v <- array(1:24, c(2,3,4))
v[1,2,3]
## [1] 15

An R vector is similar to, but not exactly the same as, a matrix with one column.

matrix(1:2, 2, 1)
##      [,1]
## [1,]    1
## [2,]    2
1:2
## [1] 1 2

One can convert between the two using the as.vector and as.matrix commands.

Lists

Vectors and matrices must contain variables of the same type. Lists are more generic data structures, and are sometimes called generic vectors.

List elements can be given names, which can be used to index the list using $. Lists can also be indexed using numbers inside [[ and ]]. New elements can be added by name or by indexing

x <- list(numbers = 1:5, message = "hello")
x
## $numbers
## [1] 1 2 3 4 5
## 
## $message
## [1] "hello"
x$numbers
## [1] 1 2 3 4 5
x$message
## [1] "hello"
x[[1]]
## [1] 1 2 3 4 5
x[[2]]
## [1] "hello"
x$even.numbers <- x$numbers[(x$numbers %% 2) == 0]
x[[4]] <- c(1.1, 2.2)
names(x)[[4]] <- "floats"
x
## $numbers
## [1] 1 2 3 4 5
## 
## $message
## [1] "hello"
## 
## $even.numbers
## [1] 2 4
## 
## $floats
## [1] 1.1 2.2

Lists can contain lists or functions as elements, so they can be very rich data structures.

f <- list()
f$name <- "multiply"
f$arg.types <- list(x="numeric", y="numeric")
f$eval <- function(x, y) {
  return(x*y)
}

f$name
## [1] "multiply"
f$arg.types
## $x
## [1] "numeric"
## 
## $y
## [1] "numeric"
f$eval(6, 7)
## [1] 42

Basic scoping

R uses lexical scoping, which means that it resolves a variable name in a specific location in the code by considering where that variable is defined in the code. In particular, name resolution does not depend on the run time call stack of the program. We do not go into too many details here, except to show some examples that should help you become familiar with how scoping works in R.

A name introduced inside a function renders irrelevant any variables with the same name outside the function body.

x <- 1
foo <- function() {
  x <- 2
  x
}
foo()
## [1] 2
x <- 1
foo <- function(x) {
  x
}
foo(2)
## [1] 2

If a name cannot be resolved in the “local scope”, R will look one level up in the code.

x <- 1
y <- 2
foo <- function() {
  x <- 3
  c(x, y)
}
foo()
## [1] 3 2

The scoping rule determines which y is being referred to. The value associated with that y does depend on the execution context. In the below example, the name is resolved to the y in global scope, whose value changes between the two invocations of foo.

x <- 1
y <- 2
foo <- function() {
  x <- 3
  c(x, y)
}
foo()
## [1] 3 2
x <- 20
y <- 10
foo()
## [1]  3 10

To see what is meant by “one level up”, the following snippet may be helpful.

x <- 1
foo <- function() {
  x <- 2
  bar <- function() {
    x
  }
  bar()
}
foo()
## [1] 2

Lexical scoping is now very common, and R’s specific scoping rules are usually easy to work with.

A particular problem is when a programmer expects a variable to be defined locally within a function but it is not, and instead a global variable with the same name is used.

Curly braces { and } do not define a new scope in R.

Pass by value semantics

In R one can think of arguments to functions as if they are passed by value. That is, when a function is called with an argument x it is as if x is copied and then passed to the function. So any modifications of x in the function body do not affect the variable x from the caller’s perspective.

a <- c(1, 2)
foo <- function(x) {
  x[1] <- 3
  x
}
foo(a)
## [1] 3 2
a
## [1] 1 2

This is also true for assignments. If a variable y is assigned to take the same value that x takes, modifications of y do not affect x.

x <- c(1, 2)
y <- x

y[1] <- 3
y
## [1] 3 2
x
## [1] 1 2

In the above, we have said that it is as if the values are copied. There are performance enhancements to make this less computationally costly in some situations, and R actually uses a copy on modify strategy in the background. This is not necessary to understand to write working code, but knowledge of how R works internally can be used to improve performance.

Iteration: while and for loops

Iteration allows a program to execute a block of code many times. Iteration (or recursion) greatly increases the expressive power of a programming language.

One way to do this is using a while loop, which executes a block of code until some condition is met.

x <- 0
while (x < 3) {
  x <- x + 1
  print(x)
}
## [1] 1
## [1] 2
## [1] 3

Another way is to use a for loop, where one executes a block of code for each element in a vector.

vec <- c(1,4,9)
for (x in vec) {
  sqrt.x <- sqrt(x)
  print(sqrt.x)
}
## [1] 1
## [1] 2
## [1] 3

It is quite common for the vector in a for loop to be defined as 1:n for some positive integer n.

vec <- c(1,4,9)
n <- length(vec)
for (i in 1:n) {
  print(sqrt(vec[i]))
}
## [1] 1
## [1] 2
## [1] 3

Some people prefer to use loop replacements such as apply and lapply. We will consider this in more detail later.

When using loops, the commands break and next are useful. The former immediately exits the loop, while the latter immediately starts the next iteration of the loop.

x <- 1
while (TRUE) {
  x <- x+1
  if (x >= 5) {
    break
  }
}
x
## [1] 5
for (i in 1:3) {
  if (i == 2) {
    next
  }
  print(i)
}
## [1] 1
## [1] 3

Recursion

In some cases, defining a function recursively is natural, and R supports recursive functions. A classic example is a recursive function for computing a Fibonacci number.

fib <- function(n) {
  stopifnot(n %% 1 == 0 && n >= 0)
  if (n == 0 || n == 1) {
    return(n)
  }
  return(fib(n-2) + fib(n-1))
}

# compute and store the first 5 Fibonacci numbers
fib.first5 <- rep(0, 5)
for (i in 1:5) {
  fib.first5[i] <- fib(i)
}
fib.first5
## [1] 1 1 2 3 5

The recursive function/algorithm fib is very poor from a performance perspective. It is good to think about why, and how one might overcome the problem.

Errors, warnings and messages

You might have noticed the use of the stopifnot function in fib, which is used there to check that the input is a non-negative integer. It will raise an error if this check fails. Checks like this are useful for ensuring that functions are used only in appropriate situations.

In general, an error can be thrown using the stop command, which will stop execution and print the error message, unless the error is caught.

stop("informative error message")
## Error: informative error message

A warning can be printed using the warning command, and a message can be printed using the message command. Neither warnings and messages stop execution, and the difference between them is primarily semantic: warnings are often used to indicate something unexpected has occurred, but the function was probably able to deal with it appropriately. The messages from stop, warning and message are sent to standard error and not standard out, in contrast to print.

warning("informative warning message")
## Warning: informative warning message
message("informative message")
## informative message

Errors, warnings and messages are collectively known as conditions.

One can catch a condition using tryCatch. error, warning, message and finally are named arguments, the first three being functions that are called when the corresponding condition is raised. finally is just a block of code.

tryCatch({
  warning("be careful!")
}, error = function(e) {
  paste("there was an error:", e$message)
}, warning = function(w) {
  paste("there was a warning:", w$message)
}, message = function(m) {
  paste("there was a message:", m$message)
}, finally = {
  print("finished.")
})
## [1] "finished."
## [1] "there was a warning: be careful!"

If all you want to do is display when there is an error you can use try, which allows execution to continue and displays the error message.

try(stop("something is wrong"))
## Error in try(stop("something is wrong")) : something is wrong

Special values

R has some special values, including NULL, NA, Inf and NaN.

NULL is returned by functions with no return value, and can be thought of us an empty vector.

x <- NULL
x <- c(x, 1)
x <- c(x, 2)
x
## [1] 1 2

NA is “Not Available” and is used for missing values. NA is propagated where appropriate.

x <- c(1, NA)
y <- c(3, 4)
x + y
## [1]  4 NA

Inf represents positive infinity. -Inf represents minus infinity.

NaN is “Not a Number” and is the value of, e.g., 0/0, Inf - Inf or Inf/Inf.