Data

Creating synthetic data
Creating a data frame
Saving and loading CSV data
Saving and loading in R’s data format
Performing a simple linear regression

This page introduces a few basics ways to deal with data in R.

Creating synthetic data

R has many functions for simulating random variates from common distributions, as well as evaluating their density / mass functions, distribution functions and quantile functions. For example, the Uniform(0,1) distribution is served by the functions

runif(n): simulates n random variates,
dunif(x): evaluates the density at x,
punif(x): evaluates the distribution function at x,
qunif(u): evaluates the quantile function at u.

Distributions supported in this way can be found by typing ?distributions in the R console.

We can simulate 2 sets of 100 standard normal random variates as follows.

n <- 100
x1s <- rnorm(n)
x2s <- rnorm(n)

Let us imagine that these are predictors in a linear regression model. We can simulate synthetic responses, and plot them alongside x1s and x2s.

beta.true <- c(3.2, 0.5, -0.2)
sigmaSq.true <- 2
ys <- beta.true[1] + beta.true[2]*x1s + beta.true[3]*x2s + rnorm(n, mean=0, sd=sqrt(sigmaSq.true))

plot(ys, pch=20, ylim=range(c(ys, x1s, x2s)))
points(x1s, pch=20, col="red")
points(x2s, pch=20, col="blue")

Creating a data frame

A fundamental data structure in R is the data frame. We can create a data frame to hold our synthetic data, and then print the first 10 elements.

my.data <- data.frame(x1=x1s, x2=x2s, y=ys)
head(my.data, 10)

##            x1         x2        y
## 1   0.5855288  0.2239254 1.416963
## 2   0.7094660 -1.1562233 2.896070
## 3  -0.1093033  0.4224185 3.405256
## 4  -0.4534972 -1.3247553 4.734953
## 5   0.6058875  0.1410843 4.650432
## 6  -1.8179560 -0.5360480 2.547024
## 7   0.6300986 -0.3116061 1.114216
## 8  -0.2761841  1.5561096 3.663203
## 9  -0.2841597 -0.4480333 3.284853
## 10 -0.9193220  0.3211235 2.567596

Each row in a data frame corresponds to an observation of each the relevant variables, in this case an observation of x1, x2 and y.

Saving and loading CSV data

A simple way to store small datasets is using a comma-separated values (CSV) format. In R, we can save our my.data dataset using the write.csv command.

write.csv(my.data, file="synthetic-data.csv", row.names=FALSE)

The filename is relative to the current working directory, which can be printed by using the getwd command.

getwd()

## [1] "/home/runner/work/sc1/sc1/content/intro-r"

To change the working directory, one can use the setwd command.

To load data from a CSV file, one can use the read.csv command.

my.data <- read.csv(file="synthetic-data.csv")
head(my.data)

##           x1         x2        y
## 1  0.5855288  0.2239254 1.416963
## 2  0.7094660 -1.1562233 2.896070
## 3 -0.1093033  0.4224185 3.405256
## 4 -0.4534972 -1.3247553 4.734953
## 5  0.6058875  0.1410843 4.650432
## 6 -1.8179560 -0.5360480 2.547024

Using a CSV file has the benefit that the data is in a standard format that can be easily parsed in any programming language that can read from a text file. It can also be viewed in a text editor or spreadsheet program.

Saving and loading in R’s data format

R has support for saving and loading using an R-specific binary format. In combination with compression (the default), this can substantially reduce large file sizes and speed up the saving/loading process. R data files are usually given the “.Rdata” extension.

save(my.data, file="synthetic-data.Rdata")

Any variable or R function can be saved, and not just data frames. Multiple objects can be saved using a command such as

save(object1, object2, object3, file="filename.Rdata")

The save.image command saves all of the variables in the workspace.

Unlike write.csv and read.csv, the name of a saved variable is associated with the saved value. Loading the .Rdata file associates the same variable name to the same data.

load(file="synthetic-data.Rdata")

Performing a simple linear regression

With our loaded my.data, we can use some of R’s built-in statistical methods to try to estimate a linear model for y given x1 and x2.

model <- lm(y ~ x1 + x2, my.data)
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = my.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9243 -0.8558  0.0671  0.9305  3.6348 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.1507     0.1353  23.288  < 2e-16 ***
## x1            0.4641     0.1197   3.876 0.000193 ***
## x2           -0.3610     0.1320  -2.735 0.007408 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.321 on 97 degrees of freedom
## Multiple R-squared:  0.1746,	Adjusted R-squared:  0.1576 
## F-statistic: 10.26 on 2 and 97 DF,  p-value: 9.069e-05

The linear model says that data point y[i] is a realization of a normal random variable \[Y^{(i)} = \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \sigma Z^{(i)},\] where $Z^{(1)},\ldots,Z^{(n)}$ is a sequence of independent standard normal random variables.

In particular, the intercept $\beta_0$ is included by default.

We can compare the “true” values of the coefficients with the fitted values.

beta.true

## [1]  3.2  0.5 -0.2

model$coefficients

## (Intercept)          x1          x2 
##   3.1507283   0.4641199  -0.3610236

We can similarly compare the “true” variance of the errors with the unbiased estimate.

sigmaSq.true

## [1] 2

summary(model)$sigma^2

## [1] 1.744297

Computing the estimates directly

The estimates obtained in the linear regression example are not difficult to obtain: the lm function is useful because it can be used easily and has been tested extensively.

First we create the design matrix

X <- cbind(rep(1, n), my.data$x1, my.data$x2)

We can check that this is the same as the design matrix used by lm, via the model.matrix function.

all(X == model.matrix(model))

## [1] TRUE

The maximum likelihood estimate of the coefficient vector $\beta$ is

\[\hat{\beta}_{\rm ML} = (X^{\rm T}X)^{-1}X^{\rm T}y,\]

which is straightforward to compute.

beta.hat <- solve(t(X)%*%X, t(X)%*%my.data$y)
beta.hat

##            [,1]
## [1,]  3.1507283
## [2,]  0.4641199
## [3,] -0.3610236

A prediction of $y$ given $x = (1,x_1,x_2)$ is

\[f(x) := \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2,\]

and the residual sum of squares is

\[{\rm RSS} = \sum_{i=1}^n (y^{(i)} - f(x^{(i)}))^2.\]

The maximum likelihood estimate of $\sigma^2$ is

\[\hat{\sigma}^2_{\rm ML} = \frac{1}{n} {\rm RSS},\]

The ML estimate $\hat{\sigma}^2_{\rm ML}$ is not what was reported above. Instead, what is reported by summary(model)$sigma^2 is the unbiased estimate

\[\hat{\sigma}^2 = \frac{1}{n-3} {\rm RSS}.\]

RSS <- sum((my.data$y - X%*%beta.hat)^2)
sigmaSq.MLE <- RSS / n
sigmaSq.hat <- RSS / (n-3)

sigmaSq.MLE

## [1] 1.691969

sigmaSq.hat

## [1] 1.744297

The estimator is unbiased because the distribution of ${\rm RSS}/\sigma^2$ is $\chi^2_{n-3}$.