Data
This page introduces a few basics ways to deal with data in R.
Creating synthetic data
R has many functions for simulating random variates from common distributions, as well as evaluating their density / mass functions, distribution functions and quantile functions. For example, the Uniform(0,1) distribution is served by the functions
runif(n)
: simulatesn
random variates,dunif(x)
: evaluates the density atx
,punif(x)
: evaluates the distribution function atx
,qunif(u)
: evaluates the quantile function atu
.
Distributions supported in this way can be found by typing ?distributions
in the R console.
We can simulate 2 sets of 100 standard normal random variates as follows.
n <- 100
x1s <- rnorm(n)
x2s <- rnorm(n)
Let us imagine that these are predictors in a linear regression model. We can simulate synthetic responses, and plot them alongside x1s
and x2s
.
beta.true <- c(3.2, 0.5, -0.2)
sigmaSq.true <- 2
ys <- beta.true[1] + beta.true[2]*x1s + beta.true[3]*x2s + rnorm(n, mean=0, sd=sqrt(sigmaSq.true))
plot(ys, pch=20, ylim=range(c(ys, x1s, x2s)))
points(x1s, pch=20, col="red")
points(x2s, pch=20, col="blue")
Creating a data frame
A fundamental data structure in R is the data frame. We can create a data frame to hold our synthetic data, and then print the first 10 elements.
my.data <- data.frame(x1=x1s, x2=x2s, y=ys)
head(my.data, 10)
## x1 x2 y
## 1 0.5855288 0.2239254 1.416963
## 2 0.7094660 -1.1562233 2.896070
## 3 -0.1093033 0.4224185 3.405256
## 4 -0.4534972 -1.3247553 4.734953
## 5 0.6058875 0.1410843 4.650432
## 6 -1.8179560 -0.5360480 2.547024
## 7 0.6300986 -0.3116061 1.114216
## 8 -0.2761841 1.5561096 3.663203
## 9 -0.2841597 -0.4480333 3.284853
## 10 -0.9193220 0.3211235 2.567596
Each row in a data frame corresponds to an observation of each the relevant variables, in this case an observation of x1
, x2
and y
.
Saving and loading CSV data
A simple way to store small datasets is using a comma-separated values (CSV) format. In R, we can save our my.data
dataset using the write.csv
command.
write.csv(my.data, file="synthetic-data.csv", row.names=FALSE)
The filename is relative to the current working directory, which can be printed by using the getwd
command.
getwd()
## [1] "/home/runner/work/sc1/sc1/content/intro-r"
To change the working directory, one can use the setwd
command.
To load data from a CSV file, one can use the read.csv
command.
my.data <- read.csv(file="synthetic-data.csv")
head(my.data)
## x1 x2 y
## 1 0.5855288 0.2239254 1.416963
## 2 0.7094660 -1.1562233 2.896070
## 3 -0.1093033 0.4224185 3.405256
## 4 -0.4534972 -1.3247553 4.734953
## 5 0.6058875 0.1410843 4.650432
## 6 -1.8179560 -0.5360480 2.547024
Using a CSV file has the benefit that the data is in a standard format that can be easily parsed in any programming language that can read from a text file. It can also be viewed in a text editor or spreadsheet program.
Saving and loading in R’s data format
R has support for saving and loading using an R-specific binary format. In combination with compression (the default), this can substantially reduce large file sizes and speed up the saving/loading process. R data files are usually given the “.Rdata” extension.
save(my.data, file="synthetic-data.Rdata")
Any variable or R function can be saved, and not just data frames. Multiple objects can be saved using a command such as
save(object1, object2, object3, file="filename.Rdata")
The save.image
command saves all of the variables in the workspace.
Unlike write.csv
and read.csv
, the name of a saved variable is associated with the saved value. Loading the .Rdata file associates the same variable name to the same data.
load(file="synthetic-data.Rdata")
Performing a simple linear regression
With our loaded my.data
, we can use some of R’s built-in statistical methods to try to estimate a linear model for y
given x1
and x2
.
model <- lm(y ~ x1 + x2, my.data)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2, data = my.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9243 -0.8558 0.0671 0.9305 3.6348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1507 0.1353 23.288 < 2e-16 ***
## x1 0.4641 0.1197 3.876 0.000193 ***
## x2 -0.3610 0.1320 -2.735 0.007408 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.321 on 97 degrees of freedom
## Multiple R-squared: 0.1746, Adjusted R-squared: 0.1576
## F-statistic: 10.26 on 2 and 97 DF, p-value: 9.069e-05
The linear model says that data point y[i]
is a realization of a normal random variable
\[Y^{(i)} = \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \sigma Z^{(i)},\]
where \(Z^{(1)},\ldots,Z^{(n)}\) is a sequence of independent standard normal random variables.
In particular, the intercept \(\beta_0\) is included by default.
We can compare the “true” values of the coefficients with the fitted values.
beta.true
## [1] 3.2 0.5 -0.2
model$coefficients
## (Intercept) x1 x2
## 3.1507283 0.4641199 -0.3610236
We can similarly compare the “true” variance of the errors with the unbiased estimate.
sigmaSq.true
## [1] 2
summary(model)$sigma^2
## [1] 1.744297
Computing the estimates directly
The estimates obtained in the linear regression example are not difficult to obtain: the lm
function is useful because it can be used easily and has been tested extensively.
First we create the design matrix
X <- cbind(rep(1, n), my.data$x1, my.data$x2)
We can check that this is the same as the design matrix used by lm
, via the model.matrix
function.
all(X == model.matrix(model))
## [1] TRUE
The maximum likelihood estimate of the coefficient vector \(\beta\) is
\[\hat{\beta}_{\rm ML} = (X^{\rm T}X)^{-1}X^{\rm T}y,\]
which is straightforward to compute.
beta.hat <- solve(t(X)%*%X, t(X)%*%my.data$y)
beta.hat
## [,1]
## [1,] 3.1507283
## [2,] 0.4641199
## [3,] -0.3610236
A prediction of \(y\) given \(x = (1,x_1,x_2)\) is
\[f(x) := \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2,\]
and the residual sum of squares is
\[{\rm RSS} = \sum_{i=1}^n (y^{(i)} - f(x^{(i)}))^2.\]
The maximum likelihood estimate of \(\sigma^2\) is
\[\hat{\sigma}^2_{\rm ML} = \frac{1}{n} {\rm RSS},\]
The ML estimate \(\hat{\sigma}^2_{\rm ML}\) is not what was reported above. Instead, what is reported by summary(model)$sigma^2
is the unbiased estimate
\[\hat{\sigma}^2 = \frac{1}{n-3} {\rm RSS}.\]
RSS <- sum((my.data$y - X%*%beta.hat)^2)
sigmaSq.MLE <- RSS / n
sigmaSq.hat <- RSS / (n-3)
sigmaSq.MLE
## [1] 1.691969
sigmaSq.hat
## [1] 1.744297
The estimator is unbiased because the distribution of \({\rm RSS}/\sigma^2\) is \(\chi^2_{n-3}\).