SC1

Statistical Computing 1

4. Data transformation with dplyr


The previous two sections focussed on data visualization using ggplot2. ggplot2 assumes that your data is stored in a data.frame, so here we explain how to transform your data.frame to get it in the right format for plotting and modelling. In particular, the following section focuses on the dplyr package, which provides some convenient tools for manipulating data stored in a tabular format (i.e. data.frames). Below we demonstrate some of its most commonly used features, and we refer to the relevant chapter of “R for Data Science” for more details.

Basic dplyr functions

To illustrate the tools provided by dplyr, lets us consider again the UK electricity demand data set we used before:

library(qgam)
data(UKload)
head(UKload)
##     NetDemand       wM   wM_s95       Posan      Dow      Trend NetDemand.48
## 25      38353 6.046364 5.558800 0.001369941   samedi 1293879600        38353
## 73      41192 2.803969 3.230582 0.004109824 dimanche 1293966000        38353
## 121     43442 2.097259 1.858198 0.006849706    lundi 1294052400        41192
## 169     50736 3.444187 2.310408 0.009589588    mardi 1294138800        43442
## 217     50438 5.958674 4.724961 0.012329471 mercredi 1294225200        50736
## 265     50064 4.124248 4.589470 0.015069353    jeudi 1294311600        50438
##     Holy Year                Date
## 25     1 2011 2011-01-01 12:00:00
## 73     0 2011 2011-01-02 12:00:00
## 121    0 2011 2011-01-03 12:00:00
## 169    0 2011 2011-01-04 12:00:00
## 217    0 2011 2011-01-05 12:00:00
## 265    0 2011 2011-01-06 12:00:00

One of the simplest functions provided by dplyr is select, which allows you to select one or more columns of a data.frame. For example:

library(dplyr)
library(magrittr)
library(GGally)
UKload %>% select(NetDemand, wM, Posan) %>% 
           ggpairs()

where GGally::ggpairs is the ggplot2 version of graphics::pairs, so the code above is similar to:

pairs( UKload[ , c("NetDemand", "wM", "Posan")] )

We can also remove variables, for instance:

UKload %<>% select(-Date)

removes the Date variable from the data set (notice that we are over-writing UKload by using the assignment pipe %<>%).

Another useful function is filter, which allows you to subset your data using logical conditions, for instance:

coldMondays <- UKload %>% filter(wM < 5 & Dow == "lundi") %T>% 
                          {print(head(.), digits = 2)}
##   NetDemand  wM wM_s95  Posan   Dow   Trend NetDemand.48 Holy Year
## 1     43442 2.1    1.9 0.0068 lundi 1.3e+09        41192    0 2011
## 2     51282 2.9    1.5 0.0836 lundi 1.3e+09        44845    0 2011
## 3     50163 5.0    4.3 0.1411 lundi 1.3e+09        42856    0 2011
## 4     48403 4.5    4.8 0.1603 lundi 1.3e+09        40709    0 2011
## 5     50950 3.9    1.8 0.9658 lundi 1.3e+09        43257    0 2011
## 6     48653 3.7    1.4 0.0424 lundi 1.3e+09        42238    0 2012

Here we are selecting only the Mondays (“Lundi” in French) where the temperature is below 5 degrees Celsius. We are also printing the first and last few rows of the data by exploiting the %T>% pipe (by the way, why do we need the curly brackets {} around print?). Warning: notice that the following would not work:

coldMondays <- UKload %>% filter(wM < 5 && Dow == "lundi") %T>% 
                          {print(head(.), digits = 2)}
## Error in `filter()`:
## ℹ In argument: `wM < 5 && Dow == "lundi"`.
## Caused by error in `wM < 5 && Dow == "lundi"`:
## ! 'length = 2008' in coercion to 'logical(1)'

because we need to use &, which performs elementwise comparisons, not &&, which is typically used for programming control-flow (e.g. in if-else statements). Same for | and ||.

You might also find useful the arrange function, which allows you to sort the rows of the data using one of more variable:

UKload %<>% arrange(Dow, desc(wM)) %T>% 
            {print(head(.), digits = 2)} %T>% 
            {print(tail(.), digits = 2)}
##   NetDemand wM wM_s95 Posan      Dow   Trend NetDemand.48 Holy Year
## 1     34061 25     21  0.63 dimanche 1.3e+09        34780    0 2012
## 2     33680 24     19  0.51 dimanche 1.4e+09        34566    0 2013
## 3     33811 23     20  0.53 dimanche 1.4e+09        34890    0 2013
## 4     33127 23     18  0.40 dimanche 1.3e+09        34191    1 2012
## 5     34237 22     17  0.75 dimanche 1.3e+09        34608    0 2011
## 6     34039 22     17  0.48 dimanche 1.3e+09        35170    0 2011
##      NetDemand    wM wM_s95 Posan      Dow   Trend NetDemand.48 Holy Year
## 2003     49958  1.46  1.032 0.144 vendredi 1.4e+09        49316    0 2013
## 2004     50771  1.42  1.425 0.075 vendredi 1.3e+09        51130    0 2011
## 2005     50054  1.29  1.873 0.056 vendredi 1.3e+09        50110    0 2011
## 2006     51593  1.06  0.548 0.111 vendredi 1.3e+09        51827    0 2012
## 2007     50188  0.32 -1.520 0.092 vendredi 1.3e+09        50334    0 2012
## 2008     52233 -1.43  0.026 0.048 vendredi 1.4e+09        51702    0 2013

Above we are modifying the UKload dataframe by sorting its rows by day of the week (Dow) and by descending temperature (wM) (see, ?desc). More precisely, arrange takes a set of column names or expressions and uses the first to order the rows, the second to break the ties in the first, the third to break the ties in the second and so on. It is slightly annoying that the Dow factor does not have the order we would have liked (e.g. Monday to Sunday), instead it is alphabetical:

levels(UKload$Dow) # sun, thur, mon, tue, wed, sat, frid
## [1] "dimanche" "jeudi"    "lundi"    "mardi"    "mercredi" "samedi"   "vendredi"

This gives us the opportunity to illustrate the mutate function, which allows us to modify one or more variables of the data.frame:

UKload %<>% mutate(Dow = factor(Dow, levels(Dow)[c(3, 4, 5, 2, 7, 6, 1)])) %>% 
            arrange(Dow) %T>% 
            {print(head(.), digits = 2)} %T>% 
            {print(tail(.), digits = 2)}
##   NetDemand wM wM_s95 Posan   Dow   Trend NetDemand.48 Holy Year
## 1     42185 27     22  0.49 lundi 1.3e+09        34039    0 2011
## 2     40404 24     20  0.54 lundi 1.4e+09        33811    0 2013
## 3     40663 23     19  0.55 lundi 1.4e+09        33097    0 2013
## 4     39546 23     19  0.58 lundi 1.3e+09        32422    0 2011
## 5     40259 23     18  0.41 lundi 1.3e+09        33127    1 2012
## 6     40316 22     18  0.51 lundi 1.3e+09        33166    0 2011
##      NetDemand    wM wM_s95 Posan      Dow   Trend NetDemand.48 Holy Year
## 2003     45234  2.19  -0.23 0.116 dimanche 1.3e+09        45395    0 2012
## 2004     43164  2.18   2.13 0.034 dimanche 1.4e+09        43509    0 2013
## 2005     43395  1.97   1.51 0.045 dimanche 1.5e+09        43380    0 2016
## 2006     45359  1.55   0.39 0.097 dimanche 1.3e+09        47236    0 2012
## 2007     45487  0.42   0.59 0.226 dimanche 1.4e+09        45425    0 2013
## 2008     46566 -0.26  -0.13 0.053 dimanche 1.4e+09        46997    0 2013

Above we are mutating the Dow factor variable by rearranging the order of its levels, in fact:

levels(UKload$Dow) # mon, tue, wed, thur, fri, sat, sun
## [1] "lundi"    "mardi"    "mercredi" "jeudi"    "vendredi" "samedi"   "dimanche"

so that the rows of UKload are now in order we want (Monday to Sunday).

Notice that the functions described so far have a similar set of arguments, for instance:

args(arrange)
## function (.data, ..., .by_group = FALSE) 
## NULL
args(select)
## function (.data, ...) 
## NULL
args(mutate)
## function (.data, ...) 
## NULL

Hence, the first argument is always a data.frame and the ... contains a variable number of arguments which determine what needs to be done with the data (e.g. in select(UKload, wM, Posan) both wM and Posan end up in the ...). The fact that the first argument is a data.frame is important, because it makes so that pipes work smoothly with dplyr.

Grouping and summarizing data.frames

The dplyr package offers also some very convenient tools for grouping and summarizing data frames. To illustrate these, let us load a fresh version of our electricity demand data:

data(UKload)

The summarise function allows you to reduce a data frame to a set of scalar variables, for instance:

UKload %>% summarise(maxDem = max(NetDemand), 
                     meanTemp = mean(wM), 
                     nHoly = sum(Holy == "1"))
##   maxDem meanTemp nHoly
## 1  52596 12.20725    56

Above we are reducing the whole data set to a vector containing the maximum electricity demand, the mean temperature and the total number of holidays, calculated across the whole data.frame. Using summarise in this simple way can be handy sometimes, but we could do the same thing quite easily in base R. It is more interesting to use summarise in conjunction group_by, as the following example illustrates.

Recall that the UKload data set contains daily electricity demand observations (NetDemand), representing the total demand in the UK between 11:30am and 12am (minus embedded production, e.g. from solar panels). Now, suppose that we want to model the total demand during a week, using some of the other variables. To do this it is useful to exploit the group_by function, which takes as input a data.frame and groups its rows using one or more variables. For example:

library(lubridate)
UKweek <- UKload %>% mutate(wk = week(Date)) %>% 
                     group_by(Year, wk)
UKweek
## # A tibble: 2,008 × 11
## # Groups:   Year, wk [291]
##    NetDemand    wM wM_s95   Posan Dow           Trend NetDemand.48 Holy   Year
##        <dbl> <dbl>  <dbl>   <dbl> <fct>         <dbl>        <dbl> <fct> <int>
##  1     38353  6.05   5.56 0.00137 samedi   1293879600        38353 1      2011
##  2     41192  2.80   3.23 0.00411 dimanche 1293966000        38353 0      2011
##  3     43442  2.10   1.86 0.00685 lundi    1294052400        41192 0      2011
##  4     50736  3.44   2.31 0.00959 mardi    1294138800        43442 0      2011
##  5     50438  5.96   4.72 0.0123  mercredi 1294225200        50736 0      2011
##  6     50064  4.12   4.59 0.0151  jeudi    1294311600        50438 0      2011
##  7     51698  3.08   2.45 0.0178  vendredi 1294398000        50064 0      2011
##  8     43988  7.08   7.25 0.0205  samedi   1294484400        51698 0      2011
##  9     43340  4.89   3.53 0.0233  dimanche 1294570800        43988 0      2011
## 10     50645  6.33   3.47 0.0260  lundi    1294657200        43340 0      2011
## # ℹ 1,998 more rows
## # ℹ 2 more variables: Date <dttm>, wk <dbl>

Here we are using mutate and lubridate::week to create a new variable, wk \(\in \{1, 2, \dots, 53\}\), indicating the week to which each observation belongs and then we are grouping the data by year and week. As you can see, the output of group_by is not just a data.frame but a tibble, which is a “tidy” version of a data.frame. Without going too much into details, tibbles inherits the data.frame class:

class(UKweek)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

so we can use them pretty much as if they were just data.frames. Notice that the structure of the UKweek tibble is printed nicely on the console, in fact we can see its size, the class of each variable (e.g. wM is a double <dbl> and Dow is a factor <fct>) and we can also see that it has been grouped by year and week (Groups: Year, wk [1]).

The fact that the UKweek has been grouped by week and year, makes so that if we perform dplyr-based operations to it, these will be applied by group. For instance, here:

UKweek %<>% summarise(TotDemand = sum(NetDemand),
                      tempMax = max(wM), 
                      tempMin = min(wM),
                      Posan = mean(Posan), 
                      nHoly = factor(sum(Holy == "1")))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
UKweek
## # A tibble: 291 × 7
## # Groups:   Year [6]
##     Year    wk TotDemand tempMax tempMin   Posan nHoly
##    <int> <dbl>     <dbl>   <dbl>   <dbl>   <dbl> <fct>
##  1  2011     1    325923    6.05   2.10  0.00959 1    
##  2  2011     2    332107   12.5    4.89  0.0288  0    
##  3  2011     3    329419   12.2    1.29  0.0479  0    
##  4  2011     4    340075    7.63   1.42  0.0671  0    
##  5  2011     5    339415   11.3    0.968 0.0863  0    
##  6  2011     6    319443   11.5    7.05  0.105   0    
##  7  2011     7    320058    8.85   6.01  0.125   0    
##  8  2011     8    323699   13.0    4.46  0.144   0    
##  9  2011     9    321129    8.28   3.07  0.163   0    
## 10  2011    10    314010   10.8    4.83  0.182   0    
## # ℹ 281 more rows

we are calculating the weekly total demand, max and min temperature, the mean position along the year (Posan) and the total number of holidays (ranging from 0 to 7). Now we have a weekly demand data set but, before starting modelling it, we discard the last two weeks and the first week of the year by doing:

UKweek %<>% filter(wk < 52 & wk > 1)

The reason for this is that the Christmas and New Year period is quite special in terms of electricity demand dynamics, hence we prefer to discard it here.

A simple GAM model for weekly total demand might be \(D_w \sim N(\mu_w, \sigma)\) where \[ \mu_w = \mathbb{E}(D_w) = \psi_{N_w} + f_1(T^{max}_w) + f_2(T^{min}_w) + f_3(\text{Posan}_w), \] \(\psi_{N_t}\) is a parametric effect, whose value depends on the number of holidays taking place during the \(w\)-th week, while \(f_1, f_2\) and \(f_3\) are smooth effects (see a previous section for an intro to GAMs, but a deep understanding of GAMs is unnecessary here, as we just want to illustrate the utility of dplyr in day-to-day modelling). We fit the model as follows:

library(mgcViz)
fit <- gamV(TotDemand ~ nHoly + s(tempMax) + s(tempMin) + s(Posan), data = UKweek)

and we then plot the effects:

print(plot(fit, allTerms = TRUE), pages = 1)

We can see a strong heating effect depending on the minimal weekly temperature, but little cooling effect (recall that this is UK data!). As expected consumption is higher in the winter than in the summer, and weeks containing holidays have a lower total consumption.

This section showed how to use the basic tools provided by dplyr to transform data.frames. To appreciate the practical utility of such tools, you are encouraged to try to transform UKload into UKweek using only the tools provided by base R (that is, without using group_by and summarise). You are also encouraged to issue a pull request including your solutions, as it would be cool to see how base R solutions to this problem look, relative to the dplyr-based solution above.