Introduction to dplyr

Alex Sanchez ()
Francesc Carmona ()
GME Department. Universitat de Barcelona

March 2020

Readme

Disclaimer and acknowledgements

The dplyr package (1)

The dplyr package (2)

dplyr Grammar

Some of the key “verbs” provided by the dplyr package are

The dplyr package as a number of its own data types that it takes advantage of.

select() (1)

For the examples in this document we will be using a dataset containing air pollution and temperature data for the city of Chicago in the U.S.

library(dplyr)
chicago <- readRDS("chicago.rds")
dim(chicago)
## [1] 6940    8
str(chicago)
## 'data.frame':    6940 obs. of  8 variables:
##  $ city      : chr  "chic" "chic" "chic" "chic" ...
##  $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
##  $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
##  $ date      : Date, format: "1987-01-01" "1987-01-02" ...
##  $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pm10tmean2: num  34 NA 34.2 47 NA ...
##  $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
##  $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...

The select() function can be used to select columns of a data frame that you want to focus on.

Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly.

names(chicago)[1:3]
## [1] "city" "tmpd" "dptp"
subset <- select(chicago, city:dptp)
head(subset)
##   city tmpd   dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125

select() (2)

Note that the : normally cannot be used with names or strings, but inside the select() function you can use it to specify a range of variable names.

You can also omit variables using the select() function by using the negative sign.

select(chicago, -(city:dptp))

The select() function also allows a special syntax that allows you to specify variable names based on patterns.

subset <- select(chicago, ends_with("2"))
head(subset)
##   pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1         NA   34.00000 4.250000  19.98810
## 2         NA         NA 3.304348  23.19099
## 3         NA   34.16667 3.333333  23.81548
## 4         NA   47.00000 4.375000  30.43452
## 5         NA         NA 4.750000  30.33333
## 6         NA   48.00000 5.833333  25.77233
subset <- select(chicago, matches("^p.*2$"))
head(subset)
##   pm25tmean2 pm10tmean2
## 1         NA   34.00000
## 2         NA         NA
## 3         NA   34.16667
## 4         NA   47.00000
## 5         NA         NA
## 6         NA   48.00000

filter()

chic.f <- filter(chicago, pm25tmean2 > 30)
dim(chic.f)
## [1] 194   8
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
dim(select(chic.f, date, tmpd, pm25tmean2))
## [1] 17  3

arrange()

Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation:

chicago.ord <- arrange(chicago, date)
head(select(chicago.ord, date, pm25tmean2), 3) # the first few rows
##         date pm25tmean2
## 1 1987-01-01         NA
## 2 1987-01-02         NA
## 3 1987-01-03         NA
tail(select(chicago.ord, date, pm25tmean2), 3) # the last few rows
##            date pm25tmean2
## 6938 2005-12-29    7.45000
## 6939 2005-12-30   15.05714
## 6940 2005-12-31   15.00000
chicago.desc <- arrange(chicago, desc(date))   # descending order

rename()

Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function is designed to make this process easier.

names(chicago.desc)[1:5]
## [1] "city"       "tmpd"       "dptp"       "date"       "pm25tmean2"
# names of the first five variables
chicago.desc <- rename(chicago.desc, dewpoint = dptp, pm25 = pm25tmean2)
names(chicago.desc)[1:5]
## [1] "city"     "tmpd"     "dewpoint" "date"     "pm25"

mutate()

Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.

chicago.desc <- mutate(chicago.desc, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago.desc[,6:9])
##   pm10tmean2  o3tmean2 no2tmean2 pm25detrend
## 1       23.5  2.531250  13.25000   -1.230958
## 2       19.2  3.034420  22.80556   -1.173815
## 3       23.5  6.794837  19.97222   -8.780958
## 4       27.5  3.260417  19.28563    1.519042
## 5       27.0  4.468750  23.50000    7.329042
## 6        8.5 14.041667  16.81944   -7.830958

Here we detrend the PM10 and ozone (O3) variables.

head(transmute(chicago.desc, 
               pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),
               o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)))

group_by()

First, we can create a year variable using as.POSIXlt().

chicago.desc <- mutate(chicago.desc, year = as.POSIXlt(date)$year + 1900)

Now we can create a separate data frame that splits the original data frame by year.

years <- group_by(chicago.desc, year)

Finally, we compute summary statistics for each year in the data frame with the summarize() function.

summarize(years, pm25 = mean(pm25, na.rm = TRUE), 
                 o3 = max(o3tmean2, na.rm = TRUE), 
                 no2 = median(no2tmean2, na.rm = TRUE))

summarize() returns a data frame with year as the first column, and then the annual averages of pm25, o3, and no2.

The pipeline operator %>%

third(second(first(x)))
first(x) %>% second %>% third

Take the example that we just did in the last section where we computed the mean of pm25, o3 and no2 by year.

There we had to

That can be done with the following sequence in a single R expression.

mutate(chicago.desc, year = as.POSIXlt(date)$year + 1900) %>%    
       group_by(year) %>% 
       summarize(pm25 = mean(pm25, na.rm = TRUE),
                 o3 = mean(o3tmean2, na.rm = TRUE), 
                 no2 = mean(no2tmean2, na.rm = TRUE))
The keyboard shortcut for %>% is

Tibbles

library(tidyverse)
as_tibble(iris)
tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

Tibbles (2)

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)
## # A tibble: 2 × 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         2   3.6
## 2 b         1   8.5

Printing tibbles

nycflights13::flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
nycflights13::flights %>% 
  print(n = 10, width = Inf)
nycflights13::flights %>% 
  View()

Subsetting tibbles

# Extract by name
df$x
df[["x"]]
# Extract by position
df[[1]]
df %>% .$x
df %>% .[["x"]]

Resources