dplyr
Alex Sanchez (asanchez@ub.edu)
Francesc Carmona (fcarmona@ub.edu)
GME Department. Universitat de Barcelona
March 2020
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License http://creativecommons.org/licenses/by-nc-sa/4.0/
You are free to:
Under the following conditions:
These slides have been prepared based on multiple sources: websites, blogs, courses. While it is hard to cite them all I wish to acknowledge those sources that have been particularly useful.
subset()
function and the use of [
and
$
operators to extract subsets of data frames.dplyr
package is designed to mitigate a lot of
these problems and to provide a highly optimized set of routines
specifically for dealing with data frames.dplyr
package was developed by Hadley
Wickham of RStudio and is an optimized and distilled
version of his plyr
package.dplyr
package does not provide any “new”
functionality to R per se, in the sense that everything dplyr does could
already be done with base R, but it greatly simplifies existing
functionality in R.dplyr
package is that
it provides a “grammar” (in particular, verbs) for data manipulation and
for operating on data frames.dplyr
functions
are very fast, as many key operations are coded in C++.dplyr
GrammarSome of the key “verbs” provided by the dplyr
package
are
select: return a subset of the columns of a data frame, using a flexible notation
filter: extract a subset of rows from a data frame based on logical conditions
arrange: reorder rows of a data frame
rename: rename variables in a data frame
mutate: add new variables/columns or transform existing variables
summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata
%>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline
The dplyr
package as a number of its own data types that
it takes advantage of.
select()
(1)For the examples in this document we will be using a dataset containing air pollution and temperature data for the city of Chicago in the U.S.
## [1] 6940 8
## 'data.frame': 6940 obs. of 8 variables:
## $ city : chr "chic" "chic" "chic" "chic" ...
## $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
## $ dptp : num 31.5 29.9 27.4 28.6 28.9 ...
## $ date : Date, format: "1987-01-01" "1987-01-02" ...
## $ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
## $ pm10tmean2: num 34 NA 34.2 47 NA ...
## $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
## $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
The select()
function can be used to select columns of a
data frame that you want to focus on.
Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly.
## [1] "city" "tmpd" "dptp"
## city tmpd dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125
select()
(2)Note that the :
normally cannot be used with names or
strings, but inside the select()
function you can use it to
specify a range of variable names.
You can also omit variables using the
select()
function by using the negative sign.
The select()
function also allows a special syntax that
allows you to specify variable names based on patterns.
## pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 NA 34.00000 4.250000 19.98810
## 2 NA NA 3.304348 23.19099
## 3 NA 34.16667 3.333333 23.81548
## 4 NA 47.00000 4.375000 30.43452
## 5 NA NA 4.750000 30.33333
## 6 NA 48.00000 5.833333 25.77233
## pm25tmean2 pm10tmean2
## 1 NA 34.00000
## 2 NA NA
## 3 NA 34.16667
## 4 NA 47.00000
## 5 NA NA
## 6 NA 48.00000
filter()
filter()
function is used to extract subsets of
rows from a data frame.subset()
function in R but is quite a bit faster.## [1] 194 8
## [1] 17 3
arrange()
arrange()
function is used to reorder rows of a
data frame according to one of the variables/columns.arrange()
function simplifies the process quite a bit.Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation:
chicago.ord <- arrange(chicago, date)
head(select(chicago.ord, date, pm25tmean2), 3) # the first few rows
## date pm25tmean2
## 1 1987-01-01 NA
## 2 1987-01-02 NA
## 3 1987-01-03 NA
## date pm25tmean2
## 6938 2005-12-29 7.45000
## 6939 2005-12-30 15.05714
## 6940 2005-12-31 15.00000
rename()
Renaming a variable in a data frame in R is surprisingly hard to do!
The rename()
function is designed to make this process
easier.
## [1] "city" "tmpd" "dptp" "date" "pm25tmean2"
# names of the first five variables
chicago.desc <- rename(chicago.desc, dewpoint = dptp, pm25 = pm25tmean2)
names(chicago.desc)[1:5]
## [1] "city" "tmpd" "dewpoint" "date" "pm25"
mutate()
mutate()
function exists to compute transformations
of variables in a data frame.mutate()
provides a clean interface
for doing that.Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.
chicago.desc <- mutate(chicago.desc, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago.desc[,6:9])
## pm10tmean2 o3tmean2 no2tmean2 pm25detrend
## 1 23.5 2.531250 13.25000 -1.230958
## 2 19.2 3.034420 22.80556 -1.173815
## 3 23.5 6.794837 19.97222 -8.780958
## 4 27.5 3.260417 19.28563 1.519042
## 5 27.0 4.468750 23.50000 7.329042
## 6 8.5 14.041667 16.81944 -7.830958
transmute()
function, which
does the same thing as mutate()
but then drops all
non-transformed variables.Here we detrend the PM10 and ozone (O3) variables.
group_by()
The group_by()
function is used to generate summary
statistics from the data frame within strata defined by a
variable.
The general operation here is a combination of splitting a data
frame into separate pieces defined by a variable or group of variables
(group_by()
), and then applying a summary function across
those subsets (summarize()
).
First, we can create a year variable using
as.POSIXlt()
.
Now we can create a separate data frame that splits the original data frame by year.
Finally, we compute summary statistics for each year in the data
frame with the summarize()
function.
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
summarize()
returns a data frame with year as the first
column, and then the annual averages of pm25, o3, and no2.
%>%
%>%
is very handy for
stringing together multiple dplyr
functions in a sequence
of operations.%>%
operator allows you to string operations in
a left-to-right fashion, i.e.Take the example that we just did in the last section where we computed the mean of pm25, o3 and no2 by year.
There we had to
That can be done with the following sequence in a single R expression.
mutate(chicago.desc, year = as.POSIXlt(date)$year + 1900) %>%
group_by(year) %>%
summarize(pm25 = mean(pm25, na.rm = TRUE),
o3 = mean(o3tmean2, na.rm = TRUE),
no2 = mean(no2tmean2, na.rm = TRUE))
%>%
is
tidyverse
package produce
data in the tibble format, an alternative to the
classic R data.frame.tibble()
.## # A tibble: 5 × 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
data.frame()
, note that
tibble()
does much less: it never changes the type of the
inputs (e.g. it never converts strings to factors!), it never changes
the names of variables, and it never creates row names.tribble()
, short
for transposed tibble.tribble()
is customised for data entry in code: column
headings are defined by formulas (i.e. they start with ~), and entries
are separated by commas.## # A tibble: 2 × 3
## x y z
## <chr> <dbl> <dbl>
## 1 a 2 3.6
## 2 b 1 8.5
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
print()
the data frame and
control the number of rows (n) and the width of the display.
width = Inf
will display all columns:options()
.$
and [[
in a pipe, you’ll need to
use the special placeholder .
:as.data.frame()
to turn a tibble
back to a data.frame.