Introduction to dplyr

February 26, 2014 | Alan

dplyr is a new package for R written by Hadley Wickham and Romain Francois and is the successor to Hadley Wickham’s earlier package plyr. It focuses on working with tabular data structures and aims to provide a set of tools for the most common data manipulation tasks with a consistent API that can be used for in-memory data as well as data stored elsewhere. It also boasts massive performance gains over plyr with key pieces of code written in Rcpp.

Learning Outcomes

By reading this post you will learn how to:

  • use the 5 basic data manipulation verbs in dplyr.
  • work with local data using dplyr and vastly outperform plyr.
  • connect to the database in the XAP and work with huge datasets from R.
  • translate from R to SQL using dplyr.
  • use some of dplyr‘s 2 table verbs.
  • add new columns and filter data using some basic window functions.

Data Manipulation Verbs

dplyr provides 5 basic data manipulation verbs which correspond to common tasks you might want to perform on a table of data.

  • filter() – Filters the rows of data based on given criteria.
  • select() – Selects specific columns of data.
  • arrange() – Orders data by values in a specific column or columns.
  • mutate() – Adds new columns to data.
  • summarise() – Collapses a table of data using aggregate functions.

When combined with the function group_by(), which splits data into groups for operations to be performed on ‘by group’, these 5 verbs can accomplish a wide range of data manipulation tasks.

A tutorial on how to use each of these functions is given in one of the vignettes of the dplyr package, which can be found online here. I highly recommend reading this (at least up to the ‘Chaining’ section) before continuing with the rest of this post.

If you’ve used plyr before then you probably recognise some of these functions and in fact, when working with in-memory data, plyr can easily achieve most of the same functionality as these 5 functions plus group_by(). The advantage of dplyr, aside from the speed gains, is that its functions have a much more consistent interface, not only making it easier to learn but also making it easier to formalise your thoughts on what you would like to do into code that actually accomplishes the task.

Performance

We can show how dplyr works with in-memory data by comparing its performance with plyr on a typical data manipulation task. As many of the functions in these packages share names we need to be careful when loading them both in the same session since functions from packages loaded later will mask functions of the same name in packages loaded earlier. It is recommended that plyr be loaded first and then dplyr, we then know that dplyr functions will be used by default and that if we want to use a plyr function that shares a name with a dplyr function then we must make this explicit by using the syntax plyr::function(). If you’ve loaded either of these packages while reading through the vignette linked above then the following code will ensure that the packages are loaded in the correct order in your session (despite any warnings it may give).

try(detach("package:plyr", unload = TRUE), silent = TRUE)
try(detach("package:dplyr", unload = TRUE), silent = TRUE)
library(plyr)
library(dplyr)

The hflights package is imported with dplyr and contains a dataset with information on flights from Houston in 2011.

library(hflights)
data(hflights)
nrow(hflights)
## [1] 227496
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

dplyr can work directly with data frames but also provides a convenient wrapper, tbl_df(), which will only print the first few rows and as many columns as will fit on one line when called. There are also wrappers for data tables (tbl_dt()) and remote data sources (tbl()).

tbl_df(hflights)
## Source: local data frame [227,496 x 21]
## 
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
## 5430 2011     1          7         5    1359    1509            AA
## 5431 2011     1          8         6    1355    1454            AA
## 5432 2011     1          9         7    1443    1554            AA
## 5433 2011     1         10         1    1443    1553            AA
## ..    ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

Suppose we wanted to find the unique routes travelled on each day along with the number of flights for that route on that day. We might then want to order each day by the number of flights on each route. In plyr we could accomplish this as follows:

plyr_routes <- ddply(hflights, .(Month, DayofMonth, Origin, Dest), plyr::summarise, flights = length(Year))
head(plyr::arrange(plyr_routes, Month, DayofMonth, desc(flights)))
##   Month DayofMonth Origin Dest flights
## 1     1          1    IAH  ORD      14
## 2     1          1    HOU  DAL      12
## 3     1          1    IAH  LAX      12
## 4     1          1    IAH  EWR      11
## 5     1          1    IAH  PHX      11
## 6     1          1    IAH  CLT      10

We can accomplish the same with dplyr by using some of the data manipulation verbs.

(dplyr_routes <- arrange(
                   summarise(
                     group_by(hflights, Month, DayofMonth, Origin, Dest),
                     flights = n()),
                   Month, DayofMonth, desc(flights))
                 )
## Source: local data frame [49,702 x 5]
## Groups: Month, DayofMonth, Origin
## 
##    Month DayofMonth Origin Dest flights
## 1      1          1    IAH  ORD      14
## 2      1          1    HOU  DAL      12
## 3      1          1    IAH  LAX      12
## 4      1          1    IAH  EWR      11
## 5      1          1    IAH  PHX      11
## 6      1          1    IAH  CLT      10
## 7      1          1    IAH  DEN      10
## 8      1          1    IAH  DFW      10
## 9      1          1    IAH  MCI      10
## 10     1          1    IAH  ATL       9
## ..   ...        ...    ...  ...     ...

Alternatively, we can give this a more logical flow by using the chaining operator from dplyr. Currently (at the time of writing) dplyr v0.1.3 is installed in the XAP which uses %.% as its chaining operator but dplyr v0.2 imports the more general (not to mention easier to type!) %>% from the magrittr package. %.% will probably be deprecated in a later version of dplyr and I’d rather not get into a habit of using it, so for the moment:

if(!exists("%>%")) `%>%` <- `%.%`

Of course, if this does create a new %>% function, it won’t be the more general one from magrittr but for most cases it will behave in the same way.
%>% can be read as ‘then’ to get an idea of what it is doing. More formally x %>% f(y) = f(x,y). The following achieves exactly the same as the above.

dplyr_routes <- hflights %>%
                  group_by(Month, DayofMonth, Origin, Dest) %>%
                  summarise(flights = n()) %>% 
                  arrange(Month, DayofMonth, desc(flights))

If you ran the last few chunks of code then the improvement should already be obvious but we can get some exact timings to quantify it.

system.time(
  plyr::arrange(
    ddply(hflights, .(Month, DayofMonth, Origin, Dest), plyr::summarise, flights = length(Year)),
    Month, DayofMonth, flights)
)
##    user  system elapsed 
##   22.03    0.03   22.12

system.time(
  hflights %>%
    group_by(Month, DayofMonth, Origin, Dest) %>%
    summarise(flights = n()) %>% 
    arrange(Month, DayofMonth, desc(flights))
)
##    user  system elapsed 
##    0.15    0.00    0.15

Connecting to the Database

As well as working with data in-memory dplyr can also work with data stored in databases. This is useful when you want to work with a large dataset that will not fit into memory for which you would normally be forced to work with samples or aggregates. Details of how to connect to a local database are given in the vignettes for dplyr and can be found online here. To show this in action within the XAP we will be working with prescribing data for England from August 2010 to Febuary 2014 which can be downloaded from http://www.hscic.gov.uk/ as individual csv files for each month. These have been uploaded to the XAP and combined into one 66GB table, far larger than the memory available in a single R session within the platform.

In order to connect to the PostgreSQL database from the R Console we need to define the correct src which requires values for each of the arguments of the src_postgres() function; dbname, host, port, user and password. For convenience we can use the supplied xap.dplyr() function to define the appropriate src without having to look up these arguments. Using this we can now create a tbl from the database by supplying an SQL query to define the tbl as follows.

presc <- tbl(xap.dplyr(), sql("SELECT * FROM aridhia_presc_eng_sandbox.prescriptions"))

We need to specify the schema here. If you haven’t created your own schemas then all your data tables will be in the default schema for the workspace which is the name of the workspace suffixed by ‘_sandbox’.

As with a tbl stored in memory, printing will only return the first 10 rows.

presc
## Source: postgres 8.2.15 [aridhia_presc_eng@192.168.200.55:5432/aridhia_presc_eng]
## From: <derived table> [?? x 10]
## 
##    sha pct practice        bnf_code                                 bnf_name
## 1  Q64 11E   L81025 0106040M0AAAAAA Macrogol_Co Oral Pdr Sach S/F 
## 2  Q64 11E   L81025 0201010F0AAADAD Digoxin_Tab 62.5mcg 
## 3  Q64 11E   L81025 0204000R0AAAHAH Propranolol HCl_Tab 10mg 
## 4  Q64 11E   L81025 0206010K0BNACAE Isib 60 XL_Tab 60mg 
## 5  Q64 11E   L81025 0206020C0AAAXAX Diltiazem HCl_Cap 200mg M/R 
## 6  Q64 11E   L81025 0206020C0BIABAV Slozem 180_Cap 180mg 
## 7  Q64 11E   L81025 0206020L0AAAAAA Lercanidipine HCl_Tab 10mg 
## 8  Q64 11E   L81025 0301020Q0AAAAAA Tiotropium_Pdr For Inh Cap 18mcg + Dev 
## 9  Q64 11E   L81025 0302000K0AAALAL Budesonide/Formoterol Inh B/A 100/6(120D
## 10 Q64 11E   L81027 21010900620     Microlet Lancets 0.5mm/28 Gauge 
## .. ... ...      ...             ...                                      ...
## Variables not shown: items (chr), nic (chr), act_cost (chr), quantity (chr),
##  period (int)

Working with Remote Data

The 5 data manipulation verbs introduced earlier will also work with remote data sources.

presc2 <- filter(presc, period > 201300)
presc3 <- mutate(presc2, percentage_discount = 100 * (as.numeric(nic) - as.numeric(act_cost))/as.numeric(nic))
presc4 <- select(presc3, practice, bnf_code, nic, percentage_discount)
presc4
## Source: postgres 8.2.15 [aridhia_presc_eng@192.168.200.55:5432/aridhia_presc_eng]
## From:  [?? x 4]
## Filter: period > 201300 
##  
##    practice        bnf_code         nic percentage_discount
## 1    Y04405 130201000BBCJA8 00000003.97            7.304786
## 2    Y04405 1304000V0CGABCH 00000009.96            7.429719
## 3    E83654 0204000E0AAAGAG 00000005.74            6.794425
## 4    E83654 0204000T0AAABAB 00000015.21            6.837607
## 5    E83654 0205051L0AAABAB 00000012.10            7.190083
## 6    E83654 0205051R0AAADAD 00000035.84            5.831473
## 7    E83654 0301020Q0AAABAB 00000402.00            7.649254
## 8    E83654 0302000C0BPABBF 00000014.84            7.547170
## 9    E83654 0402010ADAAAAAA 00000288.12            7.666944
## 10   E84006 0703022M0BBABAB 00000006.01            7.653910
## ..      ...             ...         ...                 ...

In contrast to the tbl outputs for in-memory data, these outputs don’t contain information on the number of rows in the data. This is because, in order to save time when printing tbls, queries are only run as far as is necessary to obtain the first 10 rows. If we want to know the number of rows we can always use summarise()

summarise(presc, count = n())
## Source: postgres 8.2.15 [aridhia_presc_eng@192.168.200.55:5432/aridhia_presc_eng]
## From:  [?? x 1]
##  
##        count
## 1  430081606
## ..       ...

Clearly we want to avoid executing queries over 430 million rows as much as possible! Fortunately, since one of the aims of dplyr is to allow you to work interactively with datasets of this magnitude, dplyr does its best to help with this. For example, in the code chunk where we defined presc4, the first three lines don’t interact with the data at all. Instead, the input R code is translated into a SQL query which is only executed when we actually ask to see presc4. We can see what this query is by asking for the query component of presc4.

presc4$query
## <Query> SELECT "practice", "bnf_code", "nic", 100.0 * (CAST("nic" AS NUMERIC) - CAST("act_cost" AS NUMERIC)) / CAST("nic" AS NUMERIC) AS "percentage_discount"
## FROM (SELECT * FROM aridhia_presc_eng_sandbox.prescriptions) AS "_W1"
## WHERE "period" > 201300.0
## <PostgreSQLConnection:(20173,0)>

dplyr tries to hide the fact that you are working with remote data as much as possible but, once you see how it works ‘under the hood’, it becomes obvious that you cannot expect arbitrary pieces of R code to be handled correctly.
The translate_sql() function can be used to see how a particular piece of R code will be translated to SQL. Along with the 5 basic data manipulation verbs and the group_by() function, dplyr knows how to translate at least the following R correctly.

  • basic math operators: +, -, *, /, %%, ^
  • math functions: abs, acos, acosh, asin, asinh, atan, atan2,atanh, ceiling, cos, cosh, cot, coth, exp, floor,log, log10, round, sign, sin, sinh, sqrt, tan, tanh
  • logical comparisons: <, <=, !=, >=, >, ==, %in%
  • boolean operations: &, &&, |, ||, !, xor
  • basic aggregations: mean, sum, min, max, sd, var

Additionally, as we have seen above, changing the type of data is also handled correctly for certain types.

If there is no known translation for a piece of R code then it will just be copied as is. Notably, this means we can use PostgreSQL’s pattern matching functionality by using ‘LIKE’ as if it were an in-line R function.

prescy <- filter(presc, practice %like% "Y%")
prescy$query
## <Query> SELECT "sha", "pct", "practice", "bnf_code", "bnf_name", "items", "nic", "act_cost", "quantity", "period"
## FROM (select * from aridhia_presc_eng_sandbox.prescriptions) AS "_W1"
## WHERE "practice" LIKE 'Y%'
## <PostgreSQLConnection:(23202,0)>

Antidepressants in England

A large number of data manipulation tasks can be accomplished with what has been covered so far but working through this example will outline some additional functionality dplyr offers.

The pct column in the prescription data we’ve been looking at shows which primary care trust a practice belongs to. However, these administrative bodies were replaced at the end of March 2013 and replaced by clinical commissioning groups (CCGs) which cover different areas. In order to conduct some meaningful analysis on the prescription data we would ideally want to have data on which practice belongs to which clinical commissioning group in the same structure as the rest of the prescribing data. We have a table mapping practice to CCG within our workspace so a join would seem the obvious way to do this:

presc_ccg <- tbl(xap.dplyr(), sql("SELECT * FROM aridhia_presc_eng_sandbox.prescriptions p
JOIN aridhia_presc_eng_sandbox.practice_ccgs c
ON p.practice = c.practice"))

We can edit the SQL query we use to define our tbl in the first place but we can also perform this join from within R.

presc <- tbl(xap.dplyr(), sql("SELECT * FROM aridhia_presc_eng_sandbox.prescriptions"))
ccgs <- tbl(xap.dplyr(), sql("SELECT * FROM aridhia_presc_eng_sandbox.practice_ccgs"))
presc_ccg <- inner_join(presc, ccgs, by = "practice")

dplyr currently supports left, inner, semi and anti joins but note that there is only a single ‘by’ argument in each of these so we need the tables we are joining to share a column name (although I believe this is set to change).

Both of these methods achieve the same result and, in this case, neither are particularly complicated but it’s extremely useful to have the ability to easily join tables from within R even after performing several other data manipulation steps.

We could now attempt to analyse the prescription of antidepressants month by month in each of these CCGs for which we’ll need to know which BNF codes correspond to antidepressants. The BNF is the British National Formulary, a pharmaceutical guide, and the first four characters of the BNF code correspond to the chapter and section a drug is found in the BNF, for example chapter 4 section 3 covers antidepressants and so the BNF codes for antidepressants all begin ‘0403’. We can now filter the data to include only what we’re interested in and group in preparation for aggregating.

presc_antidepress <- presc_ccg %>% filter(period > 201303, bnf_code %like% "0403%") %>% group_by(period, ccg_code)

Now, by summarising, we can find how much each CCG spends on antidepressant prescriptions each month.

ccg_antidepress <- presc_antidepress %>% summarise(cost = sum(as.numeric(act_cost)))
ccg_antidepress
## Source: postgres 8.2.15 [aridhia_presc_eng@192.168.200.55:5432/aridhia_presc_eng]
## From:  [?? x 3]
## Arrange: 
## Grouped by: period 
##  
##    period ccg_code      cost
## 1  201304      08F  71129.15
## 2  201304      11T  97803.21
## 3  201304      02X 173372.31
## 4  201304      06L 225449.58
## 5  201304      03K  80164.10
## 6  201304      01A 178558.13
## 7  201304      00F 140595.29
## 8  201304      10W  39542.01
## 9  201304      13P 291136.55
## 10 201304      11N 312105.39
## ..    ...      ...       ...

We could now use some window functions to see how the cost of prescribing antidepressants changes from month to month in each CCG. The output of window functions depends upon the grouping and ordering, we no longer want the data to be grouped by period so we will regroup and arrange in preparation for applying some window functions.

ccg_antidepress <- ccg_antidepress %>% regroup(list(quote(ccg_code))) %>% arrange(period)

We can now use the lag function to find the difference between consecutive rows in the data. Note that lag (and other window functions) will just be translated as is to functions in PostgreSQL of the same name when working with database tbls so be sure to check the arguments of PostgreSQL equivalents.

month_change <- mutate(ccg_antidepress, percentage_increase = 100 * (cost - lag(cost)) / cost)
month_change
## Source: postgres 8.2.15 [aridhia_presc_eng@192.168.200.55:5432/aridhia_presc_eng]
## From:  [?? x 4]
## Arrange: period 
## Grouped by: ccg_code 
##  
##    period ccg_code     cost percentage_increase
## 1  201304      00C 70816.21                  NA
## 2  201305      00C 67437.79           -5.009684
## 3  201306      00C 55263.21          -22.030172
## 4  201307      00C 57417.78            3.752444
## 5  201308      00C 55662.24           -3.153915
## 6  201309      00C 52683.60           -5.653828
## 7  201310      00C 55769.64            5.533548
## 8  201311      00C 53837.77           -3.588317
## 9  201312      00C 57780.06            6.822925
## 10 201401      00C 51219.34          -12.809068
## ..    ...      ...      ...                 ...

filter() can also be used in conjunction with window functions. For example we could find the 3 most expensive months for each CCG.

top_months <- ccg_antidepress %>% arrange(desc(cost)) %>% filter(row_number(desc(cost)) <= 3)
top_months
## Source: postgres 8.2.15 [aridhia_presc_eng@192.168.200.55:5432/aridhia_presc_eng]
## From:  [?? x 3]
## Filter: row_number(desc(cost)) <= 3 
## Arrange: period 
## Grouped by: ccg_code 
##  
##    period ccg_code      cost
## 1  201304      00C  70816.21
## 2  201305      00C  67437.79
## 3  201312      00C  57780.06
## 4  201304      00D 195637.34
## 5  201305      00D 190720.93
## 6  201312      00D 172805.95
## 7  201304      00F 140595.29
## 8  201305      00F 139654.76
## 9  201307      00F 125778.67
## 10 201304      00G  78955.72
## ..    ...      ...       ...

Finally, now that our data has been summarised to a more manageable size, we can execute the entire query and bring the data into our R session as a tbl_df, using the collect() function, to continue with more advanced analysis and visualisation.

Further Reading


 

Leave a Reply

Your email address will not be published. Required fields are marked *