February 26, 2014 | Alan
ggplot2 is a data exploration and visualisation package written in R. Developed by Hadley Wickham, the principals of ggplot2 were defined in the Grammar of Graphics (Wilkinson, 2005) , which described the theoretical division of graphs into semantic components. This approach of handling elements of a graph separately and building the features up in a series of layers allows for unmatched versatility and control.
This post covers the following:
- An explanation of the
- How to define a
- An understanding to the concept of the aesthetics inside the
- An understanding of layers and
- An understanding of applying
facets, statistics, and
scalesto a plot.
We introduce these concepts as we work through an example.
In the following worked example, we are going to use the dataset CO2, which comes pre-packaged with R. The CO2 uptake of six grass plants from Quebec and six grass plants from Mississippi was measured at several levels of ambient CO2 concentration. To assess the effects of temperature, half of the plants of each type were chilled overnight before the experiment was conducted. The dataset contains the following columns:
Plant, which gives a unique identifier for each plant.
Typeis a factor, with levels Quebec and Mississippi giving the origin of the plant.
Treatmentis a factor, with levels non-chilled and chilled.
concis a numeric vector of ambient carbon dioxide concentrations (mL/L).
uptakeis a numeric vector of carbon dioxide uptake rates (umol/m2 sec).
Note that, for convenience, when we refer to “Quebec” and “Mississippi”, we are referring to the results for the plants in those cities, not the cities themselves!
ggplot2 library, there are two types of plot that can be produced:
We look at both in this blog post.
Setting up the environment
Load the relevant library and dataset. Check that column names align with the Data Source description:
library(ggplot2) library(plyr) data(CO2) names(CO2)
##  "Plant" "Type" "Treatment" "conc" "uptake"
Introducing the qplot function
qplot(), an abbreviation of “quick plot”, is a function belonging to the
ggplot2 library that is used for producing simple plots.
qplot() is similar to the plot function in the base package for R, but has slightly more functionality.
y arguments, but also accepts “data”, where “data”“ is the data frame that houses the
y coordinates. It is good practice to include the data argument, for example:
qplot(x, y, data = )
qplot() to plot the uptake of CO2 against concentration level for plants across all data:
qplot(CO2$conc, CO2$uptake, data = CO2)
As we can see, there is a large spread of data for each concentration level but we can split data further inside the
qplot() function. To do this we use the shape and colour arguments to split data by
qplot(CO2$conc, CO2$uptake, data = CO2, shape = CO2$Treatment, colour = CO2$Plant)
Notice the difference in shape of the coloured points that identify Treatment. The default style of
qplot() is a dot plot but by defining the
geom attribute of a
qplot() we can change the style. Applying “point” and “line” to
qplot(CO2$conc, CO2$uptake, data = CO2, shape = CO2$Treatment, colour = CO2$Plant, geom = c("point", "line"))
Let’s also look at regional differences (
Type) for CO2 uptake. To do this we use the
facets argument to compare Quebec with Mississippi:
qplot(CO2$conc, CO2$uptake, data = transform(CO2, fct = CO2$Type), shape = CO2$Treatment, colour = CO2$Plant, geom = c("point", "line"), facets = ~fct)
Based on this plot, we can deduce the following:
- CO2 uptake looks like it has a logarithmic relationship when plotted against increasing concentration levels.
- Quebec has a higher CO2 uptake than Mississippi across all concentration levels.
- The variance in CO2 uptake is greater in Mississippi compared to Quebec.
You can find a complete list of arguments for the
qplot() function using the following syntax:
qplot(x, y, data=, color=, shape=, fill=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)
xis a vector of x coordinates.
yis a vector of y coordinates.
datais the data frame housing
fillare used to split data by features – legends are drawn automatically when any of these arguments are called.
alphais used to define the transparency of overlapping elements – ranges from 0 (clear) to 1 (opaque).
geomis used to define the type to plot.
formulaare used to define regression lines and the model to use.
facetssplits the data into multiple graphs on one page.
ylimare used to define the x and y axis limits.
subare used for labelling purposes – the x axis, y axis, title, and subtitle respectively.
Note that if you get into the habit of using the data argument, you can save yourself some time typing since you no longer need to prefix variables with their source dataframe . The last plot can also be generated with this code
qplot(conc, uptake, data= CO2, shape = Treatment, colour = Plant, geom = c("point", "line"), facets = ~Type)
Introducing the ggplot function
The “gg” in
ggplot() is an abbreviation of The Grammar of Graphics, a book by Leland Wilkinson, the principals of which
ggplot2 is derived from. The Grammar of Graphics describes the method of breaking graphs into elements and building up each element in a series of layers to control visual representation.
ggplot2: elegant graphics for data analysis by Hadley Wickham summarises the Grammar of Graphics nicely:
“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinates system.”
ggplot is made up of two main components – a
ggplot() object and at least one
The ggplot() object
ggplot() object acts as a storage facility for the data. It is here where we define the data frame that houses the
y coordinate values themselves and instructions on how to split the data. There are three ways to initialise a
p <- ggplot() p <- ggplot(data_frame) p <- ggplot(data_frame, aes(x, y, ))
Displaying the object
p generated in the code chunk above would result in
Error: No layers in plot. This is because you always need at least one layer for a
Mapping aesthetics to data
aes() aesthetic mapping function lives inside a
ggplot object and is where we specify the set of plot attributes that remain constant throughout the subsequent layers (unless overwritten – more on this later).
We can consider the relationship between the
geoms components as follows:
aes()function is the “how” – how data is stored, how data is split
geomsis the “what”—what the data looks like. These are geometrical objects stored in subsequent layers.
We use the + operator to construct. By appending layers we can connect the “how” (aesthetics) to the “what” (geometric objects). Adding geometric, scale, facet and statistic layers to a
ggplot() object is how to control virtually every visual aspect of the plot from the data contained in the object.
Adding a geometric object layer
A geometric object is used to define the style of the plot. Common geometric objects include:
geom_point()which is used to draw a dot plot
geom_line()used to draw a line plot
geom_bar()used to draw a bar chart.
A single plot can have numerous
geom layers, and it is also possible to overlay results from multiple data frames in one plot. Overwriting the aesthetic mapping that was defined in the
ggplot() object can be done inside a
geom object function (more on this later!).
Let’s apply the material covered so far for
ggplot() to a worked example. To start this we:
- create a
- define the data frame and
- add a single layer.
Notice that we don’t need the whole
title data_frame_name$column_name for our
y values, because we are already directing to the CO2 data frame in the object definition:
p <- ggplot(CO2, aes(conc, uptake)) + geom_point() p
Now, let’s investigate changing the aesthetic map attributes and the effect this has on the appearance of a plot:
- define the data frame of interest
ycoordinate values inside the
aes()function along with colouring by
p <- ggplot(CO2, aes(x = conc, y = log(uptake), colour = Type)) + geom_point() p
As we can see, the data has now been split by column
Type, and a colour-coded legend has automatically been generated. Data can be further split by using the
fill arguments. Offline, you should explore the different outputs from changing attributes of the
aes() function and how this changes the content which is displayed.
Appending a facet layer to a
ggplot generates the same plot for different subsets of data. Let’s reproduce the last plot in the
qplot example using
ggplot. To do this we:
- Define the
- insert the data frame,
- set the
ycoordinate values in the
- define the split by columns
- Append layers,
geom_line()to mirror the previous plot. Add a
facet_grid()to compare subsets of Type
p <- ggplot(CO2, aes(conc, uptake, colour = Plant, shape = Treatment)) + geom_point() + geom_line()
Now we have our structure we can append a facet, add a
facet_grid() to compare subsets of Type:
q <- p
+ facet_grid(~Type) q
The advantage of
qplot() is its extreme flexibility –
ggplot() was designed to handle much more complexity. The “shape” identifier for
Treatment in the previous plot requires very good eyesight! Use
facet_grid() to compare data further, let’s look at a
facet arguments by columns
r <- p + facet_grid(Type ~ Treatment) r
From this plot we can deduce that for each region, overall, non-chilled plants saw a higher CO2 uptake than chilled. Now let us strengthen our deductions of the data by looking at some cold hard facts – statistics time!
Exploratory data analysis can be done using the base packages in R, the results of which can be added to a
ggplot() in the guise of a
geom layer. Let’s look at the relationship between
concentration of CO2 and
uptake of CO2. To do this we remove the aesthetic split by
Plant and the
geom_line() layer and use
stat_summary() to compute the mean of each concentration level:
p <- ggplot(CO2, aes(conc, uptake)) + geom_point() + facet_grid(Treatment ~ Type) + stat_summary(fun.y = mean, colour = "red", geom = "line") p
Let’s annotate this plot with the mean values we have overlaid onto the plot. To do this we must deduce what these mean values are. Use the
ddply() function in the
plyr package to get a new data frame of the mean values of each group. To reflect how the data is split in the plot, let’s group our data by
# group_means <- ddply(CO2, c('Type','Treatment', 'conc'), # function(hi)mean(hi$uptake)) group_means <- ddply(CO2, .(Type, Treatment, conc), summarise, means = mean(uptake)) head(group_means, 10)
## Type Treatment conc means ## 1 Quebec nonchilled 95 15.27 ## 2 Quebec nonchilled 175 30.03 ## 3 Quebec nonchilled 250 37.40 ## 4 Quebec nonchilled 350 40.37 ## 5 Quebec nonchilled 500 39.60 ## 6 Quebec nonchilled 675 41.50 ## 7 Quebec nonchilled 1000 43.17 ## 8 Quebec chilled 95 12.87 ## 9 Quebec chilled 175 24.13 ## 10 Quebec chilled 250 34.47
means column is the resulting mean of each group. Now use
geom_text() to annotate the
group_means$means onto our plot. As mentioned above, multiple data frames can be overlaid onto one plot. For this new layer, set a new aesthetic mapping inside the
geom_text() object. The text
label along with
y coordinates must be set. The syntax for this is as follows:
p <- ggplot(CO2, aes(conc, uptake)) + geom_point() + facet_grid(Treatment ~ Type) + stat_summary(fun.y = mean, colour = "red", geom = c("line", "point")) + geom_text(data = group_means, aes(x = conc + 50, y = means - 4, label = round(means, 0)), colour = "red", inherit.aes = FALSE, parse = FALSE) p
Scales are used in
ggplot() to administer control on the axes.
scale covers everything from setting limits, through defining labels, to setting the granularity of the breaks in the data. For example, when setting limits, use:
scale_x_continuous()to change the x axis limits
scale_y_continuous()to change the y axis limits
The following code shows how to set y axis limits:
q <- p + scale_y_continuous(limits = c(0, 75)) q
Change the number of data points on the x axis:
r <- q + scale_x_continuous(breaks = seq(0, 1000, by = 100)) r
We can now make observations about the average behaviour for each concentration level across region and treatment condition:
- On average, plants in Quebec have a higher CO2 uptake than those in Mississippi for both chilled and non-chilled conditions.
- On average, plants under non-chilled conditions have a higher CO2 uptake than chilled for both regions.
- On average, plants in Quebec under non-chilled conditions have the highest CO2 uptake across all concentration levels.
- On average, plants in Mississippi under chilled conditions have the lowest CO2 uptake across all concentration levels.
You can find more about ggplot in the following posts:
- The application of
ggplot2to the construction of run charts can be found in Visualising Features of A&E Waiting Time Data Using Run Charts.