Visualising air pollution data over time

June 6, 2014 | Alan

The health effects of air pollution have been reported in research studies over the past 30 years. These effects include respiratory diseases such as asthma, cardiovascular diseases, changes in lung function, and death. The Department for environment, food, and rural affairs (DEFRA) is host to an online, historical, air quality dataset for the UK. Visualisation of such data over time could help define and predict changes in air quality based on environmental parameters. This post describes using the AnalytiXagility platform to visualise air quality data overtime, including recommendations for levels of pollutants from the World Health Organisation.


Data source

We have downloaded air quality data for the Glasgow Kerbside location, located on the pavement of Hope Street adjacent to Glasgow Central Station. The nearest road is subject to frequent congestion during peak traffic flow periods. The surrounding area is built up. The dataset covers daily mean values for 01/01/2013 – 31/12/2013, the URL of open data source is:

We cleaned this up and stored this as air_quality_data_glw_cleaned in the AnalytiXagility platform, accompanying metadata is stored as a txt file air_quality_data_glw.txt.

Learning outcomes

The packages explored are:

  • ggplot2
  • zoo
  • dplyr
  • XML


Step 1 – Set up environment

Load relevant libraries:


Import relevant data into workspace:

air_qual_gla <- xap.read_table('air_quality_data_glw_cleaned')

Step 2 – Data prep

This dataset includes daily mean pollutant values that were gathered from the Glasgow Kerbside location in 2013. Use the format() function from the zoo package to extract monthly values:

air_qual_gla$month <- as.numeric(format(as.Date(air_qual_gla$date), "%m"))

This dataset contains information on a range of pollutants, but we are going to focus on particulate matter (PM) levels. PM10 are large pollution particles, for example, smoke, dirt and dust and PM2.5 are smaller pollution particles, for example gas emissions from automobiles.

Use the summarise() function from the dplyr package to calculate the mean values of PM10 and PM2.5 per month:

pm10_ave_month <- air_qual_gla %.% group_by(month) %.% summarise(ave = mean(pm10_particulate_matter_.hourly_measured., na.rm = T))
pm2.5_ave_month <- air_qual_gla %.% group_by(month) %.% summarise(ave = mean(pm2_5_particulate_matter_.hourly_measured., na.rm = T))

Append identifiers to each summarised data frame and define line type variables:

pm10_ave_month$type <- rep("PM 10", nrow(pm10_ave_month))
pm2.5_ave_month$type <- rep("PM 2.5", nrow(pm2.5_ave_month))

pm10_ave_month$line <- rep(1, nrow(pm10_ave_month))
pm2.5_ave_month$line <- rep(1, nrow(pm2.5_ave_month))

Step 3 – Scrape web data from World Health Organisation air quality website:

The World Health Organisation (WHO) recommends target pollutant levels, these are categorised as 24-hour mean and annual mean levels. We used the htmlTreeParse() function from the XML package to scrape the recommended levels for pollutants PM10 and PM2.5 and parse data into a tree:

url <- ""
html <- htmlTreeParse(url, useInternalNodes = T)
PM2.5_recom <- getNodeSet(html, "//span")[[35]]
PM10_recom <- getNodeSet(html, "//span")[[36]]

Use xmlValue() to convert data class XMLInternalTextNode to character:

PM2.5_annual_mean_str <- xmlValue(PM2.5_recom[[3]])
PM10_annual_mean_str <- xmlValue(PM10_recom[[3]])

Use gsub (base R package) to extract numerical value from character string:

PM2.5_annual_mean_num <- as.numeric(gsub("([0-9]+).*$", "1", PM2.5_annual_mean_str))
PM10_annual_mean_num <- as.numeric(gsub("([0-9]+).*$", "1", PM10_annual_mean_str))

Step 4 – Bind extracted annual mean values to summarised tables for PM10 and PM2.5

Use rbind() from the zoo package to append extracted PM2.5_annual_mean_num and PM10_annual_mean_num to summarised data frames pm2.5_ave_month and pm10_ave_month:

to_bind <- data.frame(1:12, rep(PM2.5_annual_mean_num, nrow(pm2.5_ave_month)), 
    rep("PM 2.5", nrow(pm2.5_ave_month)), rep(3, nrow(pm2.5_ave_month)))
colnames(to_bind) <- colnames(pm2.5_ave_month)
all_pm2.5 <- rbind(pm2.5_ave_month, to_bind)

to_bind <- data.frame(1:12, rep(PM10_annual_mean_num, nrow(pm10_ave_month)), 
    rep("PM 10", nrow(pm10_ave_month)), rep(3, nrow(pm10_ave_month)))
colnames(to_bind) <- colnames(pm10_ave_month)
all_pm10 <- rbind(pm10_ave_month, to_bind)

Bind all_pm2.5 and all_pm10 into one:

all_pm <- rbind(all_pm10, all_pm2.5)

Step 5 – Plot results

Select a colour palette that can be identified by people with colour blindness:

cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#0072B2", "#D55E00", "#CC79A7")

Create a ggplot object and layer with data, add in annual mean recommendations scraped values from WHO website:

p <- ggplot() + geom_line(data = all_pm, aes(x = month, y = ave, colour = type, linetype = as.factor(line)), size = 2) + 
    scale_colour_manual(name = "Colour identifier", values = c(cbPalette[1], cbPalette[2])) + 
    scale_linetype_manual(name = "Line type identifier", label = c("PM level", " WHO recommendation"), values = c("solid", "dotted")) + 
    scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) + 
    geom_vline(xintercept = 4, linetype = "dashed") + 
    geom_vline(xintercept = 10, linetype = "dashed") + 
    xlab("Month") + 
    ylab("PM level (R ugm-3)") + 
    ggtitle("Pollution (particulate matter) levels over time") + 
    theme_bw() + 
    theme(plot.title = element_text(face = "bold", size = 14))



plot of chunk unnamed-chunk-12

What Hope is There for Hope Street?

First Bus fit all new buses with low emission environmentally friendly engines. With this information, we would expect to see a gradual downward trend in PM2.5 emissions over a year, but the data seems to suggest that there are other variants contributing to pollutant levels.  What seasonal behaviour could be attributed to changes in pollutant levels? And how can we remedy this?

Further reading


Leave a Reply

Your email address will not be published. Required fields are marked *