Home Blogs & News

An Introduction to R Parallelisation Support in AnalytiXagility

Today I’m going to introduce you to the new R parallelisation capability that has been included in the latest version of AnalytiXagility and how we can use it to dramatically improve the performance of the R programs we write. This is an important upgrade if you need to run intensive jobs at speed, because it gives you the opportunity to really make use of AnalytiXagility’s high-performance capabilities.

With ever more complicated and time-consuming computations needing to be carried out in healthcare, there are a number of avenues we can take to attempt to speed these up. This is particularly important when you consider time-sensitive clinical data analysis, such as the traumatic brain injury data that we are working on in the CHART-ADAPT project. In this project parallelisation is being employed to cut the data processing time from hours to minutes in order to support clinicians to make evidence-based treatment decisions at patient’s bedside; a clear benefit of fast, efficient data processing.

With the majority of modern computers coming with multicore processors, data scientists working with R on certain specific operating systems can take advantage of the extra cores using the parallel R library to speed up their analysis – however, this functionality isn’t currently available on Windows machines, or offline, which excludes a lot of users. Aridhia recognised these issues and wanted to open up this capability to a much wider population, so the development team took up the challenge of making it available within the AnalytiXagility platform so that anyone could get access to rapid analysis, no matter which type of operating system they are using. So in the early part of this year we spent time modifying the mechanism in which R connects to the database in order to allow our users to more easily create new connections on multiple threads, with the ultimate aim of enabling parallelisation. By default R does not take advantage of all the CPU cores available on a user’s computer, but using R’s parallel package gives us access to these additional cores. This package was introduced into R in version 2.14.0, building on the work done in the multicore and snow libraries, and provides an easy to use interface for parallelisation of computations.

The ‘parallel’ Package

By using functions from the parallel library such as mclapply (a parallelised version of lapply) we can run computations across multiple cores. When running the mclapply function and other parallel functions it looks at the variable MC_CORES to get the number of cores to use. If a value isn’t set it will use 2 by default (i.e. the minimum number of cores parallelisation requires). mclapply also accepts in an optional argument – mc.cores -that allows us to manually specify the number of cores to use if required. For example:

mclapply(c(1,2,3,4,5,6,7,8,9,10), identity, mc.cores = 4)

This will manually set the number of cores to use in the computation to 4. It should be noted that if you set it to 1, as expected it will just run the lapply function. For details on more properties that the mclapply function accepts, such as mc.allow.recursive or mc.cleanup, see the documentation here. In the latest version of AnalytiXagility we have set MC_CORES to be the number of cores on the system as expected, although we have allowed for this to be manually changed on an individual R server basis.

Database Connections

Parallelization of database functionality in AnalytiXagility is now even easier, providing a new function xap.db.connect() that allows us to easily spawn new individual database connections. The following is an example of utilising this function along with the mclapply parallel function that was previously demonstrated:

require(parallel)

query_db <- function(n) {
	con <- xap.db.connect()
	return (dbGetQuery(con, "select 1 as value")$value)
	}
	
mclapply(c(1,2,3,4,5,6,7,8,9,10), query_db)

Find Out More

Next week we’ll show you how to get started with parallelisation in your AnalytiXagility workspace, but in the meantime here are some amazing resources that I have come across for finding out more about R’s parallelisation capabilities that I recommend reading:

Package ‘parallel’ – https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
Introduction to parallel computing in R by Clint Leach – http://michaeljkoontz.weebly.com/uploads/1/9/9/4/19940979/parallel.pdf

An Introduction to R Parallelisation Support in AnalytiXagility

The ‘parallel’ Package

Database Connections

Find Out More

Allister Antosik

Recent Posts