A recent proof of concept aimed at providing insight into the potential power of federated analysis of genomics data saw Aridhia team up with several organisations which form part of the Global Alliance for Genomics and Health. Using AnalytiXagility we demonstrated how multiple sites from across the world can collaborate and enrich their information by:
Have a look at the video we made of the process, then I’ll walk you through the concept and steps we took.
As the volume and rate of collection of genomic data continues to increase along with the number of different methods available to extract clinically meaningful or actionable insights, the need to share knowledge and data in this domain becomes ever more important.
The raw data produced by genomic sequencers and the intermediate results produced within a bioinformatics pipeline both require significant disk space in terms of storage, while the pipelines themselves typically require significant computation time to run. Evidently replication should be avoided if possible. Additionally, with a rapidly expanding toolset for conducting analysis on sequencing data, simply sharing experimental results is not enough. It is essential that researchers have access to the data at the level of detail they require so that bioinformatics pipelines can be updated as the technology improves and re-run to obtain new or better results.
As with any other study in the clinical domain, sample size remains a key issue. The ability to combine data from different studies worldwide could greatly enhance the power and reliability of individual studies as well as motivating new, collaborative studies.
Federated analysis could be the answer to some of these issues.
The GA4GH, or Global Alliance for Genomics and Health, has been established to promote the required sharing of human genetic data across multiple sites. The Alliance brings together almost 400 expert organisations from across the world in a bid to address some of the sharing and analysis issues which the genomics community faces.
To address these issues GA4GH has developed a web API to allow remote sites to exchange genomic data. The API consists of a series of schemas that define the objects that it can receive and the objects it can send as a response. This allows standardisation in the way that genomics information is shared across multiple sites. That is, different remote sites can be asked the same question, about a particular genomic region or feature for example, and each will give their answer in the same standardised format.
As a broad overview, the API currently has support for the sharing and organising of the following objects:
More details on the GA4GH API can be found in the project documentation.
A proof of concept project was co-ordinated by Aridhia to assess the usage of the GA4GH API and provide some insight into the potential power of federated analysis across several participating sites. The primary objectives were:
Aridhia collaborated with five sites who setup GA4GH Servers:
To more easily distribute consistent instructions to the participating sites Aridhia forked the GA4GH repository and made a couple of changes which would allow users to easily setup the server using Vagrant and Docker. This is currently held in GitHub. Instructions were also provided on how to populate each server with a set of sample data.
Data from the 1000 Genomes Project were used for the PoC. A subset of the 1000 Genomes variant data was distributed across the different sites so that each site contained a distinct population. For this proof of concept we only used the variant sets on chromosome 21 for each of these genomes, so each participating site contained a variant set for each individual they were provided with. This is slightly different to how the variant set for the 1000 Genomes Project is distributed, which shows all variants found in the entire population with additional genotype fields in the VCF showing the genotype for each of the genomes in the population.
The GA4GH reference server hosts a lot of the data from the 1000 Genomes Project, including the population variant set, annotations on that variant set performed by VEP, feature sets containing information on various genomic features (genes, transcripts etc.), phenotypic associations and the reads for each of the genomes in the 1000 Genomes Project.
We queried each of these servers from an AnalytiXagility workspace using the built-in R capability. There was already a Python client for the API but nothing had been written for R. To make working with the API from R easier for the proof of concept, and to allow other R users to easily make calls to the GA4GH API, I have written an R package which simply wraps up each of the API’s operations into functions. You can find the package Rga4gh on CRAN. It’s very much a work in progress, but if you are an R user it should make getting started with the API easier. Please use and give feedback!
Using functions from this package, each site was queried to find all the variant sets they contained. Variants from these sets were pulled into the platform and aggregated to find the population allele frequencies for each of the variants found in the population. Each of these aggregated sets were joined to create a dataset containing all the variants found in each of the distinct populations so that allele frequencies in different populations could be compared. We also performed the same queries on the reference server to pull all the variants found in the 1000 Genomes Project. Allele frequencies had already been derived in this dataset and so didn’t need to be derived by aggregating. Finally, we also searched the variant annotation set on the reference server and used this to annotate the variant sets at each of the participating sites.
With the aggregated and joined dataset it is possible, for example, to:
We achieved the goals set out at the start of this proof of concept, but there is much more that could be done to showcase the concept and take this work forward into genuine applications. Indeed, motivating ideas for further projects and identifying areas that need additional work or thought is one of the intended outcomes of a PoC. There are several ideas I would like to continuing working on to build upon what has already been done in this project, for example:
This work goes beyond a limited proof of concept – it forms an integral part of the services that we are developing with the Stratified Medicine Scotland Innovation Centre for its Precision Medicine Ecosystem for Scotland initiative. The Ecosystem brings together capability from across Scotland to make precision medicine operational within healthcare, through ease of access to discovery, development and delivery at a population level. Federated analysis is set to be a key service available through the Ecosystem, aimed at delivering innovation in the clinical application of genomics. More information on this will be able in the coming months.
Finally, a big thank you to all the sites who participated in this proof of concept project. It’s essential to have early adopters for projects like this to identify pain points in the process and domain experts to validate the utility of the project and to drive effective development. A few sites which were unable to take part in this test are interested in another round, so we hope to provide further validation involving an even greater number of organisations in the near future – please get in touch if you might be interested in taking part.Tweet