November 15, 2016 | Harry
A recent proof of concept aimed at providing insight into the potential power of federated analysis of genomics data saw Aridhia team up with several organisations which form part of the Global Alliance for Genomics and Health. Using AnalytiXagility we demonstrated how multiple sites from across the world can collaborate and enrich their information by:
- distributing data across several different sites
- pulling data from each site and joining it together
- finding the variants that are common across these sites
- deriving aggregates or running other analysis on this data
- enriching that data with additional information
Have a look at the video we made of the process, then I’ll walk you through the concept and steps we took.
Overcoming information challenges in genomic research
As the volume and rate of collection of genomic data continues to increase along with the number of different methods available to extract clinically meaningful or actionable insights, the need to share knowledge and data in this domain becomes ever more important.
The raw data produced by genomic sequencers and the intermediate results produced within a bioinformatics pipeline both require significant disk space in terms of storage, while the pipelines themselves typically require significant computation time to run. Evidently replication should be avoided if possible. Additionally, with a rapidly expanding toolset for conducting analysis on sequencing data, simply sharing experimental results is not enough. It is essential that researchers have access to the data at the level of detail they require so that bioinformatics pipelines can be updated as the technology improves and re-run to obtain new or better results.
As with any other study in the clinical domain, sample size remains a key issue. The ability to combine data from different studies worldwide could greatly enhance the power and reliability of individual studies as well as motivating new, collaborative studies.
Federated analysis could be the answer to some of these issues.
The Global Alliance for Genomics and Health
The GA4GH, or Global Alliance for Genomics and Health, has been established to promote the required sharing of human genetic data across multiple sites. The Alliance brings together almost 400 expert organisations from across the world in a bid to address some of the sharing and analysis issues which the genomics community faces.
To address these issues GA4GH has developed a web API to allow remote sites to exchange genomic data. The API consists of a series of schemas that define the objects that it can receive and the objects it can send as a response. This allows standardisation in the way that genomics information is shared across multiple sites. That is, different remote sites can be asked the same question, about a particular genomic region or feature for example, and each will give their answer in the same standardised format.
As a broad overview, the API currently has support for the sharing and organising of the following objects:
- Variants – genetic differences between a sample and reference sequence. The data model used is very similar to VCF.
- Reads – the genetic data generated from DNA sequencing. The data model for reads is very similar to SAM or BAM format.
- References – e.g. the reference assemblies provided by GRC.
- Sequence Annotations – annotations on a sequence describing genomics features such genes or coding sequences.
- Allele Annotations – annotations to describe, classify and understand the potential impact of individual variants. These are often generated from algorithms implemented within software such as VEP and SnpEff.
- RNA Quantifications – quantifications for genomic features produced from RNA reads. For example, gene expressions obtained from an RNASeq experiment.
- Genotype to Phenotype Associations.
More details on the GA4GH API can be found in the project documentation.
Proof of Concept
A proof of concept project was co-ordinated by Aridhia to assess the usage of the GA4GH API and provide some insight into the potential power of federated analysis across several participating sites. The primary objectives were:
- To increase pragmatic understanding of federation strategies through a set of simulated test cases.
- To document the experience for global community benefit.
- To provide feedback on how the Genomics API model could be extended to capture the requirements of clinical transactions.
Aridhia collaborated with five sites who setup GA4GH Servers:
- Queens University Belfast
- Royal Hospital of Melbourne University, BioGrid
- EMC R&D
- UCSC (Reference Server)
To more easily distribute consistent instructions to the participating sites Aridhia forked the GA4GH repository and made a couple of changes which would allow users to easily setup the server using Vagrant and Docker. This is currently held in GitHub. Instructions were also provided on how to populate each server with a set of sample data.
Data from the 1000 Genomes Project were used for the PoC. A subset of the 1000 Genomes variant data was distributed across the different sites so that each site contained a distinct population. For this proof of concept we only used the variant sets on chromosome 21 for each of these genomes, so each participating site contained a variant set for each individual they were provided with. This is slightly different to how the variant set for the 1000 Genomes Project is distributed, which shows all variants found in the entire population with additional genotype fields in the VCF showing the genotype for each of the genomes in the population.
The GA4GH reference server hosts a lot of the data from the 1000 Genomes Project, including the population variant set, annotations on that variant set performed by VEP, feature sets containing information on various genomic features (genes, transcripts etc.), phenotypic associations and the reads for each of the genomes in the 1000 Genomes Project.
We queried each of these servers from an AnalytiXagility workspace using the built-in R capability. There was already a Python client for the API but nothing had been written for R. To make working with the API from R easier for the proof of concept, and to allow other R users to easily make calls to the GA4GH API, I have written an R package which simply wraps up each of the API’s operations into functions. You can find the package Rga4gh on CRAN. It’s very much a work in progress, but if you are an R user it should make getting started with the API easier. Please use and give feedback!
Using functions from this package, each site was queried to find all the variant sets they contained. Variants from these sets were pulled into the platform and aggregated to find the population allele frequencies for each of the variants found in the population. Each of these aggregated sets were joined to create a dataset containing all the variants found in each of the distinct populations so that allele frequencies in different populations could be compared. We also performed the same queries on the reference server to pull all the variants found in the 1000 Genomes Project. Allele frequencies had already been derived in this dataset and so didn’t need to be derived by aggregating. Finally, we also searched the variant annotation set on the reference server and used this to annotate the variant sets at each of the participating sites.
With the aggregated and joined dataset it is possible, for example, to:
- Find variants which are common in a particular site’s population but rare in the wider (1000 Genomes) population, or vice versa.
- Combine populations from different sites which have a variant in common and use this as a cohort to conduct further analysis on this variant with a larger population sample than would be possible at a single site.
- Use annotations from the reference server to filter the variant sets at the remote sites to variants of interest without having to repeat analysis on these variants.
We achieved the goals set out at the start of this proof of concept, but there is much more that could be done to showcase the concept and take this work forward into genuine applications. Indeed, motivating ideas for further projects and identifying areas that need additional work or thought is one of the intended outcomes of a PoC. There are several ideas I would like to continuing working on to build upon what has already been done in this project, for example:
- We only looked at querying a few of the schemas of the GA4GH API in this project. It would be interesting to explore how the API can be used to query specific sections of Reads (BAM/SAM) data or to map genotypic to phenotypic variation.
- In the video demonstration at the start of this post, most of the analysis is done directly from an R console. Some of this analysis could be made more interactive by placing a simple UI on top of the API querying functions. We could demonstrate this by creating an AnalytiXagility mini-app.
- This PoC analysis was shown as a standalone piece, however it would be more compelling to show how the API could be integrated with a bioinformatics pipeline.
- In order to present the results of analysis involving the API in a more exciting way we could develop some eye-catching visualisations along the same lines as the ones shown in my earlier blog posts, Beauty in Simplicity – Visualising Large Scale Genomic Data and RNA-Seq: Creating Simple Outputs from Complex Genetic Data.
This work goes beyond a limited proof of concept – it forms an integral part of the services that we are developing with the Stratified Medicine Scotland Innovation Centre for its Precision Medicine Ecosystem for Scotland initiative. The Ecosystem brings together capability from across Scotland to make precision medicine operational within healthcare, through ease of access to discovery, development and delivery at a population level. Federated analysis is set to be a key service available through the Ecosystem, aimed at delivering innovation in the clinical application of genomics. More information on this will be able in the coming months.
Finally, a big thank you to all the sites who participated in this proof of concept project. It’s essential to have early adopters for projects like this to identify pain points in the process and domain experts to validate the utility of the project and to drive effective development. A few sites which were unable to take part in this test are interested in another round, so we hope to provide further validation involving an even greater number of organisations in the near future – please get in touch if you might be interested in taking part.