Enabling large-scale genomic data analysis: the GA4GH federated analysis proof of concept

In 2015 the Global Alliance for Genomics and Health (GA4GH) put in motion the groundwork for a proof of concept (PoC), coordinated by Aridhia and utilising our AnalytiXagility platform, that will trial the concept of federated genomic analysis in a bid to address the computational challenges facing the community.

The GA4GH has been established to promote the required sharing of human genetic data across multiple sites and brings together almost 400 expert organisations from across the world in a bid to address some of the sharing and analysis issues which the genomics community faces. The GA4GH proof of concept (PoC) is coordinated by Aridhia and utilises our AnalytiXagility platform to trial the concept of federated genomic analysis in a bid to address the computational challenges facing the community.

In late 2016 we were able to demonstrate the potential power of federated analysis of genomics data (see the linked blog for details and a video) how multiple sites from across the world can collaborate and enrich their information.

This paper details the approach taken by the collaboration involved in this test, including:

  • SMS-IC
  • Queens University Belfast
  • Royal Hospital of Melbourne University, BioGrid
  • EMC R&D
  • UCSC

Genomics – the biggest big data science

In July 2015 four of the world’s leading researchers from within the genomics community published a paper which suggested that “…between 100 million and as many as 2 billion human genomes could be sequenced by 2025, representing four to five orders of magnitude growth in ten years…” to become the biggest of all the big data domains, reaching exabase-scale genomics within the next decade.

A common concern within the genomics community, and one which is shared by the authors of the aforementioned paper, is the availability of sufficient data in any one site to come to any valid scientific conclusion and move beyond that into the application of the science in healthcare delivery. As comparable sequencing and variant calling technologies become available that allow consistent analysis, new models of interaction are required which facilitate the collection of vast amounts of data from multiple sites into secure repositories to enable collaborative analysis on a global scale. At its heart this is about clinical communities pooling their knowledge of relevant variants.

As the sequencing data being created independently by multiple projects across the world (such as the US Precision Medicine Initiative) which aim to map genetic variation grows at an exponential rate, our ability to adequately store, share and analyse this data becomes an increasingly urgent issue, one which requires early and detailed consideration of the infrastructure needed to support future growth in the domain.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.