Last month, four of the world’s leading researchers from within the genomics community published a paper which suggested that “…between 100 million and as many as 2 billion human genomes could be sequenced by 2025, representing four to five orders of magnitude growth in ten years…” to become the biggest of all the big data domains, reaching exabase-scale genomics within the next decade. 
A common concern within the genomics community is the availability of sufficient data in any one site to come to any valid scientific conclusion and move beyond that into the application of the science in healthcare delivery. As comparable sequencing and variant calling technologies become available that allow consistent analysis, new models of interaction are required which facilitate the collection of vast amounts of data from multiple sites into secure repositories to enable collaborative analysis on a global scale.
The Global Alliance for Genomics and Health GA4GH has been established to promote the required sharing of human genetic data across multiple sites. The Alliance brings together almost 400 expert organisations from across the world in a bid to address some of the sharing and analysis issues which the genomics community faces. The GA4GH has now put in motion the groundwork for a limited three month proof of concept (PoC), coordinated by Aridhia, that will trial the concept of federated genomic analysis in a bid to address the computational challenges facing the healthcare industry.
Federated analysis describes the ability to access data for distributed analysis without physically sharing it, and therefore provides an ideal foundation for a globally fragmented and distributed genomics community which stores data in isolated databases.
While some sharing models rely on pooling data, whether in a private, open or commercial framework, federated models aim to respect important local legal, privacy and consent arrangements by allowing relevant data to remain in local storage, reducing the need for data to travel. Researchers are then able to gain access to a larger ‘virtual’ dataset comprising information – if not data – from multiple sites, upon which analyses can be run simultaneously, whereby increasing research efficiency.
Federated analysis therefore promises researchers access to larger sample sizes, facilitating large-scale data comparison to get better insight and drive improvement.
There is some concern that federated models are too rigid or expensive to implement and don’t fulfil the basic criteria of (a) encouraging data owners to participate and (b) being useful for analyst end users. In commercial settings, such as flight or hotel APIs in travel booking, extensive federated searches exist, so there may be lessons to learn from the success factors in those settings. The Genomics API is defined as:
With a reference implementation available, these efforts are at a sufficiently developed to allow a proof of concept that shows a number of sites implementing a first level of API for a limited period of time and documenting the experience. This should improve communication and understanding of how federated analyses might work at a pragmatic level as well as providing useful feedback to the APIs themselves.
The primary objectives of this PoC are to increase pragmatic understanding of federation strategies through a set of simulated test cases and to document the experience for the community benefit. As an additional goal, it would be useful to provide some feedback on how the Genomics API model could be extended to capture the requirements of clinical transactions too.
We are excited to be working with groups around the world, including the Stratified Medicine Scotland Innovation Centre, to run this PoC later in the summer. You can download the Federated Analysis paper to read more about the proposal and rationale, but please get in touch if you have any specific questions.
 Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell et al. Big Data: Astronomical or Genomical? PLoS Biol. 2015 Jul 7;13(7):e1002195. doi: 10.1371/journal.pbio.1002195 http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195Tweet