December 7, 2020 | Rodrigo
A data repository holding diverse data like a hospital Digital Research Environment (DRE), or ecosystems like the AD Workbench or the ICODA COVID- 19 Workbench, will contain diverse datasets from similar domains. This also applies to gateways providing aggregated metadata and brokering access to data. For an end-user undertaking meta-analysis or otherwise combining data from these sources, it can be a challenge to compare datasets.
There are many efforts to reduce the friction in data sharing and broaden access to data that can contribute to improving health while preserving data governance. Some of the barriers to the effective use of data include the lack of specific information to assess:
- the source of the data and how it was collected
- the type and constraints on the data
- whether the data discloses personally identifiable data
- what semantic grounding the data has.
Additionally, users of data (the Demand side)and data contributors (the Supply side)do not have a shared frame of reference or enough or accurate and useful information to understand the disclosure risk inherent in a given data set.
These problems become ever more apparent when deciding whether datasets should be harmonised. It is not enough to compare metadata – statistical profiling is required, and often researcher judgment is required to decide whether harmonised fields are comparable enough. This is further complicated when there is a lack of information available on input datasets, so there are a few problems to work through to make progress on the objective of reducing friction and building trust on the sharing of data.
The process of data harmonisation is typically a time-consuming effort and error-prone. There are challenges in transforming two or more columns into a common frame of reference (“mapping”) limited by different data storage systems and associated types to fields. Knowledge built on previous harmonisations of many datasets could improve and is rarely, if ever, used to improve this process. For example, reference ranges for laboratory data are often generic and hardly ever differentiate based on real-world diverse cohorts.
Implementations for these requirements would need to be transparently verifiable or at least explainable. A data owner would not want their data to be transformed by an unknown process offering certain guarantees (accurate profiling, reduced disclosure risk). For this reason, implementations of solutions should be open source.
The open source Federated Data Sharing Common API, developed by Aridhia and the Alzheimer’s Disease Data Initiative (ADDI) and supported by Health Data Research UK (HDR UK), aims to facilitate collaboration and trusted data sharing networks between trusted research environments and data repositories. The API provides a set of endpoints required to provide a ‘common’ API to organisations wishing to participate in data sharing or federated analysis. This API is either deployed or implemented locally at the participating organisation and linked to data within their own repositories.
Currently, the Common API defines metadata standards and selection interfaces that can facilitate data understand and sharing:
- using the Metadata API, a user can obtain data catalogue and dictionary data as well as automatically align data at the level of type and semantics (via controlled vocabularies)
- using the Selection API, a user can retrieve selections of data and analyse it.
An extension of the Common API is proposed that addresses most of the requirements to improve the use and harmonisation of data. We invite the community of healthcare and research organisations to adopt and enhance the API to tackle these challenges.
For the data insights API, the aim is to provide a production-quality reference implementation within acceptable constraints that can be run on and integrated with diverse systems and data repositories. Some effort should be made to review other approaches to trusted and federated data sharing to encapsulate innovation and useful techniques while accelerating the development of this capability.
Areas of functionality should include:
- standardised statistical profiling of data
- disclosure risk assessments
- providing synthetic previews of data
- tools for harmonisation.
A standard set of profiling functions should be provided that helps users understand the values of the data by summarising values. These should be delivered as data structures with field labels and numerical summaries that should map to common visualisations.
Basic shape information for the data structure should be provided as well as type detection (flagging if not matching given metadata), number of unique values, empty and or null values. For a given field and column of data, it should be possible to generate a type-specific statistical profile.
- For a categorical field, this could be a frequency distribution of variables. For a continuous field, this could be a histogram.
- For numeric and date fields, giving minimum and maximum value. For numeric fields, the quartiles would provide additional information.
- It may be possible to provide an estimation of the ‘curve’ of a numeric field or closeness to a well-known distribution.
The Selection API defines a /profile endpoint that is intended to provide column-wise statistical analysis of selected data frames, however this is not implemented yet. An alignment between API implementations would ensure users obtain a consistent set of metrics for multiple datasets across multiple data repositories.
Disclosure risk, or statistical data protection, aims to identify the risks when data containing personal information is published and ultimately prevent statistical disclosure of this information. Information can be provided to users about the qualities of a given data frame using a range of approaches:
- identifying direct identifiers or key variables that should not appear in a dataset, or sensitive variables that might also be removed.
- k-anonymity: A dataset is said to be k-anonymous if every combination of values for key variable columns in the dataset appears at least for k different records.
- l-diversity: A dataset is said to be l-diverse if every column classified as sensitive has at least l distinct values.
- reidentification_risk: it is possible to calculate the probability of re-identification for each record in an anonymised dataset.
- outlier_detection: An outlier_detection function determines which records are outliers (x < q or x > 1-q, where q is a given quantile) for each of the continuous key attributes in the data set.
- suda: The Special Uniques Detection Algorithm (SUDA) measures how ‘risky’ a record is by identifying the minimal sample uniques (MSUs) in a record. A record is considered a special unique if it is a sample unique on the complete set of quasi-identifiers and possesses at least one MSU.
- attribute_classifier: An attribute_classifier function could classify key variables in a dataset as categorical (e.g. gender, country) or continuous (e.g. age, weight).
The examples above are based on prototype work undertaken at Aridhia however these could be expanded and included in the Common API to risk assess any datasets made available to others, for example, as a precursor to any /selection API calls. For the Aridhia DRE we envisage a number of gateways in the flow of data where providing information to researchers and data custodians: in FAIR, where data selections are made, to provide input into the decision to deliver the data, and when results are Airlocked out of a Workspace to provide input to output checkers.
Synthetic Previews of Data
The main case for synthetic data is useful where data was collected for a clinical research purpose (with ethic and data governance controls) but where individual patient data is included and cannot be shared due to consent at the individual level, i.e., creating a new dataset that has almost the same statistical properties of the original but does not include the original patient records in a discernible way.
Alternatively, one could consider de-identification to provide users with the a pseudonymised version of the data. Still, recent developments suggest that generating synthetic datasets from source clinical datasets is increasingly feasible and could provide broader access to data with fewer governance constraints.
However, the use of synthetic data may not be an option if the data owner is concerned about statistical analysis of their data (for example, performance analysis of a hospital) so it should not be assumed that making synthetic data available is a general solution to making any data available.
The Selection API defines a /preview endpoint that is intended to provide either a real or synthetic preview of the data, however this is not implemented.
Unlike most other proposed solutions in this article, adding synthetic previews introduces a heavier computational burden as the whole data set needs to be considered, not just an individual column or row. Current developments focus on a machine learning/deep learning approach to create similar data within some agreed tolerance. It will be necessary to use profiling and disclosure risk functions mentioned above to evaluate the fidelity, utility and privacy characteristics of the data. Alternatively, it might be possible to generate synthetic analogues of datasets with sufficient fidelity that avoid the concerns of disclosure.
Tools for Harmonisation
It is possible to define a set of functions that combine and transform subsets of columns from different datasets into a cohesive dataset that users can base their analysis. Rules that allow this basic level of harmonisation typically are:
- Simple mapping of columns of different names into a new column (possible with one of the original names) and no other change.
- Applying a set of functions to each source column and combining into a new column (e.g. F/C conversions of temperature). A function might be a filter combining multiple columns in a source data frame and applying a function to the result.
- Converting categorical values based on a semantic rule (postcodes to deprivation index or specific diagnoses to more general ones).
To guide this effort, the approach of building up common data elements within a research community shows some promise. A domain-specific language could be developed to define these rules in a convenient way that would get broad adoption. A library could analyse data and propose candidate harmonisation rules based on common heuristics or other more or less simple algorithms. The resulting code to harmonise data becomes part of the metadata of the resulting data set and could be made available for reuse in its own right.
There is value and important context in harmonisation work and for this reason, use of harmonisation should be attributed to the author. Data owners may expect to provide some constraints on what harmonisation they permit as they know the original context of data collection, so it may be necessary to include draft harmonisation rules in a data access request.
Call for collaboration
Aridhia is interested in developing this capability and will develop solutions mentioned in this article with partners to establish an open-source, collaborative software development workstream with a roadmap that reflects these outcomes. However, we also invite others to partner with us to enhance the API but also adopt and improve the API to integrate with others in a network of data sharing.