July 13, 2020 | Gary
July marks a milestone in Aridhia’s history as we celebrate the first release of our FAIR Data Services focusing on data discovery.
In the development of our FAIR Data Services there is much to consider, but first and foremost is the “F” of the FAIR Data Principles, or how to make data Findable. To be Findable:
- F1. (meta)data are assigned a globally unique and persistent identifier
- F2. Data are described with rich metadata
- F3. Metadata clearly and explicitly include the identifier of the data it describes
- F4. (meta)data are registered or indexed in a searchable resource – the key to findability to ensure the dataset and associated assets can be discovered.
Despite the FAIR Data Principles being published in 2016, their acceptance and implementation into research ecosystems was initially slow. However, there is now a substantial demand for making data available and shareable to improve research quality and hence patient outcomes at a faster pace.
Outside of research circles it might be surprising to learn just how much time is dedicated to simply finding and organising data for analysis. Data scientists reported that this accounts for up to 80% of their working time. Similarly, by not knowing what data already exists, researchers can spend their time duplicating existing work. Lifting that burden allows for faster discovery, more efficient analysis and greater innovation.
The challenge is how to enable effective data discovery and help users understand if a particular dataset is relevant to answering their research question. As the FAIR Data Principles outline, there are two parts to addressing this.
Describing your Data
A dataset should be described with clear, concise, and exhaustive metadata. The quantity and quality of added metadata has a direct impact of the overall findability of a dataset and what can be indexed by search resources. A well-described dataset should include:
- A catalogue entry describing the dataset at the instance-level such as name, description, author details, license, version and publisher. Keywords should allow the dataset to be classified and must be indexed by the search resource.
- One or more data dictionaries describing the dataset at the field-level. As a minimum, this should include the field’s name, label, type and description.
- Any other associated resources, such as attachments that enhance the quantity and quality of the metadata, e.g. PDF or JSON files that provide additional information or context.
Leading by example of curating well described datasets can itself encourage other users to add more metadata and define the standard of what is expected by all users in the research ecosystem. Where possible, dataset descriptions should align with metadata standards to facilitate data sharing and the interoperability between data discovery ecosystems, such as the Data Catalog Vocabulary (DCAT) for catalogue entries. Metadata should also be available in a machine-readable format (e.g. JSON).
Alongside user added metadata, auto-generated metadata should also allow the user to obtain a greater understanding of the dataset:
- Dataset previews: displaying a subset of the dataset (e.g. limited to a number of rows) where users can easily understand the structure and content of the dataset. A preview may show the underlying dataset or synthetically generated data.
- Dataset profiles: statistical profiles of the data for numeric fields. For example, minimum and maximum values, averages and distributions.
- Dataset provenance: display the lineage of a dataset such as the transformations performed on the dataset from original creation to its current state.
- Persistent identifier: at the user’s request, a persistent identifier such as a Digital Object Identifier (DOI) should be generated and assigned to the data which resolves to the dataset’s landing page. Not only does a persistent identifier remove the ambiguity to the referenced data, identifiers can be shared within the community to ensure findability for the lifecycle of the dataset. In cases where the data and metadata exist in separate files, persistent identifiers can provide explicit linkage between the two, satisfying the F3 principle.
Searching for Data
A dataset may have the exhaustive and quality metadata, however without a search resource, the dataset will never be found. Effective metadata and search capabilities both play and equally important part of enabling data discovery.
Searching for data must suit a variety of users based on their “search experience”. The ability to use simple search criteria typically is balanced against the accuracy required of the search engine (i.e. precision vs recall). While complex search criteria may return more relevant results, nowadays there are various user-friendly smart search capabilities that can in-part replace the need for complex queries to return a user’s expected results. For example:
- Filters and faceted search
- Relevant dataset suggestions
- Search helpers and auto-complete
- Real-time optimised search results where past searches and user audit feed back into search to improve search quality.
Search should not only allow enable datasets to become ‘Findable’, but the search itself should be ‘Reusable’ where searches can be saved and re-run at a later date. Furthermore, URLs of searches should be copiable to enable searches to be shared within the community. Options for this are intuitively built-in to the user interface of FAIR Data Services.
Despite the many user-friendly approaches to search, ultimately a search engine is only as good as the data it indexes. While there is some variability in approaches to indexing, there are certain use cases and should be the foundation of searching across datasets and metadata. Users should be able to find data via:
- Catalogue entry information (e.g. title, description, source, publisher, etc)
- Data dictionaries (e.g. the fields of the dataset)
- Tags/keywords that are used to classify datasets
- Persistent identifier such as DOIs
- Data type
- Dataset values (e.g. find a dataset with field ‘score’ > 10)
- Semantics and ontologies (e.g. searching for asthma returns results with bronchitis)
Searching for data should understand semantics, which can be difficult in an ecosystem like biomedicine with so many different ontological systems. What is required is a harmonisation of these semantic repositories.
How we address Findability
The first release of Aridhia FAIR Data Services addresses much of the above by giving researchers and innovators the ability to discover and understand data through dataset search, classification and efficient metadata browsing capabilities described via dataset catalogues, dictionaries and associated attached assets.
Specifically, this release comprises the following features:
|Role-based Access Control||
|Built on Standards||
|Integration with Aridhia Workspaces||
|Privacy by Design||
Throughout the rest of 2020, more features will be rolled out as they are completed. This includes:
- Further integration with Aridhia Workspaces that allows users to transfer data directly from FAIR Data Services for secure and compliant data analytical capabilities.
- The ability to optionally de-identify data on route to your Workspace to satisfy compliance and governance requirements. Approved administrators may then re-identify data.
- Exploration of datasets made possible via statistical profiles and row-level previews of data (either synthetic or real).
- Querying and filtering of datasets so that derived subsets and combinations of data can be easily generated.
- Auditing actions that show how data is being used on the service. This will include things like DOI citation tracking to show where and by who the data is being utilised.
- Federated data sharing that connects you with other compliant data silos or FAIR services conforming to a defined API.