Blogs & News

Home Blogs & News

Findability: The release of Aridhia FAIR Data Services

July marks a milestone in Aridhia’s history as we celebrate the first release of our FAIR Data Services focusing on data discovery.

In the development of our FAIR Data Services there is much to consider, but first and foremost is the “F” of the FAIR Data Principles, or how to make data Findable. To be Findable:

  • F1. (meta)data are assigned a globally unique and persistent identifier
  • F2. Data are described with rich metadata
  • F3. Metadata clearly and explicitly include the identifier of the data it describes
  • F4. (meta)data are registered or indexed in a searchable resource – the key to findability to ensure the dataset and associated assets can be discovered.

Despite the FAIR Data Principles being published in 2016, their acceptance and implementation into research ecosystems was initially slow. However, there is now a substantial demand for making data available and shareable to improve research quality and hence patient outcomes at a faster pace.

Outside of research circles it might be surprising to learn just how much time is dedicated to simply finding and organising data for analysis. Data scientists reported that this accounts for up to 80% of their working time. Similarly, by not knowing what data already exists, researchers can spend their time duplicating existing work. Lifting that burden allows for faster discovery, more efficient analysis and greater innovation.

The challenge is how to enable effective data discovery and help users understand if a particular dataset is relevant to answering their research question. As the FAIR Data Principles outline, there are two parts to addressing this.

Describing your Data

A dataset should be described with clear, concise, and exhaustive metadata. The quantity and quality of added metadata has a direct impact of the overall findability of a dataset and what can be indexed by search resources. A well-described dataset should include:

  • A catalogue entry describing the dataset at the instance-level such as name, description, author details, license, version and publisher. Keywords should allow the dataset to be classified and must be indexed by the search resource.
  • One or more data dictionaries describing the dataset at the field-level. As a minimum, this should include the field’s name, label, type and description.
  • Any other associated resources, such as attachments that enhance the quantity and quality of the metadata, e.g. PDF or JSON files that provide additional information or context.

Leading by example of curating well described datasets can itself encourage other users to add more metadata and define the standard of what is expected by all users in the research ecosystem. Where possible, dataset descriptions should align with metadata standards to facilitate data sharing and the interoperability between data discovery ecosystems, such as the Data Catalog Vocabulary (DCAT) for catalogue entries. Metadata should also be available in a machine-readable format (e.g. JSON).

Alongside user added metadata, auto-generated metadata should also allow the user to obtain a greater understanding of the dataset:

  • Dataset previews: displaying a subset of the dataset (e.g. limited to a number of rows) where users can easily understand the structure and content of the dataset. A preview may show the underlying dataset or synthetically generated data.
  • Dataset profiles: statistical profiles of the data for numeric fields. For example, minimum and maximum values, averages and distributions.
  • Dataset provenance: display the lineage of a dataset such as the transformations performed on the dataset from original creation to its current state.
  • Persistent identifier: at the user’s request, a persistent identifier such as a Digital Object Identifier (DOI) should be generated and assigned to the data which resolves to the dataset’s landing page. Not only does a persistent identifier remove the ambiguity to the referenced data, identifiers can be shared within the community to ensure findability for the lifecycle of the dataset. In cases where the data and metadata exist in separate files, persistent identifiers can provide explicit linkage between the two, satisfying the F3 principle.

Searching for Data

A dataset may have the exhaustive and quality metadata, however without a search resource, the dataset will never be found. Effective metadata and search capabilities both play and equally important part of enabling data discovery.

Searching for data must suit a variety of users based on their “search experience”. The ability to use simple search criteria typically is balanced against the accuracy required of the search engine (i.e. precision vs recall). While complex search criteria may return more relevant results, nowadays there are various user-friendly smart search capabilities that can in-part replace the need for complex queries to return a user’s expected results. For example:

  • Filters and faceted search
  • Relevant dataset suggestions
  • Search helpers and auto-complete
  • Real-time optimised search results where past searches and user audit feed back into search to improve search quality.

Search should not only allow enable datasets to become ‘Findable’, but the search itself should be ‘Reusable’ where searches can be saved and re-run at a later date. Furthermore, URLs of searches should be copiable to enable searches to be shared within the community. Options for this are intuitively built-in to the user interface of FAIR Data Services.

Despite the many user-friendly approaches to search, ultimately a search engine is only as good as the data it indexes. While there is some variability in approaches to indexing, there are certain use cases and should be the foundation of searching across datasets and metadata. Users should be able to find data via:

  • Catalogue entry information (e.g. title, description, source, publisher, etc)
  • Data dictionaries (e.g. the fields of the dataset)
  • Tags/keywords that are used to classify datasets
  • Persistent identifier such as DOIs
  • Data type
  • Dataset values (e.g. find a dataset with field ‘score’ > 10)
  • Semantics and ontologies (e.g. searching for asthma returns results with bronchitis)

Searching for data should understand semantics, which can be difficult in an ecosystem like biomedicine with so many different ontological systems. What is required is a harmonisation of these semantic repositories.

How we address Findability

The first release of Aridhia FAIR Data Services addresses much of the above by giving researchers and innovators the ability to discover and understand data through dataset search, classification and efficient metadata browsing capabilities described via dataset catalogues, dictionaries and associated attached assets.

Specifically, this release comprises the following features:

Feature Description
Data Discovery
  • Search for datasets relevant to your research project using text-based simple or complex search queries.
Metadata Browsing
  • Understand existing datasets by viewing metadata including catalogue and field-level descriptions.
  • Download machine-readable dataset metadata.
Metadata Management
  • Upload your dataset metadata and associated attachments (e.g. PDFs, json, etc.) to be discovered by others.
Role-based Access Control
  • Self-service signup with role-based user permissions. This includes read only and edit/update roles.
Built on Standards
  • Uses the Data Catalog Vocabulary (DCAT) for dataset instance-level descriptions.
Integration with Aridhia Workspaces
  • Single Sign On (SSO) between FAIR and Workspace services.
  • Consistent Aridhia DRE user interface.
Privacy by Design
  • Secure data access and management via MFA, RBAC, encryption and secure key management.
  • ISO 27001 accredited.
Cloud-native Service
  • Developed and hosted on the cloud.
  • Integrates with and improves on cloud technologies.

Throughout the rest of 2020, more features will be rolled out as they are completed. This includes:

  • Further integration with Aridhia Workspaces that allows users to transfer data directly from FAIR Data Services for secure and compliant data analytical capabilities.
  • The ability to optionally de-identify data on route to your Workspace to satisfy compliance and governance requirements. Approved administrators may then re-identify data.
  • Exploration of datasets made possible via statistical profiles and row-level previews of data (either synthetic or real).
  • Querying and filtering of datasets so that derived subsets and combinations of data can be easily generated.
  • Auditing actions that show how data is being used on the service. This will include things like DOI citation tracking to show where and by who the data is being utilised.
  • Federated data sharing that connects you with other compliant data silos or FAIR services conforming to a defined API.

For more information about the service, view our Aridhia FAIR Data Services web page. Alternatively feel free to contact us.


 

andrew

Andrew joined Aridhia in January 2018 to support the Enablement Team. He studied Ecology and Animal Behaviour at the University of St Andrews before working in various sales and marketing positions for technology companies. Outside of sales/marketing, Andrew also provides client support for the likes of Great Ormond Street Children’s Hospital and the European Prevention of Alzheimer’s Dementia Consortium (EPAD).