Home Blogs & News

AI Meets FAIR: Enabling Semantic Search in Trusted Research Environments

Last year we blogged about possible ways of integrating AI services with the DRE, one of the options was introducing AI-backed vector search for FAIR Data Services, our metadata catalogue.

This blog explains what vector search is, details the changes we have made to integrate it with FAIR, and how hub owners can enable and optimise this service.

What is vector search?

FAIR’s standard Azure Cognitive Search (ACS) returns results based on string similarity. For example, to return an Amyotrophic Lateral Sclerosis (ALS) dataset the user most likely has to enter a search term which contains either ‘Amyotrophic Lateral Sclerosis’ or ‘ALS’.

The standard search does support lucene syntax, so users familiar with this can use it to add boolean operators or various wildcard and fuzzy match options to their search, but it still fundamentally depends on string matches to return results. This is fine for users searching for specific conditions like ALS, but it is insufficient for users with more general search requirements.

A user may be interested in a variety of conditions that fall under the umbrella of Motor Neurone Disease (MND), like Progressive Bulbar Palsy (PBP) or Primary Lateral Sclerosis (PLS). Using the standard search, it would not be possible to submit a simple search using the term ‘MND’ that returns results containing datasets for all of these conditions. In fact, it may not return any results if none of the datasets contain the specific term ‘MND’.

Vector search presents a solution to this problem as it matches on semantic similarity, not string similarity. This means that a user could submit a search for ‘MND’ and reasonably expect to return results for ALS, PBP and PLS.

How does vector search work?

A full explanation of how vector search works can be found here, but in very brief summary, each piece of indexed metadata is transformed into a list of numbers called a vector.

This vector is a coordinate used to position the metadata on a multi-dimensional map. Semantically similar values should generate vectors that are close together on this map. To extend our example above, we would expect the vectors for ALS and PBP to be closer together than the vectors for ALS and measles.

A vector value is then generated for every search term users enter. This can then be used to position the search term on the vector map and identify semantic similarities with the indexed metadata. To reuse the example above, the vector for the search term ‘MND’ should be closer on our multi-dimensional map to the vector for ALS than the vector for measles.

This process requires indexed metadata to be sent to Microsoft’s OpenAI service, and while this is always hosted in the same region as the DRE hub, it is not under Aridhia’s control. We understand that not all data owners will be comfortable with this, and therefore vector search is currently disabled by default on all hubs. Vector search cannot be enabled on a dataset-by-dataset basis, it is either on or off at the hub level.

Enabling Vector Search

Given this sensitivity, enabling vector search on a hub is a multi-step process:

Contact the Aridhia service desk: They can enable the vector search service on your hub. This is a necessary first step.
Update search configuration: The default FAIR search configuration contains a pre-set weighting for vector search, and for all required metadata elements. These pre-set weights are retained by custom search configurations, but can be modified. Instructions for applying a weight to specific catalogue fields and adding them to the vector index are available here.
Add vector search permission to required roles: None of the standard roles in FAIR have vector search enabled by default. To use vector search users need to be given a role that has the new vector search permissions enabled.

If you would like more information on this process, please contact your Customer Success Manager.

Optimising Vector Search

The introduction of vector search has also required us introduce new configuration options for FAIR search, to help hub owners with different types and volumes of metadata optimise search for their own purposes. Hub owners now have the following options when managing their search configuration.

Vector vs String

This allows the hub owner to determine the relative weight given to results returned due to string similarity from standard search versus semantic similarity from vector search. The default FAIR search configuration gives more weight to results returned by string similarity, this means that when results containing both types of match are returned, those based on string similarity will be ranked highest.

Field Weights

The hub owner can also modify the weight given to different metadata elements, as above where a particular metadata element is given more weight than others then search matches based on it will be ranked highest in the user’s results.

For example, in our default configuration, matches on a dataset title are given greater weight than those returned by matches on the dataset description.

This means that when a user searches for ‘Alzheimer’s’, datasets where the term appears in the title will be ranked higher than those where it only appears in the description. Our current default configuration gives the highest weight to the dataset title, followed by the catalogue description, dictionaries and lookups.

In addition to the above, hub owners can decide what catalogue fields they want to index for search, and what weight to give them. Full details of this are available in our Knowledge Base.

If you would like to know more about vector search in general, or about how to enable it within a DRE hub, please get in touch.

AI Meets FAIR: Enabling Semantic Search in Trusted Research Environments

What is vector search?

How does vector search work?

Enabling Vector Search

Optimising Vector Search

Vector vs String

Field Weights

Ross Stiven

Recent Posts