Blogs & News
Last year we blogged about possible ways of integrating AI services with the DRE, one of the options was introducing AI-backed vector search for FAIR Data Services, our metadata catalogue.
This blog explains what vector search is, details the changes we have made to integrate it with FAIR, and how hub owners can enable and optimise this service.
FAIR’s standard Azure Cognitive Search (ACS) returns results based on string similarity. For example, to return an Amyotrophic Lateral Sclerosis (ALS) dataset the user most likely has to enter a search term which contains either ‘Amyotrophic Lateral Sclerosis’ or ‘ALS’.
The standard search does support lucene syntax, so users familiar with this can use it to add boolean operators or various wildcard and fuzzy match options to their search, but it still fundamentally depends on string matches to return results. This is fine for users searching for specific conditions like ALS, but it is insufficient for users with more general search requirements.
A user may be interested in a variety of conditions that fall under the umbrella of Motor Neurone Disease (MND), like Progressive Bulbar Palsy (PBP) or Primary Lateral Sclerosis (PLS). Using the standard search, it would not be possible to submit a simple search using the term ‘MND’ that returns results containing datasets for all of these conditions. In fact, it may not return any results if none of the datasets contain the specific term ‘MND’.
Vector search presents a solution to this problem as it matches on semantic similarity, not string similarity. This means that a user could submit a search for ‘MND’ and reasonably expect to return results for ALS, PBP and PLS.
A full explanation of how vector search works can be found here, but in very brief summary, each piece of indexed metadata is transformed into a list of numbers called a vector.
This vector is a coordinate used to position the metadata on a multi-dimensional map. Semantically similar values should generate vectors that are close together on this map. To extend our example above, we would expect the vectors for ALS and PBP to be closer together than the vectors for ALS and measles.
A vector value is then generated for every search term users enter. This can then be used to position the search term on the vector map and identify semantic similarities with the indexed metadata. To reuse the example above, the vector for the search term ‘MND’ should be closer on our multi-dimensional map to the vector for ALS than the vector for measles.
This process requires indexed metadata to be sent to Microsoft’s OpenAI service, and while this is always hosted in the same region as the DRE hub, it is not under Aridhia’s control. We understand that not all data owners will be comfortable with this, and therefore vector search is currently disabled by default on all hubs. Vector search cannot be enabled on a dataset-by-dataset basis, it is either on or off at the hub level.
Given this sensitivity, enabling vector search on a hub is a multi-step process:
If you would like more information on this process, please contact your Customer Success Manager.
The introduction of vector search has also required us introduce new configuration options for FAIR search, to help hub owners with different types and volumes of metadata optimise search for their own purposes. Hub owners now have the following options when managing their search configuration.
This allows the hub owner to determine the relative weight given to results returned due to string similarity from standard search versus semantic similarity from vector search. The default FAIR search configuration gives more weight to results returned by string similarity, this means that when results containing both types of match are returned, those based on string similarity will be ranked highest.
The hub owner can also modify the weight given to different metadata elements, as above where a particular metadata element is given more weight than others then search matches based on it will be ranked highest in the user’s results.
For example, in our default configuration, matches on a dataset title are given greater weight than those returned by matches on the dataset description.
This means that when a user searches for ‘Alzheimer’s’, datasets where the term appears in the title will be ranked higher than those where it only appears in the description. Our current default configuration gives the highest weight to the dataset title, followed by the catalogue description, dictionaries and lookups.
In addition to the above, hub owners can decide what catalogue fields they want to index for search, and what weight to give them. Full details of this are available in our Knowledge Base.
If you would like to know more about vector search in general, or about how to enable it within a DRE hub, please get in touch.
August 28, 2025
Ross joined the Aridhia Product Team in January 2022. He is the Product Owner for FAIR Data Services, and Aridhia's open source federation project. He works with our customers to understand their needs, and with our Development Team to introduce new features and improve our products. Outside of work, he likes to go hill walking and is slowly working his way through Scotland's Munros.