February 8, 2023 | Amanda
Please indulge me in a brief moment of self-pity about a member of my graduate committee who told me that my thesis, which employed geospatial analyses and custom algorithms to create a decision support tool, was a ‘glorified database’ and not ‘real’ science. You’ll note my gratuitous use of air quotes as an indication that I am still mildly irritated years later. Alas, I graduated soon after that with great committee reviews, but that advisor was a bit of a Luddite and needed explanations from the other members. The great value of that irritation to me is that our conversation and our defense of my work is one of the first times I was acutely aware that both the tool as well as its skillful and ethical application are critical parts of data science success. Said another way, a scalpel is not a solution without a trained surgeon, and a surgeon is limited by access to the proper tools. And that is a valuable lesson.
Conversely (and moving us closer to the point of this discussion), even a great surgeon cannot fix a case of adult-onset diabetes with a scalpel. But some cases are not so obvious as a lifestyle change versus an operation so it follows that a quality of a great surgeon might also be her experience to know when to employ a scalpel and when to abstain. And similarly, perhaps that is why we sometimes try to solve a health data problem with a technology solution – the difference between a data problem and a tech problem can be murky. So, let’s start by trying to build some clearer fences between data problems and tech problems, and then we’ll circle back to the cases without obvious guardrails.
Some examples of clear challenges with data – adding more memory or computational power will not fix these:
• No repeated measures, lack of longitudinality needed for desired analysis
• Inconsistent variable collection – not enough n to provide statistical power
• No quantitative measures, lack of units, inability to convert between calculations
• Lack of metadata like case report forms or data dictionaries with domains of values
Some examples of clear technology challenges – the highest quality data may still withhold insights if the right technology is not available:
• Big data needs big compute resources (E.g., GPU needs for Machine Learning/AI)
• Security and privacy concerns, especially for health data
• Access to the right software for analysis and package management for reproducible models
When It’s Both
Sometimes the challenges are rooted in a bit of both worlds, admittedly. For example, data cannot be reusable without the hard work of defining metadata, but the process can be made simpler and more standardized by technology applications such as Aridhia’s FAIR Data Services. And posting study data to a website is not enough to meet the accessibility sniff test according to FAIR data standards, but I can find and request access to a study or even aggregate multiple studies according to each of their governance policies with the right data platform.
Aridhia DRE takes this a step further and seamlessly integrates the data request process from the metadata exploration workflow and provides these data in a private Workspace for me, so that I can take advantage of analysis tools that I may not have on my personal or work computer. In these instances, metadata and data quality still matter, but technology can reduce manual labor, reduce inefficiency, and increase data FAIRness.
However, many of us fruitlessly try to solve a data problem with a technology solution and find ourselves in a quagmire. In these instances, setting expectations becomes crucial to success. I still hear rumblings of the logical fallacy that if one has poor quality data, but lots of poor-quality data, a powerful enough computer and some AI algorithms can provide magical insights. At best, we may learn about bias or decide that a new, well-defined study is warranted for collecting better data.
And when our data lacks statistical power – great quality but not enough of it – we can employ technology solutions to help with semantics and aggregation. But this will never be a replacement for the human investment in manual (or computer-assisted but human-supervised) mapping and cleaning. Data harmonization tools that modify raw data or automatically exclude records and lead to bias – these are great for combing the internet for buying patterns but won’t work for regulated health data. Your expectation here should be tempered with the knowledge that your findings will not meet submission standards for regulatory agencies and would likely be unsuitable for publication that guides clinical practice. However, automated harmonization will possibly save you enough time that it can be a worthwhile test of whether the aggregated data sets are valuable enough to invest in a more careful and defensible curation plan.
Even ChatGPT Can’t Help You Here
And the biggest quagmire of all? In my opinion, that must be cultural issues with sharing data! These issues will not be solved with the latest technology. I hear folks intermingling the definitions of interfaces, federation, and APIs as a panacea that cures all data sharing maladies. The thinking here goes something like this: If you do not have access to data, you can just [fill in interface]. But if a person or organization is unwilling to transfer data – or indeed to share at all – there is a limit to what interfaces can do.
If you are running a study across multiple collection sites with different governance or legal restrictions, then a federated approach is a fantastic idea. This allows you to share data models and keep data semantically interoperable and available for analysis across sites. If you are sharing data in a single direction such as sending a list of mutations from a genomics platform to a clinical trial database with a patient identifier in common, then an interface makes good sense.
But if you are unable to access record-level data because the data controller will not share it, then sending an analysis request has limits and needs deliberate consideration of the cost-benefit analysis. And, as mentioned with harmonization, you will be unable to submit all raw data for a publication or regulatory submission. Moreover, without access to raw data, you will be at risk of unnoticed semantic mismatches and the hazards associated with curation differences when combining it with your own data. And importantly, remember that analysis of summary-level data will have its limits from a modelling perspective. It might be possible to see that 199 patients had cardiac arrhythmia, and 72 patients had a diagnosis of Parkinson’s, but not how many had both.
Know the limits of interfacing and make sure you do not set higher expectations for users. It may be worth it to you if complex interfacing leads to the data controller becoming more comfortable with sharing over time, but sending federated compute to summary-level data is never the same experience as full, transparent access to record-level data with rich metadata associated. Your time and energy may be better spent working on the underlying cultural challenges instead. In other words, you and your scalpel may be wasting time trying to help a patient with a lifestyle issue while neglecting a backlog of patients who need surgery.