An Experiment in Using Offline LLMs Within a Secure Data Environment – A Breast Cancer Case Study

Introduction

In health research, data privacy and security are paramount. Researchers work with sensitive datasets, such as patient records, where any breach could have significant legal and ethical implications. Yet the need for advanced analytical tools like large language models (LLMs) is growing. These tools simplify complex tasks, such as exploratory data analysis (EDA), making them accessible to researchers with varying technical expertise.

This is where a Trusted Research Environment (TRE) shines. By integrating offline LLM frameworks like Ollama, researchers can leverage the power of AI without the fear of data or prompt leakage. I’m using an open-source model, Mistral-7B, which provides a balance between performance and resource consumption (no expensive GPU required), while ensuring all prompts and data stay securely within the TRE, ensuring compliance with data protection regulations such as GDPR.

At Aridhia, our DRE/TRE platform is designed to support secure, compliant, and collaborative data analysis. It empowers teams to harness cutting-edge AI capabilities while safeguarding sensitive health data. This was covered more extensively in a previous blog.

In this first post, we’ll explore how AI-driven coding assistance can guide less experienced R users through some simple EDA, demonstrating these capabilities using the publicly available Breast Cancer Wisconsin dataset. We’ll do a more detailed EDA in a later post in this series.

Background Setup for the Demo

Dataset Upload

For this demo, I’ve downloaded the Breast Cancer Wisconsin dataset as a CSV file. The dataset is publicly available and can be accessed from here. After securely uploading it into my TRE workspace via the inbound airlock, the dataset is ready for analysis.

Workspace/TRE Environment

This demo is conducted entirely within a Workspace/TRE, which provides a secure, controlled environment for analysing sensitive data. To facilitate AI-driven insights, I’ve installed the Ollama LLM framework along with the Mistral-7B model (I’m using the Q4_K_M version of the model that uses up to 7GB of RAM). These tools were also uploaded securely through the workspace inbound airlock.

With this setup, I can confidently use advanced AI capabilities without risking data exposure, making it ideal for privacy-sensitive applications like healthcare research.

Visual Insights through AI-Powered EDA

Let’s walk through a practical EDA example, showcasing how an AI coding assistant can guide researchers unfamiliar with R through each step. I have the LLM prompt open in one terminal, and an R console for running commands open in another.

Step 1: Load the Dataset

Prompt: “I’ve uploaded a CSV file named `wdbc.csv` into my TRE workspace. Can you help me load it into R and take a quick look at the data?”

Here’s the response from my LLM:

# Load the data
bc_data <- read.csv("~/files/wdbc.csv", header=T)

# View the first few rows
head(bc_data)

Output

You might notice the weird “X” column of blank values in the file. As it turns out, the Kaggle dataset has an error in the CSV file formatting, with an extra column name compared to the number of data columns. It’s always good to preview the data to see if it has loaded correctly. Next, we’ll fix this.

Step 2: Fix Empty Column

Prompt: “It looks like there’s a column with completely missing values. Can you help me identify and remove it?”

# Display the structure of each column within the dataset
str(bc_data)

# Find the indices of columns with only NA values
empty_columns <- which(sapply(bc_data, function(x) all(is.na(x))))

# Print the names and indices of empty columns
cat("Empty columns:\n")
print(colnames(bc_data[empty_columns]))

# To remove any completely empty columns we'll use the dplyr library
install.packages("dplyr")
library(dplyr)

Create a clean version of the dataset without empty columns
bc_data <- bc_data %>% select(-empty_columns)

Step 3: Generate a Histogram

Prompt: “I want to see how tumour sizes are distributed. Can you help me create a histogram for the `radius_mean` column?”

# load the ggplot2 library for making plots
install.packages("ggplot2")
library(ggplot2)

ggplot(bc_data, aes(x = radius_mean)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Radius Mean", x = "Radius Mean", y = "Frequency") +
theme_minimal()

Output

Step 4: Create a Scatter Plot

Prompt: “How can I visualise the relationship between `radius_mean` and `texture_mean`, and show different tumour types in different colours?”

ggplot(bc_data, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("red", "green"), labels = c("Malignant", "Benign")) +
  labs(title = "Scatter Plot: Radius Mean vs Texture Mean",
       x = "Radius Mean", y = "Texture Mean", color = "Tumour Type") +
  theme_minimal()

Output

Prompt:“Can you help me create a correlation heatmap for the first 10 features to see which ones are strongly related?”

# Install the reshape2 library
install.packages("reshape2")
library(reshape2)

# Calculate correlation matrix and transform the data for plotting using the reshape2 library
cor_matrix <- round(cor(bc_data %>% select(radius_mean, texture_mean, perimeter_mean,
                                           area_mean, smoothness_mean, compactness_mean,
                                           concavity_mean, concave.points_mean,
                                           symmetry_mean, fractal_dimension_mean)), 2)

melted_cor_matrix <- reshape2::melt(cor_matrix)

# Create the heatmap
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = value), color = "black", size = 4) +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
                       limit = c(-1, 1), space = "Lab") +
  labs(title = "Correlation Heatmap of Selected Features") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output

Here’s what this all looks like inside the TRE workspace, with my LLM prompt in one terminal window, the R console in another (could easily have been using R Development Environment), and the heatmap plot produced.

DRE Screenshot

The Power of LLMs in a Secure Data Environment

The ability to run powerful LLMs like Mistral-7B within a TRE ensures that researchers can access cutting-edge tools without compromising data security. With no external dependencies, researchers can confidently explore and analyse sensitive data while maintaining strict privacy standards.

At Aridhia, we provide a secure, scalable platform that not only meets regulatory requirements but also enhances productivity by integrating advanced AI tools into the research workflow. Whether you’re performing EDA, conducting federated learning, or managing complex multi-site collaborations, our DRE/TRE offers the flexibility and security you need to succeed.

Conclusion

AI tools like Ollama provide invaluable assistance for researchers, especially those with limited coding expertise. By working securely within a Trusted Research Environment, these tools ensure that sensitive health data remains protected while delivering actionable insights. Whether you’re performing EDA or diving deeper into modeling, the integration of AI into your TRE workflow is a game-changer.

Stay tuned for the next post, where we’ll explore Retrieval-Augmented Generation (RAG) and show how to train and query your datasets entirely offline.