November 7, 2022 | Scott
An Aridhia DRE Workspace is a Trusted Research Environment – providing a safe-haven for clinical researchers, bioinformaticians and pharmacologists to analyse and develop models on sensitive data with the confidence that the data and models developed are secure and protected.
Vital to collaborative development is the need for individuals to modify, change, and fix on their own versions of code without disrupting other team members. When code is committed and a submission is made, either to a research paper, as a thesis, or for approval from a regulatory body – reproducibility of outcomes is essential. It must be possible to re-run the exact code, on the exact data on the exact compute infrastructure and verify that the results and outcomes are unchanged. Such transparent reproducibility is only possible with version control software.
Outside of a workspace, Git, the open-source versioning control software, along with Github for hosting has become the standard for collaborative development, not only for code but for API standards and, significantly for health care data science, ontologies.
Inside a workspace where data is highly sensitive, security is of paramount importance and access to online repositories is typically prohibited, users still require the Git-like abilities to create repositories, clone them, import external repositories (with appropriate review and security controls) and push reviewed and approved code to repositories or other workspaces as part of the outbound airlock process.
The paper by the Turing Institute publishes best practice recommendations for code development and publishing with a Trusted Research Environment.
Key to the conclusions within the paper is the requirement to version code, data, and compute within the scope of a Trusted Research Environment. Outcomes must be digitally reproducible and the paper recommends the use of Conda for package management and Git for code, model, and (where appropriate) data versioning.
The Aridhia DRE Workspaces provides support for the use of Anaconda and has now released the integration of Gitea. Users of a workspace now have access to their own completely secure version control system. This is completely locked within the scope of a workspace, ensuring that data and code cannot leak to other workspaces or leave the workspace environment without going through the accepted airlock release checks and balances. Based on a lot of discussion with our data science, modelling and governance communities we think this hits the right balance between freedom to operate for workspace owners and the rigorous security and information governance requirements of data controllers.
The following diagram illustrates a typical flow for development and review purposes.
The above workspace may be a primary workspace for development, or a workspace used for regulatory review purposes.
1. Data from FAIR or elsewhere is brought into the workspace via the inbound Airlock. Users may also bring code from areas that are not normally accessible from within the workspace by uploading directly
2. These files are then moved from the Airlock to the shared /files area of the workspace.
At this point, all users in the workspace can modify and change these files. There is no versioning in place. We would recommend creating a structure that fits the development or review process.
In this example, we have docs, code and data folders within /files. It is recommended that these folders be used for sharing and reviewing the latest ‘working’ versions of each type of file once a new master version is available. Files within this space are accessible from workspace tooling such as Data Table Analytics modules, RStudio or Jupyter notebook or any other custom applications or R Shiny applications that have been provisioned in the workspace.
3. Users that require to work on their own versions of code or documents can do so within a Linux VM where each user of the workspace has their own personal storage area. Using Gitea via the command line or web UI, users can create repos with initial versions of files, clone repos to their personal area and work on their own branches before submitting pull requests to bring those versions of the files back to the master branch.
4. When development is complete and the master branch is updated, files can be moved to the /files directory to ensure that the latest version is available to all users and tools within the space, while the version history is managed by Gitea. At any time, users can recall previous versions of files and, when required, move this to the shared files area.
5. When files are ready to leave a workspace, either for a direct download or transfer to another workspace for further development, files are transferred to the Airlock and approval is requested.
6. A workspace manager can accept or reject the airlock request. After reviewing the content, the request is approved and the user is notified – only at that point, can the files leave the workspace.
Further documentation on using Gitea within a workspace can be found here.
For research reproducibility and regulatory review, versioning of files within a workspace is crucial. On many occasions regulatory reviewers will request both datasets upon which pivotal analyses have been completed in addition to the scripts and code used to specify models and/or simulations that support key decisions and recommendations by sponsors. The Gitea versioning can be used to create transparency for how the modelling evolves, the exclusion of certain data, handling of missing data etc. all of which come under regulatory scrutiny. Use cases such as the support and defense of dosing recommendations, labelling statements (package insert or drug monograph) and the suitability of clinical endpoints are all key decisions where sponsors negotiate with regulators who may challenge the fidelity of these decisions against the analyses used to support them.
This latest update to Workspaces provides further capabilities to ensure files relevant to the use cases outlined above are versioned appropriately, with full traceability of all modifications using familiar tools and interfaces. Aridhia will further enhance this capability over time, to simplify adoption and enhance collaboration opportunities.