User Guide

Loading data into a workspace

AnalytiXagility has been designed to handle the import and processing of numerous types of data from multiple sources into a secure, multi-tenanted environment.

  • Users can securely upload CSV data – using the web interface and via the SFTP protocol – into any workspace they are a member of, where it is then stored as data tables.
  • Mapping of CSV fields to data table columns takes place automatically, however it is also possible to provide definition files that specifically inform the platform as to how the data should be loaded.
  • Aridhia can also provide custom file transfer mechanisms (e.g. from genomics labs) where required.
  • Multiple users within one workspace can be granted permissions to upload data.
 

Uploading data using the web interface

drop down menu, grabbed on 17.01.17, using XAP version 1.19.1 2740-8759a, from https://edcvaatest04.aridhiatest.net/#/workspaces/1150

Users can upload tabular data formatted as .csv files from the web interface. To do so navigate to the desired workspace you want to upload data into. Once in the workspace, navigate to the ‘Add’ dropdown menu and select ‘Upload Data’. In the ‘Upload Data’ panel browse select the .csv file to upload and click ‘Upload File’. The upload process is split into five sections:

 

Select files

Uploading data using the web interface, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

You also have the option to provide an accompanying table definition file (TDF) which describes the fields and data within the CSV file. This is not a mandatory requirement during the upload process, however if a TDF is not provided, a new TDF will be created based on user input gathered during the upload process, which can then be downloaded.

You can also provide an authorisation reference for the dataset if applicable. For example, a name of the project or study which this dataset has been approved for use in, or the name of the data owner who has provided consent for this data to be uploaded.

You should also confirm that the dataset does not contain any personally identifiable data, or indicate that the data has de-identified by the user prior to upload.

 

Describe your dataset

The next screen requires you to enter a title for your dataset and optionally a basic description of the dataset being uploaded. Additionally, you can provide a web URL which may provide the location of that dataset within a web-available repository or scientific journal article relating to that dataset.
Uploading data using the web interface, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

 

Parse your data

The next screen allows you to configure how your data should be processed when it is uploaded into the workspace.
Uploading data using the web interface, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

The following settings can be configured in this screen:

  • Data table name: The name of the database table that your data will be loaded into.
  • Delimiter: The character used to separate the columns in your CSV file.
  • Include header row: Determines if the first row in your CSV file contains column headers.
  • Text qualifier: The character used to surround text within each column in the CSV file.
  • Null qualifier: This character in the CSV file will be replaced with a database Null value when it is loaded into the workspace database.
  • Encoding: The character encoding set to use when processing the data. A number of common options are provided, with the default set to UTF-8.

Clicking on a column header in the grid allows you to change the name of the column that will be created in the workspace database. Similarly, clicking on the data type directly below the column header allows you to change the type of the column that will be created.

 

Describe your fields

Uploading data using the web interface, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

Users can alter the label of each column by altering text in the ‘Label’ field and can provide a description of the data captured within that column by altering text in the ‘Description’ field. This will ultimately provide a metadata description of each field should you wish to share the resulting TDF with another individual.

 

Review

Uploading data using the web interface, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

Pressing ‘Upload’ will upload data from your .csv file to the workspace. When the upload is completed a note will appear in the Summary tab notifying you if upload was completed successfully. You can also select ‘Download TDF’ if you wish to share the TDF which has been cumulatively generated during the upload process. Please note that the TDF will not be saved in the workspace, therefore if you wish to keep it, you should select ‘Download TDF’ if you plan to use it in future.
Uploading data using the web interface, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

 

De-identification

If the de-identification service is enabled in your workspace, you will be given the option to de-identify the data before it is uploaded. For instructions on uploading data via the web interface see the Uploading data using the web interface section. In the Authorisation and Privacy section there is an option to select This data needs to be de-identified before it is uploaded to my workspace, which will provide you with options for configuring how you would like each field in your dataset to be processed through the de-identification service.
Deidentification - Select files, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

You should navigate through the Describe your dataset and Parse your data sections before reaching the Describe your fields section. Now you can select which columns you wish to de-identify by selecting an option from the dropdown menu.
Deidentification - Describe your fields, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

Once you have specified how you wish to de-identify each field, click on Next to progress to the Review section.
Deidentification - Review, grabbed on 10.03.17, using XAP version 1.19.2 2878-c35a5ee1a, from https://analytixagility.aridhia.net/

You will then be informed that you have specified for your data to be de-identified via the Data Staging Area (where the de-identification service is typically deployed) and that the data will be added to the workspace upon selecting ‘Upload’.

 

Uploading data using SFTP

This section provides guidance on how to use an FTP client to securely copy data from your machine into your AnalytiXagility workspace via SFTP.

Before you can upload a data file:

  • You must have FTP client software which supports the sFTP protocol and AES-256 cypher strength (most recent clients offer this support as standard) such as WinSCP, FileZilla and the PuTTY PSFTP program.
  • You must have the domain name of the AnalytiXagility sFTP server to upload data to which will be supplied by email or can be requested from the service desk.
 

Connecting to the SFTP using WinSCP

The steps below give an example of how data may be uploaded using the SFTP service using the popular SFTP client WinSCP.

  • Download and install the WinSCP client – https://winscp.net/eng/download.php.
  • Start the WinSCP client.
  • Insert your SFTP credentials
    • host name: xxxxxxx@host.com
    • username: your AnalytiXagility username
    • password: your password
    • click ‘Login’
  • Select the files to upload from your local machine and drag them to the upload folder, this example uses the Iris dataset.
  • Drag and drop iris.csv to data folder.SFTP using WinSCP, grabbed on 17.01.17, using XAP version 1.19.1 2740-8759a, from https://edcvaatest04.aridhiatest.net/#
  • Once the data file has been copied in full to the data folder, the platform will automatically start to load the data into the workspace.
  • Once the platform has completed processing the data, the following files are created in the upload folder.
    • A log file reporting the success or failure of the upload, giving detailed information on the process of how data was loaded in case of the successful upload, or why the upload failed. The resulting file has a name in the following format:
      <filename>_<success/fail>_<year>
      <month>
      <day>_<hours>
      <mins>
      <secs>.txt

      For example:

      iris_success_20160306_121535.txt
      iris_fail_20160307_142743.txt

The time needed to upload and process the data is dependent on the size of the data file and the speed of the network connection. If your SFTP client supports it, enabling compression can reduce the transfer time.

 

Preparing CSV files for uploading with SFTP

To load data into the platform, you must supply CSV files containing the data you wish to upload. The platform will attempt to determine the following from the files:

  • the character encoding
  • the character used as the column delimiter
  • the names of the columns
  • the character used as the text-qualifier

There are limitations on what the platform can automatically determine, and sometimes you need to help the process along by supplying some of this information in a Table Definition file, as described below in Using Table Definition files with SFTP uploads. To get started, upload your CSV file and let the platform attempt to identify the contents. By modifying the contents of the Table Definition file returned by the platform, you can quickly define a correct Table Definition file.

The rest of this section describes:

  • the file name that should be used
  • the contents of the CSV file
 

The file name

For a CSV file to be successfully uploaded, the file name must:

  • start with an alphabetic character (a-z/A-Z)
  • only contain alphanumeric characters (a-z/A-Z/0-9)
  • have the extension .csv

Data from the CSV file is loaded into a table in your workspace. The name of the table is based on the name of the CSV file, with the following changes:

  • the leading and trailing spaces are removed from the name
  • the name is trimmed to a maximum of 60 characters

Alternatively, the name of the table can be specified in a Table Definition file supplied with the CSV file.

For best results, the table name and file name should:

  • contain only lowercase letters
  • not include any punctuation
  • not include any special characters
 

Contents of the csv file

The platform can automatically process CSV files with the following format characteristics:

  • The first row is a header row defining the column names.
  • Values can be separated by the following common characters:
    • comma (,)
    • colon (:)
    • pipe (|)
    • space ( )
    • tab ( )
  • The text-qualifier character either a double-quote (“) or a single-quote (‘).
  • Line endings are either LF or CRLF (and the line endings are consistent throughout the file).
  • Files are encoded using UTF-8 or ISO-8859-1.

If your file format does not match these characteristics, then you need to prepare and submit a Table Definition file along with the CSV file. Table Definition files allow detailed control over the format of your CSV files, see Using Table Definition files with SFTP uploads for more details.

 

Using table definition files with SFTP uploads

The platform attempts to automatically interpret CSV files, determining details like the character encoding, column delimiter, text-qualifier, the names of the table and columns from the data in the CSV file. Sometimes, the values for these properties fall outside the range that the platform can automatically determine and in such cases, the details need to be supplied in a Table Definition file sent along with the CSV file. Please note that if a Table Definition file is to be supplied, it should be uploaded before the associated CSV file.

If supplied, the Table Definition file must have the same name as the CSV file, but with the extension “.xml”. For example, if you supply a data file called myresearchdata.csv, the corresponding Table Definition file must be called myresearchdata.xml.

 

The format of the table definition file

Table Definition files are XML documents containing the following elements:

<?xml version="1.0" encoding="utf-8"?>
<TableDefinition TableName='the filename or what the user specifies'Action='create or append'>
<Columns>
<Column Name="column-name" Type="##PostgreSQL data-type##" />...</Columns>
<Format 
Delimiter='column-delimiter'
TextQualifier='text-qualifier'
NullQualifier='replace these values with null'
Encoding='UTF-8'
Header='first line of csv is the header'
HeaderCase='change the column-header casing'
DateFormat='the format of all dates in the csv'
/>
<Success RemovefromUpload='delete source files on SFTP'/>
<Fail RemovefromUpload='delete source files on SFTP'/>
</TableDefinition>

Please note that:

  • A PostgreSQL data type definition cannot contain column constraints or default values. See http://www.postgresql.org/docs/8.2/static/datatype.html for more details about PostgreSQL data types. Please note that the platform is built on the Greenplum database, which in turn is based on PostgreSQL 8.2.
  • Where columns are supplied they must match the number of columns in the accompanying CSV file.
  • The file is case-sensitive, so ensure that all attributes and elements follow the casing set out in the templates outlined here.
  • A limited set of XML Named Character References are supported for the Table Definition file: T
 

The table definition element

Providing the table definition element is mandatory; any missing or blank attributes will be handled as follows:

TableName

The destination table for the accompanying data file that will be loaded into the platform. If not supplied or blank, use the file name.

Action

Specifies whether the data is loaded into a brand-new table (create) or inserted at the end of an existing table (append). If not supplied or blank, “create” is assumed.

For create:

  • If a table of the same name already exists, it is renamed with a timestamp before the new table is created

For append:

  • If a table with the name does not exist, then a new one is created.
  • If data being appended is narrower than the destination table, then all missing values are set to null.
  • If data being appended is wider than the destination table, then the surplus columns are quietly omitted.
 

Data transformation during upload processes

When your .csv file is loaded to the platform it is stored in the AnalytiXagility platform as a table. Please note that when loading without supplying table definition file, all special characters [, . ? , “ $ % ^ ( ! # )] in the file name will be transformed to underscore “_” and capital letters will become lowercase, e.g. if you upload file Aridhia-1.csv table, it would be saved as aridhia_1.

This also applies to table column headings, e.g. Column.1 would change to column_1

Here is the list of all changes which are made by the platform when uploading via SFTP.

  • Within the file name, the following special characters are replaced by an underscore (_):
    – caret (^)
    – full stop (.)
    – single quote (‘)
    – double quote (“)
    – space ( )
    – dash (-)
  • An empty value with no quotations is translated to NULL.
  • An empty value with quotations is translated to an empty string.
  • All columns in the table are nullable.
  • Missing fields at the end of rows are assumed to be NULL.
  • If a column name is not supplied, the column name is given a unique name: column1, column2.
  • Column names:
    – Will be trimmed to remove any leading or trailing space characters.
    – Will be trimmed to 60 characters.
    – Will be made unique by appending a number to the name.
  • Within a column name, the following special characters are replaced by an underscore (_):
    – caret (^)
    – full stop ()
    – single quote (‘)
    – double quote (“)
    – space ( )
  • The platform does not support the data type “timestamp with timezone”. Any field of this type is treated as a “timestamp without time zone”.

We could otherwise use a Table Definition file to specify table names and column headers. Please refer to The Table Definition Element section above.

 

To contact support, send an email to the Aridhia service desk: servicedesk@aridhia.com.