Skip to content Skip to footer

Showcase: Using the ENA data submission toolbox for SARS-CoV-2 data

Introduction

The COVID-19 pandemic has highlighted the importance of open and FAIR (Findable, Accessible, Interoperable, Reusable) data. Rapid access to diverse genome sequences of SARS-CoV-2 is identified as crucial for developing tests, vaccines, treatments, policies as well as detecting and monitoring new variants. The European Nucleotide Archive (ENA) and the Global Initiative on Sharing All Influenza Data (GISAID) are primary repositories for SARS-CoV-2 data, with GISAID having stricter limitations on data reuse.

Despite an increase in number of submissions to both repositories, the ENA requirement of command line knowledge for bulk submissions was found to be a significant hinderance. To address this, a new set of tools was developed to simplify the submission of SARS-CoV-2 sequences to ENA. This included streamlining the process and integrating with platforms like Galaxy, to provide a user-friendly interface and additional tools for preprocessing and analysis.

To read more in detail how ELIXIR, in special the Belgian and German nodes, provided tooling for SARS-CoV-2 sequence submissions to ENA, please visit the Covid-19 section on the RDM Guide.

Who is the showcase intended for?

The main targets of the ENA data submission toolbox are researchers, institutions, and support facilities that need to make data submissions to the European Nucleotide Archive (ENA). The toolbox offers solutions for both users with little to no programming skills and ones with advanced bioinformatics knowledge.

What is the showcase?

The components described below allow for a single-step submission process, a graphical user interface, secure credential handling, tabular-formatted metadata and the possibility to remove human reads prior to submission.

Overview of ENA upload toolbox

Figure 1. Overview of the ENA data submission toolbox components.

To improve the submission of SARS-CoV-2 nucleotide sequences to European Nucleotide Archive (ENA), we have collaboratively developed and compiled a command line tool, a set of Galaxy tools and Galaxy workflows essential for cleaning, assembling, and submitting SARS-CoV-2 sequences to the European Nucleotide Archive (ENA). Using Galaxy offers numerous advantages, such as a user-friendly graphical interface, access to a wide range of tools and workflows for preprocessing, downstream analysis, and visualization of sequences, including those specific to SARS-CoV-2 (Maier et al., 2021). Additionally, Galaxy provides a platform for sharing data and metadata, facilitating international collaboration, integrating with other public resources, and enabling the publication of FAIR data and analysis workflows.

Human Reads Cleaning Tool

To comply with Europe’s General Data Protection Regulation (GDPR), human genetic information must be removed from raw data before submission to ENA. We have integrated Metagen-FastQC into Galaxy for this purpose. A series of steps were wrapped into a single Galaxy tool, allowing users to filter human data from the data.

ENA upload CLI

Submitting raw reads to ENA can be done via the ENA website, Webin-CLI, or programmatically using curl commands. Programmatic submissions are preferable for bulk uploads but require bioinformatics expertise to generate XML metadata files and upload data via FTP. To simplify this, a Python command line interface (CLI), ENA upload CLI, was created. This CLI eases submission for bioinformaticians by converting user-friendly TSV files or Excel templates into the required XML files. It manages FTPS uploads, validates metadata before submission, and allows for adding, modifying, canceling, and releasing the study, sample, experiment, and run ENA metadata objects without the need to login into Webin portal.

The templates are made available through a GitHub repository.

The Python package ENA upload CLI is available on pipy and bioconda.

The Galaxy ENA upload tool

To make the process accessible to researchers with limited bioinformatics expertise, we wrapped the ENA upload tool as a Galaxy tool. It is part of the Intergalactic Utilities Commission (IUC) list of curated Galaxy tools, allowing instance administrators to install the tool from the Galaxy ToolShed at ENA upload Galaxy tool.

The Galaxy ENA consensus submission tool

To facilitate the submission of genome assemblies, a Galaxy wrapper ENA Webin CLI of the ENA Webin-CLI was created. The tool wrapper simplified the creation of the mandatory manifest file required for submission by allowing users to interactively fill in the assembly metadata and submit SARS-CoV-2 consensus data to ENA.

Galaxy Docker container

If you cannot use the Galaxy instances at useGalaxy.eu, .be, or .au, possibly due to GDPR, a ready-to-use Docker container is made available. This container includes additional tools. It and can be found in the SARS-CoV-2 GitHub repository since February 2020, in an effort published by Maier et al. (2021) to provide publicly accessible infrastructure and workflows for SARS-CoV-2 data analyses. The repository featured workflows for Genomics, Cheminformatics, and Proteomics analyses. This centralization of workflows made it easy for Galaxy administrators to deploy and install tools and workflows together with the other submission tools and create a one stop shop for processing and submitting SARS-CoV-2 sequencing data.

The Docker container allows for the local deployment of a fully functional Galaxy instance. It will ensure that data remains on-premise until submission. More information on how to obtain and deploy the container can be found on the GitHub repository.

What can you use the ENA data submission toolbox for?

The work described in the showcase can be readily used or its modules adapted and integrated into new workflows.

Direct use for SARS-CoV-2 sequence submission

The tools and workflows mentioned here are available at the useGalaxy.be instance. The submission of raw reads can also be readily performed through useGalaxy.eu and useGalaxy.org.au.

The Galaxy Docker container can be used to deploy an instance containing all the tools described. Instructions on using the components are described in the ELIXIR-Belgium RDM Guide.

Repurposing of the showcase for other organisms

Since the beginning of the response to SARS-CoV-2, the tools described here have been expanded to allow the submission of data for any of the ENA Sample checklists available, and thus any organism supported by ENA. The tools can be reused in multiple contexts, being amenable for integration into different types of data brokering systems or used as is. A general purpose container for deploying a Galaxy instance capable of submitting both raw sequences and genome assemblies of all ENA Sample checklists is available. The Galaxy ENA upload tool is available at useGalaxy.be, useGalaxy.eu and useGalaxy.org.au, allowing researchers direct access without the need to setup their own infrastructure. The ENA upload CLI can, if necessary, be integrated in existing bioinformatics workflows by more advanced bioinformatics groups. The metadata templates are available through a GitHub repository and can be integrated in other workflows. Further instructions on accessing and making use of the different components of the toolbox can be found in our documentation.

Because the tools can be used to submit any sample type to the ENA and can be integrated in existing sequence analysis workflows, the potential user community is large and relates to the user community of ENA itself.

Publications

  • Roncoroni, M., Droesbeke, B., Eguinoa, I., De Ruyck, K., D’Anna, F., Yusuf, D., Grüning, B., Backofen, R., & Coppens, F. (2021). A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive. Bioinformatics, 37(21), 3983–3985. https://doi.org/10.1093/bioinformatics/btab421

  • Baker D, van den Beek M, Blankenberg D, Bouvier D, Chilton J, et al. (2020) No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLOS Pathogens 16(8): e1008643. https://doi.org/10.1371/journal.ppat.1008643

  • Maier, W., Bray, S., van den Beek, M. et al. Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nat Biotechnol 39, 1178–1179 (2021). https://doi.org/10.1038/s41587-021-01069-1

Acknowledgments

We thank the valuable feedback from Ulvi Talas, Heleri Inno (University of Tartu) during the SARS-CoV-2 response.

Support

This work was supported by Fonds Wetenschappelijk Onderzoek [I002819N] and Sonderforschungsbereich/TRR [167/2 Z01].

Related pages

More information

Skip tool table
Tool or resource Description Related pages Registry
ENA upload CLI Command line tool (CLI) allowing easy submission of data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or an excel spreadsheet. The tool allows programatic submission of all ENA objects (study, sample, run and experiment) without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. SARS-CoV-2 sequencing ...
ENA upload Galaxy tool Galaxy tool wrapper of the ENA upload CLI to submit experimental data and respective metadata to the European Nucleotide Archive (ENA). SARS-CoV-2 sequencing ...
ENA Webin CLI Galaxy wrapper to submit consensus sequences to ENA in an interactive way. The tool has the Webin-CLI script of ENA at its core and supports all sample checklists.
European Nucleotide Archive (ENA) Provides a record of the nucleotide sequencing information. It includes raw sequencing data, sequence assembly information and functional annotation. Pathogen characterisation Human biomolecular data An automated SARS-CoV-... SARS-CoV-2 sequencing ... Linked pathogen and ho... Tool info Standards/Databases Training
Galaxy Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. Human biomolecular data Human clinical and hea... Tool info Training
Metagen-FastQC Cleans metagenomic reads to remove adapters, low-quality bases and host (e.g. human) contamination. SARS-CoV-2 sequencing ...
Webin-CLI Command line application to submit assemblies and transcriptomes to ENA. Training
Affiliations Contributors