Introduction
Data analysis for pathogen characterization allows us to understand the evolution of pathogens, and the relationship among different strains and provides insights on host-pathogen interactions and drug resistance. The tasks can involve processing data collected from a diverse spectrum of sources, from both clinical and environmental samples. As in every data analysis procedure, the general workflow involves:
-
Preprocessing: Includes the initial steps required to prepare data, genomics and not, for further analysis.
-
Analysis: Is the core stage where the actual detection and characterization of pathogens occur. This stage employs many techniques for pathogen characterization, such as Next-Generation Sequencing (NGS).
-
Postprocessing: Includes interpreting and validating the data obtained from the analysis stage, as well as integrating it into broader contexts. Moreover, this is often followed by reporting and communication, and archiving and data management.
Each stage is crucial for the accurate and comprehensive characterisation of pathogens, from the initial handling of samples to the final reporting and data management, and will be detailed below. Scalable and reproducible data analysis activities enable rapid surveillance of infectious epidemics of emerging and re-emerging pathogens in foodborne, hospital settings, and local community outbreaks. Ensuring reproducibility is critical for the usability of the analysis results. Following community-recognised best practices and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) is fundamental for guaranteeing the trustworthiness of the results and enabling collaboration and sharing of information.
General considerations
When analysing pathogen data involved in a health emergency or epidemic outbreak are:
- Define the pathogen and specific aspects to be investigated, e.g. genomic features of interest
- Collect the suitable reference data about the pathogen of interest, preferentially from community-accepted repositories, e.g. European Nucleotide Archive (ENA) and Global Initiative on Sharing All Influenza Data (GISAID). It is worth noting that the right reference should be chosen taking into account mutation features, time of isolation, classification, phenotype, and genomic structure.
- Before analysing the data, define which specific aspect of the pathogen’s variability will be investigated. For example, if your aim is to describe the whole variability along the genome, the data should be compared with the whole reference genome.
- Define the type of data you are using, e.g. DNA or RNAseq for viral genome characterisation
- Select the tools best suited for the analysis of your data
- Estimate the computing resources needed
- Define which computing infrastructure is most suitable, e.g. cluster or cloud
- Ensure to follow the FAIR principles when handling data
- Guarantee findability of the data and tools for all collaborators for reproducibility by providing your:
- Code
- Execution environment
- Workflows
- Data analysis execution, including parameters used
- Accompanied by documentation that lists all parameters and other relevant information to reproduce the findings
Existing approaches
- Container and environments: Consider using containers and environments to collect and isolate dependencies for tools and pipelines. Environment management systems, such as Conda, help with reproducibility but are not inherently portable across platforms. Containers provide a higher level of portability, being able to encapsulate both the software and its dependencies.
- Web-based code collaboration platform: Consider using a centralised location for software developers to store, manage, collaborate, and share their code. For instance, GitHub, GitLab, or Bitbucket.
- Workflow management systems: Allow you to formalise your workflows in a standardised format and execute them locally or on a remote computer infrastructure. Popular systems are Nextflow and Snakemake.
- Workflow platforms: Allow users to manage data, run formalised workflows, and review their results. Platforms, such as Galaxy, may offer multiple interfaces, e.g. web, GUI, and APIs.
- Reference databases: Collect the suitable reference data about pathogens to be investigated. European Nucleotide Archive (ENA) and Global Initiative on Sharing All Influenza Data (GISAID) are examples of genomic databases to which researchers share their data. In this context, the European Pathogens Portal aggregates databases relating to pathogens, as well as hosts and their vectors. Other countries host their own instance of the Pathogens Portal, e.g. see the Swedish Pathogens PortalSwedish Pathogens Portal showcase.
- Workflow registries: Register workflows in platforms, such as WorkflowHub, that facilitate sharing, versioning, and authorship attribution of the pipelines.
For more general information and solutions on data analysis, you may have a look at the content available on the [RDMkit data analysis page] (https://rdmkit.elixir-europe.org/data_analysis#what-are-the-best-practices-for-data-analysis). While the examples on this page focus on the genomic characterisation of pathogens, similar principles apply to other data types.
Preprocessing
Data preprocessing is an initial step in data analysis involving the preparation of raw data for the main analysis. It is an important factor in quality control, and involves steps for the cleaning of the data, with the identification of inconsistencies, errors, and missing values. Preprocessing may also include data conversion and transformation steps to get the data in a format compatible with the expected inputs of the chosen analysis pipelines.
Considerations
Some typical considerations involved in this step:
- Data cleaning: Finds and corrects errors in the data. For example, eliminating duplicates, removing too short genomic reads, and trimming not useful information such as contaminating host data.
- Quality control checks: Should be conducted at each step to ensure that the data is suitable for the intended analysis.
- Exclusion of low-quality samples: Samples with low-quality scores should be marked and removed. In genomics studies, samples with missing values, low sequencing depth, and contaminations might be removed.
Existing approaches
Preprocessing steps may depend on the technology used and the pathogen being studied and thus should be adjusted accordingly. Some common approaches in genomics studies include:
- Raw sequences quality check: FASTQC
- Trimming out adapters and low-quality sequences: Trimmomatic
- Quality checks: further information can be found on the Quality control - Pathogen characterisation page.
Analysis
The analysis of data to characterise a pathogen of interest can involve methodologies from different fields. While genomics approaches are of common interest, analysis of other data types, such as proteomics and metabolomics, and their combination can be of special importance.
Considerations
- The computational resources: Verify that the appropriate computational resources are available. Depending on the volume and complexity of the data, you might need to make use of large computing clusters or cloud computing resources.
- The location of your data: Ensure that the chosen computing infrastructure and platforms have access to the data. It is important to consider the distance between the data storage and computing, as it can significantly impact transfer times and costs.
- Document the steps: Report every step of the data analysis process. Including software versions employed, parameters utilised, the computing environment employed, reference genome used, as well as any “manual” data curation steps. More information on recording provenance can be found on the Provenance pages
- Collaborative analysis: it is important that partners have access to the data, tools, and workflows. It is crucial that systems are in place to track changes to the tools and workflows used, and that the history of modifications is accessible to all collaborators.
Existing approaches
There are several types of analysis that can be performed on pathogen-related data, depending on the specific research question and type of data being analysed. Here are some solutions:
- Consider using the available computational infrastructure to scale up your analysis capabilities. This may include applying for access to large computing cluster resources with e.g. EuroHPC or making use of public Galaxy servers such as Galaxy Europe.
- Genomic analysis: Including whole genome sequencing (WGS), this analysis allows the interpretation of genetic information encoded along the genome (DNA or RNA). Genomic analysis can be used for a wide range of applications to characterise many aspects of pathogen variability, such as Variants of Concern (VOC) and antimicrobial resistance profiles in bacteria (AMR). Examples of tools that allow us to take into account the genomic characteristics of pathogens (e.g. genomic structure and size, gene annotations, mobile genetic elements) are:
- Sequence Alignment: Bowtie2, BWA and SAMtools
- Genome Assembly: Canu, Velvet and SPAdes
- Phylogenetic Analysis: ClustalW, MUSCLE, MAFFT, RAxML and IQtree
- Molecular Clock: MrBayes, BEAST and BEAUti
- Variant calling: Dragen-GATK, freebayes and VarScan
- Annotation: ANNOVAR, SnpEff, VEP and dbNSFP
- All-in-one Bioinformatic Tools: SNippy
- Metagenomics analysis: Sequencing all genetic material in a sample can provide comprehensive data about the composition of the microbial community. In the context of infectious diseases, it can aid in identifying multiple pathogens simultaneously in clinical, as well as environmental samples. Examples of tools in this type of analysis are:
- Proteomics analysis: Proteomics, primarily utilising mass spectrometry techniques, offers a powerful tools for examining proteins and their interplay. This can provide valuable insights into irregularities associated with infectious diseases and potentially uncover mechanisms of drug resistance. Examples of tools in this type of analysis are:
- Metabolomics analysis: This involves measuring the levels of small molecules (metabolites) produced by specific pathogens in biological samples, comparing them across different conditions or groups of samples. Examples of tools in this type of analysis are:
Postprocessing
In pathogen characterisation, the postprocessing steps are crucial to evaluate and interpret the results. These steps are important to identify strain relationships and specific molecular variation patterns linked to peculiar phenotypes of pathogens (e.g. drug resistance, virulence, and transmission rate). Such results must be biologically meaningful and reproducible, considering also the clinical aspects and treatment implications.
Considerations
Some considerations about postprocessing steps in pathogen characterization include:
- Interpretation: it is important to interpret them in a biologically meaningful context. This should consider the following aspects: report the variability of specific pathogens; find out new strains that could become concerning; identify specific genes or mutations associated with pathogenic variation.
- Transformation: Consider having postprocessing steps to ensure that outputs are transformed or converted into interoperable and open formats. This ensures that subsequent pipelines and collaborators can readily make use of the results.
- Visualisation: To allow a clear interpretation of the clinical practice, it is important to visualise the results clearly, to make the results clear also to all professionals involved.
Existing approaches
- Spatial-temporal analysis and visualisation: using a combined approach of phylogenetic, spatial distribution, and molecular clock, this approach aids in designing strategies to control and prevent the spread of infectious diseases, as well as in the development of effective treatments, and vaccines.
- Spatial distribution of strain: Nextstrain
- Drug resistance characterisation: genomic analysis can be used to characterise pathogens for specific resistance against drugs and help develop strategies to fight the spread of drug-resistant strains.
- Antimicrobial resistance (AMR): ResFinder and Pathogenwatch
- Viral drug resistance: Stanford HIV Drug Resistance Database (HIVDB)
- Interaction analysis and functional enrichment analysis: placing the identified protein interactions and regulatory networks in the context of the affected biological pathways allows for a better understanding of disease mechanisms and potential drug targets.
- Network analysis: Cytoscape and CellDesigner
- Gene enrichment analysis: Enrichr, GO and g:Profiler
- Interaction Databases: BioGRID and IntAct
- Integrative diagrams:
- A disease map can be used to represent a conceptual model of the molecular mechanisms of a disease. An example is the COVID19 Disease Map.
Data analysis of wastewater surveillance for infectious diseases
Wastewater surveillance has emerged as a valuable tool for monitoring infectious diseases, providing a non-invasive method to track the spread of pathogens within communities. This approach has gained significant attention during the COVID-19 pandemic, particularly for detecting and analysing SARS-CoV-2 variants. By analysing wastewater samples, researchers can identify the presence and prevalence of infectious agents, offering insights into public health trends. Here we focus on the analysis of wastewater with an emphasis on SARS-CoV-2.
Considerations
Even though the considerations for this specific field are very similar to the ones described in the previous paragraphs, there are some approaches that are used in the context of wastewater surveillance.
Existing approaches
Several tools and workflows have been developed or adapted for the analysis of wastewater data, especially in the context of SARS-CoV-2 surveillance:
- Specific Tools for SARS-CoV-2: Certain tools (such as Freyja, COJAC, and Lineagespot) are specifically designed for analysing SARS-CoV-2 data, providing capabilities such as variant detection and lineage tracking.
- Repurposed Tools: Originally developed for other types of genomic data, tools like Kallisto or Kraken 2, have been successfully applied to wastewater data analysis, offering high performance in read alignment and taxonomic classification.
- In addition, here are several bioinformatics protocols and solutions that could be used in the context of wastewater next-generation sequencing (NGS) data analysis.
- PiGx SARS-CoV-2 Wastewater Sequencing Pipeline: provides a comprehensive solution for sequencing and analysing SARS-CoV-2 in wastewater.
- Detection of SARS-CoV-2 variants in Switzerland by genomic analysis of wastewater samples medRxiv: COWWID: A GitHub repository from the CBG-ETHZ group offering tools for detecting SARS-CoV-2 variants in Switzerland
- CDC Module 2.7: Wastewater based variant tracking for SARS-CoV-2
- The Public Health Alliance for Genomic Epidemiology GitHub organization makes available a mapping to the European Nucleotide Archive (ENA): SARS-CoV-2 Contextual Data Specification
- PHES-ODM as an open data model for wastewater surveillance
- Viral Lineage Quantification (VLQ), Kallisto-Approach: Lineage abundance estimation for SARS-CoV-2 in wastewater using transcriptome quantification techniques and corresponding repository at VLQ
- Performance benchmark of tools, evaluating tools like Kraken2, Kallisto, Freyja, implemented in C-WAP
- Wastewater quality control workflow in GalaxyTrakr (SSquAWK4). Further quality control aspects are discussed in the Quality Control - Pathogen Characterisation page
- ECDC Guidance document for representative and targeted genomic SARS-CoV-2 monitoring
Related pages
More information
Links to RDMkit
RDMkit is the Research Data Management toolkit for Life Sciences describing best practices and guidelines to help you make your data FAIR (Findable, Accessible, Interoperable and Reusable)
Training
Tools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
ANNOVAR | ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes. | Human biomolecular data | Tool info |
apex | Absolute protein expression Quantitative Proteomics Tool, is a free and open source Java implementation of the APEX technique for the quantitation of proteins based on standard LC- MS/MS proteomics data. | Tool info | |
BEAST | BEAST is a cross-platform program for Bayesian phylogenetic analysis, estimating rooted, time-measured phylogenies using strict or relaxed molecular clock models. It uses Markov chain Monte Carlo (MCMC) to average over tree space and includes a graphical user interface for setting up analyses and tools for result analysis. | Tool info | |
BEAUti | BEAUti is a graphical user-interface (GUI) application for generating BEAST XML files. | ||
BioGRID | BioGRID is a comprehensive biomedical repository for curated protein, genetic and chemical interactions | Human biomolecular data | Tool info Standards/Databases |
Bitbucket | Git based code hosting and collaboration tool, built for teams. | Human biomolecular data | Standards/Databases |
Bowtie2 | Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. | Human biomolecular data | Tool info Training |
BWA | BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. | Human biomolecular data | Tool info Training |
C-WAP | CFSAN Wastewater Analysis Pipeline to estimate the percentage of SARS-CoV-2 variants in a sample. | ||
Canu | Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing. | Human biomolecular data | Tool info |
CellDesigner | CellDesigner is a structured diagram editor for drawing gene-regulatory and biochemical networks. | ||
Chenomx | A commercial software package for NMR spectral processing that offers a semi-automated tool for spectral deconvolution, enabling interactive fitting of metabolite peaks to reference spectra and quantifying their concentrations. | ||
ClustalW | ClustalW is a progressive multiple sequence alignment tool to align a set of sequences by repeatedly aligning pairs of sequences and previously generated alignments. | Human biomolecular data | Tool info Training |
COJAC | The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons. | Tool info | |
COVID19 Disease Map | The COVID-19 Disease Map is an assembly of molecular interaction diagrams, established based on literature evidence. | ||
Cytoscape | Cytoscape provides a solid platform for network visualization and analysis | Human biomolecular data | Tool info Training |
dbNSFP | A comprehensive database of transcript-specific functional predictions and annotations for human non-synonymous and splice-site SNVs | Human biomolecular data | Tool info |
Dragen-GATK | DRAGEN-GATK Best Practices contains open-source workflows that are compatible between Illumina's platforms and mainstream infrastructure. | Human biomolecular data | |
Enrichr | Functional Enrichment Analysis and Network Construction | Tool info | |
EuroHPC | EuroHPC Joint Undertaking is a joint initiative between the EU, European countries and private partners to develop a World Class Supercomputing Ecosystem in Europe. | ||
European Nucleotide Archive (ENA) | Provides a record of the nucleotide sequencing information. It includes raw sequencing data, sequence assembly information and functional annotation. | Human clinical and hea... Pathogen characterisation Human biomolecular data An automated SARS-CoV-... Using the ENA data sub... SARS-CoV-2 sequencing ... Linked pathogen and ho... | Tool info Standards/Databases Training |
FASTQC | A quality control tool for high throughput sequence data. | Human biomolecular data Pathogen characterisation | Tool info Training |
freebayes | freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. | Human biomolecular data | Tool info Training |
Freyja | Freyja is a tool to recover relative lineage abundances from mixed SARS-CoV-2 samples from a sequencing dataset (BAM aligned to the Hu-1 reference). | Tool info | |
g:Profiler | g:GOSt performs functional enrichment analysis, also known as over-representation analysis (ORA) or gene set enrichment analysis, on input gene list. | Tool info Training | |
Galaxy | Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. | Human biomolecular data Human clinical and hea... Using the ENA data sub... | Tool info Training |
Galaxy Europe | The European Galaxy server. Provides access to thousands of tools for scalable and reproducible analysis. | An automated SARS-CoV-... | Training |
GitHub | GitHub is a versioning system, used for sharing code, as well as for sharing of small data. | Human biomolecular data An automated pipeline ... | Standards/Databases Standards/Databases Training |
GitLab | GitLab is an open source end-to-end software development platform with built-in version control, issue tracking, code review, CI/CD, and more. Self-host GitLab on your own servers, in a container, or on a cloud provider. | Human biomolecular data | Standards/Databases Training |
Global Initiative on Sharing All Influenza Data (GISAID) | A web-based platform for sharing viral sequence data, initially for influenza data, and now for other pathogens (including SARS-CoV-2). | Human clinical and hea... Pathogen characterisation | Standards/Databases |
GO | GO is to perform enrichment analysis on gene sets. | Human biomolecular data | Tool info Training |
IntAct | IntAct (Molecular Interaction Database) Website | Human biomolecular data | Tool info Standards/Databases Training |
IQtree | IQ-TREE is designed to efficiently handle large phylogenomic datasets, utilize multicore and distributed parallel computing for faster analysis, and automatically resume interrupted analyses through checkpointing. | Tool info | |
Kallisto | Kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. | Tool info | |
Kraken 2 | A taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. | ||
Lineagespot | Lineagespot is a framework written in R, and aims to identify SARS-CoV-2 related mutations based on a single (or a list) of variant(s) file(s). | Tool info | |
MAFFT | MAFFT is a multiple sequence alignment program | Human biomolecular data | Tool info |
MAXQUANT | MaxQuant is a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets. It is specifically aimed at high-resolution MS data. | Tool info Training | |
MEGAHIT | MEGAHIT is an ultra-fast and memory-efficient NGS assembler optimized for metagenomes. | Tool info | |
MetaboAnalyst | MetaboAnalyst is a comprehensive platform dedicated for metabolomics data analysis via user-friendly, web-based interface. | Human biomolecular data | Tool info Training |
MrBayes | MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters. | Tool info | |
MUSCLE | MUSCLE is widely-used software for making multiple alignments of biological sequences. | Human biomolecular data | Tool info Training |
Mzmine | MZmine 3 is an open-source software for mass-spectrometry data processing, with the main focus on LC-MS data. | Human biomolecular data | Tool info |
Nextflow | Nextflow is a framework for data analysis workflow execution | Human biomolecular data | Tool info Training |
Nextstrain | Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data. | Tool info Training | |
OMSSA | OMSSA (Open Mass Spectrometry Search Algorithm) is a tool to identify peptides in tandem mass spectrometry (MS/MS) data. The OMSSA algorithm uses a classic probability score to compute specificity. See also The NCBI C++ Toolkit and The NCBI C++ Toolkit Book. | Tool info | |
OpenMS | OpenMS is an open-source software C++ library for LC-MS data management and analyses. | Human biomolecular data | Tool info Training |
Pathogens Portal | The Pathogens Portal, launched in July 2023, is an invaluable resource for researchers, clinicians, and policymakers who need access to the latest and most comprehensive datasets on pathogens. The portal is a collaborative effort between the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) and partners. | Linked pathogen and ho... The Swedish Pathogens ... | Standards/Databases Training |
Pathogenwatch | Pathogenwatch provides species and taxonomy prediction for over 60,000 variants of bacteria, viruses, and fungi. | ||
PepArMl | A Meta-Search Peptide Identification Platform for Tandem Mass Spectra | Tool info | |
PHES-ODM | A data model to improve wastewater surveillance through interoperable data. | ||
PiGx SARS-CoV-2 Wastewater Sequencing Pipeline | PiGx SARS-CoV-2 is a pipeline for analysing data from sequenced wastewater samples and identifying given lineages of SARS-CoV-2. | ||
QIIME 2 | QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. | Tool info Training | |
RAxML | A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies | Tool info | |
ReAdW | Convert ThermoFinningan RAW mass spectrometry files to the mzXML format. | Tool info | |
ResFinder | ResFinder identifies acquired genes and/or finds chromosomal mutations mediating antimicrobial resistance in total or partial DNA sequence of bacteria. | Tool info | |
SAMtools | SAMtools is a suite of programs for interacting with high-throughput sequencing data. | Human biomolecular data Pathogen characterisation | Tool info Training |
SARS-CoV-2 Contextual Data Specification | A SARS-CoV-2 Contextual Data Specification from PHA4GE. | ||
Snakemake | Snakemake is a framework for data analysis workflow execution | Human biomolecular data Human clinical and hea... | Tool info Training |
SNippy | Rapid haploid variant calling and core genome alignment. | Tool info | |
SnpEff | Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins. | Human biomolecular data | Tool info Training |
SPAdes | SPAdes is an assembly toolkit containing various assembly pipelines. | Human biomolecular data | Tool info Training |
Stanford HIV Drug Resistance Database (HIVDB) | A curated database containing nearly all published HIV RT and protease sequences: a resource designed for researchers studying evolutionary and drug-related variation in the molecular targets of anti-HIV therapy. | ||
Swedish Pathogens Portal | The Swedish Pathogens Portal was previously known as the Swedish COVID-19 Data Portal. It is the Swedish national node of the Pathogens Portal, aimed at facilitating the sharing of data related to pathogens and pandemic preparedness. | The Swedish Pathogens ... | Standards/Databases |
Trimmomatic | Trimmomatic is a tool used for the removal of adapter sequences, low-quality reads, and sequences with ambiguous bases from NGS data. | Human biomolecular data | Tool info Training |
VarScan | Variant calling and somatic mutation/CNV detection for next-generation sequencing data | Human biomolecular data | Tool info |
Velvet | Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. | Tool info Training | |
VEP | VEP (Variant Effect Predictor) predicts the functional effects of genomic variants. | Human biomolecular data | Tool info Training |
VLQ | A pipeline for lineage abundance estimation from wastewater sequencing data. | ||
WorkflowHub | A registry for describing, sharing and publishing scientific computational workflows. | An automated SARS-CoV-... | Tool info Standards/Databases Training |
X! Tandem | X! Tandem open source is software that can match tandem mass spectra with peptide sequences, in a process that has come to be known as protein identification. | ||
xcms | Framework for processing and visualization of chromatographically separated and single-spectra mass spectral data. | Tool info Training |