Pathogen characterisation

Introduction

Data analysis for pathogen characterization allows us to understand the evolution of pathogens, and the relationship among different strains and provides insights on host-pathogen interactions and drug resistance. The tasks can involve processing data collected from a diverse spectrum of sources, from both clinical and environmental samples. As in every data analysis procedure, the general workflow involves:

Preprocessing: Includes the initial steps required to prepare data, genomics and not, for further analysis.
Analysis: Is the core stage where the actual detection and characterization of pathogens occur. This stage employs many techniques for pathogen characterization, such as Next-Generation Sequencing (NGS).
Postprocessing: Includes interpreting and validating the data obtained from the analysis stage, as well as integrating it into broader contexts. Moreover, this is often followed by reporting and communication, and archiving and data management.

Each stage is crucial for the accurate and comprehensive characterisation of pathogens, from the initial handling of samples to the final reporting and data management, and will be detailed below. Scalable and reproducible data analysis activities enable rapid surveillance of infectious epidemics of emerging and re-emerging pathogens in foodborne, hospital settings, and local community outbreaks. Ensuring reproducibility is critical for the usability of the analysis results. Following community-recognised best practices and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) is fundamental for guaranteeing the trustworthiness of the results and enabling collaboration and sharing of information.

General considerations

When analysing pathogen data involved in a health emergency or epidemic outbreak are:

Define the pathogen and specific aspects to be investigated, e.g. genomic features of interest
Collect the suitable reference data about the pathogen of interest, preferentially from community-accepted repositories, e.g. European Nucleotide Archive (ENA) and Global Initiative on Sharing All Influenza Data (GISAID). It is worth noting that the right reference should be chosen taking into account mutation features, time of isolation, classification, phenotype, and genomic structure.
Before analysing the data, define which specific aspect of the pathogen’s variability will be investigated. For example, if your aim is to describe the whole variability along the genome, the data should be compared with the whole reference genome.
Define the type of data you are using, e.g. DNA or RNAseq for viral genome characterisation
Select the tools best suited for the analysis of your data
Estimate the computing resources needed
Define which computing infrastructure is most suitable, e.g. cluster or cloud
Ensure to follow the FAIR principles when handling data
Guarantee findability of the data and tools for all collaborators for reproducibility by providing your:
- Code
- Execution environment
- Workflows
- Data analysis execution, including parameters used
- Accompanied by documentation that lists all parameters and other relevant information to reproduce the findings

Existing approaches

Container and environments: Consider using containers and environments to collect and isolate dependencies for tools and pipelines. Environment management systems, such as Conda, help with reproducibility but are not inherently portable across platforms. Containers provide a higher level of portability, being able to encapsulate both the software and its dependencies.
Web-based code collaboration platform: Consider using a centralised location for software developers to store, manage, collaborate, and share their code. For instance, GitHub, GitLab, or Bitbucket.
Workflow management systems: Allow you to formalise your workflows in a standardised format and execute them locally or on a remote computer infrastructure. Popular systems are Nextflow and Snakemake.
Workflow platforms: Allow users to manage data, run formalised workflows, and review their results. Platforms, such as Galaxy, may offer multiple interfaces, e.g. web, GUI, and APIs.
Reference databases: Collect the suitable reference data about pathogens to be investigated. European Nucleotide Archive (ENA) and Global Initiative on Sharing All Influenza Data (GISAID) are examples of genomic databases to which researchers share their data. In this context, the European Pathogens Portal aggregates databases relating to pathogens, as well as hosts and their vectors. Other countries host their own instance of the Pathogens Portal, e.g. see the Swedish Pathogens PortalSwedish Pathogens Portal showcase.
Workflow registries: Register workflows in platforms, such as WorkflowHub, that facilitate sharing, versioning, and authorship attribution of the pipelines.

For more general information and solutions on data analysis, you may have a look at the content available on the RDMkit data analysis page. While the examples on this page focus on the genomic characterisation of pathogens, similar principles apply to other data types.

Preprocessing

Data preprocessing is an initial step in data analysis involving the preparation of raw data for the main analysis. It is an important factor in quality control, and involves steps for the cleaning of the data, with the identification of inconsistencies, errors, and missing values. Preprocessing may also include data conversion and transformation steps to get the data in a format compatible with the expected inputs of the chosen analysis pipelines.

General Considerations

Some typical considerations involved in this step:

Data cleaning: Finds and corrects errors in the data. For example, eliminating duplicates, removing too short genomic reads, and trimming not useful information such as contaminating host data.
Quality control checks: Should be conducted at each step to ensure that the data is suitable for the intended analysis.
Exclusion of low-quality samples: Samples with low-quality scores should be marked and removed. In genomics studies, samples with missing values, low sequencing depth, and contaminations might be removed.
Host read removal: Depending of the sample type, the host and the analysis to be performed, it is strongly recommended removing reads from the host that can affect the analysis
Origin of the reads:
- Short reads or long reads: Short reads, generated by platforms like Illumina, offer high accuracy with low error rates but can be challenging for assembling repetitive regions. Long reads, produced by technologies such as Oxford Nanopore or PacBio, enable more comprehensive genome assemblies and structural variant detection, though they may have higher error rates.
- Sequencing platform: Different platforms vary in read quality, length, and throughput. For example, Illumina MiSeq may experience a significant decrease in R2 read quality when performing 300-cycle paired-end sequencing.
- Enrichment of the library: Depending on the study’s objective, different enrichment methods can be used, such as PCR amplification with specific primers (amplicon sequencing), probe-based hybridization capture (target enrichment), or amplification-free sequencing. Some enrichment methods may present specific quality issues that need to be addressed.
- fast5 or fastq files: ONT sequencing data can be stored in different formats, each requiring specific software for analysis

Existing approaches

Preprocessing steps may depend on the technology used and the pathogen being studied and thus should be adjusted accordingly. Some common approaches in genomics studies include:

Short reads:
- Raw sequences quality check: FASTQC
- Trimming out adapters and low-quality sequences: Trimmomatic, fastp, Cutadapt, SOAPnuke
Long reads:
- basecalling softwares: Guppy, Dorado
- Raw sequences quality check: NanoPlot, nanoq, PycoQC, LongQC, MinIONQC
- Trimming out adapters and low-quality sequences: Porechop, Nanofilt
Quality checks: further information can be found on the Quality control - Pathogen characterisation page.

Analysis

The analysis of data to characterise a pathogen of interest can involve methodologies from different fields. While genomics approaches are of common interest, analysis of other data types, such as proteomics and metabolomics, and their combination can be of special importance.

Considerations

The computational resources: Verify that the appropriate computational resources are available. Depending on the volume and complexity of the data, you might need to make use of large computing clusters or cloud computing resources.
The location of your data: Ensure that the chosen computing infrastructure and platforms have access to the data. It is important to consider the distance between the data storage and computing, as it can significantly impact transfer times and costs.
Document the steps: Report every step of the data analysis process. Including software versions employed, parameters utilised, the computing environment employed, reference genome used, as well as any “manual” data curation steps. More information on recording provenance can be found on the Provenance pages
Collaborative analysis: it is important that partners have access to the data, tools, and workflows. It is crucial that systems are in place to track changes to the tools and workflows used, and that the history of modifications is accessible to all collaborators.

Existing approaches

There are several types of analysis that can be performed on pathogen-related data, depending on the specific research question and type of data being analysed. Here are some solutions:

Consider using the available computational infrastructure to scale up your analysis capabilities. This may include applying for access to large computing cluster resources with e.g. EuroHPC or making use of public Galaxy servers such as Galaxy Europe.
Genomic analysis: Including whole genome sequencing (WGS), this analysis allows the interpretation of genetic information encoded along the genome (DNA or RNA). Genomic analysis can be used for a wide range of applications to characterise many aspects of pathogen variability, such as Variants of Concern (VOC) and antimicrobial resistance profiles in bacteria (AMR). Examples of tools that allow us to take into account the genomic characteristics of pathogens (e.g. genomic structure and size, gene annotations, mobile genetic elements) are:
- Short reads:
  - All-in-one Bioinformatic Tools: SNippy, Dragen-GATK
  - Sequence Alignment: Bowtie2, BWA and SAMtools
  - Genome Assembly: Canu, Velvet and SPAdes
  - Phylogenetic Analysis: ClustalW, MUSCLE, MAFFT, RAxML, FastTree and IQtree
  - Molecular Clock: MrBayes, BEAST and BEAUti
  - Variant calling: Dragen-GATK, freebayes, Bcftools and iVar
  - Varoant annotation: ANNOVAR, SnpEff, VEP and dbNSFP
  - Genome annotation: Prokka, Bakta or DFAST
  - Consensus genome generation: Bcftools, SAMtools
- Long reads:
  - All-in-one Bioinformatic Tools: artic
  - Sequence Alignment: Minimap2
  - Genome Assembly: Canu, Flye, Raven, Miniasm, Dragonflye
  - Assembly Polishing: Racon
  - Phylogenetic Analysis: ClustalW, MUSCLE, MAFFT, RAxML, FastTree, IQtree
  - Molecular Clock: MrBayes, BEAST and BEAUti
  - Variant Calling: Medaka, Nanopolish, NanoCaller
  - Variant Annotation: SnpEff
  - Genome annotation: Prokka, Bakta or DFAST
- Hybrid:
  - Genome Assembly: Dragonflye, Unicycler
  - Polishing: Pilon, Polypolish
Expression analysis (fungal genomes):
- Sequence Alignment: STAR, HISAT2, Kallisto
- Quantification: Salmon, RSEM
- Removal of ribosomal RNA: SortMeRNA
- Differential expression analysis: DESeq2
Metagenomics analysis: Sequencing all genetic material in a sample can provide comprehensive data about the composition of the microbial community. In the context of infectious diseases, it can aid in identifying multiple pathogens simultaneously in clinical, as well as environmental samples. Examples of tools in this type of analysis are:
- 16S rRNA sequencing: QIIME 2
- Shotgun sequencing: SPAdes, and MEGAHIT
Assigning taxonomic labels: Kraken 2, Kaiju, DIAMOND, MEGAN, MetaPhlAn, Centrifuge, mOTUs, KrakenUniq, KMCP, Ganon, Clark
Proteomics analysis: Proteomics, primarily utilising mass spectrometry techniques, offers a powerful tools for examining proteins and their interplay. This can provide valuable insights into irregularities associated with infectious diseases and potentially uncover mechanisms of drug resistance. Examples of tools in this type of analysis are:
- Mass Spectrometry Data Extraction Software: ReAdW
- Search Algorithms: X! Tandem, OMSSA and MAXQUANT
- Statistical Validation: PepArMl
- Quantitative Tools: apex and MAXQUANT
Metabolomics analysis: This involves measuring the levels of small molecules (metabolites) produced by specific pathogens in biological samples, comparing them across different conditions or groups of samples. Examples of tools in this type of analysis are:
- Mass Spectrometry Software: xcms and MetaboAnalyst
- NMR Spectroscopy Software: Chenomx
- Data Processing: xcms, Mzmine and OpenMS

Postprocessing

In pathogen characterisation, the postprocessing steps are crucial to evaluate and interpret the results. These steps are important to identify strain relationships and specific molecular variation patterns linked to peculiar phenotypes of pathogens (e.g. drug resistance, virulence, and transmission rate). Such results must be biologically meaningful and reproducible, considering also the clinical aspects and treatment implications.

Considerations

Some considerations about postprocessing steps in pathogen characterization include:

Interpretation: it is important to interpret them in a biologically meaningful context. This should consider the following aspects: report the variability of specific pathogens; find out new strains that could become concerning; identify specific genes or mutations associated with pathogenic variation.
Transformation: Consider having postprocessing steps to ensure that outputs are transformed or converted into interoperable and open formats. This ensures that subsequent pipelines and collaborators can readily make use of the results.
Visualisation: To allow a clear interpretation of the clinical practice, it is important to visualise the results clearly, to make the results clear also to all professionals involved.
Quality metrics: It is important to obtain different quality metrics to evaluate the accuracy or quality of the data generated during the analysis. Depending on the analysis performed, the metrics and tools may vary significantly.

Existing approaches

Spatial-temporal analysis and visualisation: using a combined approach of phylogenetic, spatial distribution, and molecular clock, this approach aids in designing strategies to control and prevent the spread of infectious diseases, as well as in the development of effective treatments, and vaccines.
- Spatial distribution of strain: Nextstrain
Lineage, subtyping or clade assignment:
- Pangolin
- Nextclade
- IRMA
- FluServer
- HIV subtyping tools: CoMet, rega/scueal or Genome Detective
- HCV-GLUE
Drug resistance and virulence factor characterisation: genomic analysis can be used to characterise pathogens for specific resistance against drugs and help develop strategies to fight the spread of drug-resistant strains.
- Antimicrobial resistance (AMR) and virulence factors: ResFinder, AMRfinderplus, ARIBA, ABRicate and Pathogenwatch
- Viral drug resistance: Stanford HIV Drug Resistance Database (HIVDB) and FluServer
- Hepatitis C and B resistance profilers: Geno2pheno, HCV-GLUE
Interaction analysis and functional enrichment analysis: placing the identified protein interactions and regulatory networks in the context of the affected biological pathways allows for a better understanding of disease mechanisms and potential drug targets.
- Network analysis: Cytoscape and CellDesigner
- Gene enrichment analysis: Enrichr, GO and g:Profiler
- Interaction Databases: BioGRID and IntAct
- Integrative diagrams:
  - A disease map can be used to represent a conceptual model of the molecular mechanisms of a disease. An example is the COVID19 Disease Map.
Analysis metrics:
- Genome alignment metrics: SAMtools, Picard
- de novo assembly metrics: Quast
- Expression analysis quality metrics: RSeQC, Qualimap, dupRadar, preseq
General results aggregation:
- General metrics: MultiQC
- Antibiotic Resistance Genes: hAMRonization

Data analysis of wastewater surveillance for infectious diseases

Wastewater surveillance has emerged as a valuable tool for monitoring infectious diseases, providing a non-invasive method to track the spread of pathogens within communities. This approach has gained significant attention during the COVID-19 pandemic, particularly for detecting and analysing SARS-CoV-2 variants. By analysing wastewater samples, researchers can identify the presence and prevalence of infectious agents, offering insights into public health trends. Here we focus on the analysis of wastewater with an emphasis on SARS-CoV-2.

Considerations

Even though the considerations for this specific field are very similar to the ones described in the previous paragraphs, there are some approaches that are used in the context of wastewater surveillance.

Existing approaches

Several tools and workflows have been developed or adapted for the analysis of wastewater data, especially in the context of SARS-CoV-2 surveillance:

Specific Tools for SARS-CoV-2: Certain tools (such as Freyja, COJAC, and Lineagespot) are specifically designed for analysing SARS-CoV-2 data, providing capabilities such as variant detection and lineage tracking.
Repurposed Tools: Originally developed for other types of genomic data, tools like Kallisto or Kraken 2, have been successfully applied to wastewater data analysis, offering high performance in read alignment and taxonomic classification.
In addition, here are several bioinformatics protocols and solutions that could be used in the context of wastewater next-generation sequencing (NGS) data analysis.
- PiGx SARS-CoV-2 Wastewater Sequencing Pipeline: provides a comprehensive solution for sequencing and analysing SARS-CoV-2 in wastewater.
- Detection of SARS-CoV-2 variants in Switzerland by genomic analysis of wastewater samples medRxiv: COWWID: A GitHub repository from the CBG-ETHZ group offering tools for detecting SARS-CoV-2 variants in Switzerland
- CDC Module 2.7: Wastewater based variant tracking for SARS-CoV-2
- The Public Health Alliance for Genomic Epidemiology GitHub organization makes available a mapping to the European Nucleotide Archive (ENA): SARS-CoV-2 Contextual Data Specification
- PHES-ODM as an open data model for wastewater surveillance
- Viral Lineage Quantification (VLQ), Kallisto-Approach: Baaijens et al., 2022 and corresponding repository at VLQ
- Performance benchmark of tools (Kayikcioglu et al., 2023), evaluating tools like Kraken2, Kallisto, Freyja, implemented in C-WAP
- Wastewater quality control workflow in GalaxyTrakr (SSquAWK4). Further quality control aspects are discussed in the Quality Control - Pathogen Characterisation page
- ECDC Guidance document for representative and targeted genomic SARS-CoV-2 monitoring

Data analysis of bacterial outbreaks

General considerations

An outbreak in infectious diseases is defined as the increase in the number of cases of a disease above what is typically expected in a specific area or among a particular population over a certain period of time. Outbreak can occur in localized regions, such as a community, or they can spread across wider areas. They can involve new infectious or re-emergence of previously controlled diseases. Common examples include outbreaks of Influenza, SARS-CoV2, or Foodborne illnesses like caused by Listeria monocytogenes among others.

In order to know if an outbreak is caused by the same pathogen, it is necessary to isolate and characterize the pathogen using phenotypic or genotypic techniques, which is known as a typification process or Microbial typing (Shelby R. et al., 2021). It is used to differentiate between strains or species of microorganisms for various purposes, such as epidemiological studies, infection tracking, and outbreak investigations to pinpoint the source of foodborne outbreaks. It can also be used to identify which microorganisms are most virulent and cause serious diseases, resistant to antimicrobial drugs, or able to survive and multiply.

Techniques for microbial typing include traditional methods, like serotyping, fragment based methods and sequence based methods.

The gold standard for bacterial typing is pulsed-field gel electrophoresis (PFGE) and has been widely used for tacking outbreaks of foodborne pathogens, such as E.coli and Salmonella. However, technologies like Whole Genome Sequencing (WGS) have replaced PFGE due to their higher precision and resolution.

Different levels of sequence information can be associated with different taxonomic levels. A single locus (i.e. 16S rRNA) is enough to distinguish from phylum to genus and, in some cases, species, while subspeciation requires incorporating more genes, such as the 7 from MLST or the 53 loci from ribosomal MLST. The highest resolution power comes from incorporating the whole genome sequencing (WGS).

One of the significant advantages of whole genome sequencing (WGS) is its application in comparative genomics, enabling the determination of the phylogenetic relationships among a group of bacterial strains. There are several methods to assess the similarity between different genomes (i.e. gen-by-gen approaches: Multi-Locus Sequence Typing -wgMLST, Core Genome Multi-Locus Sequence Typing -cgMLST, or Single Nucleotide Polymorphisms -SNPs approach) (Uelze et al., 2020).

wgMLST analyzes the entire genome, considering all the genes present in the genome. This method allows for a comprehensive understanding of the genetic variation within and between species.

cgMLST focuses specifically on the set of genes that are present in all strains of a species, the core genome. This excludes accessory genes, genes not necessary for organism survival or reproduction under standard conditions, (i.e. virulence genes) that may vary among strains.

Existing approaches

There are different tools available for bacterial outbreak analysis. Here they are classified based on the stage of the analysis.

Starting with sequencing reads .fastq files:
- Pre-processing: During this step we aim to obtain a set of good quality reads that can be used for further analysis. We will inspect the quality of the raw reads, discard all the reads that don’t meet the quality standards and remove adapters, primers or unwanted sequences that may interfere with later steps of the analysis. Tools: FASTQC, fastp
- Bacterial genome identification: Identifying the organisms that are present in our samples is crucial. Even though the microbiology lab could’ve sent information about the organisms present in the samples, there could be contaminations that would lead to mapping, variant calling and annotation errors. Tools: KmerFinder
- De-novo assembly: If we want to make an in-depth analysis of the samples, we cannot work with a mess of unordered pieces of DNA. The reads can be aligned and merged (assembled) into a single sequence which we will use as a representation of the original DNA from the organism in the sample. Tools: Unicycler
Once we have the assembly or reconstructed DNA strand, we can proceed with further analysis:
- Genomic features annotation: Using the assembled sequence, the aim of this step is to automatically detect genes or subproducts of genes that may have functional properties in a non-specific way. These features can be relevant to characterize the organism in our samples as they may include strain-specific genes. Tools: Prokka, Bakta, DFAST
- AMR/virulence characterization: Contrary to the previous step, here the aim to do a specific search for relevant genes, including virulence factors and antibiotic-resistance genes. The main difference is that the search is performed within domain-specific databases:
  - Tools: ARIBA, AMRfinderplus
  - Databases: VFDB for virulence factors, CARD for antibiotic resistance genes, NCBI RefGene Pathogen Database: VF and AMR
- Plasmid identification: Sometimes the previously mentioned genes are found inside plasmids, which might restrain the previous tools from finding them. Not only that, but plasmids are quite relevant as they can be the source of horizontal gene transfer. These tools are used to find and characterize these plasmids: ARIBA, PlasmidID
- MLST characterization: MLST (Multi-Locus Sequence Typing) is an unambiguous procedure for characterizing isolates of bacterial species using the sequences of internal fragments of (usually) seven housekeeping genes. The aim is to obtain a profile based on the alleles for each gene, which can be used to establish relationships between the samples and the possible source of the outbreak. Tools: ARIBA used along with the PubMLST database.
  - Note: Different species may have different MLST schemes, so it is crucial to use the appropriate schema for each one.
- SNP calling and Core-SNPs matrix: To find relationships between the samples and a possible source, phylogenetic distance can be a good measure. This distance is calculated by counting all the different pairwise SNP positions among samples. However, since the number of SNPs increases exponentially with the number of samples, a good approach is to use only the SNPs that are common to all samples (Core SNPs). These values are represented in a distance matrix, which is then used to create a phylogenetic tree. Tools: SNippy
- Phylogenetic tree: Using the previously obtained distance matrix, it is possible to create a phylogenetic tree, which is a visual representation of the genetic relationships between the samples, based on clustering algorithms. This is useful to group similar samples together and infer potential outbreak sources.
  Tools: IQtree
- wgMLST and cgMLST: These are both gene-by-gene approaches. Genome assembly data is aligned to a scheme consisting of a set of loci and their corresponding allele sequences to perform allele calling. Each isolate is then characterized by its allele profile, and comparing multiple samples generates an allele distance matrix. Tools: chewBBACA
  - Schema resources:
- Pathogen-specific typing tools: MLST is not always enough to characterize a certain species. Some species have their unique typing methods. Here are species-specific tools:
  - Escherichia / Shigella: ECTyper, ShigaTyper or ShigEiFinder
  - Haemophilus: hicap or SsuisSero
  - Klebsiella: Kleborate
  - Legionella: legsta
  - Listeria: LisSero
  - Mycobacterium: TBProfiler and MTBseq
  - Neisseria: meningotype or ngmaster
  - Pseudomonas: pasty
  - Salmonella: SeqSero2 and SISTR
  - Staphylococcus: AgrVATE, spaTyper and sccmec
  - Streptococcus: emmtyper, pbptyper or SsuisSero
  - Source: Bactopia Merlin
Available pipelines: These pipelines synthesize some of the previous steps into a single workflow, making bioinformatics analysis easier:
- Bacterial assembly and annotation (short reads, long reads, hybrid):
  - nf-core/bacass
- Screening of functional genes:
  - nf-core/funcscan
- Taxonomy classification (short and long metagenomic reads):
  - nf-core/taxprofiler
- Complete analysis of bacterial genomes:
  - Bactopia

Viral outbreak

General considerations

Analysis of outbreaks when a virus is suspected to be the responsible for constitutes an important tool for disease control.

A proper identification of the virus is needed to find whether there are already measures described for its prevention and contention so that the spread of the infection can be managed in order to avoid new potential cases. In certain scenarios, molecular diagnostic methods, such as PCR, do not achieve the viral identification and the application of other protocols, such as deep sequencing (i.e. metagenomics) is needed to correctly identify the pathogen.

Further analyses that can be performed are phylogenetic analysis and genome characterisation The former is useful to identify the links between different samples from the same outbreak and track the spread of the infection as well as locate its potential origin, in case additional control measures are needed. The latter is based on genome analysis in order to find specific biological characteristics of that virus that can help to identify potential effective treatments or the application of more extreme control measures (i.e. mutations related to antiviral resistance or high pathogenicity).

Existing approaches

There are different tools available for bacterial outbreak analysis. Here they are classified based on the stage of the analysis.

Pre-processing: Data preprocessing is a crucial initial step prior to data analysis that involves reads quality check, low-quality cleaning and trimming, removal of adapters, primers or unwanted sequences from raw sequencing reads. Some of the tools to perform this process are:
- Short-reads: FASTQC, fastp, Cutadapt or iVar
- Long-reads: NanoPlot and Nanofilt
Viral species identification: Accurately identifying the viral species present in samples is essential for responding to viral outbreaks, especially when the causative agent is unknown. This step involves classifying the sample’s reads into various viral taxonomic species, using tools such as: Kraken 2 / Krona Tools or DIAMOND / MEGAN
Reference-genome mapping: This approach is particularly useful when the viral species causing the outbreak is known, a reference genome is available, and the viral genome does not exhibit significant variability. It involves aligning sequencing reads to the reference genome, which is a crucial step for accurate variant calling and in-depth genetic analysis. Widely used tools are:
- Short-reads: Bowtie2 and BWA
- Long-reads: Minimap2
Variant calling: This step involves identifying genetic variations, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants within a sample by comparing its sequencing data to a reference genome. Some useful tools are: iVar, Bcftools and LoFreq*
Variant annotation: Variant annotation is a step following variant calling that involves assigning biological and clinical significance to identified genetic variants. This step typically involves mapping variants to genes, determining their potential impact on protein function, and assessing their association with known phenotypes, diseases, or traits. Available tools are: SnpEff and SnpSift
Consensus genome reconstruction: Once we know the differences (variants) between the sample and the reference genome, we can change the positions in the reference genome to obtain a consensus genome .fasta file that can serve as the sample’s assembled genome. Tools: Bcftools
de novo assembly: When the viral reference genome is not known or is highly variable, de novo assembly is a better approach than reference genome mapping. de novo assembly reconstructs a viral genome’s reads without the need of a reference genome into longer contiguous sequences (contigs) and scaffolds. These can be compared to a database (using any of the tools for taxonomic classification mentioned above) to identify the pathogen. Some of the widely used tools are:
- Short-reads: SPAdes and Unicycler
- Long-reads: Canu, Flye, Raven, Miniasm and Dragonflye
- Hybrid: Dragonflye and Unicycler
Phylogeny: In the context of addressing a viral outbreak, phylogenetic studies can help in tracing the origins and transmission pathways of the infection. Generating phylogenetic trees visually represents the connections between organisms, illustrating how they have diverged over time. The process of performing phylogenetic analysis involves several steps:
- Sequence alignment of the generated genomes. Tools: MAFFT, MUSCLE or ClustalW
- Core genome SNPs: The core genome consists of the genes that are shared among all the samples/strains in the analysis, providing a stable genetic framework for comparative analysis. Tools: SNippy
- Phylogenetic trees: With the multiple sequence alignment or the core SNPs, researchers can perform phylogenetic analysis. The choice of algorithm and evolutionary model will influence the results, with commonly used tools including: IQtree and Nextstrain
Lineage/clade/type: When the virus causing the outbreak has been typed with different lineages, clades, or types/subtypes, it is essential to characterize this aspect of the samples. The information generated from both the phylogenetic tree, combined with lineage data, can help determine the connections between samples involved in the outbreak. Tools: Pangolin, Nextclade or HIV subtyping tools such as CoMet or rega/scueal
Available pipelines:
- nf-core/viralrecon: is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples.
- IRMA: was designed for the robust assembly, variant calling, and phasing of highly variable RNA viruses. Currently, IRMA is deployed with modules for influenza, ebolavirus, and coronavirus.

More information

Links to RDMkit

RDMkit is the Research Data Management toolkit for Life Sciences describing best practices and guidelines to help you make your data FAIR (Findable, Accessible, Interoperable and Reusable)

Data Analysis

Training

SARS-CoV-2 data analysis

SARS-CoV-2, viruses and bacteria data analysis

Pathway analysis with the MINERVA Platform

Tools and resources on this page

Tool or resource	Description	Related pages	Registry
ABRicate	ABRicate is used for mass screening of contigs for antimicrobial resistance or virulence genes.		Tool info Training
AgrVATE	AgrVATE is a tool for rapid identification of Staphylococcus aureus agr locus type and also reports possible variants in the agr operon.		Tool info
AMRfinderplus	NCBI Antimicrobial Resistance Gene Finder (AMRFinderPlus)		Tool info
ANNOVAR	ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes.	Human biomolecular data	Tool info
apex	Absolute protein expression Quantitative Proteomics Tool, is a free and open source Java implementation of the APEX technique for the quantitation of proteins based on standard LC- MS/MS proteomics data.		Tool info
ARIBA	ARIBA is an Antimicrobial Resistance Identification By Assembly
artic	artic is a pipeline and set of accompanying tools for working with viral nanopore sequencing data, generated from tiling amplicon schemes.		Tool info Standards/Databases Training
Bakta	Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs.		Tool info Training
Bcftools	Bcftools is a set of tools for working with variant calls in the VCF format.	Human biomolecular data	Tool info Training
BEAST	BEAST is a cross-platform program for Bayesian phylogenetic analysis, estimating rooted, time-measured phylogenies using strict or relaxed molecular clock models. It uses Markov chain Monte Carlo (MCMC) to average over tree space and includes a graphical user interface for setting up analyses and tools for result analysis.		Tool info
BEAUti	BEAUti is a graphical user-interface (GUI) application for generating BEAST XML files.
BioGRID	BioGRID is a comprehensive biomedical repository for curated protein, genetic and chemical interactions	Human biomolecular data	Tool info Standards/Databases
Bitbucket	Git based code hosting and collaboration tool, built for teams.	Human biomolecular data	Standards/Databases
Bowtie2	Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.	Human biomolecular data	Tool info Training
BWA	BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome.	Human biomolecular data	Tool info Training
C-WAP	CFSAN Wastewater Analysis Pipeline to estimate the percentage of SARS-CoV-2 variants in a sample.
Canu	Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing.	Human biomolecular data	Tool info
CellDesigner	CellDesigner is a structured diagram editor for drawing gene-regulatory and biochemical networks.
Centrifuge	Tool to classify the taxonomic origin of a read in pathogen sequencing data	Human biomolecular data	Tool info
Chenomx	A commercial software package for NMR spectral processing that offers a semi-automated tool for spectral deconvolution, enabling interactive fitting of metabolite peaks to reference spectra and quantifying their concentrations.
chewBBACA	chewBBACA is a software suite for the creation and evaluation of core genome and whole genome MultiLocus Sequence Typing (cg/wgMLST) schemas and results.		Tool info
Clark	Clark is a fast, versatile and accurate tool for sequence classification system		Tool info Training
ClustalW	ClustalW is a progressive multiple sequence alignment tool to align a set of sequences by repeatedly aligning pairs of sequences and previously generated alignments.	Human biomolecular data	Tool info Training
COJAC	The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons.		Tool info
CoMet	A workflow using contig coverage and composition for binning a metagenomic sample with high precision		Tool info
COVID19 Disease Map	The COVID-19 Disease Map is an assembly of molecular interaction diagrams, established based on literature evidence.
Cutadapt	Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.		Tool info Training
Cytoscape	Cytoscape provides a solid platform for network visualization and analysis	Human biomolecular data	Tool info Training
dbNSFP	A comprehensive database of transcript-specific functional predictions and annotations for human non-synonymous and splice-site SNVs	Human biomolecular data	Tool info
DESeq2	Differential gene expression analysis based on the negative binomial distribution	Human biomolecular data	Tool info Training
DFAST	DFAST is a flexible and customizable pipeline for prokaryotic genome annotation as well as data submission to the INSDC		Tool info
DIAMOND	DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data.		Tool info Training
Dorado	Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nanopore reads.		Tool info
Dragen-GATK	DRAGEN-GATK Best Practices contains open-source workflows that are compatible between Illumina's platforms and mainstream infrastructure.	Human biomolecular data
Dragonflye	Dragonflye is a pipeline that aims to make assembling Oxford Nanopore reads quick and easy.
dupRadar	dupRadar is used for the assessment of duplication rates in RNA-Seq datasets.		Tool info
ECTyper	ECTyper is a standalone versatile serotyping module for Escherichia coli. It supports both fasta (assembled) and fastq (raw reads) file formats.		Tool info
emmtyper	emmtyper is a command line tool for emm-typing of Streptococcus pyogenes using a de novo or complete assembly.		Tool info
Enrichr	Functional Enrichment Analysis and Network Construction		Tool info
EuroHPC	EuroHPC Joint Undertaking is a joint initiative between the EU, European countries and private partners to develop a World Class Supercomputing Ecosystem in Europe.
European Nucleotide Archive (ENA)	Provides a record of the nucleotide sequencing information. It includes raw sequencing data, sequence assembly information and functional annotation.	Human clinical and hea... Pathogen characterisation Human biomolecular data An automated SARS-CoV-... Using the ENA data sub... SARS-CoV-2 sequencing ... Linked pathogen and ho...	Tool info Standards/Databases Training
fastp	A tool designed to provide ultrafast all-in-one preprocessing and quality control for FastQ data.		Tool info Training
FASTQC	A quality control tool for high throughput sequence data.	Human biomolecular data Pathogen characterisation	Tool info Training
FastTree	FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences		Tool info Training
FluServer	The main application scenario for FluSurver is to highlight phenotypically or epidemiologically interesting candidate mutations for further research and should ideally be combined with experimental testing and verification of any predicted phenotypes.
Flye	Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies.	Human biomolecular data	Tool info Training
freebayes	freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment.	Human biomolecular data	Tool info Training
Freyja	Freyja is a tool to recover relative lineage abundances from mixed SARS-CoV-2 samples from a sequencing dataset (BAM aligned to the Hu-1 reference).		Tool info
g:Profiler	g:GOSt performs functional enrichment analysis, also known as over-representation analysis (ORA) or gene set enrichment analysis, on input gene list.		Tool info Training
Galaxy	Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.	Human biomolecular data General guidelines Human clinical and hea... Using the ENA data sub...	Tool info Training
Galaxy Europe	The European Galaxy server. Provides access to thousands of tools for scalable and reproducible analysis.	An automated SARS-CoV-...	Training
Ganon	ganon2 classifies DNA sequences against large sets of genomic reference sequences efficiently.		Tool info
Geno2pheno	Estimating phenotypic drug resistance from HIV-1 genotypes associated with resistance to PRO and RT inhibitors		Tool info
Genome Detective	Genome Detective offers intuitive Bio-Informatics applications for the analysis of microbial molecular sequence data.
GitHub	GitHub is a versioning system, used for sharing code, as well as for sharing of small data.	Human biomolecular data An automated pipeline ...	Standards/Databases Standards/Databases Training
GitLab	GitLab is an open source end-to-end software development platform with built-in version control, issue tracking, code review, CI/CD, and more. Self-host GitLab on your own servers, in a container, or on a cloud provider.	Human biomolecular data	Standards/Databases Training
Global Initiative on Sharing All Influenza Data (GISAID)	A web-based platform for sharing viral sequence data, initially for influenza data, and now for other pathogens (including SARS-CoV-2).	Human clinical and hea... Pathogen characterisation	Standards/Databases
GO	GO is to perform enrichment analysis on gene sets.	Human biomolecular data	Tool info Training
Guppy	Guppy is a bioinformatics toolkit that enables real-time basecalling and several post-processing features that works on Oxford Nanopore Technologies™ sequencing platforms.		Tool info
hAMRonization	hAMRonization is a software tool to harmonize and standardize antimicrobial resistance (AMR) data generated by various bioinformatics tools.		Tool info
HCV-GLUE	HCV-GLUE is a bioinformatics resource for HCV sequence data.
hicap	The cap locus of H. influenzae are categorised into 6 different groups based on serology (a-f).
HISAT2	HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome).	Human biomolecular data	Tool info Training
IntAct	IntAct (Molecular Interaction Database) Website	Human biomolecular data	Tool info Standards/Databases Training
IQtree	IQ-TREE is designed to efficiently handle large phylogenomic datasets, utilize multicore and distributed parallel computing for faster analysis, and automatically resume interrupted analyses through checkpointing.		Tool info Training
IRMA	IRMA (Iterative Refinement Meta-Assembler) was designed for the robust assembly, variant calling, and phasing of highly variable RNA viruses.		Tool info
iVar	iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing.		Tool info Training
Kaiju	Kaiju is a program for the taxonomic classification of high-throughput sequencing reads, e.g., Illumina or Roche/454, from whole-genome sequencing of metagenomic DNA.		Tool info
Kallisto	Kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.		Tool info
Kleborate	Kleborate was primarily developed to screen genome assemblies of Klebsiella pneumoniae and the Klebsiella pneumoniae species complex (KpSC).		Tool info
KMCP	KMCP is an accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping
KmerFinder	KmerFinder is a bioinformatics tool designed for the rapid identification of bacterial species and strains from whole genome sequencing (WGS) data.		Tool info
Kraken 2	A taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.
KrakenUniq	False-positive identifications are a significant problem in metagenomics classification.		Tool info
Krona Tools	Krona Tools is a set of scripts to create Krona charts from several Bioinformatics tools as well as from text and XML files.		Tool info
legsta	In silico Legionella pneumophila Sequence Based Typing (SBT).		Tool info
Lineagespot	Lineagespot is a framework written in R, and aims to identify SARS-CoV-2 related mutations based on a single (or a list) of variant(s) file(s).		Tool info
LisSero	In silico serogroup typing prediction for Listeria monocytogenes
LoFreq*	LoFreq* (i.e. LoFreq version 2) is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data		Training
LongQC	LongQC is a tool for the data quality control of the PacBio and ONT long reads, and it has two functionalities: sample qc and platform qc.
MAFFT	MAFFT is a multiple sequence alignment program	Human biomolecular data	Tool info Training
MAXQUANT	MaxQuant is a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets. It is specifically aimed at high-resolution MS data.		Tool info Training
Medaka	medaka is a tool to create consensus sequences and variant calls from nanopore sequencing data.		Tool info Standards/Databases Training
MEGAHIT	MEGAHIT is an ultra-fast and memory-efficient NGS assembler optimized for metagenomes.		Tool info Training
MEGAN	MEGAN is used fot the interactive exploration and analysis of large-scale microbiome sequencing data.		Tool info
meningotype	In silico typing of Neisseria meningitidis contigs.		Tool info
MetaboAnalyst	MetaboAnalyst is a comprehensive platform dedicated for metabolomics data analysis via user-friendly, web-based interface.	Human biomolecular data	Tool info Training
MetaPhlAn	MetaPhlAn is a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) from metagenomic shotgun sequencing data.		Tool info Training
Miniasm	Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format.		Tool info Training
Minimap2	Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.		Tool info Training
MinIONQC	Fast and effective quality control for MinION and PromethION sequencing data
mOTUs	The mOTU profiler is a computational tool that estimates relative taxonomic abundance of known and currently unknown microbial community members using metagenomic shotgun sequencing data.		Tool info Training
MrBayes	MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters.		Tool info
MTBseq	MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from Illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
MultiQC	MultiQC searches a given directory for analysis logs and compiles a HTML report.	Pathogen characterisation	Tool info Training
MUSCLE	MUSCLE is widely-used software for making multiple alignments of biological sequences.	Human biomolecular data	Tool info Training
Mzmine	MZmine 3 is an open-source software for mass-spectrometry data processing, with the main focus on LC-MS data.	Human biomolecular data	Tool info
NanoCaller	NanoCaller is a computational method that integrates long reads in deep convolutional neural network for the detection of SNPs/indels from long-read sequencing data.		Tool info
Nanofilt	Filtering and trimming of long read sequencing data.
NanoPlot	NanoPlot is a plotting tool for long read sequencing data and alignments.		Tool info Training
Nanopolish	Software package for signal-level analysis of Oxford Nanopore sequencing data.		Tool info
nanoq	nanoq is an ultra-fast quality control and summary reports for nanopore reads		Training
Nextclade	Nextclade is a tool for viral genome clade assignment, mutation calling, and sequence quality checks.		Tool info Training
Nextflow	Nextflow is a framework for data analysis workflow execution	Human biomolecular data	Tool info Training
Nextstrain	Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data.		Tool info Standards/Databases Training
ngmaster	In silico multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST) and Neisseria gonorrhoeae sequence typing for antimicrobial resistance (NG-STAR).
OMSSA	OMSSA (Open Mass Spectrometry Search Algorithm) is a tool to identify peptides in tandem mass spectrometry (MS/MS) data. The OMSSA algorithm uses a classic probability score to compute specificity. See also The NCBI C++ Toolkit and The NCBI C++ Toolkit Book.		Tool info
OpenMS	OpenMS is an open-source software C++ library for LC-MS data management and analyses.	Human biomolecular data	Tool info Training
Pangolin	Pangolin is a tool for the Phylogenetic Assignment of Named Global Outbreak LINeages		Tool info Training
pasty	A tool easily taken advantage of for in silico serogrouping of Pseudomonas aeruginosa isolates.
Pathogens Portal	The Pathogens Portal, launched in July 2023, is an invaluable resource for researchers, clinicians, and policymakers who need access to the latest and most comprehensive datasets on pathogens. The portal is a collaborative effort between the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) and partners.	Linked pathogen and ho... The Swedish Pathogens ...	Standards/Databases Training
Pathogenwatch	Pathogenwatch provides species and taxonomy prediction for over 60,000 variants of bacteria, viruses, and fungi.
pbptyper	In silico Penicillin Binding Protein (PBP) typer for Streptococcus pneumoniae assemblies
PepArMl	A Meta-Search Peptide Identification Platform for Tandem Mass Spectra		Tool info
PHES-ODM	A data model to improve wastewater surveillance through interoperable data.
Picard	Picard is a suite of tools that provides quality control and processing of NGS data, including duplicate read removal, format conversion, and alignment.	Human biomolecular data	Tool info
PiGx SARS-CoV-2 Wastewater Sequencing Pipeline	PiGx SARS-CoV-2 is a pipeline for analysing data from sequenced wastewater samples and identifying given lineages of SARS-CoV-2.
Pilon	Pilon is a software tool which can be used to automatically improve draft assemblies and find variation among strains, including large event detection.		Tool info Training
PlasmidID	PlasmidID is a mapping-based, assembly-assisted plasmid identification tool that analyzes and gives graphic solution for plasmid identification.		Tool info
Polypolish	Polypolish is a tool for polishing genome assemblies with short reads.		Tool info Training
Porechop	Porechop is a tool for finding and removing adapters from Oxford Nanopore reads.
preseq	The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment.		Tool info
Prokka	Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.		Tool info Training
PycoQC	PycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data		Tool info Training
QIIME 2	QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency.		Tool info Training
Qualimap	Qualimap is a quality control tool that assesses the quality of the sequencing data at different stages of the analysis pipeline, including read mapping, coverage, and expression analysis.	Human biomolecular data	Tool info Training
Quast	QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics.		Tool info Training
Racon	Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step.		Tool info Training
Raven	Raven is a de novo genome assembler for long uncorrected reads.		Tool info
RAxML	A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies		Tool info Training
ReAdW	Convert ThermoFinningan RAW mass spectrometry files to the mzXML format.		Tool info
ResFinder	ResFinder identifies acquired genes and/or finds chromosomal mutations mediating antimicrobial resistance in total or partial DNA sequence of bacteria.		Tool info
RSEM	RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data.		Tool info
RSeQC	RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data.		Tool info Training
Salmon	Salmon is a wicked-fast program to produce a highly-accurate, transcript-level quantification estimates from RNA-seq data		Tool info Training
SAMtools	SAMtools is a suite of programs for interacting with high-throughput sequencing data.	Human biomolecular data Pathogen characterisation	Tool info Training
SARS-CoV-2 Contextual Data Specification	A SARS-CoV-2 Contextual Data Specification from PHA4GE.
sccmec	sccmec is a tool for typing SCCmec cassettes in assemblies.
SeqSero2	Salmonella serotype prediction from genome sequencing data.		Tool info
ShigaTyper	ShigaTyper is a quick and easy tool designed to determine Shigella serotype using Illumina (single or paired-end) or Oxford Nanopore reads with low computation requirement.
ShigEiFinder	This is a tool that is used to identify differentiate Shigella/EIEC using cluster-specific genes and identify the serotype using O-antigen/H-antigen genes.		Tool info
SISTR	Salmonella serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles using BLAST.		Tool info
Snakemake	Snakemake is a framework for data analysis workflow execution	Human biomolecular data General guidelines Human clinical and hea...	Tool info Training
SNippy	Rapid haploid variant calling and core genome alignment.		Tool info Training
SnpEff	Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins.	Human biomolecular data	Tool info Training
SnpSift	SnpSift annotates genomic variants using databases, filters, and manipulates genomic annotated variants.		Tool info
SOAPnuke	A novel analysis tool developed for quality control and preprocessing of FASTQ and SAM/BAM data.		Tool info
SortMeRNA	SortMeRNA is a local sequence alignment tool for filtering, mapping and clustering.		Tool info Training
SPAdes	SPAdes is an assembly toolkit containing various assembly pipelines.	Human biomolecular data	Tool info Training
spaTyper	Given a fasta file or multiple fasta files, identifies the repeats and the order and generates a spa type.
SsuisSero	This pipeline is designed to rapidly infer Streptococcus suis serotype from Oxford Nanopore data by first assemblying a draft genome using Flye followed by genome polishing with racon and medaka.
Stanford HIV Drug Resistance Database (HIVDB)	A curated database containing nearly all published HIV RT and protease sequences: a resource designed for researchers studying evolutionary and drug-related variation in the molecular targets of anti-HIV therapy.
STAR	Spliced Transcripts Alignment to a Reference	Human biomolecular data	Tool info Training
Swedish Pathogens Portal	The Swedish Pathogens Portal was previously known as the Swedish COVID-19 Data Portal. It is the Swedish national node of the Pathogens Portal, aimed at facilitating the sharing of data related to pathogens and pandemic preparedness.	The Swedish Pathogens ...	Standards/Databases
TBProfiler	TBProfiler can rapidly and accurately predict anti-TB drug resistance profiles across large numbers of samples with WGS data.
Trimmomatic	Trimmomatic is a tool used for the removal of adapter sequences, low-quality reads, and sequences with ambiguous bases from NGS data.	Human biomolecular data	Tool info Training
Unicycler	Unicycler is an assembly pipeline for bacterial genomes. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a short-read-first hybrid assembly.		Tool info Training
Velvet	Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments.		Tool info Training
VEP	VEP (Variant Effect Predictor) predicts the functional effects of genomic variants.	Human biomolecular data	Tool info Training
VLQ	A pipeline for lineage abundance estimation from wastewater sequencing data.
WorkflowHub	A registry for describing, sharing and publishing scientific computational workflows.	An automated SARS-CoV-...	Tool info Standards/Databases Training
X! Tandem	X! Tandem open source is software that can match tandem mass spectra with peptide sequences, in a process that has come to be known as protein identification.
xcms	Framework for processing and visualization of chromatographically separated and single-spectra mass spectral data.		Tool info Training