Introduction
Data analysis for pathogen characterization allows us to understand the evolution of pathogens, and the relationship among different strains and provides insights on host-pathogen interactions and drug resistance. The tasks can involve processing data collected from a diverse spectrum of sources, from both clinical and environmental samples. As in every data analysis procedure, the general workflow involves:
-
Preprocessing: Includes the initial steps required to prepare data, genomics and not, for further analysis.
-
Analysis: Is the core stage where the actual detection and characterization of pathogens occur. This stage employs many techniques for pathogen characterization, such as Next-Generation Sequencing (NGS).
-
Postprocessing: Includes interpreting and validating the data obtained from the analysis stage, as well as integrating it into broader contexts. Moreover, this is often followed by reporting and communication, and archiving and data management.
Each stage is crucial for the accurate and comprehensive characterisation of pathogens, from the initial handling of samples to the final reporting and data management, and will be detailed below. Scalable and reproducible data analysis activities enable rapid surveillance of infectious epidemics of emerging and re-emerging pathogens in foodborne, hospital settings, and local community outbreaks. Ensuring reproducibility is critical for the usability of the analysis results. Following community-recognised best practices and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) is fundamental for guaranteeing the trustworthiness of the results and enabling collaboration and sharing of information.
General considerations
When analysing pathogen data involved in a health emergency or epidemic outbreak are:
- Define the pathogen and specific aspects to be investigated, e.g. genomic features of interest
- Collect the suitable reference data about the pathogen of interest, preferentially from community-accepted repositories, e.g. European Nucleotide Archive (ENA) and Global Initiative on Sharing All Influenza Data (GISAID). It is worth noting that the right reference should be chosen taking into account mutation features, time of isolation, classification, phenotype, and genomic structure.
- Before analysing the data, define which specific aspect of the pathogen’s variability will be investigated. For example, if your aim is to describe the whole variability along the genome, the data should be compared with the whole reference genome.
- Define the type of data you are using, e.g. DNA or RNAseq for viral genome characterisation
- Select the tools best suited for the analysis of your data
- Estimate the computing resources needed
- Define which computing infrastructure is most suitable, e.g. cluster or cloud
- Ensure to follow the FAIR principles when handling data
- Guarantee findability of the data and tools for all collaborators for reproducibility by providing your:
- Code
- Execution environment
- Workflows
- Data analysis execution, including parameters used
- Accompanied by documentation that lists all parameters and other relevant information to reproduce the findings
Existing approaches
- Container and environments: Consider using containers and environments to collect and isolate dependencies for tools and pipelines. Environment management systems, such as Conda, help with reproducibility but are not inherently portable across platforms. Containers provide a higher level of portability, being able to encapsulate both the software and its dependencies.
- Web-based code collaboration platform: Consider using a centralised location for software developers to store, manage, collaborate, and share their code. For instance, GitHub, GitLab, or Bitbucket.
- Workflow management systems: Allow you to formalise your workflows in a standardised format and execute them locally or on a remote computer infrastructure. Popular systems are Nextflow and Snakemake.
- Workflow platforms: Allow users to manage data, run formalised workflows, and review their results. Platforms, such as Galaxy, may offer multiple interfaces, e.g. web, GUI, and APIs.
- Reference databases: Collect the suitable reference data about pathogens to be investigated. European Nucleotide Archive (ENA) and Global Initiative on Sharing All Influenza Data (GISAID) are examples of genomic databases to which researchers share their data. In this context, the European Pathogens Portal aggregates databases relating to pathogens, as well as hosts and their vectors. Other countries host their own instance of the Pathogens Portal, e.g. see the Swedish Pathogens PortalSwedish Pathogens Portal showcase.
- Workflow registries: Register workflows in platforms, such as WorkflowHub, that facilitate sharing, versioning, and authorship attribution of the pipelines.
For more general information and solutions on data analysis, you may have a look at the content available on the RDMkit data analysis page. While the examples on this page focus on the genomic characterisation of pathogens, similar principles apply to other data types.
Preprocessing
Data preprocessing is an initial step in data analysis involving the preparation of raw data for the main analysis. It is an important factor in quality control, and involves steps for the cleaning of the data, with the identification of inconsistencies, errors, and missing values. Preprocessing may also include data conversion and transformation steps to get the data in a format compatible with the expected inputs of the chosen analysis pipelines.
General Considerations
Some typical considerations involved in this step:
- Data cleaning: Finds and corrects errors in the data. For example, eliminating duplicates, removing too short genomic reads, and trimming not useful information such as contaminating host data.
- Quality control checks: Should be conducted at each step to ensure that the data is suitable for the intended analysis.
- Exclusion of low-quality samples: Samples with low-quality scores should be marked and removed. In genomics studies, samples with missing values, low sequencing depth, and contaminations might be removed.
- Host read removal: Depending of the sample type, the host and the analysis to be performed, it is strongly recommended removing reads from the host that can affect the analysis
- Origin of the reads:
- Short reads or long reads: Short reads, generated by platforms like Illumina, offer high accuracy with low error rates but can be challenging for assembling repetitive regions. Long reads, produced by technologies such as Oxford Nanopore or PacBio, enable more comprehensive genome assemblies and structural variant detection, though they may have higher error rates.
- Sequencing platform: Different platforms vary in read quality, length, and throughput. For example, Illumina MiSeq may experience a significant decrease in R2 read quality when performing 300-cycle paired-end sequencing.
- Enrichment of the library: Depending on the study’s objective, different enrichment methods can be used, such as PCR amplification with specific primers (amplicon sequencing), probe-based hybridization capture (target enrichment), or amplification-free sequencing. Some enrichment methods may present specific quality issues that need to be addressed.
- fast5 or fastq files: ONT sequencing data can be stored in different formats, each requiring specific software for analysis
Existing approaches
Preprocessing steps may depend on the technology used and the pathogen being studied and thus should be adjusted accordingly. Some common approaches in genomics studies include:
- Short reads:
- Raw sequences quality check: FASTQC
- Trimming out adapters and low-quality sequences: Trimmomatic, fastp, Cutadapt, SOAPnuke
- Long reads:
- Quality checks: further information can be found on the Quality control - Pathogen characterisation page.
Analysis
The analysis of data to characterise a pathogen of interest can involve methodologies from different fields. While genomics approaches are of common interest, analysis of other data types, such as proteomics and metabolomics, and their combination can be of special importance.
Considerations
- The computational resources: Verify that the appropriate computational resources are available. Depending on the volume and complexity of the data, you might need to make use of large computing clusters or cloud computing resources.
- The location of your data: Ensure that the chosen computing infrastructure and platforms have access to the data. It is important to consider the distance between the data storage and computing, as it can significantly impact transfer times and costs.
- Document the steps: Report every step of the data analysis process. Including software versions employed, parameters utilised, the computing environment employed, reference genome used, as well as any “manual” data curation steps. More information on recording provenance can be found on the Provenance pages
- Collaborative analysis: it is important that partners have access to the data, tools, and workflows. It is crucial that systems are in place to track changes to the tools and workflows used, and that the history of modifications is accessible to all collaborators.
Existing approaches
There are several types of analysis that can be performed on pathogen-related data, depending on the specific research question and type of data being analysed. Here are some solutions:
- Consider using the available computational infrastructure to scale up your analysis capabilities. This may include applying for access to large computing cluster resources with e.g. EuroHPC or making use of public Galaxy servers such as Galaxy Europe.
- Genomic analysis: Including whole genome sequencing (WGS), this analysis allows the interpretation of genetic information encoded along the genome (DNA or RNA). Genomic analysis can be used for a wide range of applications to characterise many aspects of pathogen variability, such as Variants of Concern (VOC) and antimicrobial resistance profiles in bacteria (AMR). Examples of tools that allow us to take into account the genomic characteristics of pathogens (e.g. genomic structure and size, gene annotations, mobile genetic elements) are:
- Short reads:
- All-in-one Bioinformatic Tools: SNippy, Dragen-GATK
- Sequence Alignment: Bowtie2, BWA and SAMtools
- Genome Assembly: Canu, Velvet and SPAdes
- Phylogenetic Analysis: ClustalW, MUSCLE, MAFFT, RAxML, FastTree and IQtree
- Molecular Clock: MrBayes, BEAST and BEAUti
- Variant calling: Dragen-GATK, freebayes, Bcftools and iVar
- Varoant annotation: ANNOVAR, SnpEff, VEP and dbNSFP
- Genome annotation: Prokka, Bakta or DFAST
- Consensus genome generation: Bcftools, SAMtools
- Long reads:
- All-in-one Bioinformatic Tools: artic
- Sequence Alignment: Minimap2
- Genome Assembly: Canu, Flye, Raven, Miniasm, Dragonflye
- Assembly Polishing: Racon
- Phylogenetic Analysis: ClustalW, MUSCLE, MAFFT, RAxML, FastTree, IQtree
- Molecular Clock: MrBayes, BEAST and BEAUti
- Variant Calling: Medaka, Nanopolish, NanoCaller
- Variant Annotation: SnpEff
- Genome annotation: Prokka, Bakta or DFAST
- Hybrid:
- Genome Assembly: Dragonflye, Unicycler
- Polishing: Pilon, Polypolish
- Short reads:
- Expression analysis (fungal genomes):
- Metagenomics analysis: Sequencing all genetic material in a sample can provide comprehensive data about the composition of the microbial community. In the context of infectious diseases, it can aid in identifying multiple pathogens simultaneously in clinical, as well as environmental samples. Examples of tools in this type of analysis are:
-
Assigning taxonomic labels: Kraken 2, Kaiju, DIAMOND, MEGAN, MetaPhlAn, Centrifuge, mOTUs, KrakenUniq, KMCP, Ganon, Clark
- Proteomics analysis: Proteomics, primarily utilising mass spectrometry techniques, offers a powerful tools for examining proteins and their interplay. This can provide valuable insights into irregularities associated with infectious diseases and potentially uncover mechanisms of drug resistance. Examples of tools in this type of analysis are:
- Metabolomics analysis: This involves measuring the levels of small molecules (metabolites) produced by specific pathogens in biological samples, comparing them across different conditions or groups of samples. Examples of tools in this type of analysis are:
Postprocessing
In pathogen characterisation, the postprocessing steps are crucial to evaluate and interpret the results. These steps are important to identify strain relationships and specific molecular variation patterns linked to peculiar phenotypes of pathogens (e.g. drug resistance, virulence, and transmission rate). Such results must be biologically meaningful and reproducible, considering also the clinical aspects and treatment implications.
Considerations
Some considerations about postprocessing steps in pathogen characterization include:
- Interpretation: it is important to interpret them in a biologically meaningful context. This should consider the following aspects: report the variability of specific pathogens; find out new strains that could become concerning; identify specific genes or mutations associated with pathogenic variation.
- Transformation: Consider having postprocessing steps to ensure that outputs are transformed or converted into interoperable and open formats. This ensures that subsequent pipelines and collaborators can readily make use of the results.
- Visualisation: To allow a clear interpretation of the clinical practice, it is important to visualise the results clearly, to make the results clear also to all professionals involved.
- Quality metrics: It is important to obtain different quality metrics to evaluate the accuracy or quality of the data generated during the analysis. Depending on the analysis performed, the metrics and tools may vary significantly.
Existing approaches
- Spatial-temporal analysis and visualisation: using a combined approach of phylogenetic, spatial distribution, and molecular clock, this approach aids in designing strategies to control and prevent the spread of infectious diseases, as well as in the development of effective treatments, and vaccines.
- Spatial distribution of strain: Nextstrain
- Lineage, subtyping or clade assignment:
- Pangolin
- Nextclade
- IRMA
- FluServer
- HIV subtyping tools: CoMet, rega/scueal or Genome Detective
- HCV-GLUE
- Drug resistance and virulence factor characterisation: genomic analysis can be used to characterise pathogens for specific resistance against drugs and help develop strategies to fight the spread of drug-resistant strains.
- Antimicrobial resistance (AMR) and virulence factors: ResFinder, AMRfinderplus, ARIBA, ABRicate and Pathogenwatch
- Viral drug resistance: Stanford HIV Drug Resistance Database (HIVDB) and FluServer
- Hepatitis C and B resistance profilers: Geno2pheno, HCV-GLUE
- Interaction analysis and functional enrichment analysis: placing the identified protein interactions and regulatory networks in the context of the affected biological pathways allows for a better understanding of disease mechanisms and potential drug targets.
- Network analysis: Cytoscape and CellDesigner
- Gene enrichment analysis: Enrichr, GO and g:Profiler
- Interaction Databases: BioGRID and IntAct
- Integrative diagrams:
- A disease map can be used to represent a conceptual model of the molecular mechanisms of a disease. An example is the COVID19 Disease Map.
- Analysis metrics:
- General results aggregation:
- General metrics: MultiQC
- Antibiotic Resistance Genes: hAMRonization
Data analysis of wastewater surveillance for infectious diseases
Wastewater surveillance has emerged as a valuable tool for monitoring infectious diseases, providing a non-invasive method to track the spread of pathogens within communities. This approach has gained significant attention during the COVID-19 pandemic, particularly for detecting and analysing SARS-CoV-2 variants. By analysing wastewater samples, researchers can identify the presence and prevalence of infectious agents, offering insights into public health trends. Here we focus on the analysis of wastewater with an emphasis on SARS-CoV-2.
Considerations
Even though the considerations for this specific field are very similar to the ones described in the previous paragraphs, there are some approaches that are used in the context of wastewater surveillance.
Existing approaches
Several tools and workflows have been developed or adapted for the analysis of wastewater data, especially in the context of SARS-CoV-2 surveillance:
- Specific Tools for SARS-CoV-2: Certain tools (such as Freyja, COJAC, and Lineagespot) are specifically designed for analysing SARS-CoV-2 data, providing capabilities such as variant detection and lineage tracking.
- Repurposed Tools: Originally developed for other types of genomic data, tools like Kallisto or Kraken 2, have been successfully applied to wastewater data analysis, offering high performance in read alignment and taxonomic classification.
- In addition, here are several bioinformatics protocols and solutions that could be used in the context of wastewater next-generation sequencing (NGS) data analysis.
- PiGx SARS-CoV-2 Wastewater Sequencing Pipeline: provides a comprehensive solution for sequencing and analysing SARS-CoV-2 in wastewater.
- Detection of SARS-CoV-2 variants in Switzerland by genomic analysis of wastewater samples medRxiv: COWWID: A GitHub repository from the CBG-ETHZ group offering tools for detecting SARS-CoV-2 variants in Switzerland
- CDC Module 2.7: Wastewater based variant tracking for SARS-CoV-2
- The Public Health Alliance for Genomic Epidemiology GitHub organization makes available a mapping to the European Nucleotide Archive (ENA): SARS-CoV-2 Contextual Data Specification
- PHES-ODM as an open data model for wastewater surveillance
- Viral Lineage Quantification (VLQ), Kallisto-Approach: Baaijens et al., 2022 and corresponding repository at VLQ
- Performance benchmark of tools (Kayikcioglu et al., 2023), evaluating tools like Kraken2, Kallisto, Freyja, implemented in C-WAP
- Wastewater quality control workflow in GalaxyTrakr (SSquAWK4). Further quality control aspects are discussed in the Quality Control - Pathogen Characterisation page
- ECDC Guidance document for representative and targeted genomic SARS-CoV-2 monitoring
Data analysis of bacterial outbreaks
General considerations
An outbreak in infectious diseases is defined as the increase in the number of cases of a disease above what is typically expected in a specific area or among a particular population over a certain period of time. Outbreak can occur in localized regions, such as a community, or they can spread across wider areas. They can involve new infectious or re-emergence of previously controlled diseases. Common examples include outbreaks of Influenza, SARS-CoV2, or Foodborne illnesses like caused by Listeria monocytogenes among others.
In order to know if an outbreak is caused by the same pathogen, it is necessary to isolate and characterize the pathogen using phenotypic or genotypic techniques, which is known as a typification process or Microbial typing (Shelby R. et al., 2021). It is used to differentiate between strains or species of microorganisms for various purposes, such as epidemiological studies, infection tracking, and outbreak investigations to pinpoint the source of foodborne outbreaks. It can also be used to identify which microorganisms are most virulent and cause serious diseases, resistant to antimicrobial drugs, or able to survive and multiply.
Techniques for microbial typing include traditional methods, like serotyping, fragment based methods and sequence based methods.
The gold standard for bacterial typing is pulsed-field gel electrophoresis (PFGE) and has been widely used for tacking outbreaks of foodborne pathogens, such as E.coli and Salmonella. However, technologies like Whole Genome Sequencing (WGS) have replaced PFGE due to their higher precision and resolution.
Different levels of sequence information can be associated with different taxonomic levels. A single locus (i.e. 16S rRNA) is enough to distinguish from phylum to genus and, in some cases, species, while subspeciation requires incorporating more genes, such as the 7 from MLST or the 53 loci from ribosomal MLST. The highest resolution power comes from incorporating the whole genome sequencing (WGS).
One of the significant advantages of whole genome sequencing (WGS) is its application in comparative genomics, enabling the determination of the phylogenetic relationships among a group of bacterial strains. There are several methods to assess the similarity between different genomes (i.e. gen-by-gen approaches: Multi-Locus Sequence Typing -wgMLST, Core Genome Multi-Locus Sequence Typing -cgMLST, or Single Nucleotide Polymorphisms -SNPs approach) (Uelze et al., 2020).
wgMLST analyzes the entire genome, considering all the genes present in the genome. This method allows for a comprehensive understanding of the genetic variation within and between species.
cgMLST focuses specifically on the set of genes that are present in all strains of a species, the core genome. This excludes accessory genes, genes not necessary for organism survival or reproduction under standard conditions, (i.e. virulence genes) that may vary among strains.
Existing approaches
There are different tools available for bacterial outbreak analysis. Here they are classified based on the stage of the analysis.
- Starting with sequencing reads .fastq files:
- Pre-processing: During this step we aim to obtain a set of good quality reads that can be used for further analysis. We will inspect the quality of the raw reads, discard all the reads that don’t meet the quality standards and remove adapters, primers or unwanted sequences that may interfere with later steps of the analysis. Tools: FASTQC, fastp
- Bacterial genome identification: Identifying the organisms that are present in our samples is crucial. Even though the microbiology lab could’ve sent information about the organisms present in the samples, there could be contaminations that would lead to mapping, variant calling and annotation errors. Tools: KmerFinder
- De-novo assembly: If we want to make an in-depth analysis of the samples, we cannot work with a mess of unordered pieces of DNA. The reads can be aligned and merged (assembled) into a single sequence which we will use as a representation of the original DNA from the organism in the sample. Tools: Unicycler
- Once we have the assembly or reconstructed DNA strand, we can proceed with further analysis:
- Genomic features annotation: Using the assembled sequence, the aim of this step is to automatically detect genes or subproducts of genes that may have functional properties in a non-specific way. These features can be relevant to characterize the organism in our samples as they may include strain-specific genes. Tools: Prokka, Bakta, DFAST
- AMR/virulence characterization: Contrary to the previous step, here the aim to do a specific search for relevant genes, including virulence factors and antibiotic-resistance genes. The main difference is that the search is performed within domain-specific databases:
- Plasmid identification: Sometimes the previously mentioned genes are found inside plasmids, which might restrain the previous tools from finding them. Not only that, but plasmids are quite relevant as they can be the source of horizontal gene transfer. These tools are used to find and characterize these plasmids: ARIBA, PlasmidID
- MLST characterization: MLST (Multi-Locus Sequence Typing) is an unambiguous procedure for characterizing isolates of bacterial species using the sequences of internal fragments of (usually) seven housekeeping genes. The aim is to obtain a profile based on the alleles for each gene, which can be used to establish relationships between the samples and the possible source of the outbreak. Tools: ARIBA used along with the PubMLST database.
- Note: Different species may have different MLST schemes, so it is crucial to use the appropriate schema for each one.
- SNP calling and Core-SNPs matrix: To find relationships between the samples and a possible source, phylogenetic distance can be a good measure. This distance is calculated by counting all the different pairwise SNP positions among samples. However, since the number of SNPs increases exponentially with the number of samples, a good approach is to use only the SNPs that are common to all samples (Core SNPs). These values are represented in a distance matrix, which is then used to create a phylogenetic tree. Tools: SNippy
- Phylogenetic tree: Using the previously obtained distance matrix, it is possible to create a phylogenetic tree, which is a visual representation of the genetic relationships between the samples, based on clustering algorithms. This is useful to group similar samples together and infer potential outbreak sources.
Tools: IQtree - wgMLST and cgMLST: These are both gene-by-gene approaches. Genome assembly data is aligned to a scheme consisting of a set of loci and their corresponding allele sequences to perform allele calling. Each isolate is then characterized by its allele profile, and comparing multiple samples generates an allele distance matrix. Tools: chewBBACA
- Schema resources:
- Pathogen-specific typing tools: MLST is not always enough to characterize a certain species. Some species have their unique typing methods. Here are species-specific tools:
- Escherichia / Shigella: ECTyper, ShigaTyper or ShigEiFinder
- Haemophilus: hicap or SsuisSero
- Klebsiella: Kleborate
- Legionella: legsta
- Listeria: LisSero
- Mycobacterium: TBProfiler and MTBseq
- Neisseria: meningotype or ngmaster
- Pseudomonas: pasty
- Salmonella: SeqSero2 and SISTR
- Staphylococcus: AgrVATE, spaTyper and sccmec
- Streptococcus: emmtyper, pbptyper or SsuisSero
- Source: Bactopia Merlin
- Available pipelines: These pipelines synthesize some of the previous steps into a single workflow, making bioinformatics analysis easier:
- Bacterial assembly and annotation (short reads, long reads, hybrid):
- Screening of functional genes:
- Taxonomy classification (short and long metagenomic reads):
- Complete analysis of bacterial genomes:
Viral outbreak
General considerations
Analysis of outbreaks when a virus is suspected to be the responsible for constitutes an important tool for disease control.
A proper identification of the virus is needed to find whether there are already measures described for its prevention and contention so that the spread of the infection can be managed in order to avoid new potential cases. In certain scenarios, molecular diagnostic methods, such as PCR, do not achieve the viral identification and the application of other protocols, such as deep sequencing (i.e. metagenomics) is needed to correctly identify the pathogen.
Further analyses that can be performed are phylogenetic analysis and genome characterisation The former is useful to identify the links between different samples from the same outbreak and track the spread of the infection as well as locate its potential origin, in case additional control measures are needed. The latter is based on genome analysis in order to find specific biological characteristics of that virus that can help to identify potential effective treatments or the application of more extreme control measures (i.e. mutations related to antiviral resistance or high pathogenicity).
Existing approaches
There are different tools available for bacterial outbreak analysis. Here they are classified based on the stage of the analysis.
- Pre-processing: Data preprocessing is a crucial initial step prior to data analysis that involves reads quality check, low-quality cleaning and trimming, removal of adapters, primers or unwanted sequences from raw sequencing reads. Some of the tools to perform this process are:
- Viral species identification: Accurately identifying the viral species present in samples is essential for responding to viral outbreaks, especially when the causative agent is unknown. This step involves classifying the sample’s reads into various viral taxonomic species, using tools such as: Kraken 2 / Krona Tools or DIAMOND / MEGAN
- Reference-genome mapping: This approach is particularly useful when the viral species causing the outbreak is known, a reference genome is available, and the viral genome does not exhibit significant variability. It involves aligning sequencing reads to the reference genome, which is a crucial step for accurate variant calling and in-depth genetic analysis. Widely used tools are:
- Variant calling: This step involves identifying genetic variations, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants within a sample by comparing its sequencing data to a reference genome. Some useful tools are: iVar, Bcftools and LoFreq*
- Variant annotation: Variant annotation is a step following variant calling that involves assigning biological and clinical significance to identified genetic variants. This step typically involves mapping variants to genes, determining their potential impact on protein function, and assessing their association with known phenotypes, diseases, or traits. Available tools are: SnpEff and SnpSift
- Consensus genome reconstruction: Once we know the differences (variants) between the sample and the reference genome, we can change the positions in the reference genome to obtain a consensus genome
.fasta
file that can serve as the sample’s assembled genome. Tools: Bcftools - de novo assembly: When the viral reference genome is not known or is highly variable, de novo assembly is a better approach than reference genome mapping. de novo assembly reconstructs a viral genome’s reads without the need of a reference genome into longer contiguous sequences (contigs) and scaffolds. These can be compared to a database (using any of the tools for taxonomic classification mentioned above) to identify the pathogen. Some of the widely used tools are:
- Short-reads: SPAdes and Unicycler
- Long-reads: Canu, Flye, Raven, Miniasm and Dragonflye
- Hybrid: Dragonflye and Unicycler
- Phylogeny: In the context of addressing a viral outbreak, phylogenetic studies can help in tracing the origins and transmission pathways of the infection. Generating phylogenetic trees visually represents the connections between organisms, illustrating how they have diverged over time. The process of performing phylogenetic analysis involves several steps:
- Sequence alignment of the generated genomes. Tools: MAFFT, MUSCLE or ClustalW
- Core genome SNPs: The core genome consists of the genes that are shared among all the samples/strains in the analysis, providing a stable genetic framework for comparative analysis. Tools: SNippy
- Phylogenetic trees: With the multiple sequence alignment or the core SNPs, researchers can perform phylogenetic analysis. The choice of algorithm and evolutionary model will influence the results, with commonly used tools including: IQtree and Nextstrain
- Lineage/clade/type: When the virus causing the outbreak has been typed with different lineages, clades, or types/subtypes, it is essential to characterize this aspect of the samples. The information generated from both the phylogenetic tree, combined with lineage data, can help determine the connections between samples involved in the outbreak. Tools: Pangolin, Nextclade or HIV subtyping tools such as CoMet or rega/scueal
- Available pipelines:
- nf-core/viralrecon: is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples.
- IRMA: was designed for the robust assembly, variant calling, and phasing of highly variable RNA viruses. Currently, IRMA is deployed with modules for influenza, ebolavirus, and coronavirus.
Related pages
More information
Links to RDMkit
RDMkit is the Research Data Management toolkit for Life Sciences describing best practices and guidelines to help you make your data FAIR (Findable, Accessible, Interoperable and Reusable)
Training
Tools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
ABRicate | ABRicate is used for mass screening of contigs for antimicrobial resistance or virulence genes. | Tool info | |
AgrVATE | AgrVATE is a tool for rapid identification of Staphylococcus aureus agr locus type and also reports possible variants in the agr operon. | Tool info | |
AMRfinderplus | NCBI Antimicrobial Resistance Gene Finder (AMRFinderPlus) | Tool info | |
ANNOVAR | ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes. | Human biomolecular data | Tool info |
apex | Absolute protein expression Quantitative Proteomics Tool, is a free and open source Java implementation of the APEX technique for the quantitation of proteins based on standard LC- MS/MS proteomics data. | Tool info | |
ARIBA | ARIBA is an Antimicrobial Resistance Identification By Assembly | ||
artic | artic is a pipeline and set of accompanying tools for working with viral nanopore sequencing data, generated from tiling amplicon schemes. | Tool info | |
Bakta | Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs. | Tool info | |
Bcftools | Bcftools is a set of tools for working with variant calls in the VCF format. | Human biomolecular data | Tool info Training |
BEAST | BEAST is a cross-platform program for Bayesian phylogenetic analysis, estimating rooted, time-measured phylogenies using strict or relaxed molecular clock models. It uses Markov chain Monte Carlo (MCMC) to average over tree space and includes a graphical user interface for setting up analyses and tools for result analysis. | Tool info | |
BEAUti | BEAUti is a graphical user-interface (GUI) application for generating BEAST XML files. | ||
BioGRID | BioGRID is a comprehensive biomedical repository for curated protein, genetic and chemical interactions | Human biomolecular data | Tool info Standards/Databases |
Bitbucket | Git based code hosting and collaboration tool, built for teams. | Human biomolecular data | Standards/Databases |
Bowtie2 | Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. | Human biomolecular data | Tool info Training |
BWA | BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. | Human biomolecular data | Tool info Training |
C-WAP | CFSAN Wastewater Analysis Pipeline to estimate the percentage of SARS-CoV-2 variants in a sample. | ||
Canu | Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing. | Human biomolecular data | Tool info |
CellDesigner | CellDesigner is a structured diagram editor for drawing gene-regulatory and biochemical networks. | ||
Centrifuge | Tool to classify the taxonomic origin of a read in pathogen sequencing data | Human biomolecular data | Tool info |
Chenomx | A commercial software package for NMR spectral processing that offers a semi-automated tool for spectral deconvolution, enabling interactive fitting of metabolite peaks to reference spectra and quantifying their concentrations. | ||
chewBBACA | chewBBACA is a software suite for the creation and evaluation of core genome and whole genome MultiLocus Sequence Typing (cg/wgMLST) schemas and results. | ||
Clark | Clark is a fast, versatile and accurate tool for sequence classification system | ||
ClustalW | ClustalW is a progressive multiple sequence alignment tool to align a set of sequences by repeatedly aligning pairs of sequences and previously generated alignments. | Human biomolecular data | Tool info Training |
COJAC | The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons. | Tool info | |
CoMet | A workflow using contig coverage and composition for binning a metagenomic sample with high precision | ||
COVID19 Disease Map | The COVID-19 Disease Map is an assembly of molecular interaction diagrams, established based on literature evidence. | ||
Cutadapt | Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads. | Tool info | |
Cytoscape | Cytoscape provides a solid platform for network visualization and analysis | Human biomolecular data | Tool info Training |
dbNSFP | A comprehensive database of transcript-specific functional predictions and annotations for human non-synonymous and splice-site SNVs | Human biomolecular data | Tool info |
DESeq2 | Differential gene expression analysis based on the negative binomial distribution | Human biomolecular data | Tool info Training |
DFAST | DFAST is a flexible and customizable pipeline for prokaryotic genome annotation as well as data submission to the INSDC | Tool info | |
DIAMOND | DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. | Tool info | |
Dorado | Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nanopore reads. | Tool info | |
Dragen-GATK | DRAGEN-GATK Best Practices contains open-source workflows that are compatible between Illumina's platforms and mainstream infrastructure. | Human biomolecular data | |
Dragonflye | Dragonflye is a pipeline that aims to make assembling Oxford Nanopore reads quick and easy. | ||
dupRadar | dupRadar is used for the assessment of duplication rates in RNA-Seq datasets. | Tool info | |
ECTyper | ECTyper is a standalone versatile serotyping module for Escherichia coli. It supports both fasta (assembled) and fastq (raw reads) file formats. | Tool info | |
emmtyper | emmtyper is a command line tool for emm-typing of Streptococcus pyogenes using a de novo or complete assembly. | ||
Enrichr | Functional Enrichment Analysis and Network Construction | Tool info | |
EuroHPC | EuroHPC Joint Undertaking is a joint initiative between the EU, European countries and private partners to develop a World Class Supercomputing Ecosystem in Europe. | ||
European Nucleotide Archive (ENA) | Provides a record of the nucleotide sequencing information. It includes raw sequencing data, sequence assembly information and functional annotation. | Human clinical and hea... Pathogen characterisation Human biomolecular data An automated SARS-CoV-... Using the ENA data sub... SARS-CoV-2 sequencing ... Linked pathogen and ho... | Tool info Standards/Databases Training |
fastp | A tool designed to provide ultrafast all-in-one preprocessing and quality control for FastQ data. | Tool info | |
FASTQC | A quality control tool for high throughput sequence data. | Human biomolecular data Pathogen characterisation | Tool info Training |
FastTree | FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences | Tool info | |
FluServer | The main application scenario for FluSurver is to highlight phenotypically or epidemiologically interesting candidate mutations for further research and should ideally be combined with experimental testing and verification of any predicted phenotypes. | ||
Flye | Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. | Human biomolecular data | Tool info Training |
freebayes | freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. | Human biomolecular data | Tool info Training |
Freyja | Freyja is a tool to recover relative lineage abundances from mixed SARS-CoV-2 samples from a sequencing dataset (BAM aligned to the Hu-1 reference). | Tool info | |
g:Profiler | g:GOSt performs functional enrichment analysis, also known as over-representation analysis (ORA) or gene set enrichment analysis, on input gene list. | Tool info Training | |
Galaxy | Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. | Human biomolecular data General guidelines Human clinical and hea... Using the ENA data sub... | Tool info Training |
Galaxy Europe | The European Galaxy server. Provides access to thousands of tools for scalable and reproducible analysis. | An automated SARS-CoV-... | Training |
Ganon | ganon2 classifies DNA sequences against large sets of genomic reference sequences efficiently. | Tool info | |
Geno2pheno | Estimating phenotypic drug resistance from HIV-1 genotypes associated with resistance to PRO and RT inhibitors | Tool info | |
Genome Detective | Genome Detective offers intuitive Bio-Informatics applications for the analysis of microbial molecular sequence data. | ||
GitHub | GitHub is a versioning system, used for sharing code, as well as for sharing of small data. | Human biomolecular data An automated pipeline ... | Standards/Databases Standards/Databases Training |
GitLab | GitLab is an open source end-to-end software development platform with built-in version control, issue tracking, code review, CI/CD, and more. Self-host GitLab on your own servers, in a container, or on a cloud provider. | Human biomolecular data | Standards/Databases Training |
Global Initiative on Sharing All Influenza Data (GISAID) | A web-based platform for sharing viral sequence data, initially for influenza data, and now for other pathogens (including SARS-CoV-2). | Human clinical and hea... Pathogen characterisation | Standards/Databases |
GO | GO is to perform enrichment analysis on gene sets. | Human biomolecular data | Tool info Training |
Guppy | Guppy is a bioinformatics toolkit that enables real-time basecalling and several post-processing features that works on Oxford Nanopore Technologies™ sequencing platforms. | ||
hAMRonization | hAMRonization is a software tool to harmonize and standardize antimicrobial resistance (AMR) data generated by various bioinformatics tools. | Tool info | |
HCV-GLUE | HCV-GLUE is a bioinformatics resource for HCV sequence data. | ||
hicap | The cap locus of H. influenzae are categorised into 6 different groups based on serology (a-f). | ||
HISAT2 | HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome). | Human biomolecular data | Tool info Training |
IntAct | IntAct (Molecular Interaction Database) Website | Human biomolecular data | Tool info Standards/Databases Training |
IQtree | IQ-TREE is designed to efficiently handle large phylogenomic datasets, utilize multicore and distributed parallel computing for faster analysis, and automatically resume interrupted analyses through checkpointing. | Tool info Training | |
IRMA | IRMA (Iterative Refinement Meta-Assembler) was designed for the robust assembly, variant calling, and phasing of highly variable RNA viruses. | ||
iVar | iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing. | ||
Kaiju | Kaiju is a program for the taxonomic classification of high-throughput sequencing reads, e.g., Illumina or Roche/454, from whole-genome sequencing of metagenomic DNA. | Tool info | |
Kallisto | Kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. | Tool info | |
Kleborate | Kleborate was primarily developed to screen genome assemblies of Klebsiella pneumoniae and the Klebsiella pneumoniae species complex (KpSC). | Tool info | |
KMCP | KMCP is an accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping | ||
KmerFinder | KmerFinder is a bioinformatics tool designed for the rapid identification of bacterial species and strains from whole genome sequencing (WGS) data. | Tool info | |
Kraken 2 | A taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. | ||
KrakenUniq | False-positive identifications are a significant problem in metagenomics classification. | Tool info | |
Krona Tools | Krona Tools is a set of scripts to create Krona charts from several Bioinformatics tools as well as from text and XML files. | Tool info | |
legsta | In silico Legionella pneumophila Sequence Based Typing (SBT). | Tool info | |
Lineagespot | Lineagespot is a framework written in R, and aims to identify SARS-CoV-2 related mutations based on a single (or a list) of variant(s) file(s). | Tool info | |
LisSero | In silico serogroup typing prediction for Listeria monocytogenes | ||
LoFreq* | LoFreq* (i.e. LoFreq version 2) is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data | ||
LongQC | LongQC is a tool for the data quality control of the PacBio and ONT long reads, and it has two functionalities: sample qc and platform qc. | ||
MAFFT | MAFFT is a multiple sequence alignment program | Human biomolecular data | Tool info Training |
MAXQUANT | MaxQuant is a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets. It is specifically aimed at high-resolution MS data. | Tool info Training | |
Medaka | medaka is a tool to create consensus sequences and variant calls from nanopore sequencing data. | Tool info | |
MEGAHIT | MEGAHIT is an ultra-fast and memory-efficient NGS assembler optimized for metagenomes. | Tool info Training | |
MEGAN | MEGAN is used fot the interactive exploration and analysis of large-scale microbiome sequencing data. | Tool info | |
meningotype | In silico typing of Neisseria meningitidis contigs. | Tool info | |
MetaboAnalyst | MetaboAnalyst is a comprehensive platform dedicated for metabolomics data analysis via user-friendly, web-based interface. | Human biomolecular data | Tool info Training |
MetaPhlAn | MetaPhlAn is a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) from metagenomic shotgun sequencing data. | Tool info | |
Miniasm | Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. | Tool info | |
Minimap2 | Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. | Tool info | |
MinIONQC | Fast and effective quality control for MinION and PromethION sequencing data | ||
mOTUs | The mOTU profiler is a computational tool that estimates relative taxonomic abundance of known and currently unknown microbial community members using metagenomic shotgun sequencing data. | Tool info | |
MrBayes | MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters. | Tool info | |
MTBseq | MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from Illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates. | ||
MultiQC | MultiQC searches a given directory for analysis logs and compiles a HTML report. | Pathogen characterisation | Tool info Training |
MUSCLE | MUSCLE is widely-used software for making multiple alignments of biological sequences. | Human biomolecular data | Tool info Training |
Mzmine | MZmine 3 is an open-source software for mass-spectrometry data processing, with the main focus on LC-MS data. | Human biomolecular data | Tool info |
NanoCaller | NanoCaller is a computational method that integrates long reads in deep convolutional neural network for the detection of SNPs/indels from long-read sequencing data. | Tool info | |
Nanofilt | Filtering and trimming of long read sequencing data. | ||
NanoPlot | NanoPlot is a plotting tool for long read sequencing data and alignments. | Tool info | |
Nanopolish | Software package for signal-level analysis of Oxford Nanopore sequencing data. | Tool info | |
nanoq | nanoq is an ultra-fast quality control and summary reports for nanopore reads | ||
Nextclade | Nextclade is a tool for viral genome clade assignment, mutation calling, and sequence quality checks. | Tool info | |
Nextflow | Nextflow is a framework for data analysis workflow execution | Human biomolecular data | Tool info Training |
Nextstrain | Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data. | Tool info Training | |
ngmaster | In silico multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST) and Neisseria gonorrhoeae sequence typing for antimicrobial resistance (NG-STAR). | ||
OMSSA | OMSSA (Open Mass Spectrometry Search Algorithm) is a tool to identify peptides in tandem mass spectrometry (MS/MS) data. The OMSSA algorithm uses a classic probability score to compute specificity. See also The NCBI C++ Toolkit and The NCBI C++ Toolkit Book. | Tool info | |
OpenMS | OpenMS is an open-source software C++ library for LC-MS data management and analyses. | Human biomolecular data | Tool info Training |
Pangolin | Pangolin is a tool for the Phylogenetic Assignment of Named Global Outbreak LINeages | ||
pasty | A tool easily taken advantage of for in silico serogrouping of Pseudomonas aeruginosa isolates. | ||
Pathogens Portal | The Pathogens Portal, launched in July 2023, is an invaluable resource for researchers, clinicians, and policymakers who need access to the latest and most comprehensive datasets on pathogens. The portal is a collaborative effort between the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) and partners. | Linked pathogen and ho... The Swedish Pathogens ... | Standards/Databases Training |
Pathogenwatch | Pathogenwatch provides species and taxonomy prediction for over 60,000 variants of bacteria, viruses, and fungi. | ||
pbptyper | In silico Penicillin Binding Protein (PBP) typer for Streptococcus pneumoniae assemblies | ||
PepArMl | A Meta-Search Peptide Identification Platform for Tandem Mass Spectra | Tool info | |
PHES-ODM | A data model to improve wastewater surveillance through interoperable data. | ||
Picard | Picard is a suite of tools that provides quality control and processing of NGS data, including duplicate read removal, format conversion, and alignment. | Human biomolecular data | Tool info |
PiGx SARS-CoV-2 Wastewater Sequencing Pipeline | PiGx SARS-CoV-2 is a pipeline for analysing data from sequenced wastewater samples and identifying given lineages of SARS-CoV-2. | ||
Pilon | Pilon is a software tool which can be used to automatically improve draft assemblies and find variation among strains, including large event detection. | ||
PlasmidID | PlasmidID is a mapping-based, assembly-assisted plasmid identification tool that analyzes and gives graphic solution for plasmid identification. | Tool info | |
Polypolish | Polypolish is a tool for polishing genome assemblies with short reads. | Tool info | |
Porechop | Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. | ||
preseq | The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. | Tool info | |
Prokka | Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files. | Tool info | |
PycoQC | PycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data | Tool info | |
QIIME 2 | QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. | Tool info Training | |
Qualimap | Qualimap is a quality control tool that assesses the quality of the sequencing data at different stages of the analysis pipeline, including read mapping, coverage, and expression analysis. | Human biomolecular data | Tool info Training |
Quast | QUAST stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics. | Tool info | |
Racon | Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. | Tool info | |
Raven | Raven is a de novo genome assembler for long uncorrected reads. | Tool info | |
RAxML | A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies | Tool info Training | |
ReAdW | Convert ThermoFinningan RAW mass spectrometry files to the mzXML format. | Tool info | |
ResFinder | ResFinder identifies acquired genes and/or finds chromosomal mutations mediating antimicrobial resistance in total or partial DNA sequence of bacteria. | Tool info | |
RSEM | RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. | Tool info | |
RSeQC | RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. | Tool info | |
Salmon | Salmon is a wicked-fast program to produce a highly-accurate, transcript-level quantification estimates from RNA-seq data | Tool info | |
SAMtools | SAMtools is a suite of programs for interacting with high-throughput sequencing data. | Human biomolecular data Pathogen characterisation | Tool info Training |
SARS-CoV-2 Contextual Data Specification | A SARS-CoV-2 Contextual Data Specification from PHA4GE. | ||
sccmec | sccmec is a tool for typing SCCmec cassettes in assemblies. | ||
SeqSero2 | Salmonella serotype prediction from genome sequencing data. | Tool info | |
ShigaTyper | ShigaTyper is a quick and easy tool designed to determine Shigella serotype using Illumina (single or paired-end) or Oxford Nanopore reads with low computation requirement. | ||
ShigEiFinder | This is a tool that is used to identify differentiate Shigella/EIEC using cluster-specific genes and identify the serotype using O-antigen/H-antigen genes. | Tool info | |
SISTR | Salmonella serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles using BLAST. | Tool info | |
Snakemake | Snakemake is a framework for data analysis workflow execution | Human biomolecular data General guidelines Human clinical and hea... | Tool info Training |
SNippy | Rapid haploid variant calling and core genome alignment. | Tool info Training | |
SnpEff | Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins. | Human biomolecular data | Tool info Training |
SnpSift | SnpSift annotates genomic variants using databases, filters, and manipulates genomic annotated variants. | Tool info | |
SOAPnuke | A novel analysis tool developed for quality control and preprocessing of FASTQ and SAM/BAM data. | Tool info | |
SortMeRNA | SortMeRNA is a local sequence alignment tool for filtering, mapping and clustering. | Tool info | |
SPAdes | SPAdes is an assembly toolkit containing various assembly pipelines. | Human biomolecular data | Tool info Training |
spaTyper | Given a fasta file or multiple fasta files, identifies the repeats and the order and generates a spa type. | ||
SsuisSero | This pipeline is designed to rapidly infer Streptococcus suis serotype from Oxford Nanopore data by first assemblying a draft genome using Flye followed by genome polishing with racon and medaka. | ||
Stanford HIV Drug Resistance Database (HIVDB) | A curated database containing nearly all published HIV RT and protease sequences: a resource designed for researchers studying evolutionary and drug-related variation in the molecular targets of anti-HIV therapy. | ||
STAR | Spliced Transcripts Alignment to a Reference | Human biomolecular data | Tool info Training |
Swedish Pathogens Portal | The Swedish Pathogens Portal was previously known as the Swedish COVID-19 Data Portal. It is the Swedish national node of the Pathogens Portal, aimed at facilitating the sharing of data related to pathogens and pandemic preparedness. | The Swedish Pathogens ... | Standards/Databases |
TBProfiler | TBProfiler can rapidly and accurately predict anti-TB drug resistance profiles across large numbers of samples with WGS data. | ||
Trimmomatic | Trimmomatic is a tool used for the removal of adapter sequences, low-quality reads, and sequences with ambiguous bases from NGS data. | Human biomolecular data | Tool info Training |
Unicycler | Unicycler is an assembly pipeline for bacterial genomes. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a short-read-first hybrid assembly. | Tool info | |
Velvet | Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. | Tool info Training | |
VEP | VEP (Variant Effect Predictor) predicts the functional effects of genomic variants. | Human biomolecular data | Tool info Training |
VLQ | A pipeline for lineage abundance estimation from wastewater sequencing data. | ||
WorkflowHub | A registry for describing, sharing and publishing scientific computational workflows. | An automated SARS-CoV-... | Tool info Standards/Databases Training |
X! Tandem | X! Tandem open source is software that can match tandem mass spectra with peptide sequences, in a process that has come to be known as protein identification. | ||
xcms | Framework for processing and visualization of chromatographically separated and single-spectra mass spectral data. | Tool info Training |