Introduction
Data analysis, as in Human biomolecular data of infectious diseases, involves exploring the data collected to gain an understanding of the messages within a dataset and identifying relationships between variables using mathematical formulas or models. Moreover, it is always crucial to follow the best practices and especially the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) to enable the collection and flow of information in the best way possible.
General considerations
Some considerations for analysing human biomolecular data of infectious diseases are:
- Select the tools best suited for the analysis of your data.
- Document the exact steps used for data analysis.
- Choose between several computing infrastructure types, e.g. cluster, cloud.
- Take into account the computing resources needed.
- Which type of data are you using, e.g. DNAseq, ATACseq, CNV.
- Integration of different types of data (e.g. RNAseq and DNAseq).
- Ensure following the FAIR principles.
- Guarantee access to the data and tools for all collaborators, for reproducibility.
- Providing your code
- Providing your execution environment
- Providing your workflows
- Providing your data analysis execution
When looking for solutions to some of the considerations above, you may have a look at the information available on the RDMkit website.
Existing approaches
Below you can find some general existing approaches in order to help with and improve your data analysis pipeline/protocol:
- Container environments: As an alternative to package management systems you can consider container environments like Docker or Singularity.
- Web-based platform: Provides a centralised location for software developers to store, manage, collaborate, and share their code. You can use GitHub (widely used), GitLab or Bitbucket.
- Workflow platforms: Allows the user to manage your data and provide an interface (web, GUI, APIs) to run complex pipelines and review their results. For instance: Galaxy and Arvados (CWL-based, open source).
- Workflow runners: Allows you to take a workflow written in a proprietary or standardised format (such as the CWL standard) and execute it locally or on a remote computer infrastructure. For instance, toil-cwl-runner, the reference CWL runner (cwltool), Nextflow, Snakemake, Cromwell.
- Integration Pipelines: These pipelines are used to integrate different types of data, such as genomic, proteomic, and metabolomic data. It involves steps such as data preprocessing, data integration, and functional analysis. Tools like Omicsgenerator can be used for data integration.
Preprocessing
Data preprocessing is the phase in the project where data is converted into a desired format and prepared for analysis. Is a crucial step in data analysis that involves cleaning, transforming, and preparing data for analysis. The goal of preprocessing is to ensure that the data is of high quality and is suitable for the intended analysis. Preprocessing can involve a range of steps, depending on the type of data and the analysis being performed.
Preprocessing is a critical step in data analysis of Human biomolecular data of infectious diseases because it can greatly impact the accuracy and reliability of the analysis results. By ensuring that the data is of high quality and suitable for analysis, preprocessing can help researchers obtain more accurate and meaningful insights from this data.
Considerations
Here are some common considerations involved in data preprocessing:
- Data cleaning: This step involves identifying and correcting errors or inconsistencies in the data. Examples of data cleaning include removing duplicates, correcting typos or misspellings, and identifying and handling missing data.
- Data transformation: This step involves transforming the data to make it suitable for analysis. Examples of data transformation include converting data types (e.g., from categorical to numerical), scaling data, and normalising data.
- Remove low-quality samples: Samples that have low sequencing depth, high number of missing values, or poor alignment quality can be removed from the dataset to ensure that the remaining samples are of high quality.
- Identify and remove outliers: Outliers are data points that fall outside the expected range of values and can skew the analysis results. Outliers can be identified using statistical methods and removed from the dataset.
- Check for batch effects: Batch effects are systematic differences in the data that arise from technical or experimental factors. Batch effects can be identified using statistical methods and removed from the dataset.
- Data normalisation: Normalisation is a common preprocessing step that aims to remove systematic biases in the data. Normalisation methods should be evaluated to ensure that they are effective and do not introduce additional biases.
- Perform quality control checks at each preprocessing step: Quality control checks should be performed at each preprocessing step to ensure that the data is of high quality and suitable for the intended analysis.
Existing approaches
Preprocessing could be done using the state of art bioinformatics tools and/or programming languages that have different functions and packages to work and process this kind of data. For example, Python, RStudio or using Command-Line, are different approaches to enable the user performing all the necessary and wanted steps to do the desired preprocessing pipeline/protocol.
When looking for quality control protocols, see Quality control - Human biomolecular data page.
Analysis
The analysis of human biomolecular data involves the use of various techniques and approaches to extract meaningful information from biological samples such as DNA, RNA, proteins, and metabolites.
This stage relies on the previous stages (collection, processing) that will lay the foundations for the generation of new knowledge by providing accurate and trustworthy data.
Considerations
- The location of your data: Proximity to computing resources is crucial due to its impact on data transfer across infrastructures. It is worthwhile to compare the cost of transferring large data volumes versus the transfer of virtual machine images for analysis purposes.
- Analysis of the data: Prior to analyzing the data, it is necessary to evaluate the computing environment and make a decision among various types of computing infrastructures, such as clusters or clouds. Additionally, selecting the suitable work environment, such as command line or web portal, based on individual requirements and expertise, is crucial.
- Best tools: You need to select the tools best suited for the analysis of your data.
- Document the steps: Accurate documentation of the data analysis process is essential, encompassing the precise steps taken, software versions employed, parameters utilized, and the computing environment employed. However, it is important to mention that the “manual” manipulation of the data can potentially complicate this documentation procedure.
- Collaborative analysis: When engaging in collaborative data analysis, it is crucial to ensure that all collaborators have access to the data and tools required. This can be facilitated by establishing virtual research environments that provide a shared platform for seamless collaboration.
Existing approaches
There are several types of analysis that can be performed on human biomolecular data, depending on the specific research question and type of data being analysed. Here are some common types of analysis:
- Gene expression analysis: This involves measuring the expression levels of genes in a biological sample and comparing them across different conditions or groups of samples. This can be done using techniques such as microarray analysis or RNA sequencing.
- Genomic analysis: This involves the interpretation of genetic information encoded in DNA sequences. DNA data analysis can be used for a wide range of applications, such as identifying genetic variants associated with disease, studying the evolution of species, and understanding the molecular mechanisms underlying biological processes.
- Sequence alignment: Bowtie2 and BWA
- Structural variant detection: Delly, Lumpy, Manta and GRIDSS
- Genome assembly: Canu, Flye, wtdbg2 and SPAdes
- Phylogenetic analysis: ClustalW, MUSCLE, MAFFT and PhyML
- Variant calling: Dragen-GATK, DeepVariant, FreeBayes and VarScan
- Annotation: ANNOVAR, SnpEff, VEP and dbNSFP
- Epigenetic analysis: This involves measuring changes in DNA methylation, histone modifications, or other epigenetic marks in different samples or conditions. This can help to understand how gene expression is regulated and identify potential biomarkers or therapeutic targets.
- DNA methylation analysis: Bismark, MethylKit and methylPipe
- Histone modification analysis: MACS and SICER2
- Protein-protein interaction analysis: This involves identifying proteins that interact with each other and exploring the functional consequences of these interactions. This can help to identify new targets for drug development and understand disease mechanisms.
- Metabolomics analysis: This involves measuring the levels of small molecules (metabolites) in biological samples and comparing them across different conditions or groups of samples. This can help to identify biomarkers of disease or drug response.
- Data processing: XCMS, Mzmine and OpenMS
- Statistical analysis: MetaboAnalyst and MetSign
Postprocessing
The postprocessing part refers to the steps taken after the initial analysis to refine and interpret the results. Postprocessing steps are important because they can help to identify biological patterns and relationships that were not apparent in the initial analysis, and to ensure that the results are biologically meaningful and reproducible.
Considerations
Some considerations to take into account when performing postprocessing on human biomolecular data include:
- Interpretation: Once the results have been generated, it is important to interpret them in a biologically meaningful context. This can include identifying enriched pathways or gene sets, performing network analysis, or annotating the results.
- Visualisation: It is important to visualise the results in a clear and informative way. This can help to identify patterns and relationships in the data, and to communicate the results to others in a clear and accessible way.
Existing approaches
-
Functional Enrichment Analysis: These analyses are used to identify enriched pathways and biological functions associated with differentially expressed biomolecules. It involves steps such as gene ontology analysis, pathway analysis, and network analysis. Tools like GSEA, GO, KEGG, DAVID and Cytoscape can be used for functional enrichment analysis and/or also annotate the results.
-
Visualisation:
- Generate plots and heatmaps using tools such as ggplot2 or matplotlib
- Visualise data in a genomic context using tools such as IGV or UCSC Genome Browser
All these workflows and tools can be adapted and customised based on the specific type of data being analysed and the research question being addressed.
More information
Links to RDMkit
RDMkit is the Research Data Management toolkit for Life Sciences describing best practices and guidelines to help you make your data FAIR (Findable, Accessible, Interoperable and Reusable)
Tools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
ANNOVAR | ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes. | Tool info | |
Arvados | With Arvados, bioinformaticians run and scale compute-intensive workflows, developers create biomedical applications, and IT administrators manage large compute and storage resources. | ||
BioGRID | BioGRID is a comprehensive biomedical repository for curated protein, genetic and chemical interactions | Tool info Standards/Databases | |
Bismark | Bismark is a program to map bisulfite treated sequencing reads to a genome of interest and perform methylation calls in a single step. | Tool info Training | |
Bitbucket | Git based code hosting and collaboration tool, built for teams. | Standards/Databases | |
Bowtie2 | Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. | Tool info Training | |
BWA | BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. | Tool info Training | |
Canu | Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing. | Tool info | |
ClustalW | ClustalW is a progressive multiple sequence alignment tool to align a set of sequences by repeatedly aligning pairs of sequences and previously generated alignments. | Tool info Training | |
Cromwell | Cromwell is a Workflow Management System geared towards scientific workflows. | ||
cwltool | Reference implementation to provide comprehensive validation of CWL files as well as provide other tools related to working with CWL. | ||
Cytoscape | Cytoscape provides a solid platform for network visualization and analysis | Tool info Training | |
DAVID | The Database for Annotation, Visualization and Integrated Discovery (DAVID) provides a comprehensive set of functional annotation tools for investigators to understand the biological meaning behind large lists of genes. | Tool info Training | |
dbNSFP | A comprehensive database of transcript-specific functional predictions and annotations for human non-synonymous and splice-site SNVs | Tool info | |
DeepVariant | DeepVariant is a deep learning-based variant caller that takes aligned reads (in BAM or CRAM format), produces pileup image tensors from them, classifies each tensor using a convolutional neural network, and finally reports the results in a standard VCF or gVCF file. | Tool info | |
Delly | Delly is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read and long-read massively parallel sequencing data. | Tool info | |
DESeq2 | Differential gene expression analysis based on the negative binomial distribution | Tool info Training | |
Docker | Docker is a software for the execution of applications in virtualized environments called containers. It is linked to DockerHub, a library for sharing container images | Standards/Databases Standards/Databases Training | |
Dragen-GATK | DRAGEN-GATK Best Practices contains open-source workflows that are compatible between Illumina's platforms and mainstream infrastructure. | ||
EdgeR | Empirical Analysis of Digital Gene Expression Data in R | Tool info Training | |
Flye | Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. | Tool info Training | |
FreeBayes | FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. | Tool info Training | |
Galaxy | Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. | Human clinical and hea... Using the ENA data sub... | Tool info Training |
GeneMANIA | GeneMANIA helps you predict the function of your favourite genes and gene sets. | Tool info Training | |
ggplot2 | ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. | Tool info Training | |
GitHub | GitHub is a versioning system, used for sharing code, as well as for sharing of small data. | An automated pipeline ... | Standards/Databases Standards/Databases Training |
GitLab | GitLab is an open source end-to-end software development platform with built-in version control, issue tracking, code review, CI/CD, and more. Self-host GitLab on your own servers, in a container, or on a cloud provider. | Standards/Databases Training | |
GO | GO is to perform enrichment analysis on gene sets. | Tool info Training | |
GRIDSS | GRIDSS is a module software suite containing tools useful for the detection of genomic rearrangements. | Tool info | |
GSEA | Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states | Tool info Training | |
HISAT2 | HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome). | Tool info Training | |
IGV | The Integrative Genomics Viewer (IGV) is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data. | Tool info Training | |
IntAct | IntAct (Molecular Interaction Database) Website | Tool info Standards/Databases Training | |
KEGG | A set of annotation maps for Kyoto encyclopedia of genes and genomes (KEGG) | Tool info Training | |
Lumpy | A probabilistic framework for structural variant discovery. | Tool info | |
MACS | Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites. | Tool info Training | |
MAFFT | MAFFT is a multiple sequence alignment program | Tool info | |
Manta | Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. | Tool info | |
matplotlib | Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. | Socioeconomic data | Tool info Training |
MetaboAnalyst | MetaboAnalyst is a comprehensive platform dedicated for metabolomics data analysis via user-friendly, web-based interface. | Tool info Training | |
MethylKit | methylKit is an R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing. | Tool info | |
methylPipe | Base resolution DNA methylation data analysis | Tool info | |
MetSign | A computational platform for high-resolution mass spectrometry-based metabolomics | ||
MUSCLE | MUSCLE is widely-used software for making multiple alignments of biological sequences. | Tool info Training | |
Mzmine | MZmine 3 is an open-source software for mass-spectrometry data processing, with the main focus on LC-MS data. | Tool info | |
Nextflow | Nextflow is a framework for data analysis workflow execution | Tool info Training | |
Omicsgenerator | Omics Integrator is a package designed to integrate proteomic data, gene expression data and/or epigenetic data using a protein-protein interaction network. | ||
OpenMS | OpenMS is an open-source software C++ library for LC-MS data management and analyses. | Tool info Training | |
PhyML | PhyML is a software package that uses modern statistical approaches to analyse alignments of nucleotide or amino acid sequences in a phylogenetic framework. | Tool info | |
SICER2 | Redesigned and improved ChIP-seq broad peak calling tool SICER | ||
Singularity | Singularity is a widely-adopted container runtime that implements a unique security model to mitigate privilege escalation risks and provides a platform to capture a complete application environment into a single file (SIF) | Training | |
Snakemake | Snakemake is a framework for data analysis workflow execution | Human clinical and hea... | Tool info Training |
SnpEff | Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins. | Tool info Training | |
SPAdes | SPAdes is an assembly toolkit containing various assembly pipelines. | Tool info Training | |
STAR | Spliced Transcripts Alignment to a Reference | Tool info Training | |
toil-cwl-runner | The toil-cwl-runner command provides cwl-parsing functionality using cwltool, and leverages the job-scheduling and batch system support of Toil. | ||
UCSC Genome Browser | An online tool for analyzing and visualizing genomic data. It allows users to add and share annotations. | An automated SARS-CoV-... | Tool info Standards/Databases |
VarScan | Variant calling and somatic mutation/CNV detection for next-generation sequencing data | Tool info | |
VEP | VEP (Variant Effect Predictor) predicts the functional effects of genomic variants. | Tool info Training | |
wtdbg2 | Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). | Tool info | |
XCMS | Metabolomic and lipidomic platform | Tool info Training |