Human biomolecular data

Introduction

Data analysis, as in Human biomolecular data of infectious diseases, involves exploring the data collected to gain an understanding of the messages within a dataset and identifying relationships between variables using mathematical formulas or models. Moreover, it is always crucial to follow the best practices and especially the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) to enable the collection and flow of information in the best way possible.

General considerations

Some considerations for analysing human biomolecular data of infectious diseases are:

Select the tools best suited for the analysis of your data.
Document the exact steps used for data analysis.
Choose between several computing infrastructure types, e.g. cluster, cloud.
Take into account the computing resources needed.
Which type of data are you using, e.g. DNAseq, ATACseq, CNV.
Integration of different types of data (e.g. RNAseq and DNAseq).
Ensure following the FAIR principles.
Guarantee access to the data and tools for all collaborators, for reproducibility.
- Providing your code
- Providing your execution environment
- Providing your workflows
- Providing your data analysis execution

When looking for solutions to some of the considerations above, you may have a look at the information available on the RDMkit website.

Existing approaches

Below you can find some general existing approaches in order to help with and improve your data analysis pipeline/protocol:

Container environments: As an alternative to package management systems you can consider container environments like Docker or Singularity.
Web-based platform: Provides a centralised location for software developers to store, manage, collaborate, and share their code. You can use GitHub (widely used), GitLab or Bitbucket.
Workflow platforms: Allows the user to manage your data and provide an interface (web, GUI, APIs) to run complex pipelines and review their results. For instance: Galaxy and Arvados (CWL-based, open source).
Workflow runners: Allows you to take a workflow written in a proprietary or standardised format (such as the CWL standard) and execute it locally or on a remote computer infrastructure. For instance, toil-cwl-runner, the reference CWL runner (cwltool), Nextflow, Snakemake, Cromwell.
Integration Pipelines: These pipelines are used to integrate different types of data, such as genomic, proteomic, and metabolomic data. It involves steps such as data preprocessing, data integration, and functional analysis. Tools like Omicsgenerator can be used for data integration.

Preprocessing

Data preprocessing is the phase in the project where data is converted into a desired format and prepared for analysis. Is a crucial step in data analysis that involves cleaning, transforming, and preparing data for analysis. The goal of preprocessing is to ensure that the data is of high quality and is suitable for the intended analysis. Preprocessing can involve a range of steps, depending on the type of data and the analysis being performed.

Preprocessing is a critical step in data analysis of Human biomolecular data of infectious diseases because it can greatly impact the accuracy and reliability of the analysis results. By ensuring that the data is of high quality and suitable for analysis, preprocessing can help researchers obtain more accurate and meaningful insights from this data.

Considerations

Here are some common considerations involved in data preprocessing:

Data cleaning: This step involves identifying and correcting errors or inconsistencies in the data. Examples of data cleaning include removing duplicates, correcting typos or misspellings, and identifying and handling missing data.
Data transformation: This step involves transforming the data to make it suitable for analysis. Examples of data transformation include converting data types (e.g., from categorical to numerical), scaling data, and normalising data.
Remove low-quality samples: Samples that have low sequencing depth, high number of missing values, or poor alignment quality can be removed from the dataset to ensure that the remaining samples are of high quality.
Identify and remove outliers: Outliers are data points that fall outside the expected range of values and can skew the analysis results. Outliers can be identified using statistical methods and removed from the dataset.
Check for batch effects: Batch effects are systematic differences in the data that arise from technical or experimental factors. Batch effects can be identified using statistical methods and removed from the dataset.
Data normalisation: Normalisation is a common preprocessing step that aims to remove systematic biases in the data. Normalisation methods should be evaluated to ensure that they are effective and do not introduce additional biases.
Perform quality control checks at each preprocessing step: Quality control checks should be performed at each preprocessing step to ensure that the data is of high quality and suitable for the intended analysis.

Existing approaches

Preprocessing could be done using the state of art bioinformatics tools and/or programming languages that have different functions and packages to work and process this kind of data. For example, Python, RStudio or using Command-Line, are different approaches to enable the user performing all the necessary and wanted steps to do the desired preprocessing pipeline/protocol.

When looking for quality control protocols, see Quality control - Human biomolecular data page.

Analysis

The analysis of human biomolecular data involves the use of various techniques and approaches to extract meaningful information from biological samples such as DNA, RNA, proteins, and metabolites.

This stage relies on the previous stages (collection, processing) that will lay the foundations for the generation of new knowledge by providing accurate and trustworthy data.

Considerations

The location of your data: Proximity to computing resources is crucial due to its impact on data transfer across infrastructures. It is worthwhile to compare the cost of transferring large data volumes versus the transfer of virtual machine images for analysis purposes.
Analysis of the data: Prior to analyzing the data, it is necessary to evaluate the computing environment and make a decision among various types of computing infrastructures, such as clusters or clouds. Additionally, selecting the suitable work environment, such as command line or web portal, based on individual requirements and expertise, is crucial.
Best tools: You need to select the tools best suited for the analysis of your data.
Document the steps: Accurate documentation of the data analysis process is essential, encompassing the precise steps taken, software versions employed, parameters utilized, and the computing environment employed. However, it is important to mention that the “manual” manipulation of the data can potentially complicate this documentation procedure.
Collaborative analysis: When engaging in collaborative data analysis, it is crucial to ensure that all collaborators have access to the data and tools required. This can be facilitated by establishing virtual research environments that provide a shared platform for seamless collaboration.

Existing approaches

There are several types of analysis that can be performed on human biomolecular data, depending on the specific research question and type of data being analysed. Here are some common types of analysis:

Gene expression analysis: This involves measuring the expression levels of genes in a biological sample and comparing them across different conditions or groups of samples. This can be done using techniques such as microarray analysis or RNA sequencing.
- Sequence alignment: STAR and HISAT2
- Gene expression analysis: DESeq2 and EdgeR
Genomic analysis: This involves the interpretation of genetic information encoded in DNA sequences. DNA data analysis can be used for a wide range of applications, such as identifying genetic variants associated with disease, studying the evolution of species, and understanding the molecular mechanisms underlying biological processes.
- Sequence alignment: Bowtie2 and BWA
- Structural variant detection: Delly, Lumpy, Manta and GRIDSS
- Genome assembly: Canu, Flye, wtdbg2 and SPAdes
- Phylogenetic analysis: ClustalW, MUSCLE, MAFFT and PhyML
- Variant calling: Dragen-GATK, DeepVariant, freebayes and VarScan
- Annotation: ANNOVAR, SnpEff, VEP and dbNSFP
Epigenetic analysis: This involves measuring changes in DNA methylation, histone modifications, or other epigenetic marks in different samples or conditions. This can help to understand how gene expression is regulated and identify potential biomarkers or therapeutic targets.
- DNA methylation analysis: Bismark, MethylKit and methylPipe
- Histone modification analysis: MACS and SICER2
Protein-protein interaction analysis: This involves identifying proteins that interact with each other and exploring the functional consequences of these interactions. This can help to identify new targets for drug development and understand disease mechanisms.
- Interaction databases: BioGRID and IntAct
- Network analysis: Cytoscape and GeneMANIA
Metabolomics analysis: This involves measuring the levels of small molecules (metabolites) in biological samples and comparing them across different conditions or groups of samples. This can help to identify biomarkers of disease or drug response.
- Data processing: XCMS Online, Mzmine and OpenMS
- Statistical analysis: MetaboAnalyst and MetSign

Postprocessing

The postprocessing part refers to the steps taken after the initial analysis to refine and interpret the results. Postprocessing steps are important because they can help to identify biological patterns and relationships that were not apparent in the initial analysis, and to ensure that the results are biologically meaningful and reproducible.

Considerations

Some considerations to take into account when performing postprocessing on human biomolecular data include:

Interpretation: Once the results have been generated, it is important to interpret them in a biologically meaningful context. This can include identifying enriched pathways or gene sets, performing network analysis, or annotating the results.
Visualisation: It is important to visualise the results in a clear and informative way. This can help to identify patterns and relationships in the data, and to communicate the results to others in a clear and accessible way.

Existing approaches

Functional Enrichment Analysis: These analyses are used to identify enriched pathways and biological functions associated with differentially expressed biomolecules. It involves steps such as gene ontology analysis, pathway analysis, and network analysis. Tools like GSEA, GO, KEGG, DAVID and Cytoscape can be used for functional enrichment analysis and/or also annotate the results.
Visualisation:
- Generate plots and heatmaps using tools such as ggplot2 or matplotlib
- Visualise data in a genomic context using tools such as IGV or UCSC Genome Browser

All these workflows and tools can be adapted and customised based on the specific type of data being analysed and the research question being addressed.

More information

Links to RDMkit

RDMkit is the Research Data Management toolkit for Life Sciences describing best practices and guidelines to help you make your data FAIR (Findable, Accessible, Interoperable and Reusable)

Human Data

Sensitive Data

Biomolecular simulation data

Data analysis

Tools and resources on this page

Tool or resource	Description	Related pages	Registry
ANNOVAR	ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes.	Pathogen characterisation	Tool info
Arvados	With Arvados, bioinformaticians run and scale compute-intensive workflows, developers create biomedical applications, and IT administrators manage large compute and storage resources.
BioGRID	BioGRID is a comprehensive biomedical repository for curated protein, genetic and chemical interactions	Pathogen characterisation	Tool info Standards/Databases
Bismark	Bismark is a program to map bisulfite treated sequencing reads to a genome of interest and perform methylation calls in a single step.		Tool info Training
Bitbucket	Git based code hosting and collaboration tool, built for teams.	Pathogen characterisation	Standards/Databases
Bowtie2	Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.	Pathogen characterisation	Tool info Training
BWA	BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome.	Pathogen characterisation	Tool info Training
Canu	Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing.	Pathogen characterisation	Tool info
ClustalW	ClustalW is a progressive multiple sequence alignment tool to align a set of sequences by repeatedly aligning pairs of sequences and previously generated alignments.	Pathogen characterisation	Tool info Training
Cromwell	Cromwell is a Workflow Management System geared towards scientific workflows.
cwltool	Reference implementation to provide comprehensive validation of CWL files as well as provide other tools related to working with CWL.
Cytoscape	Cytoscape provides a solid platform for network visualization and analysis	Pathogen characterisation	Tool info Training
DAVID	The Database for Annotation, Visualization and Integrated Discovery (DAVID) provides a comprehensive set of functional annotation tools for investigators to understand the biological meaning behind large lists of genes.		Tool info Training
dbNSFP	A comprehensive database of transcript-specific functional predictions and annotations for human non-synonymous and splice-site SNVs	Pathogen characterisation	Tool info
DeepVariant	DeepVariant is a deep learning-based variant caller that takes aligned reads (in BAM or CRAM format), produces pileup image tensors from them, classifies each tensor using a convolutional neural network, and finally reports the results in a standard VCF or gVCF file.		Tool info
Delly	Delly is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read and long-read massively parallel sequencing data.		Tool info
DESeq2	Differential gene expression analysis based on the negative binomial distribution	Pathogen characterisation	Tool info Training
Docker	Docker is a software for the execution of applications in virtualized environments called containers. It is linked to DockerHub, a library for sharing container images		Standards/Databases Standards/Databases Training
Dragen-GATK	DRAGEN-GATK Best Practices contains open-source workflows that are compatible between Illumina's platforms and mainstream infrastructure.	Pathogen characterisation
EdgeR	Empirical Analysis of Digital Gene Expression Data in R		Tool info Training
Flye	Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies.	Pathogen characterisation	Tool info Training
freebayes	freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment.	Pathogen characterisation	Tool info Training
Galaxy	Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.	Pathogen characterisation General guidelines Human clinical and hea... Using the ENA data sub...	Tool info Training
GeneMANIA	GeneMANIA helps you predict the function of your favourite genes and gene sets.		Tool info Training
ggplot2	ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.		Tool info Training
GitHub	GitHub is a versioning system, used for sharing code, as well as for sharing of small data.	Pathogen characterisation An automated pipeline ...	Standards/Databases Standards/Databases Training
GitLab	GitLab is an open source end-to-end software development platform with built-in version control, issue tracking, code review, CI/CD, and more. Self-host GitLab on your own servers, in a container, or on a cloud provider.	Pathogen characterisation	Standards/Databases Training
GO	GO is to perform enrichment analysis on gene sets.	Pathogen characterisation	Tool info Training
GRIDSS	GRIDSS is a module software suite containing tools useful for the detection of genomic rearrangements.		Tool info
GSEA	Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states		Tool info Training
HISAT2	HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome).	Pathogen characterisation	Tool info Training
IGV	The Integrative Genomics Viewer (IGV) is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data.		Tool info Training
IntAct	IntAct (Molecular Interaction Database) Website	Pathogen characterisation	Tool info Standards/Databases Training
KEGG	A set of annotation maps for Kyoto encyclopedia of genes and genomes (KEGG)		Tool info Training
Lumpy	A probabilistic framework for structural variant discovery.		Tool info
MACS	Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites.		Tool info Training
MAFFT	MAFFT is a multiple sequence alignment program	Pathogen characterisation	Tool info Training
Manta	Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads.		Tool info
matplotlib	Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.	Socioeconomic data	Tool info Training
MetaboAnalyst	MetaboAnalyst is a comprehensive platform dedicated for metabolomics data analysis via user-friendly, web-based interface.	Pathogen characterisation	Tool info Training
MethylKit	methylKit is an R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing.		Tool info
methylPipe	Base resolution DNA methylation data analysis		Tool info
MetSign	A computational platform for high-resolution mass spectrometry-based metabolomics
MUSCLE	MUSCLE is widely-used software for making multiple alignments of biological sequences.	Pathogen characterisation	Tool info Training
Mzmine	MZmine 3 is an open-source software for mass-spectrometry data processing, with the main focus on LC-MS data.	Pathogen characterisation	Tool info
Nextflow	Nextflow is a framework for data analysis workflow execution	Pathogen characterisation	Tool info Training
Omicsgenerator	Omics Integrator is a package designed to integrate proteomic data, gene expression data and/or epigenetic data using a protein-protein interaction network.
OpenMS	OpenMS is an open-source software C++ library for LC-MS data management and analyses.	Pathogen characterisation	Tool info Training
PhyML	PhyML is a software package that uses modern statistical approaches to analyse alignments of nucleotide or amino acid sequences in a phylogenetic framework.		Tool info
SICER2	Redesigned and improved ChIP-seq broad peak calling tool SICER
Singularity	Singularity is a widely-adopted container runtime that implements a unique security model to mitigate privilege escalation risks and provides a platform to capture a complete application environment into a single file (SIF)		Training
Snakemake	Snakemake is a framework for data analysis workflow execution	Pathogen characterisation General guidelines Human clinical and hea...	Tool info Training
SnpEff	Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins.	Pathogen characterisation	Tool info Training
SPAdes	SPAdes is an assembly toolkit containing various assembly pipelines.	Pathogen characterisation	Tool info Training
STAR	Spliced Transcripts Alignment to a Reference	Pathogen characterisation	Tool info Training
toil-cwl-runner	The toil-cwl-runner command provides cwl-parsing functionality using cwltool, and leverages the job-scheduling and batch system support of Toil.
UCSC Genome Browser	An online tool for analyzing and visualizing genomic data. It allows users to add and share annotations.	An automated SARS-CoV-...	Tool info Standards/Databases Training
VarScan	Variant calling and somatic mutation/CNV detection for next-generation sequencing data		Tool info Training
VEP	VEP (Variant Effect Predictor) predicts the functional effects of genomic variants.	Pathogen characterisation	Tool info Training
wtdbg2	Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT).		Tool info
XCMS Online	A systems biology tool for analyzing metabolomic data. It automatically superimposes raw metabolomic data onto metabolic pathways and integrates it with transcriptomic and proteomic data.		Tool info

Contributors