Introduction
This documentation is currently incomplete, and a work-in-progress.
Welcome to the OmicsPipelines suite of applications. This suite of applications allows you to set up and run genomics and proteomics post-processing workflows in the cloud.
What is a workflow?
At its most basic level, a workflow is a set of pre-determined steps that accomplish some task.
In the context of processing raw biological datasets (.fastq files for genomics and .raw files for proteomics) this means performing a series of steps to turn those massive raw files that come from a sequencing machine or mass-spectrometry device into comprehensible data that humans can understand. For example, one step could be reading a FASTQ file and running quality control steps using tools such as samtools, and the following step could be filtering the samtools output and writing the results to a CSV file.
Why use workflows?
Historically, biological post-processing would involve running a script containing all the steps that a bioinformatician would want to run to turn a biological dataset into comprehensible data.
However, this would often have many disadvantages including, but not limited to, challenges with portability (i.e. reusing a single workflow across different HPC and cloud environments), difficulty resuming failed tasks, difficulty parallelizing processing across a number of input files, among others.
Workflows aim to solve this by splitting processing tasks into discrete steps that can be run independently, similar to a directed acyclic graph. Workflows are typically run by workflow orchestration engines, described in the next section.
How are workflows run?
Workflow orchestration engines such as Nextflow, Cromwell, and Snakemake were developed to run and are the progenitor of the modern concept of workflows.
These programs are glorified task masters that coordinate running the steps described in a workflow, concurrently or otherwise. They use standardized languages/formats to describe workflows and workflow steps (such as WDL or CWL for Cromwell and Nextflow and Python for Snakemake) and are able to leverage HPC and cloud environments with minimal or no changes required to the actual workflow file.
Additionally, many of these workflow orchestration engines provide convenience features, such as caching of workflow inputs and outputs, so that entire workflows don't have to be re-run when a single task fails, or a downstream task changes.
Finally, task parallelization is tremendously easy to scale automatically using workflow orchestration engines. A user could theoretically run a thousand input files in approximately the same amount of time it takes to run a single file, although in practice, this is not realistic given the computational constraints that come with having a program and compute environment supporting a thousand concurrent virtual machine.