Skip to main content

Introduction

This documentation is currently incomplete, and a work-in-progress.

Welcome to the OmicsPipelines suite of applications. This suite of applications allows you to set up and run genomics and proteomics post-processing workflows in the cloud.

What is a workflow?

At its most basic level, a workflow is a set of pre-determined steps that accomplish some task.

In the context of processing raw biological datasets (.fastq files for genomics and .raw files for proteomics) this means performing a series of steps to turn those massive raw files that come from a sequencing machine or mass-spectrometry device into comprehensible data that humans can understand. For example, one step could be reading a FASTQ file and running quality control steps using tools such as samtools, and the following step could be filtering the samtools output and writing the results to a CSV file.

Why use workflows?

Historically, biological post-processing would involve running a script containing all the steps that a bioinformatician would want to run to turn a biological dataset into comprehensible data.

However, this would often have many disadvantages including, but not limited to, challenges with portability (i.e. reusing a single workflow across different HPC and cloud environments), difficulty resuming failed tasks, difficulty parallelizing processing across a number of input files, among others.

Workflows aim to solve this by splitting processing tasks into discrete steps that can be run independently, similar to a directed acyclic graph. Workflows are typically run by workflow orchestration engines, described in the next section.

How are workflows run?

Workflow orchestration engines such as Nextflow, Cromwell, and Snakemake were developed to run and are the progenitor of the modern concept of workflows.

These programs are glorified task masters that coordinate running the steps described in a workflow, concurrently or otherwise. They use standardized languages/formats to describe workflows and workflow steps (such as WDL or CWL for Cromwell and Nextflow and Python for Snakemake) and are able to leverage HPC and cloud environments with minimal or no changes required to the actual workflow file.

Additionally, many of these workflow orchestration engines provide convenience features, such as caching of workflow inputs and outputs, so that entire workflows don't have to be re-run when a single task fails, or a downstream task changes.

Finally, task parallelization is tremendously easy to scale automatically using workflow orchestration engines. A user could theoretically run a thousand input files in approximately the same amount of time it takes to run a single file, although in practice, this is not realistic given the computational constraints that come with having a program and compute environment supporting a thousand concurrent virtual machine.

Where does OmicsPipelines fit in?

While workflow orchestration engines have made running workflows significantly easier and more portable, running, sharing, following the progress of, and organizing the inputs and outputs of workflows is still an enormous challenge, especially to a user who may not be familiar with the command line. There are a number of existing tools, such as the Broad Institute's Terra platform, DNANexus, among others, however these platforms come with a number of drawbacks.

Other products versus OmicsPipelines

FeaturesOmicsPipelinesTerraDNANexus
Self-hosted
Visualize running workflows
Single-form setup for workflows
Free1
Open source
Runs on AWS
Runs on Google Cloud
Runs on Azure
Available pre-validated workflows
Safe for PHI233
Bring your own workflow4
Self-contained data management platform4

How does OmicsPipelines work?

OmicsPipelines is self-hosted - you provide the cloud account and the computing resources, and OmicsPipelines provides the rest. You receive the bill for running your workflows directly from your chosen cloud provider.

OmicsPipelines consists of two applications:

  • the desktop application - sets up the appropriate cloud infrastructure to run and securely access the dashboard in your own cloud account in either Google Cloud Platform or Amazon Web Services.
  • the dashboard - allows you to run, monitor, and access the workflows that you run

OmicsPipelines supports two types of workflows, genomics and proteomics. We provide opinionated and validated workflows that have also been pre-configured to make it easy to get started by just filling out a simple form. Future versions of OmicsPipelines will allow you to bring your own workflows.

OmicsPipelines leverages the workflow engine Cromwell and a wrapper around it written by the team at ENCODE. The workflows that OmicsPipelines provides are written in WDL and can be found on the MoTrPAC GitHub organization.

Get started

Get started by downloading the infrastructure launcher application

Check out the docs for the dashboard

Check out the docs for the application

Check out the docs for each workflow that you can run with OmicsPipelines

Footnotes

  1. Does not consider costs from your chosen cloud provider.

  2. When following best practices suggested by OmicsPipelines and using a cloud account with a PHI-safe agreement.

  3. Requires agreement to be signed between your lab/organization/company and the platform. 2

  4. Feature planned for upcoming versions. 2