Skip to main content

Understanding the "Cloud"

OmicsPipelines is a cloud-based platform for running bioinformatics pipelines. It is designed to be easy to use, and to provide a consistent interface for running pipelines. It is designed to run in the cloud, and as such requires an account with a cloud provider. Currently, OmicsPipelines supports running pipelines on AWS and Google Cloud.

What is the cloud?

The cloud is a network of remote servers that can be accessed via the internet rather than a local server or a personal computer, and it is used to store, manage, and process data.

Cloud computing enables users to have on-demand access to shared computing resources, such as software and information, and to pay only for what they use.

The cloud is made up of various services, including infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). This can include storage, databases, software, analytics, and other services. Scalability, flexibility, and cost-effectiveness are the primary advantages of cloud computing.

What is cloud infrastructure?

Cloud infrastructure refers to the physical and virtual resources required to create, set up, and operate applications and services on a cloud computing platform.

Servers, storage, networking, and data centers are some examples of this. Organizations may access computing resources on demand, pay for only what they use, and scale resources as necessary thanks to cloud infrastructure.

Cloud providers offer a number of different services, and choosing which ones to use and how to configure them can be confusing. Choosing incorrectly and failing to set them up correctly may lead to a complex behemoth of services that developers, statisticians, and administrators need to navigate, or worse still, getting hacked. OmicsPipelines provides an opinionated configuration of these services, and tries to make it easy to set up.

warning

While OmicsPipelines' configurations have been designed with security in mind, and have been battle-tested, it is still possible to make mistakes. It is important to understand the services that OmicsPipelines uses, to understand the security implications of using them, and to ensure that you are following best practices for security. OmicsPipelines is not responsible for any security breaches that may occur as a result of using the application.

Why use the cloud?

The cloud is a great way to run bioinformatics pipelines. Because it is a network of remote servers, it is easy to scale up and down as needed, and to share resources with other users. It is also highly cost-effective.

For example, a single low-to-mid-tier server can cost upwards of $4,000, without even accounting for maintenance and electricity costs and quickly becomes outdated within 3-4 years. Due to economies of scale, the equivalent virtual machine in Google Cloud costs ~$850 to run 24/7, without accounting for long-term usage and other pricing discounts that Google Cloud offers. However, this is still an over-estimation, because the virtual machines are not billed when not in use.

Furthermore, running a modern pipeline on a single machine can take days or weeks, whereas parallelizing the task over multiple servers in the cloud can cut that time to hours.

What does the infrastructure application do?

The infrastructure application provides a number of features to make it easy to run pipelines. These include automatic infrastructure provisioning, secure data transfer, secure connections to the cloud, and a simple interface for monitoring all the infrastructure in use (including non-OmicsPipelines infrastructure).

What exactly is OmicsPipelines doing in my cloud account?

OmicsPipelines uses a number of different services in your cloud account. These include:

  • virtual machines (VMs) for running pipelines
  • cloud object storage for storing pipeline data
  • networking for connecting the VMs to the storage

Depending on which cloud provider you are using, OmicsPipelines may also use:

  • billing data to show you how much you are spending
  • networking gateway for encrypting traffic to/from the VMs
  • logging utilities for storing logs
  • code repositories for storing code related to provisioning virtual machines
  • container registries for storing container images

What is a virtual machine?

A virtual machine (VM) is a software implementation of a computer that executes programs like a physical computer.

A VM runs on top of a physical computer, and is isolated from other VMs. This allows multiple VMs to run on the same physical computer, and for each VM to have its own operating system and applications. This is known as virtualization.

This ability to run multiple VMs on a single physical computer is what makes cloud computing possible. Cloud providers offer a number of different VMs, each with different specifications, such as the amount of memory, the amount of storage, and the number of CPUs. These VMs are available on demand, and, as they are defined in software, can be scaled up or down as needed, similar to how a computer can be upgraded.

How does OmicsPipelines use VMs?

The bulk of OmicsPipelines' work is done using virtual machines. These are used to run the pipelines, cache metadata regarding the pipelines you run, and serve you the OmicsPipelines dashboard. Additionally, the pipelines themselves launch virtual machines that are pre-configured with the software each step. All of these virtual machines are
configured to use the appropriate cloud providers' storage solution (S3 for AWS and GCS for Google Cloud) for storing pipeline data and metadata.

What is cloud object storage?

Cloud object storage is a service that provides persistent remote data storage. It is used to store data that is not needed immediately, but that may be needed in the future. Storage is used to store pipeline data and metadata. Think of it like Google Drive that is accessible by both you and your cloud resources (like virtual machines).

Cloud Object Storage has a basic hierarchy of buckets, folders, and objects. Buckets are the top-level container for objects (think of it like a Shared Drive in Google Drive or a physical disk drive in your computer). Folders are used to organize objects within a bucket, although in most cloud object storage solutions, folders are not a first-class object and are instead just a naming convention. Objects are the actual data/files that is stored in the cloud object storage.