motrpac-rna-seq-pipeline

MoTrPAC RNA-SEQ Pipeline

Overview
Quick Start
Prerequisites
GCP Set-up
Software / Dockerfiles
Configuration Files
Run the Pipeline
Pipeline Outputs
Monitoring and Job Management
WDL Workflow Structure
Local Development and Testing
Troubleshooting
Citations and References
Contributing and Support
Version Information

Overview

This repo contains the rna-seq data processing pipeline implemented in Workflow Description Language (WDL) based on harmonized RNA-SEQ MOP. This pipeline uses caper, a wrapper python package for the workflow management system Cromwell. All the data was processed on the Google Cloud Platform (GCP).

Supported Organisms and Genome Builds

The pipeline supports the following organisms and genome versions:

Rat: rn6 (Rnor_6.0), rn7 (mRatBN7.2)
Human: gencode_v39 (GRCh38)

Pipeline Tools

The pipeline uses:

STAR aligner for read alignment
RSEM for transcript quantification (TPM, FPKM, counts)
featureCounts for gene-level read quantification
Quality control tools including FastQC, MultiQC, Picard, and Bowtie2 (for contamination assessment)

Pipeline Outputs

The pipeline generates:

Gene expression quantification (counts, TPM, FPKM) from RSEM
Gene counts from featureCounts
Comprehensive QC metrics for outlier detection and covariate adjustment
MultiQC reports for both pre- and post-alignment QC

Quick Start

For experienced users, here’s the essential workflow:

# 1. Clone the repository
git clone https://github.com/MoTrPAC/motrpac-rna-seq-pipeline

# 2. Install Python dependencies
pip3 install -r scripts/requirements.txt

# 3. Generate input JSON configuration
python3 scripts/make_json_rnaseq.py \
  -g gs://your-bucket/fastq_raw \
  -o ./input_json \
  -r batch1_qc_metrics \
  -a human \
  -v gencode_v39 \
  -n 1 \
  -p your-gcp-project \
  -d us-docker.pkg.dev/motrpac-portal/rnaseq

# 4. Submit the pipeline
caper submit wdl/rnaseq_pipeline_scatter.wdl -i input_json/set1_rnaseq.json

# 5. Monitor pipeline status
caper list

Prerequisites

Required Accounts

Google Cloud Platform (GCP) account with billing enabled
GCP service account with appropriate permissions
GCP Storage bucket for pipeline inputs and outputs

Required Software (Local Machine)

Google Cloud SDK
Python >= 3.6.9
Git

Python Dependencies

Install required Python packages:

pip3 install -r scripts/requirements.txt

The main dependencies include:

gcsfs - for accessing Google Cloud Storage
numpy - for data processing

GCP Permissions and APIs

Ensure the following APIs are enabled in your GCP project:

Compute Engine API
Cloud Storage API
Google Cloud Batch API (for workflow execution)

GCP Set-up

The WDL/Cromwell framework is optimized to run pipelines in high-performance computing environments. The MoTrPAC Bioinformatics Center runs pipelines on Google Cloud Platform (GCP). We used a number of fantastic tools developed by our colleagues from the ENCODE project to run pipelines on GCP (and other HPC platforms).

A brief summary of the steps to set-up a VM to run the Motrpac pipelines on GCP (for details, please, check the caper repo):

Create a GCP account.
Enable cloud APIs.
Install the Google Cloud SDK (Software Development Kit) on your local machine.
Create a service account and download the key file to your local computer (e.g. service-account-191919.json)
Create a bucket for pipeline inputs and outputs (e.g. gs://pipelines/). Note: a GCP bucket is similar to a folder on your computer or a storage unit, but it is stored on Google’s servers in the cloud instead of on your local computer.
Set up a VM on GCP: create a Virtual Machine (VM) instance from where the pipelines will be run. We recommend the script available in the caper repo. For that, clone the repo on your local machine and run the following command:

 $ bash create_instance.sh [INSTANCE_NAME] [PROJECT_ID] [GCP_SERVICE_ACCOUNT_KEY_JSON_FILE] [GCP_OUT_DIR]

 # Example for the pipeline:
./create_instance.sh pipeline-instance your-gcp-project-name service-account-191919.json gs://pipelines/results/

Finally, clone the repo on your VM instance

 git clone https://github.com/MoTrPAC/motrpac-rna-seq-pipeline

Software / Dockerfiles

Several tools are required to run the rna-seq pipeline. All of them are pre-installed in docker containers, which are publicly available in the Artifact Registry.

Available Docker Images

The pipeline uses the following containerized tools (all available at us-docker.pkg.dev/motrpac-portal/rnaseq):

fastqc:latest - FastQC for quality control
umi_attach:latest - UMI attachment utility
cutadapt:latest - Adapter trimming
multiqc:latest - Aggregate QC reporting
star:latest - STAR aligner
feature_counts:latest - featureCounts from Subread
rsem:latest - RSEM quantification
bowtie:latest - Bowtie2 aligner (for contamination screening)
picard:latest - Picard tools (MarkDuplicates, CollectRnaSeqMetrics)
umi_dup:latest - UMI-based duplication assessment
samtools:latest - SAMtools utilities
collect_qc:latest - Custom QC metrics collection
merge_results:latest - Result merging across samples

Building and Updating Containers

To find out more about the specific versions of tools used to run the pipeline, check the dockerfiles/*.Dockerfile.

To build and push updated containers:

# Build all dockerfiles
bash scripts/build_dockerfiles.sh

# Push to Artifact Registry (requires appropriate permissions)
bash scripts/push_dockerfiles.sh

Configuration Files

An input configuration file (in JSON format) is required to process the data through the pipeline. This configuration file contains several key-value pairs that specify the inputs and outputs of the workflow, the location of the input files, default pipeline parameters, docker containers, the execution environment, and other parameters needed for execution.

Generating Configuration Files

The optimal way to generate the configuration files is to run the make_json_rnaseq.py script.

Usage:

python3 scripts/make_json_rnaseq.py \
  -g GCP_PATH \               # GCS path to directory containing FASTQ files
  -o OUTPUT_PATH \            # Local path where JSON files will be written
  -r OUTPUT_REPORT_NAME \     # Name for the output QC metrics report
  -a {rat,human} \            # Organism
  -v {rn6,rn7,gencode_v39} \  # Genome build version
  -n NUM_CHUNKS \             # Number of batches to split samples into
  -p PROJECT \                # GCP project name
  -d DOCKER_REPO \            # Docker repository prefix (optional)
  -i \                        # Include index files (for UMI processing)
  -u                          # Include undetermined reads (optional)
  -v VERSION \                # Accepts values "rn6", "rn7", or "gencode_v39" to specify the genome build version 

Complete Example:

python3 scripts/make_json_rnaseq.py \
  -g gs://motrpac-bucket/rna-seq/human/batch7_20220316/fastq_raw \
  -o ./input_json \
  -r batch7_qc_metrics.csv \
  -a human \
  -v gencode_v39 \
  -n 1 \
  -p motrpac-portal \
  -d us-docker.pkg.dev/motrpac-portal/rnaseq \
  -i

This will create JSON configuration file(s) (e.g., set1_rnaseq.json, set2_rnaseq.json, etc.) in the specified output directory.

Organism Reference Files

The make_json_rnaseq.py script automatically selects the appropriate reference files based on the organism and version:

Rat (rn6):

STAR index: gs://omicspipelines-public-resources/rnaseq/references/rat/Rnor6_v96_star_index.tar.gz
GTF: gs://omicspipelines-public-resources/rnaseq/references/rat/Rattus_norvegicus.Rnor_6.0.96.gtf
RSEM reference: gs://omicspipelines-public-resources/rnaseq/references/rat/rn6_rsem_reference.tar.gz

Rat (rn7):

STAR index: gs://omicspipelines-public-resources/rnaseq/references/rat/rn7/rn7_v108_star_index.tar.gz
GTF: gs://omicspipelines-public-resources/rnaseq/references/rat/rn7/Rattus_norvegicus.mRatBN7.2.108.gtf
RSEM reference: gs://omicspipelines-public-resources/rnaseq/references/rat/rn7/rn7_rsem_reference.tar.gz

Human (gencode_v39):

STAR index: gs://omicspipelines-public-resources/rnaseq/references/human/hg38_v39_star_index.tar.gz
GTF: gs://omicspipelines-public-resources/rnaseq/references/human/GRCh38.v39.primary_assembly.annotation.gtf
RSEM reference: gs://omicspipelines-public-resources/rnaseq/references/human/hg38_rsem_reference.tar.gz

For more details, see the scripts documentation.

Run the Pipeline

Connect to the VM and submit the job using the below command:

caper submit wdl/rnaseq_pipeline_scatter.wdl -i input_json/set1_rnaseq.json

Check the status of workflows and make sure they have succeeded by typing caper list on the VM instance that’s running the job and look for Succeeded.

Pipeline Outputs

The pipeline generates the following main output files:

Quantification Files

RSEM Gene Expression Quantification
- *_rsem_genes_count.txt - Raw gene-level counts
- *_rsem_genes_tpm.txt - Transcripts Per Million (TPM) normalized expression
- *_rsem_genes_fpkm.txt - Fragments Per Kilobase Million (FPKM) normalized expression
featureCounts Gene Quantification
- *_feature_counts.txt - Gene-level raw counts from featureCounts

Quality Control Files

QC Metrics Report
- *_qc_report.csv - Comprehensive QC metrics per sample including:
  - Read alignment statistics
  - rRNA, globin, and PhiX contamination rates
  - PCR duplication rates
  - Strand specificity
  - 5’ to 3’ coverage bias
  - Percentage of reads mapping to coding/intronic/intergenic regions
  - Chromosome mapping percentages

Additional Outputs (Per Sample)

The pipeline also generates intermediate outputs for each sample (stored in Cromwell execution directories):

FastQC reports (pre- and post-trimming)
MultiQC consolidated reports
STAR alignment BAM files
Trimmed FASTQ files
Picard metrics files

Retrieving Outputs

Final merged outputs are written to the GCS bucket specified during pipeline submission. Individual sample outputs are organized in the Cromwell execution directory structure.

Monitoring and Job Management

Checking Pipeline Status

# List all workflows
caper list

# Check detailed status of a specific workflow
caper metadata [WORKFLOW_ID]

Monitoring Running Jobs

# View workflows currently running
caper list | grep Running

# Check logs for a specific workflow
caper debug [WORKFLOW_ID]

Managing Workflows

# Abort a running workflow
caper abort [WORKFLOW_ID]

# Check troubleshooting information
caper troubleshoot [WORKFLOW_ID]

Retrieving Results

Successful pipeline runs will write outputs to your specified GCS bucket. Intermediate files and execution logs are stored in:

cromwell-executions/ - Contains all task execution outputs and logs
cromwell-workflow-logs/ - Contains workflow-level logs

To copy results from GCS to your local machine:

gsutil -m cp -r gs://your-bucket/results/workflow_id/* ./local_results/

WDL Workflow Structure

The pipeline is organized as a modular WDL workflow with the following structure:

Main Workflow

wdl/rnaseq_pipeline_scatter.wdl - Main workflow that orchestrates all tasks using a scatter-gather pattern to process multiple samples in parallel

Task Modules

The pipeline consists of the following task modules (in wdl/ directory):

Pre-alignment QC and Processing:

fastqc/ - Quality control with FastQC (pre- and post-trimming)
attach_umi/ - Attach UMI indices to read names
cutadapt/ - Adapter trimming
multiqc/ - Aggregate QC reporting

Alignment and Quantification:

star_align/ - Alignment with STAR
rsem_exp/ - RSEM quantification
feature_counts/ - featureCounts quantification

Contamination Screening:

bowtie2_align/ - Bowtie2 alignment to globin, rRNA, and PhiX references

Post-alignment QC:

mark_duplicates/ - PCR duplicate marking with Picard
collect_rnaseq_metrics/ - RNA-seq QC metrics with Picard
umi_dup/ - UMI-based duplication assessment
compute_mapped/ - Chromosome mapping statistics
collect_qc_metrics/ - Consolidated QC metrics collection

Results Aggregation:

merge_results/ - Merge quantification and QC data across all samples

Reference Building Workflows

The repository also includes workflows for building reference files:

wdl/star_ref/ - Build STAR genome indices
wdl/rsem_index/ - Build RSEM reference indices
wdl/bowtie2_index/ - Build Bowtie2 indices

Local Development and Testing

Setup for Development

Use the provided setup scripts to configure your development environment:

# Set up VM for pipeline execution
bash scripts/setup/setup_vm.sh

# Set up local development environment
bash scripts/setup/setup_develop.sh

Validating JSON Files

Before submitting workflows, validate your JSON configuration files:

python3 scripts/validate_jsons.py input_json/set1_rnaseq.json

Testing with Prototype Examples

The prototype/ directory contains example configuration files and submission scripts:

# Example submission script for generic use
bash prototype/submit_rnaseq_generic.sh

# Example JSON configurations in prototype/input_json/

The examples/ directory contains additional JSON examples for individual tasks and different organism configurations.

Building Docker Images Locally

# Build all docker images
bash scripts/build_dockerfiles.sh

# Push to your container registry (configure registry URL first)
bash scripts/push_dockerfiles.sh

Troubleshooting

Common Issues

1. Pipeline Fails During Submission

Verify JSON configuration is valid using scripts/validate_jsons.py
Ensure all required input files exist in the specified GCS paths
Check that service account has permissions to access GCS buckets

2. Tasks Fail with “Out of Memory” Errors

Increase *_ramGB parameters in your JSON configuration
Default memory allocations are in scripts/make_json_rnaseq.py

3. Tasks Fail with “Out of Disk Space” Errors

Increase *_disk parameters in your JSON configuration
Ensure your GCS bucket has sufficient quota

4. Cannot Find Output Files

Check workflow succeeded: caper list
Outputs are in the GCS bucket specified in your configuration
Check Cromwell execution logs in cromwell-executions/

5. Docker Image Pull Failures

Verify you have access to the Artifact Registry
Check that docker image names/tags are correct in JSON
Ensure Compute Engine service account has Artifact Registry Reader role

Accessing Logs

Workflow-level logs:

# View workflow metadata
caper metadata [WORKFLOW_ID]

# Check troubleshooting info
caper troubleshoot [WORKFLOW_ID]

Task-level logs: Navigate to the Cromwell execution directory:

cd cromwell-executions/rnaseq_pipeline/[WORKFLOW_ID]/
# Find specific task directories and check stderr/stdout logs

GCP Console:

Navigate to Life Sciences API in GCP Console
View operation logs and details for each task execution

Getting Help

If issues persist:

Check the Cromwell documentation: https://cromwell.readthedocs.io/
Review the Caper documentation: https://github.com/MoTrPAC/caper/
Open an issue on the GitHub repository with:
- Workflow ID
- Error messages from logs
- JSON configuration (with sensitive data removed)

Citations and References

Pipeline Documentation

RNA-SEQ MOP - MoTrPAC RNA-seq Method of Procedure

Workflow Management

Cromwell - Workflow management system
Caper - Cromwell wrapper for easy workflow execution
WDL - Workflow Description Language specification

Analysis Tools

STAR - Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013.
RSEM - Li B and Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011.
featureCounts - Liao Y, et al. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014.
FastQC - Quality control for high throughput sequence data
MultiQC - Ewels P, et al. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016.
Cutadapt - Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011.
Bowtie2 - Langmead B and Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012.
Picard Tools - Broad Institute toolkit for SAM/BAM file manipulation

Infrastructure

ENCODE-DCC - Tools and pipelines from the ENCODE Project Consortium

Reference Genomes

Rat rn6: Ensembl Rnor_6.0 release 96
Rat rn7: Ensembl mRatBN7.2 release 108
Human: GENCODE v39 (GRCh38)

Contributing and Support

Reporting Issues

If you encounter bugs or have feature requests, please open an issue on the GitHub repository.

When reporting issues, please include:

Description of the problem
Steps to reproduce
Expected vs. actual behavior
Workflow ID (if applicable)
Relevant error messages or logs
JSON configuration (remove sensitive information)

Contact

For questions or support related to the MoTrPAC RNA-seq pipeline:

Open an issue on GitHub: https://github.com/MoTrPAC/motrpac-rna-seq-pipeline/issues
Contact the MoTrPAC Bioinformatics Center

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request with a clear description of the changes

MoTrPAC Data Hub - Access to MoTrPAC datasets
MoTrPAC GitHub Organization - Other MoTrPAC analysis pipelines and tools

Version Information

Current Version

This pipeline is actively maintained and updated. Check the releases page for version history and changelogs.

Citing This Pipeline

If you use this pipeline in your research, please cite:

Compatibility Notes

WDL Version: 1.0
Cromwell Version: Compatible with Cromwell 50+
Python Version: Requires Python >= 3.6.9
GCP: Designed for Google Cloud Platform (adaptable to other backends with Cromwell configuration)

Change History

Major updates and changes are documented in the repository’s commit history. For significant changes:

Reference genome updates (rn6 → rn7, GENCODE versions)
Tool version updates (see dockerfiles for current versions)
Workflow optimizations and bug fixes

Check the commit history for detailed changes.

motrpac-rna-seq-pipeline

MoTrPAC RNA-SEQ Pipeline

Table of Contents

Overview

Supported Organisms and Genome Builds

Pipeline Tools

Pipeline Outputs

Quick Start

Prerequisites

Required Accounts

Required Software (Local Machine)

Python Dependencies

GCP Permissions and APIs

GCP Set-up

Software / Dockerfiles

Available Docker Images

Building and Updating Containers

Configuration Files

Generating Configuration Files

Organism Reference Files

Run the Pipeline

Pipeline Outputs

Quantification Files

Quality Control Files

Additional Outputs (Per Sample)

Retrieving Outputs

Monitoring and Job Management

Checking Pipeline Status

Monitoring Running Jobs

Managing Workflows

Retrieving Results

WDL Workflow Structure

Main Workflow

Task Modules

Reference Building Workflows

Local Development and Testing

Setup for Development

Validating JSON Files

Testing with Prototype Examples

Building Docker Images Locally

Troubleshooting

Common Issues

Accessing Logs

Getting Help

Citations and References

Pipeline Documentation

Workflow Management

Analysis Tools

Infrastructure

Reference Genomes

Contributing and Support

Reporting Issues

Contact

Contributing

Related Repositories

Version Information

Current Version

Citing This Pipeline

Compatibility Notes

Change History