motrpac-rna-seq-pipeline

MoTrPAC RNA-SEQ Pipeline

DOI

Table of Contents

Overview

This repo contains the rna-seq data processing pipeline implemented in Workflow Description Language (WDL) based on harmonized RNA-SEQ MOP. This pipeline uses caper, a wrapper python package for the workflow management system Cromwell. All the data was processed on the Google Cloud Platform (GCP).

Supported Organisms and Genome Builds

The pipeline supports the following organisms and genome versions:

Pipeline Tools

The pipeline uses:

Pipeline Outputs

The pipeline generates:

Quick Start

For experienced users, here’s the essential workflow:

# 1. Clone the repository
git clone https://github.com/MoTrPAC/motrpac-rna-seq-pipeline

# 2. Install Python dependencies
pip3 install -r scripts/requirements.txt

# 3. Generate input JSON configuration
python3 scripts/make_json_rnaseq.py \
  -g gs://your-bucket/fastq_raw \
  -o ./input_json \
  -r batch1_qc_metrics \
  -a human \
  -v gencode_v39 \
  -n 1 \
  -p your-gcp-project \
  -d us-docker.pkg.dev/motrpac-portal/rnaseq

# 4. Submit the pipeline
caper submit wdl/rnaseq_pipeline_scatter.wdl -i input_json/set1_rnaseq.json

# 5. Monitor pipeline status
caper list

Prerequisites

Required Accounts

Required Software (Local Machine)

Python Dependencies

Install required Python packages:

pip3 install -r scripts/requirements.txt

The main dependencies include:

GCP Permissions and APIs

Ensure the following APIs are enabled in your GCP project:

GCP Set-up

The WDL/Cromwell framework is optimized to run pipelines in high-performance computing environments. The MoTrPAC Bioinformatics Center runs pipelines on Google Cloud Platform (GCP). We used a number of fantastic tools developed by our colleagues from the ENCODE project to run pipelines on GCP (and other HPC platforms).

A brief summary of the steps to set-up a VM to run the Motrpac pipelines on GCP (for details, please, check the caper repo):

 $ bash create_instance.sh [INSTANCE_NAME] [PROJECT_ID] [GCP_SERVICE_ACCOUNT_KEY_JSON_FILE] [GCP_OUT_DIR]

 # Example for the pipeline:
./create_instance.sh pipeline-instance your-gcp-project-name service-account-191919.json gs://pipelines/results/
 git clone https://github.com/MoTrPAC/motrpac-rna-seq-pipeline

Software / Dockerfiles

Several tools are required to run the rna-seq pipeline. All of them are pre-installed in docker containers, which are publicly available in the Artifact Registry.

Available Docker Images

The pipeline uses the following containerized tools (all available at us-docker.pkg.dev/motrpac-portal/rnaseq):

Building and Updating Containers

To find out more about the specific versions of tools used to run the pipeline, check the dockerfiles/*.Dockerfile.

To build and push updated containers:

# Build all dockerfiles
bash scripts/build_dockerfiles.sh

# Push to Artifact Registry (requires appropriate permissions)
bash scripts/push_dockerfiles.sh

Configuration Files

An input configuration file (in JSON format) is required to process the data through the pipeline. This configuration file contains several key-value pairs that specify the inputs and outputs of the workflow, the location of the input files, default pipeline parameters, docker containers, the execution environment, and other parameters needed for execution.

Generating Configuration Files

The optimal way to generate the configuration files is to run the make_json_rnaseq.py script.

Usage:

python3 scripts/make_json_rnaseq.py \
  -g GCP_PATH \               # GCS path to directory containing FASTQ files
  -o OUTPUT_PATH \            # Local path where JSON files will be written
  -r OUTPUT_REPORT_NAME \     # Name for the output QC metrics report
  -a {rat,human} \            # Organism
  -v {rn6,rn7,gencode_v39} \  # Genome build version
  -n NUM_CHUNKS \             # Number of batches to split samples into
  -p PROJECT \                # GCP project name
  -d DOCKER_REPO \            # Docker repository prefix (optional)
  -i \                        # Include index files (for UMI processing)
  -u                          # Include undetermined reads (optional)
  -v VERSION \                # Accepts values "rn6", "rn7", or "gencode_v39" to specify the genome build version 

Complete Example:

python3 scripts/make_json_rnaseq.py \
  -g gs://motrpac-bucket/rna-seq/human/batch7_20220316/fastq_raw \
  -o ./input_json \
  -r batch7_qc_metrics.csv \
  -a human \
  -v gencode_v39 \
  -n 1 \
  -p motrpac-portal \
  -d us-docker.pkg.dev/motrpac-portal/rnaseq \
  -i

This will create JSON configuration file(s) (e.g., set1_rnaseq.json, set2_rnaseq.json, etc.) in the specified output directory.

Organism Reference Files

The make_json_rnaseq.py script automatically selects the appropriate reference files based on the organism and version:

Rat (rn6):

Rat (rn7):

Human (gencode_v39):

For more details, see the scripts documentation.

Run the Pipeline

Connect to the VM and submit the job using the below command:

caper submit wdl/rnaseq_pipeline_scatter.wdl -i input_json/set1_rnaseq.json

Check the status of workflows and make sure they have succeeded by typing caper list on the VM instance that’s running the job and look for Succeeded.

Pipeline Outputs

The pipeline generates the following main output files:

Quantification Files

  1. RSEM Gene Expression Quantification
    • *_rsem_genes_count.txt - Raw gene-level counts
    • *_rsem_genes_tpm.txt - Transcripts Per Million (TPM) normalized expression
    • *_rsem_genes_fpkm.txt - Fragments Per Kilobase Million (FPKM) normalized expression
  2. featureCounts Gene Quantification
    • *_feature_counts.txt - Gene-level raw counts from featureCounts

Quality Control Files

  1. QC Metrics Report
    • *_qc_report.csv - Comprehensive QC metrics per sample including:
      • Read alignment statistics
      • rRNA, globin, and PhiX contamination rates
      • PCR duplication rates
      • Strand specificity
      • 5’ to 3’ coverage bias
      • Percentage of reads mapping to coding/intronic/intergenic regions
      • Chromosome mapping percentages

Additional Outputs (Per Sample)

The pipeline also generates intermediate outputs for each sample (stored in Cromwell execution directories):

Retrieving Outputs

Final merged outputs are written to the GCS bucket specified during pipeline submission. Individual sample outputs are organized in the Cromwell execution directory structure.

Monitoring and Job Management

Checking Pipeline Status

# List all workflows
caper list

# Check detailed status of a specific workflow
caper metadata [WORKFLOW_ID]

Monitoring Running Jobs

# View workflows currently running
caper list | grep Running

# Check logs for a specific workflow
caper debug [WORKFLOW_ID]

Managing Workflows

# Abort a running workflow
caper abort [WORKFLOW_ID]

# Check troubleshooting information
caper troubleshoot [WORKFLOW_ID]

Retrieving Results

Successful pipeline runs will write outputs to your specified GCS bucket. Intermediate files and execution logs are stored in:

To copy results from GCS to your local machine:

gsutil -m cp -r gs://your-bucket/results/workflow_id/* ./local_results/

WDL Workflow Structure

The pipeline is organized as a modular WDL workflow with the following structure:

Main Workflow

Task Modules

The pipeline consists of the following task modules (in wdl/ directory):

Pre-alignment QC and Processing:

Alignment and Quantification:

Contamination Screening:

Post-alignment QC:

Results Aggregation:

Reference Building Workflows

The repository also includes workflows for building reference files:

Local Development and Testing

Setup for Development

Use the provided setup scripts to configure your development environment:

# Set up VM for pipeline execution
bash scripts/setup/setup_vm.sh

# Set up local development environment
bash scripts/setup/setup_develop.sh

Validating JSON Files

Before submitting workflows, validate your JSON configuration files:

python3 scripts/validate_jsons.py input_json/set1_rnaseq.json

Testing with Prototype Examples

The prototype/ directory contains example configuration files and submission scripts:

# Example submission script for generic use
bash prototype/submit_rnaseq_generic.sh

# Example JSON configurations in prototype/input_json/

The examples/ directory contains additional JSON examples for individual tasks and different organism configurations.

Building Docker Images Locally

# Build all docker images
bash scripts/build_dockerfiles.sh

# Push to your container registry (configure registry URL first)
bash scripts/push_dockerfiles.sh

Troubleshooting

Common Issues

1. Pipeline Fails During Submission

2. Tasks Fail with “Out of Memory” Errors

3. Tasks Fail with “Out of Disk Space” Errors

4. Cannot Find Output Files

5. Docker Image Pull Failures

Accessing Logs

Workflow-level logs:

# View workflow metadata
caper metadata [WORKFLOW_ID]

# Check troubleshooting info
caper troubleshoot [WORKFLOW_ID]

Task-level logs: Navigate to the Cromwell execution directory:

cd cromwell-executions/rnaseq_pipeline/[WORKFLOW_ID]/
# Find specific task directories and check stderr/stdout logs

GCP Console:

Getting Help

If issues persist:

  1. Check the Cromwell documentation: https://cromwell.readthedocs.io/
  2. Review the Caper documentation: https://github.com/MoTrPAC/caper/
  3. Open an issue on the GitHub repository with:
    • Workflow ID
    • Error messages from logs
    • JSON configuration (with sensitive data removed)

Citations and References

Pipeline Documentation

Workflow Management

Analysis Tools

Infrastructure

Reference Genomes

Contributing and Support

Reporting Issues

If you encounter bugs or have feature requests, please open an issue on the GitHub repository.

When reporting issues, please include:

Contact

For questions or support related to the MoTrPAC RNA-seq pipeline:

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request with a clear description of the changes

Version Information

Current Version

This pipeline is actively maintained and updated. Check the releases page for version history and changelogs.

Citing This Pipeline

If you use this pipeline in your research, please cite:

DOI

Compatibility Notes

Change History

Major updates and changes are documented in the repository’s commit history. For significant changes:

Check the commit history for detailed changes.