This repo contains the rna-seq data processing pipeline implemented in Workflow Description Language (WDL) based on harmonized RNA-SEQ MOP. This pipeline uses caper, a wrapper python package for the workflow management system Cromwell. All the data was processed on the Google Cloud Platform (GCP).
The pipeline supports the following organisms and genome versions:
The pipeline uses:
The pipeline generates:
For experienced users, here’s the essential workflow:
# 1. Clone the repository
git clone https://github.com/MoTrPAC/motrpac-rna-seq-pipeline
# 2. Install Python dependencies
pip3 install -r scripts/requirements.txt
# 3. Generate input JSON configuration
python3 scripts/make_json_rnaseq.py \
-g gs://your-bucket/fastq_raw \
-o ./input_json \
-r batch1_qc_metrics \
-a human \
-v gencode_v39 \
-n 1 \
-p your-gcp-project \
-d us-docker.pkg.dev/motrpac-portal/rnaseq
# 4. Submit the pipeline
caper submit wdl/rnaseq_pipeline_scatter.wdl -i input_json/set1_rnaseq.json
# 5. Monitor pipeline status
caper list
Install required Python packages:
pip3 install -r scripts/requirements.txt
The main dependencies include:
gcsfs - for accessing Google Cloud Storagenumpy - for data processingEnsure the following APIs are enabled in your GCP project:
The WDL/Cromwell framework is optimized to run pipelines in high-performance computing environments. The MoTrPAC Bioinformatics Center runs pipelines on Google Cloud Platform (GCP). We used a number of fantastic tools developed by our colleagues from the ENCODE project to run pipelines on GCP (and other HPC platforms).
A brief summary of the steps to set-up a VM to run the Motrpac pipelines on GCP (for details, please, check the caper repo):
service-account-191919.json) $ bash create_instance.sh [INSTANCE_NAME] [PROJECT_ID] [GCP_SERVICE_ACCOUNT_KEY_JSON_FILE] [GCP_OUT_DIR]
# Example for the pipeline:
./create_instance.sh pipeline-instance your-gcp-project-name service-account-191919.json gs://pipelines/results/
git clone https://github.com/MoTrPAC/motrpac-rna-seq-pipeline
Several tools are required to run the rna-seq pipeline. All of them are pre-installed in docker containers, which are publicly available in the Artifact Registry.
The pipeline uses the following containerized tools (all available at us-docker.pkg.dev/motrpac-portal/rnaseq):
fastqc:latest - FastQC for quality controlumi_attach:latest - UMI attachment utilitycutadapt:latest - Adapter trimmingmultiqc:latest - Aggregate QC reportingstar:latest - STAR alignerfeature_counts:latest - featureCounts from Subreadrsem:latest - RSEM quantificationbowtie:latest - Bowtie2 aligner (for contamination screening)picard:latest - Picard tools (MarkDuplicates, CollectRnaSeqMetrics)umi_dup:latest - UMI-based duplication assessmentsamtools:latest - SAMtools utilitiescollect_qc:latest - Custom QC metrics collectionmerge_results:latest - Result merging across samplesTo find out more about the specific versions of tools used to run the pipeline, check the dockerfiles/*.Dockerfile.
To build and push updated containers:
# Build all dockerfiles
bash scripts/build_dockerfiles.sh
# Push to Artifact Registry (requires appropriate permissions)
bash scripts/push_dockerfiles.sh
An input configuration file (in JSON format) is required to process the data through the pipeline. This configuration file contains several key-value pairs that specify the inputs and outputs of the workflow, the location of the input files, default pipeline parameters, docker containers, the execution environment, and other parameters needed for execution.
The optimal way to generate the configuration files is to run the make_json_rnaseq.py script.
Usage:
python3 scripts/make_json_rnaseq.py \
-g GCP_PATH \ # GCS path to directory containing FASTQ files
-o OUTPUT_PATH \ # Local path where JSON files will be written
-r OUTPUT_REPORT_NAME \ # Name for the output QC metrics report
-a {rat,human} \ # Organism
-v {rn6,rn7,gencode_v39} \ # Genome build version
-n NUM_CHUNKS \ # Number of batches to split samples into
-p PROJECT \ # GCP project name
-d DOCKER_REPO \ # Docker repository prefix (optional)
-i \ # Include index files (for UMI processing)
-u # Include undetermined reads (optional)
-v VERSION \ # Accepts values "rn6", "rn7", or "gencode_v39" to specify the genome build version
Complete Example:
python3 scripts/make_json_rnaseq.py \
-g gs://motrpac-bucket/rna-seq/human/batch7_20220316/fastq_raw \
-o ./input_json \
-r batch7_qc_metrics.csv \
-a human \
-v gencode_v39 \
-n 1 \
-p motrpac-portal \
-d us-docker.pkg.dev/motrpac-portal/rnaseq \
-i
This will create JSON configuration file(s) (e.g., set1_rnaseq.json, set2_rnaseq.json, etc.) in the specified output directory.
The make_json_rnaseq.py script automatically selects the appropriate reference files based on the organism and version:
Rat (rn6):
gs://omicspipelines-public-resources/rnaseq/references/rat/Rnor6_v96_star_index.tar.gzgs://omicspipelines-public-resources/rnaseq/references/rat/Rattus_norvegicus.Rnor_6.0.96.gtfgs://omicspipelines-public-resources/rnaseq/references/rat/rn6_rsem_reference.tar.gzRat (rn7):
gs://omicspipelines-public-resources/rnaseq/references/rat/rn7/rn7_v108_star_index.tar.gzgs://omicspipelines-public-resources/rnaseq/references/rat/rn7/Rattus_norvegicus.mRatBN7.2.108.gtfgs://omicspipelines-public-resources/rnaseq/references/rat/rn7/rn7_rsem_reference.tar.gzHuman (gencode_v39):
gs://omicspipelines-public-resources/rnaseq/references/human/hg38_v39_star_index.tar.gzgs://omicspipelines-public-resources/rnaseq/references/human/GRCh38.v39.primary_assembly.annotation.gtfgs://omicspipelines-public-resources/rnaseq/references/human/hg38_rsem_reference.tar.gzFor more details, see the scripts documentation.
Connect to the VM and submit the job using the below command:
caper submit wdl/rnaseq_pipeline_scatter.wdl -i input_json/set1_rnaseq.json
Check the status of workflows and make sure they have succeeded by typing caper list on the VM instance that’s running the job and look for Succeeded.
The pipeline generates the following main output files:
*_rsem_genes_count.txt - Raw gene-level counts*_rsem_genes_tpm.txt - Transcripts Per Million (TPM) normalized expression*_rsem_genes_fpkm.txt - Fragments Per Kilobase Million (FPKM) normalized expression*_feature_counts.txt - Gene-level raw counts from featureCounts*_qc_report.csv - Comprehensive QC metrics per sample including:
The pipeline also generates intermediate outputs for each sample (stored in Cromwell execution directories):
Final merged outputs are written to the GCS bucket specified during pipeline submission. Individual sample outputs are organized in the Cromwell execution directory structure.
# List all workflows
caper list
# Check detailed status of a specific workflow
caper metadata [WORKFLOW_ID]
# View workflows currently running
caper list | grep Running
# Check logs for a specific workflow
caper debug [WORKFLOW_ID]
# Abort a running workflow
caper abort [WORKFLOW_ID]
# Check troubleshooting information
caper troubleshoot [WORKFLOW_ID]
Successful pipeline runs will write outputs to your specified GCS bucket. Intermediate files and execution logs are stored in:
cromwell-executions/ - Contains all task execution outputs and logscromwell-workflow-logs/ - Contains workflow-level logsTo copy results from GCS to your local machine:
gsutil -m cp -r gs://your-bucket/results/workflow_id/* ./local_results/
The pipeline is organized as a modular WDL workflow with the following structure:
wdl/rnaseq_pipeline_scatter.wdl - Main workflow that orchestrates all tasks using a scatter-gather pattern to process multiple samples in parallelThe pipeline consists of the following task modules (in wdl/ directory):
Pre-alignment QC and Processing:
fastqc/ - Quality control with FastQC (pre- and post-trimming)attach_umi/ - Attach UMI indices to read namescutadapt/ - Adapter trimmingmultiqc/ - Aggregate QC reportingAlignment and Quantification:
star_align/ - Alignment with STARrsem_exp/ - RSEM quantificationfeature_counts/ - featureCounts quantificationContamination Screening:
bowtie2_align/ - Bowtie2 alignment to globin, rRNA, and PhiX referencesPost-alignment QC:
mark_duplicates/ - PCR duplicate marking with Picardcollect_rnaseq_metrics/ - RNA-seq QC metrics with Picardumi_dup/ - UMI-based duplication assessmentcompute_mapped/ - Chromosome mapping statisticscollect_qc_metrics/ - Consolidated QC metrics collectionResults Aggregation:
merge_results/ - Merge quantification and QC data across all samplesThe repository also includes workflows for building reference files:
wdl/star_ref/ - Build STAR genome indiceswdl/rsem_index/ - Build RSEM reference indiceswdl/bowtie2_index/ - Build Bowtie2 indicesUse the provided setup scripts to configure your development environment:
# Set up VM for pipeline execution
bash scripts/setup/setup_vm.sh
# Set up local development environment
bash scripts/setup/setup_develop.sh
Before submitting workflows, validate your JSON configuration files:
python3 scripts/validate_jsons.py input_json/set1_rnaseq.json
The prototype/ directory contains example configuration files and submission scripts:
# Example submission script for generic use
bash prototype/submit_rnaseq_generic.sh
# Example JSON configurations in prototype/input_json/
The examples/ directory contains additional JSON examples for individual tasks and different organism configurations.
# Build all docker images
bash scripts/build_dockerfiles.sh
# Push to your container registry (configure registry URL first)
bash scripts/push_dockerfiles.sh
1. Pipeline Fails During Submission
scripts/validate_jsons.py2. Tasks Fail with “Out of Memory” Errors
*_ramGB parameters in your JSON configurationscripts/make_json_rnaseq.py3. Tasks Fail with “Out of Disk Space” Errors
*_disk parameters in your JSON configuration4. Cannot Find Output Files
caper listcromwell-executions/5. Docker Image Pull Failures
Workflow-level logs:
# View workflow metadata
caper metadata [WORKFLOW_ID]
# Check troubleshooting info
caper troubleshoot [WORKFLOW_ID]
Task-level logs: Navigate to the Cromwell execution directory:
cd cromwell-executions/rnaseq_pipeline/[WORKFLOW_ID]/
# Find specific task directories and check stderr/stdout logs
GCP Console:
If issues persist:
If you encounter bugs or have feature requests, please open an issue on the GitHub repository.
When reporting issues, please include:
For questions or support related to the MoTrPAC RNA-seq pipeline:
Contributions are welcome! Please:
This pipeline is actively maintained and updated. Check the releases page for version history and changelogs.
If you use this pipeline in your research, please cite:
Major updates and changes are documented in the repository’s commit history. For significant changes:
Check the commit history for detailed changes.