motrpac-rna-splicing-pipeline

Snakemake Alternative Splicing Pipeline

This repository contains a robust, cluster-ready Snakemake pipeline for analyzing alternative splicing from high-throughput bulk rna-seq sequencing data. The pipeline is designed to be modular, scalable, and easily adaptable to a variety of experimental designs and computational environments.

Features

Cluster Execution: Seamless execution on compute clusters (e.g., Load Sharing Facility - LSF - supercomputers).
Conda Environment Management: Automatic handling and creation of conda environments for reproducibility.
Highly Configurable: Most parameters are set via environment variables, allowing easy customization.
Support for Multiple Environments: Handles both Python 2 and 3 environments as needed.
Comprehensive Logging and Debugging: Supports verbose and debug modes for easier troubleshooting.

Prerequisites

Conda must be installed on your system before using this pipeline, as the main environment must be created with conda ahead of time (see below).
Snakemake and all core dependencies are managed via conda environments.
Linux-based operating system.
Genome reference data - see details in the Genome reference data section below.

Getting Started

1. Clone the Repository

git clone https://github.com/stasrira/snakemake_alternative_splicing.git
cd snakemake_alternative_splicing

2. Prepare Your Data

Place your input data (FASTQ files) into a sub-folder fastq_raw (can be customized through data_path_required_folders variable) of the main data location. That location will be referenced in the data_path directory.

3. Create Your `.env` File

Create an .env file in the config directory to specify your environment variables. You can use the provided example file as a template:

cp config/.env_example config/.env

Edit config/.env to set the variables for your environment as needed.
See the example file here: config/.env_example

4. Create the Base Conda Environment

Important:
The base conda environment (conda_base) must be pre-created using the provided environment YAML file:

conda env create -p /desired/path/to/conda_base -f conda_envs/snakemake_base.yml

/desired/path/to/conda_base should be accessible to all users who intend to run the pipeline.
All other environments required by the pipeline will be created on-the-fly in the location defined by the conda_envs_path variable.

For multi-user setups:

The conda_base environment must be accessible (read/execute) by all users (e.g., placed in a shared location).
The conda_envs_path location must also be accessible for writing, but each user will get their own set of environments there.

5. Genome reference data

All the references (except noted below) used in this implementation are identical to the ones used here - https://github.com/yongchao/motrpac_rnaseq

Alternative Splicing related references:

rMATS

Reference version used - GENCODE Human Release 30 (GRCh38)

SUPPA

suppa.py generateEvents -i -o -e
gtf_file used - GENCODE Human Release 30 (GRCh38)

Configurable Variables

All arguments to the pipeline are passed as environment variables. These variables are categorized into two sets:

1. Variables for Configuring the `alter_splicing.smk` File

These variables control the execution of the main Snakemake workflow file.
conda_base: (required) path to the base conda environment containing the required version of the snakemake installation for this pipeline
data_path: (required) path to the folder containing data that will be processed
conda_envs_path: Directory for additional conda environments (default: sibling to conda_base). Each user will get their own environments here.
snakemake_file: Path to the main Snakemake file (default: alter_splicing.smk)
cluster_config: Path to the cluster config file (default: config/cluster_cfg.yaml)
dry_run: True/False — run a dry run (default: False)
cluster_run: True/False — use cluster execution (default: True)
jobs: Number of jobs to run in parallel (default: 400)
cores_local: Number of local cores (default: 40)
latency_wait: Wait time in seconds for cluster jobs (default: 15)
rerun_incomplete: True/False — rerun incomplete jobs (default: True)
keep_going: True/False — continue on error (default: True)
verbose: True/False — verbose output (default: True)
ml_purge: True/False — purge modules before starting (default: True)
help: True/False — show extended help (default: False)
debug: True/False — enable bash and Snakemake debug mode (default: False)
additional_args: Any additional arguments for Snakemake
For a complete list and description of these variables, refer to the run_pipeline.sh script.
To view help information about this set of variables, run:
```
./run_pipeline.sh -h
```

2. Variables for Configuring the Pipeline’s Path of Execution

The following variables can be configured for the pipeline, and their default values are listed below.

help: If True, displays help instructions and prints all config variables with default values (default: False).
data_path: Path to the data directory being processed.
data_path_required_folders: List of required folders in data_path (default: ['fastq_raw']).
raw_data_file_ext: File extension for raw data (default: ".fastq.gz").
samplesToIgnore: Comma delimited list of samples that needs to be ignored. It also can be provided as a text file with list of samples where each sample is located on a new row (default: no samples will be ignored).
ref_data_path: Path to the main location of reference data.
motrpac_ref_data: name of the motrpac reference data directory under the ref_data_path (default: motrpac_refdata).
motrpac_ref_data_genom_dir: name of the genome directory under the motrpac_ref_data (default: hg38_gencode_v30).
alt_sl_ref_data: name of the alternative splicing reference data directory under the ref_data_path (default: AS).
alt_sl_ref_data_genom_dir: name of the genome directory under the alt_sl_ref_data (default: if no value provided, the value of motrpac_ref_data_genom_dir will be used).
run_alternative_splicing: If True, runs alternative splicing steps (default: True).
run_rsem: If True, runs RSEM steps (default: True).
run_rmats_turbo: If True, uses rmats_turbo instead of the standard rmats tool (default: False).
run_rmats_novel: If True, includes --novelSS in rmats command (default: False).
run_leafcutter: If True, enables LeafCutter-related steps (default: False).
run_spladder: If True, runs SplAdder-related steps (default: True).
run_djexpress: If True, runs DJExpress-related steps (default: True).
run_spladder_long_execution_events: If True, executes SplAdder long-execution events (default: False).
run_ngscheckmate: If True, includes NGSCheckMate rules (default: True).
run_feature_counts: If True, runs FeatureCounts and includes results in MultiQC (default: False).
run_telescope: If True, runs Telescope-related rules (default: True).
create_rds_for_rscripts: If set to True, an RDS file will be created for testing participating rules utilizing R scripts (default: False).
leafcutter_library_strandedness: Used for intronMotifBAM2junc and leafcutter_cluster_regtools rules; possible values: XS, FR (default: XS).
leafcutter_contrast_db_mapping_code: Mapping code used to retrieve contrast value from DB (default: "primary_contrast").
leafcutter_contrast_group_count_min: Minimum number of items in a leafcutter contrast group (default: 4).
sample_validate_min_reads_count: Integer value. Specifies the minimum number of reads a BAM file must contain to be considered valid (default is 50000).
sample_validate_outlier_threshold: Negative float value. Specifies the threshold of the z-score of number of reads of a BAM file to be considered valid (default is -2.0).
spladder_chunk_size: Chunk size parameter for SplAdder runs (default: 2).
spladder_events_ase_edge_limit: Used for spladder_call_events and spladder_call_events_long_execution rules to set the ASE edge limit (default: 500).
spladder_events_ase_edge_limit_decrease_step: Value to decrease ase_edge_limit for rerun attempts of the spladder_call_event rule (default: 100).
rmats_contrast_file: Expected location of the rmats contrast metadata file (default: metadata/rmats_contrast.tsv).
rmats_default_contrast: Default contrast value for rmats (default: b1).
rmats_min_b1_qty: Minimum quantity for b1 group in rmats (default: 2).
rmats_min_b2_qty: Minimum quantity for b2 group in rmats (default: 2).
debug: If True, temp objects will not be deleted (default: False).
print_prerun_info: If set to True (default), the collected prerun info will be printed to stdout. If set to False, no printing will be done—this is required to print the DAG diagram.
pipeline_info_file: Filename for pipeline info output (default: "pipeline_info.txt").
pipeline_warning_file: Filename for pipeline warning output (default: "pipeline_warning.txt").

To view the complete list dynamically at runtime, set the help variable to True and run the pipeline as shown below:

export help=True
./run_pipeline.sh

This will display the help information and terminate the pipeline execution. Set help=False or unset the variable to enable normal operation afterward.

Running the Pipeline

Set up your environment variables as needed.
Execute the pipeline:
```
./run_pipeline.sh
```

File Structure

run_pipeline.sh — Main script to launch the pipeline.
alter_splicing.smk — Main Snakemake workflow file.
config/config.yaml — Contains default values and descriptions for pipeline configuration variables.
conda_envs/snakemake_base.yml — Base environment YAML for Snakemake; must be used to pre-create conda_base.
README.md — This file.

Example Usage

# Pre-create the shared base environment
conda env create -p /shared/path/to/conda_base -f conda_envs/snakemake_base.yml

# Set up your variables
export conda_base="/shared/path/to/conda_base"
export data_path="/path/to/your/data"

# set dry-run to True to view the list of steps to be executed without the actual run
export dry_run=True

# Run the pipeline
./run_pipeline.sh

Environment & Dependencies

Conda (must be installed before use)
Snakemake (installed via conda_base)
(Optional) Python 2 environment for legacy tool support.

Note:

All dependencies for the core workflow must be installed in the conda_base environment.
All additional environments for pipeline steps are created automatically in conda_envs_path.
NGScheckmate related rules, although included in the current snakemake implementation, represents a dev version reliant on a custom internal data structure and therefore the corresponding snakemake rules should be disabled by setting export run_ngscheckmate=False

Troubleshooting

Missing Variables: The script will check for required variables and print clear error messages if they are not set or are invalid.
Cluster Support: Ensure your cluster configuration (config/cluster_cfg.yaml) matches your HPC environment.