This repository contains a robust, cluster-ready Snakemake pipeline for analyzing alternative splicing from high-throughput bulk rna-seq sequencing data. The pipeline is designed to be modular, scalable, and easily adaptable to a variety of experimental designs and computational environments.
git clone https://github.com/stasrira/snakemake_alternative_splicing.git
cd snakemake_alternative_splicing
Place your input data (FASTQ files) into a sub-folder fastq_raw (can be customized through data_path_required_folders variable) of the main data location. That location will be referenced in the data_path directory.
.env FileCreate an .env file in the config directory to specify your environment variables. You can use the provided example file as a template:
cp config/.env_example config/.env
Edit config/.env to set the variables for your environment as needed.
See the example file here: config/.env_example
Important:
The base conda environment (conda_base) must be pre-created using the provided environment YAML file:
conda env create -p /desired/path/to/conda_base -f conda_envs/snakemake_base.yml
/desired/path/to/conda_base should be accessible to all users who intend to run the pipeline.conda_envs_path variable.For multi-user setups:
conda_base environment must be accessible (read/execute) by all users (e.g., placed in a shared location).conda_envs_path location must also be accessible for writing, but each user will get their own set of environments there.All the references (except noted below) used in this implementation are identical to the ones used here - https://github.com/yongchao/motrpac_rnaseq
Alternative Splicing related references:
rMATS
SUPPA
All arguments to the pipeline are passed as environment variables. These variables are categorized into two sets:
alter_splicing.smk FileThese variables control the execution of the main Snakemake workflow file.
conda_base: (required) path to the base conda environment containing the required version of the snakemake installation for this pipelinedata_path: (required) path to the folder containing data that will be processedconda_envs_path: Directory for additional conda environments (default: sibling to conda_base). Each user will get their own environments here.snakemake_file: Path to the main Snakemake file (default: alter_splicing.smk)cluster_config: Path to the cluster config file (default: config/cluster_cfg.yaml)dry_run: True/False — run a dry run (default: False)cluster_run: True/False — use cluster execution (default: True)jobs: Number of jobs to run in parallel (default: 400)cores_local: Number of local cores (default: 40)latency_wait: Wait time in seconds for cluster jobs (default: 15)rerun_incomplete: True/False — rerun incomplete jobs (default: True)keep_going: True/False — continue on error (default: True)verbose: True/False — verbose output (default: True)ml_purge: True/False — purge modules before starting (default: True)help: True/False — show extended help (default: False)debug: True/False — enable bash and Snakemake debug mode (default: False)additional_args: Any additional arguments for Snakemake
run_pipeline.sh script../run_pipeline.sh -h
The following variables can be configured for the pipeline, and their default values are listed below.
help: If True, displays help instructions and prints all config variables with default values (default: False).data_path: Path to the data directory being processed.data_path_required_folders: List of required folders in data_path (default: ['fastq_raw']).raw_data_file_ext: File extension for raw data (default: ".fastq.gz").samplesToIgnore: Comma delimited list of samples that needs to be ignored. It also can be provided as a text file with list of samples where each sample is located on a new row (default: no samples will be ignored).ref_data_path: Path to the main location of reference data.motrpac_ref_data: name of the motrpac reference data directory under the ref_data_path (default: motrpac_refdata).motrpac_ref_data_genom_dir: name of the genome directory under the motrpac_ref_data (default: hg38_gencode_v30).alt_sl_ref_data: name of the alternative splicing reference data directory under the ref_data_path (default: AS).alt_sl_ref_data_genom_dir: name of the genome directory under the alt_sl_ref_data (default: if no value provided, the value of motrpac_ref_data_genom_dir will be used).run_alternative_splicing: If True, runs alternative splicing steps (default: True).run_rsem: If True, runs RSEM steps (default: True).run_rmats_turbo: If True, uses rmats_turbo instead of the standard rmats tool (default: False).run_rmats_novel: If True, includes --novelSS in rmats command (default: False).run_leafcutter: If True, enables LeafCutter-related steps (default: False).run_spladder: If True, runs SplAdder-related steps (default: True).run_djexpress: If True, runs DJExpress-related steps (default: True).run_spladder_long_execution_events: If True, executes SplAdder long-execution events (default: False).run_ngscheckmate: If True, includes NGSCheckMate rules (default: True).run_feature_counts: If True, runs FeatureCounts and includes results in MultiQC (default: False).run_telescope: If True, runs Telescope-related rules (default: True).create_rds_for_rscripts: If set to True, an RDS file will be created for testing participating rules utilizing R scripts (default: False).leafcutter_library_strandedness: Used for intronMotifBAM2junc and leafcutter_cluster_regtools rules; possible values: XS, FR (default: XS).leafcutter_contrast_db_mapping_code: Mapping code used to retrieve contrast value from DB (default: "primary_contrast").leafcutter_contrast_group_count_min: Minimum number of items in a leafcutter contrast group (default: 4).sample_validate_min_reads_count: Integer value. Specifies the minimum number of reads a BAM file must contain to be considered valid (default is 50000).sample_validate_outlier_threshold: Negative float value. Specifies the threshold of the z-score of number of reads of a BAM file to be considered valid (default is -2.0).spladder_chunk_size: Chunk size parameter for SplAdder runs (default: 2).spladder_events_ase_edge_limit: Used for spladder_call_events and spladder_call_events_long_execution rules to set the ASE edge limit (default: 500).spladder_events_ase_edge_limit_decrease_step: Value to decrease ase_edge_limit for rerun attempts of the spladder_call_event rule (default: 100).rmats_contrast_file: Expected location of the rmats contrast metadata file (default: metadata/rmats_contrast.tsv).rmats_default_contrast: Default contrast value for rmats (default: b1).rmats_min_b1_qty: Minimum quantity for b1 group in rmats (default: 2).rmats_min_b2_qty: Minimum quantity for b2 group in rmats (default: 2).debug: If True, temp objects will not be deleted (default: False).print_prerun_info: If set to True (default), the collected prerun info will be printed to stdout. If set to False, no printing will be done—this is required to print the DAG diagram.pipeline_info_file: Filename for pipeline info output (default: "pipeline_info.txt").pipeline_warning_file: Filename for pipeline warning output (default: "pipeline_warning.txt").To view the complete list dynamically at runtime, set the help variable to True and run the pipeline as shown below:
export help=True
./run_pipeline.sh
This will display the help information and terminate the pipeline execution. Set help=False or unset the variable to enable normal operation afterward.
./run_pipeline.sh
run_pipeline.sh — Main script to launch the pipeline.alter_splicing.smk — Main Snakemake workflow file.config/config.yaml — Contains default values and descriptions for pipeline configuration variables.conda_envs/snakemake_base.yml — Base environment YAML for Snakemake; must be used to pre-create conda_base.README.md — This file.# Pre-create the shared base environment
conda env create -p /shared/path/to/conda_base -f conda_envs/snakemake_base.yml
# Set up your variables
export conda_base="/shared/path/to/conda_base"
export data_path="/path/to/your/data"
# set dry-run to True to view the list of steps to be executed without the actual run
export dry_run=True
# Run the pipeline
./run_pipeline.sh
conda_base)Note:
conda_base environment.conda_envs_path.export run_ngscheckmate=Falseconfig/cluster_cfg.yaml) matches your HPC environment.