RNA-seq

This workflow processes RNA sequencing data, aligning reads to a reference genome, quantifying gene expression, and generating quality control reports.

Workflow Inputs

Workflow Name (Optional): A name to associate with your workflow for easy identification.
FASTQ Folder: The cloud storage location (e.g., gs://bucket/folder) containing the FASTQ files for your samples. The folder should contain paired-end reads with filenames ending in "_R1.fastq.gz" and "_R2.fastq.gz".
UMI Read Index Files (Optional): Enable this option and provide the cloud storage location of UMI index files if your data includes them. These files should have filenames ending in "_I1.fastq.gz".
Reference Files: The workflow uses pre-defined reference files for alignment and quantification. You have the option to override these defaults with your own files.
STAR Index: The cloud storage location of the STAR index file for read alignment.
GTF File: The cloud storage location of the GTF annotation file for gene quantification.
RSEM Reference: The cloud storage location of the RSEM reference file for transcript quantification.
Globin Genome Index: The cloud storage location of the Bowtie2 index for globin sequence to measure contamination.
rRNA Genome Index: The cloud storage location of the Bowtie2 index for rRNA sequence to quantify rRNA reads.
PhiX Genome Index: The cloud storage location of the Bowtie2 index for PhiX sequence to measure PhiX contamination.
RefFlat File: The cloud storage location of the RefFlat file generated from the GTF file using gtfToGenePred.
Output Report Name: The prefix used for the final merged report files.

Workflow Steps

The RNAseq pipeline consists of several steps, each performing a specific analysis:

1. Pre-Trim FASTQC:

Quality control analysis of reads after adapter trimming.

Additional configurable options:
pretrim_fastqc_ncpu: Number of CPU cores allocated for pre- and post-trim FASTQC analysis.
pretrim_fastqc_ramGB: Amount of RAM (in GB) dedicated to pre- and post-trim FASTQC.
pretrim_fastqc_disk: Disk space (in GB) allocated for pre- and post-trim FASTQC outputs.
fastqc_docker: Specify the Docker image used for running FASTQC.

2. Attach UMI (Optional):

Appends UMI information to read names if UMI index files are provided.

attach_umi_ncpu: Number of CPU cores used for attaching UMI information to read names.
attach_umi_ramGB: RAM allocated for the UMI attachment process.
attach_umi_disk: Disk space designated for UMI attachment outputs.
attach_umi_docker: Docker image used for the UMI attachment step.

3. Cutadapt:

Trims adapter sequences from reads and removes low-quality reads.

Additional configurable options:
minimumLength: Minimum length of reads after adapter trimming. Reads shorter than this will be discarded.
index_adapter: Sequence of the adapter used for indexing.
univ_adapter: Sequence of the universal adapter used (if applicable).
cutadapt_ncpu: Number of CPU cores used for adapter trimming with Cutadapt.
cutadapt_ramGB: RAM allocated for Cutadapt execution.
cutadapt_disk: Disk space designated for Cutadapt outputs.
cutadapt_docker: Docker image used for running Cutadapt.

4. Post-Trim FASTQC:

Quality control analysis of reads after adapter trimming.

Additional configurable options:
posttrim_fastqc_ncpu: Number of CPU cores allocated for pre- and post-trim FASTQC analysis.
posttrim_fastqc_ramGB: Amount of RAM (in GB) dedicated to pre- and post-trim FASTQC.
posttrim_fastqc_disk: Disk space (in GB) allocated for pre- and post-trim FASTQC outputs.
fastqc_docker: Specify the Docker image used for running FASTQC.

5. MultiQC:

Generates a consolidated report combining results from pre- and post-trim FASTQC and Cutadapt.

Additional configurable options:
multiqc_ncpu: Number of CPU cores used for generating the MultiQC report.
multiqc_ramGB: RAM allocated for MultiQC report generation.
multiqc_disk: Disk space allocated for MultiQC report outputs.
multiqc_docker: Docker image used for running MultiQC.

6. STAR Alignment:

Aligns reads to the reference genome using STAR.

Additional configurable options:
star_ncpu: Number of CPU cores dedicated to STAR alignment.
star_ramGB: RAM allocated for STAR alignment execution.
star_disk: Disk space designated for STAR alignment outputs.
star_docker: Docker image used for running STAR.

7. FeatureCounts:

Quantifies gene expression levels based on read counts.

Additional configurable options:
feature_counts_ncpu: Number of CPU cores used for gene expression quantification with FeatureCounts.
feature_counts_ramGB: RAM allocated for FeatureCounts execution.
feature_counts_disk: Disk space allocated for FeatureCounts outputs.
feature_counts_docker: Docker image used for running FeatureCounts.

8. RSEM Quantification:

Quantifies gene and transcript expression levels, including FPKMs and TPMs.

Additional configurable options:
rsem_ncpu: Number of CPU cores dedicated to RSEM quantification.
rsem_ramGB: RAM allocated for RSEM execution.
rsem_disk: Disk space designated for RSEM outputs.
rsem_docker: Docker image used for running RSEM.

9. **Contamination Estimation:

Uses Bowtie2 to estimate the level of globin, rRNA, and PhiX contamination in the samples.

Additional configurable options:
bowtie2_{globin,rrna,phix}_ncpu: Number of CPU cores for each Bowtie2 contamination estimation step.
bowtie2_{globin,rrna,phix}_ramGB: RAM allocated for each Bowtie2 step.
bowtie2_{globin,rrna,phix}_disk: Disk space for each Bowtie2 step outputs.
bowtie_docker: Docker image used for running Bowtie2.

10. Mark Duplicates:

Identifies and marks duplicate reads resulting from PCR amplification.

Additional configurable options:
markdup_ncpu: Number of CPU cores used for marking duplicate reads with Picard MarkDuplicates.
markdup_ramGB: RAM allocated for Picard MarkDuplicates execution.
markdup_disk: Disk space for Picard MarkDuplicates outputs.

11. Collect RNA-seq Metrics:

Collects various RNA-seq quality control metrics using Picard tools.

Additional configurable options:
rnaqc_ncpu: Number of CPU cores used for collecting RNAseq metrics with Picard CollectRnaSeqMetrics.
rnaqc_ramGB: RAM allocated for Picard CollectRnaSeqMetrics execution.
rnaqc_disk: Disk space for Picard CollectRnaSeqMetrics outputs.

12. UMI Duplication (Optional):

Estimates PCR duplication rates from UMI information if provided.

Additional configurable options:
umi_dup_ncpu: Number of CPU cores used for UMI-based duplication estimation.
umi_dup_ramGB: RAM allocated for UMI duplication estimation.
umi_dup_disk: Disk space designated for UMI duplication estimation outputs.
umi_dup_docker: Docker image used for UMI duplication estimation.

13. SAMTools Mapped:

Calculates the percentage of reads mapped to different genomic regions.

Additional configurable options:
mapped_ncpu: Number of CPU cores used for calculating mapped reads with SAMTools.
mapped_ramGB: RAM allocated for SAMTools Mapped execution.
mapped_disk: Disk space for SAMTools Mapped outputs.
samtools_docker: Docker image used for running SAMTools.

14. MultiQC Post-Alignment:

Generates a consolidated report combining results from STAR, Picard tools, and other post-alignment steps.

Additional configurable options:
mqc_postalign_ncpu: Number of CPU cores used for generating the post-alignment MultiQC report.
mqc_postalign_ramGB: RAM allocated for post-alignment MultiQC report generation.
mqc_postalign_disk: Disk space allocated for post-alignment MultiQC report outputs.

15. RNAseq QC Report:

Creates a comprehensive QC report for each sample using MultiQC reports and other log files.

Additional configurable options:
collect_qc_ncpu: Number of CPU cores used for generating the final RNAseq QC report.
collect_qc_ramGB: RAM allocated for final QC report generation.
collect_qc_disk: Disk space allocated for final QC report outputs.
collect_qc_docker: Docker image used for running the final QC report.

16. Merge Results:

Merges quantification outputs from RSEM and FeatureCounts, along with QC reports, into final combined files.

Additional configurable options:
merge_results_ncpu: Number of CPU cores used for merging results.
merge_results_ramGB: RAM allocated for merging results.
merge_results_disk: Disk space allocated for merging results.
merge_results_docker: Docker image used for merging results.

Workflow Outputs

RSEM Gene Counts: A table containing raw read counts for each gene across all samples.
RSEM Gene TPMs: A table containing TPM (Transcripts Per Million) values for each gene across all samples.
RSEM Gene FPKMs: A table containing FPKM (Fragments Per Kilobase Million) values for each gene across all samples.
FeatureCounts File: A table containing raw read counts for each gene across all samples.
QC Report File: A combined QC report summarizing results for all samples.

Additional Notes

Each step in the workflow has configurable options for adjusting computational resources (CPU, RAM, disk) and specifying docker images.
The workflow is designed to be modular, allowing for customization and extension as needed.

RNA-seq

Workflow Inputs​

Workflow Steps​

1. Pre-Trim FASTQC:​

2. Attach UMI (Optional):​

3. Cutadapt:​

4. Post-Trim FASTQC:​

5. MultiQC:​

6. STAR Alignment:​

7. FeatureCounts:​

8. RSEM Quantification:​

9. **Contamination Estimation:​

10. Mark Duplicates:​

11. Collect RNA-seq Metrics:​

12. UMI Duplication (Optional):​

13. SAMTools Mapped:​

14. MultiQC Post-Alignment:​

15. RNAseq QC Report:​

16. Merge Results:​

Workflow Outputs​

Additional Notes​