Skip to main content

RNA-seq

This workflow processes RNA sequencing data, aligning reads to a reference genome, quantifying gene expression, and generating quality control reports.

Workflow Inputs

  • Workflow Name (Optional): A name to associate with your workflow for easy identification.
  • FASTQ Folder: The cloud storage location (e.g., gs://bucket/folder) containing the FASTQ files for your samples. The folder should contain paired-end reads with filenames ending in "_R1.fastq.gz" and "_R2.fastq.gz".
  • UMI Read Index Files (Optional): Enable this option and provide the cloud storage location of UMI index files if your data includes them. These files should have filenames ending in "_I1.fastq.gz".
  • Reference Files: The workflow uses pre-defined reference files for alignment and quantification. You have the option to override these defaults with your own files.
  • STAR Index: The cloud storage location of the STAR index file for read alignment.
  • GTF File: The cloud storage location of the GTF annotation file for gene quantification.
  • RSEM Reference: The cloud storage location of the RSEM reference file for transcript quantification.
  • Globin Genome Index: The cloud storage location of the Bowtie2 index for globin sequence to measure contamination.
  • rRNA Genome Index: The cloud storage location of the Bowtie2 index for rRNA sequence to quantify rRNA reads.
  • PhiX Genome Index: The cloud storage location of the Bowtie2 index for PhiX sequence to measure PhiX contamination.
  • RefFlat File: The cloud storage location of the RefFlat file generated from the GTF file using gtfToGenePred.
  • Output Report Name: The prefix used for the final merged report files.

Workflow Steps

The RNAseq pipeline consists of several steps, each performing a specific analysis:

1. Pre-Trim FASTQC:

Quality control analysis of reads after adapter trimming.

  • Additional configurable options:
  • pretrim_fastqc_ncpu: Number of CPU cores allocated for pre- and post-trim FASTQC analysis.
  • pretrim_fastqc_ramGB: Amount of RAM (in GB) dedicated to pre- and post-trim FASTQC.
  • pretrim_fastqc_disk: Disk space (in GB) allocated for pre- and post-trim FASTQC outputs.
  • fastqc_docker: Specify the Docker image used for running FASTQC.

2. Attach UMI (Optional):

Appends UMI information to read names if UMI index files are provided.

  • attach_umi_ncpu: Number of CPU cores used for attaching UMI information to read names.
  • attach_umi_ramGB: RAM allocated for the UMI attachment process.
  • attach_umi_disk: Disk space designated for UMI attachment outputs.
  • attach_umi_docker: Docker image used for the UMI attachment step.

3. Cutadapt:

Trims adapter sequences from reads and removes low-quality reads.

  • Additional configurable options:
  • minimumLength: Minimum length of reads after adapter trimming. Reads shorter than this will be discarded.
  • index_adapter: Sequence of the adapter used for indexing.
  • univ_adapter: Sequence of the universal adapter used (if applicable).
  • cutadapt_ncpu: Number of CPU cores used for adapter trimming with Cutadapt.
  • cutadapt_ramGB: RAM allocated for Cutadapt execution.
  • cutadapt_disk: Disk space designated for Cutadapt outputs.
  • cutadapt_docker: Docker image used for running Cutadapt.

4. Post-Trim FASTQC:

Quality control analysis of reads after adapter trimming.

  • Additional configurable options:
  • posttrim_fastqc_ncpu: Number of CPU cores allocated for pre- and post-trim FASTQC analysis.
  • posttrim_fastqc_ramGB: Amount of RAM (in GB) dedicated to pre- and post-trim FASTQC.
  • posttrim_fastqc_disk: Disk space (in GB) allocated for pre- and post-trim FASTQC outputs.
  • fastqc_docker: Specify the Docker image used for running FASTQC.

5. MultiQC:

Generates a consolidated report combining results from pre- and post-trim FASTQC and Cutadapt.

  • Additional configurable options:
  • multiqc_ncpu: Number of CPU cores used for generating the MultiQC report.
  • multiqc_ramGB: RAM allocated for MultiQC report generation.
  • multiqc_disk: Disk space allocated for MultiQC report outputs.
  • multiqc_docker: Docker image used for running MultiQC.

6. STAR Alignment:

Aligns reads to the reference genome using STAR.

  • Additional configurable options:
  • star_ncpu: Number of CPU cores dedicated to STAR alignment.
  • star_ramGB: RAM allocated for STAR alignment execution.
  • star_disk: Disk space designated for STAR alignment outputs.
  • star_docker: Docker image used for running STAR.

7. FeatureCounts:

Quantifies gene expression levels based on read counts.

  • Additional configurable options:
  • feature_counts_ncpu: Number of CPU cores used for gene expression quantification with FeatureCounts.
  • feature_counts_ramGB: RAM allocated for FeatureCounts execution.
  • feature_counts_disk: Disk space allocated for FeatureCounts outputs.
  • feature_counts_docker: Docker image used for running FeatureCounts.

8. RSEM Quantification:

Quantifies gene and transcript expression levels, including FPKMs and TPMs.

  • Additional configurable options:
  • rsem_ncpu: Number of CPU cores dedicated to RSEM quantification.
  • rsem_ramGB: RAM allocated for RSEM execution.
  • rsem_disk: Disk space designated for RSEM outputs.
  • rsem_docker: Docker image used for running RSEM.

9. **Contamination Estimation:

Uses Bowtie2 to estimate the level of globin, rRNA, and PhiX contamination in the samples.

  • Additional configurable options:
  • bowtie2_{globin,rrna,phix}_ncpu: Number of CPU cores for each Bowtie2 contamination estimation step.
  • bowtie2_{globin,rrna,phix}_ramGB: RAM allocated for each Bowtie2 step.
  • bowtie2_{globin,rrna,phix}_disk: Disk space for each Bowtie2 step outputs.
  • bowtie_docker: Docker image used for running Bowtie2.

10. Mark Duplicates:

Identifies and marks duplicate reads resulting from PCR amplification.

  • Additional configurable options:
  • markdup_ncpu: Number of CPU cores used for marking duplicate reads with Picard MarkDuplicates.
  • markdup_ramGB: RAM allocated for Picard MarkDuplicates execution.
  • markdup_disk: Disk space for Picard MarkDuplicates outputs.

11. Collect RNA-seq Metrics:

Collects various RNA-seq quality control metrics using Picard tools.

  • Additional configurable options:
  • rnaqc_ncpu: Number of CPU cores used for collecting RNAseq metrics with Picard CollectRnaSeqMetrics.
  • rnaqc_ramGB: RAM allocated for Picard CollectRnaSeqMetrics execution.
  • rnaqc_disk: Disk space for Picard CollectRnaSeqMetrics outputs.

12. UMI Duplication (Optional):

Estimates PCR duplication rates from UMI information if provided.

  • Additional configurable options:
  • umi_dup_ncpu: Number of CPU cores used for UMI-based duplication estimation.
  • umi_dup_ramGB: RAM allocated for UMI duplication estimation.
  • umi_dup_disk: Disk space designated for UMI duplication estimation outputs.
  • umi_dup_docker: Docker image used for UMI duplication estimation.

13. SAMTools Mapped:

Calculates the percentage of reads mapped to different genomic regions.

  • Additional configurable options:
  • mapped_ncpu: Number of CPU cores used for calculating mapped reads with SAMTools.
  • mapped_ramGB: RAM allocated for SAMTools Mapped execution.
  • mapped_disk: Disk space for SAMTools Mapped outputs.
  • samtools_docker: Docker image used for running SAMTools.

14. MultiQC Post-Alignment:

Generates a consolidated report combining results from STAR, Picard tools, and other post-alignment steps.

  • Additional configurable options:
  • mqc_postalign_ncpu: Number of CPU cores used for generating the post-alignment MultiQC report.
  • mqc_postalign_ramGB: RAM allocated for post-alignment MultiQC report generation.
  • mqc_postalign_disk: Disk space allocated for post-alignment MultiQC report outputs.

15. RNAseq QC Report:

Creates a comprehensive QC report for each sample using MultiQC reports and other log files.

  • Additional configurable options:
  • collect_qc_ncpu: Number of CPU cores used for generating the final RNAseq QC report.
  • collect_qc_ramGB: RAM allocated for final QC report generation.
  • collect_qc_disk: Disk space allocated for final QC report outputs.
  • collect_qc_docker: Docker image used for running the final QC report.

16. Merge Results:

Merges quantification outputs from RSEM and FeatureCounts, along with QC reports, into final combined files.

  • Additional configurable options:
  • merge_results_ncpu: Number of CPU cores used for merging results.
  • merge_results_ramGB: RAM allocated for merging results.
  • merge_results_disk: Disk space allocated for merging results.
  • merge_results_docker: Docker image used for merging results.

Workflow Outputs

  • RSEM Gene Counts: A table containing raw read counts for each gene across all samples.
  • RSEM Gene TPMs: A table containing TPM (Transcripts Per Million) values for each gene across all samples.
  • RSEM Gene FPKMs: A table containing FPKM (Fragments Per Kilobase Million) values for each gene across all samples.
  • FeatureCounts File: A table containing raw read counts for each gene across all samples.
  • QC Report File: A combined QC report summarizing results for all samples.

Additional Notes

  • Each step in the workflow has configurable options for adjusting computational resources (CPU, RAM, disk) and specifying docker images.
  • The workflow is designed to be modular, allowing for customization and extension as needed.