motrpac-atac-seq-pipeline

MoTrPAC ATAC-Seq QC and Analysis Pipeline

Overview

This repository provides MoTrPAC-specific supplements to the ENCODE ATAC-seq pipeline.

ATAC-Seq Method

ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a technique used to assess genome-wide chromatin accessibility. The method uses hyperactive Tn5 transposase to preferentially insert sequencing adapters into accessible regions of chromatin, allowing identification of open chromatin regions, transcription factor binding sites, and nucleosome positioning.

Pipeline Implementation

MoTrPAC uses ENCODE ATAC-seq pipeline version 1.7.0 for consistency within the consortium and reproducibility outside of the consortium. This repository provides:

Automated configuration file generation for MoTrPAC samples
Quality control and flagging criteria
Post-processing scripts for count matrix generation
Organism-specific genome databases (human hg38, rat rn6, and rat rn8)

Important References

Documentation Purpose

This documentation is intended to help individuals who are:

Preparing ATAC-seq data for submission to the BIC
Processing pilot samples with the full pipeline
Running the pipeline on local compute environments with Conda

For additional details not directly related to running the ENCODE ATAC-seq pipeline or processing the results, see the most recent version of the MoTrPAC ATAC-seq QC and Analysis Pipeline MOP.

NOTE: For users working on the cloud or in other environments, follow ENCODE’s documentation as necessary. Post-processing scripts are intended to be useful to all users, regardless of environment.

MoTrPAC ATAC-Seq QC and Analysis Pipeline

1. Prerequisites

Required Software

bcl2fastq v2.20.0 - For demultiplexing and converting BCL to FASTQ (if processing raw sequencing data)
- Download from Illumina
Conda - For managing software dependencies
Git - For cloning repositories
Python >= 3.6 - Required for pipeline and post-processing scripts

Required Accounts (for cloud execution)

Google Cloud Platform (GCP) account with billing enabled (if running on GCP)
GCP service account with Google Cloud Life Sciences API permissions

Supported Organisms

Human: hg38 (GRCh38)
Rat: rn6 (Rnor_6.0) with custom MoTrPAC reference; rn8 (GRCr8) with Ensembl chr-prefixed naming

Study-Specific Information

PASS Study (rat samples):

Run pipeline with biological replicates
Generates condition-specific peak sets (protocol/timepoint/tissue/sex)
Uses automated grouping of replicates

CASS Study (human samples):

Run pipeline with singletons
Individual sample peak calling

2. Prepare ATAC-seq data for submission to the BIC

2.1 Clone this repository

This documentation assumes you clone it in a folder called ATAC_PIPELINE in your home directory. ~/ATAC_PIPELINE is also the recommended destination folder for when you clone ENCODE’s repository later.

cd ~
mkdir ATAC_PIPELINE
cd ATAC_PIPELINE
git clone https://github.com/MoTrPAC/motrpac-atac-seq-pipeline.git

2.2 Generate and format FASTQs

Each GET site (Stanford and MSSM) is responsible for sequencing the library and obtaining the demultiplexed FASTQ files for each sample. If sequencing is performed with NovaSeq, raw data is output as BCL files, which must be demultiplexed and converted to FASTQ files with bcl2fastq (version 2.20.0). bcl2fastq v2.20.0 can be downloaded directly from Illumina here.

Prepare a sample sheet for demultiplexing. Find an example here.

Important notes:

The sample sheet must not include the Adapter or AdapterRead2 settings. This will prevent bcl2fastq from automatically performing adapter trimming, which provides us with FASTQ files that include the fullest form of the raw data. Adapter trimming is performed downstream by the pipeline.
Sample_Name and Sample_ID should correspond to vial labels; FASTQ files must follow the naming convention ${viallabel}_R?.fastq.gz before submission to the BIC

src/bcl2fastq.sh provides code both to run bcl2fastq and rename files. It can be run as follows:

1. Define the following paths:

bclfolder: Path to sequencing output directory, e.g /lab/data/NOVASEQ_BATCH1/181205_NB551514_0071_AHFHLGAFXY
samplesheet: Path to the sample sheet, e.g. ${bclfolder}/SampleSheet.csv
outdir: Path to root output folder, e.g. /lab/data/NOVASEQ_BATCH1

2. Load the correct version of bcl2fastq

If applicable, load the correct version of bcl2fastq. For example, on Stanford SCG, run:

module load bcl2fastq2/2.20.0.422

3. Run the bcl2fastq script:

bash ~/ATAC_PIPELINE/motrpac-atac-seq-pipeline/src/bcl2fastq.sh ${bclfolder} ${samplesheet} ${outdir}

This makes two new directories:

${outdir}/bcl2fastq: Outputs of bcl2fastq
${outdir}/fastq_raw: Merged and re-named FASTQ files, ready for submission to the BIC

Alternative approach:

Alternatively, run the bcl2fastq command independently, and use your own scripts to merge and rename FASTQ files before submission to the BIC:

bcl2fastq \
     --sample-sheet /path/to/SampleSheet.csv \
     --runfolder-dir $seqDir \
     --output-dir $outDir

This command will generate two FASTQ files (one for each read in the pair) per sample per lane, e.g. ${viallabel}_L${lane}_R{1,2}_001.fastq.gz.

2.3 Collect additional documents

The following documents must be prepared for data submission:

laneBarcode HTML report: Found in ${outdir}/bcl2fastq/Reports/html/*/all/all/all/laneBarcode.html. This report must be included in the BIC data submission.
sample_metadata_YYYYMMDD.csv: See this table for a list of metrics that must be included in this file.
file_manifest_YYYYMMDD.csv: See the GET CAS-to-BIC Data Transfer Guidelines for details about the format of this document.

2.4 Submit data

Refer to the GET CAS-to-BIC Data Transfer Guidelines for details about the directory structure for ATAC-seq data submissions.

Required files:

file_manifest_YYYYMMDD.csv
sample_metadata_YYYYMMDD.csv
readme_YYYYMMDD.txt
laneBarcode.html
fastq_raw/*.fastq.gz

2.5 (GET Sites only) Download pipeline outputs FROM BIC

After the BIC has finished running the ENCODE ATAC-seq pipeline on a batch of submitted data, use extract_atac_from_gcp_pass.sh to download the important subset of outputs from GCP.

The script takes three arguments: the number of cores, the download directory, and the CROO output path. Run the script as follows:

bash src/extract_atac_from_gcp_pass.sh ${NUM_CORES} ${DOWNLOAD_DIR} ${CROO_OUTPUT_PATH}

3. Install and test ENCODE ATAC-seq pipeline and dependencies

NOTE: All steps in this section must only be performed once. After dependencies are installed and genome databases are built, skip to section 4.

The ENCODE pipeline supports many cloud platforms and cluster engines. It also supports docker, singularity, and Conda to resolve complicated software dependencies for the pipeline. There are special instructions for two major Stanford HPC servers (SCG4 and Sherlock).

While the BIC runs this pipeline on Google Cloud Platform, this documentation is tailored for consortium users who use non-cloud computing environments, including clusters and personal computers. Therefore, this documentation describes the Conda implementation. Refer to ENCODE’s documentation for alternatives.

Quick start (GCP only)

If you have a fresh VM in Google Cloud Platform and a service account key with Google Cloud Life Sciences API permissions, the easiest way to set up a clean server is by using the server setup script in the src directory.

This script will install all dependencies, run a database in a Docker container, and perform the rest of the steps described in this section except step 3.5.

First, create a JSON file with your service account key somewhere on the VM. Then, use the following commands to clone this repository and run the server setup script.

git clone https://github.com/MoTrPAC/motrpac-atac-seq-pipeline.git
cd motrpac-atac-seq-pipeline
bash src/server_setup.sh us-central1 my-gcp-project gs://my_bucket/outputs gs://my_bucket/loc /path/to/key.json

3.1 Clone the ENCODE repository

Clone the v1.7.0 ENCODE repository and this repository in a folder in your home directory:

cd ~/ATAC_PIPELINE
git clone --single-branch --branch v1.7.0 https://github.com/ENCODE-DCC/atac-seq-pipeline.git

IMPORTANT: Required modification to atac.wdl

The following change needs to be made to ~/ATAC_PIPELINE/atac-seq-pipeline/atac.wdl.

At the end of ~/ATAC_PIPELINE/atac-seq-pipeline/atac.wdl, find this block of code:

task raise_exception {
    String msg
    command {
        echo -e "\n* Error: ${msg}\n" >&2
        exit 2
    }
    output {
        String error_msg = '${msg}'
    }
    runtime {
        maxRetries: 0
    }
}

Replace the runtime parameters in the raise_exception task with these:

    runtime {
        maxRetries: 0
        cpu: 1
        memory: '2 GB'
        time: 1
        disks: 'local-disk 10 SSD'
    }

Alternatively, you can apply the patch file provided in this repository:

cd ~/ATAC_PIPELINE/atac-seq-pipeline
git apply ~/ATAC_PIPELINE/motrpac-atac-seq-pipeline/patches/fix__add_missing_runtime_attributes_to_atac-seq_v1_7_0.patch

If you do not make this change, you will get the following error when you try to run the pipeline:

Task raise_exception has an invalid runtime attribute memory = !! NOT FOUND !!

* Found failures JSON object.
[
    {
        "causedBy": [
            {
                "causedBy": [],
                "message": "Task raise_exception has an invalid runtime attribute memory = !! NOT FOUND !!"
            },
            {
                "causedBy": [],
                "message": "Task raise_exception has an invalid runtime attribute memory = !! NOT FOUND !!"
            }
        ],
        "message": "Runtime validation failed"
    }
]

3.2 (Optional) Install the `Conda` environment with all software dependencies

If you did not use the server setup script, you will need to install the Conda environment with all software.

Install conda by following these instructions.

TIP: Perform Step 5 of the ENCODE instructions in a screen or tmux session, as it can take some time.

3.3 Initialize `Caper`

Installing the Conda environment also installs Caper. Make sure it works:

conda activate encode-atac-seq-pipeline
caper

If you see an error like caper: command not found, then add the following line to the bottom of ~/.bashrc and re-login:

export PATH=$PATH:~/.local/bin

Choose a platform from the following table and initialize Caper. This will create a default Caper configuration file ~/.caper/default.conf, which will have only required parameters for each platform. There are special platforms for Stanford Sherlock/SCG users.

caper init [PLATFORM]

Table 3.1. Supported platforms

Platform	Description
sherlock	Stanford Sherlock cluster (SLURM)
scg	Stanford SCG cluster (SLURM)
gcp	Google Cloud Platform
aws	Amazon Web Service
local	General local computer
sge	HPC with Sun GridEngine
pbs	HPC with PBS cluster engine
slurm	HPC with SLURM cluster engine

Edit ~/.caper/default.conf according to your chosen platform. Find instruction for each item in the following table.

IMPORTANT: ONCE YOU HAVE INITIALIZED THE CONFIGURATION FILE ~/.caper/default.conf WITH YOUR CHOSEN PLATFORM, THEN IT WILL HAVE ONLY REQUIRED PARAMETERS FOR THE CHOSEN PLATFORM. DO NOT LEAVE ANY PARAMETERS UNDEFINED OR CAPER WILL NOT WORK CORRECTLY.

Table 3.2. Caper configuration parameters

Parameter	Description
tmp-dir	IMPORTANT: A directory to store all cached files for inter-storage file transfer. DO NOT USE `/tmp`.
slurm-partition	SLURM partition. Define only if required by a cluster. You must define it for Stanford Sherlock.
slurm-account	SLURM partition. Define only if required by a cluster. You must define it for Stanford SCG.
sge-pe	Parallel environment of SGE. Find one with `$ qconf -spl` or ask your admin to add one if not exists.
aws-batch-arn	ARN for AWS Batch.
aws-region	AWS region (e.g. us-west-1)
out-s3-bucket	Output bucket path for AWS. This should start with `s3://`.
gcp-prj	Google Cloud Platform Project
out-gcs-bucket	Output bucket path for Google Cloud Platform. This should start with `gs://`.

An important optional parameter is db. If you would like to enable call-caching (i.e. re-use outputs from previous workflows, which is particularly useful if a workflow fails halfway through a pipeline), add the following lines to ~/.caper/default.conf:

db=file
java-heap-run=4G

3.4 Run a test sample

Follow these platform-specific instructions to run a test sample. Use the following variable assignments:

# the next variable is only needed if using conda to manage Python packages
PIPELINE_CONDA_ENV=encode-atac-seq-pipeline
WDL=~/ATAC_PIPELINE/atac-seq-pipeline/atac.wdl
INPUT_JSON=https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled_caper.json

NOTE: Caper writes all outputs to the current working directory, so first cd to the desired output directory before using caper run or caper server.

On Stanford SCG/SLURM

Here is an example of how the test workflow is run on Stanford SCG (SLURM):

# only needed if using conda to manage Python packages
conda activate ${PIPELINE_CONDA_ENV}
JOB_NAME=encode_test
sbatch -A ${ACCOUNT} -J ${JOB_NAME} --export=ALL --mem 2G -t 4-0 --wrap "caper run ${WDL} -i ${INPUT_JSON}"

3.5 Install genome databases

3.5.1 Install the hg38 genome database

Specify a destination directory and install the ENCODE hg38 reference with the following command.

IMPORTANT: We recommend not to run this installer on a login node of your cluster. It will take >8GB memory and >2h time.

# only needed if using conda to manage Python packages
conda activate encode-atac-seq-pipeline

outdir=/path/to/reference/genome/hg38
bash ~/ATAC_PIPELINE/atac-seq-pipeline/scripts/download_genome_data.sh hg38 ${outdir}

3.5.2 Install the custom rn6 genome database

Find this section in ~/ATAC_PIPELINE/atac-seq-pipeline/scripts/build_genome_data.sh:

...

elif [[ "${GENOME}" == "YOUR_OWN_GENOME" ]]; then
  # Perl style regular expression to keep regular chromosomes only.
  # this reg-ex will be applied to peaks after blacklist filtering (b-filt) with "grep -P".
  # so that b-filt peak file (.bfilt.*Peak.gz) will only have chromosomes matching with this pattern
  # this reg-ex will work even without a blacklist.
  # you will still be able to find a .bfilt. peak file
  # use ".*", which means ALL CHARACTERS, if you want to keep all chromosomes
  # use "chr[\dXY]+" to allow chr[NUMBERS], chrX and chrY only
  # this is important to make your final output peak file (bigBed) work with genome browsers
  REGEX_BFILT_PEAK_CHR_NAME=".*"
  # REGEX_BFILT_PEAK_CHR_NAME="chr[\dXY]+"

  # mitochondrial chromosome name (e.g. chrM, MT)
  MITO_CHR_NAME="chrM"
  # URL for your reference FASTA (fasta, fasta.gz, fa, fa.gz, 2bit)
  REF_FA="https://some.where.com/your.genome.fa.gz"
  # 3-col blacklist BED file to filter out overlapping peaks from b-filt peak file (.bfilt.*Peak.gz file).
  # leave it empty if you don't have one
  BLACKLIST=
fi
...

Above it, add this block:

...
elif [[ "${GENOME}" == "motrpac_rn6" ]]; then
  REGEX_BFILT_PEAK_CHR_NAME=".*"
  MITO_CHR_NAME="chrM"
  REF_FA="http://mitra.stanford.edu/montgomery/projects/motrpac/atac/SCG/motrpac_references/rn6_release96/Rattus_norvegicus.Rnor_6.0.dna.toplevel.standardized.fa.gz"
  TSS="http://mitra.stanford.edu/montgomery/projects/motrpac/atac/SCG/motrpac_references/rn6_release96/Rattus_norvegicus.Rnor_6.0.96_protein_coding.tss.bed.gz"
  BLACKLIST=
...

NOTE: The TSS reference file was generated from the Ensembl GTF using this script.

Now run the script to build the custom genome database. Specify a destination directory and install the MoTrPAC rn6 reference with the following command.

IMPORTANT: We recommend not to run this installer on a login node of your cluster. It will take >8GB memory and >2h time.

# only needed if using conda to manage Python packages
conda activate encode-atac-seq-pipeline

outdir=/path/to/reference/genome/motrpac_rn6
bash ~/ATAC_PIPELINE/atac-seq-pipeline/scripts/build_genome_data.sh motrpac_rn6 ${outdir}

4. Run the ENCODE ATAC-seq pipeline

MoTrPAC will run the ENCODE pipeline both with singletons for human samples and replicates for rat samples. In both cases, many iterations of the pipeline will need to be run for each batch of sequencing data. This repository provides scripts to automate this process, for both rat and human samples.

Running the pipeline with replicates outputs all of the same per-sample information generated by running the pipeline with a single sample but improves power for peak calling and outputs a higher-confidence peak set called using all replicates. This generates a single peak set for every exercise protocol/timepoint/tissue/sex combination in the PASS study, which will be useful for downstream analyses.

4.1 Generate configuration files

A configuration (config) file in JSON format that specifies input parameters is required to run the pipeline. Find comprehensive documentation of definable parameters here.

For PASS (rat) samples running on GCP, use src/process_batches_gcp.sh to automatically generate config files grouped by condition (protocol/timepoint/tissue/sex). This script calls src/make_json_replicates_gcp.py, which pulls sample metadata from GCS, merges with phenotype data and BIC label data, and writes one JSON config per condition group.

bash src/process_batches_gcp.sh resources/batches_combined.tsv config_output/

Use --per-batch to generate separate config files per sequencing batch instead of combining across batches:

bash src/process_batches_gcp.sh resources/batches_combined.tsv config_output/ --per-batch

The script outputs config JSON files and a config_summary.tsv with FASTQ counts, vial labels, and PIDs per config. Base pipeline parameters (genome TSV, pipeline type, etc.) are defined in examples/base.json.

The -r argument expects a reference standards metadata file (resources/Stanford_StandardReferenceMaterial.txt). Samples that have no match in the phenotype data are checked against this file by MTP_RefLabel to identify reference standards and generate their configs separately. The file can be obtained from: gs://motrpac-portal-transfer-stanford/atac-seq/rat/batch1_20191025/metadata/Stanford_StandardReferenceMaterial.txt

For singletons (CASS/human), see docs/single_config.md.

4.2 Run the pipeline

Actually running the pipeline is straightforward. However, the command is different depending on the environment in which you set up the pipeline. Refer back to environment-specific instructions here.

An atac directory containing all of the pipeline outputs is created in the output directory (note the default output directory is the current working directory). One arbitrarily-named subdirectory for each config file (assuming the command is run in a loop for several samples) is written in atac.

For GCP

You can use the submit.sh script to submit a batch of pipelines to the GCP job queue. ${JSON_DIR} is the path to all of the config files, generated in Step 4.1. Workflow IDs are written to wfids.json for use in downstream steps.

JSON_DIR=/path/to/generated/json/files
bash src/submit.sh atac-seq-pipeline/atac.wdl ${JSON_DIR} wfids.json 4

The fourth argument controls the number of jobs submitted in parallel.

For Stanford SCG/SLURM

Here is an example of bash script that submits a batch of pipelines to the Stanford SCG job queue. ${JSON_DIR} is the path to all of the config files, generated in Step 4.1:

# if using conda to manage Python packages
conda activate encode-atac-seq-pipeline

ATACSRC=~/ATAC_PIPELINE/atac-seq-pipeline
OUTDIR=/path/to/output/directory
cd ${OUTDIR}

for json in $(ls ${JSON_DIR}); do

  INPUT_JSON=${JSON_DIR}/${json}
  JOB_NAME=$(basename ${INPUT_JSON} | sed "s/\.json.*//")

  sbatch -A ${ACCOUNT} -J ${JOB_NAME} --export=ALL --mem 2G -t 4-0 --wrap "caper run ${ATACSRC}/atac.wdl -i ${INPUT_JSON}"
  sleep 60 # necessary to prevent a collision error
done

4.3 (Optional) Monitoring execution status

If you ran the pipeline on GCP, you can monitor the status of your jobs by using the get_execution_status.py script. This script will print the status of all jobs in the output directory. The argument to the script is the path of the file that you generated in Step 4.2. For example:

python src/get_execution_status.py /path/to/output/file.json

5. Organize outputs

5.1 Mount the output directory

If running on GCP, mount the Cromwell output directory to your local machine using gcsfuse. If running on SCG, you can skip this step.

sudo mkdir -p /mnt/atac_outputs && sudo chown $USER /mnt/atac_outputs

gcsfuse --implicit-dirs --dir-mode 777 --file-mode 777 \
    --only-dir results/atac-seq_pipeline/rn8/atac \
    omicspipelines-get /mnt/atac_outputs

Adjust --only-dir and the bucket name to match your output path.

5.2 Collect important outputs with `croo`

croo is a tool ENCODE developed to simplify the pipeline outputs. Use run_croo.sh to run croo on all samples in a batch using the wfids.json file generated by submit.sh. See Table 5.1 for a description of outputs generated by this process.

bash src/run_croo.sh wfids.json /mnt/atac_outputs gs://my-bucket/results/atac-seq_pipeline/rn8/pass1b-06/croo_output

If there were resubmissions due to failures, run croo in order (original wfids first, then resubmit wfids) so that successful reruns overwrite partial outputs:

bash src/run_croo.sh wfids.json           /mnt/atac_outputs ${GCS_CROO_OUT}
bash src/run_croo.sh wfids_resubmit.json  /mnt/atac_outputs ${GCS_CROO_OUT}

Croo outputs one directory per sample label directly under ${GCS_CROO_OUT}/.

Table 5.1. Important files in croo-organized ENCODE ATAC-seq pipeline output.

Subdirectory or file	Description
`qc/*`	Components of the merged QC spreadsheet (see Step 5.3)
`signal//fc.signal.bigwig`	MACS2 peak-calling signal (fold-change), useful for visualizing “read pileups” in a genome browser
`signal//pval.signal.bigwig`	MACS2 peak-calling signal (P-value), useful for visualizing “read pileups” in a genome browser. P-value track is more dramatic than the fold-change track
`align//.trim.merged.bam`	Unfiltered BAM files
`align//.trim.merged.nodup.no_chrM_MT.bam`	Filtered BAM files, used as input for peak calling
`align//.tagAlign.gz`	tagAlign files from filtered BAMs
`peak/overlap_reproducibility/overlap.optimal_peak.narrowPeak.hammock.gz`	Hammock file of `overlap` peaks, optimized for viewing peaks in a genome browser
`peak/overlap_reproducibility/overlap.optimal_peak.narrowPeak.gz`	BED file of `overlap` peaks. Generally, use this as your final peak set
`peak/overlap_reproducibility/overlap.optimal_peak.narrowPeak.bb`	bigBed file of `overlap` peaks, useful for visualizing peaks in a genome browser
`peak/idr_reproducibility/idr.optimal_peak.narrowPeak.gz`	`IDR` peaks. More conservative than `overlap` peaks

ENCODE recommends using the overlap peak sets when one prefers a low false negative rate but potentially higher false positives; they recommend using the IDR peaks when one prefers low false positive rates.

5.3 Generate a spreadsheet of QC metrics for all samples with `qc2tsv`

This is most useful if you ran the pipeline for multiple samples. Step 5.2 generates a qc/qc.json file for each pipeline run. After installing qc2tsv within the encode-atac-seq-pipeline Conda environment (pip install qc2tsv), run the following command to compile a spreadsheet with QC from all samples:

qc2tsv $(gsutil ls "gs://my-bucket/results/atac-seq_pipeline/rn8/pass1b-06/croo_output/*/qc/qc.json") \
    --collapse-header > atac_qc_pass1b-06.tsv

Table 5.2 provides definitions for a limited number of metrics included in the JSON QC reports. The full JSON report includes >100 metrics per sample; some lines are duplicates, and many metrics are irrelevant for running the pipeline with a single biological replicate.

Table 5.2. Description of relevant QC metrics.

Metric	Definition/Notes
replication.reproducibility.overlap.N_opt	Number of optimal overlap_reproducibility peaks
replication.reproducibility.overlap.opt_set	Peak set corresponding to optimal overlap_reproducibility peaks
replication.reproducibility.idr.N_opt	Number of optimal idr_reproducibility peaks
replication.reproducibility.idr.opt_set	Peak set corresponding to optimal idr_reproducibility peaks
replication.num_peaks.num_peaks	Number of peaks called in each replicate
peak_enrich.frac_reads_in_peaks.macs2.frip	Replicate-level FRiP in raw MACS2 peaks
peak_enrich.frac_reads_in_peaks.overlap.{opt_set}.frip	Many FRiP values are reported. In order to get the FRiP corresponding to the overlap_reproducibility peak set, you need to cross-reference the `replication.reproducibility.overlap.opt_set` metric with these column names to extract the appropriate FRiP. For example, if `replication.reproducibility.overlap.opt_set` is `pooled-pr1_vs_pooled-pr2`, then you need to extract the FRiP value from the `peak_enrich.frac_reads_in_peaks.overlap.pooled-pr1_vs_pooled-pr2.frip` column. See post-processing scripts to see how to do this in an automated way
peak_enrich.frac_reads_in_peaks.idr.{opt_set}.frip	Cross-reference with `replication.reproducibility.idr.opt_set`. See `peak_enrich.frac_reads_in_peaks.overlap.{opt_set}.frip`
align.samstat.total_reads	Total number of alignments* (including multimappers)
align.samstat.pct_mapped_reads	Percent of reads that mapped
align.samstat.pct_properly_paired_reads	Percent of reads that are properly paired
align.dup.pct_duplicate_reads	Fraction (not percent) of read pairs that are duplicates after filtering alignments for quality
align.frac_mito.frac_mito_reads	Fraction of reads that align to chrM after filtering alignments for quality and removing duplicates
align.nodup_samstat.total_reads	Number of alignments* after applying all filters
align.frag_len_stat.frac_reads_in_nfr	Fraction of reads in nucleosome-free-region. Should be a value greater than 0.4
align.frag_len_stat.nfr_over_mono_nuc_reads	Reads in nucleosome-free-region versus reads in mononucleosomal peak. Should be a value greater than 2.5
align.frag_len_stat.nfr_peak_exists	Does a nucleosome-free-peak exist? Should be `true`
align.frag_len_stat.mono_nuc_peak_exists	Does a mononucleosomal-peak exist? Should be `true`
align.frag_len_stat.di_nuc_peak_exists	Does a dinucleosomal-peak exist? Ideally `true`, but not condemnable if `false`
lib_complexity.lib_complexity.NRF	Non-redundant fraction. Measure of library complexity, i.e. degree of duplicates. Ideally >0.9
lib_complexity.lib_complexity.PBC1	PCR bottlenecking coefficient 1. Measure of library complexity. Ideally >0.9
lib_complexity.lib_complexity.PBC2	PCR bottlenecking coefficient 2. Measure of library complexity. Ideally >3
align_enrich.tss_enrich.tss_enrich	TSS enrichment

*Note: Alignments are per read, so for PE reads, there are two alignments per fragment if each PE read aligns once.

6. Flag problematic samples

The following metrics are not strictly exclusion criteria for MoTrPAC samples, but samples should be flagged if any of these conditions are met. Some of these metrics reflect the ENCODE ATAC-seq data standards.

Table 6.1 Criteria to flag problematic samples.

Description	In terms of Table 5.2 metrics	Comments
< 50 million filtered, non-duplicated, non-mitochondrial paired-end reads in the filtered BAM file (i.e. 25M pairs)	align.nodup_samstat.total_reads < 50M	This is the most stringent criterion and may be relaxed
Alignment rate < 80%	align.samstat.pct_mapped_reads < 80%
Fraction of reads in `overlap` peaks < 0.1	peak_enrich.frac_reads_in_peaks.overlap.{opt_set}.frip < 0.1	This is more relaxed than the ENCODE recommendation. Note that replicate-level FRiP in raw peaks can be assessed with peak_enrich.frac_reads_in_peaks.macs2.frip
Number of peaks in `overlap` peak set < 80,000	replication.reproducibility.overlap.N_opt < 80000	This is more relaxed than the ENCODE recommendation
A nucleosome-free region is not present	align.frag_len_stat.nfr_peak_exists == false	This should be enforced more strictly
A mononucleosome peak is not present	align.frag_len_stat.mono_nuc_peak_exists == false	This should be enforced more strictly
TSS enrichment < ?	align_enrich.tss_enrich.tss_enrich	This cutoff needs to be evaluated retrospectively. We will probably have tissue-specific recommendations

7. Post-processing scripts

Use the post-processing wrapper scripts to generate the QC report and to generate the final count matrix.

Wrapper scripts:

atac-post-process-pass-wrapper.sh: wrapper script to generate QC report and final count matrix for PASS (rat) samples
atac-post-process-human-wrapper.sh: wrapper script to generate QC report and final count matrix for CASS (human) samples

IMPORTANT: For each of these wrappers, make sure you fill in the appropriate variables at the top of the script before running.

7.1. Individual scripts

run_croo.sh: batch-run croo for all samples using wfids.json; outputs one label directory per sample directly under the specified GCS path
extract_atac_from_gcp_pass.sh: copy tagAlign, signal, peak, and QC files from croo output to a structured output directory for PASS (rat) samples
extract_atac_from_gcp_human.sh: equivalent script for CASS (human) samples
encode_rep_names_from_croo.py: generate rep-to-viallabel mapping from croo filetables; takes croo output path and GCS output path (no wfids file needed)
align_stats.sh: calculate % of primary alignments mapping to chrX, chrY, chrM, autosomes, and contigs; reads BAM paths from croo filetables and downloads BAMs one at a time to minimise disk usage
merge_atac_qc.R: merge wet lab metadata, ENCODE QC (qc2tsv output), rep-to-viallabel map, and alignment stats into a single QC table for PASS samples
merge_atac_qc_human.R: equivalent script for CASS (human) samples
encode_to_count_matrix_all_tissues_pass.sh: generate a peak × sample counts matrix across all tissues for a given phase

8. Troubleshooting

Common Issues

1. Error: Task raise_exception has an invalid runtime attribute memory = !! NOT FOUND !!

Cause: Missing runtime parameters in the raise_exception task in atac.wdl
Solution: Follow the instructions in section 3.1 to add the required runtime parameters

2. Error: caper: command not found

Cause: Caper is not in your PATH
Solution: Add export PATH=$PATH:~/.local/bin to ~/.bashrc and re-login

3. Undefined parameters in ~/.caper/default.conf

Cause: Required platform-specific parameters are not defined
Solution: Ensure all required parameters for your platform are defined. Do not leave any undefined.

4. Pipeline runs slowly or fails with insufficient resources

Cause: Default resource allocations may be insufficient for some samples
Solution: Edit the JSON configuration file to increase memory, CPU, or disk allocations for specific tasks

5. Call-caching not working

Cause: Database not enabled in Caper configuration
Solution: Add db=file and java-heap-run=4G to ~/.caper/default.conf

6. Cannot access outputs on GCP

Cause: Output directory not mounted
Solution: Use gcsfuse to mount the GCS bucket (see section 5.1)

7. Low number of peaks or FRiP

Cause: Poor sample quality, low sequencing depth, or experimental issues
Solution: Review QC metrics in Table 6.1 and flag problematic samples

Getting Help

If issues persist:

Check the ENCODE ATAC-seq pipeline documentation
Review the Caper documentation
Consult the MoTrPAC ATAC-seq MOP
Open an issue on this GitHub repository with:
- Description of the problem
- Error messages
- Configuration file (with sensitive data removed)
- QC metrics if available

9. Citations and References

Pipeline and Tools

ENCODE ATAC-seq pipeline: https://github.com/ENCODE-DCC/atac-seq-pipeline
Caper: https://github.com/MoTrPAC/caper - Cromwell wrapper for workflow execution
Cromwell: https://cromwell.readthedocs.io/ - Workflow management system
Croo: ENCODE tool for organizing pipeline outputs
qc2tsv: Tool for converting JSON QC to TSV format

ATAC-seq Methodology

Buenrostro JD, et al. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods. 2013.
Buenrostro JD, et al. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current Protocols in Molecular Biology. 2015.

Data Standards

MoTrPAC Documentation

Software Tools

bcl2fastq v2.20.0: Illumina
MACS2: Zhang Y, et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biology. 2008.
Bowtie2: Langmead B and Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012.
SAMtools: Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009.

10. Contributing and Support

Reporting Issues

If you encounter bugs or have feature requests, please open an issue on the GitHub repository.

When reporting issues, please include:

Description of the problem
Steps to reproduce
Expected vs. actual behavior
Error messages or logs
Configuration file (remove sensitive information)
QC metrics if available
Platform/environment information

Contact

For questions or support related to the MoTrPAC ATAC-seq pipeline:

Open an issue on GitHub: https://github.com/MoTrPAC/motrpac-atac-seq-pipeline/issues
Contact the MoTrPAC Bioinformatics Center

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Test your changes thoroughly
Submit a pull request with a clear description of the changes

MoTrPAC Data Hub - Access to MoTrPAC datasets
MoTrPAC GitHub Organization - Other MoTrPAC analysis pipelines and tools
ENCODE ATAC-seq pipeline - Upstream ENCODE pipeline

11. Version Information

Current Version

This pipeline uses ENCODE ATAC-seq pipeline version 1.7.0 for consistency within the consortium and reproducibility outside of the consortium.

Citing This Pipeline

If you use this pipeline in your research, please cite:

Software Versions

ENCODE ATAC-seq pipeline: v1.7.0
bcl2fastq: v2.20.0 (for demultiplexing)
Python: >= 3.6
Other tool versions are managed by the ENCODE pipeline conda environment

Compatibility Notes

Platform: Supports local, SLURM, SGE, PBS, GCP, and AWS
Conda environment: encode-atac-seq-pipeline
WDL: Uses Workflow Description Language
Cromwell: Compatible with Cromwell workflow engine via Caper

Genome Databases

Human: hg38 (GRCh38) - ENCODE standard reference
Rat rn6: motrpac_rn6 (Rnor_6.0) - Custom MoTrPAC reference based on Ensembl release 96
Rat rn8: GRCr8 with Ensembl chr-prefixed chromosome naming; genome TSV at gs://omicspipelines-public-resources/atac-seq/rat/rn8/motrpac_rn8.tsv

Change History

Major updates and changes are documented in the repository’s commit history. Check the commit history for detailed changes.

Known Limitations

Pipeline requires modification to atac.wdl (raise_exception task) for proper execution
Custom rn6 genome database requires manual configuration in build_genome_data.sh
TSS enrichment cutoffs are tissue-specific and need retrospective evaluation
Some QC thresholds are more relaxed than ENCODE standards to accommodate MoTrPAC sample characteristics