Analyze genome tiles — analyze_tile • MotrpacRatTraining6mo

A wrapper function that analyzes a tile of loci in the genome to obtain correlated clusters.

Usage

analyze_tile(
  tile_name,
  tile_l,
  M,
  min_cor = 0.7,
  inflations = c(3, 2.5, 2, 1.5),
  plotcorr = FALSE
)

Arguments

tile_name: A character. The name of the current window for analysis.
tile_l: A named list. tile_name must be an item in the list. Each entry contains the M matrix row indices of the loci of the window.
M: A numeric matrix. Contains the "M" methylation values. Rows are loci, columns are samples.
min_cor: A number. The minimal correlation to be considered for connecting a pair of loci.
inflations: A numeric vector. Internal parameters of the MCL algorithm, see Details.
plotcorr: A logical. TRUE: plot the correlation matrix of the window (useful for analysis of specific windows).

Value

A character vector. Names are loci (M matrix rows), entries are the names of the loci clusters.

Details

This function implements clustering analysis of a specific window in the genome. We observed that often adjacent loci in RRBS data may manifest low correlations. This function can be used for identified a set of highly correlated loci within a window, which can then be used to extract new genomic features for downstream analysis. RRBS data have two columns per sample: "Un" (unmethylated) and "Me"(methylated). This function takes the "M matrix" which contains the integrated values of these columns (see examples). This matrix is used for computing the correlations of the loci, and then we use the Markov Clustering algorithm (MCL) for identifying homogeneous clusters within a window.

MCL (Markov clustering) details from https://www.micans.org/mcl/intro.html. MCL: Expansion coincides with taking the power of a stochastic matrix using the normal matrix product (i.e. matrix squaring). Inflation corresponds with taking the Hadamard power of a matrix (taking powers entrywise), followed by a scaling step, such that the resulting matrix is stochastic again, i.e. the matrix elements (on each column) correspond to probability values. Inflation parameter: strengthen intra-region connections and promote cluster homogeneity.

Examples

if (FALSE) { # \dontrun{
# Raw data by tissue are available as RData files in Google Cloud Storage with the following URLs:   
# Brown adipose:
# https://storage.googleapis.com/motrpac-rat-training-6mo-extdata/METHYL_BAT_RAW_DATA.rda
# Heart: 
# https://storage.googleapis.com/motrpac-rat-training-6mo-extdata/METHYL_HEART_RAW_DATA.rda
# Hippocampus: 
# https://storage.googleapis.com/motrpac-rat-training-6mo-extdata/METHYL_HIPPOC_RAW_DATA.rda
# Kidney: 
# https://storage.googleapis.com/motrpac-rat-training-6mo-extdata/METHYL_KIDNEY_RAW_DATA.rda
# Lung: 
# https://storage.googleapis.com/motrpac-rat-training-6mo-extdata/METHYL_LUNG_RAW_DATA.rda
# Liver: 
# https://storage.googleapis.com/motrpac-rat-training-6mo-extdata/METHYL_LIVER_RAW_DATA.rda
# Gastrocnemius: 
# https://storage.googleapis.com/motrpac-rat-training-6mo-extdata/METHYL_SKMGN_RAW_DATA.rda
# White adipose: 
# https://storage.googleapis.com/motrpac-rat-training-6mo-extdata/METHYL_WATSC_RAW_DATA.rda

# load_methyl_raw_data() can be used to download data for a tissue, e.g.,
# download the gastrocnemius data and load the data object into this session:
yall = load_methyl_raw_data("SKM-GN")

# remove control samples
is_sample = grepl("^9",colnames(yall),perl=TRUE)
yall = yall[,is_sample]

# Filtering unassembled chromosomes
keep <- rep(TRUE, nrow(yall))
Chr <- as.character(yall$genes$Chr)
keep[ grep("random",Chr) ] <- FALSE
keep[ grep("chrUn",Chr) ] <- FALSE
# remove non-chr ones
keep[!grepl("chr",Chr) ] <- FALSE
# remove M chromosome (otherwise we get error when assigning annotation below)
keep[Chr=="chrM"] <- FALSE
# keep.lib.sizes=FALSE causes the library sizes to be recomputed
yall <- yall[keep,, keep.lib.sizes=FALSE]
# rat genome chromosomes:
ChrNames <- paste0("chr",c(1:20,"X","Y"))
yall$genes$Chr <- factor(yall$genes$Chr, levels=ChrNames)
o <- order(yall$genes$Chr, yall$genes$Locus)
yall <- yall[o,]
print(paste("Counts matrix dim before low counts filter:"))
print(dim(yall))
gc()

# Remove low count features
# The analysis needs to be restricted to CpG sites that have enough coverage for the
# methylation level to be measurable in a meaningful way at that site. 
# As a conservative rule of thumb, we require a site to have a total count
# (both methylated and unmethylated) of at least 8 in every sample.
Coverage <- yall$counts[,grepl("-Me",colnames(yall$counts))] + 
     yall$counts[, grepl("-Un",colnames(yall$counts))]
min_count_coverage = 10
min_per_samples = 1
# see description of min_count_coverage and min_per_samples above
keep <- rowSums(Coverage >= min_count_coverage) >= min_per_samples*ncol(Coverage)
# filter the data
yall <- yall[keep,, keep.lib.sizes=FALSE]
print(paste("Counts matrix dim after low counts filter:"))
print(dim(yall))

# get locations, gene ids, etc
TSS <- edgeR::nearestTSS(yall$genes$Chr, yall$genes$Locus, species="Rn")
yall$genes$EntrezID <- TSS$gene_id
yall$genes$Symbol <- TSS$symbol
yall$genes$Strand <- TSS$strand
yall$genes$Distance <- TSS$distance
yall$genes$Width <- TSS$width

# Now we are ready to cluster the genome
wsize = 500
# step 1: define the genome-level tiles
chrs = as.character(yall$genes$Chr)
pos = round(yall$genes$Locus/wsize)
tiles = paste(chrs,pos,sep="-")
names(tiles) = rownames(yall$counts)
# step 2: add the M matrix for correlations
Me <- yall$counts[, grepl("-Me",colnames(yall$counts))]
Un <- yall$counts[, grepl("-Un",colnames(yall$counts))]
M <- log2(Me + 2) - log2(Un + 2)
# step 3: analyze each tile
tile_l = split(1:nrow(M),tiles)
tile_l = tile_l[unique(tiles)] # keep the genome order
# parallel comp
if(require("parallel")){
  tile_clusters = data.table::copy(tiles)
  for(chr in unique(chrs)){
    print(chr)
    curr_chr_inds = grepl(paste0(chr,"-"),tiles)
    chr_tiles = unique(tiles[curr_chr_inds])
    chr_tile_l = tile_l[chr_tiles]
    print(system.time({
      chr_tile_clusters = parallel::mclapply(chr_tiles,
                                             analyze_tile,
                                             tile_l = chr_tile_l,
                                             M = M, 
                                             mc.cores = 5)
      chr_tile_clusters = unlist(chr_tile_clusters)
      tile_clusters[names(chr_tile_clusters)] = chr_tile_clusters}))
    gc()
  }
}
yall = merge_sites_by_clusters(yall,tile_clusters)

# Data normalization
# A key difference between BS-seq and other sequencing data is that the pair of libraries 
# holding the methylated and unmethylated reads for a particular sample are treated as a unit.
# To ensure that the methylated and unmethylated reads for the same sample are treated on the
# same scale, we need to set the library sizes to be equal for each pair of libraries. 
# We set the library sizes for each sample to be the average of the total read counts for 
# the methylated and unmethylated libraries.
# Other normalization methods developed for RNA-seq data, 
# such as TMM, are not required for BS-seq data.
TotalLibSize <- yall$samples$lib.size[grepl("-Me",colnames(yall$counts))] + 
  yall$samples$lib.size[grepl("-Un",colnames(yall$counts))]
yall$samples$lib.size <- rep(TotalLibSize, each=2)
} # }