Skip to contents

Perform Bayesian graphical clustering of the differential analysis results using repfdr

Usage

bayesian_graphical_clustering(
  zscores,
  min_analyte_posterior_thr = 0.5,
  min_prior_for_config = 0.001,
  df = 20,
  naive_edge_sets = TRUE
)

Arguments

zscores

A numeric matrix. Rows are analytes, columns are conditions (e.g., male week 1). Entries are z-scores.

min_analyte_posterior_thr

A number. The minimal value of sum of posteriors for including an analyte.

min_prior_for_config

A number. The minimal prior probability on the configuration priors.

df

Integer. Parameter for repfdr::ztobins(). Degrees of freedom for fitting the estimated density f(z).

naive_edge_sets

A logical. TRUE: edge sets are extracted by taking simple intersection of nodes. FALSE: edge sets are extracted from the repfdr solution.

Value

A named list with three items:

node_sets

named list of node sets

edge_sets

named list of edge sets

repfdr_results

repfdr EM solution

Details

This function is highly specialized for the MoTrPAC data. It assumes that the input zscore matrix has eight columns. Four columns for male z-scores (weeks 1, 2, 4, 8), and four for females. No NA values in the input matrix. The function first runs the Bayesian clustering using repfdr (see repfdr_wrapper()) and then extracts the node and edge sets of the graphical solution.

A node is a time-point specific state. It can represent null features (z-score around zero, no effect), up-regulated features (positive z-score), or down regulated features (negative z-scores). An edge represents interaction among two adjacent time point. For example, it can represent features that are up-regulated in week 1 in both males and females, and null in both male and females by week 2. min_analyte_posterior_thr is used to make sure that features without any reasonable fit in the clustering solution will be discarded, whereas min_prior_for_config removes clusters with extremely low prior (i.e., do not have any associated features).

Formally, let zscores be the timewise nxm matrix of z-scores over n analytes and m conditions, where m is even and has the same time points for males and females. Each z-score corresponds to the effect of training compared to untrained baseline. We assume that z-scores have a latent configuration in -1,0,1, with -1 denotes down-regulation, 0 denotes null (no change), and 1 denotes up-regulation. Let h denote the latent configuration of an analyte over all m conditions. We use the expectation-maximization (EM) process of the repfdr algorithm (Heller and Yekutieli, 2014) to estimate the prior π(h) and posterior Pr(h|z_i), for every analyte i. Once the algorithm converges, we discard configurations h with π(h) < min_analyte_posterior_thr and normalize Pr(h|zi) to sum to 1 (i.e., all posteriors given the same zi). The new posteriors that can be interpreted as a soft clustering solution, where the greater the value is, the more likely it is for analyte i to participate in cluster h.

Given the posteriors Pr(h|zi), we assign analytes to nine possible states in each time point (male or females x time points 1w, 2w,4w, and 8w). Analyte i belongs to a state if the sum of posteriors that are consistent with that state is > min_analyte_posterior_thr. For every pair of states from adjacent time points j and j+1 we define their edge set as the set of analytes whose differential expression pattern is consistent with the two nodes.. Thus, the node and edge sets explained above together define a tree structure that represent different differential patterns over sex and time.

Naming format for the output: Node set names: {week}_F{differential regulation status}_M{differential regulation status}. Example 1: 1w_F1_M0 means up-regulation in females week 1 and null (zero effect) in males in week 1. Example 2: 1w_F1_M-1 means up-regulation in females week 1 and down-regulation in males in week 1. The edge set object is a named list of string vectors. The name of an edge is [node_id]---[node_id].

Examples

if (FALSE) {
### Example 1: Simulate data with a single cluster
zcolnames = c(
  paste("female",c("1w","2w","4w","8w"),sep="_"),
  paste("male",c("1w","2w","4w","8w"),sep="_")
)
zscores = matrix(rnorm(80000), ncol=8, dimnames = list(1:10000,zcolnames))
# now add a cluster with a strong signal 
zscores[1:500,1:4] = zscores[1:500,1:4] + 5 

# Run the clustering solution wrapper.
# When the data are "clean" (e.g., a mixture of two gaussians), we do not
# need a high df in the two-groups model estimation
# (default is 20, consider at least 10 when analyzing real data)
clustering_sol = bayesian_graphical_clustering(zscores, df=5)
# check if the clustering solution correctly assigns the first 500 rows 
# (with high prob) to the right nodes
length(intersect(1:500,clustering_sol$node_sets$`1w_F1_M0`))/500 > 0.95
length(intersect(1:500,clustering_sol$node_sets$`2w_F1_M0`))/500 > 0.95
length(intersect(1:500,clustering_sol$node_sets$`4w_F1_M0`))/500 > 0.95
length(intersect(1:500,clustering_sol$node_sets$`8w_F1_M0`))/500 > 0.95

# examine the node set sizes
sapply(clustering_sol$node_sets,length)
# examine the edge set sizes
sapply(clustering_sol$edge_sets,length)

# extract the top full trajectories in the data
# these should be the clusters with at least 10 features
min_cluster_size = 10
get_trajectory_sizes_from_edge_sets(clustering_sol$edge_sets,min_size = min_cluster_size)

### Example 2: real data
data(REPFDR_INPUTS, package="MotrpacRatTraining6moData")
zscores = REPFDR_INPUTS$zs_smoothed
rat_data_clustering_sol = bayesian_graphical_clustering(zscores)
# extract the largest trajectories
get_trajectory_sizes_from_edge_sets(rat_data_clustering_sol$edge_sets, min_size = 50)
# plot the top trajectories of the muscle tissues, color edges by tissue
get_tree_plot_for_tissue(tissues = c("SKM-GN","HEART","SKM-VL"), 
                         omes = "TRNSCRPT",
                         node_sets = rat_data_clustering_sol$node_sets,
                         edge_sets = rat_data_clustering_sol$edge_sets,
                         min_size = 20,
                         parallel_edges_by_tissue = TRUE,
                         max_trajectories = 3)
}