Bayesian graphical clustering
Source:R/bayesian_graphical_clustering.R
bayesian_graphical_clustering.Rd
Perform Bayesian graphical clustering of the differential analysis results using repfdr
Usage
bayesian_graphical_clustering(
zscores,
min_analyte_posterior_thr = 0.5,
min_prior_for_config = 0.001,
df = 20,
naive_edge_sets = TRUE
)
Arguments
- zscores
A numeric matrix. Rows are analytes, columns are conditions (e.g., male week 1). Entries are z-scores.
- min_analyte_posterior_thr
A number. The minimal value of sum of posteriors for including an analyte.
- min_prior_for_config
A number. The minimal prior probability on the configuration priors.
- df
Integer. Parameter for
repfdr::ztobins()
. Degrees of freedom for fitting the estimated density f(z).- naive_edge_sets
A logical. TRUE: edge sets are extracted by taking simple intersection of nodes. FALSE: edge sets are extracted from the repfdr solution.
Value
A named list with three items:
node_sets
named list of node sets
edge_sets
named list of edge sets
repfdr_results
repfdr EM solution
Details
This function is highly specialized for the MoTrPAC data. It assumes that the input zscore matrix has eight columns.
Four columns for male z-scores (weeks 1, 2, 4, 8), and four for females. No NA values in the input matrix.
The function first runs the Bayesian clustering using repfdr
(see repfdr_wrapper()
) and then extracts the node and
edge sets of the graphical solution.
A node is a time-point specific state. It can represent null features (z-score around zero, no effect), up-regulated
features (positive z-score), or down regulated features (negative z-scores).
An edge represents interaction among two adjacent time point. For example, it can represent features that are up-regulated
in week 1 in both males and females, and null in both male and females by week 2.
min_analyte_posterior_thr
is used to make sure that features without any reasonable fit in the clustering solution will be discarded,
whereas min_prior_for_config
removes clusters with extremely low prior (i.e., do not have any associated features).
Formally, let zscores be the timewise nxm matrix of z-scores over n analytes and m conditions, where m is even and has the same time points for males and females. Each z-score corresponds to the effect of training compared to untrained baseline. We assume that z-scores have a latent configuration in -1,0,1, with -1 denotes down-regulation, 0 denotes null (no change), and 1 denotes up-regulation. Let h denote the latent configuration of an analyte over all m conditions. We use the expectation-maximization (EM) process of the repfdr algorithm (Heller and Yekutieli, 2014) to estimate the prior π(h) and posterior Pr(h|z_i), for every analyte i. Once the algorithm converges, we discard configurations h with π(h) < min_analyte_posterior_thr and normalize Pr(h|zi) to sum to 1 (i.e., all posteriors given the same zi). The new posteriors that can be interpreted as a soft clustering solution, where the greater the value is, the more likely it is for analyte i to participate in cluster h.
Given the posteriors Pr(h|zi), we assign analytes to nine possible states in each time point (male or females x time points 1w, 2w,4w, and 8w). Analyte i belongs to a state if the sum of posteriors that are consistent with that state is > min_analyte_posterior_thr. For every pair of states from adjacent time points j and j+1 we define their edge set as the set of analytes whose differential expression pattern is consistent with the two nodes.. Thus, the node and edge sets explained above together define a tree structure that represent different differential patterns over sex and time.
Naming format for the output:
Node set names: {week}_F{differential regulation status}_M{differential regulation status}
.
Example 1: 1w_F1_M0
means up-regulation in females week 1 and null (zero effect) in males in week 1.
Example 2: 1w_F1_M-1
means up-regulation in females week 1 and down-regulation in males in week 1.
The edge set object is a named list of string vectors. The name of an edge is [node_id]---[node_id]
.
Examples
if (FALSE) { # \dontrun{
### Example 1: Simulate data with a single cluster
zcolnames = c(
paste("female",c("1w","2w","4w","8w"),sep="_"),
paste("male",c("1w","2w","4w","8w"),sep="_")
)
zscores = matrix(rnorm(80000), ncol=8, dimnames = list(1:10000,zcolnames))
# now add a cluster with a strong signal
zscores[1:500,1:4] = zscores[1:500,1:4] + 5
# Run the clustering solution wrapper.
# When the data are "clean" (e.g., a mixture of two gaussians), we do not
# need a high df in the two-groups model estimation
# (default is 20, consider at least 10 when analyzing real data)
clustering_sol = bayesian_graphical_clustering(zscores, df=5)
# check if the clustering solution correctly assigns the first 500 rows
# (with high prob) to the right nodes
length(intersect(1:500,clustering_sol$node_sets$`1w_F1_M0`))/500 > 0.95
length(intersect(1:500,clustering_sol$node_sets$`2w_F1_M0`))/500 > 0.95
length(intersect(1:500,clustering_sol$node_sets$`4w_F1_M0`))/500 > 0.95
length(intersect(1:500,clustering_sol$node_sets$`8w_F1_M0`))/500 > 0.95
# examine the node set sizes
sapply(clustering_sol$node_sets,length)
# examine the edge set sizes
sapply(clustering_sol$edge_sets,length)
# extract the top full trajectories in the data
# these should be the clusters with at least 10 features
min_cluster_size = 10
get_trajectory_sizes_from_edge_sets(clustering_sol$edge_sets,min_size = min_cluster_size)
### Example 2: real data
data(REPFDR_INPUTS, package="MotrpacRatTraining6moData")
zscores = REPFDR_INPUTS$zs_smoothed
rat_data_clustering_sol = bayesian_graphical_clustering(zscores)
# extract the largest trajectories
get_trajectory_sizes_from_edge_sets(rat_data_clustering_sol$edge_sets, min_size = 50)
# plot the top trajectories of the muscle tissues, color edges by tissue
get_tree_plot_for_tissue(tissues = c("SKM-GN","HEART","SKM-VL"),
omes = "TRNSCRPT",
node_sets = rat_data_clustering_sol$node_sets,
edge_sets = rat_data_clustering_sol$edge_sets,
min_size = 20,
parallel_edges_by_tissue = TRUE,
max_trajectories = 3)
} # }