Compositional Data Analysis

Compositional data analysis was introduced by Aitchison in 1982[1]. We also make use of material from Boogaart (2013)[2].

x <- FIBER_TYPES %>% 
  # Relevel fiber types for ilr later
  mutate(type = factor(type, levels = rev(levels(type)))) %>%
  arrange(type) %>% 
  nest(data = c(everything(), -muscle), .by = muscle) %>% 
  deframe() %>% 
  map(.f = function(xi) {
    xi %>% 
      # total fiber count is only included for illustrative purposes. It is not
      # necessary
      pivot_wider(id_cols = c(pid, age, sex, group, total_fiber_count),
                  names_from = type,
                  names_prefix = "t",
                  values_from = fiber_count) %>% 
      as.data.frame() %>% 
      `rownames<-`(.$pid)
  })
# x is a named list of data.frames - one per muscle

map(x, ~ head(.x))
#> $LG
#>               pid age    sex group total_fiber_count tIIb tIIx tIIa  tI
#> 10043799 10043799  6M Female   SED              8356 4905 2165  613 673
#> 10044337 10044337  6M Female   SED              7016 3290 2375  776 575
#> 10044922 10044922  6M Female   SED              8217 4186 2238 1019 774
#> 10045813 10045813  6M Female   SED              7495 3883 1997  741 874
#> 10046038 10046038  6M Female   SED              7970 4212 2377  832 549
#> 10046380 10046380  6M Female   SED              6701 3993 1706  732 270
#> 
#> $MG
#>               pid age    sex group total_fiber_count tIIb tIIx tIIa  tI
#> 10043799 10043799  6M Female   SED              5461 2613 1506  573 769
#> 10044337 10044337  6M Female   SED              5789 2767 1648  694 680
#> 10044922 10044922  6M Female   SED              5632 2442 1794  827 569
#> 10045813 10045813  6M Female   SED              5596 3046 1414  585 551
#> 10046038 10046038  6M Female   SED              5974 2558 2056  744 616
#> 10046380 10046380  6M Female   SED              7167 3627 2029  711 800
#> 
#> $PL
#>               pid age    sex group total_fiber_count tIIb tIIx tIIa  tI
#> 10043799 10043799  6M Female   SED              1667  515  565  441 146
#> 10044337 10044337  6M Female   SED              1152  405  426  258  63
#> 10044922 10044922  6M Female   SED              2021  522  678  595 226
#> 10045813 10045813  6M Female   SED              2080  535  798  576 171
#> 10046038 10046038  6M Female   SED              2175  612  849  557 157
#> 10046380 10046380  6M Female   SED              2014  449  798  514 253
#> 
#> $SOL
#>               pid age    sex group total_fiber_count tIIa   tI
#> 10043799 10043799  6M Female   SED              1954  100 1854
#> 10044337 10044337  6M Female   SED              1810  122 1688
#> 10044922 10044922  6M Female   SED              1708  272 1436
#> 10045813 10045813  6M Female   SED              1810  192 1618
#> 10046038 10046038  6M Female   SED              1699   89 1610
#> 10046380 10046380  6M Female   SED              1849  132 1717

We will look at the first 6 rows of the fiber count matrix for each muscle. The fiber counts are elements of the positive orthant \(\mathbb{P}^4\) (LG, MG, PL) or \(\mathbb{P}^2\) (SOL). The orthant is “the analogue in n-dimensional Euclidean space of a quadrant in the plane or an octant in three dimensions. In two dimensions, there are four orthants (called quadrants)” (Wikipedia - Orthant).

Y <- map(x, .f = function(yi) {
  yi %>% 
    dplyr::select(starts_with("tI")) %>% 
    as.matrix()
})

map(Y, head)
#> $LG
#>          tIIb tIIx tIIa  tI
#> 10043799 4905 2165  613 673
#> 10044337 3290 2375  776 575
#> 10044922 4186 2238 1019 774
#> 10045813 3883 1997  741 874
#> 10046038 4212 2377  832 549
#> 10046380 3993 1706  732 270
#> 
#> $MG
#>          tIIb tIIx tIIa  tI
#> 10043799 2613 1506  573 769
#> 10044337 2767 1648  694 680
#> 10044922 2442 1794  827 569
#> 10045813 3046 1414  585 551
#> 10046038 2558 2056  744 616
#> 10046380 3627 2029  711 800
#> 
#> $PL
#>          tIIb tIIx tIIa  tI
#> 10043799  515  565  441 146
#> 10044337  405  426  258  63
#> 10044922  522  678  595 226
#> 10045813  535  798  576 171
#> 10046038  612  849  557 157
#> 10046380  449  798  514 253
#> 
#> $SOL
#>          tIIa   tI
#> 10043799  100 1854
#> 10044337  122 1688
#> 10044922  272 1436
#> 10045813  192 1618
#> 10046038   89 1610
#> 10046380  132 1717

If we close the compositions (scale the fiber counts in each row so they sum to 1, rather than their total fiber count), we end up with fiber type proportions. These are elements of the positive simplex \(\mathbb{S}^4\) (LG, MG, PL) or \(\mathbb{S}^2\) (SOL). Note that this closure operation is not required by the transformation in the next code chunk. These proportions are just for investigative purposes. We will use the compositions::clo function.

Y_clo <- map(Y, clo) # close the compositions (convert to proportions)

map(Y_clo, ~ round(head(.x), digits = 3))
#> $LG
#>           tIIb  tIIx  tIIa    tI
#> 10043799 0.587 0.259 0.073 0.081
#> 10044337 0.469 0.339 0.111 0.082
#> 10044922 0.509 0.272 0.124 0.094
#> 10045813 0.518 0.266 0.099 0.117
#> 10046038 0.528 0.298 0.104 0.069
#> 10046380 0.596 0.255 0.109 0.040
#> 
#> $MG
#>           tIIb  tIIx  tIIa    tI
#> 10043799 0.478 0.276 0.105 0.141
#> 10044337 0.478 0.285 0.120 0.117
#> 10044922 0.434 0.319 0.147 0.101
#> 10045813 0.544 0.253 0.105 0.098
#> 10046038 0.428 0.344 0.125 0.103
#> 10046380 0.506 0.283 0.099 0.112
#> 
#> $PL
#>           tIIb  tIIx  tIIa    tI
#> 10043799 0.309 0.339 0.265 0.088
#> 10044337 0.352 0.370 0.224 0.055
#> 10044922 0.258 0.335 0.294 0.112
#> 10045813 0.257 0.384 0.277 0.082
#> 10046038 0.281 0.390 0.256 0.072
#> 10046380 0.223 0.396 0.255 0.126
#> 
#> $SOL
#>           tIIa    tI
#> 10043799 0.051 0.949
#> 10044337 0.067 0.933
#> 10044922 0.159 0.841
#> 10045813 0.106 0.894
#> 10046038 0.052 0.948
#> 10046380 0.071 0.929

Now, we use the compositions::ilr function to perform the isometric log-ratio (ilr) transformation and convert the proportions to elements in \(\mathbb{R}^3\) (LG, MG, PL) or \(\mathbb{R}\) (SOL) according to the sequential binary partition (SBP) {I ||| IIa || IIx | IIb}. That is, we can represent the data in 3-dimensional real space or as values on the real number line. These coordinates are called balances, which are defined as “the natural logarithm of the ratio of geometric means of the parts in each group, normalised by a coefficient to guarantee unit length of the vectors of the basis”[3].

The following balances are generated

  • \(b1 = \frac{1}{\sqrt{2}} log \left( \frac{I}{g(IIa, IIx, IIb)} \right)\) or \(b1 = log \left( \frac{I}{IIa} \right)\) (SOL)
  • \(b2 = \sqrt{\frac{2}{3}} log \left( \frac{IIa}{g(IIx, IIb)} \right)\)
  • \(b3 = log \left( \frac{IIx}{IIb} \right)\)

where \(g(\cdot)\) is the geometric mean (the \(n\)th root of the product of \(n\) values). This is also referred to as the “center” of the group of fiber types.

Y_ilr <- map(Y_clo, function(y) {
  nc <- ncol(y)
  
  # Reverse the columns of the basis V to create the above SBPs
  V <- ilrBase(D = nc)[, (nc - 1):1]
  y_ilr <- ilr(y, V = V)
  
  # Remove the "rmult" class, or rstandard will not work for SOL
  class(y_ilr) <- NULL
  
  y_ilr <- as.matrix(y_ilr)
  
  return(y_ilr)
})
# List of matrices containing ilr-transformed coordinates

map(Y_ilr, ~ round(head(.x), digits = 3))
#> $LG
#>            [,1]   [,2]   [,3]
#> 10043799 -0.884 -1.364 -0.578
#> 10044337 -1.000 -1.046 -0.230
#> 10044922 -0.873 -0.898 -0.443
#> 10045813 -0.621 -1.081 -0.470
#> 10046038 -1.131 -1.091 -0.405
#> 10046380 -1.598 -1.038 -0.601
#> 
#> $MG
#>            [,1]   [,2]   [,3]
#> 10043799 -0.462 -1.014 -0.390
#> 10044337 -0.667 -0.918 -0.366
#> 10044922 -0.860 -0.758 -0.218
#> 10045813 -0.783 -1.034 -0.543
#> 10046038 -0.813 -0.919 -0.154
#> 10046380 -0.671 -1.093 -0.411
#> 
#> $PL
#>            [,1]   [,2]  [,3]
#> 10043799 -1.074 -0.164 0.066
#> 10044337 -1.496 -0.389 0.036
#> 10044922 -0.838  0.000 0.185
#> 10045813 -1.125 -0.103 0.283
#> 10046038 -1.246 -0.211 0.231
#> 10046380 -0.702 -0.124 0.407
#> 
#> $SOL
#>           [,1]
#> 10043799 2.065
#> 10044337 1.858
#> 10044922 1.176
#> 10045813 1.507
#> 10046038 2.047
#> 10046380 1.814

Now that we have multivariate Normal data, we can use multivariate multiple regression. Here, each response is a column of the matrices shown above. We will include categorical predictors age, sex, group, and their interactions to start.

# Multivariate multiple regression models for each muscle
fit.count <- map2(.x = x, .y = Y_ilr, .f = function(x, y) {
  lm(y ~ age * sex * group, data = x)
})
# map(fit.count, summary) # lm model summary for each ilr column and muscle

For each model, we have measurements from 48 animals (6 biological replicates per group). We lose 1 degree of freedom to estimate the intercept (mean of SED 6M Female group), 3 degrees of freedom to estimate each of the main effects, another 3 to estimate the 2-way interactions, and 1 to estimate the 3-way interaction. Therefore, we have 48 - 8 = 40 residual degrees of freedom.

We will check the model diagnostic plots. First, we will check for homoscedasticity.

We will first look at boxplots of the standardized residuals to check for Normality of the residuals for each response (fiber type). We are using an rstandard.mlm method from https://stackoverflow.com/a/39768104 included in this package to calculate the standardized residuals.

Since I prefer the ggplot2 package to base plotting, we will transform the data to a longer format and then stack the data for all four muscles.

# Standardized residuals and fitted values
resid_list <- map(fit.count[1:3], rstandard.mlm)
fitted_list <- map(fit.count[1:3], fitted)

# Prepare data for plotting
plot_df <- list("resid" = resid_list,
                "fitted" = fitted_list) %>% 
  list_transpose() %>% 
  map(.f = function(.x) {
    map(names(.x), .f = function(name_i) {
      .x[[name_i]] %>%
        as.data.frame() %>%
        rownames_to_column("pid") %>%
        pivot_longer(cols = -pid,
                     names_to = "balance",
                     values_to = name_i)
    }) %>% 
      purrr::reduce(.f = left_join, by = c("pid", "balance")) %>%
      mutate(across(.cols = c(fitted, resid), 
                    .f = ~ signif(as.numeric(.x), 3)))
  }) %>% 
  enframe(name = "muscle") %>% 
  unnest(value) %>% 
  mutate(balance = sub("V", "b", balance),
         balance = ifelse(!grepl("b", balance), "b1", balance))
# Plot residuals vs. fitted for each muscle and fiber type
ggplot(plot_df, aes(x = fitted, y = resid)) +
  geom_point(shape = 21, size = 2) +
  stat_smooth(method = "loess", formula = y ~ x,
              method.args = list(degree = 2),
              se = FALSE, col = 2) +
  # Empirical rule: ~99.7% of data should be within 3 SD of the mean
  geom_hline(yintercept = c(-3, +3), lty = "dashed") +
  facet_wrap(~ muscle + balance, scales = "free", ncol = 3) +
  labs(x = "Fitted Values", y = "Standardized Residuals",
       title = "Residuals vs. Fitted") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.6)))

There is an outlier in the MG \(b1\) panel, and the MG \(b3\) panel seems to display non-constant variance of the residuals. We will check the scale-location plots.

# Plot scale vs. location plot for each muscle and fiber type
ggplot(plot_df, aes(x = fitted, y = sqrt(abs(resid)))) +
  geom_point(shape = 21, size = 2) +
  stat_smooth(method = "loess", formula = y ~ x,
              method.args = list(degree = 2),
              se = TRUE, col = 2) +
  facet_wrap(~ muscle + balance, scales = "free", ncol = 3) +
  labs(x = "Fitted", y = "sqrt(|Standardized Residuals|)",
       title = "Scale-Location") +
  coord_cartesian(ylim = c(0, NA), expand = 5e-3) +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.6)))

There is increasing variance in the \(b3\) panels of the MG and PL. Keep in mind that there are only 6 samples per group.

We will check the Normality of the residuals with Q-Q plots.

ggplot(plot_df, aes(sample = resid)) +
  geom_qq(shape = 21, size = 2) +
  geom_qq_line(lty = 2, col = 2, linewidth = 1) +
  facet_wrap(~ muscle + balance, scales = "free", ncol = 3) +
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles",
       title = "Normal Q-Q Plot")

There are a few outliers, though they should not affect the results too much. We will continue with hypothesis testing.

Hypothesis Testing

## Estimated marginal means
FIBER_COUNT_EMM <- map(fit.count, function(mod_i) {
  by <- c("age", "sex")
  mult.name <- NULL
  
  if ("mlm" %in% class(mod_i)) {
    mult.name <- "balance"
    by <- c(by, mult.name)
  }
  
  emmeans(object = mod_i, specs = "group", by = by, 
          mult.name = mult.name, infer = TRUE)
})

The values in the emmean column are the estimated marginal means of the ilr coordinates (balances) for each sequential binary partition (SBP):

  • \(b1\): \(\frac{1}{\sqrt2} log \left( \frac{I}{g(IIa, IIx, IIb)} \right)\) or \(\frac{1}{\sqrt2} log \left( \frac{I}{IIa} \right)\) (SOL)
  • \(b2\): \(\sqrt{\frac{2}{3}} log \left( \frac{IIa}{g(IIx, IIb)} \right)\)
  • \(b3\): \(log \left( \frac{IIx}{IIb} \right)\)

The P-values are testing whether the means of the groups are different from 0, which is not of interest, in this case. We want to know if there are differences between the 8W and SED groups.

# Extract model info
model_df <- fit.count %>% 
  map_chr(.f = ~ paste(deparse(.x[["call"]]), collapse = "")) %>% 
  enframe(name = "muscle", value = "model") %>% 
  mutate(model = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", model, perl = TRUE),
         model_type = "mlm",
         formula = sub(".*formula = ([^,]+),.*", "\\1", model),
         formula = sub("y", "ilr(cbind(tI, tIIa, tIIx, tIIb))", formula),
         formula = ifelse(muscle == "SOL", 
                          sub(", tIIx, tIIb", "", formula), 
                          formula),
         muscle = factor(muscle, levels = c("LG", "MG", "PL", "SOL"))) %>% 
  dplyr::select(-model)

We will proceed with specific comparisons of the subcompositions.

## Trained vs. SED comparisons by age, sex, muscle, and fiber type
FIBER_COUNT_STATS <- map(FIBER_COUNT_EMM, function(emm_i) {
  contrast(object = emm_i, method = "trt.vs.ctrl", infer = TRUE) %>% 
    as.data.frame()
}) %>%
  enframe(name = "muscle") %>% 
  unnest(value) %>%
  # Adjust p-values across balances within each combination of age, sex, and
  # muscle
  group_by(age, sex, muscle) %>% 
  mutate(p.adj = p.adjust(p.value, method = "holm"),
         signif = p.adj < 0.05) %>% 
  ungroup() %>% 
  pivot_longer(cols = estimate,
               names_to = "estimate_type",
               values_to = "estimate", 
               values_drop_na = TRUE) %>% 
  pivot_longer(cols = contains(".ratio"), 
               names_to = "statistic_type", 
               values_to = "statistic", 
               values_drop_na = TRUE) %>% 
  mutate(statistic_type = sub("\\.ratio", "", statistic_type),
         estimate_type = ifelse(estimate_type == "estimate", 
                                "difference of means", estimate_type),
         muscle = factor(muscle, levels = c("LG", "MG", "PL", "SOL")),
         response = "Fiber Count",
         balance = ifelse(is.na(balance), 1, balance)) %>% 
  left_join(model_df, by = "muscle") %>% 
  # Reorder columns
  dplyr::select(response, age, sex, muscle, balance, contrast, estimate_type,
                estimate, SE, lower.CL, upper.CL, statistic_type, statistic, df,
                p.value, p.adj, everything()) %>%
  arrange(age, sex, muscle, balance) %>% 
  as.data.frame()

Session Info

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] emmeans_1.10.1                        
#> [2] ggplot2_3.5.1                         
#> [3] tibble_3.2.1                          
#> [4] purrr_1.0.2                           
#> [5] tidyr_1.3.1                           
#> [6] dplyr_1.1.4                           
#> [7] compositions_2.0-8                    
#> [8] MotrpacRatTrainingPhysiologyData_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.5      beeswarm_0.4.0    tensorA_0.36.2.1  xfun_0.43        
#>  [5] bslib_0.7.0       htmlwidgets_1.6.4 rstatix_0.7.2     lattice_0.22-6   
#>  [9] vctrs_0.6.5       tools_4.4.0       generics_0.1.3    fansi_1.0.6      
#> [13] highr_0.10        DEoptimR_1.1-3    pkgconfig_2.0.3   Matrix_1.7-0     
#> [17] desc_1.4.3        lifecycle_1.0.4   farver_2.1.1      compiler_4.4.0   
#> [21] textshaping_0.3.7 munsell_0.5.1     carData_3.0-5     vipor_0.4.7      
#> [25] htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.8        pillar_1.9.0     
#> [29] pkgdown_2.0.9     car_3.1-2         ggpubr_0.6.0      jquerylib_0.1.4  
#> [33] MASS_7.3-60.2     cachem_1.0.8      abind_1.4-5       nlme_3.1-164     
#> [37] robustbase_0.99-2 tidyselect_1.2.1  digest_0.6.35     mvtnorm_1.2-4    
#> [41] labeling_0.4.3    splines_4.4.0     fastmap_1.1.1     grid_4.4.0       
#> [45] colorspace_2.1-0  cli_3.6.2         magrittr_2.0.3    utf8_1.2.4       
#> [49] broom_1.0.5       withr_3.0.0       scales_1.3.0      backports_1.4.1  
#> [53] ggbeeswarm_0.7.2  estimability_1.5  rmarkdown_2.26    ggsignif_0.6.4   
#> [57] ragg_1.3.0        memoise_2.0.1     evaluate_0.23     knitr_1.46       
#> [61] mgcv_1.9-1        rlang_1.1.3       Rcpp_1.0.12       glue_1.7.0       
#> [65] bayesm_3.1-6      jsonlite_1.8.8    R6_2.5.1          systemfonts_1.0.6
#> [69] fs_1.6.4

References

1. Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society. Series B (Methodological), 44(2), 139–177. http://www.jstor.org/stable/2345821
2. Boogaart, K. G. van den, & Tolosana-Delgado, R. (2013). Analyzing compositional data with r. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36809-7
3. Pawlowsky-Glahn, V., & Egozcue, J. J. (2016). Exploring compositional data with the CoDa-Dendrogram. Austrian Journal of Statistics, 40(1&2), 103–113.