1. Introduction
The human microbiome is increasingly recognized as an important component of human health. Studies show links between the composition and function of the human gut microbiome and many health outcomes, including inflammatory and autoimmune conditions, obesity, infection, and neurological outcomes [
1,
2,
3,
4]. Animal studies have shown prospective changes in the microbiome from different exposures, and changes in physiology and health status after changes to the microbiome. While some clinical trials have been done, many human microbiome studies have been observational.
In observational studies, the microbiome is typically characterized in a few key ways. The first is by measuring and comparing within individual diversity, or α-diversity. These measures, adopted from the field of ecology, measure the number of different taxa present, and the evenness of abundance among those taxa within a single sample [
5,
6,
7]. That diversity level can then be compared across individuals. However, α-diversity cannot be directly translated to health status, thus its meaningful utility is limited. A second way to characterize the microbiome is by assessing community composition, or β-diversity. This is typically done by measuring the similarity or dissimilarity of the composition of one sample compared to another, using the number of different taxa [
8], the abundance of each taxa [
9], and sometimes the phylogenetic relationships between taxa [
10]. Researchers can compare groups, for instance exposed and unexposed groups, to see if the samples within one group are more similar than samples across groups (i.e., controls are more similar to other controls than to the exposed group). We use these measures of diversity to try to gain an understanding of the effect of or on the microbiome as a whole, but changes in diversity can only indicate a general difference without indicating how specifically the microbiome is different, or what the important players within the microbiome are.
To determine which specific taxa contribute to differences in diversity, researchers can also assess each taxa individually by measuring the amount of variability each one contributes, or by assessing trends in the presence/absence or abundance of individual taxa. While combinations of these strategies are often used, these scenarios must be adjusted for multiple comparisons across hundreds or thousands of taxa, which limit the ability to identify statistically significant associations. Furthermore, the microbiome is an ecosystem of bacterial communities with complex interactions and associations, and using these modeling strategies to assess them individually does not account for their intricate correlations.
Interest in microbiome research has grown rapidly over the last fifteen years, yet the complexity of the data, e.g., zero inflation, variation across individuals, correlated taxa, etc., continues to be a challenge for researchers. There has been a push for new statistical methodologies, including machine learning methods, and new microbial data applications of existing statistical methods, in an effort to improve the accuracy of findings from microbiome analyses [
11]. Some of these methods include random forest [
12], negative binomials [
13], and clustering [
14], to name a few. Similarly, there has been a push in the field of environmental epidemiology to develop new strategies to model the health effects of multiple co-exposures to improve the accuracy of chemical exposure studies. Some of the newly developed methods allow for analysis of overall mixture effects, and indicate the importance of each chemical within that mixture. One such method is weighted quantile sum (WQS) regression [
15]. WQS regression uses an empirically weighted index of many correlated chemical exposure measurements, and models the mixture effect of the whole index, while also providing weights for each component within the mixture to indicate the relative importance. WQS regression also allows for the inclusion of covariates, to reduce the effects of confounding. Applying this method to analysis of microbiome data allows for the evaluation of the overall mixture effect of the microbiome, and simultaneously identifies the most important individual taxa in the mixture while accounting for a correlated data structure.
The goal of this study is to demonstrate the novel application of WQS regression to assess covariate-adjusted associations between health exposures and microbiome sequencing abundance data as a mixture of potentially correlated bacterial taxa. Our analysis adjusted and combined WQS regression with random subset selection [
16] and repeated holdout [
17] (WQS
RSRH) frameworks, and applied them to publicly available Human Microbiome Project (HMP) 16S amplicon sequencing data. We further demonstrate the utility of the method using data from the Growth and Obesity Cohort Study (GOCS) in Chile.
The random subset extension of WQS is used in cases where the number of components in the WQS index is greater than the number of observations, and uses random subsets of components to calculate the WQS index. The repeated holdout extension of WQS allows for more robust estimates by using different observations in the training and validation sets of the data over multiple iterations of WQS analysis. We illustrate the utility of WQSRSRH and estimate specificity (correctly determining OTUs were not associated with the outcome) and sensitivity (correctly identifying associated OTUs) of the method. This methodological application allows for more comprehensive investigation of the association between the gut microbiome and many health exposures and outcomes by assessing the microbiome as a mixture.
4. Discussion
This simulation study demonstrated the novel use of the WQS analysis framework in microbiome data analysis. The WQSRSRH model was able to detect a significant association in the correct direction between the test variable and the microbiome, in a dataset of 210 microbiome samples. With a WQS equi-weighted cut-point (1/868), average sensitivity and specificity across 30 random holdout models were 75% and 70%, respectively. In this simulation, we also demonstrated that the signal function based on the exp(t) improved specificity but was not different from less severe signal functions in assessing sensitivity. This method has potential for broad applications within microbiome research. WQSRSRH can be used to assess associations between exposures of interest and the microbiome, as well as associations between the microbiome and health outcomes. Compared to standard methods of microbiome analysis, WQSRSRH performed similarly or better than all other tested methods at identifying an overall association in the correct direction, and in sensitivity and specificity at correctly identifying the 20 OTUs with an association to the test variable. In further demonstration of the method with real data, the method was adjustable to accommodate the different composition of the dataset, including the use of ASV data instead of OTUs. Furthermore, the WQSRSRH model found a negative association between the gut microbiome and BMI, and identified important bacterial taxa consistent with previously published studies.
Our simulated variable was associated with the abundance of several OTUs across all participants. The abundance of those 20 OTUs contribute to the calculation of α-diversity, however, because α-diversity evaluates the association of single sample composition, and WQSRSRH evaluates the OTU combination association across the population, it is not surprising that WQSRSRH was able to detect an association with the test variable while α-diversity analysis was not.
Alternatively, β-diversity directly compares composition of each sample to all others. There are many different methods to calculate similarity and dissimilarity distance for β-diversity analysis. In this analysis, we saw similar null results using both the Bray–Curtis and Aitchison distances. If the OTUs that we used to simulate an association were not major contributors to the overall composition of samples, PERMANOVA would not have found significant variance by the test variable. Moreover, if the 20 OTUs that were selected for association, were overwhelmed or drowned out by the richness and abundance of the other OTUs in each sample being compared, PERMANOVA would not have detected a significant amount of variance associated with the test variable. Likewise, SIMPER identifies which OTUs contributed the most to the variance detected between levels of exposure, so again if the OTUs with a simulated association were not major contributors to the composition of many of the samples, they would likely not have contributed much variance. It is noteworthy that SIMPER detected the same OTUs of the 20 simulated associations for both the test and control variables. It indicates that SIMPER is really constructed to identify the OTUs contributing most to the composition overall, those that are most abundant, and not necessarily the OTUs most associated with an exposure.
WQSRSRH in contrast does not compare samples directly to each other, it evaluates the combination of all the OTUs across all samples, and weights the association of each OTU within the combination. This allows for identification of important OTUs even when their relative contribution to the composition of an individual sample may be small. It also allows for the identification of important OTUs across samples with very different composition. For instance, in observational studies, microbial composition of samples from different individuals can be difficult to compare to each other because there may be limited overlap in OTU composition, thus PERMANOVA and SIMPER analysis of β-diversity can fail to identify important associations. However, using WQSRSRH, OTUs are evaluated in combination across all samples, so two samples with completely different composition can both contribute heavily weighted OTUs to the combination, i.e., associated OTUs can be identified even when they are only in some of the study samples. WQS evaluates the association of the combination of OTUs, and indicates which are the most associated with exposure, without having to do direct sample comparisons, or relying only on the most abundant OTUs. The signal function in the weighted average across the random subsets further enhances the impact of random sets with important OTUs.
The Random Forest analysis performed similarly to WQSRSRH in sensitivity and specificity when using the simulated variable as the response. The test model also performed much better than the control model, indicating that it is better suited than methods like SIMPER to pick out the most associated OTUs. It is worth noting that when creating the simulated variable, all OTUs that were not assigned an association were assumed to have an association of 0. However, it is likely that some of those OTUs were correlated with some of the 20 OTUs that were assigned an association. These correlations likely account for some of the variation we see in sensitivity and specificity calculations across WQSRSRH, Random Forest, and SIMPER.
While Random Forest performed well in this application, there are some potential advantages of using WQSRSRH instead. WQSRSRH simultaneously identifies the most important OTUs and estimates an overall mixture effect (or association in this case), instead of just identifying the importance of OTUs as the Random Forest does. In a situation like the one demonstrated in this simulation analysis, where there is an underlying association that is not detected by α and β-diversity, WQSRSRH provides an additional measurement of association with the overall microbial composition that Random Forest does not. Additionally, incorporation and interpretation of covariates is simpler in WQSRSRH models, as they are modeled as they would be in traditional regression methods instead of being included as a potential predictor along with the OTUs in a Random Forest model.
In the demonstration with data from GOCS, the WQS
RSRH method is adaptable to different datasets, and performed well in identifying bacteria related to high BMI. We were able to adjust the parameters of the model in several ways to accommodate the different datasets, and provide a demonstration of the WQS
RSRH method using ASV data as opposed to the OTU data that was used in the simulation study. We set the ranking mechanism to split the ASV abundance into three levels rather than four as shown in the simulation analysis. This adjustment was made to accommodate the smaller sample size and fewer microbiome mixture components (ASVs) in the GOCS dataset. The ranking levels can be adjusted to any number as appropriate for the dataset in use. Moreover, if the ranking is split once at zero, the microbiome mixture can be analyzed with presence/absence data rather than abundance. We also adjusted the number and size of the random subsets used to calculate the weighted index. The size of the random subsets should correspond to the number of observations in the dataset. The number of random subsets used is relatively arbitrary, but the larger the number of subsets, the more robust the estimate. We were also able to specify the data subsets to use in each repeated holdout iteration to ensure the variability of the categorical outcome in each subset. Although the WQS
RSRH estimate was not statistically significantly associated with BMI in this cohort, evidence of the trend in the negative direction was strengthened by 93% of the repeated holdout iterations producing a negative estimate. Analysis of association with α-diversity also found no association with BMI in this cohort. The bacterial taxa identified as heavily weighted within the negative association were consistent with bacterial genera negatively associated with obesity in other studies, and were also identified by the Random Forest and SIMPER methods. Although different species within the same genera may associate differently with obesity [
40], several other studies have found a negative association between obesity and
Bacteroides [
41],
Ruminococcus [
42], and
Clostridium [
28] genera. While the Random Forest and SIMPER methods identified additional genera in association with BMI, it is important to note that these methods consider associations between each ASV and BMI individually, while WQS
RSRH is considering which ASVs are the most important in the gut microbiome mixture.
There are many potential advantages of using WQSRSRH for microbiome analysis, however, we are not suggesting that WQSRSRH performs statistically differently than other methods that were compared. WQSRSRH works with both continuous and categorical variables, and allows for the adjustment of covariates in an interpretable fashion. It accounts for the correlated nature of the taxa within the microbiome, and gives an overall effect estimate and the weight of importance when all taxa in the index are considered together. WQSRSRH allows for analysis of associations in positive and negative directions separately, and allows flexibility in choosing the signal function in the weighting step. It accommodates samples from populations with widely varied microbial composition, identifying associations with OTUs present in a relatively small proportion of the population. WQSRSRH also gives robust estimates over many repetitions of the analysis. WQSRSRH could be used in a broad range of health research, as well as in a drug discovery framework. It could identify groups of bacteria that are associated together with an outcome of interest, which could be targeted together in develo** probiotics. When the analyst is interested in determining a small subset of OTUs associated with the outcome, as in a drug discovery framework, the bottom 90–95% of the weights can be set to zero to test for significance in the top 5–10%. This may lead to better identification of bacteria to include in probiotics that will be successful within the gut microbiome ecosystem.
While this demonstration of the application of WQS to microbiome data establishes a novel analytical method with potential for broad use, it does have some limitations. The first is that, while the overall WQSRSRH estimate identifies the direction of association between the variable of interest and the abundance of taxa within the microbiome, it is not directly translatable to a value measure (i.e., a measureable unit of some health outcome). However, the current standard microbiome analyses using α-diversity, β-diversity, and principal components analysis suffer from a similar limitation. Another limiting consideration is that the sensitivity and specificity of this WQSRSRH method application depends on the number of taxa in the data set, and the cutoff point chosen to identify the taxa of most importance. Of course, in an analysis of microbiome data with a real variable with unknown association, we would not know the truth to determine a good cutoff for both sensitivity and specificity. As demonstrated with the GOCS data, the equi-weighted cut point can be used as a default with unknown sensitivity and specificity. However, relevant cut points could be determined by investigating the significance of the index across repeated holdout data sets. The use of 16S rRNA sequencing with taxonomic assignment down to the genus level is also somewhat controversial among microbiome researchers. However, because the primary purpose of this analysis was to demonstrate the analytical method, the choice of grou** results at the genus level rather than the family level is not a major limitation. Lastly, in calculating the α and β-diversity for comparison to WQSRSRH, we used the same data set, which was limited to OTUs detected in at least 10% of samples, and converted to relative abundance. These processing steps affected the diversity calculations that rely on singletons. While this allows us to do a direct comparison with the WQSRSRH method, these calculated values are not generalizable outside this study.
This simulation study is the first step in exploring the use of WQS methods on the analysis of microbiome data. We plan to apply WQS methods to other forms of microbiome data beyond 16s rRNA amplicon sequencing, including metagenomic sequence data. Further development of the general WQS method is still underway, including the use of stratification, and Bayesian statistical applications, thus as those methods continue to develop, we will test their application on microbiome data.