1. Introduction
Managers of natural resources need accurate information describing the resources they manage to make informed decisions [
1,
2,
3]. Owing to high acquisition costs, managers typically employ sampling and estimation techniques to describe various aspects of a given population. Additionally, to increase estimation accuracy, ancillary data, such as remotely sensed data, can be used to improve population estimates through specification and application of models that characterize how the resources of interest vary as a function of the ancillary data (e.g., [
4,
5,
6,
7]). Whether or not such models are employed or if simple expansion-based techniques are used to estimate population parameters, sample design plays an important role in estimation accuracy [
7,
8].
Within the context of estimating forest characteristics from remotely sensed data, less emphasis is generally placed on rigorous sample design to train models than when validating models [
9]. In large part, this discrepancy stems from the conditions required for design versus model-based inference [
4,
5]. As Gregoire [
8] describes, “in the design-based framework, the population is regarded as fixed whereas the sample is regarded as a realization of a stochastic process”. Conversely the model-based framework assumes, “that the values y
i, …, y
n are regarded as realizations of random variables Y
i,…,Y
n and hence the population is a realization of a random process”. The primary difference being, “in the model-based approach [inference] stems from the model, not from the sampling design”. While some rely on the availability of models as justification for deviating from probabilistic designs when drawing inferences, it is critical to remember that with natural systems, we seldom understand or can collect all the information needed to build models that completely describe the complexities of those systems. This can lead to model misspecification, localized deviations in relationships, and more generally to models that do not describe a specific population well. Therefore, sample designs used to calibrate models that include randomness in the selection process are critical to develo** models that can be used to estimate population and subpopulation parameters. Moreover, probabilistic designs that ensure that samples are widely distributed across feature space can reduce the variability of population estimates.
However, many map** projects employ non-probabilistic sample designs when calibrating predictive models [
9] or use designs that may not fully capture the spread of predictor variables and that may be unbalanced with regard to the population. Furthermore, some authors have argued that sample units should only contain pure homogenous examples of a given class [
10,
11,
12,
13,
14,
15] and have developed sampling protocols to ensure that outcome. Compared to probabilistic sampling designs, samples of homogeneous units can be expected to estimate lower error rates when calibrating and validating models. However, such metrics can only be attributed to the class of units targeted for sampling and typically cannot be generalized to the rest of the population [
16,
17].
Stehman [
16,
18,
19] describes multiple probabilistic sampling designs and outlines their associated strengths and weaknesses as they relate to map accuracy. Tille and Wilhelm [
20], Grafstrom et al. [
6], and Steven and Olsen [
21] further describe the importance of probabilistic designs while highlighting concepts of balance and spread as they relate to the precision of estimators. Balance and spread in this case refer to characteristics of a sample with regards to auxiliary variables. Specifically, a balanced sample is one for which the estimated means or totals of the auxiliary variables equal the actual population means or totals of those variables [
20]. A sample that is well spread in auxiliary space means that the sample units are widely dispersed across the auxiliary values of that population [
21]. For population estimates, balance becomes appealing where there are potentially strong relationships between response and auxiliary variables; for balance then implies that a sample will provide an accurate estimate of the mean or total of an auxiliary variable, and hence of the true mean or total of the response variable. Likewise, a sample that is well spread across auxiliary variables has the advantage of being balanced [
22] while also capturing the variability in auxiliary variables. While often described for expansion-based estimators, these arguments are also important in the context of sampling to develop models to support estimation [
9,
20,
23,
24,
25]. Nevertheless, many researchers disregard these warnings and build models derived from unbalanced, ill spread, and/or non-probabilistic samples to estimate characteristics of a given population.
Reasons for deviating from a probabilistic study design typically stem from logistical constraints and cost of implementation. However, using predicted values derived from models developed from non-probabilistic designs to describe complex natural systems such as a forested landscape can be highly questionable [
26,
27,
28]. Questions pertaining to how sample units should be spread across multiple dimensions of feature and geographical space and subsequent impacts on estimator accuracy and inference are at the very forefront of understanding how resource assessments and maps can be used to inform decision making. In this study, we investigate these questions through a series of simulations that compare and contrast estimates of population totals and pixel values obtained under varying sampling designs. We also evaluate the relative impact of sampling design on estimators derived without use of ancillary data and from various linear, non-linear, and non-parametric models. Specifically, in this study, we: (1) evaluate the relative accuracy of alternative model-based estimators of unit values and population totals across populations where the form and/or strength of associations vary, (2) evaluate the impact of probabilistic designs that spread and balance samples over geographic or feature space on the accuracy of those estimators, (3) demonstrate the potential impact of using a probabilistic sampling design taken from a subset of geographic space on the performance of those estimators and (4) assess the impact of misspecified models on the accuracy of estimation. We anticipate that within our simulations, samples drawn from probabilistic designs that are spread and balanced across predictor variable space will produce better estimates of the population and subpopulations.
2. Theoretical Background
In this study we assess expansion and model-based estimation procedures applied to populations constructed according to distinct functional relationships between response and predictor variables but tempered by normally distributed errors of varying magnitudes. While fundamentally different, expansion and model-based estimation techniques can be used to estimate population parameters [
29]. For expansion-based methods, parameter estimators are derived by weighting observed response values according to the relative frequency with which the corresponding sample units are selected under the design. In contrast, model-based estimators of population parameters are derived from a model of how the response variables change across the population. In this approach, the emphasis is on estimating model parameters, which can then be indirectly used to estimate either parameters of a given realization of the model, or the model-expectations thereof [
29]. Contrary to the expansion-based methodology, the model-based approach is not necessarily tied to a specific population but instead assumes that a given population is a realization of some random process, typically involving known or mapped auxiliary variables. The emphasis of the model-based approach is often on the relationships between a variable of interest (response) and these other variable(s) (predictors). Given those relationships, the forms of deviations around them, and the predictor variable values within a specific population, one can then estimate model and population parameters.
While models do not necessarily need to be tied to a given population, it is the case that within natural systems the underlying shapes and forms of the relationships between response and predictor variables vary and may be unknown and noisy. Therefore, model development is often based on a specific population (e.g., [
30,
31]), making models uniquely calibrated to a given time and place. Though uniquely calibrated, different modeling techniques make different assumptions with regard to the relationships between response and predictor variables. For many natural systems, these assumptions may be difficult to meet and can lead to a model that is misspecified. To assess the impact of potential model misspecification on estimating population and subpopulation parameters, we use simulated landscapes with known functional relationships and introduced error (
Figure 1). The functional shape of relationships evaluated in our simulations include linear, quadratic, and nonlinear distributions with normally distributed errors and various amounts of introduced noise (
Section 3.1). Additionally, estimation techniques are evaluated under various sample designs to assess the impact of spread and balance on parameter estimation (
Section 3.2). Expansion and model-based estimators were chosen based on their frequency of application, underlying assumptions, and flexibility in capturing various relationships among response and predictor variables. Modeling techniques evaluated include linear regression (LN), neural networks (NNs), support vector machines (SVMs), random forests (RF), and generalized additive models (GAMs).
In terms of complexity, expansion estimators are the simplest, followed by model-based estimators LN, NNs, SVMs, RF, and GAMs. Expansion estimators are commonly limited to a singular, population-wide estimate of mean or total for a population of interest [
7]. Conversely, linear regression models can estimate the mean response value of pixels for given values of predictor variables but are obviously limited in their ability to describe nonlinear relationships and can produce estimates outside the range of values observed in a sample [
32]. NNs, a machine learning technique, can represent nonlinear associations by introducing activation functions and weighting schemes of linear coefficients [
33]. SVMs use kernel functions to identify subsets of the data within multidimensional feature space that describe the relationship between response and predictor variables [
33]. RFs allow for discontinuities in the response function across feature space by implementing classification and regression trees (CART) and incorporating bagging and randomness into model training. RFs ultimately rely on ensembles of CART models to capture relationships between response and predictors but can yield only estimates within the range of observed response values [
34]. GAMs are based on a flexible modeling technique that assumes additivity in predictor variables effects, and that response values change smoothly over feature space. GAMs allow for nonlinear, nonparametric relationships between response and predictor variables [
35] and can explicitly account for non-Gaussian error distributions.
Of particular interest with regard to expansion and model-based estimators is that an expansion estimator can be thought of as a model-based estimator with only an intercept parameter if an equal probability design is used and if the presumed model incorporates an independently identically distributed error structure. That is, if the same sample is used by an expansion estimator to calibrate an intercept only model, the population estimates from the expansion and model-based estimators will be the same. This breaks down if unequal probability designs area used, because information on sampling intensities under a design are not used in the model-based approach. Furthermore, if relationships between the response and potential predictors cannot be identified, then the only viable model is one describing variation around the mean, which suggests use of simple estimators like the sample mean. Owing to this, within our simulation, we predict that as the amount of model error (noise) introduced into the relationship between response and predictors is increased, model and expansion-based estimates will converge. Conversely, as the amount of noise introduced between response and predictor decreases, the estimated values from the model-based estimators will be less variable than the expansion-based estimates. Similarly, within our simulations we anticipate that when underlying model relationships are misspecified or calibrated in the absence of important correlated data, model-based estimates will converge with expansion-based estimates. Moreover, we predict that samples drawn from probabilistic designs that enhance spread and balanced across predictor variable space will produce better estimates of the population and subpopulations.
5. Discussion
While sample design affected the performance of all estimators, the estimation approach had a larger impact on cell and population estimates. As expected, in all instances evaluated, regardless of the underlying relationship between response and predictor variables, model-based estimators produced better pixel-level estimates than expansion estimators (
Figure 6,
Table 1). Additionally, when sample units were spread and balanced in feature space, model-based estimators of population totals had relatively low bias and remained less variable than expansion estimators (
Figure 7). This finding suggests that models derived from samples balanced and spread in feature space produce low-bias estimates with less variability than RSNR, SRS, and SYS designs.
Likewise, expansion estimators derived from samples spread and balanced across feature space (GRTS), produced estimates similar to model-based estimates of population totals with less variability than SYS, SRS, and RSNR expansion estimators; supporting the idea that spreading and balancing sample observations in feature space can substantially improve population estimates presented by others [
6,
20,
46]. Within our simulations, the variability in estimated totals was on average always less for model-based estimators and expansion estimators derived from samples spread and balanced in feature space. In all instances the increased precision (reduction in RMSE) of using a model-based estimator over expansion estimators alone outweighed the impacts of the bias introduced by model-based estimators, even when models were misspecified.
Within our simulations, GAM and RF were the two best modeling techniques when all NAIP bands were used to calibrate models. Once calibrated, these estimators were able to better identify and track the underlying transformations of NAIP cell values better than other evaluated estimators. Additionally, when models were misspecified (e.g., a linear model applied to a nonlinear response), raster cell values approached expansion-based estimates. Furthermore, when only the first three NAIP bands were used to calibrate models, model-based cell estimates also approached expansion- based estimates. These findings suggest that using probability sampling, ancillary data, and models calibrated from probability samples improves pixel-level value estimates, even if the model is misspecified and especially if sample units are spread across the values of the ancillary data (feature space). This finding is especially relevant today with the availability of remotely sensed imagery and the correlations between land cover (e.g., forest ecosystems) attributes and spectral and textural image metrics (e.g., [
30,
31]). While it is tempting to view our results as confirmation of the robustness of GAM and RF modeling techniques, it is important to recognize that within our simulation, randomness played a critical role in all sample designs [
20] and that error and SC1 transformations were consistent across the spatial domain. Foody et al. [
11] points out that other relationships and error structures may favor different modeling techniques—for example, non-Gaussian error distributions could advantage the GAM approach. However, on average, when samples were spread and balanced in feature space, pixel-level and population total estimates were as good as or better than samples that were not spread or balanced in feature space. This result is most likely because spreading and balancing sample units across feature space will more consistently allow for the detection of changes in complex relationship between response and predictors than samples that are not spread and balanced.
While we did not directly evaluate sample size in our comparisons, we assume that the strength of the relationship between response and predictor variables, represented in our study as added random noise, would have impacts on estimation similar to sample size. Specifically, we would expect that an increase in sample size would have a similar impact on estimation accuracy as a decrease in noise introduced into the response surfaces. This would suggest that as sample size increases, estimates should become more precise and measures of spread, such as the B statistic [
21,
22] will become smaller. Future investigations should look at quantifying tradeoffs in sample size, spread, and the strength of the relationship between response and predictor variables on estimation for modeling techniques such as LN, GAM, RF, SVM, and NN.
These simulated findings have practical relevance for natural resource managers who are interested in inventory, monitoring, and managing natural resources. Specifically, improvements in estimating characteristics of a natural resource for a given population (e.g., basal area in a forest) can be gained by incorporating models into the estimation process [
4,
30,
31]. Moreover, indirect estimates of population subdomains (e.g., stands) can be made from model-based estimates [
26]. In the case where observational units (e.g., pixels) cover a smaller spatial domain than the subdomain of interest (e.g., stands), those pixel estimates can be aggregated to the spatial extent of the stand. Conversely, when pixels have an extent larger than the stand of interest, model estimates can be attributed to the entire stand. In instances where pixel estimates partially cover the extent of the stand, estimates can be weighted and attributed to the stand based on the amount of overlap** area between pixels and the stand. However, models have estimation error, which should be incorporated into the standard errors of the estimates of the domain and subdomain characteristics [
26,
47,
48].
In our simulation, we used iteration and sampling to quantify estimation error. Our top performing estimator (GAM) on average had the least amount of bias and the smallest RMSE across iterations. However, there were instances (iterations) within our simulations when the GAM model had, relatively speaking, large RMSE. These instances within our iterations identify the case in which a sample drawn from the population were not well balanced or spread. While the GRTS sample design minimized the occurrence of samples that were not well balanced or spread, it did not eliminate those types of samples. This means that while rare, some samples may have a disproportionately large number of extreme occurrences within feature space which could adversely affect the calibration of a given model and model estimates. In this situation, bagging or boosting could be used to iteratively draw random subsets from a given sample, calibrate multiple models, average model estimates, and empirically estimate standard error to reduce the impact of extreme observations [
49,
50]. Applying methods such as this should have a similar impact to the variation in RMSE values as what is displayed in
Figure 9 for the RF modeling techniques, which uses bagging and model averaging [
34].
For forest managers, this suggests that estimates of forest characteristics such as species composition, basal area, and tree counts can be improved for a given area by relating field measurements to remotely sensed imagery such as NAIP (e.g., [
4,
30]). Moreover, with the relative abundance of free remotely sensed data and newer software designed to facilitate these types of analyses (e.g., [
51]), managers can estimate characteristics of the forests they manage with a greater level of detail and accuracy than was previously possible.
6. Conclusions
In this study we compared multiple estimators for a variety of response and predictor variable relationships, Gaussian error distributions, amounts of noise, and sample designs. Our findings indicated that balance and spread are important aspects of a sample, estimates of pixel-level and population totals can be improved by incorporating models, and spreading samples within feature space can improve estimates. Likewise, for those same samples, when the relationship between response and predictor variables is misspecified, missing key information, or is extremely noisy, expansion and model-based estimates of the population converge, suggesting that with regards to estimation, nothing is lost by using ancillary data. Moreover, for map** endeavors attempting to spatially depict various characteristics of a landscape, samples used to calibrate predictive models should aim to spread and balance observational units across feature space.