# ---------------------------------------------------------------------------- # Electronic Appendix A; # Examples for using R function to calculate RD for generalized linear fixed-effect models. # Revised on Dec 11, 2024 # Kohji Yamamura (2016) # Estimation of the predictive ability of ecological models # Communications in Statistics - Simulation and Computation, 45:2122-2144 # DOI: 10.1080/03610918.2014.889161 # RDcompare.txt should be placed in the current directory. Otherwise, the following directory should be modified to that containing the function. source("RDcompare.txt") # ---------------------------------------------------------------------------- ## Example 1: Calculation of RD for log(x+0.5) ## # The expectation of variable such as insect population is determined from various factors in a multiplicative manner. In such cases, the distribution of expectation follows a lognormal distribution due to the central limit theorem irrespective of the original distributions. Hence, a logarithmic Poisson GLMM will be usually appropriate for describing the observed variable. A linear model about log(x+0.5) can be used as an approximation to the logarithmic Poisson GLMM. # # Data from Yamamura (2009, Generalized linear models and model selection. Shokubutsu Boeki (Plant Protection), 63: 324-329). cat(file="MothData.txt", "Trap Month y 1 5 8 1 6 16 1 7 55 1 8 341 2 5 16 2 6 48 2 7 112 2 8 874 ") MothData <- read.table("MothData.txt", header=TRUE) MothData$Trap <- factor(MothData$Trap) MothData$Trap <- relevel(MothData$Trap, ref="1") # source("RDcompare.txt") RDcompare(log(y+0.5) ~ Trap*Month, data=MothData) # TrueModel <- glm(log(y+0.5) ~ Trap+Month+Trap:Month, data=MothData) TestModel <- glm(log(y+0.5) ~ Trap+Month, data=MothData) RDcompare.cal(TestModel,TrueModel) # # If we want to include "Month" in all models for biological reasons, we should use the following code. Notice that the compulsory factors should be also included in the model formula. RDcompare(log(y+0.5) ~ Trap*Month, Compulsory="Month", data=MothData) # # If the number of observations is not sufficiently large as compared to the number of explanatory variables, we cannot obtain a reliable estimate of variance from the saturated model. In such cases, we should estimate the variance from a simpler model that contains a smaller number of explanatory variables to increase the degree of freedom for the estimation of variance. We must carefully adopt an appropriate set of explanatory variables that does not distort the estimate of variance. We can automatically perform an appropriate choice by using RDcompareLasso function instead of RDcompare function, and by specifying the required number of freedom by using 'df.error' option. source("RDcompareLasso.txt") RDcompareLasso(log(y+0.5) ~ Trap*Month, data=MothData, df.error=5) # If we have a reliable estimate of the variance beforehand, we can specify the variance by using 'phi' option. source("RDcompare.txt") RDcompare(log(y+0.5) ~ Trap*Month, phi=0.1684716, data=MothData) # ---------------------------------------------------------------------------- ## Example 2: Poisson GLM ## # This is an example of erroneous calculation of RD by specifying Poisson distribution. We should always consider the variability in expectation when we use Poisson errors. A logarithmic Poisson GLMM is usually recommended, but the calculation of RD for GLMM is executable only by SAS macro. Hence, an approximation using log(y+0.5) is usually recommended, as we have calculated above. source("RDcompare.txt") RDcompare(y ~ Trap*Month, family="poisson", data=MothData) # ---------------------------------------------------------------------------- ## Example 3: Quasi-Poisson model for a clustered population. ## # We can use a logarithmic quasi-Poisson model instead of a logarithmic Poisson GLMM, if the multiplicative population consists of clusters of fixed mean (Yamamura 2000, Population Ecology 42:161-169). However, we encounter problems, as we will explain below, when we estimate models other than the saturated model. source("RDcompare.txt") RDcompare(y ~ Trap*Month, family="quasipoisson", data=MothData) # This is equivalent to the Poisson regression with a given phi. source("RDcompare.txt") RDcompare(y ~ Trap*Month, family="poisson", phi=12.43926, data=MothData) # ---------------------------------------------------------------------------- ## Example 4: Negative binomial model (RDcompareNB function) ## # A negative binomial model derives from a Poisson distribution whose expectation follows a gamma distribution. A negative binomial model (NB model) can be used as an approximation to the logarithmic Poisson GLMM except for a small difference in the intercept (the difference given by log(CV^2+1)/2), because the lognormal distribution of a fixed CV^2 (in GLMM) can be approximated by a gamma distribution of a fixed CV^2 (in NB model) if CV^2 << 1. However, we should notice that the parameters of simpler candidate models (other than the saturated model) are different between GLMM and NB model. A GLMM assumes a linear model in the logarithmic scale for the expectation that was evaluated at the 'logarithmic' scale. In contrast, a NB model (and also quasi-Poisson GLM) assumes a linear model in the logarithmic scale for the expectation that was evaluated at the 'arithmetic' scale. Hence, the parameters in NB model are considerably different from that of GLMM, in simpler candidate models in which several parameters are dropped from the true model. source("RDcompare.txt") RDcompareNB(y ~ Trap*Month, data=MothData, ShowModel=5) # ---------------------------------------------------------------------------- ## Example 5: Binomial GLM. ## # Aphid data from Crawley (2007, The R Book, New York: Wiley). # An experiment was performed to examine whether the existence of aphids reduces the number of leaves later attacked by hole-making insects. The work was carried out on two different trees. # r: number of leaves with holes. # n: total number of examined leaves. cat(file="Aphid.txt", "Tree Aphid r n B None 35 1785 B With 23 1169 A None 146 1788 A With 30 363 ") Aphid <- read.table("Aphid.txt", header=TRUE) source("RDcompare.txt") RDcompare(cbind(r,n-r) ~ Tree+Aphid, family="binomial", data=Aphid, ShowModel=4) # For binary variables, we should use RSD instead of RD, since we usually want to predict the probability of response rather than the response of each item. The difference between RD and RSD is clearly shown by the following alternative program for the same data. RSD becomes different from RD if we use this type of program of Bernoulli form. cat(file="Aphid_Bernoulli.txt", "Tree Aphid r freq B None 1 35 B None 0 1750 B With 1 23 B With 0 1146 A None 1 146 A None 0 1642 A With 1 30 A With 0 333 ") Aphid_Bernoulli <- read.table("Aphid_Bernoulli.txt", header=TRUE) source("RDcompare.txt") RDcompare(r ~ Tree+Aphid, family="binomial", weights=freq, data=Aphid_Bernoulli) # ---------------------------------------------------------------------------- ## Example 6: Calculation by 'RD after Lasso'. # Dataset: McDonald GC, Schwing RC (1973) Instabilities of regression estimates relating air pollution to mortality. Technometrics, 15:463-481 # This dataset includes 15 explanatory variables, and hence the calculation of RDcompare requires too much time. In such cases, we should calculate 'RD after Lasso' by using RDcompareLasso function. pollution <- read.table("https://www4.stat.ncsu.edu/~boos/var.select/pollution.data.txt", header=TRUE) source("RDcompareLasso.txt") library(grpreg) # The following calculation using RDcompare requires 94 seconds. RDcompare(y ~ ., data=pollution, max.ranking = 20, ShowModel = 5) # The results are as follows. # RD ranking for the hierarchical family of models # RD RSD k Model 1 0.6636175 0.8648014 7 y ~ 1 + x1 + x2 + x3 + x6 + x9 + x14 2 0.6623753 0.8631827 8 y ~ 1 + x1 + x2 + x3 + x5 + x6 + x9 + x14 3 0.6602734 0.8604435 8 y ~ 1 + x1 + x2 + x3 + x6 + x8 + x9 + x14 4 0.6564475 0.8554577 6 y ~ 1 + x1 + x2 + x6 + x9 + x14 5 0.6561624 0.8550862 9 y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x9 + x14 6 0.6560376 0.8549235 7 y ~ 1 + x1 + x2 + x5 + x6 + x9 + x14 7 0.6551305 0.8537414 9 y ~ 1 + x1 + x2 + x3 + x5 + x6 + x8 + x9 + x14 8 0.6551299 0.8537407 8 y ~ 1 + x1 + x2 + x3 + x7 + x8 + x9 + x14 # Small differences in RD (e.g., less than 1%) should be ignored since the predictive ability of models is almost the same. If the difference in RD is sufficiently small, we should rather adopt the most tractable model that has a smaller number of parameters. # In this case, the best model is y ~ 1 + x1 + x2 + x3 + x6 + x9 + x14 with RD=0.6636175 with k=7, while the 4th best model is y ~ 1 + x1 + x2 + x6 + x9 + x14 with RD=0.6564475 with k=6. The difference in RD is smaller than 1%. Therefore, we should adopt the 4th best model. That is, we need not measure the average temperature in July (x3) for saving the cost in the future prediction activity. RDcompareLasso(y ~ ., data = pollution, max.variable = 10, max.ranking = 20, ShowModel = 5) # The results are as follows. # RD ranking for the hierarchical family of models # RD RSD k Model 1 0.6636175 0.8814370 7 Lasso.Y ~ 1 + x9 + x6 + x14 + x1 + x2 + x3 2 0.6623753 0.8797871 8 Lasso.Y ~ 1 + x9 + x6 + x14 + x1 + x2 + x3 + x5 3 0.6602734 0.8769952 8 Lasso.Y ~ 1 + x9 + x6 + x14 + x1 + x2 + x8 + x3 4 0.6564475 0.8719135 6 Lasso.Y ~ 1 + x9 + x6 + x14 + x1 + x2 5 0.6560376 0.8713691 7 Lasso.Y ~ 1 + x9 + x6 + x14 + x1 + x2 + x5 6 0.6551305 0.8701642 9 Lasso.Y ~ 1 + x9 + x6 + x14 + x1 + x2 + x8 + x3 + x5 7 0.6551299 0.8701635 8 Lasso.Y ~ 1 + x9 + x14 + x1 + x2 + x8 + x7 + x3 8 0.6535472 0.8680612 8 Lasso.Y ~ 1 + x9 + x6 + x14 + x1 + x2 + x7 + x3 # The max.variable specifies the maximum number of variables that are used in the models for evaluateing RD. The calculation time increases with increasing the max.variable, while we may erroneously drop important variables if we use a small amount of max.variable. The default setting is max.variable=10. # We can compulsorily include important variables without relying on the selection. For example, if we know that the variables x6 and x9 are important due to theoretical reasons, we should specify Compulsory="x6+x9" as follows. Then, we can further shorten the CPU time. RDcompareLasso(y ~ ., data = pollution, Compulsory="x6+x9", max.variable = 10, max.ranking = 20, ShowModel = 5)