I recently began using model selection methods and AIC to analyze my data as per the strong suggestion from one of my dissertation committee members. As I learn about this methodology, I am also asked to justify my interpretations to other members of my committee. Switching over definitely has been a very productive learning experience. I thought I would share some of the questions I’ve gotten and how I responded. My responses are derived from my understanding of readings and discussions with others. I publicly air my responses for several reasons. First, there may be others out there asking these same questions and perhaps this will pop up in a search and be helpful. Second, I want others that know more to correct me (constructively!) and help me gain a deeper understanding. In keeping with the first reason, having others’ comments and corrections next to my own statements will hopefully not lead too many astray should there be gross falsehoods in my statements.
What criteria allow you to conclude that X3 has predictive value when the wAIC for model 1 is only a bit stronger than what I assume is the null model?
The criteria I used here was a difference in AIC score of 3 or more between models. The absolute AIC score is not meaningful but the differences in scores between models can be used as a rough guideline according to Burnham and Anderson in their 2002 book (Model Selection and Multimodel Inference pp 70 and pp 446). They state that “models within 1-2 of the best model have substantial support”. [Although, I have not been able to find a theoretical justification for these rough cut-offs!] Some authors have used a cut-off of 2 and others of 3. I decided to use 3 to be more inclusive of alternative models or more conservative with criteria that a model was ok to stand alone.
When differences (ΔAICc) are within 3 units then that means that those models are plausible too so in order to get an estimate of the effect of the variables included within all plausible models you can do model averaging. Model averaging weights the variables in the model by the AIC weight and adjusts the estimate accordingly. For example, in the model set below, the estimate for “X3” is adjusted down when you look at the averaged model. However, X3 still has a positive effect on the response variable Y while X1 and X2 have estimates close to zero. Notice also that there is more uncertainty around the estimate for X3 after averaging. One of the things this suggests to me is that there are relevant X-factors that were not measured and included as potential predictive variables. I have nothing to back this up but the fact that the second-best model is the “null” suggests to me that a lot of the variation might be due to the random factor (not shown in the models below).
After I compared models 1-4 and noticed they were all within 3 ΔAICc of the top model I decided to make model number 5 to see if it would rise to the top. Apparently the penalty of having three times the number of parameters trumped any combined power to predict the response variable. All of these models also contain site as a random intercept so the variation in y seemed to be due primarily to differences in site and X3.
What I finally decided to do in this case was to use a cut-off of 2 ΔAICc, not average the models and interpret the results based on the relative weights of the models. … this suggests that X3 may have some effect but results are inconclusive.
This is again an example of the general question on what you can say about your results when your wAIC value for your ‘best model’ is not 0.95, but only 0.88 or 0.63, or 0.44.
The weights are calculated using the differences in the AIC score but the weight is also affected by the number of models you have in a set. Since the weights must add to 1, as you add more models some of that weight is “claimed”. So if you have a lot of models in your set the top model can have an AIC weight that is far below say 0.95 but it’s the combined evidence of difference in score, AIC weight and CI is used to determine if competing models are also plausible.
The weights are also useful in model averaging for adjusting the relative contribution of model parameters. Actually, Burnham and Anderson have stressed that model averaging should be done with all hypothesized reasonable models instead of using a cutoff in the score differences. If model averaging is done then parameters that have little impact will have a low estimate (relative to the other parameters in the averaged model if estimates are standardized) and one can see this at a glance. However, standardizing the estimates for different variables may be difficult to impossible. I think their main point is that the primary benefit of model averaging is to develop a more predictive model.
The best model is not necessarily a good model – its just the best out of the ones that you elected to include.
Absolutely correct. However, if you populate the model set with models that can’t predict the response variable then the “null model” (intercept-only model) will be the “best” model. If the null model is the best of the set, this doesn’t necessarily mean “no difference” it means that IF there is a difference it is not explained by the variables in your alternative hypotheses. Additionally, you can use confidence intervals or SE for model estimates in your best models to see how confident you can be that the variables included can explain the variation in the response variable.
How did you choose which models to include?
Personally, I chose to include models that I reasonably thought might be meaningful AND where the independent variable was something I had manipulated in the experiment. Choosing your models is really not different from developing alternative models to test in a study.
For standard stats, if you test 20 hypotheses, you need to adjust your p-value for multiple tests. Is there an analog with wAICs where when you have so many ‘sets’ that you need to account for the fact that some wAIC of 0.8 or whatever really aren’t meaningful?
When I did the experiment the plan was that I was going to use null-hypothesis testing and that meant I would have to do a long series of “does such-and-such have an effect? yes/no“.
In that case it might make sense to correct for multiple tests as the probability of getting some “yes” answers can increase with the number of times asked. However, (I think) it is more important to use a correction when asking the question multiple times using different explanatory variables (X1, X2…) for the same response variable (Y). In my case, I was asking whether stimulus alone affected the outcome and used no other explanatory variables in an attempt to keep “hunting” for a significant result. Therefore, originally I was not incorporating a multiple-test correction in my stats methods.
By using model selection on the surface it does seem that I am asking many times whether variable X1, X2, X3… can predict variable Y. The big difference is that it is not a “game” of probability. If X1 has any power to explain variation in Y then you get an estimate of the magnitude of that effect with some measure of precision. Rather than just get a yes/no answer that “X1 has an affect” you get a measure of the magnitude of the effect which is not subject to probability.
You do alter your “chances” of getting models with high AIC weights by using fewer models in a set since each additional model can “claim” some of the total weight of 1 and dilute the pool. However, the weight is not the only selection criteria and there is also not a cut-off so that models below a certain weight are obviously meaningless. That the number of models can affect the spread of weight doesn’t change the chances that your pet alternative hypothesis ends up being the “best” model. If none of the alternative models have predictive power then it’s the “null model” that ends up with the greatest support. This result is not a matter of probability but a matter of developing reasonable, meaningful models.
One of the benefits I’ve seen from using model selection rather than null-hypothesis testing is that it has allowed me to understand what I’m observing in greater detail in a relatively painless way. I know there are more complicated stats (beyond t-tests and non-parametric equivalents) that allow you to see these patterns but I never felt comfortable with the methods. I did not doubt their validity! I simply felt overwhelmed by the assumptions and the need for correcting for multiple tests and basically the logistics of performing a complex MANOVA or worse yet, finding a legitimate non-parametric way to ask the questions I wanted to ask. Fortunately, I had analyzed my data using simple tests and had significant p-values for many of my comparisons. This allowed me to see that model selection was also detecting these differences so the methods were “in agreement”. Model selection allowed me to explore beyond the dichotomous treatment of my data in a way that was more transparent to me. More importantly, it allowed me to get an idea about the impact of different variables rather than a yes/no answer. Again, I’m not saying it can’t be done with “p-value” methods but I have found that for me model selection is more approachable and it helps that there are reasoned arguments to prefer model selection methods over dichotomous p-values assessments. I don’t think for one second (nor have I seen others state) that a shift to model selection as a superior method invalidates experimental results analyzed and evaluated using p-values as criteria. In my opinion, if anything, perhaps using p-values as criteria rather than model selection has resulted in more type II errors.