I recently began using model selection methods and AIC to analyze my data as per the strong suggestion from one of my dissertation committee members. As I learn about this methodology, I am also asked to justify my interpretations to other members of my committee. Switching over definitely has been a very productive learning experience. I thought I would share some of the questions I’ve gotten and how I responded. My responses are derived from my understanding of readings and discussions with others. I publicly air my responses for several reasons. First, there may be others out there asking these same questions and perhaps this will pop up in a search and be helpful. Second, I want others that know more to correct me (constructively!) and help me gain a deeper understanding. In keeping with the first reason, having others’ comments and corrections next to my own statements will hopefully not lead too many astray should there be gross falsehoods in my statements.

**What criteria allow you to conclude that X _{3} has predictive value when the wAIC for model 1 is only a bit stronger than what I assume is the null model?**

The criteria I used here was a difference in AIC score of 3 or more between models. The absolute AIC score is not meaningful but the differences in scores between models can be used as a rough guideline according to Burnham and Anderson in their 2002 book (Model Selection and Multimodel Inference pp 70 and pp 446). They state that “models within 1-2 of the best model have substantial support”. [Although, I have not been able to find a theoretical justification for these rough cut-offs!] Some authors have used a cut-off of 2 and others of 3. I decided to use 3 to be more inclusive of alternative models or more conservative with criteria that a model was ok to stand alone.

When differences (ΔAICc) are within 3 units then that means that those models are plausible too so in order to get an estimate of the effect of the variables included within all plausible models you can do model averaging. Model averaging weights the variables in the model by the AIC weight and adjusts the estimate accordingly. For example, in the model set below, the estimate for “X

_{3}” is adjusted down when you look at the averaged model. However, X

_{3}still has a positive effect on the response variable Y while X

_{1}and X

_{2}have estimates close to zero. Notice also that there is more uncertainty around the estimate for X

_{3}after averaging. One of the things this suggests to me is that there are relevant X-factors that were not measured and included as potential predictive variables. I have nothing to back this up but the fact that the second-best model is the “null” suggests to me that a lot of the variation might be due to the random factor (not shown in the models below).

After I compared models 1-4 and noticed they were all within 3 ΔAICc of the top model I decided to make model number 5 to see if it would rise to the top. Apparently the penalty of having three times the number of parameters trumped any combined power to predict the response variable. All of these models also contain site as a random intercept so the variation in y seemed to be due primarily to differences in site and X_{3}.

What I finally decided to do in this case was to use a cut-off of 2 ΔAICc, not average the models and interpret the results based on the relative weights of the models. … *this suggests that X _{3} may have some effect but results are inconclusive*.

**This is again an example of the general question on what you can say about your results when your wAIC value for your ‘best model’ is not 0.95, but only 0.88 or 0.63, or 0.44.**

The weights are calculated using the differences in the AIC score but the weight is also affected by the number of models you have in a set. Since the weights must add to 1, as you add more models some of that weight is “claimed”. So if you have a lot of models in your set the top model can have an AIC weight that is far below say 0.95 but it’s the combined evidence of difference in score, AIC weight and CI is used to determine if competing models are also plausible.

The weights are also useful in model averaging for adjusting the relative contribution of model parameters. Actually, Burnham and Anderson have stressed that model averaging should be done with all hypothesized reasonable models instead of using a cutoff in the score differences. If model averaging is done then parameters that have little impact will have a low estimate (relative to the other parameters in the averaged model if estimates are standardized) and one can see this at a glance. However, standardizing the estimates for different variables may be difficult to impossible. I think their main point is that the primary benefit of model averaging is to develop a more predictive model.

**The best model is not necessarily a good model – its just the best out of the ones that you elected to include.
** Absolutely correct. However, if you populate the model set with models that can’t predict the response variable then the “null model” (intercept-only model) will be the “best” model. If the null model is the best of the set, this doesn’t necessarily mean “no difference” it means that IF there is a difference it is not explained by the variables in your alternative hypotheses. Additionally, you can use confidence intervals or SE for model estimates in your best models to see how confident you can be that the variables included can explain the variation in the response variable.

**How did you choose which models to include?**

Personally, I chose to include models that I reasonably thought might be meaningful AND where the independent variable was something I had manipulated in the experiment. Choosing your models is really not different from developing alternative models to test in a study.

**For standard stats, if you test 20 hypotheses, you need to adjust your p-value for multiple tests. Is there an analog with wAICs where when you have so many ‘sets’ that you need to account for the fact that some wAIC of 0.8 or whatever really aren’t meaningful?**

When I did the experiment the plan was that I was going to use null-hypothesis testing and that meant I would have to do a long series of “*does such-and-such have an effect? yes/no*“.

In that case it might make sense to correct for multiple tests as the probability of getting some “yes” answers can increase with the number of times asked. However, (I think) it is more important to use a correction when asking the question multiple times using different explanatory variables (X_{1}, X_{2}…) for the same response variable (Y). In my case, I was asking whether stimulus alone affected the outcome and used no other explanatory variables in an attempt to keep “hunting” for a significant result. Therefore, originally I was not incorporating a multiple-test correction in my stats methods.

By using model selection on the surface it does seem that I am asking many times whether variable X_{1}, X_{2}, X_{3}… can predict variable Y. The big difference is that it is not a “game” of probability. If X_{1} has any power to explain variation in Y then you get an estimate of the magnitude of that effect with some measure of precision. Rather than just get a yes/no answer that “X_{1} has an affect” you get a measure of the magnitude of the effect which is not subject to probability.

You do alter your “chances” of getting models with high AIC weights by using fewer models in a set since each additional model can “claim” some of the total weight of 1 and dilute the pool. However, the weight is not the only selection criteria and there is also not a cut-off so that models below a certain weight are obviously meaningless. That the number of models *can *affect the spread of weight doesn’t change the chances that your pet alternative hypothesis ends up being the “best” model. If none of the alternative models have predictive power then it’s the “null model” that ends up with the greatest support. This result is not a matter of probability but a matter of developing reasonable, meaningful models.

Personal take:

One of the benefits I’ve seen from using model selection rather than null-hypothesis testing is that it has allowed me to understand what I’m observing in greater detail in a relatively painless way. I know there are more complicated stats (beyond t-tests and non-parametric equivalents) that allow you to see these patterns but I never felt comfortable with the methods. I did not doubt their validity! I simply felt overwhelmed by the assumptions and the need for correcting for multiple tests and basically the logistics of performing a complex MANOVA or worse yet, finding a legitimate non-parametric way to ask the questions I wanted to ask. Fortunately, I had analyzed my data using simple tests and had significant p-values for many of my comparisons. This allowed me to see that model selection was also detecting these differences so the methods were “in agreement”. Model selection allowed me to explore beyond the dichotomous treatment of my data in a way that was more transparent to me. More importantly, it allowed me to get an idea about the impact of different variables rather than a yes/no answer. Again, I’m not saying it can’t be done with “p-value” methods but I have found that for me model selection is more approachable and it helps that there are reasoned arguments to prefer model selection methods over dichotomous p-values assessments. I don’t think for one second (nor have I seen others state) that a shift to model selection as a superior method invalidates experimental results analyzed and evaluated using p-values as criteria. In my opinion, if anything, perhaps using p-values as criteria rather than model selection has resulted in more type II errors.

Thank you for your wonderful explanations!

but I still have a question..

It happens to me also that the null model is the second bes,t with a difference in the AIC to the best model smaller than 2. I was wondering how to interpret this and I did not find an explanation in “Model selection and multimodel inference”, but I found your post: “this suggests that X3 may have some effect but results are inconclusive.” I am very thankful for this, and I would like to read more about it, could you please tell me where? I guess I am not on the right chapter of the book. Thanks!

By:

Noraon March 26, 2012at 1:14 pm

Hi Nora,

I would recommend reading “Model Based Inference in the Life Sciences: A Primer on Evidence” by Anderson. It won’t answer all your questions but it will provide some explanations that help you think about interpretations. Personally, in a situation like yours, I would look at several things: the residual deviance of the model (look for a “pretending variable”), the coefficient estimate and the 95% CI (or the SE) around that estimate in your second best model.

If the second best model is within a few deviance points from the best model then it is likely the additional variable is not explaining much if anything at all and is then a “pretending variable”. This concept is discussed in the book I mention above.

If the coefficient is too small (you must back transform your estimate in order to interpret it directly) to be biologically meaningful then you would not have high confidence in that predictor variable being relevant. For example, say that the effect on a response variable (the coefficient) is 2. You’ll intuit that it is biologically meaningful if the response variable is “# of offspring” since having 2 more babies due to the predictor variable (relative to those NOT experiencing the predictor variable) could be a very big deal. However, if the response variable is say, the “# of times an animal blinks in an hour” it is less convincing that the value is biologically meaningful.

Now looking at the SE or 95% CI you can then evaluate how much confidence you have in whether your predictor variable can be trusted. If the CI around your estimate of 2 offspring is wide and includes zero, I would be dubious about making a conclusive statement regarding the impact of the predictor. I consider several possible interpretations: 1) the predictor variable is acting a a proxy for the real variable of impact and there is not a perfect correlation between those variables so we get “noise”; 2) the strength of the effect of that variable might be real but not very strong so you need a larger sample size to cut through the noise of other interfering variables you did not measure and account for in models; 3) it was a fluke that you got a pattern at all (a few outliers?) and the predictor variable really has no impact on your response variable. My take away from this situation is that the experiment needs repeating with a larger sample and/or I need to redesign my experiment. This of course would be worth the trouble only if you have a strong intuition that there really is something biologically meaningful going on. (There’s the art to the science part).

These conclusions only inform how to proceed experimentally. I’m not recommending these as arguments in a paper to “rescue” a poorly supported model.

You can always model average the top models (within a certain delta if you like) and then look at the estimates and CI to make conclusions.

Also, don’t forget to look at the AIC weights. If the models within several deltas are weighted equally then there may be more explanatory power in the predictor variables in the various models. By this same logic, I also consider there to be less support for a model that is within a few deltas but has very little weight. Whether this is correct or not, I can’t say for sure. But I try to use all the quantitative evidence I get from model selection to interpret the results and establish how much confidence I have in those interpretations.

I hope this helps and hope you find the book very informative- I did!

By:

tiglesiason March 26, 2012at 8:27 pm

Good article. Well-written! Thanks.

By:

Bruce Robertsonon June 27, 2012at 2:15 pm

Thanks, Bruce!

By:

tiglesiason June 27, 2012at 3:39 pm