The Wild Animal Modeling Wiki
Care with estimated breeding values
Predictions of breeding values from animal models have been used to make a range of inferences on parameters of evolutionary importance, including, selection with respect to genetic variation at the individual level, population trends in mean breeding value over time (microevolution), and genetically-based differences among populations. However, despite the promising name of the technique with which these predictions are made (best linear unbiased prediction, or BLUP), applications of predicted breeding values can be very seriously biased, and are often far from the best option.One simple example of very serious bias was described by Erik Postma (E. Postma, 2006. Implications of the difference between true and predicted breeding values for the study of natural selection and micro-evolution. Journal of Evolutionary Biology 19: 309-320.). If predictions of breeding values are obtained from a model that does not account for (potential) environmentally-induced trends in phenotype over time, trends in predicted breeding values will mimic the phenotypic trend, even in the complete absence of a genetic trend. We can verify this by simulation:
est.without.cohort<-1:100 est.with.cohort<-1:100 for(rep in 1:100){ random.effects<-cbind(1:100,rnorm(100,0,1)) dogs<-as.data.frame(cbind(1:1000,expand.grid(1:200,1:5)[,2], ceiling(runif(1000,0,100)),NA,rnorm(1000,0,1))) names(dogs)<-cbind("ID","COHORT","LITTER","LITTER.value","ENV") dogs$LITTER.value<-random.effects[match(dogs$LITTER,random.effects[,1]),2] dogs$PHEN<-dogs$COHORT*0.3+dogs$LITTER.value+dogs$ENV dogs$COHORT<-as.factor(dogs$COHORT) dogs$LITTER<-as.factor(dogs$LITTER) library(asreml) model1<-asreml(PHEN~1, random=~LITTER, data=dogs) blups1<-cbind(1:100,summary(model1)$coef.random[,1]) dogs$blups.model1<-blups1[match(as.character(dogs$LITTER), as.character(blups1[,1])),2] model2<-asreml(PHEN~COHORT, random=~LITTER, data=dogs) blups1<-cbind(1:100,summary(model2)$coef.random[,1]) dogs$blups.model2<-blups1[match(as.character(dogs$LITTER), as.character(blups1[,1])),2] est.without.cohort[rep]<-summary(lm(blups.model1~ as.numeric(COHORT),data=dogs))$coeff[2] est.with.cohort[rep]<-summary(lm(blups.model2~ as.numeric(COHORT),data=dogs))$coeff[2] # this line is because the code for associating blups with their # effects is only right when all effects occur in the data # this will make the "In cbind(..." warnings ignorable if(length(summary(model1)$coef.random[,1])!=100| length(summary(model2)$coef.random[,1])!=100) rep<-rep-1 } > mean(est.without.cohort) [1] 0.02618011 > sd(est.without.cohort) [1] 0.02118003 > > > mean(est.with.cohort) [1] -7.418099e-06 > sd(est.with.cohort) [1] 0.02072310 >
Here the pedigree is defined by litters, rather than the inverse of the A matrix typical to animal models, and some phenotype of dogs is being tested for evolution. Even though there is not change in the population, we pick it up if we fail to model temporal variation at the level of the phenotype.
We now see how to get unbiased results from this statistical procedure of regressing predicted breeding values on time to make inferences about microevolution. However, a nasty statistical issue remains, one that often comes up when we do statistics on statistics. If we want to test the statistical significance of the trend in breeding values over time, we could (and have) naively fit regressions of breeding values on time. However, this approach treats every predicted breeding value as (a) known without error and (b) independent. In fact, in the sorts of pedigrees that we deal with in the analysis of data from natural populations, breeding values are predicted with a lot of error, and relatedness (the very factor that we rely on to run our models) means that individual predictions of breeding values are non-independent, both across and within cohorts. Jarrod Hadfield (J.D. Hadfield, A.J. Wilson, D. Garant, B.C. Sheldon, L.E.B. Kruuk. 2010. The Misuse of BLUP in Ecology and Evolution. American Naturalist 175: 116-125.) provided the solution, which is to account for the full uncertainty and non-independence by integrating the regression over the joint posterior distribution of the breeding values. The posterior distribution of a animal model is generally computationally expensive to obtain, and so the simulation approach we implemented above is out of the question. But we can make one iteration of the model, and analyze it with both ways to see the difference:
library(asreml) library(MCMCglmm) ## simulate the data (this time ten years) random.effects<-cbind(1:200,rnorm(200,0,1)) dogs<-as.data.frame(cbind(1:2000,expand.grid(1:200,1:10) [,2],rep(ceiling(runif(200,0,20)),10),NA,rnorm(1000,0,1))) names(dogs)<-cbind("ID","COHORT","LITTER","LITTER.value","ENV") for(x in 1:2000) dogs$LITTER[x]<-paste(dogs$COHORT[x],dogs$LITTER[x],sep="") dogs$LITTER.value<-random.effects[match(as.numeric(as.factor(dogs$LITTER)), random.effects[,1]),2] dogs$PHEN<-dogs$COHORT*0.3+dogs$LITTER.value+dogs$ENV dogs$COHORT<-as.factor(dogs$COHORT) dogs$LITTER<-as.factor(dogs$LITTER) ## fit the models model2<-asreml(PHEN~COHORT, random=~LITTER, data=dogs) model.MCMCglmm<-MCMCglmm(PHEN~COHORT, random=~LITTER, data=dogs,pr=TRUE) ## estimate the change in breeding values and evaluate statistical confidence ## 1) blup regression blups1<-summary(model2)$coef.random[,1] dogs$blups.model2<-blups1[ match(paste("LITTER_",as.character(dogs$LITTER),sep=""), names(blups1))] summary(lm(blups.model2~as.numeric(COHORT),data=dogs)) ## 2) integration over the posterior distribution animal.indexes<-match(paste("LITTER.",as.numeric(as.character(dogs$LITTER)), sep=""),names(model.MCMCglmm$Sol[1,][])) post<-1:1000 for(i in 1:1000){ post[i]<-summary(lm(model.MCMCglmm$Sol[i,][animal.indexes]~ as.numeric(as.character(dogs$COHORT))))$coef[2] } table(post>0) ## note this is one tailed ## get something more comparable to the SE reported above se.range<-HPDinterval(as.mcmc(post),0.5) mode<-posterior.mode(as.mcmc(post)) mean(abs(se.range[2]-mode),abs(mode-se.range[1])) ## and we can get the output: # > ## estimate the change in breeding values and evaluate statistical confidence # > ## 1) blup regression # > blups1<-summary(model2)$coef.random[,1] # > dogs$blups.model2<-blups1[ # + match(paste("LITTER_",as.character(dogs$LITTER),sep=""), # + names(blups1))] # > summary(lm(blups.model2~as.numeric(COHORT),data=dogs)) # # Call: # lm(formula = blups.model2 ~ as.numeric(COHORT), data = dogs) # # Residuals: # Min 1Q Median 3Q Max # -2.31619 -0.56308 0.04386 0.63695 1.79261 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.016829 0.042875 -0.393 0.695 # as.numeric(COHORT) 0.004905 0.006910 0.710 0.478 # # Residual standard error: 0.8876 on 1998 degrees of freedom # Multiple R-squared: 0.0002522, Adjusted R-squared: -0.0002482 # F-statistic: 0.504 on 1 and 1998 DF, p-value: 0.4778 # # > # > ## 2) integration over the posterior distribution # > # animal.indexes<-match(paste("LITTER.", # as.numeric(as.character(dogs$LITTER)),sep=""), # names(model.MCMCglmm$Sol[1,][])) # > post<-1:1000 # > for(i in 1:1000){ # + post[i]<-summary(lm(model.MCMCglmm$Sol[i,] # [animal.indexes]~as.numeric(as.character(dogs$COHORT))))$coef[2] # + } # > table(post>0) ## note this is one tailed # # FALSE TRUE # 410 590 # > # > ## get something more comparable to the SE reported above # > se.range<-HPDinterval(as.mcmc(post),0.5) # > mode<-posterior.mode(as.mcmc(post)) # > mean(abs(se.range[2]-mode),abs(mode-se.range[1])) # [1] 0.01473613
So the test for a trend in breeding values is anti-conservative when uncertainty and non-independence at the level of individual predicted breeding values is not taken into account. Both estimates were non-significant, as they should have been, in this particular run, because we fit the right model and so the slope of the regression of estimated breeding values on cohort was very small in both analyses. But note how the standard error appeared to be less than half what it should have been in the analysis that didn't account for the uncertainty and the non-independence in the 'best' predictions of the breeding values.
Hadfield et al. 2010 also provides detailed discussion of (a) why uses of blups to test whether or not breeding values covary with fitness and whether or not different groups of individuals differ genetically (b) solutions to these problems as well.