Number of Samples Needed For Model Selection with Confidence

August 1, 2017

Model Similarity

Question: How similar is the model of the NY Yankees baseball team to the model of our home team, the Baltimore Orioles?

Common answer: The Kullback-Leibler Divergence.
Different answer: The number of Baltimore-at-bat-half-innings we would need to simulate to correctly reject the model of the Yankees, 95 percent of the time.

Baseball Model

Markov Chain fitted from the 2011 Baltimore Orioles, batting at home (Marchi & Albert, 2013):
Joint work: Rebeca Berger, American University, Class of 2017.

sim.baseball(5, BAL, seed=1)

## [1] "0|0X|0XX|XXX"           "0|1|1X|23X|3X|0XX|XXX" 
## [3] "0|0X|0XX|XXX"           "0|1|1X|2XX|XXX"        
## [5] "0|1|2|2X|2X|1X|1XX|XXX"

likes.baseball(sim.baseball(5, BAL, seed=1), BAL);

## [1]  -1.157964 -11.320255  -1.157964  -5.652618 -14.439420

New Models: t(5) and N(0,1)

Running: t(5) and N(0,1)

samples.t5 <- sim.t(200000, df=5, seed=7); head(samples.t5,5)

## [1]  4.4328664  0.5426187 -0.1550494 -0.1240639  2.1740723

likes.t5.t5 <- likes.t(samples.t5, df=5) 
likes.t5.normal <- likes.t(samples.t5, df=Inf)
head(like.ratios(likes.t5.t5, likes.t5.normal), 5)

## [1]  4.98941721 -0.07411863 -0.05205052 -0.05120605  0.31733761

A positive log-likelihood-ratio selects correct model.
Adding ratios gives the log-likelihood-ratio of an ensemble.

The Bootstrap Matrix

Each element of the \(i^{th}\) column contains the sum of \(i\) independent log-likelihood-ratios.
The rows are independent.

bootstrap.matrix(likes.t5.t5, likes.t5.normal,
                 max.samples=5, bootstrap.rows=7, seed=5)

##             [,1]        [,2]        [,3]       [,4]       [,5]
## [1,] -0.06198447  0.80318724  0.04472667  0.2339618 -0.3907877
## [2,] -0.09664454  0.08452911 -0.18877694 -0.2913085  5.2309346
## [3,] -0.05415492 -0.16289951 -0.23643338  1.6543063  1.2374765
## [4,] -0.08426828  0.76323568  0.49300372 -0.2023973 -0.3325500
## [5,] -0.05264315 -0.14691627 -0.20688222  2.0281977  0.1084606
## [6,] -0.08581603  1.06363358 -0.02712145 -0.2683411 -0.3006797
## [7,] -0.05088923 -0.15060100 -0.24338174 -0.2727284 -0.4206254

Proportion Correct in Each Column

Each proportion is a multiple of 1/#rows (bad for regression).

\(5^{th}\) Percentile of Log-Likelihood-Ratio

- The "proportion correct" crosses 0.95 at the same location where the \(5^{th}\) percentile (not discrete) crosses zero.

Region of interest

- We restrict the \(5^{th}\) quantiles to a region of interest, then use regression to compute the intersection.

Computing number of samples

samples.needed(likes.t5.t5, likes.t5.normal,
                  max.samples=250, bootstrap.rows=1000, seed=5,
                  confidence.level=0.95)

## [1] 126.3708

repeated.estimates(reps=6, seed=1)

## [1] 139.4967 129.8585 125.4115 126.1486 137.8517 125.1627

Occasionally, the method fails and leads to nonsensical results, but it is easy to spot those estimates.

Density of Estimates

Conclusions

We found little, if any, improvement for the quantreg::rq() function over the two step: quantile() \(\rightarrow\) lm(). Having the result of the intermediate step allowed greater efficiency.
Based on a distribution derived from 100 estimates, it takes \(128 \pm 1\) (approximately, MEAN +/- 2 STANDARD ERROR) to correctly reject the standard Normal distribution using samples from the t-distribution with 5 degrees of freedom.
The above result required minutes of computation, not hours, and was robust.

Why (or Why Not) Samples Needed?

Advantages: provides a measure of similarity between models which may be easier to interpret than the Kullback-Leibler Divergence.
Disadvantages: harder and less natural to compute than the Kullback-Leibler divergence.

Future Work

I am interested in models in Neuroscience where the likelihood must be approximated with Sequential Monte Carlo techniques.

Questions?

Links & Acknowledgments

These slides can be found at http://jsm17.seancarver.org/talk.html
My code can be found at https://github.com/seancarverphd/klir.git
Baseball collaborator: Rebeca Berger, American University, Class of 2017.
Baseball data (needed to run our baseball code) available from Max Marchi, see: https://github.com/maxtoki/baseball_R
Citation: Max Marchi and Jim Albert (2013), Analyzing Baseball With R, CRC Press.