August 1, 2017

Model Similarity

Question: How similar is the model of the NY Yankees baseball team to the model of our home team, the Baltimore Orioles?

  • Common answer: The Kullback-Leibler Divergence.
  • Different answer: The number of Baltimore-at-bat-half-innings we would need to simulate to correctly reject the model of the Yankees, 95 percent of the time.

Baseball Model

  • Markov Chain fitted from the 2011 Baltimore Orioles, batting at home (Marchi & Albert, 2013):
  • Joint work: Rebeca Berger, American University, Class of 2017.
sim.baseball(5, BAL, seed=1)
## [1] "0|0X|0XX|XXX"           "0|1|1X|23X|3X|0XX|XXX" 
## [3] "0|0X|0XX|XXX"           "0|1|1X|2XX|XXX"        
## [5] "0|1|2|2X|2X|1X|1XX|XXX"
likes.baseball(sim.baseball(5, BAL, seed=1), BAL);
## [1]  -1.157964 -11.320255  -1.157964  -5.652618 -14.439420

New Models: t(5) and N(0,1)

Running: t(5) and N(0,1)

samples.t5 <- sim.t(200000, df=5, seed=7); head(samples.t5,5)
## [1]  4.4328664  0.5426187 -0.1550494 -0.1240639  2.1740723
likes.t5.t5 <- likes.t(samples.t5, df=5) 
likes.t5.normal <- likes.t(samples.t5, df=Inf)
head(like.ratios(likes.t5.t5, likes.t5.normal), 5)
## [1]  4.98941721 -0.07411863 -0.05205052 -0.05120605  0.31733761
  • A positive log-likelihood-ratio selects correct model.
  • Adding ratios gives the log-likelihood-ratio of an ensemble.

The Bootstrap Matrix

  • Each element of the \(i^{th}\) column contains the sum of \(i\) independent log-likelihood-ratios.
  • The rows are independent.
bootstrap.matrix(likes.t5.t5, likes.t5.normal,
                 max.samples=5, bootstrap.rows=7, seed=5)
##             [,1]        [,2]        [,3]       [,4]       [,5]
## [1,] -0.06198447  0.80318724  0.04472667  0.2339618 -0.3907877
## [2,] -0.09664454  0.08452911 -0.18877694 -0.2913085  5.2309346
## [3,] -0.05415492 -0.16289951 -0.23643338  1.6543063  1.2374765
## [4,] -0.08426828  0.76323568  0.49300372 -0.2023973 -0.3325500
## [5,] -0.05264315 -0.14691627 -0.20688222  2.0281977  0.1084606
## [6,] -0.08581603  1.06363358 -0.02712145 -0.2683411 -0.3006797
## [7,] -0.05088923 -0.15060100 -0.24338174 -0.2727284 -0.4206254

Proportion Correct in Each Column

Each proportion is a multiple of 1/#rows (bad for regression).

\(5^{th}\) Percentile of Log-Likelihood-Ratio

- The "proportion correct" crosses 0.95 at the same location where the \(5^{th}\) percentile (not discrete) crosses zero.

Region of interest

- We restrict the \(5^{th}\) quantiles to a region of interest, then use regression to compute the intersection.

Computing number of samples

samples.needed(likes.t5.t5, likes.t5.normal,
                  max.samples=250, bootstrap.rows=1000, seed=5,
                  confidence.level=0.95)
## [1] 126.3708
repeated.estimates(reps=6, seed=1)
## [1] 139.4967 129.8585 125.4115 126.1486 137.8517 125.1627
  • Occasionally, the method fails and leads to nonsensical results, but it is easy to spot those estimates.

Density of Estimates

Conclusions

  • We found little, if any, improvement for the quantreg::rq() function over the two step: quantile() \(\rightarrow\) lm(). Having the result of the intermediate step allowed greater efficiency.
  • Based on a distribution derived from 100 estimates, it takes \(128 \pm 1\) (approximately, MEAN +/- 2 STANDARD ERROR) to correctly reject the standard Normal distribution using samples from the t-distribution with 5 degrees of freedom.
  • The above result required minutes of computation, not hours, and was robust.

Why (or Why Not) Samples Needed?

  • Advantages: provides a measure of similarity between models which may be easier to interpret than the Kullback-Leibler Divergence.

  • Disadvantages: harder and less natural to compute than the Kullback-Leibler divergence.

Future Work

I am interested in models in Neuroscience where the likelihood must be approximated with Sequential Monte Carlo techniques.

Questions?

Links & Acknowledgments