The Basic Basics:

Bayesian Inference
on a Binomial Proportion

Rob Mealey

"And you are?"

By a show of hands:

  1. Stats?
  2. Probability?
  3. Equations?
  4. Bayes?
  5. Code?

A Chickensh** Disclaimer

Princess Bride

I'll do my best
not to be tedious
about it...

"Tell em what you’re gonna tell em..."

Fast But Useless Introductions™:
Probability

Probability allows us to quantify our uncertainty...

Probabilities:

  1. Point probabilities: $80\%$ chance of rain
  2. Probability range
  3. Probability distribution:

    • A model of uncertainty around some value or process
    • A function: $y = f(x)$
    • Describes the likelihood of specific values of some random variable.

Rules:

> Princess Bride

What does normal mean, anyway?

$$ p(x | \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{{-\frac{1}{2}} (\frac{x-\mu}{\sigma} )^2} $$

Fast But Useless Introductions™: Bayes Theorem

You need the joint, man...

and the conditional...

This equivalence...

$$ p(A \text{ and } B) = p(B \text{ and } A)$$

and this definition...

$$ \Rightarrow p(A,B) = p(A|B) p(B) $$ $$ \Rightarrow p(B,A) = p(B|A) p(A) $$

$$ p(A \text{ and } B) = p(A \text{ given } B) p(B) $$ $$ p(A \text{ and } B) = p(B \text{ given } A) p(A) $$

implies this relationship...

$$ \Rightarrow p(A|B) p(B) = p(B|A) p(A) $$

$$ p(A \text{ given } B) p(B) = p(B \text{ given } A) p(A) $$

Which gives us Bayes Theorem.

$$ \Rightarrow p(A|B) = \frac{p(B|A) p(A)}{p(B)} $$

$$ p(A \text{ given } B) = \frac{p(B \text{ given } A) p(A)}{p(B)} $$

Works both ways, obviously.

$$ p(B|A) = \frac{p(A|B) p(A)}{p(A)} $$

$$ p(B \text{ given } A) = \frac{p(A \text{ given } B) p(A)}{p(A)} $$

A very accurate test for a very rare disease:

Test $98\%$ accurate
$0.2\%$ of population have disease

$p(\text{positive}|\text{disease}) = 0.98$

$p(\text{disease}) = 0.002$

$p(\text{positive}) = p(\text{positive}|\text{disease}) p(\text{disease}) + p(\text{positive}|\text{no disease}) p(\text{no disease})$

$p(\text{disease}|\text{positive}) = \frac{p(\text{positive}|\text{disease}) p(\text{disease})}{p(\text{positive})} = \frac{0.98 * 0.002}{0.98 * 0.002 + 0.02 * 0.998} = \frac{0.000196}{0.020156} = 0.00894 $

$p(\text{disease}|\text{positive}) = 0.00894 $

$0.89\%$ chance of having disease
given an initial positive test

The version we will use...

$$ p( \theta | Y ) = \frac{p( Y | \theta ) \times p( \theta )}{p( Y )} $$

$$ p(\text{parameter} | \text{data} ) = \frac{p( \text{data} | \text{parameter} ) \times p( \text{parameter} )}{p( \text{data} )} $$

Up to a multiplicative constant...

$$ p( \theta | Y ) \propto p( Y | \theta ) \times p( \theta ) $$

$$ p(\text{parameter} | \text{data} ) \text{ is proportional to } p( \text{data} | \text{parameter} ) \times p( \text{parameter} ) $$

The Prior times the Likelihood
is Proportional to the Posterior.

So you want to analyze some data...

How it begins:

What A Classical Data Analysis Gives Us:

  1. Point estimates

  2. p values: the probability of observing our sample data under the hypothesis that the true value of the parameter is 0...

  3. Confidence interval: not a probability range...

    An interval that contains all the estimates for our parameter of interest for which we do not reject a null hypothesis of equality with our "true" parameter of interest...

All implicitly conditional on our assumptions

(and maybe lots of other assumptions)

Advantages of Classical Data Analysis

Princess Bride

There are reasons...

Disadvantages of Classical Data Analysis

Princess Bride

Princess Bride

Inflexible

Princess Bride

Asymptotic justifications...

What A Bayesian Data Analysis Gives Us:

Probability distributions!

Explicitly conditional on our assumptions!

Advantages of Bayesian Data Analysis

Princess Bride

It's all probability...

Disadvantages of Bayesian Data Analysis

More time
feeling dumb.

Fisher

More difficult to scale.

Fisher

Few widely adopted off-the-shelf solutions.

Fisher Fisher Fisher

And now for the main event...

Our Data:
Observations of some process
with just two outcomes

Fisher

Lots of real-world processes
can be modeled like this.

Free Throws
Obama vs. McCain

\(Z\) is every occurences of \(z_i\),
whether or not they've already "happened"
...
whether or not they'll all ever happen...

$$ Z = z_0,z_1,... $$

The outcome of any $z_i$ can only be either $A$ or $B$.

i.e. A free throw can either be made or missed...

$$ z_i \in (A, B) \forall i $$

Preggers

In order to model
a process mathematically,
we have to define
one outcome as...

a success...
Preggers

and the other as a failure.

Preggers

$$ y = \begin{cases} 1 & \text{if } z==A \\ 0 & \text{if } z==B \end{cases} $$

$\theta$ equals the proportion
of the $N$ cases of $Z$
equal to $A$...

$$ \theta = \frac{\Sigma_{i=1}^{N} y_i}{N} = \frac{Y}{N} $$

The Binomial Distribution:

$$ p(Y) = {N \choose Y} \theta^Y (1-\theta)^{(N-Y)} $$

$$ p(y_i) = \theta^{y_i} (1-\theta)^{(1-y_i)} $$

$$\theta = 0.2$$

$$\theta = 0.5$$

$$\theta = 0.8$$

Playing make-believe...

Our population...

The Prior:

$$p( \theta )$$

$$p( \theta | Y ) \propto p( Y | \theta ) \times p( \theta )$$

Beta Distribution:

\[ p(\theta) = \theta^{(\alpha - 1)} (1 - \theta)^{(\beta - 1)} \]

Fisher

Fisher

Fisher

Fisher Fisher

Fisher Fisher

A Visit with Conjugacy...

$$ p(\theta) = \theta^{(\alpha - 1)} (1 - \theta)^{(\beta - 1)} $$$$ p(Y|\theta) = \theta^Y (1-\theta)^{(N-Y)} $$

$$ p(Y|\theta) \times p(\theta) = $$$$\theta^{(Y + \alpha -1)} (1-\theta)^{(N - Y + \beta - 1)} $$

$$ mean[p(Y|\theta)] = \theta = \frac{\theta \times N}{N} $$$$ mean[p(\theta)] = \frac{\alpha}{\alpha + \beta} $$

$$ m = \frac{\alpha}{\alpha + \beta} $$$$ n = \alpha + \beta $$

$$ \alpha = m \times n $$$$ \beta = (1 - m) \times n $$

### Function: Prior Plot Values
###################################################
### Function: Prior Plot Values
###################################################
Prior <- function(m, n, a=n*m, b=n*(1-m)){
	dom <- seq(0,1,0.005)
	val <- dbeta(dom,a,b)
	return(data.frame('x'=dom, 'y'=val))
}

$$ p(\theta) = \theta^{(nm) - 1} (1-\theta)^{(n(1-m)) - 1} $$

The Likelihood:

$$ p( Y | \theta ) $$

$$ p( \theta | Y ) \propto p( Y | \theta ) \times p( \theta )$$
###################################################
### Function: Likelihood Plot Values
###################################################
Likelihood <- function(N, Y){
	a <- Y + 1
	b <- N - Y + 1
	dom <- seq(0,1,0.005)
	val <- dbeta(dom,a,b)
	return(data.frame('x'=dom, 'y'=val))
}

$$ p( Y | \theta ) = \theta^{(Y)} (1-\theta)^{(N - Y)} $$

The Posterior:

$$p( \theta | Y ) \propto p( Y | \theta ) \times p( \theta )$$

$$p( \theta | Y ) \propto [\theta^{(Y)} (1-\theta)^{(N - Y)}] \times [\theta^{(nm)} (1-\theta)^{(n(1-m))}] $$

$$p( \theta | Y ) \propto [\theta^{(Y + nm)} (1-\theta)^{(N - Y + n(1-m))}]$$
###################################################
### Function: Posterior Plot Values
###################################################
Posterior <- function(m, n, N, Y, a_in=n*m, b_in=n*(1-m)){
	a_out <- Y + a_in - 1
	b_out <- N - Y + b_in - 1
	dom <- seq(0,1,0.005)
	val <- dbeta(dom,a_out,b_out)
	return(data.frame('x'=dom, 'y'=val))
}

$$ p( \theta | Y ) \propto \theta^{(Y + nm)} (1-\theta)^{(N - Y + n(1-m))} $$

###################################################
### Function: Posterior Plot Values
###################################################
Posterior <- function(m, n, N, Y, a_in=n*m, b_in=n*(1-m)){
	a_out <- Y + a_in - 1
	b_out <- N - Y + b_in - 1
	dom <- seq(0,1,0.005)
	val <- dbeta(dom,a_out,b_out)
	return(data.frame('x'=dom, 'y'=val))
}

$$ \beta + 1 = N - Y + n(1-m)$$ $$\rightarrow \beta = N - Y + n(1-m) - 1 $$

$$ \alpha + 1 = (Y + nm)$$ $$\rightarrow \alpha = Y + nm - 1 $$

###################################################
### Function: Mean of Posterior Beta
###################################################
MeanOfPosterior <- function(m, n, N, Y, a=n*m, b=n*(1-m)){
	a_out <- Y + a - 1
	b_out <- N - Y + b - 1
	E_posterior <- a_out / (a_out + b_out)
	return(E_posterior)
}

###################################################
### Function: Mode of Posterior Beta
###################################################
ModeOfPosterior <- function(m, n, N, Y, a=n*m, b=n*(1-m)){
	a_out <- Y + a - 1
	b_out <- N - Y + b - 1
	mode_posterior <- (a_out - 1)/(a_out + b_out - 2)
	return(mode_posterior)
}

###################################################
### Function: Standard Deviation of Posterior Beta
###################################################
StdDevOfPosterior <- function(m, n, N, Y, a=n*m, b=n*(1-m)){
	a_out <- Y + a - 1
	b_out <- N - Y + b - 1
	sigma_posterior <- sqrt((a_out*b_out)/(((a_out+b_out)^2)*(a_out+b_out+1)))
	return(sigma_posterior)
}

Our population...

Our sample...

$N_{samp} = 500 $
$Y_{samp} = 26 $

$ p(Y_{samp}|\theta, N) = \theta^{26} (1-\theta)^{(500 - 26)}$
$p(Y_{samp}|\theta) = \theta^{26} (1-\theta)^{474}$

$(\theta_{samp} = \frac{Y_{samp}}{n} = \frac{26}{474} = 0.052)$

Testing out different priors...

So how'd we do?

Results...

Prior Mean Mode Std Dev Alpha Beta
Uniform 0.0520 0.0502 0.0099 26 474
Bimodal 0.0511 0.0493 0.0098 25.5 473.5
Weak High 0.0669 0.0652 0.0111 34 474
Weak Equal 0.0591 0.0573 0.0104 30 478
Weak Low 0.0512 0.0494 0.0098 26 482
Strong High 0.1923 0.1913 0.0161 115 483
Strong Equal 0.1254 0.1242 0.0135 75 523
Strong Low 0.0585 0.0570 0.0096 35 563

$$\theta_{true} = 0.042$$

Our population...

$$\theta_{mle} = \frac{Y_{samp}}{n} = \frac{26}{474} = 0.052$$

A p value...

And for my next trick...

Obama vs. McCain

Using only these models...

#### 2008 Election ####
# 2008 Parameters
elecDay <- '2008-11-04'
cutOff <- '2008-08-01'
elecPassed = T
elecType <- 'President'
candidates <- list('d'='Obama','r'='McCain')
d_m = 0.5
r_m = 0.5
init_n = 2

Two binomial proportion models, one for each candidate.

The prior for each update after the first is the posterior of the last model.

The Data...

Any one affiliated with Real Clear Politics?

18 states with full polling data still 'available'...

RCP Predictions

Obama McCain Predicted Outcome Predicted Winner Forecast Error
FL 49.00 47.20 1.80 Obama 1.00
IN 46.40 47.80 -1.40 McCain 2.50
MO 47.80 48.50 -0.70 McCain 0.60
NC 48.00 48.40 -0.40 McCain 0.70
NH 52.80 42.20 10.60 Obama 1.00
OH 48.80 46.30 2.50 Obama 2.10
VA 50.20 45.80 4.40 Obama 1.90

RCP Predictions

My Predictions

Obama McCain Predicted Outcome Predicted Winner Forecast Error
FL 47.40 46.50 0.90 Obama 1.90
IN 46.50 47.40 -0.90 McCain 2.00
MO 47.40 47.70 -0.30 McCain 0.20
NC 47.40 47.30 0.10 Obama 0.20
NH 50.70 42.60 8.10 Obama 1.50
OH 47.80 45.30 2.50 Obama 2.10
VA 49.40 45.50 3.90 Obama 2.40

My Predictions

Actual Outcomes

Obama McCain Outcome Winner
FL 51.00 48.20 2.80 Obama
IN 50.00 48.90 1.10 Obama
MO 49.30 49.40 -0.10 McCain
NC 49.70 49.40 0.30 Obama
NH 54.10 44.50 9.60 Obama
OH 51.50 46.90 4.60 Obama
VA 52.60 46.30 6.30 Obama

Actual Outcomes

Performance...

RMSE RMSE, Diff < 2% Number Correctly Predicted
My Predictions 4.27 1.17 17.00
RealClearPolitics Predictions 3.23 1.54 16.00

So what about...

RCP Predictions

Obama Romney Outcome Winner
CO 48.70 45.80 2.90 Obama
FL 49.30 46.10 3.20 Obama
MO 43.00 50.30 -7.30 Romney
NC 47.80 46.70 1.10 Obama
NH 48.70 45.70 3.00 Obama
VA 48.70 45.00 3.70 Obama

RCP Predictions

My Predictions

Obama Romney Outcome Winner
CO 47.90 46.10 1.80 Obama
FL 48.10 46.30 1.80 Obama
MO 42.70 49.30 -6.60 Romney
NC 46.60 47.20 -0.60 Romney
NH 49.00 44.80 4.20 Obama
VA 47.80 45.50 2.30 Obama

My Predictions

And if we have time...

A Tiny Bit of Context: The Holy War

"SUBJECTIVE!"
"UNINTUITIVE!"
"ASYMPTOTIC!"
"INTRACTABLE!"

Pierre LaPlace:

"I'm pretty famous."

LaPlace
LaPlace

"The theory of probabilities is at bottom nothing but common sense reduced to calculus;
it enables us to appreciate with exactness that which accurate minds feel with a sort of instinct for which of times they are unable to account."

What's the collective noun
for a group of statisticians?

A "quarrel".

In the Bayesian corner...

Harold Jeffreys

"Oh.
Oh dear.
Well, then.
I do believe...
I definitely don't..."

Jeffreys

"The scientific method is neither deduction from a set of axioms nor a way of making plausible guesses, as Bertrand Russell said; but that it is a matter of successive approximation to probability distributions."

Jeffreys

And the winner...

R. A. Fisher:

Fisher

"Clever.
Now I will
ruin you."

Fisher

"Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first."

Fisher Fisher
Fisher

Manhattan Project:
Human "Computers" at Work...

But the thing is...

Fisher

Enough of that...

All the code for both the simulation and electoral model is up on github...

Simulation Code

Election Code

These slides will be hosted at:
www.obscureanalytics.com

Questions?

Fisher