The Basic Basics:
Bayesian Inference
on a Binomial Proportion

Rob Mealey

By a show of hands:

Stats?
Probability?
Equations?
Bayes?
Code?

I'll do my best
not to be tedious
about it...

"Tell em what you’re gonna tell em..."

A Conceptual Introduction to Bayesian Data Analysis
A Few Equations...Ok, a lot of equations.
A Little Code
Pretty Pictures
Maybe A Little History

Fast But Useless Introductions™:
Probability

Probability allows us to quantify our uncertainty...

Probabilities:

Point probabilities: $80\%$ chance of rain
Probability range
Probability distribution:
- A model of uncertainty around some value or process
- A function: $y = f(x)$
- Describes the likelihood of specific values of some random variable.

Rules:

The probability of all possible values of a random variable always sums (or integrates) to 1.
Joint: probability of A and B $$ p(A,B) = p(B,A) = p(A) p(B) $$
Marginal: probability of A $$ p(A) = p(A)p(B) + p(A)p(\text{not }B)$$
Conditional: probability of A given x $$ p(A|B) = p(A \text{ and }B) p(B)$$

What does normal mean, anyway?

$$ p(x | \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{{-\frac{1}{2}} (\frac{x-\mu}{\sigma} )^2} $$

Fast But Useless Introductions™: Bayes Theorem

You need the joint, man...

and the conditional...

This equivalence...

$$ p(A \text{ and } B) = p(B \text{ and } A)$$

and this definition...

$$ \Rightarrow p(A,B) = p(A|B) p(B) $$ $$ \Rightarrow p(B,A) = p(B|A) p(A) $$

$$ p(A \text{ and } B) = p(A \text{ given } B) p(B) $$ $$ p(A \text{ and } B) = p(B \text{ given } A) p(A) $$

implies this relationship...

$$ \Rightarrow p(A|B) p(B) = p(B|A) p(A) $$

$$ p(A \text{ given } B) p(B) = p(B \text{ given } A) p(A) $$

Which gives us Bayes Theorem.

$$ \Rightarrow p(A|B) = \frac{p(B|A) p(A)}{p(B)} $$

$$ p(A \text{ given } B) = \frac{p(B \text{ given } A) p(A)}{p(B)} $$

Works both ways, obviously.

$$ p(B|A) = \frac{p(A|B) p(A)}{p(A)} $$

$$ p(B \text{ given } A) = \frac{p(A \text{ given } B) p(A)}{p(A)} $$

A very accurate test for a very rare disease:

Test $98\%$ accurate
$0.2\%$ of population have disease

$p(\text{positive}|\text{disease}) = 0.98$

$p(\text{disease}) = 0.002$

$p(\text{positive}) = p(\text{positive}|\text{disease}) p(\text{disease}) + p(\text{positive}|\text{no disease}) p(\text{no disease})$

$p(\text{disease}|\text{positive}) = \frac{p(\text{positive}|\text{disease}) p(\text{disease})}{p(\text{positive})} = \frac{0.98 * 0.002}{0.98 * 0.002 + 0.02 * 0.998} = \frac{0.000196}{0.020156} = 0.00894 $

$p(\text{disease}|\text{positive}) = 0.00894 $

$0.89\%$ chance of having disease
given an initial positive test

The version we will use...

$$ p( \theta | Y ) = \frac{p( Y | \theta ) \times p( \theta )}{p( Y )} $$

$$ p(\text{parameter} | \text{data} ) = \frac{p( \text{data} | \text{parameter} ) \times p( \text{parameter} )}{p( \text{data} )} $$

Up to a multiplicative constant...

$$ p( \theta | Y ) \propto p( Y | \theta ) \times p( \theta ) $$

$$ p(\text{parameter} | \text{data} ) \text{ is proportional to } p( \text{data} | \text{parameter} ) \times p( \text{parameter} ) $$

The Prior times the Likelihood
is Proportional to the Posterior.

How it begins:

Data:

$ Y = y_1, y_2, y_3, ..., y_n $ $ X = x_1, x_2, x_3, ..., x_n $
Quantities of Interest:

$\mu_y $

$ \sigma_y $

$ \rho_{x,y} $
Assumptions!

$ y = f(x) + \epsilon $

$ \epsilon \overset{iid}{\sim} WN(0,\sigma^2) $

What A Classical Data Analysis Gives Us:

Point estimates
p values: the probability of observing our sample data under the hypothesis that the true value of the parameter is 0...
Confidence interval: not a probability range...

An interval that contains all the estimates for our parameter of interest for which we do not reject a null hypothesis of equality with our "true" parameter of interest...

All implicitly conditional on our assumptions

(and maybe lots of other assumptions)

Advantages of Classical Data Analysis

There are reasons...

Easier to DO (and scale).
Pre-packaged, widely adopted, off-the-shelf solutions.
Often computationally simpler.
Easier to black-box it.

Disadvantages of Classical Data Analysis

What A Bayesian Data Analysis Gives Us:

Probability distributions!

Explicitly conditional on our assumptions!

Advantages of Bayesian Data Analysis

It's all probability...

Easier to understand what comes out.
Easier to update and include disparate data sources.
Flexible.

Disadvantages of Bayesian Data Analysis

Few widely adopted off-the-shelf solutions.

Our Data:
Observations of some process
with just two outcomes

Lots of real-world processes
can be modeled like this.

$Z$ is every occurences of $z_i$,
whether or not they've already "happened"...
whether or not they'll all ever happen...

$$ Z = z_0,z_1,... $$

The outcome of any $z_i$ can only be either $A$ or $B$.

i.e. A free throw can either be made or missed...

$$ z_i \in (A, B) \forall i $$

In order to model
a process mathematically,
we have to define
one outcome as...

$$ y = \begin{cases} 1 & \text{if } z==A \\ 0 & \text{if } z==B \end{cases} $$

$\theta$ equals the proportion
of the $N$ cases of $Z$
equal to $A$...

$$ \theta = \frac{\Sigma_{i=1}^{N} y_i}{N} = \frac{Y}{N} $$

The Binomial Distribution:

$$ p(Y) = {N \choose Y} \theta^Y (1-\theta)^{(N-Y)} $$

$$ p(y_i) = \theta^{y_i} (1-\theta)^{(1-y_i)} $$

We don't know $\theta$...

## Generate Theta: Unknown Probability of Success
## Using built-in R pseudo-random number generator
theta_true <- runif(1,0,1)

We don't know even know $N$...

## N is random integer between 100,000 and 400,000
N = sample(seq(100000,400000),1)

Our Dataset:

## A is successes proportion of N
A = round(theta_true*N)
## B is the rest of N
B = N - A
## Zpop is randomly shuffled A successes and B failures
Zpop <- sample(c(rep(1,A),rep(0,B)))

The Prior:

$$p( \theta )$$

$$p( \theta | Y ) \propto p( Y | \theta ) \times p( \theta )$$

Beta Distribution:

\[ p(\theta) = \theta^{(\alpha - 1)} (1 - \theta)^{(\beta - 1)} \]

$$ p(\theta) = \theta^{(\alpha - 1)} (1 - \theta)^{(\beta - 1)} $$$$ p(Y|\theta) = \theta^Y (1-\theta)^{(N-Y)} $$

$$ p(Y|\theta) \times p(\theta) = $$$$\theta^{(Y + \alpha -1)} (1-\theta)^{(N - Y + \beta - 1)} $$

$$ mean[p(Y|\theta)] = \theta = \frac{\theta \times N}{N} $$$$ mean[p(\theta)] = \frac{\alpha}{\alpha + \beta} $$

$$ m = \frac{\alpha}{\alpha + \beta} $$$$ n = \alpha + \beta $$

$$ \alpha = m \times n $$$$ \beta = (1 - m) \times n $$

### Function: Prior Plot Values
###################################################
### Function: Prior Plot Values
###################################################
Prior <- function(m, n, a=n*m, b=n*(1-m)){
	dom <- seq(0,1,0.005)
	val <- dbeta(dom,a,b)
	return(data.frame('x'=dom, 'y'=val))
}

$$ p(\theta) = \theta^{(nm) - 1} (1-\theta)^{(n(1-m)) - 1} $$

The Likelihood:

$$ p( Y | \theta ) $$

$$ p( \theta | Y ) \propto p( Y | \theta ) \times p( \theta )$$

###################################################
### Function: Likelihood Plot Values
###################################################
Likelihood <- function(N, Y){
	a <- Y + 1
	b <- N - Y + 1
	dom <- seq(0,1,0.005)
	val <- dbeta(dom,a,b)
	return(data.frame('x'=dom, 'y'=val))
}

$$ p( Y | \theta ) = \theta^{(Y)} (1-\theta)^{(N - Y)} $$

The Posterior:

$$p( \theta | Y ) \propto p( Y | \theta ) \times p( \theta )$$

$$p( \theta | Y ) \propto [\theta^{(Y)} (1-\theta)^{(N - Y)}] \times [\theta^{(nm)} (1-\theta)^{(n(1-m))}] $$

$$p( \theta | Y ) \propto [\theta^{(Y + nm)} (1-\theta)^{(N - Y + n(1-m))}]$$

###################################################
### Function: Posterior Plot Values
###################################################
Posterior <- function(m, n, N, Y, a_in=n*m, b_in=n*(1-m)){
	a_out <- Y + a_in - 1
	b_out <- N - Y + b_in - 1
	dom <- seq(0,1,0.005)
	val <- dbeta(dom,a_out,b_out)
	return(data.frame('x'=dom, 'y'=val))
}

$$ p( \theta | Y ) \propto \theta^{(Y + nm)} (1-\theta)^{(N - Y + n(1-m))} $$

###################################################
### Function: Posterior Plot Values
###################################################
Posterior <- function(m, n, N, Y, a_in=n*m, b_in=n*(1-m)){
	a_out <- Y + a_in - 1
	b_out <- N - Y + b_in - 1
	dom <- seq(0,1,0.005)
	val <- dbeta(dom,a_out,b_out)
	return(data.frame('x'=dom, 'y'=val))
}

$$ \beta + 1 = N - Y + n(1-m)$$ $$\rightarrow \beta = N - Y + n(1-m) - 1 $$

$$ \alpha + 1 = (Y + nm)$$ $$\rightarrow \alpha = Y + nm - 1 $$

###################################################
### Function: Mean of Posterior Beta
###################################################
MeanOfPosterior <- function(m, n, N, Y, a=n*m, b=n*(1-m)){
	a_out <- Y + a - 1
	b_out <- N - Y + b - 1
	E_posterior <- a_out / (a_out + b_out)
	return(E_posterior)
}

###################################################
### Function: Mode of Posterior Beta
###################################################
ModeOfPosterior <- function(m, n, N, Y, a=n*m, b=n*(1-m)){
	a_out <- Y + a - 1
	b_out <- N - Y + b - 1
	mode_posterior <- (a_out - 1)/(a_out + b_out - 2)
	return(mode_posterior)
}

###################################################
### Function: Standard Deviation of Posterior Beta
###################################################
StdDevOfPosterior <- function(m, n, N, Y, a=n*m, b=n*(1-m)){
	a_out <- Y + a - 1
	b_out <- N - Y + b - 1
	sigma_posterior <- sqrt((a_out*b_out)/(((a_out+b_out)^2)*(a_out+b_out+1)))
	return(sigma_posterior)
}

Our sample...

$N_{samp} = 500 $
$Y_{samp} = 26 $

$ p(Y_{samp}|\theta, N) = \theta^{26} (1-\theta)^{(500 - 26)}$
$p(Y_{samp}|\theta) = \theta^{26} (1-\theta)^{474}$

$(\theta_{samp} = \frac{Y_{samp}}{n} = \frac{26}{474} = 0.052)$

Results...

Prior	Mean	Mode	Std Dev	Alpha	Beta
Uniform	0.0520	0.0502	0.0099	26	474
Bimodal	0.0511	0.0493	0.0098	25.5	473.5
Weak High	0.0669	0.0652	0.0111	34	474
Weak Equal	0.0591	0.0573	0.0104	30	478
Weak Low	0.0512	0.0494	0.0098	26	482
Strong High	0.1923	0.1913	0.0161	115	483
Strong Equal	0.1254	0.1242	0.0135	75	523
Strong Low	0.0585	0.0570	0.0096	35	563

$$\theta_{true} = 0.042$$

$$\theta_{mle} = \frac{Y_{samp}}{n} = \frac{26}{474} = 0.052$$

Using only these models...

#### 2008 Election ####
# 2008 Parameters
elecDay <- '2008-11-04'
cutOff <- '2008-08-01'
elecPassed = T
elecType <- 'President'
candidates <- list('d'='Obama','r'='McCain')
d_m = 0.5
r_m = 0.5
init_n = 2

Two binomial proportion models, one for each candidate.

The prior for each update after the first is the posterior of the last model.

Any one affiliated with Real Clear Politics?

18 states with full polling data still 'available'...

RCP Predictions

	Obama	McCain	Predicted Outcome	Predicted Winner	Forecast Error
FL	49.00	47.20	1.80	Obama	1.00
IN	46.40	47.80	-1.40	McCain	2.50
MO	47.80	48.50	-0.70	McCain	0.60
NC	48.00	48.40	-0.40	McCain	0.70
NH	52.80	42.20	10.60	Obama	1.00
OH	48.80	46.30	2.50	Obama	2.10
VA	50.20	45.80	4.40	Obama	1.90

My Predictions

	Obama	McCain	Predicted Outcome	Predicted Winner	Forecast Error
FL	47.40	46.50	0.90	Obama	1.90
IN	46.50	47.40	-0.90	McCain	2.00
MO	47.40	47.70	-0.30	McCain	0.20
NC	47.40	47.30	0.10	Obama	0.20
NH	50.70	42.60	8.10	Obama	1.50
OH	47.80	45.30	2.50	Obama	2.10
VA	49.40	45.50	3.90	Obama	2.40

Actual Outcomes

	Obama	McCain	Outcome	Winner
FL	51.00	48.20	2.80	Obama
IN	50.00	48.90	1.10	Obama
MO	49.30	49.40	-0.10	McCain
NC	49.70	49.40	0.30	Obama
NH	54.10	44.50	9.60	Obama
OH	51.50	46.90	4.60	Obama
VA	52.60	46.30	6.30	Obama

Performance...

	RMSE	RMSE, Diff < 2%	Number Correctly Predicted
My Predictions	4.27	1.17	17.00
RealClearPolitics Predictions	3.23	1.54	16.00

RCP Predictions

	Obama	Romney	Outcome	Winner
CO	48.70	45.80	2.90	Obama
FL	49.30	46.10	3.20	Obama
MO	43.00	50.30	-7.30	Romney
NC	47.80	46.70	1.10	Obama
NH	48.70	45.70	3.00	Obama
VA	48.70	45.00	3.70	Obama

My Predictions

	Obama	Romney	Outcome	Winner
CO	47.90	46.10	1.80	Obama
FL	48.10	46.30	1.80	Obama
MO	42.70	49.30	-6.60	Romney
NC	46.60	47.20	-0.60	Romney
NH	49.00	44.80	4.20	Obama
VA	47.80	45.50	2.30	Obama

A Tiny Bit of Context: The Holy War

"SUBJECTIVE!"
"UNINTUITIVE!"
"ASYMPTOTIC!"
"INTRACTABLE!"

Pierre LaPlace:

"I'm pretty famous."

"The theory of probabilities is at bottom nothing but common sense reduced to calculus;
it enables us to appreciate with exactness that which accurate minds feel with a sort of instinct for which of times they are unable to account."

What's the collective noun
for a group of statisticians?

Harold Jeffreys

"Oh.
Oh dear.
Well, then.
I do believe...
I definitely don't..."

"The scientific method is neither deduction from a set of axioms nor a way of making plausible guesses, as Bertrand Russell said; but that it is a matter of successive approximation to probability distributions."

R. A. Fisher:

"Clever.
Now I will
ruin you."

"Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first."

The Basic Basics: Bayesian Inference on a Binomial Proportion