Bayesian inference is the use of Bayes' theorem to develop probability distributions for various phenomena and can be used to estimate parameters for various probability distributions. This article has a companion widget that you should open in a new window or tab.

Perhaps the most simple non-trival example of a probability distribution is the Bernoulli distribution of a Bernoulli process. Suppose we have an event that has two states success and failure, such as flipping a coin (heads and tails respectively). Suppose that the event is a success with probability $$p$$ and a failure with probability $$1-p$$.

Now suppose that we have a Bernoulli process but we do not yet know the probability $$p$$ of success. How can we, from observing outcomes of the process, determine the probability $$p$$? Using a frequentist approach, we could observe $$N$$ events with say $$s$$ successes, and compute an estimate $$\hat{p} = \frac{s}{N}$$. Then we can compute a confidence interval and decided if the estimate is sufficiently accurate for our purposes.

Bayesian inference takes a different approach. First we assume that we have some initial belief called a prior distribution. For example, perhaps we are flipping a standard coin and belief that it is likely fair. In this case we would chose a probability distribution on $$[0, 1]$$ for the value of $$p$$ that is heavily concentrated around $$p=0.5$$. If we just had some generic Bernoulli process we might chose a flat distribution in which every value of $$p$$ is just as likely.

Once we have chosen a prior distribution $$Pr(p)$$, we then observe the process, and use Bayes' theorem to update our probability distribution appropriately, for either a success $$s$$ or a failure $$f$$: $Pr(p \, | \, s) = \frac{Pr(s \, | \, p) \, Pr(p) }{Pr(s)}$ or $Pr(p \, | \, f) = \frac{Pr(f \, | \, p) \, Pr(p) }{Pr(f)}$ The resulting probability distribution called the posterior distribution. Now we can observe more events and continue updating our posterior distribution, using the previous posterior as the prior in Bayes' theorem. In general this involves integrations to compute $$Pr(f)$$ or $$Pr(f)$$, however there is a nice trick that eliminates this overhead.

## Conjugate Distributions

Let's dig into the details a bit more. While we could in principle use any probability distribution as the prior, there is a computationally convenient choice called a conjugate prior. For a Bernoulli process, we can use the Beta distribution, a two shape parameter family of distributions that take the following form: $Pr(p; \alpha, \beta) = \frac{p^{\alpha - 1} \left(1-p\right)^{\beta - 1}}{B(\alpha, \beta)}$ The normalizing constant $$B(\alpha, \beta)$$ is given by factorials or the Gamma function as needed: $B(\alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}$

Source: Wikipedia

Why choose the Beta distribution? It turns out that if we start with a Beta distribution as the prior $$Pr(p; \alpha, \beta)$$, observe a success, and use Bayes' theorem, then the posterior distribution is $$Pr(p; \alpha + 1, \beta)$$. Similarly, if we observe a failure, the posterior is $$Pr(p; \alpha, \beta +1)$$. So we always stay within the family of Beta distributions (hence the name conjugate distributions). Computationally and algebraically, if we choose a prior in the Beta family, we only have to track $$\alpha$$ and $$\beta$$ by adding one to either when we observe a success or failure. This saves us from having to integrate after each observation (either algebraically or computationally).

## Prior Selection

We still have the issue of prior selection. (If you didn't open the companion widget, now is a good time.) Within the Beta family there are many reasonable choices for the prior:

• The uniform distribution, $$Pr(p; 1, 1)$$
• Jeffrey's prior, $$Pr(p; 0.5, 0.5)$$
• Haldane's prior, $$Pr(p; 0, 0)$$
• The strong-half prior, $$Pr(p; 100, 100)$$ (made up for this article)
Using the widget you can click on each and see these initial distributions. The strong-half prior puts most of the probability near $$p=1/2$$; the other priors are attempts at uninformative priors that do not assume much about the distribution. See here for more detailed information on the Jeffrey's prior.

## Simulating a Process and Estimating Parameters

Using the companion widget, you can simulate coin flips for a coin with any bias, and see the posterior distribution evolve as the observations pile up. You can also see various estimates of the parameter $$p$$ from the posterior, such as the following, which can easily be computed from the parameters $$\alpha$$ and $$\beta$$:

• The mean: $$\hat{p} = \frac{\alpha}{\alpha + \beta}$$
• The mode, also known as the maximum posterior estimate: $$\hat{p} = \frac{\alpha - 1}{\alpha + \beta - 2}$$
• The median: $$\hat{p} \approx \frac{\alpha - 1/3}{\alpha + \beta - 2 / 3}$$

Incidentally, for the Beta distribution the variance can also be computed from $$\alpha$$ and $$\beta$$: $\sigma = \frac{\alpha \beta}{ (\alpha + \beta) (\alpha + \beta + 1)}$ Notice that if we use Haldane's prior, then the $$\alpha$$ and $$\beta$$ counts with the mean estimate agrees with the frequentist approach! Similarly for the mode and the uniform prior.

Ultimately for a large data set the prior has an insignificant impact on the estimate because of the Bernstein von Mises theorem. You can see this from the formulas for the estimates above: in the limit of infinitely many data points, the initial $$\alpha$$ and $$\beta$$ contributed by the prior are washed out, and the various estimates converge to the same value. However you should be able to see that for small data sets the prior can greatly impact the estimate. For example, if you start with the strong-half prior and set the true value of $$p$$ to something other than one-half, it takes many observations for the posterior to adjust. (Note you can adjust priors on-the-fly.)