If we toss a `fair’ coin, one for which heads and tails are equally likely, a large number of times, we expect approximately equal numbers of heads and tails. But what is `approximate’ here? How large a deviation from equal values might raise suspicion that the coin is biased? Surely, 12 heads and 8 tails in 20 tosses would not raise any eyebrows; but 18 heads and 2 tails might.
We will consider the more general case where we do not know the odds for heads and tails. After all, no coin is perfect, so we cannot be sure that it is fair. Suppose we toss the coin times and get
heads. We denote the unknown probability of heads by
. We pose the following question:
- How many times do we need to toss a coin to get an accurate estimate of the odds
of getting heads? How big does
have to be?
Conditional Probabilities
The probability of two independent events and
both occurring (denoted
) can be expressed in terms of conditional probabilities in two different ways:
- The probability of
multiplied by the probability of
given that
holds.
- The probability of
multiplied by the probability of
given that
holds.
In symbolic form, this is
Now let us be specific and consider the event A to be “the occurrence of heads in
tosses” and
to be “the probability of heads is
”. Then
Note that is discrete (
) whereas
is continuous (
), so the probability that
lies in an interval
is
.
To answer the question posed above, we wish to estimate the first term in (1), that is, , the probability of a coin with odds
of heads, given the experimental result of
heads in
tosses. The equation has four factors, so we need to know or to estimate the remaining three. Let us consider these terms in turn, from right to left:
: The probability of odds
in the absence of any further information or data. The probability
is called the prior estimate.
: The conditional probability of
heads for a coin with given odds
. This conditional probability is called the likelihood.
: The probability of
heads in
tosses (without further qualification).
can be partitioned into a sum or integral of mutually exclusive and exhaustive cases:
If we know the two factors on the right-hand side of (1) we can evaluate this integral.
: The conditional probability of the odds being
given that
heads have shown up in an experiment. This is called the posterior probability, and is what we seek.
We can now write an expression for the posterior probability:
This result is known as Bayes’ Theorem. It is often expressed in the form
There is a vast literature on Bayes’ Theorem, the many controversies that have surrounded it and its numerous applications. For an elementary account of this history, see McGrayne (2011).
Estimating the Terms
The prior probability depends on the information available before tossing the coin. In the absence of any a priori data, we may assume a uniform distribution
. The conditional probability
is given by the familiar binomial distribution
This comes from the chance of heads [factor
] and the chance of
tails [factor
], in any order [factor
or n-choose-h].
The integral is a standard beta function which can be expressed as a ratio of factorials:
We can now write the desired probability density function (2) in final form:
This looks like a binomial distribution but is fixed here and
is the random variable. It is a beta distribution, conjugate to the binomial distribution.
The figure below shows the posterior probability for h=20 and n=50. It peaks at $p=h/n=0.4$, the mode; the mean is
.
A Limiting Case
Before getting to the odds, we look briefly at a limiting case. Suppose we “know” a priori that the coin is fair (this is unrealistic but instructive). Then we must choose . The integral in the demominator of (2) is
, so
that is, the posterior probability is identical to the prior. Since we are certain from the outset, no amount of additional data can sway our conviction. But this never happens: no coin, however carefully minted, is guaranteed to be completely fair. In reality, we should consider a prior peaking sharply at rather than a delta-function. New information can then result in a change in the expected value of
.
How Many Tosses?
The question raised above was how many tosses are needed to estimate the odds. Of course, this depends on the level of precision required. The posterior distribution for given
is the beta distribution
This is a standard beta distribution. The expected value of is
and its variance is
For large and
we can write
The quantity is called the standard error.
We expect to be close to
, but we must be more specific. It is common to choose a confidence interval
with
. For a normal distribution, this corresponds to 95% confidence:
will be within this interval 95% of the time. We also specify a level of precision: let us require that
differs from
by at most
. To ensure this, we need
Suppose the coin is approximately fair. Then so
. If the confidence interval comprises values within two standard errors (
) and we require an accuracy of three significant figures (
) then
This is amazing: we need on the order of a million tosses to have confidence in the estimated value of to three significant figures.
If we ask for six-figure accuracy, we need a trillion tosses!
Sources
Sharon Bertsch McGrayne, 2011: The Theory that would not Die. Yale Univ. Press, 336pp.