Gritty Jello: Bayes' Theorem, Part 2

As implied in Part One, this article series is supposed to be an easy introduction to Bayes' Theorem for non-experts (by a non-expert), not a thinly veiled job application directed at government agencies that don't officially exist.

Review: Testing Positive for a Rare Disease Doesn't Mean You're Sick

The previous article in this series illustrated a surprising fact about disease screening: if the disease you're testing for is sufficiently rare, then a positive diagnosis is probably wrong. This seemingly WTF outcome is an instance of the false positive paradox. It arises when the event of interest (in this case, being diseased) is so statistically rare that true positives are drowned out by a background of false positives.

Bayes' Theorem allows us to analyze this paradox, as shown below. But first, we need to define true and false positives and negatives.

False Positives and False Negatives

No classification test is perfect. Any real-world diagnostic test will sometimes mistakenly report disease in a healthy person. This type of error is defined as a false positive. If you test for the disease in a large number of people who are known to be healthy, a certain percentage of the test results will be false positives. This percentage is called the false positive rate of the test. It's the probability of getting a positive result if you test a healthy person.

The other type of classification error is the false negative -- for example, a clean bill of health mistakenly issued to someone who's actually sick. If you run your test on a large number of people known to be sick, the test will fail to detect disease in some percentage of them. This percentage is known as the test's false negative rate.

The lower the false positive rate and false negative rate, the better the test. Both rates are independent of population size and disease prevalence.

But now we get to the root of the false positive paradox: if the disease is rare enough, then the vast majority of people you test will be healthy. This unavoidable testing of crowds of healthy people represents plenty of opportunities to get false positives. These false positives drown out the relatively faint true positive signal coming from the few sick people in the population. And if the test obtains each true positive at the cost of many false positives, any given positive result is probably a false one. It's intuitive by this point that the error rate of a screening process depends not only on the accuracy of the test itself, but also on the rarity of what you're screening for.

For a more rigorous understanding, we need to derive Bayes' Theorem. To do that, we need some basic probability theory.

Probability Basics

The probability that some event \(A\) will occur is written \(P(A)\). All probabilities are limited to values between 0 ("impossible") and 1 ("guaranteed"). If we let \(H\) stand for the event that a fair coin lands heads up, then \(P(H) = 0.5\), or 50%. If \(X\) stands for rolling a "20" on a 20-sided die, then \(P(X) = 1/20\), or 5%.

If two events \(A\) and \(B\) cannot occur at the same time, they are said to be mutually exclusive, and the probability that either \(A\) or \(B\) occurs is just \(P(A) + P(B)\). Rolling a 19 and rolling a 20 on a 20-sided die are mutually exclusive events, so the probability of rolling 19 or 20 is \(1/20 + 1/20 = 1/10\).

The opposite of an event, or its complement, is denoted with a tilde (\(\text{~}\)) before the letter. The probability that \(A\) will not occur is written \(P(\text{~}A)\). If \(A\) has only two possible values, such as heads/tails, sick/healthy, or guilty/innocent, then \(A\) is called a binary event and \(P(A) + P(\text{~}A) = 1\), which just says that either \(A\) happens or it doesn't. Heads and tails, sickness and health, and guilt and innocence are all mutually exclusive binary events.

Conditional Probability

So far, we've considered probabilities of single events occurring in theoretical isolation: a single coin flip, a single die roll. Now, consider the probability of an event \(A\) given that some other event \(B\) has occurred. This new probability is read as "the probability of A given B" or "the probability of A conditional on B." Because this new probability quantifies the occurrence of A under the condition that B has definitely occurred, it is known as a conditional probability. Standard notation for conditional probability is:
\[ \begin{equation} P(A|B) \end{equation} \]

The vertical bar stands for the word "given." \(P(A|B)\) means "the probability of \(A\) given \(B\)."

It's really important to recognize right away that \(P(A|B)\) is not the same as \(P(B|A)\). To see why, dream up two related real-world events and think about their conditional probabilities:

probability that a road is wet given that it's raining: \(P(\text{wet road} ~ | ~ \text{raining})\)
probability that it's raining given that the road is wet: \(P(\text{raining} ~ | ~ \text{wet road})\)

The road will certainly get wet if it rains, but many things besides rain could result in a wet road (use your imagination). Therefore,

\[ \begin{equation} P(\text{wet road} ~ | ~ \text{raining}) > P(\text{raining} ~ | ~ \text{wet road}). \end{equation} \]

Bayes' Theorem Converts between P(A|B) and P(B|A)

Okay, so \(P(A|B)\) does not equal \(P(B|A)\), but how are they related? If we know one quantity, how do we get the other? This section title blatantly gave away the answer.

To derive Bayes' Theorem, consider events \(A\) and \(B\) that have nonzero probabilities \(P(A)\) and \(P(B)\). Let's say that \(B\) has just occurred. What is the probability that \(A\) occurs given the occurrence of \(B\)? In symbols, what is \(P(A|B)\)?

Well, since \(A\) occurs after \(B\), it will certainly be true that both \(A\) and \(B\) will have occurred. The occurrence of both \(A\) and \(B\) is itself an event; let's call it \(AB\), with probability \(P(AB)\). Now, note that \(P(B)\) will always be greater or equal to \(P(AB)\), because the "\(A\)" in "\(AB\)" represents an added criterion for event completion. The chance of both \(A\) and \(B\) occurring has to be lower than the chance of just \(B\) occurring (unless, of course, \(A\) is guaranteed to occur).

The value of \(P(AB)\) itself isn't as interesting as the ratio of \(P(AB)\) to \(P(B)\), and here's why. This ratio compares the probability of both A and B to the probability of B just by itself. The ratio gives the proportion of possible occurrences of \(AB\) relative to the possible occurrences of \(B\). You should be able to convince yourself that this ratio is none other than the conditional probability \(P(A|B)\):
\[  \begin{equation} P(A|B) = \frac{P(AB)}{P(B)}.  \end{equation} \]
Rearranging gives
\[  \begin{equation} P(AB) = P(A|B)P(B). \label{whatstheuse} \end{equation} \]
Similarly,
\begin{align} P(B|A) &= \frac{P(BA)}{P(A)} \\
P(BA) &= P(B|A)P(A). \end{align}
Since the order of \(A\) and \(B\) doesn't affect the probability of both occurring, we have \(P(AB) = P(BA)\), so
\[ \begin{equation} P(A|B)P(B) = P(B|A)P(A). \end{equation} \]
This leads to Bayes' Theorem:
\[  \begin{equation}P(A|B) = \frac{P(B|A)P(A)}{P(B)} \label{existentialangst} \end{equation} \]
There we have it: to convert from \(P(B|A)\) to \(P(A|B)\), multiply \(P(B|A)\) by the ratio \(P(A)/P(B)\). Let's see what this looks like in the disease example.

Back to the Disease Example

Let the symbols \(+\) and \(-\) stand for a positive and negative diagnosis, and let \(D\) stand for the event that disease is present. Since the test must return either \(+\) or \(-\) every time we test someone, \(P(+) + P(-) = 1\). And since disease must either be present or absent, \(P(D) + P(\text{~}D) = 1\).

Now, to determine precisely how much a positive diagnosis should worry us, we care about the probability of disease given a positive diagnosis. This is just \(P(D|+)\). By Bayes' Theorem (Equation \(\ref{existentialangst}\)), we need to calculate
\[ \begin{equation} P(D|+) = \frac{P(+|D)P(D)}{P(+)} \end{equation} \]
Consider each term on the right side:

\(P(+|D)\) is just the probability of getting a positive diagnosis given the presence of disease, i.e., the probability that the test works as advertised as a disease detector. This is the definition of the true positive rate, AKA the sensitivity, a very commonly quoted test metric.

\(P(D)\) is the probability of disease in a person randomly selected from our population. In other words, \(P(D)\) is the disease prevalence (e.g., 15 per 10,000 people).

What about the denominator, \(P(+)\)? It's the probability of getting a positive diagnosis in a randomly selected person. A positive diagnosis can be either 1. a true positive, or 2. a false positive.

A true positive is defined by the occurrence of both \(D\) and \(+\). Equation \(\ref{whatstheuse}\) says that the probability of both \(D\) and \(+\) is \(P(+|D)P(D)\). \(P(+|D)\) is the true positive rate, and \(P(D)\) is the disease prevalence.
A false positive is defined by the occurrence of both \(\text{~}D\) and \(+\). The probability of this is \(P(+|\text{~}D)P(\text{~}D)\). \(P(+|\text{~}D)\) is the false positive rate (after which the paradox is named), and \(P(\text{~}D) = 1 - P(D)\).

Since true and false positives are mutually exclusive events, their probabilities add up to give the probability of a positive outcome = either a true positive or a false positive. Thus,

\[ \begin{equation} P(+) = P(+|D)P(D) + P(+|\text{~}D)P(\text{~}D). \end{equation} \]

Bayes' Theorem for the disease-screening example now looks like this:

\[ \begin{equation} P(D|+) = \frac{P(+|D)P(D)}{P(+|D)P(D) + P(+|\text{~}D)P(\text{~}D)} \label{thehorror} \end{equation} \]

Plugging in Example Numbers

Part One of this series gave concrete numbers for a hypothetical outbreak of dancing plague. Let's insert those numbers into our newly minted Equation \(\ref{thehorror}\) to calculate the value of a positive diagnosis.

Dancing plague was assumed to affect one in 1000 people, so \(P(D) = 1/1000\). Since each person either has or does not have disease, \(P(D) + P(\text{~}D) = 1\).
Test sensitivity was 99%. Sensitivity is synonymous with the true positive rate, so this tells us that \(P(+|D) = 0.99\). And since the test must return either \(+\) or \(-\) when disease is present, \(P(-|D) = 1 - 0.99 = 0.01\).
Test specificity was 95%. Specificity is synonymous with the true negative rate, so \(P(-|\text{~}D) = 0.95\). Then \(P(+|\text{~}D) = 1 - 0.95 = 0.05\).

Inserting these numbers into Equation \(\ref{thehorror}\) gives
\[ \begin{align}
P(D|+) &= \frac{0.99 \cdot \frac{1}{1000}}{0.99 \cdot \frac{1}{1000} + 0.05 \cdot (1 - \frac{1}{1000})} \label{whyareyouevenwritingthis} \\
&= \frac{\frac{0.99}{1000}}{\frac{0.99}{1000} + 0.05 \cdot \frac{999}{1000}} \\
&= 0.01943 \\ &\simeq 1.9 \%
\end{align} \]
As expected, this is the same answer we got in Part One through a less rigorous approach.

Final Remarks

In this excessively long article, we derived Bayes' Theorem and used it to confirm our earlier reasoning in Part One that a positive diagnosis of dancing plague has only a 2% chance of being correct. This low number is an example of the false positive paradox, and Equation \(\ref{whyareyouevenwritingthis}\) reveals its origin.

The form of Equation \(\ref{whyareyouevenwritingthis}\) is [something] divided by [something + other thing], or \(\frac{t}{t + f}\). If \(f\) is small compared to \(t\), then \(\frac{t}{t+f} \simeq \frac{t}{t} = 1\), which means that the probability of disease given a positive test result is close to 100%. But if \(f\) becomes much larger than \(t\), then \(\frac{t}{t+f}\) becomes much less than 1. Looking at Equations \(\ref{thehorror}\) and \(\ref{whyareyouevenwritingthis}\), you can see that \(t\) matches up with the term \(P(+|D)P(D)\), the probability of getting a true positive, and \(f\) matches up with \(P(+|\text{~}D)P(\text{~}D)\), the probability of getting a false positive. In our example, \(t \simeq 0.001 \) and \(f \simeq 0.05 \). Thus, the chance of getting a false positive is 50 times higher than the chance of getting a true positive. That's why someone who tests positive probably has nothing to worry about, other than the social stigma of getting tested in the first place.

Cliffhanger Ending

In my next post, I'll explain the Bayesian methodology I used in the course of my involvement with the series To Catch A Killer. Essentially, the above analysis can be adapted to homicide investigations by replacing rare-disease prevalence with the homicide rate for a specific time, place, and demographic, and by treating the presence or absence of forensic evidence as positive or negative diagnostic test outcomes.

Gritty Jello

Tuesday, April 22, 2014

Bayes' Theorem, Part 2