Big Data

Beware Simpson’s Paradox

Bill Schmarzo By Bill Schmarzo April 21, 2014

When I first heard of the “Simpson’s Paradox,” I immediately thought of my favorite TV show, The Simpsons. I thought the paradox had to about how Homer Simpson deals with the challenges he faces in every episode of the show. How does Homer deal with Sideshow Bob (“I snuck into America amidst a bunch of undocumented Canadian comedy writers for The Jimmy Kimmel Show… whatever that is.”) and his constant attempts to kill Bart? How does Homer deal with Arnold Schwarzenegger…er, uh… Rainier Wolfcastle and his constant attempts to make even gorier, more meaningless movies? How does Homer deal with Kang and Kodos (“Don’t blame me, I voted for Kodos!”) and their constant attempts to conquer earth?

4 21 14 Bill Image 1

Nope. Wrong. Simpson’s Paradox refers to a statistical situation in which a trend or relationship that is observed within multiple groups disappears when the groups are combined. Simpson’s Paradox is in a sense an arithmetic trick:  weighted averages can lead to reversals of meaningful relationships—i.e., a trend or relationship that is observed within each of several groups reverses when the groups are combined. As data scientists, we need to be aware of the counter-intuitive analytic results resulting from Simpson’s paradox. Let’s review a couple of realworld examples.

Simple Simpsons Paradox Example

Below is a classic example of Simpson’s Paradox found in the world of business intelligence (BI). This happens when a higher level, aggregated dashboard view of the business hides the real story buried in the details.

Figure 1: Simpson’s Paradox Dashboard

Figure 1: Simpson’s Paradox dashboard

In Figure 1, aggregated Regional Sales look like they are on plan. But at the next level of detail (at the Territory level), we can see that Territory A is doing very poorly (25% of plan) and Territory B is doing exceptionally well (400% of plan). This is a major challenge in the world of BI. It’s easy to feel content with what the BI reports and dashboards are telling us without having an easy, automated way to delve into the minutiae to find those areas of the business that are significantly over- or underperforming.

Simpson’s Paradox and “Lurking” Variables

The next example deals with “lurking” variables[1]. The data for this example are taken from a 1996 study from Appleton, French, and Vanderpump on the effects of smoking. The study catalogued women from an original study based on the age groups in the original study and whether the women were smokers or not. The study measured the deaths of smokers and non-smokers during the 20-year period. The overall counts of the study are as follows (see Figure 2).

Figure 2: Mortality rates for smokers and nonsmokers

Figure 2: Mortality rates for smokers and nonsmokers

In Figure 3, the mortality rates shown as a bar chart.

Figure 3: Overall Mortality Rates

Figure 3: Overall mortality rates

What is striking about this analysis is that this chart suggests that nonsmokers actually have higher mortality rates[2] (die more often as a percentage) than smokers, certainly a surprising result and contrary to current medical teachings. But the numbers tell a much different story when mortality is examined by age group (see Figure 4).

Figure 4: Mortality rates by age

Figure 4: Mortality rates by age

Now we see that smokers have higher mortality rates for virtually every age group. This is a classic example of the Simpson’s Paradox phenomenon; it shows that a trend present within multiple groups can reverse when the groups are combined. The phenomenon is well known to statisticians, but counter-intuitive to many analysts.

Simpson’s Paradox requires several things to occur. First, the variable being reviewed is influenced by a “lurking” variable. In our example, age is the lurking variable, with the population grouped into a discrete number of age subcategories. Second, the subgroups have differing sizes. If both of these conditions are met, then they combine to obscure the real relationships in the data.

In this example, the age distributions are substantially different for smokers and nonsmokers[3]. In particular, the nonsmoking population is older on average. Twenty seven percent of nonsmokers are in the two oldest groups, compared to approximately eight percent for smokers. Combined mortality rates are near 100% for both groups, but the greater proportion of older nonsmokers pushes up the average for that group. Viewing the data by age leads to a more plausible theory—one that comports with long standing medical teaching—that long-term smoking shortened lifespans, thereby affecting the age distributions in the study’s population (see Figure 5).

Figure 5: Distribution of Age by Smoking Status

Figure 5: Distribution of age by smoking status

Simpson’s Paradox and Hidden Correlations

Our next example exposes the dangers of hidden correlations using the traditional price/demand curve.

Price/demand curves display on the relationship between the price of a product and the amount that is sold at that price, or the price elasticity of demand. In the figure below, each dot represents the sales for a certain product during a given week. The x-axis represents the price, the y-axis the quantity sold (logarithmic values are used because we wish to analyze changes in percentages rather than in absolute dollar or purchased amounts). The blue line represents a fitted regression model, the slope of which can be interpreted as the price elasticity (see Figure 6).

Figure 6: Price and quantity relationship

Figure 6: Price and quantity relationship

Economic theory would lead us to expect the regression line to be downward sloping, meaning that as the price increases, the quantity sold should decrease. This is the expected behavior for most products and has been the basis of economic theory. Yet oddly enough, in this example, the regression line slopes upward. The implication is that the product analyzed sells more when its price increases flies in the face of both common sense and common economic theory. What is happening here?

A different view of the data, displayed in Figure 7, yields a useful insight. Here, the red line represents the average price and the blue line represents the corresponding quantity sold at each point in time. The price and quantity lines move in an inverse relationship, as we expect; in fact, the implied elasticity here is actually quite large. What is causing the apparent disconnect between the two charts?

Figure 7: Price and quantity over time

Figure 7: Price and quantity over time

Notice that there is an overall downward trend in price (the red line). This suggests that it might be useful to break apart the data by time period (see Figure 8). Doing so reveals a surprising result:  despite the fact that there is a positive relationship between price and quantity in the aggregated dataset, the relationship is the expected negative one within each time period. This is responsible for the apparent paradox:  the relationship between price and quantity sold is intuitively a negative one within any given time period, but reverses when the data is aggregated.

Figure 8: Price and quantity plotted by periods of time

Figure 8: Price and quantity plotted by periods of time

Summary: Beware Simpson’s Paradox

4 21 14 Bill Image 2While the Simpson’s Paradox might not be as exciting as an episode of The Simpsons TV show, it is an important consideration as you analyze your data looking for trends and propensities. It supports the data science adage that one should never be satisfied with one’s analytic models, and should constantly be looking for opportunities to disprove what those models tell them. Besides, this gives me a valid reason to continue to watch The Simpsons show. You just never know what you might learn from Homer!  D’oh!


[1] Note: many of these examples come from a marvelous blog titled “Simpson’s Paradox: A Cautionary Tale in Advanced Analytics” written by Steve Berman, Leandro DalleMule, Michael Greene, and John Lucker.

[2] Mortality rate is a measure of the number of deaths in a population, scaled to the size of that population, per unit of time. Mortality rate is typically expressed in units of deaths per 1,000 individuals per year; thus, a mortality rate of 9.5 (out of 1,000) in a population of 1,000 would mean 9.5 deaths per year in that entire population, or 0.95% out of the total

[3] Note: the percentages in the nonsmokers’ chart totals 102%, not 100%. I couldn’t find the raw data to get the correct percentage distribution, but the point is still the same.

Bill Schmarzo

About Bill Schmarzo

Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *