This time, rather than do some light statistical analysis, we’re actually teaching a concept from statistics—and we have an example that should make it more familiar. Though, dear heavens, we still miss sports.
Imagine you’re collecting a whole bunch of measurements on the same data point—like the height of everyone attending a pre-COVID basketball game. It’s a ton of random people, and all different kinds of people—so when you look at the crowd, you’ll see mostly average-height people, then you’ll notice someone super-tall, then you’ll bump into someone a head shorter than you. Sound about right?
At that moment, you are (literally) seeing what a statistician would call a “normal distribution.” The popularly-invoked phrase is “the bell curve” because of how a normal distribution is shaped on a graph:
Don’t get nervous! This is where things get interesting—because the bell curve, as a mental image, will help you to understand certain kinds of data sets and make safer assumptions about them. So picture the crowd of people again, look at the bell curve again, and we’ll tell you what you’re seeing:
The curve-shaped line (and the blue shade underneath) represents how many people there are of each height. The shortest people are on the left side, tallest on the right, average in the middle. The shape of the curve tells you that there are very few people at either extreme, then more and more people as you approach the average.
Average is the vertical line right down the middle. In theory, the line cuts straight through the “peak” of a (perfect) bell curve because, whatever the average height of the audience, people of “perfectly average” height will be the biggest group. The further you move from average height (in either direction), the fewer people you’ll have in that group.
This is cool because it means that, in these situations, we don’t have to take averages with such a huge grain of salt. We’re often wary of simple (mean) averages because they’re just one way of measuring center—and even then, seeing the center doesn’t show you the shape of things. The usefulness of the bell curve (when applicable) is that you CAN anticipate the shape of things.
Start from the middle and work outwards. You’ll notice that there are a few other vertical lines… but why? The lines are all 1 standard deviation (i.e. “average distance from average”) apart, and that’s the last puzzle piece here—because if you know the average and the standard deviation, you know roughly how many people to expect in each group no matter how you slice and dice.
The percentages represent the proportion of people in each shaded area. You can follow the graph as you please, but as an important starting example: roughly two thirds (68%) of the audience will be within 1 standard deviation of “average.” Back at the arena: if the average height is 5’8″ and the standard deviation is 3 inches, you know that about 2/3 of the audience is between 5’5″ and 5’11”.
The bell curve can also put individuals in clearer context. For instance, when you look at the audience, you know that anyone standing 6’2″ is 2 standard deviations above average, which puts them around the 97th percentile (i.e. they’re taller than 97% of other people).
Disclaimer: this cannot be applied to everything. Not all types of data will form normal distributions—and bell-curve logic only applies when the data is actually shaped that way.
Audience height is a good example for two reasons. First, it’s a pretty stable trait; there are known exceptions in either direction (can someone Photoshop a Yao Lannister/Tyrion Ming for us?), but 99.9% of people are within a standard height range. Second, and more importantly: it’s a “random” variable in that people have no direct control over it.
If you can say something similar about the variable you’re examining, picturing the bell curve might help; otherwise, it’s a fool’s errand. The best way to be sure: graph the data!