Chapter 1: Looking at Data – Distributions

4. Chapter 1: Looking at Data – Distributions#

Before examining the distributions of our dataset (usually our sample dataset), we first need to understand what we can do with the dataset. The initial step involves calculating some values or creating graphs to describe our data. We use statistical tools and ideas help us examine data to describe their main features. This examination is called exploratory data analysis. In some textbooks, this is referred to as descriptive statistics. However, statistics goes far beyond simply describing data-this is a statistics class, not a drawing class!

Beyond description, we can use the data to make generalizations about the population, study the causal effects of variables of interest, and make predictions about future data points. These activities fall under inferential statistics, which we will spend more time on throughout the course.

For this course, our primary focus will be learning statistical procedures to make generalizations about the population from the sample. However, in this chapter, we will focus on describing our datasets using calculated values (statistics) and visual representations (graphs). Hence the title: Looking at Data-Distributions.[1]

Before we dive into the types of graphs, let’s first look at the types of data we can obtain from a particular variable. Broadly, data can be categorized into two main types: quantitative and categorical.

Categorical variables take categories as values, as the name suggests. Each category is called a level:
- If the levels do not have a natural order, the variable is a nominal categorical variable (e.g., fruit categories like “Apple” or “Orange”).
- If the levels have a natural order, the variable is an ordinal categorical variable (e.g., academic letter grades like “A,” “B,” “C”).
Quantitative variables take numbers as values. The magnitude and differences between numbers have quantitative meanings:
- If the values can be any number within an interval on the real line (continuous values), it is a continuous quantitative variable (e.g., income measured in dollars).
- If the values have jumps (discrete values), it is a discrete quantitative variable (e.g., the number of cars per household).

A dataset typically contains information on a number of cases. Cases are the objects or subjects in a study. For each case, we have measurements for different types of variables, such as height, gender, and age. Additionally, there is often a label, which is a special variable used to identify cases in the dataset (e.g., a VIN number to identify a specific car).

4.2. Describing Distributions with Numbers#

When you have a quantitative variable-like the heights of Purdue students-you want to summarize its distribution. A distribution can be described by its shape, its center, and its spread. This section focuses on the numerical ways to measure center and spread, and how to handle outliers.

Measures of Center

The Mean \(\bar{x}\)

Definition: The mean (arithmetic average) is found by summing all the values and then dividing by the number of observations. Mathematically,

\[\bar{x} = \frac{x_1 + x_2 + \dots + x_n}{n}\]
Illustration (Purdue Students’ Heights): Suppose we sample 5 Purdue students with heights 66, 70, 71, 72, and 75 inches. The mean would be \(\bar{x} = (66 + 70 + 71 + 72 + 75)/5 = 70.8\) inches.
Key Points:
- The mean is the “balance point” of the distribution.
- Not a resistant (or robust) measure: a very tall or very short outlier in the data can pull the mean up or down.

The Median \(M\)

Definition: The median is the midpoint of the distribution when the data are ordered from smallest to largest.
- If \(n\) (the number of data points) is odd, the median is the single middle value.
- If \(n\) is even, the median is the average of the two middle values.
Illustration (Purdue Students’ Heights): From the same sample [66, 70, 71, 72, 75], the ordered data are 66, 70, 71, 72, 75. Since \(n=5\) is odd, the median is the 3rd value -> 71 inches.
Key Points:
- The median is more resistant to outliers because it depends on the order of the data rather than the actual numeric magnitude of extreme observations.

Comparing Mean and Median

If the distribution is symmetric (like a “bell” shape for heights), the mean and median tend to be close.
If the distribution is skewed (long tail on one side), the mean is pulled toward the tail more than the median.

Measures of Spread

A single measure of center (mean or median) doesn’t capture how spread out the data are. We also need measures of variability.

Quartiles and the Interquartile Range (IQR)

Quartiles:
- First Quartile \(Q_1\): The median of the lower half of the data (25th percentile).
- Third Quartile \(Q_3\): The median of the upper half of the data (75th percentile).
IQR = \(Q_3 - Q_1\)
- Definition: The IQR measures the range of the middle 50% of the data.
- Resistant measure of spread because quartiles (like the median) are not heavily influenced by extreme values.
Illustration (Purdue Students’ Heights): If we have these 8 sorted heights in inches: 60, 62, 66, 70, 71, 72, 75, 77,
- The median \(M\) is between the 4th and 5th values \(\rightarrow \frac{70 + 71}{2} = 70.5\).
- \(Q_1\) is the median of the lower half \([60, 62, 66, 70]\) -> between 62 and 66 -> 64.
- \(Q_3\) is the median of the upper half \([71, 72, 75, 77]\) -> between 72 and 75 -> 73.5.
- IQR = \(Q_3 - Q_1 = 73.5 - 64 = 9.5\).

The Five-Number Summary

Definition: A quick numerical snapshot of a distribution made up of:

\[\text{Minimum}, \quad Q_1, \quad M, \quad Q_3, \quad \text{Maximum}\]

Boxplots

Definition: A boxplot (sometimes called a “box-and-whisker plot”) graphically shows the five-number summary.
- The “box” covers \(Q_1\) to \(Q_3\).
- A line inside the box marks the median \(M\). The mean can be represented using either a cross or an asterisk.
- “Whiskers” extend out to the minimum and maximum (or to the most extreme points that aren’t flagged as outliers in a modified boxplot).
- Use: Quickly visualize center, spread (the length of the box), and potential outliers.

Outliers and the 1.5 \(\times\) IQR Rule

Definition: A value is called a suspected outlier if it falls more than 1.5 \(\times\) IQR above \(Q_3\) or below \(Q_1\).
- Lower Bound = \(Q_1 - 1.5 \times \mathrm{IQR}\).
- Upper Bound = \(Q_3 + 1.5 \times \mathrm{IQR}\).
Illustration: If \(Q_1=64\), \(Q_3=73.5\), and \(\mathrm{IQR}=9.5\),
- \(1.5 \times \mathrm{IQR} = 1.5 \times 9.5 = 14.25\).
- Lower Bound: \(64 - 14.25 = 49.75\).
- Upper Bound: \(73.5 + 14.25 = 87.75\).
- Any height below 49.75 or above 87.75 would be a flagged outlier.

\(P\)th Percentile

In general, we can calculate the \(P\)th percentile by following these steps:

Find the Position: Use the following formula to determine the position in the sorted dataset:

\[\text{Position} = \frac{P}{100} \times (n + 1)\]

where:
- \(P\) is the desired percentile (for example, \(25\)th percentile for \(Q_1\), \(50\)th percentile for the median).
- \(n\) is the number of data points in the dataset.
Whole Number Position:
- If the position is a whole number, use the corresponding data point at that position directly.
Fractional Position:
- If the position is a fraction (that is, not a whole number), find the two adjacent data points in the sorted dataset.
- Use the values at these adjacent positions and take their average to calculate the percentile[2].

Example:

To find the median (50th percentile) in a dataset with \((n = 8)\) data points, use the formula:

\[\text{Position} = \frac{50}{100} \times (8 + 1) = 4.5\]

Since \(4.5\) is a fraction, find the 4th and 5th data points in the sorted dataset and take their average. For this textbook, the median is calculated as:

\[\text{Median} = \frac{x_4 + x_5}{2}\]

This method can be applied to any percentile by changing the value of \(P\). For example, to find the 25th percentile, use \((P = 25)\).

Standard Deviation and Variance

While the five-number summary is quite resistant to outliers, many statistical methods rely on the mean and standard deviation.

Variance \(s^2\)

Definition: The average of the squared deviations from the mean:

\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\]
Intuition: Each data point’s “deviation” from \(\bar{x}\) is squared and then averaged (using \(n-1\) in the denominator for degrees of freedom).

Standard Deviation \(s\)

Definition: The square root of the variance:

\[s = \sqrt{s^2}\]
Interpretation: Measures how far (on average) data points lie from their mean. A large \(s\) indicates the data are more spread out.
Key Points:
- \(s\) is always \(\ge 0\); \(s=0\) only if all data points are identical.
- \(s\) is not resistant to outliers, because outliers can dramatically increase the average squared deviation.
- \(s\) is most informative when the distribution is reasonably symmetric and has no extreme outliers.

Resistant and Choosing a Summary

Resistant vs. Non-Resistant Measures

Resistant (Robust) measures: The median, quartiles, IQR. They are not strongly affected by a few extreme values.
Non-Resistant measures: The mean and standard deviation can shift substantially if there are outliers or skewness.
Example: If 3 or 4 extremely tall Purdue basketball players (say 7-footers) happen to be in your sample, the mean (and standard deviation) of heights will jump notably. The median or IQR might change only a little.

Choosing a Summary: Five-Number Summary vs. \(\bar{x}\) and \(s\)

Five-Number Summary (Min, \(Q_1\), Median, \(Q_3\), Max) is best when:
- The distribution is skewed or has strong outliers.
- You want a quick, robust snapshot of center and spread.
Mean and Standard Deviation (\(\bar{x}\), \(s\)) are best when:
- The distribution is fairly symmetric with no major outliers.
- You plan to use statistical methods that assume normality or revolve around the mean.
Reminder: Always plot your data (histogram, stemplot, or boxplot) to see shape, outliers, or clusters. A single numeric summary never tells the full story (e.g., multiple modes or gaps).

Boxplot

Side by side Boxplot

Sometimes, we change the units of our measurements. For example, if your heights are measured in inches and you convert them to centimeters using the formula, \(\text{cm} = 2.54 \times \text{inches}\), the new mean in centimeters becomes \(2.54 \times \bar{x}\). This kind of conversion is known as a linear transformation, represented by \(x_{\text{new}} = a + b \times x\), which shifts the data by (a) and/or scales it by (b). Specifically:

Adding \(a\) to each observation shifts measures of center (mean, median) by \(a\) but does not affect measures of spread (IQR, \(s\)).
Multiplying \(b\) scales both the measures of center and the measures of spread by \(b\).

Chapter 1: Looking at Data – Distributions

Contents

4. Chapter 1: Looking at Data – Distributions#

4.1. Displaying Distributions with Graphs#

4.2. Describing Distributions with Numbers#

4.3. Density Curves and Normal Distributions#