Effect Size Theory
The Problem with P-Values & The Meta-Analytic Solution
1. Interpreting Effect Sizes (Cohen’s \(d\))
Effect sizes are abstract numbers. What do they look like in the real world?
The Overlap Visualization
- \(d = 0.2\) (Small): 85% Overlap. Hard to see with the naked eye. (The “Teenager” growth spurt).
- \(d = 0.5\) (Medium): 67% Overlap. Visible. (Height difference between 14 and 18 year old girls).
- \(d = 0.8\) (Large): 53% Overlap. Obvious. (Height difference between 13 and 18 year old girls).
Cohen’s Rule of Thumb: 0.2 is subtle, 0.5 is visible to the careful observer, 0.8 is obvious to anyone.
2. Interpreting Effect Sizes
What does a \(\delta\) (or \(d\)) of 0.5 mean?
- It represents the improvement in the Treatment group above what is observed in the absence of treatment.
- Overlap: It tells us how much the two distributions overlap.
Cohen’s (1988) Rule of Thumb
In the behavioral sciences, we use these benchmarks:
- \(\delta = 0.2\): Small Effect
- \(\delta = 0.5\): Medium Effect
- \(\delta = 0.8\): Large Effect
Visualizing \(\delta\): * If \(d=0.0\): 50% of Tx group scores above Ctrl mean (No effect). * If \(d=0.5\): 69% of Tx group scores above Ctrl mean. * If \(d=1.0\): 84% of Tx group scores above Ctrl mean.
The Problem with P-Values (Schmidt 1992)
In 1992, Frank Schmidt published a landmark paper…
The Narrative Review Trap (Vote Counting)
Imagine you are reviewing 21 studies on the validity of an employment test. You look at the \(p\)-values to decide if the test works.
You collect 21 studies. Each has a sample size of \(N = 68\). The true population correlation is \(\rho = 0.22\).
What happens when we analyze them individually?
The “Conflicting” Results
If you look at the table above, you will see a mess: * Some studies show a strong effect (\(r = 0.38\), Sig). * Some studies show no effect (\(r = 0.04\), NS). * Some are barely significant, some are barely not.
A Narrative Reviewer would look at this and say: > “The literature is mixed. The test works in some settings but not others. We need to investigate moderator variables to explain these inconsistencies.”
They would be WRONG.
The Meta-Analytic Reality
In this simulation, all 21 studies were drawn from the EXACT SAME population. There are no moderators. There are no “different settings”. The differences are 100% Sampling Error.
Visualizing Sampling Error
Let’s visualize the “Sampling Distribution” of these correlations using a Forest Plot.
The Confidence Intervals (grey bars) for almost all studies overlap with the True Effect (Orange Dashed Line). The “conflicting” results are an illusion created by dichotomizing p-values.
Interactive Lab: Power and Sample Size
Why did so many fail to find the effect? Low Power. At \(N=68\), you only have about 50% power to detect \(r=0.22\).
Change the sample size below to see how \(N\) cures the “mixed results”.
The “Dirty Secret” of Research
Most contradictions in the literature are not real. They are statistical artifacts caused by looking at p-values in underpowered studies.
- Study A: \(p = 0.04\) (Significant!) -> “Treatment works!”
- Study B: \(p = 0.06\) (Not Significant) -> “Treatment failed.”
In reality, the effect sizes might be almost identical (e.g., \(d = 0.41\) vs \(d = 0.39\)).
Meta-analysis moves us away from “Vote Counting” (Significant/Not Significant) and focuses on the precision of the estimate.