Effect Size Theory

The Problem with P-Values & The Meta-Analytic Solution

Author

A. C. Del Re

1. Interpreting Effect Sizes (Cohen’s \(d\))

Effect sizes are abstract numbers. What do they look like in the real world?

The Overlap Visualization

\(d = 0.2\) (Small): 85% Overlap. Hard to see with the naked eye. (The “Teenager” growth spurt).
\(d = 0.5\) (Medium): 67% Overlap. Visible. (Height difference between 14 and 18 year old girls).
\(d = 0.8\) (Large): 53% Overlap. Obvious. (Height difference between 13 and 18 year old girls).

Cohen’s Rule of Thumb: 0.2 is subtle, 0.5 is visible to the careful observer, 0.8 is obvious to anyone.

2. Interpreting Effect Sizes

What does a \(\delta\) (or \(d\)) of 0.5 mean?

It represents the improvement in the Treatment group above what is observed in the absence of treatment.
Overlap: It tells us how much the two distributions overlap.

Cohen’s (1988) Rule of Thumb

In the behavioral sciences, we use these benchmarks:

\(\delta = 0.2\): Small Effect
\(\delta = 0.5\): Medium Effect
\(\delta = 0.8\): Large Effect

Visualizing \(\delta\): * If \(d=0.0\): 50% of Tx group scores above Ctrl mean (No effect). * If \(d=0.5\): 69% of Tx group scores above Ctrl mean. * If \(d=1.0\): 84% of Tx group scores above Ctrl mean.

The Problem with P-Values (Schmidt 1992)

Historical Context

In 1992, Frank Schmidt published a landmark paper…

The Narrative Review Trap (Vote Counting)

Imagine you are reviewing 21 studies on the validity of an employment test. You look at the \(p\)-values to decide if the test works.

The Scenario
WebR Simulation

You collect 21 studies. Each has a sample size of \(N = 68\). The true population correlation is \(\rho = 0.22\).

What happens when we analyze them individually?

The “Conflicting” Results

If you look at the table above, you will see a mess: * Some studies show a strong effect (\(r = 0.38\), Sig). * Some studies show no effect (\(r = 0.04\), NS). * Some are barely significant, some are barely not.

A Narrative Reviewer would look at this and say: > “The literature is mixed. The test works in some settings but not others. We need to investigate moderator variables to explain these inconsistencies.”

They would be WRONG.

The Meta-Analytic Reality

In this simulation, all 21 studies were drawn from the EXACT SAME population. There are no moderators. There are no “different settings”. The differences are 100% Sampling Error.

Visualizing Sampling Error

Let’s visualize the “Sampling Distribution” of these correlations using a Forest Plot.

Major Insight

The Confidence Intervals (grey bars) for almost all studies overlap with the True Effect (Orange Dashed Line). The “conflicting” results are an illusion created by dichotomizing p-values.

Interactive Lab: Power and Sample Size

Why did so many fail to find the effect? Low Power. At \(N=68\), you only have about 50% power to detect \(r=0.22\).

Change the sample size below to see how \(N\) cures the “mixed results”.

The “Dirty Secret” of Research

Most contradictions in the literature are not real. They are statistical artifacts caused by looking at p-values in underpowered studies.

Study A: \(p = 0.04\) (Significant!) -> “Treatment works!”
Study B: \(p = 0.06\) (Not Significant) -> “Treatment failed.”

In reality, the effect sizes might be almost identical (e.g., \(d = 0.41\) vs \(d = 0.39\)).

The Meta-Analytic Insight

Meta-analysis moves us away from “Vote Counting” (Significant/Not Significant) and focuses on the precision of the estimate.

Next Section: Logic of Meta-Analysis >