The Science of Growth: How to Design Marketing Experiments That Don’t Lie

Most experiments don’t fail because the ideas are bad—they fail because the design is. Peeking at results and “calling it” early, underpowered samples, mushy metrics, or shipping five changes at once can all produce confident-sounding but wrong conclusions. Peeking alone inflates false-positive risk unless your method explicitly supports continuous looks.

Common Pitfalls (and what to do instead)

1) Stopping early with fixed-sample stats. Traditional p-values assume you look once, at the end. If you peek as data arrives, Type-I error balloons.
Fix: Either (a) commit to a fixed sample and don’t look, or (b) use sequential/always-valid approaches that allow continuous monitoring without breaking inference.

2) Too-small samples (underpowered tests). Want to detect a small lift? You need big n. 80% power at 5% alpha remains a sensible default; plan it before launch.

3) Multiple changes at once. If Variant B bundles six edits, you won’t know which one moved the metric. Use single-change tests or a proper multivariate/DOE design with the sample it requires.

4) Broken randomization (SRM). A 50/50 test showing 55/45 is a Sample Ratio Mismatch (SRM)—a data-quality red flag, not “chance.” Investigate or invalidate.

5) Measuring the wrong thing. Vanity metrics (pageviews, followers) feel good but rarely guide decisions. Anchor on a North Star/output metric (revenue per visitor, paid activation, retained W(A)Us) with guardrails for performance and trust.

The Scientific Method for Marketers

Hypothesis: “Because [insight], changing [thing] for [audience] will increase [metric] by [size] within [time].”
Control & treatment(s): Isolate a single meaningful change when possible; otherwise ensure your design can attribute effects.
Randomize + SRM checks: Monitor allocation continuously; fail fast on data quality.
Success & guardrails: Define the decision metric plus guardrails (latency, errors/crashes, unsubscribes/churn). Many “wins” are regressions without them.
Pre-register the analysis: Alpha/power, fixed vs sequential policy, exact metric formulas, handling of missing data and concurrent tests.

Sample Size & Significance—Made Simple

Power (80%) = chance you’ll detect your chosen MDE if it’s real.
Alpha (5%) = false-alarm rate you’re willing to accept.
Smaller MDEs → larger samples; higher power → larger samples. Use a planner before launch.

Plan & peek the right way: If your culture needs fast reads, adopt sequential / always-valid tests. They let you monitor continuously while keeping error rates honest.

Get more signal with the same traffic: Variance-reduction (e.g., CUPED) leverages pre-experiment behavior to cut noise and detect smaller lifts sooner.

Tools you’ll actually use: Evan Miller’s sample size calculator for planning, and sequential procedures when you must look early.

Choosing the Right Metrics

North Star / output metrics reflect delivered value (e.g., revenue per visitor, activated paid accounts, retained weekly actives).
Input metrics (CTR, signup starts) are levers and diagnostics—not the win condition.
Guardrail metrics (latency, error/crash rate, churn) prevent Pyrrhic wins. Many “improvements” degrade reliability or trust.
Multiple simultaneous tests? Use false-discovery control (e.g., Benjamini–Hochberg FDR) so a spray-and-pray roadmap doesn’t mint fake wins.

The 5-Step Experiment Design Checklist

Hypothesis & MDE: Audience, behavior, metric, expected lift, timeframe.
Stats path: Fixed-horizon (commit to n) or sequential/always-valid (with explicit stopping rules).
Power up: Compute n for 80–90% power at 5% alpha; estimate runtime from real traffic.
Guardrails & SRM: Instrument performance/trust guardrails and automatic SRM alarms.
QA & isolate: Test one meaningful change; lock allocation; monitor SRM in the first hours.

What TechnicalFoundry Pods add