Month: June 2026

  • AI Broke Your A/B Tests: 3 Proven Shifts to Rebuild a Resilient Experimentation Program

    AI Broke Your A/B Tests: 3 Proven Shifts to Rebuild a Resilient Experimentation Program

    I’ve watched a once-reliable A/B testing playbook buckle under the weight of generative AI. Traffic patterns aren’t stable, LLMs update behind the scenes, prompts evolve weekly, and personalization reshapes cohorts mid-flight. The result is non-stationary data, diluted statistical power, and “wins” that don’t replicate in production. If your experimentation program feels slower, noisier, and less trustworthy, you’re not imagining it—and you’re not alone.

    Learn why running more tests isn’t the answer to AI, and the three ways mature teams are shifting their experimentation programs.

    First, I’ve shifted from test volume to an evaluation stack—what I call eval-driven development. Instead of defaulting to production A/B tests, we front-load learning with offline evaluations (golden sets, synthetic scenarios), automated regressions on prompts and policies, and pre-production canaries. We size experiments with a clear minimum detectable effect (MDE), use sequential or Bayesian methods to handle drift, and reserve full A/B runs for hypotheses with sufficient power and operational readiness. This layered approach accelerates decisions, reduces traffic waste, and restores trust in effect sizes.

    Second, I’ve re-anchored our metrics and governance for AI-era reliability. We define a driver tree that links value creation to guardrail metrics such as latency, hallucination rate, cost per request, safety incidents, and user trust proxies. Persistent holdouts and long-lived control cohorts protect against platform-wide regressions, while anomaly detection highlights model or data shifts before they corrupt reads. Strong instrumentation—behavioral analytics, consistent event semantics, and product telemetry wired into Amplitude analytics—keeps our feedback loop tight and auditable.

    Third, we rebuilt rollout mechanics to make delivery experimentation-native. Feature flags, progressive delivery, and targeted canaries let us test safely in production while gating exposure by segment, risk, or policy. Shadow mode and offline replay provide signal before real users see risk. Multi-armed bandits help with exploration when goals are clear and guardrails are enforced, but we resist over-rotating to bandits when measurement is fragile. Tightly integrating experiments into CI/CD and observability shortens the cycle from hypothesis to validated outcome.

    In practice, here’s how I operationalize this shift. In 30 days, I audit the backlog, kill or consolidate tests that can’t meet MDE, and establish a minimal evaluation harness for prompts, policies, and safety checks. By 60 days, guardrail metrics are live with persistent holdouts and feature flags across AI surfaces. By 90 days, the team runs a balanced portfolio: offline evals for fast iteration, canaries for risk, and selective A/B testing for strategic bets—supported by continuous discovery to keep hypotheses grounded in real customer needs.

    AI didn’t eliminate the need for experimentation; it raised the bar for rigor. By moving from volume to validity, from vanity lifts to guardrailed outcomes, and from monolithic launches to progressive delivery, I’ve seen experimentation regain its edge—fewer false positives, faster cycles, and clearer signal on what truly drives impact. That’s how we turn a brittle testing culture into a resilient, learning system built for LLMs and beyond.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • How I Build High-Impact Experimentation Programs with Amplitude: Proven Practices at Scale

    How I Build High-Impact Experimentation Programs with Amplitude: Proven Practices at Scale

    I build experimentation programs to drive measurable outcomes, not just dashboards. In my product leadership work, I’ve seen how the right operating model turns experimentation into a reliable growth engine—especially when paired with the analytical depth of Amplitude. My goal is to help teams move from ad-hoc tests to a disciplined system that compounds learning and impact.

    Rigor starts with clarity. I translate strategic goals into testable hypotheses using driver trees, then structure A/B testing with a defined minimum detectable effect (MDE), guardrail metrics, and pre-registered decision criteria. This reduces p-hacking, shortens debate cycles, and makes outcomes auditable. I’m equally deliberate about risk: we monitor sample ratio mismatch, use feature flags for safe rollouts, and align on outcomes vs output OKRs so we celebrate business impact, not vanity wins.

    Amplitude analytics is my backbone for behavioral analytics at every step. I instrument clean event taxonomies, build funnels and cohorts to track user activation and retention analysis, and centralize experiment readouts in a unified analytics platform. This lets product trios quickly see how treatments shift behavior, where friction hides, and which moments matter most for product-led growth. The result is a trusted, shared source of truth that accelerates continuous discovery.

    At enterprise scale, governance matters as much as math. I often point to lessons inspired by Peacock’s experimentation program: standard naming conventions, centralized QA, CI/CD integration, and an active community of practice. Those practices keep velocity high without sacrificing validity, and they make wins repeatable across teams and surfaces.

    Operationally, I anchor the program in clear roles (data, engineering, design, product), templates for hypotheses and readouts, and a tight feedback loop from deploy to decision. With Amplitude, solutions engineering partnerships, and disciplined experiment hygiene, teams learn faster, ship safer, and build products customers love. That’s how experimentation becomes a strategic capability—not a side project.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image