Tag: feature flags

Feature Management as a Product Development Discipline
Feature management gives product teams a controlled way to decide how new capabilities reach users after the underlying code has been deployed. This separation can make releases more gradual, observable, and reversible.

Amplitude – Perspectives presents feature management as a contributor to innovative product development and introduces the topic through insights from guest Chris Condo, identified by the publication as a Forrester Principal Analyst. Because the available source is only a brief summary, the practical guidance below explains the established discipline without attributing unreported claims to Condo or the publication.

Feature management extends beyond feature flags

A feature flag is a technical mechanism that can turn behavior on or off without requiring a fresh deployment. Feature management is the broader operating practice around that mechanism: defining the intended audience, controlling exposure, observing results, assigning ownership, and deciding whether to expand, revise, or remove a feature.

The distinction matters because a flag alone does not create a sound product decision. Teams still need an explicit hypothesis, release criteria, relevant evidence, and a person accountable for the outcome. Without those elements, flags can become permanent switches that add complexity while providing little learning value.

Controlled exposure changes the release decision

A conventional launch can bundle several decisions into one moment: deploy the code, make it available to everyone, announce it, and accept the operational consequences. Feature management allows teams to separate those decisions. Code may be deployed while access remains limited, then exposure can expand as confidence grows.

Common approaches include enabling a capability for internal users, a defined customer segment, or a limited share of eligible traffic. The appropriate sequence depends on the feature’s risk, the quality of available signals, and the team’s ability to respond when something goes wrong. A narrow rollout is useful only when the organization is prepared to monitor it and act on what it learns.

Key takeaways for product teams
- Define the customer problem and expected outcome before configuring a rollout.
- Treat deployment, release, and promotion as related but separate decisions.
- Set expansion, pause, and rollback criteria before exposing the feature.
- Combine behavioral evidence with customer feedback and operational signals.
- Assign an owner and a removal date for every temporary flag.
A practical operating loop

Feature management works best as a repeatable decision loop rather than a collection of launch-day controls. A lightweight process can keep product, engineering, design, data, and go-to-market participants aligned:
1. Frame the decision. State what the team expects to improve and which users should benefit.
2. Choose the exposure plan. Identify eligible users, exclusions, rollout stages, and safeguards.
3. Prepare observation. Confirm that product, reliability, and support signals can reveal both value and harm.
4. Review the evidence. Decide whether to expand access, hold the rollout, change the experience, or withdraw it.
5. Close the loop. Remove obsolete flags, document the decision, and carry the learning into future product work.
This process should remain proportional to the risk. A minor interface adjustment may need little ceremony, while a change affecting permissions, billing, privacy, or a critical workflow warrants stronger controls and broader review.

The discipline has costs as well as benefits

Controlled releases can reduce exposure to problems and improve learning, but they also create operational obligations. Multiple feature states increase testing demands. Targeting rules can make customer support harder when users see different experiences. Long-lived flags can complicate the codebase, and poorly designed experiments can produce misleading signals.

Governance therefore belongs inside the practice, not around it. Teams need naming conventions, access controls, auditability, flag inventories, cleanup expectations, and clear decision rights. Product leaders should also distinguish experimentation from risk control: a rollout designed to detect failures is not automatically a valid test of customer value.

The most useful next step is modest: select one meaningful upcoming release, define its exposure and decision criteria in advance, and use the resulting evidence to refine a repeatable feature-management approach.

Inspired by this post on Amplitude – Perspectives.
July 14, 2026
A Practical Framework for Measuring New Feature Success
A feature is not successful merely because it shipped. Its value depends on whether the intended users encounter it, adopt it, gain a better experience, and produce an outcome that matters to the product or business.

The brief from Amplitude – Best Practices frames this challenge through three questions: Are people using the feature? Is it improving the user experience? Is it affecting the company’s bottom line? Because the supplied source does not provide its promised seven-step method, the framework below uses those questions as a starting point and applies established product measurement principles without attributing unsupported details to the source.

Start with the decision, not the dashboard

Before selecting metrics, the product team should identify the decision that the evidence will inform. The question might be whether to expand the rollout, improve discoverability, revise the interaction, continue investing, or reconsider the feature altogether.

This decision-first approach prevents a common measurement problem: collecting large volumes of activity data without knowing what result would change the roadmap. A useful success definition names the target user, the behavior expected to change, the intended user benefit, and the product or business outcome that benefit should support. It should also specify a reasonable evaluation window without treating an arbitrary deadline as proof of success or failure.

Build the measurement chain before release

Feature measurement works best as a connected chain rather than a single headline metric. The chain begins with eligibility: which users could reasonably benefit from the feature? It then tracks exposure, meaningful use, repeated use where appropriate, and a downstream outcome.

That distinction matters because an eligible user who never sees a feature represents a different problem from a user who sees it and declines to engage. Likewise, an initial click is not necessarily evidence that the feature delivered value. The analytics plan should define events consistently, distinguish accidental interaction from meaningful completion, and preserve enough context to compare relevant user groups.

Teams should also record a baseline when one is available. If the feature is intended to improve an existing workflow, measuring the old experience creates a reference point. Feature flags or controlled experiments can strengthen the comparison, but they do not replace a clear hypothesis or reliable instrumentation.

Read adoption, experience, and outcomes separately

Adoption shows reach and relevance

Adoption analysis asks how many eligible users discovered the feature, how many completed its meaningful action, and whether use continued when repetition is part of the value proposition. Weak adoption can indicate poor discoverability, limited relevance, unclear positioning, or friction in the first-use experience. Analytics can reveal where behavior changes, but qualitative research is often needed to explain why.

Experience measures whether use was worthwhile

Usage alone cannot establish that the experience improved. The team should examine the outcome the feature was designed to influence, such as completing a task with less friction, reaching a useful result, or avoiding an undesirable path. Relevant guardrails should also be monitored so that a gain in one area does not conceal deterioration elsewhere.

Business impact requires a credible connection

The source explicitly raises the question of bottom-line impact, but the supplied material reports no result or measurement method. In practice, a product team should state the expected causal path instead of assuming that feature use automatically creates commercial value. A business metric may sit downstream of several influences, so correlation should be treated as a signal to investigate rather than conclusive proof.

Key takeaways
- Define the product decision that measurement will support before choosing metrics.
- Separate eligibility, exposure, meaningful adoption, repeat behavior, and downstream outcomes.
- Evaluate user benefit independently from raw activity or click volume.
- Use baselines, comparison groups, and guardrails where the product context permits.
- Combine behavioral evidence with qualitative research before assigning a cause.
Turn the evidence into a product decision

The final review should distinguish among several possibilities: the feature creates value and merits expansion; the concept is useful but its discovery or execution needs work; the evidence is inconclusive; or the expected outcome is not materializing. Writing down that judgment, its supporting evidence, and the next test makes measurement part of product management rather than a post-launch reporting exercise.

A disciplined team does not wait for a dashboard to declare victory. It defines what success would change, gathers evidence suited to that decision, and uses the result to make the next investment more deliberate.

Inspired by this post on Amplitude – Best Practices.
July 14, 2026
Migrate Analytics Platforms Without Chaos: 7 Proven Lessons to Plan, Move, and Land Cleanly

I’ve led and rescued more analytics migrations than I can count, and I know the pressure: every event, dashboard, and decision pipeline depends on getting it right. Migrating analytics platforms doesn't have to be painful. Get seven lessons from Human37 and Amplitude to help your team plan, migrate, and land cleanly.

Here’s how I approach this work so teams keep momentum, regain trust in their numbers, and accelerate product-led growth on a unified analytics platform—without the rework and stakeholder fatigue that typically follow.

Lesson 1 — Start with outcomes, not events. Before moving a single event, I align leaders on the questions we must answer and the decisions we must speed up: activation, retention, and expansion. I map those goals to a simple driver tree, then back into the behavioral analytics we need. This trims noise, tightens scope, and ensures Amplitude analytics (or any destination) is instrumented for decisions, not vanity metrics.

Lesson 2 — Audit and map your data with rigor. I inventory current events, properties, IDs, and sources, then define a target schema with clear naming conventions, ownership, and versioning. Data governance and privacy-by-design are non-negotiable: we separate PII, document consent paths, and remove legacy debris. This step prevents schema drift and makes platform scalability sustainable.

Lesson 3 — De-risk the cutover with a phased plan. Rather than a big-bang switch, I dual-run critical flows, compare telemetry, and use feature flags to roll forward (and back) safely. Observability and anomaly detection are my guardrails: I monitor volume, cardinality, and event timeliness to spot regressions early—long before executives notice broken charts.

Lesson 4 — Treat instrumentation like product code. I wire schema checks into CI/CD, enforce typed analytics wrappers, and validate payloads pre-merge. With docs-as-code, the tracking plan stays current and reviewable. This keeps quality high at scale and avoids the slow death of broken funnels caused by well-meaning quick fixes.

Lesson 5 — Enable the people, not just the platform. Tools don’t create insight—teams do. I run hands-on enablement with product tours and in-app guides tailored to each role, establish communities of practice, and publish short playbooks for common questions (activation analysis, cohort retention, and journey mapping). When customer success and growth marketers can self-serve, adoption sticks.

Lesson 6 — Land cleanly with fast, visible wins. Within the first two weeks post-cutover, I showcase analyses that matter: retention analysis by use-case, friction points via session replay and heatmaps, and conversion lift by segment. These quick proofs build confidence, reinforce the value proposition, and keep stakeholders engaged through the longer tail of hardening.

Lesson 7 — Govern and evolve continuously. After go-live, I schedule schema reviews, backlog grooming, and QBRs to prune events and refine definitions. Ownership is explicit, and changes flow through the same review process as code. This keeps the unified analytics platform trustworthy as the product (and org) changes.

I’ve seen this playbook turn skepticism into momentum. In one migration I inherited mid-flight, we refocused on decisions, tightened governance, and phased the rollout; the team moved from fire drills to confident launches—and stakeholders finally believed the numbers again.

If your team is staring down a migration, anchor on outcomes, automate quality, and invest in enablement. With disciplined execution readiness and the lessons I’ve applied alongside partners like Human37 and platforms like Amplitude, you can move fast, reduce risk, and land cleanly—without the chaos.

Inspired by this post on Amplitude – Perspectives.

June 22, 2026
How Agentic Analytics Reshapes Product Development Roadmaps
Agentic, analytics-driven product development changes the role of product data. Instead of waiting for teams to interpret dashboards and debate a backlog, an agent can help detect behavioral friction, estimate opportunities, propose interventions, and monitor whether a release improves the intended outcome.

The practical payoff is not an automatically generated roadmap. It is a tighter decision system in which evidence, experiments, delivery controls, and human judgment reinforce one another. The two source articles approach that system from complementary angles: one describes the operating loop around Amplitude Wave, while the other emphasizes the engineering and organizational foundations required to make agentic recommendations dependable.

The product agent is a decision loop, not a smarter dashboard

Traditional analytics tools help teams inspect funnels, cohorts, journeys, activation, and retention. The article about Amplitude Wave describes a more proactive model: an agent continuously scans behavioral data for friction, proposes a next-best improvement, supports validation through A/B testing, and uses feature flags to control rollout. After launch, the loop continues by monitoring activation, retention, and downstream revenue rather than treating deployment as the finish line.

The companion article makes a similar distinction between reporting and agency. It presents agentic systems as capable of proposing, testing, and learning, provided that recommendations remain connected to rigorous behavioral analytics. Synthesized together, the sources describe four linked functions: observation identifies where behavior diverges from an intended journey; prioritization weighs the size, risk, and confidence of an opportunity; experimentation tests whether a proposed change causes improvement; and monitoring determines whether to expand, revise, or retire that change.

This framing matters because an agent that only generates feature ideas adds another opinion to roadmap planning. An agent that connects ideas to observed behavior, controlled tests, and post-release measurement can instead reduce the distance between a weak signal and a defensible product decision.

Reliable recommendations depend on an analytics and evaluation stack

Both sources put instrumentation ahead of automation. The Wave article calls for clearly defined events, models that connect those events to user and account journeys, explicit success metrics, and governance around data quality and privacy. Without that foundation, an agent can produce confident explanations from incomplete or misleading evidence.

The second article extends the foundation into three technical capabilities. It advocates a unified analytics platform that brings quantitative behavior together with qualitative context, evaluation harnesses that test prompts, policies, and models for regressions, and a retrieval-first pipeline that grounds an agent in trusted organizational information. These layers address different failure modes: analytics establishes what users did, retrieval supplies relevant business context, and evaluations test whether the agent behaves reliably as its components change.

Interoperability broadens the evidence available to the system. The Wave article points to CRM integration, session replay, and support systems as useful connections for relating product behavior to customer value and go-to-market effects. CI/CD, experimentation tools, and feature flags then connect analysis to controlled delivery. The resulting architecture is less a standalone AI feature than a chain of evidence and controls spanning discovery, development, release, and measurement.

That chain also establishes a sensible boundary for automation. Behavioral correlations may justify investigation, but they do not by themselves establish causality. A/B testing can provide stronger causal evidence when it is appropriate and well designed; qualitative context can explain why a pattern may be occurring; and human review can catch strategic, ethical, or operational considerations that product telemetry does not represent.

Roadmaps become portfolios of measurable opportunities

When agents can surface evidence-backed opportunities, roadmap discussions can move away from ranking requested features in isolation. The unit of planning becomes an outcome-linked opportunity: a behavioral problem, the users or accounts affected, the metric expected to move, the evidence supporting the hypothesis, and the safest way to test it.

This does not eliminate product strategy. It makes strategy more explicit. Teams still decide which customers and outcomes matter, what constraints apply, and which trade-offs are acceptable. The agent can help maintain a current view of behavioral evidence and shorten the analysis cycle, but it cannot derive organizational priorities from telemetry alone.

The sources also connect this operating model to empowered product teams, product trios, continuous discovery, and outcomes-versus-output OKRs. In that environment, an agent is best treated as a participant in the discovery and delivery workflow: it can surface anomalies, assemble relevant context, suggest hypotheses, and track results, while the team remains accountable for framing the problem and authorizing consequential decisions.

The Wave article illustrates the intended scale of intervention with an onboarding example. It reports that an agent identified drop-off around a confusing configuration step; targeted in-app guidance and tooltips were then released behind feature flags, followed by a material improvement in activation with limited engineering effort. The report is a useful illustration of the loop, but it provides no numerical effect size or independent validation. It therefore supports the workflow concept more strongly than any general claim about expected results.

Governance determines how much autonomy an agent earns

Automation should expand according to demonstrated reliability and the reversibility of the action. Early implementations can begin in an advisory role, identifying friction and preparing evidence for a team to review. A later stage can allow the agent to configure draft experiments or recommend feature-flag settings. Direct changes to production warrant a higher threshold because errors can affect customers, revenue, privacy, and trust.

The Wave article explicitly calls for policies governing data use, review thresholds for automated changes, privacy-by-design, and human checkpoints for high-impact decisions. The engineering-focused article complements those controls with eval-driven development, including tests intended to detect reliability and safety regressions across prompts, policies, and models. Together, these ideas suggest that autonomy should be earned through observable performance rather than granted because an agent appears persuasive.

A practical adoption sequence follows from the synthesis. First, define the outcome and the decisions the agent may inform. Next, verify event quality and journey models before asking the system to prioritize opportunities. Then connect recommendations to a controlled experimentation and release process. Finally, evaluate both product impact and agent behavior, expanding permissions only when the evidence supports it. This sequence keeps the initial scope narrow while creating a path toward a more capable product-development system.

Key takeaways
- An agentic product workflow should connect behavioral observation, opportunity prioritization, experimentation, controlled delivery, and post-release measurement.
- High-quality event data is necessary but insufficient; grounded retrieval, qualitative context, and evaluation harnesses make recommendations more dependable.
- Roadmaps become more evidence-driven when teams plan around measurable opportunities rather than treating feature requests as predetermined commitments.
- Human judgment remains essential for strategy, causal interpretation, risk assessment, and high-impact release decisions.
- Agent autonomy should increase only as evaluations, governance controls, and observed performance justify broader permissions.
The near-term opportunity is to build a disciplined learning loop before pursuing full autonomy. Organizations that make their data trustworthy, their outcomes explicit, and their release controls measurable will be better positioned to let product agents take on more consequential work without weakening accountability.

References
- Shivam.Consulting Blog — Inside Amplitude Wave: The Proactive AI Product Agent That Reveals What to Build Next
- Shivam.Consulting Blog — Why Agentic, Data-Driven Product Development Excites Me—and How It Redefines Roadmaps
June 10, 2026

How to Build a Resilient Experimentation Program at Scale

Your teams are running more experiments, but decisions are not getting easier. Results arrive late, apparent wins fail to repeat, and every readout starts a new argument about the data.

The fix is not another testing tool or a higher experiment count. You need an operating system that protects validity when traffic, products, models, and customer behavior change underneath you. That system starts before exposure, routes each question to the right evaluation method, and ends with a decision your team can execute.

Give every experiment a decision contract

An experiment should begin with a decision, not a feature. Ask what you will do if the result is positive, negative, inconclusive, or unsafe. If the answer is the same in every case, the test is not worth running.

Turn the proposed test into a short decision contract before engineering begins. Record:

The customer problem: the friction or unmet need you observed.
The causal hypothesis: the product change, the behavior it should alter, and why.
The eligible population: who can enter the experiment and who must be excluded.
The primary outcome: the one metric that determines whether the hypothesis worked.
The guardrails: the measures that can block a rollout even when the primary outcome improves.
The decision thresholds: the minimum effect worth acting on and the conditions for shipping, iterating, stopping, or rolling back.

A driver tree helps you connect the primary metric to the business outcome without pretending that one experiment can prove the entire chain. If the goal is retention, for example, the immediate experiment may be designed to change activation behavior. The contract should distinguish that leading behavior from the longer-term outcome.

Set the minimum detectable effect and guardrails before reading results. The minimum detectable effect is not the smallest movement your analytics can display. It is the smallest improvement that would justify the cost, risk, and complexity of the change. If your available population cannot reliably detect that effect, narrow the question, combine low-traffic variants, choose a more sensitive proximal metric, or do not run the test.

Pre-committing to the metric, stopping rule, exclusions, and decision criteria also limits convenient reinterpretation. Teams can still investigate unexpected patterns, but those findings should become new hypotheses rather than retroactive proof that the original bet won.

Match the question to the cheapest reliable evidence

Production A/B testing is only one layer of experimentation. It is often the slowest and most expensive layer because it consumes customer attention, operational capacity, and statistical power. Use it when real behavior is necessary to resolve a meaningful decision.

Evidence layer	Best question	Move forward when
Offline evaluation	Does the output meet a defined quality, policy, or safety standard?	The candidate passes the agreed evaluation set and regression checks.
Replay or shadow mode	How would the change behave on realistic inputs without affecting users?	Failure patterns, cost, and latency remain inside the operating limits.
Targeted canary	Is the change safe and observable under live conditions?	Telemetry is healthy and no guardrail triggers a rollback.
Controlled A/B test	Does the change cause a valuable shift in user behavior?	The result meets the pre-registered decision criteria.
Progressive rollout	Does the effect and reliability persist as exposure expands?	Segment-level outcomes and operational measures remain acceptable.

This layered model becomes essential for AI products. Prompts, retrieval logic, policies, model versions, and traffic composition can all change the experience. A single production metric cannot tell you whether a decline came from product value, output quality, latency, cost, safety, or an upstream model shift.

Build an evaluation stack for prompts, policies, regressions, canaries, and selective A/B tests. A candidate should earn broader exposure by passing the cheaper layers first. This reduces traffic waste and gives the team diagnostic evidence when a live result moves unexpectedly.

Do not use a multi-armed bandit simply because it can direct more traffic toward a leading variant. Bandits are useful when the objective is clear, feedback is timely, and guardrails are dependable. They are a poor substitute for stable measurement or causal understanding. If you need to estimate an effect, learn about segments, or detect delayed harm, retain a controlled comparison.

Engineer trustworthy measurement and reversible delivery

An experimentation program is only as resilient as its event pipeline. A mathematically correct analysis built on shifting event definitions is still wrong. Treat instrumentation as a product interface with owners, documentation, versioning, tests, and observability.

Before exposure begins, verify that assignment, exposure, outcome, and guardrail events share consistent identities and timestamps. Confirm that users enter only the experiments for which they are eligible. Check that retries, duplicate events, delayed ingestion, and cross-device behavior cannot silently change the denominator.

Naming conventions, schema versioning, lineage, anomaly detection, and pipeline observability are not analytics housekeeping. They let teams move without sacrificing the meaning of their measurements. Assign an owner to every critical event and make schema changes visible to the teams whose experiments depend on them.

During the run, monitor data quality separately from product performance. Sample ratio mismatch, assignment failures, missing exposure events, sharp volume changes, and implausible segment movements should pause interpretation. Do not explain these signals away because the headline result looks attractive.

Delivery must be reversible as well as measurable. Put material treatments behind feature flags. Start with a targeted canary, watch operational and customer guardrails, and expand exposure in stages. Define who can stop the rollout and make sure that person has both the telemetry and access required to act.

For broad platform or AI changes, maintain a persistent holdout when feasible. A long-lived control gives you a reference point for cumulative effects that short experiments miss, including changes in retention, trust, support burden, and cost. Protect the holdout from accidental contamination and document every change that affects its interpretation.

Scale the program around decisions, not test volume

A central experimentation team cannot design and analyze every test at scale. Product teams need autonomy inside a governed system. Centralize the parts where inconsistency creates shared risk: assignment services, metric definitions, event standards, quality checks, templates, and audit records. Let teams own hypotheses, customer context, treatment design, and decisions inside those guardrails.

Use a lightweight review based on risk. A reversible interface change with a proven metric can follow a standard path. A pricing change, safety policy, ranking system, or shared AI capability deserves stronger review, tighter exposure controls, and a clearer rollback plan. Governance should become more demanding as the blast radius grows.

Maintain a portfolio view rather than a leaderboard of teams by test count. For each active experiment, track the decision it supports, expected value, detectable effect, traffic requirement, risk class, owner, and current evidence layer. This reveals when several teams are competing for the same population, when a strategic question is underpowered, and when multiple small tests should become one coherent learning plan.

Reset a brittle program over 90 days

You can make the operating model concrete without attempting a platform-wide rebuild:

By day 30: audit the backlog and current tests. Stop or consolidate experiments that cannot meet their minimum detectable effect. Identify unreliable events, missing owners, conflicting metric definitions, and launches without explicit decision criteria. For AI surfaces, establish a minimal offline evaluation harness for prompts, policies, quality, and safety.
By day 60: publish standard hypothesis and readout templates. Put high-risk changes behind feature flags, make guardrails visible, and introduce canary exposure. Establish persistent holdouts where broad or cumulative effects matter. Add alerts for instrumentation drift and operational regressions.
By day 90: manage a balanced portfolio across offline evaluations, replay or shadow tests, canaries, controlled experiments, and progressive rollouts. Review program health through decision speed, valid learning, repeatability, and detected harm rather than the number of tests launched.

Create a community of practice alongside these controls. Regularly examine inconclusive results, failed replications, instrumentation incidents, and stopped rollouts. These cases expose weaknesses in the system more reliably than a gallery of wins. The goal is not to eliminate failure. It is to make failure informative, contained, and cheap.

Key takeaways

Start with the decision the experiment must support, then pre-register the hypothesis, primary metric, guardrails, detectable effect, and stopping rule.
Use offline evaluations, replay, shadow mode, and canaries to eliminate weak or unsafe candidates before consuming production traffic.
Treat event semantics, assignment, exposure, lineage, and anomaly detection as production infrastructure.
Pair controlled measurement with feature flags, progressive exposure, explicit rollback authority, and persistent holdouts where cumulative effects matter.
Judge the program by trustworthy decisions and reusable learning, not experiment volume or the percentage of positive results.

Choose one upcoming decision with meaningful customer or operational risk. Write its decision contract, identify the cheapest evidence layer that could disprove it, and verify the rollback path before anyone builds the treatment. That single discipline is a practical starting point for a program that can keep learning as your product and organization change.

References

June 1, 2026

An AI Operating Model That Measures Outcomes, Not Activity

Your AI team is shipping, dashboards are filling up, and executives are still asking the uncomfortable question: what changed for the customer or the business?

The answer is rarely another model metric. You need an operating model that connects AI quality to customer behavior, workflow performance, commercial results, and risk. When that chain is visible, you can decide what to scale, what to repair, and what to stop.

Key takeaways

Give every AI initiative an outcome contract that names the target behavior, business result, guardrails, and decision owner.
Measure four linked layers: AI quality, user behavior, workflow results, and business outcomes.
Preserve the context behind each interaction so you can compare outcomes by customer, workflow, model version, and acquisition path.
Run one recurring evidence review where teams make explicit scale, fix, hold, or stop decisions.
Use the first 90 days to prove a reusable learning system, not merely a functioning AI experience.

Start each initiative with an outcome contract

A feature brief tells a team what to build. An outcome contract tells it why the work exists, how evidence will be interpreted, and who can act on that evidence. It is the smallest practical unit of an outcome-led AI portfolio.

Write the contract before choosing a model or polishing a prompt. Keep it to one page and require six fields:

Target workflow: Name the repeated job being changed, such as resolving a support request or preparing a sales follow-up.
Target user behavior: Describe what a person should do differently. Adoption alone is weak; successful completion, accepted recommendations, or reduced rework is stronger.
Business outcome: Connect the behavior to retention, expansion, qualified demand, service capacity, or another commercial result.
Quality floor: Define the task-level evaluation the AI must pass before exposure expands.
Guardrails: Name the safety, privacy, latency, reliability, and cost conditions that must remain acceptable.
Decision rule: State what evidence will trigger a scale, fix, hold, or stop decision, and name the person accountable for making it.

A driver tree makes the logic inspectable. Start with the business result, work backward to the customer behavior that can influence it, then identify the product and AI capabilities that can change that behavior. This prevents a model improvement from being mistaken for business progress.

The contract also gives empowered teams useful boundaries. Leaders align the portfolio around outcomes and constraints; teams retain room to change prompts, retrieval methods, interaction design, or even the proposed solution. That is the practical connection between AI strategy, continuous discovery, evaluation, delivery, and value capture.

Build one scorecard across four layers

AI outcome analytics is not a single north-star metric. It is a chain of evidence. If you measure only the beginning of the chain, you learn whether the system produced an answer. If you measure only the end, you may see revenue move without knowing why.

Measurement layer	Question it answers	Useful examples	Typical decision
AI quality	Did the system perform the intended task?	Task success, groundedness, safety failures, response variance	Change prompts, context, retrieval, model, or fallback
User behavior	Did a person trust and use the result?	Acceptance, correction, abandonment, repeat use, human escalation	Change the interaction, explanation, or moment of assistance
Workflow outcome	Did the job become meaningfully better?	Successful completion, rework, cycle time, resolution quality	Expand, narrow, or redesign the workflow
Business and risk outcome	Did the change create durable value within constraints?	Retention, expansion, qualified leads, cost per successful outcome, incidents	Scale, repackage, hold, or stop

Read the layers from left to right. Good AI quality with weak behavior usually points to product design, trust, or workflow placement. Strong usage with no workflow improvement may indicate novelty rather than value. Workflow gains with poor economics mean the experience works but the architecture or packaging does not.

Use the workflow attempt as the basic unit of analysis whenever possible. A generic session can contain several unrelated intentions. A workflow attempt lets you connect the user request, retrieved context, model and prompt version, response, correction, completion, and downstream result.

Persist the properties needed to reconstruct that journey. Customer segment, acquisition context, workflow type, entitlement, experiment group, model version, retrieval version, and human-handoff status often matter more than another page-view event. Carrying critical context across visits lets you trace behavior from early exploration to conversion and expansion instead of losing the causal story at signup.

Keep the event taxonomy small enough to govern. Instrument decisions and state changes, not every interface movement. For each event, document its owner, trigger, required properties, prohibited sensitive data, and validation method. A dashboard built on ambiguous events creates confidence without clarity.

Run a weekly loop from evidence to decision

Analytics creates value only when it changes a decision. Give each AI initiative a recurring evidence review attended by the product trio and the engineering, data, risk, operations, or go-to-market partners needed for that workflow.

Check the contract. Reconfirm the target workflow, primary outcome, evaluation floor, and guardrails. If the goal has changed, update the contract before interpreting the data.
Inspect the scorecard. Review AI quality, behavior, workflow, business, risk, and cost in that order. Look for breaks in the chain rather than averaging them into one health score.
Segment the result. Compare the cohorts that could conceal a failure: new and experienced users, customer tiers, workflow types, channels, experiment groups, and system versions.
Review failure cases. Sample unsuccessful attempts and classify the reason: missing context, poor retrieval, incorrect generation, confusing interaction, policy restriction, latency, or a problem outside the AI system.
Make one portfolio decision. Choose scale, fix, hold, or stop. Record the evidence, owner, next test, and condition for revisiting the decision.

Do not let offline evaluations and online analytics compete. Offline evaluations test whether a candidate change can handle representative tasks and known edge cases. Online measures show whether the released experience changes real behavior under real conditions. A candidate should clear the evaluation floor before broader exposure, then earn expansion through customer and business evidence.

When you run an experiment, agree on the hypothesis, primary outcome, guardrails, minimum detectable effect, and stopping rule before looking at results. Feature flags and progressive rollout keep the decision reversible. If the result is ambiguous, improve the test or narrow the population; do not promote the most flattering proxy.

This rhythm makes learning rate operational. The useful question is not how many experiments ran. It is how many consequential uncertainties were resolved and converted into a product, portfolio, or go-to-market decision. Testable decisions, behavioral analytics, and guarded rollouts make speed credible because the evidence can survive scrutiny.

Assign decision rights before the dashboard turns red

AI products cross boundaries that ordinary feature teams can often ignore. Product owns the customer and business outcome. Engineering owns service behavior and remediation. Data or AI teams own evaluation integrity and model observability. Risk, security, legal, and operations own constraints that cannot be traded away informally.

Write those responsibilities into the operating model. For each risk tier, specify who can approve an initial release, expand exposure, pause the system, change a model or retrieval source, accept a temporary exception, and communicate an incident. A named decision owner is more useful than a large committee with shared accountability.

Governance should begin during discovery. The team can then choose acceptable data, design consent, build traceability, create fallbacks, and define escalation paths before those choices become expensive. Model cards, data records, evaluation results, release history, and incident decisions should form one audit trail rather than separate compliance paperwork.

The same principle applies to commercial decisions. Product, finance, sales, and customer success need a shared definition of value. Measure inference and support costs against successful workflow outcomes, not raw requests or tokens. Packaging can then reflect delivered value while protecting margins and avoiding incentives for wasteful usage.

Use simple decision tests:

Scale when the primary outcome improves, quality and safety floors hold, economics remain acceptable, and the result repeats in the intended cohorts.
Fix when the chain reveals a local weakness, such as adequate AI quality but low acceptance, or strong adoption but excessive rework.
Hold when the evidence is inconclusive, the measurement is unreliable, or a guardrail is close enough to its limit that broader exposure would create avoidable risk.
Stop when only proxy metrics improve, the target workflow does not change, or the value depends on manual intervention that cannot be sustained.

Use 90 days to prove the operating system

Your first 90 days should produce more than a working use case. They should leave behind a repeatable contract, event model, evaluation set, rollout path, governance record, and decision cadence that the next team can reuse.

Weeks 1–2: choose the workflow. Audit available content and data, map the highest-value repeatable workflows, and select one where behavior and business impact can be observed. Write the outcome contract and assign decision rights.
Weeks 3–4: define the evidence. Build the driver tree, establish the four-layer scorecard, create representative offline evaluations, classify risk, and document the release and stop conditions.
Weeks 5–8: build and instrument. Create the retrieval and prompt baseline, capture lineage and version context, validate events, implement observability, and test graceful fallbacks. Rehearse how the team will diagnose a failed attempt.
Weeks 9–12: release and learn. Ship behind a feature flag, begin with limited exposure, compare behavior and outcomes, inspect failure cohorts, and make explicit scale, fix, hold, or stop decisions.

At the end, ask for three forms of proof. Can the team explain which customer behavior changed? Can it connect that behavior to a workflow and business result without hand-waving? Can another team reuse the operating artifacts without rebuilding them from scratch?

If any answer is no, keep the rollout narrow and repair the system of learning. If all three are yes, fund the next workflow using the same operating model. The goal is not a larger collection of AI features. It is an organization that can turn uncertain AI capabilities into measurable outcomes, repeatedly and responsibly.

References

May 28, 2026

Analytics-Led Growth Engineering: A Practical Operating Model
Your team has dashboards, event data, and a backlog of growth ideas. Yet decisions still come down to whoever has the strongest opinion, and experiment results rarely change the roadmap.

The missing piece is usually not another analytics tool. It is an operating model that connects user behavior to a decision, a controlled release, and a measurable business result. Here is how to build one.

Start with a growth constraint, not a dashboard

Analytics-led growth begins with a constraint you want to remove. A broad instruction such as improve onboarding gives your team too much room to produce activity without progress. Frame the problem as a break in the user journey instead: qualified users reach the setup screen but fail to complete the action associated with first value.

Connect that problem to your North Star metric through a driver tree. If the North Star depends on retained active accounts, its drivers might include the number of activated accounts, how frequently they return, and how deeply they use the product. Each driver can then be decomposed into observable behaviors.

This prevents a common mistake: optimizing the easiest metric to move rather than the metric that matters. More tooltip clicks are not useful if they do not increase successful setup. Higher setup completion is still questionable if those users never return.

Before opening your analytics platform, write down four things: the user segment, the behavior that is breaking, the outcome it should influence, and the decision you will make if the signal changes. If you cannot name the decision, you are probably requesting a report rather than investigating a growth opportunity.

Build an evidence chain you can trust

A growth team needs to trace the path from exposure to durable value. That requires more than counting page views. Instrument the events that represent intent, progress, successful value delivery, and return behavior.

For every important event, define who triggered it, what object it affected, where it occurred, and whether it represents an attempt or a successful outcome. A generic event such as integration clicked cannot tell you whether the connection worked. Separate the attempt, completion, failure, and first successful use.

Then inspect the journey through three complementary views. Funnel analysis shows where users stop progressing. Cohorts reveal whether the problem is concentrated among particular acquisition channels, plans, roles, or use cases. Retention analysis tests whether an apparent activation gain survives after the initial session.

Behavior alone will not explain motivation. Pair the quantitative signal with customer interviews, support conversations, or session-level evidence. If a funnel shows that users abandon a configuration step, qualitative evidence can distinguish confusing language from missing permissions, weak intent, or a technical failure.

Treat instrumentation defects as product defects. An event that fires twice, changes meaning, or omits a critical property can send engineering effort toward the wrong problem. Assign an owner to each decision-critical event and verify it across the full journey before using it to approve a rollout. Reliable behavioral analytics, cohorting, and funnel analysis are the foundation of this operating model, not a reporting layer added after release.

Turn every growth idea into an experiment contract

An experiment should begin with a falsifiable claim. Use this structure: for a defined user segment, changing a specific part of the experience should change a target behavior because it removes an identified barrier.

Complete the contract before implementation. Name the primary success metric, the guardrails that must not deteriorate, the expected direction of change, and the minimum detectable effect. The MDE forces a useful product decision: what is the smallest improvement that would justify shipping and maintaining this change?

Power considerations belong in planning, not in the explanation written after results arrive. If the eligible audience cannot produce a credible read on the effect that matters, change the experiment. You can target a higher-signal segment, test a stronger intervention, choose a more responsive leading indicator, or treat the release as a qualitative learning exercise rather than claiming a statistical win.

Pre-commit to the decision rules as well. A positive primary metric with damaged guardrails should not become an automatic launch. A neutral result can still eliminate a weak theory. A surprising segment difference should become a new hypothesis, not an invitation to search repeatedly for a favorable slice of the data.

This discipline changes backlog quality. Ideas compete on the strength of their evidence, the importance of the driver they address, and the clarity of the learning they can produce. The roadmap becomes a portfolio of testable growth mechanisms rather than a list of requested features.

Use staged releases to separate learning from risk

Feature flags let you control exposure without tying every decision to a new deployment. Start with internal validation, expose the change to an eligible cohort, watch technical and user guardrails, and widen access only when the evidence supports it.

Keep three decisions distinct. The first is whether the change works as designed. The second is whether it improves the intended user behavior. The third is whether that behavior produces a lasting outcome. Passing the first decision does not answer the other two.

Onboarding illustrates the difference. A clearer tooltip may increase interaction with a setup control. An in-app guide may increase completion of the setup flow. Neither result proves that users reached value or formed a durable habit. Follow the exposed cohort through the activation event and into retention before declaring the intervention successful.

Small, reversible changes are especially useful here. Progressive disclosure, revised UX writing, a better default, or guidance at a predictable stall point can isolate a mechanism more clearly than a full onboarding redesign. When several elements change together, you may see movement without learning what caused it.

Make the product trio accountable for learning

Growth engineering is not an analytics team handing insights to a delivery team. Product, engineering, and design should jointly own the hypothesis, the intervention, the instrumentation, and the interpretation.

Product connects the opportunity to the growth model and defines the decision. Design identifies the user friction and shapes the smallest credible intervention. Engineering validates event behavior, controls exposure, and protects reliability. All three inspect the outcome together.

Close each experiment with a short decision record. Capture what you believed, what changed, which users were exposed, what happened to the primary metric and guardrails, what you decided, and which assumption changed. Record neutral and negative results as carefully as wins. Otherwise, old ideas return with new wording and consume another cycle.

Leaders should review the quality of this learning system, not just the number of tests shipped. Notice whether teams are testing consequential hypotheses, whether events remain trustworthy, whether results lead to explicit decisions, and whether short-term activation gains are being checked against retention. Experiment volume without decision quality is another output metric.

Key takeaways
- Define the broken user behavior and the decision it affects before opening a dashboard.
- Connect activation, depth, and frequency to your North Star through a driver tree.
- Specify the hypothesis, primary metric, guardrails, MDE, and decision rules before implementation.
- Use feature flags and staged exposure to manage risk while preserving a valid learning loop.
- Validate leading indicators against retention, and store every result in a reusable decision record.
Choose one important journey this week and trace it from first intent to retained value. If the events, ownership, or decision rules break anywhere along that path, fix that link before adding another growth experiment. Compounding growth begins with compounding clarity.

References
- Amplitude — Inside Growth Engineering at Amplitude: My Playbook to Accelerate Product-Led Growth with Analytics
May 27, 2026
Supercharge Core Web Vitals with Amplitude’s Global Agent: Faster Rankings, Happier Users

I measure product health by a simple equation: speed plus clarity equals trust. That’s why I prioritize Core Web Vitals and search performance together—because the fastest path to better UX and higher rankings is a closed loop between measurement, diagnosis, and action. Standardizing on Amplitude’s Global Agent with Amplitude AI Agents let my teams compress that loop from weeks to hours, and in many cases, to minutes.

Learn how to track your web vitals and page rankings faster with Amplitude AI Agents and improve your site’s user experience and SEO rankings. That goal sounds ambitious, but with the right instrumentation and analytics workflow, it becomes a repeatable operating rhythm rather than a one-off project.

Here’s what changed for us with Amplitude’s Global Agent: a single, consistent way to capture performance signals across pages and journeys, unified context for every session, and a lightweight footprint that doesn’t get in the way of speed. By centralizing measurement, we eliminated blind spots and gave product, growth, and engineering one shared truth for Core Web Vitals and behavioral analytics.

My practical playbook is straightforward: 1) Establish a performance baseline for Core Web Vitals on key templates and critical user paths. 2) Segment results by device, location, acquisition channel, and content type to surface where users actually feel the friction. 3) Connect those vitals to downstream behaviors—scroll depth, engagement, and conversion—so we prioritize fixes that move business outcomes, not just lab scores. 4) Use feature flags and A/B testing to ship improvements safely and quantify uplift. 5) Close the loop with Agent Analytics to keep learnings visible and actionable.

Operationally, we rely on anomaly detection to flag regressions early, CI/CD guardrails to prevent performance slips at deploy time, and observability plus session replay to accelerate root-cause analysis. This combination reduces mean time to resolution, protects page experience during fast iteration cycles, and helps us avoid trading UX for speed—or vice versa.

The strategic benefit is compounding: better Core Web Vitals improve user perception and increase engagement, which strengthens SEO signals and, ultimately, page rankings. With a unified analytics platform in place, we can spotlight the few improvements that create outsized gains, then scale those patterns across the site with confidence.

If your roadmap includes faster pages, stronger rankings, and happier users, align your teams around this simple loop: measure precisely, diagnose quickly, experiment safely, and learn continuously. Amplitude’s Global Agent and Amplitude AI Agents give you the instrumentation and insight to make that loop your competitive advantage.

Inspired by this post on Amplitude – Best Practices.

May 20, 2026
Built for Your Biggest Days: How We Engineer Fair, Reliable Scale Without Downtime

I’m getting sharper, more specific questions about scale from enterprise customers every quarter, and that’s exactly how it should be. Teams want to know how our platform behaves during their highest-volume moments — the Black Friday sales, the sporting events, the production incidents — and they want confidence their growth won’t outpace the systems they depend on. We welcome those questions. They’re the right ones to ask of any critical component of your business. Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform. We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer. While those numbers matter, they age quickly. Every growing software company can publish a bigger number every year, month, week. What ultimately matters is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it. Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them. Here’s how we do that in practice. We build on boring foundations because at the edges, we try hard not to be clever. We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product. “That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.” This extends a principle I’ve embraced for years: run less software. The point isn’t to minimize the stack for its own sake; it’s to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together. Boring technology choices aren’t a lack of ambition — they reserve our ambition for the nuanced scaling challenges that matter. The source of truth is the hard part. You can scale stateless web traffic by adding machines, add queue consumers, and add cache. Those are real problems — just not the hardest ones. The source-of-truth database is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears. That is why we moved to Vitess, managed by PlanetScale. The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers. When we first laid out this direction, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day. Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed. We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk. Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows. Search is not a hidden bottleneck for us. Search underpins core product surfaces across the platform — from vector search in our AI features to realtime reporting — and if it’s slow or unhealthy, customers feel it. Scaling isn’t just adding more machines; often the better approach is making the product do less unnecessary work. Today, our Elasticsearch clusters support a much higher-throughput product than in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We’re serving a larger product surface more efficiently, not just running a bigger cluster. More importantly, when an index gets too large or traffic distribution turns unhealthy, we don’t want a high-risk, manual migration. We reshape Elasticsearch indexes online by partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and deleting the old index only when we’re confident. We’ve used this pattern for years to make large search migrations safer and more incremental — a core playbook in our platform scalability and SRE practices. Fairness is non-negotiable in a multi-tenant system. A single customer’s high-volume moment should not quietly become everyone else’s latency problem. We design for this at multiple layers. For asynchronous work, we use overflow queues and queueing strategies that prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They’re designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants. We also build application-level guardrails so customer isolation doesn’t depend on every engineer remembering every rule in every code path. In a large multi-tenant Rails application, the safe path must be built into the system. The focus is primarily about correctness and customer data separation, but the broader operating principle is the same: important customer boundaries should be enforced by infrastructure and application frameworks. The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it’s actually needed. Fin adds a new dimension to scaling. Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work. The details differ from traditional SaaS infrastructure, but the principle is the same: understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. AI providers are not commodity storage systems, and we do not design as if they are. That is why we have invested in Fin-specific reliability systems. Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point. For enterprise customers, this matters because AI support volume can spike just like human support volume — and the AI layer must absorb that spike without relying on one fragile upstream path. When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform. Performance tests help, but production traffic is reality. Real customers use products in ways no synthetic test will perfectly predict: launches, incidents, seasonal patterns, gaming events, sudden changes in end-user behavior. Those moments give us data that no lab can fully reproduce. Often, a large customer event barely moves the platform-wide graphs because our customer base is broad enough that one industry’s peak aligns with another’s quiet period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect. “That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.” Sometimes, these events teach us something specific. In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed. The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb. We treat large customer events this way even when there’s no broad customer impact. They’re opportunities to understand real scaling properties and make the next event safer — a habit that anchors our incident management, observability, and FinOps practices. Scale is also an operating model. The infrastructure matters, but it’s not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents. “That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.” Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through infrastructure noise and answer the question that matters most during an incident: are customers able to use the product successfully? This shapes how we ship. Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. That isn’t a vanity metric — it’s part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers. These are the very DORA metrics we hold ourselves to in order to balance CI/CD speed with stability. “Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.” The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence. That’s how scale stays safe over time. Scheduled maintenance should be extraordinary. Historically, database maintenance was a main reason for maintenance windows: upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime. With the move to Vitess and PlanetScale, we changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime — and we do this in practice, not just as a goal. This matters because customers rely on our platform for live operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience. “Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.” Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact. That’s the practical benefit of the architecture work: better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers. What this means for customers is simple: be skeptical of vague scale claims. The question isn’t whether a vendor says they can scale — it’s whether they can explain how, where the limits are, what they measure, how they recover, and what they’ve changed after learning from production. We understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system — and we use that data to keep improving. That’s what you should expect from a platform you depend on during your busiest moments.

Inspired by this post on The Intercom Blog.

May 19, 2026
I Pointed a “Ralph Wiggum” AI Loop at My Product for a Week—The Data That Stopped Chaos

I spent a week pointing a "Ralph Wiggum loop" at my product to see how far an agentic AI could take pragmatic, everyday improvements without human micromanagement. It was equal parts exhilarating and nerve-wracking. The short version: the loop moved fast and broke assumptions, but Amplitude analytics kept it from going off the rails—and turned chaos into controlled acceleration.

By "Ralph Wiggum loop," I mean a deliberately naive, endlessly curious cycle: try something small, ship it behind a flag, watch the data, then try again. It is the product equivalent of a fearless intern who experiments constantly. That energy is invaluable for discovery, but it absolutely demands strong guardrails and a clear definition of success.

Before I started, I framed the outcomes I cared about: user activation within the first session, reduction in time-to-value, and early retention indicators. I set baselines and a minimum detectable effect (MDE) for A/B testing so the loop could distinguish noise from signal. I also documented a driver tree of behaviors we wanted to influence and ensured every event was cleanly instrumented in Amplitude analytics to support reliable behavioral analytics.

The guardrails mattered most. I put every change behind feature flags with instant rollback. I defined "off the rails" conditions upfront, including regression thresholds for activation and retention analysis, and enabled anomaly detection to surface unexpected spikes or drops. Session replay was ready to diagnose confusion fast, and I kept a daily evaluation cadence so the loop never ran unattended for long.

Day by day, the loop proposed micro-experiments: onboarding copy variants, tooltip timing, in-app guide sequencing, and subtle changes to progressive disclosure. Each iteration shipped behind a flag to a small cohort. I watched leading indicators in real time, then zoomed out to cohort views to guard against short-term gains that might erode longer-term value. When something looked promising, we expanded exposure methodically; when something looked risky, we paused immediately.

We had a pivotal moment where the loop suggested a bolder call-to-action that spiked activation. On the surface, it looked like a win. Amplitude cohorts told a fuller story: downstream engagement softened, and anomaly detection flagged a pattern that hinted at premature conversion rather than genuine intent. A quick rollback through feature flags saved the week—and reminded me why eval-driven development should be the default for agentic AI workflows.

The most surprising part was how quickly the loop unlocked small compounding gains once the measurement scaffolding was in place. With a unified analytics platform and crisp guardrails, the system became a safe sandbox where the AI could explore aggressively while we stayed anchored to outcomes. The combination of behavioral analytics, A/B testing discipline, and daily human review turned raw speed into durable learning.

My takeaways are direct. Agentic AI can accelerate discovery, but only if you define stop conditions and wire strict feedback loops into your stack. Measurement is product strategy here—without it, you get noisy activity instead of progress. Invest in instrumentation first, treat feature flags as non-negotiable, and let anomaly detection and session replay be your early warning system. Most of all, tie every experiment to activation, engagement, or retention, not vanity metrics.

If you’re considering your own week with a "Ralph Wiggum loop," start painfully small, constrain the blast radius, and insist on decision-quality data. Do that, and you’ll turn a chaotic agent into a compounding engine for product discovery—one that moves fast, learns faster, and stays on track.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
From Vision to Execution: Building Agentic, Data‑Driven Products with Real‑World Rigor

When I consider where product development is headed, one statement captures the mandate perfectly: "Eric Carlson is a Principal AI Engineer helping to shape and build Amplitude's next generation vision of of agentic and data driven product development." That vision resonates deeply with how I lead teams—anchoring strategy in behavioral analytics while enabling agentic AI to act on insights with speed, safety, and measurable impact.

Translating that vision into execution starts with clarity of outcomes. I frame driver trees that connect customer value to leading indicators—activation, engagement depth, and retention—then instrument product telemetry with Amplitude analytics and behavioral analytics to surface the moments that matter. From there, we operationalize learning with A/B testing and feature flags, ensuring each hypothesis gets a fair, observable run and that we can safely ramp what works.

Agentic AI changes the operating model. Instead of static dashboards, we design autonomous workflows that observe signals, reason over context, and take action—grounded in a retrieval-first pipeline and governed by eval-driven development. For product managers, this demands fluency with LLMs for product managers and practical prompt engineering, plus rigorous AI Strategy around data governance, privacy-by-design, and risk scoring so agents remain trustworthy under real-world conditions.

Cross-functional cadence is everything. I partner closely with Principal AI Engineers and product trios to blend continuous discovery with execution: rapid user interviews to reveal intent, opportunity solution trees to prioritize, and outcomes vs output OKRs to align incentives. The result is a system where insights are unified, decisions are explainable, and agents improve through tight feedback loops across analytics, experimentation, and production telemetry.

If you’re building toward an agentic, data-driven future, invest in a unified analytics platform, shorten the path from signal to action, and measure learning velocity as carefully as feature delivery. With the right foundations, agentic AI becomes more than a feature—it becomes a force multiplier for product strategy, customer value, and sustainable growth.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
How to Run AI-Assisted Feature Launches That Drive Growth
You are days from releasing a feature. Engineering needs a rollout decision. Go-to-market teams need a clear promise. Support needs to know what could go wrong. Leadership wants to know whether the release changed customer behavior. Dropping an AI bot into the launch channel will not resolve those tensions. If the metrics, authority, and escalation rules are vague, the bot will only answer ambiguity faster.

The useful model is a closed loop: define the behavior you want to change, instrument exposure and value, operate the rollout from one shared channel, let agents handle repeatable retrieval and synthesis, and reserve consequential decisions for accountable people. Done well, AI reduces the coordination tax around a launch while making the growth decision more disciplined.

Define the growth decision before you automate the launch

A feature being available is an output. A customer reaching value is an outcome. Your launch plan has to connect the two before anyone writes an agent prompt or schedules a readout.

A durable growth plan translates the product North Star into activation and retention signals, then defines the minimum detectable effect before experimentation. The North Star provides direction, but it is often too distant to diagnose a new feature. A launch needs an earlier behavioral signal that can tell you whether eligible users encountered the feature, understood it, and reached its intended value.

Write a short launch contract with these fields:
1. Target user and moment: Name the user or account segment, the situation that makes the feature relevant, and any eligibility rules. A feature intended for a new administrator solving an initial setup problem should not be evaluated across every user in the product.
2. Behavioral hypothesis: State the current behavior, the desired behavior, and why the feature should cause the change. If the causal link cannot be written plainly, the team is not ready to interpret the launch data.
3. Measurement chain: Instrument eligibility, actual exposure, meaningful engagement, the activation action, and the downstream value event. If you record engagement but not exposure, low adoption could mean either that users ignored the feature or that they never saw it.
4. Primary signal: Choose the behavior closest to customer value that can mature within the launch window. Do not promote every available metric to equal status. That turns a decision into a search for whichever chart looks most favorable.
5. Guardrails: Name the operational and customer signals that can stop a rollout, such as degraded performance, errors, support burden, privacy concerns, or a harmful shift in another important behavior. Define the actual acceptable bounds in your internal contract before launch; do not negotiate them after a concerning result appears.
6. Minimum detectable effect: Decide what change would be large enough to matter to the product and business. This keeps the team from celebrating meaningless movement or waiting indefinitely for certainty that the planned test cannot provide.
7. Decision rule and authority: Specify what evidence permits a ramp, what requires a hold, what triggers investigation, and who can pause or roll back the feature. An agent may assemble the evidence, but it should not invent the rule during the incident.
The contract should also distinguish a growth signal from a health signal. Activation, conversion, or repeated use may tell you whether the feature is producing value. Latency, error rates, complaints, and anomalous segment behavior tell you whether it is safe to continue. A healthy system with an immature growth signal may justify holding the rollout. Broken instrumentation or a material guardrail breach calls for a different response.

This distinction prevents a common category error: treating an inconclusive experiment as a failed feature, or treating early adoption as proof of durable value. The launch decision should always answer the same question: given trustworthy exposure data, the primary signal, and the guardrails, should you ramp, hold, investigate, or roll back?

Turn the launch channel into a decision system

A launch channel becomes useful when it preserves context and decisions, not merely conversation. A practical setup is one channel named #launch-[feature], with its scope, service expectations, success metrics, dashboards, and rollout plan pinned. Product, engineering, data, support, and go-to-market stakeholders can then work from the same operational record.

Set up the channel before rollout begins:
1. Pin the launch contract: Include the hypothesis, eligible population, event definitions, primary signal, guardrails, rollout stages, owners, and links to live dashboards. A screenshot becomes stale; a governed dashboard remains inspectable.
2. Create stable work lanes: Use separate parent threads for metrics, incidents, enablement, and customer feedback. This gives each agent and human responder a predictable place to work without fragmenting the overall launch record.
3. Publish response expectations: State which questions the agent can answer immediately, which require a human owner, and how urgent operational issues are escalated. The agent should never make an urgent request look handled merely because it produced a fluent reply.
4. Keep a decision ledger: For every ramp, hold, pause, or rollback, record the timestamp, evidence considered, decision, rationale, approver, and next review point. This matters later when a stakeholder asks why exposure changed or when the team compares the result with the original hypothesis.
5. Require channel-visible handoffs: If a question moves to a data, engineering, or privacy owner, the agent should post the handoff and preserve the relevant query, definitions, filters, and context. Do not let direct messages become a shadow operating system.
Give every automated data answer a consistent shape:
- The direct answer, including the population and time window.
- The metric definition and denominator.
- The relevant cohort, segment, environment, and experiment variant.
- A link to the approved underlying data or dashboard.
- An as-of timestamp so readers know how fresh the result is.
- Any missing data, definition conflict, or limitation that changes interpretation.
- The named human owner when judgment or investigation is required.
An activation rate without its denominator, an environment, or a timestamp is not decision-grade evidence. A polished answer should not receive more trust than the data lineage beneath it. Make uncertainty visible instead of prompting the agent to conceal it behind a confident summary.

Give agents narrow jobs and humans explicit authority

The safest launch architecture separates three jobs: retrieving data, operating rollout controls, and interpreting evidence. Combining them in one broadly empowered agent creates unnecessary risk. It also makes failures harder to diagnose because you cannot tell whether the problem came from a bad query, a bad recommendation, or an unauthorized action.

Use a data agent for retrieval and first-pass synthesis

Connect the data agent only to approved sources and metric definitions. It can answer repeatable questions such as activation by cohort, conversion by segment, latency by region, exposure by variant, or the movement of a named guardrail. It should provide citations and timestamps, then route questions requiring nuance to an owner while keeping the context in the thread.

Write the escalation boundary into its operating instructions. Escalate when metric definitions conflict, required data is unavailable, a query touches restricted information, the request asks for a causal conclusion that descriptive data cannot establish, or the answer would materially change rollout. The best response in those cases is not a guess. It is a precise statement of what is missing and who must resolve it.

Keep the feature-flag agent read-only by default

A flag agent can safely expose status by environment, current rollout allocation, and change history. That alone removes many repetitive questions. Write access is different: an incorrect production change can expose an unready experience, expand an incident, or remove access unexpectedly.

When you permit flag mutations, require an explicit sequence:
1. The requester names the feature, environment, target population, requested action, and reason.
2. The agent shows the current flag state and summarizes the evidence relevant to the request.
3. The authorized approver confirms the exact change. Approval cannot be inferred from an emoji, an ambiguous reply, or the agent’s own recommendation.
4. The integration performs only the approved action through constrained permissions.
5. The agent posts the resulting state, timestamp, requester, approver, rationale, and change-history link.
Do not give the agent a broad production credential merely because the chat interface is convenient. Restrict its access by environment and role, preserve an audit trail, and keep a manual rollback path available to the responsible engineer.

Use a readout agent to maintain the launch narrative

Scheduled summaries prevent the team from rebuilding the same analysis for every stakeholder. A useful default is to publish readouts at T+1 hour, T+24 hours, and T+7 days, while adapting the questions to the product’s actual usage cycle:
- T+1 hour: Confirm that exposure is occurring as intended. Check instrumentation, operational performance, obvious anomalies, and incident status. This checkpoint is primarily about measurement and safety, not declaring growth success.
- T+24 hours: Review adoption and activation by the planned cohorts, early conversion movement where applicable, support themes, and any uneven behavior across important segments.
- T+7 days: Evaluate experiment results that have had time to mature, retention or repeated-value signals when the product cycle makes them observable, significant outliers, and the follow-up work needed to harden or revise the experience.
These checkpoints are operating cadences, not guarantees of statistical maturity. A feature used on a longer cycle may not produce a meaningful retention signal by the final checkpoint. The readout should say that plainly instead of treating missing maturity as neutral evidence.

Every readout should end with a decision or an explicit statement that no decision is yet warranted. It should also name the evidence still needed, the owner, and the next review point. A summary that lists charts but does not clarify the decision state creates more reading without reducing uncertainty.

Make the accountability map visible
- Product owns the behavioral hypothesis, the primary growth signal, and the recommendation to ramp, hold, or change direction.
- Engineering owns operational health, flag implementation, incident response, and safe rollback execution.
- Data owns metric definitions, instrumentation validity, experiment design, and interpretation limits.
- Support and go-to-market owners contribute customer feedback, readiness concerns, enablement status, and communication needs.
- Agents retrieve, summarize, route, and perform narrowly preauthorized steps. They do not approve their own consequential recommendations.
The governance layer is part of the product, not a final compliance check. Apply role-based access, protect personally identifiable information, require source citations, and retain transparent logs. Then monitor response accuracy, deflection, and time-to-answer through Agent Analytics. Deflection alone is a poor success metric: a confidently wrong response may reduce human questions while increasing decision risk. Review incorrect answers, unnecessary escalations, missed escalations, and stale data as carefully as response speed.

Run the rollout as a sequence of evidence gates

A feature flag is not merely a switch. It lets you separate deployment from exposure and turn a large release decision into a sequence of smaller, inspectable decisions. The appropriate rollout stages depend on the feature’s operational, privacy, and customer risk, so define them in advance rather than copying a universal percentage.

Use this operating sequence:
1. Preflight the measurement: Verify eligibility, exposure, activation, value, and guardrail events in the intended environment. Confirm that dashboards use the launch contract’s definitions and that the agent can retrieve the same governed numbers.
2. Release to the defined cohort: Use the flag to control who can receive the experience. Confirm actual exposure before interpreting engagement. Eligibility and exposure are different facts.
3. Inspect evidence at the scheduled gates: Start with instrumentation and safety, then move to activation, conversion, retention, and other downstream value signals as they become observable. Review the preselected segments before exploring unexpected cuts of the data.
4. Choose a named decision state: Ramp, hold, investigate, pause, or roll back. Record the evidence and rationale. Avoid vague states such as looking good, because they do not tell engineering what to do or stakeholders what has been decided.
5. Feed the learning back into the journey: Update onboarding, in-product guidance, targeting, positioning, or the feature itself based on the observed friction. A winning test becomes a growth mechanism only when the trigger, experience, and value-producing behavior can be repeated reliably.
Use a clear decision ladder:
- Ramp when measurement is trustworthy, guardrails remain inside the pre-agreed bounds, the evidence meets the decision rule, and customer-facing teams are ready for broader exposure.
- Hold when the system is healthy but the outcome has not matured enough to support a decision. State what evidence is pending and when it can reasonably be reviewed.
- Investigate when an anomaly, segment divergence, definition conflict, or instrumentation gap makes the aggregate result unreliable.
- Pause when continued exposure could obscure an incident, contaminate the test, or expand a customer problem while the team diagnoses it.
- Roll back when a material operational, privacy, safety, or customer guardrail crosses the boundary defined in the launch contract. Do not wait for the primary growth metric to mature before acting on a serious downside.
If the feature itself uses AI, measure the product experience separately from the operational agents supporting its launch. AI can provide intelligent nudges, next-best actions, and adaptive experiences while applying privacy-by-design and strong data governance. That creates at least four distinct questions: Was the user eligible? Was the AI experience delivered? Did the user engage with it? Did that engagement lead to value without violating a guardrail?

Logging only the final conversion makes those questions impossible to separate. A delivery problem, poor recommendation, confusing interaction, and weak value proposition can all produce the same downstream result. Preserve the path from eligibility through value, including the experience variant the user received. If targeting or adaptive behavior changes during the test, log the change and account for it in the interpretation.

Do not confuse high initial use with a durable growth loop. Novelty can produce engagement without retained value. Look for the sequence that matters to your product: activation, repeated value, and then the relevant expansion, collaboration, or retention behavior. If the product has no natural invitation or sharing mechanic, do not force a viral story onto it. Build the loop around the behavior customers already have a reason to repeat.

Key takeaways
- Start with a launch contract that names the user, behavioral hypothesis, measurement chain, primary signal, guardrails, minimum detectable effect, decision rules, and accountable owners.
- Use one launch channel as the shared operational record, but separate metrics, incidents, enablement, and feedback into stable threads.
- Split agent responsibilities across data retrieval, flag operations, and scheduled readouts. Keep consequential actions approval-gated and auditable.
- Treat T+1 hour, T+24 hours, and T+7 days as decision checkpoints, not automatic declarations of success.
- Use feature flags to move through evidence gates. Ramp, hold, investigate, pause, or roll back according to rules written before the data arrives.
- Measure AI-powered experiences from eligibility through delivery, engagement, value, and guardrails so you can diagnose why growth did or did not move.
For your next launch, begin with a narrow operating slice: a completed launch contract, a structured channel, a data agent limited to approved queries, read-only flag visibility, and scheduled readouts. Review every wrong answer, escalation, and decision after the rollout. Expand the agent’s authority only when the evidence shows that the control system is trustworthy.

References
- Amplitude – My Playbook for a Smarter Feature Launch Slack Channel with Agents, Feature Flags, and Readouts
- Amplitude – How I Orchestrate Growth & AI at Amplitude to Ignite Viral Product Engagement
May 11, 2026

Tag: feature flags

Feature management extends beyond feature flags

Controlled exposure changes the release decision

Key takeaways for product teams

A practical operating loop

The discipline has costs as well as benefits

Start with the decision, not the dashboard

Build the measurement chain before release

Read adoption, experience, and outcomes separately

Adoption shows reach and relevance

Experience measures whether use was worthwhile

Business impact requires a credible connection

Key takeaways

Turn the evidence into a product decision

The product agent is a decision loop, not a smarter dashboard

Reliable recommendations depend on an analytics and evaluation stack

Roadmaps become portfolios of measurable opportunities

Governance determines how much autonomy an agent earns

Key takeaways

References

Give every experiment a decision contract

Match the question to the cheapest reliable evidence

Engineer trustworthy measurement and reversible delivery

Scale the program around decisions, not test volume

Reset a brittle program over 90 days

Key takeaways

References

Key takeaways

Start each initiative with an outcome contract

Build one scorecard across four layers

Run a weekly loop from evidence to decision

Assign decision rights before the dashboard turns red

Use 90 days to prove the operating system

References

Start with a growth constraint, not a dashboard

Build an evidence chain you can trust

Turn every growth idea into an experiment contract

Use staged releases to separate learning from risk

Make the product trio accountable for learning

Key takeaways

References

Define the growth decision before you automate the launch

Turn the launch channel into a decision system

Give agents narrow jobs and humans explicit authority

Use a data agent for retrieval and first-pass synthesis

Keep the feature-flag agent read-only by default

Use a readout agent to maintain the launch narrative

Make the accountability map visible

Run the rollout as a sequence of evidence gates

Key takeaways

References