Tag: eval-driven development

A Reliable Amplitude AI Workflow for Product Decisions
You ask Amplitude AI why activation fell. It returns a convincing explanation, a few plausible segments, and a recommendation your team could act on. The problem is that you still don’t know whether the answer reflects your product data, an ambiguous metric, or a reasonable-sounding guess.

You don’t fix that uncertainty with a longer prompt. You fix it with a controlled workflow: define the decision, provide only the context needed to analyze it, let AI run a bounded sequence of checks, and require evidence before accepting a conclusion. The result is an analysis another product manager can inspect, reproduce, and turn into action.

Start with a decision contract, not an open-ended question

A request such as analyze our onboarding leaves too many choices to the model. It must decide what onboarding means, which users count, what success looks like, which period matters, and whether the goal is diagnosis or opportunity discovery. A polished answer can hide those unresolved choices.

Write a short decision contract before opening the analysis. It should contain five elements:
- Decision: State what someone will decide after reading the result. For example: decide which activation bottleneck the onboarding team should investigate next.
- Population: Name the eligible users, accounts, plan types, platforms, markets, or acquisition channels.
- Metric: Supply the exact event or formula, its time window, and any exclusions.
- Evidence bar: Specify what the answer must show, such as the supporting events, segments, funnel steps, or behavioral trend.
- Output: Ask for a conclusion, competing explanations, uncertainties, and the next analysis or product action.
A useful objective is narrow enough to fit in one sentence. Your quality rubric can be slightly longer: require every conclusion to identify the relevant metric, population, comparison, and evidence. This intent-first, evaluation-driven approach keeps the analysis tied to a product decision instead of rewarding whatever answer sounds most complete.

Constraints belong in the contract too. If the team cannot change pricing, instrumentation, or a particular onboarding step, say so. If a result must remain descriptive because the analysis cannot establish causality, require that distinction. AI is more useful when it knows which doors are closed.

Build a compact context packet Amplitude AI can actually use

Amplitude AI can only interpret behavior through the data model it receives. If two teams use different definitions of an activated account, or an event changed meaning after an instrumentation update, the model can produce a coherent answer to the wrong question.

Create a reusable context packet for each important product area. Keep it short enough to review, but precise enough to remove semantic guesswork. Include:
- Metric definitions: Write the numerator, denominator, qualifying window, and exclusions for activation, retention, conversion, or any other decision metric.
- Event taxonomy: List the events and properties relevant to the question, including known aliases or deprecated events that should not be used.
- Segment definitions: Explain how key cohorts are formed and which properties distinguish users from accounts.
- Known data limitations: Flag missing platforms, delayed events, identity-resolution issues, tracking changes, and periods that should not be compared.
- Recent product context: Include only releases, experiments, or journey changes that could plausibly affect the behavior under review.
Use retrieval before expansion. Start with the smallest relevant set of definitions and observations. Add more context only when the analysis reaches a question that requires it. Dumping an entire analytics catalog into the prompt makes it harder to see which definitions shaped the answer and gives irrelevant details more chances to distract the model.

Examples can stabilize recurring work, but choose them carefully. One to three strong examples are enough to demonstrate the expected structure, evidence standard, and level of uncertainty. Remove old conclusions and stale numbers before reuse. You want the model to copy the analytical pattern, not inherit a previous answer.

Version this packet alongside the workflow. When an event definition, segment, or guardrail changes, record the change and rerun the analyses that depend on it. That turns context management from prompt housekeeping into part of your analytics governance.

Run a bounded analysis loop, then challenge the result

Move from observation to explanation in explicit steps

Don’t ask for a diagnosis in a single jump. A reliable workflow separates what happened from why it may have happened. Use a fixed sequence:
1. Establish the baseline. Confirm the metric definition, eligible population, comparison, and direction of change.
2. Locate the difference. Break the result down by the segments most relevant to the decision. Avoid exploring every available property.
3. Inspect the journey. Examine funnel steps, behavioral paths, retention patterns, or other views that can show where behavior diverges.
4. Generate competing hypotheses. Ask for more than one plausible explanation and require supporting and contradicting evidence for each.
5. Choose the next best analysis. Run the segment drill-down, funnel attribution, or anomaly check most likely to separate the leading explanations.
6. Apply a stop rule. End when the evidence is sufficient for the stated decision, when the remaining uncertainty requires new instrumentation, or when another analysis would not change the next action.
The stop rule matters. Without one, an agentic workflow can keep generating cuts of the data that add activity without increasing confidence. Before each tool call, require the system to state what question the analysis will answer and how each possible result would change its next step.

If you expose Amplitude actions through MCP or another callable interface, keep each tool narrow and observable. A call should have explicit inputs, a recognizable output shape, and an error state the workflow can surface. Log the question, parameters, returned evidence, and the interpretation built from it. Tool access makes iteration faster; it does not remove the need for an audit trail.

Put every conclusion through a verification gate

Before a finding reaches a stakeholder, check it against a simple evidence ledger. For each important claim, record:
- the event, metric, segment, funnel step, or trend that supports it;
- the population and comparison to which it applies;
- whether it is an observation, interpretation, or causal hypothesis;
- the strongest alternative explanation;
- the assumptions or data limitations that could change the conclusion;
- the next check required if confidence is still too low for the decision.
Then try to disprove the preferred answer. Ask whether the pattern survives a relevant segment change, whether a tracking change could explain it, and whether the same evidence also supports a competing hypothesis. This adversarial pass is often more valuable than asking the model to make its first response more detailed.

Turn repeated checks into an evaluation set. Save representative questions, approved metric definitions, required evidence fields, and known failure cases. Rerun them when prompts, context, instrumentation, or model versions change. Review failures by category: wrong scope, wrong metric, unsupported inference, missed uncertainty, or unusable recommendation. That gives your team a regression signal instead of a vague impression that the workflow still works.

Hand stakeholders a decision artifact, not an AI transcript

The output should make the next decision easier. A long transcript of prompts, tool calls, and exploratory branches shifts the work of interpretation onto the reader. Keep the trace for auditability, but present a concise decision artifact with six fields:
- Decision: The choice this analysis informs.
- Finding: The clearest supported behavioral observation.
- Evidence: The exact events, segments, funnel steps, or trends behind the finding.
- Uncertainty: What remains unknown and what the analysis cannot establish.
- Recommendation: The next analysis, discovery activity, experiment, or product change justified by the evidence.
- Owner: The person responsible for the next step and the condition that triggers a follow-up.
Keep human judgment at the decision boundary. Amplitude AI can retrieve definitions, propose analyses, call tools, compare patterns, and draft the artifact. A product leader should still decide whether the evidence is strong enough, whether the recommendation fits current constraints, and whether the cost of being wrong is acceptable.

That division of labor also clarifies accountability. If the AI workflow produces an unsupported inference, improve the context, tool contract, or evaluation. If the evidence is sound but the organization chooses a different path, record the strategic reason. Don’t let an AI-generated recommendation blur the difference between analytical output and an accountable product decision.

Key takeaways
- Begin with the decision, population, metric, evidence bar, and required output.
- Give Amplitude AI a small, versioned context packet instead of an unfiltered analytics catalog.
- Separate baseline measurement, segmentation, journey analysis, hypothesis generation, and the next tool call.
- Require evidence, alternatives, assumptions, and a stop rule before accepting a conclusion.
- Save recurring checks as evaluations and rerun them when data, prompts, tools, or models change.
- Deliver a decision artifact with a named owner while keeping the analytical trace available for review.
Start with one recurring product question this week. Write its decision contract, assemble the minimum context packet, and define the verification gate before asking Amplitude AI to analyze anything. Once that workflow survives review, save it as the template for the next question.

References
- Shivam.Consulting Blog — Decode How Amplitude AI Thinks: Proven Workflows to Get Actionable, High-Accuracy Results
June 2, 2026

How to Build a Resilient Experimentation Program at Scale

Your teams are running more experiments, but decisions are not getting easier. Results arrive late, apparent wins fail to repeat, and every readout starts a new argument about the data.

The fix is not another testing tool or a higher experiment count. You need an operating system that protects validity when traffic, products, models, and customer behavior change underneath you. That system starts before exposure, routes each question to the right evaluation method, and ends with a decision your team can execute.

Give every experiment a decision contract

An experiment should begin with a decision, not a feature. Ask what you will do if the result is positive, negative, inconclusive, or unsafe. If the answer is the same in every case, the test is not worth running.

Turn the proposed test into a short decision contract before engineering begins. Record:

The customer problem: the friction or unmet need you observed.
The causal hypothesis: the product change, the behavior it should alter, and why.
The eligible population: who can enter the experiment and who must be excluded.
The primary outcome: the one metric that determines whether the hypothesis worked.
The guardrails: the measures that can block a rollout even when the primary outcome improves.
The decision thresholds: the minimum effect worth acting on and the conditions for shipping, iterating, stopping, or rolling back.

A driver tree helps you connect the primary metric to the business outcome without pretending that one experiment can prove the entire chain. If the goal is retention, for example, the immediate experiment may be designed to change activation behavior. The contract should distinguish that leading behavior from the longer-term outcome.

Set the minimum detectable effect and guardrails before reading results. The minimum detectable effect is not the smallest movement your analytics can display. It is the smallest improvement that would justify the cost, risk, and complexity of the change. If your available population cannot reliably detect that effect, narrow the question, combine low-traffic variants, choose a more sensitive proximal metric, or do not run the test.

Pre-committing to the metric, stopping rule, exclusions, and decision criteria also limits convenient reinterpretation. Teams can still investigate unexpected patterns, but those findings should become new hypotheses rather than retroactive proof that the original bet won.

Match the question to the cheapest reliable evidence

Production A/B testing is only one layer of experimentation. It is often the slowest and most expensive layer because it consumes customer attention, operational capacity, and statistical power. Use it when real behavior is necessary to resolve a meaningful decision.

Evidence layer	Best question	Move forward when
Offline evaluation	Does the output meet a defined quality, policy, or safety standard?	The candidate passes the agreed evaluation set and regression checks.
Replay or shadow mode	How would the change behave on realistic inputs without affecting users?	Failure patterns, cost, and latency remain inside the operating limits.
Targeted canary	Is the change safe and observable under live conditions?	Telemetry is healthy and no guardrail triggers a rollback.
Controlled A/B test	Does the change cause a valuable shift in user behavior?	The result meets the pre-registered decision criteria.
Progressive rollout	Does the effect and reliability persist as exposure expands?	Segment-level outcomes and operational measures remain acceptable.

This layered model becomes essential for AI products. Prompts, retrieval logic, policies, model versions, and traffic composition can all change the experience. A single production metric cannot tell you whether a decline came from product value, output quality, latency, cost, safety, or an upstream model shift.

Build an evaluation stack for prompts, policies, regressions, canaries, and selective A/B tests. A candidate should earn broader exposure by passing the cheaper layers first. This reduces traffic waste and gives the team diagnostic evidence when a live result moves unexpectedly.

Do not use a multi-armed bandit simply because it can direct more traffic toward a leading variant. Bandits are useful when the objective is clear, feedback is timely, and guardrails are dependable. They are a poor substitute for stable measurement or causal understanding. If you need to estimate an effect, learn about segments, or detect delayed harm, retain a controlled comparison.

Engineer trustworthy measurement and reversible delivery

An experimentation program is only as resilient as its event pipeline. A mathematically correct analysis built on shifting event definitions is still wrong. Treat instrumentation as a product interface with owners, documentation, versioning, tests, and observability.

Before exposure begins, verify that assignment, exposure, outcome, and guardrail events share consistent identities and timestamps. Confirm that users enter only the experiments for which they are eligible. Check that retries, duplicate events, delayed ingestion, and cross-device behavior cannot silently change the denominator.

Naming conventions, schema versioning, lineage, anomaly detection, and pipeline observability are not analytics housekeeping. They let teams move without sacrificing the meaning of their measurements. Assign an owner to every critical event and make schema changes visible to the teams whose experiments depend on them.

During the run, monitor data quality separately from product performance. Sample ratio mismatch, assignment failures, missing exposure events, sharp volume changes, and implausible segment movements should pause interpretation. Do not explain these signals away because the headline result looks attractive.

Delivery must be reversible as well as measurable. Put material treatments behind feature flags. Start with a targeted canary, watch operational and customer guardrails, and expand exposure in stages. Define who can stop the rollout and make sure that person has both the telemetry and access required to act.

For broad platform or AI changes, maintain a persistent holdout when feasible. A long-lived control gives you a reference point for cumulative effects that short experiments miss, including changes in retention, trust, support burden, and cost. Protect the holdout from accidental contamination and document every change that affects its interpretation.

Scale the program around decisions, not test volume

A central experimentation team cannot design and analyze every test at scale. Product teams need autonomy inside a governed system. Centralize the parts where inconsistency creates shared risk: assignment services, metric definitions, event standards, quality checks, templates, and audit records. Let teams own hypotheses, customer context, treatment design, and decisions inside those guardrails.

Use a lightweight review based on risk. A reversible interface change with a proven metric can follow a standard path. A pricing change, safety policy, ranking system, or shared AI capability deserves stronger review, tighter exposure controls, and a clearer rollback plan. Governance should become more demanding as the blast radius grows.

Maintain a portfolio view rather than a leaderboard of teams by test count. For each active experiment, track the decision it supports, expected value, detectable effect, traffic requirement, risk class, owner, and current evidence layer. This reveals when several teams are competing for the same population, when a strategic question is underpowered, and when multiple small tests should become one coherent learning plan.

Reset a brittle program over 90 days

You can make the operating model concrete without attempting a platform-wide rebuild:

By day 30: audit the backlog and current tests. Stop or consolidate experiments that cannot meet their minimum detectable effect. Identify unreliable events, missing owners, conflicting metric definitions, and launches without explicit decision criteria. For AI surfaces, establish a minimal offline evaluation harness for prompts, policies, quality, and safety.
By day 60: publish standard hypothesis and readout templates. Put high-risk changes behind feature flags, make guardrails visible, and introduce canary exposure. Establish persistent holdouts where broad or cumulative effects matter. Add alerts for instrumentation drift and operational regressions.
By day 90: manage a balanced portfolio across offline evaluations, replay or shadow tests, canaries, controlled experiments, and progressive rollouts. Review program health through decision speed, valid learning, repeatability, and detected harm rather than the number of tests launched.

Create a community of practice alongside these controls. Regularly examine inconclusive results, failed replications, instrumentation incidents, and stopped rollouts. These cases expose weaknesses in the system more reliably than a gallery of wins. The goal is not to eliminate failure. It is to make failure informative, contained, and cheap.

Key takeaways

Start with the decision the experiment must support, then pre-register the hypothesis, primary metric, guardrails, detectable effect, and stopping rule.
Use offline evaluations, replay, shadow mode, and canaries to eliminate weak or unsafe candidates before consuming production traffic.
Treat event semantics, assignment, exposure, lineage, and anomaly detection as production infrastructure.
Pair controlled measurement with feature flags, progressive exposure, explicit rollback authority, and persistent holdouts where cumulative effects matter.
Judge the program by trustworthy decisions and reusable learning, not experiment volume or the percentage of positive results.

Choose one upcoming decision with meaningful customer or operational risk. Write its decision contract, identify the cheapest evidence layer that could disprove it, and verify the rollback path before anyone builds the treatment. That single discipline is a practical starting point for a program that can keep learning as your product and organization change.

References

June 1, 2026

An AI Operating Model That Measures Outcomes, Not Activity

Your AI team is shipping, dashboards are filling up, and executives are still asking the uncomfortable question: what changed for the customer or the business?

The answer is rarely another model metric. You need an operating model that connects AI quality to customer behavior, workflow performance, commercial results, and risk. When that chain is visible, you can decide what to scale, what to repair, and what to stop.

Key takeaways

Give every AI initiative an outcome contract that names the target behavior, business result, guardrails, and decision owner.
Measure four linked layers: AI quality, user behavior, workflow results, and business outcomes.
Preserve the context behind each interaction so you can compare outcomes by customer, workflow, model version, and acquisition path.
Run one recurring evidence review where teams make explicit scale, fix, hold, or stop decisions.
Use the first 90 days to prove a reusable learning system, not merely a functioning AI experience.

Start each initiative with an outcome contract

A feature brief tells a team what to build. An outcome contract tells it why the work exists, how evidence will be interpreted, and who can act on that evidence. It is the smallest practical unit of an outcome-led AI portfolio.

Write the contract before choosing a model or polishing a prompt. Keep it to one page and require six fields:

Target workflow: Name the repeated job being changed, such as resolving a support request or preparing a sales follow-up.
Target user behavior: Describe what a person should do differently. Adoption alone is weak; successful completion, accepted recommendations, or reduced rework is stronger.
Business outcome: Connect the behavior to retention, expansion, qualified demand, service capacity, or another commercial result.
Quality floor: Define the task-level evaluation the AI must pass before exposure expands.
Guardrails: Name the safety, privacy, latency, reliability, and cost conditions that must remain acceptable.
Decision rule: State what evidence will trigger a scale, fix, hold, or stop decision, and name the person accountable for making it.

A driver tree makes the logic inspectable. Start with the business result, work backward to the customer behavior that can influence it, then identify the product and AI capabilities that can change that behavior. This prevents a model improvement from being mistaken for business progress.

The contract also gives empowered teams useful boundaries. Leaders align the portfolio around outcomes and constraints; teams retain room to change prompts, retrieval methods, interaction design, or even the proposed solution. That is the practical connection between AI strategy, continuous discovery, evaluation, delivery, and value capture.

Build one scorecard across four layers

AI outcome analytics is not a single north-star metric. It is a chain of evidence. If you measure only the beginning of the chain, you learn whether the system produced an answer. If you measure only the end, you may see revenue move without knowing why.

Measurement layer	Question it answers	Useful examples	Typical decision
AI quality	Did the system perform the intended task?	Task success, groundedness, safety failures, response variance	Change prompts, context, retrieval, model, or fallback
User behavior	Did a person trust and use the result?	Acceptance, correction, abandonment, repeat use, human escalation	Change the interaction, explanation, or moment of assistance
Workflow outcome	Did the job become meaningfully better?	Successful completion, rework, cycle time, resolution quality	Expand, narrow, or redesign the workflow
Business and risk outcome	Did the change create durable value within constraints?	Retention, expansion, qualified leads, cost per successful outcome, incidents	Scale, repackage, hold, or stop

Read the layers from left to right. Good AI quality with weak behavior usually points to product design, trust, or workflow placement. Strong usage with no workflow improvement may indicate novelty rather than value. Workflow gains with poor economics mean the experience works but the architecture or packaging does not.

Use the workflow attempt as the basic unit of analysis whenever possible. A generic session can contain several unrelated intentions. A workflow attempt lets you connect the user request, retrieved context, model and prompt version, response, correction, completion, and downstream result.

Persist the properties needed to reconstruct that journey. Customer segment, acquisition context, workflow type, entitlement, experiment group, model version, retrieval version, and human-handoff status often matter more than another page-view event. Carrying critical context across visits lets you trace behavior from early exploration to conversion and expansion instead of losing the causal story at signup.

Keep the event taxonomy small enough to govern. Instrument decisions and state changes, not every interface movement. For each event, document its owner, trigger, required properties, prohibited sensitive data, and validation method. A dashboard built on ambiguous events creates confidence without clarity.

Run a weekly loop from evidence to decision

Analytics creates value only when it changes a decision. Give each AI initiative a recurring evidence review attended by the product trio and the engineering, data, risk, operations, or go-to-market partners needed for that workflow.

Check the contract. Reconfirm the target workflow, primary outcome, evaluation floor, and guardrails. If the goal has changed, update the contract before interpreting the data.
Inspect the scorecard. Review AI quality, behavior, workflow, business, risk, and cost in that order. Look for breaks in the chain rather than averaging them into one health score.
Segment the result. Compare the cohorts that could conceal a failure: new and experienced users, customer tiers, workflow types, channels, experiment groups, and system versions.
Review failure cases. Sample unsuccessful attempts and classify the reason: missing context, poor retrieval, incorrect generation, confusing interaction, policy restriction, latency, or a problem outside the AI system.
Make one portfolio decision. Choose scale, fix, hold, or stop. Record the evidence, owner, next test, and condition for revisiting the decision.

Do not let offline evaluations and online analytics compete. Offline evaluations test whether a candidate change can handle representative tasks and known edge cases. Online measures show whether the released experience changes real behavior under real conditions. A candidate should clear the evaluation floor before broader exposure, then earn expansion through customer and business evidence.

When you run an experiment, agree on the hypothesis, primary outcome, guardrails, minimum detectable effect, and stopping rule before looking at results. Feature flags and progressive rollout keep the decision reversible. If the result is ambiguous, improve the test or narrow the population; do not promote the most flattering proxy.

This rhythm makes learning rate operational. The useful question is not how many experiments ran. It is how many consequential uncertainties were resolved and converted into a product, portfolio, or go-to-market decision. Testable decisions, behavioral analytics, and guarded rollouts make speed credible because the evidence can survive scrutiny.

Assign decision rights before the dashboard turns red

AI products cross boundaries that ordinary feature teams can often ignore. Product owns the customer and business outcome. Engineering owns service behavior and remediation. Data or AI teams own evaluation integrity and model observability. Risk, security, legal, and operations own constraints that cannot be traded away informally.

Write those responsibilities into the operating model. For each risk tier, specify who can approve an initial release, expand exposure, pause the system, change a model or retrieval source, accept a temporary exception, and communicate an incident. A named decision owner is more useful than a large committee with shared accountability.

Governance should begin during discovery. The team can then choose acceptable data, design consent, build traceability, create fallbacks, and define escalation paths before those choices become expensive. Model cards, data records, evaluation results, release history, and incident decisions should form one audit trail rather than separate compliance paperwork.

The same principle applies to commercial decisions. Product, finance, sales, and customer success need a shared definition of value. Measure inference and support costs against successful workflow outcomes, not raw requests or tokens. Packaging can then reflect delivered value while protecting margins and avoiding incentives for wasteful usage.

Use simple decision tests:

Scale when the primary outcome improves, quality and safety floors hold, economics remain acceptable, and the result repeats in the intended cohorts.
Fix when the chain reveals a local weakness, such as adequate AI quality but low acceptance, or strong adoption but excessive rework.
Hold when the evidence is inconclusive, the measurement is unreliable, or a guardrail is close enough to its limit that broader exposure would create avoidable risk.
Stop when only proxy metrics improve, the target workflow does not change, or the value depends on manual intervention that cannot be sustained.

Use 90 days to prove the operating system

Your first 90 days should produce more than a working use case. They should leave behind a repeatable contract, event model, evaluation set, rollout path, governance record, and decision cadence that the next team can reuse.

Weeks 1–2: choose the workflow. Audit available content and data, map the highest-value repeatable workflows, and select one where behavior and business impact can be observed. Write the outcome contract and assign decision rights.
Weeks 3–4: define the evidence. Build the driver tree, establish the four-layer scorecard, create representative offline evaluations, classify risk, and document the release and stop conditions.
Weeks 5–8: build and instrument. Create the retrieval and prompt baseline, capture lineage and version context, validate events, implement observability, and test graceful fallbacks. Rehearse how the team will diagnose a failed attempt.
Weeks 9–12: release and learn. Ship behind a feature flag, begin with limited exposure, compare behavior and outcomes, inspect failure cohorts, and make explicit scale, fix, hold, or stop decisions.

At the end, ask for three forms of proof. Can the team explain which customer behavior changed? Can it connect that behavior to a workflow and business result without hand-waving? Can another team reuse the operating artifacts without rebuilding them from scratch?

If any answer is no, keep the rollout narrow and repair the system of learning. If all three are yes, fund the next workflow using the same operating model. The goal is not a larger collection of AI features. It is an organization that can turn uncertain AI capabilities into measurable outcomes, repeatedly and responsibly.

References

May 28, 2026

AI-Ready Customer Support: An Operating Model That Works
You may already have an AI agent, a vendor shortlist, or pressure to automate more tickets. But if your policies conflict, ownership is unclear, and agents routinely rely on tribal knowledge, adding AI will expose those weaknesses at customer speed.

The practical goal is not to make every ticket autonomous. It is to build a support operation in which AI can resolve the right issues, recognize when it lacks authority or information, and help your team improve the system after every failure.

Key takeaways
- Start with a bounded customer problem, not a general mandate to automate support.
- Treat knowledge as a controlled production input with owners, audience rules, and review triggers.
- Define acceptable outcomes, prohibited actions, and escalation conditions before configuring the agent.
- Preserve a middle path where a human can unblock the AI without taking over the entire conversation.
- Expand automation only when evaluation results, live outcomes, and operational ownership support it.
Start with the queue, not the model

An AI-ready operation begins with a resolvable job. “Handle customer support” is too broad. “Help authenticated customers update their billing details under the current policy” is something you can document, test, monitor, and constrain.

Choose an initial queue where demand is meaningful, the desired outcome is clear, and the governing policy is reasonably stable. Avoid starting with cases that depend on negotiation, undocumented exceptions, or several teams making judgment calls behind the scenes. Those cases may become suitable later, but they are poor places to learn basic operational control.

Review a representative slice of conversations from that queue. For each one, record the customer’s intent, the information required, the systems touched, the policy applied, the final outcome, and any human judgment that changed the path. This turns a pile of tickets into a resolution map.

Pay special attention to cases that look identical at first but require different actions. A refund request may depend on plan type, purchase date, account state, or a regulatory restriction. These branches are where a fluent answer can still be operationally wrong.

You also need to decide where AI will sit. In most established operations, the safer path is to work through the support systems, queues, and reporting practices your team already uses. Replacing the help desk and automating the work at the same time creates two migrations and makes failures harder to diagnose.

Turn knowledge into a controlled production input

Your help center is only one part of the answer set. Reliable support may also depend on internal runbooks, policy clarifications, troubleshooting steps, approved reply snippets, product limitations, escalation instructions, and information held by product or customer success teams.

Bring those materials into a governed knowledge inventory. Every record should answer seven operational questions:
<!– wp:list {
May 28, 2026
How to Design a Dependable CLI Agent Users Can Trust
Your CLI agent can look impressive in a controlled demo and still feel unsafe in a real repository. The moment it can edit files, invoke tools, or use credentials, users need to understand what it will do before they let it proceed.

The dependable design is rarely the one with the most capabilities. It is the one with the smallest clear promise, predictable execution, visible controls, and evidence that it succeeds repeatedly.

Define the boundary before you define the features

Start by writing an operating contract for the agent. This is a product decision, not a prompt-writing exercise. A useful contract answers five questions:
- What job does the agent complete?
- Which resources and tools may it use?
- What must it never do?
- Which actions require explicit approval?
- What observable result counts as success?
Keep the job narrow enough to explain in one sentence. If the description needs a collection of exceptions, the interface is already carrying too much ambiguity. Split the work into a clearly named subcommand or make the advanced behavior opt-in.

Treat every flag, tool, and permission as an increase in blast radius. A new option does not merely add flexibility. It creates another state the agent can misunderstand, another path you must test, and another behavior the user must learn. Reducing the surface area can improve repeatability and trust because both the agent and the user have fewer possible paths to reason about.

When reviewing a proposed capability, ask whether it makes the mental model smaller. If it does not, remove it, defer it, or isolate it behind progressive disclosure. Safe, fast defaults should handle the common case without demanding that a new user understand the entire system.

Design one boring, observable execution path

A dependable run should feel like a transaction with recognizable stages. The model can help interpret intent, but it should not invent the execution contract as it goes.
- Capture intent: Ask only for information required to resolve the task. If a missing choice would materially change the result, stop and ask.
- Retrieve context: Fetch the smallest relevant set of files, facts, or records. More context can introduce conflicting instructions and distract the agent from the requested change.
- Show the plan: Present a compact description of the intended actions, affected targets, and likely side effects.
- Preview when useful: Provide a dry run for operations whose effects the user should inspect before execution.
- Execute through narrow tools: Give each tool a deterministic input and output contract. Reject malformed responses instead of guessing what they meant.
- Verify the result: Check the resulting state and tell the user what changed, what did not, and whether any step failed.
The agent should stop when the requested scope changes, required context is unavailable, or a tool returns an unexpected result. A visible stop is easier to recover from than confident improvisation.

Favor idempotent operations wherever you can. Repeating an idempotent action produces the intended state without duplicating or compounding its effects. That property matters in a CLI because interrupted runs and retries are normal operating conditions. Test the second run as deliberately as the first.

Put human control at the blast-radius boundary

Do not ask for approval at every step. Constant prompts train users to approve without reading. Place confirmation gates where the consequence or scope changes.
- Read-only work: Make inspection and planning the default where possible.
- Scoped writes: Request access only to the specific project, service, or resource needed for the task.
- Destructive actions: Require a separate confirmation that names the target and explains the consequence.
- Credentials: Use narrowly scoped, time-bounded access rather than broad credentials that persist beyond the run.
- Expanded capability: Let users opt into advanced tools instead of quietly enabling them for every session.
A confirmation message should help the user make a decision. Replace a generic question such as “Continue?” with a concrete statement of what will be changed and whether it can be undone.

Reversibility should shape the underlying implementation as well. Prefer changes that can be represented as a patch, show the proposed difference before applying it, and preserve enough information to explain how to undo the operation. When reversal is impossible, make that fact visible before execution.

Use a simple review question for each workflow: can a user predict the maximum consequence of saying yes? If the answer is unclear, the permission boundary is too broad or the confirmation arrives too late.

Prove reliability before expanding the roadmap

Do not use capability count as the measure of progress. Before adding a feature, define the task it should complete, the success threshold it must meet, and the smallest interface needed to test it. This turns roadmap discussions into observable product decisions.

Evaluate at least three outcomes: task completion, time to first successful result, and stability when the same operation is run again. A capability that succeeds once but behaves differently on a retry is not ready merely because the first demonstration worked.

Instrument each run with Agent Analytics. Capture the input, tools selected, duration, outcome, and error pattern. Review those signals to find where the agent asks unnecessary questions, repeats tool calls, loses users, or encounters the same failure. The response may be a smaller prompt, a tighter tool contract, a safer default, or the removal of a confusing option.

Documentation belongs in this reliability loop. Keep runnable examples alongside the code and make them reflect the golden path. Treat any mismatch between documented behavior and actual behavior as a product defect. If the workflow cannot be explained and demonstrated simply, it is not yet a dependable workflow.

Use these evaluations as promotion gates. Add power only after the current path is measurable, understandable, and stable. That discipline earns you the right to expand without turning the CLI into a collection of loosely related agent behaviors.

Key takeaways
- Write the agent’s operating contract before choosing its tools or refining its prompt.
- Keep the default workflow narrow, safe, fast, and explainable in one sentence.
- Retrieve minimal context, show a compact plan, execute through deterministic contracts, and verify the result.
- Place explicit approval at destructive, irreversible, or scope-expanding boundaries.
- Measure completion, time to first success, and rerun stability before adding another capability.
- Use run telemetry and executable documentation to decide what to simplify next.
Choose one golden-path task and write its operating contract now. Then run it twice: once normally and once as a retry. Every surprise you find is a reliability requirement to resolve before you broaden the agent’s reach.

References
- Shivam.Consulting Blog — The Counterintuitive Playbook for CLI Agents: Why Ruthless Subtraction Beats Feature Creep
May 27, 2026
Beyond Accuracy: How I Evaluate AI Customer Service Agents That Delight and Scale
When teams evaluate AI Agent options for customer service, I often see the rigor aimed at the wrong subset of criteria. After leading and observing dozens of proof of concept (POC) efforts with our customers and prospects, I understand why performance—accuracy scores, resolution rates, and benchmark tests on curated datasets—soaks up most of the attention. But those indicators alone won’t guarantee success once you leave the sandbox and face real customers.

If your POC only proves that the AI “works,” you’re missing the bigger picture. Here’s what else I look for to make the best long-term decision.

How does it handle your real-world setup?

Performance is table stakes, but it has to reflect the messiness of an actual support environment. The best-performing Agents don’t just get answers right—they exhibit resilient, human-like behavior under pressure. I watch how the Agent behaves when it doesn’t know an answer: does it recover or spiral? Does it stay on track through multi-step requests, and how gracefully does it hand off to human agents? If your knowledge base depends on a retrieval-first pipeline, test cross-source retrieval and grounding—not just single-document lookups.

When I build evaluation scenarios, I put the Agent through its paces with a broad, realistic mix:
- Multi-turn queries that require the Agent to carry context across a conversation, not just answer isolated questions.
- Vague or fragmented inputs, like typos, grammatical errors, and incomplete questions, because that’s how customers actually write.
- Edge cases and sensitive scenarios, like billing disputes, frustrated customers, and questions that sit at the boundary of what the Agent is trained on.
- Different phrasings of the same question. An Agent that handles one version well but fails on a rephrasing has a knowledge problem, not a performance problem.
- Queries that require pulling from multiple knowledge sources. Real issues are rarely answered by a single help article, and an Agent that can only handle single-source questions will hit a ceiling fast.
- Multilingual conversations, if your customer base requires it. Performance can vary significantly across languages and it’s better to discover that in testing than in production.
This preparation is worth the effort. Any Agent can look impressive in a demo; what matters is how it holds up as part of your team, serving your customers in production.

What does it feel like to interact with the Agent?

Two AI Agents can post the same quantitative scores—resolution rates, containment rate, and more—and still deliver very different customer experiences. Resolution rate tells me whether the Agent finishes conversations; it says nothing about how customers felt during them. I deliberately assess the experience, not just the outcome, because conversation design shapes trust and brand perception.

Here’s what I look for to ensure the AI Agent is enjoyable to interact with:
- Is the tone natural and on-brand, or does it feel robotic and generic?
- Does it build trust early in the conversation, or does it create friction that makes customers want to immediately request a human?
- When it doesn’t know the answer, does it handle that gracefully?
- When it hands off to a human, is that transition seamless, or does the customer feel abandoned?
As George Dilthey at Clay put it when evaluating their AI setup: “Keep what’s important to your business up front and center. For us, that was transparency and control over the customer experience.”

That framing is exactly right. The Agent represents your brand in every conversation. Customers don’t experience “accuracy,” they experience conversations. An Agent that’s technically accurate but tonally off-brand will erode customer trust over time.

I make the experience dimension explicit in my POCs. I have people on my team—and when possible, a small cohort of real customers—interact with the Agent under realistic conditions. Then I ask how it felt, not just whether it worked.

Can you keep improving it after launch?

This is the dimension most teams don’t evaluate at all, and it’s possibly the most important one. Choosing an Agent that works today and ensures you can continuously improve the customer experience over time requires more than a functional demo. You’re buying a system that must get better every week, not just during the first sprint.

The feedback loop

Can your team easily review conversations and identify where the Agent is underperforming? Can you pinpoint specific gaps (missing knowledge, incorrect tone, poor handoff decisions) and act on them quickly? The faster the loop between “something isn’t working” and “we’ve fixed it,” the more value compounds over time. In practice, that means instrumenting conversations, leveraging Agent Analytics, tagging misroutes and tone slips, and running targeted evals on known failure modes.

The speed of iteration

When you identify a gap, how quickly can you address it? This is partly a question of tooling (how easy is it to update knowledge, refine guidance, adjust behavior?) and partly a question of team capability. The teams getting the most out of AI are the ones that have changed how they operate and made continuous improvement a part of their everyday work. They’ve committed to going all-in for the long term, not just the first few weeks when launching their AI Agent. We treat this as eval-driven development: automate evaluations that mirror real tickets, tighten prompt engineering and retrieval settings, and ship small fixes daily.

The vendor partnership

The vendor behind the Agent matters just as much as the solution itself. You’re choosing a partner for transformation that will help you evolve how your business delivers customer experience. Ask:
- How does customer feedback influence the product roadmap, and can they show you examples?
- If you have feedback on limitations or weaknesses, do they engage transparently or get defensive?
- What kind of support will you get post-launch?
- Are they shaping where AI customer experience is going, or reacting to what others are building?
How a vendor responds to those questions tells you more about the long-term relationship than any benchmark result.

What a good POC proves

If your POC only proves “the AI works,” you haven’t done enough. A strong proof of concept tests performance in realistic conditions, evaluates the experience from the customer’s perspective, and validates the system that will support continuous improvement after launch. Done well, it sets you up for long-term operational success and builds organizational AI readiness—not just a flashy demo.

Inspired by this post on The Intercom Blog.
May 22, 2026

How to Build an AI-Native Product Discovery Workflow

Your discovery stack may already hold interview transcripts, support conversations, behavioral analytics, experiment results, and roadmap assumptions. Yet the decision in a product review can still depend on whoever read the most material or built the most persuasive deck.

If adding an LLM only gives you faster summaries, the workflow is not AI-native. An AI-native discovery workflow shortens the distance from evidence to a decision while making every important claim easier to inspect. AI retrieves, structures, compares, and challenges the evidence. You remain accountable for what the evidence means and what the product team does next.

Key takeaways

Begin every AI-assisted discovery run with an outcome, a metric, defined context, and a decision that someone needs to make.
Preserve raw evidence and give each observation a stable identifier before asking AI to synthesize it.
Break the workflow into bounded jobs such as retrieval, extraction, clustering, contradiction detection, and decision-brief drafting.
Evaluate citation accuracy, evidence fidelity, counterevidence, abstention, and access controls before the output enters a roadmap discussion.
Measure whether the workflow improves decision quality and product outcomes, not merely whether the model produces polished prose.

Frame the decision before you involve the model

Most weak discovery prompts fail before the model sees them. Analyze the interviews, summarize the feedback, and find insights are activities, not decisions. They give the model no principled way to distinguish useful evidence from interesting noise.

Write a short decision contract first. A useful contract specifies the outcome and metric, the context and constraints, and the decision and deliverable. Those fields turn an open-ended request into a bounded discovery task.

Outcome and metric: Name the user or business outcome, then define the behavior or measure that represents it. Activation, funnel conversion, and retention are not interchangeable. Include the event definition and observation window used by your analytics system.
Context and constraints: State the relevant cohort, product surface, timeframe, market, known exclusions, and data-access limits. New self-serve accounts on the web can exhibit a different pattern from established accounts or customers using another surface.
Decision and deliverable: Say what someone will do with the answer. Ask for a ranked opportunity brief, an interview plan, a set of competing explanations, or experiment candidates only when that format supports a real pending decision.

Reusable decision prompt: Help me decide [decision]. The outcome is [outcome], measured as [metric definition]. Limit the analysis to [cohort, surface, timeframe, and constraints]. Retrieve evidence from [approved repositories]. Return [deliverable]. For every material claim, include the evidence identifier, any conflicting evidence, the affected segment, and what is still unknown. If the available evidence cannot support a recommendation, say so and specify what is missing.

The last sentence matters. An AI system should be allowed to return insufficient evidence. If every run must end with a recommendation, the workflow rewards plausible completion instead of honest discovery.

Keep the outcome separate from the proposed solution. Improve activation is an outcome. Validate an onboarding checklist is already a solution choice. When you embed the solution in the prompt, AI tends to organize the available evidence around that choice instead of testing whether another opportunity matters more.

Use evidence-strength labels that a reviewer can verify rather than asking the model for an unsupported confidence percentage:

Sufficient: Direct evidence applies to the target context, and no material contradiction remains unresolved.
Mixed: Direct evidence and meaningful counterevidence both exist, or the pattern changes by segment.
Insufficient: Evidence is missing, indirect, stale for the decision, or outside the target context.

Build a traceable evidence pipeline, not a transcript pile

AI cannot make discovery evidence traceable if the underlying repository has already flattened observations, interpretations, and decisions into the same notes. Preserve those layers separately. My rule is simple: automate the movement and inspection of evidence before automating judgment.

Layer	What it contains	Control that matters
Raw evidence	Interview recordings or transcripts, support records, session evidence, and analytics query results	Keep the original record intact, access-controlled, and addressable by a stable locator
Evidence units	Atomic observations with metadata	Separate exact customer language, observed behavior, and analyst interpretation
Opportunities	Candidate needs, frictions, or desired outcomes	Attach supporting evidence, counterevidence, affected segments, and unresolved questions
Decisions	Choices made, rejected alternatives, assumptions, and rationale	Name the decision owner and preserve the evidence available at the time
Learning	Experiment results and later customer or behavioral evidence	Update the opportunity without erasing the earlier reasoning

Each evidence unit should carry enough metadata to survive outside its original document:

A stable evidence identifier.
The collection date and an exact locator such as a transcript timestamp or saved analytics query.
The relevant user segment, product surface, and journey stage.
The raw observation, kept separate from the interpretation proposed by a person or model.
The access, retention, and sensitivity classification.
The opportunity, assumption, or outcome to which the evidence may relate.

This structure prevents a common failure: a model paraphrases an interview, a later summary compresses that paraphrase, and the roadmap eventually treats the compressed interpretation as a customer fact. A reviewer should always be able to move from a claim to the evidence unit and then to the original record.

Apply data-governance rules before ingestion. If customer conversations contain personal, confidential, or contract-restricted information, do not copy them into an AI system until its access, retention, redaction, and model-training terms match your commitments. A more convenient synthesis workflow is not worth an unauthorized disclosure.

Retrieve the smallest useful context

Once the evidence corpus no longer fits sensibly into a prompt, use a retrieval-first pipeline with modular prompts and observable traces. Retrieval-augmented generation should select evidence relevant to the decision contract, rather than asking a general agent to reason over everything the company knows.

RAG is a grounding mechanism, not a truth guarantee. A fluent answer does not prove that the retriever found the decisive interview, the correct event definition, or the evidence that contradicts the dominant pattern. Configure retrieval to look for both support and contradiction, preserve evidence identifiers, respect access controls, and return no result when the available context does not meet the task.

An opportunity solution tree can provide the shared view above this pipeline: the desired outcome connects to opportunities, solution candidates, and tests. Treat the tree as a navigable representation of current thinking. Every important node should still resolve to evidence and assumptions beneath it.

Give AI a chain of bounded jobs

A single agent asked to interview customers, interpret feedback, size opportunities, choose a solution, and write a roadmap has too many ways to hide a weak inference. Break the work into stages with explicit inputs and review gates:

Prepare: Give AI the outcome, assumptions, and learning gaps. Let it draft non-leading interview questions. A human checks whether the guide is testing an assumption or merely inviting agreement.
Convert: Extract atomic observations from approved records. Require exact locators and label customer language, observed behavior, and interpretation separately.
Synthesize: Cluster candidate opportunities without erasing segment differences. Request supporting evidence, counterevidence, and unrepresented cohorts for every cluster.
Connect: Use behavioral analytics to examine whether the observed pattern appears in the target cohort. Interviews can expose mechanisms and unmet needs; they should not be treated as a substitute for measuring prevalence.
Challenge: Ask for rival explanations, evidence that would reverse the conclusion, and assumptions that remain untested. This stage should consume the evidence record, not just the previous summary.
Draft: Produce a decision brief containing the pending decision, options, evidence, contradictions, unknowns, and proposed next test. A named human accepts, revises, or rejects it.
Learn: Attach experiment and outcome evidence to the same opportunity record. Preserve what the team believed before the test so later reviewers can inspect how the decision changed.

Pass structured artifacts between stages. If each stage receives only prose copied from the previous chat, unsupported claims can become progressively harder to distinguish from evidence.

Buy workflow plumbing; own the decision logic

You do not need to build every repository, connector, permission system, visualization, and observability screen. Licensing purpose-built opportunity-tree infrastructure can be the sensible choice when your differentiated work is the learning system rather than the canvas or collaboration layer.

Keep ownership of the parts that encode how your company makes product decisions: the decision contract, evidence schema, opportunity taxonomy, prompt modules, evaluation cases, escalation rules, and approval gates. Before choosing a platform, ask:

Can you export the raw evidence, metadata, opportunity structure, prompts, and run traces?
Can access rules follow the evidence through retrieval and generation?
Can the system connect to your approved analytics and customer-evidence repositories without repeated manual copying?
Can you evaluate a prompt or retrieval change against representative past cases?
Can a reviewer inspect why a claim appeared and what evidence was omitted?
Would building this capability improve the customer outcome, or merely recreate commodity workflow infrastructure?

Evaluate the workflow before it shapes the roadmap

Start evals before AI-generated conclusions become routine inputs to product reviews. The evaluation set should represent the cases the workflow will actually encounter: a clear pattern, conflicting evidence, insufficient evidence, cohort-specific behavior, stale material, duplicated records, and content the requesting user is not allowed to retrieve.

For synthesis and decision-support tasks, evaluate behavior that a reviewer can observe:

Citation validity: Every material claim points to a real, accessible evidence identifier.
Evidence fidelity: Quotations and behavioral facts remain faithful to the underlying record; interpretations are labeled as interpretations.
Retrieval coverage: The output includes the evidence required to assess the target opportunity, not merely the easiest matching passages.
Contradiction handling: Material counterevidence and segment differences are visible rather than buried.
Abstention: The system returns insufficient evidence when the decision cannot be supported.
Decision fit: The deliverable answers the stated decision instead of drifting into a generic summary or unrelated recommendation.
Policy compliance: Restricted evidence stays outside unauthorized retrieval, traces, and generated output.

A strict release gate is useful here. Fail the output if it invents an evidence identifier, turns an interpretation into a quotation, ignores a material contradiction, or exposes restricted content. Those are not cosmetic defects that a polished paragraph can offset.

Treat the prompt, retrieval configuration, model choice, taxonomy, and evaluation set as versioned artifacts. This is the practical value of eval-driven development and early observability: when behavior changes, you can identify the change that caused it and rerun representative cases before wider use.

For each production run, retain the decision contract, evidence identifiers retrieved, prompt and retrieval versions, generated output, reviewer edits, final decision, and later outcome. That trace lets you distinguish a retrieval failure from a synthesis failure, a weak decision contract, or a reasonable decision invalidated by new evidence.

Model-quality checks are only one layer. Also baseline and monitor the discovery workflow itself:

Time from a framed question to a reviewable decision brief.
The share of material claims with inspectable evidence.
Reviewer corrections to quotations, segments, event definitions, and interpretations.
Decisions reopened because relevant evidence was missing or misread.
Movement in the outcome and metric named in the original decision contract.

Do not set improvement targets until you have a baseline for the existing process. A system can make synthesis faster while increasing correction work or encouraging premature decisions. The end-to-end measure tells you whether the saved time is real.

Turn the workflow into a product operating system

AI-native discovery changes the product team’s operating model only when ownership remains explicit. The product manager or product trio owns the outcome, assumptions, and decision. Research and design judgment protects interview quality and interpretive nuance. Data and engineering ownership protects event definitions, retrieval reliability, instrumentation, and access controls. AI produces candidate artifacts. The decision owner approves the action.

Review by exception instead of rereading every generated sentence. Inspect claims marked mixed or insufficient, new opportunity clusters, segment differences, material contradictions, changed event definitions, and outputs that differ from earlier runs. This focuses human attention where judgment is most valuable without treating the model as an authority.

Roll out the workflow through one recurring, reversible discovery decision:

Choose a decision for which customer evidence and behavioral data already exist, such as prioritizing an onboarding friction or investigating a repeated support issue.
Baseline the current path from question to decision, including reviewer corrections and missing-evidence failures.
Create the decision contract, evidence schema, and access rules before connecting an agent.
Build the evaluation set from previous clear, contradictory, insufficient, segment-specific, and restricted cases.
Run the AI workflow in shadow mode beside the existing process. Compare claims, omissions, reviewer effort, and the resulting decision without allowing the generated output to act automatically.
Promote bounded jobs only after they pass their gates. Evidence extraction may be ready before opportunity ranking, and opportunity ranking may be ready before solution recommendations.
Expand to another workflow only when the traces are stable, reviewers understand escalation paths, and the first use case is improving the decision process rather than merely generating more material.

At your next discovery review, do not ask what AI found. Bring one decision contract, require every consequential claim to resolve to evidence, and make the unresolved assumption visible. That is a small enough change to start immediately and a strong enough foundation for everything you automate later.

References

May 19, 2026

How to Prove the ROI of an AI Product Before You Scale It

Your AI product is getting used. The demos land well, task completion is improving, and internal enthusiasm is high. Then the CFO asks a harder question: what changed in the business because this product exists?

You cannot answer that question with prompt volume, response quality, adoption, or tickets touched. You need a measurement system that separates activity from incremental value, counts the full operating cost, and makes risk visible before a rollout gets larger. Here is how to build one.

Start with the decision your ROI model must support

ROI is not a retrospective slide assembled after launch. It is a decision rule. Before development begins, decide what evidence would justify launching, scaling, redesigning, rolling back, or retiring the capability.

That distinction changes the conversation. Instead of asking whether the agent is accurate enough or popular enough, you ask whether a measurable change in customer behavior produces a measurable business result without crossing an unacceptable risk threshold.

Build a driver tree with four levels:

Company outcome: revenue growth, lower cost to serve, or reduced business risk.
Customer outcome: the user completes a valuable job, reaches value sooner, or resolves a problem without unnecessary effort.
Product behavior: the AI capability changes conversion, expansion, self-service completion, containment, handle time, or escalation.
Controllable lever: the team changes the workflow, model behavior, conversation design, human review, or product guidance.

The chain matters because a model metric is rarely a business metric. Better answer quality may improve task completion, which may improve trial-to-paid conversion. The ROI case depends on the full chain, not the first link.

Value path	Business outcome	Leading evidence	Guardrails
Revenue	Higher conversion, average order value, or expansion	Time-to-first-value and self-service completion	Errors, complaints, and policy violations
Cost	Lower cost to serve	Containment, deflection, and reduced handle time	Escalations, false resolution, and downstream customer harm
Risk	Lower frequency or impact of harmful failures	Human-review events and detected violations	False positives, false negatives, hallucinations, and security breaches

Choose one primary value path for the investment case. Revenue, cost, and risk can all appear on the scorecard, but declaring all three as primary makes it too easy to rescue a weak result with whichever metric moved after launch.

A support agent, for example, may appear successful because it contains more conversations. But containment is only valuable if customers actually resolve their problems. A conversation that never reaches a human can reduce measured support volume while increasing complaints or churn risk. This is why revenue, cost, and risk measures must be evaluated together.

Write the measurement contract before you build the dashboard

A measurement contract is a short agreement among product, data, finance, and the operational team affected by the AI workflow. It prevents the definitions, cost boundaries, and success thresholds from changing after results arrive.

Your contract should answer these questions:

Who is eligible? Define the users, accounts, tasks, channels, and exclusions. Do not mix workflows with materially different economics.
What is the intervention? Name the AI capability and the version being evaluated. A model, prompt, retrieval pipeline, policy, or escalation change can alter the result.
What is the primary outcome? Select the business metric that determines whether the hypothesis passed.
What are the leading indicators? Use measures such as time-to-first-value, containment, and self-service completion to diagnose movement before lagging results mature.
What are the guardrails? Predefine acceptable limits for errors, hallucinations, false positives, false negatives, escalations, complaints, security events, and policy violations.
What is the baseline? Freeze the comparison period or control group before exposing the eligible population to the capability.
How will incrementality be proven? Specify the experiment, holdout, assignment unit, and minimum detectable effect.
What costs count? Agree on model or API consumption, labeling, evaluation, human review, and ongoing oversight before calculating value.
What action follows each result? Record the thresholds for launch, scale, redesign, rollback, and retirement.

The contract should distinguish an outcome OKR from an output OKR. Shipping the agent, generating responses, and increasing feature use are outputs. Improving conversion, lowering verified cost to serve, or reducing harmful failures are outcomes. Outputs can explain what happened, but they cannot establish value on their own.

Instrument the complete journey, not just the conversation

An AI log tells you what the model did. An ROI dataset must also tell you what the user did next.

Connect the journey from eligibility to business outcome:

The user or account became eligible for the capability.
The AI experience was offered, viewed, and engaged.
A task was attempted, completed, abandoned, or repeated.
A response was accepted, corrected, regenerated, or sent for human review.
The interaction was contained, escalated, or handed to another workflow.
The downstream conversion, expansion, support, retention, or complaint event occurred.
The associated model cost, labeling work, and human-oversight cost were recorded.

Carry a stable user or account identifier, experiment assignment, agent version, and journey identifier across those events. Without that connective tissue, the team may have an impressive agent dashboard and no defensible way to attribute a business outcome to the experience.

Use behavioral analytics and session replay to understand why a metric moved. Use journey mapping and retention analysis to locate the friction worth solving in the first place. Product tours and in-app guidance can then help eligible users reach a validated workflow. This creates a closed loop from journey friction to experiment and measurable outcome, instead of a collection of disconnected AI metrics.

Calculate economic value without turning activity into savings

Start with net business value:

Net business value = incremental revenue + cost avoided – total operating cost – quantified risk loss

If finance requires an ROI percentage, divide net business value by the agreed investment base. Keep both the numerator and denominator visible. A percentage without its cost boundary is easy to inflate and hard to audit.

Count only incremental revenue

Do not credit the AI product with every transaction it touched. Credit it with the difference between the exposed population and the valid control or holdout.

A practical revenue calculation is:

Incremental revenue = eligible volume x measured outcome lift x value per additional outcome

The measured outcome might be trial-to-paid conversion, self-service upsell, average order value, or expansion. Use the same eligibility definition, attribution window, and revenue treatment for the intervention and control. If the AI experience merely appears somewhere in a successful journey, that is influenced revenue, not proof of incremental revenue.

Separate capacity from cashable savings

Cost claims require more care than a deflection count. A contained interaction may create capacity without reducing expenditure. That capacity can still be valuable, but it should not be presented as cash savings unless spending actually changes.

Capacity created: employees have time available for other work, but the existing cost base remains.
Variable cost avoided: the company no longer incurs a cost that would have grown with each additional interaction.
Cashable savings: an approved budget, vendor charge, or staffing requirement is actually reduced.

Report these separately. Otherwise, the same saved minute can be counted once as employee capacity and again as reduced spend.

Validate that a deflected task was resolved, not abandoned or displaced to another channel. Then calculate avoided cost from the incremental lift in verified resolution, not the total number of conversations the agent handled.

Include the operating costs that make the agent dependable

Model or API cost is only one part of the investment. Include labeling, evaluation, human review, and operational oversight. If a safer workflow requires more review, that review is part of the product’s economics, not an external inconvenience to exclude from the model.

Segment cost by agent, workflow, and outcome. Cost per response is useful for infrastructure management, but cost per verified successful outcome is the better economic unit. A cheap response that triggers retries, escalations, or corrections may be more expensive than a higher-cost response that completes the job.

Do not bury risk inside an average ROI number

Risk adjustment should make uncertainty visible, not create false precision. Use three layers:

Hard guardrails: security and policy conditions that trigger containment or rollback regardless of financial upside.
Observed risk indicators: error, hallucination, escalation, complaint, false-positive, and false-negative rates tracked by workflow and cohort.
Financial adjustment: expected loss deducted from net value only when the probability and impact assumptions are credible enough for finance and risk owners to accept.

Do not let a low-frequency, high-consequence failure disappear inside a high average success rate. If the downside cannot be defensibly monetized, keep it as an explicit decision constraint rather than assigning it a convenient dollar value.

Prove incrementality before claiming impact

The strongest ROI calculation still fails if the attribution is weak. A before-and-after improvement may come from seasonality, pricing, traffic quality, a support policy change, or another product release. The AI capability needs a counterfactual: what would have happened to comparable eligible users without it?

Use an A/B test or holdout whenever the product and risk profile allow it. Make these choices before launch:

Assignment unit: Randomize at the level where the outcome occurs. If expansion is measured per account, account-level assignment can prevent users in the same customer organization from receiving conflicting experiences.
Primary outcome: Pick the metric that determines success and keep diagnostic metrics secondary.
Minimum detectable effect: Precompute the smallest lift worth detecting based on the baseline, available population, and business value. If the experiment cannot detect a decision-relevant change, extending the metric list will not fix it.
Guardrails: Test quality, escalation, complaints, security, and policy outcomes alongside the primary metric.
Analysis population: For a product-level ROI claim, analyze eligible users according to their assigned experience. Looking only at people who voluntarily used the agent introduces selection bias.
Measurement horizon: Keep the holdout long enough to observe the outcome named in the contract. Leading indicators can guide iteration, but they should not be substituted for retention, churn, Net Recurring Revenue, or other lagging outcomes.

If randomization is not practical, use a fixed holdout or a frozen comparison period and document the limitations. A weaker design can still inform a decision, but the ROI claim should carry less confidence. Do not quietly promote correlation to causation because the rollout has executive attention.

Interpret the result as a system. Suppose self-service completion rises but the business outcome does not. The agent may be solving a low-value task, attracting users who would have converted anyway, or shifting effort to a later step. If conversion improves while complaints or policy violations cross the guardrail, the value hypothesis may be valid but the implementation is not ready to scale.

This is eval-driven development applied to product economics: define acceptable behavior and business success, measure both under controlled conditions, diagnose the failures, and repeat the test after a meaningful change.

Turn ROI into a portfolio operating system

A one-time business case goes stale as models, prompts, traffic, user behavior, and operating costs change. Maintain an Agent Analytics view for every production capability.

Each agent scorecard should show:

The primary business outcome and current experiment result.
Leading journey metrics from eligibility through verified completion.
Revenue contribution, cost avoided, and total operating cost using the agreed definitions.
Quality and risk guardrails, including escalations and human-review events.
Performance by relevant customer, task, and journey cohort.
The agent, model, policy, and workflow version associated with the result.
The current decision status: exploring, launching, scaling, redesigning, contained, or retiring.

Use the dashboard to make portfolio decisions, not merely to report trends:

Scale when the primary outcome clears the precommitted threshold, guardrails hold, net value is positive, and the result remains credible across the cohorts that matter.
Redesign when leading indicators improve but the business outcome does not, or when human review and escalation erase the economic gain.
Contain or roll back when a hard security, policy, or customer-harm threshold is breached, even if average financial performance is positive.
Retire when controlled measurement shows no decision-relevant incrementality or when dependable operation costs more than the value created.

Review operational signals with frontline teams because they can explain patterns hidden by aggregate metrics. Review portfolio value in QBRs with product, data, finance, and risk owners so investment follows evidence rather than novelty.

Only accelerate adoption after the workflow has demonstrated unit value. In-app guides, product tours, and lifecycle nudges can bring more eligible users into a validated flow. Measure whether those interventions increase the business outcome, not merely clicks or agent sessions. Scaling exposure to an unproven workflow scales its cost and risk as readily as its potential benefit.

Key takeaways

Treat ROI as a precommitted decision rule for launch, scale, redesign, rollback, or retirement.
Connect model behavior to customer behavior and then to revenue, cost, or risk through a driver tree.
Freeze the baseline, cost boundary, guardrails, attribution method, and success thresholds before results arrive.
Credit only incremental revenue and verified avoided cost. Keep created capacity separate from cashable savings.
Include model consumption, labeling, evaluation, human review, and oversight in the operating cost.
Use controlled experiments or holdouts, with a decision-relevant minimum detectable effect, to separate causal impact from correlation.
Keep severe risk conditions as explicit constraints when they cannot be responsibly converted into a financial estimate.
Scale adoption only after the AI workflow has shown positive unit value under acceptable risk.

Pick one high-friction customer journey and complete its measurement contract before the next roadmap review. If the team cannot name the baseline, control, primary outcome, cost boundary, guardrails, and decision thresholds, the capability is still an exploration. Label it honestly, instrument it properly, and earn the right to make an ROI claim.

References

May 15, 2026

AI-Enabled Enzymatic Recycling: A Product Leader’s Playbook

You have an AI-enabled materials proposal in front of you, a promising set of enzyme candidates, and a difficult decision: fund another round of discovery or start building toward industrial scale. The candidate sequences may be impressive, but they are not yet the product.

Your decision should turn on whether the full system can repeatedly transform a defined waste stream into usable monomers at an economically viable cost. That framing connects model performance, laboratory evidence, process engineering, and commercial reality before an exciting demonstration becomes a stranded pilot.

Define the product around recovered monomers

Only 10% of the plastic manufactured gets recycled. That ceiling is not merely a sorting or consumer-behavior problem. Traditional recycling commonly shortens polymer chains instead of restoring their original molecular building blocks, so the resulting material can lose quality and move toward downcycling.

Enzymatic recycling changes the intended output. An engineered enzyme can deconstruct a polymer into its original monomers, which can then become inputs for new, high-quality plastic. The difference is fundamental: the product is not processed waste or a smaller plastic fragment. It is recovered molecular feedstock.

This distinction gives you a better product boundary. A generated protein sequence is a feature. An enzyme that shows activity in one assay is a technical result. The product is a repeatable monomer-recovery system with a defined input, output, operating envelope, and cost structure.

Before approving a roadmap, require the team to define five contracts:

Input contract: Which polymer, packaging format, mixture, and contamination profile will the process accept? “Mixed plastic” is not a specification. Name the included materials and the variation the system must tolerate.
Transformation contract: Which polymer bonds must the enzyme break, and what conversion and selectivity must the reaction demonstrate?
Output contract: Which monomers will be recovered, what downstream use must they support, and how will the team determine that the output is suitable for that use?
Operating contract: What reaction conditions, throughput, energy consumption, and process controls must hold outside a small laboratory assay?
Economic contract: Which cost per ton must the integrated process approach, and which assumptions currently separate measured economics from projected economics?

Selectivity is especially important. An enzyme can target a particular plastic within a mixed waste stream, potentially reducing the need to treat every input as chemically identical. But selectivity does not make an undefined waste stream manageable. The process still needs to know which target material is present, whether the enzyme can reach it, and how the desired products will be recovered.

Write the product brief in one sentence: For this defined feedstock, transform this polymer into these monomers, within this operating envelope, output specification, and cost boundary. If a number is unknown, leave a visible blank and assign an experiment to fill it. Do not hide the uncertainty inside a broad ambition such as “make plastic circular.”

Build the AI as a closed learning system

AI changes the economics of searching enzyme-design space. Protein language models can generate candidates, multi-step agents can coordinate specialized tasks, and computational evaluations can eliminate weak options before scarce laboratory capacity is used. Advances in protein structure prediction have expanded what can be explored, but prediction does not remove the need for physical validation.

The useful architecture is therefore not a model that emits sequences. It is a closed loop in which every physical result makes the next design round better. Rhea’s Factory combines protein language models, an agentic pipeline, domain constraints, and proprietary wet-lab feedback. The product lesson is broader than any one implementation: generation, evaluation, experimentation, and learning need to operate as one traceable system.

Encode the objective. Convert the product contract into machine-readable constraints: target polymer, desired products, acceptable operating conditions, and the metrics that will decide whether a candidate advances.
Generate candidates. Explore multiple plausible designs rather than optimizing immediately around the first promising family.
Apply computational gates. Reject candidates that violate explicit constraints, preserve the reasons for rejection, and rank the remaining candidates for laboratory use.
Run controlled wet-lab experiments. Test candidates under recorded conditions and capture successes, failures, and inconclusive results.
Update domain predictions. Use the measured outcomes to improve ranking and candidate selection for the next round.
Feed process evidence back into discovery. When a candidate struggles under reactor or feedstock conditions, turn that failure into a new design constraint instead of treating it as a separate engineering problem.

Agentic AI is valuable here because the workflow is multi-step, not because an agent should make every decision autonomously. At each handoff, define the required input, expected output, validator, and failure behavior. A generation step should not advance an incomplete candidate. A computational score should not be presented as a laboratory observation. A promising assay should not silently become a scale claim.

Exploration also needs an explicit lane. Higher model-sampling temperatures can produce more unusual enzyme candidates and reach beyond the safest local variations. Controlled model “hallucination” can be useful during candidate exploration when downstream guardrails prevent novelty from being mistaken for evidence.

Separate the candidate portfolio into three buckets: improvements near known winners, adjacent designs that test a clear hypothesis, and high-variance exploration. Give each bucket a deliberate laboratory budget. Raise sampling temperature only in the exploratory lane, and never allow generated assay values, reaction outcomes, or scale results into the measured-data record.

The durable advantage sits in the feedback data. In a narrow, high-signal domain, even hundreds of relevant proprietary laboratory observations can support a useful domain prediction model. That is not a general claim that small datasets are always sufficient. It means contextual quality can matter more than indiscriminate volume when the problem, assay, and outcomes are tightly defined.

For every experiment, preserve enough context to make the result reusable:

The enzyme identity, sequence version, and design lineage.
The target polymer, material format, mixture, and relevant contamination profile.
The assay and protocol version used for the test.
The reaction conditions and duration.
The measured conversion, selectivity, yield, and uncertainty available from the experiment.
The full result, including failure, no-result, and inconclusive outcomes.
The relationship between the candidate, computational evaluations, physical test, and model or data release.

A spreadsheet of winning sequences is not a data moat. A traceable record of why candidates were proposed, how they were tested, what failed, and how each result changed the next decision can become one.

Use stage gates that end in physical evidence

AI product teams often gravitate toward a model leaderboard because it creates a clean sense of progress. Enzymatic recycling does not have one adequate master score. A candidate can look structurally plausible and fail in the lab. It can perform in a controlled assay and miss the required throughput. It can convert the polymer and still lose economically once the rest of the process is counted.

Use a hierarchy of evidence that moves from design compliance to laboratory performance, operating fit, and scale economics:

Gate	Decision question	Required evidence	Red flag
Design compliance	Does the candidate satisfy the stated target and pipeline constraints?	Deterministic checks, recorded constraint evaluations, and candidate provenance	A candidate advances mainly because it appears novel
Wet-lab performance	Does the enzyme convert the target with the required selectivity under defined conditions?	Repeatable measured observations, including negative and inconclusive runs	Only the best run is retained or shared
Operating fit	Does useful performance hold within the intended controlled, low-temperature process and throughput requirements?	Process measurements tied to reaction conditions, conversion, yield, throughput, and energy use	Activity is reported without the process context needed to interpret it
Scale economics	Can the integrated system move toward cost parity with inexpensive oil-based plastic?	A cost and energy model tied to measured inputs, with assumptions and sensitivities exposed	Commercial viability is inferred from enzyme activity alone

Set pass, hold, and stop conditions before seeing the result. Otherwise, an interesting candidate will repeatedly earn one more experiment while the commercial requirement drifts. Relative improvement is useful for learning, but an enzyme that is twice as good as an unusable baseline may still be unusable. Every relative metric should sit beside the absolute requirement it is meant to approach.

Keep conversion, selectivity, yield, throughput, and energy per ton separate. Combining them too early into a single score can conceal the actual tradeoff. A team should be able to show why it is advancing a faster candidate with lower selectivity, or a more selective candidate with a different operating burden, without claiming that the candidates are equivalent.

Three common metric substitutions deserve direct scrutiny:

Low reaction temperature is not automatically low total energy. Count the energy demands of the complete process rather than the enzyme reaction in isolation.
Polymer conversion is not automatically usable monomer recovery. Measure whether the desired output can be recovered to the specification required downstream.
Bench performance is not automatically scaled performance. Treat increasing process scale as a new evidence gate, not a routine deployment step.

My rule is simple: model output can earn laboratory time; only measured process evidence can earn scale capital.

Plan the roadmap backward from cost parity

The commercial benchmark is unforgiving. Enzymatic recycling ultimately has to compete with inexpensive oil-based plastic production. A greener reaction that cannot approach a viable delivered cost will remain dependent on special conditions rather than becoming a broadly adopted circular process.

Build the economic model while discovery is still underway. At minimum, separate these cost lines:

Feedstock acquisition, sorting, and rejected material.
Preparation required before the enzyme can act on the target polymer.
Enzyme production, delivery, useful lifetime, and replacement.
Reactor capacity, reaction time, process control, and energy.
Monomer recovery and purification.
Waste handling, downtime, and variability in plant utilization.

Do not wait for perfect values. Use ranges, label each input as measured or assumed, and run sensitivity analysis. The purpose is to identify which uncertain variable can kill the business case. If enzyme lifetime dominates cost, another candidate-generation run may be rational. If purification dominates, generating thousands of additional sequences may be a distraction from the real constraint.

Pair every scientific milestone with an industrial question:

Discovery gate: Is activity and selectivity reproducible enough to justify process work?
Process gate: Does the candidate perform inside the intended operating envelope rather than only under a convenient assay condition?
Feedstock gate: Does performance survive representative material formats and mixtures, including difficult packaging such as clamshells?
Demonstration gate: Can the system sustain the required material flow, output quality, and energy profile at a scale that tests the major engineering assumptions?
Commercial gate: Does the cost case remain credible when feedstock composition, utilization, throughput, and other sensitive inputs move away from the preferred case?

A planned 5,000-ton demonstration plant in California illustrates why demonstration capacity belongs on the product roadmap. A plant is not simply a larger laboratory. It tests whether biology, equipment, controls, feedstock variability, and recovery operations behave as an integrated product.

Before committing meaningful scale capital, ask six kill questions:

Which assumption has the largest effect on delivered cost per ton?
Which inputs are measured, and which still come from a design estimate?
At what physical scale was each important input measured?
What fails first when the feedstock mix changes?
If enzyme performance improves as planned, which downstream step becomes the bottleneck?
Which observed result will stop, narrow, or materially redesign the program?

Expansion into additional plastics should follow the same discipline. Enzyme selectivity creates a plausible path toward enzyme blends for mixed streams, and new plastic types and mixed-plastic blends remain important development directions. Treat each added polymer as a new product vertical with its own input contract, assays, process interactions, recovery requirements, and economics. A new enzyme is not automatically a low-cost extension of the first process.

Key takeaways for your next roadmap review

Define success as repeatable recovery of specified monomers, not the generation of novel enzyme sequences.
Run discovery as a closed loop connecting product constraints, AI generation, computational gates, wet-lab measurements, and process feedback.
Treat proprietary experimental context—including failures—as the data asset; candidate count alone is not a defensible moat.
Use separate gates for design compliance, laboratory performance, operating fit, and scale economics.
Work backward from cost parity and direct the next experiment toward the assumption that most threatens the integrated business case.

For your next review, ask the team to bring one page containing the input and output contracts, a diagram of the learning loop, the current stage-gate thresholds, the experimental data schema, and a cost sensitivity model with measured and assumed inputs clearly separated. Every roadmap item should change one of those artifacts or produce evidence for a named decision.

If the team cannot fill those fields yet, that is the immediate product work. The first defensible milestone is one traceable loop from a defined industrial problem through candidate generation, laboratory measurement, and an updated cost model. Repeat that loop with increasing realism before increasing capital exposure. That is how you determine whether programmable biology is becoming an industrial recycling product rather than remaining an impressive AI demonstration.

References

Product Talk — How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

May 14, 2026

No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

Most teams ship AI agent personalities by accident—emergent quirks, brittle prompts, and uneven behavior. We refused to let that happen. From day one, we treated personality as a first-class product surface, one that should be designed, instrumented, and iterated with the same rigor as any core capability.

Learn how we designed Global Agent’s personality and fine-tuned its inquisitiveness and helpfulness using Agent Analytics.

In my role leading product at HighLevel, Inc., I framed our approach around agentic AI and conversation design: personality is not “flavor text”; it is the control system for how an agent interprets context, asks questions, and decides when to act. Our product strategy prioritized clarity, empathy, and consistency—so the agent would be curious enough to resolve ambiguity without becoming interrogatory, and helpful enough to move work forward without overstepping.

We made that intent measurable. Using behavioral analytics, we defined operational signals such as clarification-question rate, resolution-path efficiency, and escalation quality. We combined eval-driven development with targeted A/B testing to compare prompt patterns and tool strategies, ensuring each change had a clear hypothesis and measurable outcome.

To calibrate inquisitiveness, we mapped decision points where the agent should ask follow-ups versus proceed autonomously. Prompt engineering codified those thresholds, while a retrieval-first pipeline reduced unnecessary questions by improving context completeness up front. When the agent did ask, we constrained tone and cadence to keep queries concise, respectful, and progress-oriented.

To enhance helpfulness, we prioritized precise action-taking and unambiguous guidance. Context window management preserved relevant facts without diluting intent, and guardrails aligned with AI risk management principles ensured the agent stayed within policy, privacy, and compliance boundaries. The result was an assistant that resolved more tasks end-to-end, with fewer stalls and clearer handoffs when human help was warranted.

Agent Analytics became our nervous system. We instrumented every dialog turn to attribute outcomes to design choices, then used driver trees to connect micro-behaviors to macro results like time-to-resolution and customer satisfaction. This closed-loop view let us ship confidently, knowing which levers improved helpfulness, which sharpened curiosity, and which merely added noise.

Process mattered as much as tooling. Product trios ran continuous discovery with customers to surface edge cases—ambiguous intents, multi-intent turns, and sensitive scenarios—while our engineering partners operationalized experiments with clean rollback paths. We favored small, testable changes over sweeping rewrites, building momentum and trust with each iteration.

The payoff is a personality that feels consistent across use cases: curious when clarity is missing, decisive when action is obvious, and transparent when limits are reached. Users experience fewer dead ends, faster resolutions, and a brand voice that shows up the same way every time—because it was defined, measured, and improved on purpose.

If you’re building agentic AI, don’t leave personality to chance. Treat it like a product: set clear outcomes, instrument deeply with Agent Analytics, and iterate with eval-driven development and A/B testing. That’s how curiosity becomes a feature, helpfulness becomes a habit, and your agent becomes reliably, intentionally excellent.

Inspired by this post on Amplitude – Best Practices.

May 13, 2026
I Pointed a “Ralph Wiggum” AI Loop at My Product for a Week—The Data That Stopped Chaos

I spent a week pointing a "Ralph Wiggum loop" at my product to see how far an agentic AI could take pragmatic, everyday improvements without human micromanagement. It was equal parts exhilarating and nerve-wracking. The short version: the loop moved fast and broke assumptions, but Amplitude analytics kept it from going off the rails—and turned chaos into controlled acceleration.

By "Ralph Wiggum loop," I mean a deliberately naive, endlessly curious cycle: try something small, ship it behind a flag, watch the data, then try again. It is the product equivalent of a fearless intern who experiments constantly. That energy is invaluable for discovery, but it absolutely demands strong guardrails and a clear definition of success.

Before I started, I framed the outcomes I cared about: user activation within the first session, reduction in time-to-value, and early retention indicators. I set baselines and a minimum detectable effect (MDE) for A/B testing so the loop could distinguish noise from signal. I also documented a driver tree of behaviors we wanted to influence and ensured every event was cleanly instrumented in Amplitude analytics to support reliable behavioral analytics.

The guardrails mattered most. I put every change behind feature flags with instant rollback. I defined "off the rails" conditions upfront, including regression thresholds for activation and retention analysis, and enabled anomaly detection to surface unexpected spikes or drops. Session replay was ready to diagnose confusion fast, and I kept a daily evaluation cadence so the loop never ran unattended for long.

Day by day, the loop proposed micro-experiments: onboarding copy variants, tooltip timing, in-app guide sequencing, and subtle changes to progressive disclosure. Each iteration shipped behind a flag to a small cohort. I watched leading indicators in real time, then zoomed out to cohort views to guard against short-term gains that might erode longer-term value. When something looked promising, we expanded exposure methodically; when something looked risky, we paused immediately.

We had a pivotal moment where the loop suggested a bolder call-to-action that spiked activation. On the surface, it looked like a win. Amplitude cohorts told a fuller story: downstream engagement softened, and anomaly detection flagged a pattern that hinted at premature conversion rather than genuine intent. A quick rollback through feature flags saved the week—and reminded me why eval-driven development should be the default for agentic AI workflows.

The most surprising part was how quickly the loop unlocked small compounding gains once the measurement scaffolding was in place. With a unified analytics platform and crisp guardrails, the system became a safe sandbox where the AI could explore aggressively while we stayed anchored to outcomes. The combination of behavioral analytics, A/B testing discipline, and daily human review turned raw speed into durable learning.

My takeaways are direct. Agentic AI can accelerate discovery, but only if you define stop conditions and wire strict feedback loops into your stack. Measurement is product strategy here—without it, you get noisy activity instead of progress. Invest in instrumentation first, treat feature flags as non-negotiable, and let anomaly detection and session replay be your early warning system. Most of all, tie every experiment to activation, engagement, or retention, not vanity metrics.

If you’re considering your own week with a "Ralph Wiggum loop," start painfully small, constrain the blast radius, and insist on decision-quality data. Do that, and you’ll turn a chaotic agent into a compounding engine for product discovery—one that moves fast, learns faster, and stays on track.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
From Vision to Execution: Building Agentic, Data‑Driven Products with Real‑World Rigor

When I consider where product development is headed, one statement captures the mandate perfectly: "Eric Carlson is a Principal AI Engineer helping to shape and build Amplitude's next generation vision of of agentic and data driven product development." That vision resonates deeply with how I lead teams—anchoring strategy in behavioral analytics while enabling agentic AI to act on insights with speed, safety, and measurable impact.

Translating that vision into execution starts with clarity of outcomes. I frame driver trees that connect customer value to leading indicators—activation, engagement depth, and retention—then instrument product telemetry with Amplitude analytics and behavioral analytics to surface the moments that matter. From there, we operationalize learning with A/B testing and feature flags, ensuring each hypothesis gets a fair, observable run and that we can safely ramp what works.

Agentic AI changes the operating model. Instead of static dashboards, we design autonomous workflows that observe signals, reason over context, and take action—grounded in a retrieval-first pipeline and governed by eval-driven development. For product managers, this demands fluency with LLMs for product managers and practical prompt engineering, plus rigorous AI Strategy around data governance, privacy-by-design, and risk scoring so agents remain trustworthy under real-world conditions.

Cross-functional cadence is everything. I partner closely with Principal AI Engineers and product trios to blend continuous discovery with execution: rapid user interviews to reveal intent, opportunity solution trees to prioritize, and outcomes vs output OKRs to align incentives. The result is a system where insights are unified, decisions are explainable, and agents improve through tight feedback loops across analytics, experimentation, and production telemetry.

If you’re building toward an agentic, data-driven future, invest in a unified analytics platform, shorten the path from signal to action, and measure learning velocity as carefully as feature delivery. With the right foundations, agentic AI becomes more than a feature—it becomes a force multiplier for product strategy, customer value, and sustainable growth.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026

Tag: eval-driven development

Start with a decision contract, not an open-ended question

Build a compact context packet Amplitude AI can actually use

Run a bounded analysis loop, then challenge the result

Move from observation to explanation in explicit steps

Put every conclusion through a verification gate

Hand stakeholders a decision artifact, not an AI transcript

Key takeaways

References

Give every experiment a decision contract

Match the question to the cheapest reliable evidence

Engineer trustworthy measurement and reversible delivery

Scale the program around decisions, not test volume

Reset a brittle program over 90 days

Key takeaways

References

Key takeaways

Start each initiative with an outcome contract

Build one scorecard across four layers

Run a weekly loop from evidence to decision

Assign decision rights before the dashboard turns red

Use 90 days to prove the operating system

References

Key takeaways

Start with the queue, not the model

Turn knowledge into a controlled production input

Define the boundary before you define the features

Design one boring, observable execution path

Put human control at the blast-radius boundary

Prove reliability before expanding the roadmap

Key takeaways

References

Key takeaways

Frame the decision before you involve the model

Build a traceable evidence pipeline, not a transcript pile

Retrieve the smallest useful context

Give AI a chain of bounded jobs

Buy workflow plumbing; own the decision logic

Evaluate the workflow before it shapes the roadmap

Turn the workflow into a product operating system

References

Start with the decision your ROI model must support

Write the measurement contract before you build the dashboard

Instrument the complete journey, not just the conversation

Calculate economic value without turning activity into savings

Count only incremental revenue

Separate capacity from cashable savings

Include the operating costs that make the agent dependable

Do not bury risk inside an average ROI number

Prove incrementality before claiming impact

Turn ROI into a portfolio operating system

Key takeaways

References

Define the product around recovered monomers

Build the AI as a closed learning system

Use stage gates that end in physical evidence

Plan the roadmap backward from cost parity

Key takeaways for your next roadmap review

References