Category: AI Strategy

From Prototype to Production: How I Built Reliable AI-Generated Opportunity Solution Trees

I just wrapped an all-out engineering sprint. That still sounds odd coming from me, because while I’ve written code on and off for years, I don’t self-identify as an engineer. I’m a product manager who used to be a designer. It’s been a long time since I wrote code for a living.

But AI has expanded what’s just now possible—for our products, and for us. It’s pushed me to do more than I imagined. In that spirit, I want to share a recent engineering story. It includes technical details, and a year ago I couldn’t have done any of it. I learned it with the help of AI, and my aim is to show what’s now within reach.

I’ve been building two services with a partner at Vistaly: AI-generated interview snapshots and AI-generated opportunity solution trees. We put out a call for alpha partners, received over 100 applicants, and selected eight design partners to start.

A clear, color‑coded map from desired outcome to opportunities, solutions, and assumption tests—showing how to structure discovery work and prompt AI to generate, compare, and validate product ideas.

Each team uploaded three customer interviews. I identified the key moments and opportunities and then generated an opportunity solution tree from those snapshots. I provide the AI services; Vistaly is building the UI and workflows around them.

Early feedback was strong. Teams immediately asked to upload more interviews—exactly the kind of demand signal you hope to see—so we got to work making that possible.

Go behind the scenes as AI turns raw feedback into a clear Opportunity Solution Tree. Linked cards reveal user needs—onboarding, support offload, and bot-readiness signals—so product teams can spot priorities and next steps at a glance.

Updating an opportunity solution tree with new interview content is far harder than generating a new tree from scratch. I initially underestimated the complexity. Our goal wasn’t to produce a tree and declare it truth. We wanted teams to engage, correct, and collaborate with the AI—scaffolding cross-interview synthesis instead of doing it for them.

To support that, we needed a way to communicate precisely how a tree would change after new interviews were added. We took inspiration from git diff and set out to build the equivalent for opportunity solution trees—step-by-step change sets that explain each proposed modification.

A clear visual of AI‑generated opportunity solution trees: outcomes feed opportunities that branch into sub‑opportunities, while evidence is preserved. The structure ensures updates stay traceable and never cause data loss.

That decision was right, but the lift was larger than I expected. It wasn’t enough to generate an updated tree; I also had to provide a clear, ordered walkthrough of what changed and why.

I often see the same pattern with AI: it’s easy to get to an impressive prototype, but much harder to reach a production-grade product. That was exactly my experience here. My service actually comprised two sub-services: generating a new tree from scratch and updating an existing tree with new interviews. The first worked well in alpha; the second had to be built before anyone could add a fourth interview.

Explore how an outcome expands into an Opportunity Solution Tree: Opportunities A and B stem from the goal, with C and D nested under B, while a concise change set tracks every node added along the way.

On the surface, these services look similar. In reality, updates must preserve existing structure unless new evidence requires a change. You have to account for compound operations—merges, splits, deletes—while guaranteeing no data loss. Every node has source opportunities (supporting evidence from interviews) and children (tree sub-opportunities), and neither can be dropped.

In classic AI fashion, I got a reasonable version working in a few days and shipped it to our design partners. One team quickly hit our beta limits and asked to convert to a paid subscription so they could keep going. They showed a willingness to pay, converted, and started uploading aggressively.

Watch an Opportunity Solution Tree evolve: the original parent A with x, y, z branches is split into A and B, shifting evidence while preserving links—mirroring how AI refines scope and structure in discovery.

At the 14th, 15th, and 16th uploads, the cracks appeared. We saw odd behavior in some trees. The Vistaly team noticed that the change sets—the step-by-step instructions emitted by my service—didn’t always reconstruct the final tree my service also emitted. We needed those steps to match exactly, so teams could review and accept, modify, or reject each change with confidence.

They flagged the issue the day I was flying to New Orleans for Jazz Fest. In hindsight, I’m glad I didn’t grasp the scope of what awaited me. I had roughly 80% of the work still to do to make tree updates rock solid. At least I got to enjoy the music first.

From fragments to focus: this diagram shows how Opportunities B and C are merged into a single Opportunity Solution Tree, removing duplicates and unifying context so AI can rank and explore five related opportunities with clarity.

Back home, I started diagnosing. My service was a pipeline: several LLM-driven steps followed by deterministic code to compare trees and produce change sets. As I dug in, I realized that approach was flawed. Tree diffs, unlike linear document diffs, are ambiguous.

In a document, if I add a sentence, the diff shows an addition. If I delete a paragraph and rewrite it, the diff shows a removal and an addition. Simple. But trees are different. Suppose I split opportunity A into A and B, and later merge B with C. The split can disappear from the final diff.

Peek inside our process: a simple opportunity solution tree maps an outcome to prioritized opportunities A and C with downstream options x-z and t-v. A clear snapshot of how AI organizes product discovery.

When the model splits an opportunity, it must distribute A’s source opportunities and children between A and B. For instance, if A has source opportunities 1, 2, 3 and children x, y, z, after the split A might keep 1, 2, and x, while B takes 3, y, and z.

Now suppose the model merges B into C. If C originally had source opportunities 4 and 5 and children t, u, v, then after the merge C now has source opportunities 3, 4, 5 and children t, u, v, y, z. When you compare the original and final trees, it looks like A somehow donated some evidence and children directly to C. The split and merge that explain why are invisible to a naive diff.

See how an AI-generated Opportunity Solution Tree unfolds: one Outcome flows to Opportunities A and C, then into options x–v. Clean colors and arrows reveal the hierarchy from goal to opportunities at a glance.

That was the core insight: we didn’t just need to show what changed—we needed to show why it changed. I had to reconstruct each move step-by-step. That meant getting the model to show its work, which opened a new can of worms.

I refactored my prompts so the model produced both the final output and the exact change set it used to get there. The action language was explicit: add, delete, reframe, merge, split, and so on. Crucially, I asked the model to describe its moves in user-meaningful terms—“split A into A and B, then merge B into C”—not as opaque reassignments of sources and children.

Watch an opportunity solution tree take shape: start with the outcome, add opportunities A and B, then extend B to C and D. The paired change set makes every edit transparent—ideal for AI-assisted product discovery.

For each LLM step, the model now emitted its recommendation and the corresponding change set. This helped, but it wasn’t perfect. After extensive testing and error analysis, two classes of errors emerged: (1) the model attempted an invalid move, and (2) the change set didn’t actually generate the recommendation.

Category 1 felt like designing a game while the model played it creatively. For example, what happens when the model tries to merge a parent with a child? If opportunity A has children B, C, and D and the model merges A with B, the merge is directional. If the instruction is “keep A, delete B,” that works—the parent absorbs the child. But if the instruction is “keep B, delete A,” then C and D become orphans. These puzzles were solvable and even fun.

Visual explainer from Product Talk on AI-generated Opportunity Solution Trees. It contrasts an allowed merge (B into A) with a not-allowed merge (A into B) that leaves child opportunities orphaned, guiding safe hierarchy edits.

Category 2 was harder. Despite prompt iterations, I could only push the discrepancy rate down to about 1 in 40 instances. With 10–20 LLM calls per run, that meant roughly half of all runs still failed. Not acceptable for production. I hit a wall. A paying customer was waiting, and more design partners were queued up.

Next, I tried to correct the model’s mistakes with deterministic code. I had promised that my change sets would generate the output tree, so I wrote verifiers: detect conflicts (e.g., delete a node, then try to use it later), guard against data loss, prevent orphaned nodes, and more. Detection was straightforward; correction was not. Fixing issues required guessing the model’s intent. If the sequence said “delete A, then merge A with B,” should I remove A entirely or salvage A’s sources and children by merging into B? There were dozens of such cases with no unambiguous answer.

A step-by-step loop shows how changes are validated: generate a change set, run a validation tool, review the result, then repeat on failure and exit on pass—mirroring iterative work behind AI-built Opportunity Solution Trees.

After 11 straight days of deep work—including weekends—I was exhausted. I dislike hustle culture; this isn’t how I design my life. But I was stuck, and then I had an insight.

On a walk with my husband (also an engineer), I realized I could have the LLM repair its own mistakes. My data contract with Vistaly requires that the change set must generate the output tree. I had already built robust validation code. I knew exactly when a change set failed—and why. No amount of prompt tuning alone was fixing it. So I turned the validator into a tool for the model and created a simple agentic loop.

The loop works like this: the model proposes a change set, calls the validation tool, and gets back a pass/fail plus specific feedback. If it fails, the model uses those instructions to repair the change set and calls the tool again. Iterate until success or a max number of turns.

I prototyped in Node.js with a single model call, a verifier pass, and a repair attempt. At first, the loop didn’t converge—it just accumulated compute. I experimented with how to communicate errors, how much context to include, and how to sequence feedback. Eventually, it clicked: the model began fixing its own mistakes and typically returned a valid change set in one or two repairs. It was, in practice, eval-driven development applied to LLM outputs.

I had already built an agent loop utility for another AI workflow, so I productionized quickly: model call, optional tool invocation, tool result returned to the model, repeat until the validator signals success or the loop times out. I integrated the new loop into the pipeline and shipped the revamped service to Vistaly on Monday at noon. They’re integrating now, and it will be in the hands of our design partners shortly. I’m relieved—and ready for a day off.

Reflecting on the last two weeks, a few things stand out. First, I shed limiting beliefs about being an engineer. To make this reliable, I had to solve legitimately hard problems, and that feels good.

Second, this was genuinely fun. Designing the action set and watching the model push those boundaries was like working through elegant puzzles. Models are incredibly creative, and harnessing that creativity with the right constraints is deeply satisfying.

Third, I learned when I can and can’t trust Claude to write code for me. Since Opus 4.6 came out, I gave Claude a much longer leash. After the past two weeks, Claude is back on a short leash. I found a lot of gaps in my implementation in areas where I simply trusted that Claude got it right, when in fact it didn’t. If you don’t have the right infrastructure—planning, testing, code review—this can be disastrous. I’ll be investing more here and sharing what I learn.

Finally, if this work had been spread over two months, it would have been thoroughly enjoyable. I’m discovering how much I like being an AI engineer. It feels like a new chapter where I can combine opportunity solution trees with modern AI engineering—and deliver real value to product teams doing continuous discovery.

I’m excited to share more of what we’re building with Vistaly and to onboard more design partners soon. If you’re interested, get on the waiting list. And if you’ve been hesitant to stretch beyond your current skill set, I hope this story nudges you to take the first small step toward what’s just now possible.

Inspired by this post on Product Talk.

May 13, 2026

AI-Assisted Product Strategy: A Practical Operating System

You can get an AI model to produce a roadmap in minutes. That is precisely the problem. A polished roadmap can hide weak evidence, unresolved trade-offs, and a strategy that never made a real choice.

The useful question is not whether AI can do product management work. It is where AI should accelerate the path from evidence to decision, where human judgment must remain explicit, and how you will know the resulting strategy is working. The operating system below gives you that separation.

Key takeaways

Give AI a defined role in the decision process. It can extract, organize, challenge, and draft; the product leader still owns choices, trade-offs, and commitments.
Build a strategy chain from customer problem to business result before asking AI for initiatives. Otherwise, the model will fill strategic gaps with plausible language.
Ground every workflow in canonical product context, and require every important claim to point back to evidence.
Use AI to shorten discovery synthesis, not to turn a limited set of interviews or support conversations into false market certainty.
Carry the same strategic hypothesis through the roadmap, experiment, launch, and learning review. Changing the success definition between those stages makes measurement meaningless.

Start with decision architecture, not a better prompt

Most weak AI-assisted strategy work begins with an underspecified request: analyze this feedback, prioritize these ideas, or build a roadmap. The model responds by making silent assumptions about the customer, the business objective, and the meaning of priority. Its output may read well while answering a question nobody deliberately chose.

Write a decision brief before opening the model. This is not a conventional product requirements document. It is a compact contract defining the decision AI is helping you make.

Decision: State the choice in one sentence. For example, decide which onboarding opportunity deserves discovery capacity in the next planning cycle.
Target customer and context: Name the segment, job, and situation. Feedback from an administrator configuring an account should not be blended with feedback from an end user completing a daily task.
Desired outcome: Identify the customer behavior you want to change and the business result it is expected to influence.
Evidence in scope: List the interviews, behavioral data, support conversations, journey maps, and prior experiments the model may use.
Constraints: Include privacy requirements, technical dependencies, commercial commitments, capacity limits, and non-goals.
Decision owner: Name the person accountable for accepting the trade-off. An AI-generated recommendation does not distribute accountability.

Build a strategy chain the model can inspect

Your strategy should form a traceable chain:

Choose the customer and job that matter.
Define the value proposition, including what must match the market and what should be meaningfully different.
Name the customer outcome and business outcome.
Break that outcome into drivers the product can influence.
Select an opportunity supported by evidence.
Form a testable product bet.
Decide what evidence would justify continuing, changing, or stopping.

A driver tree makes this chain concrete. It creates a visible connection between roadmap work and measures such as activation, retention, expansion, and Net Recurring Revenue. AI is useful here as a critic. Ask it to identify unsupported jumps, duplicated drivers, initiatives disguised as outcomes, and metrics the proposed product change cannot plausibly affect.

Keep outputs and outcomes separate. Shipping an AI onboarding assistant is an output. Changing a defined activation behavior for a defined customer segment is an outcome. The model can help rewrite output-oriented objectives, but it cannot choose a credible target without baseline data, business context, and an accountable owner.

Force a distinction between fact, inference, and assumption

Require the model to label every material statement as one of three things:

Observed: Directly supported by a supplied interview, event, support conversation, or experiment.
Inferred: A reasonable interpretation that combines observations but is not explicitly stated by the customer or proven by the data.
Assumed: Necessary for the recommendation to work but not yet supported by the supplied evidence.

This simple classification prevents an attractive narrative from laundering assumptions into facts. It also improves discovery planning: the most consequential assumption with the weakest evidence becomes a candidate for the next test.

A useful instruction is: Use only the supplied material. For every recommendation, show the observations that support it, the inference connecting those observations to the recommendation, the assumptions that remain, and the evidence that could disprove it. If support is missing, say that it is missing.

Build a controlled workflow from context to decision record

AI assistance becomes reliable when it is a workflow rather than a chat session. A chat encourages improvisation: context changes, instructions disappear, and nobody can reconstruct why an answer looked different the next time. A workflow gives each pass a defined input, output, and approval gate.

Ground the model in canonical product context

Start with a retrieval-first set of canonical documents. At minimum, that context should include the current vision, product strategy, target segments, value proposition, OKRs, metric definitions, analytics dashboards, relevant discovery evidence, decision history, and definition-of-done checks.

Canonical does not mean comprehensive. More context can make conflicts harder to notice. Give each item an owner, a freshness indicator, and an authority level. If an old positioning document conflicts with the approved strategy, the workflow should identify the conflict rather than silently averaging the two.

Include exclusions as well. Tell the model which documents are historical, which metrics are deprecated, which segments are out of scope, and which proposals have already been rejected. Without those boundaries, previously abandoned ideas can return as apparently new recommendations.

Separate extraction, synthesis, challenge, and approval

Extract: Pull observations, customer language, events, metrics, decisions, and unresolved questions from the supplied material. Preserve links to the original evidence.
Synthesize: Group related observations and propose opportunity statements. Keep contradictory evidence visible.
Challenge: Look for alternative explanations, missing segments, weak causal claims, metric gaming, dependencies, and reasons the recommendation could fail.
Decide: Have the accountable product leader and relevant partners accept, modify, or reject the recommendation. Record the trade-off explicitly.
Publish: Store the decision, evidence, owner, expected outcome, guardrails, and next review trigger in the system the team already uses.

Do not combine these passes into one request for a final answer. Extraction should not quietly prioritize. Synthesis should not hide inconvenient evidence. A challenge pass should test a proposed direction without changing the original evidence set. The human approval gate should be visible, not implied by the fact that somebody copied the output into a roadmap.

Raw interviews, support threads, CRM records, and analytics exports can contain personal or confidential data. Do not paste them into an unapproved model. Minimize the data, remove identifiers that are not needed for the decision, use the governed environment approved by your organization, and retain only what the workflow requires. Privacy-by-design belongs at intake because redacting an output does not undo an inappropriate disclosure in the input.

For recurring workflows, add acceptance criteria and evaluation cases. A discovery synthesis evaluation might check whether every theme retains evidence links, whether contradictions survive summarization, and whether unsupported market-size claims are rejected. A strategy evaluation might check whether every initiative maps to an outcome driver and whether an output has been mislabeled as an objective. Re-run those checks when the model, prompt, context set, or output schema changes.

Use AI in discovery without laundering uncertainty

Discovery generates exactly the kind of material language models handle well: interview transcripts, support conversations, journey notes, behavioral patterns, and open-ended hypotheses. AI can reduce the time between collecting this material and discussing it. It cannot make a biased sample representative or turn a correlation into a cause.

Run synthesis as part of a weekly learning cadence that combines customer evidence with journey and behavioral analysis. Waiting for a large quarterly research readout increases the distance between observation and decision. Treating every new conversation as a roadmap mandate creates the opposite problem. A regular review gives the team a stable point at which evidence can accumulate, conflict, and change an existing belief.

A cluster is a lead, not a finding

Theme clustering is useful for navigation. It is not proof of importance. A frequent topic in support data may reflect product friction, a noisy customer segment, a documentation gap, or a recent incident. The model sees only the supplied dataset, not the market outside it.

Require each proposed opportunity to include:

The affected segment and the context in which the problem occurs.
The job the customer is trying to complete.
Links to supporting observations, including direct customer language where it preserves important nuance.
The observed count within the supplied dataset, clearly distinguished from prevalence in the customer base or market.
Behavioral evidence that supports or challenges the qualitative pattern.
The outcome driver the opportunity could influence.
Contradictory evidence and plausible alternative explanations.
The unanswered question that creates the greatest decision risk.
The next piece of evidence that would materially change the decision.

Then place the opportunity in an opportunity solution tree. Keep the opportunity separate from candidate solutions. If the branch says customers need an AI assistant, it has already collapsed a customer problem into a preferred implementation. Rewrite it in terms of the customer’s obstacle or desired progress, then generate multiple ways to address it.

At the weekly review, ask four practical questions: What did the team observe? Which belief changed? Which important assumption remains weakly supported? What evidence should be collected next? AI can prepare the evidence packet and show deltas from the prior review. The product trio should decide what the evidence means and whether it changes the opportunity being pursued.

Connect roadmap, experiment, launch, and learning

A strategy loses integrity when each delivery stage invents its own explanation. The roadmap promises one outcome, the experiment measures another, the launch emphasizes a feature, and the retrospective celebrates shipping. AI can help maintain the thread, but only if the same hypothesis and metric definitions travel with the work.

Decision layer	Useful AI assistance	Required human judgment	Artifact to preserve
Strategy	Check the chain from customer value to business result and expose unsupported jumps	Choose the segment, differentiation, outcome, and trade-offs	Strategy brief and driver tree
Discovery	Extract observations, cluster themes, retain contradictions, and draft opportunities	Interpret evidence and choose the next uncertainty to reduce	Evidence-linked opportunity record
Roadmap	Map candidate initiatives to drivers, surface dependencies, and prepare option comparisons	Allocate capacity and accept opportunity cost	Prioritization decision record
Experiment	Draft hypotheses, instrumentation, guardrails, edge cases, and analysis checks	Approve the test design, statistical assumptions, and decision rule	Experiment brief
Launch	Adapt release notes, in-product guidance, support material, and segment messaging	Approve claims, rollout risk, positioning, and readiness	Launch plan and approved message set
Learning	Summarize funnels, cohorts, retention patterns, qualitative feedback, and anomalies	Decide whether to continue, revise, expand, or stop	Learning review and updated decision

Make the roadmap show its reasoning

Ask AI to produce roadmap options, not a single supposedly objective ranking. Each option should show the outcome driver it targets, evidence strength, important dependencies, unresolved risk, stakeholder impact, and the work displaced by choosing it. A priority score can organize inputs, but it cannot resolve a strategic disagreement about which customer or outcome matters most.

Every roadmap item should answer: Why this customer problem, why now, what behavior should change, which business result should follow, and what observation would make the team reconsider? If the answer is merely that customers requested it or a competitor has it, the strategy is incomplete.

Make experiments decision-ready before they run

An AI-drafted experiment brief should contain a falsifiable hypothesis, eligible population, primary metric, guardrail metrics, instrumentation plan, exposure logic, expected mechanism, known confounders, and decision rule. For A/B testing, define the minimum detectable effect before interpreting results. The value must be tied to a practically meaningful change and checked against baseline behavior and available traffic; a model cannot infer those constraints from a feature description.

Instrumentation deserves its own review. Specify the event, properties, eligibility conditions, trigger, and expected sequence in the funnel. Use behavioral analytics to check that exposure and activation are measured consistently across variants. Feature flags can separate deployment from release, support a controlled ramp, and limit exposure while the team checks behavior.

For an AI-powered product experience, add eval-driven checks alongside product metrics. Define the behavior the model should exhibit, edge cases it must handle, unacceptable outputs, privacy constraints, and regression cases. Product success cannot compensate for a model behavior that violates an explicit safety or trust requirement.

Keep launch language tied to the original value proposition

AI can adapt UX copy, product tours, tooltips, release notes, in-app guides, and support macros for different segments. Give every channel the same approved value proposition, capability boundaries, terminology, and claims. Otherwise, speed creates message drift: the release note promises an outcome the interface does not support, while the support macro describes a different workflow again.

After release, bring the original decision brief into the learning review. Examine the target cohort, funnel behavior, activation, retention, qualitative feedback, and guardrails. Do not ask only whether the feature was adopted. Ask whether the intended customer behavior changed, whether the assumed mechanism appears credible, and whether the business outcome remains a reasonable consequence.

Scale the workflow only when another person can audit it

Before expanding AI assistance across the product organization, hand one completed decision package to a colleague who was not part of the workflow. They should be able to identify the governing strategy, trace each important claim to evidence, see which assumptions remain open, understand the trade-off, and find the metric that will trigger the next decision.

If they cannot, do not solve the problem with a longer prompt. Repair the missing artifact, unclear ownership, broken evidence link, or inconsistent metric definition. That is where strategic reliability lives.

Start with one decision entering your next weekly discovery review. Build its evidence set, label observations and assumptions, run separate synthesis and challenge passes, and publish the human decision with its reversal signal. Once that chain survives review, reuse the workflow. The goal is not more AI-generated product work. It is a shorter, more inspectable path from customer evidence to a measurable strategic choice.

References

May 12, 2026

What the Intercom-to-Fin Rebrand Teaches Product Leaders
If you are deciding whether an AI product should become your company name, you probably do not have a naming problem. You have a portfolio commitment problem. The rename will make your bet visible, but it will also force you to explain what existing customers still own, what will keep improving, and what now defines the company’s future.

The Intercom-to-Fin move offers a clean way to think about that decision. The company is now named Fin, while Intercom remains its customer service software platform; Intercom 2 has also launched as a complete rebuild with continued investment behind it. The growth brand moves up to the corporate level without erasing the durable product brand beneath it. That is the strategic work of this rebrand.

The decisive choice is what you do not rename

The most important word in this rebrand is not Fin. It is “remains.” Intercom remains a product, a customer commitment, and a place where the company can keep creating value. Fin becomes the corporate identity and the clearest expression of the next growth thesis.

Changing the company name while retaining the established product name is not an incomplete rebrand. It is deliberate brand architecture. The two names answer different customer questions:
- The company brand answers: What future is this organization building toward?
- The product brand answers: What can I buy, operate, renew, and rely on today?
- The category brand answers: What new capability should I understand, budget for, and compare with alternatives?
Those answers do not always belong under one name. Forcing them together can make the new strategy sound smaller than it is or make the established product appear to be on its way out. Keeping Intercom as the platform avoids turning corporate ambition into accidental product deprecation.

Before approving a similar rename, write a transition contract. This is not a legal document. It is a short internal statement that every product, sales, marketing, support, recruiting, finance, and communications leader can use without improvising. It should answer:
1. Exactly which entity is being renamed?
2. Which products keep their current names?
3. What changes for an existing customer because of the rename?
4. What explicitly does not change?
5. Where will investment increase, continue, or decline?
6. How should someone describe the relationship between the company and each product?
If the answers vary by executive, your organization is not ready to communicate the rename. Customers will encounter every inconsistency as a separate strategic story.

A new category needs a clean place in the buyer’s mind

Established brands are efficient because buyers use them as shorthand. The same shorthand becomes restrictive when a company wants to define a substantially different category. People do not continuously reassess every vendor from first principles. They attach new information to what they already believe.

That is why a legacy name can create friction even when it has strong awareness and customer trust. The problem is not that buyers dislike the old brand. The problem is that they already know where to file it. Every pitch for the new category begins with a correction: the company you associate with one product is now asking you to understand it as something else.

Fin had time to develop as a distinct service-agent identity before becoming the company name. The business introduced Fin three years before the corporate rename and deliberately led with that name while keeping Intercom in the background. That sequence matters. It allowed the category proposition to earn meaning before the corporate identity was placed behind it.

You should look for the same underlying evidence before elevating a product brand:
- Prospects ask for the new product or category by name instead of treating it as another feature of the established platform.
- The product has a distinct job, competitive set, buying conversation, and roadmap.
- Your largest resource-allocation decisions increasingly revolve around the new category.
- The existing company name repeatedly requires explanation before buyers understand the new proposition.
- The legacy product can remain a coherent, investable business under its own name.
- Leadership is willing to keep prioritizing the category when it competes with comfortable, near-term work elsewhere in the portfolio.
Wait if the new product still depends almost entirely on legacy demand, if “AI” is the only thing making it sound like a new category, or if leaders cannot explain the future of the existing portfolio. A corporate rename should settle a strategic truth that is already visible in the product and resource decisions. It cannot manufacture that truth.

Test the strategy before you test the name

Name preference is the least important question at the start. A memorable name cannot rescue an unstable thesis, and a room full of favorable reactions cannot prove that the proposed architecture makes sense. Test the decisions the name is meant to encode.

Strategic permanence

Ask whether the new identity can survive normal product evolution. A company named after a feature will eventually outgrow its name. A company named for a durable category, customer outcome, or long-term platform has more room to expand.

Pressure-test the choice against plausible roadmap changes. If the current interface changes, the underlying models improve, or the product expands into adjacent workflows, does the name still represent the company? If one disappointing planning cycle would make leadership retreat to the old story, the corporate rename is premature.

Customer comprehension

Do not ask customers whether the new brand “makes sense.” That question invites politeness. Show them the proposed naming hierarchy without an explanation and ask them to describe:
- What the company does.
- What they can buy.
- What happened to the existing product.
- Which name they expect to see in the application, documentation, support experience, and commercial relationship.
- Whether the new offering feels like a feature, a product, a platform, or a category.
The vocabulary in their answers matters more than a preference score. If customers merge the company and product into one ambiguous object, the hierarchy needs work. If established customers assume their product is being replaced, the continuity story is too weak. If prospects still describe the company only through the old category, the new position has not yet become legible.

Portfolio durability

Every product affected by the rename needs a stated fate: promoted, retained, integrated, or retired. Silence creates its own answer, and customers usually interpret it as declining commitment.

The Intercom-to-Fin architecture avoids that ambiguity. The corporate brand follows the AI growth engine, while the established platform receives a rebuilt product and continued investment. You can apply the same discipline by requiring a roadmap, owner, customer promise, and success measure for every brand that survives the transition.

Operating commitment

A company name is a resource-allocation claim. Check whether hiring plans, executive attention, roadmap capacity, sales enablement, partner priorities, and operating metrics already support the future implied by the name.

This is where weak rebrands reveal themselves. The homepage changes, but planning continues to favor the old center of gravity. Sales compensation rewards the previous motion. Product teams keep describing the AI offer as an add-on. Recruiting language promises one future while internal goals fund another. If those contradictions remain, the market will believe the operating behavior rather than the new identity.

Turn the rebrand into an operating model

A corporate rename touches more than brand assets. It changes the nouns people use to make product, commercial, and technical decisions. Treat it as a cross-functional migration with a defined architecture, owners, dependencies, and observable failure modes.

Before launch, remove internal ambiguity

Start with an inventory of named objects. Separate the corporate brand, legal entity, product names, application name, AI agent, domains, documentation, status pages, integrations, partner listings, support channels, and customer-facing team names. They may not all change together, and some should not change at all.

Create a controlled vocabulary for each object. Record the approved name, a plain-language definition, the transition phrase, phrases to avoid, and the person responsible for exceptions. Then apply it to roadmap documents, release notes, sales materials, onboarding, job descriptions, support macros, analytics labels, and executive reporting. This prevents each function from inventing a slightly different portfolio.

Keep the public brand change separate from legal and payment instructions. A new display name does not automatically mean that the contracting entity, tax information, or bank details changed. Telling customers to update those records without confirmation can create payment failures, procurement delays, and fraud risk. Legal and finance owners should identify any real operational changes and communicate them through established, verifiable channels.

Build the customer FAQ from actual consequences, not brand language. Cover logins, existing contracts, invoices, data handling, support access, integrations, domains, saved links, product roadmaps, and administrative work. For every item, say whether action is required. “No action required” is useful only when you have verified it across the relevant systems.

At launch, separate ambition from continuity

Lead with the scope of the change. Say which name belongs to the company, which belongs to the existing product, and how the new category fits. Then explain why the corporate identity is changing. Follow that with a precise account of what existing customers should expect.

Do not rely on “nothing changes” as reassurance. It is usually too broad to be credible, especially when a new product strategy and increased investment are central to the story. Name the stable elements instead: the product that remains, the workflows that continue, the commitments that persist, and any interfaces or commercial records that stay the same.

Use the same architecture everywhere a customer can encounter the company. A clear launch page cannot compensate for an application header, help center, invoice, partner marketplace entry, or sales deck that implies a different relationship. Transitional wording can help connect the names, but it should have an exit condition rather than becoming permanent clutter.

After launch, measure the translation tax

Launch reach tells you that people saw the rename. It does not tell you that they understood it. Establish a pre-launch baseline where possible, then monitor evidence of confusion:
- Support conversations asking whether the existing product is being discontinued or replaced.
- Sales calls in which representatives must repeatedly correct the company-product relationship.
- Documentation searches that mix old and new names in ways your information architecture does not handle.
- Broken redirects, failed bookmarks, authentication problems, or integration errors caused by changed domains or labels.
- Procurement and accounts-payable questions about the company name, contracting entity, or invoice sender.
- Prospect descriptions of the category after encountering the new positioning.
- Retention, adoption, and expansion for the established product, tracked separately from awareness of the new corporate brand.
Review the language in those interactions, not just their volume. The words customers use will show whether the new mental model has formed. Retire transition copy only when support, sales, search, and customer interviews indicate that people can move between the names without assistance.

Key takeaways for your own portfolio decision
- The Fin corporate name expresses the future growth bet; retaining Intercom protects a valuable product identity and signals continued commitment.
- A corporate rename is a brand-architecture and resource-allocation decision, not a cosmetic marketing project.
- Elevate a product name only when the category, roadmap, buying conversation, and operating priorities already support it.
- Tell customers exactly what is renamed, what remains, what changes, and whether they need to act.
- Validate comprehension with unscripted customer explanations, not name-preference questions.
- Measure confusion across support, sales, documentation, procurement, integrations, and product health after launch.
If this decision is in front of you, bring a one-page transition contract to your next portfolio review. Ask product, sales, support, legal, finance, and recruiting to describe the company and its products using the same nouns. If they cannot, keep working on the architecture. If they can, and your resource allocation already matches the story, the rename can do its real job: make the strategy easier for the market to understand.

References
- Intercom – Today Intercom becomes Fin
May 12, 2026

From Internal FinOps Agents to Customer-Embedded Optimization

Your cloud-cost agent can identify the line item that moved and still fail to change a single decision. The gap appears after the diagnosis: the recommendation arrives without the product, pricing, ownership, and risk context needed to act.

If you are taking an internal FinOps capability into the customer experience, design for a closed decision loop. The goal is not autonomous cost cutting. It is a governed system that connects spend to customer value, recommends the next move, and proves whether the move worked.

Design a decision loop, not another cost dashboard

Start by naming the decision your product will improve. A broad promise such as optimize cloud spend gives the agent no useful boundary. A better contract is: detect a material change in workload cost, identify the most plausible driver, propose one permitted response, route it to the right owner, and verify the effect.

Draw the product boundary around an outcome

The operating loop is simple to describe: observe, explain, propose, authorize, execute, and verify. A dashboard normally stops at observe or explain. An agentic FinOps workflow carries evidence into a recommendation and then closes the loop with an approved action and post-action telemetry.

Agentic does not mean unrestricted. It means the agent can select the next permitted step based on context. Deterministic services should still perform calculations, enforce policies, check permissions, and execute infrastructure changes. Use the model where interpretation is valuable: reconciling signals, building a driver narrative, identifying missing context, explaining tradeoffs, and routing a decision.

That distinction matters in FinOps. A model should not improvise a billing calculation, invent a price, or bypass a commitment policy. If a calculation has one correct result, compute it in code and give the result to the agent as evidence.

Build four layers with explicit responsibilities

Evidence layer: Billing exports, usage metering, observability, product telemetry, pricing logic, feature flags, deployment activity, environment metadata, customer segmentation, and ownership records.
Reasoning layer: Driver trees, anomaly triage, competing explanations, confidence evidence, and recommendation selection.
Action layer: Policy checks, approval routing, change preparation, execution, rollback, and escalation.
Learning layer: Post-action telemetry, realized outcomes, agent evaluations, customer feedback, and recurring patterns that belong in the product roadmap.

A retrieval-first pipeline that combines billing, usage, observability, product, and go-to-market context is more useful than a large prompt containing a monthly cost export. Retrieve the records needed for the current decision and preserve their lineage. Every recommendation should reveal which records were used, when they were updated, which pricing assumptions applied, and what the agent could not retrieve.

Customer-facing retrieval adds another non-negotiable boundary: tenant isolation must be enforced before context reaches the model. Do not rely on a prompt to prevent cross-customer disclosure. Access control belongs in the retrieval and service layers, with the resulting access decision recorded in the audit trail.

Start with one anomaly and one reversible response

Your first release does not need to optimize every cloud service. A practical thin slice is anomaly detection plus one high-leverage remediation path. For example, the agent might detect a change in non-production workload cost, connect it to a schedule change, prepare a schedule correction, request approval from the workload owner, and monitor the next usage window.

Choose a first action that is bounded and reversible. A scheduling correction is easier to inspect and undo than a long-term financial commitment or a production capacity change. The purpose of the thin slice is to prove the whole operating loop, not merely the anomaly model.

Make every recommendation safe enough to act on

A recommendation without an execution envelope is an opinion. It may be correct, but the recipient still has to reconstruct the evidence, find the owner, assess the downside, and decide how to validate it. That is where apparently intelligent systems create more work than they remove.

Use a recommendation contract

Treat every agent recommendation as a structured product object. At minimum, require these fields:

Decision: The exact choice the recipient is being asked to make.
Scope: The account, workload, service, environment, and time window affected.
Owner: The person or role accountable for the workload and the person authorized to approve the action.
Evidence: Links to the billing, usage, observability, deployment, and product records that support the diagnosis, including their freshness.
Driver path: The causal chain the agent believes explains the change, plus material alternative explanations it considered.
Proposed action: The change, its expected mechanism, and any assumptions behind an estimated effect. If the effect cannot be estimated reliably, say that it is unknown.
Confidence and unknowns: Evaluation-backed confidence evidence, missing context, and conditions that would invalidate the recommendation.
Execution envelope: Policy checks, blast radius, approver, expiration, rollback procedure, and escalation path.
Verification plan: The telemetry, observation window, success condition, and stop condition used after the action.

The expiration field is easy to overlook. Cloud state changes quickly enough that an old recommendation can remain plausible after its evidence has gone stale. Expire the recommendation when its pricing, topology, deployment, or usage assumptions are no longer current. Force a fresh retrieval before execution.

Grant autonomy by action class

Do not give an agent one global autonomy setting. Earn autonomy independently for each action class:

Observe: Detect and organize a possible anomaly.
Explain: Build a driver tree and expose supporting evidence without proposing a change.
Recommend: Propose an action while a human retains approval and execution.
Prepare: Generate a change plan or dry run, but require an authorized owner to apply it.
Execute within policy: Apply a reversible, bounded action only when the policy engine, permissions, evidence freshness, and rollback checks all pass.

Purchasing a cloud commitment or altering production resources can create real financial or availability exposure. Keep finance and service owners in the approval path until confidence evidence and post-action telemetry demonstrate reliable performance for that specific intervention. Good results on anomaly explanations do not establish that the same agent is safe to execute infrastructure changes.

Governance should be visible in the product, not left in a policy document. Show the approver which data was accessed, which rules passed, who changed the recommendation, what action ran, and what happened afterward. Privacy-by-design, data controls, and transparent decision logs are part of the user experience when the system influences money and production infrastructure.

Evaluate the decision loop, not the prose

A polished explanation is not evidence of a useful agent. Build evaluations around the failure modes that can block or distort a decision:

Did the recommendation use the correct customer, workload, environment, price, and time window?
Can each material claim be traced to an underlying record?
Does the driver path match known cases, including cases with several plausible causes?
Does the agent abstain when ownership, telemetry, or pricing context is missing?
Did approval routing and policy enforcement behave correctly?
Can the recipient perform the proposed action without reconstructing missing steps?
Did post-action telemetry confirm the expected direction of change without creating an unacceptable operational tradeoff?

Put retrieval changes, prompts, policies, and tools through the same delivery discipline as application code. Eval-driven development, CI/CD, and a weekly shipping cadence make regressions visible before a persuasive but poorly grounded recommendation reaches an operator or customer.

Embed the capability with customers before scaling it

The first customer version should not be a general-purpose cost chatbot. It should be a narrow, product-assisted engineering motion in which a Forward Deployed Engineer, or FDE, helps the customer connect product usage, cloud architecture, and cost-to-value.

Choose a small pod and customers that can teach you

A sensible starting shape is one FDE pod focused on two or three high-potential customers. High potential should not mean merely the largest cloud bill. Select customers where the team can access the necessary evidence, an accountable sponsor can authorize changes, the problem is likely to recur, and the customer agrees to clear data and governance boundaries.

Evidence readiness: Billing, metering, observability, pricing, and deployment context can be joined without weeks of manual reconciliation.
Decision access: An engineering, product, or finance owner can approve an intervention and explain the operational constraints.
Learning value: The problem represents a pattern that may apply beyond one account.
Measurability: The customer and FDE can agree on a cost-to-value measure before making a change.
Governance fit: Data access, retention, tenant isolation, approvals, and audit expectations are explicit.

If any of these conditions is absent, the engagement may still be commercially important, but it is a weak environment for deciding whether the agentic product works. Separate account urgency from product-learning quality.

Run a customer optimization loop that produces reusable knowledge

Define the value unit. Agree on what an active workload or valuable unit of product usage means. Total spend alone cannot distinguish efficient growth from contraction.
Establish the baseline. Record current cost per active workload, time-to-first-value, relevant deployment behavior, and the constraints the customer will not trade away.
Build the driver tree. Connect the spend change to services, environments, releases, product behavior, and customer usage. Surface gaps instead of filling them with assumptions.
Select one intervention. Prefer the smallest action that can test the diagnosis. Document the expected mechanism, approver, risk, and rollback before execution.
Verify the outcome. Compare post-action telemetry with the agreed baseline. Record savings, unit-economics movement, performance effects, adoption effects, and unintended consequences separately.
Codify the pattern. Capture the inputs, decision rule, action, exceptions, safeguards, and evidence required to repeat the intervention.
Send a weekly learning packet to product. Include successful patterns, failed diagnoses, missing platform capabilities, customer language, and recommendations that still depend on FDE judgment.

Within a quarter, this loop should make it possible to distinguish interventions that can be automated, patterns that should become native product features, and problems that still require deeper solutions engineering. The point is not to eliminate the FDE. It is to reserve that scarce judgment for cases where ambiguity and customer context remain material.

Make the commercial incentive legible

Customer-embedded optimization creates an obvious trust question for a consumption business: does the vendor want the customer to spend less or consume more? The clean answer is to optimize cost-to-value rather than either number in isolation.

A customer’s total cloud cost can rise while cost per active workload improves because valuable usage is growing. Total cost can also fall because the customer is using less of the product, which is not an optimization success. Label the outcome precisely: lower total spend, lower unit cost, avoided waste, shifted commitment, higher useful consumption, or reduced operational risk. Do not collapse these different effects into a generic savings claim.

The FDE is also a trust boundary. The role should explain the recommendation, expose assumptions, and represent the customer’s constraints. It should not become a human interface for repetitive exports and one-off queries that the platform ought to handle.

Turn field work into a roadmap, not permanent custom service

A strong FDE can make a weak product look successful by solving every gap manually. That is useful for an individual customer and dangerous for product strategy. You need an explicit test for moving work from the field into an agent workflow or native platform capability.

Apply a productization test to every recurring intervention

Can the same signal be retrieved reliably across the intended customer segment?
Can the decision logic be expressed without undocumented customer-specific knowledge?
Can the action be bounded by a stable policy, approval path, and rollback procedure?
Can the outcome be measured with telemetry that exists before and after the change?
Do the likely exceptions fit a review workflow, or do they fundamentally change the decision?

If the signal, decision, action, and measurement are repeatable, make the pattern a native feature or automated playbook. If the evidence is repeatable but judgment varies, keep an agentic workflow with human review. If the action carries high financial or availability risk, keep the FDE and accountable owner in the loop. If the pattern is a one-off, document it but resist turning it into product scope.

Use a scorecard that reveals where the loop is breaking

Dimension	Measure	Decision it informs
Insight speed	Time-to-insight from a material spend change	Is the system finding the issue early enough to change an engineering decision?
Action quality	Recommendations with evidence, an owner, a permitted action, and a verification plan	Is the agent producing executable decisions or polished commentary?
Economics	Realized savings per recommendation and cost per active workload	Did the intervention improve spend or unit economics for the intended value unit?
Reliability	Post-action effects, abstentions, rollbacks, and policy failures by action class	Which interventions have earned more autonomy, and which need tighter controls?
Customer outcome	Time-to-first-value and NRR movement on FDE-supported accounts	Is the motion improving adoption and durable account value? NRR is directional evidence, not proof of causation.
Product leverage	Recurring field patterns converted into features, guardrails, or in-product guidance	Is customer work compounding into a scalable product?

Recommendation volume, prompt length, and agent activity are operating diagnostics, not business outcomes. A quiet system that changes a few high-value decisions can be more useful than an active system that produces hundreds of unactioned findings.

Make build versus buy a component decision

Do not treat the choice as one monolithic platform decision. Separate commodity capabilities from the context and workflow that create differentiation. Evaluate billing ingestion, normalization, anomaly detection, the context model, pricing logic, recommendation policy, approval routing, execution, and agent analytics independently.

Does the capability require knowledge of your architecture, pricing model, feature flags, customer usage, or deployment behavior?
Can an external component preserve evidence lineage, tenant isolation, and decision logs at the level your customers require?
Is the capability a generic input to the product, or is it where your product makes a differentiated decision?
Can your team evaluate and operate the component continuously, including regressions after model, prompt, policy, or data changes?
Will the component reduce time-to-value without trapping critical customer and pricing context in an opaque workflow?

Unique architecture, pricing, and growth loops can justify building the context and decision layers. But weak tagging, unclear ownership, and missing observability undermine either path. Fix those foundations before expecting an in-house or purchased agent to produce precise optimization decisions.

Give the core product to a product trio spanning product management, engineering, and FinOps. Bring FDE, customer success, SRE, finance, and security into discovery and evaluation where their decisions are affected. Field requests should enter the roadmap with evidence of recurrence, strategic importance, or platform leverage rather than becoming an informal side door to custom development.

Key takeaways

Define the product as observe, explain, propose, authorize, execute, and verify. Diagnosis alone is not an agentic outcome.
Retrieve billing, usage, observability, pricing, product, and ownership context for each decision, with lineage and tenant boundaries enforced outside the prompt.
Represent every recommendation as a governed contract containing evidence, owner, action, risk, approval, rollback, expiration, and verification.
Grant autonomy by action class. Keep humans in the loop for commitments and production changes until that intervention has reliable post-action evidence.
Start customer delivery with one FDE pod and two or three customers that offer evidence access, decision access, measurable value, and reusable learning.
Measure time-to-insight, realized outcomes, unit economics, reliability, customer value, and productized patterns instead of counting recommendations.

This week, choose one recurring cost anomaly and map the complete path from underlying records to a verified action. Name the owner, approval rule, rollback, and success telemetry before improving the prompt. Do not add a second workflow until the first can explain what changed, why the action was allowed, and whether it improved customer cost-to-value.

References

May 11, 2026

How to Run AI-Assisted Feature Launches That Drive Growth
You are days from releasing a feature. Engineering needs a rollout decision. Go-to-market teams need a clear promise. Support needs to know what could go wrong. Leadership wants to know whether the release changed customer behavior. Dropping an AI bot into the launch channel will not resolve those tensions. If the metrics, authority, and escalation rules are vague, the bot will only answer ambiguity faster.

The useful model is a closed loop: define the behavior you want to change, instrument exposure and value, operate the rollout from one shared channel, let agents handle repeatable retrieval and synthesis, and reserve consequential decisions for accountable people. Done well, AI reduces the coordination tax around a launch while making the growth decision more disciplined.

Define the growth decision before you automate the launch

A feature being available is an output. A customer reaching value is an outcome. Your launch plan has to connect the two before anyone writes an agent prompt or schedules a readout.

A durable growth plan translates the product North Star into activation and retention signals, then defines the minimum detectable effect before experimentation. The North Star provides direction, but it is often too distant to diagnose a new feature. A launch needs an earlier behavioral signal that can tell you whether eligible users encountered the feature, understood it, and reached its intended value.

Write a short launch contract with these fields:
1. Target user and moment: Name the user or account segment, the situation that makes the feature relevant, and any eligibility rules. A feature intended for a new administrator solving an initial setup problem should not be evaluated across every user in the product.
2. Behavioral hypothesis: State the current behavior, the desired behavior, and why the feature should cause the change. If the causal link cannot be written plainly, the team is not ready to interpret the launch data.
3. Measurement chain: Instrument eligibility, actual exposure, meaningful engagement, the activation action, and the downstream value event. If you record engagement but not exposure, low adoption could mean either that users ignored the feature or that they never saw it.
4. Primary signal: Choose the behavior closest to customer value that can mature within the launch window. Do not promote every available metric to equal status. That turns a decision into a search for whichever chart looks most favorable.
5. Guardrails: Name the operational and customer signals that can stop a rollout, such as degraded performance, errors, support burden, privacy concerns, or a harmful shift in another important behavior. Define the actual acceptable bounds in your internal contract before launch; do not negotiate them after a concerning result appears.
6. Minimum detectable effect: Decide what change would be large enough to matter to the product and business. This keeps the team from celebrating meaningless movement or waiting indefinitely for certainty that the planned test cannot provide.
7. Decision rule and authority: Specify what evidence permits a ramp, what requires a hold, what triggers investigation, and who can pause or roll back the feature. An agent may assemble the evidence, but it should not invent the rule during the incident.
The contract should also distinguish a growth signal from a health signal. Activation, conversion, or repeated use may tell you whether the feature is producing value. Latency, error rates, complaints, and anomalous segment behavior tell you whether it is safe to continue. A healthy system with an immature growth signal may justify holding the rollout. Broken instrumentation or a material guardrail breach calls for a different response.

This distinction prevents a common category error: treating an inconclusive experiment as a failed feature, or treating early adoption as proof of durable value. The launch decision should always answer the same question: given trustworthy exposure data, the primary signal, and the guardrails, should you ramp, hold, investigate, or roll back?

Turn the launch channel into a decision system

A launch channel becomes useful when it preserves context and decisions, not merely conversation. A practical setup is one channel named #launch-[feature], with its scope, service expectations, success metrics, dashboards, and rollout plan pinned. Product, engineering, data, support, and go-to-market stakeholders can then work from the same operational record.

Set up the channel before rollout begins:
1. Pin the launch contract: Include the hypothesis, eligible population, event definitions, primary signal, guardrails, rollout stages, owners, and links to live dashboards. A screenshot becomes stale; a governed dashboard remains inspectable.
2. Create stable work lanes: Use separate parent threads for metrics, incidents, enablement, and customer feedback. This gives each agent and human responder a predictable place to work without fragmenting the overall launch record.
3. Publish response expectations: State which questions the agent can answer immediately, which require a human owner, and how urgent operational issues are escalated. The agent should never make an urgent request look handled merely because it produced a fluent reply.
4. Keep a decision ledger: For every ramp, hold, pause, or rollback, record the timestamp, evidence considered, decision, rationale, approver, and next review point. This matters later when a stakeholder asks why exposure changed or when the team compares the result with the original hypothesis.
5. Require channel-visible handoffs: If a question moves to a data, engineering, or privacy owner, the agent should post the handoff and preserve the relevant query, definitions, filters, and context. Do not let direct messages become a shadow operating system.
Give every automated data answer a consistent shape:
- The direct answer, including the population and time window.
- The metric definition and denominator.
- The relevant cohort, segment, environment, and experiment variant.
- A link to the approved underlying data or dashboard.
- An as-of timestamp so readers know how fresh the result is.
- Any missing data, definition conflict, or limitation that changes interpretation.
- The named human owner when judgment or investigation is required.
An activation rate without its denominator, an environment, or a timestamp is not decision-grade evidence. A polished answer should not receive more trust than the data lineage beneath it. Make uncertainty visible instead of prompting the agent to conceal it behind a confident summary.

Give agents narrow jobs and humans explicit authority

The safest launch architecture separates three jobs: retrieving data, operating rollout controls, and interpreting evidence. Combining them in one broadly empowered agent creates unnecessary risk. It also makes failures harder to diagnose because you cannot tell whether the problem came from a bad query, a bad recommendation, or an unauthorized action.

Use a data agent for retrieval and first-pass synthesis

Connect the data agent only to approved sources and metric definitions. It can answer repeatable questions such as activation by cohort, conversion by segment, latency by region, exposure by variant, or the movement of a named guardrail. It should provide citations and timestamps, then route questions requiring nuance to an owner while keeping the context in the thread.

Write the escalation boundary into its operating instructions. Escalate when metric definitions conflict, required data is unavailable, a query touches restricted information, the request asks for a causal conclusion that descriptive data cannot establish, or the answer would materially change rollout. The best response in those cases is not a guess. It is a precise statement of what is missing and who must resolve it.

Keep the feature-flag agent read-only by default

A flag agent can safely expose status by environment, current rollout allocation, and change history. That alone removes many repetitive questions. Write access is different: an incorrect production change can expose an unready experience, expand an incident, or remove access unexpectedly.

When you permit flag mutations, require an explicit sequence:
1. The requester names the feature, environment, target population, requested action, and reason.
2. The agent shows the current flag state and summarizes the evidence relevant to the request.
3. The authorized approver confirms the exact change. Approval cannot be inferred from an emoji, an ambiguous reply, or the agent’s own recommendation.
4. The integration performs only the approved action through constrained permissions.
5. The agent posts the resulting state, timestamp, requester, approver, rationale, and change-history link.
Do not give the agent a broad production credential merely because the chat interface is convenient. Restrict its access by environment and role, preserve an audit trail, and keep a manual rollback path available to the responsible engineer.

Use a readout agent to maintain the launch narrative

Scheduled summaries prevent the team from rebuilding the same analysis for every stakeholder. A useful default is to publish readouts at T+1 hour, T+24 hours, and T+7 days, while adapting the questions to the product’s actual usage cycle:
- T+1 hour: Confirm that exposure is occurring as intended. Check instrumentation, operational performance, obvious anomalies, and incident status. This checkpoint is primarily about measurement and safety, not declaring growth success.
- T+24 hours: Review adoption and activation by the planned cohorts, early conversion movement where applicable, support themes, and any uneven behavior across important segments.
- T+7 days: Evaluate experiment results that have had time to mature, retention or repeated-value signals when the product cycle makes them observable, significant outliers, and the follow-up work needed to harden or revise the experience.
These checkpoints are operating cadences, not guarantees of statistical maturity. A feature used on a longer cycle may not produce a meaningful retention signal by the final checkpoint. The readout should say that plainly instead of treating missing maturity as neutral evidence.

Every readout should end with a decision or an explicit statement that no decision is yet warranted. It should also name the evidence still needed, the owner, and the next review point. A summary that lists charts but does not clarify the decision state creates more reading without reducing uncertainty.

Make the accountability map visible
- Product owns the behavioral hypothesis, the primary growth signal, and the recommendation to ramp, hold, or change direction.
- Engineering owns operational health, flag implementation, incident response, and safe rollback execution.
- Data owns metric definitions, instrumentation validity, experiment design, and interpretation limits.
- Support and go-to-market owners contribute customer feedback, readiness concerns, enablement status, and communication needs.
- Agents retrieve, summarize, route, and perform narrowly preauthorized steps. They do not approve their own consequential recommendations.
The governance layer is part of the product, not a final compliance check. Apply role-based access, protect personally identifiable information, require source citations, and retain transparent logs. Then monitor response accuracy, deflection, and time-to-answer through Agent Analytics. Deflection alone is a poor success metric: a confidently wrong response may reduce human questions while increasing decision risk. Review incorrect answers, unnecessary escalations, missed escalations, and stale data as carefully as response speed.

Run the rollout as a sequence of evidence gates

A feature flag is not merely a switch. It lets you separate deployment from exposure and turn a large release decision into a sequence of smaller, inspectable decisions. The appropriate rollout stages depend on the feature’s operational, privacy, and customer risk, so define them in advance rather than copying a universal percentage.

Use this operating sequence:
1. Preflight the measurement: Verify eligibility, exposure, activation, value, and guardrail events in the intended environment. Confirm that dashboards use the launch contract’s definitions and that the agent can retrieve the same governed numbers.
2. Release to the defined cohort: Use the flag to control who can receive the experience. Confirm actual exposure before interpreting engagement. Eligibility and exposure are different facts.
3. Inspect evidence at the scheduled gates: Start with instrumentation and safety, then move to activation, conversion, retention, and other downstream value signals as they become observable. Review the preselected segments before exploring unexpected cuts of the data.
4. Choose a named decision state: Ramp, hold, investigate, pause, or roll back. Record the evidence and rationale. Avoid vague states such as looking good, because they do not tell engineering what to do or stakeholders what has been decided.
5. Feed the learning back into the journey: Update onboarding, in-product guidance, targeting, positioning, or the feature itself based on the observed friction. A winning test becomes a growth mechanism only when the trigger, experience, and value-producing behavior can be repeated reliably.
Use a clear decision ladder:
- Ramp when measurement is trustworthy, guardrails remain inside the pre-agreed bounds, the evidence meets the decision rule, and customer-facing teams are ready for broader exposure.
- Hold when the system is healthy but the outcome has not matured enough to support a decision. State what evidence is pending and when it can reasonably be reviewed.
- Investigate when an anomaly, segment divergence, definition conflict, or instrumentation gap makes the aggregate result unreliable.
- Pause when continued exposure could obscure an incident, contaminate the test, or expand a customer problem while the team diagnoses it.
- Roll back when a material operational, privacy, safety, or customer guardrail crosses the boundary defined in the launch contract. Do not wait for the primary growth metric to mature before acting on a serious downside.
If the feature itself uses AI, measure the product experience separately from the operational agents supporting its launch. AI can provide intelligent nudges, next-best actions, and adaptive experiences while applying privacy-by-design and strong data governance. That creates at least four distinct questions: Was the user eligible? Was the AI experience delivered? Did the user engage with it? Did that engagement lead to value without violating a guardrail?

Logging only the final conversion makes those questions impossible to separate. A delivery problem, poor recommendation, confusing interaction, and weak value proposition can all produce the same downstream result. Preserve the path from eligibility through value, including the experience variant the user received. If targeting or adaptive behavior changes during the test, log the change and account for it in the interpretation.

Do not confuse high initial use with a durable growth loop. Novelty can produce engagement without retained value. Look for the sequence that matters to your product: activation, repeated value, and then the relevant expansion, collaboration, or retention behavior. If the product has no natural invitation or sharing mechanic, do not force a viral story onto it. Build the loop around the behavior customers already have a reason to repeat.

Key takeaways
- Start with a launch contract that names the user, behavioral hypothesis, measurement chain, primary signal, guardrails, minimum detectable effect, decision rules, and accountable owners.
- Use one launch channel as the shared operational record, but separate metrics, incidents, enablement, and feedback into stable threads.
- Split agent responsibilities across data retrieval, flag operations, and scheduled readouts. Keep consequential actions approval-gated and auditable.
- Treat T+1 hour, T+24 hours, and T+7 days as decision checkpoints, not automatic declarations of success.
- Use feature flags to move through evidence gates. Ramp, hold, investigate, pause, or roll back according to rules written before the data arrives.
- Measure AI-powered experiences from eligibility through delivery, engagement, value, and guardrails so you can diagnose why growth did or did not move.
For your next launch, begin with a narrow operating slice: a completed launch contract, a structured channel, a data agent limited to approved queries, read-only flag visibility, and scheduled readouts. Review every wrong answer, escalation, and decision after the rollout. Expand the agent’s authority only when the evidence shows that the control system is trustworthy.

References
- Amplitude – My Playbook for a Smarter Feature Launch Slack Channel with Agents, Feature Flags, and Readouts
- Amplitude – How I Orchestrate Growth & AI at Amplitude to Ignite Viral Product Engagement
May 11, 2026

How to Build a SaaS Retention and Expansion System

Your team can explain churn after it happens. The harder problem is seeing a customer change direction early enough to do something useful, then knowing whether the intervention actually changed the outcome.

You do not solve that problem with another health dashboard. You solve it with a closed-loop operating system: define how customers progress toward value, detect when that progression changes, choose the right intervention, and measure the incremental result. Built well, the same system protects retention and identifies credible expansion opportunities.

Treat retention and expansion as one value-progression system

Retention and expansion are often split across teams, tools, and meetings. Customer Success monitors renewal risk. Product watches activation and feature adoption. Sales looks for additional revenue. Support handles whatever breaks. Marketing runs lifecycle campaigns. Each function can be busy while the customer still receives a fragmented experience.

The better organizing principle is customer value progression. A retained customer continues receiving enough value to justify the relationship. An expanding customer is ready to receive that value across more users, workflows, usage, or capabilities. The two outcomes sit on the same path.

That changes the question from, Which accounts might churn? to, What value state is this account in, what evidence supports that assessment, and what should happen next?

Define the state. Translate product, support, CRM, and commercial signals into a recognizable customer condition.
Make a decision. Select an intervention, assign a human owner, or deliberately take no action.
Act in context. Use the channel and message appropriate to the customer’s current job, friction, and relationship.
Observe the response. Track whether behavior, value attainment, or commercial outcomes changed.
Learn and revise. Keep playbooks that produce incremental value, change weak ones, and retire harmful or noisy ones.

This loop is the system. A prediction model, lifecycle tool, or customer-success platform is only one component inside it.

Key takeaways

Model movement toward and away from value, not churn as a single binary event.
Keep the account state, its underlying drivers, and the recommended action visible together.
Use automated journeys for clear, low-complexity situations and human help when diagnosis or commercial context matters.
Separate risk recovery from expansion outreach, even when both use the same underlying data.
Measure incremental outcomes with an eligible comparison group or holdout whenever possible.
Start with one segment and one customer state before adding more data, models, and playbooks.

Instrument customer states, not a pile of events

A login is not value. A feature click is not adoption. A support ticket is not necessarily risk. Raw events become useful only when you interpret them in the context of a customer journey.

Begin with a small set of decisions your system must support. Common starting use cases include an activation funnel, onboarding drop-off, and adoption of the product’s core capability. A lightweight tracking plan, consistent event names, and explicit initial use cases give Product, Data, Growth, and Customer Success a shared language for those decisions.

Define customer states before designing a score. The exact evidence will differ by product, segment, pricing model, and maturity, but the state taxonomy can remain understandable:

Customer state	Evidence to define for your product	Decision the state should enable
Onboarding stalled	A required setup or first-value milestone was started but not completed, or progress stopped relative to the expected journey	Remove a specific blocker before sending broader education
Activated but shallow	The account reached initial value, but usage remains concentrated in one person, workflow, or capability	Help the account repeat and distribute the successful behavior
Healthy and deepening	Core outcomes recur, usage is stable or growing, and value is spreading through the intended scope	Reinforce success and watch for an adjacent need
Contracting	Relevant usage, active participation, or workflow breadth is declining relative to the account’s own baseline	Diagnose whether the cause is friction, seasonality, organizational change, or reduced need
Expansion ready	The current scope is producing value and the account has an evidenced adjacent need, capacity constraint, or unserved group	Offer a relevant next step without disrupting existing value

Do not assign universal activity thresholds merely because they are easy to query. The same number of weekly users can mean strong adoption for a small account and serious contraction for a larger one. Compare an account with its expected journey, purchased scope, peer segment, and prior behavior.

Your data model also needs to distinguish a person from an account. A power user can make an account look healthy while every other intended user disengages. Conversely, a stable automated workflow may create value without frequent logins. Track the unit at which value is delivered, then roll that evidence up to the commercial account.

For each meaningful behavioral event, capture enough context to reconstruct what happened: account identity, user identity where relevant, event name, timestamp, source, product object or workflow, plan or entitlement context, and outcome. Resolve duplicate identities before calculating breadth or frequency. Missing data must remain distinguishable from negative behavior; an integration outage is not customer disengagement.

Behavior alone is incomplete. Useful retention systems can combine product usage, CRM context, support interactions, billing health, and qualitative session evidence. Each signal should have an owner, a freshness expectation, and a clear meaning. If nobody can explain how a field affects a decision, it does not yet belong in the model.

Turn signals into explainable risk and opportunity decisions

A single health score is convenient for sorting accounts. It is poor guidance for action. Two accounts can receive the same score for completely different reasons: one failed to finish onboarding, while another lost active users after months of successful use. They should not receive the same message or playbook.

Keep a compact score if it helps prioritize work, but expose the dimensions beneath it:

Value attainment: Has the account completed the behaviors associated with its intended outcome?
Depth: Is the core workflow repeated enough to become part of normal work?
Breadth: Is value distributed across the intended users, teams, use cases, or product areas?
Trajectory: Is relevant behavior growing, stable, stalled, or declining against an appropriate baseline?
Friction: Are unresolved issues, repeated failures, poor outcomes, or setup barriers preventing progress?
Commercial health: Is the account approaching a renewal, reducing scope, encountering billing trouble, or operating near a legitimate capacity boundary?

Every flagged account should carry reason codes in plain language. A useful record says that core workflow usage declined from the account baseline, active participation narrowed, the change began after an unresolved issue, and the evidence was refreshed recently. A label such as health score: 42 does not tell an owner what to do.

Also show what would disconfirm the assessment. If a supposed contraction signal is seasonal, expected, or caused by a tracking change, the owner needs a way to correct it. That feedback should improve the rule or model instead of disappearing into private notes.

My default is to begin with transparent rules and cohort comparisons. Add machine learning when the volume, complexity, and demonstrated lift justify it. A black-box score creates false precision if Product cannot trace it to behavior and Customer Success does not trust it enough to act. Clear drivers, cohort-level analysis, and explainable scoring are operational requirements, not cosmetic reporting features.

AI is useful for classifying issue themes, summarizing account context, detecting unusual changes, ranking eligible accounts, and recommending a playbook. It should not silently make ambiguous commercial commitments or send sensitive outreach to a strategically important account without the controls your business requires. Preserve the underlying evidence, model or rule version, chosen action, human override, and eventual outcome so the decision can be audited.

Apply the same discipline to governance. Limit access to account data by role, record consequential changes, define how customer data may be used, and evaluate retention tooling for privacy, implementation burden, and maintainability as well as predictive performance. A model that cannot be governed will eventually become difficult to trust or operate.

Match each customer state to a bounded playbook

A signal without an intervention is reporting. An intervention without eligibility rules is noise. Build a small library of bounded playbooks, each designed for one customer condition and one desired state change.

Every playbook should specify:

The eligible segment and state.
The evidence that triggers entry.
Conditions that suppress outreach, such as an unresolved incident, a recent human conversation, an opt-out, or an active commercial negotiation.
The customer problem and value hypothesis.
The channel, message, and accountable owner.
The action you want the customer to take.
The success event and business outcome.
The guardrails that reveal annoyance, added support burden, or unintended contraction.
The exit condition, expiration rule, and fallback if the customer does not respond.

That template forces useful distinctions between common plays:

Onboarding rescue. Identify the missing value milestone and address that obstacle directly. Use an in-product guide for a clear, contextual step. Route technical ambiguity or multi-step setup to a person who can diagnose it.
Shallow-adoption expansion. Help an already successful user repeat the core workflow or bring the right colleagues into it. Do not pitch additional commercial scope before the existing scope is working.
Friction recovery. Connect repeated errors, unresolved issues, or failed outcomes to the affected workflow. Fixing the underlying problem takes priority over a generic educational campaign.
Contraction diagnosis. Ask why behavior changed before prescribing a solution. Declining activity may reflect product friction, a completed project, seasonality, team turnover, or a genuine loss of need.
Consultative expansion. Trigger outreach after demonstrated success and an evidenced adjacent need. Frame the next step around the customer’s outcome, not an arbitrary quota or a feature list.

Channel choice matters. In-app guidance works when the next step is clear and the customer is already in the relevant context. Lifecycle messaging can reinforce an understood behavior. Customer Success or Sales should handle relationship-heavy and commercial situations. Support is especially valuable when the opportunity requires product depth, diagnosis, or credibility earned through solving a real problem.

AI automation can give support teams capacity for that higher-context work, but capacity alone does not create a consultative motion. One AI-enabled support transformation started with a small volunteer cohort inside an organization of more than 100 people and grew to roughly 16 participants across regions within a year. Early use cases focused on trial guidance, optimization for mature customers, and accounts that appeared ready for broader adoption.

The implementation lesson is more important than the org chart: protect core support quality, recruit people who want to test the motion, and train for curiosity, commercial awareness, and broader customer context. Product knowledge is necessary, but consultative work also requires the restraint to ask another question before recommending an answer.

Keep automation reversible. If the account’s state changes, a human begins working the case, or new evidence contradicts the trigger, stop the sequence. A retention system should respond to current customer reality, not continue executing an outdated classification.

Prove incremental impact and build an operating rhythm

The easiest measurement mistake is comparing customers who accepted help with customers who ignored it. In a six-month comparison, accounts that engaged with proactive support grew roughly twice as fast in both usage and expansion as accounts that were contacted but did not respond. That is a meaningful operational signal, but it is not the same as randomized causal proof: customers who engage may already be more motivated, better staffed, or more likely to grow.

When the stakes and volume permit, define the eligible population first and assign eligible accounts to treatment and holdout groups. Randomize at the account level when account-level outcomes and cross-user spillover matter. Measure all assigned accounts in their assigned group, including customers who never engage with the intervention. That estimates the effect of offering the playbook, not merely the characteristics of people who accepted it.

Before launch, document:

The customer state and segment being tested.
The intervention unit: user, workspace, account, or another value-bearing entity.
The primary outcome the playbook is meant to change.
The observation window, chosen to match the expected behavior and commercial cycle.
The minimum detectable effect (MDE) that would make the effort worth acting on.
Leading indicators that show whether customers moved through the intended mechanism.
Guardrails that would stop or narrow the rollout.
The decision rule for scaling, revising, or retiring the playbook.

If random assignment is not practical, use the strongest comparison your context allows. At minimum, compare accounts that were eligible at the same time and stratify by segment, starting health, lifecycle stage, and prior trajectory. Label the result as observational. Do not turn a directional association into a causal revenue claim.

Use a measurement stack rather than one success metric:

Mechanism metrics: completion of the missing milestone, restored core behavior, increased workflow breadth, or resolution of the triggering friction.
Intervention metrics: eligibility, delivery, response, acceptance, completion, time to action, and exit reason.
Commercial outcomes: renewal, churn, contraction, expansion, and Net Recurring Revenue.
Guardrails: opt-outs, complaints, avoidable support demand, negative product outcomes, and harm to other customer journeys.

A common NRR calculation is starting recurring revenue plus expansion, minus contraction and churn, divided by starting recurring revenue. Document your exact definition and keep it stable. Report gross retention, contraction, and expansion beside NRR because strong expansion can conceal losses elsewhere in the customer base.

The operating review should end in decisions, not dashboard commentary. Inspect data quality first. Then review movement between customer states, playbook reach and outcomes, experiment evidence, guardrail breaches, and customer feedback. For every change, record an owner, the rule or playbook being changed, the expected effect, and when the evidence will be reviewed.

Ownership must follow the loop. Product can define value milestones and product interventions. Data can maintain instrumentation and analytical quality. Support and Customer Success can diagnose context and execute human plays. Growth can operate scaled journeys. Revenue Operations can maintain CRM and commercial definitions. One accountable leader still needs to own whether the complete system produces better customer and business outcomes.

Do not begin by buying a prediction platform or modeling every possible customer state. Choose one segment where a meaningful signal appears early enough to act. Define the state, instrument the evidence, create one bounded playbook, and preserve a credible comparison group. Add complexity only after that loop changes an outcome you care about. That is how retention stops being a renewal rescue exercise and becomes a product operating capability.

References

May 8, 2026

Outcome-Based Pricing That Delivers: Pay $10 Only for Qualified Leads with Fin for Sales

Our outcome-based pricing model hinges on one principle: you pay when Fin delivers value.

As Fin takes on new roles, that principle doesn’t change, but the definition of value does.

Fin for Sales qualifies leads, engages prospects, and routes high-intent buyers to your sales team. The value it creates isn’t a resolved query, but a pipeline of qualified opportunities. So we price accordingly: $10 per qualified lead. And you, the customer, define what “qualified” means, not Fin.

This is the first outcome-based pricing model for an AI Agent for sales. Here’s why I believe it’s the right approach and how I’ve seen it change the way teams think about SaaS pricing and ROI.

Over the years, I’ve learned that the fastest way to earn trust with sales and finance leaders is to align pricing with outcomes they actually report on. The core finding from our research was unambiguous: zero buyers preferred paying for activity. They wanted to pay for results.

That insight shaped how we priced Fin for its service role, $0.99 per resolution, where a resolution means the customer’s issue is fully solved without human intervention. More recently, we evolved that model to outcomes, reflecting the broader ways Fin delivers value across complex workflows. We believe pricing should be aligned with value delivery, and the vendor should carry risk when the product doesn’t perform. In sales, the best unit of value is pipeline.

Most sales teams today are overwhelmed by leads. Early in my career, I watched reps spend hours chasing form fills that looked promising but went nowhere. That experience cemented a lesson I still use: volume is vanity; qualification is sanity.

Ensuring the right opportunities promptly reach your sales team is what makes a difference. When a prospect visits your site, engages with Fin, answers qualifying questions, and is directed to a sales rep, Fin is identifying whether the opportunity is worth your team’s time and delivering value.

Charging per conversation would penalize businesses for every curious visitor who asks a question but isn’t a buyer. And charging per token, well, that’s always been a model that protects the vendor, not the customer.

We needed a metric that captures the actual value Fin creates in a sales context: qualified leads.

The purest version of outcome-based pricing for Fin’s sales role would be a percentage of closed revenue. Fin qualifies the lead, a rep closes the deal, and we take a cut. On paper, it looks elegant; in practice, I found it breaks down for two reasons that matter to operators.

First, attribution. Between the moment Fin qualifies a lead and the moment a deal closes, dozens of things can impact the final result. The quality of human-led demos can differ, products can have outages, prospects’ budgets can get cut. Tying Fin’s price to the final outcome holds it accountable for variables entirely outside its control.

Second, measurement. To track closed revenue, we’d need deep integration into every customer’s CRM, tracking each opportunity from qualification through to close. That’s a significant implementation burden that slows time to value, which is the opposite of what we want.

So we asked: what’s the most honest proxy for the value Fin delivers, where Fin is clearly the one creating it?

A qualified lead is that proxy. It represents the moment Fin has done its job. It has engaged the prospect, gathered the relevant information, evaluated them against your criteria, and determined they’re qualified. Everything up to that point is Fin’s work. Everything after it is the rep’s. At $10 per qualified lead, the pricing reflects this boundary.

There are two key components to how this pricing model works.

First, the customer defines success. With Fin’s sales role, the customer sets their own qualification criteria based on their business context. A company with high average contract values might set a lower bar because they can’t afford to miss anyone. A company where rep time is scarce and deal sizes are smaller might set a much higher bar, filtering aggressively to only surface the most promising prospects. The criteria flex to match the business.

Second, the economics are different by design. As a Customer Agent, Fin can switch between roles like sales and service. So if you’ve deployed Fin for Sales, it can still handle support queries like prospects asking a product question. Those queries are charged at $1 per resolution, consistent with our service pricing. Disqualifications, where Fin determines a prospect doesn’t meet the criteria, are also $1. The $10 price point for qualified leads reflects the higher value of pipeline creation compared to issue resolution.

The ROI speaks for itself. Early customers are reporting significant returns using Fin for Sales. One shared a perspective that mirrors what I hear in executive QBRs:

“I would say it’s at least 10 times the value. You’re now giving the business exactly what it needs as opposed to just activity. We say this expression in sales leadership all the time – ‘I don’t pay my sales team for activity. I pay them for results.’ I want my AI engine to be the same way.”

When you compare the cost of a qualified lead from Fin against the fully loaded cost of an SDR—salary, benefits, tooling, ramp time—the economics are compelling. For many businesses, particularly those that never had SDRs in the first place, Fin for Sales isn’t just replacing headcount, but creating an entirely new capability that wasn’t economically viable before.

This pricing model came from extensive customer research—qualitative interviews and quantitative studies—exploring how buyers want to pay for AI in a sales context. We tested multiple concepts: per-conversation, per-token, per-seat, revenue share, and per-qualified-lead. The research consistently pointed to outcome-aligned pricing as the preferred model, with the qualified lead emerging as the metric that best balances value alignment, measurability, and practical implementation.

Outcome-based pricing is still rare in AI, but we think that will change. For Sales Agents, we’re the first to do it. Transparency is part of the model. If you understand why we price the way we do, you can evaluate whether it works for your business.

Inspired by this post on The Intercom Blog.

May 8, 2026
4 Costly Agent Analytics Myths—And the Data-Backed Metrics I Rely on Instead

In my work with product, operations, and support leaders, I’m often asked to help make sense of Agent Analytics—what to track, how to attribute outcomes, and where to invest. After reviewing countless dashboards and running experiments across human agents and AI agents, I’ve learned that some of the most common measurement beliefs are precisely the ones that lead teams astray.
What comes up in conversation with leaders about Agent Analytics, and why not everything is what it seems.
Below, I unpack four pervasive myths I encounter and share the data-centered practices I use to replace them. My goal is simple: help you upgrade the way you measure performance so you can improve customer outcomes, accelerate learning, and scale impact with confidence.
Myth 1: “Lower average handle time (AHT) means higher performance.” AHT is useful but incomplete. When teams optimize solely for speed, they often push complexity into repeat contacts, reopens, or escalations. In the data, that shows up as a weak or negative relationship between lower AHT and durable outcomes like first contact resolution (FCR), customer effort, or revenue per conversation.
Reality and what I measure instead: I right-size speed by pairing AHT with intent-level resolution and recontact rate. For simple intents (password reset, billing address update), shorter is usually better. For complex intents (tiered troubleshooting, multi-step verification), “right-speeding” wins—slightly longer interactions that prevent rework. Practically, that means segmenting by intent complexity using behavioral analytics, tracking weighted “intent resolution rate,” and monitoring repeat-contact windows (24–168 hours) to catch downstream pain.
Myth 2: “AI agent containment tells the whole story.” A high containment rate can mask failure modes such as unresolved intent, silent abandonment, or low-quality handoffs that frustrate customers and spike human workload later.
Reality and what I measure instead: I break containment into three parts for voice and chat flows: (1) intent resolution without escalation, (2) graceful handoff quality when escalation is necessary, and (3) post-handoff efficiency and satisfaction. For voice AI agent experiences, I also track escalation clarity (did the transcript summarize history and intent?), time-to-human, and customer satisfaction on the combined interaction. This provides a fuller view of customer support ai strategy effectiveness and avoids over-crediting automation for partial wins.
Myth 3: “Quality is subjective, so it can’t be measured at scale.” Teams often default to sporadic QA because they assume it can’t be standardized across channels or agent types. The result is noisy feedback loops and stalled coaching.
Reality and what I measure instead: Quality becomes measurable when it’s grounded in observable behaviors linked to outcomes. I use a rubric anchored in behavioral analytics (e.g., verified customer need, correct resolution path, policy compliance, empathy markers) and validate it via correlation with FCR, recontact, and retention analysis. To scale, I combine calibrated human reviews with AI-assisted scoring, check inter-rater reliability weekly, and use driver trees to connect quality levers to business results. This creates a consistent, coachable signal for both human agents and AI flows.
Myth 4: “If the dashboard is green after launch, we’ve won.” Early wins can reflect novelty effects, cherry-picked routing, or short-term incentives that don’t persist. Declaring victory too soon locks in fragile gains and hides regressions across cohorts.
Reality and what I measure instead: I treat go-live as the start of learning. I use A/B testing with a clear minimum detectable effect (MDE), stagger ramps, and hold out stable control cohorts for at least one full demand cycle. I track outcomes vs output OKRs—focusing on intent resolution, customer effort, and revenue/customer health over vanity metrics. I also monitor seasonality and channel mix shifts inside a unified analytics platform to ensure improvements generalize beyond the first week.
How I operationalize this day to day: (1) define intents and complexity upfront, (2) unify journey data across channels, (3) instrument resolution and recontact rigorously, (4) apply driver trees to isolate what actually moves outcomes, and (5) iterate via disciplined experiments rather than sweeping changes. This approach aligns product and operations, speeds up coaching, and ensures AI investments compound rather than decay.
If you’re rethinking your Agent Analytics stack, start by replacing each myth with a sharper metric: pair AHT with intent-level resolution, pair containment with handoff quality and satisfaction, pair QA with outcome-linked rubrics, and pair green dashboards with robust experiments. The payoff is a measurement system that earns trust, guides better decisions, and consistently improves customer and business results.

Inspired by this post on Pendo – Best Practices.

May 7, 2026

How to Evaluate a Shopify-Native AI Shopping Agent

You’ve probably been asked a deceptively simple question: should your Shopify store add an AI shopping agent? The hard part isn’t installing another chat widget. It is deciding whether the agent can help an uncertain shopper choose correctly without recommending unavailable products, misreading policy, or making a costly order change.

Treat this as a commerce-system decision, not a chatbot decision. A useful agent must connect conversation, live product data, cart behavior, checkout, and post-purchase service. The evaluation framework below will help you separate a persuasive demo from a system you can trust with customers and revenue.

The native test: can the agent read, reason, and act?

“Shopify-native” should describe an architecture, not a distribution channel. Being listed in an app marketplace or embedded in a storefront does not make an agent native. The meaningful test is whether Shopify remains the operational source of truth while the agent uses its data and APIs in the customer’s current context.

A concrete implementation shows how high that bar can be: a Shopify connection can expose products, variants, content, live inventory, order data, policies, and transactional APIs to the same customer-facing agent. That combination matters because a correct product description is still a bad answer if the relevant size is unavailable, and an accurate return policy is insufficient if the customer must start over somewhere else to use it.

I would evaluate an agent at four capability levels. A product that stops at the first level may still be useful, but it should not be presented internally as an autonomous commerce agent.

Capability	What the agent needs	What you should ask it to prove
Answer	Product content, store policies, and current catalog facts	Answer a precise product or policy question and identify the relevant item, variant, or rule
Recommend	Catalog relationships, inventory, conversation context, and the shopper’s constraints	Turn an ambiguous request into a short, reasoned shortlist instead of returning generic search results
Transact	Cart or order APIs, authentication, permissions, and confirmation controls	Update a test cart or prepare an order change while showing exactly what will happen before execution
Recover	Shared state across shopping, service, and human escalation	Resolve a support interruption and resume the customer’s original shopping task without asking for the same context again

Freshness is part of correctness. During a test, change the availability of a variant in Shopify and repeat the same shopping request. The agent should stop recommending that variant once the change is reflected in the connected system. Run a similar test with a policy update. A polished answer based on yesterday’s state is not a small quality defect; it can create a promise your operations team must later unwind.

Actions deserve an even stricter test. Ask the vendor to demonstrate the complete chain: customer identification, authorization, interpretation of the request, action preview, explicit confirmation, API execution, and a visible result. If any step is simulated, ask which one. A fast setup can reduce implementation effort, but it does not prove that the agent is accurate, observable, or safe.

Design the shopping dialogue around decisions, not keywords

Traditional ecommerce search works best when the customer already knows the product vocabulary. A shopping agent earns its place when the request is incomplete: a gift for a partner, a mattress for a particular sleep preference, or shoes that must work across road and trail conditions. The agent’s first job is not to produce an answer. It is to discover which answer would be useful.

A strong product-discovery dialogue follows a repeatable decision sequence:

Restate the customer’s job in plain language so a misunderstanding becomes visible early.
Identify hard constraints first, such as an in-stock variant, required use case, compatibility, budget boundary, or delivery requirement.
Ask only for information that could change the recommendation. A question that does not alter ranking or eligibility is conversational overhead.
Present a small shortlist and tie each option to the constraints the customer supplied.
Explain the meaningful tradeoff between the options instead of declaring a universal winner.
Offer the next useful action: compare details, select a variant, update the cart, or continue narrowing the choice.

This sequence turns conversation design into a product requirement. For every recommended item, your evaluation record should capture the customer’s stated need, the product facts used, the reason the item fits, the tradeoff disclosed, current variant availability, and the next question or action. That record gives your team something concrete to inspect when a recommendation is challenged.

Test whether the reasoning is responsive rather than decorative. Change one important answer while holding the rest of the conversation constant. If the customer switches from occasional use to daily use, removes a budget constraint, or requires an available color, the shortlist should change when that information is material. If the products remain identical regardless of the customer’s answers, the experience is probably search with conversational packaging.

The agent should also know when not to narrow further. Once the customer has enough information to choose, another question adds friction. Conversely, confidence should not be manufactured when catalog data cannot resolve the request. A safe response identifies the missing fact, asks for clarification, or hands the conversation to a person with the constraints already summarized.

Product cards can accelerate the final step, but the interface should preserve the reasoning that produced them. An image, name, and price answer “what is this?” The conversation must also answer “why does this fit me?” That is the difference between displaying inventory and assisting a decision.

Make shopping and support one customer state machine

A shopper does not experience your sales and support departments as separate funnels. The same person can compare products, ask about shipping, check an existing order, correct a variant, and return to buying in one session. Routing each intent to a separate tool forces the customer to reconstruct context at every boundary.

Model the journey as one state machine: discover, decide, transact, service, and resume. The agent can move between those states, but it should retain the customer’s goal, constraints, products considered, cart state, relevant order, completed actions, and unresolved question. That shared state is more important than whether the organization labels the current message “sales” or “support.”

This is where a connected agent can do more than answer FAQs. Current Shopify-oriented implementations can handle tracking, returns, exchanges, refunds, order changes, shipping questions, and subscription updates through connected procedures and APIs. Each additional action increases usefulness, but it also increases the consequence of a misunderstanding.

Use different controls for different action classes:

Read-only actions, such as showing order status, should still require appropriate customer identification but do not change commercial state.
Reversible shopping actions, such as adding or removing a cart item, should be immediately visible and easy for the customer to undo.
Financially consequential actions, including refunds, paid order changes, and subscription updates, should require authentication, an exact action summary, explicit confirmation, and a durable result or receipt.
Ambiguous or unsupported actions should stop safely and transfer to a person. The agent must not treat conversational enthusiasm, silence, or an inferred preference as consent.

That last distinction protects both the customer and the business. A mistaken recommendation can usually be reconsidered. An executed refund or subscription change creates financial and operational consequences. If the system cannot preview and verify the exact action, keep that workflow read-only and let a trained person execute it.

The transition back to shopping also needs deliberate design. After resolving a delivery problem or order correction, the agent should restore the prior context and offer a relevant path forward. It should not force an upsell into every service interaction. The next best action after a serious order problem is often confirmation that the problem is resolved. Commercial momentum comes from reducing friction, not from ignoring the customer’s immediate priority.

When escalation is necessary, pass a structured handoff rather than a transcript dump. Include the detected intent, verified identity state, constraints already collected, products or orders involved, actions attempted, results returned by Shopify, and the unresolved decision. A human agent should be able to continue with the next question, not repeat the first one.

Measure incremental commerce value and operational risk

Chat conversion is an attractive metric and an easy one to misread. People who open a shopping conversation may already have higher intent than people who do not. Comparing those groups directly can credit the agent for demand it did not create.

Ninja Transfers reported that 10% of its conversations converted to orders with values 20% above the store’s average order value. That is a useful customer result, but it is a vendor-supplied case from one merchant, not a universal benchmark or proof that the agent caused the full difference. Your business case should depend on your own incremental test.

Where traffic permits, randomize eligible storefront sessions between an agent experience and the existing experience. Measure the result across all eligible sessions, not only the visitors who choose to chat. That intention-to-treat view reduces self-selection bias and answers the executive question: what changed when the store made the agent available?

Use a balanced scorecard rather than a single conversion target:

Business outcomes: completed orders per eligible session, revenue per eligible session, average order value, checkout completion, and assisted revenue.
Decision outcomes: recommendation engagement, product-detail visits after a recommendation, add-to-cart actions, and successful comparison flows.
Downstream quality: cancellations, returns, exchanges, and contacts caused by a poor recommendation or incorrect expectation.
Service outcomes: successful action completion, repeat contact for the same problem, human escalation, and time to a confirmed resolution.
Agent quality: use of current catalog facts, in-stock recommendation rate, policy accuracy, clarification behavior, safe refusal, and correct tool execution.
Risk outcomes: unauthorized or incorrect actions, failed confirmations, customer complaints, policy exceptions, and cases requiring operational repair.

Conversion and average order value belong beside returns and cancellations. An agent can raise the initial basket by recommending a more expensive option while reducing customer fit. Without the downstream view, the dashboard rewards the sale and hides the repair.

Your event model should make the journey reconstructable. Useful events include agent opened, intent classified, clarification answered, recommendation shown, product selected, cart changed, checkout started, order completed, support action requested, confirmation received, action completed, and human handoff. Join these events through an appropriately governed session or order identifier so the team can inspect both funnel movement and individual failure paths.

Build the evaluation before the vendor demo

Create scenarios from your real catalog shape, policies, and failure modes. For each scenario, write the expected outcome and the failures that would make deployment unacceptable. Include:

An ambiguous shopping request that requires clarification.
Two products that appear similar but differ on a constraint customers care about.
An unavailable variant that would otherwise be the best match.
A question whose answer depends on store policy rather than product copy.
A conversation that moves from shopping to order support and back again.
A request that sounds actionable but lacks required authentication or confirmation.
An unsupported request that should trigger a safe handoff.
A catalog or policy change made after the agent’s initial synchronization.

Run the same scenarios repeatedly and record the underlying catalog state each time. You are testing consistency, grounding, and recovery, not literary elegance. A shorter answer that uses the correct variant and policy is more valuable than a fluent answer that improvises.

Expand autonomy in the order of consequence

A staged rollout lets evidence determine how much authority the agent receives:

Evaluate offline with approved scenarios and representative catalog data.
Launch read-only product discovery and policy answers with an obvious human fallback.
Add visible, reversible cart actions after recommendation quality is stable.
Introduce authenticated order-support workflows with previews and confirmations.
Enable financially consequential actions only after tool execution, auditability, exception handling, and operational ownership have been tested end to end.

Define the owner of each failure before launch. Product should own the intended customer behavior and success measures. Commerce or operations should own policy and workflow correctness. Engineering should own integration reliability and observability. Customer support should own escalation quality and emerging failure patterns. The exact reporting lines can vary; an unowned failure queue cannot.

Review both aggregates and conversation-level traces after release. Aggregate metrics tell you whether the experience is moving the business. Traces tell you why. A small cluster of incorrect variant recommendations or failed order actions can disappear inside a healthy conversion average while creating disproportionate customer harm.

Key takeaways

A Shopify-native agent should use live commerce data and governed APIs; storefront placement alone is not enough.
The agent’s product-discovery job is to uncover decision criteria, apply hard constraints, explain tradeoffs, and lead to the next useful action.
Shopping and support should share customer context, but the agent’s permissions must become stricter as actions become harder to reverse.
Conversation conversion is a diagnostic metric, not causal proof. Measure incremental results across eligible traffic whenever possible.
Pair conversion and average order value with returns, cancellations, incorrect actions, and operational repair costs.
Begin with read-only assistance and expand autonomy only after each workflow proves accurate, observable, recoverable, and properly owned.

Before you approve a purchase, bring a vague shopping request, a policy edge case, an unavailable variant, a mixed sales-and-support conversation, and a consequential order action into a live test store. If the agent cannot show where its answer came from, what it will change, and how it fails safely, you have found the next product requirement, not a detail to defer until launch.

References

Intercom – Fin for Ecommerce: The Shopify-native AI Agent transforming product discovery and sales

May 7, 2026

How to Link AI Evals to Retention Without Chasing Proxies

Your AI activation rate is rising. More users are reaching the agent, completing setup, or trying the workflow. Yet the retention curve is flat. That usually means you know who touched the product, but not who received enough value to return.

A higher aggregate eval score will not resolve that gap. You need to identify an AI quality signal that appears early, connect it to later behavior, and determine whether changing that signal can change retention. The result should influence onboarding, roadmap priorities, customer success, and model releases, not just add another chart to an eval dashboard.

Start with the retention decision, not the eval dashboard

The wrong opening question is: Which evals can the team measure? Start with: What must a user experience early enough that returning becomes the rational next action?

That framing forces you to define retention before searching for a predictive signal. A login is rarely enough. Choose a return behavior that represents recurring value: running another workflow, completing another meaningful task, or bringing the agent into an ongoing process. Then make five decisions explicit:

Define the eligible population. Decide whether you are studying newly activated users, newly activated tenants, or another clearly bounded cohort.
Choose the unit of analysis. Use the user when value is individual. Use the tenant or account when adoption and renewal depend on a shared workflow.
Name the retained behavior. It should represent renewed value, not passive presence.
Select the retention window. Weekly and monthly cohorts answer different product questions, so do not switch between them after seeing the result.
Close the observation period before the retention outcome begins. Otherwise, later behavior can leak into the feature that supposedly predicts it.

This distinction matters when activation improves but retention does not. Activation proves that a user crossed a product milestone. It does not prove that the AI produced a trustworthy, complete, safe, and usable outcome. Your eval candidates should measure that missing experience.

Eval family	Question it should answer	When it deserves product attention
Semantic accuracy	Did the output correctly address the intended task?	Incorrect results prevent completion or make the user unwilling to rely on the agent again.
Containment	Did the agent complete the eligible workflow without an avoidable human escalation?	Escalation prevents the workflow from delivering repeatable automation.
Safety	Did the interaction remain within the product’s acceptable risk boundaries?	A regression creates unacceptable exposure, even if another engagement metric improves.
Latency	Did the result arrive fast enough for the user’s workflow?	Delay causes abandonment, repeated attempts, or a return to the previous process.
UX friction	Could the user reach a good outcome without unnecessary setup, retries, or corrections?	Users fail before they have a fair chance to experience the agent’s value.

Shortlist three to five candidates tied to these user outcomes. A long eval inventory makes analysis look comprehensive while weakening the decision. You are not trying to find every quality problem. You are looking for an early signal that is measurable, related to meaningful retention, and alterable through a product intervention.

Build an identity and time contract before modeling anything

The hardest part is usually not the statistical model. It is joining AI interactions to product behavior without duplicating records, losing users, or assigning an outcome to the wrong account. Evals often live in notebooks or model-observability systems while retention events live in product analytics. A plausible-looking join can still be wrong.

Create a data contract that covers both systems. At minimum, it should specify:

Stable user and tenant identifiers, including the rule used when a user belongs to more than one tenant.
The timestamp that determines whether an interaction belongs inside the observation period.
The model and workflow version associated with the interaction.
The conditions that make an interaction eligible for each eval.
The grain of the analysis table, such as one row per user-day or tenant-day.
The treatment of missing data, especially the difference between no eligible interaction and an evaluated failure.

That last distinction is easy to miss. A user who never invoked an eligible workflow did not fail the accuracy eval. Combining non-use and poor quality into the same value hides whether the retention problem comes from discovery, setup, or AI performance.

Compute daily per-user and per-tenant features rather than joining every raw interaction directly to every product event. Each feature should retain its denominator or exposure count. A pass rate without the number of eligible interactions can make sparse use look equivalent to sustained use.

Keep the definition of each feature readable. Containment, for example, needs an explicit eligible-workflow denominator and an explicit rule for what counts as avoidable escalation. UX friction needs named events, such as a retry or correction, rather than an opaque composite score. If a product manager cannot explain how the feature changes, the team will struggle to turn it into a roadmap decision.

Watch for many-to-many joins. One AI interaction may generate several product events, and one product session may contain several AI interactions. Joining both raw tables can multiply rows and inflate success or failure counts. Aggregate each side to the agreed grain first, then join the resulting features to the retention cohort.

Versioning also matters. If a model or workflow changes during the observation period, an account-level average can blend materially different experiences. Preserve the version so you can distinguish a real quality improvement from a change in traffic or segment mix.

Find a threshold that survives segment and leakage checks

Once the dataset is reliable, begin with cohort analysis rather than a complex predictive model. Compare retention among users who reached different levels of each candidate signal. You are looking for a separation that is large enough to matter, stable enough to repeat, and reachable through product changes.

Use this sequence:

Plot weekly or monthly retention against each early eval feature.
Use a driver tree to show where the feature sits between acquisition, activation, AI quality, repeat behavior, and the final retention outcome.
Fit a simple logistic model that controls for plan type, segment, region, and acquisition channel.
Repeat the analysis inside important segments instead of relying only on the blended population.
Check whether the threshold remains directionally useful when you vary the observation definition without allowing it to overlap the outcome.

The controls are not statistical decoration. Higher-plan customers may have better implementation support. One region may contain a different account mix. A high-intent acquisition channel may produce both better agent usage and better retention. Without those checks, customer composition can masquerade as model improvement.

In one product context, users who crossed a specific eval threshold early showed three times higher retention than peers who did not. That is evidence that an eval can become a commercially useful leading indicator. It is not a universal benchmark. Your threshold, effect size, eligible population, and retention behavior will depend on your product.

Do not choose the threshold merely because it creates the largest visual gap. Prefer a boundary that has enough eligible users on both sides, persists across relevant segments, and corresponds to an experience the product can influence. A dramatic ratio from a small cohort is a hypothesis, not a roadmap mandate.

Run an explicit leakage review before presenting the result. Common forms include an eval feature calculated after the retention window begins, an account-health field that already contains renewal information, or a usage feature whose value can only rise when the user returns. Leakage can make a weak signal look uncannily predictive.

The decision artifact should show the cohort definition, feature window, retention window, cohort sizes, effect estimate, control variables, and segment sensitivity together. If the threshold only works for a particular plan or acquisition channel, say so. A narrow, honest signal is more actionable than a broad result that disappears when the mix changes.

Use experiments to separate a predictor from a product lever

A predictive eval signal is not automatically causal. Sophisticated users may configure the agent better, choose easier workflows, or persist through early friction. Their higher eval scores and higher retention may share the same cause. Improving the score will not necessarily reproduce their behavior.

Convert the signal into a testable product intervention:

Choose an intervention that can move the signal during the early observation period. Depending on the failure, that could be an in-app guide, a product tour, a setup change, or a model change behind a feature flag.
Keep the threshold definition fixed for the experiment. Redefining success after seeing the result turns the test into another exploratory analysis.
Predefine the retained behavior, retention window, target population, and second-order guardrails.
Use a minimum detectable effect calculation to determine whether the experiment can answer the question with the available population.
Run an A/B test where randomization is practical. Measure whether the intervention moves the eval signal and whether that movement is followed by the intended retention lift.
Inspect results by the same segments used in the observational analysis. A blended win can hide a regression for a strategically important group.

This creates a necessary chain of evidence: the intervention changed the early experience, the early eval feature moved, and the retention outcome moved in the expected direction. If retention improves without movement in the eval, your intervention may work through another mechanism. If the eval improves without retention, the signal is not yet a proven growth lever.

Treat safety differently from an ordinary optimization metric. A retention increase cannot compensate for an unacceptable safety regression. Use risk scoring to gate exposure, keep model changes behind feature flags until the required evals pass, and monitor anomalies in both the score and its eligible volume. A stable percentage on a collapsing sample is not stability.

Track support tickets, NPS, and Net Recurring Revenue alongside the primary retention result. These measures operate on different timelines, but they help catch proxy optimization. An intervention that pushes users across an eval threshold while increasing support burden or degrading customer sentiment has not produced a clean product win.

Separate the user-level and release-level uses of the signal. A user-level signal can trigger onboarding or customer-success help when a new account has not reached the value threshold. A release-level eval can prevent a model change from expanding when quality falls. Combining both into one vague health score makes ownership and response unclear.

Put the winning signal into the product operating system

The analysis matters only when it changes what happens next. Give the signal a definition, an owner, an intervention, and a response to regression.

For onboarding, guide new users toward the workflow conditions associated with crossing the threshold. Do not merely show them where the AI button is.
For customer success, add the signal to a health score only when the team has a specific action to take. A warning without a playbook creates dashboard noise.
For roadmap planning, require proposed work to identify which eval feature it should move, why that feature connects to retention, and how the effect will be tested.
For model releases, keep exposure controlled with feature flags until the relevant eval improves without violating safety or experience guardrails.
For monitoring, use anomaly detection on the eval value, eligible interaction volume, and important segments so a blended average does not conceal a regression.

This operating model also clarifies ownership. Product owns the intervention and decision. Data science owns the validity of the feature and analysis. Engineering owns reliable instrumentation and release controls. Customer success owns the response when an account-level signal indicates missed value. Those responsibilities can be distributed differently in your organization, but none should be implicit.

Key takeaways

Define the retained behavior, population, unit, and time window before selecting an eval.
Shortlist three to five eval candidates that describe real user value: accuracy, containment, safety, latency, or UX friction.
Aggregate reliable daily features with stable user and tenant identifiers before joining them to product cohorts.
Use cohort analysis, driver trees, and simple controlled models to find a predictive threshold, then check sample size, segment mix, and label leakage.
Use an A/B test to learn whether a product intervention can move both the eval signal and retention.
Operationalize a validated signal through onboarding, customer success, release gates, feature flags, and anomaly detection.

At your next product review, bring a short decision sheet with the retained behavior, observation window, no more than five candidate evals, join keys, and the first intervention you can test. If the team cannot fill in those fields, fix the analytics contract first. If it can, run the smallest credible experiment and let retained behavior, not a prettier eval dashboard, decide the roadmap.

References

Amplitude — The Surprising Eval Signal That Tripled Retention: How I Connected AI Evals to Product KPIs

May 7, 2026

Amplitude MCP: Evidence-Grounded AI Workflows for Product Teams

An AI assistant can produce a convincing roadmap recommendation or code patch before you have established what users actually did. That speed feels productive until a confident answer turns an instrumentation gap, a rare edge case, or a coincidental sequence into a product decision.

Amplitude MCP is most useful when it reverses that order. The assistant retrieves behavioral evidence first, labels what is observed versus inferred, proposes a bounded action, and defines how the result will be verified. You still make the decision and own the release, but you spend less time moving context between analytics, product documents, Session Replay, and the development environment.

Key takeaways

Treat Amplitude MCP as an evidence-retrieval layer, not an automated decision-maker. Access to analytics does not make every conclusion valid.
Require every response to separate observed behavior, inferred explanations, proposed actions, and verified outcomes.
Use aggregate analytics to establish prevalence and affected segments, Session Replay to understand the journey, and code-level tests to validate a technical explanation.
End product workflows with a decision brief and engineering workflows with a reproducible test, a controlled release plan, and post-release behavioral verification.
Begin with a narrow, high-value workflow. Apply least-privilege access, redact sensitive data, and evaluate retrieval accuracy, analytical discipline, latency, and business usefulness before expanding.

Create an evidence contract before asking for a recommendation

An MCP connection can make evidence accessible, but it cannot decide whether your event taxonomy is reliable, whether a cohort is appropriate, or whether a pattern is causal. Amplitude MCP can let an assistant request behavioral context such as funnels, cohorts, segments, and user journeys as needed. Your workflow still has to constrain what is retrieved and how it may be interpreted.

The practical control is an evidence contract: a short specification for the question, the permitted data, the expected output, and the point at which the assistant must stop. Write it before asking for a recommendation. Otherwise, the assistant can silently change the population, comparison, or definition while producing an answer that sounds coherent.

Decision: State the exact choice the analysis is meant to inform. “Improve onboarding” is a theme; “decide which onboarding step needs further investigation” is a decision.
Population: Name the relevant segment, account type, lifecycle stage, product surface, or release exposure. Do not let the assistant substitute all users because that query is easier.
Behavior definition: Specify the events or funnel that represent the outcome. If activation, retention, or failure has no agreed event definition, resolve that ambiguity before interpreting results.
Comparison: Define the cohort, release, segment, or other baseline against which a difference should be assessed.
Permitted evidence: List the analytics views, event paths, Session Replays, error details, and code context the assistant may use.
Required traceability: Make the assistant identify the query, event definition, segment, and replay behind each material observation.
Abstention rule: Require the assistant to say when missing instrumentation, insufficient data, or conflicting evidence prevents a conclusion.

A reusable prompt can be direct: “Analyze [outcome] for [segment] using [funnel, cohort, or event path]. Use [comparison] as the baseline. For every conclusion, identify the supporting query or replay. Return observed facts, data limitations, hypotheses, next retrievals, recommended action, and a verification plan. If the evidence is insufficient, state what is missing instead of filling the gap.”

The labels matter. Without them, a behavioral sequence can become a supposed root cause within one paragraph. Use the following distinction in product investigations, incident work, and roadmap analysis:

Layer	What belongs here	What must support it
Observed	An event pattern, funnel difference, cohort trend, replayed interaction, error, or test result	A traceable query, event timeline, replay, log, or test output
Inferred	A plausible explanation for the observed behavior	Supporting and conflicting evidence, plus assumptions that remain unverified
Proposed	An instrumentation change, discovery step, experiment, code change, or rollout action	A stated rationale, expected effect, risk, and owner
Verified	A conclusion that the intervention produced the intended result without an unacceptable regression	Post-change tests and behavioral evidence using definitions consistent with the original investigation

This structure does more than improve prompt quality. It makes reviews faster. A product manager can challenge the population, an analyst can challenge the event definition, and an engineer can challenge the technical hypothesis without reopening the entire conversation.

Turn product questions into bounded analytics tasks

Broad questions invite broad stories. “Why is activation down?” asks the assistant to choose the definition, locate a pattern, infer a cause, and recommend a solution in one leap. Break that work into retrieval, interpretation, and decision stages instead.

Find an activation blocker without inventing causality

Suppose you need to determine which onboarding step deserves attention for an SMB segment. Behavioral analytics can locate where journeys diverge, while Session Replay can show what happened around that point. Neither alone proves why the behavior occurred.

Define activation. Name the event or event sequence that represents the outcome. If stakeholders use different definitions, surface that disagreement rather than averaging it away.
Fix the population and comparison. Specify the SMB segment and the cohort, release, or successful journey against which it should be compared.
Retrieve the funnel or event path. Ask for the event definitions as well as the result. An unexplained event name is not enough to support a decision.
Locate the observed divergence. Identify where completion or progression differs. Call it a divergence, not a cause or even a blocker yet.
Inspect contrasting journeys. Review unsuccessful and successful Session Replays around the same step. Capture UI state, preceding actions, environment details, errors, and unexpected loops.
Generate competing hypotheses. Include product friction, technical failure, user intent, and instrumentation error where each is plausible. Ask what evidence would weaken each explanation.
Choose the next action that matches the evidence. That may be additional instrumentation, customer discovery, a controlled experiment, a targeted technical investigation, or a product change. The assistant should not default to shipping.
Write the decision record. Preserve the query, segment, replay references, observed facts, unresolved uncertainty, chosen action, and verification signal.

Do not let the assistant jump from “fewer users completed this step” to “the copy is confusing.” The first statement may be observable. The second is a hypothesis that needs corroboration. This distinction is the difference between faster analysis and faster rationalization.

Use behavioral context to sharpen roadmap decisions

Behavioral evidence can show whether a problem appears in real journeys, which segments encounter it, and how the surrounding path differs. It does not determine strategic importance, implementation cost, contractual commitments, regulatory exposure, or the opportunity cost of displacing other work. Those remain product leadership inputs.

Ask the assistant to produce an opportunity brief rather than a priority score. The brief should contain:

The outcome and user segment under consideration
The observed behavior and the exact analytics definition behind it
The prevalence and journey context the available evidence can support, without pretending that frequency equals severity
Successful paths or unaffected segments that provide counterevidence
Known data-quality limitations
Competing explanations and what would distinguish them
The smallest useful discovery, instrumentation, experiment, or delivery step
The signal that would cause you to continue, revise, or stop

This format is particularly useful for activation and retention work because it prevents a familiar category error: an analytics pattern describes behavior, while a roadmap decision combines that behavior with strategy, feasibility, risk, and judgment. Amplitude MCP can improve the behavioral part of the decision without pretending to own the whole decision.

Close the engineering loop from customer signal to verified fix

Code generation is only the middle of a debugging workflow. The more important sequence is evidence, reproduction, hypothesis, failing test, bounded change, controlled release, and verification. Amplitude MCP helps connect the customer side of that sequence to Claude or Cursor, but a plausible diff is not a completed investigation.

From a customer report to a reproducible failure

A support ticket usually contains a symptom. Turn it into an evidence packet before asking the coding assistant for a fix.

Establish impact. Use behavioral analytics to find affected segments, related anomalies, and comparable successful journeys. This tells you whether you are investigating an isolated path or a broader degradation.
Reconstruct the experience. Use Session Replay to capture the sequence of actions, UI state, environment, and the moment the behavior diverged. Preserve timestamps for relevant console errors or API failures.
State expected versus actual behavior. Do not make the coding assistant infer the product requirement from the failure.
Provide constraints. Include known dependencies, release exposure, rate limits, feature-flag state, and any code areas that must not change.
Ask for hypotheses before a patch. Require a list of candidate causes, supporting evidence, contradictory evidence, and missing instrumentation.
Request the smallest failing test. Whenever feasible, reproduce the failure in a test before accepting a code change. If urgent containment is necessary, record it separately from the durable fix.
Validate locally and through CI/CD. A generated test or patch still needs human review and the normal engineering checks.
Release behind a feature flag where appropriate. Limit exposure while verifying the behavior in production.
Verify with the original signals. Re-run the relevant analytics, inspect post-change replays, and monitor related behavioral and performance indicators before increasing exposure.

This workflow can turn a replayed customer problem into reproduction steps, a root-cause hypothesis, a minimal failing test, and a controlled verification plan. The human owner still decides whether the evidence is sufficient, whether the patch is safe, and whether the rollout should continue.

A useful debugging prompt is: “Reconstruct the observed sequence from this replay and event timeline. Separate facts from suspected causes. Identify missing instrumentation. Propose the smallest failing test and the narrowest relevant patch surface. State what post-release evidence would confirm or falsify the fix.”

A passing test proves that the code behaves under the conditions represented by that test. It does not prove that the affected customer journey is repaired. That is why the workflow returns to behavioral evidence after deployment.

From a code symptom back to customer impact

Sometimes the investigation begins with a flaky test, a suspicious diff, or a performance regression. In that direction, the assistant first maps possible failure modes and critical code paths. Amplitude then helps answer whether real users reach those paths, under which conditions, and with what observable consequences.

Give the assistant the test failure, diff, or performance symptom and ask it to enumerate the affected code paths.
Translate those paths into observable events, screens, releases, or journey conditions. If no observable signal exists, add instrumentation before making a product-impact claim.
Retrieve matching behavioral patterns and inspect replays that support and contradict the suspected failure.
Separate technical correctness from operational priority. A real defect may have limited observed reach; a common path may still be functioning correctly.
Implement and test the narrowest justified change.
After release, monitor the original journey, relevant errors, and performance measures such as Web Vitals before ramping the flag.

Frequency must not become the only severity test. Security, privacy, data-integrity, and irreversible-loss risks can demand action even when behavioral analytics shows few affected sessions. Use analytics to understand exposure, not to override the appropriate risk process.

Scale only after retrieval and governance earn trust

The strongest rollout begins with one recurring question, not unrestricted access to every project and replay. Activation blockers and bug triage are good candidates because the input, evidence, decision, and verification artifacts can all be made explicit. Start with a high-value, lower-risk dataset and expand only after the workflow performs reliably.

Make access narrower than the assistant’s capability

Session Replay and event data can contain sensitive customer context. An MCP connection does not remove the obligations attached to that data. Apply the same access rules inside the AI workflow that apply in the analytics product, then reduce exposure further where the task does not require it.

Begin with read-only retrieval for the selected workflow.
Limit access to the relevant projects, datasets, and replay permissions supported by your access model.
Redact sensitive fields before the data reaches either replay or the assistant.
Send the minimum context necessary for the task. Prefer event identifiers, stack traces, test cases, and bounded timelines over raw personally identifiable information.
Keep analytics retrieval, code modification, and deployment authority separate. Successful retrieval is not a reason to grant release permissions.
Preserve the query and evidence references behind material decisions so a reviewer can reconstruct what the assistant saw.
Treat a replay link as governed customer data, not as a generic attachment that can be copied into any conversation.

These controls reflect a practical privacy-by-design rule: include only the information needed to reach the fix and favor structured technical artifacts over raw PII. If the workflow cannot answer a question within those boundaries, the correct result may be escalation to an authorized person rather than broader automated access.

Evaluate the workflow, not just the prose

A polished response is a weak success criterion. Build an evaluation set from representative work and include cases where the answer is easy, ambiguous, unsupported by current instrumentation, and blocked by permissions. The assistant should succeed by reaching the right conclusion or by refusing to overstate what the evidence supports.

Retrieval correctness: Did it use the intended project, event definitions, segment, comparison, and available time scope?
Traceability: Can a reviewer follow every material observation back to a query, replay, error, or test?
Analytical discipline: Did it distinguish behavioral association from cause and identify counterevidence?
Action quality: Is the proposed next step bounded, testable, and proportionate to the evidence?
Abstention quality: Did it stop when data was missing, permissions were insufficient, or the available evidence conflicted?
Latency: Did the workflow reduce time spent finding and transferring context without adding review overhead elsewhere?
Business usefulness: Did the evidence improve the decision, reproduction, or verification outcome rather than merely shorten the response?
Governance: Did retrieval stay within approved access and data-handling boundaries?

Classify failures by layer. A wrong segment is a retrieval failure. An unsupported causal claim is an interpretation failure. An oversized code rewrite is an action failure. Exposure of unnecessary customer data is a governance failure. That classification tells you whether to change permissions, analytics definitions, prompts, review rules, or the underlying product instrumentation.

Use a narrow adoption sequence

Choose one repeated workflow with a visible evidence trail, such as activation analysis or production bug triage.
Record how the workflow operates without MCP, including where context is lost and which handoffs cause rework.
Define the evidence contract, approved access, expected artifact, and human decision gate.
Run representative cases and record retrieval, interpretation, action, and governance failures.
Standardize the prompts, evidence packet, and review checklist only after the failure patterns are understood.
Measure time-to-insight, decision usefulness, and engineering outcomes without assuming that faster responses mean better decisions.
Expand to retention analysis, roadmap shaping, or experiment generation only when the narrow workflow remains traceable and safe.

For incident and engineering use cases, preserve root causes and guardrails as docs-as-code so the next investigation can retrieve known failure patterns instead of rediscovering them. Watch change lead time and deployment frequency alongside stability; speed that produces more regressions is not an improvement.

Start with one decision your team faces repeatedly. Define what the assistant may observe, how it must label inference, who approves the action, and what evidence will verify the result. If it cannot show that chain, it is not ready to influence the decision. If it can, Amplitude MCP becomes more than a convenient connector: it becomes part of a disciplined evidence loop between product behavior and execution.

References

May 6, 2026

Taste vs. Evidence in the AI Era: What Product Leaders Must Invest In Now

I just finished listening to "Taste – All Things Product Podcast with Teresa Torres & Petra Wille," and as a product leader shipping AI-powered capabilities at HighLevel, Inc., I wanted to pressure-test the sudden obsession with "taste."

If you're curious, you can listen to this episode on Spotify or Apple Podcasts.

The core question landed perfectly for our moment: Is "taste" the must-have skill of the AI era — or just the latest tech buzzword in a world where AI is eating through design, delivery, and discovery?

Teresa pushes back hard, highlighting how slippery the term can be. "It's just this month's flavor of founder mode." She points out that "taste" is rarely defined, can't be easily taught, and too often becomes shorthand for "my preference trumps yours." Just as importantly, "It's not about your taste. It's about your customer's taste."

Petra adds needed nuance from years in the craft: pattern-recognition is real, and some people do develop sharper product sense over time. As she put it, "I am a strong believer that you develop product sense and taste over time. It's never finished."

Both threads lead back to familiar roots in product: product sense, founder mode, and the enduring myth of the lone visionary. They even grapple with the big question on everyone’s mind—Will AI Eat Taste Too?—and where that leaves product teams navigating GenAI, LLMs for product managers, and evolving product strategy.

Here’s my take. "Taste" can be useful as a personal north star, but it is not a decision system. In my teams, we bias toward evidence: continuous discovery, customer interviews, discovery synthesis with opportunity solution trees, and tight collaboration in product trios. Opinion can start the conversation, but evidence should end it.

Practically, that means investing in the skills that compound: Discovery skills — understanding customers, matching solutions to real needs. Human-to-human interaction skills. Learning to collaborate with AI effectively. Critical thinking and judgment grounded in evidence.

On AI collaboration specifically, we treat GenAI as a force multiplier, not a decider. We prototype with AI to explore breadth, then narrow with qualitative and quantitative signals, ablation-style experiments, and clear success criteria. The bar I hold myself to is simple: taste without evidence is just opinion.

Three lines I underlined from the conversation:

"It's just this month's flavor of founder mode." — Teresa Torres

"It's not about your taste. It's about your customer's taste." — Teresa Torres

"I am a strong believer that you develop product sense and taste over time. It's never finished." — Petra Wille

If you want to go deeper, these references are helpful for sharpening judgment without falling into the "great man" theory trap.

Follow Teresa Torres: https://ProductTalk.org

Follow Petra Wille: https://Petra-Wille.com

Founder mode

Marty Cagan: Founder-Style Leadership

Vercel/v0 CEO Guillermo Rauch on building taste: from Lenny Rachitsky’s Linkedin post

Continuous discovery (Read Teresa’s Everyone Can Do Continuous Discovery—Even You! Here’s How

The "great man" theory

Steve Jobs and the myth of the lone product visionary

Have thoughts on this episode? Leave a comment below and share how your team balances product sense with evidence in the age of AI.

Inspired by this post on Product Talk.

May 5, 2026

Category: AI Strategy

Key takeaways

Start with decision architecture, not a better prompt

Build a strategy chain the model can inspect

Force a distinction between fact, inference, and assumption

Build a controlled workflow from context to decision record

Ground the model in canonical product context

Separate extraction, synthesis, challenge, and approval

Use AI in discovery without laundering uncertainty

A cluster is a lead, not a finding

Connect roadmap, experiment, launch, and learning

Make the roadmap show its reasoning

Make experiments decision-ready before they run

Keep launch language tied to the original value proposition

Scale the workflow only when another person can audit it

References

The decisive choice is what you do not rename

A new category needs a clean place in the buyer’s mind

Test the strategy before you test the name

Strategic permanence

Customer comprehension

Portfolio durability

Operating commitment

Turn the rebrand into an operating model

Before launch, remove internal ambiguity

At launch, separate ambition from continuity

After launch, measure the translation tax

Key takeaways for your own portfolio decision

References

Design a decision loop, not another cost dashboard

Draw the product boundary around an outcome

Build four layers with explicit responsibilities

Start with one anomaly and one reversible response

Make every recommendation safe enough to act on

Use a recommendation contract

Grant autonomy by action class

Evaluate the decision loop, not the prose

Embed the capability with customers before scaling it

Choose a small pod and customers that can teach you

Run a customer optimization loop that produces reusable knowledge

Make the commercial incentive legible

Turn field work into a roadmap, not permanent custom service

Apply a productization test to every recurring intervention

Use a scorecard that reveals where the loop is breaking

Make build versus buy a component decision

Key takeaways

References

Define the growth decision before you automate the launch

Turn the launch channel into a decision system

Give agents narrow jobs and humans explicit authority

Use a data agent for retrieval and first-pass synthesis

Keep the feature-flag agent read-only by default

Use a readout agent to maintain the launch narrative

Make the accountability map visible

Run the rollout as a sequence of evidence gates

Key takeaways

References

Treat retention and expansion as one value-progression system

Key takeaways

Instrument customer states, not a pile of events

Turn signals into explainable risk and opportunity decisions

Match each customer state to a bounded playbook

Prove incremental impact and build an operating rhythm

References

The native test: can the agent read, reason, and act?

Design the shopping dialogue around decisions, not keywords

Make shopping and support one customer state machine

Measure incremental commerce value and operational risk

Build the evaluation before the vendor demo

Expand autonomy in the order of consequence

Key takeaways

References

Start with the retention decision, not the eval dashboard

Build an identity and time contract before modeling anything

Find a threshold that survives segment and leakage checks

Use experiments to separate a predictor from a product lever

Put the winning signal into the product operating system

Key takeaways

References

Key takeaways