Category: Product Management

Value-Based Pricing and Packaging: A Practical Playbook

If your pricing discussion keeps bouncing between competitor screenshots, delivery costs, and whatever Sales thinks the market will accept, you are not yet deciding a price. You are mixing four separate decisions: the pricing model, the pricing metric, the package, and the amount charged.

Separate those decisions and make them in the right order. You will get a pricing system that customers can understand, Finance can model, Sales can explain, and Product can improve as real behavior replaces assumptions.

Find the value, then choose a metric that tracks it

Value-based pricing does not mean charging the highest number a customer will tolerate. It means connecting what the customer pays to a result the customer cares about. Your costs still determine whether the offer is sustainable, but they do not explain why the buyer should purchase it.

Start by keeping four commonly confused decisions separate:

Decision	Question it answers	Example output
Pricing model	What overall structure determines how the customer pays?	Fixed fee, access-based, usage-based, or outcome-based
Pricing metric	What unit causes value and charges to scale?	Account, seat, transaction, workflow, or verified outcome
Packaging	Which capabilities, limits, and service levels belong together?	Plans, allowances, add-ons, commitments, and overages
Price	How much will you charge for the package or metric?	List price, contracted rate, and discount guardrails

Define value in the buyer’s language

Your first customer conversations should not begin with a proposed price. Begin with the decision the buyer is trying to make and the change the buyer expects after adopting the product. Ask for recent, concrete examples rather than opinions about a hypothetical offer.

What event made this problem important enough to address?
What happens if the buyer leaves the problem unsolved?
Who experiences the problem, and who controls the budget?
What observable change would count as success?
How does the buyer prove that change internally?
What alternatives compete for the same budget, including manual work and doing nothing?
What causes value to grow: more users, more activity, more completed work, better results, or lower risk?

Turn the answers into one working statement: For [buyer], the product creates value when [observable result] improves [business or operational consequence], compared with [current alternative]. This is not positioning copy. It is a testable value hypothesis that will guide the metric and package.

If different segments complete that sentence differently, do not average the answers into a vague promise. That is evidence that the segments may need different packages, metrics, or sales motions. A support leader buying fewer escalations and an operations leader buying more throughput may use the same product while evaluating its value in different ways.

Turn value into a billable unit

The pricing metric is the bridge between the value hypothesis and the invoice. For an AI support agent, for example, the model can charge only for results, while the unit is an outcome counted when the agent resolves a customer query without further help. The principle is attractive because payment moves with delivered value. The definition is difficult because every ambiguous edge case can become an invoice dispute.

Write the metric specification before selecting the price. It should define:

The event that starts the measurement.
The event that qualifies the unit as complete.
Any quality threshold required before it is billable.
Exclusions, such as tests, spam, duplicates, abandoned work, or activity outside the contracted scope.
Attribution when a human, an automation, and an AI system all contribute.
How reversals, reopened work, refunds, and corrections affect the count.
What the customer can see before the count appears on an invoice.
Which record resolves a disagreement between product analytics and billing.

My rule is simple: if a buyer cannot understand what will be counted and predict the direction of the next bill, the metric is not ready. Evaluate every candidate against six tests:

Value alignment: Does an increase in the unit normally mean the customer received more value?
Predictability: Can the customer forecast the unit well enough to plan a budget?
Auditability: Can both sides inspect the same underlying events?
Controllability: Can the customer influence usage or set limits without abandoning the product?
Operational feasibility: Can your product, data, billing, and support systems calculate the unit consistently?
Economic alignment: Does revenue scale sensibly relative to the cost and risk of delivering the value?

A value-based design does not always require a literal outcome metric. A proxy can be the better choice when it is closely related to value and much easier to forecast and audit. Raw activity is a poor proxy when it can grow without improving the customer’s result. A seat is a poor proxy when adding users does not increase value. An outcome is a poor metric when success cannot be defined consistently. Choose the least complicated unit that preserves alignment.

Before charging anyone, run the proposed rules against beta or historical events. Generate shadow invoices, inspect unusually high and low accounts, and reconcile the count from the raw event through the customer-facing bill. This exposes definitional and data problems while they are still product problems rather than financial disputes.

Make packaging do the segmentation work

Pricing determines how revenue scales. Packaging determines which customers select which offer. A package is therefore not a decorative feature table. It is a mechanism for matching different value patterns, operating needs, and willingness to pay without creating a custom product for every account.

Segment customers by how they receive value. Company size may matter, but workflow complexity, risk, required integrations, volume, and the cost of failure can be more revealing.
Identify the minimum complete experience. Every package should let its intended customer reach the core outcome; a deliberately crippled entry plan teaches the market that the product does not work.
Place differentiators where their value is concentrated. Advanced governance, analytics, automation, integrations, service levels, and support may matter much more to one segment than another.
Choose the relationship between access and consumption. Decide what is included, what is metered, whether unused commitments expire, how overages work, and whether customers can set caps or alerts.
Test whether buyers can self-select. Show realistic scenarios, ask which package they would choose, and then ask them to explain why. Their explanation is more diagnostic than the selected tier.

Choose modular, bundled, or hybrid architecture deliberately

Modular pricing works best when capabilities have distinct buyers, adoption paths, and measurable outcomes. It lets a customer buy one job without funding unrelated functionality. Its weakness appears as the portfolio expands: each additional module adds another decision, metric, contract term, and sales explanation.

Bundling works better when capabilities reinforce one workflow or when customers experience the combined result rather than the individual components. It reduces buying friction, but it can hide which capability creates value and can force smaller customers to pay for breadth they do not need.

A hybrid can separate platform access from variable value: a base package covers the shared product, an included allowance makes the initial bill predictable, and overages or commitments let revenue grow with delivered value. Use that structure only when each component answers a different commercial question. Adding a platform fee, several meters, tier thresholds, credits, and add-ons without a clear role for each one creates a billing puzzle, not a pricing strategy.

Look for these packaging failure signals:

Customers repeatedly need capabilities scattered across several tiers.
The entry package cannot produce the outcome used to sell it.
The highest tier is simply every leftover feature rather than an offer for a distinct need.
Two packages attract the same customer for reasons your sales team cannot explain consistently.
The economically best package for you is visibly wrong for the customer.
Customers need a spreadsheet or a salesperson to estimate a normal bill.
Every new capability becomes a new add-on because the portfolio has no shared packaging logic.

Do not ask customers whether they like the package names or feature list. Give them a buying situation, expected volume, required controls, and a budget constraint. Ask them to choose, identify what feels unnecessary, and state what is missing. You are testing whether the architecture supports a decision, not whether the page looks polished.

Measure willingness to pay only after the offer is clear

Quantitative pricing work becomes useful only after buyers understand the model, metric, and package. Otherwise, a survey can produce a precise answer to a question the market would never ask. Use qualitative discovery to establish the buyer’s language and mental model, then carry that exact framing into willingness-to-pay testing.

Methods such as Gabor-Granger and Van Westendorp answer different questions. Gabor-Granger-style testing helps estimate purchase willingness across proposed price points. Van Westendorp-style questions help expose perceived price boundaries, including where an offer begins to feel implausibly cheap or prohibitively expensive. Neither method discovers the value metric for you, and neither produces a universally correct price.

A defensible survey sequence looks like this:

Describe the customer problem and product outcome without promotional language.
State exactly how charging works.
Define the billable unit, including the success condition.
Show what the package contains and what it excludes.
Give the respondent a realistic usage or outcome scenario.
Ask about willingness to purchase at a specific price or across a controlled sequence of prices.
Capture the respondent’s role, segment, buying authority, expected volume, and current alternative so the results can be interpreted rather than merely averaged.

A demand curve is more useful than a single average. In one outcome-priced case, stated purchase willingness moved from 69% at $0.86 per outcome to 39% at $1.42. Those figures are not benchmarks for another product. They demonstrate why the decision is strategic: moving along the curve changes expected adoption as well as revenue captured from each unit.

A simple price multiplied by the share willing to buy can identify a survey-based revenue peak, but that point is not automatically your final recommendation. It does not, by itself, include realized discounts, differences in unit volume, cost to serve, retention, expansion, sales effort, or the value of establishing market share.

Decide what the price is meant to accomplish before interpreting the curve:

If the priority is adoption, you may accept less revenue per unit to reach more qualified customers.
If the priority is near-term revenue, you may choose a higher point while accepting a lower attach rate.
If the product requires substantial support or delivery cost, margin may eliminate prices that look attractive in a demand survey.
If the category is unfamiliar, simplicity and predictability may be more important than extracting the theoretical maximum.
If the product is part of a broader platform, the effect on cross-sell, retention, and portfolio coherence may matter more than stand-alone revenue.

Treat willingness-to-pay results as stated intent, not observed buying behavior. Segment the curve before using it. A blended result can conceal a high-value segment with strong demand and another segment that should not be targeted at all. It can also overstate confidence when respondents use the product but do not own the budget.

Convert the demand curve into a commercial model

The survey narrows the plausible range. The commercial model tells you whether an option can survive contact with actual customers, contracts, usage, discounts, and delivery costs. This is where a promising price becomes an operating plan.

Set a candidate list price. Choose a point that reflects the demand curve and the strategic objective, not just the highest theoretical revenue index.
Estimate realized price. Apply expected discounts, negotiated rates, credits, promotions, and channel effects. A list price that relies on constant exceptions is not the real price.
Project units by segment. Use beta or observed usage to estimate outcomes, transactions, seats, or another billable quantity. Preserve the distribution instead of relying only on the mean.
Model attach rate. Estimate what share of eligible customers will buy in conservative, base, and upside cases. Connect each case to an explicit assumption rather than a general level of optimism.
Calculate customer and portfolio revenue. For a metered product, combine realized unit price with expected annual units. Then roll the result across eligible customers and segments.
Include delivery economics. Subtract variable delivery costs and account for service obligations that grow with usage. For AI products, inspect how model, infrastructure, support, and exception-handling costs behave at both low and high volume.
Connect the recommendation to the operating plan. Show the implications for customer count, adoption, annual recurring revenue, gross margin, expansion, and any dependencies on the rest of the portfolio.

Stress-test the assumptions that can break the plan

A single base case hides the shape of the risk. Change one major assumption at a time so decision-makers can see what the recommendation depends on.

Discount sensitivity: What happens if realized price is materially below list price?
Volume sensitivity: What happens when customers generate far fewer or far more units than the average?
Attach sensitivity: How much adoption is required before the product covers its fixed investment?
Cost sensitivity: Does high usage improve gross profit, or does the delivery cost scale almost as quickly as revenue?
Concentration risk: Does the forecast depend on a small number of unusually large customers?
Invoice volatility: Can normal changes in behavior create bills that customers will perceive as unpredictable?
Metric leakage: Are valuable events going unbilled, or are low-quality events being counted as successful outcomes?

Inspect account-level scenarios, not just portfolio totals. A model can produce acceptable average revenue while creating obviously unreasonable bills for a small customer, a seasonal customer, or a high-volume account. Those tails often become the discount exceptions, support escalations, and renewal problems that the average concealed.

Make the recommendation easy to challenge

The approval memo should contain the decision and the logic required to dispute it. Include:

The buyer, value hypothesis, model, metric, and metric definition.
The proposed packages and the segment each package is designed to serve.
The willingness-to-pay range and how it changes by segment.
The recommended list price, expected realized price, and discount guardrails.
Conservative, base, and upside forecasts for adoption, revenue, and margin.
The most sensitive assumptions and the evidence supporting them.
Alternatives considered, why they were rejected, and what evidence would reopen them.
Operational dependencies across Product, Research, Data, Finance, Engineering, Sales, Customer Success, Support, and billing.

Cross-functional review is not ceremonial. Finance can expose a margin or forecasting problem. Engineering can show that the proposed event cannot be measured reliably. Sales can identify a model buyers cannot procure. Support can anticipate disputes. Product can determine whether the metric rewards the behavior the product is supposed to create. Resolve those conflicts before the price becomes a public promise.

Launch pricing as a controlled learning system

Approval is the end of price design and the start of price operations. Customers experience pricing through entitlements, usage counters, contracts, invoices, renewal conversations, and support responses. A sensible strategy can fail if those surfaces disagree.

Complete the billing path before charging

Write a billing specification that maps raw events to billable units and contract terms.
Verify entitlements, included allowances, overages, caps, credits, and exception handling.
Run parallel or shadow invoices and reconcile them from event log to customer-facing total.
Give customers a usage view that uses the same definitions and timing as billing.
Enable Sales with qualification rules, scenario-based pricing examples, and clear discount authority.
Prepare Customer Success and Support to explain the metric, diagnose discrepancies, and escalate genuine billing errors.
Instrument proof of value next to proof of usage so the commercial conversation is not reduced to a meter.
Communicate the effective date, affected products, counting rules, package changes, and available customer controls in plain language.

Do not alter existing charges on the assumption that a product announcement overrides a contract. Review contractual commitments, renewal timing, migration rules, and customer communications before changing what an existing customer pays. An informal migration can create financial disputes and destroy trust even when the new model is better designed.

Use behavior to diagnose the next problem

Instrument the system from the first launch cohort. Review both commercial performance and customer experience:

Eligibility, attach rate, and package selection by segment.
List price, realized price, discount frequency, and exception rates.
The full distribution of billable units per customer, not just the average.
Revenue and gross margin by segment, package, and usage band.
Invoice variance and how accurately customers forecast their charges.
Billing questions, disputes, credits, and metric-definition escalations.
Activation, continued usage, achieved outcomes, expansion, contraction, renewal, and churn.
Sales-cycle friction caused by the model, procurement requirements, or package complexity.

Use each signal to choose the next investigation. Low attach can point to weak qualification, unclear value, the wrong package, or the wrong price. Strong attach followed by low activity can indicate an onboarding or product-value problem. High activity with poor margin calls for an economics or discount review. Frequent disputes usually justify inspecting the metric definition, event quality, and customer visibility. These patterns are diagnostic prompts, not causal proof; pair the numbers with targeted customer and GTM conversations.

Review the architecture, not only the number, when the product expands. Modular outcome pricing can work cleanly while each capability has a distinct result. As a platform adds capabilities, buyers may face several meters, overlapping modules, and an invoice they cannot predict. That is a signal to reconsider how access, bundles, allowances, and outcomes fit together, not merely to adjust every component independently.

Reopen the pricing system when customers cannot forecast bills, new capabilities do not fit an existing package, discount exceptions become routine, sales explanations diverge, gross margin behaves differently from the model, or the value customers receive is no longer represented by the metric. Pricing should be treated as a living system informed by research, customer behavior, and go-to-market learning, not a launch artifact that becomes untouchable.

Key takeaways

Make four decisions separately: pricing model, pricing metric, package, and price.
Define value using an observable customer result before asking what anyone will pay.
Choose a metric that aligns with value but remains predictable, auditable, operationally feasible, and economically sound.
Design packages around distinct value patterns and buying needs, not an arbitrary progression of feature counts.
Use willingness-to-pay work to build a demand curve, then combine it with usage, attach, discounts, and margin in a commercial model.
Validate the complete billing path before launch and use observed behavior to improve the system afterward.

If your team is stuck debating the number, stop the meeting and complete six lines first: buyer, customer outcome, billable unit, measurement proof, package boundary, and commercial assumptions. Any line you cannot defend is the next research or modeling task. Put a price on the page only after those six lines tell one coherent story.

References

Intercom – Inside My Pricing Playbook: Building Value-Based Packaging That Balances Growth and Profit

May 20, 2026

How to Build an AI-Native Product Discovery Workflow

Your discovery stack may already hold interview transcripts, support conversations, behavioral analytics, experiment results, and roadmap assumptions. Yet the decision in a product review can still depend on whoever read the most material or built the most persuasive deck.

If adding an LLM only gives you faster summaries, the workflow is not AI-native. An AI-native discovery workflow shortens the distance from evidence to a decision while making every important claim easier to inspect. AI retrieves, structures, compares, and challenges the evidence. You remain accountable for what the evidence means and what the product team does next.

Key takeaways

Begin every AI-assisted discovery run with an outcome, a metric, defined context, and a decision that someone needs to make.
Preserve raw evidence and give each observation a stable identifier before asking AI to synthesize it.
Break the workflow into bounded jobs such as retrieval, extraction, clustering, contradiction detection, and decision-brief drafting.
Evaluate citation accuracy, evidence fidelity, counterevidence, abstention, and access controls before the output enters a roadmap discussion.
Measure whether the workflow improves decision quality and product outcomes, not merely whether the model produces polished prose.

Frame the decision before you involve the model

Most weak discovery prompts fail before the model sees them. Analyze the interviews, summarize the feedback, and find insights are activities, not decisions. They give the model no principled way to distinguish useful evidence from interesting noise.

Write a short decision contract first. A useful contract specifies the outcome and metric, the context and constraints, and the decision and deliverable. Those fields turn an open-ended request into a bounded discovery task.

Outcome and metric: Name the user or business outcome, then define the behavior or measure that represents it. Activation, funnel conversion, and retention are not interchangeable. Include the event definition and observation window used by your analytics system.
Context and constraints: State the relevant cohort, product surface, timeframe, market, known exclusions, and data-access limits. New self-serve accounts on the web can exhibit a different pattern from established accounts or customers using another surface.
Decision and deliverable: Say what someone will do with the answer. Ask for a ranked opportunity brief, an interview plan, a set of competing explanations, or experiment candidates only when that format supports a real pending decision.

Reusable decision prompt: Help me decide [decision]. The outcome is [outcome], measured as [metric definition]. Limit the analysis to [cohort, surface, timeframe, and constraints]. Retrieve evidence from [approved repositories]. Return [deliverable]. For every material claim, include the evidence identifier, any conflicting evidence, the affected segment, and what is still unknown. If the available evidence cannot support a recommendation, say so and specify what is missing.

The last sentence matters. An AI system should be allowed to return insufficient evidence. If every run must end with a recommendation, the workflow rewards plausible completion instead of honest discovery.

Keep the outcome separate from the proposed solution. Improve activation is an outcome. Validate an onboarding checklist is already a solution choice. When you embed the solution in the prompt, AI tends to organize the available evidence around that choice instead of testing whether another opportunity matters more.

Use evidence-strength labels that a reviewer can verify rather than asking the model for an unsupported confidence percentage:

Sufficient: Direct evidence applies to the target context, and no material contradiction remains unresolved.
Mixed: Direct evidence and meaningful counterevidence both exist, or the pattern changes by segment.
Insufficient: Evidence is missing, indirect, stale for the decision, or outside the target context.

Build a traceable evidence pipeline, not a transcript pile

AI cannot make discovery evidence traceable if the underlying repository has already flattened observations, interpretations, and decisions into the same notes. Preserve those layers separately. My rule is simple: automate the movement and inspection of evidence before automating judgment.

Layer	What it contains	Control that matters
Raw evidence	Interview recordings or transcripts, support records, session evidence, and analytics query results	Keep the original record intact, access-controlled, and addressable by a stable locator
Evidence units	Atomic observations with metadata	Separate exact customer language, observed behavior, and analyst interpretation
Opportunities	Candidate needs, frictions, or desired outcomes	Attach supporting evidence, counterevidence, affected segments, and unresolved questions
Decisions	Choices made, rejected alternatives, assumptions, and rationale	Name the decision owner and preserve the evidence available at the time
Learning	Experiment results and later customer or behavioral evidence	Update the opportunity without erasing the earlier reasoning

Each evidence unit should carry enough metadata to survive outside its original document:

A stable evidence identifier.
The collection date and an exact locator such as a transcript timestamp or saved analytics query.
The relevant user segment, product surface, and journey stage.
The raw observation, kept separate from the interpretation proposed by a person or model.
The access, retention, and sensitivity classification.
The opportunity, assumption, or outcome to which the evidence may relate.

This structure prevents a common failure: a model paraphrases an interview, a later summary compresses that paraphrase, and the roadmap eventually treats the compressed interpretation as a customer fact. A reviewer should always be able to move from a claim to the evidence unit and then to the original record.

Apply data-governance rules before ingestion. If customer conversations contain personal, confidential, or contract-restricted information, do not copy them into an AI system until its access, retention, redaction, and model-training terms match your commitments. A more convenient synthesis workflow is not worth an unauthorized disclosure.

Retrieve the smallest useful context

Once the evidence corpus no longer fits sensibly into a prompt, use a retrieval-first pipeline with modular prompts and observable traces. Retrieval-augmented generation should select evidence relevant to the decision contract, rather than asking a general agent to reason over everything the company knows.

RAG is a grounding mechanism, not a truth guarantee. A fluent answer does not prove that the retriever found the decisive interview, the correct event definition, or the evidence that contradicts the dominant pattern. Configure retrieval to look for both support and contradiction, preserve evidence identifiers, respect access controls, and return no result when the available context does not meet the task.

An opportunity solution tree can provide the shared view above this pipeline: the desired outcome connects to opportunities, solution candidates, and tests. Treat the tree as a navigable representation of current thinking. Every important node should still resolve to evidence and assumptions beneath it.

Give AI a chain of bounded jobs

A single agent asked to interview customers, interpret feedback, size opportunities, choose a solution, and write a roadmap has too many ways to hide a weak inference. Break the work into stages with explicit inputs and review gates:

Prepare: Give AI the outcome, assumptions, and learning gaps. Let it draft non-leading interview questions. A human checks whether the guide is testing an assumption or merely inviting agreement.
Convert: Extract atomic observations from approved records. Require exact locators and label customer language, observed behavior, and interpretation separately.
Synthesize: Cluster candidate opportunities without erasing segment differences. Request supporting evidence, counterevidence, and unrepresented cohorts for every cluster.
Connect: Use behavioral analytics to examine whether the observed pattern appears in the target cohort. Interviews can expose mechanisms and unmet needs; they should not be treated as a substitute for measuring prevalence.
Challenge: Ask for rival explanations, evidence that would reverse the conclusion, and assumptions that remain untested. This stage should consume the evidence record, not just the previous summary.
Draft: Produce a decision brief containing the pending decision, options, evidence, contradictions, unknowns, and proposed next test. A named human accepts, revises, or rejects it.
Learn: Attach experiment and outcome evidence to the same opportunity record. Preserve what the team believed before the test so later reviewers can inspect how the decision changed.

Pass structured artifacts between stages. If each stage receives only prose copied from the previous chat, unsupported claims can become progressively harder to distinguish from evidence.

Buy workflow plumbing; own the decision logic

You do not need to build every repository, connector, permission system, visualization, and observability screen. Licensing purpose-built opportunity-tree infrastructure can be the sensible choice when your differentiated work is the learning system rather than the canvas or collaboration layer.

Keep ownership of the parts that encode how your company makes product decisions: the decision contract, evidence schema, opportunity taxonomy, prompt modules, evaluation cases, escalation rules, and approval gates. Before choosing a platform, ask:

Can you export the raw evidence, metadata, opportunity structure, prompts, and run traces?
Can access rules follow the evidence through retrieval and generation?
Can the system connect to your approved analytics and customer-evidence repositories without repeated manual copying?
Can you evaluate a prompt or retrieval change against representative past cases?
Can a reviewer inspect why a claim appeared and what evidence was omitted?
Would building this capability improve the customer outcome, or merely recreate commodity workflow infrastructure?

Evaluate the workflow before it shapes the roadmap

Start evals before AI-generated conclusions become routine inputs to product reviews. The evaluation set should represent the cases the workflow will actually encounter: a clear pattern, conflicting evidence, insufficient evidence, cohort-specific behavior, stale material, duplicated records, and content the requesting user is not allowed to retrieve.

For synthesis and decision-support tasks, evaluate behavior that a reviewer can observe:

Citation validity: Every material claim points to a real, accessible evidence identifier.
Evidence fidelity: Quotations and behavioral facts remain faithful to the underlying record; interpretations are labeled as interpretations.
Retrieval coverage: The output includes the evidence required to assess the target opportunity, not merely the easiest matching passages.
Contradiction handling: Material counterevidence and segment differences are visible rather than buried.
Abstention: The system returns insufficient evidence when the decision cannot be supported.
Decision fit: The deliverable answers the stated decision instead of drifting into a generic summary or unrelated recommendation.
Policy compliance: Restricted evidence stays outside unauthorized retrieval, traces, and generated output.

A strict release gate is useful here. Fail the output if it invents an evidence identifier, turns an interpretation into a quotation, ignores a material contradiction, or exposes restricted content. Those are not cosmetic defects that a polished paragraph can offset.

Treat the prompt, retrieval configuration, model choice, taxonomy, and evaluation set as versioned artifacts. This is the practical value of eval-driven development and early observability: when behavior changes, you can identify the change that caused it and rerun representative cases before wider use.

For each production run, retain the decision contract, evidence identifiers retrieved, prompt and retrieval versions, generated output, reviewer edits, final decision, and later outcome. That trace lets you distinguish a retrieval failure from a synthesis failure, a weak decision contract, or a reasonable decision invalidated by new evidence.

Model-quality checks are only one layer. Also baseline and monitor the discovery workflow itself:

Time from a framed question to a reviewable decision brief.
The share of material claims with inspectable evidence.
Reviewer corrections to quotations, segments, event definitions, and interpretations.
Decisions reopened because relevant evidence was missing or misread.
Movement in the outcome and metric named in the original decision contract.

Do not set improvement targets until you have a baseline for the existing process. A system can make synthesis faster while increasing correction work or encouraging premature decisions. The end-to-end measure tells you whether the saved time is real.

Turn the workflow into a product operating system

AI-native discovery changes the product team’s operating model only when ownership remains explicit. The product manager or product trio owns the outcome, assumptions, and decision. Research and design judgment protects interview quality and interpretive nuance. Data and engineering ownership protects event definitions, retrieval reliability, instrumentation, and access controls. AI produces candidate artifacts. The decision owner approves the action.

Review by exception instead of rereading every generated sentence. Inspect claims marked mixed or insufficient, new opportunity clusters, segment differences, material contradictions, changed event definitions, and outputs that differ from earlier runs. This focuses human attention where judgment is most valuable without treating the model as an authority.

Roll out the workflow through one recurring, reversible discovery decision:

Choose a decision for which customer evidence and behavioral data already exist, such as prioritizing an onboarding friction or investigating a repeated support issue.
Baseline the current path from question to decision, including reviewer corrections and missing-evidence failures.
Create the decision contract, evidence schema, and access rules before connecting an agent.
Build the evaluation set from previous clear, contradictory, insufficient, segment-specific, and restricted cases.
Run the AI workflow in shadow mode beside the existing process. Compare claims, omissions, reviewer effort, and the resulting decision without allowing the generated output to act automatically.
Promote bounded jobs only after they pass their gates. Evidence extraction may be ready before opportunity ranking, and opportunity ranking may be ready before solution recommendations.
Expand to another workflow only when the traces are stable, reviewers understand escalation paths, and the first use case is improving the decision process rather than merely generating more material.

At your next discovery review, do not ask what AI found. Bring one decision contract, require every consequential claim to resolve to evidence, and make the unresolved assumption visible. That is a small enough change to start immediately and a strong enough foundation for everything you automate later.

References

May 19, 2026

Level Up: May 26 Claude Code Show & Tell + Final Product Discovery Fundamentals Cohort

I’m excited to share two opportunities this season to uplevel your craft, connect with peers, and leave with practical, repeatable techniques you can apply immediately to your product work.

We will be doing another round of Claude Code: Show and Tell on May 26th at 9am PDT. These community-driven sessions are hands-on and fast-paced—we swap proven workflows, compare prompts, and pressure-test approaches together. You’ll see how product teams are operationalizing AI workflows in real contexts and walk away with ideas you can adapt for your own roadmap and experimentation pipeline. Invites will go out to Supporting Members and CDH Members tomorrow. If you'd like to join us, keep an eye on your inbox for the invite.

I love these Show & Tell sessions because they translate tacit knowledge into clear, reusable playbooks. Whether you’re refining evaluation loops for LLMs, streamlining discovery synthesis, or standardizing prompts for consistency, the shared rigor and camaraderie make it a high-signal hour for any product leader invested in AI workflows.

I also want to share that I'll be teaching our June 4th – July 9th cohort of Product Discovery Fundamentals. This is the last time I'll be teaching this cohort in its current format. If you've been thinking of enrolling in this program, and want to take it with me, this is your last chance. Register here.

Across this cohort, we’ll practice continuous discovery habits—framing opportunities, tightening assumptions, running lean experiments, and aligning product trios on evidence-backed decisions. If you want a rigorous, repeatable system for turning customer insight into confident prioritization and compelling product strategy, I’d be thrilled to have you in the room.

Inspired by this post on Product Talk.

May 18, 2026
Behavioral Customer Data for Proactive SaaS Retention
Your cancellation dashboard can tell you who has already left. It cannot tell you which accounts are failing to reach value, why their behavior changed, or what your team should do while the relationship is still recoverable.

That is the real purpose of behavioral customer data. You are not trying to produce a more sophisticated churn report. You are building an operating system that turns observable behavior into a reason, an owner, and a timely response.

Start with the retention decision, not the dashboard

A risk score has no operational value if nobody knows what to do when it changes. Before choosing events, dashboards, or models, write down the retention decisions your data must support.

For every proposed signal, define a decision contract:
- Trigger: What behavior changed, started, stopped, failed, or never happened?
- Interpretation: What customer state might that behavior indicate?
- Owner: Should product, customer success, support, solutions engineering, or billing respond?
- Intervention: What is the smallest useful action that could remove the obstacle?
- Success signal: Which subsequent behavior would show that the customer is back on a value path?
- Expiration rule: When should the alert or intervention stop so the customer is not repeatedly contacted?
This contract prevents a common failure: treating all declining activity as the same problem. A customer who cannot finish an integration needs a different response from an activated customer whose core usage suddenly drops. A payment problem is different again. Combining them into one generic churn-risk label hides the information required to help.

The signal also needs to match the product’s natural rhythm. Daily inactivity can matter in a daily workflow, but the same rule will create false alarms for a workflow used weekly or at the end of a reporting cycle. Compare behavior with the expected use pattern for the account’s persona, plan, lifecycle stage, and use case.

I would design backward from a small set of decisions rather than forward from every event that happens to be available. The most useful leading indicators usually describe activation, time-to-first-value, depth of feature adoption, usage momentum, friction, and expansion intent. Each tells you something different about whether value is beginning, recurring, weakening, or growing.

Instrument the path from first value to recurring value

Measure value at the account level

In B2B SaaS, the person clicking is not always the entity that retains. Users perform actions, while the account usually owns the subscription. Your model therefore needs both a reliable user identity and an account identity, plus a record of which users belonged to which account when the behavior occurred.

This distinction matters when roles differ. An administrator may configure the product once, an operator may use the core workflow repeatedly, and an executive may only view outcomes. A login-frequency rule applied equally to all three will misclassify healthy behavior as disengagement. Define the value-producing behavior for each relevant persona, then roll those behaviors into an account-level state.

Map the customer journey around observable value states:
- Setup: The account has supplied the prerequisites required to attempt the core workflow.
- Activation: The account has completed a meaningful milestone that indicates initial value, not merely finished an onboarding screen.
- Recurring value: The core workflow is being completed at a cadence consistent with the use case.
- Adoption depth: The account is using the capabilities required to obtain more complete or durable value.
- Friction: Attempts, errors, failed integrations, or support interactions indicate that progress is being blocked.
- Expansion intent: Behavior indicates a new use case, broader adoption, or interest in a relevant upgrade path.
Your activation milestone is the pivotal definition. It should represent the earliest behavior that credibly demonstrates value. Completing profile fields or dismissing a tour may be easy to measure, but neither proves that the customer accomplished the job for which the product was purchased.

Do not force one milestone across materially different use cases. If a plan, persona, or workflow changes the way value is produced, define the appropriate milestone for that segment. You can still report a common activation outcome while preserving the underlying reason an account qualified.

Use a minimal tracking contract

Once the value path is clear, instrument attempts, completions, failures, and meaningful outcomes along that path. A useful event contract includes:
- A stable event name with a documented business meaning.
- The user and account identifiers required for identity resolution.
- The time the behavior actually occurred, not only the time it reached the analytics system.
- The persona, plan, lifecycle stage, and use case needed for segmentation.
- The product object or workflow involved.
- A normalized outcome or error category when the action can fail.
- The event owner and the process for approving semantic changes.
For an integration workflow, for example, separate connection attempted, connection completed, and connection failed. Attach the provider and a controlled error category. Do not attach credentials, tokens, raw request bodies, or unrestricted personal information. Those fields create security and privacy exposure without improving the retention decision.

The foundation is a clean event taxonomy, dependable identity resolution, and privacy-by-design. Capture only what the decision requires. If support sentiment is useful, prefer a governed derived category over copying unrestricted support conversations into an analytics platform. Keep sensitive material in the controlled system that already owns it.

Before using any event in a risk score, ask product, data, and customer success to reconstruct the same account timeline. Check for duplicate events, delayed delivery, internal or test traffic, users mapped to the wrong account, plan changes that were not propagated, and renamed events with conflicting meanings. If those teams see different stories, automation will only distribute the disagreement faster.

It is also safer to trigger interventions from a derived account state than directly from a raw event. A raw event says that something happened. An account state says whether activation is incomplete, recurring value has weakened, an integration is blocked, or a commercial issue is unresolved. That state can carry a reason code, observation time, data-quality status, and expiration rule into the product, lifecycle messaging, or customer success workflow.

Build a risk score people can challenge and act on

You do not need a black-box model to begin. A transparent rule set is often more useful because product and customer success can inspect the evidence, dispute a weak assumption, and choose the correct response.

A practical account score can combine several distinct dimensions:
<!– wp:list {
May 18, 2026
How to Validate Behavioral Heatmap Accuracy Before You Act
Your heatmap puts a bright cluster on the primary call to action, and the next step seems obvious: move the button, rewrite the copy, or prioritize a mobile redesign. Pause before turning that picture into a roadmap decision. A heatmap can look coherent while representing the wrong interface state, assigning clicks to the wrong element, or combining users whose layouts are materially different.

Behavioral heatmap accuracy is not about whether the colors look plausible. It is about whether each recorded interaction appears on the interface the user actually encountered, within the correct context, and supports the conclusion you want to draw. You need to validate that chain before you act on the pattern.

Treat accuracy as a chain, not a single metric

There is no single accuracy score that makes a heatmap trustworthy. Four separate conditions have to hold:
- Capture fidelity: The background image represents the relevant product state. The release, page structure, loaded content, navigation, overlays, and experiment variant should match what generated the interactions.
- Placement fidelity: A click is attached to the intended interface element after responsive reflow, personalization, localization, and other layout changes. A precise coordinate on the wrong screenshot is still wrong.
- Population fidelity: The map contains the users, devices, variants, and product states relevant to your decision. An aggregate can be mathematically correct while describing an interface that no individual user experienced.
- Inference fidelity: The visualization can support the claim being made. A click establishes an interaction, not the user’s motivation. Scroll depth establishes reach, not attention, comprehension, or persuasion.
Reliable screenshot capture, selector-based placement, automatic device detection, and clearer scrollmaps address important failure modes in this chain. They reduce ambiguity, but they do not eliminate the need to inspect your product states, filters, selectors, and supporting evidence.

The weakest link determines whether the map is useful. Perfect element placement cannot rescue a screenshot from an old release. Clean device segmentation cannot justify a claim about user intent. Before discussing what the hot area means, establish what was captured, where it was placed, and whose behavior was included.

Run a validation pass before reading the colors

Use the same validation sequence whenever a heatmap is about to influence an experiment, design change, or roadmap priority. This turns accuracy from a vague feeling into a reviewable process.
1. Write down the decision first. Be specific: move the primary action, remove a section, change the activation path, or investigate a mobile interaction. This tells you which page states and elements require the strongest validation.
2. Freeze the analysis scope. Record the screen or template, analysis window, release, experiment variant, device class, and user segment. If the interface changed during the selected window, split the data or identify the limitation rather than treating the period as one stable experience.
3. Build a state matrix. List only the states that materially alter the interface: desktop and mobile layouts, relevant locales, personalized variants, authenticated and unauthenticated views, expanded and collapsed components, or overlays that cover the underlying page. You do not need every possible segment. You need every state capable of moving, replacing, hiding, or duplicating the elements involved in your decision.
4. Compare the screenshot with each relevant state. Check the order and size of major sections, sticky navigation, banners, modals, lazy-loaded content, and conditional components. If the displayed background is stale or combines interactions from incompatible layouts, stop interpreting the map and repair the capture or filtering first.
5. Test element placement. In a controlled recorded session, interact with the target and with nearby controls that could be confused with it. Repeat the check on the layouts that move the element. The target’s hotspot should remain attached to the target rather than to an old coordinate. Exclude the controlled session from normal analysis when your tooling allows it.
6. Inspect critical selectors. Ask engineering to confirm that each selector identifies the intended component across the templates and states in scope. Pay particular attention to repeated cards, reused button components, translated labels, and responsive navigation. If adjacent actions collapse into one hotspot, the map is not suitable for deciding between those actions.
7. Reconcile the picture with events and replay. Apply equivalent page, date, device, user, and variant filters before comparing evidence. Exact numerical agreement is only a reasonable expectation when the systems use the same interaction definition and filters. Otherwise, document why their coverage differs and investigate unexplained gaps.
8. Assign a confidence grade. Mark the map as decision-grade, directional, or invalid. Decision-grade means the relevant states and placements were verified. Directional means a pattern is visible but a known limitation prevents a precise conclusion. Invalid means the visual representation is wrong for the proposed decision.
For a critical call to action, treat any reproducible placement error as a blocker. A hotspot that sometimes lands on a neighboring control can reverse the apparent preference between the two controls. Fix the representation before discussing design implications.

Split heatmaps when the interface or interaction model changes

Segmentation is not merely an analytical refinement. It is part of measurement accuracy. Mobile and desktop users may see different navigation, stacking order, content length, control size, and interaction affordances. Combining them can create a vivid composite that corresponds to neither experience.

Use a simple rule: split the map whenever a cohort can encounter different geometry, different elements, or a different way of interacting. Check these questions before aggregating:
- Does the same element exist in every included state?
- Does it keep the same purpose and selector?
- Does responsive behavior move it relative to neighboring elements?
- Does a variant, locale, or personalized state change the surrounding content?
- Are touch and pointer interactions being interpreted in a comparable way?
- Did a release alter the template during the selected analysis window?
If any answer exposes a material difference, inspect separate maps first. You can compare the resulting patterns afterward, but you should not use the blended view as the primary evidence.

Scrollmaps need the same discipline. The same depth percentage can correspond to different content when a mobile page stacks sections that sit side by side on desktop. Compare scroll behavior within consistent layouts, then map each depth region to the actual value proposition, trust element, form, or call to action shown there. Scroll reach tells you that a region became reachable within the journey; it does not prove that the person read or understood it.

Match the decision to what the evidence can prove

Even a technically accurate heatmap is an observation layer. It can show where interactions accumulated or how far sessions progressed. It cannot, by itself, tell you why the behavior occurred or whether a proposed design change will improve an outcome.

Use an evidence ladder instead of promoting every hotspot directly into the backlog:
- Heatmaps locate the pattern. They help you identify concentrated clicks, neglected controls, competing actions, and sections reached by fewer sessions.
- Event data measures the associated behavior. Use it to determine whether the interaction registered, where it sits in the funnel, and whether it connects to the micro-conversion or product outcome you care about.
- Session replay supplies sequence and context. Inspect what happened immediately before and after the interaction, including overlays, loading states, repeated attempts, navigation changes, and other conditions that an aggregate view hides.
- A controlled experiment evaluates the proposed change. When the claim is that a different placement, label, or layout will improve an outcome, compare that change against a baseline rather than treating the heatmap as causal proof.
The combination also helps you diagnose apparent contradictions. A strong hotspot with no corresponding outcome event may indicate a broken interaction, incomplete instrumentation, or an action whose result is unclear. Low interaction on content that few sessions reach is first a placement or journey question, not automatically a copy problem. High scroll reach with low interaction means the region was available to users, but it does not establish that they noticed or rejected its message. A hotspot outside the visible target is a measurement defect, not a behavioral insight.

Translate each finding into the next appropriate action:
- If the screenshot, selector, or segment is wrong, create an instrumentation or analytics repair.
- If the behavior is verified but its explanation is uncertain, create a discovery question and inspect relevant replays.
- If the behavior is verified and tied to an outcome gap, define a hypothesis and an A/B test.
- If the evidence reveals a reproducible interaction defect, prioritize the defect without disguising it as a preference experiment.
This language matters in product reviews. Say that you observed a pattern, verified its representation, formed a hypothesis, and selected the next test. Do not say that users prefer, understand, ignore, or want something unless your evidence can support that stronger claim.

Key takeaways
- A heatmap is decision-grade only when the captured state, element placement, population, and proposed inference all align.
- Validate the critical target and its neighboring controls across every layout that can move or replace them.
- Split device classes, variants, releases, locales, or personalized states when they produce materially different interfaces.
- Read scroll depth as reach and click concentration as interaction. Neither measure establishes attention, intent, or causality.
- Pair heatmaps with event data and session replay, then use a controlled experiment when your decision depends on predicted impact.
At your next heatmap review, do not begin with the hottest color. Begin with the screenshot, segment label, release, and one critical interaction traced from capture through outcome. If that path survives validation, turn the pattern into a hypothesis or product action. If it does not, fix the measurement before it becomes roadmap evidence.

References
- Amplitude – Amplitude Heatmaps Rebuilt: Rock-Solid Screenshots, Precise Placement, Smarter Scrollmaps
May 15, 2026

Governed AI Analytics in Financial Services: A Playbook

You have a credible AI analytics use case, product teams want access, and risk leaders want proof that the system will not expose sensitive data or influence the wrong decision. The mistake is to settle that tension with a broad choice between “innovation” and “control.” That choice is too vague to operate.

Start with a narrower question: what decision may this system influence, using which data, under whose authority, with what evidence afterward? Once those boundaries are explicit, you can give teams meaningful speed without asking compliance to accept an invisible risk.

Classify the decision before you assess the AI

Many AI reviews begin with the model: where it is hosted, how it was trained, or whether it can explain an answer. Those questions matter, but they do not establish the business risk. The same model can summarize an approved dashboard, flag an unusual transaction pattern, or help determine an outcome that affects a customer. Those are not equivalent uses.

Classify each use case by consequence, reversibility, and action authority. Consequence asks what happens if the output is wrong. Reversibility asks whether a person can correct the result before harm occurs. Action authority asks whether the system informs a person, recommends an action, or executes one.

Use case pattern	Permitted role for AI	Control that matters most	Boundary to make explicit
Descriptive analysis	Summarize approved metrics or behavioral patterns	Data permissions and traceable metric definitions	The output cannot create a new customer-level action
Investigative signal	Surface anomalies or suspicious patterns for review	Analyst validation, evidence capture, and disposition logging	A signal is not a finding or a verdict
Product recommendation	Suggest an intervention, workflow, or experiment	Human approval and outcome monitoring	The recommendation cannot bypass existing approval paths
Customer-affecting decision	Support a formally governed decision process	Documented oversight, explainability, and accountable human authority	The final authority and escalation path must be unambiguous

This classification prevents two common errors. The first is applying the heaviest possible review to every analytical assistant, which sends teams into unofficial tools and manual workarounds. The second is treating every output as “just an insight” even when a downstream workflow turns it into a customer action.

Trace the output one step beyond the interface. If an anomaly score enters a case-management queue, changes account handling, or triggers outreach, govern that downstream effect as part of the use case. A recommendation does not become low risk merely because a person clicks the final button.

Before development begins, write an allowed-action statement and a prohibited-action statement. For example: “The system may prioritize patterns for analyst investigation. It may not label a customer, close a case, or initiate an external action.” That pair of sentences is more operationally useful than calling the project “medium risk.”

Risk and compliance leaders still need to map the use case to the organization’s actual legal and regulatory obligations. A product risk classification is an operating tool, not a legal conclusion. When a use case could affect access, eligibility, pricing, fraud treatment, or another consequential outcome, obtain the appropriate compliance and legal review before activation.

Turn governance principles into an enforceable contract

Principles such as fairness, privacy, transparency, and human oversight do not control a production workflow by themselves. Each principle needs an owner, an enforcement point, and evidence that the control operated. I treat that combination as the governance contract for the use case.

Define the data boundary

List the approved data domains, fields, purposes, environments, and user groups. Do not stop at “customer data” or “analytics data.” Those labels are too broad to enforce. State which attributes the system can retrieve, which identifiers it can display, whether results may be exported, and where generated outputs may be stored.

Purpose: the business question the data may be used to answer.
Permitted inputs: the approved events, attributes, aggregates, and reference data.
Prohibited inputs: data classes that the workflow must never retrieve or infer.
Permitted users: roles allowed to query, review, approve, or export results.
Output handling: where results may be displayed, retained, shared, or reused.
Failure behavior: what the system does when permission, provenance, or confidence is insufficient.

Enforce that boundary with role-based access controls and granular permissions at retrieval time. Filtering an answer after a model has received restricted data is not equivalent to preventing access. The model, retrieval layer, analytics service, export path, and destination workflow all need to respect the same user identity and policy context.

Assign decision rights to named roles

A committee can set policy, but it cannot own every operational decision. Give each use case an accountable product owner, a data owner, a control owner, and a business reviewer. Clarify who can approve launch, who can change the data scope, who reviews exceptions, and who has authority to stop the workflow.

The product owner defines the user problem, allowed action, prohibited action, and business outcome.
The data owner approves the data purpose, quality expectations, permissions, and reuse limits.
The risk or compliance owner maps policy obligations to testable controls and reviews material exceptions.
The platform or security owner implements identity, access, isolation, logging, and change controls.
The business reviewer accepts, rejects, or escalates outputs and records why.

Keep the decision rights close to the workflow. If a reviewer sees an unsupported conclusion, that person needs a clear way to reject it, preserve the evidence, and route the issue. If every exception disappears into a general governance inbox, the formal control will be bypassed when operational pressure rises.

Design the audit record before launch

An audit trail should reconstruct what happened without relying on someone’s memory. Capture the requesting identity and role, the approved purpose, the data and metric definitions used, the system configuration, the generated result, any human review, the resulting action, and later corrections or overrides.

Logging creates its own data risk. Prompts, retrieved context, generated explanations, and reviewer notes can contain sensitive information. Protect the audit store with appropriate access, retention, and segregation rather than treating logs as harmless operational exhaust. Where policy permits, record protected references to sensitive records instead of duplicating raw payloads.

A practical platform evaluation should test whether the system combines strong data governance, auditable AI behavior, secure scale, and a direct connection to product outcomes. A policy document that cannot be enforced in the workflow is not enough, and a platform control without an accountable operating process is not enough either.

Put controls inside the workflows people actually use

Governance fails when it exists as a review ceremony around the product rather than a behavior inside it. Analysts should not have to remember a separate policy every time they ask a question. The approved data scope, identity context, review step, and evidence capture should travel with the task.

Behavioral analytics: govern the meaning as well as the data

Behavioral analytics can reveal how customers move through onboarding, self-service, support, payments, and other product journeys. The danger is not limited to unauthorized access. An AI system can also combine valid events into a misleading interpretation of customer intent.

Start the workflow with curated event definitions and approved business metrics. Require the output to expose the cohort definition, time context, filters, exclusions, and comparison used. The analyst should be able to inspect the path from a narrative claim back to the underlying measure before sharing it.

Separate observation from inference in the interface. “Users in this cohort abandoned the flow after this step” is an observation tied to event data. “They abandoned because they distrusted the process” is a hypothesis. Labeling those differently prevents fluent language from turning a plausible explanation into an unsupported fact.

Anomaly detection: route a signal into investigation, not judgment

An anomaly means a pattern differs from an expected baseline. It does not establish fraud, customer intent, system abuse, or operational error. Treat anomaly detection as a prioritization mechanism unless a separately governed process establishes something more.

Give the reviewer the observed deviation, relevant context, the comparison baseline, and links to permitted evidence. Capture the reviewer’s disposition: confirmed issue, expected behavior, insufficient evidence, data-quality problem, or escalation. That disposition is both an audit artifact and a feedback signal for improving the workflow.

Watch the operational burden as closely as the detection capability. A flood of weak signals can make the nominal control less safe because reviewers rush, defer, or stop trusting the queue. Monitor false positives, unresolved escalations, overrides, and the reasons analysts reject outputs. When those indicators deteriorate, reduce scope or pause automated routing while the cause is investigated.

Self-service analysis: give teams a governed lane

Product managers and analysts need enough freedom to explore without sending every question through a central approval queue. Create a governed workspace containing approved metrics, documented data products, role-aware access, and restricted export paths. Let people iterate freely inside that lane while changes to data scope, decision authority, or external activation trigger a new review.

Make the boundary visible. Users should know when an answer is based on incomplete data, when a metric is not approved for customer-level decisions, and when an output cannot be exported. A silent denial encourages workarounds; a clear denial that identifies the policy boundary gives the user a legitimate next step.

Do not give an analytics assistant write access to operational systems merely because the integration is convenient. Insight generation and action execution are separate privileges. Connect them only when the action, reviewer, failure mode, and rollback path have been governed explicitly.

Pilot with evidence, not a polished demonstration

A convincing demo proves that the happy path works. A governed pilot must also prove that the system refuses the wrong request, exposes enough evidence for review, and leaves a usable record when something goes wrong.

Choose a narrow workflow with an identifiable user, a bounded data set, a reviewable output, and a business outcome you already understand. Avoid beginning with an enterprise-wide assistant or an autonomous action layer. Broad scope makes it difficult to distinguish model behavior, data problems, permission failures, and process gaps.

Write the decision contract. Record the user, purpose, permitted inputs, allowed action, prohibited action, reviewer, and stop authority.
Configure the smallest useful data boundary. Include only the fields and metrics needed for the chosen workflow.
Test legitimate work. Confirm that authorized users can produce an insight, inspect its basis, and complete the intended review.
Test prohibited work. Attempt access with the wrong role, request excluded attributes, try an unauthorized export, and ask the system to take a prohibited action.
Test ambiguity and failure. Use incomplete context, conflicting metric definitions, missing permissions, and unavailable dependencies. Confirm that the system fails visibly and safely.
Reconstruct the event. Use the audit record to determine who requested the output, what information was used, what was generated, who reviewed it, and what happened next.
Change the system deliberately. Update a relevant configuration or model component and confirm that approval, documentation, testing, and monitoring follow the change.

Do not accept screenshots as evidence for controls that operate behind the interface. Ask the vendor or internal platform team to demonstrate a denied request, a permission change, a reviewer override, an exported audit record, and the behavior after a governed configuration change. The test should follow your use case and identities, not a generic demonstration tenant.

Measure value and control health together. If the system produces faster insights but increases unreviewed actions, weakens attribution, or creates an investigation backlog, it has not delivered a durable improvement.

Dimension	Question	Useful signals
Business value	Does the workflow improve a real product, growth, risk, or operational decision?	Time to a validated insight, useful investigations completed, issues resolved, and attributable product outcomes
Analytical quality	Can a reviewer verify the conclusion?	Accepted and rejected outputs, unsupported claims, metric-definition errors, and missing context
Control effectiveness	Did policy operate as designed?	Prohibited requests blocked, required reviews completed, permission exceptions, and audit-record completeness
Operational health	Can people sustain the workflow?	False-positive burden, unresolved escalations, overrides, rework, and reviewer backlog
Change safety	Do updates preserve the approved boundary?	Documented changes, completed regression checks, new failure patterns, and monitored post-change behavior

Set release gates in binary language. The use case has a named accountable owner or it does not. Permissions have been tested with unauthorized identities or they have not. High-impact outputs receive the required review or they do not. Audit evidence can reconstruct an event or it cannot. Ambiguous gates become exceptions as soon as delivery pressure appears.

When the pilot is stable, reuse the control components rather than copying the entire use case. Standard identity propagation, data classification, audit schemas, reviewer workflows, and change gates can form a shared control plane. Each new use case still needs its own purpose, decision boundary, outcome measure, and risk assessment.

Key takeaways

Govern the decision the AI can influence, not just the model that produces the output.
Write both an allowed-action statement and a prohibited-action statement before development begins.
Enforce data permissions before retrieval and carry the user’s identity through analysis, export, and downstream action.
Treat human review as an operational workflow with evidence, dispositions, escalations, and stop authority.
Keep observations, hypotheses, recommendations, and customer-affecting decisions visibly distinct.
Test denial, ambiguity, change, and audit reconstruction alongside the happy path.
Track business value, analytical quality, control effectiveness, and operational burden on the same scorecard.

Your next move is not to draft an enterprise AI policy. Pick one live analytics workflow and write its decision contract on a single page. If you cannot name the allowed action, prohibited action, data boundary, reviewer, audit evidence, and stop authority, the workflow is not ready to scale. If you can, you have the foundation for AI analytics that product teams can use and risk leaders can defend.

References

Amplitude – Financial Services AI

May 15, 2026

How to Prove the ROI of an AI Product Before You Scale It

Your AI product is getting used. The demos land well, task completion is improving, and internal enthusiasm is high. Then the CFO asks a harder question: what changed in the business because this product exists?

You cannot answer that question with prompt volume, response quality, adoption, or tickets touched. You need a measurement system that separates activity from incremental value, counts the full operating cost, and makes risk visible before a rollout gets larger. Here is how to build one.

Start with the decision your ROI model must support

ROI is not a retrospective slide assembled after launch. It is a decision rule. Before development begins, decide what evidence would justify launching, scaling, redesigning, rolling back, or retiring the capability.

That distinction changes the conversation. Instead of asking whether the agent is accurate enough or popular enough, you ask whether a measurable change in customer behavior produces a measurable business result without crossing an unacceptable risk threshold.

Build a driver tree with four levels:

Company outcome: revenue growth, lower cost to serve, or reduced business risk.
Customer outcome: the user completes a valuable job, reaches value sooner, or resolves a problem without unnecessary effort.
Product behavior: the AI capability changes conversion, expansion, self-service completion, containment, handle time, or escalation.
Controllable lever: the team changes the workflow, model behavior, conversation design, human review, or product guidance.

The chain matters because a model metric is rarely a business metric. Better answer quality may improve task completion, which may improve trial-to-paid conversion. The ROI case depends on the full chain, not the first link.

Value path	Business outcome	Leading evidence	Guardrails
Revenue	Higher conversion, average order value, or expansion	Time-to-first-value and self-service completion	Errors, complaints, and policy violations
Cost	Lower cost to serve	Containment, deflection, and reduced handle time	Escalations, false resolution, and downstream customer harm
Risk	Lower frequency or impact of harmful failures	Human-review events and detected violations	False positives, false negatives, hallucinations, and security breaches

Choose one primary value path for the investment case. Revenue, cost, and risk can all appear on the scorecard, but declaring all three as primary makes it too easy to rescue a weak result with whichever metric moved after launch.

A support agent, for example, may appear successful because it contains more conversations. But containment is only valuable if customers actually resolve their problems. A conversation that never reaches a human can reduce measured support volume while increasing complaints or churn risk. This is why revenue, cost, and risk measures must be evaluated together.

Write the measurement contract before you build the dashboard

A measurement contract is a short agreement among product, data, finance, and the operational team affected by the AI workflow. It prevents the definitions, cost boundaries, and success thresholds from changing after results arrive.

Your contract should answer these questions:

Who is eligible? Define the users, accounts, tasks, channels, and exclusions. Do not mix workflows with materially different economics.
What is the intervention? Name the AI capability and the version being evaluated. A model, prompt, retrieval pipeline, policy, or escalation change can alter the result.
What is the primary outcome? Select the business metric that determines whether the hypothesis passed.
What are the leading indicators? Use measures such as time-to-first-value, containment, and self-service completion to diagnose movement before lagging results mature.
What are the guardrails? Predefine acceptable limits for errors, hallucinations, false positives, false negatives, escalations, complaints, security events, and policy violations.
What is the baseline? Freeze the comparison period or control group before exposing the eligible population to the capability.
How will incrementality be proven? Specify the experiment, holdout, assignment unit, and minimum detectable effect.
What costs count? Agree on model or API consumption, labeling, evaluation, human review, and ongoing oversight before calculating value.
What action follows each result? Record the thresholds for launch, scale, redesign, rollback, and retirement.

The contract should distinguish an outcome OKR from an output OKR. Shipping the agent, generating responses, and increasing feature use are outputs. Improving conversion, lowering verified cost to serve, or reducing harmful failures are outcomes. Outputs can explain what happened, but they cannot establish value on their own.

Instrument the complete journey, not just the conversation

An AI log tells you what the model did. An ROI dataset must also tell you what the user did next.

Connect the journey from eligibility to business outcome:

The user or account became eligible for the capability.
The AI experience was offered, viewed, and engaged.
A task was attempted, completed, abandoned, or repeated.
A response was accepted, corrected, regenerated, or sent for human review.
The interaction was contained, escalated, or handed to another workflow.
The downstream conversion, expansion, support, retention, or complaint event occurred.
The associated model cost, labeling work, and human-oversight cost were recorded.

Carry a stable user or account identifier, experiment assignment, agent version, and journey identifier across those events. Without that connective tissue, the team may have an impressive agent dashboard and no defensible way to attribute a business outcome to the experience.

Use behavioral analytics and session replay to understand why a metric moved. Use journey mapping and retention analysis to locate the friction worth solving in the first place. Product tours and in-app guidance can then help eligible users reach a validated workflow. This creates a closed loop from journey friction to experiment and measurable outcome, instead of a collection of disconnected AI metrics.

Calculate economic value without turning activity into savings

Start with net business value:

Net business value = incremental revenue + cost avoided – total operating cost – quantified risk loss

If finance requires an ROI percentage, divide net business value by the agreed investment base. Keep both the numerator and denominator visible. A percentage without its cost boundary is easy to inflate and hard to audit.

Count only incremental revenue

Do not credit the AI product with every transaction it touched. Credit it with the difference between the exposed population and the valid control or holdout.

A practical revenue calculation is:

Incremental revenue = eligible volume x measured outcome lift x value per additional outcome

The measured outcome might be trial-to-paid conversion, self-service upsell, average order value, or expansion. Use the same eligibility definition, attribution window, and revenue treatment for the intervention and control. If the AI experience merely appears somewhere in a successful journey, that is influenced revenue, not proof of incremental revenue.

Separate capacity from cashable savings

Cost claims require more care than a deflection count. A contained interaction may create capacity without reducing expenditure. That capacity can still be valuable, but it should not be presented as cash savings unless spending actually changes.

Capacity created: employees have time available for other work, but the existing cost base remains.
Variable cost avoided: the company no longer incurs a cost that would have grown with each additional interaction.
Cashable savings: an approved budget, vendor charge, or staffing requirement is actually reduced.

Report these separately. Otherwise, the same saved minute can be counted once as employee capacity and again as reduced spend.

Validate that a deflected task was resolved, not abandoned or displaced to another channel. Then calculate avoided cost from the incremental lift in verified resolution, not the total number of conversations the agent handled.

Include the operating costs that make the agent dependable

Model or API cost is only one part of the investment. Include labeling, evaluation, human review, and operational oversight. If a safer workflow requires more review, that review is part of the product’s economics, not an external inconvenience to exclude from the model.

Segment cost by agent, workflow, and outcome. Cost per response is useful for infrastructure management, but cost per verified successful outcome is the better economic unit. A cheap response that triggers retries, escalations, or corrections may be more expensive than a higher-cost response that completes the job.

Do not bury risk inside an average ROI number

Risk adjustment should make uncertainty visible, not create false precision. Use three layers:

Hard guardrails: security and policy conditions that trigger containment or rollback regardless of financial upside.
Observed risk indicators: error, hallucination, escalation, complaint, false-positive, and false-negative rates tracked by workflow and cohort.
Financial adjustment: expected loss deducted from net value only when the probability and impact assumptions are credible enough for finance and risk owners to accept.

Do not let a low-frequency, high-consequence failure disappear inside a high average success rate. If the downside cannot be defensibly monetized, keep it as an explicit decision constraint rather than assigning it a convenient dollar value.

Prove incrementality before claiming impact

The strongest ROI calculation still fails if the attribution is weak. A before-and-after improvement may come from seasonality, pricing, traffic quality, a support policy change, or another product release. The AI capability needs a counterfactual: what would have happened to comparable eligible users without it?

Use an A/B test or holdout whenever the product and risk profile allow it. Make these choices before launch:

Assignment unit: Randomize at the level where the outcome occurs. If expansion is measured per account, account-level assignment can prevent users in the same customer organization from receiving conflicting experiences.
Primary outcome: Pick the metric that determines success and keep diagnostic metrics secondary.
Minimum detectable effect: Precompute the smallest lift worth detecting based on the baseline, available population, and business value. If the experiment cannot detect a decision-relevant change, extending the metric list will not fix it.
Guardrails: Test quality, escalation, complaints, security, and policy outcomes alongside the primary metric.
Analysis population: For a product-level ROI claim, analyze eligible users according to their assigned experience. Looking only at people who voluntarily used the agent introduces selection bias.
Measurement horizon: Keep the holdout long enough to observe the outcome named in the contract. Leading indicators can guide iteration, but they should not be substituted for retention, churn, Net Recurring Revenue, or other lagging outcomes.

If randomization is not practical, use a fixed holdout or a frozen comparison period and document the limitations. A weaker design can still inform a decision, but the ROI claim should carry less confidence. Do not quietly promote correlation to causation because the rollout has executive attention.

Interpret the result as a system. Suppose self-service completion rises but the business outcome does not. The agent may be solving a low-value task, attracting users who would have converted anyway, or shifting effort to a later step. If conversion improves while complaints or policy violations cross the guardrail, the value hypothesis may be valid but the implementation is not ready to scale.

This is eval-driven development applied to product economics: define acceptable behavior and business success, measure both under controlled conditions, diagnose the failures, and repeat the test after a meaningful change.

Turn ROI into a portfolio operating system

A one-time business case goes stale as models, prompts, traffic, user behavior, and operating costs change. Maintain an Agent Analytics view for every production capability.

Each agent scorecard should show:

The primary business outcome and current experiment result.
Leading journey metrics from eligibility through verified completion.
Revenue contribution, cost avoided, and total operating cost using the agreed definitions.
Quality and risk guardrails, including escalations and human-review events.
Performance by relevant customer, task, and journey cohort.
The agent, model, policy, and workflow version associated with the result.
The current decision status: exploring, launching, scaling, redesigning, contained, or retiring.

Use the dashboard to make portfolio decisions, not merely to report trends:

Scale when the primary outcome clears the precommitted threshold, guardrails hold, net value is positive, and the result remains credible across the cohorts that matter.
Redesign when leading indicators improve but the business outcome does not, or when human review and escalation erase the economic gain.
Contain or roll back when a hard security, policy, or customer-harm threshold is breached, even if average financial performance is positive.
Retire when controlled measurement shows no decision-relevant incrementality or when dependable operation costs more than the value created.

Review operational signals with frontline teams because they can explain patterns hidden by aggregate metrics. Review portfolio value in QBRs with product, data, finance, and risk owners so investment follows evidence rather than novelty.

Only accelerate adoption after the workflow has demonstrated unit value. In-app guides, product tours, and lifecycle nudges can bring more eligible users into a validated flow. Measure whether those interventions increase the business outcome, not merely clicks or agent sessions. Scaling exposure to an unproven workflow scales its cost and risk as readily as its potential benefit.

Key takeaways

Treat ROI as a precommitted decision rule for launch, scale, redesign, rollback, or retirement.
Connect model behavior to customer behavior and then to revenue, cost, or risk through a driver tree.
Freeze the baseline, cost boundary, guardrails, attribution method, and success thresholds before results arrive.
Credit only incremental revenue and verified avoided cost. Keep created capacity separate from cashable savings.
Include model consumption, labeling, evaluation, human review, and oversight in the operating cost.
Use controlled experiments or holdouts, with a decision-relevant minimum detectable effect, to separate causal impact from correlation.
Keep severe risk conditions as explicit constraints when they cannot be responsibly converted into a financial estimate.
Scale adoption only after the AI workflow has shown positive unit value under acceptable risk.

Pick one high-friction customer journey and complete its measurement contract before the next roadmap review. If the team cannot name the baseline, control, primary outcome, cost boundary, guardrails, and decision thresholds, the capability is still an exploration. Label it honestly, instrument it properly, and earn the right to make an ROI claim.

References

May 15, 2026

How to Deploy an Operator AI Agent in Customer Operations

Your support team probably does not need another chatbot that summarizes a ticket on command. It needs help with the operational work surrounding every ticket: finding why escalations changed, keeping knowledge accurate, correcting broken automations, coordinating incident communication, and showing human reps what deserves attention next.

An operator AI agent can take on that work, but only if you design it as an operating system for customer operations rather than a conversational layer over support APIs. The useful version closes the loop from signal to diagnosis to tested change. The dangerous version produces plausible commentary and receives permission to act before it has earned trust.

Define the job as a closed loop, not a chat box

A customer-facing AI agent handles an individual customer’s request. An operator agent works on the system around those requests: conversations, help content, automation configuration, performance data, incident workflows, and the human queue.

That distinction changes the product requirement. The agent is not complete when it answers a question such as why escalations increased. It is complete when it can investigate the increase, identify a supported cause, determine which operational object needs attention, prepare a change, test that change where possible, and route it to the right person for approval.

Observe: Detect a question, anomaly, scheduled task, failed conversation, release brief, or incident.
Diagnose: Select the relevant metrics and attributes, inspect representative conversations, and separate recurring patterns from isolated cases.
Locate the control point: Determine whether the problem sits in knowledge, guidance, a procedure, a data connector, an automation rule, or a human workflow.
Propose: Produce a concrete artifact such as an article diff, configuration change, procedure, incident audience, or prioritized queue.
Verify: Run a simulation or another appropriate check and expose failures, edge cases, and remaining uncertainty.
Act and learn: Apply an approved change, record what happened, and monitor the affected outcome for regression.

Consider the prompt, Why did escalations rise last week? A reporting copilot returns a chart. A useful operator identifies which escalation definition applies, segments the change, reads relevant conversations, finds the repeated cause, checks whether the corresponding help content or automation is deficient, and prepares the smallest defensible correction. That progression from an operational question to an actionable proposal is already possible across analysis, knowledge maintenance, automation building, and human support workflows.

Write the acceptance criteria around that complete handoff. Require the evidence used, the proposed artifact, the scope of impact, the verification result, the named reviewer, and any action the agent is forbidden to take. If the output still leaves an operations manager rebuilding the context manually, you have a chat assistant, not an operator.

Build reliability below the model and price that work honestly

A foundation model with API access can make a persuasive prototype. It can query ticket data, summarize conversations, and write a report that appears coherent. The hard part begins when different workspaces use different fields, configurations, workflows, permissions, and definitions of success.

The model should not have to rediscover your operating rules on every run. Encode those rules in purpose-built tools and reusable skills. A tool performs one bounded operation, such as retrieving a conversation, searching knowledge, or running a defined report. A skill coordinates several tools to complete a business job, such as debugging a failed resolution or rolling a policy change through the help center.

Operator’s production architecture is described as having more than 50 tools and 10 multi-step skills. Those counts are not targets to copy. They illustrate how quickly the hidden surface area grows once an agent must do dependable operational work instead of demonstrating a few API calls.

System layer	Job it must perform	Failure you should test for	Control to add
Semantic retrieval	Find content by meaning, not only exact words	Irrelevant or incomplete evidence produces a confident diagnosis	Evaluate retrieval against real support questions and known content gaps
Attribute awareness	Know which metrics, fields, and custom attributes are populated and meaningful	The agent invents a pattern from sparse or unused fields	Expose field definitions, coverage, allowed joins, and missing-data signals
Atomic tools	Perform narrow reads or writes predictably	A broad API wrapper allows an unintended query or change	Use typed inputs, constrained scopes, explicit permissions, and structured results
Domain skills	Chain tools according to a repeatable customer-operations method	The same request follows a different process on each run	Define required steps, exit conditions, evidence, and escalation paths
Review interface	Turn reasoning into charts, diffs, tests, and proposals	A reviewer approves a wall of prose without understanding the change	Render the decision in the format appropriate to the object being changed

Semantic retrieval and attribute awareness deserve particular attention. Retrieval grounds the agent in the content that can actually answer the question. Attribute awareness stops it from treating every available field as equally meaningful. A custom field that exists but is almost never populated should not become the foundation of an operational recommendation.

Give every tool a contract before the model can call it:

The business purpose and the questions it is allowed to answer.
The read and write permissions it requires.
The preconditions that must be true before it runs.
The evidence and identifiers it must return.
Its behavior when data is missing, ambiguous, stale, or inconsistent.
The audit event, approval requirement, and rollback path for a write.

Evaluate build versus buy beyond the demonstration

A proof of concept establishes that a model can produce a plausible answer with your data. It does not establish that the answer is grounded, that the proposed action is safe, or that the system will behave consistently as configurations change.

For a build decision, include retrieval tuning, permission design, tenant isolation, tool maintenance, skill development, evaluation data, observability, proposal interfaces, audit history, rollback behavior, and on-call ownership. Also ask who will update the agent when a support object, metric definition, product policy, or API changes. If these responsibilities do not have durable owners, the internal agent will age like any other unsupported operations system.

For a buy decision, ask the vendor to demonstrate your difficult cases rather than its preferred prompts. Use a conversation with conflicting evidence, an unused custom attribute, an outdated localized article, a misconfigured rule, and a proposed write with a wide blast radius. Inspect the evidence, tool trace, permissions, diff, test result, and audit record. The quality of the generated prose is one of the least informative parts of that evaluation.

Put a proposal boundary around every material action

Moving from analysis to live changes is a different class of production problem. A wrong summary wastes time. A wrong configuration can degrade customer outcomes across every conversation that matches it. An incorrect outbound message cannot be recalled after customers have read it.

I would give the agent autonomy according to consequence, not according to how confident its language sounds:

Read: Search content, inspect conversations, calculate approved metrics, and assemble evidence. Run these tasks autonomously within access controls and log every operation.
Recommend: Explain a root cause or rank an opportunity. Attach the underlying conversations, segments, fields, and assumptions so a person can challenge the conclusion.
Prepare: Draft an article, procedure, rule, connector configuration, customer response, or queue. Save it as a proposal with no production effect.
Change: Publish, configure, send, or otherwise alter the live operation only after the required reviewer sees the exact scope and explicitly approves it.

A proposal is a structured change object, not a paragraph asking for trust. Production-grade operator systems can present reviewable diffs before applying changes, allowing the reviewer to accept, reject, or refine the work. The same principle should govern any operator implementation.

Your review screen should answer six questions without forcing the approver into another tool:

What object will change?
What exact fields, passages, rules, or recipients are affected?
What evidence connects the observed problem to this change?
What test ran, and which cases failed or remained untested?
Who must approve, and which permission will execute the action?
How can the change be reversed, and what cannot be reversed?

Customer outreach needs the strictest treatment because sending is effectively irreversible. Do not approve a batch from a conversational summary that hides the audience. The safe alternative is a preview containing the resolved customer list, inclusion logic, exclusions, exact message variants, delivery channel, and approver. Start by allowing the agent to prepare that package while a person performs the send.

Simulation also needs a visible place in the proposal. If the agent modifies an automation procedure, show which representative conversations were tested, the expected outcome for each, the observed outcome, and why any mismatch occurred. An overall pass label is not enough to reveal an important edge case.

Human approval is not a permanent substitute for system quality. If reviewers routinely accept proposals without inspecting them, the control has become ceremonial. Track corrections, rejections, rollbacks, and the evidence reviewers open. Use those signals to improve the relevant retrieval rule, tool, skill, or interface.

Roll out workflows in increasing order of consequence

Choose the first workflow by its operating characteristics. A strong starting candidate recurs frequently, consumes expert attention, has accessible evidence, produces a clear artifact, and has a named reviewer. It should also allow the agent to be useful before it receives broad write permission.

A practical rollout sequence looks like this:

Recurring operations analyst. Give the agent one standing question, such as what changed in escalations or automation performance. Define the metric, comparison period, relevant segments, evidence requirements, and report destination. Require links to representative conversations and allow the conclusion that no action is warranted. Compare its reasoning with an experienced operator’s review until the failure modes are understood.
Knowledge steward. Feed it a release brief or policy change. Ask it to find affected help content, identify missing coverage, and prepare article diffs in the required voice and format. Include localized variants where they exist. The reviewer should validate product behavior, instructions, links, policy language, and whether the proposed set of pages is complete before publishing.
Automation maintainer. Start with known failed conversations. Ask the agent to distinguish a content gap from a rule, procedure, guidance, or connector problem; prepare the smallest correction; define triggers and edge cases; and simulate the result. Do not grant live configuration access until the tool trace and tests make the diagnosis reproducible.
Human-operations coordinator. Use the agent to assemble an incident audience, draft targeted responses, prepare coaching evidence, or prioritize a rep’s queue. These workflows can save substantial coordination time, but they touch customer communication and employee decisions. Begin in preparation mode, expose the selection logic, and expand autonomy only after identity, permission, review, and audit controls have been exercised.

This sequence is a risk ordering, not a universal maturity model. A read-only weekly analysis is easier to inspect and reverse than an outbound incident campaign. A knowledge proposal has a reviewable artifact. A live automation change affects future conversations, while customer communication may create an immediate and irreversible consequence. Move forward when the evidence and controls for the next class of action are ready, not merely because the previous feature launched.

Measure the completed loop, not chat activity

Prompt counts and conversation volume tell you that people opened the product. They do not tell you that customer operations improved. Build the scorecard around the operational loop:

Diagnostic quality: Whether the proposed root cause survives expert review, whether its evidence supports the conclusion, and how often factual correction is required.
Operational throughput: Time from a detected signal to a reviewed proposal and from an approved proposal to a verified change.
Artifact quality: Acceptance, revision, rejection, and rollback patterns for knowledge, automation, configuration, and communication proposals.
Customer outcome: Resolution, escalation, repeat contact, and sentiment for the affected topic after the change, interpreted alongside volume and case mix.
Safety: Permission denials, attempted out-of-scope actions, failed simulations, unauthorized writes, rollbacks, and missing audit events.
Human leverage: Expert time spent collecting evidence, recreating context, drafting the artifact, and reviewing the final proposal.

Do not make automation rate the only goal. A higher rate can coexist with poor resolutions or avoidable escalations. Treat it as one diagnostic measure and pair it with customer outcomes, correction rates, and topic-level regressions.

Create an evaluation set from real operating conditions: known content gaps, misconfigured rules, legitimate escalations, sparse attributes, conflicting evidence, localized content, and incidents with precise audience criteria. Give each case an expected outcome, required evidence, allowed tools, and forbidden action. Re-run the set when the model, retrieval system, tool, skill, permissions, or support configuration changes.

Scheduled work is where the leverage begins to compound. An operator can run recurring analysis and deliver the resulting report without waiting for a manager to remember the question. Keep an owner on every scheduled job, however. That owner should know where failures appear, when the task last completed, which data it used, and how to pause it.

Key takeaways

An operator agent improves the system around customer conversations; it is not simply another customer-facing bot.
The product boundary should cover observation, diagnosis, proposal, verification, approval, action, and monitoring.
Reliable behavior comes from grounded retrieval, attribute awareness, bounded tools, encoded domain skills, and structured review surfaces.
Grant autonomy by consequence: broad freedom to inspect approved data, tighter controls to prepare changes, and explicit approval for production writes.
Roll out recurring analysis before knowledge changes, automation configuration, and customer communication unless your own risk profile clearly supports another order.
Measure supported diagnoses, accepted artifacts, customer outcomes, human time, and safety events rather than prompt volume alone.

Your next step is to choose one recurring operational question and write down the evidence it requires, the artifact a good answer should produce, the person who will review it, and the actions the agent must not take. Once that loop works reliably, add one downstream proposal. That is a much stronger foundation for an operator agent than beginning with an open-ended prompt and a broad API key.

References

May 14, 2026

Our Operating Model Is the Product—Why We Built Product Partners to Accelerate Outcomes

I’ve learned that customers don’t just buy features—they buy the way we discover, decide, build, ship, and support. In other words, the operating model is the product. That realization has shaped how my team and I at HighLevel translate product strategy into tangible, repeatable outcomes that show up in quality, reliability, onboarding, and consultative support every single day.

We created Product Partners to codify that operating model and scale it with discipline. It’s a blueprint and operating rhythm that unifies product strategy with go-to-market strategy, customer success, and solutions engineering—so empowered product teams can move faster without sacrificing clarity, governance, or customer trust.

First, we anchored on continuous discovery. Product trios work shoulder-to-shoulder with customer-facing teams to run customer interviews, journey mapping, and A/B testing, then validate insights with session replay and behavioral analytics. We use driver trees and opportunity solution trees to connect problems to outcomes, ensuring prioritization is evidence-based and aligned to product-market fit—not just output.

Second, we elevated delivery excellence. Our practices emphasize CI/CD, feature flags, observability, SRE-informed incident management, and DORA metrics to shorten feedback loops while raising the bar on stability. Privacy-by-design, data governance, and regulatory compliance are built into our workflows, and we make deliberate build vs buy decisions to protect platform scalability and long-term velocity.

Third, we integrated go-to-market alignment from day one. Solutions engineering and customer success shape requirements early, so launches include in-app guides, product tours, onboarding paths, and consultative support that accelerate user activation. We tie outcomes vs output OKRs to stakeholder management rituals, ensuring sales-led and product-led growth motions reinforce each other instead of competing for focus.

Finally, we closed the loop with a unified analytics platform. Activation, retention analysis, and Net Recurring Revenue (NRR) sit alongside qualitative signals from customer interviews and support. This single source of truth helps us refine product positioning, sharpen value propositions, and improve roadmapping and sprint planning with clear, testable hypotheses.

What does this mean for our partners and customers? Faster time-to-value, fewer handoffs, clearer expectations, and a shared lens on the metrics that matter. Product Partners isn’t a side program; it’s how we operationalize trust—through transparency, consistent rituals, and a bias toward learning that compounds.

If this resonates, you’ll feel it in how we discover, build, and support together. I’ll continue to share our playbooks—covering continuous discovery, onboarding, and outcome-based planning—so we can keep raising the standard for product management leadership and product-led growth, one operating rhythm at a time.

Inspired by this post on Product School.

May 14, 2026

AI-Enabled Enzymatic Recycling: A Product Leader’s Playbook

You have an AI-enabled materials proposal in front of you, a promising set of enzyme candidates, and a difficult decision: fund another round of discovery or start building toward industrial scale. The candidate sequences may be impressive, but they are not yet the product.

Your decision should turn on whether the full system can repeatedly transform a defined waste stream into usable monomers at an economically viable cost. That framing connects model performance, laboratory evidence, process engineering, and commercial reality before an exciting demonstration becomes a stranded pilot.

Define the product around recovered monomers

Only 10% of the plastic manufactured gets recycled. That ceiling is not merely a sorting or consumer-behavior problem. Traditional recycling commonly shortens polymer chains instead of restoring their original molecular building blocks, so the resulting material can lose quality and move toward downcycling.

Enzymatic recycling changes the intended output. An engineered enzyme can deconstruct a polymer into its original monomers, which can then become inputs for new, high-quality plastic. The difference is fundamental: the product is not processed waste or a smaller plastic fragment. It is recovered molecular feedstock.

This distinction gives you a better product boundary. A generated protein sequence is a feature. An enzyme that shows activity in one assay is a technical result. The product is a repeatable monomer-recovery system with a defined input, output, operating envelope, and cost structure.

Before approving a roadmap, require the team to define five contracts:

Input contract: Which polymer, packaging format, mixture, and contamination profile will the process accept? “Mixed plastic” is not a specification. Name the included materials and the variation the system must tolerate.
Transformation contract: Which polymer bonds must the enzyme break, and what conversion and selectivity must the reaction demonstrate?
Output contract: Which monomers will be recovered, what downstream use must they support, and how will the team determine that the output is suitable for that use?
Operating contract: What reaction conditions, throughput, energy consumption, and process controls must hold outside a small laboratory assay?
Economic contract: Which cost per ton must the integrated process approach, and which assumptions currently separate measured economics from projected economics?

Selectivity is especially important. An enzyme can target a particular plastic within a mixed waste stream, potentially reducing the need to treat every input as chemically identical. But selectivity does not make an undefined waste stream manageable. The process still needs to know which target material is present, whether the enzyme can reach it, and how the desired products will be recovered.

Write the product brief in one sentence: For this defined feedstock, transform this polymer into these monomers, within this operating envelope, output specification, and cost boundary. If a number is unknown, leave a visible blank and assign an experiment to fill it. Do not hide the uncertainty inside a broad ambition such as “make plastic circular.”

Build the AI as a closed learning system

AI changes the economics of searching enzyme-design space. Protein language models can generate candidates, multi-step agents can coordinate specialized tasks, and computational evaluations can eliminate weak options before scarce laboratory capacity is used. Advances in protein structure prediction have expanded what can be explored, but prediction does not remove the need for physical validation.

The useful architecture is therefore not a model that emits sequences. It is a closed loop in which every physical result makes the next design round better. Rhea’s Factory combines protein language models, an agentic pipeline, domain constraints, and proprietary wet-lab feedback. The product lesson is broader than any one implementation: generation, evaluation, experimentation, and learning need to operate as one traceable system.

Encode the objective. Convert the product contract into machine-readable constraints: target polymer, desired products, acceptable operating conditions, and the metrics that will decide whether a candidate advances.
Generate candidates. Explore multiple plausible designs rather than optimizing immediately around the first promising family.
Apply computational gates. Reject candidates that violate explicit constraints, preserve the reasons for rejection, and rank the remaining candidates for laboratory use.
Run controlled wet-lab experiments. Test candidates under recorded conditions and capture successes, failures, and inconclusive results.
Update domain predictions. Use the measured outcomes to improve ranking and candidate selection for the next round.
Feed process evidence back into discovery. When a candidate struggles under reactor or feedstock conditions, turn that failure into a new design constraint instead of treating it as a separate engineering problem.

Agentic AI is valuable here because the workflow is multi-step, not because an agent should make every decision autonomously. At each handoff, define the required input, expected output, validator, and failure behavior. A generation step should not advance an incomplete candidate. A computational score should not be presented as a laboratory observation. A promising assay should not silently become a scale claim.

Exploration also needs an explicit lane. Higher model-sampling temperatures can produce more unusual enzyme candidates and reach beyond the safest local variations. Controlled model “hallucination” can be useful during candidate exploration when downstream guardrails prevent novelty from being mistaken for evidence.

Separate the candidate portfolio into three buckets: improvements near known winners, adjacent designs that test a clear hypothesis, and high-variance exploration. Give each bucket a deliberate laboratory budget. Raise sampling temperature only in the exploratory lane, and never allow generated assay values, reaction outcomes, or scale results into the measured-data record.

The durable advantage sits in the feedback data. In a narrow, high-signal domain, even hundreds of relevant proprietary laboratory observations can support a useful domain prediction model. That is not a general claim that small datasets are always sufficient. It means contextual quality can matter more than indiscriminate volume when the problem, assay, and outcomes are tightly defined.

For every experiment, preserve enough context to make the result reusable:

The enzyme identity, sequence version, and design lineage.
The target polymer, material format, mixture, and relevant contamination profile.
The assay and protocol version used for the test.
The reaction conditions and duration.
The measured conversion, selectivity, yield, and uncertainty available from the experiment.
The full result, including failure, no-result, and inconclusive outcomes.
The relationship between the candidate, computational evaluations, physical test, and model or data release.

A spreadsheet of winning sequences is not a data moat. A traceable record of why candidates were proposed, how they were tested, what failed, and how each result changed the next decision can become one.

Use stage gates that end in physical evidence

AI product teams often gravitate toward a model leaderboard because it creates a clean sense of progress. Enzymatic recycling does not have one adequate master score. A candidate can look structurally plausible and fail in the lab. It can perform in a controlled assay and miss the required throughput. It can convert the polymer and still lose economically once the rest of the process is counted.

Use a hierarchy of evidence that moves from design compliance to laboratory performance, operating fit, and scale economics:

Gate	Decision question	Required evidence	Red flag
Design compliance	Does the candidate satisfy the stated target and pipeline constraints?	Deterministic checks, recorded constraint evaluations, and candidate provenance	A candidate advances mainly because it appears novel
Wet-lab performance	Does the enzyme convert the target with the required selectivity under defined conditions?	Repeatable measured observations, including negative and inconclusive runs	Only the best run is retained or shared
Operating fit	Does useful performance hold within the intended controlled, low-temperature process and throughput requirements?	Process measurements tied to reaction conditions, conversion, yield, throughput, and energy use	Activity is reported without the process context needed to interpret it
Scale economics	Can the integrated system move toward cost parity with inexpensive oil-based plastic?	A cost and energy model tied to measured inputs, with assumptions and sensitivities exposed	Commercial viability is inferred from enzyme activity alone

Set pass, hold, and stop conditions before seeing the result. Otherwise, an interesting candidate will repeatedly earn one more experiment while the commercial requirement drifts. Relative improvement is useful for learning, but an enzyme that is twice as good as an unusable baseline may still be unusable. Every relative metric should sit beside the absolute requirement it is meant to approach.

Keep conversion, selectivity, yield, throughput, and energy per ton separate. Combining them too early into a single score can conceal the actual tradeoff. A team should be able to show why it is advancing a faster candidate with lower selectivity, or a more selective candidate with a different operating burden, without claiming that the candidates are equivalent.

Three common metric substitutions deserve direct scrutiny:

Low reaction temperature is not automatically low total energy. Count the energy demands of the complete process rather than the enzyme reaction in isolation.
Polymer conversion is not automatically usable monomer recovery. Measure whether the desired output can be recovered to the specification required downstream.
Bench performance is not automatically scaled performance. Treat increasing process scale as a new evidence gate, not a routine deployment step.

My rule is simple: model output can earn laboratory time; only measured process evidence can earn scale capital.

Plan the roadmap backward from cost parity

The commercial benchmark is unforgiving. Enzymatic recycling ultimately has to compete with inexpensive oil-based plastic production. A greener reaction that cannot approach a viable delivered cost will remain dependent on special conditions rather than becoming a broadly adopted circular process.

Build the economic model while discovery is still underway. At minimum, separate these cost lines:

Feedstock acquisition, sorting, and rejected material.
Preparation required before the enzyme can act on the target polymer.
Enzyme production, delivery, useful lifetime, and replacement.
Reactor capacity, reaction time, process control, and energy.
Monomer recovery and purification.
Waste handling, downtime, and variability in plant utilization.

Do not wait for perfect values. Use ranges, label each input as measured or assumed, and run sensitivity analysis. The purpose is to identify which uncertain variable can kill the business case. If enzyme lifetime dominates cost, another candidate-generation run may be rational. If purification dominates, generating thousands of additional sequences may be a distraction from the real constraint.

Pair every scientific milestone with an industrial question:

Discovery gate: Is activity and selectivity reproducible enough to justify process work?
Process gate: Does the candidate perform inside the intended operating envelope rather than only under a convenient assay condition?
Feedstock gate: Does performance survive representative material formats and mixtures, including difficult packaging such as clamshells?
Demonstration gate: Can the system sustain the required material flow, output quality, and energy profile at a scale that tests the major engineering assumptions?
Commercial gate: Does the cost case remain credible when feedstock composition, utilization, throughput, and other sensitive inputs move away from the preferred case?

A planned 5,000-ton demonstration plant in California illustrates why demonstration capacity belongs on the product roadmap. A plant is not simply a larger laboratory. It tests whether biology, equipment, controls, feedstock variability, and recovery operations behave as an integrated product.

Before committing meaningful scale capital, ask six kill questions:

Which assumption has the largest effect on delivered cost per ton?
Which inputs are measured, and which still come from a design estimate?
At what physical scale was each important input measured?
What fails first when the feedstock mix changes?
If enzyme performance improves as planned, which downstream step becomes the bottleneck?
Which observed result will stop, narrow, or materially redesign the program?

Expansion into additional plastics should follow the same discipline. Enzyme selectivity creates a plausible path toward enzyme blends for mixed streams, and new plastic types and mixed-plastic blends remain important development directions. Treat each added polymer as a new product vertical with its own input contract, assays, process interactions, recovery requirements, and economics. A new enzyme is not automatically a low-cost extension of the first process.

Key takeaways for your next roadmap review

Define success as repeatable recovery of specified monomers, not the generation of novel enzyme sequences.
Run discovery as a closed loop connecting product constraints, AI generation, computational gates, wet-lab measurements, and process feedback.
Treat proprietary experimental context—including failures—as the data asset; candidate count alone is not a defensible moat.
Use separate gates for design compliance, laboratory performance, operating fit, and scale economics.
Work backward from cost parity and direct the next experiment toward the assumption that most threatens the integrated business case.

For your next review, ask the team to bring one page containing the input and output contracts, a diagram of the learning loop, the current stage-gate thresholds, the experimental data schema, and a cost sensitivity model with measured and assumed inputs clearly separated. Every roadmap item should change one of those artifacts or produce evidence for a named decision.

If the team cannot fill those fields yet, that is the immediate product work. The first defensible milestone is one traceable loop from a defined industrial problem through candidate generation, laboratory measurement, and an updated cost model. Repeat that loop with increasing realism before increasing capital exposure. That is how you determine whether programmable biology is becoming an industrial recycling product rather than remaining an impressive AI demonstration.

References

Product Talk — How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

May 14, 2026

No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

Most teams ship AI agent personalities by accident—emergent quirks, brittle prompts, and uneven behavior. We refused to let that happen. From day one, we treated personality as a first-class product surface, one that should be designed, instrumented, and iterated with the same rigor as any core capability.

Learn how we designed Global Agent’s personality and fine-tuned its inquisitiveness and helpfulness using Agent Analytics.

In my role leading product at HighLevel, Inc., I framed our approach around agentic AI and conversation design: personality is not “flavor text”; it is the control system for how an agent interprets context, asks questions, and decides when to act. Our product strategy prioritized clarity, empathy, and consistency—so the agent would be curious enough to resolve ambiguity without becoming interrogatory, and helpful enough to move work forward without overstepping.

We made that intent measurable. Using behavioral analytics, we defined operational signals such as clarification-question rate, resolution-path efficiency, and escalation quality. We combined eval-driven development with targeted A/B testing to compare prompt patterns and tool strategies, ensuring each change had a clear hypothesis and measurable outcome.

To calibrate inquisitiveness, we mapped decision points where the agent should ask follow-ups versus proceed autonomously. Prompt engineering codified those thresholds, while a retrieval-first pipeline reduced unnecessary questions by improving context completeness up front. When the agent did ask, we constrained tone and cadence to keep queries concise, respectful, and progress-oriented.

To enhance helpfulness, we prioritized precise action-taking and unambiguous guidance. Context window management preserved relevant facts without diluting intent, and guardrails aligned with AI risk management principles ensured the agent stayed within policy, privacy, and compliance boundaries. The result was an assistant that resolved more tasks end-to-end, with fewer stalls and clearer handoffs when human help was warranted.

Agent Analytics became our nervous system. We instrumented every dialog turn to attribute outcomes to design choices, then used driver trees to connect micro-behaviors to macro results like time-to-resolution and customer satisfaction. This closed-loop view let us ship confidently, knowing which levers improved helpfulness, which sharpened curiosity, and which merely added noise.

Process mattered as much as tooling. Product trios ran continuous discovery with customers to surface edge cases—ambiguous intents, multi-intent turns, and sensitive scenarios—while our engineering partners operationalized experiments with clean rollback paths. We favored small, testable changes over sweeping rewrites, building momentum and trust with each iteration.

The payoff is a personality that feels consistent across use cases: curious when clarity is missing, decisive when action is obvious, and transparent when limits are reached. Users experience fewer dead ends, faster resolutions, and a brand voice that shows up the same way every time—because it was defined, measured, and improved on purpose.

If you’re building agentic AI, don’t leave personality to chance. Treat it like a product: set clear outcomes, instrument deeply with Agent Analytics, and iterate with eval-driven development and A/B testing. That’s how curiosity becomes a feature, helpfulness becomes a habit, and your agent becomes reliably, intentionally excellent.

Inspired by this post on Amplitude – Best Practices.

May 13, 2026

Knowledge Management for AI Sales Agents: A Practical System

Your AI sales agent answers the pricing question, then recommends the wrong plan. It identifies a promising buyer, then sends the conversation to the wrong queue. If the underlying facts, decision rules, or routing policy are missing, another prompt adjustment cannot fix the problem.

You need a knowledge operating system, not a larger folder of sales collateral. The goal is to give the agent the smallest reliable path from a buyer’s question to an accurate answer, an appropriate recommendation, useful qualification, and the correct next step.

Key takeaways

Separate product facts, decision guidance, and execution policy. Each solves a different part of the sales conversation.
Turn long documents into focused, approved knowledge records with an owner, scope, effective date, and explicit boundaries.
Launch a complete sales motion for a narrow set of buyer intents before trying to document everything.
Test recommendations, qualification, routing, and escalation behavior, not just whether the agent can repeat a correct sentence.
Convert unanswered, incorrect, and disengaged conversations into a managed improvement queue.

Design for answering, recommending, qualifying, and routing

A conventional knowledge base helps someone find information. An AI sales agent has a harder job: it must interpret a buyer’s situation and decide what to do with the information it retrieves.

A language model does not inherently know your current plans, qualification criteria, commercial boundaries, or customer-specific use cases. That context is unique to your business and must be made explicit. Fluency cannot compensate for a missing policy.

I find it useful to divide sales knowledge into three layers:

Knowledge layer	What it contains	What the agent should do with it	Typical failure when it is missing
Product facts	Pricing, plan structure, capabilities, limitations, availability, and supported use cases	Give a direct, accurate answer	The agent guesses, gives a vague response, or repeats obsolete information
Decision guidance	Plan-fit logic, relevant constraints, case studies, approved comparisons, and the context behind product facts	Explain which option fits and why	The answer is technically correct but does not help the buyer decide
Execution policy	Qualification questions, required fields, routing conditions, escalation rules, and actions the agent may take	Advance the conversation within defined authority	The agent collects irrelevant details, makes an unsupported commitment, or routes the buyer incorrectly

This distinction exposes why uploading product pages is not enough. Public pages and product documentation are useful starting points, but a capable inbound motion also needs FAQs, pricing explanations, case studies, competitive material, qualification criteria, and internal sales guidance.

Facts answer, “What does the product do?” Decision guidance answers, “Is this appropriate for my situation?” Execution policy answers, “What should happen next?” Audit your knowledge against all three questions.

Set a hard boundary around commercial exceptions. The agent should not infer an unlisted discount, invent a contractual commitment, or turn an internal hypothesis into a buyer-facing claim. It should state the approved terms, gather the information required by policy, and route the exception to an authorized person. A plausible but unauthorized promise can create financial and legal exposure.

Turn scattered documents into governed knowledge

Build sales-ready knowledge records

A long document can be correct and still be poor input for an agent. Pricing may be buried below an obsolete introduction. A feature table may omit the condition that changes plan fit. A battlecard may combine approved facts with a rep’s unverified notes.

Convert those documents into focused records. Each record should cover one buyer intent or one tightly related decision. Use a consistent template:

Buyer intent: The question or decision this record addresses, including common alternative phrasings.
Approved answer: The direct response the agent may give without qualification.
Decision context: Why the fact matters and when it changes the recommendation.
Constraints and exceptions: What the answer does not cover, including conditions that require clarification.
Next question or action: The appropriate follow-up, qualification step, route, or escalation.
Scope: The plans, markets, customer types, channels, or agents allowed to use the record.
Evidence location: The canonical product, pricing, or policy record from which the answer was derived.
Owner and approver: The people accountable for accuracy and authorization.
Lifecycle metadata: Effective date, review status, and whether the record replaces an earlier version.

For example, a record about plan fit should not stop after naming a plan. It should state the relevant requirement, identify the condition that changes the answer, give the agent an approved follow-up question, and define where to route a buyer whose situation falls outside the standard policy. The recommendation then becomes reproducible rather than improvised.

Keep buyer language in the record. Prospects rarely use your internal taxonomy, and the same intent may appear as a product question, an outcome question, or a comparison. Alternative phrasing helps the retrieval layer recognize that these expressions belong to the same approved answer.

Create an authority hierarchy

A centralized repository is valuable only if the agent can distinguish current authority from historical residue. Define the hierarchy before connecting more content:

Designate one canonical record for each product fact, commercial rule, or routing policy.
Make approved sales explanations point back to that record rather than becoming independent versions of the truth.
Treat scripts and examples as phrasing aids unless they are explicitly approved to carry facts.
Keep drafts, call notes, chat fragments, and retired material outside the agent’s usable knowledge until they are reviewed.

Do not let the agent reconcile conflicting records by choosing the newest upload or blending the language. When two approved items disagree, the safe behavior is to withhold the disputed claim, follow the defined escalation path, and send the conflict to its owner.

Ownership should also be specific. A knowledge owner maintains the record. A domain approver authorizes sensitive claims. An operations owner monitors how the agent uses the record in conversations. One person may hold more than one role, but every role needs a name rather than a department-shaped placeholder.

Target knowledge by audience and action

Internal knowledge is not automatically buyer-facing knowledge. A qualification score may guide routing without being disclosed. A battlecard may help frame an approved comparison without exposing internal commentary. A security question may require an authorized answer rather than the agent’s summary of a sales note.

Mark each record as buyer-answerable, decision-only, action-only, or restricted. Then expose only the appropriate material to each agent, channel, market, and sales motion. This kind of centralization and content targeting reduces duplication while keeping internal policy separate from the words a prospect sees.

Launch the smallest complete sales motion

Trying to clean every sales document before launch creates a long project with no conversational evidence. Launching with disconnected FAQs creates a different failure: the agent answers isolated questions but cannot move the buyer forward.

The better unit of scope is a complete motion for a bounded set of intents. For each selected intent, the agent needs an answer, the relevant fit logic, the next qualification question, a route or resolution, and an escalation path.

Prioritize by demand and consequence

Start with questions that appear repeatedly, delay buyers, consume sales time, signal meaningful intent, or cause material damage when answered incorrectly. Pricing, plan differences, core capabilities, common use cases, qualification, and routing are natural candidates when they dominate your actual inbound conversations.

Two simple calculations help quantify repetitive work: team time reclaimed = average response composition time x question frequency, while buyer wait avoided = number of prospects asking x average response time. These calculations are useful for prioritizing knowledge work, but neither should be presented as revenue without downstream evidence.

Add consequence to the ranking. A frequent low-risk question may save time, but an infrequent error involving price, eligibility, security, or a contractual promise may deserve earlier treatment. Frequency tells you where the volume is. Consequence tells you where control matters.

Your initial release is ready when the selected motion has:

Approved answers for the recurring and commercially important questions in scope.
Plan-fit or use-case guidance where the buyer needs a recommendation rather than a fact.
Explicit qualification fields and follow-up questions.
Routing rules for the standard paths.
A visible no-answer and human-escalation path.
Owners and lifecycle metadata for every active record.
A representative evaluation set based on real buyer language.

Test the conversation, not the sentence

A retrieval test can show that the right paragraph was found. It cannot show that the agent asked the necessary follow-up, respected a restriction, or routed the lead correctly. Evaluate the complete interaction.

Your test set should include direct questions, paraphrases, multi-part questions, ambiguous requests, outdated assumptions, missing qualification details, requests for exceptions, and scenarios that should be escalated. For each case, check:

Is every factual claim aligned with approved knowledge?
Does the response answer the buyer’s actual question before adding detail?
Does the recommendation apply the right conditions rather than matching a keyword?
Does the agent ask only for information required by the qualification policy?
Does it avoid unsupported commitments and internal-only language?
Does the final route or escalation match the execution rule?

Record pass or fail at the behavior level and attach a reason to every failure. Do not allow a strong average score to conceal a severe pricing or policy error. High-consequence failures should block that behavior from release until the knowledge or policy is corrected.

Use a controlled launch with an obvious human path. The point is to begin collecting real conversational evidence early, not to claim autonomy before the boundaries are reliable. Fast deployment and continuous iteration work when the feedback loop is designed before traffic arrives.

Run the knowledge flywheel from real conversations

Once the agent is live, conversation failures become your most useful knowledge backlog. Do not place every poor result under a generic label such as bad answer. Classify the mechanism so the right owner can fix it.

Coverage gap: No approved record addresses the buyer’s intent.
Retrieval failure: The right knowledge exists, but the wrong record was selected or the right one was missed.
Freshness failure: The agent used information that should have been replaced or retired.
Guidance failure: The fact was correct, but the recommendation ignored relevant context.
Qualification failure: The agent skipped a required question, collected unnecessary information, or misread the answer.
Routing failure: The collected information was correct, but the next action did not follow policy.
Boundary failure: The agent disclosed restricted material or made an unauthorized claim.
Conversation failure: The content was accurate, but the response was unclear, repetitive, or poorly sequenced.

Turn each confirmed failure into a work item containing the conversation, intent, root cause, affected knowledge record, accountable owner, proposed change, and regression test. A change is not complete when the wording is edited. It is complete when the test passes, the approved version is published, and conflicting text is retired.

Track a small set of operational signals that lead to decisions:

Signal	What it tells you	What to do with it
Coverage	Which eligible buyer intents have an approved answer and action path	Add knowledge where demand and consequence justify it
Reviewed correctness	Whether sampled claims match the approved record	Repair facts, retrieval, or response generation
Knowledge conflicts	Where active records disagree or overlap ambiguously	Resolve authority and retire obsolete material
Qualification completion	Whether required information was collected for eligible conversations	Improve questions, field definitions, or sequencing
Routing compliance	Whether the next action matched the approved rule	Correct policy logic or integrations
Buyer progression	Whether the conversation reached its intended next step	Inspect guidance and friction, then validate changes against downstream outcomes
Content health	Which active records lack an owner, approval, scope, or lifecycle status	Repair governance before stale content becomes a live failure

Review these signals by intent and sales path. A global average can look acceptable while one plan, market, or routing branch fails repeatedly. Conversion can be a useful downstream outcome, but it is not proof that a knowledge change caused the result. Traffic mix, offer changes, seasonality, and human follow-up can also move it. Use controlled comparisons where practical and pair outcome data with conversation-level review.

Reserve a recurring weekly block for unanswered questions, disengaged prospects, high-consequence errors, and unresolved conflicts. Process pricing, product, and policy changes as immediate knowledge events rather than waiting for the review block. This is where knowledge management becomes an operating responsibility instead of a cleanup project.

The compounding effect comes from the loop: approved knowledge improves conversations; conversations reveal gaps; each repaired gap becomes a reusable capability. That is why small, well-chosen content improvements can have effects beyond the original conversation.

Start with your recent inbound conversations. Choose the most repeated unanswered question, the most consequential incorrect answer, and one routing failure. Convert each into an owned knowledge record, add a regression test, and release only the behavior that passes. That small loop is the foundation of a sales agent you can trust with progressively more of the funnel.

References

Intercom – The Ultimate Knowledge Management Playbook to Supercharge Your AI Sales Agent

May 13, 2026

Category: Product Management

Find the value, then choose a metric that tracks it

Define value in the buyer’s language

Turn value into a billable unit

Make packaging do the segmentation work

Choose modular, bundled, or hybrid architecture deliberately

Measure willingness to pay only after the offer is clear

Convert the demand curve into a commercial model

Stress-test the assumptions that can break the plan

Make the recommendation easy to challenge

Launch pricing as a controlled learning system

Complete the billing path before charging

Use behavior to diagnose the next problem

Key takeaways

References

Key takeaways

Frame the decision before you involve the model

Build a traceable evidence pipeline, not a transcript pile

Retrieve the smallest useful context

Give AI a chain of bounded jobs

Buy workflow plumbing; own the decision logic

Evaluate the workflow before it shapes the roadmap

Turn the workflow into a product operating system

References

Start with the retention decision, not the dashboard

Instrument the path from first value to recurring value

Measure value at the account level

Use a minimal tracking contract

Build a risk score people can challenge and act on

Treat accuracy as a chain, not a single metric

Run a validation pass before reading the colors

Split heatmaps when the interface or interaction model changes

Match the decision to what the evidence can prove

Key takeaways

References

Classify the decision before you assess the AI

Turn governance principles into an enforceable contract

Define the data boundary

Assign decision rights to named roles

Design the audit record before launch

Put controls inside the workflows people actually use

Behavioral analytics: govern the meaning as well as the data

Anomaly detection: route a signal into investigation, not judgment

Self-service analysis: give teams a governed lane

Pilot with evidence, not a polished demonstration

Key takeaways

References

Start with the decision your ROI model must support

Write the measurement contract before you build the dashboard

Instrument the complete journey, not just the conversation

Calculate economic value without turning activity into savings

Count only incremental revenue

Separate capacity from cashable savings

Include the operating costs that make the agent dependable

Do not bury risk inside an average ROI number

Prove incrementality before claiming impact

Turn ROI into a portfolio operating system

Key takeaways

References

Define the job as a closed loop, not a chat box

Build reliability below the model and price that work honestly

Evaluate build versus buy beyond the demonstration

Put a proposal boundary around every material action

Roll out workflows in increasing order of consequence

Measure the completed loop, not chat activity

Key takeaways

References

Define the product around recovered monomers

Build the AI as a closed learning system

Use stage gates that end in physical evidence

Plan the roadmap backward from cost parity

Key takeaways for your next roadmap review

References

Key takeaways

Design for answering, recommending, qualifying, and routing

Turn scattered documents into governed knowledge

Build sales-ready knowledge records

Create an authority hierarchy

Target knowledge by audience and action

Launch the smallest complete sales motion