Month: May 2026

Value-Based Pricing and Packaging: A Practical Playbook

If your pricing discussion keeps bouncing between competitor screenshots, delivery costs, and whatever Sales thinks the market will accept, you are not yet deciding a price. You are mixing four separate decisions: the pricing model, the pricing metric, the package, and the amount charged.

Separate those decisions and make them in the right order. You will get a pricing system that customers can understand, Finance can model, Sales can explain, and Product can improve as real behavior replaces assumptions.

Find the value, then choose a metric that tracks it

Value-based pricing does not mean charging the highest number a customer will tolerate. It means connecting what the customer pays to a result the customer cares about. Your costs still determine whether the offer is sustainable, but they do not explain why the buyer should purchase it.

Start by keeping four commonly confused decisions separate:

Decision	Question it answers	Example output
Pricing model	What overall structure determines how the customer pays?	Fixed fee, access-based, usage-based, or outcome-based
Pricing metric	What unit causes value and charges to scale?	Account, seat, transaction, workflow, or verified outcome
Packaging	Which capabilities, limits, and service levels belong together?	Plans, allowances, add-ons, commitments, and overages
Price	How much will you charge for the package or metric?	List price, contracted rate, and discount guardrails

Define value in the buyer’s language

Your first customer conversations should not begin with a proposed price. Begin with the decision the buyer is trying to make and the change the buyer expects after adopting the product. Ask for recent, concrete examples rather than opinions about a hypothetical offer.

What event made this problem important enough to address?
What happens if the buyer leaves the problem unsolved?
Who experiences the problem, and who controls the budget?
What observable change would count as success?
How does the buyer prove that change internally?
What alternatives compete for the same budget, including manual work and doing nothing?
What causes value to grow: more users, more activity, more completed work, better results, or lower risk?

Turn the answers into one working statement: For [buyer], the product creates value when [observable result] improves [business or operational consequence], compared with [current alternative]. This is not positioning copy. It is a testable value hypothesis that will guide the metric and package.

If different segments complete that sentence differently, do not average the answers into a vague promise. That is evidence that the segments may need different packages, metrics, or sales motions. A support leader buying fewer escalations and an operations leader buying more throughput may use the same product while evaluating its value in different ways.

Turn value into a billable unit

The pricing metric is the bridge between the value hypothesis and the invoice. For an AI support agent, for example, the model can charge only for results, while the unit is an outcome counted when the agent resolves a customer query without further help. The principle is attractive because payment moves with delivered value. The definition is difficult because every ambiguous edge case can become an invoice dispute.

Write the metric specification before selecting the price. It should define:

The event that starts the measurement.
The event that qualifies the unit as complete.
Any quality threshold required before it is billable.
Exclusions, such as tests, spam, duplicates, abandoned work, or activity outside the contracted scope.
Attribution when a human, an automation, and an AI system all contribute.
How reversals, reopened work, refunds, and corrections affect the count.
What the customer can see before the count appears on an invoice.
Which record resolves a disagreement between product analytics and billing.

My rule is simple: if a buyer cannot understand what will be counted and predict the direction of the next bill, the metric is not ready. Evaluate every candidate against six tests:

Value alignment: Does an increase in the unit normally mean the customer received more value?
Predictability: Can the customer forecast the unit well enough to plan a budget?
Auditability: Can both sides inspect the same underlying events?
Controllability: Can the customer influence usage or set limits without abandoning the product?
Operational feasibility: Can your product, data, billing, and support systems calculate the unit consistently?
Economic alignment: Does revenue scale sensibly relative to the cost and risk of delivering the value?

A value-based design does not always require a literal outcome metric. A proxy can be the better choice when it is closely related to value and much easier to forecast and audit. Raw activity is a poor proxy when it can grow without improving the customer’s result. A seat is a poor proxy when adding users does not increase value. An outcome is a poor metric when success cannot be defined consistently. Choose the least complicated unit that preserves alignment.

Before charging anyone, run the proposed rules against beta or historical events. Generate shadow invoices, inspect unusually high and low accounts, and reconcile the count from the raw event through the customer-facing bill. This exposes definitional and data problems while they are still product problems rather than financial disputes.

Make packaging do the segmentation work

Pricing determines how revenue scales. Packaging determines which customers select which offer. A package is therefore not a decorative feature table. It is a mechanism for matching different value patterns, operating needs, and willingness to pay without creating a custom product for every account.

Segment customers by how they receive value. Company size may matter, but workflow complexity, risk, required integrations, volume, and the cost of failure can be more revealing.
Identify the minimum complete experience. Every package should let its intended customer reach the core outcome; a deliberately crippled entry plan teaches the market that the product does not work.
Place differentiators where their value is concentrated. Advanced governance, analytics, automation, integrations, service levels, and support may matter much more to one segment than another.
Choose the relationship between access and consumption. Decide what is included, what is metered, whether unused commitments expire, how overages work, and whether customers can set caps or alerts.
Test whether buyers can self-select. Show realistic scenarios, ask which package they would choose, and then ask them to explain why. Their explanation is more diagnostic than the selected tier.

Choose modular, bundled, or hybrid architecture deliberately

Modular pricing works best when capabilities have distinct buyers, adoption paths, and measurable outcomes. It lets a customer buy one job without funding unrelated functionality. Its weakness appears as the portfolio expands: each additional module adds another decision, metric, contract term, and sales explanation.

Bundling works better when capabilities reinforce one workflow or when customers experience the combined result rather than the individual components. It reduces buying friction, but it can hide which capability creates value and can force smaller customers to pay for breadth they do not need.

A hybrid can separate platform access from variable value: a base package covers the shared product, an included allowance makes the initial bill predictable, and overages or commitments let revenue grow with delivered value. Use that structure only when each component answers a different commercial question. Adding a platform fee, several meters, tier thresholds, credits, and add-ons without a clear role for each one creates a billing puzzle, not a pricing strategy.

Look for these packaging failure signals:

Customers repeatedly need capabilities scattered across several tiers.
The entry package cannot produce the outcome used to sell it.
The highest tier is simply every leftover feature rather than an offer for a distinct need.
Two packages attract the same customer for reasons your sales team cannot explain consistently.
The economically best package for you is visibly wrong for the customer.
Customers need a spreadsheet or a salesperson to estimate a normal bill.
Every new capability becomes a new add-on because the portfolio has no shared packaging logic.

Do not ask customers whether they like the package names or feature list. Give them a buying situation, expected volume, required controls, and a budget constraint. Ask them to choose, identify what feels unnecessary, and state what is missing. You are testing whether the architecture supports a decision, not whether the page looks polished.

Measure willingness to pay only after the offer is clear

Quantitative pricing work becomes useful only after buyers understand the model, metric, and package. Otherwise, a survey can produce a precise answer to a question the market would never ask. Use qualitative discovery to establish the buyer’s language and mental model, then carry that exact framing into willingness-to-pay testing.

Methods such as Gabor-Granger and Van Westendorp answer different questions. Gabor-Granger-style testing helps estimate purchase willingness across proposed price points. Van Westendorp-style questions help expose perceived price boundaries, including where an offer begins to feel implausibly cheap or prohibitively expensive. Neither method discovers the value metric for you, and neither produces a universally correct price.

A defensible survey sequence looks like this:

Describe the customer problem and product outcome without promotional language.
State exactly how charging works.
Define the billable unit, including the success condition.
Show what the package contains and what it excludes.
Give the respondent a realistic usage or outcome scenario.
Ask about willingness to purchase at a specific price or across a controlled sequence of prices.
Capture the respondent’s role, segment, buying authority, expected volume, and current alternative so the results can be interpreted rather than merely averaged.

A demand curve is more useful than a single average. In one outcome-priced case, stated purchase willingness moved from 69% at $0.86 per outcome to 39% at $1.42. Those figures are not benchmarks for another product. They demonstrate why the decision is strategic: moving along the curve changes expected adoption as well as revenue captured from each unit.

A simple price multiplied by the share willing to buy can identify a survey-based revenue peak, but that point is not automatically your final recommendation. It does not, by itself, include realized discounts, differences in unit volume, cost to serve, retention, expansion, sales effort, or the value of establishing market share.

Decide what the price is meant to accomplish before interpreting the curve:

If the priority is adoption, you may accept less revenue per unit to reach more qualified customers.
If the priority is near-term revenue, you may choose a higher point while accepting a lower attach rate.
If the product requires substantial support or delivery cost, margin may eliminate prices that look attractive in a demand survey.
If the category is unfamiliar, simplicity and predictability may be more important than extracting the theoretical maximum.
If the product is part of a broader platform, the effect on cross-sell, retention, and portfolio coherence may matter more than stand-alone revenue.

Treat willingness-to-pay results as stated intent, not observed buying behavior. Segment the curve before using it. A blended result can conceal a high-value segment with strong demand and another segment that should not be targeted at all. It can also overstate confidence when respondents use the product but do not own the budget.

Convert the demand curve into a commercial model

The survey narrows the plausible range. The commercial model tells you whether an option can survive contact with actual customers, contracts, usage, discounts, and delivery costs. This is where a promising price becomes an operating plan.

Set a candidate list price. Choose a point that reflects the demand curve and the strategic objective, not just the highest theoretical revenue index.
Estimate realized price. Apply expected discounts, negotiated rates, credits, promotions, and channel effects. A list price that relies on constant exceptions is not the real price.
Project units by segment. Use beta or observed usage to estimate outcomes, transactions, seats, or another billable quantity. Preserve the distribution instead of relying only on the mean.
Model attach rate. Estimate what share of eligible customers will buy in conservative, base, and upside cases. Connect each case to an explicit assumption rather than a general level of optimism.
Calculate customer and portfolio revenue. For a metered product, combine realized unit price with expected annual units. Then roll the result across eligible customers and segments.
Include delivery economics. Subtract variable delivery costs and account for service obligations that grow with usage. For AI products, inspect how model, infrastructure, support, and exception-handling costs behave at both low and high volume.
Connect the recommendation to the operating plan. Show the implications for customer count, adoption, annual recurring revenue, gross margin, expansion, and any dependencies on the rest of the portfolio.

Stress-test the assumptions that can break the plan

A single base case hides the shape of the risk. Change one major assumption at a time so decision-makers can see what the recommendation depends on.

Discount sensitivity: What happens if realized price is materially below list price?
Volume sensitivity: What happens when customers generate far fewer or far more units than the average?
Attach sensitivity: How much adoption is required before the product covers its fixed investment?
Cost sensitivity: Does high usage improve gross profit, or does the delivery cost scale almost as quickly as revenue?
Concentration risk: Does the forecast depend on a small number of unusually large customers?
Invoice volatility: Can normal changes in behavior create bills that customers will perceive as unpredictable?
Metric leakage: Are valuable events going unbilled, or are low-quality events being counted as successful outcomes?

Inspect account-level scenarios, not just portfolio totals. A model can produce acceptable average revenue while creating obviously unreasonable bills for a small customer, a seasonal customer, or a high-volume account. Those tails often become the discount exceptions, support escalations, and renewal problems that the average concealed.

Make the recommendation easy to challenge

The approval memo should contain the decision and the logic required to dispute it. Include:

The buyer, value hypothesis, model, metric, and metric definition.
The proposed packages and the segment each package is designed to serve.
The willingness-to-pay range and how it changes by segment.
The recommended list price, expected realized price, and discount guardrails.
Conservative, base, and upside forecasts for adoption, revenue, and margin.
The most sensitive assumptions and the evidence supporting them.
Alternatives considered, why they were rejected, and what evidence would reopen them.
Operational dependencies across Product, Research, Data, Finance, Engineering, Sales, Customer Success, Support, and billing.

Cross-functional review is not ceremonial. Finance can expose a margin or forecasting problem. Engineering can show that the proposed event cannot be measured reliably. Sales can identify a model buyers cannot procure. Support can anticipate disputes. Product can determine whether the metric rewards the behavior the product is supposed to create. Resolve those conflicts before the price becomes a public promise.

Launch pricing as a controlled learning system

Approval is the end of price design and the start of price operations. Customers experience pricing through entitlements, usage counters, contracts, invoices, renewal conversations, and support responses. A sensible strategy can fail if those surfaces disagree.

Complete the billing path before charging

Write a billing specification that maps raw events to billable units and contract terms.
Verify entitlements, included allowances, overages, caps, credits, and exception handling.
Run parallel or shadow invoices and reconcile them from event log to customer-facing total.
Give customers a usage view that uses the same definitions and timing as billing.
Enable Sales with qualification rules, scenario-based pricing examples, and clear discount authority.
Prepare Customer Success and Support to explain the metric, diagnose discrepancies, and escalate genuine billing errors.
Instrument proof of value next to proof of usage so the commercial conversation is not reduced to a meter.
Communicate the effective date, affected products, counting rules, package changes, and available customer controls in plain language.

Do not alter existing charges on the assumption that a product announcement overrides a contract. Review contractual commitments, renewal timing, migration rules, and customer communications before changing what an existing customer pays. An informal migration can create financial disputes and destroy trust even when the new model is better designed.

Use behavior to diagnose the next problem

Instrument the system from the first launch cohort. Review both commercial performance and customer experience:

Eligibility, attach rate, and package selection by segment.
List price, realized price, discount frequency, and exception rates.
The full distribution of billable units per customer, not just the average.
Revenue and gross margin by segment, package, and usage band.
Invoice variance and how accurately customers forecast their charges.
Billing questions, disputes, credits, and metric-definition escalations.
Activation, continued usage, achieved outcomes, expansion, contraction, renewal, and churn.
Sales-cycle friction caused by the model, procurement requirements, or package complexity.

Use each signal to choose the next investigation. Low attach can point to weak qualification, unclear value, the wrong package, or the wrong price. Strong attach followed by low activity can indicate an onboarding or product-value problem. High activity with poor margin calls for an economics or discount review. Frequent disputes usually justify inspecting the metric definition, event quality, and customer visibility. These patterns are diagnostic prompts, not causal proof; pair the numbers with targeted customer and GTM conversations.

Review the architecture, not only the number, when the product expands. Modular outcome pricing can work cleanly while each capability has a distinct result. As a platform adds capabilities, buyers may face several meters, overlapping modules, and an invoice they cannot predict. That is a signal to reconsider how access, bundles, allowances, and outcomes fit together, not merely to adjust every component independently.

Reopen the pricing system when customers cannot forecast bills, new capabilities do not fit an existing package, discount exceptions become routine, sales explanations diverge, gross margin behaves differently from the model, or the value customers receive is no longer represented by the metric. Pricing should be treated as a living system informed by research, customer behavior, and go-to-market learning, not a launch artifact that becomes untouchable.

Key takeaways

Make four decisions separately: pricing model, pricing metric, package, and price.
Define value using an observable customer result before asking what anyone will pay.
Choose a metric that aligns with value but remains predictable, auditable, operationally feasible, and economically sound.
Design packages around distinct value patterns and buying needs, not an arbitrary progression of feature counts.
Use willingness-to-pay work to build a demand curve, then combine it with usage, attach, discounts, and margin in a commercial model.
Validate the complete billing path before launch and use observed behavior to improve the system afterward.

If your team is stuck debating the number, stop the meeting and complete six lines first: buyer, customer outcome, billable unit, measurement proof, package boundary, and commercial assumptions. Any line you cannot defend is the next research or modeling task. Put a price on the page only after those six lines tell one coherent story.

References

Intercom – Inside My Pricing Playbook: Building Value-Based Packaging That Balances Growth and Profit

May 20, 2026

Built for Your Biggest Days: How We Engineer Fair, Reliable Scale Without Downtime

I’m getting sharper, more specific questions about scale from enterprise customers every quarter, and that’s exactly how it should be. Teams want to know how our platform behaves during their highest-volume moments — the Black Friday sales, the sporting events, the production incidents — and they want confidence their growth won’t outpace the systems they depend on. We welcome those questions. They’re the right ones to ask of any critical component of your business. Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform. We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer. While those numbers matter, they age quickly. Every growing software company can publish a bigger number every year, month, week. What ultimately matters is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it. Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them. Here’s how we do that in practice. We build on boring foundations because at the edges, we try hard not to be clever. We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product. “That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.” This extends a principle I’ve embraced for years: run less software. The point isn’t to minimize the stack for its own sake; it’s to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together. Boring technology choices aren’t a lack of ambition — they reserve our ambition for the nuanced scaling challenges that matter. The source of truth is the hard part. You can scale stateless web traffic by adding machines, add queue consumers, and add cache. Those are real problems — just not the hardest ones. The source-of-truth database is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears. That is why we moved to Vitess, managed by PlanetScale. The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers. When we first laid out this direction, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day. Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed. We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk. Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows. Search is not a hidden bottleneck for us. Search underpins core product surfaces across the platform — from vector search in our AI features to realtime reporting — and if it’s slow or unhealthy, customers feel it. Scaling isn’t just adding more machines; often the better approach is making the product do less unnecessary work. Today, our Elasticsearch clusters support a much higher-throughput product than in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We’re serving a larger product surface more efficiently, not just running a bigger cluster. More importantly, when an index gets too large or traffic distribution turns unhealthy, we don’t want a high-risk, manual migration. We reshape Elasticsearch indexes online by partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and deleting the old index only when we’re confident. We’ve used this pattern for years to make large search migrations safer and more incremental — a core playbook in our platform scalability and SRE practices. Fairness is non-negotiable in a multi-tenant system. A single customer’s high-volume moment should not quietly become everyone else’s latency problem. We design for this at multiple layers. For asynchronous work, we use overflow queues and queueing strategies that prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They’re designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants. We also build application-level guardrails so customer isolation doesn’t depend on every engineer remembering every rule in every code path. In a large multi-tenant Rails application, the safe path must be built into the system. The focus is primarily about correctness and customer data separation, but the broader operating principle is the same: important customer boundaries should be enforced by infrastructure and application frameworks. The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it’s actually needed. Fin adds a new dimension to scaling. Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work. The details differ from traditional SaaS infrastructure, but the principle is the same: understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. AI providers are not commodity storage systems, and we do not design as if they are. That is why we have invested in Fin-specific reliability systems. Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point. For enterprise customers, this matters because AI support volume can spike just like human support volume — and the AI layer must absorb that spike without relying on one fragile upstream path. When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform. Performance tests help, but production traffic is reality. Real customers use products in ways no synthetic test will perfectly predict: launches, incidents, seasonal patterns, gaming events, sudden changes in end-user behavior. Those moments give us data that no lab can fully reproduce. Often, a large customer event barely moves the platform-wide graphs because our customer base is broad enough that one industry’s peak aligns with another’s quiet period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect. “That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.” Sometimes, these events teach us something specific. In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed. The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb. We treat large customer events this way even when there’s no broad customer impact. They’re opportunities to understand real scaling properties and make the next event safer — a habit that anchors our incident management, observability, and FinOps practices. Scale is also an operating model. The infrastructure matters, but it’s not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents. “That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.” Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through infrastructure noise and answer the question that matters most during an incident: are customers able to use the product successfully? This shapes how we ship. Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. That isn’t a vanity metric — it’s part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers. These are the very DORA metrics we hold ourselves to in order to balance CI/CD speed with stability. “Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.” The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence. That’s how scale stays safe over time. Scheduled maintenance should be extraordinary. Historically, database maintenance was a main reason for maintenance windows: upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime. With the move to Vitess and PlanetScale, we changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime — and we do this in practice, not just as a goal. This matters because customers rely on our platform for live operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience. “Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.” Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact. That’s the practical benefit of the architecture work: better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers. What this means for customers is simple: be skeptical of vague scale claims. The question isn’t whether a vendor says they can scale — it’s whether they can explain how, where the limits are, what they measure, how they recover, and what they’ve changed after learning from production. We understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system — and we use that data to keep improving. That’s what you should expect from a platform you depend on during your busiest moments.

Inspired by this post on The Intercom Blog.

May 19, 2026

How to Build an AI-Native Product Discovery Workflow

Your discovery stack may already hold interview transcripts, support conversations, behavioral analytics, experiment results, and roadmap assumptions. Yet the decision in a product review can still depend on whoever read the most material or built the most persuasive deck.

If adding an LLM only gives you faster summaries, the workflow is not AI-native. An AI-native discovery workflow shortens the distance from evidence to a decision while making every important claim easier to inspect. AI retrieves, structures, compares, and challenges the evidence. You remain accountable for what the evidence means and what the product team does next.

Key takeaways

Begin every AI-assisted discovery run with an outcome, a metric, defined context, and a decision that someone needs to make.
Preserve raw evidence and give each observation a stable identifier before asking AI to synthesize it.
Break the workflow into bounded jobs such as retrieval, extraction, clustering, contradiction detection, and decision-brief drafting.
Evaluate citation accuracy, evidence fidelity, counterevidence, abstention, and access controls before the output enters a roadmap discussion.
Measure whether the workflow improves decision quality and product outcomes, not merely whether the model produces polished prose.

Frame the decision before you involve the model

Most weak discovery prompts fail before the model sees them. Analyze the interviews, summarize the feedback, and find insights are activities, not decisions. They give the model no principled way to distinguish useful evidence from interesting noise.

Write a short decision contract first. A useful contract specifies the outcome and metric, the context and constraints, and the decision and deliverable. Those fields turn an open-ended request into a bounded discovery task.

Outcome and metric: Name the user or business outcome, then define the behavior or measure that represents it. Activation, funnel conversion, and retention are not interchangeable. Include the event definition and observation window used by your analytics system.
Context and constraints: State the relevant cohort, product surface, timeframe, market, known exclusions, and data-access limits. New self-serve accounts on the web can exhibit a different pattern from established accounts or customers using another surface.
Decision and deliverable: Say what someone will do with the answer. Ask for a ranked opportunity brief, an interview plan, a set of competing explanations, or experiment candidates only when that format supports a real pending decision.

Reusable decision prompt: Help me decide [decision]. The outcome is [outcome], measured as [metric definition]. Limit the analysis to [cohort, surface, timeframe, and constraints]. Retrieve evidence from [approved repositories]. Return [deliverable]. For every material claim, include the evidence identifier, any conflicting evidence, the affected segment, and what is still unknown. If the available evidence cannot support a recommendation, say so and specify what is missing.

The last sentence matters. An AI system should be allowed to return insufficient evidence. If every run must end with a recommendation, the workflow rewards plausible completion instead of honest discovery.

Keep the outcome separate from the proposed solution. Improve activation is an outcome. Validate an onboarding checklist is already a solution choice. When you embed the solution in the prompt, AI tends to organize the available evidence around that choice instead of testing whether another opportunity matters more.

Use evidence-strength labels that a reviewer can verify rather than asking the model for an unsupported confidence percentage:

Sufficient: Direct evidence applies to the target context, and no material contradiction remains unresolved.
Mixed: Direct evidence and meaningful counterevidence both exist, or the pattern changes by segment.
Insufficient: Evidence is missing, indirect, stale for the decision, or outside the target context.

Build a traceable evidence pipeline, not a transcript pile

AI cannot make discovery evidence traceable if the underlying repository has already flattened observations, interpretations, and decisions into the same notes. Preserve those layers separately. My rule is simple: automate the movement and inspection of evidence before automating judgment.

Layer	What it contains	Control that matters
Raw evidence	Interview recordings or transcripts, support records, session evidence, and analytics query results	Keep the original record intact, access-controlled, and addressable by a stable locator
Evidence units	Atomic observations with metadata	Separate exact customer language, observed behavior, and analyst interpretation
Opportunities	Candidate needs, frictions, or desired outcomes	Attach supporting evidence, counterevidence, affected segments, and unresolved questions
Decisions	Choices made, rejected alternatives, assumptions, and rationale	Name the decision owner and preserve the evidence available at the time
Learning	Experiment results and later customer or behavioral evidence	Update the opportunity without erasing the earlier reasoning

Each evidence unit should carry enough metadata to survive outside its original document:

A stable evidence identifier.
The collection date and an exact locator such as a transcript timestamp or saved analytics query.
The relevant user segment, product surface, and journey stage.
The raw observation, kept separate from the interpretation proposed by a person or model.
The access, retention, and sensitivity classification.
The opportunity, assumption, or outcome to which the evidence may relate.

This structure prevents a common failure: a model paraphrases an interview, a later summary compresses that paraphrase, and the roadmap eventually treats the compressed interpretation as a customer fact. A reviewer should always be able to move from a claim to the evidence unit and then to the original record.

Apply data-governance rules before ingestion. If customer conversations contain personal, confidential, or contract-restricted information, do not copy them into an AI system until its access, retention, redaction, and model-training terms match your commitments. A more convenient synthesis workflow is not worth an unauthorized disclosure.

Retrieve the smallest useful context

Once the evidence corpus no longer fits sensibly into a prompt, use a retrieval-first pipeline with modular prompts and observable traces. Retrieval-augmented generation should select evidence relevant to the decision contract, rather than asking a general agent to reason over everything the company knows.

RAG is a grounding mechanism, not a truth guarantee. A fluent answer does not prove that the retriever found the decisive interview, the correct event definition, or the evidence that contradicts the dominant pattern. Configure retrieval to look for both support and contradiction, preserve evidence identifiers, respect access controls, and return no result when the available context does not meet the task.

An opportunity solution tree can provide the shared view above this pipeline: the desired outcome connects to opportunities, solution candidates, and tests. Treat the tree as a navigable representation of current thinking. Every important node should still resolve to evidence and assumptions beneath it.

Give AI a chain of bounded jobs

A single agent asked to interview customers, interpret feedback, size opportunities, choose a solution, and write a roadmap has too many ways to hide a weak inference. Break the work into stages with explicit inputs and review gates:

Prepare: Give AI the outcome, assumptions, and learning gaps. Let it draft non-leading interview questions. A human checks whether the guide is testing an assumption or merely inviting agreement.
Convert: Extract atomic observations from approved records. Require exact locators and label customer language, observed behavior, and interpretation separately.
Synthesize: Cluster candidate opportunities without erasing segment differences. Request supporting evidence, counterevidence, and unrepresented cohorts for every cluster.
Connect: Use behavioral analytics to examine whether the observed pattern appears in the target cohort. Interviews can expose mechanisms and unmet needs; they should not be treated as a substitute for measuring prevalence.
Challenge: Ask for rival explanations, evidence that would reverse the conclusion, and assumptions that remain untested. This stage should consume the evidence record, not just the previous summary.
Draft: Produce a decision brief containing the pending decision, options, evidence, contradictions, unknowns, and proposed next test. A named human accepts, revises, or rejects it.
Learn: Attach experiment and outcome evidence to the same opportunity record. Preserve what the team believed before the test so later reviewers can inspect how the decision changed.

Pass structured artifacts between stages. If each stage receives only prose copied from the previous chat, unsupported claims can become progressively harder to distinguish from evidence.

Buy workflow plumbing; own the decision logic

You do not need to build every repository, connector, permission system, visualization, and observability screen. Licensing purpose-built opportunity-tree infrastructure can be the sensible choice when your differentiated work is the learning system rather than the canvas or collaboration layer.

Keep ownership of the parts that encode how your company makes product decisions: the decision contract, evidence schema, opportunity taxonomy, prompt modules, evaluation cases, escalation rules, and approval gates. Before choosing a platform, ask:

Can you export the raw evidence, metadata, opportunity structure, prompts, and run traces?
Can access rules follow the evidence through retrieval and generation?
Can the system connect to your approved analytics and customer-evidence repositories without repeated manual copying?
Can you evaluate a prompt or retrieval change against representative past cases?
Can a reviewer inspect why a claim appeared and what evidence was omitted?
Would building this capability improve the customer outcome, or merely recreate commodity workflow infrastructure?

Evaluate the workflow before it shapes the roadmap

Start evals before AI-generated conclusions become routine inputs to product reviews. The evaluation set should represent the cases the workflow will actually encounter: a clear pattern, conflicting evidence, insufficient evidence, cohort-specific behavior, stale material, duplicated records, and content the requesting user is not allowed to retrieve.

For synthesis and decision-support tasks, evaluate behavior that a reviewer can observe:

Citation validity: Every material claim points to a real, accessible evidence identifier.
Evidence fidelity: Quotations and behavioral facts remain faithful to the underlying record; interpretations are labeled as interpretations.
Retrieval coverage: The output includes the evidence required to assess the target opportunity, not merely the easiest matching passages.
Contradiction handling: Material counterevidence and segment differences are visible rather than buried.
Abstention: The system returns insufficient evidence when the decision cannot be supported.
Decision fit: The deliverable answers the stated decision instead of drifting into a generic summary or unrelated recommendation.
Policy compliance: Restricted evidence stays outside unauthorized retrieval, traces, and generated output.

A strict release gate is useful here. Fail the output if it invents an evidence identifier, turns an interpretation into a quotation, ignores a material contradiction, or exposes restricted content. Those are not cosmetic defects that a polished paragraph can offset.

Treat the prompt, retrieval configuration, model choice, taxonomy, and evaluation set as versioned artifacts. This is the practical value of eval-driven development and early observability: when behavior changes, you can identify the change that caused it and rerun representative cases before wider use.

For each production run, retain the decision contract, evidence identifiers retrieved, prompt and retrieval versions, generated output, reviewer edits, final decision, and later outcome. That trace lets you distinguish a retrieval failure from a synthesis failure, a weak decision contract, or a reasonable decision invalidated by new evidence.

Model-quality checks are only one layer. Also baseline and monitor the discovery workflow itself:

Time from a framed question to a reviewable decision brief.
The share of material claims with inspectable evidence.
Reviewer corrections to quotations, segments, event definitions, and interpretations.
Decisions reopened because relevant evidence was missing or misread.
Movement in the outcome and metric named in the original decision contract.

Do not set improvement targets until you have a baseline for the existing process. A system can make synthesis faster while increasing correction work or encouraging premature decisions. The end-to-end measure tells you whether the saved time is real.

Turn the workflow into a product operating system

AI-native discovery changes the product team’s operating model only when ownership remains explicit. The product manager or product trio owns the outcome, assumptions, and decision. Research and design judgment protects interview quality and interpretive nuance. Data and engineering ownership protects event definitions, retrieval reliability, instrumentation, and access controls. AI produces candidate artifacts. The decision owner approves the action.

Review by exception instead of rereading every generated sentence. Inspect claims marked mixed or insufficient, new opportunity clusters, segment differences, material contradictions, changed event definitions, and outputs that differ from earlier runs. This focuses human attention where judgment is most valuable without treating the model as an authority.

Roll out the workflow through one recurring, reversible discovery decision:

Choose a decision for which customer evidence and behavioral data already exist, such as prioritizing an onboarding friction or investigating a repeated support issue.
Baseline the current path from question to decision, including reviewer corrections and missing-evidence failures.
Create the decision contract, evidence schema, and access rules before connecting an agent.
Build the evaluation set from previous clear, contradictory, insufficient, segment-specific, and restricted cases.
Run the AI workflow in shadow mode beside the existing process. Compare claims, omissions, reviewer effort, and the resulting decision without allowing the generated output to act automatically.
Promote bounded jobs only after they pass their gates. Evidence extraction may be ready before opportunity ranking, and opportunity ranking may be ready before solution recommendations.
Expand to another workflow only when the traces are stable, reviewers understand escalation paths, and the first use case is improving the decision process rather than merely generating more material.

At your next discovery review, do not ask what AI found. Bring one decision contract, require every consequential claim to resolve to evidence, and make the unresolved assumption visible. That is a small enough change to start immediately and a strong enough foundation for everything you automate later.

References

May 19, 2026

Level Up: May 26 Claude Code Show & Tell + Final Product Discovery Fundamentals Cohort

I’m excited to share two opportunities this season to uplevel your craft, connect with peers, and leave with practical, repeatable techniques you can apply immediately to your product work.

We will be doing another round of Claude Code: Show and Tell on May 26th at 9am PDT. These community-driven sessions are hands-on and fast-paced—we swap proven workflows, compare prompts, and pressure-test approaches together. You’ll see how product teams are operationalizing AI workflows in real contexts and walk away with ideas you can adapt for your own roadmap and experimentation pipeline. Invites will go out to Supporting Members and CDH Members tomorrow. If you'd like to join us, keep an eye on your inbox for the invite.

I love these Show & Tell sessions because they translate tacit knowledge into clear, reusable playbooks. Whether you’re refining evaluation loops for LLMs, streamlining discovery synthesis, or standardizing prompts for consistency, the shared rigor and camaraderie make it a high-signal hour for any product leader invested in AI workflows.

I also want to share that I'll be teaching our June 4th – July 9th cohort of Product Discovery Fundamentals. This is the last time I'll be teaching this cohort in its current format. If you've been thinking of enrolling in this program, and want to take it with me, this is your last chance. Register here.

Across this cohort, we’ll practice continuous discovery habits—framing opportunities, tightening assumptions, running lean experiments, and aligning product trios on evidence-backed decisions. If you want a rigorous, repeatable system for turning customer insight into confident prioritization and compelling product strategy, I’d be thrilled to have you in the room.

Inspired by this post on Product Talk.

May 18, 2026
Behavioral Customer Data for Proactive SaaS Retention
Your cancellation dashboard can tell you who has already left. It cannot tell you which accounts are failing to reach value, why their behavior changed, or what your team should do while the relationship is still recoverable.

That is the real purpose of behavioral customer data. You are not trying to produce a more sophisticated churn report. You are building an operating system that turns observable behavior into a reason, an owner, and a timely response.

Start with the retention decision, not the dashboard

A risk score has no operational value if nobody knows what to do when it changes. Before choosing events, dashboards, or models, write down the retention decisions your data must support.

For every proposed signal, define a decision contract:
- Trigger: What behavior changed, started, stopped, failed, or never happened?
- Interpretation: What customer state might that behavior indicate?
- Owner: Should product, customer success, support, solutions engineering, or billing respond?
- Intervention: What is the smallest useful action that could remove the obstacle?
- Success signal: Which subsequent behavior would show that the customer is back on a value path?
- Expiration rule: When should the alert or intervention stop so the customer is not repeatedly contacted?
This contract prevents a common failure: treating all declining activity as the same problem. A customer who cannot finish an integration needs a different response from an activated customer whose core usage suddenly drops. A payment problem is different again. Combining them into one generic churn-risk label hides the information required to help.

The signal also needs to match the product’s natural rhythm. Daily inactivity can matter in a daily workflow, but the same rule will create false alarms for a workflow used weekly or at the end of a reporting cycle. Compare behavior with the expected use pattern for the account’s persona, plan, lifecycle stage, and use case.

I would design backward from a small set of decisions rather than forward from every event that happens to be available. The most useful leading indicators usually describe activation, time-to-first-value, depth of feature adoption, usage momentum, friction, and expansion intent. Each tells you something different about whether value is beginning, recurring, weakening, or growing.

Instrument the path from first value to recurring value

Measure value at the account level

In B2B SaaS, the person clicking is not always the entity that retains. Users perform actions, while the account usually owns the subscription. Your model therefore needs both a reliable user identity and an account identity, plus a record of which users belonged to which account when the behavior occurred.

This distinction matters when roles differ. An administrator may configure the product once, an operator may use the core workflow repeatedly, and an executive may only view outcomes. A login-frequency rule applied equally to all three will misclassify healthy behavior as disengagement. Define the value-producing behavior for each relevant persona, then roll those behaviors into an account-level state.

Map the customer journey around observable value states:
- Setup: The account has supplied the prerequisites required to attempt the core workflow.
- Activation: The account has completed a meaningful milestone that indicates initial value, not merely finished an onboarding screen.
- Recurring value: The core workflow is being completed at a cadence consistent with the use case.
- Adoption depth: The account is using the capabilities required to obtain more complete or durable value.
- Friction: Attempts, errors, failed integrations, or support interactions indicate that progress is being blocked.
- Expansion intent: Behavior indicates a new use case, broader adoption, or interest in a relevant upgrade path.
Your activation milestone is the pivotal definition. It should represent the earliest behavior that credibly demonstrates value. Completing profile fields or dismissing a tour may be easy to measure, but neither proves that the customer accomplished the job for which the product was purchased.

Do not force one milestone across materially different use cases. If a plan, persona, or workflow changes the way value is produced, define the appropriate milestone for that segment. You can still report a common activation outcome while preserving the underlying reason an account qualified.

Use a minimal tracking contract

Once the value path is clear, instrument attempts, completions, failures, and meaningful outcomes along that path. A useful event contract includes:
- A stable event name with a documented business meaning.
- The user and account identifiers required for identity resolution.
- The time the behavior actually occurred, not only the time it reached the analytics system.
- The persona, plan, lifecycle stage, and use case needed for segmentation.
- The product object or workflow involved.
- A normalized outcome or error category when the action can fail.
- The event owner and the process for approving semantic changes.
For an integration workflow, for example, separate connection attempted, connection completed, and connection failed. Attach the provider and a controlled error category. Do not attach credentials, tokens, raw request bodies, or unrestricted personal information. Those fields create security and privacy exposure without improving the retention decision.

The foundation is a clean event taxonomy, dependable identity resolution, and privacy-by-design. Capture only what the decision requires. If support sentiment is useful, prefer a governed derived category over copying unrestricted support conversations into an analytics platform. Keep sensitive material in the controlled system that already owns it.

Before using any event in a risk score, ask product, data, and customer success to reconstruct the same account timeline. Check for duplicate events, delayed delivery, internal or test traffic, users mapped to the wrong account, plan changes that were not propagated, and renamed events with conflicting meanings. If those teams see different stories, automation will only distribute the disagreement faster.

It is also safer to trigger interventions from a derived account state than directly from a raw event. A raw event says that something happened. An account state says whether activation is incomplete, recurring value has weakened, an integration is blocked, or a commercial issue is unresolved. That state can carry a reason code, observation time, data-quality status, and expiration rule into the product, lifecycle messaging, or customer success workflow.

Build a risk score people can challenge and act on

You do not need a black-box model to begin. A transparent rule set is often more useful because product and customer success can inspect the evidence, dispute a weak assumption, and choose the correct response.

A practical account score can combine several distinct dimensions:
<!– wp:list {
May 18, 2026
Unlocking AI Agents: The Real Barrier Is Readiness—Not Capability—Here’s How to Scale

There’s a question that runs underneath every AI Agent evaluation: what can it do?

Two years ago, that was the right question to ask because Agents were limited and capability was a genuine constraint. The gap between what organizations needed and what the technology could deliver was wide. I felt that gap acutely in early pilots—plenty of ambition, not enough dependable execution.

That gap has since narrowed considerably, and yet most organizations are running their Agents well below what’s technically possible. I see teams lean on answering and routing, but stop short of looking things up, taking actions, or resolving complex, multi-step problems—especially where data, process variance, or risk come into play.

The standard explanation is that AI isn’t good enough yet—models must improve, or vendors must ship more features. But after studying organizations across industries actively expanding their AI automation, I’ve found that this explanation holds up less often than people assume. The blockers tend to be elsewhere.

The teams I’ve observed weren’t primarily constrained by what their AI could do; they were constrained by what their organization was structured to let it do. In other words, the ceiling wasn’t the Agent’s capability—it was organizational readiness, governance, and risk tolerance.

“Readiness” for AI breaks into five distinct types, and most organizations have some but not all of them. Below is how I assess them with product, operations, and engineering leaders.

Content readiness is whether you can explain your product and policies clearly and consistently. Most companies can. In practice, that means up-to-date knowledge bases, unified policy language, and clear versions that Agents can cite and apply.

Scope readiness is whether you’ve defined the edges: when should AI engage, and when should it step aside? Edge cases multiply, intent varies by customer segment, sensitive topics surface mid-conversation, but most teams can work through this with effort. Clear guardrails reduce ambiguity and shrink risk.

Procedural readiness is where things start to get harder. This is about whether you can articulate your processes clearly enough for something other than a human with years of tacit knowledge to follow. The happy path is rarely the problem. It’s the failure paths, decision branches, variations that have never been written down because they’ve always lived in someone’s head.

Data readiness is the first real cliff. Can you reliably identify the right user, account, or object at the moment a decision needs to be made? Is the data trustworthy in real time? Are the APIs stable, accessible, and actually connected? For most organizations, the honest answer is “partially, but we’re not always sure when it breaks.”

Execution readiness is the highest bar. Not just technically (can the Agent make the change?) but organizationally. Who owns it when the wrong refund gets processed? Who detects it? Who recovers? Does someone with authority actually accept the risk?

Most companies have the first two, some have the third, fewer have the fourth and fifth. When I map this with teams, we often discover that their Agent’s ceiling is really a reflection of operational maturity and data plumbing, not model quality.

We studied companies across six industries – energy, healthcare, ecommerce, gaming, financial services, property management – all trying to expand what their Agents could do. The pattern was consistent: teams set out to automate real actions—looking up account status, processing changes, handling transactions. In most cases, the AI could technically do it, but at a certain point (somewhere between guiding a user through a process and looking something up on their behalf) they hit a wall.

One team tried to automate application changes but couldn’t reliably identify which application to modify across their internal systems. Another explored billing automation but couldn’t access live account data due to regulatory constraints. A third needed to verify status across third-party vendor systems their Agent couldn’t reliably reach. I’ve seen similar constraints surface around CRM integration, data governance, and vendor SLAs—none of which are model issues.

In most cases, the team redesigned around what their infrastructure could support. They moved toward guiding—walking users through processes step by step, rather than executing changes on their behalf. It worked, it resolved conversations and delivered real value, just differently than anyone planned. In customer support, this often looks like consultative flows that shorten time-to-resolution even without direct writes.

Most Agent evaluations are built around capability. Can it handle complex queries? Does it support multiple channels? Can it integrate with our systems? These are reasonable things to evaluate for, but they produce a capability score, and that doesn’t tell you whether your organization can actually use what you’re buying.

The teams that got to deeper automation, the ones executing actions early, didn’t have “better AI,” they had more standardized operations. Actions that were already well-defined, consistently applied, and exposed through stable systems with clear rules. Automation wasn’t inventing new behavior, it was triggering actions that were already tightly controlled elsewhere.

Readiness enables capability, not the other way around. Which reframes the evaluation question from “can the AI do this?” to “are we actually ready for it to?”

Something that gets lost in most conversations about AI readiness is that organizations are often further along than they assume, just not for the kind of work they were planning for. A team that set out to automate refunds but can reliably guide users through complex troubleshooting has genuine capability deployed. They’re operating at the level their readiness supports, which is a starting point, not a deficit.

The more useful frame isn’t “are we ready?” – it’s “what are we ready for, and what specifically stands between here and the next level?” The gaps tend to be concrete: a missing API, data that lives in three systems that don’t agree, a process that’s never been documented, or an ownership question nobody has answered. These are solvable problems. They just require a different kind of investment than buying a more capable Agent.

What nobody has worked through seriously yet is how organizations actually build readiness. Does it develop naturally through using AI at shallower levels first? Or is it mostly a function of prior decisions, like system architecture choices made years ago, operational maturity that accumulated over time, engineering investments that have nothing to do with AI? When readiness does increase, what actually changes? Does the support team develop it? Does engineering grant it? Does it require executive sponsorship and investment in infrastructure with no obvious AI label on it?

In my experience, progress comes from a joint effort: product to define scope and guardrails, operations to codify procedures and edge cases, engineering to harden APIs and observability, and leadership to underwrite risk with clear ownership. When those pieces align, agentic AI moves from guided assistance to safe, auditable execution.

Until there are clearer answers, the pattern is likely to continue. Companies will buy capable Agents, plan ambitious rollouts, and find that the harder work is building the organizational infrastructure. The Agents can do the work. The question is what it takes to let them.

Inspired by this post on The Intercom Blog.

May 18, 2026
How to Validate Behavioral Heatmap Accuracy Before You Act
Your heatmap puts a bright cluster on the primary call to action, and the next step seems obvious: move the button, rewrite the copy, or prioritize a mobile redesign. Pause before turning that picture into a roadmap decision. A heatmap can look coherent while representing the wrong interface state, assigning clicks to the wrong element, or combining users whose layouts are materially different.

Behavioral heatmap accuracy is not about whether the colors look plausible. It is about whether each recorded interaction appears on the interface the user actually encountered, within the correct context, and supports the conclusion you want to draw. You need to validate that chain before you act on the pattern.

Treat accuracy as a chain, not a single metric

There is no single accuracy score that makes a heatmap trustworthy. Four separate conditions have to hold:
- Capture fidelity: The background image represents the relevant product state. The release, page structure, loaded content, navigation, overlays, and experiment variant should match what generated the interactions.
- Placement fidelity: A click is attached to the intended interface element after responsive reflow, personalization, localization, and other layout changes. A precise coordinate on the wrong screenshot is still wrong.
- Population fidelity: The map contains the users, devices, variants, and product states relevant to your decision. An aggregate can be mathematically correct while describing an interface that no individual user experienced.
- Inference fidelity: The visualization can support the claim being made. A click establishes an interaction, not the user’s motivation. Scroll depth establishes reach, not attention, comprehension, or persuasion.
Reliable screenshot capture, selector-based placement, automatic device detection, and clearer scrollmaps address important failure modes in this chain. They reduce ambiguity, but they do not eliminate the need to inspect your product states, filters, selectors, and supporting evidence.

The weakest link determines whether the map is useful. Perfect element placement cannot rescue a screenshot from an old release. Clean device segmentation cannot justify a claim about user intent. Before discussing what the hot area means, establish what was captured, where it was placed, and whose behavior was included.

Run a validation pass before reading the colors

Use the same validation sequence whenever a heatmap is about to influence an experiment, design change, or roadmap priority. This turns accuracy from a vague feeling into a reviewable process.
1. Write down the decision first. Be specific: move the primary action, remove a section, change the activation path, or investigate a mobile interaction. This tells you which page states and elements require the strongest validation.
2. Freeze the analysis scope. Record the screen or template, analysis window, release, experiment variant, device class, and user segment. If the interface changed during the selected window, split the data or identify the limitation rather than treating the period as one stable experience.
3. Build a state matrix. List only the states that materially alter the interface: desktop and mobile layouts, relevant locales, personalized variants, authenticated and unauthenticated views, expanded and collapsed components, or overlays that cover the underlying page. You do not need every possible segment. You need every state capable of moving, replacing, hiding, or duplicating the elements involved in your decision.
4. Compare the screenshot with each relevant state. Check the order and size of major sections, sticky navigation, banners, modals, lazy-loaded content, and conditional components. If the displayed background is stale or combines interactions from incompatible layouts, stop interpreting the map and repair the capture or filtering first.
5. Test element placement. In a controlled recorded session, interact with the target and with nearby controls that could be confused with it. Repeat the check on the layouts that move the element. The target’s hotspot should remain attached to the target rather than to an old coordinate. Exclude the controlled session from normal analysis when your tooling allows it.
6. Inspect critical selectors. Ask engineering to confirm that each selector identifies the intended component across the templates and states in scope. Pay particular attention to repeated cards, reused button components, translated labels, and responsive navigation. If adjacent actions collapse into one hotspot, the map is not suitable for deciding between those actions.
7. Reconcile the picture with events and replay. Apply equivalent page, date, device, user, and variant filters before comparing evidence. Exact numerical agreement is only a reasonable expectation when the systems use the same interaction definition and filters. Otherwise, document why their coverage differs and investigate unexplained gaps.
8. Assign a confidence grade. Mark the map as decision-grade, directional, or invalid. Decision-grade means the relevant states and placements were verified. Directional means a pattern is visible but a known limitation prevents a precise conclusion. Invalid means the visual representation is wrong for the proposed decision.
For a critical call to action, treat any reproducible placement error as a blocker. A hotspot that sometimes lands on a neighboring control can reverse the apparent preference between the two controls. Fix the representation before discussing design implications.

Split heatmaps when the interface or interaction model changes

Segmentation is not merely an analytical refinement. It is part of measurement accuracy. Mobile and desktop users may see different navigation, stacking order, content length, control size, and interaction affordances. Combining them can create a vivid composite that corresponds to neither experience.

Use a simple rule: split the map whenever a cohort can encounter different geometry, different elements, or a different way of interacting. Check these questions before aggregating:
- Does the same element exist in every included state?
- Does it keep the same purpose and selector?
- Does responsive behavior move it relative to neighboring elements?
- Does a variant, locale, or personalized state change the surrounding content?
- Are touch and pointer interactions being interpreted in a comparable way?
- Did a release alter the template during the selected analysis window?
If any answer exposes a material difference, inspect separate maps first. You can compare the resulting patterns afterward, but you should not use the blended view as the primary evidence.

Scrollmaps need the same discipline. The same depth percentage can correspond to different content when a mobile page stacks sections that sit side by side on desktop. Compare scroll behavior within consistent layouts, then map each depth region to the actual value proposition, trust element, form, or call to action shown there. Scroll reach tells you that a region became reachable within the journey; it does not prove that the person read or understood it.

Match the decision to what the evidence can prove

Even a technically accurate heatmap is an observation layer. It can show where interactions accumulated or how far sessions progressed. It cannot, by itself, tell you why the behavior occurred or whether a proposed design change will improve an outcome.

Use an evidence ladder instead of promoting every hotspot directly into the backlog:
- Heatmaps locate the pattern. They help you identify concentrated clicks, neglected controls, competing actions, and sections reached by fewer sessions.
- Event data measures the associated behavior. Use it to determine whether the interaction registered, where it sits in the funnel, and whether it connects to the micro-conversion or product outcome you care about.
- Session replay supplies sequence and context. Inspect what happened immediately before and after the interaction, including overlays, loading states, repeated attempts, navigation changes, and other conditions that an aggregate view hides.
- A controlled experiment evaluates the proposed change. When the claim is that a different placement, label, or layout will improve an outcome, compare that change against a baseline rather than treating the heatmap as causal proof.
The combination also helps you diagnose apparent contradictions. A strong hotspot with no corresponding outcome event may indicate a broken interaction, incomplete instrumentation, or an action whose result is unclear. Low interaction on content that few sessions reach is first a placement or journey question, not automatically a copy problem. High scroll reach with low interaction means the region was available to users, but it does not establish that they noticed or rejected its message. A hotspot outside the visible target is a measurement defect, not a behavioral insight.

Translate each finding into the next appropriate action:
- If the screenshot, selector, or segment is wrong, create an instrumentation or analytics repair.
- If the behavior is verified but its explanation is uncertain, create a discovery question and inspect relevant replays.
- If the behavior is verified and tied to an outcome gap, define a hypothesis and an A/B test.
- If the evidence reveals a reproducible interaction defect, prioritize the defect without disguising it as a preference experiment.
This language matters in product reviews. Say that you observed a pattern, verified its representation, formed a hypothesis, and selected the next test. Do not say that users prefer, understand, ignore, or want something unless your evidence can support that stronger claim.

Key takeaways
- A heatmap is decision-grade only when the captured state, element placement, population, and proposed inference all align.
- Validate the critical target and its neighboring controls across every layout that can move or replace them.
- Split device classes, variants, releases, locales, or personalized states when they produce materially different interfaces.
- Read scroll depth as reach and click concentration as interaction. Neither measure establishes attention, intent, or causality.
- Pair heatmaps with event data and session replay, then use a controlled experiment when your decision depends on predicted impact.
At your next heatmap review, do not begin with the hottest color. Begin with the screenshot, segment label, release, and one critical interaction traced from capture through outcome. If that path survives validation, turn the pattern into a hypothesis or product action. If it does not, fix the measurement before it becomes roadmap evidence.

References
- Amplitude – Amplitude Heatmaps Rebuilt: Rock-Solid Screenshots, Precise Placement, Smarter Scrollmaps
May 15, 2026

Governed AI Analytics in Financial Services: A Playbook

You have a credible AI analytics use case, product teams want access, and risk leaders want proof that the system will not expose sensitive data or influence the wrong decision. The mistake is to settle that tension with a broad choice between “innovation” and “control.” That choice is too vague to operate.

Start with a narrower question: what decision may this system influence, using which data, under whose authority, with what evidence afterward? Once those boundaries are explicit, you can give teams meaningful speed without asking compliance to accept an invisible risk.

Classify the decision before you assess the AI

Many AI reviews begin with the model: where it is hosted, how it was trained, or whether it can explain an answer. Those questions matter, but they do not establish the business risk. The same model can summarize an approved dashboard, flag an unusual transaction pattern, or help determine an outcome that affects a customer. Those are not equivalent uses.

Classify each use case by consequence, reversibility, and action authority. Consequence asks what happens if the output is wrong. Reversibility asks whether a person can correct the result before harm occurs. Action authority asks whether the system informs a person, recommends an action, or executes one.

Use case pattern	Permitted role for AI	Control that matters most	Boundary to make explicit
Descriptive analysis	Summarize approved metrics or behavioral patterns	Data permissions and traceable metric definitions	The output cannot create a new customer-level action
Investigative signal	Surface anomalies or suspicious patterns for review	Analyst validation, evidence capture, and disposition logging	A signal is not a finding or a verdict
Product recommendation	Suggest an intervention, workflow, or experiment	Human approval and outcome monitoring	The recommendation cannot bypass existing approval paths
Customer-affecting decision	Support a formally governed decision process	Documented oversight, explainability, and accountable human authority	The final authority and escalation path must be unambiguous

This classification prevents two common errors. The first is applying the heaviest possible review to every analytical assistant, which sends teams into unofficial tools and manual workarounds. The second is treating every output as “just an insight” even when a downstream workflow turns it into a customer action.

Trace the output one step beyond the interface. If an anomaly score enters a case-management queue, changes account handling, or triggers outreach, govern that downstream effect as part of the use case. A recommendation does not become low risk merely because a person clicks the final button.

Before development begins, write an allowed-action statement and a prohibited-action statement. For example: “The system may prioritize patterns for analyst investigation. It may not label a customer, close a case, or initiate an external action.” That pair of sentences is more operationally useful than calling the project “medium risk.”

Risk and compliance leaders still need to map the use case to the organization’s actual legal and regulatory obligations. A product risk classification is an operating tool, not a legal conclusion. When a use case could affect access, eligibility, pricing, fraud treatment, or another consequential outcome, obtain the appropriate compliance and legal review before activation.

Turn governance principles into an enforceable contract

Principles such as fairness, privacy, transparency, and human oversight do not control a production workflow by themselves. Each principle needs an owner, an enforcement point, and evidence that the control operated. I treat that combination as the governance contract for the use case.

Define the data boundary

List the approved data domains, fields, purposes, environments, and user groups. Do not stop at “customer data” or “analytics data.” Those labels are too broad to enforce. State which attributes the system can retrieve, which identifiers it can display, whether results may be exported, and where generated outputs may be stored.

Purpose: the business question the data may be used to answer.
Permitted inputs: the approved events, attributes, aggregates, and reference data.
Prohibited inputs: data classes that the workflow must never retrieve or infer.
Permitted users: roles allowed to query, review, approve, or export results.
Output handling: where results may be displayed, retained, shared, or reused.
Failure behavior: what the system does when permission, provenance, or confidence is insufficient.

Enforce that boundary with role-based access controls and granular permissions at retrieval time. Filtering an answer after a model has received restricted data is not equivalent to preventing access. The model, retrieval layer, analytics service, export path, and destination workflow all need to respect the same user identity and policy context.

Assign decision rights to named roles

A committee can set policy, but it cannot own every operational decision. Give each use case an accountable product owner, a data owner, a control owner, and a business reviewer. Clarify who can approve launch, who can change the data scope, who reviews exceptions, and who has authority to stop the workflow.

The product owner defines the user problem, allowed action, prohibited action, and business outcome.
The data owner approves the data purpose, quality expectations, permissions, and reuse limits.
The risk or compliance owner maps policy obligations to testable controls and reviews material exceptions.
The platform or security owner implements identity, access, isolation, logging, and change controls.
The business reviewer accepts, rejects, or escalates outputs and records why.

Keep the decision rights close to the workflow. If a reviewer sees an unsupported conclusion, that person needs a clear way to reject it, preserve the evidence, and route the issue. If every exception disappears into a general governance inbox, the formal control will be bypassed when operational pressure rises.

Design the audit record before launch

An audit trail should reconstruct what happened without relying on someone’s memory. Capture the requesting identity and role, the approved purpose, the data and metric definitions used, the system configuration, the generated result, any human review, the resulting action, and later corrections or overrides.

Logging creates its own data risk. Prompts, retrieved context, generated explanations, and reviewer notes can contain sensitive information. Protect the audit store with appropriate access, retention, and segregation rather than treating logs as harmless operational exhaust. Where policy permits, record protected references to sensitive records instead of duplicating raw payloads.

A practical platform evaluation should test whether the system combines strong data governance, auditable AI behavior, secure scale, and a direct connection to product outcomes. A policy document that cannot be enforced in the workflow is not enough, and a platform control without an accountable operating process is not enough either.

Put controls inside the workflows people actually use

Governance fails when it exists as a review ceremony around the product rather than a behavior inside it. Analysts should not have to remember a separate policy every time they ask a question. The approved data scope, identity context, review step, and evidence capture should travel with the task.

Behavioral analytics: govern the meaning as well as the data

Behavioral analytics can reveal how customers move through onboarding, self-service, support, payments, and other product journeys. The danger is not limited to unauthorized access. An AI system can also combine valid events into a misleading interpretation of customer intent.

Start the workflow with curated event definitions and approved business metrics. Require the output to expose the cohort definition, time context, filters, exclusions, and comparison used. The analyst should be able to inspect the path from a narrative claim back to the underlying measure before sharing it.

Separate observation from inference in the interface. “Users in this cohort abandoned the flow after this step” is an observation tied to event data. “They abandoned because they distrusted the process” is a hypothesis. Labeling those differently prevents fluent language from turning a plausible explanation into an unsupported fact.

Anomaly detection: route a signal into investigation, not judgment

An anomaly means a pattern differs from an expected baseline. It does not establish fraud, customer intent, system abuse, or operational error. Treat anomaly detection as a prioritization mechanism unless a separately governed process establishes something more.

Give the reviewer the observed deviation, relevant context, the comparison baseline, and links to permitted evidence. Capture the reviewer’s disposition: confirmed issue, expected behavior, insufficient evidence, data-quality problem, or escalation. That disposition is both an audit artifact and a feedback signal for improving the workflow.

Watch the operational burden as closely as the detection capability. A flood of weak signals can make the nominal control less safe because reviewers rush, defer, or stop trusting the queue. Monitor false positives, unresolved escalations, overrides, and the reasons analysts reject outputs. When those indicators deteriorate, reduce scope or pause automated routing while the cause is investigated.

Self-service analysis: give teams a governed lane

Product managers and analysts need enough freedom to explore without sending every question through a central approval queue. Create a governed workspace containing approved metrics, documented data products, role-aware access, and restricted export paths. Let people iterate freely inside that lane while changes to data scope, decision authority, or external activation trigger a new review.

Make the boundary visible. Users should know when an answer is based on incomplete data, when a metric is not approved for customer-level decisions, and when an output cannot be exported. A silent denial encourages workarounds; a clear denial that identifies the policy boundary gives the user a legitimate next step.

Do not give an analytics assistant write access to operational systems merely because the integration is convenient. Insight generation and action execution are separate privileges. Connect them only when the action, reviewer, failure mode, and rollback path have been governed explicitly.

Pilot with evidence, not a polished demonstration

A convincing demo proves that the happy path works. A governed pilot must also prove that the system refuses the wrong request, exposes enough evidence for review, and leaves a usable record when something goes wrong.

Choose a narrow workflow with an identifiable user, a bounded data set, a reviewable output, and a business outcome you already understand. Avoid beginning with an enterprise-wide assistant or an autonomous action layer. Broad scope makes it difficult to distinguish model behavior, data problems, permission failures, and process gaps.

Write the decision contract. Record the user, purpose, permitted inputs, allowed action, prohibited action, reviewer, and stop authority.
Configure the smallest useful data boundary. Include only the fields and metrics needed for the chosen workflow.
Test legitimate work. Confirm that authorized users can produce an insight, inspect its basis, and complete the intended review.
Test prohibited work. Attempt access with the wrong role, request excluded attributes, try an unauthorized export, and ask the system to take a prohibited action.
Test ambiguity and failure. Use incomplete context, conflicting metric definitions, missing permissions, and unavailable dependencies. Confirm that the system fails visibly and safely.
Reconstruct the event. Use the audit record to determine who requested the output, what information was used, what was generated, who reviewed it, and what happened next.
Change the system deliberately. Update a relevant configuration or model component and confirm that approval, documentation, testing, and monitoring follow the change.

Do not accept screenshots as evidence for controls that operate behind the interface. Ask the vendor or internal platform team to demonstrate a denied request, a permission change, a reviewer override, an exported audit record, and the behavior after a governed configuration change. The test should follow your use case and identities, not a generic demonstration tenant.

Measure value and control health together. If the system produces faster insights but increases unreviewed actions, weakens attribution, or creates an investigation backlog, it has not delivered a durable improvement.

Dimension	Question	Useful signals
Business value	Does the workflow improve a real product, growth, risk, or operational decision?	Time to a validated insight, useful investigations completed, issues resolved, and attributable product outcomes
Analytical quality	Can a reviewer verify the conclusion?	Accepted and rejected outputs, unsupported claims, metric-definition errors, and missing context
Control effectiveness	Did policy operate as designed?	Prohibited requests blocked, required reviews completed, permission exceptions, and audit-record completeness
Operational health	Can people sustain the workflow?	False-positive burden, unresolved escalations, overrides, rework, and reviewer backlog
Change safety	Do updates preserve the approved boundary?	Documented changes, completed regression checks, new failure patterns, and monitored post-change behavior

Set release gates in binary language. The use case has a named accountable owner or it does not. Permissions have been tested with unauthorized identities or they have not. High-impact outputs receive the required review or they do not. Audit evidence can reconstruct an event or it cannot. Ambiguous gates become exceptions as soon as delivery pressure appears.

When the pilot is stable, reuse the control components rather than copying the entire use case. Standard identity propagation, data classification, audit schemas, reviewer workflows, and change gates can form a shared control plane. Each new use case still needs its own purpose, decision boundary, outcome measure, and risk assessment.

Key takeaways

Govern the decision the AI can influence, not just the model that produces the output.
Write both an allowed-action statement and a prohibited-action statement before development begins.
Enforce data permissions before retrieval and carry the user’s identity through analysis, export, and downstream action.
Treat human review as an operational workflow with evidence, dispositions, escalations, and stop authority.
Keep observations, hypotheses, recommendations, and customer-affecting decisions visibly distinct.
Test denial, ambiguity, change, and audit reconstruction alongside the happy path.
Track business value, analytical quality, control effectiveness, and operational burden on the same scorecard.

Your next move is not to draft an enterprise AI policy. Pick one live analytics workflow and write its decision contract on a single page. If you cannot name the allowed action, prohibited action, data boundary, reviewer, audit evidence, and stop authority, the workflow is not ready to scale. If you can, you have the foundation for AI analytics that product teams can use and risk leaders can defend.

References

Amplitude – Financial Services AI

May 15, 2026

How to Prove the ROI of an AI Product Before You Scale It

Your AI product is getting used. The demos land well, task completion is improving, and internal enthusiasm is high. Then the CFO asks a harder question: what changed in the business because this product exists?

You cannot answer that question with prompt volume, response quality, adoption, or tickets touched. You need a measurement system that separates activity from incremental value, counts the full operating cost, and makes risk visible before a rollout gets larger. Here is how to build one.

Start with the decision your ROI model must support

ROI is not a retrospective slide assembled after launch. It is a decision rule. Before development begins, decide what evidence would justify launching, scaling, redesigning, rolling back, or retiring the capability.

That distinction changes the conversation. Instead of asking whether the agent is accurate enough or popular enough, you ask whether a measurable change in customer behavior produces a measurable business result without crossing an unacceptable risk threshold.

Build a driver tree with four levels:

Company outcome: revenue growth, lower cost to serve, or reduced business risk.
Customer outcome: the user completes a valuable job, reaches value sooner, or resolves a problem without unnecessary effort.
Product behavior: the AI capability changes conversion, expansion, self-service completion, containment, handle time, or escalation.
Controllable lever: the team changes the workflow, model behavior, conversation design, human review, or product guidance.

The chain matters because a model metric is rarely a business metric. Better answer quality may improve task completion, which may improve trial-to-paid conversion. The ROI case depends on the full chain, not the first link.

Value path	Business outcome	Leading evidence	Guardrails
Revenue	Higher conversion, average order value, or expansion	Time-to-first-value and self-service completion	Errors, complaints, and policy violations
Cost	Lower cost to serve	Containment, deflection, and reduced handle time	Escalations, false resolution, and downstream customer harm
Risk	Lower frequency or impact of harmful failures	Human-review events and detected violations	False positives, false negatives, hallucinations, and security breaches

Choose one primary value path for the investment case. Revenue, cost, and risk can all appear on the scorecard, but declaring all three as primary makes it too easy to rescue a weak result with whichever metric moved after launch.

A support agent, for example, may appear successful because it contains more conversations. But containment is only valuable if customers actually resolve their problems. A conversation that never reaches a human can reduce measured support volume while increasing complaints or churn risk. This is why revenue, cost, and risk measures must be evaluated together.

Write the measurement contract before you build the dashboard

A measurement contract is a short agreement among product, data, finance, and the operational team affected by the AI workflow. It prevents the definitions, cost boundaries, and success thresholds from changing after results arrive.

Your contract should answer these questions:

Who is eligible? Define the users, accounts, tasks, channels, and exclusions. Do not mix workflows with materially different economics.
What is the intervention? Name the AI capability and the version being evaluated. A model, prompt, retrieval pipeline, policy, or escalation change can alter the result.
What is the primary outcome? Select the business metric that determines whether the hypothesis passed.
What are the leading indicators? Use measures such as time-to-first-value, containment, and self-service completion to diagnose movement before lagging results mature.
What are the guardrails? Predefine acceptable limits for errors, hallucinations, false positives, false negatives, escalations, complaints, security events, and policy violations.
What is the baseline? Freeze the comparison period or control group before exposing the eligible population to the capability.
How will incrementality be proven? Specify the experiment, holdout, assignment unit, and minimum detectable effect.
What costs count? Agree on model or API consumption, labeling, evaluation, human review, and ongoing oversight before calculating value.
What action follows each result? Record the thresholds for launch, scale, redesign, rollback, and retirement.

The contract should distinguish an outcome OKR from an output OKR. Shipping the agent, generating responses, and increasing feature use are outputs. Improving conversion, lowering verified cost to serve, or reducing harmful failures are outcomes. Outputs can explain what happened, but they cannot establish value on their own.

Instrument the complete journey, not just the conversation

An AI log tells you what the model did. An ROI dataset must also tell you what the user did next.

Connect the journey from eligibility to business outcome:

The user or account became eligible for the capability.
The AI experience was offered, viewed, and engaged.
A task was attempted, completed, abandoned, or repeated.
A response was accepted, corrected, regenerated, or sent for human review.
The interaction was contained, escalated, or handed to another workflow.
The downstream conversion, expansion, support, retention, or complaint event occurred.
The associated model cost, labeling work, and human-oversight cost were recorded.

Carry a stable user or account identifier, experiment assignment, agent version, and journey identifier across those events. Without that connective tissue, the team may have an impressive agent dashboard and no defensible way to attribute a business outcome to the experience.

Use behavioral analytics and session replay to understand why a metric moved. Use journey mapping and retention analysis to locate the friction worth solving in the first place. Product tours and in-app guidance can then help eligible users reach a validated workflow. This creates a closed loop from journey friction to experiment and measurable outcome, instead of a collection of disconnected AI metrics.

Calculate economic value without turning activity into savings

Start with net business value:

Net business value = incremental revenue + cost avoided – total operating cost – quantified risk loss

If finance requires an ROI percentage, divide net business value by the agreed investment base. Keep both the numerator and denominator visible. A percentage without its cost boundary is easy to inflate and hard to audit.

Count only incremental revenue

Do not credit the AI product with every transaction it touched. Credit it with the difference between the exposed population and the valid control or holdout.

A practical revenue calculation is:

Incremental revenue = eligible volume x measured outcome lift x value per additional outcome

The measured outcome might be trial-to-paid conversion, self-service upsell, average order value, or expansion. Use the same eligibility definition, attribution window, and revenue treatment for the intervention and control. If the AI experience merely appears somewhere in a successful journey, that is influenced revenue, not proof of incremental revenue.

Separate capacity from cashable savings

Cost claims require more care than a deflection count. A contained interaction may create capacity without reducing expenditure. That capacity can still be valuable, but it should not be presented as cash savings unless spending actually changes.

Capacity created: employees have time available for other work, but the existing cost base remains.
Variable cost avoided: the company no longer incurs a cost that would have grown with each additional interaction.
Cashable savings: an approved budget, vendor charge, or staffing requirement is actually reduced.

Report these separately. Otherwise, the same saved minute can be counted once as employee capacity and again as reduced spend.

Validate that a deflected task was resolved, not abandoned or displaced to another channel. Then calculate avoided cost from the incremental lift in verified resolution, not the total number of conversations the agent handled.

Include the operating costs that make the agent dependable

Model or API cost is only one part of the investment. Include labeling, evaluation, human review, and operational oversight. If a safer workflow requires more review, that review is part of the product’s economics, not an external inconvenience to exclude from the model.

Segment cost by agent, workflow, and outcome. Cost per response is useful for infrastructure management, but cost per verified successful outcome is the better economic unit. A cheap response that triggers retries, escalations, or corrections may be more expensive than a higher-cost response that completes the job.

Do not bury risk inside an average ROI number

Risk adjustment should make uncertainty visible, not create false precision. Use three layers:

Hard guardrails: security and policy conditions that trigger containment or rollback regardless of financial upside.
Observed risk indicators: error, hallucination, escalation, complaint, false-positive, and false-negative rates tracked by workflow and cohort.
Financial adjustment: expected loss deducted from net value only when the probability and impact assumptions are credible enough for finance and risk owners to accept.

Do not let a low-frequency, high-consequence failure disappear inside a high average success rate. If the downside cannot be defensibly monetized, keep it as an explicit decision constraint rather than assigning it a convenient dollar value.

Prove incrementality before claiming impact

The strongest ROI calculation still fails if the attribution is weak. A before-and-after improvement may come from seasonality, pricing, traffic quality, a support policy change, or another product release. The AI capability needs a counterfactual: what would have happened to comparable eligible users without it?

Use an A/B test or holdout whenever the product and risk profile allow it. Make these choices before launch:

Assignment unit: Randomize at the level where the outcome occurs. If expansion is measured per account, account-level assignment can prevent users in the same customer organization from receiving conflicting experiences.
Primary outcome: Pick the metric that determines success and keep diagnostic metrics secondary.
Minimum detectable effect: Precompute the smallest lift worth detecting based on the baseline, available population, and business value. If the experiment cannot detect a decision-relevant change, extending the metric list will not fix it.
Guardrails: Test quality, escalation, complaints, security, and policy outcomes alongside the primary metric.
Analysis population: For a product-level ROI claim, analyze eligible users according to their assigned experience. Looking only at people who voluntarily used the agent introduces selection bias.
Measurement horizon: Keep the holdout long enough to observe the outcome named in the contract. Leading indicators can guide iteration, but they should not be substituted for retention, churn, Net Recurring Revenue, or other lagging outcomes.

If randomization is not practical, use a fixed holdout or a frozen comparison period and document the limitations. A weaker design can still inform a decision, but the ROI claim should carry less confidence. Do not quietly promote correlation to causation because the rollout has executive attention.

Interpret the result as a system. Suppose self-service completion rises but the business outcome does not. The agent may be solving a low-value task, attracting users who would have converted anyway, or shifting effort to a later step. If conversion improves while complaints or policy violations cross the guardrail, the value hypothesis may be valid but the implementation is not ready to scale.

This is eval-driven development applied to product economics: define acceptable behavior and business success, measure both under controlled conditions, diagnose the failures, and repeat the test after a meaningful change.

Turn ROI into a portfolio operating system

A one-time business case goes stale as models, prompts, traffic, user behavior, and operating costs change. Maintain an Agent Analytics view for every production capability.

Each agent scorecard should show:

The primary business outcome and current experiment result.
Leading journey metrics from eligibility through verified completion.
Revenue contribution, cost avoided, and total operating cost using the agreed definitions.
Quality and risk guardrails, including escalations and human-review events.
Performance by relevant customer, task, and journey cohort.
The agent, model, policy, and workflow version associated with the result.
The current decision status: exploring, launching, scaling, redesigning, contained, or retiring.

Use the dashboard to make portfolio decisions, not merely to report trends:

Scale when the primary outcome clears the precommitted threshold, guardrails hold, net value is positive, and the result remains credible across the cohorts that matter.
Redesign when leading indicators improve but the business outcome does not, or when human review and escalation erase the economic gain.
Contain or roll back when a hard security, policy, or customer-harm threshold is breached, even if average financial performance is positive.
Retire when controlled measurement shows no decision-relevant incrementality or when dependable operation costs more than the value created.

Review operational signals with frontline teams because they can explain patterns hidden by aggregate metrics. Review portfolio value in QBRs with product, data, finance, and risk owners so investment follows evidence rather than novelty.

Only accelerate adoption after the workflow has demonstrated unit value. In-app guides, product tours, and lifecycle nudges can bring more eligible users into a validated flow. Measure whether those interventions increase the business outcome, not merely clicks or agent sessions. Scaling exposure to an unproven workflow scales its cost and risk as readily as its potential benefit.

Key takeaways

Treat ROI as a precommitted decision rule for launch, scale, redesign, rollback, or retirement.
Connect model behavior to customer behavior and then to revenue, cost, or risk through a driver tree.
Freeze the baseline, cost boundary, guardrails, attribution method, and success thresholds before results arrive.
Credit only incremental revenue and verified avoided cost. Keep created capacity separate from cashable savings.
Include model consumption, labeling, evaluation, human review, and oversight in the operating cost.
Use controlled experiments or holdouts, with a decision-relevant minimum detectable effect, to separate causal impact from correlation.
Keep severe risk conditions as explicit constraints when they cannot be responsibly converted into a financial estimate.
Scale adoption only after the AI workflow has shown positive unit value under acceptable risk.

Pick one high-friction customer journey and complete its measurement contract before the next roadmap review. If the team cannot name the baseline, control, primary outcome, cost boundary, guardrails, and decision thresholds, the capability is still an exploration. Label it honestly, instrument it properly, and earn the right to make an ROI claim.

References

May 15, 2026

How to Deploy an Operator AI Agent in Customer Operations

Your support team probably does not need another chatbot that summarizes a ticket on command. It needs help with the operational work surrounding every ticket: finding why escalations changed, keeping knowledge accurate, correcting broken automations, coordinating incident communication, and showing human reps what deserves attention next.

An operator AI agent can take on that work, but only if you design it as an operating system for customer operations rather than a conversational layer over support APIs. The useful version closes the loop from signal to diagnosis to tested change. The dangerous version produces plausible commentary and receives permission to act before it has earned trust.

Define the job as a closed loop, not a chat box

A customer-facing AI agent handles an individual customer’s request. An operator agent works on the system around those requests: conversations, help content, automation configuration, performance data, incident workflows, and the human queue.

That distinction changes the product requirement. The agent is not complete when it answers a question such as why escalations increased. It is complete when it can investigate the increase, identify a supported cause, determine which operational object needs attention, prepare a change, test that change where possible, and route it to the right person for approval.

Observe: Detect a question, anomaly, scheduled task, failed conversation, release brief, or incident.
Diagnose: Select the relevant metrics and attributes, inspect representative conversations, and separate recurring patterns from isolated cases.
Locate the control point: Determine whether the problem sits in knowledge, guidance, a procedure, a data connector, an automation rule, or a human workflow.
Propose: Produce a concrete artifact such as an article diff, configuration change, procedure, incident audience, or prioritized queue.
Verify: Run a simulation or another appropriate check and expose failures, edge cases, and remaining uncertainty.
Act and learn: Apply an approved change, record what happened, and monitor the affected outcome for regression.

Consider the prompt, Why did escalations rise last week? A reporting copilot returns a chart. A useful operator identifies which escalation definition applies, segments the change, reads relevant conversations, finds the repeated cause, checks whether the corresponding help content or automation is deficient, and prepares the smallest defensible correction. That progression from an operational question to an actionable proposal is already possible across analysis, knowledge maintenance, automation building, and human support workflows.

Write the acceptance criteria around that complete handoff. Require the evidence used, the proposed artifact, the scope of impact, the verification result, the named reviewer, and any action the agent is forbidden to take. If the output still leaves an operations manager rebuilding the context manually, you have a chat assistant, not an operator.

Build reliability below the model and price that work honestly

A foundation model with API access can make a persuasive prototype. It can query ticket data, summarize conversations, and write a report that appears coherent. The hard part begins when different workspaces use different fields, configurations, workflows, permissions, and definitions of success.

The model should not have to rediscover your operating rules on every run. Encode those rules in purpose-built tools and reusable skills. A tool performs one bounded operation, such as retrieving a conversation, searching knowledge, or running a defined report. A skill coordinates several tools to complete a business job, such as debugging a failed resolution or rolling a policy change through the help center.

Operator’s production architecture is described as having more than 50 tools and 10 multi-step skills. Those counts are not targets to copy. They illustrate how quickly the hidden surface area grows once an agent must do dependable operational work instead of demonstrating a few API calls.

System layer	Job it must perform	Failure you should test for	Control to add
Semantic retrieval	Find content by meaning, not only exact words	Irrelevant or incomplete evidence produces a confident diagnosis	Evaluate retrieval against real support questions and known content gaps
Attribute awareness	Know which metrics, fields, and custom attributes are populated and meaningful	The agent invents a pattern from sparse or unused fields	Expose field definitions, coverage, allowed joins, and missing-data signals
Atomic tools	Perform narrow reads or writes predictably	A broad API wrapper allows an unintended query or change	Use typed inputs, constrained scopes, explicit permissions, and structured results
Domain skills	Chain tools according to a repeatable customer-operations method	The same request follows a different process on each run	Define required steps, exit conditions, evidence, and escalation paths
Review interface	Turn reasoning into charts, diffs, tests, and proposals	A reviewer approves a wall of prose without understanding the change	Render the decision in the format appropriate to the object being changed

Semantic retrieval and attribute awareness deserve particular attention. Retrieval grounds the agent in the content that can actually answer the question. Attribute awareness stops it from treating every available field as equally meaningful. A custom field that exists but is almost never populated should not become the foundation of an operational recommendation.

Give every tool a contract before the model can call it:

The business purpose and the questions it is allowed to answer.
The read and write permissions it requires.
The preconditions that must be true before it runs.
The evidence and identifiers it must return.
Its behavior when data is missing, ambiguous, stale, or inconsistent.
The audit event, approval requirement, and rollback path for a write.

Evaluate build versus buy beyond the demonstration

A proof of concept establishes that a model can produce a plausible answer with your data. It does not establish that the answer is grounded, that the proposed action is safe, or that the system will behave consistently as configurations change.

For a build decision, include retrieval tuning, permission design, tenant isolation, tool maintenance, skill development, evaluation data, observability, proposal interfaces, audit history, rollback behavior, and on-call ownership. Also ask who will update the agent when a support object, metric definition, product policy, or API changes. If these responsibilities do not have durable owners, the internal agent will age like any other unsupported operations system.

For a buy decision, ask the vendor to demonstrate your difficult cases rather than its preferred prompts. Use a conversation with conflicting evidence, an unused custom attribute, an outdated localized article, a misconfigured rule, and a proposed write with a wide blast radius. Inspect the evidence, tool trace, permissions, diff, test result, and audit record. The quality of the generated prose is one of the least informative parts of that evaluation.

Put a proposal boundary around every material action

Moving from analysis to live changes is a different class of production problem. A wrong summary wastes time. A wrong configuration can degrade customer outcomes across every conversation that matches it. An incorrect outbound message cannot be recalled after customers have read it.

I would give the agent autonomy according to consequence, not according to how confident its language sounds:

Read: Search content, inspect conversations, calculate approved metrics, and assemble evidence. Run these tasks autonomously within access controls and log every operation.
Recommend: Explain a root cause or rank an opportunity. Attach the underlying conversations, segments, fields, and assumptions so a person can challenge the conclusion.
Prepare: Draft an article, procedure, rule, connector configuration, customer response, or queue. Save it as a proposal with no production effect.
Change: Publish, configure, send, or otherwise alter the live operation only after the required reviewer sees the exact scope and explicitly approves it.

A proposal is a structured change object, not a paragraph asking for trust. Production-grade operator systems can present reviewable diffs before applying changes, allowing the reviewer to accept, reject, or refine the work. The same principle should govern any operator implementation.

Your review screen should answer six questions without forcing the approver into another tool:

What object will change?
What exact fields, passages, rules, or recipients are affected?
What evidence connects the observed problem to this change?
What test ran, and which cases failed or remained untested?
Who must approve, and which permission will execute the action?
How can the change be reversed, and what cannot be reversed?

Customer outreach needs the strictest treatment because sending is effectively irreversible. Do not approve a batch from a conversational summary that hides the audience. The safe alternative is a preview containing the resolved customer list, inclusion logic, exclusions, exact message variants, delivery channel, and approver. Start by allowing the agent to prepare that package while a person performs the send.

Simulation also needs a visible place in the proposal. If the agent modifies an automation procedure, show which representative conversations were tested, the expected outcome for each, the observed outcome, and why any mismatch occurred. An overall pass label is not enough to reveal an important edge case.

Human approval is not a permanent substitute for system quality. If reviewers routinely accept proposals without inspecting them, the control has become ceremonial. Track corrections, rejections, rollbacks, and the evidence reviewers open. Use those signals to improve the relevant retrieval rule, tool, skill, or interface.

Roll out workflows in increasing order of consequence

Choose the first workflow by its operating characteristics. A strong starting candidate recurs frequently, consumes expert attention, has accessible evidence, produces a clear artifact, and has a named reviewer. It should also allow the agent to be useful before it receives broad write permission.

A practical rollout sequence looks like this:

Recurring operations analyst. Give the agent one standing question, such as what changed in escalations or automation performance. Define the metric, comparison period, relevant segments, evidence requirements, and report destination. Require links to representative conversations and allow the conclusion that no action is warranted. Compare its reasoning with an experienced operator’s review until the failure modes are understood.
Knowledge steward. Feed it a release brief or policy change. Ask it to find affected help content, identify missing coverage, and prepare article diffs in the required voice and format. Include localized variants where they exist. The reviewer should validate product behavior, instructions, links, policy language, and whether the proposed set of pages is complete before publishing.
Automation maintainer. Start with known failed conversations. Ask the agent to distinguish a content gap from a rule, procedure, guidance, or connector problem; prepare the smallest correction; define triggers and edge cases; and simulate the result. Do not grant live configuration access until the tool trace and tests make the diagnosis reproducible.
Human-operations coordinator. Use the agent to assemble an incident audience, draft targeted responses, prepare coaching evidence, or prioritize a rep’s queue. These workflows can save substantial coordination time, but they touch customer communication and employee decisions. Begin in preparation mode, expose the selection logic, and expand autonomy only after identity, permission, review, and audit controls have been exercised.

This sequence is a risk ordering, not a universal maturity model. A read-only weekly analysis is easier to inspect and reverse than an outbound incident campaign. A knowledge proposal has a reviewable artifact. A live automation change affects future conversations, while customer communication may create an immediate and irreversible consequence. Move forward when the evidence and controls for the next class of action are ready, not merely because the previous feature launched.

Measure the completed loop, not chat activity

Prompt counts and conversation volume tell you that people opened the product. They do not tell you that customer operations improved. Build the scorecard around the operational loop:

Diagnostic quality: Whether the proposed root cause survives expert review, whether its evidence supports the conclusion, and how often factual correction is required.
Operational throughput: Time from a detected signal to a reviewed proposal and from an approved proposal to a verified change.
Artifact quality: Acceptance, revision, rejection, and rollback patterns for knowledge, automation, configuration, and communication proposals.
Customer outcome: Resolution, escalation, repeat contact, and sentiment for the affected topic after the change, interpreted alongside volume and case mix.
Safety: Permission denials, attempted out-of-scope actions, failed simulations, unauthorized writes, rollbacks, and missing audit events.
Human leverage: Expert time spent collecting evidence, recreating context, drafting the artifact, and reviewing the final proposal.

Do not make automation rate the only goal. A higher rate can coexist with poor resolutions or avoidable escalations. Treat it as one diagnostic measure and pair it with customer outcomes, correction rates, and topic-level regressions.

Create an evaluation set from real operating conditions: known content gaps, misconfigured rules, legitimate escalations, sparse attributes, conflicting evidence, localized content, and incidents with precise audience criteria. Give each case an expected outcome, required evidence, allowed tools, and forbidden action. Re-run the set when the model, retrieval system, tool, skill, permissions, or support configuration changes.

Scheduled work is where the leverage begins to compound. An operator can run recurring analysis and deliver the resulting report without waiting for a manager to remember the question. Keep an owner on every scheduled job, however. That owner should know where failures appear, when the task last completed, which data it used, and how to pause it.

Key takeaways

An operator agent improves the system around customer conversations; it is not simply another customer-facing bot.
The product boundary should cover observation, diagnosis, proposal, verification, approval, action, and monitoring.
Reliable behavior comes from grounded retrieval, attribute awareness, bounded tools, encoded domain skills, and structured review surfaces.
Grant autonomy by consequence: broad freedom to inspect approved data, tighter controls to prepare changes, and explicit approval for production writes.
Roll out recurring analysis before knowledge changes, automation configuration, and customer communication unless your own risk profile clearly supports another order.
Measure supported diagnoses, accepted artifacts, customer outcomes, human time, and safety events rather than prompt volume alone.

Your next step is to choose one recurring operational question and write down the evidence it requires, the artifact a good answer should produce, the person who will review it, and the actions the agent must not take. Once that loop works reliably, add one downstream proposal. That is a much stronger foundation for an operator agent than beginning with an open-ended prompt and a broad API key.

References

May 14, 2026

Our Operating Model Is the Product—Why We Built Product Partners to Accelerate Outcomes

I’ve learned that customers don’t just buy features—they buy the way we discover, decide, build, ship, and support. In other words, the operating model is the product. That realization has shaped how my team and I at HighLevel translate product strategy into tangible, repeatable outcomes that show up in quality, reliability, onboarding, and consultative support every single day.

We created Product Partners to codify that operating model and scale it with discipline. It’s a blueprint and operating rhythm that unifies product strategy with go-to-market strategy, customer success, and solutions engineering—so empowered product teams can move faster without sacrificing clarity, governance, or customer trust.

First, we anchored on continuous discovery. Product trios work shoulder-to-shoulder with customer-facing teams to run customer interviews, journey mapping, and A/B testing, then validate insights with session replay and behavioral analytics. We use driver trees and opportunity solution trees to connect problems to outcomes, ensuring prioritization is evidence-based and aligned to product-market fit—not just output.

Second, we elevated delivery excellence. Our practices emphasize CI/CD, feature flags, observability, SRE-informed incident management, and DORA metrics to shorten feedback loops while raising the bar on stability. Privacy-by-design, data governance, and regulatory compliance are built into our workflows, and we make deliberate build vs buy decisions to protect platform scalability and long-term velocity.

Third, we integrated go-to-market alignment from day one. Solutions engineering and customer success shape requirements early, so launches include in-app guides, product tours, onboarding paths, and consultative support that accelerate user activation. We tie outcomes vs output OKRs to stakeholder management rituals, ensuring sales-led and product-led growth motions reinforce each other instead of competing for focus.

Finally, we closed the loop with a unified analytics platform. Activation, retention analysis, and Net Recurring Revenue (NRR) sit alongside qualitative signals from customer interviews and support. This single source of truth helps us refine product positioning, sharpen value propositions, and improve roadmapping and sprint planning with clear, testable hypotheses.

What does this mean for our partners and customers? Faster time-to-value, fewer handoffs, clearer expectations, and a shared lens on the metrics that matter. Product Partners isn’t a side program; it’s how we operationalize trust—through transparency, consistent rituals, and a bias toward learning that compounds.

If this resonates, you’ll feel it in how we discover, build, and support together. I’ll continue to share our playbooks—covering continuous discovery, onboarding, and outcome-based planning—so we can keep raising the standard for product management leadership and product-led growth, one operating rhythm at a time.

Inspired by this post on Product School.

May 14, 2026

AI-Enabled Enzymatic Recycling: A Product Leader’s Playbook

You have an AI-enabled materials proposal in front of you, a promising set of enzyme candidates, and a difficult decision: fund another round of discovery or start building toward industrial scale. The candidate sequences may be impressive, but they are not yet the product.

Your decision should turn on whether the full system can repeatedly transform a defined waste stream into usable monomers at an economically viable cost. That framing connects model performance, laboratory evidence, process engineering, and commercial reality before an exciting demonstration becomes a stranded pilot.

Define the product around recovered monomers

Only 10% of the plastic manufactured gets recycled. That ceiling is not merely a sorting or consumer-behavior problem. Traditional recycling commonly shortens polymer chains instead of restoring their original molecular building blocks, so the resulting material can lose quality and move toward downcycling.

Enzymatic recycling changes the intended output. An engineered enzyme can deconstruct a polymer into its original monomers, which can then become inputs for new, high-quality plastic. The difference is fundamental: the product is not processed waste or a smaller plastic fragment. It is recovered molecular feedstock.

This distinction gives you a better product boundary. A generated protein sequence is a feature. An enzyme that shows activity in one assay is a technical result. The product is a repeatable monomer-recovery system with a defined input, output, operating envelope, and cost structure.

Before approving a roadmap, require the team to define five contracts:

Input contract: Which polymer, packaging format, mixture, and contamination profile will the process accept? “Mixed plastic” is not a specification. Name the included materials and the variation the system must tolerate.
Transformation contract: Which polymer bonds must the enzyme break, and what conversion and selectivity must the reaction demonstrate?
Output contract: Which monomers will be recovered, what downstream use must they support, and how will the team determine that the output is suitable for that use?
Operating contract: What reaction conditions, throughput, energy consumption, and process controls must hold outside a small laboratory assay?
Economic contract: Which cost per ton must the integrated process approach, and which assumptions currently separate measured economics from projected economics?

Selectivity is especially important. An enzyme can target a particular plastic within a mixed waste stream, potentially reducing the need to treat every input as chemically identical. But selectivity does not make an undefined waste stream manageable. The process still needs to know which target material is present, whether the enzyme can reach it, and how the desired products will be recovered.

Write the product brief in one sentence: For this defined feedstock, transform this polymer into these monomers, within this operating envelope, output specification, and cost boundary. If a number is unknown, leave a visible blank and assign an experiment to fill it. Do not hide the uncertainty inside a broad ambition such as “make plastic circular.”

Build the AI as a closed learning system

AI changes the economics of searching enzyme-design space. Protein language models can generate candidates, multi-step agents can coordinate specialized tasks, and computational evaluations can eliminate weak options before scarce laboratory capacity is used. Advances in protein structure prediction have expanded what can be explored, but prediction does not remove the need for physical validation.

The useful architecture is therefore not a model that emits sequences. It is a closed loop in which every physical result makes the next design round better. Rhea’s Factory combines protein language models, an agentic pipeline, domain constraints, and proprietary wet-lab feedback. The product lesson is broader than any one implementation: generation, evaluation, experimentation, and learning need to operate as one traceable system.

Encode the objective. Convert the product contract into machine-readable constraints: target polymer, desired products, acceptable operating conditions, and the metrics that will decide whether a candidate advances.
Generate candidates. Explore multiple plausible designs rather than optimizing immediately around the first promising family.
Apply computational gates. Reject candidates that violate explicit constraints, preserve the reasons for rejection, and rank the remaining candidates for laboratory use.
Run controlled wet-lab experiments. Test candidates under recorded conditions and capture successes, failures, and inconclusive results.
Update domain predictions. Use the measured outcomes to improve ranking and candidate selection for the next round.
Feed process evidence back into discovery. When a candidate struggles under reactor or feedstock conditions, turn that failure into a new design constraint instead of treating it as a separate engineering problem.

Agentic AI is valuable here because the workflow is multi-step, not because an agent should make every decision autonomously. At each handoff, define the required input, expected output, validator, and failure behavior. A generation step should not advance an incomplete candidate. A computational score should not be presented as a laboratory observation. A promising assay should not silently become a scale claim.

Exploration also needs an explicit lane. Higher model-sampling temperatures can produce more unusual enzyme candidates and reach beyond the safest local variations. Controlled model “hallucination” can be useful during candidate exploration when downstream guardrails prevent novelty from being mistaken for evidence.

Separate the candidate portfolio into three buckets: improvements near known winners, adjacent designs that test a clear hypothesis, and high-variance exploration. Give each bucket a deliberate laboratory budget. Raise sampling temperature only in the exploratory lane, and never allow generated assay values, reaction outcomes, or scale results into the measured-data record.

The durable advantage sits in the feedback data. In a narrow, high-signal domain, even hundreds of relevant proprietary laboratory observations can support a useful domain prediction model. That is not a general claim that small datasets are always sufficient. It means contextual quality can matter more than indiscriminate volume when the problem, assay, and outcomes are tightly defined.

For every experiment, preserve enough context to make the result reusable:

The enzyme identity, sequence version, and design lineage.
The target polymer, material format, mixture, and relevant contamination profile.
The assay and protocol version used for the test.
The reaction conditions and duration.
The measured conversion, selectivity, yield, and uncertainty available from the experiment.
The full result, including failure, no-result, and inconclusive outcomes.
The relationship between the candidate, computational evaluations, physical test, and model or data release.

A spreadsheet of winning sequences is not a data moat. A traceable record of why candidates were proposed, how they were tested, what failed, and how each result changed the next decision can become one.

Use stage gates that end in physical evidence

AI product teams often gravitate toward a model leaderboard because it creates a clean sense of progress. Enzymatic recycling does not have one adequate master score. A candidate can look structurally plausible and fail in the lab. It can perform in a controlled assay and miss the required throughput. It can convert the polymer and still lose economically once the rest of the process is counted.

Use a hierarchy of evidence that moves from design compliance to laboratory performance, operating fit, and scale economics:

Gate	Decision question	Required evidence	Red flag
Design compliance	Does the candidate satisfy the stated target and pipeline constraints?	Deterministic checks, recorded constraint evaluations, and candidate provenance	A candidate advances mainly because it appears novel
Wet-lab performance	Does the enzyme convert the target with the required selectivity under defined conditions?	Repeatable measured observations, including negative and inconclusive runs	Only the best run is retained or shared
Operating fit	Does useful performance hold within the intended controlled, low-temperature process and throughput requirements?	Process measurements tied to reaction conditions, conversion, yield, throughput, and energy use	Activity is reported without the process context needed to interpret it
Scale economics	Can the integrated system move toward cost parity with inexpensive oil-based plastic?	A cost and energy model tied to measured inputs, with assumptions and sensitivities exposed	Commercial viability is inferred from enzyme activity alone

Set pass, hold, and stop conditions before seeing the result. Otherwise, an interesting candidate will repeatedly earn one more experiment while the commercial requirement drifts. Relative improvement is useful for learning, but an enzyme that is twice as good as an unusable baseline may still be unusable. Every relative metric should sit beside the absolute requirement it is meant to approach.

Keep conversion, selectivity, yield, throughput, and energy per ton separate. Combining them too early into a single score can conceal the actual tradeoff. A team should be able to show why it is advancing a faster candidate with lower selectivity, or a more selective candidate with a different operating burden, without claiming that the candidates are equivalent.

Three common metric substitutions deserve direct scrutiny:

Low reaction temperature is not automatically low total energy. Count the energy demands of the complete process rather than the enzyme reaction in isolation.
Polymer conversion is not automatically usable monomer recovery. Measure whether the desired output can be recovered to the specification required downstream.
Bench performance is not automatically scaled performance. Treat increasing process scale as a new evidence gate, not a routine deployment step.

My rule is simple: model output can earn laboratory time; only measured process evidence can earn scale capital.

Plan the roadmap backward from cost parity

The commercial benchmark is unforgiving. Enzymatic recycling ultimately has to compete with inexpensive oil-based plastic production. A greener reaction that cannot approach a viable delivered cost will remain dependent on special conditions rather than becoming a broadly adopted circular process.

Build the economic model while discovery is still underway. At minimum, separate these cost lines:

Feedstock acquisition, sorting, and rejected material.
Preparation required before the enzyme can act on the target polymer.
Enzyme production, delivery, useful lifetime, and replacement.
Reactor capacity, reaction time, process control, and energy.
Monomer recovery and purification.
Waste handling, downtime, and variability in plant utilization.

Do not wait for perfect values. Use ranges, label each input as measured or assumed, and run sensitivity analysis. The purpose is to identify which uncertain variable can kill the business case. If enzyme lifetime dominates cost, another candidate-generation run may be rational. If purification dominates, generating thousands of additional sequences may be a distraction from the real constraint.

Pair every scientific milestone with an industrial question:

Discovery gate: Is activity and selectivity reproducible enough to justify process work?
Process gate: Does the candidate perform inside the intended operating envelope rather than only under a convenient assay condition?
Feedstock gate: Does performance survive representative material formats and mixtures, including difficult packaging such as clamshells?
Demonstration gate: Can the system sustain the required material flow, output quality, and energy profile at a scale that tests the major engineering assumptions?
Commercial gate: Does the cost case remain credible when feedstock composition, utilization, throughput, and other sensitive inputs move away from the preferred case?

A planned 5,000-ton demonstration plant in California illustrates why demonstration capacity belongs on the product roadmap. A plant is not simply a larger laboratory. It tests whether biology, equipment, controls, feedstock variability, and recovery operations behave as an integrated product.

Before committing meaningful scale capital, ask six kill questions:

Which assumption has the largest effect on delivered cost per ton?
Which inputs are measured, and which still come from a design estimate?
At what physical scale was each important input measured?
What fails first when the feedstock mix changes?
If enzyme performance improves as planned, which downstream step becomes the bottleneck?
Which observed result will stop, narrow, or materially redesign the program?

Expansion into additional plastics should follow the same discipline. Enzyme selectivity creates a plausible path toward enzyme blends for mixed streams, and new plastic types and mixed-plastic blends remain important development directions. Treat each added polymer as a new product vertical with its own input contract, assays, process interactions, recovery requirements, and economics. A new enzyme is not automatically a low-cost extension of the first process.

Key takeaways for your next roadmap review

Define success as repeatable recovery of specified monomers, not the generation of novel enzyme sequences.
Run discovery as a closed loop connecting product constraints, AI generation, computational gates, wet-lab measurements, and process feedback.
Treat proprietary experimental context—including failures—as the data asset; candidate count alone is not a defensible moat.
Use separate gates for design compliance, laboratory performance, operating fit, and scale economics.
Work backward from cost parity and direct the next experiment toward the assumption that most threatens the integrated business case.

For your next review, ask the team to bring one page containing the input and output contracts, a diagram of the learning loop, the current stage-gate thresholds, the experimental data schema, and a cost sensitivity model with measured and assumed inputs clearly separated. Every roadmap item should change one of those artifacts or produce evidence for a named decision.

If the team cannot fill those fields yet, that is the immediate product work. The first defensible milestone is one traceable loop from a defined industrial problem through candidate generation, laboratory measurement, and an updated cost model. Repeat that loop with increasing realism before increasing capital exposure. That is how you determine whether programmable biology is becoming an industrial recycling product rather than remaining an impressive AI demonstration.

References

Product Talk — How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

May 14, 2026

Month: May 2026

Find the value, then choose a metric that tracks it

Define value in the buyer’s language

Turn value into a billable unit

Make packaging do the segmentation work

Choose modular, bundled, or hybrid architecture deliberately

Measure willingness to pay only after the offer is clear

Convert the demand curve into a commercial model

Stress-test the assumptions that can break the plan

Make the recommendation easy to challenge

Launch pricing as a controlled learning system

Complete the billing path before charging

Use behavior to diagnose the next problem

Key takeaways

References

Key takeaways

Frame the decision before you involve the model

Build a traceable evidence pipeline, not a transcript pile

Retrieve the smallest useful context

Give AI a chain of bounded jobs

Buy workflow plumbing; own the decision logic

Evaluate the workflow before it shapes the roadmap

Turn the workflow into a product operating system

References

Start with the retention decision, not the dashboard

Instrument the path from first value to recurring value

Measure value at the account level

Use a minimal tracking contract

Build a risk score people can challenge and act on

Treat accuracy as a chain, not a single metric

Run a validation pass before reading the colors

Split heatmaps when the interface or interaction model changes

Match the decision to what the evidence can prove

Key takeaways

References

Classify the decision before you assess the AI

Turn governance principles into an enforceable contract

Define the data boundary

Assign decision rights to named roles

Design the audit record before launch

Put controls inside the workflows people actually use

Behavioral analytics: govern the meaning as well as the data

Anomaly detection: route a signal into investigation, not judgment

Self-service analysis: give teams a governed lane

Pilot with evidence, not a polished demonstration

Key takeaways

References

Start with the decision your ROI model must support

Write the measurement contract before you build the dashboard

Instrument the complete journey, not just the conversation

Calculate economic value without turning activity into savings

Count only incremental revenue

Separate capacity from cashable savings

Include the operating costs that make the agent dependable

Do not bury risk inside an average ROI number

Prove incrementality before claiming impact

Turn ROI into a portfolio operating system

Key takeaways

References

Define the job as a closed loop, not a chat box

Build reliability below the model and price that work honestly

Evaluate build versus buy beyond the demonstration

Put a proposal boundary around every material action

Roll out workflows in increasing order of consequence

Measure the completed loop, not chat activity

Key takeaways

References

Define the product around recovered monomers

Build the AI as a closed learning system

Use stage gates that end in physical evidence

Plan the roadmap backward from cost parity

Key takeaways for your next roadmap review

References