Tag: retrieval-first pipeline

Engineering MCP Agents as a Reliable Product Platform
Model Context Protocol adoption becomes consequential when an agent can retrieve organizational knowledge, select tools, and change a system of record. At that point, the engineering challenge is no longer simply connecting a model to an API. It is operating a product platform whose context, permissions, decisions, and side effects must remain dependable.

The source article’s experience with workflows spanning Miro, Jira, and Confluence points to a coherent platform model: retrieval determines what the agent knows, tool contracts constrain what it can do, evaluation tests its behavior, and observability makes failures diagnosable. Product strategy and interaction design then determine whether that machinery improves work users already perform.

Key takeaways
- Treat retrieval, tool schemas, prompts, policies, and telemetry as platform components with explicit owners and versioning.
- Prove one frequent, measurable workflow before expanding the agent’s tool and use-case surface.
- Combine least-privilege access with visible tool rationale, consent controls, audit records, and safe recovery paths.
- Evaluate the complete chain from retrieved context to downstream action, not just the quality of generated text.
- Govern the tool catalog and delivery pipeline continuously so that extensibility does not become uncontrolled operational risk.
The platform boundary extends beyond the MCP connection

MCP provides a practical interface through which models can reach data, tools, and actions, according to the source article. The protocol connection is therefore an enabling layer, not the whole agent platform. A production workflow also depends on source authority, identity and permission checks, context selection, tool arbitration, execution controls, user-facing recovery states, and evidence that the result was useful.

This broader boundary changes how teams should decompose the system. Retrieval is a managed context service rather than an incidental prompt-building step. Tools are governed capabilities rather than a loose collection of endpoints. Prompts and policies are deployable artifacts rather than text copied into application code. Traces and evaluations are part of the control plane because they reveal whether the other layers continue to work together.

The source recommends starting with authoritative content, normalizing it with docs-as-code discipline, attaching metadata that supports permission-aware filtering, and selecting the smallest high-signal context needed for a task. The engineering implication is important: access control must shape retrieval before information reaches the model. Filtering only when an action is attempted would leave the reasoning process exposed to context the user or agent may not be entitled to use.

Context quality also affects more than answer accuracy. The source links focused retrieval to lower hallucination risk, more accurate tool calls, and lower cost. That makes retrieval performance a shared dependency for safety, reliability, latency, and economics. It deserves its own contracts, tests, freshness expectations, and failure modes.

A golden path turns architecture into an operating contract

The source describes an initial workflow that summarized a Miro board into action items and wrote them to Jira. It reports that variants involving Confluence summaries, epic splitting, and backlog grooming followed only after the original path reached its reliability targets. This is less a recommendation for those particular products than a useful sequencing principle for agent platform engineering.

A narrowly defined workflow exposes the entire contract between context and consequence. The team must decide which content is authoritative, what the model may infer, which tool is appropriate, what inputs the tool accepts, what the user should review, how a partial failure is handled, and how success is measured. A broad assistant can conceal these questions behind plausible conversation; a golden path forces explicit answers.

The right first workflow is therefore not merely technically convenient. It should be frequent enough to matter, have an observable completion state, and carry side effects that can be bounded. The source frames outcomes such as time saved during backlog grooming, better meeting notes in Confluence, and fewer context switches across Miro boards as more useful roadmap anchors than novel model capabilities. It also recommends comparing task success, completion time, user edits, detected defects, and downstream business effects rather than relying on engagement alone.

Those measures form a practical evidence chain. Evaluation results show whether the system behaves as designed; workflow measures show whether users can complete the task; business measures show whether the completed task creates value. Keeping the levels distinct prevents a technically impressive agent from being mistaken for a successful product.

Safety depends on controlling actions and explaining them

Tool access creates a sharper risk boundary than text generation because an incorrect decision can alter a ticket, document, or other shared record. The source’s proposed response combines least-privilege scopes, a human-readable rationale for each call, and an audit trail. It also calls for proposed inputs and expected side effects to be visible when the agent is about to use a tool.

These controls address different failure classes. Narrow scopes limit the maximum effect of a bad decision. Input previews help users catch incorrect parameters before execution. Rationale makes the selection inspectable. Audit records support diagnosis and accountability afterward. None substitutes for the others, and a confirmation dialog alone does not make an overprivileged tool safe.

Recovery behavior belongs in the same design. The source recommends retrying suitable failures with backoff, falling back to read-only behavior, or requesting consent or missing context. A robust platform should distinguish failures that are safe to retry from failures that require a different plan. It should also preserve an understandable state when a multi-step workflow completes only partially, so the user knows what changed and what did not.

Transparency need not mean exposing raw internal reasoning. The useful product surface is operational evidence: the sources used, the selected capability, the intended inputs, the expected effect, and the resulting status. The source suggests a reveal panel containing retrieved sources, candidate tools, and confidence signals for power users. More generally, the amount of review should follow the consequence of the action: low-risk retrieval can remain lightweight, while consequential writes warrant clearer inspection and consent.

Evaluation, observability, and delivery form one reliability loop

The source outlines offline tests for intent classification and tool selection, online shadow evaluations for live drift, and regression checks after deployment. It also recommends traces that capture prompts, retrieved chunks, tool inputs, tool outputs, latency, and error codes. Together, these practices connect a visible failure to the component and version that produced it.

Evaluation without observability can show that quality declined without explaining why. Observability without evaluation can produce detailed traces without deciding whether the behavior was acceptable. A mature loop needs both: test cases encode desired behavior, traces expose actual behavior, and production outcomes reveal gaps in the test set.

The delivery process must preserve that connection. The source treats prompts, tool schemas, and guardrails as versioned artifacts deployed behind feature flags, with canary releases, controlled comparisons, and rollback capability. This approach makes a behavioral change attributable. If tool selection deteriorates after a prompt revision or a schema update breaks an integration, operators can identify the change and contain its reach.

Latency should be governed in the same loop because an accurate workflow can still fail as a product experience. The source reports using task-specific latency budgets, caching stable retrieval results, parallelizing safe calls, prefetching likely session context, and providing progress when work exceeds the expected budget. These techniques should remain subordinate to correctness: parallel execution is appropriate only when calls are independent, while caching must respect freshness and permission boundaries.

The source also assigns prompts a user-experience role, combining plain-language intent, domain constraints, and explicit tool contracts while using examples, tooltips, and in-product guidance to help users frame requests. This connects conversation design to reliability. Better instructions can reduce ambiguity before the platform has to resolve it through additional model turns or risky assumptions.

Scale requires governance of tools, teams, and ownership

MCP’s extensibility can turn into tool sprawl if every integration is added without lifecycle management. The source recommends a curated catalog recording each tool’s owner, scope, schema version, and deprecation policy. It also describes schema linting in continuous integration, backward-compatible changes, and quarterly retirement of unused tools. These are conventional platform disciplines applied to an agent’s capability surface.

A catalog is valuable because an agent reasons over descriptions and schemas while operators depend on stable implementation contracts. Poorly differentiated tools can make selection ambiguous; unannounced schema changes can invalidate prompts and evaluations; ownerless tools can remain available after their data or permission assumptions have changed. Governance should therefore assess semantic clarity as well as API validity.

Organizational design matters for the same reason. The source describes an empowered trio consisting of a product manager responsible for outcomes and risk posture, a forward-deployed engineer focused on schemas and scalability, and a designer responsible for conversational flows and recovery states. It also favors weekly evaluation reviews over demonstration-led progress. The underlying principle is shared ownership: platform reliability cannot be delegated entirely to model engineering when the decisive questions span product value, system behavior, permissions, and user comprehension.

The source’s proposed 30-day starter sequence moves from selecting one workflow and defining permissions, measures, and evaluations; through retrieval and a minimal tool set; to an instrumented internal pilot; and finally to hardening and a limited beta. The schedule is reported as a blueprint rather than independent proof of how long every implementation should take. Its more transferable lesson is dependency order: define the outcome and risk boundary before multiplying capabilities.

As agents begin coordinating across products, the durable advantage will come from platforms that preserve this discipline across every new connection. MCP can make capabilities composable, but dependable composition will still depend on controlled context, explicit authority, observable execution, and evidence that the workflow improves real work.

References
- Shivam.Consulting Blog — Mastering MCP: Battle-tested Playbooks from Miro, Atlassian, and What I’ve Learned
June 8, 2026

Prompt Engineering for Amplitude Global Agent That Holds Up

You ask Amplitude Global Agent why activation fell. It returns a plausible explanation, but you still can’t tell which events it examined, whether the comparison was valid, or what your product team should do next.

The fix is to treat the prompt as an analysis specification. Define the decision, provide the relevant analytics context, constrain unsupported conclusions, and make the agent show its work. You will get an answer that is easier to verify and more useful in a product review.

Start with the decision, not a broad request for insights

Requests such as “analyze activation” leave several decisions unresolved. The agent must guess what activation means, which users belong in the analysis, which period matters, and what kind of answer you expect. Even a polished response may answer the wrong question.

Before writing the prompt, complete this sentence: “After reading the answer, we need to decide whether to…” Your ending might be “change the onboarding sequence,” “investigate a recent release,” or “prioritize one segment for discovery.” That decision gives the analysis a destination.

Then assign a role that matches the work. “You are a product analyst investigating activation performance” is more useful than “You are a helpful assistant.” Add the audience as well. An executive needs the size and business relevance of a change; a product trio also needs the affected steps, segments, and follow-up questions.

A strong opening contains three elements:

Role: the analytical perspective the agent should take.
Decision: what the team will choose or investigate after reading the result.
Success criteria: what the answer must establish before it is useful.

For example: “You are a product analyst helping the onboarding team decide whether to redesign a weak activation step. Identify the largest meaningful drop-off, show which defined segment is most affected, and separate measured findings from possible explanations.”

Give the agent a compact analytics contract

The most reliable prompt names the data the agent may use. Include the relevant event names, property names, segment definitions, filters, and timeframe. If activation has an internal definition, write it out rather than relying on the agent to infer it.

This is a retrieval-first approach: put authoritative definitions, dashboard context, and prior query logic into the request before asking for interpretation. Concrete grounding reduces room for invented assumptions and makes repeated analyses easier to compare. A structured prompt can also specify the role, business objective, allowed data, and output fields.

Prompt element	What to provide	What it prevents
Metric definition	The exact event sequence or outcome that counts	A different interpretation of activation or retention
Population	Included users or accounts and explicit exclusions	Comparisons across unlike populations
Segments	Named properties and the values to compare	Arbitrary segmentation
Timeframe	The analysis period and comparison period	Hidden or inconsistent date choices
Evidence boundary	The events, properties, definitions, and dashboards allowed	Unsupported claims presented as measured facts
Output contract	Required sections, fields, ordering, and length	A long narrative that cannot be reviewed quickly

Do not dump every available definition into the context. Include only what the question requires. More context is useful when it resolves ambiguity; irrelevant context competes for attention and makes the prompt harder for a teammate to audit.

Use a reusable prompt that exposes uncertainty

You can adapt the following structure for activation, retention, anomaly investigation, or another behavioral analysis:

Role and audience: “Act as a product analyst. Write for the product manager and analytics lead responsible for [area].”
Decision: “Help us decide whether to [decision].”
Question: “Determine [specific analytical question].”
Definitions: “For this analysis, [metric] means [explicit event or outcome definition].”
Data context: “Use these events: [names]. Use these properties: [names]. Compare these segments: [definitions]. Analyze [timeframe] against [comparison period]. Apply [filters and exclusions].”
Constraints: “Use only the supplied Amplitude analytics events, properties, and definitions. Do not treat an unmeasured explanation as a finding.”
Output: “Return the metric result, segment comparison, timeframe, evidence, interpretation, confidence or limitation, and recommended next check.”
Fallback: “If the available data cannot answer the question, state what is missing and provide the smallest follow-up query needed.”

The fallback matters. Without it, the agent has an incentive to complete the requested narrative even when the evidence is incomplete. A useful failure is specific: it identifies a missing event, undefined property, absent comparison, or ambiguous metric. Your team can fix that. A confident guess is harder to detect.

Ask for measured findings, interpretations, and recommendations as separate fields. A measured drop-off is evidence. A claim that users were confused is an interpretation unless the supplied data establishes it. A recommendation to inspect session replay or conduct customer interviews is a next step, not proof of the cause. Keeping those layers separate makes the result safer to use in prioritization.

Turn prompt quality into a small product evaluation

Do not judge a prompt by whether one response sounds intelligent. Save the prompt version, input context, and output. Then test it against a question whose answer your team already knows. This gives you a reference point for accuracy before you use the template on an ambiguous problem.

Score each version on three dimensions:

Accuracy: Did the answer use the supplied definitions, filters, segments, and timeframe correctly?
Clarity: Can a reviewer distinguish evidence, interpretation, limitations, and next steps?
Actionability: Does the result support the stated decision or name the next query required?

Change one meaningful element at a time. You might compare a broad objective with a decision-specific objective, a narrative response with a fixed output contract, or an unrestricted answer with an explicit evidence boundary. Run the same test question through each variant. Otherwise, you will not know which change improved the result.

Commit to two or three prompt iterations for one critical workflow. Review the failures, tighten the ambiguous instruction, and keep the better-performing version. Within a sprint, that process can produce a reusable template for a recurring analysis such as activation, retention, or anomaly detection.

Store winning prompts with their required inputs and known limitations. A template without those notes becomes cargo cult: teammates copy the wording but omit the definitions that made it work. Treat the prompt, context requirements, evaluation question, and scoring criteria as one asset.

Key takeaways

State the product decision before requesting analysis.
Define the metric, population, segments, filters, and timeframe explicitly.
Restrict conclusions to the analytics evidence you supplied.
Separate measured findings from interpretations and recommended actions.
Require a specific fallback when the data is insufficient.
Version and score prompts for accuracy, clarity, and actionability.

Start with the recurring Amplitude question that currently creates the most debate. Write its decision, definitions, evidence boundary, and output contract. Run two or three scored iterations, then give the winning template to another product manager. If they can obtain a defensible answer without you translating the prompt, it is ready to become part of the team’s operating system.

References

Amplitude — Prompt Like a Pro: Three Battle-Tested Tips for Amplitude Global Agent Success

May 26, 2026

Beyond Accuracy: How I Evaluate AI Customer Service Agents That Delight and Scale
When teams evaluate AI Agent options for customer service, I often see the rigor aimed at the wrong subset of criteria. After leading and observing dozens of proof of concept (POC) efforts with our customers and prospects, I understand why performance—accuracy scores, resolution rates, and benchmark tests on curated datasets—soaks up most of the attention. But those indicators alone won’t guarantee success once you leave the sandbox and face real customers.

If your POC only proves that the AI “works,” you’re missing the bigger picture. Here’s what else I look for to make the best long-term decision.

How does it handle your real-world setup?

Performance is table stakes, but it has to reflect the messiness of an actual support environment. The best-performing Agents don’t just get answers right—they exhibit resilient, human-like behavior under pressure. I watch how the Agent behaves when it doesn’t know an answer: does it recover or spiral? Does it stay on track through multi-step requests, and how gracefully does it hand off to human agents? If your knowledge base depends on a retrieval-first pipeline, test cross-source retrieval and grounding—not just single-document lookups.

When I build evaluation scenarios, I put the Agent through its paces with a broad, realistic mix:
- Multi-turn queries that require the Agent to carry context across a conversation, not just answer isolated questions.
- Vague or fragmented inputs, like typos, grammatical errors, and incomplete questions, because that’s how customers actually write.
- Edge cases and sensitive scenarios, like billing disputes, frustrated customers, and questions that sit at the boundary of what the Agent is trained on.
- Different phrasings of the same question. An Agent that handles one version well but fails on a rephrasing has a knowledge problem, not a performance problem.
- Queries that require pulling from multiple knowledge sources. Real issues are rarely answered by a single help article, and an Agent that can only handle single-source questions will hit a ceiling fast.
- Multilingual conversations, if your customer base requires it. Performance can vary significantly across languages and it’s better to discover that in testing than in production.
This preparation is worth the effort. Any Agent can look impressive in a demo; what matters is how it holds up as part of your team, serving your customers in production.

What does it feel like to interact with the Agent?

Two AI Agents can post the same quantitative scores—resolution rates, containment rate, and more—and still deliver very different customer experiences. Resolution rate tells me whether the Agent finishes conversations; it says nothing about how customers felt during them. I deliberately assess the experience, not just the outcome, because conversation design shapes trust and brand perception.

Here’s what I look for to ensure the AI Agent is enjoyable to interact with:
- Is the tone natural and on-brand, or does it feel robotic and generic?
- Does it build trust early in the conversation, or does it create friction that makes customers want to immediately request a human?
- When it doesn’t know the answer, does it handle that gracefully?
- When it hands off to a human, is that transition seamless, or does the customer feel abandoned?
As George Dilthey at Clay put it when evaluating their AI setup: “Keep what’s important to your business up front and center. For us, that was transparency and control over the customer experience.”

That framing is exactly right. The Agent represents your brand in every conversation. Customers don’t experience “accuracy,” they experience conversations. An Agent that’s technically accurate but tonally off-brand will erode customer trust over time.

I make the experience dimension explicit in my POCs. I have people on my team—and when possible, a small cohort of real customers—interact with the Agent under realistic conditions. Then I ask how it felt, not just whether it worked.

Can you keep improving it after launch?

This is the dimension most teams don’t evaluate at all, and it’s possibly the most important one. Choosing an Agent that works today and ensures you can continuously improve the customer experience over time requires more than a functional demo. You’re buying a system that must get better every week, not just during the first sprint.

The feedback loop

Can your team easily review conversations and identify where the Agent is underperforming? Can you pinpoint specific gaps (missing knowledge, incorrect tone, poor handoff decisions) and act on them quickly? The faster the loop between “something isn’t working” and “we’ve fixed it,” the more value compounds over time. In practice, that means instrumenting conversations, leveraging Agent Analytics, tagging misroutes and tone slips, and running targeted evals on known failure modes.

The speed of iteration

When you identify a gap, how quickly can you address it? This is partly a question of tooling (how easy is it to update knowledge, refine guidance, adjust behavior?) and partly a question of team capability. The teams getting the most out of AI are the ones that have changed how they operate and made continuous improvement a part of their everyday work. They’ve committed to going all-in for the long term, not just the first few weeks when launching their AI Agent. We treat this as eval-driven development: automate evaluations that mirror real tickets, tighten prompt engineering and retrieval settings, and ship small fixes daily.

The vendor partnership

The vendor behind the Agent matters just as much as the solution itself. You’re choosing a partner for transformation that will help you evolve how your business delivers customer experience. Ask:
- How does customer feedback influence the product roadmap, and can they show you examples?
- If you have feedback on limitations or weaknesses, do they engage transparently or get defensive?
- What kind of support will you get post-launch?
- Are they shaping where AI customer experience is going, or reacting to what others are building?
How a vendor responds to those questions tells you more about the long-term relationship than any benchmark result.

What a good POC proves

If your POC only proves “the AI works,” you haven’t done enough. A strong proof of concept tests performance in realistic conditions, evaluates the experience from the customer’s perspective, and validates the system that will support continuous improvement after launch. Done well, it sets you up for long-term operational success and builds organizational AI readiness—not just a flashy demo.

Inspired by this post on The Intercom Blog.
May 22, 2026
5 Proven Agent Skills I Use to Automate Weekly Product Reviews with Claude, Cursor, and Codex

Weekly product reviews are where strategy meets execution, and over the past year I’ve turned them into a high-signal, low-friction ritual by leaning on agentic AI. As VP of Product Management at HighLevel, Inc., I’ve standardized a set of agent skills that compress preparation time, surface the right insights, and keep PMs, engineers, and designers focused on decisions—not document wrangling.

"Learn how our teams use agent skills with claude, cursor and codex to run product reviews as PMs, engineers, and designers. Here are 5 killer use cases for builder."

Below, I walk through the five skills I rely on most in our weekly cadence—each one mapped to a clear product management outcome. They’re simple to set up, easy to govern, and aligned with core practices like continuous discovery, product roadmapping and sprint planning, and eval-driven development.

Skill 1 — Backlog triage with signal extraction: I point an agent at fresh tickets, customer notes, and experiment results to cluster themes, tag impact, and flag regressions. Using a retrieval-first pipeline and Agent Analytics, the assistant ranks items by value, effort, and risk so our meeting starts with a prioritized, explainable shortlist instead of a raw queue.

Skill 2 — PRD and spec synthesizer: Ahead of the review, an agent drafts a one-page PRD update from design diffs, git history, and decision logs. With Claude Code and Cursor, it highlights interface changes, acceptance criteria, and open questions, linking back to sources. The result is a crisp, auditable brief that keeps product trios aligned without re-litigating context.

Skill 3 — Experiment and metrics analyzer: An analytics agent pulls A/B testing readouts, checks minimum detectable effect assumptions, and annotates anomalies. It turns raw telemetry into a narrative: what moved, by how much, and whether we trust it. This makes our discussion about tradeoffs, not spreadsheets, and speeds commitments on next steps.

Skill 4 — Voice-of-customer synthesizer: The assistant clusters interviews, support threads, and NPS verbatims into jobs-to-be-done and pain themes. It proposes opportunity solution tree updates and calls out places where our roadmap diverges from customer signal. That keeps continuous discovery alive in the room—even when time is tight.

Skill 5 — Roadmap and sprint planning co-pilot: After decisions, an agent converts outcomes into scoped backlog items, engineering tasks, and stakeholder updates. It drafts sprint goals, flags dependency risks, and aligns work to objectives. Because it’s grounded in the meeting record, it preserves intent while removing ambiguity.

Under the hood, prompt engineering patterns and guardrails keep these workflows predictable: a retrieval-first pipeline for context, eval-driven development for quality checks, and role-specific prompts for PMs, engineers, and designers. With Claude Code I generate structured diffs and test scaffolds; with Cursor I accelerate code-review summaries; and with codex I bootstrap utility scripts that keep the loop tight between insights and implementation.

The payoff is tangible: higher decision velocity, fewer meetings to “re-clarify,” and clearer accountability across the product organization. Just as important, governance and privacy-by-design are built in—every agent logs rationale, cites sources, and respects data boundaries—so leaders can scale AI workflows confidently.

If you’re looking to level up your product reviews, start with these five skills, measure impact with Agent Analytics, and iterate. Small automations compound quickly, and the more consistently you run them, the more your team’s attention shifts from preparing content to making better product decisions.

Inspired by this post on Amplitude – Perspectives.

May 4, 2026

How to Build a Reliable WhatsApp AI Ordering Agent

You are not really deciding whether an LLM can chat about a menu. You are deciding whether it can turn a messy WhatsApp exchange into a correct, payable order without making the customer or venue staff repair its work.

That distinction changes the product. The hard parts are structured order state, deterministic commerce operations, response time, failure recovery, and venue-specific evaluation. Get those right and WhatsApp can become a genuine ordering channel. Get them wrong and you have a fluent chatbot sitting in front of an unreliable transaction.

Key takeaways

Define success as a confirmed, recoverable order in the system of record, not a conversation that sounded helpful.
Let the model interpret customer language, but keep menu data, prices, modifiers, delivery eligibility, payment state, and order commits behind deterministic tools.
Store the current order as structured state outside the transcript. A conversation is evidence of intent, not an order ledger.
Measure useful response time across the complete WhatsApp-to-POS path, then remove tool round trips and parallelize safe read operations.
Make item identification accuracy the primary trust metric, supported by guardrails for modifiers, payments, duplicate submissions, handoffs, and latency.
Evaluate every venue against its real menu and rules, then turn recurring configuration, tests, and operating procedures into reusable templates.

Define the product around a completed order

WhatsApp is the interface, not the product boundary. The product boundary should run from the customer’s first request to an order state that the venue can fulfill and the customer can verify.

A useful benchmark is the end-to-end flow implemented by AITropos: recommendations, item modifiers, delivery-zone checks, payment links, and status updates inside WhatsApp. Covering the whole journey matters because every missing step creates a handoff. A bot that recommends a meal but cannot resolve its required modifiers is a discovery feature. A bot that drafts an order but cannot verify submission is an assistant. Neither is yet an autonomous ordering agent.

Write an order contract before choosing models or orchestration frameworks. The contract is the minimum structured state required to fulfill, charge for, recover, and audit an order. It will usually include:

The venue and the applicable menu version.
Canonical item identifiers, quantities, and customer-facing item names.
Required and optional modifier selections, represented by identifiers rather than prose alone.
Fulfillment method, such as pickup or delivery.
The validated delivery result when delivery is requested.
A system-generated quote, including the values the customer must approve before payment or submission.
Payment-link and payment states, without treating a generated link as proof of payment.
Customer confirmation state, POS submission state, and the resulting order identifier.
The current owner of the interaction: agent, venue staff, or a defined recovery process.

The contract gives product, engineering, operations, and venue teams the same definition of done. It also exposes where autonomy is not yet safe. If the integration cannot validate a delivery zone, for example, the agent should collect the address and hand the order to a person. It should not infer eligibility from a conversational guess.

Order stage	The agent’s job	Condition before proceeding
Discover	Map natural language to menu candidates and explain relevant options.	One supported item is identified, or the agent asks a specific clarifying question.
Configure	Capture quantity, required modifiers, exclusions, and additions.	Every required choice is present and valid for that item.
Fulfillment	Resolve pickup or delivery and call the applicable eligibility checks.	The requested fulfillment method is supported for this order.
Quote and payment	Retrieve the authoritative quote and create the approved payment flow.	Prices and payment state come from the commerce system, not generated text.
Commit	Present the structured summary and submit the confirmed order once.	The customer has confirmed the current version and the POS returns a result.
Status and recovery	Report system-backed status or transfer the interaction with its context intact.	The response is tied to an order identifier or an explicit handoff owner.

Pay particular attention to the acceptance boundary. A friendly message such as “your order is being prepared” is an operational commitment. It must only appear after the system of record has accepted the order. If submission times out or returns an ambiguous result, the safe response is that confirmation is still pending, followed by a status check or human recovery. Guessing success can create duplicate orders, missed orders, and payment disputes.

You can still launch with partial automation, but name it accurately. Menu search, order drafting, and staff-assisted submission can deliver value while the integrations mature. The mistake is allowing the customer to believe the order was accepted when the product has only generated a plausible summary.

Keep the order deterministic even when the conversation is not

Customers do not speak in schemas. They change quantities, refer to items by incomplete names, add a second request before answering the first question, and revise earlier choices. Your architecture has to translate that non-deterministic conversation into structured, POS-compatible data without losing which version the customer actually approved.

My rule is simple: the model may interpret intent and propose an order-state change, but deterministic services must validate and commit it. The transcript should never be the only place where the current order exists.

A reliable turn can follow this sequence:

Load the current structured order, venue configuration, and relevant menu context.
Interpret the latest message as a proposed change: add, remove, replace, modify, confirm, cancel, pay, or request status.
Resolve referenced items and modifiers to canonical identifiers.
Call read-only tools for availability, configuration, fulfillment rules, or quotes as needed.
Validate the proposed change against required modifiers and venue rules.
Write a new order-state version and generate the next response from that validated state.
Use a separate, idempotent write operation when the customer confirms submission.

This design makes corrections much safer. If the customer says, “Make the second one large and remove the fries,” the agent should apply a state delta to the identified lines, validate the revised configuration, and show the updated summary. It should not regenerate the entire order from memory and hope that unrelated details remain intact.

Tool contracts should be narrow and explicit. Menu search should return canonical candidates and the information needed to distinguish them. Item detail should return valid modifier groups. A quote tool should return authoritative values. A payment tool should return a system-created link or a structured error. An order-submission tool should return an accepted identifier, a definite rejection, or an unresolved state that triggers recovery.

Do not let the model invent a price, payment URL, availability claim, delivery decision, or order status. These are business facts with financial and operational consequences. The response composer can explain them in natural language, but the underlying values must come from an approved system.

Separate reads from writes in the architecture. Independent menu and item lookups can often run in parallel. Writes should be serialized against a known order-state version. Every commit operation should accept an idempotency key so a retry cannot create a second order. If the state changed after the customer saw the summary, require confirmation of the new version rather than silently committing it.

The same discipline applies to human handoff. Transfer the structured cart, unresolved question, relevant tool results, and submission state along with the transcript. A handoff that forces staff to reread the entire conversation and reconstruct the order is not graceful degradation; it is deferred manual work.

Choose the orchestration pattern from the service objective, not from architectural fashion. Under tight response constraints, AITropos chose direct tool calls instead of MCP or a multi-stage pipeline to reduce orchestration overhead. That is not a universal argument against MCP. It is a reason to benchmark the actual path. Compare end-to-end latency, traceability, schema governance, failure isolation, and engineering cost using representative ordering turns. If an abstraction adds useful control, keep it. If it only adds another round trip, remove it.

Manage latency as part of the customer experience

The model’s inference time is only one part of latency. From the customer’s perspective, the clock starts when the message is sent and stops when a useful next action arrives. Context retrieval, menu search, validation, payment calls, POS submission, message delivery, retries, and overloaded queues all sit inside that interval.

Instrument the complete path before optimizing it. Capture timestamps for message receipt, context assembly, model execution, every tool call, state validation, response creation, and outbound delivery. Report median and tail latency by turn type. A single average can hide a checkout path that is consistently slower than menu questions.

At minimum, separate these turn classes:

Menu discovery and recommendation.
Item identification and configuration.
Cart edits and corrections.
Delivery or fulfillment validation.
Quote and payment-link creation.
Order confirmation and POS submission.
Order-status retrieval.
Human escalation and recovery.

Set a service objective for each class from observed channel behavior and the operational risk of delay. There is no useful universal number. A status lookup and a multi-item order edit do different work. What matters is that the team can see which component consumes the budget and what happens when that component times out.

Optimize in the order that removes uncertainty as well as delay:

Remove unnecessary model and tool round trips. Load the active order and venue configuration before asking the model what to do.
Parallelize independent read operations, such as resolving multiple products mentioned in one message.
Prefetch likely item context so the agent does not discover basic menu facts one call at a time.
Inject only the context needed for the current turn. An oversized prompt moves latency rather than eliminating it.
Keep deterministic validation outside the model when a rule or schema check can answer immediately.
Give every external dependency a timeout, an observable error state, and a safe recovery path.
Use concise responses that advance the order. Extra prose increases reading time and can obscure the decision you need from the customer.

A useful implementation pattern is already visible in production: multiple product searches run in parallel, product context is prefetched, and smaller, faster components prepare the relevant context for each turn. The product lesson is not to create a swarm of agents. It is to move predictable preparation out of the critical reasoning loop while preserving one coherent order state.

Watch the failure mode on the other side of aggressive optimization. Cached menu metadata can reduce retrieval work, but stale availability or price data can create a wrong commitment. Define which fields are stable enough to cache, how they are invalidated, and which values must be retrieved at quote or submission time. Speed is valuable only when the answer remains authoritative.

When a slow operation cannot be avoided, use an honest progress message and preserve the pending state. Do not fill the wait with repeated acknowledgements that imply completion. If the customer sends another message while the tool is running, the state machine should know whether to queue the change, cancel the pending operation, or ask the customer to wait for its result.

Evaluate each venue, then template what repeats

Make item accuracy precise enough to govern decisions

Item identification accuracy deserves to be the primary trust metric. If the agent resolves the wrong item, every later component can behave perfectly and still produce the wrong order. AITropos treats order item identification accuracy as its most important KPI, giving model, prompt, retrieval, and fallback decisions a common objective.

Define the metric before building a dashboard. I would count an attempted line item as correct only when the canonical item, quantity, and required modifier interpretation match the customer’s resolved intent. A necessary clarification is not automatically an error; it should count against a separate clarification-burden metric. Otherwise, the team may improve apparent accuracy by asking the customer to confirm every obvious detail.

Do not let the primary KPI hide transaction failures. Pair it with guardrails for:

Unsupported substitutions or invented items.
Missing and invalid required modifiers.
Customer corrections after the agent presents a summary.
Quote, payment-link, and POS tool failures.
False confirmations, unresolved submissions, and duplicate commits.
Order completion and abandonment by journey stage.
Human handoff rate, reason, and time to recovery.
End-to-end latency by turn class and venue.

Link corrections back to the original decision. If the customer changes an item because the agent misunderstood it, label the item-resolution turn rather than treating the correction as an unrelated edit. That is how production behavior becomes useful evaluation data instead of a collection of support anecdotes.

Simulate failures before customers encounter them

A venue-specific evaluation suite should use that venue’s menu identifiers, modifiers, availability behavior, delivery rules, payment flow, and POS adapter. A generic restaurant benchmark can test language understanding, but it cannot tell you whether the agent knows that a particular size requires a particular modifier or that two similar menu names map to different SKUs.

Build test families for:

Incomplete names, colloquial references, and ambiguous matches.
Several products requested in one message.
Required modifiers, exclusions, additions, and invalid combinations.
Quantity changes, replacements, removals, and cancellation.
Unavailable items and acceptable alternatives.
Pickup, delivery, and addresses that cannot be validated.
Quote changes before confirmation.
Payment failure, delayed payment state, and an abandoned payment flow.
Tool timeouts, malformed tool results, retries, and uncertain POS submission.
Interrupted conversations that resume with an existing cart.
Requests that require staff judgment rather than autonomous execution.

Generate the expected structured order independently from the agent being tested. Otherwise, the same model can reproduce its own misunderstanding in both the answer and the grade. Keep a small, human-reviewed set of critical conversations alongside the larger generated suite, and add every material production failure to the permanent regression set.

Scale matters when menus contain many combinations. Before each new venue goes live, AITropos runs thousands of simulated customer conversations overnight. The number alone is not the release gate. Coverage, a trustworthy expected answer, and clear failure categories are what make simulation useful.

Simulation also cannot reproduce every production condition. Follow it with a staff sandbox and a controlled production phase. Use only redacted, properly authorized customer conversations in evaluation systems, and retain no more personal data than the test requires.

I would treat any path that invents a price or payment state, falsely confirms an order, or can duplicate a commit as release-blocking. Other thresholds should reflect the venue’s menu complexity, existing human baseline, handoff capacity, and the cost of a wrong order. Record those thresholds before the final test run so launch pressure cannot redefine success afterward.

Roll out autonomy in observable stages

Start with a venue that is operationally manageable but representative enough to expose real modifiers, fulfillment rules, and integration behavior. An unusually simple pilot may produce a clean demo while postponing the problems that determine whether the product can scale.

Configuration: ingest and normalize the menu, map canonical identifiers, mark required modifiers, connect fulfillment and payment rules, and produce a completeness report. No customer-facing ordering is enabled.
Sandbox: let venue staff run realistic conversations while write tools remain disabled or point to a test environment.
Approval mode: allow the agent to prepare a structured order, but require a person to approve the commit. Measure how often the person changes it and why.
Constrained production: enable autonomous submission for the supported venue, fulfillment modes, and order types, with a staffed handoff path and rapid rollback.
Expansion: widen scope only after production traces confirm the accuracy, latency, recovery, and operational workload expected by the release criteria.

For every stage, decide who can pause the agent, how staff take over an active conversation, how the customer learns that a person has taken over, and how an uncertain submission is reconciled before another order is created. These are product requirements, not post-launch operating notes.

Once one venue works, resist copying its prompt and integrations into a new branch. Make venue differences configuration wherever possible: normalized menu schemas, modifier patterns, fulfillment policies, tool mappings, escalation contacts, evaluation packs, and dashboard dimensions. Keep truly distinct behavior explicit rather than burying it in prompt prose.

The scalability payoff can be substantial. AITropos reduced new-venue onboarding from three months to a few weeks, while domain templates are being used to shorten it further. Track your own onboarding work by category: configuration, data cleanup, integration, prompt or policy changes, evaluation, venue training, and launch support. If every venue still requires bespoke code and a rewritten conversation flow, the product has not yet separated its platform from its implementations.

Your next step should be concrete. Choose one representative venue and create three artifacts: the canonical order contract, a failure-and-recovery matrix for every tool, and a venue-specific evaluation set built from redacted, authorized scenarios. If those artifacts cannot show what happens when item resolution, a modifier, delivery validation, payment, or POS submission fails, the agent is not ready to accept orders. Once those states are explicit, model and architecture choices become testable decisions rather than matters of confidence.

References

Shivam.Consulting Blog — Inside AITropos: Lightning-Fast AI Employees for Hospitality That Take Orders on WhatsApp

April 30, 2026

AI Product Data Security: A Practical Playbook for PMs

Your AI feature is ready to move beyond the prototype, but one question can still stop the release: exactly which customer data leaves your boundary, where is it copied, and who can retrieve it later? If the answer is scattered across architecture diagrams, vendor settings, and assumptions, you do not yet have a security decision.

You can resolve that uncertainty without turning every experiment into a committee exercise. Map the data path, assign the capability a risk lane, minimize what the model receives, and automate the controls that follow from the classification. The result is a release process that is both faster and easier to defend.

Start with the data path, not the model

The first security question is not what the model knows. It is what your product sends, retrieves, transforms, stores, logs, and displays. A provider can have a strong security posture while your implementation still exposes data through an overbroad retrieval query, a debug log, or an incorrectly scoped support tool.

Draw the complete path for one user request. Do not use a generic platform diagram. Follow the actual capability from the moment a user or system creates an input until every resulting copy has expired or been deleted.

Identify the original input, including form fields, uploaded files, messages, system-generated events, and API payloads.
List the context added by your application, such as account attributes, conversation history, analytics, retrieved documents, feature configuration, or tool results.
Mark every transformation before the model call: filtering, redaction, tokenization, summarization, chunking, or schema conversion.
Name the service that receives each payload, including gateways, model providers, observability tools, evaluation systems, queues, and caches.
Trace the response through validation, tool execution, display, analytics, support access, and downstream storage.
Record when each copy expires, how deletion propagates, and who can access it while it exists.

For every step, capture six fields: data class, system owner, access scope, external recipient, retention rule, and failure consequence. If any field is unknown, label it unknown. An explicit unknown is useful discovery work; an undocumented assumption is hidden risk.

Do not stop at obvious records such as customer PII and payment identifiers. Prompts, retrieved context, user-linked analytics, internal roadmaps, feature flags, configuration values, embeddings, vector stores, and evaluation datasets can also reveal confidential facts or inferred identity. Treat them as product data with owners and controls, not harmless implementation residue.

Use a completion test that exposes weak assumptions

Your map is ready for a decision when someone outside the feature team can answer these questions from it:

What is the most sensitive field the capability can receive?
Which fields cross the company boundary, and which named service receives them?
Can one customer ever retrieve another customer’s data?
Are raw prompts, completions, retrieved passages, or tool results logged?
Which identities can inspect those logs or replay a request?
What happens to derived data when the original record is deleted or its permissions change?
Which control contains the incident if the model, retrieval layer, or tool call behaves unexpectedly?

If the team can only answer these questions by asking several vendors or searching production settings, keep the release open. The missing work is not paperwork. It is part of the product’s operating design.

Turn the risk assessment into a release lane

A risk score is useful only when it changes what the team must do. Avoid a long questionnaire that ends with an ambiguous rating. Use a small number of lanes, give each lane an observable entry condition, and attach default release controls.

Risk lane	Typical signals	Default release posture
Low	Internal capability; synthetic or public inputs; no sensitive context; no consequential external action	Approved provider, least-privilege credentials, basic access tests, and confirmation that secrets are not entering prompts or logs
Elevated	Customer-facing capability; authenticated user context; behavioral telemetry; stored prompts or outputs; retrieval from private content	Data minimization, pre-call redaction, permission-aware retrieval, explicit retention, adversarial evaluations, runtime monitoring, and a named incident owner
High	Regulated-data adjacent; payment identifiers; broad confidential retrieval; sensitive identity data; or authority to perform a consequential action	Early Security, Legal, privacy, and Data involvement; documented threat model; human approval where an action warrants it; verified containment; and release evidence reviewed before exposure

These lanes are an operating model, not a compliance determination. Applicable controls depend on the actual data, customer contracts, geography, industry, and use case. Security and legal specialists should make those determinations when the capability creates legal, regulatory, or material customer exposure.

Classify the capability, not the entire product. A writing assistant that uses text supplied for a single request may sit in a different lane from an account assistant that searches every customer conversation and updates CRM records, even when both use the same model.

Score the capability across these dimensions:

Data sensitivity: public, internal, confidential, personal, payment-related, or regulated-data adjacent.
Audience: constrained employee group, all employees, authenticated customers, or public users.
Retrieval reach: one supplied record, an authorized account subset, or a broad internal corpus.
Action authority: produces a suggestion, drafts a change, or executes an external action.
Persistence: ephemeral processing, structured event storage, or retained raw inputs and outputs.
Third-party exposure: stays inside your controlled environment or passes through one or more providers and subprocessors.

Use the highest-risk dimension to set the initial lane. Lower it only after a design change removes the exposure. A promise to be careful is not a mitigating control; scoped retrieval, enforced redaction, disabled raw logging, and restricted tool permissions are.

Reclassify when the feature changes its data, audience, retrieval reach, retention, provider, or ability to act. A seemingly small roadmap addition, such as remembering past conversations or connecting a second data source, can change the security posture more than a model upgrade does.

Design the system to disclose less data

The most reliable way to protect data is to keep unnecessary data out of the AI path. Encryption and contractual terms matter, but they do not make an irrelevant customer field necessary. Start with the user outcome and ask which minimum facts the model needs to produce it.

Minimize before you redact

Redaction is a valuable deterministic safeguard, but it should not carry the whole design. Free-form text can contain names, secrets, identifiers, and confidential business information in formats your rules do not recognize. Reduce the payload first, then redact the smaller payload that remains.

Replace a full customer object with the few fields required for the task.
Use a temporary account token when the model does not need a person’s name, email address, or payment identifier.
Convert long interaction histories into purpose-specific structured fields when the task does not require the original prose.
Exclude internal notes, disabled fields, hidden metadata, and unrelated attachments by default.
Log structured events such as policy result, model identifier, latency, and request status when raw prompt text is not required.

Separate identity from content wherever the workflow allows it. The application can retain the relationship between a temporary token and an account while the model processes only the content needed for the task. Access to the token map should remain narrower than access to routine AI telemetry.

Make retrieval permission-aware

A retrieval-first architecture can keep the raw corpus inside your controlled boundary while selecting only relevant context for a request. It is not automatically private. If an external model receives the selected passages, those passages still cross the boundary and still require minimization, redaction, approved-provider controls, and a clear retention policy.

Apply authorization when the request is made, not only when content is indexed. The retrieval layer should constrain results by tenant, user, role, and current document permissions before any text becomes model context. Do not index content that the eventual searcher could never be allowed to read unless the architecture has another enforceable isolation boundary.

Treat embeddings and vector-store metadata as sensitive derived data. A vector is not a magic anonymizer, and metadata can disclose document names, account relationships, categories, or activity patterns even when full text is elsewhere. Your deletion and permission-change process must reach the index, cached results, evaluation copies, and any stored citations, not just the primary database.

Retrieved content is also untrusted input. A malicious or compromised document can contain instructions intended to change model behavior. Keep system instructions separate, restrict available tools, validate tool arguments, and enforce authorization in application code. The model should never be the component that decides whether a user may access a record or perform an action.

Place deterministic controls on both sides of the call

Before the call: validate the request schema, remove disallowed fields, redact known sensitive patterns, apply allow and deny policies, and constrain retrieval.
After the call: validate output structure, block disallowed sensitive patterns, verify any cited record belongs to the authorized scope, and check tool arguments before execution.
During operation: monitor unusual prompt, output, retrieval, and access patterns without creating a second uncontrolled store of raw content.

An output filter cannot undo data already disclosed to an external provider. Use post-call checks to protect users and downstream systems, but use pre-call minimization and access enforcement to prevent the disclosure itself.

Make vendor approval specific to the intended use

Do not approve an AI vendor in the abstract. Approve a defined service, account configuration, data class, region, retention posture, and use case. A provider suitable for public-content summarization may not be suitable for customer conversations or payment-related identifiers.

Ask questions that produce enforceable answers rather than broad assurances:

Training and service improvement: Can prompts, files, retrieved passages, outputs, feedback, or metadata be used to train models or improve services? Is the restriction a default, a setting, or a contractual term?
Retention: How long does each data type remain in primary systems, safety systems, failure logs, backups, and support tooling? What initiates deletion, and what exceptions apply?
Human access: Under what conditions can provider personnel inspect customer content, and how is that access authorized, logged, and reviewed?
Security controls: Is data encrypted in transit and at rest? What key-management options, private networking, scoped credentials, access logs, and administrative controls are available?
Location and subprocessors: Which regions process and store the data? Where can support access occur? Which subprocessors participate in the path?
Assurance evidence: Which services and controls are covered by SOC 2, ISO 27001, or HIPAA-related commitments where relevant to the use case?
Response: How will the provider communicate a security incident, policy change, model change, or subprocessor change that affects your approved use?

An audit or certification is useful evidence about a defined scope. It is not proof that your architecture, settings, or use case is safe. Confirm that the service named in the evidence is the service your product will actually call, and that your configuration does not bypass the controls you evaluated.

Keep a short decision record with the approved purpose, permitted and prohibited data, named endpoints or services, required account settings, retention terms, region, responsible owner, and review triggers. Reopen the decision when the purpose, data class, provider terms, model path, subprocessor chain, or architecture changes.

A shared catalog of approved providers and patterns also reduces shadow AI. Make the approved route easier to use by supplying scoped credentials, reference architectures, redaction utilities, retrieval patterns, and clear examples of prohibited inputs. Governance works better when the safe path is a usable product for internal teams.

Put the controls into delivery and incident response

A policy that depends on every engineer remembering every rule will drift. Store the capability’s classification, required controls, approved provider configuration, and decision owner alongside the delivery artifacts. Version changes so the team can see when a new data source or retention behavior altered the release posture.

Translate the release lane into automated checks wherever the control can be tested:

Scan prompts, templates, configuration, and code for exposed secrets and unapproved endpoints.
Unit-test redaction and tokenization against representative allowed and disallowed inputs.
Integration-test tenant boundaries, role permissions, retrieval filters, and deletion propagation.
Run evaluations that attempt to elicit restricted data, override instructions, retrieve unauthorized records, or trigger tools outside the allowed scope.
Validate the selected provider, model path, region, logging setting, and retention configuration against the approval record.
Block release when required evidence, monitoring, rollback controls, or an incident owner is missing.

Evaluation data needs the same scrutiny as production data. Remove unnecessary identities, restrict access, define retention, and avoid copying raw customer interactions merely because an evaluation system is internal. A test corpus can become a long-lived data store if nobody owns its lifecycle.

Monitor security-relevant events rather than indiscriminately recording content. Useful signals include blocked sensitive-data patterns, denied cross-scope retrieval, calls to unapproved services, unusual access behavior, unexpected changes in model or endpoint usage, and failed retention or deletion jobs. Structured metadata often provides the operational signal you need without preserving every prompt and completion.

Prepare containment before the first customer request

Your incident runbook should name the people and mechanisms needed to contain the feature. Depending on the incident, that can include disabling the affected path with a feature flag, revoking or rotating credentials, restricting retrieval, stopping unsafe logging, locating downstream copies, and contacting the provider.

Do not improvise evidence deletion or customer notification during an incident. Security, privacy, and legal owners should determine preservation, notification, and regulatory obligations based on the specific exposure. The product runbook should make those owners reachable and give them an accurate data-flow record, timestamps, affected systems, and containment status.

After containment, update the control that failed: the architecture, automated check, provider setting, policy, runbook, or team guidance. A review that ends with a reminder to be more careful leaves the same mechanism in place.

Key takeaways

Map every copy of the data, including retrieved passages, logs, embeddings, evaluations, caches, and tool results.
Classify individual capabilities by their highest-risk dimension, then attach mandatory controls to the lane.
Minimize fields before redaction, enforce permissions outside the model, and treat derived stores as sensitive.
Approve vendors for a named use, configuration, data class, region, and retention posture rather than issuing blanket approval.
Put redaction, access, retrieval, configuration, evaluation, and release checks into CI/CD.
Design containment and ownership before launch so an incident does not begin with a search for the right people and switches.

Pick one AI capability currently approaching release and produce its request-to-deletion data map. Assign its lane, turn every unknown into an owned backlog item, and automate the first control the team is still checking by hand. That is how security becomes part of product delivery instead of a negotiation at the end.

References

Shivam.Consulting Blog – AI Data Security for Product Teams: Protect Sensitive Product Data Without Slowing Innovation

April 27, 2026

Cracking the Hardest Percentages: Turn Complex Support into Scalable, Trust-Building Automation

I’ve learned that the smallest slice of your support queue often dictates the majority of your operating cost, customer memory, and automation ceiling. In product reviews and CX ops deep-dives, I see the same pattern: the “easy” tickets pad your resolution counts, but the complex, multi-step queries quietly own your handle time and your brand trust. If you care about compounding impact, your customer support AI strategy has to target that hardest percentage first.

Complex queries are a small percentage of your queue, but they consume a disproportionate share of your team’s time.

Take a typical queue: password resets outnumber refund disputes ten to one, but a reset takes five minutes and a dispute takes thirty. The “rare” query accounts for over a third of total handling time. The same pattern holds for account investigations, subscription changes, and billing disputes.

How you handle complex queries is also what customers actually remember about their support experience. When someone is dealing with a damaged order or a billing dispute, the stakes are higher, and a fast, good resolution is what separates a forgettable interaction from one that builds lasting trust.

Most AI Agents automate the easy, informational queries well. The question for your automation rate is whether they can handle the hard ones. That’s where agentic AI and robust AI workflows make or break your outcomes.

We’ve gotten really good at informational queries – the hard part is what comes next. I’ve seen teams invest deeply here, and for good reason: it lifts containment quickly and cheaply. But to break through the plateau, you have to execute actions across systems, not just answer with text.

We’ve invested deeply in informational Q&A. We built Apex, a specialized customer service model trained on billions of support interactions, as Fin’s core answering engine. Beneath that sits a custom retrieval model, a purpose-built reranker, and a unified RAG pipeline, all trained specifically for customer service. Fin resolves issues at a higher rate than general-purpose frontier models, with fewer hallucinations and at lower cost.

But informational Q&A only covers queries where text is the answer. Most Agents can handle that. Far fewer let you configure complex, multi-step actions without a forward-deployed engineer setting it up for you, which creates a gap.

Every query your team handles falls into one of three categories:

Informational: “Can you ship transatlantic by priority next day?” Answered with text from your knowledge base.

Personalized: “Where is my order?” Requires data unique to that user.

Action-led: “My order arrived damaged, I need a refund.” Requires doing something: checking a return window, cross-referencing transaction data, making a judgment call – reading from multiple systems and acting across them.

From Jan to Apr 2026, the trend moves steadily upward, pausing briefly before a sharp late surge. A clear snapshot of momentum for customer service KPIs, finance results, and the impact of new procedures.

These complex queries, the ones that require multi-step processes across systems, aren’t edge cases; they’re the reason your support team exists. This is the gap Fin Procedures was built to close.

It works in practice, and the trajectory matters for product strategy and ops planning.

Procedures is live, it’s scaling, and the results are clear. Since launching in managed availability, Procedures has handled over 1.5 million conversations, and volume is doubling month over month across hundreds of apps in fintech, e-commerce, gaming, healthcare, and SaaS.

When customers hit complex, multi-step queries, the experience is dramatically better when Fin can do the work end-to-end. We tested this with a randomized 5% holdout – conversations where Procedures would normally run, but didn’t. CSAT was 28.93% higher when Procedures ran, a statistically significant result.

A product, not a services engagement. I’ve sat through too many “automation” projects that were really solutions engineering gigs: workshops, custom scripts, then a queue of change requests when policies shift. It’s fragile and slow.

The B2B AI industry has a consultingware problem. It’s not databases being forked anymore, it’s prompts. The economics of maintaining bespoke setups per customer don’t work. Either the application falls behind new models, or the vendor changes the model and quality degrades invisibly.

In my view, an agentic AI platform should be a product your team owns end to end: a natural language editor – literally paste your existing SOPs – branching logic, data connectors, and AI-powered simulations for testing. Your CX ops team configures this, iterates on it, owns it. If you need help, a forward-deployed team can assist, but they’re optional, not a dependency. You always have control.

And because it’s a unified product, improvement compounds. When the vendor optimizes a prompt, every customer’s Procedures get better. When they upgrade the model, they can A/B test across the entire customer base and know it’s better before rolling out. You can’t do that when every customer has a bespoke prompt. The consulting model isn’t just expensive, it’s structurally unable to compound.

Today, Fin Procedures is available to every Intercom customer – no waitlist or managed rollout, ready for all 8,000+ customers.

We’re iterating fast based on real customer feedback. Here’s what’s landed since the last major update, and why it matters for reliability and governance:

AI-powered Procedure review: Flags broken logic, missing references, and unreachable conditions before you deploy.

Kick off your journey with the #1 Agent—an AI partner designed to turn resolutions into real outcomes. Tap “Start a free trial” to explore faster, smarter customer service and see how Fin delivers value from day one.

Procedure failure reporting: A new reporting dimension that lets you drill into conversations where Procedures failed, so you can diagnose and fix.

Version history with rollback: Track every change, compare versions, roll back if needed.

Data connector health monitoring: See at a glance if your integrations are healthy, degraded, or failing.

Optional data connector parameters: Fin only asks customers for information when it’s actually needed, instead of prompting for every field.

Email Simulation support: Test how your Procedures behave across chat and email before going live.

Agent in the Loop (Beta) unlocks the next tranche of automation. Even with Procedures, two things hold teams back from automating their most complex queries: missing integrations and policies that require a human sign-off on sensitive decisions.

“Agent in the Loop” is built for both. Need Fin to check your internal admin tools but haven’t built a data connector yet? Put a human checkpoint at that step. Fin handles the conversation, gathers context, and pauses, surfacing a structured summary for a human agent to verify or act, then resumes. You get automation on the 80% that doesn’t need the integration.

For compliance – identity verification, high-value refunds – Fin does the legwork, a human makes the final call and then hands it back to Fin. This works natively in the Intercom Inbox and via Slack. Some competitors don’t have an inbox-native variant at all, meaning humans need to leave their primary workspace to review AI actions.

Procedures are also built to let you collaborate with all your teammates – both human agents and AI Agents. Fin can work with them directly inside a Procedure, using APIs and webhooks to loop in another teammate mid-flow, hand off context, and pick back up once they’re done.

Making it easier, faster. Procedures is already self-serve, but the next step is making Procedure creation, testing, and maintenance significantly more streamlined and easy to do, with less manual editing and more AI-assisted building and debugging. There’s a lot coming in this space over the next few months – and it aligns perfectly with a retrieval-first pipeline and stronger governance at scale.

The hardest percentages matter the most. The biggest unlock for your automation rate won’t be answering more FAQs, it will be handling the complex, multi-step queries that consume your team’s time and define what customers remember about their experience with you.

That means working with an Agent that goes beyond answering questions and executes processes. A product your team owns and configures, not a service you buy and hope gets maintained. And a platform where every improvement compounds across every customer. That’s Procedures. Available now, for everyone.

Inspired by this post on The Intercom Blog.

April 14, 2026
How We Taught Agentic AI to Speak Product Analytics—and Unlocked Actionable Insights

I set out to solve a deceptively simple problem: help our teams ask product questions in plain English and get trustworthy, analysis-grade answers—fast. That required more than a powerful model; it demanded agents that genuinely understand the language of product analytics, from behavioral analytics nuances to the messy reality of event taxonomies, funnels, and cohorts. In this post, I share how we engineered agentic AI that speaks our domain fluently and turns questions into decisions.

The core challenge wasn’t data volume or dashboard sprawl; it was semantics. Different teams said “activation,” “onboarding,” or “first value” and meant overlapping but distinct things. Our PMs, analysts, and engineers navigated a maze of synonyms across Amplitude analytics, Pendo, and our unified analytics platform. Generic LLMs stumbled on these nuances, so we built a shared ontology—driver trees anchored to a clear North Star—with canonical definitions for activation, retention, and conversion, plus consistent event naming and cohort logic.

We started with a rigorous metric catalog: every KPI linked to its drivers, exact formulas, cohorts, and time windows; every event mapped to a product taxonomy; every dashboard and SQL snippet versioned with ownership and lineage. That catalog became the ground truth for agents. We embedded data governance and privacy-by-design from the start—permissioning for fields and queries, PII redaction, and scoped access that reflected how product teams actually work.

Next, we built a retrieval-first pipeline to ground the agents in our corpus before generation. We indexed metric definitions, dashboards, experiment readouts, runbooks, and high-signal Slack threads so the agent could cite relevant artifacts, not just predict plausible text. With careful context window management and prompt engineering, the agent retrieves definitions and prior analyses, then plans multi-step actions: run a query, compare cohorts, check “minimum detectable effect (MDE)” for an A/B test, and summarize findings with references.

Architecturally, we treated this as “Agent Analytics”: an orchestrator that selects tools based on intent—querying Amplitude analytics or Pendo for behavioral paths and funnels, hitting our warehouse for cohort tables, or pulling experiment metadata and anomaly detection alerts. Tool use is permission-aware, auditable, and designed to fail safe. The agent’s outputs include citations back to the exact definitions, dashboards, and SQL used, so reviewers can validate and iterate.

Quality came from eval-driven development, not intuition. We built a gold set of representative product questions (activation inflections, retention analysis by segment, funnel drop-offs after feature launches) and scored the agent on faithfulness to definitions, numerical accuracy, latency, and actionability. We incorporated regression checks to catch drifts after schema changes, and we tuned prompts to reduce overconfident answers and push for clarifying questions when context was missing.

Safety and reliability were non-negotiable. We layered AI risk management with role-based access, guardrails that block destructive queries, and risk scoring for unfamiliar joins or sudden spikes in metric deltas. The agent logs every step—what it retrieved, which tools it called, and why—so analysts can replay and refine the chain of thought with transparent provenance.

The payoff: product teams now self-serve nuanced questions in minutes instead of days, and our analysts spend more time on discovery than report wrangling. Retention analysis improved as the agent standardized cohort logic; conversion investigations accelerated thanks to consistent funnel definitions; and cross-functional decisions aligned around the same driver trees and shared language. Most importantly, the agent turned ambiguous asks into structured analyses that stand up to scrutiny.

For fellow product leaders, my lesson is simple: start with semantics, not models. A crisp ontology, disciplined taxonomy, and clear ownership will outperform a flashy stack riddled with ambiguity. Avoid technology FOMO; favor retrieval-first grounding, small sharp tools, and continuous discovery with your product trios. When your organization speaks a common analytics language, agents can finally think with you, not just for you.

Next, we’re extending the agent’s planning skills to recommend experiment designs, estimate power and “minimum detectable effect (MDE),” and propose driver-tree-informed bet sizing. We’re also tightening feedback loops so every accepted answer, edit, or override strengthens the retrieval corpus and evaluations. The vision: a calm, reliable layer that makes rigorous product analytics feel conversational—and helps teams move from questions to confident action.

Inspired by this post on Amplitude – Best Practices.

April 13, 2026
Never Stop Disrupting: Why the Fin API Platform Signals a New Era for Agentic AI

Disruption is the only sustainable strategy in product. When a platform meaningfully changes how we build and operate, I pay attention—not just as a product leader, but as someone accountable for turning AI Strategy into durable competitive differentiation. That’s why the launch of the Fin API platform stands out: it’s a concrete step toward agentic AI at enterprise scale.

Today, I’m diving into what this launch includes, why it matters for product strategy, and how I’d navigate the build vs buy decision in this new landscape. My goal is to translate the announcement into actionable guidance for product teams, CX leaders, and forward-deployed engineers who are building the next generation of customer support and product-led experiences.

Fin is a customer agent platform that at present resolves over 2M customer issues a week, growing at a rapid exponential pace. It’s relied on by the best brands, large and small, in every vertical you can imagine. From Atlassian and Riot Games, to smaller hot upstarts like Mercury and Polymarket. It runs on a family of models trained by its AI group. Last week, they announced Apex, which is the world’s first specialized customer service LLM. In production tests over the last 6 months, it beat every single frontier model, including those from Anthropic and OpenAI, on resolution rate, latency, hallucination rate, and cost.

With this launch, teams can access the platform’s core capabilities and underlying models directly via API, with contracts starting at $250k per year, and usage rates that are by far the cheapest in the industry for each of the model’s subcategories. For leaders evaluating total cost of ownership, this is a meaningful data point: it shifts the economics of scaled automation from experimental to operational.

Why now? Because builders want options. I hear from teams daily that want to design their own agents, tune prompts and policies, and integrate with bespoke CRMs, data lakes, and product surfaces. The Fin announcement meets that demand with three clear build-paths, each mapping to a different operating model and maturity stage.

First, for the vast majority of companies, the Fin Agent Platform is the pragmatic starting point. Fin reports ~8k companies on it today. It addresses 99% of customer needs out of the box—without exhausting consulting engagements—while delivering top-tier resolution rates. If your priority is time-to-value, governance, and platform scalability, this route de-risks implementation and accelerates outcomes.

Second, for teams that need custom surfaces or channels, the Fin Agent API lets you present Fin in unique contexts. You get the Fin platform’s orchestration and controls, but you’re free to bypass the default messenger, email, voice, or any prebuilt channel and embed the agent natively in your product. I see this as the sweet spot for product-led growth motions where conversation design and UX writing are strategic levers.

Third, for companies building hyper-specific agents—think service plus in-product actions—the new API access to Apex and the broader collection of models is the obvious move. Unlike generalized models, these are purpose-trained for customer service scenarios and operational policies. If you have strong in-house solutions engineering, a retrieval-first pipeline, and eval-driven development in place, this path maximizes control without reinventing the model layer.

This also opens the door for vertical specialists. Fin-like businesses focused on deep domains can emerge quickly—Fin for dentists? Why not? Fin for car dealerships? Sure. I expect startups and modern CX providers (including players like Decagon and Sierra) to carve out niches where domain data, workflows, and compliance are the real moats. That’s where differentiated AI beats generic capability.

There’s a defensive reason to pay attention here. The software landscape is shifting fast: the moat is no longer feature parity—it’s the quality of your agents and the data flywheels powering them. Building software is simply less hard now, and I’ve watched engineering teams more than double measurable productivity as they adopt AI-assisted development. The implication is clear: the interface-and-features era is giving way to an agents-and-outcomes era.

Serious software companies must evolve from being a features company to an agents company—and build those agents on differentiated AI. More value will accrue at the model and orchestration layers, where safety, latency, cost, and resolution quality are won. That puts a premium on prompt engineering discipline, policy routing, continuous discovery of edge cases, and rigorous offline/online evals to keep hallucination rates low while maintaining speed.

How would I choose among the three build-paths? If you’re early or resource-constrained, start with the Fin Agent Platform to validate outcomes and align stakeholders. If you need branded experiences and tighter product integration, use the Fin Agent API to control surfaces without owning the heavy lifting. If you have strong ML ops and a mature customer support ai strategy, go model-level with Apex and companions, layering in your own guardrails, context window management, and test harnesses. In each case, balance velocity, control, and risk—your build vs buy decision should be grounded in clear metrics and an explicit product strategy.

Where does this lead? We’ll see more companies expose specialized model families with clearer economics and stronger governance. For now, I’m excited to see what teams build with the Fin API platform—and how they turn agentic AI into measurable improvements in resolution rate, CSAT, cost-to-serve, and ultimately, customer loyalty.

Inspired by this post on The Intercom Blog.

April 3, 2026

How to Build Scalable, AI-Ready Product Documentation

Your AI assistant gives a confident but outdated setup answer. Search returns three pages with slightly different instructions. Support knows the real workaround, but the documentation owner does not know the product changed. This is usually described as an AI problem. It is more often a knowledge-system problem.

You do not need a second documentation estate written for machines. You need one governed source of product truth that a customer can follow, a support engineer can trust, and an AI system can retrieve without reconstructing the answer from conflicting fragments.

Key takeaways

Organize documentation around the questions and tasks users bring to it, not only around your product navigation or internal team structure.
Give every important section a clear answer, scope, procedure, expected result, and permanent link so it remains useful when retrieved on its own.
Control terminology, versions, ownership, and deprecation explicitly. An AI assistant cannot reliably resolve contradictions that your organization has left unresolved.
Put documentation changes through version control, review, automated checks, and release gates so the published truth keeps pace with the product.
Measure successful task completion and grounded answer quality, not page views alone. Use failures to decide whether to fix the content, retrieval layer, assistant behavior, or product itself.

Start with an answer contract, not a page inventory

A documentation redesign often begins with a list of existing pages. That tells you what you publish, but not what customers need to accomplish. It also preserves accidental boundaries: a feature may have five pages because five teams touched it, while the customer still sees one task.

Begin with an intent register for one product area. Capture the questions that appear during activation, onboarding, routine use, escalation, and renewal. Include the language people actually use in search queries and support requests, even when it differs from your preferred product terminology.

For each intent, record:

The user’s question in their own language.
The task they are trying to complete or the decision they need to make.
The relevant audience or role, such as administrator, developer, or analyst.
The product version, plan, permission, integration, or prerequisite that changes the answer.
The canonical page and section that should answer the question.
The person accountable for keeping that answer current.
The consequence of a wrong or missing answer, such as failed activation, an unnecessary escalation, or use of a deprecated workflow.

This register exposes three different problems that page counts conceal. Some important questions have no answer. Some have several competing answers. Others have an answer that exists but cannot stand on its own because the conditions or expected result appear somewhere else.

Turn each priority intent into an answer contract. A complete unit should state what the user can accomplish, when the instructions apply, what must already be true, what to do, what success looks like, and where to go next. If any of those elements are missing, a human has to infer them and an AI system may invent the bridge.

The opening of a page should therefore name the job, not advertise the feature. “Configure routing for inbound leads” gives the reader a destination. “About lead routing” merely names a subject. This small distinction also gives retrieval systems a stronger match between a real question and the section intended to answer it.

Build retrieval units that still make sense alone

A person may enter through a search result, while an AI application may retrieve only a passage from the middle of a page. In both cases, the selected section has to survive separation from the surrounding document.

That does not mean chopping every page into tiny fragments. Atomic content is complete enough to answer one intent and bounded enough to avoid unrelated material. A fragment that says “click Save” without naming the object, required permission, or expected result is short, but it is not atomic.

Use a repeatable section pattern

For a task-oriented section, use this sequence:

Write a heading that reflects the question or task.
Give the direct answer or outcome before background material.
State who the instructions are for and when they apply.
List permissions, inputs, and prerequisites before the procedure.
Use numbered steps with one observable action in each step.
State the expected result and how the reader can verify it.
Separate exceptions, limitations, and failure states from the main path.
Link to the next likely task rather than a generic documentation landing page.

Keep interface labels, API parameters, status values, and error messages verbatim. If the product displays “Connection expired,” do not rewrite it as “Your integration is no longer active.” The second phrase may read naturally, but it weakens exact search, obscures the product state, and makes support instructions harder to match.

Examples should expose inputs, outputs, and constraints. A useful example says which role is acting, what value is supplied, what the system returns, and which condition would make the result different. A screenshot without that context is evidence of appearance, not a durable explanation of behavior.

Make boundaries and links dependable

Use one primary topic per page, semantic H1-H3 hierarchy, descriptive slugs, and stable section anchors. These practices make pages easier to scan and create smaller, linkable units that retrieval systems can identify precisely.

A stable anchor is part of the content contract. If an implementation guide links directly to the authentication prerequisite, changing that anchor silently breaks more than navigation. It breaks the path by which customers, support macros, release notes, and AI responses reach the authoritative answer.

Do not copy the same procedure into several pages to make each page self-contained. Keep one canonical procedure and give adjacent pages enough context to explain why the reader needs it, followed by a precise link. Duplication feels convenient at publication time and becomes a contradiction risk at the next product change.

Control vocabulary without ignoring customer language

Choose one canonical term for each product concept across the interface, API, documentation, and support material. Put accepted synonyms and older names in a glossary or metadata field so search can recognize them, but keep the explanation anchored to the current term.

This is the difference between supporting natural language and allowing synonym sprawl. “Workspace,” “account,” “tenant,” and “organization” may sound interchangeable inside a company. If they represent different objects in the product, casual substitution creates false equivalence. If they represent the same object, choosing one term removes needless translation work for every reader and retrieval pipeline.

Protect the current truth with metadata and delivery controls

Good prose cannot compensate for missing scope. Two instructions can each be correct for a different version, role, or integration and still produce a wrong answer when retrieved together. Metadata makes those boundaries explicit before retrieval begins.

Define a required metadata contract for every governed page or content unit. At minimum, include:

A stable content ID and canonical URL.
A descriptive title and short task-oriented description.
The product area and content type.
The intended audience or role.
The applicable version or version status.
The lifecycle state, such as current or deprecated.
The accountable owner.
The last-updated or last-reviewed date.

Use the fields as controls, not decoration. Audience metadata should allow an assistant to distinguish administrator instructions from end-user instructions. Version metadata should prevent a current answer from silently incorporating an obsolete step. Ownership should route a failed evaluation to someone who can resolve it.

Deprecation needs more than a warning banner. State what is deprecated, which users or versions are affected, what replaces it, and how to move forward. Preserve old URLs with redirects when a current replacement exists. Removing the old page without a forward path turns bookmarks and deep links into dead ends; leaving it searchable without a clear status lets obsolete guidance continue to circulate.

Ship documentation as part of the product change

Scalability depends on the delivery system behind the content. Version control, peer review, and CI/CD give documentation the same traceability and release discipline used for software changes.

For each product change, the release workflow should answer:

Which user intents and canonical sections are affected?
Do interface labels, parameters, permissions, errors, examples, or screenshots change?
Does the change introduce a new term or alter an existing definition?
Do version boundaries, redirects, or deprecation notices need updating?
Which retrieval evaluations must pass before release?
Who approves the content and owns follow-up corrections?

Automate the checks that have unambiguous pass or fail conditions: broken links, missing required metadata, duplicate IDs, invalid internal references, and orphaned pages. Use human review for semantic accuracy, task completeness, terminology, and whether an image still reflects the current workflow. Automation can detect that a screenshot file exists; it cannot reliably decide that the image teaches the correct behavior.

Set update expectations according to consequence. Instructions tied to a product release need to be correct when the change reaches users. A deprecated workflow needs a forward path before the old path disappears. Lower-risk explanatory material can follow a review schedule. One blanket service level treats cosmetic drift and activation-breaking errors as if they carry the same cost.

Measure answer quality, then migrate in risk order

Page views tell you that someone arrived. They do not tell you whether the person completed the task or whether an AI answer was accurate, grounded, and current. Pair human behavior with retrieval evaluations so each signal leads to a plausible corrective action.

Signal	What it can reveal	Likely action
Repeated searches or rapid returns to results	The answer is hard to find, uses mismatched language, or does not resolve the intent	Improve the title, intent mapping, vocabulary, or section completeness
Low task completion after reading	The procedure may omit prerequisites, verification, or a failure path	Test the instructions against the actual workflow and repair the answer contract
Support escalation after a documentation visit	The content may be incomplete, untrusted, outdated, or describing product friction	Inspect the escalation reason before assuming more content is the solution
Low answer accuracy or grounding	The wrong passage was retrieved, the selected passage conflicts with another, or the assistant exceeded the evidence	Separate retrieval, content, and answer-generation failures
Current and deprecated guidance in one answer	Version metadata, lifecycle labels, or retrieval filters are insufficient	Strengthen version boundaries and remove obsolete material from current-answer paths
High response latency	The retrieval or answer path may be doing unnecessary work	Inspect the pipeline without trading away accuracy or grounding

Build the evaluation set from the same intent register used to design the documentation. For each test question, define the expected canonical page or section, the claims a correct answer must contain, the audience and version it applies to, and any deprecated claim that must not appear. Include questions that should not be answered when the documentation lacks enough evidence. A reliable assistant must be able to stop at the boundary of the known answer.

When a test fails, classify the failure before editing anything:

If retrieval selected the wrong section, inspect information architecture, headings, metadata, vocabulary, and chunk boundaries.
If retrieval selected the correct section but the answer distorted it, inspect the assistant’s instructions and answer-generation behavior.
If two selected sections disagree, resolve the underlying ownership, versioning, or duplication problem.
If no section answers the question, add the missing knowledge or make the limitation explicit.
If the answer is correct but users still fail, inspect the procedure and the product experience. Documentation should not be used to disguise avoidable product friction.

You do not need to rebuild the entire knowledge base before learning whether this operating model works. Migrate in this order:

Choose one product area with meaningful activation, support, or deprecation risk.
Collect its real user intents and map each one to an accountable answer.
Resolve duplicate, contradictory, and missing guidance before changing the retrieval system.
Restructure priority answers into self-contained, linkable sections.
Add the required metadata, ownership, version, and lifecycle controls.
Put those sections through the product release workflow and automated checks.
Run human task checks and retrieval evaluations, classify the failures, and repair the responsible layer.
Expand only after the pattern is repeatable for another product area.

Your first useful deliverable is not an AI documentation strategy deck. It is one high-value customer question with one canonical, current, owned answer that survives retrieval and changes alongside the product.

Start with the question that creates the most expensive ambiguity today. Make its answer complete, linkable, versioned, testable, and part of the release path. That single vertical slice will show you where the larger system actually needs work.

References

March 20, 2026

Agentic Architecture Demystified: How Modern AI Systems Plan, Learn, and Execute at Scale

In my role leading product teams at HighLevel, I’m often asked to explain what’s really happening behind the scenes of today’s AI products. The short answer is that modern systems are built on "Agentic Architecture: How Modern AI Systems Actually Work"—not just a single model, but a coordinated loop of planning, tool use, memory, and evaluation. Once you see that pattern, the design decisions snap into focus and the roadmap becomes far easier to prioritize.

At its core, agentic AI treats the model as a reasoning engine embedded within an AI workflow. The agent interprets intent, plans steps, calls the right tools and APIs, grounds itself in trusted data, and then evaluates outcomes before deciding to continue or stop. This loop creates reliability, reduces hallucinations, and enables the system to operate in real-world, multi-step scenarios.

Here’s the practical lifecycle I rely on. A user provides intent (a goal or request). We run a retrieval-first pipeline to ground the model in accurate, current data. Prompt engineering structures the task and primes the agent with constraints and success criteria while managing context window management. The agent generates a plan, executes steps by calling tools or services, evaluates intermediate results, reflects or revises as needed, and only then returns a final answer with clear citations or evidence.

For more complex work, I orchestrate multiple specialized agents—commonly a planner, a solver, and a critic—coordinated by a lightweight controller. This multi-agent pattern reduces single-agent blind spots, encourages self-checking, and mirrors how empowered product teams collaborate. Whether it’s conversation design for support flows or a voice AI agent driving hands-free tasks, orchestration is the difference between a clever demo and a dependable product.

Memory is the second pillar. Short-term working context sits in the prompt, while long-term memory lives in vector stores or databases to track past interactions, preferences, and outcomes. Retrieval augments the model with the right facts at the right time, and tight context window management ensures the agent stays focused on signal, not noise. The result is faster responses, lower costs, and far better accuracy.

Reliability is earned through eval-driven development and robust AI risk management. I define offline and online evaluations, guardrails, and human-in-the-loop checkpoints before scaling traffic. These evaluations become living, automated tests that protect against regressions as prompts, models, and tools evolve. The payoff is real: fewer escalations, higher trust, and measurable improvements to quality over time.

From a product strategy perspective, I resist over-engineering. Start with a simple retrieval-first pipeline and a single agent; prove value; then layer in multi-agent orchestration only where it moves key metrics. Instrument everything—latency, cost, grounding coverage, and outcome quality—and build Agent Analytics dashboards so teams can diagnose issues and iterate with confidence.

If you’re looking for a practical playbook, here’s mine: clarify the user intent and success criteria; design the tools the agent can call; ground with authoritative data; write prompts that constrain scope and define termination conditions; add reflection and automated evaluations; and ship behind feature flags for safe, staged rollout. Each step compounds reliability without killing velocity.

The diagram and the video above bring these patterns to life. If you watch closely, you’ll see the same loop—plan, retrieve, act, evaluate—show up in every effective implementation, regardless of domain. That repetition isn’t accidental; it’s the backbone of agentic architecture and a blueprint you can adapt to your own stack.

Ultimately, what matters is outcomes. When we build around agentic AI, we create systems that are explainable to stakeholders, maintainable by engineers, and genuinely helpful to customers. That’s how we move past hype to durable impact—shipping AI products that plan, learn, and execute at scale.

Inspired by this post on Product School.

March 16, 2026
Prevent Strategy Drift: AI that flags ‘merge conflicts’ in product plans before a quarter derails

"What if an AI could spot the moment two product teams start pulling in opposite directions — before it derails a quarter?" That question hooked me, because I’ve lived through the costly fallout of subtle misalignments that only surface at the end of a sprint—or worse, during quarterly business reviews.

I recently dug into an episode of Just Now Possible featuring Matthias and Charlotte Kleverud, co-founders of Momental. Their vision for "GitHub for product management" hits a nerve in the best possible way: find "merge conflicts" in strategy, not code, and do it early enough to save execution time, trust, and outcomes.

Here’s the core: Momental ingests documents, meeting transcripts, and voice recordings across an organization, then uses AI agents to map them into a structured context layer—a set of interconnected trees covering goals, decisions, learnings, and who's doing what. When it finds a conflict—say, one team betting on retention while another is prioritizing conversion—it surfaces the misalignment for humans to resolve, just like a merge conflict in code. That framing is both familiar (for anyone who’s shipped software) and powerful (for anyone who’s scaled product strategy across multiple teams).

Their journey tracks with what many of us have learned the hard way. "Starting in 2022 with DaVinci 002 and learning that the market wasn't ready for AI-assisted product thinking" pushed them toward experiments with agent teams. "The origin story: building a team of AI agents in 2024, only to discover agents hit the same alignment problems as humans" is exactly the kind of meta-lesson I’d expect when you scale autonomy without shared context. The breakthrough was an "OODA-loop-driven document processing agent" that continuously curates a living knowledge graph rather than relying on static prompts or brittle pipelines.

One model that stood out was "The product chain: signals → learnings → decisions → principles, and how AI maps it." That is the backbone of healthy product thinking. When this chain is explicit and inspectable, you can trace why a team chose Path A over Path B—and detect when new signals should invalidate old decisions. I’ve seen this accelerate continuous discovery and improve executive decision hygiene.

I also appreciated the organizational modeling: "Three trees that model an organization: the product tree (OKRs to epics), the wisdom tree (decisions and their reasoning), and the people/time tree." This maps cleanly to how we run quarterly planning at scale—tying outcomes to work, preserving rationale, and grounding ownership and timelines. With that structure, "How conflicts are detected, auto-resolved, or escalated to humans with merge options" becomes a pragmatic workflow, not a theoretical AI demo.

On the technical front, they’re blunt about limits: "Why traditional chunking and RAG breaks down at scale and what Momental does instead." Anyone who’s tried to stitch strategy from ad hoc notes knows that naive retrieval won’t cut it. You need durable context boundaries, rich metadata, and graph-aware reasoning. Which brings me to one of my non-negotiables: "Why metadata—who said it, when, and in what context—is critical to preventing hallucinations." In my world, we treat provenance like test coverage—you can’t ship without it.

Process-wise, the product philosophy resonated: "How a document processing agent uses OODA-loop thinking to extract and connect context across documents" reinforces the need for short feedback cycles, explicit hypotheses, and continuous refactoring of knowledge. Pair that with "The self-improving agent: collecting user feedback weekly and rewriting its own prompts" and you’ve got a blueprint for eval-driven development that keeps the system honest over time.

Their UI choices also mirror a pattern I’ve adopted: "Moving from chat-first to UI-first to proactive agents as an AI product design pattern." Chat can feel magical, but alignment work benefits from concrete artifacts—trees, timelines, driver trees, and opportunity solution trees—so people can reason together. Then, let proactive agents watch for drift and nudge teams before the cost of change spikes.

Two broader themes are worth calling out. First, "Specialized tools win" when the problem is deep, cross-functional context like product strategy. General-purpose chatbots struggle here; domain-specific models with strong information architecture have the edge. Second, product culture matters: "Discovery Versus Vibe Coding" is not just a catchy contrast—it’s a reminder that disciplined discovery beats intuition theater when stakes are high.

As for the roadmap, I’m encouraged by their "Design partner strategy and what's next for Momental's public launch." Early design partners are where you validate signal quality, precision of conflict detection, and the ergonomics of human-in-the-loop resolution. I’m especially curious how this intersects with LLMs for product managers, outcomes vs output OKRs, and product roadmapping and sprint planning in large portfolios.

Finally, a nod to the broader ecosystem. The conversation touched on "Claude Code" and a shift "Beyond documents and vectors" that many of us are living through—toward retrieval-first pipelines that respect context windows, stronger governance, and measurable improvements in decision quality. If you care about AI Strategy for empowered product teams, this is a space to watch—and to pilot.

Bottom line: If you’ve ever wished you could prevent strategy drift before it shows up in your dashboards, this "GitHub for product management" approach is worth your attention. Make the chain of signals, learnings, decisions, and principles explicit. Keep humans in the loop for the hard calls. And let proactive, agentic AI do what it does best: flag misalignments early, so your teams can move fast together.

Inspired by this post on Product Talk.

March 5, 2026

Tag: retrieval-first pipeline

Key takeaways

The platform boundary extends beyond the MCP connection

A golden path turns architecture into an operating contract

Safety depends on controlling actions and explaining them

Evaluation, observability, and delivery form one reliability loop

Scale requires governance of tools, teams, and ownership

References

Start with the decision, not a broad request for insights

Give the agent a compact analytics contract

Use a reusable prompt that exposes uncertainty

Turn prompt quality into a small product evaluation

Key takeaways

References

Key takeaways

Define the product around a completed order

Keep the order deterministic even when the conversation is not

Manage latency as part of the customer experience

Evaluate each venue, then template what repeats

Make item accuracy precise enough to govern decisions

Simulate failures before customers encounter them

Roll out autonomy in observable stages

References

Start with the data path, not the model

Use a completion test that exposes weak assumptions

Turn the risk assessment into a release lane

Design the system to disclose less data

Minimize before you redact

Make retrieval permission-aware

Place deterministic controls on both sides of the call

Make vendor approval specific to the intended use

Put the controls into delivery and incident response

Prepare containment before the first customer request

Key takeaways

References

Key takeaways

Start with an answer contract, not a page inventory

Build retrieval units that still make sense alone

Use a repeatable section pattern

Make boundaries and links dependable

Control vocabulary without ignoring customer language

Protect the current truth with metadata and delivery controls

Ship documentation as part of the product change

Measure answer quality, then migrate in risk order

References