Tag: agentic AI

How to Design, Launch, and Govern an AI Agent Product

Your AI agent demo works. Now the harder questions arrive: Which actions can it take, how will anyone know it helped, and who owns a bad decision? If those answers are deferred until launch, you do not yet have a product ready to scale. You have a capability looking for permission.

Your job as a product leader is to turn uncertain model behavior into a dependable operating system for one valuable task. That means designing the job, the workflow, the controls, the measurement, and the adoption path together. Model quality matters, but it cannot compensate for an undefined outcome, excessive access, weak tools, or a launch that asks users to trust what they cannot inspect or reverse.

Start with an operating contract, not an agent persona

Names such as sales agent, support copilot, or operations assistant are too broad to guide product decisions. They hide disagreements about what the system can see, what it can change, when it should stop, and what success means. Treating an agent as a product line with a narrow job, grounded data, tool access, and guardrails forces those disagreements into the open while they are still inexpensive to resolve.

Write an operating contract before debating models or interfaces. It should answer the following questions in language that product, engineering, operations, security, and the domain owner can all review:

Who is the user? Name the role performing the job, not a market segment. An account administrator and a support specialist may need different evidence, permissions, and explanations even when they use the same underlying model.
What event starts the job? Specify the observable trigger: a customer request arrives, a record enters an exception state, or a user asks for a particular action. A generic invitation to chat is not a job boundary.
What outcome counts as done? Define a state outside the conversation. The answer might be an approved response, a correctly updated record, a validated recommendation, or a complete handoff. A fluent message is output, not necessarily an outcome.
What evidence may the agent use? List permitted systems, required records, freshness requirements, and data the agent must not retrieve. If the task requires an authoritative record, make its absence a stop condition rather than an invitation to infer.
Which tools may it call? Separate read, draft, and write permissions. An agent that can inspect a record does not automatically need permission to change it, and permission to draft an action does not imply permission to execute it.
What constraints must always hold? Capture business rules, policy boundaries, approval requirements, and prohibited actions. Enforce these constraints in tool and application layers, not only in natural-language instructions.
When must it stop or escalate? Missing required evidence, conflicting records, unsupported requests, tool failures, and policy exceptions should lead to a defined fallback. The agent should not improvise its way around a boundary.
Who remains accountable? Name the owner who approves the contract, reviews failures, and decides whether autonomy can expand. Accountability cannot be assigned to the agent itself.

A compact job statement makes the contract easier to test:

When [trigger] occurs, help [user] achieve [observable outcome] using [approved evidence and tools]. If [stop condition] occurs, hand off to [role] with [required context].

For example, a support agent might retrieve an approved knowledge record and relevant account facts, prepare a response, and stop when identity, policy, or account data is unresolved. Its handoff would include the customer’s request, the evidence retrieved, the steps attempted, and the exact question requiring a specialist. That is a testable product definition. Build a support agent is not.

Add a negative scope as well. State what the agent will not do in the current release, even if the model appears capable of doing it. This keeps a successful pilot from quietly becoming authorization for unrelated work.

The final test is simple: can two reviewers inspect the same run and agree whether the job was completed within the contract? If they need to debate whether the answer merely sounded reasonable, the definition of done is still too vague.

Build deterministic edges around the model

A dependable agent is a workflow, not a long prompt. The model interprets language and chooses among bounded options; the surrounding system controls identity, data access, tool execution, validation, state, and recovery. Retrieval, context management, reliable tools, and clear state often matter more than moving to a larger model.

Design the successful path and the failure path as an explicit sequence:

Retrieve authorized evidence. Fetch only the records relevant to the job. Preserve record identifiers, versions, and freshness so the result can be inspected later.
Construct minimal task state. Carry the user’s identity, requested outcome, validated facts, previous tool results, pending approvals, and unresolved questions. Do not treat an ever-growing chat transcript as the system of record.
Choose from allowed actions. Give the model a constrained set of tools and make unavailable actions genuinely unavailable. A prompt that says do not call a privileged endpoint is not access control.
Validate tool inputs. Use typed schemas, required fields, enumerated values where appropriate, and server-side authorization. Reject malformed or unauthorized calls before they reach the underlying system.
Validate the resulting state. Check deterministic business rules after execution. A successful API response only proves that the call ran; it does not prove that the user’s job was completed correctly.
Finish, recover, or hand off. Return an accepted outcome, retry only when retrying is safe, or create the handoff package specified in the operating contract.

Tool quality deserves product attention. Each consequential tool should expose the smallest permission needed, return machine-readable errors, support a preview when possible, and make repeated requests safe where the underlying operation permits it. Reversible operations need a tested undo path. Irreversible operations need tighter authorization and should not be made safe merely by adding another sentence to the prompt.

Context also needs a budget based on relevance, not on the maximum number of tokens the model accepts. Rank evidence by authority and usefulness. Remove unrelated history. Distinguish verified records from user claims and model-generated summaries. When two authoritative records conflict, preserve the conflict and route it through the stop condition instead of blending them into a plausible answer.

Build the evaluation set before the launch plan

Your evaluation set is the executable version of the operating contract. It should represent the situations that matter to the job, including conditions in which the correct behavior is to refuse, ask for information, or escalate.

Scenario class	What the evaluation should verify
Normal path	The agent retrieves the required evidence, selects the correct tool, satisfies the acceptance criteria, and records a complete result.
Ambiguous request	The agent asks for the missing fact or offers bounded choices instead of assuming the user’s intent.
Missing or stale evidence	The workflow stops, refreshes through an approved path, or escalates according to the contract.
Tool failure	The agent does not claim success, duplicate a consequential action, or lose the task state needed for recovery.
Policy boundary	The prohibited call is blocked by the system, the response explains the available path, and the event is auditable.
Human handoff	The receiving person gets the request, relevant evidence, attempted actions, unresolved issue, and recommended next step.

Score the dimensions separately. A single average can hide the failure that matters most.

Outcome correctness: Did the external result meet the job’s acceptance criteria?
Grounding: Did the response use the required evidence without inventing unsupported facts?
Tool behavior: Were the correct tool, arguments, order, and authorization used?
Policy compliance: Did every prohibited or approval-gated action remain inside its boundary?
Recovery: Did the workflow handle missing data, timeouts, and partial failures without misrepresenting the result?
Handoff quality: Could the receiving person continue without reconstructing the entire run?

Use deterministic assertions wherever the expected state can be checked directly. Use domain review for judgment that depends on policy or professional context. Model-based evaluators can help classify or prioritize a larger sample, but they should not become the only judge of a high-consequence action.

Run scripted evaluations whenever the model, prompt, retrieval logic, tool schema, policy, or orchestration changes. Sample live runs after release to find failure patterns the fixed set does not yet represent, subject to your data-access and retention rules. Add confirmed failures back into the regression set. That is how eval-driven development turns observed behavior into a tighter product.

Select the model after this evaluation loop exists. Compare candidates on the acceptance criteria, latency, operating cost, and operational constraints of the job. The right model is the least complex option that clears the required bar with the complete workflow around it. A model swap should be one testable hypothesis among retrieval, context, tool, state, and prompt changes, not the automatic response to erratic behavior.

Govern autonomy at the action boundary

Governance becomes practical when you classify what the agent may do, not how intelligent it appears. The important distinction is the consequence of the next action: whether it changes state, whether the change can be reversed, and who bears the cost of an error.

Action class	Typical behavior	Default product control
Advise	Summarizes evidence or recommends a next step without changing system state.	Show the supporting evidence and let the user ignore, revise, or escalate the recommendation.
Draft	Creates an editable response, plan, or proposed update that has not been sent or committed.	Require review before external effect. Capture material edits and rejection reasons as feedback.
Execute a reversible action	Changes a record or starts a bounded workflow with a reliable recovery path.	Begin with a preview and explicit approval. Enforce scope in the API, record the action, and make undo visible.
Execute a consequential action	Creates an irreversible, financial, regulatory, security, or substantial customer impact.	Keep a qualified human decision-maker in the path unless the organization has explicitly approved a narrower control model. The agent can assemble evidence and prepare the action without owning the decision.

Do not borrow one accuracy threshold for all four classes. A summarization defect and an unauthorized payment are not interchangeable errors. Set release criteria by action class, and report prohibited-action failures separately rather than averaging them together with low-consequence quality issues.

Human review only reduces risk when the reviewer can make an informed decision. A confirmation button attached to a vague summary creates approval theater. The review interface should show:

The exact action that will occur and the system it will affect.
The evidence used, including record identifiers or other traceable references.
Any missing, stale, or conflicting information.
The expected side effects and whether the action can be reversed.
Clear options to approve, edit, reject, or escalate.

For a handoff, replace approve with a receiving workflow. The person taking over needs a concise task summary, the user’s original intent, the evidence already checked, tool results, the reason automation stopped, and the next decision. Measuring whether that package is usable is more valuable than celebrating a low handoff rate.

Enforcement belongs at the tool boundary. Authenticate the user and agent, authorize each operation, validate inputs, limit accessible records, and block disallowed transitions on the server. Natural-language instructions can guide behavior, but they are not a substitute for permissions, policy checks, or transaction controls.

Keep an audit record proportionate to the risk. For a consequential run, that commonly includes the requesting identity, agent and configuration version, evidence identifiers, tool calls and results, approval decision, final state, and any reversal or escalation. Do not log raw prompts, private records, or retrieved content by default merely because they may be useful later. Decide what is necessary, who can access it, and how long it should be retained as part of AI risk management and data governance.

Assign human ownership across the operating system. Product owns the target outcome and adoption decision. A domain owner approves acceptance criteria and policy interpretation. Engineering owns tool reliability and recovery. Security and privacy owners approve data and access controls. Operations owns monitoring, handoffs, and incident response. One person may cover more than one role, but no responsibility should disappear into the phrase the agent decided.

Governance review should be triggered by meaningful change, not only by a launch meeting. Revisit the contract when you change the model, retrieval source, tool schema, permission, policy, action class, or target user. Review it again when live behavior reveals a new failure mode. That keeps governance attached to the product lifecycle instead of turning it into a document that goes stale after approval.

Instrument the outcome funnel, then earn adoption

An agent does not succeed because users open it or send messages. It succeeds when eligible users complete a valuable job, accept the result, and return when the job recurs. Behavioral instrumentation becomes useful when agent interactions are connected to activation, retention, cost, and risk.

Measure the entire path from opportunity to outcome

Start the funnel before the conversation. If you count only people who already opened the agent, you cannot distinguish poor discovery from poor execution. Define an eligible opportunity for the specific job, then instrument the path through completion.

agent_opportunity_detected: The product can identify that the target job is present for an eligible user.
agent_offer_exposed: The relevant entry point or contextual suggestion is shown.
agent_invoked: The user starts the workflow or an authorized trigger starts it on the user’s behalf.
agent_action_proposed: The workflow produces a recommendation, draft, or preview inside the operating contract.
agent_approval_resolved: The proposed action is approved, edited, rejected, or escalated where review applies.
agent_task_completed: The external acceptance criteria are satisfied and the final state is recorded.
agent_outcome_reversed: The result is undone, reopened, corrected, or otherwise found not to be durable.

The names are less important than consistent semantics. Record the job type, user role, action class, model and workflow version, tool result, and final disposition. Use identifiers and controlled classifications where possible instead of copying sensitive prompt or retrieved content into analytics.

Metric	Useful definition	Common misreading
Activation	Eligible users who complete their first accepted valuable outcome divided by eligible users exposed, for a named cohort and measurement window.	Counting a first prompt or first response as activation even when no job was completed.
Task completion	Eligible initiated tasks that meet the external acceptance criteria divided by eligible initiated tasks.	Using a model’s claim of completion or a successful API call as proof of success.
Containment	Eligible tasks completed without human takeover divided by eligible tasks started, paired with quality and later correction signals.	Rewarding fewer handoffs even when the agent should have escalated.
Time to value	Elapsed time from the eligible trigger to an accepted outcome, including waiting for review when review is part of the workflow.	Measuring response latency while ignoring the rest of the job.
Acceptance and editing	Results accepted as presented, accepted after a material edit, rejected, or escalated. Define material for the job.	Treating any click on approve as equal, regardless of the correction required before approval.
Handoff quality	Handoffs containing the required context and accepted as usable by the receiving role divided by all handoffs.	Viewing every handoff as failure instead of distinguishing correct escalation from avoidable escalation.
Cost per successful outcome	Variable model, tool, infrastructure, and human-review costs divided by accepted completed outcomes.	Optimizing token cost while ignoring rework, review time, or failed attempts.
Risk signals	Blocked prohibited calls, unauthorized attempts, reversals, policy escalations, and incidents, reported as counts and against the relevant opportunity denominator.	Combining materially different events into one average quality score.

Segment these metrics by job, user role, action class, workflow version, tool, and risk class. An overall completion rate can improve while a high-consequence segment gets worse. Version-level segmentation also tells you whether a prompt, retrieval, model, or interface change actually altered behavior.

Pair leading signals with durable outcomes. Edits, rejection, undo, escalation, and approval time can expose friction quickly. Repeated successful use, lower rework, and movement in the target business outcome tell you whether the product is creating lasting value. An increase in escalation is not automatically bad: it may mean the control became easier to use. Inspect whether the escalation was correct and whether the receiving person could act on it.

Let evidence earn each expansion of autonomy

Adoption is a behavior-change problem. Users need to notice the agent at the moment the job occurs, understand its boundary, inspect its work, and recover when it is wrong. A generic product tour may create awareness, but it does not establish trust in a consequential workflow.

Move through deployment modes according to evidence rather than a predetermined calendar:

Shadow mode: Run the workflow without exposing a result or changing state. Compare its proposed outcome with the accepted human outcome and use disagreements to improve the contract and evaluations.
Assisted mode: Let the user request a recommendation or editable draft. Make the evidence and limitations visible, and collect structured edit and rejection reasons.
Approved execution: Show the exact proposed change and require explicit confirmation before the tool commits it. Test authorization, audit, recovery, and handoff paths under live operating conditions.
Bounded autonomy: Allow execution only for the job, users, data, conditions, and limits approved in the operating contract. Continue monitoring outcomes and preserve a kill switch, rollback path, and accountable operator.

Advancement should depend on the evaluation suite, live outcome quality, tool reliability, policy compliance, recovery readiness, and the receiving team’s ability to handle escalations. If the evidence is mixed, narrow the action class or eligible population. Do not compensate for unresolved risk by making the prompt longer.

The interface should answer the user’s practical questions before asking for trust:

Why is the agent appearing at this moment?
What task can it complete, and what remains the user’s responsibility?
Which records or evidence will it use?
What will change if the user approves?
Can the result be edited or undone?
Where does the task go if the agent cannot complete it?

Surface the agent inside the existing workflow when the eligible job appears. State the action in task language, such as prepare this response or verify and update this record, rather than ask AI anything. Keep preview, edit, reject, undo, and escalation controls visible at the decision point. Contextual guidance is most useful when it removes a known piece of friction, not when it explains AI in general.

Use experiments for choices that are safe to vary: entry-point placement, explanation copy, prompt starters, preview layout, or the order of optional steps. Do not A/B test away required approvals, access controls, or safety boundaries. Time-to-value, task completion, edits, undo patterns, and escalation requests provide a more useful adoption picture than raw message volume.

Define activation as the first accepted outcome, not the first interaction. For a drafting workflow, that may be the first reviewed artifact that is actually used. For an operations workflow, it may be the first verified state change. The exact event should match the operating contract, and retention should measure return when the same job recurs rather than habitual chatting that produces no business result.

Key takeaways: use this launch gate

Before exposing an agent to production data or expanding its autonomy, require a clear yes to each question:

Can the job be stated with one user, one trigger, one observable outcome, and explicit stop conditions?
Are read, draft, and write permissions separated and enforced outside the prompt?
Does the evaluation set cover ambiguity, missing evidence, tool failure, policy boundaries, and handoff behavior?
Can every consequential tool validate authorization, return a clear result, and recover safely where recovery is possible?
Is the action classified by consequence and reversibility, with an appropriate approval path?
Can a reviewer see the evidence, proposed effect, missing information, and recovery option before approving?
Is there a named owner for outcomes, policy interpretation, monitoring, escalation, and incident response?
Can analytics connect an eligible opportunity to an accepted outcome, later correction, cost, and risk?
Can the product be narrowed, paused, or rolled back without waiting for a new model release?

A no does not have to stop all learning. It should stop the unsafe action. Move the pilot to shadow, advisory, or draft mode while the missing control is built.

For your next roadmap review, bring four artifacts instead of another open-ended demo: the operating contract, the evaluation matrix, the action classification, and the instrumented outcome funnel. Ship the smallest permissioned workflow that can prove value. Let observed outcomes, not confidence in the demo, earn the next level of autonomy.

References

January 4, 2026

How to Govern AI Agents With Product Analytics That Drives Action

Your dashboard can show growing AI agent usage while the product itself gets worse. Users may invoke the agent, wait for an answer, rewrite it, repeat the task manually, or discover too late that an action needs to be undone. An invocation count records activity. It does not tell you whether the agent was useful, safe, or worthy of more authority.

If you own an agent roadmap, the practical question is not whether the model can complete an impressive demo. It is whether you can see what the agent did, limit what it was allowed to do, connect its behavior to a user or business outcome, and stop or reverse a bad release. Product analytics should be the control system that helps you answer those questions.

Key takeaways

Define the agent’s job, eligible users, data boundary, action boundary, target outcome, and failure conditions before choosing dashboard metrics.
Join product behavior, agent decisions, tool activity, and business outcomes with shared run and workflow identifiers. A model trace or product funnel on its own is incomplete.
Treat permissions as product logic. Read access, recommendations, reversible actions, and high-consequence actions need different controls and evidence.
Version prompts, retrieval sources, models, tools, policies, and event schemas together so that a change in performance can be traced to a release.
Use quality, safety, experience, business, and operational gates to decide whether an agent should expand, remain constrained, be revised, or be retired.

Define the outcome and authority before the events

Teams often start by instrumenting what is easiest to count: conversations, messages, tool calls, and thumbs-up feedback. That produces a busy dashboard without a decision model. Start one level earlier. What job is the agent responsible for, and what evidence would justify giving it more reach or authority?

Write a one-page agent contract

An agent contract is a product artifact, not a legal document. It creates a stable reference for instrumentation, evaluation, access control, and rollout decisions. Write down:

Job: the decision or task the agent helps complete. Avoid broad mandates such as improve support or assist product managers.
Eligible workflow: the exact point at which the agent may appear or run. Eligibility must be measurable even when the user never invokes the agent.
Eligible users and accounts: the roles, segments, or environments included in the release, plus explicit exclusions.
Inputs: the approved resources, fields, retrieval collections, and user-provided context the agent may inspect.
Outputs: whether the agent answers, recommends, drafts, updates a system, contacts someone, or triggers another workflow.
Human checkpoints: the actions that require review, the person authorized to review them, and what that person must be shown.
Target outcome: the user or business result, its denominator, its measurement window, and the system that records it.
Known failure states: unsupported answers, irrelevant retrieval, repeated retries, blocked tools, abandoned approvals, incorrect actions, and failed handoffs.
Stop condition: the quality, risk, reliability, or outcome signal that pauses the rollout and identifies who owns the decision.

The eligibility definition matters more than it appears. If you count only people who chose to use the agent, your dashboard excludes people who ignored it, did not notice it, distrusted it, or could not access it. Record the eligible population first. That gives adoption, completion, and outcome metrics a defensible denominator.

Keep the first contract narrow. A practical starting footprint is one valuable question, a small team, and one assistant. Narrow scope is not merely easier to ship. It makes failures interpretable and limits the consequences of a bad policy, prompt, connector, or event definition.

Translate authority into enforceable policy

I use a strict definition of governance: the agent has a bounded objective, a known identity, limited data access, limited tools, recorded policy decisions, an escalation route, and a named owner. A policy page that the runtime cannot enforce is guidance, not governance.

Authority level	What the agent may do	Evidence to retain	Default release control
Retrieve	Read approved analytics, records, or knowledge without changing a system	Resource identifiers, applied scope, retrieval status, policy version, and references used	Pre-approved resources with least-privilege access and data minimization
Recommend	Explain, summarize, rank, draft, or propose an action	Agent version, supporting references, presentation status, and user response	The user decides whether to accept, edit, reject, or escalate
Act reversibly	Create a note or make another bounded change that can be reliably undone	Tool, target, before-and-after state, approval, execution result, and reversal path	Explicit approval during the bounded rollout, followed by evidence-based expansion
Act with high consequence	Send an external communication, alter access or entitlements, disclose sensitive data, or perform a hard-to-reverse operation	Everything above, plus approver identity, policy result, purpose, and incident linkage	A human makes the consequential decision; eligibility and tool scope remain narrow

Technical reversibility is not the same as consequence reversibility. A database field may be restored while a customer message, exposed record, or lost trust cannot be recalled. Classify authority by the real-world consequence, not by whether an API offers an undo method.

Model Context Protocol can make the policy surface clearer because it separates read-only resources from bounded tools and gives agents a standard way to discover them. That interface is useful, but the protocol does not decide who should access a resource, which fields are permitted, or whether an action needs approval. Authentication, authorization, redaction, policy enforcement, retention, and audit logging still belong in your architecture.

Apply controls before the model call and again before every tool execution. Prompts, retrieved context, logs, and third-party services can all become paths for sensitive-data leakage. Redact data the task does not require, keep secrets outside prompts, use scoped credentials, validate structured tool inputs, and record blocked requests as carefully as successful ones. A denied request is evidence that your policy worked, but repeated denials may also reveal a broken workflow, an overly broad prompt, or an attempted attack.

Build telemetry that joins agent decisions to user outcomes

Product analytics and AI observability answer different halves of the same question. A trace can show which context was retrieved, which policy ran, and which tool was called. Product analytics can show what the user did before and after the interaction, which cohort they belonged to, and whether the workflow reached its intended result. Neither view alone proves that the agent created value.

Join them with two identifiers. An agent run identifier follows one execution from trigger to final status. A workflow identifier connects that execution to the broader task, including manual steps, retries, handoffs, and the eventual business outcome. A user may start several runs inside one workflow, so treating every run as an independent success will inflate apparent demand and hide rework.

Use a minimum viable event contract

The following event model is deliberately small. Adapt the names to your analytics conventions, but preserve the states and identifiers.

Suggested event	Required properties	Decision it supports
agent_eligible	Workflow identifier, use case, surface, cohort, eligibility reason, and policy version	Who could have used the agent, including people who did not invoke it?
agent_run_started	Run identifier, workflow identifier, agent version, entry point, and initiating actor type	Where is the agent being invoked, and how often do workflows require retries?
agent_answer_presented	Run identifier, answer status, retrieval status, reference status, latency band, and fallback status	Did the user receive a grounded answer, a fallback, or no usable response?
agent_action_requested	Run identifier, tool, target type, authority level, required scope, approval requirement, and policy result	What is the agent attempting, and where are requests blocked or escalated?
agent_action_finished	Run identifier, tool, execution status, error class, approver state, reversibility state, and duration band	Did an approved action actually complete, fail, time out, or require recovery?
agent_handoff_started	Run identifier, workflow identifier, handoff reason, destination, context-transfer status, and user choice	Why did automation stop, and could the receiving person continue without reconstructing the task?
agent_run_outcome	Run identifier, workflow identifier, completion state, user response, correction state, and failure taxonomy	Was the output accepted, edited, rejected, abandoned, retried, or escalated?
workflow_outcome	Workflow identifier, outcome name, outcome state, measurement window, and source system	Did the underlying product or business result occur?

Put the agent, model, prompt, retrieval, tool, policy, and event-schema versions on the relevant records. Without version lineage, a quality shift produces debate instead of diagnosis. You will know that performance changed but not whether the cause was a prompt edit, a new model, a retrieval update, a permission change, a tool release, or broken instrumentation.

Do not make raw prompts and complete responses the default payload in a general-purpose analytics tool. They can contain personal data, secrets, customer content, or retrieved text that the analytics audience should not see. Send structured classifications and reference identifiers to product analytics. Keep any detailed trace required for investigation in an access-controlled store with explicit retention rules.

Use enumerated properties for states such as accepted, edited, rejected, blocked, failed, and handed off. Free-text status fields fragment quickly and make reliable cohorts impossible. Preserve a limited diagnostic field only where someone owns its review and classification.

Measure a stack, not a vanity metric

A useful scorecard separates five layers. Each layer answers a different management question:

Reach and adoption: Of eligible workflows, where was the agent offered and invoked? This shows discoverability and voluntary use, not value.
Task experience: Of started workflows, how many completed, retried, fell back, transferred to a person, or were abandoned? Segment edits and overrides instead of treating every acceptance as equally successful.
Agent quality: Was the answer supported by approved context, relevant to the request, structurally valid, and consistent with the task-specific evaluation criteria?
Governance and safety: Which tool requests were allowed, denied, escalated, or attempted outside the approved scope? Which redaction, moderation, or policy checks failed?
Business outcome: Did the downstream result move for the eligible workflow and intended cohort? Examples include completed onboarding, resolved cases, qualified leads, retained users, or a shorter cycle, depending on the contract.

Always display the numerator and denominator behind a rate. A falling handoff rate may look positive until you discover that completions also fell. A high acceptance rate may hide repeated runs if the dashboard counts only the final answer. A rising task outcome may reflect a changing user mix rather than the agent. Cohort, version, eligibility, and workflow-level views prevent those misreadings.

Behavioral analytics can establish association and expose where to investigate. It does not automatically establish causality. When the decision requires a causal claim, use a controlled experiment only after both variants meet the same safety and access requirements. Prompts, decision rules, and handoff designs can be tested across appropriate user cohorts; known unsafe behavior, privacy controls, and access boundaries are not experiment variants.

Turn analytics into release gates, not retrospective reporting

A governed agent release includes more than a prompt. It includes the model configuration, instructions, retrieval sources, tool definitions, permission scopes, policy rules, user disclosures, approval flow, handoff design, and telemetry. Change any of those and you have changed the product behavior.

That is why evaluation belongs in delivery, not in a quarterly review. Task-specific test sets, reference answers, error classifications, and pass-or-block thresholds can gate model and prompt changes in CI/CD. Production analytics then checks whether the behavior generalizes to real workflows without weakening the controls established before launch.

Use a staged promotion path

Validate the interface. Enumerate the resources, tools, schemas, scopes, and denial behavior. Run harmless requests and confirm that unavailable capabilities remain unavailable.
Run task evaluations. Test representative requests, known failure cases, adversarial inputs, missing context, malformed tool arguments, and handoff conditions. Classify failures by consequence rather than relying on one blended quality score.
Exercise the workflow without autonomous consequence. Use dry runs or recommendation-only behavior. Confirm telemetry, references, approvals, fallback, escalation, and rollback before enabling writes.
Release to a bounded eligible cohort. Keep tool scopes narrow and consequential actions under human control. Compare observed behavior with the contract, not with the enthusiasm generated by the demo.
Experiment inside the approved boundary. Test prompt, retrieval, interaction, and handoff variants only after they independently satisfy the safety gate. Analyze results by workflow and version.
Promote or constrain deliberately. Expand access or authority only when the relevant gates pass. A failed safety gate can restrict a release even when adoption or the business metric improves.

Pre-commit the gates

Choose thresholds and blocking conditions before reading the launch results. If the team sets them afterward, a promising outcome can quietly lower the quality bar, while a favored feature can turn every failure into an exception.

Gate	Evidence	Blocking condition	Typical response
Quality	Task evaluations, grounded-answer checks, correction categories, and unsupported-output reviews	A consequential failure class exceeds the pre-agreed tolerance or lacks a reliable detector	Revise instructions, retrieval, output constraints, or task scope
Safety and governance	Policy decisions, unauthorized tool attempts, redaction results, approval records, and incidents	An unresolved high-severity policy or data-control failure remains possible	Disable the affected tool or cohort, rotate credentials where needed, and follow the incident runbook
User experience	Completion, edits, rejection, fallback, abandonment, retries, and handoff continuity by cohort	The agent adds work, obscures control, or fails to transfer usable context	Simplify the interaction, improve disclosure, or return the step to a human workflow
Business outcome	The contract’s downstream metric for eligible workflows, with an appropriate comparison	Usage grows without a credible improvement in the intended outcome	Revisit the job, target cohort, workflow placement, or value hypothesis
Operations	Tool errors, latency, timeouts, dependency health, fallback success, and rollback readiness	The workflow cannot meet its reliability requirement or cannot fail safely	Reduce dependency surface, improve fallback, or pause promotion

Do not average these gates into a single agent score. A composite score can let strong adoption cancel a serious security failure or let low latency hide poor answer quality. Keep each gate visible, assign its owner, and specify which failures block promotion without negotiation.

Release decisions should also be reversible. Keep prior prompt, policy, retrieval, and tool configurations identifiable. Define how the runtime disables a tool, narrows a cohort, returns to recommendation-only behavior, or routes directly to a person. A rollback plan that depends on diagnosing the root cause first is too slow for a live incident.

Make the dashboard an operating system for the product team

The best agent dashboard does not attempt to show every event. It puts the release decision in view. Organize it in the order the team should reason:

Outcome: eligible workflows, target business result, comparison group where appropriate, and results by cohort and release version.
Journey: eligible, offered, invoked, answer presented, action proposed, approved, executed, handed off, and completed.
Quality and trust: grounded status, acceptance, substantive edits, rejection, retries, corrections, fallback, and qualitative feedback categories.
Governance and operations: allowed and denied tools, approval states, out-of-scope attempts, redaction failures, incidents, errors, latency, and dependency health.

Every panel should filter by agent version, policy version, tool, entry point, cohort, and workflow outcome. A top-line average is useful for orientation, but releases fail in slices: a user role with missing permissions, a workflow with poor retrieval, a new policy that blocks a required tool, or a handoff destination that cannot use the transferred context.

Run a decision review, not a dashboard tour

A regular review with the product trio can use behavioral telemetry, user feedback, and business outcomes to refine prompts, retrieval, and decision logic. Bring security, legal, analytics, operations, or domain owners into decisions that cross their boundaries. The meeting should answer:

Which intended outcome moved, for which eligible cohort, and under which release version?
Where did users retry, edit, reject, abandon, or request a person, and what does the failure taxonomy show?
Which permissions were never needed, and which denied requests reveal either a valid attack defense or a mismatch between the job and the available tools?
Did the agent reduce user work, or did it move that work into reviewing, correcting, approving, and recovering?
Are outcomes consistent across important roles and workflow entry points, or is the top-line result hiding a weak segment?
What changed since the prior release across the model, prompt, retrieval corpus, tools, policies, user experience, and instrumentation?
Should the team expand, hold, revise, restrict, roll back, or retire the current behavior?

Record the decision beside the release lineage: the hypothesis, eligible scope, versions, expected outcome, gates, observed evidence, known risks, owner, and next review condition. This turns governance into an operating history. It also prevents the same debate from restarting when a metric moves or a stakeholder changes.

Ownership must be explicit. Product owns the job, intended outcome, and promotion decision. Engineering owns runtime reliability, tool boundaries, traceability, and rollback mechanics. Design owns disclosure, user control, approval clarity, correction, and handoff. Data or analytics owns event integrity and metric definitions. Security and legal own the policies and incident requirements within their mandates. Shared input is valuable; shared accountability without a decision owner is not.

Start with one consequential workflow. Write its contract, add the eligibility event and shared identifiers, classify every available tool by authority, pre-commit the release gates, and review the first bounded cohort against the business outcome. Do not broaden the agent until you can explain why it ran, what it was permitted to see and do, what the user did next, whether the workflow improved, and how you would stop it safely.

References

January 3, 2026

10 AI Business Models You Need Now: Proven Playbooks Turning Algorithms into Revenue

I’ve spent the past few product cycles re-architecting roadmaps around one simple reality: AI is no longer just a feature—it’s a business model. The companies winning market share are those that treat models, data, and workflows as monetizable assets with defensible moats, not science projects.

AI business models are rewriting value creation. Learn how smart teams turn algorithms into profit engines, reshaping entire industries.

From my seat in product leadership, I evaluate AI bets through three lenses: durable value (moat and differentiation), measurable outcomes (clear ROI), and unit economics (gross margins under real-world load). With that frame, here are ten AI business models I see performing now—and how I decide when to invest.

1) API-first Model-as-a-Service. I monetize foundation or specialized models via an API, priced by tokens, requests, or time-in-context. Success hinges on latency, accuracy, and “context window management” that balances quality with cost. This is where “consumption SaaS pricing” shines and where disciplined rate-limiting, observability, and SLAs build trust.

2) Vertical AI copilots. I package domain-specific expertise (legal, healthcare, finance, field service) into workflow-native assistants that surface next-best actions. Because these copilots live where work happens, I price on outcomes—time saved, revenue recovered, or risk reduced—aligning value with customer metrics and accelerating product adoption.

3) Agentic AI automation. When autonomous agents handle multi-step tasks across tools, I lean toward per-outcome or per-job pricing. Reliability is the moat, so I invest early in eval-driven development, robust guardrails, and human-in-the-loop QA. This model compounds fast once agents can execute end-to-end workflows with transparent audit trails.

4) Copilot add-ons inside existing SaaS. I’ve seen “AI Assist” tiers deliver immediate ARPU lift and retention gains. The playbook: start with high-frequency, high-friction jobs (drafts, summaries, enrichment), then expand to proactive suggestions. This aligns tightly with product strategy and lets me stage value without overhauling the core experience.

5) Insights-as-a-Service via data network effects. I transform exhaust data into benchmarking, predictions, and prescriptive recommendations—while honoring privacy-by-design and data governance. The more customers I onboard, the stronger the patterns, and the higher the switching costs. Pricing ties to seats plus an outcomes or value metric.

6) Retrieval-first pipeline for enterprise knowledge. I land with high-accuracy answers over customer data (search, summarize, cite), then expand into workflow automations. This “retrieval-first pipeline” reduces hallucinations, boosts trust, and creates defensibility through connectors, semantic indexing, and continuous relevance tuning—an ideal fit for LLMs for product managers prioritizing reliability.

7) Open source monetization. When I bet on openness, I monetize hosting, support, enterprise controls, and compliance features. The advantage is developer love and rapid iteration; the moat is operational excellence at scale, plus integrations customers rely on. This model converts community momentum into predictable revenue.

8) Marketplaces for prompts, skills, and agents. I create a platform for third-party extensions and charge a take rate on usage. The flywheel spins when developers see distribution, customers see breadth, and I enforce strong quality bars. The roadmap focuses on governance, discovery, and safe execution policies.

9) Solutions with forward deployed engineers. For complex rollouts, I pair product with specialized implementation to guarantee outcomes. Revenue blends software plus services, accelerating time-to-value and informing the roadmap with real-world constraints. Over time, learnings fold back into scalable, self-serve capabilities.

10) AI risk, security, and compliance tooling. As AI scales, so does the need for policy enforcement, monitoring, and auditability. I monetize via platform subscriptions that address model provenance, data leakage prevention, red teaming, and reporting. Strong “AI risk management” is now a purchasing requirement, not a nice-to-have.

How do I choose among these models? I start with the customer’s biggest workflow pain, map it to the fastest path to measurable outcomes, and align pricing with value creation. Then I build defensibility through data advantage, distribution, and governance. If a model deepens trust, improves margins, and compounds learning, it earns a place on the roadmap.

Inspired by this post on Product School.

December 24, 2025

How to Design Multi-Agent Fintech Support That Finishes Work

Your support prototype can explain what happens after a customer reports a stolen card. The harder product decision is whether you can trust it to carry that case from the first message to a verified outcome without losing state, skipping an approval, duplicating an action, or going silent while work remains open.

You will not solve that problem by adding a larger prompt or more conversational agents. You need an operating model for cases that span people, policies, systems, and days. The model below gives you a practical way to define the work, divide agent responsibilities, control execution, and measure whether the customer's problem was actually resolved.

Define the case before you define the agents

A stolen-card request exposes the central mistake in support automation. Freezing the card is visible, immediate, and easy to demonstrate. The less visible work may include dispute intake, fraud investigation, merchant communication, customer outreach, approvals, and follow-up. If your scope ends when the chat ends, you have automated the tip of the workflow while leaving its operational burden intact.

Start with a case contract. This is the shared definition of what entered the system, what outcome is owed, which actions are permitted, and what evidence will prove completion. Define it before deciding how many agents you need.

Customer outcome: State the result in operational terms. "Card secured and required follow-up completed" is more useful than "customer helped."
Entry conditions: Record the signals that create the case, including the customer request, the affected product, and any authentication or evidence requirements imposed by your policy.
Required work: Enumerate the actions, investigations, notices, approvals, and follow-ups that may sit below the initial request.
Allowed actions: Specify which tools may be called, which fields may be changed, and which financial or account actions require approval.
State and owner: Give every open case a current state and an accountable role. "The agents are working on it" is not a state.
Waiting conditions: Name the external event that can unblock the case, such as a customer reply, a system response, a timer, or a human decision.
Terminal conditions: Define resolved, declined, cancelled, transferred, and incomplete outcomes separately. Each one should require evidence and a reason code.

The strongest procedure starts as a workflow map owned by the people who understand disputes, fraud, operations, and compliance. Those subject-matter experts can maintain agent procedures in natural language, but natural language should not mean unmanaged prose. Give each procedure an owner, version, effective date, test cases, and approval history. A policy change should produce a traceable procedure change, not an invisible prompt edit.

Test your case contract with an awkward question: could the system truthfully tell the customer that the case is resolved while a mandatory downstream task is still pending? If the answer is yes, your terminal condition is wrong. Fix that before tuning response quality.

Split responsibilities at operational handoffs

A multi-agent design earns its complexity only when the separation makes ownership clearer. Creating several agents with overlapping prompts usually produces more routing ambiguity, not more capability. Divide the system where the nature of the work, permissions, or waiting behavior changes.

A useful pattern separates inbound, back-office, and outbound responsibilities while keeping procedures, skills, and guardrails on a shared foundation.

Agent role	What it owns	Typical handoff signal	Boundary to enforce
Inbound	Understands the request, gathers required details, performs permitted immediate actions, and creates or updates the case	The case has enough validated information to begin operational work	It cannot imply resolution merely because the conversation was handled
Back office	Executes system work, coordinates investigation steps, records evidence, and manages pending operational tasks	More information, an approval, or customer communication is required	It cannot invent missing evidence or bypass a policy gate to keep the case moving
Outbound	Requests missing information, communicates status or decisions, and follows up until a defined terminal condition is reached	The required response arrives, a timer fires, or the outreach policy is exhausted	It cannot decide that silence means success unless the procedure explicitly defines that outcome

The handoff should be a structured state transition, not an open-ended conversation between agents. Pass a compact case record containing the case identifier, current state, completed actions, evidence references, pending requirement, next allowed actions, applicable procedure version, and relevant deadline or timer. That record prevents the next agent from reconstructing the truth from a transcript.

Keep skills modular as well. "Send a status request," "retrieve transaction details," and "submit an approved case update" are easier to authorize, test, and audit than one broad tool called "handle dispute." Each skill should declare its required inputs, permitted states, side effects, expected result, and failure behavior.

Do not use separate agents simply to mirror your organization chart. Use them when different stages need different permissions, context, completion rules, or escalation paths. If two proposed agents can perform the same actions in the same states under the same controls, they probably belong together.

Let a state machine control long-running work

The language model can interpret a message and propose the next step. It should not be the sole authority on what state the case is in or which actions are legal from that state. A state-machine orchestrator can manage turns, triggers, and skill selection across an asynchronous case while the model handles the language inside those boundaries.

For an illustrative stolen-card workflow, your states might include:

Report received.
Immediate protection pending.
Immediate protection confirmed.
Required information under review.
Investigation or dispute work in progress.
Waiting on the customer, a merchant, an internal system, or a human approver.
Decision ready.
Required communication pending.
Resolved, transferred, declined, cancelled, or closed incomplete with a recorded reason.

Adapt the states to your product, operating procedure, and regulatory obligations. The value is not in these labels. It is in making every transition explicit. For each transition, specify the triggering event, required preconditions, allowed skill, expected side effect, accountable role, failure path, timer behavior, and evidence written back to the case.

Then scope skills deterministically for each turn. An agent handling a customer reply while the case is waiting for information may be allowed to validate the reply, attach evidence, request a missing item, or resume the workflow. It should not be able to perform unrelated account actions simply because those tools exist elsewhere in the platform. This per-state allow-list reduces the number of unsafe choices the model can make.

Async triggers deserve the same design care as messages. A customer reply, API status change, timer expiry, failed tool call, and human approval are all events that can create a new turn. Store them durably and process them against the current case version. Otherwise a delayed event can act on stale state after the case has already moved forward.

Financial actions also need protection from retries. A timeout does not prove that a tool failed; the action may have succeeded while the response was lost. Use an idempotency key where the receiving system supports one, record the attempted operation before retrying, and reconcile uncertain outcomes. Blindly repeating a freeze, refund, fee adjustment, or dispute submission can create customer harm and financial exposure.

Outbound completion needs its own rule. The customer may never send a final message, so "the conversation ended" cannot define success. A defensible terminal condition can require that the necessary notice was sent, mandatory actions are complete, no unresolved task remains, and any follow-up timer has reached the outcome defined by policy. Silence may end an outreach attempt; it does not automatically prove the underlying case was resolved.

Finally, write an audit record for every transition. Capture the prior state, event, procedure version, allowed skills, selected action, tool result, guardrail result, human decision if present, and resulting state. A transcript tells you what was said. A transition log tells you why the system acted.

Make compliance and human review part of execution

Do not reduce compliance to a paragraph at the end of the system prompt. High-stakes rules need controls at the point where the system interprets information, chooses an action, changes a case, or communicates a decision.

Use three complementary layers:

Deterministic controls: Enforce permissions, required fields, state preconditions, transaction limits defined by your policy, and mandatory approvals in code or workflow configuration.
Classification guardrails: Detect whether an input, proposed action, or outgoing message belongs to a risk category that must be blocked, revised, or reviewed.
Human decisions: Route policy exceptions, consequential approvals, conflicting evidence, ambiguous cases, and unsupported operations to an accountable person.

For critical regulatory checks, treat guardrails as classification problems and prioritize recall when missing a risky case is more costly than sending an extra case to review. That choice has an operational consequence: more false positives can increase manual workload and delay customers. Product, operations, risk, and compliance owners should agree on that trade-off for each guardrail rather than applying one global threshold.

Every classifier needs a defined consequence. A positive result might block an action, remove a skill from the current turn, require human approval, or permit the workflow to continue with additional logging. A score without an execution rule is only dashboard data.

Customer-specific policies matter in a platform serving more than one fintech. The system may share an architecture while each customer requires its own procedures and guardrails. Resolve the applicable policy set from trusted configuration before the model acts, attach the policy version to the case, and prevent cross-customer retrieval or tool access. Do not ask a model to infer which client's rules should apply from conversational context.

Human escalation should be a first-class tool call, not a side-channel message. The request should contain the exact decision needed, current state, relevant evidence, attempted actions, available options, policy context, risk of delay, and response deadline. The human's answer should return as a recorded workflow event so the orchestrator can validate it and resume from the correct state.

This pattern is especially important when an API is missing. A person may complete the task in an internal system, but the agent must not assume it happened. Require a structured confirmation and evidence before advancing the case. If that evidence never arrives, keep the case visibly pending or escalate it according to the procedure.

Because these workflows can affect money, account access, customer rights, and regulatory obligations, your AI design cannot substitute for review by qualified legal, compliance, risk, and operations owners. Let those owners approve the policies, controls, escalation criteria, and customer communications before live execution. Begin with read-only or reversible capabilities where possible, and do not grant autonomous financial actions until the failure and recovery paths have been tested.

Measure verified resolution and improve from failures

A conversational system can produce polished replies while leaving cases unfinished. That is why containment or deflection cannot be your sole success metric. The primary question is whether the case reached the correct terminal state with the required evidence, policy checks, and customer communication.

Build a metric hierarchy that separates outcomes from diagnostics:

Case outcome: Track the share of eligible cases reaching a verified terminal state, along with cases reopened, transferred, or found incomplete during review.
Customer experience: Track customer satisfaction and whether the customer must contact support again because ownership or status was unclear.
Operational performance: Track time to resolution, first-contact resolution where that metric is genuinely applicable, deflection, escalation rate, waiting time by state, and human work by escalation reason.
Risk performance: Track critical guardrail misses, false-positive reviews, unauthorized action attempts, procedure deviations, and cases advanced without required evidence.
Agent-stage performance: Track routing accuracy, skill success, handoff completeness, tool failures, timer outcomes, and terminal-state correctness for each role.

Be careful with first-contact resolution in workflows that are supposed to run asynchronously. A fraud investigation may remain open after a perfectly handled first interaction. Optimizing the agent to close the contact can therefore conflict with the real outcome. Use time to verified resolution and unresolved-work visibility alongside conversation metrics.

Evaluation should inspect both language and execution. A useful case-level rubric asks whether the system understood the request, selected an allowed skill, used the correct procedure version, obtained required evidence, respected guardrails, preserved context at handoffs, communicated accurately, and entered the right terminal state.

An automated evaluation pipeline can flag cases for human review and turn reviewed failures into labeled data. Do not sample only obviously failed conversations. Include high-risk classifications, recently changed procedures, new skills, long-running cases, human escalations, unusual state transitions, tool errors, and a baseline sample of apparently successful cases. Otherwise your evaluation set will miss failures that look normal in aggregate metrics.

Give every reviewed failure a place in a product backlog. The fix may belong to the procedure, state machine, skill contract, integration, guardrail, escalation path, or model behavior. "The agent made a mistake" is too broad to assign. A stable failure taxonomy tells you which layer should change and which regression tests must be added before release.

A sensible implementation sequence is:

Choose one bounded journey with a meaningful operational tail and a clearly accountable owner.
Map the full case, including hidden back-office steps, waiting states, approvals, exceptions, communications, and terminal conditions.
Define the case schema, events, state transitions, evidence requirements, and audit record.
Assign inbound, back-office, and outbound responsibilities only where permissions or completion rules differ.
Expose narrow modular skills and apply a deterministic allow-list in every state.
Add compliance classifiers, hard controls, and human decision gates before enabling consequential actions.
Run historical, synthetic, or controlled cases through the workflow and evaluate the complete case, not just the generated messages.
Release gradually, monitor state-level failures, and feed reviewed cases back into procedures, controls, and regression evaluations.

Key takeaways

Scope the customer's complete case before choosing the number of agents.
Separate agents at real permission, workflow, or completion boundaries.
Let the model interpret language, but let explicit state and policy control execution.
Treat human review as a structured workflow event with an owner and deadline.
Define "done" with evidence; a finished chat is not a finished case.
Optimize for verified resolution, policy adherence, and safe recovery rather than response quality alone.

At your next design review, put one real support case on the page and ask four questions: where can it wait, what event unblocks it, who approves a risky action, and what evidence proves completion? If your team cannot answer all four from the workflow, the system is not ready to act. Once those answers are explicit, agent boundaries become an engineering decision instead of a bet on autonomous behavior.

References

Shivam.Consulting Blog — Beyond the Support Iceberg: Gradient Labs' Multi-Agent Breakthrough That Actually Gets Work Done

December 18, 2025

2026 Support Capacity Playbook: Bold AI Automation, Smarter Staffing, Zero‑Surprise SLAs

Capacity planning has always been a high-stakes exercise in customer service, and when you miss, the signal shows up fast in backlogs and SLAs. I’ve lived that pressure across multiple cycles, and 2026 will reward teams that plan differently. AI fundamentally changes capacity planning because it changes the work. It resolves the bulk of your volume, speeds up execution, and elevates the complexity and value of what humans handle. The consequence is simple: planning models must evolve. This is the final installment in my 2026 customer service planning series, and I’m focusing on the tension every leader feels right now—be ambitious about automation, but avoid the trap of understaffing if your assumptions don’t hold. My goal is to share how AI changes the logic of capacity planning, what I’ve learned implementing these practices with my team and with customers, and the common traps to avoid. Traditional planning rests on relatively stable assumptions: volume grows predictably, work types stay consistent, handle times don’t swing dramatically, and productivity improves slowly with better tools and training. In an AI-first model, none of that is guaranteed, and the fundamentals flip. The mix of work changes as AI absorbs a growing share of simpler conversations, leaving humans with deeper, more time-consuming issues that demand human-to-human connection. Demand can actually increase when you remove friction, so AI can both resolve more and attract more volume. Human time splits differently as teammates solve customer problems and also review AI behavior, give feedback, improve content, and support system-level work. Performance becomes dynamic, not fixed—automation rate isn’t a one-time number; it can rise with care and fall with neglect. If you plan for 2026 using a pre-AI model—assuming similar productivity, similar work mix, and a linear relationship between volume and headcount—you will underestimate what it now takes to run a high-performing support organization. There are many metrics you can track, but the one to put at the center is automation rate (AI Agent involvement rate × AI Agent resolution rate). This single construct tells me what share of total volume AI actually resolves, how much work remains for humans, how much additional demand humans can absorb, and how ambitious I can be with headcount. Early in the journey, I prioritize raising involvement—getting the AI involved in more conversations. Once involvement is high, I shift to resolution on the hardest remaining work, where each additional 1% of automation can represent several people’s worth of capacity. In my 2026 plans, automation rate sits alongside projected inbound volume, average “output” per person for the more complex work that remains, and occupancy—how much time is allocated to customer-facing interactions versus operational and strategic work. Together, those inputs give a realistic picture of how many people you need and where they should spend their time. First, plan boldly on automation, but match it with investment. I do not cap automation assumptions at 40–50% “because AI is new.” Many teams are already modeling 60%, 70%, even 80%+ for 2026—when they invest in AI ownership and content. The investment is non-negotiable: named ownership for AI performance (AI ops, knowledge management, conversation design), clear automation targets by work type (e.g., informational vs. personalized vs. actions vs. deep troubleshooting), realistic expectations for what’s easy to automate and what’s not, and a concrete plan to raise automation over time in monthly or quarterly steps rather than a single jump. To decide where to invest first, I dig into the data. I start with the biggest volume drivers, separate content-led issues from those dependent on data or complex procedures, assume higher resolution potential for content-led topics once the knowledge base is in shape, and set more modest initial resolution expectations for system-dependent flows. Then I stair-step improvements as the systems, data contracts, and workflows mature. In short, bold automation goals only work when paired with the team structure, content, and systems required to reach them—and the discipline to iterate. Second, expect human “output” per person to go down. That’s a mindset shift. Historically, we assumed individual productivity would stay flat or tick up as tools improved. In an AI-first model, humans handle fewer conversations but more complex, cross-functional issues—and create more value despite lower case counts. I model a lower “cases closed per person” than prior-year baselines, explicitly assume the remaining work is more complex and time-consuming, and redefine productivity to include system-level work like AI Agent improvements, content updates, and policy or workflow change management. I also report “capacity created” from automation alongside human outputs, so leadership sees the full picture. Third, rethink occupancy: more time off the queues, on higher-value work. Traditional occupancy splits time between inbox and training, meetings, and breaks. Now there’s an expanding “out-of-inbox” portfolio that directly affects AI performance and overall capacity: reviewing AI-handled conversations, improving AI Agent triaging and handovers, contributing to content and procedures, feeding insights to product and engineering, and supporting system changes that reduce future volume. I set lower inbox occupancy targets than before and make the rationale explicit. People aren’t working less—they’re working differently. In planning, I assume more time spent on improvement and system work, make it visible (for example, X% in inbox and Y% on AI and system improvement), and treat this as critical, not a “nice to have.” If you don’t proactively allocate it, it won’t happen—and your automation and performance targets will suffer. Fourth, work with the finance team early, and treat your plan as a set of assumptions. Capacity planning with AI is a set of bets across automation rate, human output, demand growth, occupancy, and where surplus capacity (if any) goes. I bring finance in early, show that the plan is dynamic and directly tied to AI performance, and label every lever as an assumption with ranges. I commit to a quarterly review cadence with finance to compare assumptions versus reality and adjust headcount, targets, and investment as needed. The risks are real: if automation grows slower than expected and you stop backfilling too early, you’ll be understaffed for months. Hiring and onboarding take time, so course-correcting late creates strain. If you do produce surplus capacity, have a clear strategy to reallocate those teammates to higher-value work—improving systems, feeding insights back to product, supporting new channels, and driving proactive CX—rather than defaulting to reductions. I also set explicit guardrails—if automation rate misses by five points for two consecutive months, we pause planned reductions and revisit hiring gates. If it over-performs, we shift people into backlog eradication, content upgrades, or proactive outreach, so we bank compounding value. To set your team up for success in 2026, anchor your plan on automation rate, be honest that humans will handle fewer but harder conversations, and protect time for system improvements. Partner early and often with finance, avoid shrinking too fast, and design a plan for surplus capacity so you’re never caught flat-footed. If AI is going to handle the majority of your customer conversations, your plan has to be designed to help it do that well and to keep your team set up for meaningful, sustainable work. A 2026 plan built on adaptable assumptions—not fixed predictions—will hold up as your work, your systems, and your customers’ expectations continue to change. If you’d like future editions like this, subscribe and stay close—I’ll keep sharing what’s working, what isn’t, and how to tune your customer support AI strategy in real time.

Inspired by this post on The Intercom Blog.

December 16, 2025
From No-Code Hack to 10,000 Weekly Calls: Inside Perk’s Voice AI That Actually Works

I love real-world AI that ships, scales, and actually solves painful customer problems. This story checks every box. As a product leader who has brought agentic AI to production environments, I was captivated by how a small, focused team at Perk took a no-code voice AI prototype and turned it into a system that reliably makes 10,000+ calls per week to prevent failed hotel payments.

What happens when you combine a real customer problem, a no-code prototype, and a team willing to listen to every single call?

Steven Payne (Product Manager), Gabriel Stock (Senior Engineering Manager), and Philipe Steiff (Senior Software Engineer) from Perk share how they built a voice AI agent that calls hotels to verify virtual credit card payments, preventing travelers from arriving to find their rooms unpaid. This is a textbook example of linking operational pain to a high-leverage AI solution.

What started as a hackathon experiment in Make.com became a production system handling over 10,000 calls per week across multiple languages. Along the way, the team learned hard lessons about prompt engineering for voice (numbers, pronunciation, and a very "Karen-like" first version), how to break a single monolithic prompt into structured conversation stages, and why listening to actual calls beats any amount of theorizing.

From a product management perspective, this approach aligns perfectly with eval-driven development and continuous discovery. Structure the problem, instrument aggressively, ship safely, then listen—deeply—to real interactions. In my own teams, I’ve seen that nothing accelerates iteration on agentic AI like closing the loop between qualitative call reviews and quantitative evals.

They built a working prototype without writing a single line of backend code.

They structured the call into discrete stages (IVR, booking confirmation, payment) to improve reliability.

They created two eval systems: one for call success classification, another for conversational behavior.

They scaled from five calls a day to tens of thousands per week while maintaining quality.

This is a detailed look at building AI for real-time human interaction—where the stakes are high and the feedback is immediate.

Guests: Steven Payne, Product Manager, Perk; Gabriel Stock, Senior Engineering Manager, Perk; Philipe Steiff, Senior Software Engineer, Perk.

What stood out to me was how Perk's team identified an AI use case by connecting prior experimentation with a real operational problem. Why they chose Make.com for prototyping—and shipped to production without touching backend code—underscores how far no-code can take you when paired with crisp problem framing. The evolution from a single prompt to structured conversation stages (IVR handling, booking confirmation, payment request) is exactly how you harden agent behavior for production.

Breaking up the agent's task dramatically improved reliability. They also built two eval systems: classification for success rates and LLM-as-judge for conversational behavior. Even with automation, the team still listens to calls manually—a practice I strongly endorse for uncovering edge cases, trust issues, and UX nuances that dashboards can’t show.

The challenge of prompt engineering for voice—numbers, booking references, and text-to-speech markup—was non-trivial. Expanding to German revealed that prompts in native language improve results. And, as often happens with operations-heavy rollouts, this project uncovered other operational problems they didn't know existed—valuable signal for the roadmap.

Resources & Links: Perk. Make.com — No-code automation platform used for the prototype. Twilio — Voice/telephony provider. Eleven Labs — Text-to-speech provider (used in early experiments).

Chapters: 00:00 Introduction to the Team; 01:54 Understanding PERK's Mission; 02:59 Challenges in Travel Booking; 07:27 AI Solutions for Customer Care; 09:52 Prototyping with AI and Voice; 17:00 Implementing AI in Production; 25:51 Learning Through Trial and Error; 26:40 Prompting Challenges and Solutions; 27:58 Iterating on Prompts and Evaluations; 30:08 Scaling and Production Challenges; 32:43 Advanced Evaluation Techniques; 35:32 Real-World Applications and Success; 49:07 Future Directions and Expansion; 53:53 Conclusion and Team Reflections.

My product takeaways: Start with clear operational pain and measurable outcomes (e.g., payment verification). Use no-code to validate quickly, then progressively harden. Treat voice AI like any production system: break it into deterministic stages, add guardrails, and measure both outcome and behavior. Pair automated evals with hands-on reviews. And when going multilingual, write prompts in the native language—your accuracy will thank you.

If you’re exploring agentic AI for operations, this is the blueprint: tight scoping, Make.com for speed, Twilio for reliability, structured prompts for control, and an eval-driven loop to scale quality with confidence.

Inspired by this post on Product Talk.

December 4, 2025
Own Your AI: 4 Essential Roles to Supercharge Support and Prevent Performance Drift by 2026

AI doesn’t fail because the model is bad, it fails because ownership is missing.

When someone truly owns your AI, everything changes. Resolution and automation rates climb, the system self-improves, and the customer experience transforms in ways a dashboard alone will never show you.

This is part three of our five-part series on customer service planning for 2026. We’ll be sharing all five editions on our blog and on LinkedIn.

If you’d rather have them emailed to you directly as they’re published, drop your details here.

Last week, we introduced the four roles that make AI actually work in a support organization. These roles are already showing up inside the teams who are scaling AI the fastest, and this week, we get closer to the ground.

Here’s what these roles look like in practice — what they do, how they work, and why your AI performance will inevitably drift without them.

AI operations lead — owns AI performance, every day. I think of this person as the air-traffic controller for our AI Agent. I treat the AI as a living system that needs ongoing supervision, evaluation, and tuning. This role is accountable for what leaders care about most: quality, reliability, and continuous improvement.

The AI ops lead sees the whole picture: conversation quality, missing knowledge, flawed assumptions, unexpected failures, new opportunities for automation, and the subtle signals that the system is beginning to drift. In practice, that vigilance is the difference between steady gains and slow decline.

Day-to-day, here’s what I expect from this role.

1. Reviews AI conversations and surfaces performance patterns. The AI ops lead monitors the AI Agent’s behavior — the tone shift after a product launch, a sudden dip in resolution for a specific intent, or conversation clusters revealing new customer behavior. They scan for anomalies, trends, and early warnings, with an emphasis on what’s happening right now, not last week. Without this intentional ownership, I’ve watched a 2% dip turn into a 10% drop in days.

2. Prioritizes fixes and improvements. Once patterns emerge, they triage fixes like a product team handles bugs. Missing or incorrect content? They route it to the knowledge manager. Behavioral issues? They adjust guidance and guardrails. Action or system issues? They partner with the automation specialist. This connective tissue turns individual fixes into compounding improvements.

3. Defines and maintains AI guardrails. Leaders everywhere worry about AI doing things it shouldn’t. This role answers that fear by establishing clarification logic, escalation rules, “never answer” policies, and safety boundaries. The goal is predictable behavior that protects customer trust — an essential pillar of any AI Strategy and AI risk management practice.

4. Aligns reporting with leadership. The AI ops lead reports on resolution rate, CX Score, CSAT, automation coverage, and hours saved — making the economic impact visible. That visibility is a foundational step in any credible customer support ai strategy.

Why this role exists now. AI systems are dynamic and require constant tuning. A small dip in quality quickly becomes an operational issue, and no existing role naturally owns that. When someone does, teams feel the benefit almost immediately.

Knowledge manager — builds and maintains the structured knowledge AI depends on. I hear the same thing from leaders again and again: AI is only as good as the content you give it. This role is rapidly evolving from classic knowledge management into knowledge strategy — part content designer, part systems thinker, part information architect. Their job is to build the knowledge scaffolding that lets AI answer accurately, consistently, and safely.

Here’s how the knowledge manager creates leverage.

1. Writes, maintains, and improves support knowledge — continuously. After every product change, they update articles, remove duplication, resolve contradictions, and pay down “knowledge debt” that quietly erodes accuracy. The upkeep is shaped by AI performance; when patterns expose gaps, they fix the source.

2. Structures knowledge for AI, not for browsing. Traditional help centers are for humans skimming pages. AI needs clean intent signals, crisp formatting, and clearly structured language. The knowledge manager designs that structure as intentionally as the content itself.

3. Works hand-in-hand with AI ops. Many performance issues stem from missing or unclear knowledge. When the AI ops lead surfaces recurring misunderstandings or low-resolution categories, the knowledge manager resolves the root cause at the source.

4. Ensures accuracy and compliance at scale. As AI handles more sensitive situations, the knowledge manager safeguards correctness, currency, and compliance — critical for data governance and regulatory alignment.

5. Develops a cross-functional knowledge strategy. The role creates a canonical, cross-functional source of truth that product, engineering, product marketing, go-to-market, and support (AI and human) can all rely on.

Why this role exists now. This is one of the highest-leverage positions in an AI-first support org. Teams like Rocket Money and Anthropic are hiring knowledge managers because AI accuracy depends on the quality of knowledge feeding it. Without this role, resolution rate caps out early and never climbs.

Conversation designer — designs how the AI speaks, clarifies, and interacts. AI isn’t just a tool customers use; it’s a representative they interact with. Tone, clarity, pacing, and conversational structure matter, especially in voice. Every word affects perceived expertise, trustworthiness, and brand. The conversation designer ensures the AI feels human-friendly without pretending to be human — the sweet spot that builds trust without misleading customers.

In my experience, staffing conversation design early accelerates results. It changes not only how we tune AI, but how we understand the end-to-end customer experience.

Here’s what great conversation design looks like.

1. Shapes the AI’s tone, voice, and communication style. This role refines phrasing, tunes politeness, adjusts how confusion is handled, and shapes micro-interactions that determine whether customers feel cared for or dismissed. On voice channels, natural cadence is make-or-break.

2. Designs flows for high-value conversations. They design how the AI clarifies intent, branches, communicates uncertainty, verifies details, escalates, hands off, and returns to the main thread without feeling mechanical — treating customer experience as a product with language as the interface.

3. Translates procedures and complex workflows into natural language and logic. As AI runs structured procedures and actions, this role becomes a conversational system architect, translating SOPs into conditional logic with exceptions and fallbacks. For example, in Intercom, our conversation designer uses Simulations to run simulated conversations to see where the AI Agent gets confused, over-confident, or awkward, and refine flows until the interaction feels effortless end-to-end.

4. Ensures transitions to humans feel smooth and respectful. Handoffs should provide clear context to the human agent and maintain continuity so customers never feel dropped.

Why this role exists now. As AI becomes the primary interface, conversation design directly influences trust, brand perception, and operational outcomes. It’s a core competency for any Generative AI and LLMs for product managers program.

Support automation specialist — builds the backend actions that allow AI to do real work. If the conversation designer shapes expression, this role shapes capability. They transform AI from an answering machine into an outcome engine by bridging AI and the systems it must safely and deterministically act on.

Support teams increasingly expect AI to do what a human would do: refund a charge, adjust a subscription, verify an identity, update an account setting, or pull relevant data. That expectation creates a new technical role at the edge of support, ops, and engineering.

What I rely on this specialist to deliver.

1. Creates and maintains backend workflows the AI executes. This includes building and maintaining: Fin Tasks. Fin Procedures with embedded steps. Action flows that call internal and external APIs. Automations that span billing systems, user identity layers, CRM objects, subscription entitlements, refund tools, and more. They ensure the AI can act compliantly and predictably — the playbooks that turn intent into action.

2. Owns the integrations required for advanced automation. Many problems require data elsewhere — billing platforms, internal databases, systems of record. The specialist ensures the AI can retrieve, validate, and use that information safely, often partnering closely on CRM integration and internal services.

3. Partners closely with product and engineering. Some workflows require new endpoints, permission layers, safety gates, or deterministic fallbacks. This role drives those changes across the stack.

4. Ensures reliability and safety at every step. Guardrails, validation logic, exception handling, safe execution paths — all are essential. They confirm that the AI has access to the correct data, the action matches policy, edge cases are accounted for, risky flows have deterministic constraints, and every action is auditable and reversible.

Why this role exists now. Customers don’t want answers, they want outcomes. AI can now deliver those outcomes, but only with the right backend scaffolding. This role modernizes operational architecture and unlocks end-to-end automation.

How these roles work together — the new operating loop. These roles aren’t silos; they’re interdependent parts of one system. The AI ops lead identifies patterns and performance gaps. The knowledge manager resolves inaccuracies or missing content. The conversation designer improves clarity, tone, and flow. The automation specialist expands the system’s ability to take action. Each improvement compounds the next, moving you from early automation to transformational resolution rates through continuous refinement.

This loop is what separates teams that plateau early from teams that scale AI into a reliable, high-performing system — the essence of a durable AI Strategy.

How to get started (even if you can’t hire all four roles today). Most teams phase into this model: assign partial ownership, formalize responsibilities, then specialize as AI volume grows. Here’s the progression I recommend.

Phase 1: Assign ownership. Give each role’s core responsibilities to someone who can devote five to 10 hours weekly. Early on, support ops, enablement, senior ICs, and technically inclined teammates can anchor the work.

Phase 2: Formalize the responsibilities. As AI resolves more queries, optimization becomes core operational work. Formalizing ownership prevents performance drift and knowledge debt.

Phase 3: Specialize and hire. Once AI handles 50–70% of incoming volume, these responsibilities become full-time roles. Investing in specialization becomes essential infrastructure for the next scale stage.

The bottom line. AI changes the shape of your support team. These four roles — AI operations lead, knowledge manager, conversation designer, and support automation specialist — form the backbone of the AI-first support organization. They bring order to a constantly changing environment and enable AI to deliver the outcomes leaders and customers expect heading into 2026.

Next week, we’ll continue the 2026 planning series with a deep dive into org design models for AI-first support teams — how to structure people, workflows, and accountability in a world where AI resolves most conversations before a human ever sees them.

To follow along with the series and have each new edition emailed to you directly, drop your details here.

Inspired by this post on The Intercom Blog.

December 2, 2025
Unlock AI Product Roadmaps: Essential Tools Every PM Needs to Prioritize and Ship Faster

In my role leading product teams, the AI product roadmap isn’t just a plan—it’s the operating system for how we discover value, prioritize with rigor, and ship with confidence. The pace has changed, the stakes are higher, and the best product managers are now orchestrating AI capabilities, data, and customer insight in near-real time.

Master the evolving art of the AI product roadmap. Prioritize smarter, turn data into direction and insight into action, only much faster.

When I say “AI product roadmap,” I’m talking about a living system that blends strategy, discovery, and delivery. It’s less about dates and more about outcomes, risk reduction, and sequencing learning. In practice, that means combining AI Strategy with product roadmapping and sprint planning, then validating each bet with real customer signals.

For prioritization, I anchor on outcomes vs output OKRs and connect them to measurable signals across the funnel. Continuous discovery keeps insights flowing, while a unified approach to analytics and retention analysis tells me where the lift is. This lets me rank initiatives not just by impact and effort, but by how quickly we can learn, iterate, and compound value.

On discovery, product trios are non-negotiable. We prototype early with gen ai and LLMs for product managers to accelerate concept validation and reduce ambiguity. When customers can co-create through in-app guides or lightweight product tours, we turn vague needs into crisp problem statements and testable hypotheses far faster.

On delivery, I pair tight feedback loops with experimentation. A deliberate cadence of A/B testing and strong instrumentation ensures we’re learning every sprint, not just launching. The goal is to de-risk decisions quickly, keep momentum high, and translate signals into roadmap movement without thrash.

Under the hood, the AI stack matters. I rely on a retrieval-first pipeline to ground models in trusted data, and I’m intentional about privacy-by-design and data governance from day one. As agentic AI patterns emerge, I put evaluation workflows in place so we can ship confidently—and safely—without slowing down innovation.

Finally, alignment is the multiplier. Clear narrative roadmaps tied to customer outcomes help stakeholders see trade-offs, while crisp interfaces with go-to-market and CRM integration close the loop from roadmap to revenue. When everyone can trace a line from AI strategy to shipped value, prioritization becomes easier and trust grows.

If you’re feeling the acceleration, you’re not alone. With the right AI product toolbox—rooted in discovery, grounded in data, and delivered through tight feedback loops—you can move faster, learn smarter, and build products your customers can’t live without.

Inspired by this post on Product School.

December 1, 2025
How We Built an AI Sleep Coach: CBTI, Voice AI, and a Product Playbook for Better Rest

What if your morning started with a helpful check-in from a voice AI that actually improves your sleep—using the same core principles that typically cost thousands of dollars and come with year-and-a-half waitlists? That idea energizes me as a product leader, because it blends clinical-grade outcomes with consumer-grade accessibility. Recently, I dug into how the team at Rest built an AI sleep coach inspired by Cognitive Behavioral Therapy for Insomnia (CBTI), and why their method offers a repeatable blueprint for complex, personal AI products.

The origin story is a classic product discovery moment. Rest’s team noticed that a meaningful slice of users in their podcast app were using audio to fall asleep. Although it represented only about 10% of users, that group showed a high willingness to pay. That signal pushed them to explore a dedicated sleep solution, moving from a general audio app to a targeted sleep experience—and eventually toward an AI-powered coach as LLMs matured.

Through jobs-to-be-done research, they identified a clear, underserved segment: “DIY sleep hackers.” These are motivated users who want agency, structure, and results without navigating clinical systems. Choosing CBTI (a clinically proven approach with 80% efficacy) gave the product a strong evidence-based foundation while remaining accessible as a wellness tool. It’s the kind of strategic choice I look for: credible, measurable, and aligned with user motivation.

The product evolution moved in smart, incremental steps. Rest started with a basic text chatbot before graduating to a voice-first experience—using Vapi for voice and OpenAI for reasoning. Voice changed the relationship dynamic: it increased intimacy, lowered friction for daily check-ins, and made behavioral coaching feel human without pretending to be. The team built a memory system that tracks context (like traveling or having a dog) with time-based relevance, which keeps conversations fresh, respectful, and genuinely personalized.

Daily engagement is driven by dynamic agendas that adapt based on sleep data, the user’s stage in the program, and their recent compliance. I love this mechanic: it operationalizes behavior change by sequencing the right intervention at the right time. In parallel, they developed text via OpenAI Assistants while building voice with Vapi, which let them ship value while learning in two modes. They also moved from massive system prompts to RAG for general sleep knowledge, keeping personal user context in the prompt—reducing brittleness while improving scalability.

Because sleep sits close to healthcare, the team drew a firm line between wellness and medical positioning. They implemented clear guardrails: no diagnosis, no medication advice, and strong boundaries on scope. Weekly error analyses with domain experts (sleep therapists) tightened quality and tone, and they adopted LLM-powered evals to enforce safety boundaries. For observability and evaluations, they leveraged Langfuse, and they experimented with Hamming for voice testing to refine the experience end-to-end.

Under the hood, this is a great example of “one bite of the apple at a time” product building in AI. Start with a simple interface, anchor on an evidence-based method, layer personalization with memory, formalize program structure with dynamic agendas, and shift to RAG when general knowledge outgrows prompt engineering. As a product leader, I see strong echoes of agentic patterns here—goal-oriented orchestration, stateful memory, and adaptive planning—shipped in pragmatic increments rather than as a monolithic platform rewrite.

A few takeaways I’m applying with my teams: First, segment deeply and pick a high-intent niche (those “DIY sleep hackers” were the right beachhead). Second, let modality fit the job—voice is not a gimmick when it boosts compliance and empathy. Third, design safety and scope from day one if you’re anywhere near health. Finally, invest early in evals and observability so you can improve with confidence, not hope.

If you want to explore the full conversation and product decisions, you can listen here: Spotify | Apple Podcasts.

Resources & Links:

Rest – AI sleep coach app

Vapi – Voice agent platform Rest uses

Langfuse – Observability and evals platform

Hamming – Voice testing platform

AI Evals Maven Course by Hamel Husain and Shreya Shankar

Bottom line: Rest demonstrates how to take a clinically grounded method like CBTI, translate it into a daily voice-first experience, and ship it with rigor. If you’re building in AI, this is a model worth studying—practical, safe, and deeply user-centered.

Inspired by this post on Product Talk.

November 20, 2025
Taming 1,000+ Vendor Emails: How Xelix’s AI Helpdesk Delivers Fast, Confident Answers

Chaos in vendor communications is a problem I see across finance operations: sprawling accounts payable inboxes, slow response times, and missed context. That’s why this build caught my attention—not just because it’s GenAI, but because it’s a disciplined product strategy that converts email overload into measurable outcomes.

Accounts payable inboxes can see 1,000+ vendor emails a day. Xelix’s new Helpdesk turns that chaos into structured tickets, enriched with ERP data, and pre-drafted replies—complete with confidence scores.

I dug into the end-to-end approach with the team—Claire Smid — AI Engineer, Xelix; Emilija Gransaull — Back-End Tech Lead, Xelix; Talal A. — Product Manager, Xelix—focusing on how they scoped the problem, iterated fast, and de-risked AI in production.

Their product thesis is refreshingly pragmatic. They prototyped with “daily slices” (Carpaccio-style) and built a retrieval-first pipeline that matches vendors, links invoices, and drafts accurate responses—before a human ever clicks “send.” That framing matters: enrichment and matching take center stage, with the model amplifying precision instead of improvising.

We unpacked the tricky bits that make or break an AI helpdesk at scale: vendor identity matching, Outlook threading, UX pivots from “inbox clone” to ticket-first views, and the metrics that prove real impact (handling time, stickiness, auto-closed spam). The pipeline architecture and email processing choices were grounded in operational realities, not just AI aspirations.

Several takeaways are worth pinning to any AI product roadmap. “Start narrow to win: pick high-volume, high-cost requests (invoice status & reminders).” “Enrichment > magic: accurate replies come from great retrieval/matching, not just a bigger LLM.” “Design for adoption: familiar inbox view helps onboarding, but a ticket-first UI unlocks AI features.” These are the kinds of decisions that drive adoption, trust, and ROI.

Data enrichment challenges dominated early learning curves: stitching ERP context into tickets, handling vendor identification at scale, managing email thread continuity, and calibrating response generation for accuracy. On the generation side, the team emphasized precision over verbosity—clean responses that reflect system-of-record truth—then instrumented the experience to “Evaluate System Performance” with production-grade telemetry.

Trust was treated as a product feature. “Measure outcomes, not vibes: track ‘messages sent from Helpdesk’, % auto-resolved.” And critically, “Confidence builds trust: show match quality and response confidence so humans know when to edit.” By surfacing match quality and confidence scores, they shortened coaching loops and made human-in-the-loop supervision feel natural, not burdensome.

What’s next is equally compelling: “targeted generation, multiple specialized responders, and more agentic routing.” That direction aligns with agentic AI patterns I recommend for operations-heavy workflows—route first, retrieve deeply, then generate with intent. It’s a scalable path from assistive AI to autonomous resolution while maintaining governance and auditability.

If you want a quick map of the journey, the conversation flowed from 0:00 Meet the Team: Claire, Emilija, and Talal, 00:36 Introduction to Xelix and Its Products, 01:08 Understanding Accounts Payable Teams, 01:37 Help Desk Product Overview, 03:11 Challenges Faced by Accounts Payable Teams, 04:03 AI Integration in Help Desk, 05:47 Automating Reconciliation Requests, 07:45 Development Methodology: Carpaccio, 09:11 Prototyping and Beta Testing, 12:00 Manual Tagging and Data Collection, 16:39 Focusing on High-Impact Use Cases, 18:55 User Experience and Interface Design, 24:56 Pipeline Architecture and Email Processing, 28:21 Data Enrichment Challenges, 29:04 Handling Vendor Identification, 33:33 Email Thread Management, 36:15 Generating Accurate Responses, 40:48 Evaluating System Performance, 49:20 Future Developments and Goals.

My takeaway for product leaders: when the domain is high-volume and rules-heavy (like AP), retrieval-first beats model-first. Start with the narrowest, costliest intents; prove lift with “messages sent from Helpdesk” and “% auto-resolved”; then graduate UX from familiar to AI-native (ticket-first) once trust is earned. That’s how you turn vendor chaos into answers—reliably, scalably, and fast.

Inspired by this post on Product Talk.

November 13, 2025
AI Won’t Replace Engineers—Engineers Using AI Will: A Practical Playbook for Your Next Move

Will AI replace software engineers or reshape their roles? Explore risks, opportunities, and alternative career paths in tech.

I’m often asked whether AI will make software engineers obsolete. My short answer: AI is already automating tasks, not eliminating the role. The engineers who learn to orchestrate models, systems, and stakeholders will create more value—not less. The real shift is from keystrokes to judgment, from writing code to designing socio-technical systems that deliver outcomes.

Today’s gen ai assistants—think Claude Code and ChatGPT connector—excel at unit test scaffolding, boilerplate generation, refactoring, docstrings, and code search. When integrated into CI/CD, they can open draft pull requests, annotate diffs, and propose fixes. This lifts developer productivity and frees time for higher-leverage work: problem framing, architecture decisions, and customer discovery.

What changes in the role? We spend more cycles on product discovery, privacy-by-design, and AI Strategy, and fewer on repetitive implementation. We design agentic AI workflows that combine retrieval, tools, and guardrails; we evaluate trade-offs that blend performance, cost, and safety; and we partner with empowered product teams to ship the smallest valuable slice, learn, and iterate.

Measure what matters. If AI is working, DORA metrics should improve: higher deployment frequency, shorter lead time for changes, stable change failure rate, and faster MTTR. Pair that with outcomes vs output OKRs to avoid gaming the system—shaving seconds off a build is meaningless if it doesn’t move activation, retention, or revenue. A unified analytics platform can help connect engineering signals to business impact.

Risk is real—and manageable. AI risk management and data governance are now core competencies, not afterthoughts. Protect IP with robust access controls, context window management, and red-teaming. In production, instrument threat detection and response to catch prompt injection, data leakage, and model drift. Treat this like any other reliability discipline alongside SRE.

If parts of coding get automated, where can great engineers thrive? Several high-impact paths are emerging: platform engineering for LLMs (tooling, evals, observability), SRE for AI-infused systems, developer evangelism and education, product management for AI-native experiences, security engineering focused on model and data threats, and forward deployed engineers who pair with customers to solve messy, real-world problems.

How to upskill fast: build an AI product toolbox and ship small. Prototype gen ai features end-to-end—retrieval, function calling, human-in-the-loop QA—and connect them to your CRM integration or support stack. Use A/B testing with a clear minimum detectable effect (MDE) to validate impact. Leverage CustomGPT workflows for internal enablement and in-app guides or product tours to onboard users safely.

Here’s a pragmatic 90-day plan. Week 0–2: audit your top 10 engineering tasks by time spent; identify 3 that are ripe for AI augmentation. Week 3–6: pilot inside CI/CD with explicit guardrails; track DORA metrics and developer sentiment. Week 7–10: productionize the wins; document runbooks; add incident management paths. Week 11–12: share learnings with product trios, refine your value proposition, and set next-quarter OKRs.

AI won’t replace software engineers; engineers who master AI will outpace those who don’t. If we embrace the shift—toward systems thinking, responsible governance, and customer outcomes—we’ll build better products faster and open new, rewarding career paths. The opportunity is here and compounding.

Inspired by this post on Product School.

November 12, 2025
From Sketch to Clickable Demo: My AI Prototyping Playbook to Build Apps in Hours

I’ve spent much of my career compressing the distance between a napkin sketch and something real customers can touch. At HighLevel, my product teams use generative AI to validate ideas faster, reduce risk earlier, and win stakeholder trust with evidence instead of slides. The goal isn’t to be flashy—it’s to be precise, testable, and repeatable.

Today, you can build it before you pitch it. AI prototyping can turn ideas into clickable demos in hours. Here are some tools to try and steps to follow.

I start every AI prototyping sprint by sharpening the problem statement and the outcome we care about. That means being explicit about the target user, jobs-to-be-done, and the riskiest assumptions. I define a minimum detectable effect (MDE) and tie it to outcomes vs output OKRs so everyone aligns on what “good” looks like before we touch a tool.

From there, I move from sketch to interface. I capture a rough flow (whiteboard, tablet, or even paper) and generate UI variations with my AI product toolbox—tools that translate structure into components and screens. I’ll iterate on information hierarchy and copy until the narrative supports the core job, borrowing techniques from UX writing. For product managers leaning into LLMs for product managers, this phase is about speed to feedback, not perfection.

Next, I wire data and logic. I connect a lightweight backend or spreadsheet, stitch in a CRM integration if needed, and add LLM calls through a ChatGPT connector or Claude Code. If the concept benefits from multi-step autonomy, I introduce agentic AI to orchestrate tasks across APIs. CustomGPT workflows help me encapsulate business rules so the demo behaves consistently in user paths we care about.

Governance is not optional at this stage. I apply privacy-by-design defaults, document data governance decisions, and run a quick AI risk management pass: input validation, prompt safety, rate limits, and fallback responses. This keeps the prototype credible and prevents false positives from polluting stakeholder perception.

With a click-through in hand, I instrument the experience so learning compounds. I drop in Amplitude analytics to track activation, task completion, and drop-off, and set up simple A/B testing when there’s a meaningful design or copy choice. This makes the prototype a learning vehicle, not just a demo.

Then I get it in front of users—fast. Five targeted conversations will beat fifty internal opinions. I run structured product discovery interviews, observe time-to-value, and capture objections. This is where empowered product teams shine: we make changes in real time, re-run the flow, and document what moves the needle for product-led growth.

When speed matters, I use a four-hour cadence: Hour 1 for problem framing and MDE; Hour 2 for sketch-to-UI generation; Hour 3 for data wiring and AI logic; Hour 4 for instrumentation and user walkthroughs. By the end, we have a clickable demo, preliminary analytics, and a clear decision on whether to advance, pivot, or park.

Finally, I translate insights into a concise artifact: the hypothesis we tested, the signal we observed, the trade-offs we made, and the next sprint plan for product roadmapping and sprint planning. The point is not to be right on the first try; it’s to learn precisely, cheaply, and quickly enough to invest with conviction.

If you adopt this approach, you’ll find that stakeholder management becomes easier, team energy rises, and your roadmap earns credibility. Build it before you pitch it, and let real interactions—not wishful thinking—do the heavy lifting.

Inspired by this post on Product School.

November 10, 2025