Tag: Agent Analytics

How Deep AI Transforms Support Into Proactive, Omnichannel CX—No Extra Headcount Needed

For years, I chased the elusive goal of delivering a perfect customer experience. Today, with AI embedded in our support operations, that standard is finally within reach—and it’s reshaping how we prioritize, design, and scale service.

In “The 2026 Customer Service Transformation Report,” teams report early, tangible wins from AI: faster responses, higher efficiency, and consistent coverage across languages and time zones. Those gains create the capacity we’ve always needed. The more we push the technology, the more quality improvements we unlock.

This marks a fundamental shift. As AI takes on more, our focus can finally move from firefighting to crafting the customer experience. When the AI is working, the measure of success becomes how well it’s working—across accuracy, tone, resolution, and end-to-end journey quality.

I’ve seen this transformation firsthand. Mature AI deployment gives my team “breathing room,” so we can design for consistently excellent outcomes rather than obsess over deflection. That means widening access to support, removing friction on the path to resolution, and anticipating customer needs before they escalate.

In our own support organization, we opened support to trial customers, accelerated first response times, and added consultative sessions during onboarding. We absorbed a 300% increase in total demand without adding headcount—made possible by deep integration of an AI Agent and a disciplined AI strategy.

Teams with mature customer service deployments are nearly three times likelier to say they always meet increasing expectations—27% vs 9% at initial rollout—highlighted by bold orange and gray comparison bubbles.

Across the industry, the pattern is similar. When teams initially deploy AI, only 9% say they can always meet customer expectations. That number triples as teams reach a mature level of deployment. Even as expectations rise, the organizations that deeply integrate AI—complete with clear ownership, robust instrumentation, and continuous improvement loops—are the ones most likely to meet (and exceed) the bar.

Looking ahead to 2026, I expect omnichannel consistency to become a key differentiator. The data shows planned investment is distributed nearly equally across chat, email, and social messaging (36% each), closely followed by phone/voice (31%). The question is no longer “Which channel should we optimize?” but “How do we deliver a consistent, AI-powered experience everywhere our customers are?”

Teams that solve for omnichannel consistency will bridge the long-standing gap between what customers expect and what support can deliver. Every touchpoint becomes an opportunity to exceed expectations and build durable trust.

Consider Clay, a team that scaled support without sacrificing quality. Support is one of their main growth drivers, and as their customer base expanded, ticket volume surged. Early on, they concentrated much of their effort in Slack, cultivating close, transparent community relationships. But relying on a single channel created friction as they grew; customers wanted the flexibility of email and in-app chat, and Clay needed to deliver the same high standard everywhere.

Where AI investment is headed for customer service in 2026: chat, social, and email lead at 36%, with phone/voice close behind at 31%. A bold visual snapshot of shifting channel priorities in CX.

By unifying their support experience with an AI Agent, Clay brought consistency across channels. Today, AI is involved in 90% of all queries and handles half of Clay’s total volume, upwards of 7,000 queries a month. First response rates improved significantly, freeing the team to focus on proactive, high-impact work.

That work includes identifying content gaps for education and content marketing, reaching customers before they need to ask for help, and surfacing feature requests and recurring challenges to product teams. Clay proves that when support is truly great, it becomes a competitive edge.

So how do you build a superior customer experience with an AI Agent? Here are five principles I use when scaling toward mature deployment.

1) Treat customer experience like a product. Treating support as a product means designing, building, and managing the support experience with the same rigor as your core product. You define goals (faster onboarding, higher CSAT or CX Score, lower churn). You map flows (AI starts the conversation, human handovers, proactive nudges). You instrument the journey (track handoffs, drop-offs, success states). You run tests and ship improvements (tone tweaks, fallback paths, training updates). You own the outcomes (gather feedback, measure performance, use insights to continuously improve the system).

Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.

2) Lead with AI, back with humans. AI isn’t replacing the human touch. It’s redefining when, where, and how it’s most valuable. In a scaled model, AI is the first responder and the end point for most conversations. Humans step in where they add the most value—particularly during high-stakes issues—and those handoffs should feel seamless. Meanwhile, your team focuses on improving AI performance and optimizing the end-to-end journey.

3) Be proactive. Use AI to anticipate needs, guide customers before problems arise, and nudge them toward successful outcomes. This is where customer support AI strategy shines—moving from reactive triage to journey orchestration that protects momentum and builds trust.

4) Build for trust. Many customers still carry the legacy of clunky chatbots that delivered vague answers and dead ends. You earn trust by showing that your system works. Don’t hide your AI Agent behind layers of “choose an option.” Get customers to the AI quickly, demonstrate real problem-solving, and ensure that when a human is needed, they join with full context to resolve complex issues efficiently.

5) Make it feel personal. Your AI Agent represents your brand. The way it speaks, follows policies, and responds matters. Use tone control, fallback logic, and language preferences to align the experience to your standards. Consistency builds trust; personality builds connection and loyalty.

Perfect really is possible. With deep AI implementation, you can scale comprehensive, fast, and personal support across channels—so customers feel supported not just when they reach out, but throughout their journey. That’s the promise of modern AI workflows in support, and it’s what will separate leaders from laggards in the years ahead.

Inspired by this post on The Intercom Blog.

February 20, 2026
Implementing AI Agents That Scale: My Playbook for One‑Person Departments with Amplitude

Over the past few years, I’ve led cross-functional teams to deploy agentic AI in production, and I’ve learned that success rarely hinges on the model alone. It comes from methodically designing the right workflows, instrumenting every step, and building a feedback loop that compounds. Learn how companies like Replit are consolidating workflows, creating one-person departments, and building systems for scale with Amplitude.

When I talk about AI agents, I’m describing software that behaves like a focused teammate—owning a clear job to be done end-to-end. In practice, that means consolidating fragmented tasks into a single accountable “one-person department,” then giving it the context, tools, and analytics to perform reliably. This is how agentic AI moves beyond demos into durable business impact.

I start with outcomes, not algorithms. I map a driver tree from business goals (e.g., lower response time, higher activation, better retention) to the specific moments an agent can influence. This outcome-first alignment keeps scope tight, informs guardrails, and grounds the value proposition in measurable change instead of vanity metrics.

Next, I define the workflow the agent will fully own. I look for high-volume, rules-adjacent processes—think lead qualification, support triage, or billing inquiries—where clear decision criteria already exist but human time is the bottleneck. I document triggers, inputs, decision points, and handoffs, then design the ideal-state flow the agent will run autonomously, with transparent escalation paths to humans.

On architecture, I favor a retrieval-first pipeline to keep responses accurate and current. I scope the knowledge base, implement context window management, and standardize tools the agent can call (search, CRM actions, ticket updates). For teams new to this, I coach “LLMs for product managers” fundamentals so we make sensible trade-offs between speed and reliability rather than chasing model-of-the-week headlines.

Instrumentation is where the system becomes self-improving. I use Amplitude analytics and an Agent Analytics schema to track intent detection, tool usage, resolution rate, time-to-resolution, deflection, and escalation causes. A unified analytics platform lets me connect agent outcomes to core product metrics—activation, retention, and conversion—so we can see the real revenue and experience impact, not just local efficiency gains.

To validate impact, I run A/B testing when traffic allows, setting a minimum detectable effect (MDE) upfront to avoid inconclusive reads. In lower-volume scenarios, I lean on eval-driven development: curated test sets for edge cases, scenario-based regression suites, and error taxonomies that accelerate iteration. Feature flags let us stage capabilities safely (shadow mode, assistive, autonomous) while we monitor deltas before full rollout.

Reliability and trust are designed in from the start. I apply AI risk management practices—privacy-by-design, data governance, and policy-aligned prompt templates—paired with observability to trace decisions. Clear escalation policies, incident management runbooks, and human-in-the-loop checkpoints ensure the agent fails safe, not silently.

Shipping cadence matters. I use CI/CD to increase deployment frequency, keep prompts and tools versioned, and gate risky changes with targeted rollouts. As patterns stabilize, we scale horizontally to new use cases, sharing core capabilities (retrieval, analytics, guardrails) as a platform. This is how “one-person departments” multiply without multiplying overhead.

Change management closes the loop. I partner with product trios and frontline teams to co-design prompts, set acceptance criteria, and define what “good” looks like in plain language. In-app guides and product tours introduce the agent’s role and limits, and structured feedback channels feed directly into our discovery and iteration rhythm.

The throughline of this playbook is simple: treat agents like real teammates with a job description, operating procedures, and performance reviews. With disciplined workflow design, a retrieval-first pipeline, and outcome-level instrumentation in Amplitude, agentic AI stops being a science project and starts compounding into durable product-led growth.

Inspired by this post on Amplitude – Perspectives.

February 18, 2026

An End-to-End AI Product Workflow From Discovery to Deployment

You have customer interviews, an AI prototype, and a launch request. What you may not have is a defensible chain connecting them. The prototype can look convincing while the team still disagrees about the customer problem, the acceptable failure rate, the limits of automation, and what should happen when the model or a connected tool fails.

A durable AI product workflow makes those decisions explicit. It connects customer evidence to a bounded opportunity, the opportunity to an interaction model, that model to an evaluation contract, and the contract to a guarded production release. You should be able to trace every automated action backward to a customer need and forward to a metric, an owner, and a recovery path.

Turn interviews into an opportunity map, not a feature request

AI products often go wrong before anyone writes a prompt. A customer describes a slow or frustrating task, someone proposes an assistant, and the proposed interface quietly becomes the problem definition. The team then tests whether it can build the assistant instead of whether solving that part of the workflow changes the customer’s outcome.

Start by defining the discovery boundary. Name the user, the workflow, the outcome the user is trying to reach, and the part of that outcome your product could reasonably influence. Keep interviews in the same outcome or product space when you synthesize them. A small batch of three interviews can be enough to produce a useful first draft, but it is not a universal saturation threshold or proof that you understand the market.

The sequence of synthesis matters. Analyze each interview on its own before looking for patterns across interviews. That preserves the situation, sequence, and meaning around each customer’s comments. If you combine transcripts immediately, repeated vocabulary can appear more important than the underlying context, and an unusual but consequential problem can disappear into the average.

Write the outcome anchor. State whose behavior or result should change. Avoid a feature-shaped outcome such as “increase use of the AI assistant.” A better outcome describes progress in the customer’s work.
Create one snapshot per interview. Capture the customer’s goal, the relevant sequence of events, key moments, obstacles, current workaround, and evidence supporting each inferred opportunity.
Separate observation from interpretation. Preserve what happened and what the customer said separately from the team’s explanation of why it happened. Label uncertainty instead of filling gaps with generated prose.
Synthesize across snapshots. Look for shared opportunities, meaningful differences, dependencies, and contradictions. Similar wording does not automatically mean the same need.
Organize opportunities before proposing solutions. Build an opportunity solution tree or equivalent map that connects the product outcome to customer opportunities. Keep solution ideas outside the opportunity labels.
Review the generated structure as a team. Ask what was merged incorrectly, what was missed, what lacks evidence, and which branch reflects a solution disguised as a need.

AI is useful here as a first-pass analyst, not as an authority. It can extract moments, propose opportunity statements, and suggest a hierarchy. Human reviewers contribute product context, recognize important exceptions, and challenge confident-looking inferences. The strongest practical model is an AI-generated draft that the team refines.

Your exit gate for discovery is not a polished tree. It is agreement on a selected opportunity, the evidence behind it, the customer outcome it should influence, and the opportunities deliberately excluded from the current scope. If the team cannot explain those choices without mentioning a model or interface, it is not ready to prototype.

Choose assistance or autonomy before choosing the architecture

The next decision is not which model to use. It is what responsibility the product will accept. An LLM can generate or classify content. An agent wraps model behavior in a workflow that plans, uses tools, retains relevant state, and attempts to complete an outcome. That difference changes the customer promise, the evaluation plan, the permission model, and the consequences of failure.

Decision	Copilot	Agent
Best task shape	High-context work that benefits from judgment, nuance, or brand voice	Bounded, tool-heavy work with a verifiable completion state
Customer promise	Drafts, explains, recommends, or accelerates	Completes an agreed task within a defined scope
Human role	Reviews and commits the result	Sets policy, handles exceptions, and approves sensitive actions
Default permissions	Read, retrieve, and propose	Narrowly scoped tool access, including only the writes required for the task
Primary proof	Useful, grounded output that improves the user’s work	End-to-end task success without unacceptable actions or loops
Failure consequence	A poor suggestion reaches the reviewer	A poor decision can propagate into another system

When the task still depends on tacit knowledge or subjective review, start with a copilot. When it is bounded, tool-heavy, and objectively checkable, consider an agent. The safer product progression is to start assistive and grant autonomy only after success is measurable. Autonomy should be earned capability by capability, not declared at the product level.

You can make that progression concrete without redesigning the entire experience. Let the product draft first. Then let it recommend a plan and show the evidence behind the recommendation. Next, allow reversible actions through a narrow tool whitelist. Keep approval immediately before actions that affect customers, money, permissions, or durable data. Expand the scope only when production evidence supports the previous boundary.

Once the responsibility is clear, define the architecture around it:

Authoritative context: retrieve relevant product, account, policy, or workflow information before asking the model to decide. A retrieval-first pipeline reduces dependence on whatever happens to be encoded in model weights.
Explicit scope: state the role, allowed objectives, prohibited actions, and conditions that require escalation.
Controlled tools: expose only the operations needed for the selected job. Apply unit limits and validate tool inputs outside the model.
Deliberate memory: separate temporary working state, durable customer facts, and governing policy. Do not treat the entire conversation history as an undifferentiated memory store.
Visible checkpoints: show the user what will happen, what data will be used, and which action requires approval.
Traceable execution: record retrieval results, model and prompt versions, tool calls, approvals, guardrail events, and final task status.

This architecture is more durable than a large prompt because each component has a distinct failure mode and owner. Retrieval can be evaluated for evidence quality. Tools can be tested deterministically. Policy can be reviewed independently. The model remains important, but it no longer carries responsibilities that ordinary software can enforce more reliably.

The exit gate is a written responsibility boundary. The team should be able to say what the product may read, what it may write, what it must never do, when a person intervenes, and how successful completion is verified. If any answer is “the model will decide,” the boundary is still incomplete.

Write the evaluation contract before optimizing the prompt

A compelling demo proves that a path can work. It does not establish how often it works, which inputs break it, whether its evidence is trustworthy, or whether it completes the customer’s job at an acceptable cost. Prompt iteration without an evaluation contract tends to optimize whatever the last reviewer noticed.

Write the contract in product language. For each target task, define the eligible input, the expected outcome, the evidence the product may use, allowed actions, prohibited outcomes, completion criteria, escalation conditions, and fallback. Add latency and cost limits chosen for your product economics. There is no universal threshold that makes an AI workflow production-ready; the important discipline is setting the threshold before seeing launch results.

Build the evaluation set from discovery evidence. Include representative customer inputs, important workflow variations, ambiguous cases, missing context, conflicting instructions, tool failures, and requests the product must refuse or escalate. Remove or protect sensitive data according to your governance rules. Every case should identify the acceptable outcome, not merely an ideal sentence, because multiple responses may solve the same job.

For copilots, measure the quality of assistance

Time to first token: how long the user waits before the response begins.
Response latency: how long the useful result takes to complete.
Groundedness: whether material claims are supported by the authoritative context supplied to the model.
User satisfaction: whether the assistance was useful in the actual workflow, not merely fluent.
Task impact: whether the user completes the selected job faster, with less effort, or with fewer corrections, using the outcome defined during discovery.

For agents, measure the whole execution

Task success rate: successfully completed eligible tasks divided by all eligible attempts. Define completion in the customer’s system of record where possible.
Steps per task: the number of model and tool steps required to finish. A rising count can expose inefficient planning or repeated work.
Tool error rate: failed, rejected, or malformed tool calls relative to attempted calls.
Loop detection: executions stopped because the agent repeated actions or failed to make progress.
Guardrail triggers: attempts blocked or redirected by policy. A trigger is diagnostic evidence, not automatically a success or a failure.
Human escalation: tasks handed to a person because the agent lacked permission, confidence, context, or a valid recovery path.
Cost per successful task: total execution cost divided by successful completions. Cost per request can hide expensive retries and failed runs.
Containment rate: eligible tasks completed within the automated workflow without human handling. Publish the eligibility and escalation rules with the metric so teams do not improve containment by narrowing the denominator invisibly.

These agent analytics complement rather than replace end-to-end task success. A fast response can still be wrong. A low tool error rate can coexist with a bad plan. High containment can be harmful if the agent completes the wrong task. Choose one outcome metric, pair it with quality and safety constraints, and retain the diagnostic metrics needed to find the cause of failure.

Route failures to the component that can fix them. Unsupported claims point first to retrieval and grounding. Correct plans with failed actions point to tool integration. Repeated steps point to orchestration or stopping logic. Frequent, legitimate escalations may mean the autonomy boundary is too broad. High model scores with low customer satisfaction should send the team back to the opportunity definition or user experience.

The exit gate is a versioned evaluation suite with release criteria, prohibited outcomes, an approved cost ceiling, and named escalation rules. Run it against every material change to the model, prompt, retrieval configuration, tool contract, or policy. Treat prompts and evaluation cases as product assets under version control, not as text pasted into a dashboard.

Release through gates and design the failure path

Deployment is where an AI capability becomes a product promise. The team now has to manage model variability, external tool behavior, changing knowledge, permissions, cost, and customer expectations at the same time. A launch plan that covers only the happy path is unfinished.

Put the capability behind a feature flag. Separate deployment from exposure so the team can stop new executions without waiting for a code release.
Open a gated beta around one bounded job. Limit the eligible users, tool permissions, data scope, and advertised promise. Make it clear whether the product recommends an action or performs it.
Use a canary for broader production traffic. Expand exposure gradually while comparing task success, guardrail events, tool errors, latency, escalation, and cost per successful task with the release criteria.
Change one material layer at a time when practical. Simultaneous changes to the model, prompt, retrieval index, tools, and policy make regressions difficult to attribute.
Expand only after the previous boundary is stable. More users, more tools, and more autonomy are separate risk decisions. Do not bundle them into one rollout.
Keep rollback and fallback distinct. Rollback restores a known model, prompt, policy, or tool version. Fallback gives the customer a safe alternative when the AI path is unavailable.

Feature flags, gated betas, canary rollouts, incident paths, and rehearsed fallbacks are ordinary operational controls, but they carry unusual weight in AI products because model and tool behavior can drift independently of an application release.

Design specific degraded states before launch:

Model unavailable: preserve the user’s work, explain that automation is unavailable, and offer the established manual path.
Retrieval unavailable or evidence missing: do not silently generate an ungrounded answer. Ask for the missing context, provide a limited response, or escalate.
Write tool fails: stop, report the actual system state, and reconcile before retrying. Blind retries can duplicate durable actions.
Execution stops making progress: terminate the loop at the configured limit and hand over the trace rather than consuming resources indefinitely.
Policy or permission check fails: block the action, preserve the audit record, and route the user to an authorized path.
Tool behavior changes: disable the affected capability until its contract and evaluation cases pass again.

Privacy and auditability belong in the release gate, not in a later compliance review. Document what customer data enters prompts, retrieval, memory, and logs; who can access it; how long each class is retained; and how deletion propagates. For actions affecting customers, money, permissions, or durable data, preserve enough detail to reconstruct the input, retrieved evidence, model and prompt version, tool parameters, approval, guardrail result, and final system state.

The operating stack also needs an ownership decision. Build the workflow logic, data model, and user experience that encode your differentiated value. Consider buying undifferentiated capabilities such as observability, prompt versioning, red-team infrastructure, and policy enforcement when an external component meets your control and governance needs. This build-versus-buy boundary keeps product attention on the parts customers actually choose you for without treating commodity infrastructure as strategically unique.

The production exit gate should require a visible scope statement, passing evaluations, a feature flag, a rollback target, a customer-safe fallback, usable audit traces, an incident owner, and a tested escalation route. If the team cannot explain what the customer sees during failure, it has not finished designing the feature.

Keep discovery, evaluation, and production in one learning loop

Once the product is live, production behavior becomes new discovery input. That does not mean replacing customer conversations with dashboards. Metrics show where the workflow breaks; customer evidence explains what the break means and whether fixing it matters.

Review failures against the original opportunity map. Concentrated escalation around one scenario may reveal an opportunity that was hidden during initial synthesis. High groundedness with low satisfaction may indicate that the product answered accurately but tackled the wrong job. A growing step count may expose orchestration waste, while a rising tool error rate points to integration reliability. If cost per successful task increases, inspect failure and retry paths before making the model cheaper; optimizing unit cost cannot rescue an unsuccessful workflow.

Every meaningful production failure should produce at least one durable change: a corrected opportunity assumption, a new evaluation case, a narrower permission, a tool-contract test, a policy update, a clearer interaction, or a revised fallback. That is how customer discovery and operational learning remain connected instead of becoming separate product and engineering rituals.

Key takeaways

Synthesize each customer interview separately before looking across interviews, then review the AI-generated opportunity structure with human judgment.
Select a customer opportunity before selecting the AI interface. A fluent prototype is not evidence that the underlying job matters.
Use a copilot for judgment-heavy work and consider an agent only for bounded, tool-heavy tasks with verifiable completion.
Define task success, prohibited outcomes, escalation, cost, and fallback before optimizing prompts or choosing a model.
Measure copilots as assistance and agents as end-to-end execution. Do not mistake latency, containment, or tool-call success for customer success.
Release behind flags, expand through gated exposure, and rehearse rollback, fallback, and incident paths before granting more autonomy.

At your next AI product review, ask to see the outcome and opportunity map, the responsibility boundary, the evaluation contract, and the rollout and recovery plan. If one is missing, pause the launch decision at that handoff. Closing that gap is usually more valuable than adding another prompt, tool, or autonomous step.

References

February 18, 2026

How to Scale Trustworthy Enterprise Analytics With AI Agents

Your analytics agent can turn a question into a chart. Then a product leader asks which activation definition it used, an analyst gets a different cohort result, or security discovers that the agent queried data the user could not normally access. That is where a promising pilot becomes an enterprise risk.

The way through is not a better chat interface. You need a controlled path from question to decision: approved definitions, bounded tools, task-level evaluations, visible evidence, and permissions that expand only after the agent proves it can handle a specific workflow reliably.

Define trust as an executable contract

A trustworthy answer is more than a plausible explanation. It is the output of a reproducible analytical process. The enterprise bar includes consistent metric definitions, privacy-by-design, role-based access control, audit trails, low-latency support, and repeatable results. If any link in that chain is implicit, the agent can be eloquent and still be unsafe.

Before you give an agent a task, define its contract. The contract should answer five questions:

What decision is being supported? A request to explain a funnel is different from a request to change the funnel definition or publish a recommendation.
Which definitions are authoritative? Identify the canonical metric, its version, the population, the unit of analysis, the time window, and any approved exclusions.
What may the agent access and do? Specify datasets, fields, tools, credentials, and whether the task is read-only, produces a draft, or can trigger an action.
What evidence must accompany the answer? Require the metric identifier, query or tool calls, filters, lineage, assumptions, and enough result detail for an analyst to reproduce the work.
When must the agent stop? Define the ambiguities, policy conflicts, statistical gaps, and high-consequence actions that require clarification or approval.

Consider a seemingly simple question: Did activation decline for new accounts? The answer depends on the approved activation event or event sequence, cohort entry rule, identity resolution, time zone, date range, and exclusions. If the agent silently supplies one of those details, it has made a product decision while pretending to perform analysis.

The safe behavior is straightforward. The agent should retrieve the approved definition, display the material assumptions, and ask for clarification when the remaining ambiguity could change the result. It should not create a new activation definition in the course of answering the question. Changes to definitions belong in a governed workflow with an owner, review, version history, and rollback path.

This distinction also gives you a better definition of accuracy. An answer fails if it uses the wrong metric, violates an access rule, omits a material assumption, or cannot be reproduced, even when the final number happens to be correct. Trust is a property of the whole execution path, not only the sentence shown to the user.

Move through four levels of autonomy one task at a time

Teams often treat agent maturity as a platform-wide label. That hides risk. The same system may be mature enough to draft a funnel but not mature enough to interpret an under-specified experiment. Assign maturity to each task, dataset, and action instead.

Level	Agent role	Evidence required before moving forward
L0: Conversational interface	Summarizes charts or reports that already exist.	The agent accurately identifies the selected artifact, preserves its filters and caveats, and does not imply that it performed new analysis.
L1: Grounded retrieval	Retrieves definitions and context from the analytics catalog, taxonomy, or metric store before answering.	Canonical definitions are consistently selected, citations and assumptions are visible, and retrieval respects the requesting user’s permissions.
L2: Governed tool use	Reads schemas, generates safe SQL, calls approved tools, and reconciles results against canonical definitions.	Representative tasks pass golden-data and regression evaluations; queries, tool calls, lineage, errors, latency, and cost are observable.
L3: Bounded autonomous workflow	Completes an end-to-end workflow with approval gates, audit logs, feature flags, and rollback controls.	The exact workflow has a stable evaluation history, clear ownership, tested failure handling, and a reversible execution path.

L0 can still be useful. It reduces navigation work and helps a user understand an existing dashboard. The mistake is presenting that convenience as autonomous analytics. L1 improves trust by grounding language in the organization’s own definitions, but retrieval alone does not prove that a newly calculated result is correct.

L2 is the consequential transition. The agent is no longer explaining an approved artifact; it is producing analytical work. Schema awareness, safe SQL, result reconciliation, and complete traces become release requirements rather than optional diagnostics.

L3 should describe a narrow, governed workflow, not a general promise that the agent can handle anything. For example, an agent might autonomously refresh an approved weekly retention analysis while still requiring an analyst to approve a new cohort definition. Broaden the task boundary only after the additional behavior has its own tests and controls.

The capabilities that justify early investment are rapid exploration, schema-grounded SQL generation, experiment summarization, and conversion of natural-language questions into charts. Ambiguous metric semantics and under-specified experiment designs remain poor candidates for unreviewed autonomy. Use the agent to compress the mechanical work, but keep unresolved organizational judgment visible.

Build evaluations around the work people actually do

A generic chatbot benchmark will not tell you whether an agent can support your product decisions. Your evaluation unit should be a complete analytics task performed under your definitions, schemas, policies, and edge cases.

Start with the ten high-frequency analytics tasks that matter most in your environment. Do not select only the cleanest demonstrations. Include work that is frequent, consequential, and likely to expose semantic or governance failures.

<!– wp:list {

February 17, 2026

Multi‑Agent Systems Demystified: Why One AI Isn’t Enough—and How I Ship Faster With Many

In my day-to-day building AI products, I’ve learned a simple truth: a single model can be brilliant, but a coordinated team of specialized agents is what consistently ships outcomes customers trust. That’s the promise of multi-agent systems—multiple AIs with distinct roles collaborating inside robust AI workflows to deliver accuracy, speed, and resilience you can’t get from a lone model.

Think of a multi-agent system as a well-run product trio for machines: a planner decomposes the job, specialists execute focused tasks, a reviewer checks quality, and an orchestrator keeps everyone aligned. This agentic AI approach mirrors how high-performing teams work—divide complex problems, play to strengths, and create tight feedback loops.

When does one AI stop being enough? Whenever tasks require tool use, domain retrieval, multi-step reasoning, or policy adherence under real-world constraints. In those moments, specialized agents shine—one for search using a retrieval-first pipeline, another for reasoning, another for action execution, and a final one for validation. The result is better accuracy with manageable latency and cost.

The core architecture I rely on starts with a planner that breaks a goal into steps, followed by execution agents equipped with tools and grounded context. I pair this with context window management to keep prompts lean and relevant, and I insert a verifier (or critic) to catch logic slips and policy violations before results reach customers. A lightweight orchestrator coordinates handoffs and retries to keep the whole flow resilient.

To make this production-grade, I treat observability as non-negotiable. Agent Analytics helps me see which agents are adding value versus adding latency, where failures cluster, and how prompts drift over time. From there, eval-driven development gives me measurable confidence: I codify representative tasks, run offline and shadow evaluations, and only promote changes that move accuracy and safety in the right direction.

Governance is equally critical. I design privacy-by-design from the start, restrict data movement with strong data governance, and enforce policy constraints inside the workflow rather than after the fact. This includes red-teaming failure modes, rate-limiting tools, and capturing immutable traces for audits and post-incident reviews—habits borrowed from SRE culture that map well to AI systems.

On the practical side, prompt engineering remains foundational, but it’s the system design that converts clever prompts into reliable outcomes. Tool access, retrieval quality, memory strategy, and error handling matter more than wordsmithing alone. I’ve found that small prompt improvements are amplified when the surrounding workflow is sound—and are overwhelmed when it isn’t.

If you’re just starting, begin with a narrow use case and a minimal set of agents—planner, executor, and verifier—then expand. Use continuous discovery with real users to learn where the workflow fails in the wild, and iterate with tight release cycles. Treat every agent like a microservice with clear contracts, test coverage, and metrics, and you’ll unlock compounding gains without losing control.

The payoff is tangible: faster shipping cycles, fewer regressions, and outcomes customers can actually rely on. When stakes are high and ambiguity is real, one AI is often a talented soloist—but a disciplined ensemble of agents is how I deliver dependable, scalable value at product velocity.

Inspired by this post on Product School.

February 16, 2026
Deeper AI Integration, Clearer ROI: How Mature Deployments Redefine Support Economics

Over the last year, I’ve had the same conversation with a lot of support leaders.

They’ve deployed AI and are seeing initial efficiency gains, but want to push beyond these early results and achieve meaningful transformation.

When AI is first introduced, the gains show up quickly. Teams resolve higher volumes of queries, free up capacity, and deliver faster responses. But the real opportunity for impact extends well beyond those initial wins. As AI becomes more deeply integrated into support operations, taking on harder, more complex work, those results compound, new ways to create and measure value open up, and the economics of support change entirely. That shift is where I spend most of my time with leaders—turning early efficiency into durable business value.

This sits at the heart of “The 2026 Customer Service Transformation Report.” In this reflection, I explore how deeper integration compounds impact and why that makes business value easier to articulate across the organization—especially to finance and product peers who need to see outcomes, not just output.

The teams going deeper are seeing higher returns. The research shows that 62% of support teams have seen their customer service metrics improve since implementing AI, with early wins showing up most clearly in speed and efficiency. But for teams that have reached mature deployment (where AI is fully integrated into operations) that number jumps to 87%.

As AI programs advance, measurement confidence surges. This chart shows how ROI tracking rises from 35% in exploring to 70% in mature deployments—evidence of a widening execution gap in customer service.

The same pattern holds for the ability to measure ROI. Among teams in early exploration, just 35% say they can measure their return on AI investment, but for teams at the mature deployment stage, that rises to 70%. In my experience, this is the moment the conversation shifts from “is AI working?” to “how much leverage are we creating?”

As AI becomes more embedded in support workflows, what teams choose to measure starts to change. In the early stages of deployment, ROI is typically understood through improved customer response times, lower cost to serve, and freeing up capacity. Teams focus on how much time AI creates and whether it’s relieving pressure on the support organization. These signals help validate that the system is working, but they say little about how that capacity is ultimately used.

As deployments mature, measurement starts to reflect a different intent. Instead of stopping at time saved, teams look at where that capacity is reinvested—into higher value customer work and revenue-generating activities. ROI becomes less about relief and more about leverage. I encourage teams to set targets for capacity redeployment and tie them directly to activation, retention, and expansion outcomes.

The report data shows this clearly. Across all maturity stages, the most commonly cited measure of ROI is "time freed up that the support team can use to focus on value-adding activities for customers." But at mature deployment, that signal intensifies, with 73% of teams citing it, compared to 56% at early exploration.

Mature AI deployments reveal clearer ROI: teams report more time freed for value-adding customer work (73% vs 59%) and more hours redirected to revenue-generating tasks (56% vs 34%) than initial rollouts.

What’s also interesting is that 56% of mature teams say freed capacity is being directed toward revenue-generating activities, up from 34% at initial deployment. That’s a powerful indicator that AI is shifting from a cost narrative to a growth narrative.

The result is a shift in economic intent: from measuring what AI saves to demonstrating how the capacity it creates is reinvested to drive growth. As a product leader, I anchor this conversation in outcome-based metrics and clear counterfactuals: what would it have cost to deliver the same experience without AI?

As AI takes on more work, the question moves from “does it save money?” to “how does it change the economics of support?” Legacy support economics were built for linear growth: more customer tickets meant more headcount, more outsourcing, and more software costs. Success was measured through containment—the number of queries that didn’t reach human agents. These models worked when volume and effort were tightly linked, but AI doesn’t scale linearly, and it needs to be evaluated differently.

To sustain AI investment and expand its impact, teams need to move beyond cost-cutting narratives and build a clearer case for business value. When done right, AI goes far beyond improving support efficiency. It rewires the financial model, breaking the link between support costs and revenue growth, and turning support into a contributor to customer activation, retention, and lifetime value. This means treating your AI Agent as a new workforce capability that changes how your support function creates and captures value. Here’s what value looks like in an AI-first model:

Deeper AI integration decouples growth from headcount. This split chart shows support volume surging while team size plateaus, revealing how automation unlocks scale, reduces costs, and makes ROI easier to prove.

Human productivity: Your team focuses on more strategic areas, not the queue.

System improvement: Every resolved query makes the system smarter.

Revenue influence: Support becomes a lever for activation, retention, and growth.

Organizational agility: You scale service without scaling headcount.

Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.

How does this look in practice? Intercom offers a compelling example with Fin. What started as a focused effort to improve their customer support experience has become one of the clearest illustrations of what happens when AI is fully embraced across an organization.

Since 2022, Fin has helped Intercom absorb more than a 300% increase in customer demand while improving the consistency of delivery—including supporting new routes into support for trial customers and website visitors. Today, Fin is involved in 97% of their customers' conversations. Of those, it resolves 83.5% end-to-end, putting their overall automation rate at 81%.

That depth of deployment allowed Intercom to scale service without scaling headcount. Without Fin, they would have needed at least 100 additional support teammates to meet rising demand and service standards.

As Fin took on the majority of day-to-day volume, the human support team shifted toward consultative work—helping customers adopt Fin more deeply, succeed faster, and unlock more value from the platform. Intercom now tracks metrics like “direct revenue generated” and “expansion revenue influenced” to understand the impact of these consultative support activities. This repositioned support from a cost center to an active contributor to long-term growth.

The throughline from The 2026 Customer Service Transformation Report is that deployment depth makes a significant difference. Teams that are investing in deeply integrating AI are reshaping how support scales and contributes to growth. Value becomes clearer as AI takes on more work, and support leaders can articulate that value to the rest of the business.

The gap between these teams and those still in the early stages is widening. A select group of pioneers are setting a new bar for what AI-powered customer service can deliver, and understanding what they’re doing differently is the first step toward closing that gap. If you want to dive deeper into the data and frameworks, you can download the report here: https://www.intercom.com/customer-transformation-report?utm_source=blog&utm_medium=internal&utm_campaign=20260128-report-owned-2026cstransformationreport&utm_content=chapterseries_2

Inspired by this post on The Intercom Blog.

February 13, 2026
AI Agent Deployment Mastery: My Proven Checklist to Ship Safely, Faster, and at Scale

Shipping AI agents is not like shipping a typical feature. The system learns, reasons, and takes action in unpredictable environments, and when it’s customer-facing, the stakes are high. Over the past few years, I’ve refined a practical checklist that helps my teams move quickly without breaking trust. It balances speed with safety, and ambition with accountability—exactly what you need to scale agentic AI in production.

This checklist was forged in real launches—some smooth, some humbling. Early on, I watched an otherwise brilliant agent confidently offer a refund policy we didn’t have. That one incident made it clear: AI agents require a higher bar for guardrails, evals, and observability. Today, I won’t greenlight an AI rollout without these steps being explicit, owned, and testable.

Start with outcomes, not output. I define the job-to-be-done, the target users, and the measurable business impact using outcomes vs output OKRs and driver trees. Success is not “ship an agent,” it’s “reduce first-response time by 40% with no drop in CSAT,” or “increase qualified demo bookings by 20% at a lower cost per acquisition.” Clear outcomes give the agent a purpose and the team a north star.

Prepare the knowledge the agent will use. A retrieval-first pipeline beats raw prompting for most enterprise cases. I inventory sources of truth, set access controls, and enforce data governance from day one. That includes PII handling, redaction, retention policies, and privacy-by-design. If the agent can’t reliably retrieve the right fact at the right time, the rest doesn’t matter.

Choose models and prompts with discipline. I align model selection with context window management, cost, latency, and tool-use requirements. Then I build prompts and tools together, not in isolation, and I keep temperature, stop conditions, and function-calling explicit. Most importantly, I use eval-driven development: golden datasets, task-specific metrics (accuracy, helpfulness, latency, cost), and target thresholds that must be met before widening rollout.

Manage AI risk upfront. I treat jailbreaks, toxicity, and data leakage as product risks, not just security issues. I implement layered defenses—input/output filtering, policy checks, rate limits, and abuse monitoring—and define escalation paths and human-in-the-loop handoffs for ambiguous cases. Every risky capability needs an owner, a playbook, and a test.

Build the pipeline that lets you iterate safely. Prompts, tools, policies, and retrieval configs go through the same CI/CD rigor as code. I use feature flags for progressive delivery, canary cohorts to limit blast radius, and clear rollback procedures. Observability isn’t optional; I track latency, token usage, cost, failure modes, and user outcomes. I also watch DORA metrics and deployment frequency to ensure we’re improving the engine, not just the output.

Constrain autonomy intentionally. Agent behavior design matters as much as model choice. I set step limits, define tool whitelists, separate read vs write permissions, and specify decision checkpoints. When the agent is uncertain or confidence drops below a threshold, it hands off to a human or a deterministic workflow. Guardrails aren’t barriers; they’re bumpers that keep you on the track.

Instrument what users experience, not just what models produce. I track activation, task success, self-serve completion rates, and time-to-value. I pair Agent Analytics with journey analytics so I can see where the agent helps or hurts. I also invest in UX trust cues—transparent explanations, undo paths, and in-app guides—so users feel in control. When the agent changes behavior through learning, the interface should make that understandable.

If you’re shipping a voice AI agent, test in realistic conditions. I set targets for ASR accuracy, barge-in responsiveness, TTS prosody, and end-to-end latency. I predefine safe transfer logic for complex calls and ensure compliance for call recording and data retention. Voice amplifies both the magic and the mistakes; operational excellence is non-negotiable.

Plan the business rollout like a product, not a press release. I align pricing (often consumption SaaS pricing), packaging, and SLAs with actual unit economics—tokens, inference, and retrieval. I equip solutions engineering with playbooks and reference architectures, wire up CRM integration for attribution, and put feedback loops into Intercom or the support stack so we learn from every interaction.

Run operations like an SRE team. I define incident severity for AI-specific failures (e.g., harmful output, runaway cost, degraded retrieval), add alerting, and keep runbooks current. I schedule postmortems that feed directly into eval baselines and backlog priorities. Continuous discovery isn’t a ceremony; it’s the safety net that keeps improvements compounding.

Close the loop on compliance and governance. From day zero, I document data flows, vendor scopes, and audit logs. I verify regulatory compliance and adopt privacy-by-design so I’m not retrofitting later. Transparency, user consent, and opt-outs aren’t just legal checkboxes; they’re trust-building tools that differentiate your product.

The result of this checklist is speed with confidence. It gives my teams a common language to debate trade-offs, a clear path to production, and the guardrails to scale safely. If you’re preparing to deploy an agent, adapt these steps to your stack and your customers. Your future self—and your users—will thank you.

Inspired by this post on Product School.

February 9, 2026
Vibe Coding Unleashed: How Parallel Agents Build KPI Driver Trees in Under Two Hours

I’ve been exploring what I call the next level of vibe coding: orchestrating agentic AI to build complex product artifacts in minutes, not days. The breakthrough comes from ditching linear handoffs and embracing true parallelism—letting specialized agents tackle the work simultaneously while I steer the orchestration. In product management contexts where speed and clarity matter, this shift changes everything.

Building a KPI Driver Tree in two hours becomes possible when you stop building sequentially and start building with parallel agents.

For product leaders, a KPI Driver Tree is the fastest way to make strategy legible. It ties high-level outcomes to the levers we can actually pull—features, channels, pricing, onboarding, activation, and retention mechanics—so we can prioritize with confidence. Done well, it connects outcomes vs output OKRs, clarifies measurement, and aligns the team around a shared, testable model of growth.

Here’s how I operationalize it with agentic AI and AI workflows. I spin up a small team of specialized parallel agents: a Metrics Librarian (taxonomy and definitions), a Data Modeler (event and table design), a Research Synthesizer (voice of customer and causal hypotheses), a UX Prototyper (visualizing the tree and flows), and a QA/Evaluator (logic and consistency checks). An Orchestrator coordinates these agents, resolves conflicts, and composes outputs into a single, production-ready artifact—while I set constraints, review deltas, and decide.

In a typical two-hour sprint, all agents run at once. While the Metrics Librarian finalizes the KPI ontology, the Data Modeler validates instrumentable events and joins, and the UX Prototyper renders an interactive driver tree for a unified analytics platform. Meanwhile, the Synthesizer maps qualitative insights to quantitative levers, and the Evaluator stress-tests assumptions. Because we’re not waiting for sequential handoffs, we converge on a coherent driver tree and its initial measurement plan in one pass.

The payoff isn’t just speed—it’s higher-quality decisions. Parallel agents reduce context loss, expose trade-offs earlier, and allow me to compare multiple viable paths side-by-side. This accelerates continuous discovery, aligns with product strategy, and gives product managers and LLMs for product managers a clear, living map of how inputs roll up to outcomes. It’s the closest I’ve found to running a product trio at machine speed.

Guardrails matter. I pair this approach with strong data governance, privacy-by-design, and eval-driven development so every agent’s output is testable and auditable. Clear prompts, scoped corpora, and consistent acceptance criteria keep the Orchestrator honest, while lightweight Agent Analytics helps me see where reasoning falters and where to improve the system.

If your team is still tackling analytics artifacts sequentially—requirements, then instrumentation, then visualization—consider switching mental models. Treat the driver tree as the backbone, empower parallel agents to co-create around it, and reserve human judgment for the critical calls. This is vibe coding for product management: creative, fast, and grounded in measurable outcomes.

Inspired by this post on Pendo – Best Practices.

February 5, 2026

How to Build a Mature AI Customer Service Operation

Your customer-service AI agent is live. It answers common questions, the launch dashboard looks healthy, and the next budget conversation is already about scale. Then a harder question arrives: which customer problems can the system actually own from start to finish?

That answer separates a production pilot from a mature deployment. Maturity is not the number of channels using AI or the quality of the demo. It is your ability to give the system meaningful responsibility, measure the result, recover safely when it fails, and improve it as part of normal operations. The framework below will help you diagnose where your deployment is shallow and decide what to build next.

Maturity begins where the pilot stops

Investment no longer distinguishes an AI leader. Among 2,470 global support professionals surveyed by Intercom, 82% of senior leaders said their teams had invested in AI during the previous year, 87% planned to invest in 2026, and 77% said AI was meeting or exceeding expectations. Yet only 10% classified their deployment as mature.

Those are self-reported responses collected by an AI-support vendor, so treat them as a directional benchmark rather than causal proof. The useful signal is the gap: buying and launching AI has become common, while redesigning customer service around it remains rare.

A pilot proves that an AI agent can participate. A mature operation proves that it can take responsibility. Participation might mean generating an answer before handing the conversation to a person. Responsibility means resolving the customer’s need, completing any permitted action, recording what happened, and escalating with context when human judgment is required.

Dimension	Pilot-shaped deployment	Mature operating behavior
Scope	A few answerable intents on one surface	Selected journeys owned from initial request through verified outcome
Work performed	Retrieves information or drafts a reply	Explains, gathers context, uses approved tools, and completes permitted tasks
Ownership	A launch team watches aggregate results	A named operator owns performance, failures, and the improvement backlog
Knowledge	Content is cleaned up before launch	Knowledge coverage, accuracy, and maintenance are governed as production dependencies
Testing	The happy path works in a demo	Realistic scenarios, boundary cases, and regressions are evaluated before changes ship
Handoffs	Escalation is an undifferentiated escape route	Every handoff has a reason, preserves context, and feeds the next improvement decision
Success	Containment or deflection rises	Verified resolution, task completion, quality, safety, and customer impact improve together

Use this as a constraint map, not an average score. A deployment with excellent content but unreliable account permissions is not ready to complete account changes. A deployment with strong automation but no failure taxonomy cannot improve systematically. Your least-developed operating dependency usually limits the next safe increase in responsibility.

Expand responsibility one customer intent at a time

The safest unit of expansion is not a channel, market, or percentage target. It is a customer intent with a defined outcome. Shipping an AI agent to every messaging surface can increase reach without increasing capability. Giving it end-to-end ownership of one additional support journey creates measurable depth.

For each intent, move up this responsibility ladder only when the previous level is dependable:

Answer: Retrieve and explain approved information.
Clarify: Ask the minimum questions needed to identify the customer’s situation.
Contextualize: Use authenticated account, product, region, or history data to provide the applicable answer.
Act: Complete a permitted task through a reliable tool or workflow, then confirm the result.
Intervene proactively: Detect a relevant condition and offer or perform an appropriate next step under explicit rules.

This ladder explains why an answer bot and an operational AI agent can look similar in a dashboard but create very different value. The first reduces reading and typing. The second can remove an entire unit of work for the customer and the support team.

The reported difference between early and deep deployments appears in the type of work performed. Mature teams were more likely than teams in initial deployment to report automation of manual work, proactive engagement, and task completion: 63% versus 52%, 51% versus 41%, and 45% versus 28%, respectively. Mature teams also reported higher quality and consistency more often. The figures do not establish that deployment depth alone caused the gains, but they show what deeper responsibility looks like in practice.

Before promoting an intent to the next rung, answer these questions:

Outcome: Can you state exactly what successful resolution means for the customer?
Knowledge: Is there an approved, current answer for the common case and its important exceptions?
Identity: Does the workflow know who the customer is when personalization or action requires authentication?
Authorization: Can the system verify that this customer and this AI workflow are allowed to perform the action?
Inputs: Can required values be validated before an action is submitted?
Confirmation: Can the system verify that the downstream task succeeded instead of assuming that a tool call worked?
Recovery: Is there a safe retry, rollback, approval, or human-handoff path?
Evidence: Can an operator reconstruct which knowledge, data, rules, and tool results produced the outcome?
Evaluation: Do your test scenarios cover ambiguity, missing information, exceptions, and known failure modes?

If an answer is no, you have found the next capability to build. Do not compensate with a more confident prompt. Missing permissions need a permission model. Unreliable data needs an integration fix. Conflicting policy pages need knowledge governance.

Use additional care for refunds, cancellations, account changes, identity-sensitive requests, and other consequential actions. Start with reversible or approval-gated operations. Validate the customer, the requested change, the permitted amount or scope, and the downstream result. A fast autonomous action is not a success if it creates financial loss, locks the wrong account, or leaves no reliable audit trail.

Build the operating system behind the agent

An AI agent does not mature on its own after launch. Performance plateaus when ownership, content, testing, integrations, and analysis remain side projects. These capabilities need to operate as one system.

Give performance to a named operator

Executive sponsorship and operational ownership solve different problems. The sponsor aligns customer experience, economics, organizational design, and cross-functional priorities. The operator turns failures into changes and makes sure those changes reach production safely. One person can fill both roles in a smaller organization, but the accountabilities should still be explicit.

The operator should own a working backlog organized by customer intent. Each entry needs enough context to support a decision:

The customer intent and desired outcome.
Where the current journey begins and ends.
Conversation volume and customer impact drawn from your own data.
The primary failure mode, supported by examples.
The proposed content, behavior, integration, or policy change.
The person responsible for the dependency.
The scenarios that will validate the change.
The deployment status, observed result, and rollback decision.

This prevents the backlog from becoming a collection of prompt tweaks. It also exposes systemic problems. If several intents fail because account status arrives late, the priority is the shared data dependency, not separate wording changes in every conversation.

Treat knowledge as a runtime dependency

Content quality is not a launch task. The AI agent depends on current knowledge every time it answers, just as a transactional workflow depends on a functioning service. A policy change can therefore create production failures even when no AI configuration changes.

Create a content contract for every intent you expect the agent to own:

Canonical location: Identify the approved source rather than allowing several conflicting pages to compete.
Coverage: Include the common case, eligibility conditions, exceptions, prerequisites, and the point where human judgment begins.
Scope: Separate product, plan, market, language, and policy variants when the answer differs.
Owner: Assign the person or function authorized to approve changes.
Freshness trigger: Tie review to the product, pricing, policy, or workflow event that can make the content stale.
Retirement: Remove or clearly supersede obsolete information so retrieval does not surface an old rule.
Validation: Attach representative scenarios that should pass whenever the knowledge changes.

A retrieval-first pipeline makes content maintainable because the approved explanation lives in governed knowledge instead of being buried inside prompts. Prompt behavior should decide how to use policy, not become a second unofficial policy store.

Run every change through an evaluation loop

A useful production loop is Train, Test, Deploy, Analyze. Its value is not the labels. It is the discipline of connecting an observed failure to a controlled change and then checking whether the change improved real outcomes.

Train: Change the relevant knowledge, behavior, data access, or tool. Record the failure you expect the change to fix.
Test: Run representative customer scenarios, including the happy path, ambiguous wording, missing data, policy exceptions, tool failure, and required escalation. Govern or redact conversation data under your privacy controls.
Deploy: Release to the intended intent, channel, customer segment, language, or market with a known fallback and rollback path.
Analyze: Check the customer outcome and guardrails, inspect new failure patterns, and decide whether to keep, revise, expand, or revert the change.

Your evaluation set should evolve with production. Add scenarios when a customer finds a new ambiguity, a product release changes the journey, or an integration fails in a way the original tests did not anticipate. Keep regression cases after the immediate defect is fixed. Otherwise, one improvement can quietly reintroduce an old failure elsewhere.

Make actions observable and recoverable

Answer quality alone is insufficient once the AI agent can perform tasks. Your operation must distinguish a bad explanation from a failed action, a denied permission, stale account data, a duplicate request, and a downstream timeout. Those failures require different owners and different fixes.

For each consequential workflow, preserve the facts needed to reconstruct the outcome: the detected intent, the applicable knowledge or policy version, required customer inputs, authorization result, tool invoked, request status, returned result, confirmation shown to the customer, and handoff reason. The goal is not indiscriminate data collection. Retain only what your privacy and security rules permit, but retain enough operational evidence to diagnose a failure.

Design the human path at the same time as the autonomous path. A handoff should carry the customer’s request, relevant facts already collected, actions attempted, results received, and the unresolved decision. Making the customer repeat the conversation transfers the AI agent’s failure cost directly to them.

Turn handoffs into the improvement backlog

A handoff is not automatically a failure. Some requests require empathy, judgment, negotiation, policy discretion, or authority that should remain with a person. The operational failure is an unexplained handoff. When every escalation looks the same in analytics, you cannot tell whether to improve knowledge, retrieval, workflow reliability, or the boundary itself.

Handoff or failure type	What to inspect	Likely improvement
Knowledge gap	No approved answer, missing exception, or obsolete policy	Create or update canonical content and add regression scenarios
Retrieval mismatch	Relevant content exists but the wrong variant is selected	Improve structure, metadata, scoping, or content separation
Interpretation or behavior error	The right information is available but applied incorrectly	Refine behavior instructions and add boundary-case evaluations
Missing customer context	The answer depends on account, plan, region, or history data that is unavailable	Connect the required data or ask a precise clarifying question
Authorization boundary	The requested action is not permitted for this customer or workflow	Preserve the guardrail; improve explanation or approval routing
Tool or data failure	A permitted action fails, times out, or returns an uncertain result	Improve integration reliability, confirmation, retry, and fallback behavior
Deliberate human boundary	The request requires judgment, discretion, or specialized handling	Keep the handoff and improve context transfer

Apply one primary reason to each reviewed failure, even when several contributing factors exist. Route the item to the owner who can change that dependency. Over time, the distribution of reasons tells you whether the deployment is becoming more capable or merely handing off in different places.

Measure the operation as a stack rather than relying on one headline rate:

Reach: Where was the AI agent involved, broken down by intent, channel, language, market, and product area?
Outcome: Was the customer’s issue actually resolved, and did any requested task complete successfully?
Quality: Was the answer correct, consistent, clear, and appropriate for the applicable policy and context?
Customer impact: What happened to satisfaction, repeat contact, abandonment, and escalation experience?
Guardrails: Were there unauthorized actions, incorrect confirmations, failed tools, or missed mandatory handoffs?
Diagnostics: Which knowledge gaps, retrieval mismatches, behavior errors, and integration failures drove the result?

Do not confuse involvement with success. It measures how often the system participated. Do not treat a conversation that ended without a human as verified resolution either; the customer may have abandoned the interaction or returned through another channel. Tie autonomous resolution to evidence that the intended outcome occurred, especially when a tool or account change was involved.

Aggregate containment is also easy to misread. It can rise because the mix shifted toward simpler questions while a high-impact journey deteriorated. Review results by intent and relevant customer segment before crediting a model or configuration change. If containment improves while repeat contacts, task failures, or customer satisfaction worsen, the operation has not become more mature.

Key takeaways

AI deployment maturity is the ability to give an AI agent measurable, recoverable responsibility for customer outcomes, not simply expose it to more conversations.
Expand one customer intent at a time through answering, clarification, contextualization, action, and carefully governed proactive work.
Do not automate consequential actions until identity, authorization, validation, confirmation, observability, and recovery are in place.
Assign a named operator to own intent-level performance, failure analysis, dependencies, evaluations, and the improvement backlog.
Manage knowledge as production infrastructure with canonical content, explicit scope, accountable owners, freshness triggers, and regression scenarios.
Classify handoffs by root cause and measure verified resolution, quality, customer impact, and guardrails alongside containment.

At your next operating review, choose one important intent that the AI agent currently answers but does not own. Map it onto the responsibility ladder, run the readiness questions, name its operator, classify its current handoffs, and put the next change through the evaluation loop. The scope is deliberately narrow. The maturity gain is real: one more customer problem resolved safely from beginning to end.

References

Intercom – Go Deep or Get Left Behind: How AI Deployment Depth Transforms Customer Service

February 5, 2026

How Product Leaders Turn AI Agents Into Adopted Workflows

Your AI agent may look convincing in a demonstration and still disappear from daily work. If people try it once but return to spreadsheets, dashboards, tickets, and manual handoffs, you do not have an awareness problem. You have a workflow design problem.

Real adoption begins when a specific user can delegate a meaningful part of a recurring job, understand the agent’s limits, and see that the resulting decision or action is better. Product leaders create those conditions by narrowing the workflow, defining the agent’s authority, measuring the complete decision loop, and expanding autonomy only after the evidence supports it.

Choose a workflow, not a place to add AI

Starting with “Where can I deploy an agent?” pushes the team toward a feature. Start with “Which recurring decision or action is unnecessarily difficult?” That question keeps the work tied to customer or business value.

A good first workflow is frequent enough to generate feedback, narrow enough to evaluate, and bounded enough that a mistake can be caught before it causes material harm. It also has an identifiable beginning and end. “Help people be more productive” is not a workflow. “Use approved customer evidence to prepare the next-best-action options for a campaign review” is much closer.

Evaluate candidate workflows against six practical criteria:

Trigger: The user can recognize the moment when the agent should enter the workflow.
Frequency: The job repeats often enough for the user to form a habit and for the team to learn from actual use.
Grounding: The agent can retrieve the approved data, policies, history, or customer evidence required to do the job.
Completion: The team can observe whether the task reached a useful end state, rather than merely whether the model returned text.
Decision boundary: Everyone can state what the agent may decide, what requires approval, and what it must never do.
Recoverability: An incorrect recommendation or action can be rejected, corrected, or reversed without disproportionate damage.

Mark each candidate high, medium, or low on those criteria. Do not hide a weak decision boundary behind an attractive use case. A repetitive workflow with clear evidence and a review point is usually a better adoption bet than an ambitious end-to-end process with unclear ownership.

This is also why natural-language access alone is not an agent strategy. It can lower the barrier between a user’s question and an analytical answer, which may improve activation. Adoption becomes more valuable when the answer connects to a defined next action and the eventual impact of that action can be observed.

Write the selected workflow in one sentence before approving a roadmap:

When [user] encounters [trigger], the agent uses [approved context] to [recommend, prepare, or execute an action]; [person or policy] controls [decision boundary], and success is measured by [workflow or customer outcome].
Agent workflow template

If the team cannot complete that sentence without vague language, discovery is not finished.

Write an adoption contract before writing the roadmap

An agent changes who performs work, which information informs it, and where accountability sits. That is an operating-model decision disguised as a product feature. A one-page adoption contract makes the change explicit before implementation creates momentum around the wrong behavior.

The contract should answer seven questions:

Who is the intended user? Name the role and the situation, not a broad department.
What job is being delegated? Separate information retrieval, analysis, recommendation, preparation, and execution. They carry different risks.
What outcome should improve? Connect the workflow to an existing customer or business outcome, not to the amount of AI content produced.
Which information is authorized? Identify the systems of record, retrieval scope, freshness requirements, and data that must remain unavailable.
Where does human judgment remain mandatory? Put approval at the consequential decision, not at an arbitrary screen in the interface.
How should uncertainty and failure appear? Define when the agent should cite evidence, ask for missing context, abstain, escalate, or report that a tool failed.
What earns expansion? Specify the quality, adoption, outcome, and risk signals required before the agent receives more users, tools, or autonomy.

This contract prevents a common measurement error: treating interaction volume as value. Conversations, generated documents, and tool calls are outputs. They can help diagnose behavior, but they do not show that the workflow improved. Activation, successful completion, repeat use at the next relevant trigger, and retention are stronger adoption signals. They still need to connect to a journey outcome such as a better decision, a completed customer task, or a validated change.

Use outcomes versus output OKRs to keep the distinction visible. An output key result might promise to launch an agent or add integrations. An outcome key result should describe the behavior or customer result that the workflow is intended to change. The delivery milestone belongs in the plan; it should not masquerade as proof of adoption.

The contract also makes prioritization easier. A request for another model, data connector, or agent tool must improve a named part of the workflow. If it cannot be tied to grounding quality, task completion, user control, or the target outcome, it is probably infrastructure enthusiasm rather than a product requirement.

Earn autonomy through observable stages

Do not jump from a chat interface to autonomous execution because the happy-path demo worked. Autonomy should advance in stages, with a different role for the user and a different standard of evidence at each stage.

Capability stage	What the agent does	Human responsibility	Evidence needed to advance
Explain	Retrieves and synthesizes approved information	Checks the evidence and interprets it	Grounding, completeness, and answer-quality evals
Recommend	Produces alternatives or ranks possible next actions	Makes the decision and records important overrides	Relevance, reasoning, boundary, and decision-support evals
Prepare	Creates a draft action, configuration, or artifact without committing it	Edits and approves before execution	Task-specific correctness, policy, format, and exception evals
Act	Executes a bounded action through approved tools	Supervises exceptions and reviews consequential cases	Reliable task completion, tool behavior, auditability, and recovery controls

The stages are not a maturity contest. Some workflows should remain in recommendation or preparation mode because the consequences of an incorrect action outweigh the benefit of removing approval. Human-in-the-loop design is useful when the person has evidence, authority, and enough context to intervene. A mandatory click from someone who cannot evaluate the result adds friction without adding control.

Before releasing each stage, create an evaluation set that represents the actual workflow. Include normal cases, ambiguous requests, missing or stale context, policy boundaries, conflicting evidence, and tool failures. For every case, record the expected behavior, unacceptable behavior, scoring rubric, and evidence the evaluator should inspect.

Do not collapse evaluation into a single pass rate. An answer can be fluent and wrong, properly grounded but irrelevant, or correct while attempting an unauthorized action. Score the dimensions that matter independently: retrieval and grounding, task correctness, tool selection, instruction adherence, policy compliance, escalation behavior, and completion quality.

Treat prompts and evaluation datasets as versioned product assets. When the model, prompt, retrieval logic, tool definition, or policy changes, rerun the relevant evaluation set and preserve the result with the release. Otherwise, a team can improve one visible behavior while silently degrading another.

A retrieval-first design is especially important when the workflow depends on institutional knowledge. The agent should use authorized context before relying on general model knowledge, expose enough evidence for the user to inspect, and ask for clarification or abstain when required context is unavailable. That behavior may look less magical in a demonstration, but it is much easier to trust in repeated work.

Measure the entire agent loop, not the chat surface

A traditional feature funnel can tell you who opened an agent and who returned. It cannot explain whether the agent retrieved the right context, selected the right tool, required extensive correction, or produced an action that affected the intended outcome. Agent Analytics must reconstruct the path from intent to result.

Instrument the workflow as a connected event chain:

Intent and eligibility: Which workflow was triggered, and was the user and situation within scope?
Context: Which approved knowledge or data was retrieved, and was essential context unavailable?
Reasoning path: Which plan or action sequence did the system select?
Tool behavior: Which tools were called, which arguments were passed, and where did errors or retries occur?
Human intervention: Did the user accept, edit, reject, override, or abandon the result?
Completion: Did the workflow reach its defined end state?
Outcome: Did the customer or business indicator named in the adoption contract move in the intended direction?

Apply privacy-by-design to that event model. Logging every raw prompt, retrieved record, or tool payload by default can create unnecessary exposure. Decide which fields are required for product learning, who may access them, how sensitive data is handled, and how long the information is retained. Data governance belongs in the instrumentation design, not in a review after launch.

Review four layers together:

Quality: Evaluation results by task and failure dimension.
Behavior: Activation, successful completion, repeat use, abandonment, edits, and overrides.
Outcome: The customer or business result attached to the workflow.
Risk and reliability: Boundary violations, unsupported claims, tool failures, escalations, and consequential incidents.

Each layer corrects a possible misreading. High usage with weak quality can mean users are compensating for the system. Strong offline quality with little repeat use can mean the workflow is not important or the interaction arrives at the wrong moment. Completion without an outcome can mean the agent is accelerating work that should not have been done. Outcome movement without traceability makes it difficult to know whether the agent deserves credit or whether the result will persist.

Use qualitative evidence to explain those patterns. Review corrections and overrides, collect feedback at the point of use, and connect support signals to roadmap decisions. A generic satisfaction question is less useful than asking what evidence was missing, which step the user repeated manually, or why the recommendation could not be acted on.

When comparing user-facing variants, define the primary outcome and minimum detectable effect before running an A/B test. This prevents the team from declaring success based on an incidental movement in a convenient metric. A/B testing is appropriate only where traffic, exposure, and risk make controlled experimentation meaningful; rare or consequential actions need direct evaluation, review, and guardrails instead.

Make agent adoption an operating change

A launch campaign can create trials. It cannot resolve unclear ownership, weak evaluation, missing context, or a workflow that asks users to supervise the agent without giving them useful control. Sustainable adoption requires a product operating model around the capability.

Give a product trio responsibility for the complete workflow and pair it with the people who can close the distance between a prototype and production use:

Product management owns the user problem, target outcome, decision boundary, adoption contract, and expansion decision.
Design owns how intent, evidence, uncertainty, approval, correction, and escalation appear in the experience.
Engineering owns retrieval, tool permissions, system behavior, observability, release controls, and recovery paths.
A forward deployed engineer or equivalent customer-facing technical partner helps expose the real context, integrations, and exceptions hidden by a clean prototype.
Data and risk owners define acceptable model behavior, privacy constraints, access rules, and the evidence required for governance.

The leadership cadence should follow the learning loop. Discovery identifies a high-value workflow and pressure-tests it with user evidence. Pre-release review examines evaluations and failure modes. A narrow rollout tests the workflow with explicit human checkpoints. Operating reviews examine quality, behavior, outcomes, and incidents together. Expansion adds a capability, population, tool, or level of autonomy only when the prior boundary is performing as intended.

This model should influence AI hiring as well. A strong AI product candidate should be able to turn a broad ambition into a bounded workflow, define an evaluation rubric, separate model quality from product outcomes, place human judgment at the right decision, and explain what evidence would justify more autonomy. Prompt fluency without those skills is not product leadership.

Key takeaways

Start with one recurring, bounded workflow whose completion and outcome can be observed.
Write an adoption contract covering the user, trigger, delegated job, approved context, decision boundary, failure behavior, and expansion criteria.
Progress from explanation to recommendation, preparation, and bounded action only as evaluation and production evidence improve.
Version prompts, retrieval logic, tool definitions, and evaluation datasets with releases.
Instrument intent, context, tool calls, human intervention, completion, and downstream outcomes as one decision loop.
Scale when quality, repeat use, workflow outcomes, and risk controls agree – not when a demonstration attracts attention.

Your next move does not need to be a company-wide agent mandate. Put three candidate workflows through the six selection criteria. Choose the one with the clearest trigger, evidence, completion point, and decision boundary. Then write its adoption contract and evaluation set before funding a broad build. If the narrow workflow earns repeat use and improves its named outcome, you will have evidence for the next capability – and a repeatable method for every agent that follows.

References

February 4, 2026

Real-Time Analytics for Financial-Services Contact Centers

Your contact center can have excellent reporting and still react too late. A weekly chart may explain why transfers rose, authentication failed, or members called again. It cannot recover the interaction that is already going wrong.

That is the practical case for real-time analytics in financial services: detect a useful signal while there is still time to change the outcome, then deliver a safe action to the person or system that can take it. The goal is not a faster dashboard. It is a shorter path from behavior to decision to resolution.

Key takeaways

Define real time against the decision window. A signal is timely only if it arrives before the next useful action expires.
Start with journeys that create material cost or dissatisfaction, such as lost cards, fraud disputes, loan-status requests, password resets, and payment issues.
Instrument the outcome as carefully as the interaction. Otherwise, you can see that an alert fired without knowing whether it helped.
Activate insights inside routing, agent, supervisor, and follow-up workflows. A separate analytics destination creates another queue for people to monitor.
Measure resolution, repeat demand, and guardrails. Activity metrics such as alerts generated or prompts displayed are diagnostics, not business outcomes.
Build privacy controls, consent handling, access restrictions, and auditability into the decision loop before expanding its reach.

Define real time as a decision contract

Real time is not a universal refresh rate. It is a promise that a signal will reach its decision point while an effective response is still possible. An agent-assist prompt must arrive before the conversation moves past the relevant step. A routing signal must arrive before the interaction enters the wrong queue. A proactive follow-up must arrive before the member has to contact you again.

This distinction prevents an expensive architecture mistake: streaming every event without deciding what any event should change. Some information needs immediate activation. Some belongs in a supervisor review. Some is useful only for longer-term journey redesign. Treating all three as equally urgent increases cost and noise without improving service.

Before building a pipeline, write a decision contract for each use case. The contract should connect the signal to an owner, action, deadline, guardrail, and measurable outcome.

Decision-contract field	Question to answer	Illustrative fraud-routing example
Trigger	What observable event or state starts the decision?	A potential fraud signal appears during an active interaction.
Decision	What choice becomes possible because of the signal?	Whether the interaction should receive specialized handling.
Action	What should the workflow do?	Prioritize the appropriate route and carry the available context forward.
Owner	Who or what is accountable for acting?	The routing workflow, with a supervisor responsible for defined exceptions.
Action window	When does the intervention stop being useful?	Before the interaction is transferred or the relevant verification step is completed.
Guardrail	What must never be bypassed?	Required compliance steps, authorized data access, and a clear human override.
Outcome	How will you know whether the action helped?	Resolution without an avoidable transfer, escalation, or repeat contact.

A contract also exposes weak use cases early. If nobody can name the action, the signal is probably reporting data rather than real-time decision data. If the action has no owner, it will become an ignored alert. If the outcome is merely that a prompt appeared, the team has confused delivery with impact.

The underlying platform still needs to bring together behavior across voice, chat, IVR, email, and in-app journeys. But unification is useful only when identity, journey state, and timing remain coherent across those channels. A member who fails authentication in the app and then calls should not look like two unrelated problems.

Instrument five costly journeys before the whole contact center

A complete contact-center data program is too broad a starting point. It invites months of taxonomy work before anyone changes an outcome. Begin with the five journeys most likely to concentrate cost or dissatisfaction: lost card, fraud dispute, loan status, password reset, and payment issue.

This is not a mandate to automate all five at once. Rank them using the evidence you already have: contact demand, transfers, repeat contacts, unresolved cases, authentication failures, and escalations. Choose the journey where a specific intervention is both valuable and operationally feasible.

For the chosen journey, create an outcome card before defining events:

Member intent: What is the person actually trying to complete?
Observable start: Which event shows that the journey has begun?
Resolution state: What evidence means the need was completed, not merely that the interaction ended?
Failure states: Where can authentication, routing, handoff, self-service, or follow-up break down?
Intervention: Which failure can the contact center change while the journey is active?
Outcome and guardrails: Which result should move, and which compliance or experience measures must not deteriorate?

The event model should then describe the journey rather than mirror the screens of each tool. At minimum, preserve a pseudonymous member reference, interaction reference, channel, event time, journey, journey step, authentication state, transfer or escalation state, intervention, and outcome. If intent or risk is inferred, record the version and confidence associated with that inference. If an agent accepts, dismisses, or overrides guidance, capture that response too.

Consistent definitions matter more than a large event count. Decide what a transfer is, when a new contact belongs to an existing journey, and what qualifies as resolution. Version those definitions. Otherwise, a changed IVR flow or CRM configuration can appear to improve performance simply because the instrumentation changed.

Instrument the negative space as well. If the member disappears from a self-service flow, the absence of a completion event is not enough to explain why. Capture the last meaningful step, the failure category when it is available, and whether the member moved to another channel. That is how you distinguish successful deflection from abandonment followed by a call.

Do not copy every transcript, recording, credential, or financial value into a broadly accessible analytics stream merely because the technology allows it. Use minimized attributes and controlled references where they are sufficient. Keep restricted evidence behind narrower permissions. Availability is not the same as permission.

Put the decision inside the workflow

The last mile determines whether real-time analytics changes performance. An insight that requires an agent to open another application, interpret a graph, and decide what it means has already lost much of its value. Activation belongs in the systems where agents, supervisors, and automated workflows already act.

Four activation patterns cover most of the useful surface area:

Routing: Use intent, journey state, or a potential risk signal to direct the interaction to the appropriate skill. High-risk transactions can be prioritized for specialized handling, but the signal should not silently become a final financial or fraud decision.
Agent guidance: Surface the next relevant step, missing compliance action, or known journey context during the interaction. Explain why the guidance appeared, avoid conflicting prompts, and give the agent a defined way to dismiss or override it.
Supervisor intervention: Alert on a material pattern with an attached playbook. The notification should identify what changed, which interactions are affected, which action is available, and when the alert expires.
Member follow-up: Trigger a relevant message or next step after an unresolved interaction. The follow-up should close a known gap, not merely create another generic communication.

Self-service requires particular care. If balance inquiries or password resets are overwhelming queues, routing eligible demand to self-service may help. But containment is not the same as resolution. Measure whether the member completed the task and whether another contact followed. A journey that exits the IVR but returns through chat has changed channels, not disappeared.

Each activation needs a safe fallback. If identity is uncertain, the signal is stale, or a dependency is unavailable, revert to the normal approved workflow. Do not let a broken analytics path invent a route or compliance step. Log the fallback so operational teams can distinguish a bad recommendation from a recommendation that never reached its destination.

Alert design deserves the same product discipline as customer-facing design. Deduplicate repeated signals, suppress guidance after the relevant action window, and route exceptions to a named owner. A queue full of low-value alerts trains people to ignore the important ones.

The technology choice comes after these workflow requirements. CRM integration should carry member and journey context forward, while the analytics layer captures behavior and evaluates interventions. Products such as Amplitude, Pendo, and Intercom may instrument digital touchpoints, but the build-versus-buy decision should turn on your decision contracts: identity reconciliation, activation latency, workflow integrations, experimentation, access control, auditability, and operational reliability.

I would not approve a platform solely because its dashboards are polished. Ask the vendor or internal platform team to demonstrate an end-to-end loop using one of your journeys: signal received, decision evaluated, workflow changed, outcome captured, and audit record produced. That sequence is the product you are buying or building.

Measure outcomes, experiment carefully, and govern the loop

Real-time analytics does not reduce operating cost by itself. It changes a decision, which changes a journey, which may change demand and resolution. Your measurement model has to preserve that chain.

Use a scorecard that separates outcomes from activity

Choose a primary outcome that matches the journey. Useful candidates include first-contact resolution, repeat-contact reduction, containment, and average time to resolution. Define the eligible population and exclusions explicitly so the metric cannot drift when channel mix changes.

Then organize the remaining measures by purpose:

Journey outcome: Was the member’s need resolved, and did it stay resolved?
Operational mechanism: Did transfers, escalations, routing failures, or authentication failures change?
Intervention delivery: Was the recommendation generated, delivered in time, accepted, dismissed, or overridden?
Experience and compliance guardrails: Were required steps completed, and did complaints, corrections, or manual exceptions increase?
System health: Was the signal complete, timely, correctly joined to the journey, and available when the workflow needed it?

Average handle time can be diagnostic, but it should not become the automatic objective. A shorter interaction that leaves the member unresolved may simply move cost into a repeat contact. Resolution and repeat demand tell you whether the system removed work or postponed it.

Test the intervention, not the existence of the data

Controlled experiments can show whether a changed IVR path, authentication step, or post-contact follow-up improves the chosen outcome. Define the minimum detectable effect before the test so the team knows which improvement would justify a decision and whether the eligible volume can support a useful result.

Choose the unit of assignment deliberately. If the same member can return during the measurement window, assigning different experiences by interaction can contaminate the comparison. A member-level assignment may be cleaner. If the intervention changes an entire queue or supervisor workflow, individual assignment may be impractical; use a rollout design that reflects how the operation actually works.

Do not randomize away mandatory compliance controls. When an intervention affects fraud handling, sensitive disclosures, or consequential routing, begin in observe-only mode, review false positives and overrides, and use an approved rollout. Experiment with the delivery or operational design only where compliance and legal owners confirm that variation is permissible.

Make governance part of the product

Privacy and compliance cannot sit downstream of activation. A real-time system makes decisions from live member behavior, so access controls, consent management, and audit trails belong in the initial architecture.

For every decision contract, document the permitted purpose of the data, who can access it, where it is retained, how consent is honored, what enters the audit record, and who approves changes. Do not infer that an attribute is lawful to use because it exists in the CRM. The relevant compliance and legal owners must determine acceptable use for the jurisdiction, product, and member context.

Auditability should reach beyond data access. Preserve enough context to reconstruct what signal arrived, which rule or model version evaluated it, what action was recommended, what the workflow did, whether a person overrode it, and what outcome followed. That record supports incident investigation, performance review, and defensible change management.

Run the operating cadence through a product trio spanning operations, data, and compliance. In each review, ask which decisions fired, which arrived too late, which actions were ignored, which outcomes changed, and which guardrails moved. Retire noisy signals. Refine ambiguous definitions. Promote successful interventions gradually. This keeps the program focused on decision quality instead of dashboard volume.

Your next step is small and concrete: choose the highest-cost or highest-friction journey among the initial five, write its decision contract, and run the signal in observe-only mode. When the team can trace the path from trigger to approved action to outcome, activate the narrowest useful intervention. Expand only after that loop is measurable, reliable, and governable.

References

Shivam.Consulting Blog – Stop Drowning in Dashboards: Real-Time Digital Analytics for Finserv Contact Centers

January 23, 2026

Agentic AI for Construction Tendering: A Product Playbook

Your tender inbox contains a deadline, a stack of attachments, and a chain of decisions that still lives in people’s heads. The tempting response is to buy a model that can read PDFs. That solves only the most visible part of the problem.

A useful tendering product must determine which documents matter, extract requirements with evidence, match those requirements to a catalog, retrieve approved pricing, draft an offer, identify uncertainty, and route exceptions before anything reaches the customer. If you lead AI or product strategy for a manufacturer or supplier, your first goal should not be an autonomous bidder. It should be the smallest tender category in which every decision can be observed, evaluated, and improved.

Pick a bounded quote, not the entire tendering department

Construction tendering is too broad for a credible first release. Product categories have different terminology, selection rules, catalogs, pricing structures, and exception patterns. A system that works for one bounded category has not automatically learned how to quote every building product.

One effective wedge started with radiator requests for a single design partner before expanding to other building products. That constraint made the catalog, expected outputs, and expert reviewers knowable. It also created a place to learn the real workflow rather than designing from an idealized process diagram.

Choose your wedge using operational criteria, not enthusiasm for the model:

The request appears often enough for reviewers to recognize recurring patterns.
The relevant catalog is bounded and maintained by a clear owner.
A domain expert can explain why a product is suitable, unsuitable, or uncertain.
The correct price can be traced to an approved system or document.
Historical tenders and reviewer corrections are available for evaluation.
An error can be caught during review before it becomes a customer-facing commitment.

A design partner is especially valuable because the work is not fully documented. In one implementation, the product team spent a week observing the process on-site. That kind of observation exposes the browser tabs, informal checks, catalog shortcuts, and exception handling that an interview alone can miss.

Follow one tender from the incoming email to the final offer. At every handoff, record five things: the input, the decision, the evidence used, the person or system responsible, and the condition that triggers an exception. If you cannot state those five things, you do not yet have a well-defined agent task.

Keep three layers of information separate from the beginning:

Stated requirements: what the tender explicitly asks for, with the originating file, page, section, or table cell.
Interpretations: conclusions the system or reviewer draws when terminology is ambiguous, incomplete, or inconsistent.
Commercial decisions: the selected product, approved price, assumptions, exclusions, and offer language.

This separation matters because a polished offer can hide a weak inference. A reviewer needs to see where the tender ends and the system’s judgment begins.

Define the first outcome as a review-ready tender case: organized source documents, structured requirements, proposed product matches, price provenance, unresolved issues, and a draft offer. That is a more useful product boundary than “understands construction PDFs.” It gives the reviewer something concrete to accept, correct, or reject.

Turn the workflow into a decision graph with specialist agents

A chatbot is the wrong mental model. Tendering is a decision graph in which an early classification or extraction error can contaminate every downstream step. Real packages can range from a short request to more than 1,800 pages describing an entire building. The system therefore needs to plan work, retain state, reconcile evidence, and know when coverage is incomplete.

Use agents only where a task requires interpretation, planning, or exception handling. Keep exact operations – file handling, arithmetic, catalog queries, identifiers, template validation, and access control – in deterministic code or approved systems.

Stage	Preferred control	Required output
Intake	Rules plus a classifier	A tender case containing the email, attachments, document types, and routing status
Requirement extraction	Parser tools plus a specialist agent	Structured requirements with source locations, missing fields, and ambiguities
Product matching	Catalog retrieval plus a reasoning agent	Candidate products, requirement coverage, incompatibilities, and rationale
Pricing	Approved database, CPQ, or pricing service	Exact product-price records with source and validity information
Offer generation	Controlled template plus a drafting agent	A draft that distinguishes confirmed facts, assumptions, and exclusions
Quality review	Rules plus a separate review agent	A pass or block decision with issue codes and supporting evidence
Human approval	Domain and commercial policy	An approval, correction, rejection, or escalation that becomes evaluation data

Each agent should have an explicit contract. Specify its permitted inputs, tools, output schema, evidence requirements, completion test, and escalation behavior. “Find the right product” is not a contract. “Return catalog candidates that meet the extracted requirements, identify uncovered requirements, cite the catalog evidence, and abstain when no candidate qualifies” is much closer to one.

For large document sets, require the workflow to maintain a task plan. It should inventory files, identify relevant sections, process bounded units of work, track completed and pending units, reconcile repeated or conflicting requirements, and run a final coverage check. A generated answer is not proof that the package was fully processed.

The review agent deserves its own role. Asking the drafting agent to “check your work” keeps creation and approval inside the same reasoning path. A separate reviewer can inspect the draft against the extracted requirements, catalog evidence, price records, and policy rules. It should return defects and a gate decision rather than silently rewriting the offer. Silent rewriting makes it harder to identify which upstream component failed.

This pattern has practical value because a dedicated review agent can catch errors before human review, much like a separate code review step. Independence comes from the reviewer’s task and evidence contract; adding more agent personas without distinct responsibilities only creates orchestration overhead.

The interface is part of the architecture. During discovery, a dedicated web workbench can be more useful than hiding the workflow behind a legacy integration. Put the source document, extracted requirement, proposed match, price evidence, and review issue within the same review path. That gives the product team control over feedback capture and makes the reason for each correction visible. One tendering product used its own web application to iterate toward greater automation rather than beginning as a backend-only integration.

You can still read from and write to existing systems at defined boundaries. The distinction is between integration and dependence: integrate with systems of record for catalogs, prices, customers, and approved quotes, but do not let an inflexible legacy screen determine how reviewers inspect an emerging AI workflow.

Evaluate every decision before judging the complete quote

An end-to-end result tells you whether a tender case failed. It rarely tells you why. If the final product is wrong, the defect may have come from document routing, requirement extraction, catalog retrieval, product reasoning, pricing, drafting, or review. A single overall accuracy score collapses those failure modes into an unactionable number.

Build an evaluation set for each agent contract and retain a smaller end-to-end set for workflow behavior. Per-agent evaluations make changes and regressions easier to localize. The useful measures differ by decision:

Intake: correct document classification, attachment coverage, and routing accuracy for each supported tender type.
Extraction: field-level completeness, exactness for identifiers and numeric values, source-location accuracy, and the rate of unsupported fields.
Product matching: reviewer agreement, requirement coverage, incompatible recommendations, unsupported matches, and appropriate abstention.
Pricing: exact agreement with the approved source, correct product-price association, formula validation, and rejection of unavailable or invalid records.
Offer generation: required-field completeness, consistency with selected products and prices, correct treatment of assumptions, and unsupported statements.
Review: detection of known defects, false blocks on valid cases, issue classification, and evidence quality.

Slice failures by characteristics that change the work: document length, file type, layout, product category, presence of tables, and conflicting or revised requirements. An aggregate score can improve while performance deteriorates on the long or unusual tenders that consume most reviewer attention.

Use end-to-end measures for the product outcome: review time, correction volume, correction severity, exception rate, percentage of cases that reach the defined completion state, and whether the workflow finishes before its operational deadline. Keep commercial outcomes separate from model correctness. Quote acceptance or win rate can be affected by price, availability, competition, customer relationships, and sales execution; it should not be treated as a clean extraction or matching metric.

Observability must connect those layers. For each tender case, retain the task plan, agent inputs and outputs, tool calls, retrieved catalog or price records, prompt and model versions, gate decisions, latency, failures, and human corrections. Complex agent chains can exceed what generic monitoring exposes, which is why custom tracing and Agent Analytics became necessary in a production tendering workflow.

Capture reviewer feedback as structured data, not only as an edited final document. Store the original output, corrected value, responsible stage, reason code, evidence used, and final disposition. Useful reason codes include missing requirement, incorrect extraction, unsupported product match, pricing issue, unresolved conflict, invalid assumption, and drafting defect.

Do not feed every edit directly back into the system and call it self-learning. A reviewer may change wording for preference, apply customer-specific knowledge, or correct an upstream error in the final draft. Validate the correction, assign it to the right component, and add it to the corresponding evaluation set. That turns human review into controlled learning rather than an untraceable feedback loop.

Release changes through two gates. First, the modified agent must pass its own evaluation set. Second, the complete workflow must pass the end-to-end set because an improvement in one component can change the assumptions of another. The trace should show exactly which version produced every customer-facing artifact.

Earn autonomy one commercial boundary at a time

“Straight-through processing” is incomplete unless you define where the straight-through path ends. Automatically extracting requirements is not the same risk as automatically selecting a product, committing a price, writing to the CPQ, or sending an offer to a customer.

Use an autonomy ladder with an explicit boundary at each stage:

Shadow: the system processes live-shaped cases, but its outputs do not affect the operational tender.
Assist: it organizes documents and extracts requirements while a person performs matching, pricing, and drafting.
Draft: it proposes products and produces an offer, but a human must review and approve every case.
Gated processing: it completes predefined internal actions for in-scope cases and sends all exceptions to a reviewer.
External dispatch: it sends an offer without case-by-case approval only when commercial policy explicitly permits that action and every required gate passes.

Eligibility for a higher-autonomy path should be machine-checkable. At minimum, confirm that the product category is in scope, every required document was processed, required fields are present, source evidence is attached, conflicts and revisions are resolved, the product match satisfies its evidence rules, the price comes from an approved valid source, the review agent reports no blocking defect, and the full trace is retained.

A wrong product or price can create margin, delivery, contractual, and customer-trust exposure. If an offer may create a binding commitment, keep human approval until the appropriate commercial and legal owners have defined the policy for automatic dispatch. The safe alternative is to automate preparation while preserving approval at the commitment boundary.

Operational controls matter after launch. Give reviewers a visible exception queue, make the reason for every block legible, preserve manual processing when the AI path is unavailable, and provide a way to suspend autonomous actions without disabling access to already processed cases. Assign owners for catalog quality, pricing validity, tender policy, model behavior, and production reliability; otherwise each exception will bounce between teams.

Expansion should follow evidence and customer pull. A request to replace an existing CPQ system is a meaningful product-market signal, but it is also a change in product scope. CPQ replacement introduces responsibilities for quote versions, approval policy, catalog administration, pricing governance, integrations, and records. Treat that request as a roadmap decision, not as proof that the original agent workflow already covers those capabilities.

Key takeaways

Start with one product category, one known workflow, and reviewers who can explain the correct decision.
Optimize first for a review-ready tender case, not an impressive answer from a general chatbot.
Use deterministic systems for exact operations and specialist agents for interpretation, planning, and exceptions.
Require structured outputs, source evidence, explicit completion tests, and an abstain-or-escalate path from every agent.
Evaluate each stage separately, then use end-to-end metrics to measure the operational outcome.
Increase autonomy only when observable eligibility gates protect the next commercial boundary.

At your next roadmap review, put one representative tender on the screen and draw the decision path from email to offer. Name the owner, evidence, pass condition, and exception path for every node. Wherever those are missing, the next task is workflow discovery, not another agent. Once the graph is explicit, you can automate one bounded decision, measure it, and earn the right to automate the next.

References

Shivam.Consulting Blog – From PDFs to Proposals: How Tendos AI’s Agent Swarm Automates Construction Quotes Fast

January 15, 2026