I’ve led multiple AI agent launches, and the single most reliable way I’ve found to ship with confidence is to treat evaluations as a product capability, not a side project. When we make AI quality measurable, predictable, and comparable over time, we move faster, reduce risk, and build trust with customers and stakeholders.
Learn how product managers use AI evaluations to measure agent quality. Covers traces, LLM judges, offline evals, online evals, and how to connect evals to product outcomes.
Why does this matter so much in product management? Because agent quality is only meaningful when it drives adoption, satisfaction, and revenue. I use eval-driven development to align the day-to-day iteration of prompts, policies, and workflows with business outcomes like activation, retention, and Net Recurring Revenue (NRR). That alignment turns AI quality from an abstract notion into a roadmap lever.
First, traces. Traces are the spine of evaluation for agentic AI: they capture inputs, intermediate steps, tools invoked, and final responses. I instrument traces to make reasoning visible—what the agent tried, where it hesitated, and why it chose a path. With that visibility, I can compare prompts, policies, and tools, and I can teach the team to fix the root cause instead of patching symptoms. This is also where Agent Analytics becomes real: we move from anecdotes to observable behavior trends across cohorts and use cases.
Next, LLM judges. I use model-as-judge to score qualities like helpfulness, coherence, or adherence to brand and policy. The trick is calibration. I pair LLM judges with a small, high-quality human-labeled set to ground the scale, then monitor drift as models, prompts, or data shift. LLM judges help me evaluate at speed, but I still spot-check edge cases and highly regulated flows to balance efficiency with risk controls.
Offline evals come first. Before I expose users to changes, I run fixed test suites representing core scenarios, failure modes, and edge cases. I include golden examples, adversarial prompts, and domain-specific queries. Metrics cover task success, factuality, safety, latency, and cost. This is where prompt engineering and retrieval quality are tuned; if I’m using a retrieval-first pipeline, I evaluate evidence quality separately from generation so improvements are attributable and reproducible.
Online evals follow to validate real-world performance. I roll changes out behind feature flags and use A/B testing to compare variants under production conditions. I track conversation outcomes, tool success rates, fallbacks to human support, and user satisfaction. These online signals close the loop on whether an offline improvement actually compounds value in the product—critical for product-led growth.
Connecting evals to product outcomes is non-negotiable. I map quality signals to a driver tree: from per-turn scores (helpfulness, safety, latency) up to session-level outcomes (task completion, deflection, revenue intent), and finally to product KPIs (activation, retention, NRR). With this structure, I can set thresholds for launch gates, prioritize roadmap items that move the biggest levers, and build dashboards that leadership understands at a glance.
A few lessons learned. Start with a minimal but durable test set and grow it as you discover new failure modes. Version everything—prompts, tools, and datasets—so you can reproduce wins. Beware metric drift when you swap models or update prompts. Blend human review where the cost of error is high. Above all, make evaluations part of your AI workflows and sprint rituals so quality improves continuously, not sporadically.
If you’re just getting started, begin with traces and a small offline suite, add LLM judges for scale, then prove impact with a focused online experiment. Within a few cycles, you’ll have a living evaluation system that guides decisions, accelerates delivery, and gives your team—and your customers—confidence in every AI release.
Inspired by this post on Amplitude – Perspectives.
We open-sourced our AI Skills library. Here's what we built, why we built it, and how to use it. I’m sharing the approach we’ve used to move faster with more confidence across product discovery, prototyping, and production—while keeping governance, safety, and measurement front and center.
What we built is a modular, open-source library of “skills” for agentic AI and LLM-powered workflows—things like retrieval and grounding, summarization, classification, tool-use, data enrichment, safety guardrails, and evaluation harnesses. Each skill follows consistent interfaces and conventions so teams can compose them like building blocks, swap implementations without breaking flows, and standardize best practices across products.
Why we built it is simple: we kept rebuilding the same core capabilities across experiments and teams. Standardizing these skills accelerates time-to-value, reduces integration risk, and helps product trios collaborate with a common language. It also lets us scale what works—prompt patterns, eval datasets, telemetry—so every new initiative starts on third base instead of at bat.
How to use it in practice: start by running a quick-start example to see a baseline skill chain in action. Then compose your own flow by selecting skills (for example, retrieval + summarization + tool call), configure them with environment variables and guardrails, and wire in evaluation datasets. From there, instrument the pipeline with metrics so you can compare variants and promote the best-performing chain to your main app or API.
In a typical stack, the library dovetails with analytics and experimentation: ship skill variants behind feature flags, measure impact with A/B testing, and observe runtime behavior with logs and traces. CI/CD hooks let you run evals pre-merge, and production dashboards keep an eye on latency, cost, and outcome quality. This creates a virtuous loop where ideas move from prototype to production with clear evidence.
Common use cases include customer support summarization and triage, lead scoring and enrichment, anomaly detection in product telemetry, and automated content workflows. Because the skills are composable, you can try multiple retrieval-first strategies, swap prompt templates, or add tools (search, RAG, calculators, connectors) without rewriting everything from scratch.
Governance and safety are built in. Guardrails handle PII redaction, content policy checks, and rate limiting; configs make it easy to enforce privacy-by-design; and evaluation harnesses encourage an eval-driven development culture. The result is faster iteration without sacrificing data governance or reliability.
If you want to contribute, add a new skill, improve prompts, share eval datasets, or open an issue with a scenario you want supported. The roadmap focuses on richer retrieval adapters, better test fixtures, and deeper observability so teams can debug and optimize complex chains with confidence.
I’m excited to see how you’ll use the library to accelerate your roadmap. Clone it, run a quick start, and compose your first workflow today—then measure, iterate, and scale what works. I’ll keep sharing patterns, learnings, and updates as we grow the skills catalog and sharpen the tooling.
Inspired by this post on Amplitude – Perspectives.
In competitive markets, I see two options: try to win the game competitors set, or choose to play a different game. In the "Customer Agents" category, I’ve watched too many glossy, fabricated demos—especially around voice—mask the real challenges. Voice is just extremely hard. We all know the future of customer experiences will be Agent-driven voice, yet most of us haven’t actually spoken with a modern AI Agent when calling a business because the tech hasn’t been truly ready in the wild. Today, the bar moves.
What changed? There’s a live, public demo of cutting-edge voice tech you can stress test yourself—no smoke, no mirrors. I recommend taking it for a spin: https://fin.ai/voice. It’s fast, natural, and, yes, very, very good.
For context, yesterday brought Apex Flash, their newest and fastest model, built for the unique demands of low latency channels like voice. Today comes Fin Voice 2, a major upgrade to Fin Voice with over 20 new features, and the first product built on Apex Flash.
Here are the three things that stood out to me—and why they matter for customer support AI strategy and product strategy.
First — thanks to Apex Flash, Fin Voice 2 is now the fastest, most natural Agent for phone, with higher resolution rates and customer satisfaction scores than ever before. Apex Flash is trained on millions of customer experience interactions, fine tuned for customer service, and can be configured to understand all your knowledge and follow all your policies. The result is higher resolution at significantly lower latency—the best of both worlds for voice AI agent performance.
Speed and naturalness here aren’t accidental. Most voice AI products are slow because they convert speech to text, send it to a general model, get a text answer, and then convert it back to speech. Fin Voice 2 was designed to work differently, separating the real time layer that handles speech processing, and the layer that generates answers. That architecture is purpose-built for the demands of customer service on voice.
Powered by Apex Flash, Fin Voice 2 raises the bar on quality and speed—boosting resolution rates and guidance following while cutting time to first audio and semantic search latency, with a lift in CSAT too.
Second — Fin Voice 2 can handle complex queries end to end: taking actions in external systems, verifying callers’ identities, processing refunds, booking appointments, and more. Phone is a high-stakes channel, and Fin adapts to customers across emotional states, clarifies when needed, and confirms key details before taking action. Most of the time, Fin can resolve the query in full, and when it can’t, it seamlessly hands off to the human team, maintaining full customer context and history. You also get multiple improvements to call quality, plus proactive outbound calls to follow up on unresolved issues—all orchestrated by robust AI workflows.
Third — Fin Voice 2 gives you total control with industry-leading tools to configure and manage how Fin behaves. You get rich, detailed insights into call behavior and quality, the most common topics of calls, and one-click recommendations to improve. As with everything in Fin, you can fully self-serve and then manage it all with ease, without requiring professional services. Many vendors only let you set up their voice agent under supervision; with Fin, you get everything you need to iterate fast.
If you haven’t tried the demo yet, go check it out: https://fin.ai/voice. If you prefer to wait, don’t be surprised when you end up speaking with it at a favorite brand soon.
From a product management lens, this is what matters: latency is a feature customers feel; transparency builds trust in enterprise AI; and control is non-negotiable for CX leaders. The combination of a purpose-built, agentic AI architecture, measurable gains in resolution and CSAT, and true self-serve configuration signals that voice is moving from prototype theater to production reality. That’s the different game I want our industry to play.
I’ve led enough multi-tool product organizations to know how quickly momentum erodes when insights and actions live in different places. When my teams bounce between Notion, Atlassian, Slack, Linear, and analytics dashboards, we pay a real tax in context switching. That’s why I’m excited about what Amplitude is enabling with Agent Connectors—bringing our daily work and our data-driven decisions into one fluid, agentic AI workflow.
Connect Notion, Atlassian, Slack, Linear, and more to Amplitude's Global Agent. Get richer analysis and take action across tools without leaving Amplitude.
Practically, this means I can treat Amplitude analytics as a unified analytics platform where analysis and execution finally meet. Instead of exporting charts or copying insights into docs, I can drive Agent Analytics directly from the same surface where I manage behavioral analytics, reducing friction and accelerating decisions. For my product strategy, that’s a meaningful shift—from “insight later” to “insight-to-action now.”
Here’s how I’d use it on a typical day: I ask the agent to synthesize signals from recent feature usage, spotlight anomalies, and then draft a concise summary for our Slack channel. In the same flow, I can prompt it to reference our Notion specs for context and queue next steps in Linear, keeping Atlassian stakeholders looped in without any extra swiveling between tabs. The value isn’t just faster execution; it’s tighter alignment across teams because the analysis and the plan live together.
From an operating model perspective, this is how I scale AI workflows responsibly. I can define clear prompts, approval paths, and ownership so the agent augments—not replaces—expert judgment. Data governance and permissions remain front and center: the agent sees what your teams are allowed to see, and we maintain auditability on critical workflow steps. The outcome is a trustworthy, repeatable system that compounds learning over time.
If you’re exploring agentic AI for product teams, start small and instrument your ROI. Pick one or two connectors (Slack and Notion are great first choices), define a measurable workflow—like pushing weekly retention insights and creating prioritized follow-ups in Linear—and iterate using continuous discovery. In my experience, the first wins appear as reduced time-to-insight, fewer meetings to align, and faster cycle time from observation to shipped change.
The big picture is simple: bring your work to your analytics, and your analytics to your work. With Agent Connectors, Amplitude’s Global Agent helps close the loop from understanding behavior to taking action—without leaving the place where your insights are born.
Inspired by this post on Amplitude – Best Practices.
I love being a builder. It feels like a superpower I can’t stop using, and lately I’ve been channeling it into better workflows, faster experimentation, and sharper product thinking.
I tinker with my Claude Code workflows to make every day more effortless. I’m having a blast creating AI-generated interview snapshots and opportunity solution trees for Vistaly. I also spend time digging into traces and iterating on the AI coaches I use for our discovery courses.
Then the recent wave of malicious software spreading through the open-source community popped my bubble. It hit companies big and small—names like OpenAI, PostHog, and Zapier. As I dug in, I realized what many cybersecurity experts have long known: this is a deep rabbit hole. If I want to build responsibly, I have to get significantly better at protecting my devices, credentials, and code. And if you’re building with AI or modern tooling, you likely do, too.
Here’s why. We all rely on open-source software. Most modern applications assemble tried-and-true components—parsing a PDF, handling dates across time zones, visualizing spreadsheet data, connecting to an API—rather than reinventing them. The same is true for agent skills and MCP servers; they accelerate how we get value from models. This is overwhelmingly a good thing. But it also creates an attack surface that bad actors exploit.
We don’t need to abandon third-party code. We do need to understand the mechanisms attackers use and consistently defend against them.
When one malicious worm compromises hundreds of packages, what should dev teams do? This visual teaser maps the agenda—how it spreads, how to guard against it, AI tool risks, and concrete steps to mitigate.
On May 11th, I started seeing tweets about a TanStack hack. At that time, I didn’t know what TanStack was. But apparently, it’s a popular set of JavaScript libraries that are used by a lot of React sites. At first, I didn’t pay much attention. Then I learned the packages were compromised by a worm—malicious software that self-replicates—and it spread quickly. Within hours, dozens of packages were implicated; by day’s end, it was in the hundreds. That’s when I knew I had to lean in.
If you’ve explored safe development practices with coding agents before, you’ve seen the basics of package safety. A package is a bundle of reusable code shared through registries, and nearly every app you use depends on them. The unfortunate twist with this specific hack, known as the Mini Shai-Hulud worm, is that it shows prior “safe enough” heuristics aren’t sufficient. Popularity and trust signals don’t guarantee safety. We have to do more.
So here’s what I’ll cover today: how malicious software typically works, a practical framework for guarding against it, the specific risks of using Cowork to write and run code, and concrete steps to mitigate that risk. My goal is simple: help you keep building—despite the risks—while protecting your data and your business.
Quick disclaimer: I’m not a security expert. I’m sharing my personal journey and what I’ve learned through research and hands-on work. Please use your best judgment when applying any of this.
Package hacks share a simple playbook: get in, sweep for secrets, and phone home. This visual breaks down the 3 steps and flags new entry points—from packages to MCP servers, agent skills, and app extensions.
An agent recently scoured over 230,000 malicious software incidents and found that most malicious software follows a similar pattern. First, it needs an entry point onto your computer. Once installed, it scours your device for sensitive data, and then it uses your network connection to send that data to its own servers. The Mini Shai-Hulud worm spreads via malicious package install scripts that run at download time, then searches the device for credentials (including package publishing rights), poisons additional packages to continue replicating, and uses multiple channels—including the victim’s own GitHub public repos—to distribute secrets.
In practice, most attacks boil down to three steps: 1) It finds an entry point to your device. 2) It searches your device for sensitive data. 3) It sends that data to its own server. The good news: this pattern also tells us how to defend. We can harden entry points, minimize what code and agents can access, and constrain outgoing network traffic.
Keep in mind that install scripts aren’t the only entry vector. Any code that runs on your machine could contain malicious payloads: third-party packages, agent skills, MCP servers, browser or desktop extensions—the list is long. As coding agents and “vibe coding” tools become mainstream, more non-engineers are exposed to the same risks engineers have managed for years.
You might be at elevated risk if you do any of the following: you download and use third-party skills or MCP servers; you let Claude Code, Codex, or other coding agents write scripts that run locally and use third-party packages; you use an IDE like VS Code or Cursor with third-party extensions; or you install third-party extensions in tools like Obsidian. This isn’t an exhaustive list, but if any of these apply, it’s worth tightening your approach.
Relying on third-party code? This visual highlights four common risk zones—agent skills/MCP servers, coding agents, IDE extensions, and Obsidian plugins—and urges a review of downloads, local scripts, and add-ons.
The “safest” approach would be to avoid installing third-party software on your local device entirely. That’s not realistic. We all depend on third-party components in our stack. So I’ll start with one of the most common paths for non-engineers writing and running code today: Cowork.
Evaluating Cowork’s safety was eye-opening. Cowork offers meaningful protection—more than running code directly on your machine—but it isn’t bulletproof. There’s a notable gap you should understand.
Here’s how Cowork helps. It runs code inside a virtual machine, which isolates the execution environment from your real device—a quarantine room for code. While Cowork doesn’t fully control what comes into the room (that part is on you), if malicious code gets in, it’s contained and cannot reach the rest of your filesystem. Cowork also limits outbound network traffic from the virtual machine, which helps disrupt data exfiltration. However, it’s not foolproof.
Because Claude can install packages inside Cowork, it remains susceptible to malicious code like the Mini Shai-Hulud worm. And GitHub is on the allow list so Cowork can read and write to your repos. Since the Mini Shai-Hulud worm uses GitHub to publish secrets, this creates exposure. The crucial mitigation: if you never give Cowork access to sensitive data, there’s nothing for an attacker to steal.
A quick visual from a security deep dive on package hacks shows how Cowork handles threats: entry points are contained, data is only safe when kept outside, and network traffic is partly limited—making shared data the gap to watch.
Your responsibility is straightforward but critical: your data is only safe if it stays outside the virtual machine. When you mount folders into Cowork, those folders become accessible to any code running inside the VM. That includes malicious scripts. Before sharing, ask two questions: do the folders contain any credentials or secrets, and do they include proprietary data that would be harmful if accessed?
It’s common for code to need credentials. That’s why Cowork includes connectors to third-party sources like Google Drive and Slack. Credentials configured for these connectors never enter the VM—they remain outside the quarantine room—so they’re not exposed to malicious code. But if your code requires additional credentials inside the VM, scope them tightly and assume they could be compromised.
You can also use custom MCP servers you create yourself with Cowork. Those credentials stay outside the VM as well, provided the MCP servers are remote (hosted on a web server, not downloaded locally). It’s more work than dropping in a local server, but it keeps secrets out of reach from VM-executed code.
Beyond credentials, scrutinize the actual content you share with Cowork, including anything accessed through connectors. Least privilege is the rule: grant only what’s absolutely necessary for the task, and nothing more.
Amid a wave of package-supply attacks, this Product Talk visual launches a 3-part guide to safer AI building—starting with Cowork safety today, then Claude code config next week, and off-device development coming soon.
What about skills? Cowork supports skills, and you can add third-party skills inside the quarantine room. If you’re not placing your own data in that room, you can afford more risk. The moment you add sensitive or proprietary data, be selective. Skills can include third-party code, and bad actors use skill directories to distribute malicious payloads. Personally, I never use third-party skills as-is. If one looks useful, I read through the files, then ask Claude to recreate it so I understand what it does and maintain control. If I were to use third-party skills, I’d do it in Cowork and keep their data access to the minimum necessary.
Overall, Cowork is a solid, “safe-ish” option if you’re disciplined about what you share. The challenge is that utility often requires access to real data—exactly what we’re trying to protect. In an upcoming deep dive, I’ll outline strategies to keep malicious code out in the first place. While I’ll focus on local development, the same patterns can extend to Cowork with a bit of setup.
One more important clarification: don’t confuse Cowork with the Code tab in the Claude Desktop app. Cowork runs code inside a virtual machine. The Code tab does not. If you ask Claude to write and execute code from the Code tab, that code runs on your local device and you’re fully responsible for security. There is one exception: the Code tab can run code in Anthropic’s cloud; I’ll cover that approach when we get into moving development off the local machine.
To summarize Cowork’s protections against the attacker’s three-step pattern: installs and scripts still run, but they’re contained inside an isolated virtual machine instead of your real device; access to sensitive data is strongly limited to the specific folders you mount, leaving the rest of your filesystem (including unrelated credentials) out of reach; data exfiltration is partially constrained because Anthropic limits outbound network traffic from the VM—helpful, but not absolute. By contrast, local Code tab sessions offer no isolation, no filesystem restrictions, and no network limits—so any malicious install scripts run directly on your machine with full access and open egress.
My takeaways so far: I still love building with AI, but I’m doing it more cautiously. Cowork offers meaningful containment when used deliberately. I still prefer the flexibility of Claude Code, and I’ve reconfigured my setup to reduce risk. Even so, “safer” isn’t “safe,” which is why I’m increasingly shifting development off my local device to more controlled environments. I’ll share the practical details—tools, configs, and scripts—in the next installments.
If this perspective is useful, let me know. I want builders to move fast—and safely—through this new era of agentic AI. Until then, stay safe out there.
I’ve learned that the fastest way to unlock better AI outcomes is to understand how the system reasons, then partner with it deliberately. In product organizations, that means treating AI like a capable collaborator with a transparent process, clear inputs, rigorous checks, and measurable success criteria. When I work this way, my teams ship insights and experiments faster—and with far fewer surprises.
Discover how Amplitude AI thinks and best practices for working with it. Partner with AI at each step of its process for more accurate, actionable outputs.
Here’s the mental model I use. AI moves through a series of steps: clarify the goal, ingest context, retrieve and rank relevant information, reason through candidate solutions, draft an answer, self-critique, and refine. My job is to actively guide each step. I define the objective precisely, supply high-signal context, specify constraints, ask for structured reasoning, and require a quality bar before anything ships to stakeholders.
Start by setting intent and success criteria. I write a one-sentence objective (“what problem are we solving now”), then define the evaluation rubric (“what good looks like”) up front. This small habit powers eval-driven development: it keeps AI outputs aligned with product goals, not just plausible-sounding text. I’ll often include target metrics and guardrails, such as confidence thresholds or required evidence from “Amplitude analytics.”
Next, I curate the context. For analytics use cases, I provide event taxonomies, metric definitions, segments, and recent behavioral analytics trends to ground the model. A retrieval-first pipeline helps here: I scope the corpus, trim noise, and apply context window management so the model sees only what’s essential. The result is sharper, faster answers that map to our real data model and “unified analytics platform.”
Then I shape the prompt. I use concise role framing, 1–3 high-quality exemplars, and explicit constraints (format, length, tone, citation requirements). I also ask the model to show its reasoning with a short, labeled scratchpad and to state uncertainties. This is practical prompt engineering—not magic—designed to make reasoning inspectable and reproducible across “AI workflows.”
When tools are available, I encourage agentic AI patterns: let the system plan, call functions, and iterate. With “Amplitude AI,” I ask it to propose the next best analysis (e.g., segment drill-down, funnel step attribution, or anomaly detection), run it, summarize findings, then reflect on whether the next step changes. If you’re using “Amplitude MCP,” formalize these actions as callable tools so the model can chain them reliably.
Quality is never an afterthought. I build lightweight evaluations into every loop: compare the model’s output against the rubric, check factual grounding, and A/B test alternative prompts for clarity and conversion where appropriate. Over time, these evaluations become our regression suite, giving us confidence as data, prompts, or model versions evolve. This discipline keeps LLMs for product managers aligned with shifting business priorities.
Finally, I turn insights into action. I ask “Amplitude AI” for decision-ready artifacts—clear hypotheses, prioritized opportunities, and concrete next steps owners can execute. I require the model to cite the specific supporting events or segments and to flag assumptions. That last step is crucial: it invites human judgment where it matters and prevents automation from outpacing accountability.
This approach doesn’t slow teams down; it speeds them up with focus. By guiding each step—intent, context, reasoning, tools, and evaluation—you transform AI from a black box into a reliable copilot. The payoff is tangible: clearer insights, faster cycles, and outputs stakeholders trust the first time they see them.
Inspired by this post on Amplitude – Perspectives.
I’ve watched a once-reliable A/B testing playbook buckle under the weight of generative AI. Traffic patterns aren’t stable, LLMs update behind the scenes, prompts evolve weekly, and personalization reshapes cohorts mid-flight. The result is non-stationary data, diluted statistical power, and “wins” that don’t replicate in production. If your experimentation program feels slower, noisier, and less trustworthy, you’re not imagining it—and you’re not alone.
Learn why running more tests isn’t the answer to AI, and the three ways mature teams are shifting their experimentation programs.
First, I’ve shifted from test volume to an evaluation stack—what I call eval-driven development. Instead of defaulting to production A/B tests, we front-load learning with offline evaluations (golden sets, synthetic scenarios), automated regressions on prompts and policies, and pre-production canaries. We size experiments with a clear minimum detectable effect (MDE), use sequential or Bayesian methods to handle drift, and reserve full A/B runs for hypotheses with sufficient power and operational readiness. This layered approach accelerates decisions, reduces traffic waste, and restores trust in effect sizes.
Second, I’ve re-anchored our metrics and governance for AI-era reliability. We define a driver tree that links value creation to guardrail metrics such as latency, hallucination rate, cost per request, safety incidents, and user trust proxies. Persistent holdouts and long-lived control cohorts protect against platform-wide regressions, while anomaly detection highlights model or data shifts before they corrupt reads. Strong instrumentation—behavioral analytics, consistent event semantics, and product telemetry wired into Amplitude analytics—keeps our feedback loop tight and auditable.
Third, we rebuilt rollout mechanics to make delivery experimentation-native. Feature flags, progressive delivery, and targeted canaries let us test safely in production while gating exposure by segment, risk, or policy. Shadow mode and offline replay provide signal before real users see risk. Multi-armed bandits help with exploration when goals are clear and guardrails are enforced, but we resist over-rotating to bandits when measurement is fragile. Tightly integrating experiments into CI/CD and observability shortens the cycle from hypothesis to validated outcome.
In practice, here’s how I operationalize this shift. In 30 days, I audit the backlog, kill or consolidate tests that can’t meet MDE, and establish a minimal evaluation harness for prompts, policies, and safety checks. By 60 days, guardrail metrics are live with persistent holdouts and feature flags across AI surfaces. By 90 days, the team runs a balanced portfolio: offline evals for fast iteration, canaries for risk, and selective A/B testing for strategic bets—supported by continuous discovery to keep hypotheses grounded in real customer needs.
AI didn’t eliminate the need for experimentation; it raised the bar for rigor. By moving from volume to validity, from vanity lifts to guardrailed outcomes, and from monolithic launches to progressive delivery, I’ve seen experimentation regain its edge—fewer false positives, faster cycles, and clearer signal on what truly drives impact. That’s how we turn a brittle testing culture into a resilient, learning system built for LLMs and beyond.
Inspired by this post on Amplitude – Perspectives.
I keep meeting talented product teams who can demo impressive proof-of-concepts but can’t get durable business impact into production. The difference isn’t raw ingenuity—it’s the operating model. As I’ve scaled AI initiatives in my own organization, one sentence has proven painfully accurate: "What the top 1% of AI-native product teams are doing differently – and why most won't catch up without rebuilding the operating model."
When I say “AI operating model,” I mean the end-to-end way we set strategy, discover value, build, ship, govern, and learn—specifically adapted for AI systems. If we try to bolt AI onto a classic software cadence, we stall. If we rebuild our operating model around AI’s unique constraints and compounding advantages, we accelerate.
It starts with strategy. I anchor our portfolio to explicit outcomes, not features—tying every initiative to measurable customer and commercial impact. Driver trees and an opportunity solution tree make tradeoffs transparent, while outcomes vs output OKRs prevent us from celebrating activity over results. This is how empowered product teams earn autonomy without losing alignment on the AI Strategy.
Next is discovery. Continuous discovery reframes “can we ship a model?” into “can we change a behavior or decision with acceptable risk?” I pair customer interviews with in-product telemetry and journey mapping to qualify moments of high value and high frequency. The litmus test: can we describe the target workflow in plain language and simulate success before training models? If not, we’re not ready.
Data foundations come third. A retrieval-first pipeline is now my default, not an afterthought. We invest in data governance, privacy-by-design, and observability so we can explain where answers come from, prove consent, and debug drift. Without trustworthy data and clear lineage, every downstream AI promise is fragile—and your AI readiness is mostly theater.
Then I insist on eval-driven development. Before we optimize prompts or tune models, we define offline and online evals that represent the real task, including safety and “gotcha” cases. We treat prompt engineering, context window management, and agentic AI patterns as hypotheses that must beat a baseline under repeatable tests. This moves debate from opinions to evidence.
Shipping is where most teams quietly stall. We integrate AI into our CI/CD with feature flags, shadow modes, and progressive rollouts, building MLOps into the same platform that runs our services. I watch DORA metrics to keep delivery velocity healthy, but I also watch AI-specific signals—input distribution shifts, response variance, and time-to-mitigation—so we catch regressions before customers do. Platform scalability matters more when inference costs and latency can spike overnight.
Governance isn’t a gate at the end; it’s a runway from the start. We operationalize AI risk management with tiered reviews, model and data cards, and clear escalation paths. The goal is not to slow down, but to reduce surprise—so product managers, engineers, and legal share the same playbook for safety, fairness, and regulatory compliance.
Value capture closes the loop. We connect product metrics to commercial levers like Net Recurring Revenue (NRR) and retention analysis, then shape packaging so customers pay for outcomes, not raw compute. This is where product-led growth meets sales-led growth: we demonstrate value in-product, then arm go-to-market teams with unambiguous proof.
So why are 80% of teams stuck? Three patterns recur: technology FOMO masquerading as strategy, fragmented data that can’t support high-quality retrieval, and a lack of evals that forces decisions by vibes. Add ad hoc governance and you get pilots that impress in slides but wither under real-world variance.
How do the top 1% think differently? They rebuild the operating model first. They position discovery around workflows, not models. They invest in retrieval-first architectures early. They standardize evals. They ship with guardrails. And they treat “learning per week” as a sacred metric—because compounding insight beats sporadic heroics.
If you need a 90-day plan, here’s the sequence I use. Week 1–2: run a content audit of data sources and map the top five repeatable workflows ripe for AI leverage. Week 3–4: define success metrics and offline evals for one beachhead use case. Week 5–8: build the retrieval pipeline, implement prompt baselines, and instrument observability. Week 9–12: ship behind feature flags, run A/B testing with safety thresholds, and iterate on failure cases. By the end, you’ll have a reusable blueprint—not just a demo.
Team design matters. I staff product trios (PM, design, tech lead) with forward deployed engineers or solutions engineering partners who sit with customers. That proximity reduces spec ambiguity and accelerates learning. It also sharpens our product roadmapping and sprint planning because we plan against outcomes, not outputs.
The hardest part is emotional, not technical: letting go of familiar software rituals that don’t serve AI. Once we accept that AI demands a different operating rhythm, progress feels lighter. The top 1% don’t have secret models; they have disciplined systems. Rebuild yours, and the compounding benefits will outpace any single model upgrade.
I keep asking myself a simple, high-stakes question: what does it take to build an AI customer support agent that actually knows when it can't help — and says so?
Recently, I dug into how Jamie Hall (Co-founder & CTO), Xharmagne Carandang, and Rona Wang at Lorikeet are answering that question for enterprises in regulated industries. Their target outcome is refreshingly concrete: an agent that responds like the best customer support you’ve ever had — one that knows you, gets things fixed, and hands off gracefully when it’s out of its depth.
What resonated first was the honesty about early missteps. The team explored reflection tools and information dashboards before a healthcare startup reframed the job-to-be-done with a blunt directive: just help us clear the inbox. The earliest prototype wasn’t flashy — a command-line script spitting out a CSV — yet it paved the way for a scalable, measurable foundation.
Today, the system runs on a dual-agent architecture: a Concierge that handles customer tickets end-to-end, and a Coach that helps customers configure, test, and continuously improve it. That split is more than a technical choice; it’s a product strategy that separates operational resolution from the meta-work of quality, guardrails, and evaluation.
The backbone principle is "AI humility" — defaulting to a human handoff when uncertain. In practice, this isn’t about avoiding responsibility; it’s about preserving trust. When an agent signals uncertainty, it protects brand equity and customer experience while still accelerating the path to resolution.
Lorikeet integrates with Zendesk and Intercom instead of replacing them. That decision respects the entrenched workflows and analytics ecosystems support leaders already rely on, and it reduces adoption friction while enhancing existing queues, macros, and reporting.
The UX has evolved from a workflow builder to a conversational interface — and yet the blank chat box is still hard. Guardrails, prompts, and example-led onboarding help teams get started without forcing them to be prompt engineers. When you’re aiming for low cognitive load, a hybrid of guided steps and conversational nudges works better than a pure canvas.
One of the most nuanced patterns is "resolution in the loop": how human agents unblock the AI without taking over a ticket. Instead of a full manual escalation, humans can provide a targeted nudge — a missing piece of data, a policy citation, a link to a system of record — and let the Concierge finish the job. That collaboration preserves productivity while keeping humans in the quality loop.
Guardrails turned out to be deeply domain-specific — a cannabis company’s support tickets famously broke the team’s first approach. That’s a crucial lesson for regulated industries: policy nuance often lives in the edge cases. Lorikeet responded by making customer-configurable guardrails a first-class capability through the Coach interface.
Even more interesting, they’re flipping the configuration workflow so customers define "what good looks like" before they ever write a standard operating procedure. By anchoring configuration in outcomes and test cases rather than prose SOPs, teams move faster, reduce ambiguity, and get to measurable quality earlier.
The platform leans into eval-driven development: using AI to diagnose failure modes in traces and automatically suggest fixes. A "Trace Diagnosis Agent" surfaces root causes and remediation paths, shrinking the feedback loop from discovery to improvement.
Culturally, the product engineering cadence is customer-obsessed: every engineer asks weekly what they learned from a customer. That lightweight ritual is a forcing function for continuous discovery and keeps prioritization tethered to real-world tickets, not just internal hypotheses.
Here’s how I translate these lessons for any customer support AI strategy in regulated environments. First, ship with opinionated "AI humility" and measure handoffs as a quality feature, not a failure. Second, separate resolution from configuration via a dual-agent architecture so each can evolve independently. Third, integrate where your customers already work (Zendesk, Intercom) to accelerate time-to-value. Fourth, make guardrails domain-native and customer-configurable, and start with evals that define "what good looks like". Finally, invest in trace analysis and automatic fix suggestions to shorten the learning cycle.
If you’re scaling support in healthcare, financial services, or any high-stakes domain, these patterns are practical, defensible, and ready to operationalize. Build the Concierge to resolve, empower the Coach to continuously improve, and let "resolution in the loop" bind humans and agents into one reliable system of service.
AI in customer service is no longer experimental—it’s the standard. In my work leading product and customer experience teams, I’ve seen the shift firsthand, and the stakes have never been higher for getting the foundations right.
Fin’s 2026 Customer Service Transformation Report found that 82% of senior leaders say their teams invested in AI for customer service over the last 12 months, with 87% planning to invest in 2026. Those investments pay off with 24/7 availability, multilingual support, major time savings, and faster resolutions. But there’s an unsung hero behind every AI-first support experience: knowledge management.
A Service Agent is only as good as what we give it to work with. If we’re using an Agent, like Fin, to resolve customer queries end to end, it needs an extensive pool of knowledge to draw from. We have to feed it accurate answers on our product, features, policies, and troubleshooting. Without these, the Agent can’t do its job—and our team ends up handling repetitive queries that should be automated.
A Fin-branded quote pairs with a friendly black-and-white portrait to champion smarter support. It reminds readers that time spent building knowledge and processes today compounds into fewer tickets and smoother operations.
In this guide, I’ll walk you through two phases of the journey. Phase 1 is about building a high-quality knowledge base from scratch or overhauling what you have. Phase 2 is about maintaining, optimizing, and scaling that knowledge so your AI performance keeps compounding over time.
Definition: Knowledge management is the process of creating, organizing, sharing, and maintaining knowledge in your business.
Fin’s quote card blends a friendly headshot with a message to think outside the box and tap new information sources to power an AI knowledge base—ideal inspiration for service teams leveling up knowledge management.
Your help center is the obvious example, but it’s only the tip of the iceberg. Effective knowledge management also means creating resources like FAQs, troubleshooting guides, onboarding and best-practice docs, internal support guidance, and learning materials that cover everything from everyday how‑tos to complex billing and account questions.
It means identifying content gaps—missing troubleshooting steps, unclear policy explanations, outdated feature details, or unanswered edge cases—before your customers find them. It means implementing systems so both your Agent and your support reps can access the right information at the right time. And it means developing processes so your content stays in lockstep with product updates, policy changes, and bug fixes.
From Fin's guide to knowledge management, this monochrome quote card urges teams to test their first deployment themselves so agents feel the same journey customers do, turning insights into faster, higher-quality support.
Your knowledge base now fuels your entire support experience, not just self-serve. It’s the key to accurately answering complex questions, reducing handle time, and delighting customers across channels.
Here’s the blunt truth I share with every team: your Agent is only as strong as what you feed it. A lack of information, messy structure, or stale documentation will tank accuracy and trust. No large language model (LLM) knows your business like you do. It doesn’t understand your customers’ needs, pain points, and use cases. That knowledge is unique to you and your organization, meaning you need to be the one to map it all out and make it available to your Agent.
Equip service agents with a clear playbook for damaged delivery reports. This procedure page outlines when to use the guide, how to verify evidence, and the next action to reorder—ready to test, save, and set live.
Every investment in knowledge also has compounding results. Think of it as a flywheel: when you improve your knowledge base, your Agent solves more cases and generates better data. That data shows you what to add, update, or refine next. The sooner you plant the seeds, the sooner you’ll harvest the returns.
Consider a simple calculation. If it takes 30 minutes to write a troubleshooting article for a common issue, that half hour often saves hours for your support reps, who no longer need to handle that query. You can estimate impact by multiplying the average time to compose a response by the frequency of the query. For customers, multiply the number of customers who ask this question by their average time to resolution to quantify time saved. Then monitor Agent involvement rate, resolution rate, and automation rate to see the compounding effect.
Give every seller instant, trusted answers with an AI-powered knowledge base that unifies docs, FAQs, and playbooks into a single source of truth—accelerating ramp, boosting call confidence, and improving every customer conversation.
Phase 1: Building your knowledge base is about getting your content durable and AI-ready. I start by prioritizing what to include, where to source it, and how to audit and triage before go‑live.
Data-driven tools can surface the right starting points. For example, platforms like Fin can surface knowledge gaps from real customer conversations where help content is missing, unclear, duplicated, or contradictory. A centralized knowledge hub then becomes your single source of truth for both customer-facing and internal content, with audience controls to ensure your Agent only uses the right materials for the right users.
AI elevates service when teams treat deployment as a learning loop. This Fin-branded quote visual introduces our ultimate guide to knowledge management for service agents—iterate from day one to improve customer outcomes and teammate efficiency.
Here’s how I prioritize content for the first wave. Support FAQs come first—billing changes, account updates, feature usage, troubleshooting, and policy questions. I mine the inbox and historical conversations to find the highest-frequency issues and turn them into crisp help articles the Agent can quote.
Next, I build onboarding and setup guides so new customers reach value fast. I collaborate with customer success and product to document the fastest path to “first win,” and I ensure the Agent can reference those steps in chat and in‑product guidance.
Keep your help content fresh. A Fin quote urges support leaders to audit and update their knowledge base so AI assistants and service agents surface accurate answers that genuinely add value.
Then I add troubleshooting and advanced guides for deeper issues and power-user workflows. I pull in product managers, engineering, and success managers to capture deeper diagnostics, known limitations, and recommended workarounds—exactly the details that prevent escalations.
Finally, I create content for specific use cases and customer segments. Different goals and configurations require contextual guidance, so I reflect language customers actually use and tailor examples to their jobs-to-be-done.
Smarter support starts with better knowledge. A testimonial highlights how Fin learns from website and help center content, showing that robust knowledge bases train AI agents, raise accuracy, and yield compounding gains.
When sourcing knowledge, I cast a wide net and consolidate it so the Agent and my team can use it reliably. That includes public help articles and troubleshooting guides; internal runbooks, escalation steps, and policy clarifications; curated snippets for short replies and exceptions; past conversations that expose gaps; relevant website pages; and documents like PDFs and DOCX with selectable text.
Before anything goes live, I run a structured content audit. The goal is twofold: prevent the Agent from learning from outdated information, and expose gaps that will cause escalations. I divide content by product area, assign clear ownership, and set a time‑boxed review window to update, consolidate, or retire content. Shared ownership turns a daunting clean‑up into a manageable sprint.
Why can’t knowledge content be an afterthought? This Fin visual pairs a grayscale portrait with a bold message: great Service Agents rely on a strong, current knowledge base to deliver accurate, evolving support. Explore the guide.
I also walk the customer journey myself—exactly as a new user would—so I can experience the Agent’s responses firsthand and spot missing topics or keywords. Where my platform supports it, I use preview and batch testing to validate coverage across common questions, then simulate more complex workflows to ensure handoffs and steps are properly defined before launch.
After 30 days of Agent activity, I dive into the data. I look for topics driving handoffs to humans, articles correlated with low resolution rates or CSAT, and content that customers view but still escalate. Those signals tell me exactly what to write or refine next—and where to tighten conversation design or retrieval.
Centralize your conversations, customer data, and knowledge in one place to sharpen context and speed resolutions. This Fin graphic pairs a monochrome portrait with a bold pull-quote highlighting unified platforms for better support.
Prioritization is where impact accelerates. I focus first on the content my team shares most: top help articles, troubleshooting steps, onboarding flows, and policies. I study conversation analytics to identify the most common questions, the longest handle times, and the lowest CX scores, then close those gaps with targeted content. I also review high‑view articles that haven’t been updated recently and refresh anything affected by changes to product, policies, or plans.
Resourcing matters. Building a high-performing Service Agent shouldn’t be a side gig. I explicitly allocate weekly time for frontline reps, support specialists, and product partners to work on content requests and knowledge improvements. A 5–10 hour per‑person cadence is a practical baseline, and it doubles as a powerful way to upskill the team for emerging AI roles.
Jumpstart smarter support with the #1 Agent—organize knowledge, speed answers, and automate routine work. Click Start a free trial to see how AI elevates your service team and delivers faster resolutions.
Writing for AI is writing for customers. I train the Agent to mirror the terms our customers use by analyzing search queries and real conversation language. I avoid internal jargon, expand acronyms, and clarify key concepts to eliminate ambiguity. When a topic invites yes/no answers, I restate the question and add the necessary context so the Agent doesn’t misinterpret shorthand. I always pair images or videos with clear explanatory text so the guidance is accessible and machine‑readable. And I structure content for scanning with crisp headings and short sections, avoiding hidden information that requires clicks to reveal.
When I have bite‑size answers—common edge cases, policy clarifications, repetitive high‑volume queries—I collect them into focused internal snippets or compact FAQs so the Agent can retrieve and deliver precise answers quickly.
Phase 2: Knowledge management is where the compounding value kicks in. Once live, I track the metrics that matter: resolution rate (conversations fully resolved by the Agent when it was involved), automation rate (total conversations handled by the Agent across overall volume), time saved (hours of manual work offloaded), Customer Experience (CX) Score comparisons across AI and human conversations, and CSAT parity.
Then I put those learnings to work. Inevitably, some problems won’t be solvable on day one. That’s a gift—it shows me where to refine workflows, add clarifying steps, and strengthen knowledge depth. The richest insights often come from where the Agent struggles or escalates; those friction points become my highest‑ROI content tickets.
Knowledge management is never one‑and‑done. As products, customers, and business goals evolve, so must the knowledge. I formalize an ongoing maintenance cadence with clear ownership, review intervals, and time blocks on the calendar. Wherever possible, I use AI‑assisted drafting to propose updates, summarize gaps, and accelerate review without sacrificing quality.
To sustain momentum, I create a simple intake for content requests—often a lightweight ticket workflow inside our support tools—so anyone in support, success, sales, marketing, engineering, or product can flag gaps and propose improvements. The teams closest to customers usually spot the patterns first; a good intake system ensures we don’t lose those insights.
I also bake knowledge work into every launch plan. New features, product updates, plans, and policies require Agent‑ready content at launch, not after. I partner with product, support, and product marketing to produce best practices and anticipated FAQs in advance, then I review early conversations post‑launch to spot recurring confusion and fast‑follow content needs.
Brand consistency builds trust across every touchpoint. I standardize terminology for products, features, plans, and policies so the Agent, the help center, and human reps all speak the same language. I proof for tone, spelling, and grammar, and I use templates so content feels cohesive. I also include clear contact options for customers who need them—what channel to use, when to use it, and what to expect—so we maintain confidence even when escalation is required.
Clarity about audience matters, too. If certain content applies only to specific roles, plans, or regions, I label it explicitly and, where my platform supports it, target content so the Agent uses the right guidance for the right segment.
Finally, I connect the dots. When conversations, customer data, and knowledge live in one place, every interaction becomes an insight loop. A connected Agent turns support into a retrieval-first pipeline, making it far easier to diagnose issues, improve accuracy, and continuously raise the bar on customer experience.
Behind every high-performing Agent is a rigorous, AI-friendly knowledge management practice. Treating knowledge as a core service function—not a project—creates systems that improve with every conversation. That’s how we transform support from a cost center into a compounding engine for customer satisfaction, operational efficiency, and growth.
I’ve learned the hard way that the fastest path to a reliable command-line agent is radical subtraction. "In the last month of developing Amplitude Wizard CLI, we cut more than we added. Learn less is more when it comes to building CLI agents." That decision was less about minimalism and more about product strategy: constraints sharpen behavior, clarify intent, and raise trust.
When I evaluate agentic AI systems, especially those that act on developer environments, I start by asking what the agent must never do. By establishing hard guardrails first, the design naturally converges on an opinionated, safe, and teachable interface. Every additional flag, tool, or permission expands the blast radius; every removal shortens the path to first success.
For CLI agents, the most valuable product choice is a narrow toolset with sane defaults. Opinionated workflows reduce cognitive load and failure modes, while clear human override points keep users in control. I prefer a bias toward idempotent actions, reversible changes, and explicit confirmation gates for anything destructive. If a feature can’t explain itself in a single, crisp sentence in the help text, it likely doesn’t belong.
Security and reliability flow from limits. Progressive permissioning, scoped credentials, and time-bounded tokens prevent the agent from wandering. Dry-run modes build confidence without side effects. When a user can reason about what the agent will and won’t do, adoption accelerates—and support tickets plummet.
Observability is the other half of trust. I instrument "Agent Analytics" across every run: inputs, tool choices, durations, outcomes, and error patterns. Those signals reveal where the agent gets confused, which steps users abandon, and which prompts need pruning. With that loop in place, "less is more" stops being a philosophy and becomes an evidence-backed operating model.
I anchor the roadmap in eval-driven development. Before adding a capability, I define a measurable task, a success threshold, and the smallest viable interface to reach it. If the capability can’t lift completion rate, time-to-first-success, or re-run stability, it waits. That simple discipline protects the experience from feature creep and preserves velocity in CI/CD.
Under the hood, I design for a retrieval-first pipeline and careful context window management. The agent should fetch only the minimally relevant facts, present a compact plan, and execute predictably. Thoughtful prompt engineering helps—but prompts are not a substitute for clear boundaries, deterministic tool contracts, and robust error handling.
Documentation is product. I maintain docs-as-code with runnable examples that mirror the golden paths. When the docs and the CLI disagree, the CLI changes—never the docs. This creates an internal forcing function: if we can’t document it simply, we probably shouldn’t ship it.
My litmus test for any proposed addition is simple: does this make the mental model smaller? If not, cut it, make it progressive, or hide it behind a clearly named subcommand. Defaults should be boring, safe, and fast. Advanced power should be opt-in and discoverable without overwhelming new users.
The paradox of agentic AI is that capability grows as surface area shrinks. By removing distractions, we amplify signal, increase repeatability, and earn the right to add the next carefully chosen step. The result is a CLI agent that feels sharp, dependable, and—most importantly—useful on day one.
Inspired by this post on Amplitude – Perspectives.
A prospect lands on our site, skims pricing, watches a demo, and clicks “contact sales.” For years, that’s where momentum died. They waited, and we built entire sales motions around managing that delay.
We optimized for “speed-to-lead,” made it the hallmark of a high-performing sales development org, hired more SDRs, tuned routing rules, added shift coverage, and stared at response-time dashboards. Typical SLA targets were one hour for best-fit leads, four hours for core MQLs, forty-eight hours for everyone else. Those were considered good numbers.
No one questioned the premise because the lag felt structural—shift scheduling, routing delays, and humans working 9–5. The fastest teams could only shrink the gap; nobody could remove it.
An AI Agent closes it completely.
When a prospect arrives today, the conversation can begin immediately. That single change reshapes how I design a sales org—how we staff it, what our team prioritizes, and the metrics we hold ourselves accountable for.
Step outside our dashboards and look at the buyer experience. We spend heavily to drive traffic, then push visitors into forms and queues that add friction precisely when purchase intent peaks.
Intent is highest the moment someone seeks out our product. If an SDR follows up two or three hours later, that buyer’s in another meeting, the urgency has faded, and the moment is gone. We still call it a lead; the buyer has already moved on.
What AI changes
Agents eliminate the structural constraints that made speed-to-lead a problem—shift scheduling, routing delays, CRM batch processing, the SDR being on another call. None of it applies anymore because every single lead can be engaged immediately, at any hour and in any language.
The impact goes beyond response time. When an Agent engages at peak intent, qualification, discovery, and even an initial demo moment can unfold in a single, continuous conversation. The gated funnel collapses. There’s no reason to qualify someone today, schedule discovery for Thursday, and demo the following week when the conversation is already happening.
The constraint the industry built around simply isn’t there anymore. We’re already seeing it with Fin, a Customer Agent. As sales leaders, we need to frame this differently.
If speed-to-lead is no longer the constraint, the knock-on effects reach every part of the org.
Introduce Fin for Sales to your team with this clean hero banner: bold headline, signature blue spiral, and a clear 'Start free trial' call to action—inviting readers to explore an AI customer agent built for revenue.
SDRs focus on moving deals forward. Instead of frontline triage, they double down on phone-based selling and relationship building, complex deal navigation, and multi-threaded engagement across stakeholders—the high-leverage work that used to get crowded out by the inbox.
Pipeline gets more relevant. The old model rewarded volume: capture as many form fills as possible, respond fast, and sort quality later. When an Agent engages at the moment of intent, it qualifies during the conversation. Low-fit leads get filtered out before they reach the team, and high-fit prospects arrive with context—needs, timeline, stakeholders—instead of just a name and email.
You measure outcomes, not response time. When first response is instant, different metrics matter. I anchor on three questions:
1) Is the Agent doing the work? Completion rate, qualification rate, and contact capture rate indicate whether conversations reach clear outcomes and produce usable handoffs to the team.
2) Is the work producing pipeline? Meetings booked and pipeline created through Agent-handled conversations are the leading indicators of revenue, not how fast someone followed up.
3) Are buyers having a good experience? Conversation-level satisfaction matters more than ever because the Agent is the first interaction prospects have with your company. The experience it delivers is the first impression you make.
These three questions reveal whether the motion is working. Time-to-first-response can’t.
Sales orgs built hiring plans, workflows, and performance metrics around beating intent decay. That made sense when the lag was unavoidable. It isn’t anymore.
An Agent is always on. It engages the moment a prospect arrives on your site, qualifies them in real time, and routes them to the right outcome without waiting for someone to be free. The lag the industry built itself around doesn’t exist when the conversation starts immediately.
The companies leaning into this are investing in what happens after the conversation starts: how well the Agent qualifies, where it creates pipeline, and what SDRs should actually spend time on. What matters now is not how fast you respond, but what the conversation produces.
Speed-to-lead made sense when the delay was structural. It isn’t anymore. If you’re re-architecting go-to-market, instrument Agent Analytics, revisit SDR charters, and tighten CRM integration so every qualified handoff is instant, traceable, and revenue-linked.