I’ve led multiple AI agent launches, and the single most reliable way I’ve found to ship with confidence is to treat evaluations as a product capability, not a side project. When we make AI quality measurable, predictable, and comparable over time, we move faster, reduce risk, and build trust with customers and stakeholders.
Learn how product managers use AI evaluations to measure agent quality. Covers traces, LLM judges, offline evals, online evals, and how to connect evals to product outcomes.
Why does this matter so much in product management? Because agent quality is only meaningful when it drives adoption, satisfaction, and revenue. I use eval-driven development to align the day-to-day iteration of prompts, policies, and workflows with business outcomes like activation, retention, and Net Recurring Revenue (NRR). That alignment turns AI quality from an abstract notion into a roadmap lever.
First, traces. Traces are the spine of evaluation for agentic AI: they capture inputs, intermediate steps, tools invoked, and final responses. I instrument traces to make reasoning visible—what the agent tried, where it hesitated, and why it chose a path. With that visibility, I can compare prompts, policies, and tools, and I can teach the team to fix the root cause instead of patching symptoms. This is also where Agent Analytics becomes real: we move from anecdotes to observable behavior trends across cohorts and use cases.
Next, LLM judges. I use model-as-judge to score qualities like helpfulness, coherence, or adherence to brand and policy. The trick is calibration. I pair LLM judges with a small, high-quality human-labeled set to ground the scale, then monitor drift as models, prompts, or data shift. LLM judges help me evaluate at speed, but I still spot-check edge cases and highly regulated flows to balance efficiency with risk controls.
Offline evals come first. Before I expose users to changes, I run fixed test suites representing core scenarios, failure modes, and edge cases. I include golden examples, adversarial prompts, and domain-specific queries. Metrics cover task success, factuality, safety, latency, and cost. This is where prompt engineering and retrieval quality are tuned; if I’m using a retrieval-first pipeline, I evaluate evidence quality separately from generation so improvements are attributable and reproducible.
Online evals follow to validate real-world performance. I roll changes out behind feature flags and use A/B testing to compare variants under production conditions. I track conversation outcomes, tool success rates, fallbacks to human support, and user satisfaction. These online signals close the loop on whether an offline improvement actually compounds value in the product—critical for product-led growth.
Connecting evals to product outcomes is non-negotiable. I map quality signals to a driver tree: from per-turn scores (helpfulness, safety, latency) up to session-level outcomes (task completion, deflection, revenue intent), and finally to product KPIs (activation, retention, NRR). With this structure, I can set thresholds for launch gates, prioritize roadmap items that move the biggest levers, and build dashboards that leadership understands at a glance.
A few lessons learned. Start with a minimal but durable test set and grow it as you discover new failure modes. Version everything—prompts, tools, and datasets—so you can reproduce wins. Beware metric drift when you swap models or update prompts. Blend human review where the cost of error is high. Above all, make evaluations part of your AI workflows and sprint rituals so quality improves continuously, not sporadically.
If you’re just getting started, begin with traces and a small offline suite, add LLM judges for scale, then prove impact with a focused online experiment. Within a few cycles, you’ll have a living evaluation system that guides decisions, accelerates delivery, and gives your team—and your customers—confidence in every AI release.
Inspired by this post on Amplitude – Perspectives.
I spend a lot of my time asking a deceptively simple question: what does excellent marketing actually look like in 2026? From the vantage point of product leadership, the answer isn’t a spreadsheet or a channel plan—it’s a feeling. Beloved tech brands earn the benefit of the doubt, create gravity around their roadmap, and make customers proud to belong. That kind of momentum is not an accident; it’s a system.
Here’s the hard truth I’ve learned building and scaling products: giving teams different goals creates dysfunction. When brand, demand gen, product marketing, and comms run on fragmented OKRs, you manufacture internal headwinds. “Marketing is one engine – not separate pieces.” One strategy, one narrative, one set of outcomes—expressed through different craft disciplines and time horizons.
That unity of purpose clarifies executive roles, too. The real difference between an SVP and a CMO is scope and narrative ownership. A great CMO architects the whole system—portfolio allocation, brand architecture, integrated go-to-market strategy, and the bar for creative taste—while refusing to get dragged into decisions they should never be making (for example, approving every headline or micromanaging channel tactics). Leaders should decide the outcomes, standards, and constraints; teams should control the craft.
On portfolio design, I run marketing like a portfolio of moonshots. You need a healthy mix: proven programs that compound, emergent bets that learn fast, and a small set of true moonshots that can change the slope of the curve. The point isn’t bravado; it’s risk-balanced exploration. If everything ships safely, you’re under-investing in differentiation. If everything is a swing for the fences, you’re not building a repeatable growth engine.
This is where taste becomes a strategic advantage. “Ubiquity is the opposite of cool.” If you want to be beloved, you cannot treat every channel, audience, and moment as equal. Early on, selective distribution, distinctive creative codes, and tight community loops create status and meaning. Later, you scale without sanding off the edges that made the product special.
Why do a few companies build a flywheel of momentum while others stall? They align story, product, and distribution. The product earns trust, the narrative creates aspiration, and the go-to-market strategy ensures the right customers experience both at the right time. Then perception cycles kick in—the Silicon Valley clock turns—and irrational optimism or skepticism can amplify signals. The antidote is compounding proof: consistent product shipping, community advocacy, and creative that makes people care.
Scaling taste across an organization is teachable. I codify brand principles, narrative guardrails, and examples of “right” versus “almost right.” I replace abstract feedback with decision rubrics—what we keep, kill, or revise and why. I run recurring creative reviews with a small cross-functional council, so judgment compounds. Taste can’t be fully automated, but it can be operationalized: shared references, a story bible, and a high bar for craft that’s explicit, not mystical.
In a post-LLM world, the fundamentals haven’t changed—but the frontier has. Generative tools supercharge iteration and research, yet the artistry never really left. You still need a point of view, a tension worth resolving, and a value proposition that’s felt, not just stated. Can taste be encoded in software? Parts of it—pattern libraries, style constraints, data-driven feedback—absolutely. But the spark that makes work unforgettable remains human: judgment, risk tolerance, and the courage to ship something that might not fit the playbook.
That’s why telling an optimistic, yet realistic story about AI matters. Over-automation drains humanity; under-automation wastes potential. The best work pairs AI Strategy with craft leadership: LLMs for rapid exploration, humans for narrative decisions and ethical judgment. Your message should show how AI expands customer agency, not just efficiency.
The brand-versus-growth debate is a false choice. The right story accelerates pipeline, and the right demand programs reinforce the brand. Look at Apple’s discipline around product truth and design codes, or Google Chrome’s “The Web Is What You Make of It (Dear Sophie)” for proof that emotion and utility can co-exist. Notion, Pinterest, Square, HubSpot, and Harley-Davidson show how community, identity, and product-led growth interlock when the company knows exactly what it stands for.
When it comes to launches, I’ve learned that announcement videos full of humans, lack humanity. Overproduced gloss often dilutes the truth customers seek: what problem does this solve, how quickly can I feel the value, and why does it matter now? Real users, real context, and a crisp arc from problem to promise will outperform most theatrics.
Practically, I architect my week to protect taste and outcomes. Early-week for strategy, portfolio reviews, and cross-functional alignment; mid-week for deep creative and product marketing work; late-week for decision clears and postmortems. I time-box “disruptive energy”—space to chase non-obvious ideas—and I guard it like any critical meeting. Without protected cycles for exploration, the urgent will always suffocate the important.
If there’s a single takeaway: playbooks are obsolete, but the fundamentals are not. The channels change; the psychology doesn’t. Run one engine. Allocate a true portfolio. Scale taste with rigor. In the AI era, make people care. That’s how beloved tech brands are built—and how they endure.
I love being a builder. It feels like a superpower I can’t stop using, and lately I’ve been channeling it into better workflows, faster experimentation, and sharper product thinking.
I tinker with my Claude Code workflows to make every day more effortless. I’m having a blast creating AI-generated interview snapshots and opportunity solution trees for Vistaly. I also spend time digging into traces and iterating on the AI coaches I use for our discovery courses.
Then the recent wave of malicious software spreading through the open-source community popped my bubble. It hit companies big and small—names like OpenAI, PostHog, and Zapier. As I dug in, I realized what many cybersecurity experts have long known: this is a deep rabbit hole. If I want to build responsibly, I have to get significantly better at protecting my devices, credentials, and code. And if you’re building with AI or modern tooling, you likely do, too.
Here’s why. We all rely on open-source software. Most modern applications assemble tried-and-true components—parsing a PDF, handling dates across time zones, visualizing spreadsheet data, connecting to an API—rather than reinventing them. The same is true for agent skills and MCP servers; they accelerate how we get value from models. This is overwhelmingly a good thing. But it also creates an attack surface that bad actors exploit.
We don’t need to abandon third-party code. We do need to understand the mechanisms attackers use and consistently defend against them.
When one malicious worm compromises hundreds of packages, what should dev teams do? This visual teaser maps the agenda—how it spreads, how to guard against it, AI tool risks, and concrete steps to mitigate.
On May 11th, I started seeing tweets about a TanStack hack. At that time, I didn’t know what TanStack was. But apparently, it’s a popular set of JavaScript libraries that are used by a lot of React sites. At first, I didn’t pay much attention. Then I learned the packages were compromised by a worm—malicious software that self-replicates—and it spread quickly. Within hours, dozens of packages were implicated; by day’s end, it was in the hundreds. That’s when I knew I had to lean in.
If you’ve explored safe development practices with coding agents before, you’ve seen the basics of package safety. A package is a bundle of reusable code shared through registries, and nearly every app you use depends on them. The unfortunate twist with this specific hack, known as the Mini Shai-Hulud worm, is that it shows prior “safe enough” heuristics aren’t sufficient. Popularity and trust signals don’t guarantee safety. We have to do more.
So here’s what I’ll cover today: how malicious software typically works, a practical framework for guarding against it, the specific risks of using Cowork to write and run code, and concrete steps to mitigate that risk. My goal is simple: help you keep building—despite the risks—while protecting your data and your business.
Quick disclaimer: I’m not a security expert. I’m sharing my personal journey and what I’ve learned through research and hands-on work. Please use your best judgment when applying any of this.
Package hacks share a simple playbook: get in, sweep for secrets, and phone home. This visual breaks down the 3 steps and flags new entry points—from packages to MCP servers, agent skills, and app extensions.
An agent recently scoured over 230,000 malicious software incidents and found that most malicious software follows a similar pattern. First, it needs an entry point onto your computer. Once installed, it scours your device for sensitive data, and then it uses your network connection to send that data to its own servers. The Mini Shai-Hulud worm spreads via malicious package install scripts that run at download time, then searches the device for credentials (including package publishing rights), poisons additional packages to continue replicating, and uses multiple channels—including the victim’s own GitHub public repos—to distribute secrets.
In practice, most attacks boil down to three steps: 1) It finds an entry point to your device. 2) It searches your device for sensitive data. 3) It sends that data to its own server. The good news: this pattern also tells us how to defend. We can harden entry points, minimize what code and agents can access, and constrain outgoing network traffic.
Keep in mind that install scripts aren’t the only entry vector. Any code that runs on your machine could contain malicious payloads: third-party packages, agent skills, MCP servers, browser or desktop extensions—the list is long. As coding agents and “vibe coding” tools become mainstream, more non-engineers are exposed to the same risks engineers have managed for years.
You might be at elevated risk if you do any of the following: you download and use third-party skills or MCP servers; you let Claude Code, Codex, or other coding agents write scripts that run locally and use third-party packages; you use an IDE like VS Code or Cursor with third-party extensions; or you install third-party extensions in tools like Obsidian. This isn’t an exhaustive list, but if any of these apply, it’s worth tightening your approach.
Relying on third-party code? This visual highlights four common risk zones—agent skills/MCP servers, coding agents, IDE extensions, and Obsidian plugins—and urges a review of downloads, local scripts, and add-ons.
The “safest” approach would be to avoid installing third-party software on your local device entirely. That’s not realistic. We all depend on third-party components in our stack. So I’ll start with one of the most common paths for non-engineers writing and running code today: Cowork.
Evaluating Cowork’s safety was eye-opening. Cowork offers meaningful protection—more than running code directly on your machine—but it isn’t bulletproof. There’s a notable gap you should understand.
Here’s how Cowork helps. It runs code inside a virtual machine, which isolates the execution environment from your real device—a quarantine room for code. While Cowork doesn’t fully control what comes into the room (that part is on you), if malicious code gets in, it’s contained and cannot reach the rest of your filesystem. Cowork also limits outbound network traffic from the virtual machine, which helps disrupt data exfiltration. However, it’s not foolproof.
Because Claude can install packages inside Cowork, it remains susceptible to malicious code like the Mini Shai-Hulud worm. And GitHub is on the allow list so Cowork can read and write to your repos. Since the Mini Shai-Hulud worm uses GitHub to publish secrets, this creates exposure. The crucial mitigation: if you never give Cowork access to sensitive data, there’s nothing for an attacker to steal.
A quick visual from a security deep dive on package hacks shows how Cowork handles threats: entry points are contained, data is only safe when kept outside, and network traffic is partly limited—making shared data the gap to watch.
Your responsibility is straightforward but critical: your data is only safe if it stays outside the virtual machine. When you mount folders into Cowork, those folders become accessible to any code running inside the VM. That includes malicious scripts. Before sharing, ask two questions: do the folders contain any credentials or secrets, and do they include proprietary data that would be harmful if accessed?
It’s common for code to need credentials. That’s why Cowork includes connectors to third-party sources like Google Drive and Slack. Credentials configured for these connectors never enter the VM—they remain outside the quarantine room—so they’re not exposed to malicious code. But if your code requires additional credentials inside the VM, scope them tightly and assume they could be compromised.
You can also use custom MCP servers you create yourself with Cowork. Those credentials stay outside the VM as well, provided the MCP servers are remote (hosted on a web server, not downloaded locally). It’s more work than dropping in a local server, but it keeps secrets out of reach from VM-executed code.
Beyond credentials, scrutinize the actual content you share with Cowork, including anything accessed through connectors. Least privilege is the rule: grant only what’s absolutely necessary for the task, and nothing more.
Amid a wave of package-supply attacks, this Product Talk visual launches a 3-part guide to safer AI building—starting with Cowork safety today, then Claude code config next week, and off-device development coming soon.
What about skills? Cowork supports skills, and you can add third-party skills inside the quarantine room. If you’re not placing your own data in that room, you can afford more risk. The moment you add sensitive or proprietary data, be selective. Skills can include third-party code, and bad actors use skill directories to distribute malicious payloads. Personally, I never use third-party skills as-is. If one looks useful, I read through the files, then ask Claude to recreate it so I understand what it does and maintain control. If I were to use third-party skills, I’d do it in Cowork and keep their data access to the minimum necessary.
Overall, Cowork is a solid, “safe-ish” option if you’re disciplined about what you share. The challenge is that utility often requires access to real data—exactly what we’re trying to protect. In an upcoming deep dive, I’ll outline strategies to keep malicious code out in the first place. While I’ll focus on local development, the same patterns can extend to Cowork with a bit of setup.
One more important clarification: don’t confuse Cowork with the Code tab in the Claude Desktop app. Cowork runs code inside a virtual machine. The Code tab does not. If you ask Claude to write and execute code from the Code tab, that code runs on your local device and you’re fully responsible for security. There is one exception: the Code tab can run code in Anthropic’s cloud; I’ll cover that approach when we get into moving development off the local machine.
To summarize Cowork’s protections against the attacker’s three-step pattern: installs and scripts still run, but they’re contained inside an isolated virtual machine instead of your real device; access to sensitive data is strongly limited to the specific folders you mount, leaving the rest of your filesystem (including unrelated credentials) out of reach; data exfiltration is partially constrained because Anthropic limits outbound network traffic from the VM—helpful, but not absolute. By contrast, local Code tab sessions offer no isolation, no filesystem restrictions, and no network limits—so any malicious install scripts run directly on your machine with full access and open egress.
My takeaways so far: I still love building with AI, but I’m doing it more cautiously. Cowork offers meaningful containment when used deliberately. I still prefer the flexibility of Claude Code, and I’ve reconfigured my setup to reduce risk. Even so, “safer” isn’t “safe,” which is why I’m increasingly shifting development off my local device to more controlled environments. I’ll share the practical details—tools, configs, and scripts—in the next installments.
If this perspective is useful, let me know. I want builders to move fast—and safely—through this new era of agentic AI. Until then, stay safe out there.
I’ve learned that the fastest way to unlock better AI outcomes is to understand how the system reasons, then partner with it deliberately. In product organizations, that means treating AI like a capable collaborator with a transparent process, clear inputs, rigorous checks, and measurable success criteria. When I work this way, my teams ship insights and experiments faster—and with far fewer surprises.
Discover how Amplitude AI thinks and best practices for working with it. Partner with AI at each step of its process for more accurate, actionable outputs.
Here’s the mental model I use. AI moves through a series of steps: clarify the goal, ingest context, retrieve and rank relevant information, reason through candidate solutions, draft an answer, self-critique, and refine. My job is to actively guide each step. I define the objective precisely, supply high-signal context, specify constraints, ask for structured reasoning, and require a quality bar before anything ships to stakeholders.
Start by setting intent and success criteria. I write a one-sentence objective (“what problem are we solving now”), then define the evaluation rubric (“what good looks like”) up front. This small habit powers eval-driven development: it keeps AI outputs aligned with product goals, not just plausible-sounding text. I’ll often include target metrics and guardrails, such as confidence thresholds or required evidence from “Amplitude analytics.”
Next, I curate the context. For analytics use cases, I provide event taxonomies, metric definitions, segments, and recent behavioral analytics trends to ground the model. A retrieval-first pipeline helps here: I scope the corpus, trim noise, and apply context window management so the model sees only what’s essential. The result is sharper, faster answers that map to our real data model and “unified analytics platform.”
Then I shape the prompt. I use concise role framing, 1–3 high-quality exemplars, and explicit constraints (format, length, tone, citation requirements). I also ask the model to show its reasoning with a short, labeled scratchpad and to state uncertainties. This is practical prompt engineering—not magic—designed to make reasoning inspectable and reproducible across “AI workflows.”
When tools are available, I encourage agentic AI patterns: let the system plan, call functions, and iterate. With “Amplitude AI,” I ask it to propose the next best analysis (e.g., segment drill-down, funnel step attribution, or anomaly detection), run it, summarize findings, then reflect on whether the next step changes. If you’re using “Amplitude MCP,” formalize these actions as callable tools so the model can chain them reliably.
Quality is never an afterthought. I build lightweight evaluations into every loop: compare the model’s output against the rubric, check factual grounding, and A/B test alternative prompts for clarity and conversion where appropriate. Over time, these evaluations become our regression suite, giving us confidence as data, prompts, or model versions evolve. This discipline keeps LLMs for product managers aligned with shifting business priorities.
Finally, I turn insights into action. I ask “Amplitude AI” for decision-ready artifacts—clear hypotheses, prioritized opportunities, and concrete next steps owners can execute. I require the model to cite the specific supporting events or segments and to flag assumptions. That last step is crucial: it invites human judgment where it matters and prevents automation from outpacing accountability.
This approach doesn’t slow teams down; it speeds them up with focus. By guiding each step—intent, context, reasoning, tools, and evaluation—you transform AI from a black box into a reliable copilot. The payoff is tangible: clearer insights, faster cycles, and outputs stakeholders trust the first time they see them.
Inspired by this post on Amplitude – Perspectives.
I’ve watched a once-reliable A/B testing playbook buckle under the weight of generative AI. Traffic patterns aren’t stable, LLMs update behind the scenes, prompts evolve weekly, and personalization reshapes cohorts mid-flight. The result is non-stationary data, diluted statistical power, and “wins” that don’t replicate in production. If your experimentation program feels slower, noisier, and less trustworthy, you’re not imagining it—and you’re not alone.
Learn why running more tests isn’t the answer to AI, and the three ways mature teams are shifting their experimentation programs.
First, I’ve shifted from test volume to an evaluation stack—what I call eval-driven development. Instead of defaulting to production A/B tests, we front-load learning with offline evaluations (golden sets, synthetic scenarios), automated regressions on prompts and policies, and pre-production canaries. We size experiments with a clear minimum detectable effect (MDE), use sequential or Bayesian methods to handle drift, and reserve full A/B runs for hypotheses with sufficient power and operational readiness. This layered approach accelerates decisions, reduces traffic waste, and restores trust in effect sizes.
Second, I’ve re-anchored our metrics and governance for AI-era reliability. We define a driver tree that links value creation to guardrail metrics such as latency, hallucination rate, cost per request, safety incidents, and user trust proxies. Persistent holdouts and long-lived control cohorts protect against platform-wide regressions, while anomaly detection highlights model or data shifts before they corrupt reads. Strong instrumentation—behavioral analytics, consistent event semantics, and product telemetry wired into Amplitude analytics—keeps our feedback loop tight and auditable.
Third, we rebuilt rollout mechanics to make delivery experimentation-native. Feature flags, progressive delivery, and targeted canaries let us test safely in production while gating exposure by segment, risk, or policy. Shadow mode and offline replay provide signal before real users see risk. Multi-armed bandits help with exploration when goals are clear and guardrails are enforced, but we resist over-rotating to bandits when measurement is fragile. Tightly integrating experiments into CI/CD and observability shortens the cycle from hypothesis to validated outcome.
In practice, here’s how I operationalize this shift. In 30 days, I audit the backlog, kill or consolidate tests that can’t meet MDE, and establish a minimal evaluation harness for prompts, policies, and safety checks. By 60 days, guardrail metrics are live with persistent holdouts and feature flags across AI surfaces. By 90 days, the team runs a balanced portfolio: offline evals for fast iteration, canaries for risk, and selective A/B testing for strategic bets—supported by continuous discovery to keep hypotheses grounded in real customer needs.
AI didn’t eliminate the need for experimentation; it raised the bar for rigor. By moving from volume to validity, from vanity lifts to guardrailed outcomes, and from monolithic launches to progressive delivery, I’ve seen experimentation regain its edge—fewer false positives, faster cycles, and clearer signal on what truly drives impact. That’s how we turn a brittle testing culture into a resilient, learning system built for LLMs and beyond.
Inspired by this post on Amplitude – Perspectives.
I keep meeting talented product teams who can demo impressive proof-of-concepts but can’t get durable business impact into production. The difference isn’t raw ingenuity—it’s the operating model. As I’ve scaled AI initiatives in my own organization, one sentence has proven painfully accurate: "What the top 1% of AI-native product teams are doing differently – and why most won't catch up without rebuilding the operating model."
When I say “AI operating model,” I mean the end-to-end way we set strategy, discover value, build, ship, govern, and learn—specifically adapted for AI systems. If we try to bolt AI onto a classic software cadence, we stall. If we rebuild our operating model around AI’s unique constraints and compounding advantages, we accelerate.
It starts with strategy. I anchor our portfolio to explicit outcomes, not features—tying every initiative to measurable customer and commercial impact. Driver trees and an opportunity solution tree make tradeoffs transparent, while outcomes vs output OKRs prevent us from celebrating activity over results. This is how empowered product teams earn autonomy without losing alignment on the AI Strategy.
Next is discovery. Continuous discovery reframes “can we ship a model?” into “can we change a behavior or decision with acceptable risk?” I pair customer interviews with in-product telemetry and journey mapping to qualify moments of high value and high frequency. The litmus test: can we describe the target workflow in plain language and simulate success before training models? If not, we’re not ready.
Data foundations come third. A retrieval-first pipeline is now my default, not an afterthought. We invest in data governance, privacy-by-design, and observability so we can explain where answers come from, prove consent, and debug drift. Without trustworthy data and clear lineage, every downstream AI promise is fragile—and your AI readiness is mostly theater.
Then I insist on eval-driven development. Before we optimize prompts or tune models, we define offline and online evals that represent the real task, including safety and “gotcha” cases. We treat prompt engineering, context window management, and agentic AI patterns as hypotheses that must beat a baseline under repeatable tests. This moves debate from opinions to evidence.
Shipping is where most teams quietly stall. We integrate AI into our CI/CD with feature flags, shadow modes, and progressive rollouts, building MLOps into the same platform that runs our services. I watch DORA metrics to keep delivery velocity healthy, but I also watch AI-specific signals—input distribution shifts, response variance, and time-to-mitigation—so we catch regressions before customers do. Platform scalability matters more when inference costs and latency can spike overnight.
Governance isn’t a gate at the end; it’s a runway from the start. We operationalize AI risk management with tiered reviews, model and data cards, and clear escalation paths. The goal is not to slow down, but to reduce surprise—so product managers, engineers, and legal share the same playbook for safety, fairness, and regulatory compliance.
Value capture closes the loop. We connect product metrics to commercial levers like Net Recurring Revenue (NRR) and retention analysis, then shape packaging so customers pay for outcomes, not raw compute. This is where product-led growth meets sales-led growth: we demonstrate value in-product, then arm go-to-market teams with unambiguous proof.
So why are 80% of teams stuck? Three patterns recur: technology FOMO masquerading as strategy, fragmented data that can’t support high-quality retrieval, and a lack of evals that forces decisions by vibes. Add ad hoc governance and you get pilots that impress in slides but wither under real-world variance.
How do the top 1% think differently? They rebuild the operating model first. They position discovery around workflows, not models. They invest in retrieval-first architectures early. They standardize evals. They ship with guardrails. And they treat “learning per week” as a sacred metric—because compounding insight beats sporadic heroics.
If you need a 90-day plan, here’s the sequence I use. Week 1–2: run a content audit of data sources and map the top five repeatable workflows ripe for AI leverage. Week 3–4: define success metrics and offline evals for one beachhead use case. Week 5–8: build the retrieval pipeline, implement prompt baselines, and instrument observability. Week 9–12: ship behind feature flags, run A/B testing with safety thresholds, and iterate on failure cases. By the end, you’ll have a reusable blueprint—not just a demo.
Team design matters. I staff product trios (PM, design, tech lead) with forward deployed engineers or solutions engineering partners who sit with customers. That proximity reduces spec ambiguity and accelerates learning. It also sharpens our product roadmapping and sprint planning because we plan against outcomes, not outputs.
The hardest part is emotional, not technical: letting go of familiar software rituals that don’t serve AI. Once we accept that AI demands a different operating rhythm, progress feels lighter. The top 1% don’t have secret models; they have disciplined systems. Rebuild yours, and the compounding benefits will outpace any single model upgrade.
I build "GTM and analytics products for the AI era—tools that make hard calls simple." That guiding principle shapes how I design systems, prioritize roadmaps, and lead teams: we earn speed by engineering clarity. My north star is straightforward—turn noisy signals into trusted insights that move the business, without adding friction for customers or chaos for teams.
In practice, this starts with behavioral analytics. Whether you're using Amplitude analytics or a homegrown stack, the goal is the same: a unified analytics platform that captures clean events, enforces a clear taxonomy, and maps behaviors to outcomes. I focus on journey mapping, activation and retention analysis, and honest attribution so that every GTM motion ladders to real product usage, not vanity metrics.
Decisions should be testable and reversible. I operationalize experimentation with A/B testing, feature flags, and guardrailed rollouts. Minimum detectable effect, power analyses, and anomaly detection aren’t academic exercises; they’re the foundation for credible learnings. When a result is unclear, we tighten hypotheses, shrink blast radius, and iterate quickly—biasing for learning while protecting the customer experience.
AI changes the surface area of product work, but it doesn’t change the discipline. I treat LLMs for product managers as a capability, not a shortcut: eval-driven development, clear success criteria, and human-in-the-loop feedback remain non-negotiable. Privacy-by-design and data governance shape what we build; responsible prompts, retrieval strategies, and safety checks shape how it behaves in the wild. When the model is uncertain, the product should be honest about it—and offer a graceful fallback.
Great GTM is a system, not a launch day. I connect product strategy to go-to-market strategy through product-led growth loops: in-app guides that meet users where they are, onboarding that accelerates time-to-value, and signals that identify true qualified intent. Driver trees tie adoption to monetization so that marketing, sales, and success work from the same picture—making trade-offs visible and reversible.
Execution is where clarity compounds. Continuous discovery with product trios keeps problems crisp and solutions grounded in user truth. Product roadmapping and sprint planning follow outcome-first principles: fewer projects, clearer intents, stronger accountability. When teams can trace every backlog item to a metric that matters, they move faster with less oversight—and deliver results that stand up to scrutiny.
When we do all of this well, decisions feel simple because the work behind them is rigorous. That’s the promise of modern GTM and analytics in the AI era: no theatrics, just dependable systems that turn possibilities into predictable progress.
Inspired by this post on Amplitude – Best Practices.
I keep asking myself a simple, high-stakes question: what does it take to build an AI customer support agent that actually knows when it can't help — and says so?
Recently, I dug into how Jamie Hall (Co-founder & CTO), Xharmagne Carandang, and Rona Wang at Lorikeet are answering that question for enterprises in regulated industries. Their target outcome is refreshingly concrete: an agent that responds like the best customer support you’ve ever had — one that knows you, gets things fixed, and hands off gracefully when it’s out of its depth.
What resonated first was the honesty about early missteps. The team explored reflection tools and information dashboards before a healthcare startup reframed the job-to-be-done with a blunt directive: just help us clear the inbox. The earliest prototype wasn’t flashy — a command-line script spitting out a CSV — yet it paved the way for a scalable, measurable foundation.
Today, the system runs on a dual-agent architecture: a Concierge that handles customer tickets end-to-end, and a Coach that helps customers configure, test, and continuously improve it. That split is more than a technical choice; it’s a product strategy that separates operational resolution from the meta-work of quality, guardrails, and evaluation.
The backbone principle is "AI humility" — defaulting to a human handoff when uncertain. In practice, this isn’t about avoiding responsibility; it’s about preserving trust. When an agent signals uncertainty, it protects brand equity and customer experience while still accelerating the path to resolution.
Lorikeet integrates with Zendesk and Intercom instead of replacing them. That decision respects the entrenched workflows and analytics ecosystems support leaders already rely on, and it reduces adoption friction while enhancing existing queues, macros, and reporting.
The UX has evolved from a workflow builder to a conversational interface — and yet the blank chat box is still hard. Guardrails, prompts, and example-led onboarding help teams get started without forcing them to be prompt engineers. When you’re aiming for low cognitive load, a hybrid of guided steps and conversational nudges works better than a pure canvas.
One of the most nuanced patterns is "resolution in the loop": how human agents unblock the AI without taking over a ticket. Instead of a full manual escalation, humans can provide a targeted nudge — a missing piece of data, a policy citation, a link to a system of record — and let the Concierge finish the job. That collaboration preserves productivity while keeping humans in the quality loop.
Guardrails turned out to be deeply domain-specific — a cannabis company’s support tickets famously broke the team’s first approach. That’s a crucial lesson for regulated industries: policy nuance often lives in the edge cases. Lorikeet responded by making customer-configurable guardrails a first-class capability through the Coach interface.
Even more interesting, they’re flipping the configuration workflow so customers define "what good looks like" before they ever write a standard operating procedure. By anchoring configuration in outcomes and test cases rather than prose SOPs, teams move faster, reduce ambiguity, and get to measurable quality earlier.
The platform leans into eval-driven development: using AI to diagnose failure modes in traces and automatically suggest fixes. A "Trace Diagnosis Agent" surfaces root causes and remediation paths, shrinking the feedback loop from discovery to improvement.
Culturally, the product engineering cadence is customer-obsessed: every engineer asks weekly what they learned from a customer. That lightweight ritual is a forcing function for continuous discovery and keeps prioritization tethered to real-world tickets, not just internal hypotheses.
Here’s how I translate these lessons for any customer support AI strategy in regulated environments. First, ship with opinionated "AI humility" and measure handoffs as a quality feature, not a failure. Second, separate resolution from configuration via a dual-agent architecture so each can evolve independently. Third, integrate where your customers already work (Zendesk, Intercom) to accelerate time-to-value. Fourth, make guardrails domain-native and customer-configurable, and start with evals that define "what good looks like". Finally, invest in trace analysis and automatic fix suggestions to shorten the learning cycle.
If you’re scaling support in healthcare, financial services, or any high-stakes domain, these patterns are practical, defensible, and ready to operationalize. Build the Concierge to resolve, empower the Coach to continuously improve, and let "resolution in the loop" bind humans and agents into one reliable system of service.
A prospect lands on our site, skims pricing, watches a demo, and clicks “contact sales.” For years, that’s where momentum died. They waited, and we built entire sales motions around managing that delay.
We optimized for “speed-to-lead,” made it the hallmark of a high-performing sales development org, hired more SDRs, tuned routing rules, added shift coverage, and stared at response-time dashboards. Typical SLA targets were one hour for best-fit leads, four hours for core MQLs, forty-eight hours for everyone else. Those were considered good numbers.
No one questioned the premise because the lag felt structural—shift scheduling, routing delays, and humans working 9–5. The fastest teams could only shrink the gap; nobody could remove it.
An AI Agent closes it completely.
When a prospect arrives today, the conversation can begin immediately. That single change reshapes how I design a sales org—how we staff it, what our team prioritizes, and the metrics we hold ourselves accountable for.
Step outside our dashboards and look at the buyer experience. We spend heavily to drive traffic, then push visitors into forms and queues that add friction precisely when purchase intent peaks.
Intent is highest the moment someone seeks out our product. If an SDR follows up two or three hours later, that buyer’s in another meeting, the urgency has faded, and the moment is gone. We still call it a lead; the buyer has already moved on.
What AI changes
Agents eliminate the structural constraints that made speed-to-lead a problem—shift scheduling, routing delays, CRM batch processing, the SDR being on another call. None of it applies anymore because every single lead can be engaged immediately, at any hour and in any language.
The impact goes beyond response time. When an Agent engages at peak intent, qualification, discovery, and even an initial demo moment can unfold in a single, continuous conversation. The gated funnel collapses. There’s no reason to qualify someone today, schedule discovery for Thursday, and demo the following week when the conversation is already happening.
The constraint the industry built around simply isn’t there anymore. We’re already seeing it with Fin, a Customer Agent. As sales leaders, we need to frame this differently.
If speed-to-lead is no longer the constraint, the knock-on effects reach every part of the org.
Introduce Fin for Sales to your team with this clean hero banner: bold headline, signature blue spiral, and a clear 'Start free trial' call to action—inviting readers to explore an AI customer agent built for revenue.
SDRs focus on moving deals forward. Instead of frontline triage, they double down on phone-based selling and relationship building, complex deal navigation, and multi-threaded engagement across stakeholders—the high-leverage work that used to get crowded out by the inbox.
Pipeline gets more relevant. The old model rewarded volume: capture as many form fills as possible, respond fast, and sort quality later. When an Agent engages at the moment of intent, it qualifies during the conversation. Low-fit leads get filtered out before they reach the team, and high-fit prospects arrive with context—needs, timeline, stakeholders—instead of just a name and email.
You measure outcomes, not response time. When first response is instant, different metrics matter. I anchor on three questions:
1) Is the Agent doing the work? Completion rate, qualification rate, and contact capture rate indicate whether conversations reach clear outcomes and produce usable handoffs to the team.
2) Is the work producing pipeline? Meetings booked and pipeline created through Agent-handled conversations are the leading indicators of revenue, not how fast someone followed up.
3) Are buyers having a good experience? Conversation-level satisfaction matters more than ever because the Agent is the first interaction prospects have with your company. The experience it delivers is the first impression you make.
These three questions reveal whether the motion is working. Time-to-first-response can’t.
Sales orgs built hiring plans, workflows, and performance metrics around beating intent decay. That made sense when the lag was unavoidable. It isn’t anymore.
An Agent is always on. It engages the moment a prospect arrives on your site, qualifies them in real time, and routes them to the right outcome without waiting for someone to be free. The lag the industry built itself around doesn’t exist when the conversation starts immediately.
The companies leaning into this are investing in what happens after the conversation starts: how well the Agent qualifies, where it creates pipeline, and what SDRs should actually spend time on. What matters now is not how fast you respond, but what the conversation produces.
Speed-to-lead made sense when the delay was structural. It isn’t anymore. If you’re re-architecting go-to-market, instrument Agent Analytics, revisit SDR charters, and tighten CRM integration so every qualified handoff is instant, traceable, and revenue-linked.
I’ve spent my career building products on top of the internet, championing social media, and now scaling AI. Lately, I keep returning to an uncomfortable but necessary question: are we still building a net positive future—or have we drifted into something else entirely?
A recent long-form conversation in my podcast queue challenged me to do a deeper self-audit. If you want to hear the debate that sparked this reflection, you can listen on: Spotify | Apple Podcasts. What follows is my synthesis as a product management leader: the hard truths, the hopeful paths forward, and the practical actions I’m taking with my teams.
The moment that hit me hardest was a family member’s blunt assessment that the internet has become “net negative.” That phrase landed like a wake-up call—a reminder that those of us inside tech often operate in an echo chamber. We see our roadmaps, our metrics, our progress; the rest of the world experiences the second-order effects. As a leader, I have to seek out those outside-in perspectives with the same rigor I apply to any product discovery practice.
Another truth I can’t ignore: somewhere along the way, parts of our industry slid from “make people’s lives better” to “extract maximum value at any human cost.” You can see it in incentives that prioritize growth at all costs, in waves of layoffs that treat people as an expense line, and in platform behaviors that resemble a modern tycoon era. This isn’t just a moral critique—it’s a product strategy risk. Extractive models erode trust, weaken retention, and invite regulatory and reputational headwinds that no amount of optimization can out-execute.
The loneliness crisis is real, and technology has too often replaced human connection instead of augmenting it. Spend a week in San Francisco and you’ll notice what I call “isolation by design”—QR-code menus, autonomous Waymos, frictionless everything, but fewer genuine human moments. It’s efficient, yes, but alienating. No algorithm can substitute for physical touch, care, and community. As builders, we should design products that create on-ramps to real-world connection, not cul-de-sacs of infinite scroll.
We still have agency. “Don’t be evil” shouldn’t be a nostalgic slogan; it should be a minimum bar. Responsible product management means being a citizen of the ecosystems we influence: naming trade-offs clearly, instrumenting for externalities, and building AI risk management into our operating cadence. It also means stepping outside the industry narrative to ask neighbors, parents, teachers, and small business owners how our products actually land in their lives.
One idea that gives me hope is “mom and pop tech”: AI-enabled, hyper-local tools crafted for specific neighborhoods and communities. Think “inch wide, mile deep”—software that solves a real problem for a defined community rather than chasing a horizontal total addressable market. Consider ride share. The extractive platform playbook maximized liquidity but squeezed drivers and frayed local fabric. A community-owned alternative could optimize for safety, fair wages, and neighborhood vitality over blitz-scaled margins. That’s civic tech with a viable product strategy.
I’m also watching how social norms evolve. At a recent Elternabend at a German primary school, parents collectively agreed to delay smartphones until age 11 or 12—a striking shift from just five years ago when many 7–8 year olds had devices. Culture moves, sometimes faster than we expect. Product-led growth that ignores cultural momentum (or ethical guardrails) is fragile growth.
So what do we do on Monday morning? First, rebuild our discovery muscles outside the echo chamber: continuous discovery with the people most affected by our products, not just our power users. Second, measure what matters: add well-being, community impact, and qualitative trust signals to the same dashboards that track activation and retention. Third, resist technology FOMO—choose fewer bets and go deeper, especially where AI can be applied responsibly to unlock real-world value. Fourth, cultivate communities of practice that normalize responsible experimentation, privacy-by-design, and transparent communication. Finally, narrate the change: as product people, we are educators as much as we are builders; our stories shape what teams believe is possible.
If you’re looking for frameworks to anchor this work, revisit classics like Bowling Alone: The Collapse and Revival of American Community for context on social capital, and pair that with modern conversations on local resilience and community spaces. The future isn’t written yet. With clear principles, careful incentives, and the courage to narrow our scope in service of depth, we can still build technology that strengthens the bonds that make life worth living.
I’d love to hear how you’re approaching this in your organization—especially examples of “mom and pop tech,” AI Strategy in service of community, or product strategies that trade a little scale for a lot of human good. Join the conversation in the comments.
When teams evaluate AI Agent options for customer service, I often see the rigor aimed at the wrong subset of criteria. After leading and observing dozens of proof of concept (POC) efforts with our customers and prospects, I understand why performance—accuracy scores, resolution rates, and benchmark tests on curated datasets—soaks up most of the attention. But those indicators alone won’t guarantee success once you leave the sandbox and face real customers.
If your POC only proves that the AI “works,” you’re missing the bigger picture. Here’s what else I look for to make the best long-term decision.
How does it handle your real-world setup?
Performance is table stakes, but it has to reflect the messiness of an actual support environment. The best-performing Agents don’t just get answers right—they exhibit resilient, human-like behavior under pressure. I watch how the Agent behaves when it doesn’t know an answer: does it recover or spiral? Does it stay on track through multi-step requests, and how gracefully does it hand off to human agents? If your knowledge base depends on a retrieval-first pipeline, test cross-source retrieval and grounding—not just single-document lookups.
When I build evaluation scenarios, I put the Agent through its paces with a broad, realistic mix:
Multi-turn queries that require the Agent to carry context across a conversation, not just answer isolated questions.
Vague or fragmented inputs, like typos, grammatical errors, and incomplete questions, because that’s how customers actually write.
Edge cases and sensitive scenarios, like billing disputes, frustrated customers, and questions that sit at the boundary of what the Agent is trained on.
Different phrasings of the same question. An Agent that handles one version well but fails on a rephrasing has a knowledge problem, not a performance problem.
Queries that require pulling from multiple knowledge sources. Real issues are rarely answered by a single help article, and an Agent that can only handle single-source questions will hit a ceiling fast.
Multilingual conversations, if your customer base requires it. Performance can vary significantly across languages and it’s better to discover that in testing than in production.
This preparation is worth the effort. Any Agent can look impressive in a demo; what matters is how it holds up as part of your team, serving your customers in production.
What does it feel like to interact with the Agent?
Two AI Agents can post the same quantitative scores—resolution rates, containment rate, and more—and still deliver very different customer experiences. Resolution rate tells me whether the Agent finishes conversations; it says nothing about how customers felt during them. I deliberately assess the experience, not just the outcome, because conversation design shapes trust and brand perception.
Here’s what I look for to ensure the AI Agent is enjoyable to interact with:
Is the tone natural and on-brand, or does it feel robotic and generic?
Does it build trust early in the conversation, or does it create friction that makes customers want to immediately request a human?
When it doesn’t know the answer, does it handle that gracefully?
When it hands off to a human, is that transition seamless, or does the customer feel abandoned?
As George Dilthey at Clay put it when evaluating their AI setup: “Keep what’s important to your business up front and center. For us, that was transparency and control over the customer experience.”
That framing is exactly right. The Agent represents your brand in every conversation. Customers don’t experience “accuracy,” they experience conversations. An Agent that’s technically accurate but tonally off-brand will erode customer trust over time.
I make the experience dimension explicit in my POCs. I have people on my team—and when possible, a small cohort of real customers—interact with the Agent under realistic conditions. Then I ask how it felt, not just whether it worked.
Can you keep improving it after launch?
This is the dimension most teams don’t evaluate at all, and it’s possibly the most important one. Choosing an Agent that works today and ensures you can continuously improve the customer experience over time requires more than a functional demo. You’re buying a system that must get better every week, not just during the first sprint.
The feedback loop
Can your team easily review conversations and identify where the Agent is underperforming? Can you pinpoint specific gaps (missing knowledge, incorrect tone, poor handoff decisions) and act on them quickly? The faster the loop between “something isn’t working” and “we’ve fixed it,” the more value compounds over time. In practice, that means instrumenting conversations, leveraging Agent Analytics, tagging misroutes and tone slips, and running targeted evals on known failure modes.
The speed of iteration
When you identify a gap, how quickly can you address it? This is partly a question of tooling (how easy is it to update knowledge, refine guidance, adjust behavior?) and partly a question of team capability. The teams getting the most out of AI are the ones that have changed how they operate and made continuous improvement a part of their everyday work. They’ve committed to going all-in for the long term, not just the first few weeks when launching their AI Agent. We treat this as eval-driven development: automate evaluations that mirror real tickets, tighten prompt engineering and retrieval settings, and ship small fixes daily.
The vendor partnership
The vendor behind the Agent matters just as much as the solution itself. You’re choosing a partner for transformation that will help you evolve how your business delivers customer experience. Ask:
How does customer feedback influence the product roadmap, and can they show you examples?
If you have feedback on limitations or weaknesses, do they engage transparently or get defensive?
What kind of support will you get post-launch?
Are they shaping where AI customer experience is going, or reacting to what others are building?
How a vendor responds to those questions tells you more about the long-term relationship than any benchmark result.
What a good POC proves
If your POC only proves “the AI works,” you haven’t done enough. A strong proof of concept tests performance in realistic conditions, evaluates the experience from the customer’s perspective, and validates the system that will support continuous improvement after launch. Done well, it sets you up for long-term operational success and builds organizational AI readiness—not just a flashy demo.
Old-school, in-person selling is having a renaissance in the AI era, and I’ve seen why up close. From leading product and go-to-market teams through hypergrowth, I keep returning to one lesson: enterprise buyers still reward the teams who show up, orchestrate change management, and own outcomes end-to-end. The tech has changed; the human dynamics haven’t.
Has the sales playbook changed in the AI era? The tools are faster and the surface area is bigger, but the core motion remains the same: “showing up” beats letting the marketplace decide. That’s why in-person enterprise rollouts still beat product-led motions, especially when the stakes include security, governance, and cross-functional adoption. You win by reducing organizational risk, not by assuming free trials will do the heavy lifting.
Great enterprise sellers collapse silos. They sell to engineers and executives in one motion, pairing deeply technical validation with crisp business narratives. In my org, that means every high-velocity pilot has a dual thread: hands-on, eval-driven proof for the builders and a value architecture for the budget owners. When those motions run in parallel, time-to-value plummets and procurement friction fades.
Selling to AI-native buyers who grew up on ChatGPT changes tempo, not fundamentals. The same seller, different tempo: 8 weeks vs. 8 business days. These buyers evaluate fast, expect clear ROI, and push for automation-first workflows. How AI-native buyers handle build vs. buy decisions comes down to build for differentiation and buy for acceleration. If you make procurement feel like product—frictionless, instrumented, and transparent—you’ll meet their bar.
Process matters, but humanity wins. Building a robust sales process that still leaves room for unscripted moments is where trust is formed. I’ll never forget the story of the rep who taught a champion’s son guitar over Zoom—an unscripted moment that cemented a partnership. The lesson: raise the floor without capping the ceiling. Equip every rep with repeatable plays, then celebrate the creative instincts that make champions out of customers.
In early GTM, why the three highest-leverage early sales hires aren’t sellers at all resonates with my experience. I prioritize a solutions engineer who can de-risk integration, a forward-deployed operator who can run the first rollout like a product manager, and a customer success lead who designs adoption paths from day zero. Together, they compress the value journey from proof to production.
Compensation design shapes your talent market. The case for outsized commission accelerators for star sellers — and the kind of person they attract is real: magnets for competitors who close complex, multi-threaded deals and thrive with ownership. But beware: why too much process narrows the kind of seller you attract. Over-script it and you filter out the very people who can navigate ambiguity with customers.
Under the hood, instrumenting the funnel from stage zero to close keeps the system honest. I track intent signals before pipeline, conversion by persona and use case, proof milestones, and time-to-value in production. The three pillars of GTM excellence for me are repeatable discovery, referenceable outcomes, and relentless enablement. And inside the leadership team, building peers who are 80% aligned, not 100% preserves healthy tension while keeping execution fast.
AI is expanding the definition of enablement—whether AI is changing what good enablement looks like isn’t a theoretical question anymore. I see world-class teams arming reps with retrieval-first knowledge bases, sandbox environments, and objection libraries that evolve weekly. Meanwhile, selling against direct and implied competitors at once is the norm: your battlecard must cover “do nothing,” internal tools, adjacent categories, and new AI entrants—while you still remember why in-person enterprise rollouts still beat product-led motions for durable adoption.
Planning horizons tighten in AI markets. How far out should a GTM leader be planning? I work a dual cadence: a rolling 6-week operating plan that’s ruthlessly tactical and a 2–3 quarter roadmap for coverage, enablement, and category storytelling. What a normal week looks like in hypergrowth blends customer time, pipeline triage, onboarding and enablement, deal engineering, and process tuning—always with one or two high-conviction bets that could bend the curve.
If you’re scaling an AI product today, pair a disciplined sales-led growth engine with the best of product-led growth: fast paths to proof, hands-on validation for builders, executive-level value mapping, and human moments that turn customers into advocates. That’s how you compress an eight-week cycle into five business days—and keep the expansion flywheel spinning.