I rely on Amplitude analytics and Figma Make to turn real user insights into high-fidelity prototypes in hours, not weeks. This pairing compresses our continuous discovery loop and helps my team prioritize what truly moves the needle for customers and the business.
Design smarter with Amplitude and Figma Make. Use AI and product analytics together to prototype, test, and learn faster.
Here’s how I put that into practice: I start with product analytics to isolate a measurable opportunity—often around user activation, conversion drop‑offs, or retention analysis. Amplitude cohorts and funnels surface where friction hides; I translate those signals into design prompts and flows in Figma Make, so we can visualize and validate potential solutions before a single line of production code is written.
Once a promising direction emerges, I convene the product trio—design, engineering, and product—around a clear outcome metric, not output. We build a lightweight driver tree, align on a hypothesis, and define the minimum detectable effect (MDE) so our A/B testing has enough statistical power to be decision‑worthy. From there, we create a small set of Figma Make variations that reflect distinct value hypotheses, not cosmetic tweaks.
On the experimentation front, I gate risky changes behind feature flags and ship via our CI/CD pipeline to limit blast radius and accelerate feedback. I monitor the experiment with a unified analytics platform mindset: the same definitions and segments in Amplitude power both pre‑launch discovery and post‑launch evaluation. That continuity lets us compare prototype expectations against production reality with far fewer translation errors.
A few principles keep this workflow sharp and responsible: I use privacy-by-design patterns, apply data governance guardrails to keep datasets consent‑aligned, and set AI risk management standards so generated designs respect accessibility and brand constraints. Critically, I avoid vanity metrics—I measure learning speed, decision quality, and downstream impact on activation or retention, which are what sustain product-led growth.
If you’re looking for a playbook, try this cadence: 1) define the customer outcome and success metric; 2) map a simple driver tree to narrow the solution space; 3) explore multiple flows in Figma Make; 4) validate quickly with concept tests and usability checks; 5) run A/B testing with a clearly defined MDE; 6) ship iteratively behind feature flags; 7) close the loop in Amplitude with cohort‑level retention analysis; 8) refine copy and UX writing to reinforce the core value proposition. Repeat until the signal is undeniable.
Blending Amplitude analytics with Figma Make has become my fastest path from insight to impact. It keeps my team focused on learning that compounds, features that matter, and outcomes customers can feel—so we truly make what matters.
Inspired by this post on Amplitude – Best Practices.
What happens when you treat an AI agent not as a chatbot, but as a full teammate on your sales team – one that can jump on video calls, demo your product, make phone calls, and follow up over days?
I recently dug into this question with the team behind ShowMe, an AI-native startup building digital sales reps for inbound teams. Founded in April 2025, ShowMe has engineered a multi‑agent system that combines conversation agents for live voice and video interactions, evaluator agents that score every call for quality and sentiment, and creator agents that ingest customer documentation to build tailored playbooks. A workflow layer orchestrates the entire lead‑to‑close journey across days, not minutes—exactly the kind of agentic AI approach I expect to see become standard in revenue workflows.
What stood out to me first was the origin story: a glaring conversion gap on a previous website, and the realization that a purpose‑built AI could fill it. The initial MVP was refreshingly pragmatic—start with a voice agent, pair it with product videos, and back it with a simple RAG knowledge base. That retrieval‑first pipeline let the team ship quickly, validate real user behavior, and then scale sophistication where it mattered.
Then came a pivotal affordance shift: adding a realistic avatar via HeyGen. It wasn’t just eye candy; it changed how prospects engaged. The video-call UX established trust and made the AI’s capabilities legible at a glance. Prospects behaved as if they were with a human rep—interrupting, probing, and asking for demos—because the surface area invited that behavior.
On the architecture side, the team decomposed a single sales conversation into multiple specialized sub‑agents—greeting, qualifying, pitching—to manage latency, memory constraints, and model limitations. Deterministic workflows handle the happy paths reliably, while a smart orchestrator is emerging to break out of rigid paths when context demands it. Confidence scoring and frustration detection kick in for real‑time human handoff decisions, a must for revenue‑critical moments where a missed nuance can cost pipeline.
Training the system to sell like your team is where it gets powerful. ShowMe ingests sales transcripts and training materials to teach company‑specific sales skills, then uses creator agents to assemble tailored playbooks. Conversation agents stay focused on live interactions, while evaluator agents continuously score calls for quality and sentiment. The result: repeatable, compliant, and brand‑consistent selling—without flattening personalization.
Quality isn’t an afterthought—it’s operationalized. Early deployments run with customer-driven evaluation loops where 100% of conversations are reviewed, tapering to about 5% over time as confidence increases. Feedback becomes automated tests to prevent prompt regression, and production quality is proven with POCs, A/B rollouts, dashboards, and CRM logging. This is eval-driven development applied to go‑to‑market: measurable, auditable, and continuously improving.
I also appreciate how they treat the agent as a coworker, not a widget. Onboarding happens via Slack, weekly reporting aligns with sales leadership rhythms, and tight CRM integration keeps data flowing both ways. That mindset unlocks adoption because it fits how sales teams actually operate—and it creates real Agent Analytics you can manage.
From a product perspective, several pragmatic details matter. Real‑time voice and avatar demos rely on latency tricks and a library of video clips to keep interactions snappy. The conversation agent evolved from a basic Q&A bot into guided sales discovery, balancing personalization with the ever-present risks of hallucination. Guardrails, human‑in‑the‑loop, and clearly defined handoff rules are non‑negotiables in high‑stakes sales workflows.
Looking ahead, the roadmap makes sense: move toward self‑serve PLG setup, add smarter orchestration that adapts beyond deterministic flows, and expand into adjacent roles like customer success. For product leaders building in gen ai, the pattern here is instructive: start with inbound value, design AI workflows that align to proven sales motions, and use rigorous evals to earn the right to automate more.
If you want to go deeper into the build, the live demos, and the full multi‑agent orchestration, listen to this episode on: Spotify | Apple Podcasts. For more on the stack, explore ShowMe and the avatar platform HeyGen.
I’ve been reflecting on why so many revenue leaders are at risk of falling behind, and the conclusion is stark: fewer than 10% of current CROs will thrive by 2028. That isn’t hyperbole—it’s a wake-up call for how quickly go-to-market strategy, organizational design, and AI-driven execution are evolving. From my seat leading product, I see the pressure building on the CRO role to orchestrate the entire revenue system, not just run a sales team.
One story that crystallizes this reality comes from the journey of Stevie Case, the CRO of Vanta, the trust management platform serving everyone from founders to Fortune 100 CISOs. A former pro-video gamer who stumbled into sales through a mentor’s bet, she exemplifies how unconventional paths can drive unconventional insight. Her trajectory underscores a bigger truth I’ve witnessed across companies: the best revenue leaders aren’t just great sellers—they’re builders who understand product, process, and people at scale.
Why do early revenue hires fail? In my experience, it’s rarely about raw talent. It’s about fit, scope, and time horizon. Early-stage teams often hire coin-operated closers to sprint for this quarter’s number, when what they actually need are long-term builders who can shape ICP clarity, pipeline math, and repeatable motion. The trap is simple: you hire for momentum before you’ve validated the motion. That misalignment shows up at 00:00 Why early revenue hires fail and again at 04:16 Coin-operated sellers vs. long-term builders—two ideas every founder-led GTM team should internalize before the first half-dozen sales hires.
What separates a VP of Sales from a top 1% CRO is scope and systems thinking. A true CRO owns the full revenue engine—marketing, sales, solutions engineering, customer success, pricing, channels, and post-sale activation—not just the new-business line. It’s a role defined by precision around 07:44 Metrics, confidence, and velocity and the courage to decide when to centralize vs. decentralize capabilities as you grow. Should CROs lead sales? At 12:04 Should CROs lead sales?, the nuance is clear: yes, if the motion is still coalescing; not necessarily, once the machine is humming and specialization unlocks scale. My rule of thumb: start consolidated for speed of learning; split functions only when interlocks are provably robust.
There’s a humbling lesson in 16:36 Learning to scale at Twilio and 19:58 Stevie’s scaling mistake at Vanta: copying another company’s operating system, even a world-class one, is an easy way to blunt your edge. Context is king. What worked at Twilio won’t automatically work at a trust management business. That’s why the line at 17:44 “There is no CRO playbook” resonates so deeply. There are principles—org design, segmentation, enablement, compensation, customer activation—but your playbook must be bespoke to your product, pricing, cycle time, and buyer power map.
22:16 Why Vanta stays 100% sales-led is a reminder that not every high-growth motion demands product-led growth. In categories where compliance, security, and risk shape buying behavior, a consultative, sales-led approach builds trust and shortens time to value—especially when solutions engineering, onboarding, and customer success are tightly choreographed. I’ve seen teams chase PLG headlines while ignoring the higher-ROI path right in front of them: nailing the sales-led experience, from first touch to first value.
Top CROs plan 24–26 months ahead. 23:16 The value of planning 24-26 months ahead isn’t about creating perfect forecasts; it’s about designing optionality. That means hiring with stage gates, building enablement before you feel “ready,” instrumenting activation and retention early, and pressure-testing your pricing and packaging quarterly. In my org reviews, I push for scenario modeling: what breaks at 2x volume, what centralizes again at 600 headcount, and what competencies must be grown vs. bought.
On judgment and decision quality, 29:54 When trusting intuition was the wrong call is a familiar leadership tax. Pattern recognition is powerful—until it isn’t. I’ve learned to pair intuition with a data backstop and a lightweight pre-mortem: what would have to be true for this to fail? It’s the same posture I take with AI in GTM. At 30:49 Do humans still have a place in the future of GTM? and AI vs. humans in go-to-market, the answer is yes—but augmented. Humans set narrative, negotiate ambiguity, and build trust; AI accelerates research, writing, discovery, and coaching. The winning motion fuses both.
I’m often asked which tools materially shift outcomes. For revenue intelligence and operational rigor, I look to systems that compound learning: Gong: https://www.gong.io/, Salesforce: https://www.salesforce.com/, and Cursor: https://cursor.sh/. To study benchmark operating models and developer-led growth infrastructure, Twilio: https://www.twilio.com/ remains instructive. And to understand why trust, security, and compliance can define the entire GTM architecture, Vanta: https://www.vanta.com/ is a useful case study.
Leadership non-negotiables matter more as you scale. 33:33 Stevie’s leadership non-negotiables reminded me to be explicit about standards: clarity over activity, customer outcomes over internal wins, and auditability over anecdotes. 36:36 The myth of hiring for industry expertise shows up again and again—I’d rather hire for learning velocity, systems thinking, and builder DNA than narrow domain familiarity. And at 40:00 What stays centralized in a 600-person company, remember: centralize what must be consistent (data, tooling, pricing guardrails, core enablement), decentralize what benefits from speed and context (segment plays, partner motions, field marketing).
If you prefer a structured digest, here’s the operating checklist I use with revenue and product peers: define your ICP and value proposition crisply; hire builders over coin-operated sellers; instrument the first 30 days post-sale (47:09 The hidden leverage of a customer’s first 30 days); align pricing, packaging, and onboarding to activation; model capacity and hiring plans on 24–26 month horizons; decide early what stays centralized; use AI to amplify discovery, coaching, and content while keeping humans front-and-center for trust-building; and cultivate an unvarnished CEO–CRO pact (01:02:30 Unpacking the CEO-CRO dynamic) that aligns on strategy, segmentation, and sequencing.
For those who want a few timeline highlights: 00:00 Why early revenue hires fail; 02:23 Who to hire at $5M in revenue; 05:57 What excellence looks like in the CRO role; 17:44 “There is no CRO playbook”; 22:16 Why Vanta stays 100% sales-led; 23:16 The value of planning 24-26 months ahead; 47:09 The hidden leverage of a customer’s first 30 days; 53:42 Why the CRO role will face enormous changes by 2028; 58:42 What leaders must do now to stay relevant.
The throughline is simple and urgent. 53:42 Why the CRO role will face enormous changes by 2028 isn’t a forecast—it’s a present-tense mandate. 58:42 What leaders must do now to stay relevant: build a revenue system, not a sales team; plan further out while executing faster; let AI handle the mechanical so your people can master the human. Those who internalize this shift will be the fewer than 10% of current CROs who thrive by 2028. The rest will be outpaced by change they could have anticipated—and designed for.
Over the past few years, I’ve led cross-functional teams to deploy agentic AI in production, and I’ve learned that success rarely hinges on the model alone. It comes from methodically designing the right workflows, instrumenting every step, and building a feedback loop that compounds. Learn how companies like Replit are consolidating workflows, creating one-person departments, and building systems for scale with Amplitude.
When I talk about AI agents, I’m describing software that behaves like a focused teammate—owning a clear job to be done end-to-end. In practice, that means consolidating fragmented tasks into a single accountable “one-person department,” then giving it the context, tools, and analytics to perform reliably. This is how agentic AI moves beyond demos into durable business impact.
I start with outcomes, not algorithms. I map a driver tree from business goals (e.g., lower response time, higher activation, better retention) to the specific moments an agent can influence. This outcome-first alignment keeps scope tight, informs guardrails, and grounds the value proposition in measurable change instead of vanity metrics.
Next, I define the workflow the agent will fully own. I look for high-volume, rules-adjacent processes—think lead qualification, support triage, or billing inquiries—where clear decision criteria already exist but human time is the bottleneck. I document triggers, inputs, decision points, and handoffs, then design the ideal-state flow the agent will run autonomously, with transparent escalation paths to humans.
On architecture, I favor a retrieval-first pipeline to keep responses accurate and current. I scope the knowledge base, implement context window management, and standardize tools the agent can call (search, CRM actions, ticket updates). For teams new to this, I coach “LLMs for product managers” fundamentals so we make sensible trade-offs between speed and reliability rather than chasing model-of-the-week headlines.
Instrumentation is where the system becomes self-improving. I use Amplitude analytics and an Agent Analytics schema to track intent detection, tool usage, resolution rate, time-to-resolution, deflection, and escalation causes. A unified analytics platform lets me connect agent outcomes to core product metrics—activation, retention, and conversion—so we can see the real revenue and experience impact, not just local efficiency gains.
To validate impact, I run A/B testing when traffic allows, setting a minimum detectable effect (MDE) upfront to avoid inconclusive reads. In lower-volume scenarios, I lean on eval-driven development: curated test sets for edge cases, scenario-based regression suites, and error taxonomies that accelerate iteration. Feature flags let us stage capabilities safely (shadow mode, assistive, autonomous) while we monitor deltas before full rollout.
Reliability and trust are designed in from the start. I apply AI risk management practices—privacy-by-design, data governance, and policy-aligned prompt templates—paired with observability to trace decisions. Clear escalation policies, incident management runbooks, and human-in-the-loop checkpoints ensure the agent fails safe, not silently.
Shipping cadence matters. I use CI/CD to increase deployment frequency, keep prompts and tools versioned, and gate risky changes with targeted rollouts. As patterns stabilize, we scale horizontally to new use cases, sharing core capabilities (retrieval, analytics, guardrails) as a platform. This is how “one-person departments” multiply without multiplying overhead.
Change management closes the loop. I partner with product trios and frontline teams to co-design prompts, set acceptance criteria, and define what “good” looks like in plain language. In-app guides and product tours introduce the agent’s role and limits, and structured feedback channels feed directly into our discovery and iteration rhythm.
The throughline of this playbook is simple: treat agents like real teammates with a job description, operating procedures, and performance reviews. With disciplined workflow design, a retrieval-first pipeline, and outcome-level instrumentation in Amplitude, agentic AI stops being a science project and starts compounding into durable product-led growth.
Inspired by this post on Amplitude – Perspectives.
When people ask me about "LLM vs AI Agents: What Product Teams Must Get Right," I start with a simple truth: an LLM is a powerful prediction engine, while an AI agent is a productized workflow that plans, takes actions with tools, remembers, and closes the loop on an outcome. That difference sounds academic until you’re on the hook for reliability, cost, and customer trust.
In my role, I’ve shipped LLM copilots that delight users and piloted agents that automate complex workflows. The pattern that never fails is this: start assistive, then graduate to autonomy. Copilots accelerate people; agents own outcomes. When we respect that gradient, adoption climbs, incidents fall, and we earn the right to expand scope.
The first decision point is use-case fit. If the task benefits from human judgment, high-context nuance, or brand voice, I frame it as a copilot with strong guardrails and crisp UX. If the task is well-bounded, tool-heavy, and verify‑able, I consider an agent—but only after we can measure end‑to‑end task success with eval-driven development.
Architecture matters. I reach for a retrieval-first pipeline to keep responses grounded in authoritative data, then add tool use for actions (search, write, schedule, transact) with deterministic scaffolding to prevent thrashing. Good prompt engineering is table stakes, but context window management and a clean memory strategy (short‑term scratchpad, long‑term facts, and policy) separate demos from durable systems.
Agents amplify both value and risk. I build safety in layers: role and scope definition, tool whitelists, unit limits, human‑in‑the‑loop checkpoints at irreversible steps, and privacy-by-design data governance. We log every decision token-for-token because auditability isn’t optional once agents touch customers, money, or data.
Measurement is non‑negotiable. For LLM features, I track time‑to‑first‑token, response latency, groundedness, and user satisfaction. For agents, I add Agent Analytics: task success rate, number of steps per task, tool error rate, loop detection, guardrail triggers, escalation to human, cost per successful task, and containment rate. If we can’t see it, we can’t ship it.
My delivery playbook mirrors modern software ops. We use feature flags, gated betas, and canary rollouts; we version prompts like code; we set incident management paths for model outages and tool drift; and we rehearse fallbacks so the experience degrades gracefully, not catastrophically. Dull operations build dazzling products.
On roadmapping, I thin‑slice value. We introduce a minimal viable copilot that handles a single, frequent job-to-be-done with high success. Only after continuous discovery confirms product‑market fit do we grant more autonomy, one capability at a time. Outcomes vs output OKRs keep us honest: if the customer’s job gets done faster, cheaper, and with fewer errors, we scale; if not, we fix fundamentals before adding scope.
Build vs buy is rarely binary. I tend to buy the undifferentiated heavy lifting—observability, prompt versioning, red‑teaming, and policy enforcement—while building the proprietary workflows, data modeling, and UX that encode our defensible advantage. The litmus test: if it’s part of our unique value proposition, we own it; if not, we integrate the best‑in‑class and move.
Go‑to‑market must be as rigorous as the tech. We position clearly (assistant vs agent), price to value with transparent consumption SaaS pricing, and communicate risk posture in plain language. Customers don’t buy models; they buy confidence that a job gets done reliably within their constraints.
Common failure modes repeat: shipping autonomy before instrumentation, treating prompts as magic instead of software, skipping data governance, and ignoring the human experience. The antidote is disciplined AI Strategy rooted in empowered product teams, tight feedback loops, and relentless evaluation.
If you take nothing else: choose the right paradigm for the job (copilot first, agent when proven), ground with a retrieval-first pipeline, instrument with eval-driven development and Agent Analytics, and operationalize like a mission‑critical system. Do that, and you’ll turn LLM capabilities into durable product outcomes.
Every week, product and data leaders ask me the same question: can AI agents truly shoulder enterprise analytics without sacrificing trust, governance, or speed? I’ve spent the past year putting agentic AI through its paces in real product workflows, and I’ve distilled what works into a practical, task-driven evaluation approach you can adopt immediately.
Learn how to evaluate AI analytics agents with a task-based framework across analytics tasks. See how Amplitude’s Global Agent scores.
When I say “enterprise analytics,” I’m talking about far more than chatty dashboards. The bar includes consistent metric definitions, privacy-by-design, RBAC and data governance, audit trails, low-latency decision support, and repeatable outcomes across retention analysis, funnels, cohorts, A/B testing, instrumentation planning, and anomaly detection—ideally within a unified analytics platform.
My task-based framework evaluates eight capability pillars I expect from an enterprise-ready Agent Analytics solution: task coverage and depth across common product analytics workflows; data fidelity and governance (lineage, access controls, PII handling); instruction-following and reasoning transparency; evaluation rigor and reliability (repeatability, error modes, regressions); security and compliance posture; latency and cost efficiency; integration into existing product strategy workflows (e.g., CRM integration, CI/CD-linked instrumentation, experiment platforms); and human-in-the-loop controls for approvals and guardrails.
Operationally, I define canonical tasks that reflect day-to-day product management: codify a North Star metric; perform retention analysis by cohort; generate and explain a funnel with drop-off drivers; recommend an event taxonomy and tracking plan; analyze an A/B test with minimum detectable effect (MDE) considerations; and propose a driver tree that maps inputs to outcomes. Each task comes with ground-truth datasets, acceptance criteria, and edge cases to stress the agent—an eval-driven development practice I’ve found indispensable.
I then score maturity across four levels. L0: a pure chat UI that summarizes existing charts. L1: a retrieval-first pipeline that grounds responses in your analytics catalog and metric store. L2: a tool-using agent that is schema-aware, can write safe SQL, and reconciles results to canonical definitions. L3: a governance-aware autonomous workflow that executes analytics tasks end-to-end with approvals, audit logs, feature flags, and rollback plans. Most teams discover they’re between L1 and L2; reaching L3 requires serious investment in data governance and eval automation.
Risk management is non-negotiable. I require strict data governance and privacy-by-design controls, including scoped credentials, PII redaction, policy-aware retrieval, and comprehensive observability (query traces, prompt/response logs, lineage). Feature flags and approval gates prevent unintended metric redefinitions. Red-teaming tasks expose prompt injection, schema drift, and hallucination failure modes before they hit production stakeholders.
Where do agents shine today? Rapid exploration, SQL generation from schema context, summarizing experimentation results, and turning natural-language questions into actionable charts. Where do they struggle? Ambiguous metric semantics, under-specified experiment designs, and edge-case-heavy analyses where ground truth depends on organizational nuance. The cure is disciplined product management: codify definitions, maintain a living analytics taxonomy, and continuously harden your eval suite.
In the context of product analytics stacks, Amplitude analytics is a common anchor for product teams, and many are evaluating “Amplitude’s Global Agent” to accelerate insight generation. In my framework, I look for how well it grounds to canonical metrics, handles retention and funnel tasks, explains trade-offs, and respects governance boundaries—before I consider expanded autonomy. I share the full task matrix and scoring rubric so you can replicate the assessment in your environment.
If you’re getting started, pick your top ten high-frequency analytics tasks and define crisp success metrics for each (accuracy, explainability, latency, and reusability). Build a small eval harness with golden datasets, assertions, and regression tests. Favor a retrieval-first pipeline tied to your taxonomy and metric store, add human-in-the-loop approvals for sensitive actions, then pilot with a cross-functional tiger team. Measure time-to-insight, analyst hours saved, and stakeholder trust—then iterate.
Enterprise analytics isn’t a single feature; it’s a system of definitions, workflows, and governance. With a task-based, eval-driven approach, agentic AI can become a reliable partner—not just a novel interface. If you’re evaluating options, apply this framework first, then expand scope as reliability and trust climb.
Inspired by this post on Amplitude – Best Practices.
I hear the same refrain from product leadership peers everywhere: we’re overwhelmed. Shrinking headcount, constant AI disruption, economic uncertainty, and relentless context switching make it feel like we’re carrying two jobs—setting strategy while shielding our teams. I recently listened to an episode of All Things Product that zeroes in on what a real support system for product leaders looks like, and it resonated deeply with my day-to-day.
Want to listen to the conversation yourself? Find it on Spotify or Apple Podcasts.
Here’s the core tension I see (and felt early in my own leadership journey): product leaders tend to underinvest in themselves. We hold onto work because it feels faster, safer, or “just easier if I do it.” But that pattern quietly taxes strategy, slows learning, and caps team throughput. The hidden cost of “doing it all yourself” is real.
Early in my tenure leading product, I tried to keep every plate spinning—roadmap reviews, stakeholder prep, user research, executive updates—while protecting my team’s focus. I was busy and useful, but not maximally valuable. The turning point came when I started building a lightweight support stack: a few hours of executive assistant help each week, targeted research support for bet sizing, and a personal cadence with a leadership coach. The result wasn’t just more time; it was better time.
One provocative point that landed hard: product leaders rarely have executive assistants—and that’s a problem. If your calendar is your operating system, an EA is an extension of your leverage. Mine now handles scheduling, meeting hygiene, prep packets, and post-meeting artifacts. That shift moved me from “calendar triage” to “strategic curation.” It also reinforced a core principle: delegation is a leadership skill, not a weakness. When I delegate outcomes (not just tasks), my team learns, ownership grows, and we ship decisions faster.
Support for strategy work shouldn’t stop at the calendar. Research and data enable better bets. Lightweight research ops, access to product analytics, and brief synthesis sprints keep me anchored in evidence without drowning in artifacts. Paired with a strong community of practice, I get a steady stream of comparative patterns—how other leaders delegate, scope advisory boards, or run decision reviews—which short-circuits trial-and-error.
Coaches were framed as shortcuts for clarity, accountability, and skill-building—and I agree. A good coach compresses cycles, sharpens decision quality, and holds the mirror up when you drift into doer mode. Two quotes captured the mindset perfectly: “You are a pro athlete. It makes sense to think about how you scale your impact without adding more to your calendar.” — Petra Wille. “As you get busier, it becomes more important to focus on the value only you can bring.” — Teresa Torres.
There’s also a helpful nudge to let go of perfectionism: “80% done by someone else is 100% awesome.” — Dan Martell (quoted). In practice, that means I accept great drafts from others, then add the 10–20% only I can contribute—context, narrative, and the sharp edges of the decision.
What about AI? The conversation hits a practical middle ground I share: use AI where it compounds leverage—meeting summaries, research synthesis starters, doc outlines, and backlog triage. But keep humans where judgment, alignment, and context truly matter—strategy framing, stakeholder management, and the final decision-making loops. In other words, apply an AI Strategy that respects product leadership’s uniquely human work.
Key themes I took away: why product leaders struggle to scale themselves; the true cost of “doing it all yourself”; why not having executive assistants limits impact; delegation as a core leadership capability; how to identify and protect the work only you can uniquely do; using research and data to inform strategy; coaches as accelerators for clarity and accountability; communities of practice as a force multiplier; adopting a “professional athlete” mindset; when AI helps—and when humans still matter; and the liberating mantra that “80% done by someone else is 100% awesome.”
If you’re wondering where to begin, start small and practical. Audit your time: what work truly requires you? Experiment with small amounts of support (even a few hours a week). Delegate outcomes, not just tasks. Keep the hands-on work you love—but be intentional. Use peers, coaches, and communities to learn how others delegate. Don’t wait until burnout to build your support system.
Resources mentioned if you want to go deeper: Follow Teresa Torres: https://ProductTalk.org. Follow Petra Wille: https://Petra-Wille.com. Petra’s Coaching for Product Leaders: https://www.petra-wille.com/coaching-packages. Dan Martell’s book Buy Back Your Time: https://www.buybackyourtime.com.
I’m curious: what’s one outcome you’ll delegate this week, and what support would make it stick? Share your thoughts in the comments—your playbook might be exactly what another product leader needs right now.
In my day-to-day building AI products, I’ve learned a simple truth: a single model can be brilliant, but a coordinated team of specialized agents is what consistently ships outcomes customers trust. That’s the promise of multi-agent systems—multiple AIs with distinct roles collaborating inside robust AI workflows to deliver accuracy, speed, and resilience you can’t get from a lone model.
Think of a multi-agent system as a well-run product trio for machines: a planner decomposes the job, specialists execute focused tasks, a reviewer checks quality, and an orchestrator keeps everyone aligned. This agentic AI approach mirrors how high-performing teams work—divide complex problems, play to strengths, and create tight feedback loops.
When does one AI stop being enough? Whenever tasks require tool use, domain retrieval, multi-step reasoning, or policy adherence under real-world constraints. In those moments, specialized agents shine—one for search using a retrieval-first pipeline, another for reasoning, another for action execution, and a final one for validation. The result is better accuracy with manageable latency and cost.
The core architecture I rely on starts with a planner that breaks a goal into steps, followed by execution agents equipped with tools and grounded context. I pair this with context window management to keep prompts lean and relevant, and I insert a verifier (or critic) to catch logic slips and policy violations before results reach customers. A lightweight orchestrator coordinates handoffs and retries to keep the whole flow resilient.
To make this production-grade, I treat observability as non-negotiable. Agent Analytics helps me see which agents are adding value versus adding latency, where failures cluster, and how prompts drift over time. From there, eval-driven development gives me measurable confidence: I codify representative tasks, run offline and shadow evaluations, and only promote changes that move accuracy and safety in the right direction.
Governance is equally critical. I design privacy-by-design from the start, restrict data movement with strong data governance, and enforce policy constraints inside the workflow rather than after the fact. This includes red-teaming failure modes, rate-limiting tools, and capturing immutable traces for audits and post-incident reviews—habits borrowed from SRE culture that map well to AI systems.
On the practical side, prompt engineering remains foundational, but it’s the system design that converts clever prompts into reliable outcomes. Tool access, retrieval quality, memory strategy, and error handling matter more than wordsmithing alone. I’ve found that small prompt improvements are amplified when the surrounding workflow is sound—and are overwhelmed when it isn’t.
If you’re just starting, begin with a narrow use case and a minimal set of agents—planner, executor, and verifier—then expand. Use continuous discovery with real users to learn where the workflow fails in the wild, and iterate with tight release cycles. Treat every agent like a microservice with clear contracts, test coverage, and metrics, and you’ll unlock compounding gains without losing control.
The payoff is tangible: faster shipping cycles, fewer regressions, and outcomes customers can actually rely on. When stakes are high and ambiguity is real, one AI is often a talented soloist—but a disciplined ensemble of agents is how I deliver dependable, scalable value at product velocity.
Over the last year, I’ve had the same conversation with a lot of support leaders.
They’ve deployed AI and are seeing initial efficiency gains, but want to push beyond these early results and achieve meaningful transformation.
When AI is first introduced, the gains show up quickly. Teams resolve higher volumes of queries, free up capacity, and deliver faster responses. But the real opportunity for impact extends well beyond those initial wins. As AI becomes more deeply integrated into support operations, taking on harder, more complex work, those results compound, new ways to create and measure value open up, and the economics of support change entirely. That shift is where I spend most of my time with leaders—turning early efficiency into durable business value.
This sits at the heart of “The 2026 Customer Service Transformation Report.” In this reflection, I explore how deeper integration compounds impact and why that makes business value easier to articulate across the organization—especially to finance and product peers who need to see outcomes, not just output.
The teams going deeper are seeing higher returns. The research shows that 62% of support teams have seen their customer service metrics improve since implementing AI, with early wins showing up most clearly in speed and efficiency. But for teams that have reached mature deployment (where AI is fully integrated into operations) that number jumps to 87%.
As AI programs advance, measurement confidence surges. This chart shows how ROI tracking rises from 35% in exploring to 70% in mature deployments—evidence of a widening execution gap in customer service.
The same pattern holds for the ability to measure ROI. Among teams in early exploration, just 35% say they can measure their return on AI investment, but for teams at the mature deployment stage, that rises to 70%. In my experience, this is the moment the conversation shifts from “is AI working?” to “how much leverage are we creating?”
As AI becomes more embedded in support workflows, what teams choose to measure starts to change. In the early stages of deployment, ROI is typically understood through improved customer response times, lower cost to serve, and freeing up capacity. Teams focus on how much time AI creates and whether it’s relieving pressure on the support organization. These signals help validate that the system is working, but they say little about how that capacity is ultimately used.
As deployments mature, measurement starts to reflect a different intent. Instead of stopping at time saved, teams look at where that capacity is reinvested—into higher value customer work and revenue-generating activities. ROI becomes less about relief and more about leverage. I encourage teams to set targets for capacity redeployment and tie them directly to activation, retention, and expansion outcomes.
The report data shows this clearly. Across all maturity stages, the most commonly cited measure of ROI is "time freed up that the support team can use to focus on value-adding activities for customers." But at mature deployment, that signal intensifies, with 73% of teams citing it, compared to 56% at early exploration.
Mature AI deployments reveal clearer ROI: teams report more time freed for value-adding customer work (73% vs 59%) and more hours redirected to revenue-generating tasks (56% vs 34%) than initial rollouts.
What’s also interesting is that 56% of mature teams say freed capacity is being directed toward revenue-generating activities, up from 34% at initial deployment. That’s a powerful indicator that AI is shifting from a cost narrative to a growth narrative.
The result is a shift in economic intent: from measuring what AI saves to demonstrating how the capacity it creates is reinvested to drive growth. As a product leader, I anchor this conversation in outcome-based metrics and clear counterfactuals: what would it have cost to deliver the same experience without AI?
As AI takes on more work, the question moves from “does it save money?” to “how does it change the economics of support?” Legacy support economics were built for linear growth: more customer tickets meant more headcount, more outsourcing, and more software costs. Success was measured through containment—the number of queries that didn’t reach human agents. These models worked when volume and effort were tightly linked, but AI doesn’t scale linearly, and it needs to be evaluated differently.
To sustain AI investment and expand its impact, teams need to move beyond cost-cutting narratives and build a clearer case for business value. When done right, AI goes far beyond improving support efficiency. It rewires the financial model, breaking the link between support costs and revenue growth, and turning support into a contributor to customer activation, retention, and lifetime value. This means treating your AI Agent as a new workforce capability that changes how your support function creates and captures value. Here’s what value looks like in an AI-first model:
Deeper AI integration decouples growth from headcount. This split chart shows support volume surging while team size plateaus, revealing how automation unlocks scale, reduces costs, and makes ROI easier to prove.
Human productivity: Your team focuses on more strategic areas, not the queue.
System improvement: Every resolved query makes the system smarter.
Revenue influence: Support becomes a lever for activation, retention, and growth.
Organizational agility: You scale service without scaling headcount.
Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.
How does this look in practice? Intercom offers a compelling example with Fin. What started as a focused effort to improve their customer support experience has become one of the clearest illustrations of what happens when AI is fully embraced across an organization.
Since 2022, Fin has helped Intercom absorb more than a 300% increase in customer demand while improving the consistency of delivery—including supporting new routes into support for trial customers and website visitors. Today, Fin is involved in 97% of their customers' conversations. Of those, it resolves 83.5% end-to-end, putting their overall automation rate at 81%.
That depth of deployment allowed Intercom to scale service without scaling headcount. Without Fin, they would have needed at least 100 additional support teammates to meet rising demand and service standards.
As Fin took on the majority of day-to-day volume, the human support team shifted toward consultative work—helping customers adopt Fin more deeply, succeed faster, and unlock more value from the platform. Intercom now tracks metrics like “direct revenue generated” and “expansion revenue influenced” to understand the impact of these consultative support activities. This repositioned support from a cost center to an active contributor to long-term growth.
The throughline from The 2026 Customer Service Transformation Report is that deployment depth makes a significant difference. Teams that are investing in deeply integrating AI are reshaping how support scales and contributes to growth. Value becomes clearer as AI takes on more work, and support leaders can articulate that value to the rest of the business.
The gap between these teams and those still in the early stages is widening. A select group of pioneers are setting a new bar for what AI-powered customer service can deliver, and understanding what they’re doing differently is the first step toward closing that gap. If you want to dive deeper into the data and frameworks, you can download the report here: https://www.intercom.com/customer-transformation-report?utm_source=blog&utm_medium=internal&utm_campaign=20260128-report-owned-2026cstransformationreport&utm_content=chapterseries_2
I've been getting a lot of questions about why I'm diving so deep into Claude Code, so I want to take a step back and provide some context.
Last March, when I started building my first AI product—the Interview Coach—I felt like I had to figure it all out on my own. I had never built an AI product before, and I didn't have a team I could lean on. It was equal parts energizing and intimidating.
I had a blast digging in, experimenting, and learning what I needed to learn to ship that first AI product. But I also started to wonder, "How are product teams going to learn this stuff?"
As an industry, we are being asked to leverage a new technology that is foreign to us. We are all experimenting and learning what's just now possible. It's moving so fast, it's exhausting just following the news, let alone trying to learn and develop new skills.
My mission has always been to help teams make better product decisions. That still drives me today.
After releasing the Interview Coach, I asked myself two questions: "How am I going to rapidly develop my skill set?" and "How can I help others do the same?" I landed on a three-part plan: First, I'm going to collect and share stories about how other teams are learning and building AI products—that's why I launched Just Now Possible. Second, I'm going to push the boundaries on how I can use AI in my day-to-day life, and I'm going to write about it. Third, I'm going to keep building AI products—and I'm going to write about that, too.
The Claude Code series was born out of number two. It’s had an interesting side effect: it’s also helping me build better AI products.
The more I push the boundaries of what's possible with Claude Code, the more I understand how to build more robust AI products. That’s reinforced my belief that product teams need to get hands-on with this stuff in their day-to-day lives. It’s how we’re going to develop the skillsets we need to build tomorrow’s products.
In my context rot article—where we learned how to manage the context window in Claude Code—I showed just how much day-to-day practice compounds. Today, I want to show how learning about context window management in our day-to-day lives directly maps to managing the context window in the AI products we might build. My hope is to make it crystal clear how experience in one area develops expertise in the other. Let’s dive in.
Discover how product teams engineer context in generative AI: compact prompts, curated turns, external memory, repetition, and sub-agents, all feeding a shared context window to deliver clearer, faster outcomes.
A quick refresher on context window management. In the context rot article, we learned: "what the context window is and what goes into it"; "how to offload conversational context to the file system"; "about the /compact and /clear tools"; "to repeat critical information as the context window fills up to overcome tokens "lost in the middle" or at the beginning of the input"; and "how to use agents to get access to more context windows."
It turns out these exact same skills are being used by developers to manage the context window in production products. If you haven't read the context rot article, start there: "Context Rot: Why AI Gets Worse the Longer You Talk (And How to Fix It)."
What is Context Engineering? Context engineering is the work that we do to manage the context window in the AI products and services that we build. It's how we give the large language model the context it needs to do the job well. It's also how we manage and mitigate context rot in our product and services, so that we can get the highest performance from the underlying model.
Today, we are going to look at five different strategies that product teams are currently using in their context engineering efforts. You are going to see that each of these strategies ties back to a strategy you might already be using in your day-to-day AI usage (especially if you followed the advice in the context rot article).
Here's how product teams are putting this into practice right now: designing compact system prompts by breaking big tasks into smaller tasks; building external memory/state structures to keep the context window clean; curating what goes into each turn; repeating critical information as context grows; and using sub-agents to grow the context window.
I'll connect each tactic back to patterns you're likely already using in your daily AI workflows, especially if you followed the advice in the context rot article. Along the way, I’ll share practical guardrails and instrumentation ideas so you can track quality with eval-driven development, reduce context rot, and scale performance predictably.
Why this matters for product trios: these strategies clarify the handoffs between prompt engineering, external memory design, and orchestration, which strengthens collaboration across PM, design, and engineering. Whether you’re exploring gen ai prototypes, hardening a retrieval-first pipeline, or evolving toward agentic AI, context engineering is the backbone of reliable, high-performing experiences.
If you build or lead LLMs for product managers initiatives, consider this your field guide. In upcoming posts, I’ll break down each strategy with concrete examples and templates you can adapt to your stack, so your team can move from experiments to durable, scalable AI workflows with confidence.
I’ve been pushing hard to operationalize AI for real product work, and this episode zeroes in on the moment Claude Code stops feeling like a demo and starts behaving like a dependable teammate. If you’ve ever wondered how to go from clever prompts in the browser to durable, repeatable workflows on your machine, this walkthrough is for you.
Listen on: Spotify | Apple Podcasts.
My first honest reaction to installing and configuring the desktop agent was the all-too-relatable “this tool thinks everything is a code repo” reality. That framing helped me reset expectations fast: instead of treating it like a magical universal assistant, I began designing guardrails, context, and repeatable routines—exactly how I’d onboard a new team member.
The shift from Claude-in-the-browser to Claude Code on my machine was the unlock. Locally, it can finally work with my files, folders, and workflows. That meant I could ground it in real artifacts—project docs, meeting notes, product specs, and historical decisions—so responses weren’t just plausible; they were contextual and verifiable.
On setup, I now treat /init and Claude MD files as my product requirements. I define roles, boundaries, and canonical sources up front, then run in a deliberate “walled garden.” The “treat it like an intern” model works beautifully: scope access intentionally, expand privileges as trust grows, and keep a tight audit trail of what it can touch and why.
Surprisingly, task management became my ideal on-ramp. It’s easy to validate, the feedback loops are tight, and the ROI is immediate. I export calendar windows rather than granting full calendar access, then let the agent map priorities into Trello, reconcile time blocks, and surface trade-offs. Fast wins build confidence—mine and the agent’s.
Model switching matters more than I expected. When speed is king and “good enough” will do, Haiku keeps the loop snappy. When stakes are higher—complex synthesis, nuanced product strategy, or gnarly ambiguity—I step up to Claude Opus 4.5. Being intentional about when to optimize for latency versus depth is a quiet superpower.
Web tasks can still spiral. When that happens, I pause its autonomy, toggle to fewer steps, and ask, “What are you doing?” Paired with Claude’s Web fetch tool, this makes the agent explain its chain-of-thought planning without exposing hidden reasoning, so I can spot brittle assumptions, prune distractions, and re-ground the task.
Content retrieval has become a killer workflow. I point the agent at my archives—blog posts, book drafts, transcripts, notes—and ask, “Where have I talked about this before?” It assembles a map of prior art, connects themes I’d forgotten, and prevents me from reinventing work. Over time, this evolves into a Zettelkasten-style research system that upgrades rigor and accelerates synthesis.
I’ve also turned Claude Code into a publishing engine. From a single transcript, it drafts titles, descriptions, show notes, and chapters, then routes artifacts to Ghost for formatting. Before anything ships, I run fact-checking workflows that validate claims against transcripts and research sources. The output improves, but more importantly, the scaffolding makes quality repeatable.
Reusable workflows compound. I rely on slash commands to trigger common jobs, break down larger efforts with sub-agents, and wire in hooks and plugins where external systems are needed. This is agentic AI at its most practical: fewer hero prompts, more reliable processes.
Audience analytics and content prioritization are helpful with caveats. I let the agent cluster themes and flag gaps, then I pressure-test its suggestions against first-party data and strategic goals. As with any model-driven insight, triangulation beats blind faith.
Two metaphors guide my day-to-day. First, Claude Code is like a dog—sometimes it returns with the stick, sometimes it gets lost in the woods. Second, the “intern” framing keeps me honest: don’t hand it the whole company on day one. With that mindset, my output jumped—more volume without sacrificing quality—because the workflow scaffolding got better.
In this episode, I cover what Claude Code is and why it’s useful even if you’re not an engineer, the real difference between the browser experience and running locally, how to shape behavior with /init and Claude MD files, why task management is the perfect proving ground, when to export calendar windows versus connecting directly, and when model-switching makes sense—Haiku for speed, Opus for depth.
I also dig into debugging web tasks by asking “What are you doing?”, content retrieval workflows across personal archives, building reusable slash-command systems with sub-agents, hooks, and plugins, practical publishing stacks from transcripts, fact-checking against transcripts and research sources, and using analytics to prioritize content—with a healthy respect for uncertainty.
If you’ve been trying to make Claude Code feel less like “throwing a stick into the woods,” this is the candid, tactical tour I wish I’d had on day one. Drop your questions and experiments below—I’m eager to compare notes and refine the playbook together.
“You don’t have to trust the algorithm; you can see exactly why a conversation earned the score it did.”
We recently shared how we redesigned CX Score to deliver deeper, more actionable insights across every conversation. The most common follow-up from support leaders was simpler and incredibly important: “Can I trust it?” It’s the right question—and it’s the one I use as my own bar for whether a metric is ready for the C‑suite.
CS teams are the subject matter experts on customer experience. They understand the nuance of what customers feel, the context behind every interaction, and the difference between a technically resolved issue and a genuinely satisfied customer. I’ve learned, conversation by conversation, that any metric we ship has to capture that nuance at scale—or it doesn’t deserve to be used.
We built CX Score to give support teams a complete view of how their customers feel across every conversation. It surfaces what’s working, what’s not, and why—so leaders can communicate impact clearly and drive change across support, product, and the wider business.
A CX Score in action: repeated CSV export failures trigger a low score and customer frustration, while the AI agent clarifies next steps and gathers details—turning raw signals into actionable support insights.
Here’s exactly how I approached building a trustworthy metric that support leaders can inspect, explain, and defend.
1) It’s grounded in how support teams define quality. I started with how experienced support professionals actually evaluate conversations—collecting real examples of strong, mixed, and poor interactions across industries, identifying the specific factors that shape overall experience, and writing plain-English rules for each. The result: CX Score applies the same criteria a trained support professional would use, not generic LLM assumptions.
2) It’s aligned with human judgment. We created a dataset of thousands of real customer conversations spanning multiple industries, languages, channels, and agent types. Each was manually reviewed by experienced support professionals—with two reviewers per conversation where possible and disagreement resolution to create stable consensus labels. The result: CX Score is trained and tested to behave like an expert reviewer, not a language model making broad guesses.
A modern CX analytics view shows how conversations flow from chat, email, and mobile into AI assistance, then to resolutions and sentiment outcomes—turning messy support data into a single, defensible CX Score.
3) It’s engineered by AI specialists. CX Score isn’t a prompt attached to an LLM. It’s a production system built by Intercom’s AI Group: 37 ML scientists and 350 engineers whose full-time focus is AI for customer service. The system includes specialized handling for long transcripts, model configuration tailored for support language and subtle sentiment, prompt engineering designed to default to neutral when evidence is weak, and a multi-stage evaluation pipeline that checks for precision, consistency, and reliability. The result: A metric built by a team that understands LLM behavior in production support environments, where accuracy and consistency matter most.
4) It’s validated statistically, not qualitatively. Trust requires measurement, not vibes. We tested CX Score across standard ML metrics: Precision (when the model flags a negative experience, how often do humans agree?), recall (how many human-identified issues does it catch?), and F1 score (the balance between both). We set an explicit bar: F1 above 0.8, representing high agreement with human judgment. We reran these evaluations through every revision, checking for regressions or biases, and I focused especially on negative experiences, because a false negative hides a real problem. The result: CX Score meets a measurable standard before it ships—not a gut check, a statistical requirement.
5) It was battle-tested with real customers. Lab accuracy isn’t enough. Customer environments are messy: Varied ticket types, mixed languages, unpredictable edge cases. Before release, we ran a multi-phase field test—shadow-scoring conversations with both old and new models, validating sensible behavior across agent type and conversation length, then rolling out to a controlled customer group who confirmed the scores felt right, reasons were clear, and insights were actionable. The result: CX Score shipped because real teams told us it made sense in practice, not because it passed internal tests.
From conversation to clarity: this visual maps the drivers behind a CX Score. Explore how policy feedback, answer quality, and effort combine to produce defendable insights support leaders can act on.
The importance of explainability. One of the most critical choices I made was ensuring CX Score isn’t a black box. Every score comes with clear reasons, concrete excerpts, and a short explanation of what influenced the rating. This turns the metric into something you can inspect, audit, and explain to executives. You don’t have to trust the algorithm. You can see exactly why a conversation earned the score it did.
A metric that evolves with your business. Customer expectations shift. Products change. AI improves. A trustworthy metric can’t be static. CX Score evolves with the same commitments that shaped its redesign: Evaluate the real signals that shape customer experience, keep the logic simple and interpretable, and ensure leaders can make clear decisions from it. It’s built to be a durable source of truth across every conversation.
The takeaway. In a world where products look the same and AI can generate any interaction, customer experience is one of the few differentiators that actually matters. Support leaders have built that expertise conversation by conversation. What they’ve lacked is a measurement system that could validate it at scale—one that’s reliable enough to report to the C-suite, explainable enough to defend in strategy meetings, and rigorous enough to drive real decisions. That’s what CX Score is designed to be: A metric that reflects the reality support leaders see every day, backed by the technical rigor to make it credible everywhere else.
Want to see CX Score in your workspace? Ask your admin to enable it for your team, and start using explainable AI insights to improve customer experience and coach with confidence.