Author: Shivam Tiwari

  • AI Evals for Product Managers: How I Measure Agent Quality—A Beginner’s Playbook

    AI Evals for Product Managers: How I Measure Agent Quality—A Beginner’s Playbook

    I’ve led multiple AI agent launches, and the single most reliable way I’ve found to ship with confidence is to treat evaluations as a product capability, not a side project. When we make AI quality measurable, predictable, and comparable over time, we move faster, reduce risk, and build trust with customers and stakeholders.

    Learn how product managers use AI evaluations to measure agent quality. Covers traces, LLM judges, offline evals, online evals, and how to connect evals to product outcomes.

    Why does this matter so much in product management? Because agent quality is only meaningful when it drives adoption, satisfaction, and revenue. I use eval-driven development to align the day-to-day iteration of prompts, policies, and workflows with business outcomes like activation, retention, and Net Recurring Revenue (NRR). That alignment turns AI quality from an abstract notion into a roadmap lever.

    First, traces. Traces are the spine of evaluation for agentic AI: they capture inputs, intermediate steps, tools invoked, and final responses. I instrument traces to make reasoning visible—what the agent tried, where it hesitated, and why it chose a path. With that visibility, I can compare prompts, policies, and tools, and I can teach the team to fix the root cause instead of patching symptoms. This is also where Agent Analytics becomes real: we move from anecdotes to observable behavior trends across cohorts and use cases.

    Next, LLM judges. I use model-as-judge to score qualities like helpfulness, coherence, or adherence to brand and policy. The trick is calibration. I pair LLM judges with a small, high-quality human-labeled set to ground the scale, then monitor drift as models, prompts, or data shift. LLM judges help me evaluate at speed, but I still spot-check edge cases and highly regulated flows to balance efficiency with risk controls.

    Offline evals come first. Before I expose users to changes, I run fixed test suites representing core scenarios, failure modes, and edge cases. I include golden examples, adversarial prompts, and domain-specific queries. Metrics cover task success, factuality, safety, latency, and cost. This is where prompt engineering and retrieval quality are tuned; if I’m using a retrieval-first pipeline, I evaluate evidence quality separately from generation so improvements are attributable and reproducible.

    Online evals follow to validate real-world performance. I roll changes out behind feature flags and use A/B testing to compare variants under production conditions. I track conversation outcomes, tool success rates, fallbacks to human support, and user satisfaction. These online signals close the loop on whether an offline improvement actually compounds value in the product—critical for product-led growth.

    Connecting evals to product outcomes is non-negotiable. I map quality signals to a driver tree: from per-turn scores (helpfulness, safety, latency) up to session-level outcomes (task completion, deflection, revenue intent), and finally to product KPIs (activation, retention, NRR). With this structure, I can set thresholds for launch gates, prioritize roadmap items that move the biggest levers, and build dashboards that leadership understands at a glance.

    A few lessons learned. Start with a minimal but durable test set and grow it as you discover new failure modes. Version everything—prompts, tools, and datasets—so you can reproduce wins. Beware metric drift when you swap models or update prompts. Blend human review where the cost of error is high. Above all, make evaluations part of your AI workflows and sprint rituals so quality improves continuously, not sporadically.

    If you’re just getting started, begin with traces and a small offline suite, add LLM judges for scale, then prove impact with a focused online experiment. Within a few cycles, you’ll have a living evaluation system that guides decisions, accelerates delivery, and gives your team—and your customers—confidence in every AI release.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • From Solutions Engineering to PMM Leadership: Darshil Gandhi’s Playbook for Amplitude’s Edge

    From Solutions Engineering to PMM Leadership: Darshil Gandhi’s Playbook for Amplitude’s Edge

    I look for product marketing leaders who translate market noise into clear decisions that move roadmap, revenue, and relationships. In that context, Darshil Gandhi exemplifies how competitive rigor and technical depth can sharpen product strategy and accelerate go-to-market strategy across empowered product teams.

    Darshil leads competitive intelligence, partner product marketing and technical marketing at Amplitude. He is a former solutions engineering team principal.

    That blend matters: a solutions engineering mindset grounds messaging in real implementation details, while competitive intelligence and partner product marketing align product positioning, points of parity, and competitive differentiation with what buyers actually evaluate. At a company centered on Amplitude analytics, that cross-functional view helps transform behavioral data into a crisp value proposition customers can feel in evaluations and expansions.

    In practice, I prioritize a few patterns when partnering with leaders who span these domains: align on a single competitive narrative using driver trees that connect capabilities to outcomes; use Amplitude analytics to validate claims and win themes; co-create partner playbooks that make integrations repeatable; and ensure technical marketing closes the loop by pressure-testing demos, docs-as-code, and reference architectures with field feedback. This strengthens stakeholder management across sales, solutions engineering, and product trios, reducing ambiguity and speeding decisions.

    The net effect is clarity: sharper differentiation in the field, cleaner handoffs between teams, and faster feedback cycles that de-risk launches. It’s a model I trust when stakes are high—use the truth of implementation to tell a compelling story, then let the market confirm it.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • We Open-Sourced Our AI Skills Library: Reusable Skills to Supercharge Product Velocity

    We Open-Sourced Our AI Skills Library: Reusable Skills to Supercharge Product Velocity

    We open-sourced our AI Skills library. Here's what we built, why we built it, and how to use it. I’m sharing the approach we’ve used to move faster with more confidence across product discovery, prototyping, and production—while keeping governance, safety, and measurement front and center.

    What we built is a modular, open-source library of “skills” for agentic AI and LLM-powered workflows—things like retrieval and grounding, summarization, classification, tool-use, data enrichment, safety guardrails, and evaluation harnesses. Each skill follows consistent interfaces and conventions so teams can compose them like building blocks, swap implementations without breaking flows, and standardize best practices across products.

    Why we built it is simple: we kept rebuilding the same core capabilities across experiments and teams. Standardizing these skills accelerates time-to-value, reduces integration risk, and helps product trios collaborate with a common language. It also lets us scale what works—prompt patterns, eval datasets, telemetry—so every new initiative starts on third base instead of at bat.

    How to use it in practice: start by running a quick-start example to see a baseline skill chain in action. Then compose your own flow by selecting skills (for example, retrieval + summarization + tool call), configure them with environment variables and guardrails, and wire in evaluation datasets. From there, instrument the pipeline with metrics so you can compare variants and promote the best-performing chain to your main app or API.

    In a typical stack, the library dovetails with analytics and experimentation: ship skill variants behind feature flags, measure impact with A/B testing, and observe runtime behavior with logs and traces. CI/CD hooks let you run evals pre-merge, and production dashboards keep an eye on latency, cost, and outcome quality. This creates a virtuous loop where ideas move from prototype to production with clear evidence.

    Common use cases include customer support summarization and triage, lead scoring and enrichment, anomaly detection in product telemetry, and automated content workflows. Because the skills are composable, you can try multiple retrieval-first strategies, swap prompt templates, or add tools (search, RAG, calculators, connectors) without rewriting everything from scratch.

    Governance and safety are built in. Guardrails handle PII redaction, content policy checks, and rate limiting; configs make it easy to enforce privacy-by-design; and evaluation harnesses encourage an eval-driven development culture. The result is faster iteration without sacrificing data governance or reliability.

    If you want to contribute, add a new skill, improve prompts, share eval datasets, or open an issue with a scenario you want supported. The roadmap focuses on richer retrieval adapters, better test fixtures, and deeper observability so teams can debug and optimize complex chains with confidence.

    I’m excited to see how you’ll use the library to accelerate your roadmap. Clone it, run a quick start, and compose your first workflow today—then measure, iterate, and scale what works. I’ll keep sharing patterns, learnings, and updates as we grow the skills catalog and sharpen the tooling.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Director of Product, Growth & AI at Amplitude: My Playbook for Viral Growth and Engagement

    Director of Product, Growth & AI at Amplitude: My Playbook for Viral Growth and Engagement

    I see the Director of Product, Growth & AI at Amplitude as a mandate to operationalize "viral and core growth strategies, user acquisition, and product engagement" with precision. From my vantage point, that means building a rigorous, metrics-first operating system grounded in Amplitude analytics and product-led growth principles, then layering in an AI Strategy that personalizes experiences without sacrificing control or safety.

    I start by defining a clear North Star Metric and mapping a driver tree to expose causal levers across acquisition, activation, engagement, retention, and monetization. With behavioral analytics and cohort analysis, I quantify which user behaviors correlate with long-term value. I operationalize rapid experimentation through A/B testing with sensible minimum detectable effect (MDE) thresholds, guardrail metrics, and sequential testing to ensure we move fast while preserving measurement integrity.

    For "viral and core growth strategies," I lean on durable growth loops more than one-off hacks. Viral loops might include collaboration invites, user-generated content, and shareable artifacts that make the product more valuable as it spreads. Core growth centers on frictionless activation: guided onboarding, in-app guides, product tours, progressive disclosure, and judicious tooltip design that connects users to the ‘aha’ moment quickly. Session replay and funnel instrumentation help isolate friction and systematically remove it.

    On user acquisition, I connect performance channels and go-to-market strategy tightly to in-product activation. Rather than optimizing for clicks, I optimize for post-signup behaviors that predict retention. This includes improving landing page-message-product congruence, refining qualification (so top-of-funnel aligns with downstream value), and orchestrating lifecycle messaging that nudges users toward key activation milestones.

    To deepen product engagement, I focus on leading indicators of retention and feature adoption. I segment by jobs-to-be-done and intent, then personalize in-app prompts to surface the right capability at the right moment. Retention analysis, pathing, and funnel breakouts inform which nudges to deploy and where—whether that’s smarter checklists, contextual education, or lightweight in-product interventions that turn sporadic usage into reliable habits.

    AI raises the ceiling on what’s possible here. With a thoughtful AI Strategy, I use gen ai to personalize onboarding flows, recommend next-best actions based on behavioral signals, and summarize complex activity patterns into actionable insights for the team. I maintain strict measurement: every AI intervention ships behind feature flags, is evaluated through controlled experiments, and adheres to privacy-by-design principles. The outcome is a system that learns continuously while staying aligned to business and user outcomes.

    Execution is where strategy becomes real. I rely on empowered product trios, continuous discovery with customers, and outcome-focused roadmaps that tie directly to the driver tree. This keeps the organization moving in sync: engineering prioritizes the highest-signal experiments, design accelerates comprehension and task success, and product ensures each release strengthens the core loop rather than adding ornamental features.

    Ultimately, the blueprint is simple and disciplined: anchor on "viral and core growth strategies, user acquisition, and product engagement," quantify what matters with behavioral analytics, and iterate through well-instrumented experiments. Combine that with targeted AI augmentation, and you create a compounding growth engine that is both measurable and resilient.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • A Game-Changing Leap in Voice AI: Fin Voice 2, Apex Flash, and a Live Demo You Can Trust

    A Game-Changing Leap in Voice AI: Fin Voice 2, Apex Flash, and a Live Demo You Can Trust

    In competitive markets, I see two options: try to win the game competitors set, or choose to play a different game. In the "Customer Agents" category, I’ve watched too many glossy, fabricated demos—especially around voice—mask the real challenges. Voice is just extremely hard. We all know the future of customer experiences will be Agent-driven voice, yet most of us haven’t actually spoken with a modern AI Agent when calling a business because the tech hasn’t been truly ready in the wild. Today, the bar moves.

    What changed? There’s a live, public demo of cutting-edge voice tech you can stress test yourself—no smoke, no mirrors. I recommend taking it for a spin: https://fin.ai/voice. It’s fast, natural, and, yes, very, very good.

    For context, yesterday brought Apex Flash, their newest and fastest model, built for the unique demands of low latency channels like voice. Today comes Fin Voice 2, a major upgrade to Fin Voice with over 20 new features, and the first product built on Apex Flash.

    Here are the three things that stood out to me—and why they matter for customer support AI strategy and product strategy.

    First — thanks to Apex Flash, Fin Voice 2 is now the fastest, most natural Agent for phone, with higher resolution rates and customer satisfaction scores than ever before. Apex Flash is trained on millions of customer experience interactions, fine tuned for customer service, and can be configured to understand all your knowledge and follow all your policies. The result is higher resolution at significantly lower latency—the best of both worlds for voice AI agent performance.

    Speed and naturalness here aren’t accidental. Most voice AI products are slow because they convert speech to text, send it to a general model, get a text answer, and then convert it back to speech. Fin Voice 2 was designed to work differently, separating the real time layer that handles speech processing, and the layer that generates answers. That architecture is purpose-built for the demands of customer service on voice.

    Slide for Fin Voice 2, powered by Apex Flash, showing it beats Voice 1: +24.5% average resolution, +8.4% guidance following, +1.3% CSAT, -19.2% time to first audio, -37.6% semantic search latency.
    Powered by Apex Flash, Fin Voice 2 raises the bar on quality and speed—boosting resolution rates and guidance following while cutting time to first audio and semantic search latency, with a lift in CSAT too.

    Second — Fin Voice 2 can handle complex queries end to end: taking actions in external systems, verifying callers’ identities, processing refunds, booking appointments, and more. Phone is a high-stakes channel, and Fin adapts to customers across emotional states, clarifies when needed, and confirms key details before taking action. Most of the time, Fin can resolve the query in full, and when it can’t, it seamlessly hands off to the human team, maintaining full customer context and history. You also get multiple improvements to call quality, plus proactive outbound calls to follow up on unresolved issues—all orchestrated by robust AI workflows.

    Third — Fin Voice 2 gives you total control with industry-leading tools to configure and manage how Fin behaves. You get rich, detailed insights into call behavior and quality, the most common topics of calls, and one-click recommendations to improve. As with everything in Fin, you can fully self-serve and then manage it all with ease, without requiring professional services. Many vendors only let you set up their voice agent under supervision; with Fin, you get everything you need to iterate fast.

    If you haven’t tried the demo yet, go check it out: https://fin.ai/voice. If you prefer to wait, don’t be surprised when you end up speaking with it at a favorite brand soon.

    From a product management lens, this is what matters: latency is a feature customers feel; transparency builds trust in enterprise AI; and control is non-negotiable for CX leaders. The combination of a purpose-built, agentic AI architecture, measurable gains in resolution and CSAT, and true self-serve configuration signals that voice is moving from prototype theater to production reality. That’s the different game I want our industry to play.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • Crafting Beloved Tech Brands: My Moonshot Marketing Playbook for the Post-LLM Era

    I spend a lot of my time asking a deceptively simple question: what does excellent marketing actually look like in 2026? From the vantage point of product leadership, the answer isn’t a spreadsheet or a channel plan—it’s a feeling. Beloved tech brands earn the benefit of the doubt, create gravity around their roadmap, and make customers proud to belong. That kind of momentum is not an accident; it’s a system.

    Here’s the hard truth I’ve learned building and scaling products: giving teams different goals creates dysfunction. When brand, demand gen, product marketing, and comms run on fragmented OKRs, you manufacture internal headwinds. “Marketing is one engine – not separate pieces.” One strategy, one narrative, one set of outcomes—expressed through different craft disciplines and time horizons.

    That unity of purpose clarifies executive roles, too. The real difference between an SVP and a CMO is scope and narrative ownership. A great CMO architects the whole system—portfolio allocation, brand architecture, integrated go-to-market strategy, and the bar for creative taste—while refusing to get dragged into decisions they should never be making (for example, approving every headline or micromanaging channel tactics). Leaders should decide the outcomes, standards, and constraints; teams should control the craft.

    On portfolio design, I run marketing like a portfolio of moonshots. You need a healthy mix: proven programs that compound, emergent bets that learn fast, and a small set of true moonshots that can change the slope of the curve. The point isn’t bravado; it’s risk-balanced exploration. If everything ships safely, you’re under-investing in differentiation. If everything is a swing for the fences, you’re not building a repeatable growth engine.

    This is where taste becomes a strategic advantage. “Ubiquity is the opposite of cool.” If you want to be beloved, you cannot treat every channel, audience, and moment as equal. Early on, selective distribution, distinctive creative codes, and tight community loops create status and meaning. Later, you scale without sanding off the edges that made the product special.

    Why do a few companies build a flywheel of momentum while others stall? They align story, product, and distribution. The product earns trust, the narrative creates aspiration, and the go-to-market strategy ensures the right customers experience both at the right time. Then perception cycles kick in—the Silicon Valley clock turns—and irrational optimism or skepticism can amplify signals. The antidote is compounding proof: consistent product shipping, community advocacy, and creative that makes people care.

    Scaling taste across an organization is teachable. I codify brand principles, narrative guardrails, and examples of “right” versus “almost right.” I replace abstract feedback with decision rubrics—what we keep, kill, or revise and why. I run recurring creative reviews with a small cross-functional council, so judgment compounds. Taste can’t be fully automated, but it can be operationalized: shared references, a story bible, and a high bar for craft that’s explicit, not mystical.

    In a post-LLM world, the fundamentals haven’t changed—but the frontier has. Generative tools supercharge iteration and research, yet the artistry never really left. You still need a point of view, a tension worth resolving, and a value proposition that’s felt, not just stated. Can taste be encoded in software? Parts of it—pattern libraries, style constraints, data-driven feedback—absolutely. But the spark that makes work unforgettable remains human: judgment, risk tolerance, and the courage to ship something that might not fit the playbook.

    That’s why telling an optimistic, yet realistic story about AI matters. Over-automation drains humanity; under-automation wastes potential. The best work pairs AI Strategy with craft leadership: LLMs for rapid exploration, humans for narrative decisions and ethical judgment. Your message should show how AI expands customer agency, not just efficiency.

    The brand-versus-growth debate is a false choice. The right story accelerates pipeline, and the right demand programs reinforce the brand. Look at Apple’s discipline around product truth and design codes, or Google Chrome’s “The Web Is What You Make of It (Dear Sophie)” for proof that emotion and utility can co-exist. Notion, Pinterest, Square, HubSpot, and Harley-Davidson show how community, identity, and product-led growth interlock when the company knows exactly what it stands for.

    When it comes to launches, I’ve learned that announcement videos full of humans, lack humanity. Overproduced gloss often dilutes the truth customers seek: what problem does this solve, how quickly can I feel the value, and why does it matter now? Real users, real context, and a crisp arc from problem to promise will outperform most theatrics.

    Practically, I architect my week to protect taste and outcomes. Early-week for strategy, portfolio reviews, and cross-functional alignment; mid-week for deep creative and product marketing work; late-week for decision clears and postmortems. I time-box “disruptive energy”—space to chase non-obvious ideas—and I guard it like any critical meeting. Without protected cycles for exploration, the urgent will always suffocate the important.

    If there’s a single takeaway: playbooks are obsolete, but the fundamentals are not. The channels change; the psychology doesn’t. Run one engine. Allocate a true portfolio. Scale taste with rigor. In the AI era, make people care. That’s how beloved tech brands are built—and how they endure.


    Book a consult png image
  • Supercharge Insights with Amplitude Agent Connectors: Connect Notion, Slack, Linear & More

    Supercharge Insights with Amplitude Agent Connectors: Connect Notion, Slack, Linear & More

    I’ve led enough multi-tool product organizations to know how quickly momentum erodes when insights and actions live in different places. When my teams bounce between Notion, Atlassian, Slack, Linear, and analytics dashboards, we pay a real tax in context switching. That’s why I’m excited about what Amplitude is enabling with Agent Connectors—bringing our daily work and our data-driven decisions into one fluid, agentic AI workflow.

    Connect Notion, Atlassian, Slack, Linear, and more to Amplitude's Global Agent. Get richer analysis and take action across tools without leaving Amplitude.

    Practically, this means I can treat Amplitude analytics as a unified analytics platform where analysis and execution finally meet. Instead of exporting charts or copying insights into docs, I can drive Agent Analytics directly from the same surface where I manage behavioral analytics, reducing friction and accelerating decisions. For my product strategy, that’s a meaningful shift—from “insight later” to “insight-to-action now.”

    Here’s how I’d use it on a typical day: I ask the agent to synthesize signals from recent feature usage, spotlight anomalies, and then draft a concise summary for our Slack channel. In the same flow, I can prompt it to reference our Notion specs for context and queue next steps in Linear, keeping Atlassian stakeholders looped in without any extra swiveling between tabs. The value isn’t just faster execution; it’s tighter alignment across teams because the analysis and the plan live together.

    From an operating model perspective, this is how I scale AI workflows responsibly. I can define clear prompts, approval paths, and ownership so the agent augments—not replaces—expert judgment. Data governance and permissions remain front and center: the agent sees what your teams are allowed to see, and we maintain auditability on critical workflow steps. The outcome is a trustworthy, repeatable system that compounds learning over time.

    If you’re exploring agentic AI for product teams, start small and instrument your ROI. Pick one or two connectors (Slack and Notion are great first choices), define a measurable workflow—like pushing weekly retention insights and creating prioritized follow-ups in Linear—and iterate using continuous discovery. In my experience, the first wins appear as reduced time-to-insight, fewer meetings to align, and faster cycle time from observation to shipped change.

    The big picture is simple: bring your work to your analytics, and your analytics to your work. With Agent Connectors, Amplitude’s Global Agent helps close the loop from understanding behavior to taking action—without leaving the place where your insights are born.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • Package Hack Wake-Up Call: My Playbook for Securing Cowork, Coding Agents, and Secrets

    Package Hack Wake-Up Call: My Playbook for Securing Cowork, Coding Agents, and Secrets

    I love being a builder. It feels like a superpower I can’t stop using, and lately I’ve been channeling it into better workflows, faster experimentation, and sharper product thinking.

    I tinker with my Claude Code workflows to make every day more effortless. I’m having a blast creating AI-generated interview snapshots and opportunity solution trees for Vistaly. I also spend time digging into traces and iterating on the AI coaches I use for our discovery courses.

    Then the recent wave of malicious software spreading through the open-source community popped my bubble. It hit companies big and small—names like OpenAI, PostHog, and Zapier. As I dug in, I realized what many cybersecurity experts have long known: this is a deep rabbit hole. If I want to build responsibly, I have to get significantly better at protecting my devices, credentials, and code. And if you’re building with AI or modern tooling, you likely do, too.

    Here’s why. We all rely on open-source software. Most modern applications assemble tried-and-true components—parsing a PDF, handling dates across time zones, visualizing spreadsheet data, connecting to an API—rather than reinventing them. The same is true for agent skills and MCP servers; they accelerate how we get value from models. This is overwhelmingly a good thing. But it also creates an attack surface that bad actors exploit.

    We don’t need to abandon third-party code. We do need to understand the mechanisms attackers use and consistently defend against them.

    Infographic titled 'When Trusted Packages Go Rogue' summarizing a talk on package hacks: worm spread, defense framework, risks from AI coding tools, and practical mitigation steps, with security-themed icons.
    When one malicious worm compromises hundreds of packages, what should dev teams do? This visual teaser maps the agenda—how it spreads, how to guard against it, AI tool risks, and concrete steps to mitigate.

    On May 11th, I started seeing tweets about a TanStack hack. At that time, I didn’t know what TanStack was. But apparently, it’s a popular set of JavaScript libraries that are used by a lot of React sites. At first, I didn’t pay much attention. Then I learned the packages were compromised by a worm—malicious software that self-replicates—and it spread quickly. Within hours, dozens of packages were implicated; by day’s end, it was in the hundreds. That’s when I knew I had to lean in.

    If you’ve explored safe development practices with coding agents before, you’ve seen the basics of package safety. A package is a bundle of reusable code shared through registries, and nearly every app you use depends on them. The unfortunate twist with this specific hack, known as the Mini Shai-Hulud worm, is that it shows prior “safe enough” heuristics aren’t sufficient. Popularity and trust signals don’t guarantee safety. We have to do more.

    So here’s what I’ll cover today: how malicious software typically works, a practical framework for guarding against it, the specific risks of using Cowork to write and run code, and concrete steps to mitigate that risk. My goal is simple: help you keep building—despite the risks—while protecting your data and your business.

    Quick disclaimer: I’m not a security expert. I’m sharing my personal journey and what I’ve learned through research and hands-on work. Please use your best judgment when applying any of this.

    Infographic showing a 3‑step pattern in malicious software: enter via package or script, search a device for sensitive data, then exfiltrate to an attacker, with icons and expanding entry points.
    Package hacks share a simple playbook: get in, sweep for secrets, and phone home. This visual breaks down the 3 steps and flags new entry points—from packages to MCP servers, agent skills, and app extensions.

    An agent recently scoured over 230,000 malicious software incidents and found that most malicious software follows a similar pattern. First, it needs an entry point onto your computer. Once installed, it scours your device for sensitive data, and then it uses your network connection to send that data to its own servers. The Mini Shai-Hulud worm spreads via malicious package install scripts that run at download time, then searches the device for credentials (including package publishing rights), poisons additional packages to continue replicating, and uses multiple channels—including the victim’s own GitHub public repos—to distribute secrets.

    In practice, most attacks boil down to three steps: 1) It finds an entry point to your device. 2) It searches your device for sensitive data. 3) It sends that data to its own server. The good news: this pattern also tells us how to defend. We can harden entry points, minimize what code and agents can access, and constrain outgoing network traffic.

    Keep in mind that install scripts aren’t the only entry vector. Any code that runs on your machine could contain malicious payloads: third-party packages, agent skills, MCP servers, browser or desktop extensions—the list is long. As coding agents and “vibe coding” tools become mainstream, more non-engineers are exposed to the same risks engineers have managed for years.

    You might be at elevated risk if you do any of the following: you download and use third-party skills or MCP servers; you let Claude Code, Codex, or other coding agents write scripts that run locally and use third-party packages; you use an IDE like VS Code or Cursor with third-party extensions; or you install third-party extensions in tools like Obsidian. This isn’t an exhaustive list, but if any of these apply, it’s worth tightening your approach.

    Infographic titled 'Are You at Risk?' listing third-party code exposure points: agent skills and MCP servers, coding agents on local devices, IDE extensions (VS Code, Cursor), and Obsidian plugins.
    Relying on third-party code? This visual highlights four common risk zones—agent skills/MCP servers, coding agents, IDE extensions, and Obsidian plugins—and urges a review of downloads, local scripts, and add-ons.

    The “safest” approach would be to avoid installing third-party software on your local device entirely. That’s not realistic. We all depend on third-party components in our stack. So I’ll start with one of the most common paths for non-engineers writing and running code today: Cowork.

    Evaluating Cowork’s safety was eye-opening. Cowork offers meaningful protection—more than running code directly on your machine—but it isn’t bulletproof. There’s a notable gap you should understand.

    Here’s how Cowork helps. It runs code inside a virtual machine, which isolates the execution environment from your real device—a quarantine room for code. While Cowork doesn’t fully control what comes into the room (that part is on you), if malicious code gets in, it’s contained and cannot reach the rest of your filesystem. Cowork also limits outbound network traffic from the virtual machine, which helps disrupt data exfiltration. However, it’s not foolproof.

    Because Claude can install packages inside Cowork, it remains susceptible to malicious code like the Mini Shai-Hulud worm. And GitHub is on the allow list so Cowork can read and write to your repos. Since the Mini Shai-Hulud worm uses GitHub to publish secrets, this creates exposure. The crucial mitigation: if you never give Cowork access to sensitive data, there’s nothing for an attacker to steal.

    Infographic titled 'Does Cowork Keep You Safe?' with three points: entry point contained, data safe only if kept outside, and partially limited network traffic, highlighting risks in package attacks.
    A quick visual from a security deep dive on package hacks shows how Cowork handles threats: entry points are contained, data is only safe when kept outside, and network traffic is partly limited—making shared data the gap to watch.

    Your responsibility is straightforward but critical: your data is only safe if it stays outside the virtual machine. When you mount folders into Cowork, those folders become accessible to any code running inside the VM. That includes malicious scripts. Before sharing, ask two questions: do the folders contain any credentials or secrets, and do they include proprietary data that would be harmful if accessed?

    It’s common for code to need credentials. That’s why Cowork includes connectors to third-party sources like Google Drive and Slack. Credentials configured for these connectors never enter the VM—they remain outside the quarantine room—so they’re not exposed to malicious code. But if your code requires additional credentials inside the VM, scope them tightly and assume they could be compromised.

    You can also use custom MCP servers you create yourself with Cowork. Those credentials stay outside the VM as well, provided the MCP servers are remote (hosted on a web server, not downloaded locally). It’s more work than dropping in a local server, but it keeps secrets out of reach from VM-executed code.

    Beyond credentials, scrutinize the actual content you share with Cowork, including anything accessed through connectors. Least privilege is the rule: grant only what’s absolutely necessary for the task, and nothing more.

    Infographic titled 'Keep Building. Stay Safe.' outlining a 3-part series for AI builders: 1 Cowork Safety, 2 Claude Code Config, 3 Off-Device Development, with teal security, AI, and cloud icons and a 'Product Talk' label.
    Amid a wave of package-supply attacks, this Product Talk visual launches a 3-part guide to safer AI building—starting with Cowork safety today, then Claude code config next week, and off-device development coming soon.

    What about skills? Cowork supports skills, and you can add third-party skills inside the quarantine room. If you’re not placing your own data in that room, you can afford more risk. The moment you add sensitive or proprietary data, be selective. Skills can include third-party code, and bad actors use skill directories to distribute malicious payloads. Personally, I never use third-party skills as-is. If one looks useful, I read through the files, then ask Claude to recreate it so I understand what it does and maintain control. If I were to use third-party skills, I’d do it in Cowork and keep their data access to the minimum necessary.

    Overall, Cowork is a solid, “safe-ish” option if you’re disciplined about what you share. The challenge is that utility often requires access to real data—exactly what we’re trying to protect. In an upcoming deep dive, I’ll outline strategies to keep malicious code out in the first place. While I’ll focus on local development, the same patterns can extend to Cowork with a bit of setup.

    One more important clarification: don’t confuse Cowork with the Code tab in the Claude Desktop app. Cowork runs code inside a virtual machine. The Code tab does not. If you ask Claude to write and execute code from the Code tab, that code runs on your local device and you’re fully responsible for security. There is one exception: the Code tab can run code in Anthropic’s cloud; I’ll cover that approach when we get into moving development off the local machine.

    To summarize Cowork’s protections against the attacker’s three-step pattern: installs and scripts still run, but they’re contained inside an isolated virtual machine instead of your real device; access to sensitive data is strongly limited to the specific folders you mount, leaving the rest of your filesystem (including unrelated credentials) out of reach; data exfiltration is partially constrained because Anthropic limits outbound network traffic from the VM—helpful, but not absolute. By contrast, local Code tab sessions offer no isolation, no filesystem restrictions, and no network limits—so any malicious install scripts run directly on your machine with full access and open egress.

    My takeaways so far: I still love building with AI, but I’m doing it more cautiously. Cowork offers meaningful containment when used deliberately. I still prefer the flexibility of Claude Code, and I’ve reconfigured my setup to reduce risk. Even so, “safer” isn’t “safe,” which is why I’m increasingly shifting development off my local device to more controlled environments. I’ll share the practical details—tools, configs, and scripts—in the next installments.

    If this perspective is useful, let me know. I want builders to move fast—and safely—through this new era of agentic AI. Until then, stay safe out there.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Broken Procurement Is Costing You Talent: A Product Leader’s Playbook for Speed and Sanity

    Broken Procurement Is Costing You Talent: A Product Leader’s Playbook for Speed and Sanity

    Procurement should accelerate value, not suffocate it. Listening to this episode, I found myself nodding (and wincing) through a painfully familiar story about how well-intended controls morph into barriers that keep great expertise out. As a product leader responsible for speed, outcomes, and brand experience, I see procurement as a direct mirror of culture—and an often overlooked part of the product operating system.

    In the conversation, Teresa is cranky—and honestly, she has every right to be. She’s simultaneously juggling seven speaking engagement contracts, and six of them have become a part-time job in themselves—think 80-page ethics policies, 800-question security forms, and Multi-Factor Authentication (MFA) questions asked 17 different times. Meanwhile, the one company that just put her fee on a credit card? Scheduled, confirmed, and done in two weeks. That contrast is the whole story: friction repels talent; clarity and simplicity attract it.

    Petra adds her own horror story—filling out 12 identical Word document forms—and together they surface a deeper truth I’ve seen across organizations: broken vendor processes don’t just frustrate consultants; they stop companies from getting the expertise they actually need. And despite what many assume, company size isn’t the deciding factor—leadership intent and process ownership are.

    If you’ve ever wondered why a training got canceled, why a speaker backed out, or why your team can’t seem to bring in outside experts, this is likely the culprit: procurement theater. Repetitive forms, unbounded scope creep, and sprawling security reviews create drag that outlasts any short-term legal or compliance gain. The opportunity cost—lost learning, slower progress, and talent that simply says no—is enormous.

    One detail that stood out: with CEO-level buy-in, a legal review timeline collapsed from four months to 10 days. I’ve seen the same thing. Executive sponsorship is the fastest procurement tool there is, and it reveals what the organization truly values. If you can compress the path when a leader cares, you can redesign the path so it’s always faster—without compromising real risk management.

    I also loved the clarity of a simple policy from the episode: Teresa’s new policy is straightforward—her paperwork, credit card payment, no vendor setup—or no speaking engagement. That’s not obstinance; it’s a bright-line test for whether an organization respects expert time and understands total cost. The best experts have options, and friction filters them out first.

    Here’s how I operationalize this in product-led organizations. Tier risk by engagement type (e.g., one-hour talk vs. long-term software vendor) and match the process to the risk. Offer a credit-card fast lane with standard, plain-English terms for low-risk work. Eliminate duplicate data entry and kill redundant questionnaires. Use a single, secure intake that auto-fills known fields. Track cycle time end to end, and publish SLAs for legal, InfoSec, and finance. Most importantly, make vendor experience a first-class metric—because it is a brand experience.

    Security and compliance matter, but they must be right-sized. If you’re buying a keynote, you’re not buying data processing—so why the 800-question security review? Calibrate controls to actual data access and system interaction. The episode even references AWS DynamoDB and GuardDuty, plus Claude Code—helpful reminders that your stack context matters, but not every purchase touches it. Don’t conflate deep technical diligence for a SaaS integration with a simple, no-data engagement.

    There’s a reason the classic film Office Space gets a nod—it’s the perfect metaphor for what happens when well-meaning governance calcifies. Bureaucracy compounds over time, usually after adverse events, until startups—or any team that still moves fast—run circles around you. Procurement that treats experts like adversaries won’t win the race that actually matters: learning faster than the market.

    If you want the full story, listen to the episode here: Spotify (https://open.spotify.com/episode/2JHnTvnZX2WcFczml7ozKY?ref=producttalk.org) | Apple Podcasts (https://podcasts.apple.com/kh/podcast/procurement/id1794203808?i=1000770701690&ref=producttalk.org). It’s cathartic, but more importantly, it’s a blueprint for fixing what’s broken.

    Mentioned in the episode: Hire Teresa to Speak (https://www.producttalk.org/hire-teresa-to-speak/), AWS DynamoDB (https://aws.amazon.com/dynamodb/?ref=producttalk.org), GuardDuty (https://aws.amazon.com/guardduty/?ref=producttalk.org), Claude Code (https://www.claude.com/product/claude-code?ref=producttalk.org), and Office Space (https://en.wikipedia.org/wiki/Office_Space?ref=producttalk.org).

    I’d love to hear your experiences and fixes. Where does your procurement flow break, how do you measure cycle time today, and what would it take to create a vendor experience you’d be proud to put your brand on? Drop your thoughts below and let’s trade playbooks.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Decode How Amplitude AI Thinks: Proven Workflows to Get Actionable, High-Accuracy Results

    Decode How Amplitude AI Thinks: Proven Workflows to Get Actionable, High-Accuracy Results

    I’ve learned that the fastest way to unlock better AI outcomes is to understand how the system reasons, then partner with it deliberately. In product organizations, that means treating AI like a capable collaborator with a transparent process, clear inputs, rigorous checks, and measurable success criteria. When I work this way, my teams ship insights and experiments faster—and with far fewer surprises.

    Discover how Amplitude AI thinks and best practices for working with it. Partner with AI at each step of its process for more accurate, actionable outputs.

    Here’s the mental model I use. AI moves through a series of steps: clarify the goal, ingest context, retrieve and rank relevant information, reason through candidate solutions, draft an answer, self-critique, and refine. My job is to actively guide each step. I define the objective precisely, supply high-signal context, specify constraints, ask for structured reasoning, and require a quality bar before anything ships to stakeholders.

    Start by setting intent and success criteria. I write a one-sentence objective (“what problem are we solving now”), then define the evaluation rubric (“what good looks like”) up front. This small habit powers eval-driven development: it keeps AI outputs aligned with product goals, not just plausible-sounding text. I’ll often include target metrics and guardrails, such as confidence thresholds or required evidence from “Amplitude analytics.”

    Next, I curate the context. For analytics use cases, I provide event taxonomies, metric definitions, segments, and recent behavioral analytics trends to ground the model. A retrieval-first pipeline helps here: I scope the corpus, trim noise, and apply context window management so the model sees only what’s essential. The result is sharper, faster answers that map to our real data model and “unified analytics platform.”

    Then I shape the prompt. I use concise role framing, 1–3 high-quality exemplars, and explicit constraints (format, length, tone, citation requirements). I also ask the model to show its reasoning with a short, labeled scratchpad and to state uncertainties. This is practical prompt engineering—not magic—designed to make reasoning inspectable and reproducible across “AI workflows.”

    When tools are available, I encourage agentic AI patterns: let the system plan, call functions, and iterate. With “Amplitude AI,” I ask it to propose the next best analysis (e.g., segment drill-down, funnel step attribution, or anomaly detection), run it, summarize findings, then reflect on whether the next step changes. If you’re using “Amplitude MCP,” formalize these actions as callable tools so the model can chain them reliably.

    Quality is never an afterthought. I build lightweight evaluations into every loop: compare the model’s output against the rubric, check factual grounding, and A/B test alternative prompts for clarity and conversion where appropriate. Over time, these evaluations become our regression suite, giving us confidence as data, prompts, or model versions evolve. This discipline keeps LLMs for product managers aligned with shifting business priorities.

    Finally, I turn insights into action. I ask “Amplitude AI” for decision-ready artifacts—clear hypotheses, prioritized opportunities, and concrete next steps owners can execute. I require the model to cite the specific supporting events or segments and to flag assumptions. That last step is crucial: it invites human judgment where it matters and prevents automation from outpacing accountability.

    This approach doesn’t slow teams down; it speeds them up with focus. By guiding each step—intent, context, reasoning, tools, and evaluation—you transform AI from a black box into a reliable copilot. The payoff is tangible: clearer insights, faster cycles, and outputs stakeholders trust the first time they see them.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Join Me in June: Master Opportunity-First Product Strategy with Continuous Discovery Habits

    Join Me in June: Master Opportunity-First Product Strategy with Continuous Discovery Habits

    I’m celebrating the five-year anniversary of Continuous Discovery Habits by inviting you to read it with me this June. As someone who leads product management and coaches product trios, I’ve seen how a shared discovery practice tightens alignment, speeds up learning, and drives outcomes. This month, we’ll go deep on prioritizing opportunities—not solutions—and I’ll guide you step by step so you can apply the ideas on your own team.

    Each month, I’m releasing an in-depth reading guide that includes:

    We’ll discuss each month’s reading in the comments, and we’ll gather quarterly on a live call to unpack real-world applications, trade wins and missteps, and keep the momentum going.

    Joining late? No problem. I monitor the comments on each reading guide throughout the year. Start with the current month or go back to January—whatever works for you. Ask for help, share what’s working, and connect with other readers at any point.

    If you want to participate, grab a copy of the book (or dust off your old copy), share the “Spread the Love” videos with your team, block time for the exercises, and register for the community sessions. Let’s do this.

    This Month’s Reading

    Chapter:

    Estimated reading time: ~16 minutes

    This month's chapter will introduce you to:

    Need a copy? Grab the book

    Share the Love with Friends and Colleagues

    We learn best in community. Use these short videos to spread the key ideas across your product trios, engineering partners, and stakeholders. Invite them to read along with you so your discovery cadence—and your product strategy—advance together.

    Reflect & Discuss What You Read

    When we reflect and discuss what we read, we absorb more and apply it faster. This chapter challenges a deeply ingrained habit: prioritizing solutions. I’ve been in those meetings—spreadsheets full of features, heated roadmap debates, and a creeping sense that we’re optimizing outputs rather than outcomes. The shift to opportunity-first thinking changed how my teams frame bets, sequence discovery, and communicate product strategy.

    Individual Reflection

    Team Discussion

    Put It Into Practice

    This month is all about shifting from solution-first to opportunity-first thinking. These short, focused exercises will help your product trio practice opportunity prioritization and improve decision speed without sacrificing product discovery rigor.

    Exercise: Map Your Roadmap to Opportunities

    Time: 45 minutesDo this: With your product trio

    Take your current roadmap or backlog and work backwards. For each planned feature or solution:

    This exercise often reveals that you're either:

    Use these insights to inform your next prioritization conversation.

    Exercise: Practice Two-Way Door Thinking

    Time: 30 minutesDo this: With your product trio

    Choose 3-5 recent or upcoming product decisions. For each one, discuss:

    The goal is to calibrate your team's decision-making speed. Two-way door decisions should be made quickly with "just enough" evidence. One-way door decisions deserve more deliberation and data.

    Go Deeper: Additional Reading

    If you prefer an audio summary of this month’s reading, including the book chapters and the following resources, I’ve included an audio version for members at the bottom of this post.

    Related In-Depth Guides

    Supplementary Reading

    Related Courses

    Our Live Discussion Schedule

    Our live discussion sessions are for registered members. Sessions are not recorded. Invitations will go out two weeks before the scheduled event—reserve time now.

    Audio Summary

    Prefer to listen? Stream the audio overview here: June — Prioritizing Opportunities (audio).

    Ready to put continuous discovery into action? Grab the book, share the videos with your team, schedule the exercises, and join the community sessions. Opportunity-first product strategy is a muscle we can build together.

    The chapters we will be readingA preview of the most important concepts we'll be learning aboutShort videos you can share with friends and colleagues to help spread the ideasIndividual and team discussion questions to help you absorb and engage with the readingTeam exercises to help you put the ideas into practiceAdditional reading to help you go deeper on the core ideasChapter 7: Prioritizing Opportunities, Not SolutionsWhy product strategy happens in the opportunity space, not the solution spaceHow to focus on one target opportunity at a time to deliver value iterativelyUsing the tree structure to simplify prioritization decisionsThe four criteria for assessing opportunities: sizing, market factors, company factors, and customer factorsWhy treating prioritization as a messy, subjective decision leads to better outcomes than scoring formulasThe concept of two-way door decisions and how they apply to opportunity prioritizationWork on one small opportunity at a time – Reduce your batch sizeGetting started with compare and contrast decisions – Choose the right target opportunityTurn big intractable problems into smaller, more solvable problems – The power of decompositionThink about your team's current roadmap or backlog. How much of your time is spent prioritizing features versus understanding and prioritizing customer opportunities? What would change if you flipped that ratio?Reflect on the last time you made a product decision. Did you treat it as a one-way door (irreversible) or a two-way door (reversible)? How did that framing affect your decision-making process and timeline?Consider the four assessment criteria (opportunity sizing, market factors, company factors, customer factors). Which of these does your team currently emphasize most? Which do you tend to overlook or underweight?As a team, list the top 5-10 items on your current roadmap or backlog. For each one, try to identify the underlying customer opportunity it addresses. If you can't clearly articulate the opportunity, what does that tell you about how you're making decisions?The chapter argues against scoring formulas (like RICE or ICE) for prioritization, calling them "made-up math." If your team uses a scoring system, discuss: What is it really measuring? Does it help you make better decisions, or does it just make subjective decisions feel more objective?Walk through a recent prioritization decision. Did you assess options in isolation ("should we build this?") or compare and contrast them? How might your decision have been different with a compare-and-contrast approach?Identify the customer opportunity it's meant to addressWrite it as something a customer might say (e.g., "I can't find anything to watch" not "We need better search")Look for patterns: Are multiple solutions addressing the same opportunity? Are some solutions disconnected from any clear customer need?Spreading yourself thin across too many opportunitiesOver-investing in a single opportunity with multiple solutionsBuilding solutions with no clear opportunity attachedIs this a one-way door decision (hard to reverse) or a two-way door decision (easy to reverse)?If it's a two-way door, what's the smallest step we could take to learn whether we're on the right track?What would we need to see to know we made the wrong choice?If we realize we're wrong, how quickly could we course-correct?Opportunity Solution Trees: Visualize Your Discovery to Stay Aligned and Drive OutcomesCustomer Interviews: Uncover Hidden Insights from Every ConversationPrioritize Opportunities, Not Solutions7 Key Benefits of Using Opportunity Solution TreesProduct in Practice: How 2-Way Door Decisions Helped Simply Business Learn FastProduct in Practice: Getting Started with Opportunity Solution Trees at SuperAwesomeProduct Discovery Fundamentals: Learn a structured and sustainable approach to continuous discovery.Tuesday, June 16, 2026: 9am-10am PDTThursday, September 17, 2026: 9am-10am PDTWednesday, December 16, 2026: 9am-10am PST


    Inspired by this post on Product Talk.


    Book a consult png image
  • Stop Support Tickets Before They Start: How AI Unsticks Users and Lifts Conversions

    Stop Support Tickets Before They Start: How AI Unsticks Users and Lifts Conversions

    Every moment of friction in a product carries a hidden cost: attention drifts, motivation wanes, and the next click becomes a support ticket—or worse, silent churn. Over the years, I’ve learned to treat “stuck” as an urgent product signal, not just an operational nuisance. When we unstick users in the flow, we protect revenue, brand trust, and the momentum that powers product-led growth.

    Learn how Amplitude’s Global Support team uses AI Assistant to reduce support tickets, prevent user churn, and increase conversions.

    I reference that line often because it captures a proven pattern: meet users where confusion peaks and resolve it instantly. In my practice, the formula is straightforward—pair behavioral analytics and session replay with a just-in-time AI Assistant, routed by clear driver trees. This transforms support from reactive firefighting into a proactive, in-product experience that accelerates onboarding and boosts user activation.

    Here’s how I operationalize it. First, I use Amplitude analytics and behavioral analytics to surface high-friction steps—pages with elevated drop-off, loops, or rage clicks. Session replay clarifies the “why” behind the numbers, while cohort and retention analysis reveal who’s most at risk. Then I deploy targeted in-app guides and tooltip design to preempt known pitfalls, while an AI Assistant handles real-time questions with context from our knowledge base and product docs.

    The AI Assistant is more than a chatbot. With well-structured AI workflows, it detects intent, pulls precise snippets from docs-as-code, and handles routine issues instantly. When complexity spikes, it executes a graceful handoff to consultative support via Intercom or a Zendesk integration—preserving conversation history and sentiment cues—so humans spend time where judgment matters. This hybrid model keeps response times low without sacrificing quality.

    To de-risk changes, I lean on A/B testing and feature flags. I measure time-to-value, activation rate, and funnel conversion as leading indicators, while tracking ticket deflection, CSAT, and NRR as trailing indicators. The goal isn’t just fewer tickets; it’s faster learning loops and a compounding improvement in user outcomes. When we see activation curves steepen and onboarding friction flatten, we know the system is working.

    Practically, I start with the top three friction points in onboarding, implement narrow in-app guides, and deploy the AI Assistant with strict guardrails and clear escalation paths. Weekly reviews align product, customer success, and solutions engineering around shared telemetry—so we tune prompts, content, and UI patterns together. Over time, I’ve seen ticket volume decline meaningfully, while conversion and retention rise as users experience fewer dead ends.

    If you’re evaluating where to begin, identify the moments where confusion compounds—pricing configuration, integrations, and data mapping are common culprits. Then introduce targeted, context-aware help right where users hesitate. You’ll not only prevent “every stuck user” from turning into a ticket—you’ll convert friction into confidence, and confidence into growth.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image