Tag: AI risk management

  • Package Hack Wake-Up Call: My Playbook for Securing Cowork, Coding Agents, and Secrets

    Package Hack Wake-Up Call: My Playbook for Securing Cowork, Coding Agents, and Secrets

    I love being a builder. It feels like a superpower I can’t stop using, and lately I’ve been channeling it into better workflows, faster experimentation, and sharper product thinking.

    I tinker with my Claude Code workflows to make every day more effortless. I’m having a blast creating AI-generated interview snapshots and opportunity solution trees for Vistaly. I also spend time digging into traces and iterating on the AI coaches I use for our discovery courses.

    Then the recent wave of malicious software spreading through the open-source community popped my bubble. It hit companies big and small—names like OpenAI, PostHog, and Zapier. As I dug in, I realized what many cybersecurity experts have long known: this is a deep rabbit hole. If I want to build responsibly, I have to get significantly better at protecting my devices, credentials, and code. And if you’re building with AI or modern tooling, you likely do, too.

    Here’s why. We all rely on open-source software. Most modern applications assemble tried-and-true components—parsing a PDF, handling dates across time zones, visualizing spreadsheet data, connecting to an API—rather than reinventing them. The same is true for agent skills and MCP servers; they accelerate how we get value from models. This is overwhelmingly a good thing. But it also creates an attack surface that bad actors exploit.

    We don’t need to abandon third-party code. We do need to understand the mechanisms attackers use and consistently defend against them.

    Infographic titled 'When Trusted Packages Go Rogue' summarizing a talk on package hacks: worm spread, defense framework, risks from AI coding tools, and practical mitigation steps, with security-themed icons.
    When one malicious worm compromises hundreds of packages, what should dev teams do? This visual teaser maps the agenda—how it spreads, how to guard against it, AI tool risks, and concrete steps to mitigate.

    On May 11th, I started seeing tweets about a TanStack hack. At that time, I didn’t know what TanStack was. But apparently, it’s a popular set of JavaScript libraries that are used by a lot of React sites. At first, I didn’t pay much attention. Then I learned the packages were compromised by a worm—malicious software that self-replicates—and it spread quickly. Within hours, dozens of packages were implicated; by day’s end, it was in the hundreds. That’s when I knew I had to lean in.

    If you’ve explored safe development practices with coding agents before, you’ve seen the basics of package safety. A package is a bundle of reusable code shared through registries, and nearly every app you use depends on them. The unfortunate twist with this specific hack, known as the Mini Shai-Hulud worm, is that it shows prior “safe enough” heuristics aren’t sufficient. Popularity and trust signals don’t guarantee safety. We have to do more.

    So here’s what I’ll cover today: how malicious software typically works, a practical framework for guarding against it, the specific risks of using Cowork to write and run code, and concrete steps to mitigate that risk. My goal is simple: help you keep building—despite the risks—while protecting your data and your business.

    Quick disclaimer: I’m not a security expert. I’m sharing my personal journey and what I’ve learned through research and hands-on work. Please use your best judgment when applying any of this.

    Infographic showing a 3‑step pattern in malicious software: enter via package or script, search a device for sensitive data, then exfiltrate to an attacker, with icons and expanding entry points.
    Package hacks share a simple playbook: get in, sweep for secrets, and phone home. This visual breaks down the 3 steps and flags new entry points—from packages to MCP servers, agent skills, and app extensions.

    An agent recently scoured over 230,000 malicious software incidents and found that most malicious software follows a similar pattern. First, it needs an entry point onto your computer. Once installed, it scours your device for sensitive data, and then it uses your network connection to send that data to its own servers. The Mini Shai-Hulud worm spreads via malicious package install scripts that run at download time, then searches the device for credentials (including package publishing rights), poisons additional packages to continue replicating, and uses multiple channels—including the victim’s own GitHub public repos—to distribute secrets.

    In practice, most attacks boil down to three steps: 1) It finds an entry point to your device. 2) It searches your device for sensitive data. 3) It sends that data to its own server. The good news: this pattern also tells us how to defend. We can harden entry points, minimize what code and agents can access, and constrain outgoing network traffic.

    Keep in mind that install scripts aren’t the only entry vector. Any code that runs on your machine could contain malicious payloads: third-party packages, agent skills, MCP servers, browser or desktop extensions—the list is long. As coding agents and “vibe coding” tools become mainstream, more non-engineers are exposed to the same risks engineers have managed for years.

    You might be at elevated risk if you do any of the following: you download and use third-party skills or MCP servers; you let Claude Code, Codex, or other coding agents write scripts that run locally and use third-party packages; you use an IDE like VS Code or Cursor with third-party extensions; or you install third-party extensions in tools like Obsidian. This isn’t an exhaustive list, but if any of these apply, it’s worth tightening your approach.

    Infographic titled 'Are You at Risk?' listing third-party code exposure points: agent skills and MCP servers, coding agents on local devices, IDE extensions (VS Code, Cursor), and Obsidian plugins.
    Relying on third-party code? This visual highlights four common risk zones—agent skills/MCP servers, coding agents, IDE extensions, and Obsidian plugins—and urges a review of downloads, local scripts, and add-ons.

    The “safest” approach would be to avoid installing third-party software on your local device entirely. That’s not realistic. We all depend on third-party components in our stack. So I’ll start with one of the most common paths for non-engineers writing and running code today: Cowork.

    Evaluating Cowork’s safety was eye-opening. Cowork offers meaningful protection—more than running code directly on your machine—but it isn’t bulletproof. There’s a notable gap you should understand.

    Here’s how Cowork helps. It runs code inside a virtual machine, which isolates the execution environment from your real device—a quarantine room for code. While Cowork doesn’t fully control what comes into the room (that part is on you), if malicious code gets in, it’s contained and cannot reach the rest of your filesystem. Cowork also limits outbound network traffic from the virtual machine, which helps disrupt data exfiltration. However, it’s not foolproof.

    Because Claude can install packages inside Cowork, it remains susceptible to malicious code like the Mini Shai-Hulud worm. And GitHub is on the allow list so Cowork can read and write to your repos. Since the Mini Shai-Hulud worm uses GitHub to publish secrets, this creates exposure. The crucial mitigation: if you never give Cowork access to sensitive data, there’s nothing for an attacker to steal.

    Infographic titled 'Does Cowork Keep You Safe?' with three points: entry point contained, data safe only if kept outside, and partially limited network traffic, highlighting risks in package attacks.
    A quick visual from a security deep dive on package hacks shows how Cowork handles threats: entry points are contained, data is only safe when kept outside, and network traffic is partly limited—making shared data the gap to watch.

    Your responsibility is straightforward but critical: your data is only safe if it stays outside the virtual machine. When you mount folders into Cowork, those folders become accessible to any code running inside the VM. That includes malicious scripts. Before sharing, ask two questions: do the folders contain any credentials or secrets, and do they include proprietary data that would be harmful if accessed?

    It’s common for code to need credentials. That’s why Cowork includes connectors to third-party sources like Google Drive and Slack. Credentials configured for these connectors never enter the VM—they remain outside the quarantine room—so they’re not exposed to malicious code. But if your code requires additional credentials inside the VM, scope them tightly and assume they could be compromised.

    You can also use custom MCP servers you create yourself with Cowork. Those credentials stay outside the VM as well, provided the MCP servers are remote (hosted on a web server, not downloaded locally). It’s more work than dropping in a local server, but it keeps secrets out of reach from VM-executed code.

    Beyond credentials, scrutinize the actual content you share with Cowork, including anything accessed through connectors. Least privilege is the rule: grant only what’s absolutely necessary for the task, and nothing more.

    Infographic titled 'Keep Building. Stay Safe.' outlining a 3-part series for AI builders: 1 Cowork Safety, 2 Claude Code Config, 3 Off-Device Development, with teal security, AI, and cloud icons and a 'Product Talk' label.
    Amid a wave of package-supply attacks, this Product Talk visual launches a 3-part guide to safer AI building—starting with Cowork safety today, then Claude code config next week, and off-device development coming soon.

    What about skills? Cowork supports skills, and you can add third-party skills inside the quarantine room. If you’re not placing your own data in that room, you can afford more risk. The moment you add sensitive or proprietary data, be selective. Skills can include third-party code, and bad actors use skill directories to distribute malicious payloads. Personally, I never use third-party skills as-is. If one looks useful, I read through the files, then ask Claude to recreate it so I understand what it does and maintain control. If I were to use third-party skills, I’d do it in Cowork and keep their data access to the minimum necessary.

    Overall, Cowork is a solid, “safe-ish” option if you’re disciplined about what you share. The challenge is that utility often requires access to real data—exactly what we’re trying to protect. In an upcoming deep dive, I’ll outline strategies to keep malicious code out in the first place. While I’ll focus on local development, the same patterns can extend to Cowork with a bit of setup.

    One more important clarification: don’t confuse Cowork with the Code tab in the Claude Desktop app. Cowork runs code inside a virtual machine. The Code tab does not. If you ask Claude to write and execute code from the Code tab, that code runs on your local device and you’re fully responsible for security. There is one exception: the Code tab can run code in Anthropic’s cloud; I’ll cover that approach when we get into moving development off the local machine.

    To summarize Cowork’s protections against the attacker’s three-step pattern: installs and scripts still run, but they’re contained inside an isolated virtual machine instead of your real device; access to sensitive data is strongly limited to the specific folders you mount, leaving the rest of your filesystem (including unrelated credentials) out of reach; data exfiltration is partially constrained because Anthropic limits outbound network traffic from the VM—helpful, but not absolute. By contrast, local Code tab sessions offer no isolation, no filesystem restrictions, and no network limits—so any malicious install scripts run directly on your machine with full access and open egress.

    My takeaways so far: I still love building with AI, but I’m doing it more cautiously. Cowork offers meaningful containment when used deliberately. I still prefer the flexibility of Claude Code, and I’ve reconfigured my setup to reduce risk. Even so, “safer” isn’t “safe,” which is why I’m increasingly shifting development off my local device to more controlled environments. I’ll share the practical details—tools, configs, and scripts—in the next installments.

    If this perspective is useful, let me know. I want builders to move fast—and safely—through this new era of agentic AI. Until then, stay safe out there.


    Inspired by this post on Product Talk.


    Book a consult png image
  • AI Operating Model Playbook: Why 80% Stall—and How the Top 1% Accelerate with Discipline

    AI Operating Model Playbook: Why 80% Stall—and How the Top 1% Accelerate with Discipline

    I keep meeting talented product teams who can demo impressive proof-of-concepts but can’t get durable business impact into production. The difference isn’t raw ingenuity—it’s the operating model. As I’ve scaled AI initiatives in my own organization, one sentence has proven painfully accurate: "What the top 1% of AI-native product teams are doing differently – and why most won't catch up without rebuilding the operating model."

    When I say “AI operating model,” I mean the end-to-end way we set strategy, discover value, build, ship, govern, and learn—specifically adapted for AI systems. If we try to bolt AI onto a classic software cadence, we stall. If we rebuild our operating model around AI’s unique constraints and compounding advantages, we accelerate.

    It starts with strategy. I anchor our portfolio to explicit outcomes, not features—tying every initiative to measurable customer and commercial impact. Driver trees and an opportunity solution tree make tradeoffs transparent, while outcomes vs output OKRs prevent us from celebrating activity over results. This is how empowered product teams earn autonomy without losing alignment on the AI Strategy.

    Next is discovery. Continuous discovery reframes “can we ship a model?” into “can we change a behavior or decision with acceptable risk?” I pair customer interviews with in-product telemetry and journey mapping to qualify moments of high value and high frequency. The litmus test: can we describe the target workflow in plain language and simulate success before training models? If not, we’re not ready.

    Data foundations come third. A retrieval-first pipeline is now my default, not an afterthought. We invest in data governance, privacy-by-design, and observability so we can explain where answers come from, prove consent, and debug drift. Without trustworthy data and clear lineage, every downstream AI promise is fragile—and your AI readiness is mostly theater.

    Then I insist on eval-driven development. Before we optimize prompts or tune models, we define offline and online evals that represent the real task, including safety and “gotcha” cases. We treat prompt engineering, context window management, and agentic AI patterns as hypotheses that must beat a baseline under repeatable tests. This moves debate from opinions to evidence.

    Shipping is where most teams quietly stall. We integrate AI into our CI/CD with feature flags, shadow modes, and progressive rollouts, building MLOps into the same platform that runs our services. I watch DORA metrics to keep delivery velocity healthy, but I also watch AI-specific signals—input distribution shifts, response variance, and time-to-mitigation—so we catch regressions before customers do. Platform scalability matters more when inference costs and latency can spike overnight.

    Governance isn’t a gate at the end; it’s a runway from the start. We operationalize AI risk management with tiered reviews, model and data cards, and clear escalation paths. The goal is not to slow down, but to reduce surprise—so product managers, engineers, and legal share the same playbook for safety, fairness, and regulatory compliance.

    Value capture closes the loop. We connect product metrics to commercial levers like Net Recurring Revenue (NRR) and retention analysis, then shape packaging so customers pay for outcomes, not raw compute. This is where product-led growth meets sales-led growth: we demonstrate value in-product, then arm go-to-market teams with unambiguous proof.

    So why are 80% of teams stuck? Three patterns recur: technology FOMO masquerading as strategy, fragmented data that can’t support high-quality retrieval, and a lack of evals that forces decisions by vibes. Add ad hoc governance and you get pilots that impress in slides but wither under real-world variance.

    How do the top 1% think differently? They rebuild the operating model first. They position discovery around workflows, not models. They invest in retrieval-first architectures early. They standardize evals. They ship with guardrails. And they treat “learning per week” as a sacred metric—because compounding insight beats sporadic heroics.

    If you need a 90-day plan, here’s the sequence I use. Week 1–2: run a content audit of data sources and map the top five repeatable workflows ripe for AI leverage. Week 3–4: define success metrics and offline evals for one beachhead use case. Week 5–8: build the retrieval pipeline, implement prompt baselines, and instrument observability. Week 9–12: ship behind feature flags, run A/B testing with safety thresholds, and iterate on failure cases. By the end, you’ll have a reusable blueprint—not just a demo.

    Team design matters. I staff product trios (PM, design, tech lead) with forward deployed engineers or solutions engineering partners who sit with customers. That proximity reduces spec ambiguity and accelerates learning. It also sharpens our product roadmapping and sprint planning because we plan against outcomes, not outputs.

    The hardest part is emotional, not technical: letting go of familiar software rituals that don’t serve AI. Once we accept that AI demands a different operating rhythm, progress feels lighter. The top 1% don’t have secret models; they have disciplined systems. Rebuild yours, and the compounding benefits will outpace any single model upgrade.


    Inspired by this post on Product School.


    Book a consult png image
  • Is Technology Still Net Positive? A Product Leader’s Reckoning and Playbook for Humane Growth

    Is Technology Still Net Positive? A Product Leader’s Reckoning and Playbook for Humane Growth

    I’ve spent my career building products on top of the internet, championing social media, and now scaling AI. Lately, I keep returning to an uncomfortable but necessary question: are we still building a net positive future—or have we drifted into something else entirely?

    A recent long-form conversation in my podcast queue challenged me to do a deeper self-audit. If you want to hear the debate that sparked this reflection, you can listen on: Spotify | Apple Podcasts. What follows is my synthesis as a product management leader: the hard truths, the hopeful paths forward, and the practical actions I’m taking with my teams.

    The moment that hit me hardest was a family member’s blunt assessment that the internet has become “net negative.” That phrase landed like a wake-up call—a reminder that those of us inside tech often operate in an echo chamber. We see our roadmaps, our metrics, our progress; the rest of the world experiences the second-order effects. As a leader, I have to seek out those outside-in perspectives with the same rigor I apply to any product discovery practice.

    Another truth I can’t ignore: somewhere along the way, parts of our industry slid from “make people’s lives better” to “extract maximum value at any human cost.” You can see it in incentives that prioritize growth at all costs, in waves of layoffs that treat people as an expense line, and in platform behaviors that resemble a modern tycoon era. This isn’t just a moral critique—it’s a product strategy risk. Extractive models erode trust, weaken retention, and invite regulatory and reputational headwinds that no amount of optimization can out-execute.

    The loneliness crisis is real, and technology has too often replaced human connection instead of augmenting it. Spend a week in San Francisco and you’ll notice what I call “isolation by design”—QR-code menus, autonomous Waymos, frictionless everything, but fewer genuine human moments. It’s efficient, yes, but alienating. No algorithm can substitute for physical touch, care, and community. As builders, we should design products that create on-ramps to real-world connection, not cul-de-sacs of infinite scroll.

    We still have agency. “Don’t be evil” shouldn’t be a nostalgic slogan; it should be a minimum bar. Responsible product management means being a citizen of the ecosystems we influence: naming trade-offs clearly, instrumenting for externalities, and building AI risk management into our operating cadence. It also means stepping outside the industry narrative to ask neighbors, parents, teachers, and small business owners how our products actually land in their lives.

    One idea that gives me hope is “mom and pop tech”: AI-enabled, hyper-local tools crafted for specific neighborhoods and communities. Think “inch wide, mile deep”—software that solves a real problem for a defined community rather than chasing a horizontal total addressable market. Consider ride share. The extractive platform playbook maximized liquidity but squeezed drivers and frayed local fabric. A community-owned alternative could optimize for safety, fair wages, and neighborhood vitality over blitz-scaled margins. That’s civic tech with a viable product strategy.

    I’m also watching how social norms evolve. At a recent Elternabend at a German primary school, parents collectively agreed to delay smartphones until age 11 or 12—a striking shift from just five years ago when many 7–8 year olds had devices. Culture moves, sometimes faster than we expect. Product-led growth that ignores cultural momentum (or ethical guardrails) is fragile growth.

    So what do we do on Monday morning? First, rebuild our discovery muscles outside the echo chamber: continuous discovery with the people most affected by our products, not just our power users. Second, measure what matters: add well-being, community impact, and qualitative trust signals to the same dashboards that track activation and retention. Third, resist technology FOMO—choose fewer bets and go deeper, especially where AI can be applied responsibly to unlock real-world value. Fourth, cultivate communities of practice that normalize responsible experimentation, privacy-by-design, and transparent communication. Finally, narrate the change: as product people, we are educators as much as we are builders; our stories shape what teams believe is possible.

    If you’re looking for frameworks to anchor this work, revisit classics like Bowling Alone: The Collapse and Revival of American Community for context on social capital, and pair that with modern conversations on local resilience and community spaces. The future isn’t written yet. With clear principles, careful incentives, and the courage to narrow our scope in service of depth, we can still build technology that strengthens the bonds that make life worth living.

    I’d love to hear how you’re approaching this in your organization—especially examples of “mom and pop tech,” AI Strategy in service of community, or product strategies that trade a little scale for a lot of human good. Join the conversation in the comments.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Unlocking AI Agents: The Real Barrier Is Readiness—Not Capability—Here’s How to Scale

    There’s a question that runs underneath every AI Agent evaluation: what can it do?

    Two years ago, that was the right question to ask because Agents were limited and capability was a genuine constraint. The gap between what organizations needed and what the technology could deliver was wide. I felt that gap acutely in early pilots—plenty of ambition, not enough dependable execution.

    That gap has since narrowed considerably, and yet most organizations are running their Agents well below what’s technically possible. I see teams lean on answering and routing, but stop short of looking things up, taking actions, or resolving complex, multi-step problems—especially where data, process variance, or risk come into play.

    The standard explanation is that AI isn’t good enough yet—models must improve, or vendors must ship more features. But after studying organizations across industries actively expanding their AI automation, I’ve found that this explanation holds up less often than people assume. The blockers tend to be elsewhere.

    The teams I’ve observed weren’t primarily constrained by what their AI could do; they were constrained by what their organization was structured to let it do. In other words, the ceiling wasn’t the Agent’s capability—it was organizational readiness, governance, and risk tolerance.

    “Readiness” for AI breaks into five distinct types, and most organizations have some but not all of them. Below is how I assess them with product, operations, and engineering leaders.

    Content readiness is whether you can explain your product and policies clearly and consistently. Most companies can. In practice, that means up-to-date knowledge bases, unified policy language, and clear versions that Agents can cite and apply.

    Scope readiness is whether you’ve defined the edges: when should AI engage, and when should it step aside? Edge cases multiply, intent varies by customer segment, sensitive topics surface mid-conversation, but most teams can work through this with effort. Clear guardrails reduce ambiguity and shrink risk.

    Procedural readiness is where things start to get harder. This is about whether you can articulate your processes clearly enough for something other than a human with years of tacit knowledge to follow. The happy path is rarely the problem. It’s the failure paths, decision branches, variations that have never been written down because they’ve always lived in someone’s head.

    Data readiness is the first real cliff. Can you reliably identify the right user, account, or object at the moment a decision needs to be made? Is the data trustworthy in real time? Are the APIs stable, accessible, and actually connected? For most organizations, the honest answer is “partially, but we’re not always sure when it breaks.”

    Execution readiness is the highest bar. Not just technically (can the Agent make the change?) but organizationally. Who owns it when the wrong refund gets processed? Who detects it? Who recovers? Does someone with authority actually accept the risk?

    Most companies have the first two, some have the third, fewer have the fourth and fifth. When I map this with teams, we often discover that their Agent’s ceiling is really a reflection of operational maturity and data plumbing, not model quality.

    We studied companies across six industries – energy, healthcare, ecommerce, gaming, financial services, property management – all trying to expand what their Agents could do. The pattern was consistent: teams set out to automate real actions—looking up account status, processing changes, handling transactions. In most cases, the AI could technically do it, but at a certain point (somewhere between guiding a user through a process and looking something up on their behalf) they hit a wall.

    One team tried to automate application changes but couldn’t reliably identify which application to modify across their internal systems. Another explored billing automation but couldn’t access live account data due to regulatory constraints. A third needed to verify status across third-party vendor systems their Agent couldn’t reliably reach. I’ve seen similar constraints surface around CRM integration, data governance, and vendor SLAs—none of which are model issues.

    In most cases, the team redesigned around what their infrastructure could support. They moved toward guiding—walking users through processes step by step, rather than executing changes on their behalf. It worked, it resolved conversations and delivered real value, just differently than anyone planned. In customer support, this often looks like consultative flows that shorten time-to-resolution even without direct writes.

    Most Agent evaluations are built around capability. Can it handle complex queries? Does it support multiple channels? Can it integrate with our systems? These are reasonable things to evaluate for, but they produce a capability score, and that doesn’t tell you whether your organization can actually use what you’re buying.

    The teams that got to deeper automation, the ones executing actions early, didn’t have “better AI,” they had more standardized operations. Actions that were already well-defined, consistently applied, and exposed through stable systems with clear rules. Automation wasn’t inventing new behavior, it was triggering actions that were already tightly controlled elsewhere.

    Readiness enables capability, not the other way around. Which reframes the evaluation question from “can the AI do this?” to “are we actually ready for it to?”

    Something that gets lost in most conversations about AI readiness is that organizations are often further along than they assume, just not for the kind of work they were planning for. A team that set out to automate refunds but can reliably guide users through complex troubleshooting has genuine capability deployed. They’re operating at the level their readiness supports, which is a starting point, not a deficit.

    The more useful frame isn’t “are we ready?” – it’s “what are we ready for, and what specifically stands between here and the next level?” The gaps tend to be concrete: a missing API, data that lives in three systems that don’t agree, a process that’s never been documented, or an ownership question nobody has answered. These are solvable problems. They just require a different kind of investment than buying a more capable Agent.

    What nobody has worked through seriously yet is how organizations actually build readiness. Does it develop naturally through using AI at shallower levels first? Or is it mostly a function of prior decisions, like system architecture choices made years ago, operational maturity that accumulated over time, engineering investments that have nothing to do with AI? When readiness does increase, what actually changes? Does the support team develop it? Does engineering grant it? Does it require executive sponsorship and investment in infrastructure with no obvious AI label on it?

    In my experience, progress comes from a joint effort: product to define scope and guardrails, operations to codify procedures and edge cases, engineering to harden APIs and observability, and leadership to underwrite risk with clear ownership. When those pieces align, agentic AI moves from guided assistance to safe, auditable execution.

    Until there are clearer answers, the pattern is likely to continue. Companies will buy capable Agents, plan ambitious rollouts, and find that the harder work is building the organizational infrastructure. The Agents can do the work. The question is what it takes to let them.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • Proven 3-Step Playbook to Quantify AI Agent ROI: Boost Revenue, Cut Costs, Reduce Risk

    AI agents are only as valuable as the measurable outcomes they deliver. In my role leading product strategy at HighLevel, I’ve learned that the fastest way to earn executive trust is to translate agent performance into clear revenue impact, cost savings, and risk reduction. The challenge isn’t enthusiasm for AI; it’s creating a disciplined, repeatable way to prove business value.

    Here’s the three-step playbook my teams and I use to quantify the value of agentic AI, align stakeholders, and scale what works.

    Step 1 — Define value outcomes and success criteria. Start with a driver tree that ties agent outcomes to company-level goals. For revenue, target conversion lift, average order value, and expansion (e.g., trial-to-paid, self-serve upsell). For cost, focus on containment/deflection rate, reduced handle time, and lower cost to serve. For risk, measure error rates, hallucinations, security/policy violations, and customer complaint rate. Convert these into outcomes vs output OKRs, set baselines, and pre-commit to thresholds for launch, scale, or rollback. This ensures the team is accountable to business KPIs, not vanity metrics.

    Step 2 — Instrument comprehensively and establish baselines. Instrument the full journey: prompts, responses, human-in-the-loop events, escalations, feedback, and downstream conversions. Capture both leading indicators (time-to-first-value, containment rate, self-serve completion) and lagging outcomes (NRR, churn, LTV/CAC). Use behavioral analytics, session replay, product tours, and in-app guides to contextualize what users do before and after agent interactions. Baselines matter—freeze a control period so improvements are truly incremental.

    Increase revenue, cut costs, and reduce risk with Pendo’s Software Experience Management platform. Optimize the entire software experience to drive adoption and improve engagement.

    Step 3 — Experiment, attribute, and risk-adjust. Treat every agent capability like a hypothesis. Run A/B tests or holdouts with a precomputed minimum detectable effect so you can ship confidently. Attribute outcomes to the agent by linking events to conversions and support deflection, and calculate ROI as (incremental revenue + cost avoided – total operating cost, including model/API, labeling, and oversight). Apply AI risk management by tracking false positives/negatives, escalation rate, and policy breaches; adjust ROI with a risk score so the “cheapest” agent isn’t inadvertently the riskiest. This is eval-driven development in practice: define success, measure, iterate.

    Operationalizing the playbook requires crisp reporting. Stand up Agent Analytics dashboards in your unified analytics platform that roll up per-agent KPIs, funnel performance, cohort trends, and experiment results. Review them in QBRs and with frontline teams to connect numbers to lived customer experience. When metrics improve, amplify with product-led growth motions—targeted in-app guides and lifecycle nudges to get more users into high-value agent flows.

    What does this look like in the real world? Early on, we celebrated “tickets deflected” and missed that some conversations quietly increased churn risk. After we adopted this three-step approach, we saw the full picture: a modest dip in deflection quality was offset by a larger lift in expansion revenue and a meaningful drop in time-to-resolution. The risk-adjusted ROI was unambiguous, and the CFO greenlit broader rollout.

    If you’re building or scaling AI agents, anchor on outcomes, instrument ruthlessly, and insist on experimentation. With the right measurement discipline, you’ll know exactly which agents deserve more investment, which need redesign, and which should be retired. The result is a portfolio of agents that reliably drive adoption, engagement, and durable business value.


    Inspired by this post on Pendo – Best Practices.


    Book a consult png image
  • My Playbook for Safe AI Analytics in Financial Services: Compliance, Trust, and Real Workflows

    My Playbook for Safe AI Analytics in Financial Services: Compliance, Trust, and Real Workflows

    I spend a lot of time helping financial services teams adopt AI analytics without compromising on risk, compliance, or customer trust. The stakes are high: regulations are evolving, data sensitivity is non‑negotiable, and a single misstep can erode confidence. That’s why my approach centers on governed AI, rigorous data governance, and measurable business value—not flashy demos.

    Learn how Amplitude delivers safe, governed AI analytics for financial services—aligned to compliance, built for trust, and ready for real workflows.

    In practice, “safe and governed” means clear lines of accountability and controls that hold up under audit. I look for privacy-by-design principles, role-based access controls, robust audit trails, and granular data permissions that keep sensitive data segregated. Strong AI risk management also requires model oversight—documented policies, human-in-the-loop review where needed, and explainability for high-impact decisions. Above all, the platform must meet regulatory compliance expectations and support the organization’s risk posture without slowing teams down.

    Real workflows are where the value shows up. In financial services, that can mean using behavioral analytics to understand user intent, applying anomaly detection to surface suspicious patterns earlier, and empowering product managers and analysts to iterate safely within a unified analytics platform. When these capabilities are built into the core analytics motion, I see faster detection of issues, clearer attribution of outcomes, and more confident decision-making—all while staying within governance guardrails.

    When I evaluate a solution, my checklist is simple and strict: does it enforce strong data governance by default; does it provide transparent, auditable AI behaviors; can it scale securely to meet enterprise requirements; does it tie insights directly to product and growth outcomes; and will it help risk, compliance, and product teams work together instead of at cross purposes? If the answer is yes across that list, the platform earns a place in the enterprise toolbelt.

    Done right, governed AI analytics give financial services teams the confidence to move faster with less risk. You gain sharper insights from behavioral data, earlier warning from anomalies, and the trust that comes from controls that are aligned to compliance and resilient under scrutiny. That’s the path to durable advantage: responsible AI that accelerates learning, protects customers, and translates directly into better products and performance.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

    How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

    Only 10% of the plastic we manufacture gets recycled. For a century, we have relied on mechanical and chemical methods that were never designed to close the loop. As a product leader, I look for step-change technologies that break through entrenched ceilings, and biology—specifically engineered enzymes—has emerged as that missing piece.

    Recently, I dug into the work of Rhea's Factory and spoke with their founders, Arzu Sandıkçı (co-founder and CEO) and Mert Topcu (co-founder). Arzu brings deep expertise in molecular biology and enzyme engineering. Mert brings 20 years in tech, including a decade at Google as a product manager. Their combined perspective—domain science plus product rigor—shows up in every design choice.

    Rhea's Factory has built an AI platform that uses protein language models, multi-step agentic pipelines, and proprietary wet lab data to design novel enzymes that deconstruct plastic polymers into their original monomers—selectively, at low temperatures, and at industrial scale. That stack matters: it layers foundation models with domain-specific constraints and real-world data to systematically explore, evaluate, and scale candidates.

    Here’s the crux: traditional recycling mostly just chops polymer chains into shorter fragments. Enzymatic recycling, by contrast, breaks plastic all the way back to its original monomers. Think of a necklace and pearls analogy—mechanical methods snip the chain; enzymes cleanly return the pearls. The result is true circularity: you can remake high-quality plastic without downcycling.

    Selectivity is the superpower. Enzymes can target specific plastic types even in mixed waste streams, operating at low temperatures in a controlled, low energy reactor process. That combination of precision and energy efficiency is why this approach can be both greener and economically competitive.

    The field accelerated after the discovery of a plastic-eating bacteria in Japan, which opened the door to enzymatic recycling. Advances in protein structure prediction—“AlphaFold” and the Nobel Prize in Chemistry—transformed what’s possible in enzyme engineering, and created space for AI-native design loops to flourish.

    On the AI side, the team evolved from a human-orchestrated pipeline to an agentic AI scientist. Problem statements serve as inputs, multi-step protein generation builds on foundation models, and guardrails at each pipeline step keep the AI pointed in the right direction without limiting exploration. It’s a textbook example of agentic AI applied to a highly constrained, safety-critical domain.

    Crucially, wet lab feedback closes the loop. Why wet lab data—even just hundreds of proprietary data points—can be enough to train a powerful domain-specific prediction model is a reminder that quality and relevance can trump sheer volume when you’re operating in a narrow, high-signal domain. The team measures success in the lab first, then scales what works.

    I appreciated their take on exploration: there are moments when Mert sometimes wants the model to hallucinate. Running high temperature settings helps explore the full enzyme design space, and the guardrails ensure those forays remain productive rather than random. In other words, controlled creativity beats blind search.

    The business constraint is unambiguous: enzymatic recycling must compete economically with cheap, oil-based plastic production. That framing forces disciplined choices around energy use, throughput, and yield—factors that directly determine unit economics and the path to industrial reality and cost parity.

    What’s next is equally compelling: a process agent to optimize end-to-end system performance, a 5,000-ton demo plant in California to validate scale, and enzymes for new plastic types. I’m especially intrigued by enzyme blends for mixed plastics and the practical insight into why clamshells aren’t recyclable—precisely the messy corner cases that decide whether circularity works outside the lab.

    From a product management lens, several patterns stand out: define clear problem statements as inputs to the agentic orchestration; use eval-driven development to enforce stage-by-stage quality; build a proprietary data moat with wet lab results; and tie milestones to industrial metrics (conversion, selectivity, energy per ton) rather than vanity outputs. This is AI Strategy in action—aligning model capability, data leverage, and operational design to deliver outcomes, not just demos.

    Most of all, the ambition to explore an enzyme design space that “makes everything nature has ever evolved look like a tiny dot” captures the promise of this approach. Pairing agentic AI with rigorous lab validation doesn’t just make plastic circularity plausible—it makes it programmable.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Why I Bet on First-Time Executives: Inside Figma’s Playbook for AI, IPO Readiness, and Scale

    Founders should bet on first-time executives. I’ve seen it pay off repeatedly, and a recent deep dive with Praveer Melwani, CFO at Figma, reinforced exactly why. Praveer joined Figma in 2017 as the company’s first business operations and finance hire—when the team was around 30 people and not yet charging for the product—and stepped into the CFO seat in 2022, helping to lead the company’s IPO in 2025. His journey from IC to CFO isn’t just a career arc; it’s a blueprint for scaling leadership capacity in high-velocity environments.

    What struck me first was the clarity of the step functions that took him from operator to “whole-company” leader. Early on, he optimized for doing the work—building driver trees, stress-testing go-to-market assumptions, and putting the basics of board management in place. As the business matured, he shifted from answering questions to defining them, owning capital allocation, and shaping the operating cadence. That evolution—from execution to orchestration—is exactly the arc I look for when I’m hiring first-time VPs.

    Another takeaway: Figma started acting like a public company three years before its IPO. That wasn’t optics; it was operating discipline. Quarterly rhythms, tight controls, an audit-proof close, and forward-looking narrative management helped the company move faster, not slower. In my experience, this kind of public-company readiness clarifies trade-offs, compresses decision cycles, and strengthens cross-functional trust—especially between product, finance, and go-to-market leadership.

    We also unpacked what separates world-class finance leaders from a traffic-cop CFO. The latter enforces rules and guards budgets; the former uses first principles decision making to direct resources toward asymmetric upside. World-class CFOs help the company understand risk in a post-ChatGPT world, design SaaS pricing that matches product reality, and build reliable instrumentation for outcomes—not just outputs. They’re partners in product strategy as much as stewards of the balance sheet.

    On pricing, I appreciated the courage behind selling the exec team on AI consumption pricing. Consumption SaaS pricing introduces variance, but it also aligns value with usage and accelerates time-to-value—especially for AI-driven features whose unit economics evolve rapidly. It requires tight stakeholder management, robust telemetry, and a crisp value proposition, but when executed well it can unlock both growth and discipline.

    One of the boldest moves: Figma intentionally cut its 90% gross margin to invest in AI. That’s a masterclass in capital allocation. The reflex to protect margins is strong, but durable advantage often comes from compounding learning loops, not short-term optics. Framed correctly, AI Strategy isn’t a cost center—it’s an option on multiple future S-curves. The key is to define decision guardrails, instrument usage, and keep a living risk register for AI risk management.

    I was also intrigued by how AI is changing the CFO craft itself. Tools like Claude Code are now part of the financial leader’s toolbox—useful for scenario modeling, policy drafting, and exploring new domains without slowing down the team. Paired with strong data governance and controls, this is where FinOps meets executive leverage: faster cycles, tighter experiments, and better communication with product and engineering.

    Leadership transitions can catalyze phase shifts. When a COO leaves or a company re-architects its operating model, great executives don’t just fill gaps—they redesign the system. That’s when clarity about swimlanes, escalation paths, and decision rights matters most. The lesson for founders: hire for adaptability, not just pedigree, and look for people who can turn ambiguity into momentum.

    Hiring leaders in functions you don’t deeply understand is a common founder challenge. The best antidote is a first-principles test for hiring VPs: can the candidate map the business model, define success metrics, and explain trade-offs in plain language? Do they show how they’d build the team, not just run it? Can they teach you something new in 30 minutes? I use this pattern across executive hiring because it scales better than relying on domain buzzwords.

    Another practice I recommend: build an internal board of peer CFOs and operators. Regular, no-agenda check-ins create a community of practice that shortens feedback loops and surfaces non-obvious risks. It’s one of the most efficient ways to de-risk capital allocation and sharpen strategic narratives ahead of real board meetings.

    We talked about scope versus depth: how deeply in the details should a CFO be? My view aligns with what I heard here—be in the details often enough to validate the model and coach the team, but not so deep that you become the bottleneck. The executive job is to raise the quality of decisions at scale, not to personally make every decision.

    There were personal lessons, too—from the nine-year working relationship with Dylan Field to foundational team-building insights from time at Dropbox. Strong teams are built on crisp roles, tight feedback loops, and a bias for writing things down. That muscle—organizational development through clarity—is what separates resilient companies from merely lucky ones.

    If you’re a founder weighing whether to back a rising operator or recruit a “proven” exec, this story tips the scale toward the former. Bet on slope, not just intercept. Create the scaffolding—public-company behaviors early, transparent metrics, and a culture that rewards learning—and your first-time executives will scale with the business. Done right, it’s the highest-LEVERAGE people decision you can make.


    Book a consult png image
  • My Playbook for a Smarter Feature Launch Slack Channel with Agents, Feature Flags, and Readouts

    My Playbook for a Smarter Feature Launch Slack Channel with Agents, Feature Flags, and Readouts

    Feature launches move fast, and the Slack channel is our command center. Recently, I leveled it up with agentic AI so every data question, feature flag decision, and post-launch readout lives in one trusted place—faster, clearer, and with less operational drag on the team.

    Learn how to set up your launch Slack channel so agents handle your data questions, feature flags, and post-launch readouts in one place.

    Here’s the strategy I use. I treat the launch Slack channel like a real-time control room: agentic AI handles the repetitive asks, experts handle the judgment calls, and stakeholders stay aligned through crisp, automated summaries. The result is tighter stakeholder management, quicker go/no-go calls, and fewer meetings—without sacrificing data quality or governance.

    First, I set clear channel rituals. I name the space #launch-[feature], declare scope and SLAs, and pin the success metrics, dashboards, and rollout plan. Product, engineering, data, support, and GTM all join. I keep threads focused: one for metrics, one for incidents, one for enablement, one for feedback. This small bit of structure makes agent responses and human follow-ups easy to find.

    Next, I add a data questions agent. The agent connects to approved sources and answers the most common queries—activation by cohort, conversion by segment, latency by region—directly in-thread with citations and timestamps. When the question requires nuance, the agent routes to an owner and posts a handoff note, preserving context. This keeps our AI workflows safe and reliable while giving the team quick visibility.

    Then I wire in a feature flags agent. It exposes read-only status by environment, shows rollout percentages, and links to change history. When a toggle is requested, the agent enforces approvals and logs who asked, who approved, and why. We can pause, ramp, or roll back in seconds—with auditability intact. Feature flags become an operational muscle instead of a bottleneck.

    Finally, I schedule post-launch readouts. The readout agent publishes T+1 hour, T+24 hours, and T+7 days summaries: adoption, performance, anomalies, and key learnings. It highlights A/B testing results, flags outliers, and threads follow-up actions to owners. The team gets a single source of truth for post-launch readouts without scrambling across tools.

    Governance matters. I apply role-based access, protect PII, and make the agent cite sources so we can trust what we see. I use Agent Analytics to monitor response accuracy, deflection, and time-to-answer, then refine prompts and permissions. This is practical AI risk management: clear boundaries, human-in-the-loop for consequential decisions, and transparent logs.

    The impact has been real: faster decisions during go-to-market, fewer pings to data and engineering, and higher confidence in our product management rituals. Centralizing “questions, flags, and readouts” in Slack doesn’t replace expertise—it frees it to focus on the hard problems.

    If you’re rolling this out, start small: define the channel, pin your metrics, launch the data agent with a handful of approved queries, add the feature flags agent with strict approvals, and automate a simple daily readout. Iterate weekly. Within one or two launches, you’ll feel the compounding benefits.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • AI Agents That Truly Help Product Teams: A Practical Framework for When—and When Not—to Use Them

    AI Agents That Truly Help Product Teams: A Practical Framework for When—and When Not—to Use Them

    Every week, I field the same question from product leaders and engineers: should we deploy an AI agent here, or are we overfitting the problem to a shiny solution? Learn when AI Agents actually help product teams—plus a simple framework to decide when not to use them.

    When I say “AI agents,” I’m talking about autonomous or semi-autonomous systems that can perceive context, plan steps, and take actions across tools and data sources with minimal supervision—what many now call agentic AI. In product management terms, they’re not just another feature; they’re an operating model shift. Used well, they compound team leverage. Used poorly, they add invisible complexity, new failure modes, and governance headaches.

    To make the call with confidence, I use a straightforward VITAL framework that my team can apply in minutes. It keeps us honest about where AI agents are a force multiplier—and where a simpler automation, rule, or in-product UX is the better choice.

    V is for Volume. Agents shine where there’s sustained, repetitive, high-throughput work: triaging inbound support, cleansing CRM records, orchestrating QA checks, or synthesizing weekly research summaries. If the workflow happens rarely or ad hoc, an agent is often overhead in disguise.

    I is for Instructions. Can I specify success in clear, testable terms? Strong instructions include measurable acceptance criteria and constraints. If I can’t articulate what “good” looks like without hand-waving, the task likely needs product discovery, not autonomy.

    T is for Tolerance. What is the blast radius if the agent makes a wrong call? Low-stakes, reversible actions with tight guardrails are ideal. If the tolerance for error is near zero (e.g., irreversible financial transactions or sensitive regulatory actions), favor human-in-the-loop, stronger approvals, or defer agents entirely.

    A is for Access. The agent needs the right data, tools, and permissions, with privacy-by-design and data governance in place. If telemetry is sparse, integrations are brittle, or you can’t enforce least-privilege access, you’ll fight fragility more than you’ll gain leverage.

    L is for Learning loop. Agents require eval-driven development, Agent Analytics, and continuous feedback to stay accurate as reality shifts. If you can’t measure quality, latency, and cost per outcome—or you lack a retrieval-first pipeline to ground responses—expect drift and stakeholder distrust.

    Now, the counterweight. Don’t use agents when the problem is novel or strategically ambiguous and you still need exploratory research; when outcomes are unmeasurable or subjective without heavy context; when stakes are high and the acceptable error rate is effectively zero; when data is siloed, stale, or legally constrained; when the work is one-off or low-volume; or when your team can’t commit to instrumentation, evaluations, and ongoing maintenance. In these cases, a simpler rules engine, a clearer UX, or a well-defined workflow usually beats agentic complexity.

    Here’s how this plays out in practice. We’ve seen agents materially improve customer support triage (categorization, priority, and next-best-action suggestions), CRM hygiene (deduplication, enrichment, and routing), and release QA (regression check orchestration with human sign-off). Conversely, we avoid agents for nuanced pricing decisions, sensitive risk scoring without robust datasets, or any workflow where “explainability” and auditability trump speed.

    Operationalizing agents is a product problem before it’s an ML problem. Start narrow with a retrieval-first pipeline and rigorous prompt engineering, define success metrics upfront (quality, latency, cost per task), and run head-to-head evaluations against human baselines. Ship behind feature flags, monitor with Agent Analytics, and graduate from assisted to autonomous modes only after you’ve proven stability. Align this with product roadmapping and sprint planning so the work lands as durable capability, not a lab demo.

    Finally, be honest about build vs buy. If the workflow is a point of parity, consider buying and focusing your team on integration quality and governance. If it’s a potential source of competitive differentiation, invest in a modular architecture with clear context window management, strong observability, and a feedback loop tightly coupled to your empowered product teams.

    The bottom line: AI agents unlock leverage when there’s volume, clarity, tolerance, access, and a learning loop. If any of those pillars is missing, pause. Your best next move is likely better instrumentation, sharper problem framing, and continuous discovery—not more autonomy. That discipline is how product teams turn agentic AI from hype into habit.


    Inspired by this post on Product School.


    Book a consult png image
  • AI Data Security for Product Teams: Protect Sensitive Product Data Without Slowing Innovation

    AI Data Security for Product Teams: Protect Sensitive Product Data Without Slowing Innovation

    Protecting product data has never felt more urgent. Every week, my teams experiment with gen ai prototypes and LLM-powered capabilities, and I’m accountable for ensuring our innovation never compromises cybersecurity, privacy, or customer trust. The goal is not to slow down—it's to build in the right guardrails so speed and safety reinforce each other.

    Understand AI data security risks in product teams, what product data is most exposed, and how to use AI tools responsibly without slowing innovation.

    When I assess AI risk with product managers, I start with how data moves. The biggest threats usually come from prompt and context leaks, unsafe logging of sensitive inputs or outputs, permissive access controls, unmanaged third-party model usage (shadow AI), and unclear data-retention policies. For LLMs for product managers, I emphasize that every step in AI workflows—from collection to processing to storage—must assume adversarial conditions.

    In my experience, the product data most exposed includes customer PII and payment identifiers, internal strategy documents and roadmaps, analytics and behavioral telemetry tied to users, feature flags and configuration values, embeddings and vector stores that can reveal sensitive patterns, and the prompts or contexts themselves. Even “harmless” evaluation datasets can contain inferred identities. Treat all of this as high-value assets in your data governance model.

    I apply privacy-by-design from the first discovery conversation: minimize data by default, redact or tokenize before any external model call, and separate identities from content wherever possible. A retrieval-first pipeline helps keep raw customer data within our boundary while still enabling relevant context. We combine deterministic safeguards (policy-based redaction, allow/deny lists) with runtime observability to detect anomalous prompts, outputs, or access patterns.

    To keep velocity high, we operationalize risk rather than debate it ad hoc. A lightweight risk scoring rubric classifies each capability (e.g., internal-only, customer-facing, regulated data adjacent) and dictates controls: redaction requirements, human-in-the-loop thresholds, eval-driven development gates, and incident response readiness. These controls live in CI/CD so product teams get fast, automated feedback without waiting on meetings.

    Partnership is essential. I bring Security, Legal, and Data partners into the product trios early to align on regulatory compliance and threat modeling while scoping solutions that meet outcome goals. We maintain a shared catalog of approved providers and architectures, document data flows, and version our policies just like code—so everyone can see what changed and why.

    Vendor diligence is non-negotiable. I ask LLM providers about data retention and training usage, encryption at rest and in transit, key management, regional data controls, audit posture (SOC 2, ISO 27001, HIPAA where needed), and support for private networking. We restrict scopes with least-privilege access and instrument robust observability for threat detection and response across the full path, not just the API call.

    Culture makes the biggest difference. I coach teams on prompt hygiene, secret handling, and context window management; we publish redaction patterns, approved libraries, and clear do/don’t examples. When incidents happen, we treat them as learning opportunities, run blameless reviews, and update our playbooks, guardrails, and training materials accordingly.

    The outcome I aim for is confidence with speed: we ship AI features that customers love while protecting the data they entrust to us. With a clear risk model, strong data governance, and embedded controls, product teams can innovate boldly—without compromising on security or trust.


    Inspired by this post on Product School.


    Book a consult png image
  • AI Experimentation Mastery: How I Test Faster, Tame Variability, and Ship with Confidence

    AI Experimentation Mastery: How I Test Faster, Tame Variability, and Ship with Confidence

    I’ve learned that the fastest path to durable AI impact is a disciplined experimentation engine: one that moves quickly, reduces ambiguity, and earns trust with evidence. My goal isn’t just to ship models—it’s to ship measurable outcomes with repeatable rigor.

    AI experimentation for product teams. Here’s how to test AI features, choose the right metrics, handle variability, and make data-driven decisions.

    I start every AI initiative by framing a clear decision: what must be true for this feature to be worth building, and how will we know quickly? From there, I map driver trees that connect user value to measurable signals, so every test clarifies both impact and risk, not just accuracy.

    Success criteria come next. I translate aspirations into testable thresholds, define leading and lagging indicators, and size tests with minimum detectable effect (MDE) so we don’t confuse noise for signal. This keeps us honest about sample sizes, power, and the real cost of waiting for certainty.

    Before I touch production traffic, I run eval-driven development. I curate golden datasets that reflect real user complexity, codify rubrics for correctness, safety, tone, and latency, and automate scoring so improvements are reproducible—not anecdotal. This gives the team a stable baseline to iterate prompts, tools, and policies with confidence.

    Model behavior is inherently stochastic, so I deliberately control variability. I document temperature, top-p, and seed strategies; I compare deterministic settings for regression checks versus sampled settings for user-facing creativity; and I test sensitivity across content lengths and edge cases. This reduces flakiness and prevents surprise regressions during CI/CD.

    When it’s time to learn from real users, I favor A/B testing with thoughtful guardrails. I run holdouts, cap exposure with feature flags, and protect core experience metrics like retention and time-to-value. For ranking and retrieval changes, I’ll use interleaving or switchback tests to isolate effects from seasonality and traffic mix.

    To handle LLM variability online, I aggregate outcomes over multiple prompts per cohort, use stratified bucketing to balance power users and new accounts, and track confidence intervals over time instead of snapshot p-values. This approach turns noisy model outputs into stable product signals.

    Instrumentation fuels everything. I rely on behavioral analytics to trace user intent, effort, and satisfaction across flows, and I wire up Amplitude analytics for event schemas, funnel drop-offs, and cohort comparisons. Clear event taxonomies and naming discipline make it trivial to separate model quality from UX friction.

    Risk is part of the work, so I bake in AI risk management early. I include toxicity and PII checks in my offline evals, monitor safety metrics in every A/B, and set rollback criteria tied to user harm and system costs. Privacy-by-design, audit logs, and runtime safeguards aren’t afterthoughts—they’re acceptance criteria.

    The operating cadence matters as much as the math. I run continuous discovery with customer interviews to keep the test queue grounded in real jobs-to-be-done, and I align product trios on hypotheses, success metrics, and stop-loss rules before launch. Weekly readouts keep decisions crisp, and post-ship learning cycles feed the next iteration.

    Finally, I invest in upskilling the team. We run internal workshops on LLMs for product managers, standardize experiment templates, and maintain a living playbook so new experiments start at 80% instead of 0%. The result: faster learning loops, safer bets, and more confident shipping.


    Inspired by this post on Product School.


    Book a consult png image