Tag: LLMs for product managers

  • Stop Losing Users: How a Second Message and Prompt Audit Drive 2–3x Retention

    Stop Losing Users: How a Second Message and Prompt Audit Drive 2–3x Retention

    Default prompts are quietly sabotaging agent retention. I learned this the hard way while reviewing early funnels for our voice and chat agents—engagement looked great at the greeting, but the moment the agent stopped after a single reply, the conversation flatlined. The fix wasn’t a fancy LLM trick; it was a disciplined second message and a rigorous audit of defaults across every entry point.

    When an AI agent opens with a generic, low-friction greeting and then waits, users hesitate. Cognitive load rises, intent stays fuzzy, and drop-off follows. A thoughtful second message—delivered quickly, with clarity and options—reduces ambiguity and gives people a low-effort path to progress. It’s a small behavioral nudge that pays off in outsized retention gains.

    Here’s the pattern that consistently works for me. First, keep the initial default prompt short, confident, and specific to the channel and task domain. Then ship a fast follow-up if the user hesitates for a few seconds. That second message should clarify what the agent can do, present 2–3 concrete choices, and invite free-form input. I’ve repeatedly seen this simple sequence unlock a 2–3x retention lift in early sessions, especially for first-time users.

    Auditing default prompts is where the leverage lives. I inventory every ingress—web widget, IVR, SMS, in-app, help center—and catalogue the exact default system, developer, and user-facing prompts. Then I inspect turn-1 and turn-2 transcripts in Agent Analytics to quantify where users stall: time-to-first-intent, clarification rate, option selection rate, and completion. This makes the drop-off visible and turns “vibes” into data we can A/B test.

    Designing the second message is a conversation design exercise, not a copy tweak. My recipe: empathize with the user’s likely uncertainty, constrain scope so the agent appears capable, and apply choice architecture. For voice AI agents, I keep it shorter, use confirmation questions, and bias toward read-back for accuracy. For chat, I include tappable options and examples that mirror top intents. The goal is momentum without feeling pushy.

    Operationally, I run controlled A/B tests on default and second-message variants, sized to a realistic minimum detectable effect. I segment by source (ad, organic, support), device, and use case, because the winning prompt for sales qualification rarely matches the one for customer support. With proper instrumentation in our analytics stack, we track retention curves over the first 3–5 sessions, not just single-session reply rates, to avoid optimizing for chatter over outcomes.

    Strong prompt engineering underpins the experience. I keep system prompts stable and explicit about persona, tone, and refusal behavior; manage the context window so examples don’t drown live intent; and use a retrieval-first pipeline when domain knowledge matters. The most expensive mistake I see is shipping defaults like “How can I help you?” without guardrails or examples—great for demos, bad for real users.

    If you’re starting fresh, begin with a prompt audit this week: list all defaults, map them to top intents, and pair each with a channel-appropriate second message. Instrument the funnel, launch two variants, and set a crisp success metric (e.g., turn-2 continuation rate to task start, then task completion). This is one of those rare changes that is simple to ship and compounds across onboarding, activation, and long-term retention.

    The takeaway is straightforward: don’t let your best work stall after the first reply. A disciplined second message and a focused default prompt audit will lift engagement, reduce ambiguity, and create the kind of early momentum that sustains retention over time.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • My Always‑On AI Team: How I Get Claude Agents to Tackle Work While I’m Offline

    My Always‑On AI Team: How I Get Claude Agents to Tackle Work While I’m Offline

    Most mornings I wake up to a to-do list that’s already been updated—because my always-on team of agentic AI assistants has been working while I sleep. I rely on Claude to orchestrate these agents so routine prep, follow-ups, and retrospectives never slip through the cracks.

    When a podcast recording hits my calendar, my podcast-manager agent (powered by Claude) automatically creates a podcast-interview-prep task with a concise summary of who I’m interviewing and what they are building. It also creates a transcript review document with the correct share settings. After the recording, it adds a task to my to-do list to share the transcript with the podcast participants.

    For sales, my sales-admin agent (also powered by Claude) prepares a sales-meeting-prep task with notes on who I’m meeting with, where they are in the sales process, and what I need to move the deal forward. After the call, it generates clear next-step tasks so momentum doesn’t stall.

    Every week, my coding-manager agent (still powered by Claude) compiles a report from my prior week’s coding sessions and offers targeted tips. It flags recurring mistakes or dead ends, shows how to avoid them, and suggests ways to work better with Claude. It’s the retrospective I never skip.

    In this walkthrough, I’ll explain how I get Claude to complete tasks for me while I’m away from the computer—and how I designed the system to balance power, safety, and cost control.

    I first explored this approach after seeing the rapid growth of OpenClaw. OpenClaw is an open-source "agent harness" that lets you configure personalized agents to act on your behalf. It’s incredibly promising, but the early wave of enthusiasm also revealed pitfalls: complex safety configuration, overly broad machine access (browser, terminal, files, credentials), third-party skills of varying quality, and surprise usage bills.

    After hearing one too many horror stories about wasted hours and unexpected charges, I set out to design a safer, more predictable way to capture the benefits of OpenClaw while managing risk and spend. That’s what led to my current agent setup.

    For transparency: I’m a long-time practitioner and a genuine fan of Claude Code. I have not received any compensation from Anthropic for writing about my approach. If that ever changes, I will disclose it—both because it’s required by the FTC in the U.S. and because it’s simply the right thing to do.

    An Overview of How My Agent Team Works

    Today, I run three specialized agents: a podcast manager, a sales admin, and a coding manager. As I invest more, I expect this team to grow—because the pattern scales cleanly across use cases.

    This system runs on four core components that keep everything reliable, auditable, and cost-aware.

    First, agent identity. I use a simple but powerful convention: an identity markdown file that tells the agent who it is, where its task folder lives, and provides context for the types of tasks it will do. This keeps scope tight and intent explicit—critical for safety and predictable automation.

    Second, the scheduler. I’m using MacOS’s built-in scheduler (via LaunchAgents). This is like cron, but runs with all your user permissions on Mac. That means I can run all of this under my Claude Code Max subscription or my ChatGPT/Codex subscription. The result is a dependable heartbeat for my AI workflows without relying on fragile cloud glue.

    Third, tasks. Each agent owns a dedicated folder of tasks. A task is a markdown file with frontmatter. That structure makes work items easy to create, parse, review, and version—perfect for repeatable automation with a human-in-the-loop safety net.

    Fourth, scripts. Each agent has its own scripts folder with utilities it can call on demand or that run on a schedule. These scripts are small, composable, and transparent—so I can evolve capabilities without ballooning risk or complexity.

    Agent identity, tasks, and scripts are saved in Obsidian—not Claude Code skills or agents. The scheduler runs on my always-on Mac Mini. The benefit of this is it just works across all of my devices and I can seamlessly switch between Claude Code, Codex—or any other coding CLI—as I need to. All it takes is updating my script that the scheduler uses.

    In practice, this architecture delivers exactly what I want from agentic AI: clarity of responsibility, strong guardrails, and outcomes that compound. My podcast manager keeps interviews buttoned up, my sales admin removes administrative drag, and my coding manager turns lessons learned into steady skill gains—all while I focus on higher-leverage product management work.

    If you’re considering a similar setup, start with a single agent and a narrow task, then expand. Keep identities crisp, scripts small, and schedules explicit. With that foundation, you’ll get the benefits of automation and delegation—without surrendering control.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Built for Your Biggest Days: How We Engineer Fair, Reliable Scale Without Downtime

    I’m getting sharper, more specific questions about scale from enterprise customers every quarter, and that’s exactly how it should be. Teams want to know how our platform behaves during their highest-volume moments — the Black Friday sales, the sporting events, the production incidents — and they want confidence their growth won’t outpace the systems they depend on. We welcome those questions. They’re the right ones to ask of any critical component of your business. Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform. We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer. While those numbers matter, they age quickly. Every growing software company can publish a bigger number every year, month, week. What ultimately matters is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it. Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them. Here’s how we do that in practice. We build on boring foundations because at the edges, we try hard not to be clever. We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product. “That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.” This extends a principle I’ve embraced for years: run less software. The point isn’t to minimize the stack for its own sake; it’s to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together. Boring technology choices aren’t a lack of ambition — they reserve our ambition for the nuanced scaling challenges that matter. The source of truth is the hard part. You can scale stateless web traffic by adding machines, add queue consumers, and add cache. Those are real problems — just not the hardest ones. The source-of-truth database is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears. That is why we moved to Vitess, managed by PlanetScale. The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers. When we first laid out this direction, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day. Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed. We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk. Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows. Search is not a hidden bottleneck for us. Search underpins core product surfaces across the platform — from vector search in our AI features to realtime reporting — and if it’s slow or unhealthy, customers feel it. Scaling isn’t just adding more machines; often the better approach is making the product do less unnecessary work. Today, our Elasticsearch clusters support a much higher-throughput product than in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We’re serving a larger product surface more efficiently, not just running a bigger cluster. More importantly, when an index gets too large or traffic distribution turns unhealthy, we don’t want a high-risk, manual migration. We reshape Elasticsearch indexes online by partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and deleting the old index only when we’re confident. We’ve used this pattern for years to make large search migrations safer and more incremental — a core playbook in our platform scalability and SRE practices. Fairness is non-negotiable in a multi-tenant system. A single customer’s high-volume moment should not quietly become everyone else’s latency problem. We design for this at multiple layers. For asynchronous work, we use overflow queues and queueing strategies that prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They’re designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants. We also build application-level guardrails so customer isolation doesn’t depend on every engineer remembering every rule in every code path. In a large multi-tenant Rails application, the safe path must be built into the system. The focus is primarily about correctness and customer data separation, but the broader operating principle is the same: important customer boundaries should be enforced by infrastructure and application frameworks. The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it’s actually needed. Fin adds a new dimension to scaling. Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work. The details differ from traditional SaaS infrastructure, but the principle is the same: understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. AI providers are not commodity storage systems, and we do not design as if they are. That is why we have invested in Fin-specific reliability systems. Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point. For enterprise customers, this matters because AI support volume can spike just like human support volume — and the AI layer must absorb that spike without relying on one fragile upstream path. When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform. Performance tests help, but production traffic is reality. Real customers use products in ways no synthetic test will perfectly predict: launches, incidents, seasonal patterns, gaming events, sudden changes in end-user behavior. Those moments give us data that no lab can fully reproduce. Often, a large customer event barely moves the platform-wide graphs because our customer base is broad enough that one industry’s peak aligns with another’s quiet period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect. “That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.” Sometimes, these events teach us something specific. In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed. The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb. We treat large customer events this way even when there’s no broad customer impact. They’re opportunities to understand real scaling properties and make the next event safer — a habit that anchors our incident management, observability, and FinOps practices. Scale is also an operating model. The infrastructure matters, but it’s not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents. “That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.” Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through infrastructure noise and answer the question that matters most during an incident: are customers able to use the product successfully? This shapes how we ship. Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. That isn’t a vanity metric — it’s part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers. These are the very DORA metrics we hold ourselves to in order to balance CI/CD speed with stability. “Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.” The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence. That’s how scale stays safe over time. Scheduled maintenance should be extraordinary. Historically, database maintenance was a main reason for maintenance windows: upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime. With the move to Vitess and PlanetScale, we changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime — and we do this in practice, not just as a goal. This matters because customers rely on our platform for live operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience. “Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.” Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact. That’s the practical benefit of the architecture work: better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers. What this means for customers is simple: be skeptical of vague scale claims. The question isn’t whether a vendor says they can scale — it’s whether they can explain how, where the limits are, what they measure, how they recover, and what they’ve changed after learning from production. We understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system — and we use that data to keep improving. That’s what you should expect from a platform you depend on during your busiest moments.

    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • From PM to AI Engineer: RAG, Evals, and Discovery—The Surprising Playbook I’m Applying

    From PM to AI Engineer: RAG, Evals, and Discovery—The Surprising Playbook I’m Applying

    I just finished a standout conversation on AI engineering and product discovery that hit squarely at the questions I hear from product leaders every week: What does practical AI engineering actually look like for product managers, and how do we ramp without a traditional software background?

    Listen to this episode on: Spotify | Apple Podcasts

    Here’s the arc that resonated with me: a product leader goes from occasional tinkerer to spending 60% of her time on real engineering work—building AI-powered tools for continuous discovery, forming a licensing partnership with Vistaly, and quietly constructing "Teresa Bot," an AI discovery coach trained on everything she’s ever written. The journey is less about mastering every framework up front and more about structuring learning, tightening feedback loops, and shipping useful outcomes.

    The most energizing throughline is the myth-busting: you don’t need a deep engineering pedigree to operate in this space. Curiosity, rigorous discovery habits, and eval-driven development will take you further than brute-force coding. As one moment put beautifully, "I know anything that I don't know how to do, Claude will teach me how to do. And Claude is infinitely patient." That captures the posture I expect modern PMs to adopt with LLMs and tools like Claude Code.

    On the nuts and bolts, the discussion gets concrete about AI engineering in practice: context engineering, prompt writing, RAG, observability, and evals. This is the real stack—think retrieval-first pipeline design, prompt engineering guardrails, instrumentation for model drift, and continuous, automated evals to protect behavior as you iterate. If you’ve been dabbling with context window management but haven’t formalized your test harnesses or dashboards, this is your cue.

    What I appreciated most is how directly discovery skills transfer. Framing assumptions, running tight customer interviews, mapping opportunity solution trees, and aligning stakeholders—these are precisely the muscles you need to shape problem spaces before you “vibe code” solutions. As one reflection nails it, "The moment I learned more about data science, all of my discovery work became so different." That’s the bridge from qualitative sense-making to measurable, model-centered learning.

    The partnership with Vistaly is also a smart build vs buy case study. Rather than reinvent infrastructure, the choice to license purpose-built opportunity solution tree software keeps focus on the differentiated layer—learning systems and product outcomes. As it’s put plainly: "I don't want to build all that stuff. I don't really want to be a software company. I'm almost set up like an AI researcher." Product leaders should internalize this lens for platform choices across their AI roadmaps.

    On "Teresa Bot," the implementation breadcrumbs are familiar and pragmatic: pair a solid retrieval-first pipeline (RAG) with clean content sources, keep prompts modular, enforce code review even for vibe coding, and stand up observability and evals early. I’ve had similar success using Claude Code for rapid iteration while treating every prompt and context change as a versioned artifact. That discipline pays dividends when you need to trace regressions or prove improvements.

    If you’re a PM ready to lean in, start small and systematic. Pick one high-signal discovery workflow, model the knowledge you already have, and wire up basic evals before you scale. Keep a lab notebook, use programmatic tests to gate deployments, and measure outcome movement—not just model cleverness. This is where LLMs for product managers move from novelty to execution readiness.

    Resources mentioned: Watch the episode on YouTube, Claude Code, Vistaly (opportunity solution tree software), Opportunity Solution Trees: Visualize Your Discovery to Stay Aligned and Drive Outcomes, Product Talk Academy, Just Now Possible Podcast, Vibe Coding Best Practices: Avoid the Doom Loop with Planning and Code Reviews, and the AI Evals for Engineers and PMs course on Maven.

    What stood out to you—RAG design choices, eval frameworks, or the discovery-to-engineering mindset shift? Drop your thoughts below; I’d love to learn how you’re applying these patterns in your own product roadmaps.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Stop Asking AI Anything: The 3 Outcome-Based Prompts That Unlock Real Product Insights

    Stop Asking AI Anything: The 3 Outcome-Based Prompts That Unlock Real Product Insights

    Too often I watch teams ping a global agent with vague AMAs and then wonder why they get generic summaries instead of decisive guidance. When I lead product reviews, I push the team to treat AI like a partner in decision-making, not a trivia engine. That simple mindset shift transforms how quickly we move from questions to confident action.

    AI isn’t built for AMA (ask me anything). Get recommendations for outcome-based questions for the best results with Amplitude AI.

    In practice, outcome-based prompting means I don’t ask an agent to “analyze the data.” I ask it to help me reach a specific product decision, grounded in behavioral analytics and connected to our outcomes vs output OKRs. To make that concrete, I always frame my prompts around three things.

    First, I state the outcome and metric. I name the business goal and the exact measure in Amplitude analytics that will validate success—activation rate, funnel conversion from A to B, or 8-week retention. I’ll reference the relevant events, segments, or driver trees so the agent has a crisp target. This is where product strategy meets measurement discipline.

    Second, I define the context and constraints. I specify the user cohort, the timeframe, and the surface area I care about—new self-serve signups in the last 30 days, first-session behavior on web only, or EU traffic where data governance rules apply. On a unified analytics platform, this context lets an agentic AI narrow its search to the highest-signal slices of behavioral analytics rather than pattern-matching across noise.

    Third, I declare the decision and deliverable. I tell the agent exactly what I will do next and the format I need to act: a ranked list of levers for an A/B testing plan, a recommended prompt engineering template for in-app guides, or a one-page brief I can hand to the growth team. Clear decisions lead to clear outputs; vague intents lead to vague answers.

    Operationally, I turn these three elements into reusable prompt templates, and I track their performance with Agent Analytics. I review traces to see which inputs drive the best recommendations, and I refine prompts the same way I iterate on product copy. For LLMs for product managers, this is the craft: small, testable improvements that compound into outsized impact.

    Here’s a quick example. When I needed to lift user activation, I asked for a prioritized set of friction points blocking first-value within 24 hours for new self-serve accounts, based on last month’s data. I defined activation as completing event X within Y hours, asked the agent to analyze top drop-offs in the funnel, and requested an action plan with two experiment ideas and success thresholds. The response mapped behaviors to interventions, connected to retention analysis, and gave me a prompt engineering snippet for the onboarding nudge we shipped the same week.

    If your AI workflow still starts with “What does the data say?”, you’ll keep getting broad narratives. Start with outcomes, sharpen the context, and specify the decision you will make. That’s how Amplitude analytics, paired with agentic AI, stops being interesting and starts being indispensable.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Level Up: May 26 Claude Code Show & Tell + Final Product Discovery Fundamentals Cohort

    I’m excited to share two opportunities this season to uplevel your craft, connect with peers, and leave with practical, repeatable techniques you can apply immediately to your product work.

    We will be doing another round of Claude Code: Show and Tell on May 26th at 9am PDT. These community-driven sessions are hands-on and fast-paced—we swap proven workflows, compare prompts, and pressure-test approaches together. You’ll see how product teams are operationalizing AI workflows in real contexts and walk away with ideas you can adapt for your own roadmap and experimentation pipeline. Invites will go out to Supporting Members and CDH Members tomorrow. If you'd like to join us, keep an eye on your inbox for the invite.

    I love these Show & Tell sessions because they translate tacit knowledge into clear, reusable playbooks. Whether you’re refining evaluation loops for LLMs, streamlining discovery synthesis, or standardizing prompts for consistency, the shared rigor and camaraderie make it a high-signal hour for any product leader invested in AI workflows.

    I also want to share that I'll be teaching our June 4th – July 9th cohort of Product Discovery Fundamentals. This is the last time I'll be teaching this cohort in its current format. If you've been thinking of enrolling in this program, and want to take it with me, this is your last chance. Register here.

    Across this cohort, we’ll practice continuous discovery habits—framing opportunities, tightening assumptions, running lean experiments, and aligning product trios on evidence-backed decisions. If you want a rigorous, repeatable system for turning customer insight into confident prioritization and compelling product strategy, I’d be thrilled to have you in the room.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Operator Unleashed: The AI Agent That Transforms Customer Ops across Fin and Intercom

    Operator Unleashed: The AI Agent That Transforms Customer Ops across Fin and Intercom

    Today I’m introducing Operator, an Agent that works across both Fin and the Intercom helpdesk to help you manage your customer operations.

    In practical terms, Operator manages help content, builds automation, does the ongoing work that determines how well Fin performs, and runs the operational work your human team doesn’t have time for. That combination is precisely what modern support teams need to move from reactive firefighting to proactive, consultative support.

    Why does this matter? Running a customer operation means managing AI and humans simultaneously, and doing this well requires more capacity than most teams realistically have. I’ve felt that strain firsthand—competing priorities, constant context switching, and a never-ending queue that blurs strategic focus.

    On the AI side, Fin’s performance is largely influenced by what surrounds it: the accuracy of your help content, the quality of your Fin configuration, and how well you understand what’s working and why. When product teams ship daily, keeping your help center current means finding every affected article before customers notice the gaps. When Fin gets a conversation wrong, diagnosing it requires reading through what happened, identifying the root cause at the configuration level, making the fix, and verifying it worked. Analyzing why your resolution rate dropped means pulling conversations, finding patterns, and tracing the cause back to something actionable. And beyond individual fixes, there’s the ongoing question of what to automate next – what your human reps are still handling repetitively, whether it’s worth building a Procedure for it, and how to test it before it goes live.

    On the human side, the demands are just as continuous. When an incident hits, someone needs to identify every affected customer, draft the right response, and send it before the problem compounds. Team leads need visibility into rep performance across hundreds of conversations to coach effectively and prep for 1:1s. Reps need to know what to prioritize without spending the first part of their day figuring it out. In fast-moving environments, that operational overhead wastes energy you should be investing in better customer outcomes.

    Black-and-white testimonial graphic from Synthesia about Fin Operator: a smiling professional at left and a quote at right describing how asking Operator clarifies what happened and makes improving Fin easier.
    Meet Operator, the agent that explains your customer conversations. This Synthesia testimonial shows how simply asking Operator reveals what happened and makes refining Fin faster for support and enablement teams.

    Too often, the work outpaces what teams can manage, so it happens reactively, or not at all. Operator was built to change that, giving teams a new way to understand, manage, and improve their customer operations. Here’s how I put Operator to work across AI workflows and human-led processes.

    First, I use Operator to ask my data anything. Your support operation generates more useful data than most teams have time to process. Operator gives you direct access to it. You can ask it any question about what’s happening in your operation (why a metric changed, what’s driving escalations, how the team performed last week) and it returns structured answers with charts, breakdowns, and the ability to dig further. It analyzes samples of real conversations on the fly to surface patterns and identify root causes. If your head of product wants to know what customers are saying about a new release, you can ask Operator rather than spending half a day pulling a report together. It also works across your entire operation, analyzing Fin’s performance, your human reps’ performance, and customer sentiment.

    Crucially, I don’t start from scratch every time. Give Operator ongoing work, like analyzing your automation rate every Monday, flagging anything that needs attention, and posting the report in your Fin workspace. It’ll run the analysis, write the report, and deliver it without you having to go looking for it. That’s the kind of agentic AI leverage that compounds week after week.

    Second, I keep the knowledge base current without writing a single article. Your knowledge base is only as useful as it is accurate. When product teams ship fast, keeping pace with content updates is a substantial, ongoing job. Give Operator a brief about anything, from a new feature or policy change to release notes, and it finds every article in your help center that needs updating, drafts the edits in your tone of voice and style, identifies content gaps, and drafts new articles to fill them. It even handles localized versions. Every change is formatted as a proposal (Operator’s version of a pull request) for you to review, edit, and approve before anything goes live. It feels like adding several knowledge managers to the team overnight, without the ramp time.

    Monochrome testimonial graphic showing a bearded person's headshot beside bold copy from Raylo praising Fin Operator for accurate analysis, strong trend insights, and reporting beyond basic LLM connectors.
    See why teams choose Fin Operator for customer operations: accurate analysis, trend insights, and conversation debugging—going beyond basic LLM connectors. A Raylo testimonial spotlights daily, real-world impact.

    Third, I build, test, and ship improvements to Fin directly through Operator. When Fin gets a conversation wrong because of a content gap or misconfigured rule, Operator can debug it by reading through the conversation, identifying what caused the problem, proposing a fix, and running simulation tests to verify it before you approve. You see what changed and why before anything goes live. Beyond debugging, Operator has deep knowledge of every Fin feature and capability, so you can ask it directly to help you configure whatever you need. If you need a Procedure for a specific query type, describe the outcome you want and Operator builds it, including triggers, multi-step instructions, edge case handling, and a simulation test, all from a single prompt. The same applies to configuring Guidance rules, data connectors, monitors, and workflows. You don’t need to know which feature solves your problem or how to configure it; you just describe what you want.

    For teams looking to increase their overall automation rate, Operator can handle that strategically too. Ask it to analyze where your biggest automation opportunities are and it surfaces them by volume, along with an estimate of the weekly team time each one is consuming. Pick one, and it builds the solution for you to approve. That’s consultative support, productized.

    Finally, I use Operator to effortlessly manage the human side of support. When an incident hits, Operator identifies every affected conversation, drafts targeted responses, and sends them proactively, turning what would normally be hours of reactive triage into minutes of review and approval. For ongoing management, a team lead prepping for 1:1s can ask Operator to pull each rep’s metrics, flag outliers, and surface what’s worth digging into. A rep coming back from a meeting can ask what to focus on next and get a prioritized queue based on urgency, customer value, and wait time. And because Operator sees patterns across everything your human team is handling, it can surface the conversations they’re still resolving manually, flagging your next automation opportunity before you’ve had time to go looking for it.

    Here’s why this works. Operator isn’t a general-purpose AI model given access to your data. It’s built on a library of purpose-built tools that encode expertise specific to support operations, like how to pick the right attributes for a given analysis, search a knowledge base semantically, debug Fin’s reasoning in a specific conversation, or write and test a Procedure that will actually work. That specialized toolkit is what makes its recommendations trustworthy and its execution reliable.

    Minimalist banner reading 'Transform your support operation with Operator' above a bright orange square with an abstract purple-green knot logo, suggesting AI-driven customer support automation.
    Elevate customer service with Operator. The bold headline and vivid knot logo introduce a modern AI platform that streamlines workflows, speeds resolutions, and scales support operations without extra headcount.

    The proposal (pull request) system makes this possible. When Operator updates content, adjusts configuration, or modifies how Fin behaves, it creates a proposal – a structured diff of what’s changing and why. You review it, edit if needed, and approve before it takes effect. Operator does the cognitive work; the human stays in control of what goes live.

    More than 200 early users are already trying Operator, and every one of them is finding new use cases. It’s a genuine step change in capability, and I expect it will change the way support teams run their operation. We’re working towards a vision of Operator being increasingly agentic, expanding across every new role Fin takes on.

    Operator is available in early access now. If you’re ready to transform your customer operations across Fin and the Intercom helpdesk with agentic AI, start here: https://fin.ai/operator.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • Escape the “It’s Just an LLM” Trap: Inside Operator, a Reliable, Actionable AI Agent

    Escape the “It’s Just an LLM” Trap: Inside Operator, a Reliable, Actionable AI Agent

    We just launched Operator, an Agent for your customer operations that helps you understand, manage, and improve your entire customer experience. I’ve spent years shipping AI-driven products at production scale, and this one reflects the lessons I’ve learned the hard way about what it really takes to go from a flashy demo to a dependable system your team trusts.

    To give you a clear view of just how powerful this Agent is, I want to share the technical infrastructure and engineering choices that make Operator work reliably at production scale across thousands of customer workspaces. My goal is to demystify the gap between a well-prompted LLM and a true, production-grade Agent—so you can make an informed build vs. buy decision.

    If you’re a technical leader evaluating whether to build something like this yourself, or trying to understand the difference between a well-prompted LLM and a production Agent system, this is for you.

    Escaping the “it’s just an LLM” trap

    Most engineering teams in this space start the same way: a prototype. You take a foundation model, give it API access to your support data, add a system prompt with some domain context, and you’ve got something that queries your database, summarizes tickets, and generates reports that look right. It demos convincingly—and I’ve been there, impressed in the moment, only to watch it buckle under real-world complexity.

    The problem with that prototype is that it obscures the scope of what’s actually required. It demonstrates the 10% of the system that’s straightforward to build, and it’s easy to assume the rest is just as straightforward. It isn’t. The gap between a working demo and a production system your team depends on daily is where most of the engineering investment lives. That’s precisely the gap we focused on closing.

    With Operator, we’ve invested deeply in every layer: tooling, reasoning, how the Agent takes action, and the infrastructure that makes it reliable at scale. Here’s a closer look at the architecture and why it matters for agentic AI, platform scalability, and observability.

    The tooling layer

    The first thing we had to confront was that the obvious approach (giving a model access to your APIs and letting it figure things out) doesn’t hold up in production. The model makes reasonable decisions for simple queries, but operating across thousands of customer workspaces with different configurations, data models, and usage patterns, a “figure it out” approach isn’t nearly precise enough.

    What you need is purpose-built tooling: tools that encode decisions about what data to fetch, how to structure it, what context to include, and what to leave out. Operator has over 50 of these tools and 10 skills.

    A tool is a single action that Operator takes (search content, run a query, look up a conversation). A skill chains multiple tools together to complete a whole job, like debugging a conversation end-to-end, rolling out a content update across an entire help center, and identifying the next automation opportunity. This is where AI workflows move from abstract prompts to dependable, repeatable outcomes.

    The difference between using thin wrappers around API endpoints and purpose-built tooling shows up in something as seemingly simple as a performance question. When you ask “how did Fin perform last week?”, a naive implementation runs a query and hands back a table. Operator runs a reporting tool that determines which metrics are relevant for your specific workspace, which are meaningful for your particular question, and what the numbers actually mean in context, giving you a much richer answer that you can do something tangible with.

    Developing that behavior took months of engineering. Not because any individual piece is conceptually hard, but because getting it right across the full range of customer workspaces, configurations, and edge cases is an iterative process. You build it, you test it against real conversations, you find the cases where it breaks, you fix those, and you repeat. There’s no shortcut—and in practice, this is where most DIY efforts stall.

    The intelligence layer

    The tooling layer solves what to do, but beneath it is a harder problem: understanding what’s worth doing, and why. This is the layer that makes Operator understand your business rather than just query it. Three components go into it, and in my experience they’re non-negotiable for a reliable Agent.

    1. Semantic search

    Unlike solutions that rely on keyword matching, Operator uses a system that understands what content is about, not just what words it contains. When it searches your help center, it’s using the same semantic search engine we’ve spent years optimizing for Fin itself. This is a retrieval system that’s been tuned against millions of real support conversations, with precision and recall characteristics we’ve measured and improved continuously. This retrieval-first pipeline is the backbone of grounding and dramatically reduces hallucinations.

    2. Attribute awareness

    Operator has access to your data and knows what is meaningful for different questions. It knows which metrics are actually in use in your workspace, which custom attributes carry signals, and which fields are populated versus effectively empty. We’ve built specific skills that give Operator this meta-knowledge, so when it’s investigating a performance question, it’s looking at the right things, not hallucinating insights from sparse data.

    3. Intelligent reasoning

    A well-built Agent can answer your question and anticipate what you should ask next. If you ask Operator about escalations spiking, it doesn’t just say, “escalations increased 23% week-over-week.” It’ll continue on to tell you why this happened by examining the escalated conversations and identifying that a disproportionate number involved a specific product area, before moving on to check whether the relevant help content is up to date, and, if it isn’t, proposing an update. That chain of reasoning isn’t prompt engineering. It’s encoded in the skills we’ve built, refined against the patterns we see across our entire customer base.

    The action layer

    This is where the engineering complexity increases by an order of magnitude because instead of just analyzing problems and recommending solutions, Operator takes action to solve them itself. It can update Guidance rules, draft and publish help articles, create Procedures, configure data connectors, and modify your Fin configuration. Moving from read-only insights to write-capable actions is a fundamentally different class of product and infrastructure problem—one that demands rigorous SRE practices and rock-solid safeguards.

    Every one of these actions has to be safe, reversible, and auditable. An analytics tool that occasionally returns a wrong number is frustrating. but an Agent that occasionally applies a wrong configuration change to a live support system is a different category of problem. To prevent this, we built a robust proposal system, whereby every change Operator suggests is presented as a reviewable diff. You see exactly what will change before anything is applied, with the option to accept, reject, or refine. Nothing goes live without your explicit approval.

    What else sets Operator apart

    A UI that’s both conversational and graphical, not one or the other. Operator blends conversational interaction with purpose-built graphical components. Proposal diffs show exactly what will change in an article. Inline charts visualize performance trends. Dashboards render directly inside the conversation thread. In practice, that means a knowledge manager reviews a structured diff—not a wall of LLM-generated text—and a team lead asking about weekly performance gets an accurate chart with context, not a paragraph approximating data.

    Building this hybrid experience is extremely difficult outside of a native platform integration. In a chat interface or CLI, you’re limited to text output; in a standalone dashboard, you lose conversational context. Operator does both in the same thread, so every interaction is detailed and context-rich—and importantly, actionable in the flow of work.

    It lives where your team already works. Operator is built into the same platform your team uses every day. It’s not a separate tool with a separate login, nor is it a Slack bot your engineer set up that only three people know about. It operates exactly where you are, alongside the conversations, help center articles, workflows, and data you’re working with. That tight integration closes the gap between finding a problem and fixing it: spot an outdated article while reviewing a Fin conversation, and Operator can surface the fix in the same session. Notice an escalation spike in the morning, and you can ask Operator to investigate without switching tools, waiting for a data pull, or filing a ticket.

    The compounding advantage

    Every customer using Operator teaches us something. We see which debugging approaches work across different types of support operations, learn which content structures perform better, and identify automation strategies that consistently land. Those patterns get encoded back into Operator’s skills and tools. When we discover that a particular sequence of investigation steps reliably identifies the root cause of a spike in escalations, we build that into Operator’s diagnostic skill. When we find that a specific way of structuring help articles leads to higher Fin resolution rates, we encode that into the content creation skill. Our engineering team is continuously shipping improvements based on what we observe across the entire customer base.

    A custom-built solution gives you exactly what you built, meaning it doesn’t get smarter unless you invest engineering resources into making it smarter. And that usually means taking time and talent away from your core product. I’ve watched teams underestimate the ongoing cost of eval-driven development, model upgrades, and API churn—costs that only grow as your footprint expands.

    We’re not locking the door

    Some teams want to build their own Agents. Some of our most technical customers do this. But when you do, you’re working with raw APIs and building your own tooling on top of them. When you use Operator, you’re working with a system that already knows what questions to ask, understands your data, and encodes the best practices we’ve learned from thousands of support teams. We recently launched the Fin CLI, which means you can use third-party agents like Claude Code or Cursor to interact with your Fin data and configuration. That door is open. What I hope this post has clarified is everything that goes into the build of Operator: Over 50 tools and 10 skills, purpose-built for support operations. Years of investment in semantic search. Deep integration with every layer of Fin’s stack. The proposal system. The intelligence layer. The reliability infrastructure.

    If you’d still like to move ahead with building a custom solution, here’s an honest assessment. You can build a useful read-only tool in weeks. It’ll query your data, summarize tickets, and generate reports, but turning it into a production system will take quarters. Reliability, security, edge case handling, multi-tenant data isolation, and graceful degradation are all important architectural decisions that you’ll need to get right from the start. The action layer is also where you might risk stalling out. Going from “here’s what’s wrong” to safely making changes in a production system is a fundamentally different engineering problem than analysis. Most DIY projects never get there. Finally, you’ll be maintaining it forever. Every model upgrade, API change, and new capability in your support platform means updating your custom tooling. We have a team dedicated to this. You’ll need one too.

    The economics still favor buying when a vendor has invested more in the problem than you can justify internally. What I hope this post adds is a clearer picture of what that investment actually looks like from an engineering perspective—and why it compounds into a durable advantage for your support organization.

    The investment is ongoing. The problems we’re solving at the infrastructure level today are harder than the ones we solved a year ago, and that trajectory isn’t slowing down. If you’re ready to see the difference a production-grade Agent can make, explore Operator.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

    How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

    Only 10% of the plastic we manufacture gets recycled. For a century, we have relied on mechanical and chemical methods that were never designed to close the loop. As a product leader, I look for step-change technologies that break through entrenched ceilings, and biology—specifically engineered enzymes—has emerged as that missing piece.

    Recently, I dug into the work of Rhea's Factory and spoke with their founders, Arzu Sandıkçı (co-founder and CEO) and Mert Topcu (co-founder). Arzu brings deep expertise in molecular biology and enzyme engineering. Mert brings 20 years in tech, including a decade at Google as a product manager. Their combined perspective—domain science plus product rigor—shows up in every design choice.

    Rhea's Factory has built an AI platform that uses protein language models, multi-step agentic pipelines, and proprietary wet lab data to design novel enzymes that deconstruct plastic polymers into their original monomers—selectively, at low temperatures, and at industrial scale. That stack matters: it layers foundation models with domain-specific constraints and real-world data to systematically explore, evaluate, and scale candidates.

    Here’s the crux: traditional recycling mostly just chops polymer chains into shorter fragments. Enzymatic recycling, by contrast, breaks plastic all the way back to its original monomers. Think of a necklace and pearls analogy—mechanical methods snip the chain; enzymes cleanly return the pearls. The result is true circularity: you can remake high-quality plastic without downcycling.

    Selectivity is the superpower. Enzymes can target specific plastic types even in mixed waste streams, operating at low temperatures in a controlled, low energy reactor process. That combination of precision and energy efficiency is why this approach can be both greener and economically competitive.

    The field accelerated after the discovery of a plastic-eating bacteria in Japan, which opened the door to enzymatic recycling. Advances in protein structure prediction—“AlphaFold” and the Nobel Prize in Chemistry—transformed what’s possible in enzyme engineering, and created space for AI-native design loops to flourish.

    On the AI side, the team evolved from a human-orchestrated pipeline to an agentic AI scientist. Problem statements serve as inputs, multi-step protein generation builds on foundation models, and guardrails at each pipeline step keep the AI pointed in the right direction without limiting exploration. It’s a textbook example of agentic AI applied to a highly constrained, safety-critical domain.

    Crucially, wet lab feedback closes the loop. Why wet lab data—even just hundreds of proprietary data points—can be enough to train a powerful domain-specific prediction model is a reminder that quality and relevance can trump sheer volume when you’re operating in a narrow, high-signal domain. The team measures success in the lab first, then scales what works.

    I appreciated their take on exploration: there are moments when Mert sometimes wants the model to hallucinate. Running high temperature settings helps explore the full enzyme design space, and the guardrails ensure those forays remain productive rather than random. In other words, controlled creativity beats blind search.

    The business constraint is unambiguous: enzymatic recycling must compete economically with cheap, oil-based plastic production. That framing forces disciplined choices around energy use, throughput, and yield—factors that directly determine unit economics and the path to industrial reality and cost parity.

    What’s next is equally compelling: a process agent to optimize end-to-end system performance, a 5,000-ton demo plant in California to validate scale, and enzymes for new plastic types. I’m especially intrigued by enzyme blends for mixed plastics and the practical insight into why clamshells aren’t recyclable—precisely the messy corner cases that decide whether circularity works outside the lab.

    From a product management lens, several patterns stand out: define clear problem statements as inputs to the agentic orchestration; use eval-driven development to enforce stage-by-stage quality; build a proprietary data moat with wet lab results; and tie milestones to industrial metrics (conversion, selectivity, energy per ton) rather than vanity outputs. This is AI Strategy in action—aligning model capability, data leverage, and operational design to deliver outcomes, not just demos.

    Most of all, the ambition to explore an enzyme design space that “makes everything nature has ever evolved look like a tiny dot” captures the promise of this approach. Pairing agentic AI with rigorous lab validation doesn’t just make plastic circularity plausible—it makes it programmable.


    Inspired by this post on Product Talk.


    Book a consult png image
  • No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

    No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

    Most teams ship AI agent personalities by accident—emergent quirks, brittle prompts, and uneven behavior. We refused to let that happen. From day one, we treated personality as a first-class product surface, one that should be designed, instrumented, and iterated with the same rigor as any core capability.

    Learn how we designed Global Agent’s personality and fine-tuned its inquisitiveness and helpfulness using Agent Analytics.

    In my role leading product at HighLevel, Inc., I framed our approach around agentic AI and conversation design: personality is not “flavor text”; it is the control system for how an agent interprets context, asks questions, and decides when to act. Our product strategy prioritized clarity, empathy, and consistency—so the agent would be curious enough to resolve ambiguity without becoming interrogatory, and helpful enough to move work forward without overstepping.

    We made that intent measurable. Using behavioral analytics, we defined operational signals such as clarification-question rate, resolution-path efficiency, and escalation quality. We combined eval-driven development with targeted A/B testing to compare prompt patterns and tool strategies, ensuring each change had a clear hypothesis and measurable outcome.

    To calibrate inquisitiveness, we mapped decision points where the agent should ask follow-ups versus proceed autonomously. Prompt engineering codified those thresholds, while a retrieval-first pipeline reduced unnecessary questions by improving context completeness up front. When the agent did ask, we constrained tone and cadence to keep queries concise, respectful, and progress-oriented.

    To enhance helpfulness, we prioritized precise action-taking and unambiguous guidance. Context window management preserved relevant facts without diluting intent, and guardrails aligned with AI risk management principles ensured the agent stayed within policy, privacy, and compliance boundaries. The result was an assistant that resolved more tasks end-to-end, with fewer stalls and clearer handoffs when human help was warranted.

    Agent Analytics became our nervous system. We instrumented every dialog turn to attribute outcomes to design choices, then used driver trees to connect micro-behaviors to macro results like time-to-resolution and customer satisfaction. This closed-loop view let us ship confidently, knowing which levers improved helpfulness, which sharpened curiosity, and which merely added noise.

    Process mattered as much as tooling. Product trios ran continuous discovery with customers to surface edge cases—ambiguous intents, multi-intent turns, and sensitive scenarios—while our engineering partners operationalized experiments with clean rollback paths. We favored small, testable changes over sweeping rewrites, building momentum and trust with each iteration.

    The payoff is a personality that feels consistent across use cases: curious when clarity is missing, decisive when action is obvious, and transparent when limits are reached. Users experience fewer dead ends, faster resolutions, and a brand voice that shows up the same way every time—because it was defined, measured, and improved on purpose.

    If you’re building agentic AI, don’t leave personality to chance. Treat it like a product: set clear outcomes, instrument deeply with Agent Analytics, and iterate with eval-driven development and A/B testing. That’s how curiosity becomes a feature, helpfulness becomes a habit, and your agent becomes reliably, intentionally excellent.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • From Vision to Execution: Building Agentic, Data‑Driven Products with Real‑World Rigor

    From Vision to Execution: Building Agentic, Data‑Driven Products with Real‑World Rigor

    When I consider where product development is headed, one statement captures the mandate perfectly: "Eric Carlson is a Principal AI Engineer helping to shape and build Amplitude's next generation vision of of agentic and data driven product development." That vision resonates deeply with how I lead teams—anchoring strategy in behavioral analytics while enabling agentic AI to act on insights with speed, safety, and measurable impact.

    Translating that vision into execution starts with clarity of outcomes. I frame driver trees that connect customer value to leading indicators—activation, engagement depth, and retention—then instrument product telemetry with Amplitude analytics and behavioral analytics to surface the moments that matter. From there, we operationalize learning with A/B testing and feature flags, ensuring each hypothesis gets a fair, observable run and that we can safely ramp what works.

    Agentic AI changes the operating model. Instead of static dashboards, we design autonomous workflows that observe signals, reason over context, and take action—grounded in a retrieval-first pipeline and governed by eval-driven development. For product managers, this demands fluency with LLMs for product managers and practical prompt engineering, plus rigorous AI Strategy around data governance, privacy-by-design, and risk scoring so agents remain trustworthy under real-world conditions.

    Cross-functional cadence is everything. I partner closely with Principal AI Engineers and product trios to blend continuous discovery with execution: rapid user interviews to reveal intent, opportunity solution trees to prioritize, and outcomes vs output OKRs to align incentives. The result is a system where insights are unified, decisions are explainable, and agents improve through tight feedback loops across analytics, experimentation, and production telemetry.

    If you’re building toward an agentic, data-driven future, invest in a unified analytics platform, shorten the path from signal to action, and measure learning velocity as carefully as feature delivery. With the right foundations, agentic AI becomes more than a feature—it becomes a force multiplier for product strategy, customer value, and sustainable growth.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • From Prototype to Production: How I Built Reliable AI-Generated Opportunity Solution Trees

    From Prototype to Production: How I Built Reliable AI-Generated Opportunity Solution Trees

    I just wrapped an all-out engineering sprint. That still sounds odd coming from me, because while I’ve written code on and off for years, I don’t self-identify as an engineer. I’m a product manager who used to be a designer. It’s been a long time since I wrote code for a living.

    But AI has expanded what’s just now possible—for our products, and for us. It’s pushed me to do more than I imagined. In that spirit, I want to share a recent engineering story. It includes technical details, and a year ago I couldn’t have done any of it. I learned it with the help of AI, and my aim is to show what’s now within reach.

    I’ve been building two services with a partner at Vistaly: AI-generated interview snapshots and AI-generated opportunity solution trees. We put out a call for alpha partners, received over 100 applicants, and selected eight design partners to start.

    Opportunity Solution Tree diagram with a blue Desired Outcome branching to green Opportunity nodes, yellow Solution nodes, and orange Assumption Tests for product discovery and AI workflows.
    A clear, color‑coded map from desired outcome to opportunities, solutions, and assumption tests—showing how to structure discovery work and prompt AI to generate, compare, and validate product ideas.

    Each team uploaded three customer interviews. I identified the key moments and opportunities and then generated an opportunity solution tree from those snapshots. I provide the AI services; Vistaly is building the UI and workflows around them.

    Early feedback was strong. Teams immediately asked to upload more interviews—exactly the kind of demand signal you hope to see—so we got to work making that possible.

    Dark interface screenshot of an opportunity solution tree with colored cards and dotted connectors, showing merged, moved, and evidence-added Opportunity notes about onboarding, support, and bot readiness.
    Go behind the scenes as AI turns raw feedback into a clear Opportunity Solution Tree. Linked cards reveal user needs—onboarding, support offload, and bot-readiness signals—so product teams can spot priorities and next steps at a glance.

    Updating an opportunity solution tree with new interview content is far harder than generating a new tree from scratch. I initially underestimated the complexity. Our goal wasn’t to produce a tree and declare it truth. We wanted teams to engage, correct, and collaborate with the AI—scaffolding cross-interview synthesis instead of doing it for them.

    To support that, we needed a way to communicate precisely how a tree would change after new interviews were added. We took inspiration from git diff and set out to build the equivalent for opportunity solution trees—step-by-step change sets that explain each proposed modification.

    Diagram of an opportunity solution tree with an Outcome node pointing to Opportunity A and Opportunity B; B branches to child opportunities and shows source evidence, labeled “Updates Can't Result in Data Loss.”
    A clear visual of AI‑generated opportunity solution trees: outcomes feed opportunities that branch into sub‑opportunities, while evidence is preserved. The structure ensures updates stay traceable and never cause data loss.

    That decision was right, but the lift was larger than I expected. It wasn’t enough to generate an updated tree; I also had to provide a clear, ordered walkthrough of what changed and why.

    I often see the same pattern with AI: it’s easy to get to an impressive prototype, but much harder to reach a production-grade product. That was exactly my experience here. My service actually comprised two sub-services: generating a new tree from scratch and updating an existing tree with new interviews. The first worked well in alpha; the second had to be built before anyone could add a fourth interview.

    Opportunity Solution Tree diagram: teal Outcome links to Opportunities A and B; Opportunities C and D branch under B; right panel lists the change set steps for adding nodes.
    Explore how an outcome expands into an Opportunity Solution Tree: Opportunities A and B stem from the goal, with C and D nested under B, while a concise change set tracks every node added along the way.

    On the surface, these services look similar. In reality, updates must preserve existing structure unless new evidence requires a change. You have to account for compound operations—merges, splits, deletes—while guaranteeing no data loss. Every node has source opportunities (supporting evidence from interviews) and children (tree sub-opportunities), and neither can be dropped.

    In classic AI fashion, I got a reasonable version working in a few days and shipped it to our design partners. One team quickly hit our beta limits and asked to convert to a paid subscription so they could keep going. They showed a willingness to pay, converted, and started uploading aggressively.

    Diagram of an Opportunity Solution Tree showing how parent 'Opportunity A' with children x, y, z is split into 'Opportunity A' and 'Opportunity B' to reassign evidence and connections.
    Watch an Opportunity Solution Tree evolve: the original parent A with x, y, z branches is split into A and B, shifting evidence while preserving links—mirroring how AI refines scope and structure in discovery.

    At the 14th, 15th, and 16th uploads, the cracks appeared. We saw odd behavior in some trees. The Vistaly team noticed that the change sets—the step-by-step instructions emitted by my service—didn’t always reconstruct the final tree my service also emitted. We needed those steps to match exactly, so teams could review and accept, modify, or reject each change with confidence.

    They flagged the issue the day I was flying to New Orleans for Jazz Fest. In hindsight, I’m glad I didn’t grasp the scope of what awaited me. I had roughly 80% of the work still to do to make tree updates rock solid. At least I got to enjoy the music first.

    Flowchart merging two opportunity solution trees: Opportunity B with children y and z, and Opportunity C with t, u, v, consolidated into one tree led by Opportunity C connected to five child opportunity nodes.
    From fragments to focus: this diagram shows how Opportunities B and C are merged into a single Opportunity Solution Tree, removing duplicates and unifying context so AI can rank and explore five related opportunities with clarity.

    Back home, I started diagnosing. My service was a pipeline: several LLM-driven steps followed by deterministic code to compare trees and produce change sets. As I dug in, I realized that approach was flawed. Tree diffs, unlike linear document diffs, are ambiguous.

    In a document, if I add a sentence, the diff shows an addition. If I delete a paragraph and rewrite it, the diff shows a removal and an addition. Simple. But trees are different. Suppose I split opportunity A into A and B, and later merge B with C. The split can disappear from the final diff.

    Diagram of an opportunity solution tree labeled 'Input Tree' showing an Outcome node branching to Opportunity A and C, each with child nodes x-z and t-v, with arrows indicating hierarchy.
    Peek inside our process: a simple opportunity solution tree maps an outcome to prioritized opportunities A and C with downstream options x-z and t-v. A clear snapshot of how AI organizes product discovery.

    When the model splits an opportunity, it must distribute A’s source opportunities and children between A and B. For instance, if A has source opportunities 1, 2, 3 and children x, y, z, after the split A might keep 1, 2, and x, while B takes 3, y, and z.

    Now suppose the model merges B into C. If C originally had source opportunities 4 and 5 and children t, u, v, then after the merge C now has source opportunities 3, 4, 5 and children t, u, v, y, z. When you compare the original and final trees, it looks like A somehow donated some evidence and children directly to C. The split and merge that explain why are invisible to a naive diff.

    Opportunity Solution Tree diagram titled Output Tree: a blue Outcome node branches to green Opportunity A and Opportunity C, which expand to nodes x-v with arrows; Product Talk badge.
    See how an AI-generated Opportunity Solution Tree unfolds: one Outcome flows to Opportunities A and C, then into options x–v. Clean colors and arrows reveal the hierarchy from goal to opportunities at a glance.

    That was the core insight: we didn’t just need to show what changed—we needed to show why it changed. I had to reconstruct each move step-by-step. That meant getting the model to show its work, which opened a new can of worms.

    I refactored my prompts so the model produced both the final output and the exact change set it used to get there. The action language was explicit: add, delete, reframe, merge, split, and so on. Crucially, I asked the model to describe its moves in user-meaningful terms—“split A into A and B, then merge B into C”—not as opaque reassignments of sources and children.

    Diagram of an AI-generated Opportunity Solution Tree: blue Outcome node with children Opportunity A and Opportunity B; B branches to Opportunity C and D. A right-hand list shows the change set for each step.
    Watch an opportunity solution tree take shape: start with the outcome, add opportunities A and B, then extend B to C and D. The paired change set makes every edit transparent—ideal for AI-assisted product discovery.

    For each LLM step, the model now emitted its recommendation and the corresponding change set. This helped, but it wasn’t perfect. After extensive testing and error analysis, two classes of errors emerged: (1) the model attempted an invalid move, and (2) the change set didn’t actually generate the recommendation.

    Category 1 felt like designing a game while the model played it creatively. For example, what happens when the model tries to merge a parent with a child? If opportunity A has children B, C, and D and the model merges A with B, the merge is directional. If the instruction is “keep A, delete B,” that works—the parent absorbs the child. But if the instruction is “keep B, delete A,” then C and D become orphans. These puzzles were solvable and even fun.

    Diagram of Opportunity Solution Tree merge rules: merging node B into parent A is allowed, while merging A into B is not because it would orphan opportunities B, C, and D.
    Visual explainer from Product Talk on AI-generated Opportunity Solution Trees. It contrasts an allowed merge (B into A) with a not-allowed merge (A into B) that leaves child opportunities orphaned, guiding safe hierarchy edits.

    Category 2 was harder. Despite prompt iterations, I could only push the discrepancy rate down to about 1 in 40 instances. With 10–20 LLM calls per run, that meant roughly half of all runs still failed. Not acceptable for production. I hit a wall. A paying customer was waiting, and more design partners were queued up.

    Next, I tried to correct the model’s mistakes with deterministic code. I had promised that my change sets would generate the output tree, so I wrote verifiers: detect conflicts (e.g., delete a node, then try to use it later), guard against data loss, prevent orphaned nodes, and more. Detection was straightforward; correction was not. Fixing issues required guessing the model’s intent. If the sequence said “delete A, then merge A with B,” should I remove A entirely or salvage A’s sources and children by merging into B? There were dozens of such cases with no unambiguous answer.

    Workflow diagram titled 'My Simple Repair Loop' showing an iterative validation cycle: Generate the Change Set → Run the validation tool → Check Result, with branches to retry on failure or exit on pass.
    A step-by-step loop shows how changes are validated: generate a change set, run a validation tool, review the result, then repeat on failure and exit on pass—mirroring iterative work behind AI-built Opportunity Solution Trees.

    After 11 straight days of deep work—including weekends—I was exhausted. I dislike hustle culture; this isn’t how I design my life. But I was stuck, and then I had an insight.

    On a walk with my husband (also an engineer), I realized I could have the LLM repair its own mistakes. My data contract with Vistaly requires that the change set must generate the output tree. I had already built robust validation code. I knew exactly when a change set failed—and why. No amount of prompt tuning alone was fixing it. So I turned the validator into a tool for the model and created a simple agentic loop.

    The loop works like this: the model proposes a change set, calls the validation tool, and gets back a pass/fail plus specific feedback. If it fails, the model uses those instructions to repair the change set and calls the tool again. Iterate until success or a max number of turns.

    I prototyped in Node.js with a single model call, a verifier pass, and a repair attempt. At first, the loop didn’t converge—it just accumulated compute. I experimented with how to communicate errors, how much context to include, and how to sequence feedback. Eventually, it clicked: the model began fixing its own mistakes and typically returned a valid change set in one or two repairs. It was, in practice, eval-driven development applied to LLM outputs.

    I had already built an agent loop utility for another AI workflow, so I productionized quickly: model call, optional tool invocation, tool result returned to the model, repeat until the validator signals success or the loop times out. I integrated the new loop into the pipeline and shipped the revamped service to Vistaly on Monday at noon. They’re integrating now, and it will be in the hands of our design partners shortly. I’m relieved—and ready for a day off.

    Reflecting on the last two weeks, a few things stand out. First, I shed limiting beliefs about being an engineer. To make this reliable, I had to solve legitimately hard problems, and that feels good.

    Second, this was genuinely fun. Designing the action set and watching the model push those boundaries was like working through elegant puzzles. Models are incredibly creative, and harnessing that creativity with the right constraints is deeply satisfying.

    Third, I learned when I can and can’t trust Claude to write code for me. Since Opus 4.6 came out, I gave Claude a much longer leash. After the past two weeks, Claude is back on a short leash. I found a lot of gaps in my implementation in areas where I simply trusted that Claude got it right, when in fact it didn’t. If you don’t have the right infrastructure—planning, testing, code review—this can be disastrous. I’ll be investing more here and sharing what I learn.

    Finally, if this work had been spread over two months, it would have been thoroughly enjoyable. I’m discovering how much I like being an AI engineer. It feels like a new chapter where I can combine opportunity solution trees with modern AI engineering—and deliver real value to product teams doing continuous discovery.

    I’m excited to share more of what we’re building with Vistaly and to onboard more design partners soon. If you’re interested, get on the waiting list. And if you’ve been hesitant to stretch beyond your current skill set, I hope this story nudges you to take the first small step toward what’s just now possible.


    Inspired by this post on Product Talk.


    Book a consult png image