Tag: eval-driven development

  • Master Burger Prompting: Build a High-Impact AI Resume Coach with Proven LLM Structures

    Master Burger Prompting: Build a High-Impact AI Resume Coach with Proven LLM Structures

    I turned the playful idea of “burger prompting” into a rigorous framework for building an AI resume coach that delivers consistent, high‑quality guidance. In product management, repeatability matters: I want dependable LLM behavior, tight control of outputs, and measurable outcomes. This approach gives me exactly that—clear roles, crisp constraints, and an evaluation loop that raises the quality bar with each iteration.

    Here’s the metaphor in practice. The top bun sets the role and goal; the middle layers stack context, examples, constraints, and tools; the patty is the core algorithm and output schema; and the bottom bun locks in the quality bar and follow-up behavior. When I apply this structure to an AI resume coach, I get results that feel expert, empathetic, and actionable—without rewriting the prompt every time.

    Top bun: I define the system role and success criteria. I’ll say, “Act as an experienced hiring manager and resume coach for SaaS product roles” and specify the north star: improve clarity, impact, and ATS alignment without fabricating experience. I also name the audience (mid-career PMs, early-career candidates, or executives) so tone and calibration stay consistent across sessions.

    First layer: I load precise context. That includes the candidate’s resume, the target job description, and any constraints (for example: keep bullets under 22 words, lead with impact, quantify outcomes). I also clarify non-goals (no inflated titles, no unverifiable claims). This is where I set the voice: confident, concise, and supportive, not generic or robotic.

    Second layer: I attach the tools and references that anchor outputs. A skill taxonomy for product roles, a style guide for resume bullets, and a scoring rubric (impact, clarity, relevance, keyword coverage) help the model prioritize. To protect quality, I call out context window management rules—what to include or trim—and how to summarize long inputs without losing signal.

    Third layer: I add exemplars. Few-shot examples of excellent resume bullets (“before” and “after”) teach the model what “great” looks like. I also include a counterexample or two to prevent bad habits (for instance, over-indexing on buzzwords). Exemplars act like taste buds; they steer nuance without overfitting.

    Patty: I define the core algorithm and the output schema. The algorithm moves in stages: diagnose the resume against the job, identify 3–5 high-leverage improvements, rewrite bullets with quantified outcomes, and propose a summary that highlights relevant wins. I then specify the output sections: a brief diagnosis, rewritten bullets mapped to the job’s requirements, an ATS keyword coverage table, and a confidence score with rationale. A tight schema produces consistent, scannable outputs that are easy to evaluate—and easy to ship.

    Bottom bun: I lock in the quality bar and the follow-up behavior. If inputs are incomplete, the coach must ask clarifying questions before rewriting. If claims lack evidence, it should suggest proof points (metrics, scope, stakeholders) rather than embellish. Finally, I require a self-check pass where the coach verifies that each bullet demonstrates impact, relevance, and clarity before presenting the final result.

    Implementation blueprint: I create a reusable prompt template with clear system and user sections, then parameterize it for different roles (PM, design, data). If I have a library of style guides or skill matrices, I wire it into a retrieval layer so the model references the right material for each job. This setup makes the coach portable across tools and easy to maintain as the taxonomy evolves.

    Evaluation and iteration: I practice eval-driven development. I assemble a small, representative test set of resumes and job descriptions, define acceptance criteria (readability score, keyword coverage, human rater alignment), and A/B test prompt variants. I track drift and tighten the schema whenever outputs start to meander. The goal isn’t just impressive demos—it’s reliable performance at scale.

    Governance guardrails: A trustworthy resume coach respects privacy-by-design. I strip PII where possible, avoid storing raw resumes beyond what’s necessary, and document bias checks so advice doesn’t disadvantage non-traditional candidates. Clear data governance and risk management keep the product shippable and compliant as it grows.

    When I apply burger prompting end to end, the AI resume coach becomes a repeatable system: fast, accurate, and measurably helpful. The structure teaches the model how to behave; the evals keep it honest; and the schema makes the result easy to review, refine, and ship. If you want dependable LLM outcomes, start with a great bun—and don’t skimp on the patty.


    Inspired by this post on Pendo – Best Practices.


    Book a consult png image
  • How I Make Diagnostic AI Trustworthy: Confidence Levels, Citations, and Evals That Win Trust

    How I Make Diagnostic AI Trustworthy: Confidence Levels, Citations, and Evals That Win Trust

    Trust is the true currency of diagnostic analytics. If customers can’t verify why a system reached a conclusion—or how confident it is—adoption stalls. That’s why this line resonated so strongly with my own playbook: Amplitude used confidence levels, citations, and evals to build a diagnostic AI tool accurate enough to earn customer trust.

    Confidence levels are my first non-negotiable. When a model flags a root cause or prescribes a next step, I want the UI to state its certainty upfront and in plain language—ideally with calibrated ranges and a brief rationale. This simple pattern sets the right expectations, reduces over-trust, and supports AI risk management by making uncertainty visible. In practice, we pair this with clear UX writing so users understand what “High,” “Medium,” or “Low” confidence really means in their workflow.

    Citations are the second pillar. Every diagnostic needs a breadcrumb trail back to source data: which metrics were analyzed, what time window was used, and how the insight was derived. Linking directly to the underlying chart, query, or dashboard reinforces data governance and shortens the path from “interesting” to “actionable.” When customers can click through to verify the evidence, they gain the confidence to make decisions—fast.

    Evals complete the trio. Before and after launch, I hold the team to eval-driven development: offline benchmarks, targeted scenario tests, and live performance monitoring that mirrors real customer use. We define success criteria for precision/recall, false-positive thresholds, and latency, then wire those checks into CI/CD so regressions are caught early. Continuous evals aren’t just QA; they’re the heartbeat of an AI workflow that keeps insights reliable at scale.

    Operationally, these practices compound. Confidence levels help prioritize follow-up analysis, citations accelerate collaboration across product and data teams, and evals keep quality high even as models, data, and usage evolve. Together, they form a pragmatic AI strategy that aligns product discovery with measurable outcomes and safeguards customer trust where it matters most—inside daily decisions.

    If you’re building a diagnostic AI tool, start with these three building blocks and resist the urge to hide uncertainty. Make it legible. Make it verifiable. And measure it continuously. That’s how we turn powerful models into trustworthy products customers depend on.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • What It Takes to Build AI-Powered Products: A Senior Engineer’s Playbook and Mindset

    What It Takes to Build AI-Powered Products: A Senior Engineer’s Playbook and Mindset

    I spend my days partnering with technical leaders who bridge invention and impact. The role of a Senior Software Engineer at Amplitude working on AI-powered products epitomizes how engineering and product fuse to ship customer value with speed, safety, and conviction. In my world, that fusion isn’t accidental—it’s designed, measured, and relentlessly improved.

    When I form product trios—engineering, product, and design—we clarify the problem, the target users, and the measurable outcomes before a single line of code ships. This is how empowered product teams operate: we trade feature wish-lists for hypotheses, align on success metrics, and commit to learning loops that turn ambiguity into progress.

    On the technical front, modern AI systems demand a retrieval-first pipeline, robust data contracts, and a thoughtful orchestration layer for LLMs. I expect eval-driven development to be first-class: offline unit-style evals for prompts and policies, and online evals that track behavior changes and quality at scale. This rigor gives us confidence to ship, learn, and iterate without burning cycles on guesswork.

    Velocity matters, and so does reliability. I look for CI/CD that makes small, safe, frequent releases the default, and for DORA metrics to shine a light on delivery health. Pair that with platform scalability, clear SLOs, and pragmatic SRE practices, and teams earn the right to move fast without breaking trust.

    Responsible AI is non-negotiable. We operationalize AI risk management with guardrails, input/output filters, red-teaming, and human-in-the-loop review where stakes are high. Data governance and privacy-by-design ensure that our creativity never outruns our compliance—because durable products are built on durable trust.

    Impact comes from evidence. I advocate for disciplined A/B testing, careful minimum detectable effect (MDE) planning, and retention analysis that ties feature work to real business outcomes. Clear analytics pipelines and transparent dashboards keep stakeholders aligned and make good decisions repeatable.

    Ultimately, the Senior Software Engineer I want to collaborate with is a builder who balances systems thinking with customer empathy: someone who can design reliable architectures, instrument the work with meaningful evals, and co-lead discovery to de-risk the roadmap. When we combine that mindset with crisp execution, AI-powered products stop being demos—and start becoming indispensable.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Beyond the Support Iceberg: Gradient Labs’ Multi‑Agent Breakthrough That Actually Gets Work Done

    Beyond the Support Iceberg: Gradient Labs’ Multi‑Agent Breakthrough That Actually Gets Work Done

    When a customer reports a stolen credit card, the frontline play seems straightforward—freeze it. But that’s just the visible tip of a much larger customer support iceberg. Underneath sits the real work: dispute filings, fraud investigations, merchant communications, proactive outreach, and follow-ups that unfold over days across multiple systems. Most AI support tools only touch the surface; they don’t coordinate or close the loop. That gap is exactly where my product instincts kick in—and why this story matters.

    I recently listened to a conversation with Jack Taylor (Product Engineer) and Ibrahim Faruqi (AI Engineer) from Gradient Labs, an AI-native startup building agents that automate the full scope of customer support in fintech. Their approach resonated with the challenges I see every day in customer support automation: fragmented workflows, regulatory complexity, and the need for human-in-the-loop moments. Gradient Labs has architected a platform with three coordinating agents—"inbound, back office, and outbound"—all built on a shared foundation of "natural language procedures, modular skills, and configurable guardrails."

    What impressed me most was how they "Let non-technical subject matter experts define agent behavior through natural language procedures—no coding required." That’s a powerful way to remove engineering bottlenecks, accelerate iteration, and keep the domain experts—those closest to fraud, disputes, and compliance—directly in control. In my experience, this design choice alone can compress lead times from weeks to hours and aligns perfectly with continuous discovery and eval-driven development.

    At the heart of their platform is orchestration. They "Architected a state machine orchestrator that manages turns, triggers, and skill selection across long-running conversations." That "turn" architecture is built for the messy reality of async, multi-day support. They treat "Skills as modular agent capabilities—and how they're scoped deterministically per turn," ensuring the system stays predictable and auditable. They also confront a nuanced challenge most teams dodge: "Defining "done" for outbound agents when the customer isn't the one ending the conversation." That’s where deterministic criteria, timers, and clearly scoped outcomes matter as much as the model beneath.

    Compliance is not an afterthought—it’s baked into the core. Gradient Labs "Built guardrails as binary classifiers with eval pipelines, tuning for high recall on critical regulatory checks." In regulated domains, optimizing for recall on high-stakes checks is the right call; you can tolerate a few extra reviews, but you can’t miss a potential fraud signal. More broadly, they frame "Guardrails as classification problems: balancing recall and precision for regulatory compliance." That mindset is exactly how I like to merge AI risk management with product velocity.

    Crucially, they avoid the trap of fully autonomous optimism. "Ask a Human: a tool call that brings humans into the loop for approvals or missing APIs" gives the system a safety valve for novel or high-risk cases. I also appreciated the explicit "Ask A Human Tool" pattern, which cleanly integrates approvals, policy exceptions, or data gaps without derailing the workflow.

    Quality doesn’t happen by accident. They "Designed an auto-eval system that samples conversations for human review to catch edge cases and build labeled datasets" and built "Auto-eval pipelines that flag conversations for manual review and feed labeled datasets." That closed-loop evaluation flow is the backbone of sustainable performance in agentic AI. Combine this with targeted instrumentation—think CSAT, first contact resolution, deflection rate, time to resolution, and escalation rate—and you get a real Agent Analytics discipline, not just logs and dashboards.

    The "iceberg" metaphor is more than a catchy visual. It’s a blueprint for scoping multi-agent platforms that work across the entire customer journey. With "inbound, back office, and outbound" agents coordinating on complex tasks like fraud disputes, the system can transition cleanly from intake to investigation to resolution—without dropping context or asking customers to repeat themselves. This is what genuine customer support automation looks like when it’s grounded in real operations.

    Under the hood, the team leans into robust design choices that matter at scale: the "Complexities of Natural Language Input" are managed with explicit state and skill scoping, "Deterministic Skill Execution" reduces flakiness, and "Customer-Specific Guardrails" ensure compliance remains aligned to each client’s policies. Add their focus on "APIs and Customer Tools Integration" and the result is a platform that can actually take action—not just answer questions.

    If you’re building in this space, here’s how I’d apply these lessons. Start by mapping the iceberg: enumerate back-office steps, approvals, and SLAs that follow the initial customer touchpoint. Capture those steps as "natural language procedures" owned by SMEs. Implement a "state machine orchestrator" to manage "turns, triggers, and skill selection" across multi-day workflows. Treat "guardrails as classification problems" and tune for high recall on high-stakes checks. Introduce "Ask a Human" early to handle missing APIs or policy exceptions. Finally, operationalize learning with "auto-eval pipelines" and tight, eval-driven development loops. That’s how multi-agent platforms deliver measurable outcomes in fintech support.

    If you want to hear the full conversation, you can listen on Spotify or Apple Podcasts. You’ll also hear a nod to the "Incident.io episode – Referenced in the conversation," and a thoughtful take on the "Future of Multi-Agent Systems."

    In short: this is a shift from simple Q&A bots to agents that can coordinate, comply, and complete. It’s the kind of multi-agent platform work that moves the needle for customer support in fintech—and a compelling template for any product leader scaling agentic AI and AI workflows beyond the tip of the iceberg.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Master AI as a Product Manager in 12 Months: My 2026 Roadmap to Ship Smarter, Faster

    Master AI as a Product Manager in 12 Months: My 2026 Roadmap to Ship Smarter, Faster

    AI isn’t a side quest for product managers anymore—it’s the skill stack that will define how we discover problems, prototype solutions, and ship value in 2026. Over the last few cycles, I’ve watched teams that embrace AI Strategy outperform on speed, signal, and stakeholder confidence. This roadmap is the approach I use to build capability in a structured, outcome-driven way—so we ship smarter, faster, and more impact-driven products.

    "AI for PMs in 2026: why it matters, what to learn, and a 12-month AI roadmap to master product skills and ship smarter, faster, impact-driven products."

    Here’s how I frame what to learn and why: focus on enduring capabilities first (problem discovery, experimentation, ethics), then layer the AI product toolbox (LLMs for product managers, retrieval-first pipeline patterns, AI workflows), and finally operationalize with outcomes vs output OKRs. The goal isn’t to sprinkle gen ai on everything—it’s to make better decisions, reduce cycle time, and unlock product-led growth in measurable ways.

    Months 1–3: Foundations. I build literacy around model behavior and constraints, context window management, and prompting patterns. I pair this with data governance and privacy-by-design basics so we avoid rework later. Practically, I assemble an AI product toolbox (evaluation checklists, prompt libraries, retrieval-first pipeline templates) and apply them to product discovery—summarizing research, clustering feedback, and sharpening value propositions without losing critical nuance.

    Months 4–6: Prototyping and evaluation. This is where ideas become testable artifacts. I use gen ai for product prototyping to create UX mocks, PRDs, and in-app guides rapidly, then validate with eval-driven development. I run lean experiments (A/B testing with a clear minimum detectable effect), wire up analytics to Amplitude, and track activation and retention signals. The mantra: instrument early, measure causally, and iterate based on evidence.

    Months 7–9: Shipping AI-enabled workflows. I partner with product trios to integrate AI into real user journeys—customer support ai strategy, CRM integration, and guided onboarding are common wins. We explore agentic AI for complex multi-step tasks, add safeguards for AI risk management, and pressure-test systems with threat detection and response playbooks. As features reach production, we monitor deployment frequency and tighten feedback loops to protect quality while accelerating learning.

    Months 10–12: Scale and governance. I operationalize what works with product roadmapping and sprint planning aligned to outcomes vs output OKRs. We codify playbooks for continuous discovery, define eval gates for new AI features, and unify analytics so teams can compare lift apples-to-apples. Stakeholder management matures into clear narratives: what shipped, what moved, what’s next—so leadership sees compounding value, not just activity.

    Throughout the year, I keep the focus on real users and real metrics: fewer hops from insight to iteration, tighter loops between problem and prototype, and crisper communication around trade-offs. The result is a team that can translate AI capabilities into differentiated product experiences—reliably and responsibly. If you follow this path, you’ll enter 2026 with the confidence to lead, the systems to scale, and the evidence to prove it.


    Inspired by this post on Product School.


    Book a consult png image
  • Stop Tuning Prompts: How Context Engineering 10x’d Accuracy and Adoption in Our AI Platform

    Stop Tuning Prompts: How Context Engineering 10x’d Accuracy and Adoption in Our AI Platform

    "The best AI products improve more through context engineering than prompt tinkering." I’ve seen this play out repeatedly in high-stakes, enterprise use cases: substantive gains come from how we curate, structure, and deliver context to models—not from wordsmithing. When we started treating context as a product surface, performance climbed, hallucinations dropped, and teams shipped with more confidence.

    Here are four key decisions we made to improve our AI context.

    First, we moved to a retrieval-first pipeline. We unified trusted sources—CRM records, support knowledge bases, product telemetry, and governance-approved docs—behind hybrid retrieval (semantic + keyword) with strong metadata ranking. This let us constrain generations to verifiable facts, apply privacy-by-design rules at the edge, and practice disciplined context window management so every token carried its weight. Freshness policies, source-level confidence scores, and lightweight schemas kept the system precise and auditable.

    Second, we made eval-driven development non-negotiable. Every change to context assembly goes through offline evals and online A/B testing with clear acceptance thresholds (e.g., task success, groundedness, time-to-first-answer, and deflection rate). We sized tests with minimum detectable effect (MDE) and tied them to outcomes vs output OKRs so we weren’t just shipping more prompts—we were shipping measurable improvements that mattered to customers.

    Third, we personalized context based on intent and role. We built AI workflows that detect user intent, segment by persona, and dynamically assemble context: recent account activity for customer success, policy-safe excerpts for finance, and fine-grained reasoning chains for product teams. For conversational and voice AI agent experiences, we combined short-term conversation memory with scoped, long-term account memory to preserve relevance without bloating the prompt. This agentic AI pattern ensured faster, safer, and more helpful responses.

    Fourth, we operationalized context as a first-class platform capability. We invested in data governance (ownership, lineage, and redaction), instrumentation (Amplitude analytics for usage, retrieval hit rates, and failure modes), and CI/CD guardrails for context updates. Product trios partnered with SRE to monitor drift, while side-by-side comparisons and human-in-the-loop reviews turned frontline feedback into structured improvements. The result: a durable system that improves continuously instead of relying on one-off prompt tweaks.

    Context engineering isn’t glamorous, but it compounds. By prioritizing retrieval-first design, rigorous evaluation, intent-aware assembly, and operational excellence, we transformed our AI features into dependable, enterprise-ready capabilities. If you’re serious about LLMs for product managers and sustainable AI Strategy, shift your energy from clever prompts to robust context—and watch adoption and trust follow.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Inside a Staff AI Engineer’s Impact: How Cross-Functional AI Initiatives Drive Product Wins

    Inside a Staff AI Engineer’s Impact: How Cross-Functional AI Initiatives Drive Product Wins

    When I think about the roles that truly move the needle on AI Strategy and product outcomes, the Staff AI Engineer stands out. This is the person who can translate research into repeatable AI workflows, partner with product to solve real user problems, and operationalize models in a way that scales. It’s where innovation meets accountability—and where product management leadership meets hands-on engineering craft.

    Ram Soma is a Staff AI Engineer at Amplitude, leading various AI initiatives across the company. He has a background in data science and machine learning engineering.

    What does that look like in practice from my seat? It starts with precise problem framing and measurable success criteria. I align with a Staff AI Engineer on eval-driven development and instrumentation so we can track impact from prototype to production. With Amplitude analytics operating as a unified analytics platform, we can quantify user activation, retention analysis, and feature adoption, then iterate through continuous discovery with tight feedback loops.

    Execution quality hinges on robust experimentation. Together, we design A/B testing plans with minimum detectable effect (MDE) targets, isolate confounding variables, and build evaluation harnesses that reflect real-world UX constraints. We also agree on rollout strategies—staged deployments, guardrails, and observability—so we can learn safely while preserving customer trust and performance SLAs.

    On the technical approach, I look for pragmatic architectures that balance speed and reliability: a retrieval-first pipeline for grounding, judicious use of LLMs for product managers to instrument prompts and policies, and agentic AI patterns only when task decomposition truly reduces complexity. Just as important are privacy-by-design and data governance practices from day one, because responsible innovation beats retrofitting controls after the fact.

    Finally, the magic happens in empowered product teams and product trios. When product, design, and Staff AI Engineering operate with shared context and clear constraints, we compress decision cycles and ship value faster. That’s how AI initiatives evolve from demos to durable capabilities—and how we enable product-led growth with measurable results that customers feel, not just features they see.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • Teach AI to Think Like an Analyst: My Amplitude-Inspired Playbook for Faster Decisions

    Teach AI to Think Like an Analyst: My Amplitude-Inspired Playbook for Faster Decisions

    I’ve spent years trying to bottle the judgment of a great product analyst and pour it into our AI workflows. The hardest part isn’t access to data; it’s encoding the nuance of analytical reasoning. That’s why Amplitude’s approach resonated with me—turning expert analysis into a repeatable, stepwise process AI can run with discipline and speed.

    Learn how Amplitude turned its data analysis expertise into a structured, iterative process that AI can execute in moments.

    In practical terms, I translate that one line into an operating model: define the decision, formalize the metrics, map the data, decompose the questions, iterate on evidence, and converge on a recommendation with clear trade-offs. This is the backbone of agentic AI for product managers—giving an LLM not just data, but a procedure that mirrors how our best analysts think.

    Here’s the analyst-to-AI loop I use. First, frame the business question in decision language (what will we do differently?). Second, anchor on success metrics and guardrails, including statistical sensitivity and minimum detectable effect (MDE). Third, locate trusted sources—your unified analytics platform, experiment logs, and product instrumentation—so the AI never guesses. Fourth, generate hypotheses and segment the data (cohorts, channels, plans, geos), prioritizing signal over noise. Finally, synthesize findings into options with expected impact, risks, and next steps.

    To operationalize this, I build a retrieval-first pipeline that binds Amplitude analytics to structured prompts and function calls. The AI receives exact metric definitions, event taxonomies, and governance rules, then returns a predictable schema—headlines, evidence, segments, caveats, and recommended actions. That combination of clear constraints and consistent output makes eval-driven development possible: I can test prompts and tooling against a gold set of analyses and steadily improve quality.

    Consider retention analysis on a new onboarding flow. I’ll ask the system to pull activation rate, time-to-value, and day-7 retention from Amplitude, then compare cohorts by channel and plan. The AI proposes hypotheses (e.g., tooltip engagement correlates with activation), runs segmentation to validate them, and lays out product-led growth levers—like simplifying the first-run checklist or moving guidance in-app. What used to take hours of manual slicing now becomes an iterative loop that lets me spend more time on prioritization and less on tab wrangling.

    Of course, speed without rigor is a trap. I guard against metric drift and hallucinations with strong definitions, lineage checks, and human-in-the-loop approvals for consequential decisions. I also log analysis steps and outcomes so we can audit reasoning, catch regressions, and keep AI grounded in our true north metrics—not just what’s easy to compute.

    The big unlock isn’t a clever prompt; it’s codifying the analyst’s craft. When we treat analysis as a structured, iterative process, AI can execute it with consistency, and product teams can move faster with more confidence. If you’re building AI workflows for product insight, start by formalizing your analyst loop, connect it to your Amplitude analytics, and evaluate continuously. The result is smarter, faster decisions—and a repeatable path from raw data to action.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • AI in Product Design: My Proven Playbook, Real Use Cases, and the Tools That Win Faster

    AI in Product Design: My Proven Playbook, Real Use Cases, and the Tools That Win Faster

    In product design, AI has shifted from novelty to non-negotiable. I’ve watched teams accelerate discovery, compress prototyping cycles, and turn ambiguous ideas into validated experiences faster than ever—without sacrificing quality or customer trust.

    AI in product design has quickly moved from new to necessary. Here are the AI product design tools and approaches you need to stay relevant in this decade.

    From my vantage point leading product teams, “necessary” means AI is woven throughout the product lifecycle—discovery, prioritization, prototyping, validation, and iteration—not bolted on. The goal isn’t to chase hype; it’s to build durable advantage with clear AI Strategy, disciplined execution, and measurable outcomes.

    First, anchor the work in strategy. Tie every AI initiative to a specific customer problem and value proposition, then express that linkage with outcomes vs output OKRs. This keeps teams focused on real impact and avoids feature-chasing. It also sharpens product positioning and clarifies where AI can deliver competitive differentiation versus simple points of parity.

    Second, upgrade discovery. I rely on AI workflows to synthesize interviews, cluster themes, and surface insights at scale. A retrieval-first pipeline—grounding models in our own data—improves factuality and reduces hallucinations. Combine this with strong data governance and privacy-by-design so insights are trustworthy and compliant from day one.

    Third, make quality measurable. Adopt eval-driven development: define evaluation sets and acceptance thresholds that reflect real user tasks before you ship. Pair that with A/B testing and minimum detectable effect (MDE) discipline, so you learn quickly and confidently. Add safety guardrails (red-teaming prompts, content filters, and bias checks) to manage AI risk without slowing the pace.

    Fourth, enable empowered product teams. Product trios (PM, design, engineering) should co-create prompts, prototypes, and evaluation criteria. Give designers and PMs practical tools—LLMs for product managers, structured prompt templates, and reusable components—so AI-augmented work becomes the default, not a special project.

    Where does AI shine in product design today? Concept exploration and market scans, turning fuzzy opportunity spaces into crisp problem statements. Rapid wireframes and interaction ideas, using gen ai for product prototyping to explore multiple design directions in minutes. UX writing that adapts tone and reduces friction across onboarding, tooltip design, and microcopy.

    It also excels at guided experiences. I’ve seen strong lifts in user activation when we pair in-app guides and product tours with context-aware suggestions. For support and education use cases, a retrieval-grounded assistant can deflect tickets, shorten time-to-value, and reinforce the product’s value proposition at the exact moment a user needs help.

    Voice is another frontier. A well-scoped voice AI agent can accelerate complex workflows (think data entry or multi-step configurations) when hands-free is faster or more intuitive. Just be intentional about when agentic AI adds net value versus when a simple UI tweak would do.

    On the tooling side, my AI product toolbox is pragmatic and modular. For analytics and learning loops, Amplitude analytics and Pendo help quantify behavior changes and retention analysis. For in-product engagement and feedback routing, Intercom and HubSpot integrate cleanly with LLM-driven tagging and summarization. For ideation and automation, I use a ChatGPT connector and Claude Code for quick scripts, data wrangling, and prompt experiments. The constant: a retrieval-first pipeline that grounds models in approved knowledge and maintains context window management at scale.

    Risk management is built in, not bolted on. Set clear AI risk management policies, catalog model and data dependencies, and document decisions. Align with regulatory compliance requirements early, and keep an audit trail of prompts, datasets, and eval results. That’s how you move fast without breaking trust.

    If you’re getting started, begin small: pick one high-friction workflow, add a retrieval-grounded copilot, and measure the lift. Use the results to inform product roadmapping and sprint planning, then scale to adjacent use cases. With disciplined discovery, sharp evaluation, and the right tooling, AI becomes a force multiplier for product teams and a clear win for customers.


    Inspired by this post on Product School.


    Book a consult png image
  • From Concierge to AI Marketing Engine: Inside Mowie’s Document Hierarchy Playbook

    From Concierge to AI Marketing Engine: Inside Mowie’s Document Hierarchy Playbook

    I’m constantly asked by SMB owners: What if your small business could have a full marketing team—automated content calendars, customer segmentation, and channel-specific posts—without the headcount? That question is no longer hypothetical; it’s precisely the promise behind Mowie, and the way they got there is a masterclass in practical AI product development.

    I recently listened to Chris O'Connor (CEO) and Jessica Valenzuela (Co-Founder) of Mowie, an AI marketing platform built for small and medium-sized businesses in restaurants, retail, and e-commerce. Their story starts with a concierge marketing service—doing the work by hand for overwhelmed owners—and evolves into a fully automated AI product.

    They walk through their "document hierarchy" approach: how Mowie crawls the web to build a "dossier" about each business, infers customer segments and marketing pillars, and generates quarterly content calendars with channel-specific posts. As a product leader, this is the kind of retrieval-first pipeline that consistently outperforms naive prompt chaining because it builds durable context before generation.

    They also unpack the technical challenges of structuring unstructured data and the evolution from rigid schemas to loosely structured markdown. In my experience with LLMs for product managers, markdown becomes a flexible intermediate representation that’s easy to diff, trace, and feed back into models without brittle parsing.

    Equally important, they use customer feedback—from calendar approvals to regeneration requests—as their primary evaluation signal. That’s eval-driven development in practice: close the loop with lightweight evals that reflect genuine user intent, not proxy metrics.

    The planning model is elegant: the three mini-calendars—public events, business-specific events, and recommended campaigns—roll up into a coherent plan that eliminates the blank-page problem and enables steady, predictable execution.

    Crucially, they’re building traceability so customers can see which context documents influenced their content. This kind of transparency increases trust, accelerates edits, and supports governance in regulated categories where auditability matters.

    Onboarding and data collection stay pragmatic: let the system crawl first, ask humans only for deltas, and progressively profile over time. It’s a pattern I advocate in continuous discovery and AI workflows—keep humans in the loop without overwhelming them, and make the right action the easy action.

    Early on, they used Simon Sinek's Golden Circle framework to validate demand and sharpen messaging. Framing the "why" before the "what" helps teams maintain a crisp value proposition and tighten their go-to-market strategy.

    Performance measurement goes beyond vanity metrics by connecting marketing performance back to point-of-sale data for attribution. The ability to tie campaigns to revenue events is the bridge from clever content to accountable outcomes.

    What’s next is equally compelling: deeper attribution, omnichannel expansion, and digital out-of-home displays. For SMBs, that points to a unified analytics platform spanning email, social, and in-store touchpoints—exactly where modern marketing is headed.

    My takeaways for builders: invest in a retrieval-first pipeline with a resilient document hierarchy; prefer loosely structured markdown over rigid JSON when dealing with messy inputs; design human-in-the-loop controls that double as evals; and always connect activity to business outcomes. That’s how you turn an idea into a repeatable system that scales.

    If you want to explore further, start here: Mowie AI — AI marketing platform for SMBs. For early validation and storytelling, revisit Simon Sinek's Golden Circle.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Make Every Answer the Last: Building a Self-Improving AI Support Engine for 2026

    Make Every Answer the Last: Building a Self-Improving AI Support Engine for 2026

    Once I’ve defined the right roles on my team, the next move is to design an operating model that makes progress a habit. My goal is simple: every interaction should strengthen the system so the AI Agent keeps improving over time.

    I anchor the team on a mantra that has never failed me: “The first time you answer a question should be the last.” That single statement reframes support as a compounding system rather than a one-off activity.

    The ambition is to ensure every resolution makes the next one faster and more accurate, so fewer issues repeat, quality compounds, and support scales naturally. That doesn’t happen by accident—it requires intentional design.

    In practice, this comes down to four essentials: clear ownership of performance, guardrails that make iteration fast and safe, feedback loops that turn learning into routine upgrades, and a culture that celebrates the work of improvement—not just the outcomes. Here’s how I put that into play.

    First, I start with clear ownership. Ambiguity is one of the most common reasons AI performance plateaus. When no one truly owns how the AI Agent performs, feedback gets lost, issues linger, and improvements stall.

    On high-performing teams, I assign a single owner—often an AI ops lead—responsible for making the AI Agent better. They review resolution trends to spot underperformance, make targeted updates to content, configuration, and behavior, coordinate with product and engineering on systemic blockers, and set improvement priorities, targets, and timelines. The title matters less than the mandate; what matters is clear authority to drive change across teams.

    Real-world example: At Dotdigital, AI performance plateaued after a strong start—resolving around 2,800 conversations per month for three consecutive months. To drive resolution rates up, the team created a dedicated support operations specialist role, filled by an experienced agent with deep product knowledge. This person will focus on refining snippets, improving content, and enhancing the AI’s resolution capabilities.

    Second, I make iteration fast and safe. As the AI Agent takes on more volume and complexity, change can start to feel risky—so teams hesitate, and performance stalls. Lightweight governance fixes that by making the path from insight to action predictable.

    I keep the rules simple and explicit: which changes need review (and which don’t), who the decision-makers are, how we test updates before they go live, where feedback flows so it’s seen and acted on, and when progress gets reviewed on a steady cadence. Governance isn’t bureaucracy—it’s what keeps improvement routine and safe.

    Real-world example: Anthropic ran a focused “Fin hackathon” sprint to improve their AI Agent’s resolution rate. The team audited unresolved queries, identified underperforming topics, and created or updated content to close gaps. They converted frequently used macros into AI-usable snippets, monitored Fin’s performance during live support, and continuously refined content based on real interactions. This structured approach enabled rapid improvement while maintaining quality standards.

    Third, I build a system that learns by default. AI performance isn’t static, but many organizations treat it like a one-time implementation. The most successful teams operationalize learning: they analyze where the AI Agent struggles and feed those insights directly into structured improvements.

    The signals are straightforward: review common handoffs to humans, track unresolved queries by topic or intent, measure resolution rate trends over time, and use those inputs to prioritize fixes and content upgrades. Whether you follow a formal loop like the Fin Flywheel framework or something lighter, the goal is the same—make improvement inevitable.

    Fourth, I treat content as competitive infrastructure. Your AI Agent is only as good as what it knows. As George Dilthey, Head of Support at Clay, put it: “That’s when we realized: AI doesn’t just come up with information out of nowhere, you have to feed it. We were spending all our time evaluating tools when we should’ve been focused on content.”

    I operationalize knowledge like infrastructure: every topic has a clear owner, content is structured, versioned, and ingestion-ready, new products ship with source-of-truth content by default, and changes ship on a schedule—not when someone finds time. This is the backbone that differentiates teams who scale confidently from those who stall out.

    In my organization, we’ve evolved our New Product Introduction (NPI) process by aligning early with R&D on a single, canonical source of truth that becomes the foundation for all downstream content—including what the AI Agent uses to resolve queries. By embedding content creation into launch readiness, not as an afterthought, we’ve consistently hit 50%+ resolution rates on new features from day one.

    Finally, I make belief visible. Even the best system will stagnate if people stop believing in it. Belief can fade quietly unless you reinforce it on purpose. I keep it strong by sharing specific wins regularly, highlighting improvements with metrics, and recognizing the people behind the gains—then giving them space to lead. This isn’t just about morale; it keeps everyone aligned on the bigger play.

    When you put it all together—clear ownership, safe iteration, a learning system by default, and content as infrastructure—AI performance compounds. As the AI Agent gets better, the entire support model becomes faster, more reliable, and truly scalable. That’s the foundation of a modern, AI-first support organization.

    Next, I’ll take this a level deeper and share how capacity planning changes when AI handles the majority of inbound volume and your team shifts into higher-value roles. If scaling with confidence is the goal, this is where the operating model pays off.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • Beyond Accuracy: The Trust-First Evaluation Metrics I Use to Scale High-Impact AI Products

    Beyond Accuracy: The Trust-First Evaluation Metrics I Use to Scale High-Impact AI Products

    When I assess whether an AI product is ready for prime time, I start with trust—not model accuracy. Accuracy is table stakes; trust is what earns adoption, drives retention, and unlocks durable product-led growth.

    Evaluation metrics in AI products go beyond accuracy. Learn how product teams use trust-driven metrics to build reliable, growth-driving AI systems.

    In practice, I organize trust-driven metrics into four layers: model quality and safety, user and business outcomes, operational reliability and cost, and governance and compliance. This layered approach keeps product trios aligned on what matters now, what must be gated in CI/CD, and what signals we’ll use to prove progress against outcomes vs output OKRs.

    On model quality and safety, I care about precision, recall, F1, calibration, and abstention behavior, but also the hard-to-fake signals: hallucination rate, grounding and faithfulness, citation coverage, toxicity, bias, and fairness. For generative systems, I instrument refusal correctness (declining unsafe requests) and evidence adequacy (did the answer rely on retrieved, trustworthy sources).

    User and business outcomes must be explicit. I track adoption, activation, task success rate, time to first value, win rate uplift in assisted workflows, CSAT and NPS deltas, and retention analysis by cohort exposed to AI features. For customer support scenarios, deflection rate, average handle time change, and first-contact resolution are core; for sales or ops copilots, I monitor cycle-time reduction and error-rate reduction in critical tasks.

    Experimentation is non-negotiable. I design A/B testing with a clear minimum detectable effect (MDE), pre-registered guardrails for safety and quality, and sequential tests that stop early if harm outpaces benefit. Online metrics are always paired with offline evals so we can iterate quickly without exposing users to regressions.

    Operationally, trust shows up as speed, stability, and cost predictability. I track latency end-to-end, time to first token, throughput, rate of 5xx and timeouts, cost per request, and caching effectiveness. We also trend safety incidents per 10,000 interactions and mean time to mitigation to keep reliability visible alongside performance.

    Governance and compliance are part of the product, not an afterthought. Data governance and privacy-by-design metrics include PII exposure rate, data lineage coverage, access-control correctness, audit pass rate against internal policies, and model and prompt change traceability. This is the backbone of our AI risk management posture and accelerates regulatory compliance reviews instead of slowing them down.

    The delivery engine for all of this is eval-driven development. We maintain golden datasets and scenario-based test suites that mirror real user intents, gate releases in CI/CD with minimum thresholds, and run canary rollouts to validate offline–online alignment. Every model or prompt update gets a comparable scorecard so product, engineering, and design can trade off quality, speed, and cost with shared facts.

    For LLM-heavy features, retrieval-first pipeline metrics are mandatory. I monitor retrieval hit rate, recall at K, mean reciprocal rank, context contamination, and citation correctness. With large prompts, context window management matters: we track context utilization, truncation rate, and the contribution of each context block to final answers to avoid silently losing critical evidence.

    Finally, trust must be legible. I package these metrics into an executive scorecard that maps to business outcomes, risk appetite, and OKRs, with clear thresholds for ship, improve, or roll back. When teams can articulate trade-offs—say, a 20% latency reduction at a small cost increase, or a lower hallucination rate at the expense of higher abstention—they build credibility with stakeholders and confidence with customers.

    Trust is not a single number; it’s a system of evidence. By instrumenting these layers and operationalizing AI Strategy with rigorous, transparent metrics, we can ship faster, reduce surprises, and earn the right to scale AI features across the product portfolio.


    Inspired by this post on Product School.


    Book a consult png image