Tag: AI risk management

  • Make Your Analytics AI-Ready: De-Risk, Measure, and Scale AI-First Products Fast

    Make Your Analytics AI-Ready: De-Risk, Measure, and Scale AI-First Products Fast

    I ask one question before I green‑light any new AI feature: is our analytics truly AI‑ready? If the answer is no, we slow down, because nothing derails an AI roadmap faster than shipping features we can’t measure, iterate, or trust. Over time, I’ve learned that the right analytics foundation is the difference between a flashy demo and a durable, compounding product advantage.

    "Product and engineering teams face new challenges when building AI-first products. A modern digital analytics platform offers solutions." I agree—and I’d add that the real win comes when model metrics and product outcomes live in one coherent system, so we can connect every improvement to customer value.

    Here’s what “AI‑ready” analytics means in practice for me: a unified event taxonomy tied to clear user and account identities; consistent product analytics (activation, funnels, retention analysis, cohorts); ground‑truth labels and feedback signals for model evaluation; and a single source of truth that blends model telemetry with user behavior. When those pieces click, our AI Strategy turns from guessing to “eval‑driven development.”

    Start with data governance and privacy‑by‑design. Define event names, properties, and versioning rules up front. Capture the context that AI needs—inputs, outputs, confidence scores, content types—without storing unnecessary PII. This discipline reduces rework, improves observability, and keeps auditors and customers confident in how we handle data.

    Next, operationalize eval‑driven development. I run offline evaluations with representative datasets, then shadow mode in production, and finally controlled rollouts with A/B testing and feature flags. We set a minimum detectable effect so experiments are conclusive, and we include AI risk management metrics—like safety violations, fallback rates, and moderation triggers—alongside core product KPIs such as activation, task success, and time‑to‑value.

    On the product analytics side, I rely on a unified analytics platform (e.g., Amplitude analytics or similar) to track adoption of AI features: who sees the feature, who tries it, who repeats it, and who retains because of it. Cohort analyses help me isolate lift among target segments; CRM integration connects usage to revenue; and pathing highlights where users need guidance. This is the engine of product‑led growth for AI capabilities.

    Quality and observability complete the loop. I monitor latency, error rates, and cost per successful outcome, but I also watch human‑grounded proxies: thumbs up/down, edits after AI suggestions, and deflection and CSAT for support workflows. These signals feed back into prompt engineering, retrieval quality, and model selection—closing the gap between LLM behavior and customer value.

    None of this works without strong cross‑functional rituals. Product trios align on success metrics before we write a line of code; continuous discovery validates user problems; and QBRs versus OKRs are reconciled so we invest in durable capabilities, not just quarterly spikes. When analytics and discovery move in lockstep, we ship fewer speculative features and more compounding improvements.

    Finally, choose build versus buy intentionally. I buy a robust, scalable analytics substrate and only build the custom AI evals I need for proprietary use cases. With feature flags in CI/CD and automated schema checks, instrumentation becomes part of deployment frequency—not an afterthought. The result is a reliable runway to scale AI‑first products without losing speed, safety, or clarity.

    If you want a quick readiness check: do you have a clean event schema, identity resolution, and governed properties; a measurable definition of activation for each AI feature; offline and online evals connected to business KPIs; guardrails and human feedback in the loop; and dashboards that team leaders actually use? If not, start there. The payoff is faster iteration, lower risk, and a clearer line from AI investment to customer outcomes.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • AI Agent Deployment Mastery: My Proven Checklist to Ship Safely, Faster, and at Scale

    AI Agent Deployment Mastery: My Proven Checklist to Ship Safely, Faster, and at Scale

    Shipping AI agents is not like shipping a typical feature. The system learns, reasons, and takes action in unpredictable environments, and when it’s customer-facing, the stakes are high. Over the past few years, I’ve refined a practical checklist that helps my teams move quickly without breaking trust. It balances speed with safety, and ambition with accountability—exactly what you need to scale agentic AI in production.

    This checklist was forged in real launches—some smooth, some humbling. Early on, I watched an otherwise brilliant agent confidently offer a refund policy we didn’t have. That one incident made it clear: AI agents require a higher bar for guardrails, evals, and observability. Today, I won’t greenlight an AI rollout without these steps being explicit, owned, and testable.

    Start with outcomes, not output. I define the job-to-be-done, the target users, and the measurable business impact using outcomes vs output OKRs and driver trees. Success is not “ship an agent,” it’s “reduce first-response time by 40% with no drop in CSAT,” or “increase qualified demo bookings by 20% at a lower cost per acquisition.” Clear outcomes give the agent a purpose and the team a north star.

    Prepare the knowledge the agent will use. A retrieval-first pipeline beats raw prompting for most enterprise cases. I inventory sources of truth, set access controls, and enforce data governance from day one. That includes PII handling, redaction, retention policies, and privacy-by-design. If the agent can’t reliably retrieve the right fact at the right time, the rest doesn’t matter.

    Choose models and prompts with discipline. I align model selection with context window management, cost, latency, and tool-use requirements. Then I build prompts and tools together, not in isolation, and I keep temperature, stop conditions, and function-calling explicit. Most importantly, I use eval-driven development: golden datasets, task-specific metrics (accuracy, helpfulness, latency, cost), and target thresholds that must be met before widening rollout.

    Manage AI risk upfront. I treat jailbreaks, toxicity, and data leakage as product risks, not just security issues. I implement layered defenses—input/output filtering, policy checks, rate limits, and abuse monitoring—and define escalation paths and human-in-the-loop handoffs for ambiguous cases. Every risky capability needs an owner, a playbook, and a test.

    Build the pipeline that lets you iterate safely. Prompts, tools, policies, and retrieval configs go through the same CI/CD rigor as code. I use feature flags for progressive delivery, canary cohorts to limit blast radius, and clear rollback procedures. Observability isn’t optional; I track latency, token usage, cost, failure modes, and user outcomes. I also watch DORA metrics and deployment frequency to ensure we’re improving the engine, not just the output.

    Constrain autonomy intentionally. Agent behavior design matters as much as model choice. I set step limits, define tool whitelists, separate read vs write permissions, and specify decision checkpoints. When the agent is uncertain or confidence drops below a threshold, it hands off to a human or a deterministic workflow. Guardrails aren’t barriers; they’re bumpers that keep you on the track.

    Instrument what users experience, not just what models produce. I track activation, task success, self-serve completion rates, and time-to-value. I pair Agent Analytics with journey analytics so I can see where the agent helps or hurts. I also invest in UX trust cues—transparent explanations, undo paths, and in-app guides—so users feel in control. When the agent changes behavior through learning, the interface should make that understandable.

    If you’re shipping a voice AI agent, test in realistic conditions. I set targets for ASR accuracy, barge-in responsiveness, TTS prosody, and end-to-end latency. I predefine safe transfer logic for complex calls and ensure compliance for call recording and data retention. Voice amplifies both the magic and the mistakes; operational excellence is non-negotiable.

    Plan the business rollout like a product, not a press release. I align pricing (often consumption SaaS pricing), packaging, and SLAs with actual unit economics—tokens, inference, and retrieval. I equip solutions engineering with playbooks and reference architectures, wire up CRM integration for attribution, and put feedback loops into Intercom or the support stack so we learn from every interaction.

    Run operations like an SRE team. I define incident severity for AI-specific failures (e.g., harmful output, runaway cost, degraded retrieval), add alerting, and keep runbooks current. I schedule postmortems that feed directly into eval baselines and backlog priorities. Continuous discovery isn’t a ceremony; it’s the safety net that keeps improvements compounding.

    Close the loop on compliance and governance. From day zero, I document data flows, vendor scopes, and audit logs. I verify regulatory compliance and adopt privacy-by-design so I’m not retrofitting later. Transparency, user consent, and opt-outs aren’t just legal checkboxes; they’re trust-building tools that differentiate your product.

    The result of this checklist is speed with confidence. It gives my teams a common language to debate trade-offs, a clear path to production, and the guardrails to scale safely. If you’re preparing to deploy an agent, adapt these steps to your stack and your customers. Your future self—and your users—will thank you.


    Inspired by this post on Product School.


    Book a consult png image
  • Amplitude’s AI Visibility Upgrade: Content Generation, Chat Segmentation, Sleeker UI—Why It Matters

    Amplitude’s AI Visibility Upgrade: Content Generation, Chat Segmentation, Sleeker UI—Why It Matters

    I look for analytics upgrades that meaningfully compress time-to-insight for product teams. The newest expansion of Amplitude AI Visibility stands out because it improves how we explore user behavior, automate insight creation, and translate data into action across product-led growth motions.

    Explore the most recent updates to Amplitude AI Visibility, including content generation, AI chat-driven segmentation, better UI, and improved reliability.

    Here’s how I’m thinking about the impact. Content generation can turn raw events into ready-to-share narratives—experiment summaries for A/B testing, cohort deep-dives for retention analysis, and executive briefs that tie outcomes to roadmap decisions. For leaders and ICs alike, this trims the manual lift in Amplitude analytics while keeping the human in the loop to verify context and nuance.

    AI chat-driven segmentation is another meaningful unlock. Instead of clicking through complex filters, I can describe the cohort I want in natural language and iterate quickly. That speeds up continuous segmentation work—spotting activation bottlenecks, isolating churn precursors, or defining cohorts for product-led growth experiments—and keeps the team focused on hypotheses and decisions, not interface friction. With LLMs for product managers, the key is pairing this speed with clear guardrails and validation steps.

    The updated UI matters more than aesthetic polish. A clearer, more consistent experience reduces cognitive load, improves adoption across cross-functional partners, and reinforces a unified analytics platform approach. Improved reliability, paired with strong observability, increases trust in the stack—critical when insights drive roadmap priorities and high-visibility launches.

    Operationally, I’d roll this out with a simple playbook: identify 2–3 high-value use cases (e.g., activation funnel analysis, churn cohort exploration, experiment reporting), define success metrics (time-to-insight, stakeholder adoption, decision velocity), and establish basic AI risk management and data governance guardrails (prompt templates, access policies, and review steps). The goal is to turn AI workflows into a durable capability rather than a one-off novelty.

    Bottom line: these enhancements remove friction between questions and answers. If your team relies on Amplitude analytics, the combination of content generation, AI chat-driven segmentation, a cleaner UI, and stronger reliability should accelerate discovery cycles and help you translate insight into action with greater confidence.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • Two People, Zero Waste: How Earmark’s Agentic AI Turns Meetings into Finished Work

    Two People, Zero Waste: How Earmark’s Agentic AI Turns Meetings into Finished Work

    I care about meetings only insofar as they create momentum and outcomes. What if your meetings could actually produce the artifacts you need—specs, tickets, slides—before the call even ends?

    I recently listened to an episode of Just Now Possible where Teresa Torres talks with Mark Barbir (CEO) and Sanden Gocka (Co-Founder), the co-founders of Earmark, about building a productivity suite that turns unstructured conversations into finished work in real time. As a product leader, this premise hits the sweet spot of agentic AI, real-time AI workflows, and ruthless focus on outcomes over output.

    Listen to this episode on: Spotify | Apple Podcasts

    Unlike generic AI notetakers that produce summaries nobody reads, Earmark runs multiple agents in parallel during your meetings—translating engineering jargon, drafting product specs, even spinning up prototypes in Cursor or V0 while you're still talking. That’s the bar I want from AI in the room: finished work, not notes.

    What impressed me most was the clarity of their pivot. They moved from an Apple Vision Pro presentation coaching tool to a web-based meeting assistant. I’ve made similar calls: when the distribution path and daily workflow are obvious, you follow the user’s gravity. This shift unlocked a broader surface area—PMs, engineers, design partners—and made agentic workflows useful where work actually happens.

    They also turned a technical constraint into a commercial advantage. Their ephemeral (no-storage) architecture became a feature for enterprise sales. I’ve seen this repeatedly in AI risk management: privacy-by-design and clear data governance reduce friction with security reviewers and accelerate procurement. For many enterprises, “we don’t store your data” is the win condition.

    Cost discipline was another standout. They tackled the hard problem of making real-time AI affordable—from $70 per meeting down to under a dollar through prompt caching. That’s not just optimization; it’s product strategy. Choices like model selection, context window management, and retrieval-first pipeline design determine whether a feature can scale to every meeting or remains a demo.

    On capability design, the team leaned into templates and simulated stakeholders to ship value fast. Template-based agents: Engineering Translator, Make Me Look Smart, Acronym Explainer. Personas that simulate absent team members (security architect, legal, accessibility). This is exactly how I frame early AI workflows: remove friction for the product trio, anticipate blockers, and let the agent do the tedious, error-prone first pass.

    They were refreshingly pragmatic about models. Why GPT 4.1 still beats newer models for prose quality in their use case is a reminder that “best” is contextual. When the job-to-be-done is precise prose and production-grade artifacts, consistent quality trumps leaderboard buzz. Of course, they also invest in guardrails to ensure quality and manage hallucinations—another non-negotiable for enterprise adoption.

    Search and analysis across time is where many AI products stumble. They explained the limits of vector search for analysis questions across meetings and how they’re building agentic search with multiple retrieval tools (RAG, BM25, metadata queries, bespoke summaries). I couldn’t agree more: analysis requires reasoning over structure, time, and purpose—not just semantic proximity. Layered retrieval with stateful agents beats a single embedding call.

    They also articulated a crisp user thesis: design for product managers as the extreme user to solve for everyone. In my experience, if you satisfy the PM’s bar for clarity, traceability, and actionability, engineers, designers, and go-to-market teams benefit immediately. That’s how you earn daily active use, not once-a-week novelty.

    For builders curious about the stack and comparables, they discuss services and tools like Assembly AI for speech-to-text, OpenAI API with prompt caching support, and build integrations with Cursor and V0 by Vercel. They also reference Granola as a comparison point and nod to ProductPlan, where both founders previously worked. If you want to try the product, here’s Earmark—a productivity suite where the work completes itself.

    If you're a PM drowning in follow-up work or a builder curious about real-time AI architectures, this conversation offers a detailed look at what it takes to ship an AI product that people can't imagine working without. Personally, I see this as a credible path toward an AI chief of staff—their vision goes beyond automating deliverables to orchestrating judgment, compliance signals, and cross-functional readiness.

    The episode covers the founder backstory, what Earmark does, comparisons to competitors, unique features, templates and personas, technical decisions, early versions and challenges, optimizing transcript summarization, managing multiple tools and costs, challenges with context and reasoning models, innovative search and retrieval techniques, creating actionable artifacts from meetings, ensuring quality and managing hallucinations, and the future vision for an AI chief of staff. It’s a full-spectrum look at building with agentic AI, not just talking about it.

    Podcast transcripts are only available to paid subscribers.


    Inspired by this post on Product Talk.


    Book a consult png image
  • From Coaching to Co‑Pilots: How AI Elevates Product Owners and Feature Teams

    From Coaching to Co‑Pilots: How AI Elevates Product Owners and Feature Teams

    After two decades of coaching product teams, I’m making a deliberate shift in how I guide leaders and practitioners. The destination hasn’t changed—great products, empowered product teams, and durable outcomes—but the route has. AI is now a practical, compounding advantage, and it demands we evolve our product coaching model.

    In my day-to-day as a VP of Product Management at HighLevel, I’ve watched AI move from novelty to necessity. Large language models, agentic AI, and streamlined AI workflows now accelerate how we discover opportunities, test hypotheses, and communicate decisions. This is not about replacing product judgment; it’s about augmenting it with a disciplined AI Strategy.

    For years, I’ve raised the alarm about the gap between execution and strategy among “product owners and feature team product managers.” The intent was never to pile on more process. It was to strengthen product discovery, sharpen product strategy, and clarify outcomes vs output OKRs so that teams ship what matters. AI finally gives us the leverage to make that shift unavoidable—and repeatable.

    Here’s the new coaching stance: treat AI as a co-pilot, not an answer engine. I coach teams to build an AI product toolbox they can trust—prompt engineering patterns, eval-driven development to measure model quality, and a retrieval-first pipeline for institutional knowledge. When combined with continuous discovery, this creates a tight loop between insight, iteration, and impact.

    Practically, this means elevating core rituals. In product trios, we start discovery with AI-assisted opportunity mapping, then pressure-test problem framing with user evidence. We generate multiple solution sketches with LLMs for product managers, annotate assumptions, and use A/B testing with a minimum detectable effect (MDE) to validate the riskiest bets. The result is faster learning without skipping the hard thinking.

    On the governance side, I set clear guardrails: privacy-by-design, data governance, AI risk management, and explicit criteria for acceptable model behavior. We treat prompts and evaluation datasets as versioned assets, and we pair product managers with forward deployed engineers to operationalize insights in production safely.

    Coaching also extends to measurement. We anchor product outcomes in the customer journey and watch leading indicators for activation, adoption, and retention. On the delivery side, we look at deployment frequency and the health of the feedback loop between support signals and roadmap choices—because empowered product teams win when they learn faster than the market shifts.

    The most profound cultural change is mindset. Instead of asking AI for answers, we ask it for alternatives, counterexamples, and structured ways to explain tradeoffs to stakeholders. That makes product positioning clearer, decision narratives stronger, and the path from insight to execution shorter.

    If you’re responsible for developing talent, reframe coaching as enablement plus guardrails. Build the AI muscle into everyday discovery and delivery, not as a side project. When we do this well, we transform good practitioners into strategic operators—people who pair judgment with leverage and consistently ship value.

    The bottom line: AI doesn’t replace the craft; it amplifies it. Our job as leaders is to harness that amplification responsibly and turn it into a durable competitive advantage.


    Inspired by this post on SVPG.


    Book a consult png image
  • How We Built Rock-Solid AI Infrastructure: Lessons From Scaling AI Visibility and Reliability

    How We Built Rock-Solid AI Infrastructure: Lessons From Scaling AI Visibility and Reliability

    Scaling AI Visibility pushed me to rethink what “reliable” really means for AI infrastructure. As my team expanded usage across more datasets, models, and workflows, we uncovered unexpected sources of report failure and built the guardrails, observability, and processes that now anchor our stability strategy.

    In practice, the surprising failure modes were rarely the loud ones. We saw report failure triggered by small schema drift from non-deterministic LLM outputs, silent permission changes in upstream data sources, token-limit truncation that broke downstream parsing, third-party API rate limits that surfaced only under bursty load, and clock skew that confused idempotent writes. Individually these issues looked minor; together they created reliability debt.

    Our first move was deep observability. We instrumented the end-to-end pipeline with structured logs, distributed tracing, and high-signal metrics mapped to SLOs and error budgets. That visibility let us separate symptom from cause, quantify impact by segment, and prioritize fixes that moved business outcomes, not just vanity thresholds. It also gave product managers and SREs a shared, real-time view to make tradeoffs explicit.

    Next, we hardened the runtime with resilience patterns: circuit breakers on flaky dependencies, timeouts tuned to p95 behavior, retries with jittered backoff, idempotent processing for at-least-once delivery, and backpressure-aware queues. We enforced schema contracts at ingestion with JSON validation and added feature flags to decouple deploys from releases, so we could roll forward or back within minutes when signals degraded.

    On the product side, we adopted eval-driven development for model and prompt changes, shifting risky modifications behind canaries and staged rollouts. CI/CD gates required evaluation baselines to hold or improve before promotion. We tracked DORA metrics to keep deployment frequency high without sacrificing change failure rate, and we used P95 latency and budget burn as the forcing functions for prioritization.

    Culture mattered as much as code. We formalized incident management with clear ownership, lightweight runbooks, and blameless reviews that produced crisp, automatable actions. We partnered early with SRE on SLO design, integrated privacy-by-design and PII scanning into the pipeline, and treated AI risk management as an ongoing product constraint rather than a checkbox.

    The net effect: fewer flaky reports, faster recovery when things do break, and far more confidence to ship improvements to AI Visibility at pace. If you’re scaling similar capabilities, start with observability, make resilience patterns non-negotiable, and let SLOs guide your product roadmap. Reliability is not a phase—it’s the product.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • AI Operating Model Masterclass: How I Scale Teams, Tech, and Governance Without Chaos

    AI Operating Model Masterclass: How I Scale Teams, Tech, and Governance Without Chaos

    When I set out to operationalize AI across a product organization, I focus on one promise: repeatable outcomes without chaos. An effective AI operating model turns experiments into an engine—aligning strategy, teams, technology, and governance so we can ship value safely and at scale.

    At its core, an AI operating model is the connective tissue between vision and delivery. I anchor it on a few pillars: clear AI Strategy, empowered cross-functional teams, a modern AI platform, rigorous AI risk management and data governance, and a cadence of eval-driven development that ties everything back to outcomes.

    Strategy comes first. I translate big ambitions into a portfolio of use cases ranked by customer impact, feasibility, and risk. I use continuous discovery to validate the problem, then frame each bet with outcomes vs output OKRs, a crisp value proposition, and a build vs buy decision. For generative AI, I encourage PMs to treat LLMs for product managers as a craft—rapid prototyping, deliberate prompt engineering, and disciplined evaluation from day one.

    Team design matters as much as models. I organize around product trios—PM, design, and engineering—augmented by data, ML, and a “forward deployed” mindset when the domain is complex. I invest in empowered product teams and communities of practice to spread patterns quickly while avoiding centralized bottlenecks.

    On the platform side, I start retrieval-first pipeline before fancy modeling. A solid foundation—feature stores, vector search, observability, and safe integration points—beats bolt-on hacks. I rely on CI/CD with feature flags, strong deployment frequency, DORA metrics, and SRE-grade reliability to keep the iteration loop tight and safe.

    Governance is non-negotiable. I implement privacy-by-design, clear data governance, audit trails, and policy controls aligned to regulatory compliance. AI risk management includes model red teaming, safety layers, and human-in-the-loop review where needed. The goal is confidence: we know what shipped, why it works, and how it fails.

    Execution rides on eval-driven development. For every AI workflow, I define offline and online test sets, target metrics, and a decision policy before launch. I A/B test with proper minimum detectable effect (MDE), layer canaries for protection, and monitor user experience and outcomes in production. This is how we turn “it seems smarter” into statistically confident improvements.

    Adoption is a product in itself. I build onboarding, in-app guides, and product tours that help users form habits quickly. I monitor activation, time-to-value, and retention analysis while partnering with customer support ai strategy to close the loop between real-world issues and roadmap priorities.

    Culture scales the system. I normalize rapid learning, shared playbooks, and personal knowledge management so insights don’t disappear into meetings or notebooks. I upskill teams on prompt engineering, context window management, and model selection, and I celebrate the humility required to refactor what “worked” yesterday.

    Operating cadence keeps it all coherent. I run an AI portfolio review tied to outcomes vs output OKRs, keep a single source of truth for evaluations, and align go-to-market strategy with release readiness. We review risks alongside results so speed never outruns safety.

    If you’re starting from scratch, I recommend a 30-60-90 approach: baseline your current state, choose two lighthouse use cases, stand up the retrieval-first pipeline and eval harness, define governance and data policies, then ship small, safe increments behind feature flags. Teach the system to learn before you make it run.

    I’ve felt the pain of brilliant prototypes that crumble in production and the thrill of AI features that compound value month after month. The difference is the operating model. Build it with intent, and you’ll scale AI with confidence—teams aligned, tech resilient, and customers seeing real outcomes.


    Inspired by this post on Product School.


    Book a consult png image
  • Building Physician‑Grade AI When Trust Is Everything: Inside Healio’s Proven Playbook

    Building Physician‑Grade AI When Trust Is Everything: Inside Healio’s Proven Playbook

    Trust is the currency of any high-stakes AI product, and nowhere is that more true than in healthcare. I recently dug into how Healio built an AI assistant for physicians—an audience that can’t afford to be wrong—and it’s a masterclass in balancing accuracy, transparency, and speed without compromising credibility.

    Healio, a 125-year-old medical publishing company, set out to create Healio AI to help clinicians prepare for patient care. From the outset, their guiding principle was simple: physicians won’t trust you until you prove it. That lens shaped every decision—from discovery and prototyping to architecture, evaluation, and ongoing validation.

    Discovery started with a survey of 300 healthcare professionals to understand real-world needs at the point of care. The headline insight: physicians primarily want AI for preparation, not bedside use. Even more surprising, the top ask wasn’t purely diagnostic support; it was help with patient communication and empathy—translating complex information into clear, accessible conversation.

    Momentum mattered. After beginning with Figma mockups to validate workflows, the team built a working prototype in a single weekend using Cursor. That velocity wasn’t about cutting corners; it was about proving value quickly, reducing ambiguity, and iterating with concrete feedback from physicians.

    Under the hood, the system employs RAG and hybrid search—combining lexical search, vector search, and semantic search across multiple trusted sources like PubMed. As any PM who has integrated biomedical literature knows, "just use PubMed" isn’t simple—there are five different ways to access the same data, each with trade-offs. The team made pragmatic choices to balance freshness, coverage, latency, and cost while preserving trust in source quality.

    Designing for trust extended all the way to the citation UX. The team leaned into citations that physicians actually trust: subscripts, hover states, and progressive disclosure. This gave clinicians verifiable threads back to source material without overwhelming the core interaction, aligning with how experts want to audit evidence under time pressure.

    Evaluation wasn’t left to chance. They stood up eight LLM judges for evals: safety, medical accuracy, faithfulness, relevancy, completeness, reasoning, clarity, and overall quality. Just as importantly, they treated those signals as directional, not definitive. In a high-stakes domain, physician feedback trumps LLM-as-judge feedback—so they complemented automated evals with direct reviews from practicing clinicians to calibrate quality and reduce hallucinations.

    On the safety front, the team implemented HIPAA compliance and input guardrails for masking personal health information. That choice reflects strong data governance and privacy-by-design thinking: protect PHI by default, constrain prompts to safe boundaries, and make compliance a first-class citizen in the product architecture.

    They also addressed monetization without compromising experience. Serving contextual ads while the LLM processes queries is a practical approach that preserves physician workflow efficiency and creates a clear, non-intrusive revenue model.

    Critically, the work didn’t stop at launch. The Healio Innovation Partners provide ongoing discovery and validation, ensuring the system evolves with physician needs and the medical evidence base. This is the operating cadence you want for any AI product that sits at the intersection of safety, accuracy, and fast-changing knowledge.

    My takeaways for building AI in high-stakes domains: prioritize retrieval-first pipelines over model cleverness; couple RAG with hybrid search across vetted sources; design citations that earn trust at a glance; use eval-driven development, but let domain-expert feedback be the ultimate judge; and embed regulatory compliance into your product strategy from day one. If trust is your North Star, this is a playbook worth emulating.


    Inspired by this post on Product Talk.


    Book a consult png image
  • AI Ethics That Win Trust: The Product Manager’s Playbook for Safe, Scalable Innovation

    AI Ethics That Win Trust: The Product Manager’s Playbook for Safe, Scalable Innovation

    I’ve learned that the fastest way to lose customers with AI is to ship something powerful but unpredictable. The fastest way to earn their loyalty is to ship something powerful and trustworthy. That’s the job.

    AI ethics in product management isn’t about theory anymore. It’s the line between trusted products and unpredictable ones. Here’s what PMs need to know.

    When I frame AI ethics for my team, I translate principles into practices that protect customers and accelerate velocity. We bake trust into product strategy, delivery, and operations—so ethics is not a separate checklist, but a core capability that compounds over time.

    First, I anchor the roadmap on explicit outcomes and guardrails. We set success metrics alongside ethical constraints, tying them to outcomes vs output OKRs, so teams know not only what to achieve but what to avoid. If a feature can’t meet our trust thresholds, it doesn’t ship—no matter how impressive the demo.

    Data is where trust starts. We enforce data governance from day one: clear data lineage, collection minimization, role-based access, and privacy-by-design defaults. We document lawful bases for processing, consent flows, and retention policies, then automate checks so they run with every change—not just at launch.

    On the model side, we use eval-driven development to turn subjective “looks good” into measurable quality. We design evaluations for safety, bias, robustness, and performance; we red-team prompts; and we test failure modes in realistic conditions. For LLMs, we lean on a retrieval-first pipeline to ground responses in authoritative data, and we apply context window management and prompt engineering patterns to reduce hallucinations.

    In the product experience, we make ethical choices visible. That means clear disclosures when AI is in the loop, user controls to review and correct outputs, and transparent UX writing that avoids overclaiming. In-app guides and thoughtful tooltip design help users understand capabilities and limits without friction.

    Shipping safely requires operational discipline. We build kill switches, human-in-the-loop overrides for high-risk actions, and incident playbooks that pair incident management with threat detection and response. SRE partnerships ensure observability covers both model behavior and customer impact, with rollback paths ready when drift or regressions appear.

    Governance is a team sport. I maintain an AI risk register, review it with security, legal, and product trios, and brief leadership on residual risks and mitigations. Regulatory compliance isn’t a final hurdle; it’s a design input that shapes technical choices long before code reaches production.

    Build vs buy decisions carry ethical implications too. Vendor due diligence covers model provenance, data handling, eval results, and incident history—not just feature checklists. Contracts codify SLAs, audit rights, and deletion commitments so our obligations to customers flow down the stack.

    Finally, we earn trust in public. We publish model facts, change logs, and limitations in a customer-facing trust center, and we invite feedback loops that turn real-world usage into better safeguards. Stakeholder management matters here: being candid about trade-offs often increases confidence more than chasing perfection.

    This is how I keep teams fast without being reckless: ethics as a product capability, not a poster. Build with intention, measure what matters, and make it easy for customers to understand, control, and benefit from your AI. That’s how we ship innovation that stays trusted—at scale.


    Inspired by this post on Product School.


    Book a consult png image
  • The Modern Playbook for AI Agents: Build One‑Person Departments and Scale with Amplitude

    The Modern Playbook for AI Agents: Build One‑Person Departments and Scale with Amplitude

    I’ve spent the last few years turning AI from an intriguing demo into an operational advantage, and the clearest wins come when we treat agents as productized workflows—not toys. In practice, that means aligning agentic AI to a sharp product strategy, instrumenting everything, and scaling what works across the organization.

    Learn how companies like Replit are consolidating workflows, creating one-person departments, and building systems for scale with Amplitude

    When I talk about agentic AI, I’m focused on outcomes: fewer handoffs, faster cycle times, and measurable uplift in activation, retention, and NPS. The most successful rollouts start with a specific job-to-be-done, translate it into clear AI workflows, and then iterate with a tight feedback loop between data, design, and engineering.

    My implementation playbook is simple and disciplined. First, choose a high-friction workflow and define success upfront. Second, make the build vs buy call on the foundation model, orchestration layer, and connectors. Third, establish AI risk management and safeguards early—before scale amplifies errors. Finally, run small, eval-driven releases and promote what performs.

    Instrumentation is where the leverage compounds. With Amplitude analytics as a unified analytics platform, I design purposeful events (agent intent, tool calls, resolution state, human handoff), map funnels from user input to agent outcome, and cohort users by context to pinpoint lift. This gives me an honest read on where agents help, where they hinder, and what to tune next.

    The “one-person departments” concept isn’t about doing more with less at all costs; it’s about assembling a tight loop of product management leadership, data, and automation so one operator can own a business outcome end-to-end. An agent handles the repeatable work, while the human focuses on judgment, edge cases, and continuous improvement that compounds.

    As we scale, I look for platform scalability patterns: shared tools and policies, reusable prompt libraries, standardized evaluation suites, and consistent governance. That structure keeps agent performance predictable while preserving speed, and it aligns beautifully with product-led growth when agents are embedded directly in the product experience.

    If you’re starting now, begin with a single, valuable workflow. Instrument it thoroughly with Amplitude analytics, make decisions from the data you see—not the demos you remember—and expand only after you’ve proven uplift. Iteration beats ambition here: agentic AI rewards teams who measure relentlessly and scale only what truly works.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • How We Built an AI Career Co‑pilot that Turns Knowing into Doing for Disadvantaged Students

    How We Built an AI Career Co‑pilot that Turns Knowing into Doing for Disadvantaged Students

    How do you help disadvantaged students take action on opportunities they don't even know exist? That question has been top of mind for me as I’ve explored how AI can augment—not replace—human mentorship. Recently, I dug into the work behind Zero Gravity, a UK-based platform using mentoring, community, and learning pathways to unlock elite career opportunities for state school students. Their approach reframed a core problem I care deeply about: the "knowing-doing gap."

    I sat down with Elliot Little (Product Manager) and Dan St. Paul (Software Engineer) from Zero Gravity to unpack how they’re tackling this gap with an AI career co‑pilot. They’ve intentionally positioned the system as an orchestrator, not an automation tool—bridging the space between knowing what to do and actually doing it. As a product leader, I see this as a powerful pattern for Generative AI: use AI to coordinate steps, personalize guidance, and empower action in moments where confidence and clarity are fragile.

    What resonated most was the humility of their build journey. They started with grand visions of AI mentors and synthetic avatars, then scaled back to something simpler and more effective. The first prototype—a job suitability summary—didn’t deliver the "wow moment" they expected. And they discovered that hiding the "LLM magic" backfired—students needed to feel the personalization. That insight aligns with my own experience: users must perceive the value for trust and motivation to compound.

    From a UX standpoint, the team chose text chat over voice input and leaned into guided prompts rather than empty text boxes. That decision lowered cognitive load and increased completion rates—classic product management tradeoffs that privilege momentum over novelty. In my view, this is what good AI product strategy looks like: invite action with structure, then expand autonomy as confidence grows.

    The technical backbone is equally thoughtful. Multi‑month journeys require rigorous context window management to avoid exploding token counts and degrading quality. I appreciated their pragmatic toolkit: context management techniques like removing stale tool calls, summarizing history, exposing tools conditionally. They also used application logic rather than complex RAG architectures to manage tool availability and context freshness. This is the kind of disciplined engineering that keeps systems reliable at scale without overcomplicating the stack.

    Model selection was fit‑for‑purpose, not one‑size‑fits‑all. They’re using different models for different tasks, including "GPT-5 Nano for structured outputs, lighter models for quick replies." That modularity enables speed and cost control while preserving high‑fidelity moments where structure matters most.

    Safeguarding was treated as a first‑class concern—non‑negotiable when you’re building AI for 16‑year‑olds. Their safeguarding architecture pairs moderation endpoints with external verification via Unitary. They also invested in building a failure taxonomy through internal red team/green team exercises. This is AI risk management done right: define failure modes early, test ruthlessly, and wire safety into the product surface area—not just the model layer.

    Evaluation was grounded in outcomes, not demos. The team focused on whether students progressed from insight to action: applying, interviewing, and engaging with mentors. That aligns with how I run eval‑driven development—ship narrowly, measure real behavior, and iterate toward a repeatable "wow moment" that students can actually feel.

    Looking ahead, I’m excited by what’s next: long‑term memory management for multi‑year student journeys. It’s a hard problem—balancing privacy, provenance, and portability—but it’s precisely where an AI career co‑pilot can compound value over time. The vision is compelling: a resilient companion that remembers goals, adapts to context, and orchestrates the right next step.

    If you want to dive deeper, you can listen to the full conversation on Spotify and Apple Podcasts:

    Listen to this episode on: Spotify | Apple Podcasts

    Resources mentioned:

    Zero Gravity: https://zerogravity.co.uk/

    Unitary – AI-powered content moderation: https://www.unitary.ai/

    Blue Dot Impact AI Safety Course – free AI safety course Elliot recommended: https://bluedot.org/

    My key takeaways: build AI that augments human relationships, not replaces them; don’t hide the personalization—let learners feel it; privilege application logic over unnecessary architectural complexity; and treat safety, context, and evaluation as product features, not afterthoughts. That’s how we bridge the "knowing-doing gap" with integrity and scale.


    Inspired by this post on Product Talk.


    Book a consult png image
  • PMs and Developers Need Different AI Metrics—Here’s How That Builds Faster, Better Products

    PMs and Developers Need Different AI Metrics—Here’s How That Builds Faster, Better Products

    I’ve sat in countless AI measurement debates and noticed a recurring gap. One major voice has been noticeably underrepresented in the AI measurement conversation: the product manager (PM) that’s leading development. From experience, PMs and developers do need different measurement tools—and making those differences explicit is exactly what speeds up decisions and improves outcomes.

    Developers optimize the model and system layer. Their toolkit centers on eval-driven development: offline evals, regression suites, red-teaming, latency and throughput monitoring, token cost tracking, and hallucination rate reduction. On the delivery side, engineering teams watch DORA metrics alongside CI/CD performance to keep iteration fast and safe. When building LLM-backed experiences, they also care deeply about retrieval-first pipeline quality and context window management because those mechanics determine grounding, relevance, and consistency.

    PMs, by contrast, own outcomes. We instrument user journeys end to end and define a clear north-star tied to value: activation, time-to-value, task success rate, retention analysis, support deflection, and revenue contribution. We rely on A/B testing frameworks and minimum detectable effect (MDE) planning to separate real impact from noise, and we consolidate behavioral signals in a unified analytics platform like Amplitude analytics and Pendo to understand adoption, friction, and cohort differences. This is the heart of product-led growth and continuous discovery: evidence, not anecdotes.

    The fact that these toolboxes differ is a strength, not a weakness. Specialized metrics keep responsibilities crisp: developers guarantee model quality and reliability; PMs guarantee that quality translates into customer and business outcomes. What we need is an explicit metrics ladder that connects layers—model-level quality floors and SLOs, feature-level KPIs, and company-level results—so trade-offs are transparent and prioritization is principled.

    In practice, I create a shared measurement contract for every AI initiative. It links eval sets to user-facing success criteria, defines acceptance thresholds, and spells out observability across the stack. We include governance from day one—AI risk management, privacy-by-design, and data governance—so we can scale responsibly without slowing teams down.

    Here’s the AI product toolbox I give my teams: start with a concise value hypothesis; define a success rubric the customer would recognize; instrument the happy path and the failure path; plan experiments with MDE up front; segment results by persona and job-to-be-done; and close the loop with qualitative feedback inside the product via in-app guides, product tours, and lightweight surveys. For AI features specifically, add Agent Analytics for agentic AI, capture grounding sources for explainability, and log model/context inputs to make debugging and iteration repeatable. That way, LLMs for product managers stop being magic and start being manageable.

    When we roll out a new assistant—whether a retrieval-augmented copilot or a voice AI agent—we set two dashboards: one for developers (eval pass rates, latency, context integrity, error budgets) and one for PMs (activation, task completion, deflection, satisfaction). The dashboards read differently by design, yet they are joined at the hip by shared definitions and experiment IDs. This lets us move quickly with confidence: engineering can tighten quality loops while product steers toward the outcome that matters most.

    If you’re feeling the tension between model metrics and product metrics, don’t collapse them—connect them. Start with a thin slice, agree on 3–5 measurable outcomes, and let your evals and A/B tests work together. With a clear metrics ladder and a unified analytics platform, PMs and developers can each excel at their craft and still ship AI that customers love.


    Inspired by this post on Pendo – Perspectives.


    Book a consult png image