Tag: context window management

Multi‑Agent Systems Demystified: Why One AI Isn’t Enough—and How I Ship Faster With Many

In my day-to-day building AI products, I’ve learned a simple truth: a single model can be brilliant, but a coordinated team of specialized agents is what consistently ships outcomes customers trust. That’s the promise of multi-agent systems—multiple AIs with distinct roles collaborating inside robust AI workflows to deliver accuracy, speed, and resilience you can’t get from a lone model.

Think of a multi-agent system as a well-run product trio for machines: a planner decomposes the job, specialists execute focused tasks, a reviewer checks quality, and an orchestrator keeps everyone aligned. This agentic AI approach mirrors how high-performing teams work—divide complex problems, play to strengths, and create tight feedback loops.

When does one AI stop being enough? Whenever tasks require tool use, domain retrieval, multi-step reasoning, or policy adherence under real-world constraints. In those moments, specialized agents shine—one for search using a retrieval-first pipeline, another for reasoning, another for action execution, and a final one for validation. The result is better accuracy with manageable latency and cost.

The core architecture I rely on starts with a planner that breaks a goal into steps, followed by execution agents equipped with tools and grounded context. I pair this with context window management to keep prompts lean and relevant, and I insert a verifier (or critic) to catch logic slips and policy violations before results reach customers. A lightweight orchestrator coordinates handoffs and retries to keep the whole flow resilient.

To make this production-grade, I treat observability as non-negotiable. Agent Analytics helps me see which agents are adding value versus adding latency, where failures cluster, and how prompts drift over time. From there, eval-driven development gives me measurable confidence: I codify representative tasks, run offline and shadow evaluations, and only promote changes that move accuracy and safety in the right direction.

Governance is equally critical. I design privacy-by-design from the start, restrict data movement with strong data governance, and enforce policy constraints inside the workflow rather than after the fact. This includes red-teaming failure modes, rate-limiting tools, and capturing immutable traces for audits and post-incident reviews—habits borrowed from SRE culture that map well to AI systems.

On the practical side, prompt engineering remains foundational, but it’s the system design that converts clever prompts into reliable outcomes. Tool access, retrieval quality, memory strategy, and error handling matter more than wordsmithing alone. I’ve found that small prompt improvements are amplified when the surrounding workflow is sound—and are overwhelmed when it isn’t.

If you’re just starting, begin with a narrow use case and a minimal set of agents—planner, executor, and verifier—then expand. Use continuous discovery with real users to learn where the workflow fails in the wild, and iterate with tight release cycles. Treat every agent like a microservice with clear contracts, test coverage, and metrics, and you’ll unlock compounding gains without losing control.

The payoff is tangible: faster shipping cycles, fewer regressions, and outcomes customers can actually rely on. When stakes are high and ambiguity is real, one AI is often a talented soloist—but a disciplined ensemble of agents is how I deliver dependable, scalable value at product velocity.

Inspired by this post on Product School.

February 16, 2026
Context Engineering Playbook: 5 Proven Ways to Slash Context Rot and Scale Smarter AI

I've been getting a lot of questions about why I'm diving so deep into Claude Code, so I want to take a step back and provide some context.

Last March, when I started building my first AI product—the Interview Coach—I felt like I had to figure it all out on my own. I had never built an AI product before, and I didn't have a team I could lean on. It was equal parts energizing and intimidating.

I had a blast digging in, experimenting, and learning what I needed to learn to ship that first AI product. But I also started to wonder, "How are product teams going to learn this stuff?"

As an industry, we are being asked to leverage a new technology that is foreign to us. We are all experimenting and learning what's just now possible. It's moving so fast, it's exhausting just following the news, let alone trying to learn and develop new skills.

My mission has always been to help teams make better product decisions. That still drives me today.

After releasing the Interview Coach, I asked myself two questions: "How am I going to rapidly develop my skill set?" and "How can I help others do the same?" I landed on a three-part plan: First, I'm going to collect and share stories about how other teams are learning and building AI products—that's why I launched Just Now Possible. Second, I'm going to push the boundaries on how I can use AI in my day-to-day life, and I'm going to write about it. Third, I'm going to keep building AI products—and I'm going to write about that, too.

The Claude Code series was born out of number two. It’s had an interesting side effect: it’s also helping me build better AI products.

The more I push the boundaries of what's possible with Claude Code, the more I understand how to build more robust AI products. That’s reinforced my belief that product teams need to get hands-on with this stuff in their day-to-day lives. It’s how we’re going to develop the skillsets we need to build tomorrow’s products.

In my context rot article—where we learned how to manage the context window in Claude Code—I showed just how much day-to-day practice compounds. Today, I want to show how learning about context window management in our day-to-day lives directly maps to managing the context window in the AI products we might build. My hope is to make it crystal clear how experience in one area develops expertise in the other. Let’s dive in.

Discover how product teams engineer context in generative AI: compact prompts, curated turns, external memory, repetition, and sub-agents, all feeding a shared context window to deliver clearer, faster outcomes.

A quick refresher on context window management. In the context rot article, we learned: "what the context window is and what goes into it"; "how to offload conversational context to the file system"; "about the /compact and /clear tools"; "to repeat critical information as the context window fills up to overcome tokens "lost in the middle" or at the beginning of the input"; and "how to use agents to get access to more context windows."

It turns out these exact same skills are being used by developers to manage the context window in production products. If you haven't read the context rot article, start there: "Context Rot: Why AI Gets Worse the Longer You Talk (And How to Fix It)."

What is Context Engineering? Context engineering is the work that we do to manage the context window in the AI products and services that we build. It's how we give the large language model the context it needs to do the job well. It's also how we manage and mitigate context rot in our product and services, so that we can get the highest performance from the underlying model.

Today, we are going to look at five different strategies that product teams are currently using in their context engineering efforts. You are going to see that each of these strategies ties back to a strategy you might already be using in your day-to-day AI usage (especially if you followed the advice in the context rot article).

Here's how product teams are putting this into practice right now: designing compact system prompts by breaking big tasks into smaller tasks; building external memory/state structures to keep the context window clean; curating what goes into each turn; repeating critical information as context grows; and using sub-agents to grow the context window.

I'll connect each tactic back to patterns you're likely already using in your daily AI workflows, especially if you followed the advice in the context rot article. Along the way, I’ll share practical guardrails and instrumentation ideas so you can track quality with eval-driven development, reduce context rot, and scale performance predictably.

Why this matters for product trios: these strategies clarify the handoffs between prompt engineering, external memory design, and orchestration, which strengthens collaboration across PM, design, and engineering. Whether you’re exploring gen ai prototypes, hardening a retrieval-first pipeline, or evolving toward agentic AI, context engineering is the backbone of reliable, high-performing experiences.

If you build or lead LLMs for product managers initiatives, consider this your field guide. In upcoming posts, I’ll break down each strategy with concrete examples and templates you can adapt to your stack, so your team can move from experiments to durable, scalable AI workflows with confidence.

Inspired by this post on Product Talk.

February 11, 2026
AI Agent Deployment Mastery: My Proven Checklist to Ship Safely, Faster, and at Scale

Shipping AI agents is not like shipping a typical feature. The system learns, reasons, and takes action in unpredictable environments, and when it’s customer-facing, the stakes are high. Over the past few years, I’ve refined a practical checklist that helps my teams move quickly without breaking trust. It balances speed with safety, and ambition with accountability—exactly what you need to scale agentic AI in production.

This checklist was forged in real launches—some smooth, some humbling. Early on, I watched an otherwise brilliant agent confidently offer a refund policy we didn’t have. That one incident made it clear: AI agents require a higher bar for guardrails, evals, and observability. Today, I won’t greenlight an AI rollout without these steps being explicit, owned, and testable.

Start with outcomes, not output. I define the job-to-be-done, the target users, and the measurable business impact using outcomes vs output OKRs and driver trees. Success is not “ship an agent,” it’s “reduce first-response time by 40% with no drop in CSAT,” or “increase qualified demo bookings by 20% at a lower cost per acquisition.” Clear outcomes give the agent a purpose and the team a north star.

Prepare the knowledge the agent will use. A retrieval-first pipeline beats raw prompting for most enterprise cases. I inventory sources of truth, set access controls, and enforce data governance from day one. That includes PII handling, redaction, retention policies, and privacy-by-design. If the agent can’t reliably retrieve the right fact at the right time, the rest doesn’t matter.

Choose models and prompts with discipline. I align model selection with context window management, cost, latency, and tool-use requirements. Then I build prompts and tools together, not in isolation, and I keep temperature, stop conditions, and function-calling explicit. Most importantly, I use eval-driven development: golden datasets, task-specific metrics (accuracy, helpfulness, latency, cost), and target thresholds that must be met before widening rollout.

Manage AI risk upfront. I treat jailbreaks, toxicity, and data leakage as product risks, not just security issues. I implement layered defenses—input/output filtering, policy checks, rate limits, and abuse monitoring—and define escalation paths and human-in-the-loop handoffs for ambiguous cases. Every risky capability needs an owner, a playbook, and a test.

Build the pipeline that lets you iterate safely. Prompts, tools, policies, and retrieval configs go through the same CI/CD rigor as code. I use feature flags for progressive delivery, canary cohorts to limit blast radius, and clear rollback procedures. Observability isn’t optional; I track latency, token usage, cost, failure modes, and user outcomes. I also watch DORA metrics and deployment frequency to ensure we’re improving the engine, not just the output.

Constrain autonomy intentionally. Agent behavior design matters as much as model choice. I set step limits, define tool whitelists, separate read vs write permissions, and specify decision checkpoints. When the agent is uncertain or confidence drops below a threshold, it hands off to a human or a deterministic workflow. Guardrails aren’t barriers; they’re bumpers that keep you on the track.

Instrument what users experience, not just what models produce. I track activation, task success, self-serve completion rates, and time-to-value. I pair Agent Analytics with journey analytics so I can see where the agent helps or hurts. I also invest in UX trust cues—transparent explanations, undo paths, and in-app guides—so users feel in control. When the agent changes behavior through learning, the interface should make that understandable.

If you’re shipping a voice AI agent, test in realistic conditions. I set targets for ASR accuracy, barge-in responsiveness, TTS prosody, and end-to-end latency. I predefine safe transfer logic for complex calls and ensure compliance for call recording and data retention. Voice amplifies both the magic and the mistakes; operational excellence is non-negotiable.

Plan the business rollout like a product, not a press release. I align pricing (often consumption SaaS pricing), packaging, and SLAs with actual unit economics—tokens, inference, and retrieval. I equip solutions engineering with playbooks and reference architectures, wire up CRM integration for attribution, and put feedback loops into Intercom or the support stack so we learn from every interaction.

Run operations like an SRE team. I define incident severity for AI-specific failures (e.g., harmful output, runaway cost, degraded retrieval), add alerting, and keep runbooks current. I schedule postmortems that feed directly into eval baselines and backlog priorities. Continuous discovery isn’t a ceremony; it’s the safety net that keeps improvements compounding.

Close the loop on compliance and governance. From day zero, I document data flows, vendor scopes, and audit logs. I verify regulatory compliance and adopt privacy-by-design so I’m not retrofitting later. Transparency, user consent, and opt-outs aren’t just legal checkboxes; they’re trust-building tools that differentiate your product.

The result of this checklist is speed with confidence. It gives my teams a common language to debate trade-offs, a clear path to production, and the guardrails to scale safely. If you’re preparing to deploy an agent, adapt these steps to your stack and your customers. Your future self—and your users—will thank you.

Inspired by this post on Product School.

February 9, 2026
Two People, Zero Waste: How Earmark’s Agentic AI Turns Meetings into Finished Work

I care about meetings only insofar as they create momentum and outcomes. What if your meetings could actually produce the artifacts you need—specs, tickets, slides—before the call even ends?

I recently listened to an episode of Just Now Possible where Teresa Torres talks with Mark Barbir (CEO) and Sanden Gocka (Co-Founder), the co-founders of Earmark, about building a productivity suite that turns unstructured conversations into finished work in real time. As a product leader, this premise hits the sweet spot of agentic AI, real-time AI workflows, and ruthless focus on outcomes over output.

Listen to this episode on: Spotify | Apple Podcasts

Unlike generic AI notetakers that produce summaries nobody reads, Earmark runs multiple agents in parallel during your meetings—translating engineering jargon, drafting product specs, even spinning up prototypes in Cursor or V0 while you're still talking. That’s the bar I want from AI in the room: finished work, not notes.

What impressed me most was the clarity of their pivot. They moved from an Apple Vision Pro presentation coaching tool to a web-based meeting assistant. I’ve made similar calls: when the distribution path and daily workflow are obvious, you follow the user’s gravity. This shift unlocked a broader surface area—PMs, engineers, design partners—and made agentic workflows useful where work actually happens.

They also turned a technical constraint into a commercial advantage. Their ephemeral (no-storage) architecture became a feature for enterprise sales. I’ve seen this repeatedly in AI risk management: privacy-by-design and clear data governance reduce friction with security reviewers and accelerate procurement. For many enterprises, “we don’t store your data” is the win condition.

Cost discipline was another standout. They tackled the hard problem of making real-time AI affordable—from $70 per meeting down to under a dollar through prompt caching. That’s not just optimization; it’s product strategy. Choices like model selection, context window management, and retrieval-first pipeline design determine whether a feature can scale to every meeting or remains a demo.

On capability design, the team leaned into templates and simulated stakeholders to ship value fast. Template-based agents: Engineering Translator, Make Me Look Smart, Acronym Explainer. Personas that simulate absent team members (security architect, legal, accessibility). This is exactly how I frame early AI workflows: remove friction for the product trio, anticipate blockers, and let the agent do the tedious, error-prone first pass.

They were refreshingly pragmatic about models. Why GPT 4.1 still beats newer models for prose quality in their use case is a reminder that “best” is contextual. When the job-to-be-done is precise prose and production-grade artifacts, consistent quality trumps leaderboard buzz. Of course, they also invest in guardrails to ensure quality and manage hallucinations—another non-negotiable for enterprise adoption.

Search and analysis across time is where many AI products stumble. They explained the limits of vector search for analysis questions across meetings and how they’re building agentic search with multiple retrieval tools (RAG, BM25, metadata queries, bespoke summaries). I couldn’t agree more: analysis requires reasoning over structure, time, and purpose—not just semantic proximity. Layered retrieval with stateful agents beats a single embedding call.

They also articulated a crisp user thesis: design for product managers as the extreme user to solve for everyone. In my experience, if you satisfy the PM’s bar for clarity, traceability, and actionability, engineers, designers, and go-to-market teams benefit immediately. That’s how you earn daily active use, not once-a-week novelty.

For builders curious about the stack and comparables, they discuss services and tools like Assembly AI for speech-to-text, OpenAI API with prompt caching support, and build integrations with Cursor and V0 by Vercel. They also reference Granola as a comparison point and nod to ProductPlan, where both founders previously worked. If you want to try the product, here’s Earmark—a productivity suite where the work completes itself.

If you're a PM drowning in follow-up work or a builder curious about real-time AI architectures, this conversation offers a detailed look at what it takes to ship an AI product that people can't imagine working without. Personally, I see this as a credible path toward an AI chief of staff—their vision goes beyond automating deliverables to orchestrating judgment, compliance signals, and cross-functional readiness.

The episode covers the founder backstory, what Earmark does, comparisons to competitors, unique features, templates and personas, technical decisions, early versions and challenges, optimizing transcript summarization, managing multiple tools and costs, challenges with context and reasoning models, innovative search and retrieval techniques, creating actionable artifacts from meetings, ensuring quality and managing hallucinations, and the future vision for an AI chief of staff. It’s a full-spectrum look at building with agentic AI, not just talking about it.

Podcast transcripts are only available to paid subscribers.

Inspired by this post on Product Talk.

February 5, 2026
Build Your Personal Operating System with Claude Code: A Playbook for Focus, Speed, Clarity

This is the year to build your personal operating system. For me, that line isn’t a slogan; it’s a commitment to eliminate context switching, compress decision cycles, and turn fragmented information into a reliable source of truth. As a product leader, I needed a system that blends judgment, data, and automation—so I built mine around Claude Code.

When I say “personal operating system,” I mean an integrated set of AI workflows, rituals, and tools that capture knowledge, structure decisions, and automate execution. It’s where product discovery meets delivery: a place to synthesize signals, prioritize with clarity, and move from insight to action without friction. The outcome is fewer ad hoc decisions, more deliberate strategy, and a calmer, more focused day.

Claude Code sits at the center because it helps me translate intent into working software and repeatable processes. I use it to scaffold small utilities, write adapters for APIs, and evolve prompts into robust patterns. It accelerates everything from research synthesis and PRD drafting to backlog grooming and stakeholder updates—while keeping me in the loop for final judgment.

Under the hood, I run a retrieval-first pipeline that connects notes, docs, tickets, research transcripts, and roadmaps into a searchable, living memory. With careful context window management, I feed only the most relevant snippets into Claude Code, preserving accuracy and speed. The result: richer answers, fewer hallucinations, and an assistant that “remembers” what matters without drowning in noise.

My daily loop is simple: capture, synthesize, decide, and act. I capture customer signals and meeting notes into a personal knowledge management vault; synthesize patterns with prompt engineering that emphasizes evidence; decide using outcomes vs output OKRs; and act by generating drafts, creating tasks, and updating artifacts. Claude Code helps me wire this end-to-end, so the system works even on my busiest days.

If you’re implementing this from scratch, start small. Pick one high-friction workflow—say, product feedback triage—and build a narrow agentic AI flow to classify, summarize, and route items. Use eval-driven development to test prompts against known edge cases. Add guardrails and privacy-by-design practices from day one, then expand to neighboring workflows once the first loop is reliable.

Governance matters. I treat AI risk management, data governance, and security as first-class citizens: limited data scopes, clear audit trails, human-in-the-loop approvals, and rollback plans. Feature flags control changes; observability tracks drift and quality; and a simple playbook documents how we deploy, monitor, and improve the system.

Measure what this personal operating system earns you. Track decision latency, cycle time from signal to action, meeting-to-output ratios, and the signal-to-noise ratio of inputs. When the system is working, you’ll feel it: fewer meetings, more momentum, and sharper product strategy supported by trustworthy AI workflows.

The goal isn’t to automate judgment—it’s to protect it. By letting Claude Code handle the glue work and information wrangling, I preserve energy for high-leverage thinking: positioning, sequencing, and trade-offs. Build your personal operating system now, and make this the year your product practice runs with clarity and composure.

Inspired by this post on Pendo – Best Practices.

February 3, 2026
How We Built an AI Career Co‑pilot that Turns Knowing into Doing for Disadvantaged Students

How do you help disadvantaged students take action on opportunities they don't even know exist? That question has been top of mind for me as I’ve explored how AI can augment—not replace—human mentorship. Recently, I dug into the work behind Zero Gravity, a UK-based platform using mentoring, community, and learning pathways to unlock elite career opportunities for state school students. Their approach reframed a core problem I care deeply about: the "knowing-doing gap."

I sat down with Elliot Little (Product Manager) and Dan St. Paul (Software Engineer) from Zero Gravity to unpack how they’re tackling this gap with an AI career co‑pilot. They’ve intentionally positioned the system as an orchestrator, not an automation tool—bridging the space between knowing what to do and actually doing it. As a product leader, I see this as a powerful pattern for Generative AI: use AI to coordinate steps, personalize guidance, and empower action in moments where confidence and clarity are fragile.

What resonated most was the humility of their build journey. They started with grand visions of AI mentors and synthetic avatars, then scaled back to something simpler and more effective. The first prototype—a job suitability summary—didn’t deliver the "wow moment" they expected. And they discovered that hiding the "LLM magic" backfired—students needed to feel the personalization. That insight aligns with my own experience: users must perceive the value for trust and motivation to compound.

From a UX standpoint, the team chose text chat over voice input and leaned into guided prompts rather than empty text boxes. That decision lowered cognitive load and increased completion rates—classic product management tradeoffs that privilege momentum over novelty. In my view, this is what good AI product strategy looks like: invite action with structure, then expand autonomy as confidence grows.

The technical backbone is equally thoughtful. Multi‑month journeys require rigorous context window management to avoid exploding token counts and degrading quality. I appreciated their pragmatic toolkit: context management techniques like removing stale tool calls, summarizing history, exposing tools conditionally. They also used application logic rather than complex RAG architectures to manage tool availability and context freshness. This is the kind of disciplined engineering that keeps systems reliable at scale without overcomplicating the stack.

Model selection was fit‑for‑purpose, not one‑size‑fits‑all. They’re using different models for different tasks, including "GPT-5 Nano for structured outputs, lighter models for quick replies." That modularity enables speed and cost control while preserving high‑fidelity moments where structure matters most.

Safeguarding was treated as a first‑class concern—non‑negotiable when you’re building AI for 16‑year‑olds. Their safeguarding architecture pairs moderation endpoints with external verification via Unitary. They also invested in building a failure taxonomy through internal red team/green team exercises. This is AI risk management done right: define failure modes early, test ruthlessly, and wire safety into the product surface area—not just the model layer.

Evaluation was grounded in outcomes, not demos. The team focused on whether students progressed from insight to action: applying, interviewing, and engaging with mentors. That aligns with how I run eval‑driven development—ship narrowly, measure real behavior, and iterate toward a repeatable "wow moment" that students can actually feel.

Looking ahead, I’m excited by what’s next: long‑term memory management for multi‑year student journeys. It’s a hard problem—balancing privacy, provenance, and portability—but it’s precisely where an AI career co‑pilot can compound value over time. The vision is compelling: a resilient companion that remembers goals, adapts to context, and orchestrates the right next step.

If you want to dive deeper, you can listen to the full conversation on Spotify and Apple Podcasts:

Listen to this episode on: Spotify | Apple Podcasts

Resources mentioned:

Zero Gravity: https://zerogravity.co.uk/

Unitary – AI-powered content moderation: https://www.unitary.ai/

Blue Dot Impact AI Safety Course – free AI safety course Elliot recommended: https://bluedot.org/

My key takeaways: build AI that augments human relationships, not replaces them; don’t hide the personalization—let learners feel it; privilege application logic over unnecessary architectural complexity; and treat safety, context, and evaluation as product features, not afterthoughts. That’s how we bridge the "knowing-doing gap" with integrity and scale.

Inspired by this post on Product Talk.

January 8, 2026
AI Context Pulling Playbook: How I Get LLMs and Teams to Collaborate for Better Product Outcomes

In my role leading product, I’ve learned that the fastest path to higher-quality deliverables from large language models (LLMs) is not a clever prompt—it’s rigorous context. I call the practice AI context pulling: a repeatable way to assemble, compress, and structure the most relevant knowledge before the model ever starts generating. Done well, it turns generative AI into a dependable partner for discovery, prioritization, and execution.
AI context pulling means I proactively gather the right artifacts (customer insights, analytics, strategy, constraints), manage context windows intentionally, and shape the model’s task with clear objectives and guardrails. This reduces hallucinations, improves alignment, and creates traceability back to sources—critical for product management leadership and stakeholder trust.
Learn a new way in which product professionals can collaborate with AI to get even better results on their projects.
Here’s the simple flow I use: first, I define the intent (e.g., “synthesize discovery interviews for a positioning brief”). Next, I inventory relevant context: top customer pains from product discovery, usage patterns from Amplitude analytics, recent support trends from Intercom, and any constraints from our product strategy. Then I run a retrieval-first pipeline to select only the most pertinent slices—favoring recency, representativeness, and canonical sources.
Because context window management matters, I compress long documents into short, source-cited summaries and keep raw excerpts handy when nuance is important. My prompts follow a consistent structure: role and objective, constraints and audience, curated context, the explicit ask, preferred output format, and a brief self-check (e.g., “cite sources and flag uncertainty”). This is prompt engineering for reliability, not theatrics.
A quick example: when drafting a one-page feature brief, I attach three items—the product strategy paragraph that sets the frame, a usage cohort analysis that highlights who’s affected, and five verbatim customer quotes. I ask the LLM to propose a problem statement, success criteria, and a shortlist of solution hypotheses, each tied to a cited piece of evidence. The result is a grounded, decision-ready artifact I can share with product trios and stakeholders.
Tooling-wise, I keep it pragmatic. A lightweight retrieval-first pipeline (embeddings, metadata filters, and recency rules) ensures the LLM pulls what matters. I version prompts and contexts together so I can run quick A/B testing on output quality. And I log decisions and sources to support eval-driven development and continuous discovery.
Common pitfalls are avoidable. Too little context yields generic answers; too much overwhelms the model. Stale docs can mislead; curate aggressively. Vague asks invite fluffy prose; specify outcomes, audiences, and formats. If the task is high risk, I bias toward smaller, well-cited outputs and expand iteratively with human review in the loop.
To measure impact, I track rework rate, review time, and stakeholder alignment on first pass. Over time, teams adopting AI context pulling report clearer artifacts, faster synthesis cycles, and more confident decisions—because every recommendation traces back to evidence. That’s how humans and LLMs truly collaborate better: we provide the right context, and the model amplifies our judgment.
If you’re ready to operationalize this, start by templatizing your most common product workflows—discovery synthesis, roadmap rationale, and release notes—and attach small, high-signal context packs. With a retrieval-first mindset and disciplined prompting, AI becomes an extension of your product craft, not a gamble.

Inspired by this post on Pendo – Perspectives.

January 4, 2026
Master Burger Prompting: Build a High-Impact AI Resume Coach with Proven LLM Structure

I’ve been refining a hands-on approach to “burger prompting” that turns prompt engineering into a reliable, repeatable system. Using an AI resume coach as the proving ground, I’ll walk through a detailed prompt structure to get the most out of your LLM and share what’s worked for me in product environments where clarity, consistency, and measurable outcomes matter.

At a high level, burger prompting follows a simple mental model: the top bun frames the role and mission, the fillings pack in context and examples, and the bottom bun locks in output format and quality guardrails. It’s deceptively simple and extremely effective for Generative AI use cases where you need predictable behavior across different inputs and user personas.

For the top bun, I establish the AI’s role, audience, and objective in one place. In the resume coach flow, I define the assistant as a structured, unbiased reviewer tasked with aligning a candidate’s resume to a specific job description. I set constraints on tone (supportive but direct), scope (resume and job description only), and safety (avoid speculative claims, defer legal or medical advice). This crisp intent statement reduces ambiguity and prevents the model from wandering outside the product’s value proposition.

The fillings are where context window management becomes crucial. I inject the job description, the candidate’s resume, a capability rubric aligned to the role, and the company’s style preferences. If the content is long, I chunk inputs and, when needed, use a retrieval-first pipeline to fetch only the most relevant snippets. I also include a brief style guide with voice, depth, and formatting expectations so the AI doesn’t drift between terse and verbose responses across sessions.

Strong examples are the meat of the burger. I include a few annotated comparisons that show what “excellent,” “good,” and “needs improvement” look like for specific competencies, from impact statements to quantification. These examples are compact and domain-specific, so the LLM sees the pattern I expect without overfitting to a single profile. I encourage transparent reasoning by asking for stepwise evaluations that reference evidence from the resume and job description, while keeping the explanations concise and user-friendly.

The bottom bun finalizes structure and guardrails. I specify an output schema that always returns a brief summary, evidence-backed strengths, concrete gaps with examples of what’s missing, and a prioritized action plan with suggested rewrites. I also request a rubric-aligned score to support eval-driven development, and I cap length to ensure scannability inside product UI. This predictable format reduces downstream parsing errors and keeps the AI workflow snappy.

To operationalize this in a product context, I run small A/B tests on the prompt variants and measure utility through user activation and completion rates. I tune the prompt with tight feedback loops, comparing structured scores against human spot checks until the variance narrows. When I see drift, I adjust the constraints, swap underperforming examples, or expand the rubric to capture overlooked signals.

Quality and trust are non-negotiable. I add guidance to avoid hallucinated credentials or inflated claims, enforce privacy-by-design around sensitive data, and encourage the assistant to cite which resume lines support each recommendation. When the model is uncertain or the resume lacks evidence, the assistant should explicitly say so and propose realistic next steps rather than guessing.

The result is an AI resume coach that feels both helpful and disciplined. With burger prompting, you get a durable prompt pattern you can reuse across adjacent AI workflows, from portfolio reviews to job description rewrites. Once you internalize the top bun, fillings, and bottom bun, you’ll find it far easier to ship prompts that scale, maintain consistency across releases, and deliver tangible, career-advancing outcomes for users.

Inspired by this post on Pendo – Best Practices.

January 4, 2026
Unlock Product Insights Fast: Connect MCP and Pendo to Claude, ChatGPT, and Cursor

I’ve spent the last year pushing our AI Strategy from slideware to shipped value, and one pattern keeps winning in real-world product teams: connecting agentic AI directly to trustworthy product analytics. That connection is where Model Context Protocol shines—safely bridging LLMs with the tools and data product managers rely on every day.
Model Context Protocol (MCP) gives AI agents access to your business data. Learn how MCP works, how product managers are using it, and how to connect Pendo’s MCP server to Claude, ChatGPT, or Cursor for instant product insights.
In practice, I treat MCP as a clean, auditable interface between LLMs and enterprise systems—decoupling the model choice from the data plane and enabling a retrieval-first pipeline with strong data governance. Because MCP standardizes the way agents discover resources and tools, it simplifies context window management, enforces least-privilege access, and makes it easier to evolve our stack without rewriting prompts or fragile glue code.
For product leaders, the immediate payoff is speed to insight. Instead of hopping across dashboards, I ask the agent questions in natural language—“Which onboarding step drives the biggest drop-off by segment?”—and get synthesized answers backed by traceable queries. That shift turns AI workflows into a daily habit, improving continuous discovery and accelerating product-led growth while maintaining privacy-by-design controls.
Under the hood, I think about MCP in four layers: resources (read-only data surfaces such as feature usage or retention cohorts), tools (safe operations like creating a note, exporting a segment, or proposing an in-app guide), prompts (task-scoped instructions tuned for LLMs for product managers), and observability (logs and evaluations). This structure keeps eval-driven development front and center and reduces operational risk.
Here’s how I connect Pendo analytics through MCP to my preferred assistants without compromising security or accuracy:
1) Prepare access: confirm your Pendo MCP server endpoint, authentication method, and scopes; apply least-privilege and redact any PII not required for analysis.
2) Register the server: in Claude, ChatGPT, or Cursor, add the MCP server with the provided URL and API key or token, then enable only the resources and tools your use case demands.
3) Validate the contract: prompt the agent to list available resources and describe tools; run harmless dry runs (e.g., “summarize top feature adoption trends last 30 days”) to confirm the interface behaves as expected.
4) Operationalize: standardize prompts for recurring analyses (QBRs vs OKRs, activation funnels, retention analysis), set guardrails, and log every interaction for audit. This is where prompt engineering meets governance.
5) Iterate with metrics: track answer quality, latency, and usage; expand scopes gradually and gate new tools behind human-in-the-loop until you reach reliable performance.
Once configured, I use the agent to surface weekly activation insights, identify outlier cohorts, and auto-draft product discovery notes with links back to Pendo reports. The result isn’t magic; it’s a disciplined AI product toolbox that brings the right context to the right question, fast.
If you’re starting from zero, pilot with one high-value question, one team, and one assistant. Keep the footprint small, measure outcomes, and then scale—with security, compliance, and stakeholder management baked in from day one. That’s how you turn MCP from an interesting protocol into a durable competitive advantage.

Inspired by this post on Pendo – Best Practices.

January 3, 2026
Master Burger Prompting: Build a High-Impact AI Resume Coach with Proven LLM Structures

I turned the playful idea of “burger prompting” into a rigorous framework for building an AI resume coach that delivers consistent, high‑quality guidance. In product management, repeatability matters: I want dependable LLM behavior, tight control of outputs, and measurable outcomes. This approach gives me exactly that—clear roles, crisp constraints, and an evaluation loop that raises the quality bar with each iteration.

Here’s the metaphor in practice. The top bun sets the role and goal; the middle layers stack context, examples, constraints, and tools; the patty is the core algorithm and output schema; and the bottom bun locks in the quality bar and follow-up behavior. When I apply this structure to an AI resume coach, I get results that feel expert, empathetic, and actionable—without rewriting the prompt every time.

Top bun: I define the system role and success criteria. I’ll say, “Act as an experienced hiring manager and resume coach for SaaS product roles” and specify the north star: improve clarity, impact, and ATS alignment without fabricating experience. I also name the audience (mid-career PMs, early-career candidates, or executives) so tone and calibration stay consistent across sessions.

First layer: I load precise context. That includes the candidate’s resume, the target job description, and any constraints (for example: keep bullets under 22 words, lead with impact, quantify outcomes). I also clarify non-goals (no inflated titles, no unverifiable claims). This is where I set the voice: confident, concise, and supportive, not generic or robotic.

Second layer: I attach the tools and references that anchor outputs. A skill taxonomy for product roles, a style guide for resume bullets, and a scoring rubric (impact, clarity, relevance, keyword coverage) help the model prioritize. To protect quality, I call out context window management rules—what to include or trim—and how to summarize long inputs without losing signal.

Third layer: I add exemplars. Few-shot examples of excellent resume bullets (“before” and “after”) teach the model what “great” looks like. I also include a counterexample or two to prevent bad habits (for instance, over-indexing on buzzwords). Exemplars act like taste buds; they steer nuance without overfitting.

Patty: I define the core algorithm and the output schema. The algorithm moves in stages: diagnose the resume against the job, identify 3–5 high-leverage improvements, rewrite bullets with quantified outcomes, and propose a summary that highlights relevant wins. I then specify the output sections: a brief diagnosis, rewritten bullets mapped to the job’s requirements, an ATS keyword coverage table, and a confidence score with rationale. A tight schema produces consistent, scannable outputs that are easy to evaluate—and easy to ship.

Bottom bun: I lock in the quality bar and the follow-up behavior. If inputs are incomplete, the coach must ask clarifying questions before rewriting. If claims lack evidence, it should suggest proof points (metrics, scope, stakeholders) rather than embellish. Finally, I require a self-check pass where the coach verifies that each bullet demonstrates impact, relevance, and clarity before presenting the final result.

Implementation blueprint: I create a reusable prompt template with clear system and user sections, then parameterize it for different roles (PM, design, data). If I have a library of style guides or skill matrices, I wire it into a retrieval layer so the model references the right material for each job. This setup makes the coach portable across tools and easy to maintain as the taxonomy evolves.

Evaluation and iteration: I practice eval-driven development. I assemble a small, representative test set of resumes and job descriptions, define acceptance criteria (readability score, keyword coverage, human rater alignment), and A/B test prompt variants. I track drift and tighten the schema whenever outputs start to meander. The goal isn’t just impressive demos—it’s reliable performance at scale.

Governance guardrails: A trustworthy resume coach respects privacy-by-design. I strip PII where possible, avoid storing raw resumes beyond what’s necessary, and document bias checks so advice doesn’t disadvantage non-traditional candidates. Clear data governance and risk management keep the product shippable and compliant as it grows.

When I apply burger prompting end to end, the AI resume coach becomes a repeatable system: fast, accurate, and measurably helpful. The structure teaches the model how to behave; the evals keep it honest; and the schema makes the result easy to review, refine, and ship. If you want dependable LLM outcomes, start with a great bun—and don’t skimp on the patty.

Inspired by this post on Pendo – Best Practices.

December 19, 2025
Stop Tuning Prompts: How Context Engineering 10x’d Accuracy and Adoption in Our AI Platform

"The best AI products improve more through context engineering than prompt tinkering." I’ve seen this play out repeatedly in high-stakes, enterprise use cases: substantive gains come from how we curate, structure, and deliver context to models—not from wordsmithing. When we started treating context as a product surface, performance climbed, hallucinations dropped, and teams shipped with more confidence.

Here are four key decisions we made to improve our AI context.

First, we moved to a retrieval-first pipeline. We unified trusted sources—CRM records, support knowledge bases, product telemetry, and governance-approved docs—behind hybrid retrieval (semantic + keyword) with strong metadata ranking. This let us constrain generations to verifiable facts, apply privacy-by-design rules at the edge, and practice disciplined context window management so every token carried its weight. Freshness policies, source-level confidence scores, and lightweight schemas kept the system precise and auditable.

Second, we made eval-driven development non-negotiable. Every change to context assembly goes through offline evals and online A/B testing with clear acceptance thresholds (e.g., task success, groundedness, time-to-first-answer, and deflection rate). We sized tests with minimum detectable effect (MDE) and tied them to outcomes vs output OKRs so we weren’t just shipping more prompts—we were shipping measurable improvements that mattered to customers.

Third, we personalized context based on intent and role. We built AI workflows that detect user intent, segment by persona, and dynamically assemble context: recent account activity for customer success, policy-safe excerpts for finance, and fine-grained reasoning chains for product teams. For conversational and voice AI agent experiences, we combined short-term conversation memory with scoped, long-term account memory to preserve relevance without bloating the prompt. This agentic AI pattern ensured faster, safer, and more helpful responses.

Fourth, we operationalized context as a first-class platform capability. We invested in data governance (ownership, lineage, and redaction), instrumentation (Amplitude analytics for usage, retrieval hit rates, and failure modes), and CI/CD guardrails for context updates. Product trios partnered with SRE to monitor drift, while side-by-side comparisons and human-in-the-loop reviews turned frontline feedback into structured improvements. The result: a durable system that improves continuously instead of relying on one-off prompt tweaks.

Context engineering isn’t glamorous, but it compounds. By prioritizing retrieval-first design, rigorous evaluation, intent-aware assembly, and operational excellence, we transformed our AI features into dependable, enterprise-ready capabilities. If you’re serious about LLMs for product managers and sustainable AI Strategy, shift your energy from clever prompts to robust context—and watch adoption and trust follow.

Inspired by this post on Amplitude – Perspectives.

December 16, 2025
Beyond Accuracy: The Trust-First Evaluation Metrics I Use to Scale High-Impact AI Products

When I assess whether an AI product is ready for prime time, I start with trust—not model accuracy. Accuracy is table stakes; trust is what earns adoption, drives retention, and unlocks durable product-led growth.

Evaluation metrics in AI products go beyond accuracy. Learn how product teams use trust-driven metrics to build reliable, growth-driving AI systems.

In practice, I organize trust-driven metrics into four layers: model quality and safety, user and business outcomes, operational reliability and cost, and governance and compliance. This layered approach keeps product trios aligned on what matters now, what must be gated in CI/CD, and what signals we’ll use to prove progress against outcomes vs output OKRs.

On model quality and safety, I care about precision, recall, F1, calibration, and abstention behavior, but also the hard-to-fake signals: hallucination rate, grounding and faithfulness, citation coverage, toxicity, bias, and fairness. For generative systems, I instrument refusal correctness (declining unsafe requests) and evidence adequacy (did the answer rely on retrieved, trustworthy sources).

User and business outcomes must be explicit. I track adoption, activation, task success rate, time to first value, win rate uplift in assisted workflows, CSAT and NPS deltas, and retention analysis by cohort exposed to AI features. For customer support scenarios, deflection rate, average handle time change, and first-contact resolution are core; for sales or ops copilots, I monitor cycle-time reduction and error-rate reduction in critical tasks.

Experimentation is non-negotiable. I design A/B testing with a clear minimum detectable effect (MDE), pre-registered guardrails for safety and quality, and sequential tests that stop early if harm outpaces benefit. Online metrics are always paired with offline evals so we can iterate quickly without exposing users to regressions.

Operationally, trust shows up as speed, stability, and cost predictability. I track latency end-to-end, time to first token, throughput, rate of 5xx and timeouts, cost per request, and caching effectiveness. We also trend safety incidents per 10,000 interactions and mean time to mitigation to keep reliability visible alongside performance.

Governance and compliance are part of the product, not an afterthought. Data governance and privacy-by-design metrics include PII exposure rate, data lineage coverage, access-control correctness, audit pass rate against internal policies, and model and prompt change traceability. This is the backbone of our AI risk management posture and accelerates regulatory compliance reviews instead of slowing them down.

The delivery engine for all of this is eval-driven development. We maintain golden datasets and scenario-based test suites that mirror real user intents, gate releases in CI/CD with minimum thresholds, and run canary rollouts to validate offline–online alignment. Every model or prompt update gets a comparable scorecard so product, engineering, and design can trade off quality, speed, and cost with shared facts.

For LLM-heavy features, retrieval-first pipeline metrics are mandatory. I monitor retrieval hit rate, recall at K, mean reciprocal rank, context contamination, and citation correctness. With large prompts, context window management matters: we track context utilization, truncation rate, and the contribution of each context block to final answers to avoid silently losing critical evidence.

Finally, trust must be legible. I package these metrics into an executive scorecard that maps to business outcomes, risk appetite, and OKRs, with clear thresholds for ship, improve, or roll back. When teams can articulate trade-offs—say, a 20% latency reduction at a small cost increase, or a lower hallucination rate at the expense of higher abstention—they build credibility with stakeholders and confidence with customers.

Trust is not a single number; it’s a system of evidence. By instrumenting these layers and operationalizing AI Strategy with rigorous, transparent metrics, we can ship faster, reduce surprises, and earn the right to scale AI features across the product portfolio.

Inspired by this post on Product School.

December 8, 2025