Tag: eval-driven development

AI Product Validation: From Promising Demo to Proven Value

You have an AI demo that looks impressive. It answers the happy-path prompt, the latency seems acceptable, and stakeholders can already imagine the launch. The uncomfortable question is whether any of that proves the product is worth building.

It does not. A useful validation process has to reduce several different risks: whether customers care, whether the workflow helps them, whether the AI performs reliably, whether the economics work, and whether failures remain tolerable. Test those risks in that order and you can make a defensible investment decision without turning production traffic into your debugging environment.

Define the decision before you design the AI

The first artifact for an AI initiative should not be a model shortlist or a prototype. It should be a decision contract that states what must become true for the initiative to deserve more investment.

A practical decision statement has this shape: For a defined user in a defined situation, the proposed capability will improve an observable outcome relative to the current alternative, without breaching named guardrails. If the agreed threshold is met, you will advance. If it is not, you will stop or change a specific assumption.

Write down these five elements before the experiment begins:

User and job: Name who encounters the problem, when it occurs, and what they are trying to complete. A broad label such as knowledge workers is not precise enough to design a useful test.
Current alternative: Record what the user does now, including manual work, an existing product flow, a rules engine, or simply tolerating the problem. This is the baseline the AI must beat.
Observable outcome: Choose a user or business result, not a model activity. Task completion, time-to-value, corrected routing, rework, repeat use, or downstream resolution can carry more meaning than generations or prompt volume.
Success threshold and guardrails: Decide how much improvement would justify the cost and what must not deteriorate. Safety failures, latency, privacy exposure, retention, and cost per successful outcome can all constrain an otherwise positive result.
Decision rule: State what evidence will trigger expansion, another iteration, a change in direction, or cancellation. Precommitting prevents enthusiasm for a polished demo from moving the goalposts later.

The threshold is not universal. It should reflect the value of the outcome, the implementation and operating costs, the consequences of errors, and the return available from competing roadmap work. Minimum detectable effect belongs here: define the smallest improvement that would actually change your decision, then size the test to detect that effect. A test that cannot distinguish a worthwhile gain from noise is not a faster test. It is a delayed decision.

A driver tree helps prevent a common measurement mistake. Start with the desired outcome, connect it to the user behaviors that could produce that outcome, and then connect those behaviors to system-level drivers. For an AI support-triage capability, the outcome might be faster correct routing. Accepted category and priority suggestions are leading signals; downstream corrections, reassignment, and resolution are closer to the outcome. Model classification accuracy matters, but it is only one driver in the chain.

If the proposal involves an autonomous or semi-autonomous agent, run a precondition check before planning the experiment. Volume, instructions, tolerance, access, and a learning loop expose whether agentic complexity is justified:

Volume: Does the workflow happen often enough for automation to create meaningful leverage?
Instructions: Can success, constraints, and exceptions be expressed in testable terms?
Tolerance: Is the likely failure reversible, detectable, and contained?
Access: Can the system use the necessary data and tools with reliable integrations and least-privilege permissions?
Learning loop: Can you measure quality, latency, cost, and failures after launch?

A missing condition tells you what to validate next. Unclear instructions call for more discovery and rubric design. Weak access calls for an integration or data-quality spike. Low error tolerance calls for approvals and a narrower action space. Low volume may mean that a clear workflow, a rule, or better product UX is the better answer. The purpose of validation is not to prove that AI belongs in the solution; it is to discover whether it does.

Climb an evidence ladder instead of jumping to a pilot

An oversized pilot often mixes market, usability, model, integration, and operational risk into one expensive test. When the result disappoints, nobody knows which assumption failed. An evidence ladder gives each experiment one dominant question and increases fidelity only after the previous uncertainty has been reduced.

Question to answer	Lean experiment	Evidence to inspect	What it does not prove
Do users care enough to act?	Painted door, landing page, waitlist, concierge offer, preorder, or deposit where appropriate	Click-through intent, qualified sign-ups, willingness to pay, and continued requests	Usability, AI quality, or scalable delivery
Can the proposed workflow help?	Wizard-of-Oz flow or realistic interactive prototype	Task completion, time on task, errors, material friction, and repeat use	Whether an AI system can deliver the experience reliably
Can the system perform the job?	Offline evaluation on a curated golden set plus targeted technical spikes	Rubric results by case type, failure patterns, latency, and cost	Whether the complete product changes user behavior
Does the product improve the target outcome?	Feature-flagged A/B test or holdout	Primary outcome, leading indicators, cohort effects, and guardrails	Long-term stability under every operating condition
Can it operate within acceptable risk?	Capped rollout with approvals, audit logs, monitoring, and rollback controls	Harm and privacy events, reversals, escalations, reliability, and cost per successful outcome	That future changes will remain safe without continued evaluation

Use the first row when demand is the dominant risk. A painted-door click is a signal of curiosity, not proof of durable value. A qualified sign-up asks for more commitment. A preorder or deposit, when honest and operationally appropriate, tests willingness to pay. Repeated use of a manually delivered service provides stronger behavioral evidence. Do not collapse these signals into a single conversion metric; they represent different levels of commitment.

Once demand appears credible, use a prototype or Wizard-of-Oz flow to learn whether the proposed interaction helps. Pretotyping should answer whether the product deserves to exist, while prototyping should answer how it needs to work. Keeping those questions separate prevents a polished interface from disguising weak demand and prevents a crude early interface from killing a valuable idea before its workflow has been understood.

These experiments still owe users honest expectations. A painted door should reveal that the capability is unavailable after the user expresses interest, rather than pretending it already exists. A concierge or Wizard-of-Oz flow should be explicit about how data will be handled and what follow-up the participant can expect. Deception can manufacture a metric while damaging the trust the eventual product will need.

Advance when the evidence changes the dominant uncertainty. Strong demand does not authorize a production launch; it authorizes a workflow test. A usable workflow authorizes a system evaluation. An offline pass authorizes limited exposure. Each rung earns the next investment without pretending to answer questions it was not designed to answer.

Separate model quality from product value

A model can produce better answers while the product creates less value. Added latency can interrupt the workflow. A retrieval failure can ground an otherwise capable model in the wrong context. A user may spend more time checking and rewriting an answer than doing the task manually. This is why a single accuracy score cannot validate an AI product.

Build a golden set from the work users actually do

Eval-driven development starts before production traffic. Build a curated set of cases that reflects real user complexity, then turn your definition of good into a reproducible scoring process.

Define the evaluation unit: Score the completed job whenever possible, not merely an isolated response. An agent that drafts a correct message but sends it to the wrong destination has failed the job.
Represent meaningful variation: Include normal cases, longer and shorter inputs, ambiguous requests, important customer segments, and known edge conditions. A convenience sample of clean happy paths measures demo readiness.
Tag each slice: Label cases by intent, complexity, risk, input type, or other distinctions that could conceal a concentrated failure. Aggregate performance can improve while a critical slice gets worse.
Write a multidimensional rubric: Score correctness, completeness, groundedness, safety, tone, policy compliance, and any task-specific requirements separately. Add latency and cost as system measures rather than blending everything into an opaque average.
Choose a real baseline: Compare the candidate with the current product, manual workflow, rules-based approach, or incumbent model. The relevant question is not whether the candidate looks capable in isolation; it is whether switching produces enough value.
Preserve regression evidence: Keep a stable set for comparisons and add newly discovered failures to an evolving challenge set. This turns production learning into protection against recurrence.

Keep the measurement layers visible in every readout:

Output quality: correctness, completeness, groundedness, tone, safety, and compliance.
System performance: retrieval quality, tool execution, policy enforcement, latency, reliability, and cost.
User outcome: task completion, time-to-value, edits, rejection, rework, escalation, and repeat use.
Business consequence: the downstream result the initiative was funded to improve, along with retention or other core guardrails where relevant.

Each layer diagnoses a different problem. If output quality is weak, work on context, prompts, retrieval, tools, policies, or the model. If output quality passes but completion does not improve, inspect the interaction and workflow. If users succeed but the cost per successful outcome is unacceptable, narrow the use case or revisit the architecture. A composite score can hide these distinctions at exactly the moment you need them.

Test the behavior distribution, not a lucky response

AI output is variable, so a candidate should not pass because one run happened to look good. Use two evaluation modes. A regression configuration should be as controlled as the system allows, with model, prompt, retrieval, tool, temperature, top-p, and seed settings documented where they apply. A production-like configuration should match the variability users will experience and repeat cases often enough to reveal unstable behavior and tail failures.

Run candidate and baseline systems on the same cases under comparable settings.
Inspect results by slice and failure type, not only the overall average.
Repeat stochastic cases so the team sees consistency, variance, and severe outliers.
Automate clear rubric checks, but retain human review for ambiguous or high-consequence judgments.
Version the model, prompt, retrieval configuration, tools, policies, and evaluation set so a change can be reproduced.

This creates a release gate instead of a demo contest. Offline evaluation will not prove market value, but it can prevent known regressions, unsafe behavior, and obviously weak variants from consuming customer trust in a live experiment.

Make the production test answer a business decision

Production exposure is justified when demand, workflow, and offline performance have enough evidence behind them. The live test should then answer a narrow causal question: does access to this capability improve the intended outcome for the eligible population, compared with the current experience, without violating the operating constraints?

Instrument the complete causal chain

Your event schema should connect eligibility to exposure, interaction, system behavior, task completion, and downstream consequences. At minimum, distinguish these moments:

The user or account became eligible for the test.
The treatment was actually shown or made available.
The capability was invoked, whether explicitly or automatically.
The system succeeded, failed, timed out, or triggered a safeguard.
The output was displayed, accepted, edited, rejected, reversed, or escalated.
The target task was completed or abandoned.
The downstream outcome occurred, such as a correction, reassignment, reopening, or successful resolution for a support workflow.

Attach the cohort and the relevant model, prompt, retrieval, tool, and policy versions to the trace. Capture latency, cost, and safety results without indiscriminately logging sensitive payloads. Privacy-by-design and data governance determine which data may be retained, who may inspect it, and how long it should remain available.

Missing links create predictable misreadings. Without an exposure event, low adoption can be confused with low visibility. Without version information, a regression cannot be tied to a system change. Without the downstream event, acceptance can be mistaken for value even when users later undo the AI’s work.

Choose the design and sample around the decision

Randomization: Choose user, account, workflow, or time window based on where contamination can occur. If people in one account share outputs, user-level assignment may mix treatment and control experiences.
Population: Define eligibility before launch. Balance or stratify meaningful groups such as new accounts and power users when their behavior or exposure differs.
Primary metric: Select one outcome that can settle the main question. Treat diagnostic measures as supporting evidence, not a menu from which to pick a winner later.
Guardrails: Monitor core experience, retention where relevant, time-to-value, safety, privacy, reliability, and cost. Write rollback conditions for unacceptable movement before exposure begins.
Effect size and power: Set the minimum detectable effect from the business decision, estimate the required sample, and acknowledge when available traffic cannot support the desired conclusion.
Exposure control: Use feature flags, a capped rollout, and a holdout so you can stop quickly and preserve a valid comparison.

Standard A/B testing fits many product changes. Ranking and retrieval changes can benefit from interleaving when alternatives can be compared within the same experience. Switchback designs can help when time, seasonality, or shared operating conditions make simultaneous assignment misleading. Match the design to the interference in the workflow instead of defaulting to the experiment template you use for deterministic UI changes.

AI variability also changes the readout. Aggregate outcomes across the multiple interactions users have, compare cohorts, and track confidence intervals over time. A snapshot p-value should not overrule an underpowered test, an unstable effect, or a concentrated safety failure. A statistically inconclusive result means the test did not resolve the decision; it does not prove that the feature has no effect.

Prewrite the scale, iterate, and stop rules

I prefer four explicit decision states because they force the readout to connect evidence to action:

Scale: The primary outcome clears the meaningful threshold, guardrails hold, important cohorts do not show an unacceptable reversal, and reliability and cost remain viable.
Iterate the AI system: User intent is strong, but a defined output or system failure blocks value. The next test should target that failure rather than repeat the same broad pilot.
Change the product experience: Offline quality passes, but users cannot discover, trust, control, or efficiently use the capability. Treat this as workflow evidence, not an automatic reason to swap models.
Stop or reframe: Demand is weak, the economics cannot work, the necessary data or access is unavailable, or credible risk remains outside tolerance.

Risk must be part of the launch rule, not a review added after a positive metric appears. Include toxicity and personally identifiable information checks where relevant, enforce least-privilege access, retain appropriate audit logs, and make rollback operational before exposure. For irreversible financial actions, sensitive regulatory decisions, or any workflow where the acceptable error rate is effectively zero, keep a qualified human approval step or defer autonomy. Faster execution does not compensate for an unacceptable blast radius.

Autonomy should be earned in stages. Begin with assistance that the user can inspect. Move to required approval before actions. Allow autonomous execution only for narrow, low-stakes, reversible actions after stability is demonstrated. Expand permissions and exposure only when monitoring shows that the earlier guardrails continue to hold.

The experiment does not end at launch. Model behavior, retrieval content, user mix, prompts, tools, and operating costs can change. Continue tracking quality, latency, cost per successful outcome, safety, and cohort behavior. Feed new failures into the evaluation set and keep a holdout when the decision warrants one. A weekly readout should identify what changed, which assumption the evidence affected, and what decision follows; it should not become a tour of every available dashboard.

Key takeaways

Start with a precommitted decision contract: user, job, baseline, outcome, threshold, guardrails, and next action.
Validate demand before usability, usability before system capability, and system capability before broad production impact.
Compare the AI with the user’s current alternative, not with an abstract standard of impressive output.
Measure output quality, system performance, user outcomes, and business consequences separately so failures remain diagnosable.
Treat stochastic behavior as a distribution: document configurations, repeat runs, inspect slices, and watch severe outliers.
Use feature flags, holdouts, exposure caps, auditability, and prewritten rollback rules to contain risk while learning.

At your next AI review, ask for the experiment contract instead of another demo. If the team cannot name the dominant risk, current baseline, meaningful threshold, guardrails, and action for each possible result, the next step is not production exposure. It is a sharper test.

Start with the smallest experiment that could credibly invalidate the idea. Evidence that survives that test earns the right to spend more, increase fidelity, and expose more users.

References

April 27, 2026

The AI PM One-Pager: Radical prototyping requirements for speed, clarity, and truth

I move fastest in Generative AI when I strip work down to its essential signals. At HighLevel, I rely on a single-page format—”Prototyping Requirements: The One-Pager for AI PMs”—to turn ideas into testable artifacts within hours, not weeks. This approach reinforces AI Strategy, minimizes coordination overhead, and keeps Product Management focused on learning over ceremony.

“Prototyping requirements go rogue: one page, zero bureaucracy, built for AI. Shape concepts fast, prompt tools directly, and get to the truth sooner.”

In practice, my one-pager captures only what’s required to run an immediate experiment: the user problem, the target behavior change, success signals, core constraints, intended AI workflows, and the smallest realistic path to an evaluable demo. I also include example prompts, guardrails, and evaluation criteria so the team can apply prompt engineering and LLMs for product managers without guessing.

This is eval-driven development in action. I document a minimal hypothesis, concrete inputs/outputs, and a quick plan for metrics, including qualitative signals from product discovery and continuous discovery. By prompting tools directly, we expose assumptions early, shorten feedback loops, and build an AI product toolbox that compounds learning sprint after sprint.

I run this with a product trio to ensure we balance feasibility, usability, and value. We align on risks, dependencies, and what “good” looks like, then we integrate the learnings into product roadmapping and sprint planning. The result: fewer meetings, tighter collaboration, and empowered product teams delivering sharper outcomes with less friction.

If you want speed and clarity without sacrificing rigor, adopt the one-pager. It centers the conversation on evidence, accelerates AI workflows from prompt to prototype, and makes it obvious what to try next—and what to stop doing. Most importantly, it keeps the team focused on truth over theater, which is how great AI products actually ship.

Inspired by this post on Product School.

April 24, 2026
Unleashing Inbound Sales with AI: My Playbook for Launching and Scaling Sales Agents Fast

Inbound leads shouldn’t wait for a rep’s calendar. When we first launched The Service Agent Blueprint, support leaders finally had a clear AI path. Go-to-market and revenue teams are now facing similar uncertainty, so I’m introducing The Sales Agent Blueprint—a practical map for launching and scaling AI for sales with confidence.

For most sales teams, inbound motions require a lot of manual work. I’ve watched leads pile up in queues, waiting for availability rather than being prioritized by buyer intent. That delay costs meetings, pipeline, and momentum—and it’s exactly where a modern AI Strategy can transform your go-to-market strategy.

Agents can run sales conversations end to end – engaging buyers, qualifying leads, and routing high-intent opportunities to the right team to move prospective buyers forward quickly. Humans will still be involved, but will move their focus to the consultative conversations and higher-value work they did not have time to focus on before. In practice, this shift enables cleaner AI workflows, better conversation design, and a healthier balance between sales-led growth and product-led growth.

The questions many go-to-market and revenue leaders are facing now are where do you start? What should success look like? How do you actually test and deploy these solutions? These are the right questions—and the ones I hear most often when teams weigh build vs buy decisions, evaluation frameworks, and CRM integration nuances.

The Sales Agent Blueprint answers those questions. It’s designed to be a strategic guide for sales, revenue, and AI transformation leaders who want to deploy AI for inbound sales fast, prove value, and build momentum. If you’re aiming for eval-driven development, this will help you define success up front and operationalize it.

What’s inside is simple by design yet deep enough to take you from zero to value. The Sales Agent Blueprint is structured around two tracks that reflect how high-performing teams adopt agentic AI: first, launch for quick wins; next, scale for durable growth.

Coming soon: Sales Agent Blueprint. A sleek, blueprint-inspired teaser with the call to 'Scale it' signals tools, playbooks, and workflows to grow revenue, streamline operations, and scale teams with confidence.

Today, I’m releasing the first part of the Blueprint: “Launch it.” It’s a practical guide for getting your Agent live and seeing real results. You’ll learn how to deploy a Sales Agent that runs inbound sales conversations end to end, engaging buyers, qualifying leads, and routing high-intent opportunities to the right outcome in real time—without disrupting your current CRM integration or pipeline processes.

By the end of the “Launch it” track, you’ll be ready to execute with clarity. Here’s how I frame the essential steps, based on what consistently works in the field.

Understand what a Sales Agent is: Discover why they’re different from chatbots and how they work. Build a business case: Prove the basic economics of AI, decide whether to buy or build, and get the buy-in and budget you need to move forward.

Evaluate an Agent: Learn how to define success, choose the right evaluation criteria, and run a focused, high-impact assessment with our five-step framework.

Deploy with confidence: Build a deployment plan that gets your Agent live quickly to engage buyers at peak intent. Learn what to expect at each stage.

Introducing the Sales Agent Blueprint. This crisp, grid-based graphic spotlights step 1—Launch it—signaling day-one activation for an AI sales agent. Explore the framework and get started at fin.ai/blueprint/sales.

Continuously improve performance: After launch, your Agent becomes a system to manage. We’ll show you how to implement a repeatable process to train, test, deploy, and optimize.

The second track, “Scale it” (coming soon), focuses on the organizational and systems design work that unlocks compounding gains. Launching AI is only the beginning. To unlock its full potential, you need to rewire your inbound sales motion—redesigning the buyer journey, building AI-first systems and ownership models, and rethinking how pipeline is generated and scaled. This is where governance, measurement, and team roles evolve to support sustainable growth.

I’ll be building this Blueprint in public as I navigate the same challenges—sharing what works, what to avoid, and how to accelerate time-to-value without sacrificing quality or trust. If you’re ready to turn intent into revenue with agentic AI, this is your head start.

The Sales Agent Blueprint is live now. Explore the full guide at fin.ai/blueprint/sales and start your “Launch it” sprint today.

Inspired by this post on The Intercom Blog.

April 23, 2026

Build an AI Toolbox That Improves Product Management

You have an interview transcript waiting to be synthesized, a roadmap debate with more opinions than evidence, and a stakeholder update due before the decisions are settled. A general-purpose chatbot can help with each task. It can also produce a polished version of the wrong answer.

I’ve evaluated dozens of generative AI products against the work product managers actually do, from discovery through launch. The useful pattern is simple: choose a recurring decision, connect the model to the evidence for that decision, define the human review, and measure whether the workflow improves. The tool is only one part of that system.

Start with the decision that needs to improve

If you begin with a tool, its demo will define your use case. You will end up generating summaries, specifications, and slide copy because those outputs are easy to show, not because they remove the most important constraint in your product process.

Begin with a decision that is slow, inconsistent, or poorly supported. Write the workflow in one sentence:

When [trigger occurs], [owner] uses [approved evidence] to produce [decision artifact], which [reviewer] checks before [downstream action]. Success is measured by [workflow metric] and [product metric].

For customer discovery, that might become: when an interview round closes, the product manager uses transcripts, participant metadata, and the research question to produce a theme map and a list of unresolved questions. A research or design partner checks the evidence before the findings enter an opportunity solution tree. Synthesis time, evidence corrections, and the quality of the next research questions show whether the workflow is helping.

A strong first use case has four properties:

It recurs. A workflow used repeatedly gives you enough opportunities to find failure modes and improve the prompt.
Its evidence is bounded. You can identify the transcripts, event definitions, strategy documents, or decision logs the model is allowed to use.
A qualified person can review it. The reviewer knows what a plausible but unsupported answer looks like.
The improvement is observable. You can compare cycle time, rework, evidence quality, or another meaningful measure before and after introducing AI.

My rule is to start with frequent, evidence-rich work where a mistake is reversible. Interview synthesis, experiment readouts, roadmap option framing, and release communication are usually better learning environments than an autonomous decision that immediately changes customer data or launches an experience.

Capture a baseline before changing the workflow. Record how long the work takes, where review cycles occur, which errors appear repeatedly, and what downstream decision the artifact supports. Without that baseline, faster drafting can look like progress even when reviewers spend the saved time correcting unsupported claims.

Build the toolbox in layers, not by brand

An effective product management toolbox connects LLMs, research synthesis, behavioral analytics, and lightweight automation. These layers solve different problems. Buying several products that all generate text does not create a complete system.

Tool layer	Best PM job	Evidence it needs	Useful output	Main failure to catch
General-purpose LLM workspace	Framing, critique, drafting, and option generation	Objective, constraints, definitions, and approved documents	Questions, alternatives, structured drafts, and decision briefs	Confident invention or generic advice detached from product context
Research synthesis	Organizing customer interviews and qualitative feedback	Transcripts, participant identifiers, segment metadata, and research questions	Evidence-linked themes, contradictions, unmet needs, and follow-up questions	Treating a small sample as market prevalence or erasing minority views
Behavioral analytics	Finding where behavior changes and sizing an opportunity	Event definitions, entity grain, cohorts, funnels, paths, retention views, and experiment results	Drop-off patterns, affected segments, anomalies, and testable hypotheses	Turning correlation into causation or analyzing an incorrectly defined event
Knowledge and retrieval layer	Grounding answers in current product context	Strategy, decision logs, research, taxonomy, policies, and product documentation	Traceable answers with evidence and visible conflicts	Retrieving stale, unauthorized, or contradictory material without warning
Workflow and experience automation	Moving an approved decision into repeatable execution	Approved copy, segments, triggers, stop conditions, owners, and measurement events	In-app guides, product tours, handoffs, checklists, and status updates	Publishing or acting before human approval, measurement, or rollback is ready

Use the table to expose missing layers. If research synthesis is strong but event definitions are unreliable, another writing assistant will not improve opportunity sizing. If analytics is mature but the model cannot access the current strategy or decision history, its prioritization advice will remain generic. If automation is available but ownership and rollback are unclear, speed will amplify operational risk.

Evaluate each candidate against the workflow, not against a feature checklist. Ask:

Can it work where the approved evidence already lives, or will people create uncontrolled copies?
Can a reviewer trace a conclusion back to a transcript, event definition, document, or decision record?
Can access, retention, sharing, and deletion follow your data governance rules?
Can you test a stable workflow with representative examples instead of judging a polished demo?
Can you observe failures, corrections, latency, and cost after rollout?
Does the total cost include integration, governance, evaluation, review time, and maintenance rather than only the license?

A vendor can be impressive and still be wrong for your operating environment. The decisive question is whether it strengthens a specific product decision without weakening evidence quality, privacy, or accountability.

Turn the tools into repeatable PM workflows

The prompt is not the workflow. A production workflow includes prepared inputs, an output contract, a review step, a decision owner, and a place to record what happened. The following patterns cover the PM work where AI can create leverage without pretending to replace product judgment.

Synthesize interviews without manufacturing certainty

Qualitative synthesis becomes unreliable when the model merges observation, interpretation, and recommendation into one smooth narrative. Preserve those boundaries. Give each participant a stable identifier, retain relevant segment context, and tell the model to cite the evidence behind every theme.

Copy-paste prompt: Act as a product research analyst. Use only the supplied interviews and research brief. For each theme, return the claim, supporting participant identifiers, contradictory evidence, affected segment, confidence with a reason, and the next unanswered question. Separate direct observations from interpretation and recommendation. Do not infer market prevalence from this interview sample. If a conclusion lacks evidence, label it unsupported.

Review the output by opening the cited passages, not by judging whether the summary sounds plausible. Look for participants who do not fit the dominant theme. Check whether two different needs have been combined because they use similar words. Confirm that the model has not converted the loudest quotation into the most important opportunity.

Only then move the findings into your discovery structure. The useful handoff is not a list of themes. It is a set of evidence-backed needs, open questions, affected segments, and assumptions that the product trio can investigate.

Combine behavioral data with customer evidence

Behavioral analytics can tell you where users drop out, which segments behave differently, and whether a pattern is large enough to deserve attention. It does not tell you why the behavior occurred. Interviews can reveal possible motivations, but a qualitative sample does not establish how common each motivation is. Use the two evidence types together without asking either to answer the other’s question.

Before involving an LLM, verify the event name, event meaning, user or account grain, relevant cohort, and analysis window. If instrumentation changed, include that context. Prefer aggregated or appropriately governed data; do not paste raw personal or confidential customer data into an unapproved model.

Copy-paste prompt: Use the supplied event definitions, cohort table, funnel, and interview themes. Identify the largest observed behavior changes by segment. For each change, distinguish the observed fact from possible explanations. List data quality questions, supporting customer evidence, conflicting customer evidence, and the cheapest analysis or experiment that could reduce uncertainty. Do not claim causation from a correlation.

Return to the analytics system to validate every material claim. The model is useful for connecting evidence and generating hypotheses; the governed analytics layer remains the place to confirm event behavior, segment definitions, retention patterns, and experiment results.

Frame roadmap choices as options, not generated certainty

A roadmap debate rarely fails because nobody can generate feature ideas. It fails when alternatives, assumptions, constraints, and expected outcomes are implicit. AI is most useful here as an argument compiler: it can turn scattered evidence into comparable options and expose what each option requires you to believe.

Copy-paste prompt: Use the supplied product objective, customer evidence, behavioral evidence, strategic constraints, technical constraints, and decision history. Create a set of distinct options rather than a ranked feature backlog. For each option, state the target outcome, supporting evidence, contradictory evidence, critical assumptions, excluded alternatives, leading indicator, delivery risk, and cheapest test. Flag any recommendation that lacks a traceable source. Do not make the final priority decision.

This format makes outcome-versus-output confusion visible. An option such as build a new onboarding checklist is an output. Improve successful first-time setup for a defined customer segment is an outcome. The first can support the second, but the relationship is still a hypothesis. Keep that hypothesis visible in the roadmap and in the experiment plan.

The human decision owner should record the selected option, why it won, what evidence mattered, which assumption remains unresolved, and when the decision should be revisited. That decision log becomes grounding material for later planning instead of forcing the next model session to reconstruct context from scattered documents.

Move an approved launch into an observable experience

Once the decision is approved, AI can reduce the mechanical work of adapting positioning into release notes, support context, product tours, and in-app guides. The risky part is not drafting the words. It is allowing generated content to reach the wrong segment, appear at the wrong moment, or launch without a measurement and stop condition.

Copy-paste prompt: Using only the approved positioning, UX terminology, target segment, trigger event, and product constraints, draft an in-app sequence. For each message, state its purpose, trigger, target user, action requested, dismissal behavior, stop condition, and measurement event. Preserve the approved claim boundaries. Flag any copy that introduces a benefit, capability, or promise not present in the supplied material.

Review the experience in context. Confirm that the audience definition matches the analytics definition, the trigger can actually be observed, the requested action exists in the current interface, and users can dismiss or complete the sequence. Keep experiment design and success analysis outside the copy generator. Fluent wording cannot declare the launch successful.

Make every output inspectable before it becomes operational

The difference between a useful personal assistant and a dependable organizational workflow is inspectability. A reviewer must be able to see which evidence was available, which instructions shaped the answer, what the model produced, what a person changed, and which decision followed.

Use a retrieval-first pipeline grounded in product documents and decision logs. Do not rely on model memory for current strategy, naming, policy, or product behavior. Define an authority order for conflicting material. A current approved decision record should not silently lose to an older planning document simply because the older document contains more text.

Your grounding layer should preserve permissions. Retrieval is not an excuse to expose every document to every workflow. Record the owner and freshness of important material, remove obsolete versions from the approved collection, and instruct the model to show conflicts instead of resolving them invisibly.

Treat each repeated prompt as a small product surface with a contract:

Goal: the decision or artifact the workflow must support.
Allowed evidence: the documents, data, and tools the model may use.
Definitions: the product terms, entities, events, segments, and metrics that must remain consistent.
Method constraints: what the model must separate, preserve, cite, or avoid inferring.
Output contract: the required fields, order, labels, and evidence links.
Uncertainty behavior: when to flag missing context, conflicting inputs, or unsupported conclusions.
Review and stop conditions: who approves the output and what prevents it from moving downstream.

Then create an evaluation set from representative work. Include ordinary inputs, ambiguous cases, conflicting documents, incomplete evidence, sensitive-data traps, and previously observed failures. A good evaluation checks groundedness, traceability, coverage, decision usefulness, confidentiality, and consistency. Writing quality matters, but polish is not evidence.

Re-run the evaluation whenever the model, prompt, connector, knowledge collection, event taxonomy, or output schema changes. A workflow that passed yesterday’s cases can regress when one dependency changes. This is why eval-driven development, observability, privacy-by-design, and AI risk management belong in the product manager’s toolbox rather than in a separate governance document.

For each operational run, retain enough information to diagnose failure: workflow name, input sources, prompt or configuration version, output, reviewer corrections, final decision, latency, and cost where available. The record should support improvement without retaining sensitive data longer than your policy permits.

A screenshot checklist can make the workflow easier to teach and audit. Capture the approved input location, relevant access setting, prompt configuration, evidence-linked output, human edits, final decision record, and measurement view. Screenshots do not replace logs or documentation, but they give PMs and stakeholders the same operating picture during onboarding and review.

Scale adoption through gates and measurable outcomes

Do not roll an AI tool out to every product manager and hope good practices emerge. Move one workflow through explicit gates:

Baseline the current workflow. Record cycle time, review effort, recurring errors, and the downstream outcome it supports.
Run in shadow mode. Produce the AI-assisted artifact without allowing it to drive the real decision. Compare it with the normal process and save failure cases.
Introduce assisted use. Let a named human owner use and edit the output. Require evidence checks before it reaches stakeholders or customers.
Standardize the operating pattern. Publish the input rules, prompt contract, evaluation set, owner, storage location, escalation path, and fallback process.
Expand only after the workflow holds up. Add users, data sources, or automation after quality, privacy, and review behavior remain dependable.

Measure the workflow at more than one level. Cycle time tells you whether work moves faster. Correction rate and review effort show whether speed is hiding rework. Evidence coverage shows whether claims can be defended. The linked product metric shows whether the artifact supports a meaningful outcome. Total cost tells you whether licenses, integration, evaluations, governance, and human review are worth the saved effort.

Do not count prompts submitted, words generated, summaries created, or seats assigned as product impact. Those are activity measures. A workflow is valuable when it shortens a real decision cycle, improves the evidence behind a decision, reduces preventable rework, or helps the team learn about an outcome sooner.

Pause or roll back the workflow when material claims cannot be traced, confidential data crosses an unapproved boundary, reviewers begin rubber-stamping output, small configuration changes cause unpredictable recommendations, or the review and governance burden cancels the useful gain. A graceful fallback to the previous process is part of the design, not an admission that AI failed.

Key takeaways

Choose a recurring product decision before choosing an AI product.
Combine LLMs, research synthesis, behavioral analytics, grounded knowledge, and automation only where the workflow needs them.
Require bounded evidence, visible uncertainty, traceable claims, and a named human decision owner.
Turn repeated prompts into governed contracts with evaluations, observability, and clear stop conditions.
Judge the toolbox by cycle time, evidence quality, rework, product learning, and total cost rather than by generated output.

This week, select one recurring PM decision and write its workflow sentence. Baseline the current process, run the AI-assisted version in shadow mode, and save every failure as an evaluation case. Your toolbox becomes valuable when it improves a decision you can defend, not when it produces more material to review.

References

Shivam.Consulting Blog – My Essential AI Toolbox for Product Managers: Tested Picks, Prompts, Workflows + Checklists

April 22, 2026

How to Build Agentic AI for Product Analytics and Support

Your support bot can tell a customer where a setting lives, yet leave that customer to diagnose the problem, change the setting, and hope it worked. Your product team then receives a chat transcript without knowing whether the interaction improved activation, feature adoption, or retention.

If you are deciding how to connect AI, product analytics, and support, do not start with the model. Design the closed loop first: assemble trustworthy context, choose an allowed action, verify the resulting product state, and measure the user outcome. The model is one component inside that system.

Treat product analytics as the agent’s control plane

A useful standard is an assistant that understands the user’s context, can complete an allowed action, and measures whether the action helped. Remove any one of those capabilities and the experience degrades quickly. Context without action produces advice. Action without context creates risk. Action without measurement creates an impressive demo that cannot earn a durable place on the roadmap.

Product analytics supplies the behavioral context and outcome signals for this loop. It can show where the user is in a journey, which features have been adopted, which step failed, and whether the expected success event eventually occurred. It should not be treated as a warehouse-sized attachment to the prompt.

Define a support context contract

Create a small, governed context object for each supported workflow. Give the agent only the fields required to understand and resolve that workflow:

Actor and access: the authenticated user, account, role, entitlements, and permissions relevant to the requested action.
Journey state: the onboarding step, feature-adoption state, experiment assignment, or other stage that explains what the user is trying to complete.
Current product state: the relevant configuration from the operational system of record, including whether required prerequisites are satisfied.
Friction evidence: recent failed events, validation results, repeated attempts, and known errors connected to this workflow.
Desired outcome: the product state and behavioral event that will count as successful resolution.

Resolve analytics events and tool calls to the same stable user and account identifiers. Preserve timestamps and the origin of each field. For a live action decision, let the operational system of record determine current state; use analytics to explain the journey and measure the outcome. An event stream can be delayed or incomplete, so it should not overrule a current configuration read.

Behavior is also evidence, not intent. Repeated visits to a setup screen could indicate confusion, careful verification, or an advanced workflow. When those interpretations require different actions, the agent should ask one targeted question instead of turning a behavioral pattern into a confident diagnosis.

Apply data minimization at this boundary. Do not place secrets, payment information, unrelated conversation history, or an account’s entire event history into the model context. Filter fields before the model sees them, and enforce the filter in code rather than relying on a prompt instruction.

Give the analytics agent a metric contract

An internal analytics agent has a different job from a customer-facing support agent. It may translate a product question into metrics, cohorts, funnels, or retention views, but a fluent answer is not enough. Require every analysis to return:

the product question it interpreted;
the metric definition and success event it used;
the cohort, filters, and observation window;
the analysis or query reference needed to reproduce the result;
known data-quality limitations and unresolved ambiguity; and
a clear distinction between observed association and demonstrated causal lift.

This turns the analytics agent into a traceable decision aid. It also prevents two agents from using the same metric name while silently applying different event definitions, account filters, or windows.

Design one closed loop from signal to verified outcome

The core unit of agentic support is not the conversation. It is a resolution attempt with a beginning, an authorized action, and a verifiable end state. Use the following loop for every workflow:

Observe the trigger. Capture the user’s request or a product signal that indicates likely friction.
Assemble scoped context. Load only the identity, permission, journey, state, and error fields defined in the context contract.
Diagnose the next constraint. Determine which prerequisite, configuration, permission, or knowledge gap is blocking progress. If the evidence is ambiguous, ask rather than assume.
Select an approved playbook. Match the constraint to a versioned workflow with explicit eligibility rules, allowed tools, and prohibited actions.
Obtain the required authorization. Show the proposed change and its consequence whenever the action changes product state or affects other people.
Execute through a narrow tool. Use a typed, allowlisted operation. Make retryable actions idempotent so a repeated call does not create duplicate changes.
Verify the result. Read the resulting product state and look for the defined success event. Tool completion alone does not prove customer resolution.
Record the outcome. Log the context version, playbook, model, policy decision, tool call, result, success signal, and any escalation or user reversal.

The loop supports two related products without collapsing their permissions. An internal analytics agent can identify an affected cohort, inspect a funnel, or surface a recurring failure pattern. A customer-facing support agent can use the approved finding to help one authenticated user, but it should see only that user’s permitted context and tools. A human support operator should receive the same trace when the agent escalates.

Keep the shared layer deliberately small: stable identities, canonical metric definitions, governed context fields, outcome events, and versioned playbooks. The analytics agent and support agent can then improve the same system while retaining separate access policies and evaluation criteria.

Do not automatically convert every observed correlation into a new support action. Let analytics generate a candidate playbook, review the causal logic and risk, test it against known cases, and release it through a controlled experiment. The learning unit is the reviewed playbook, not an unexamined prompt change.

Choose a first workflow that can prove its own value

The first pilot should be easy to verify, not merely easy to demonstrate. A conversational answer looks polished even when it does not change the user’s outcome. A narrow configuration or onboarding workflow is usually a better proving ground because eligibility, allowed actions, and success can be defined before launch.

Score candidate workflows against these criteria:

Repeated demand: the same intent or failure appears often enough to justify a reusable playbook.
Observable state: the agent can read the prerequisites and current configuration instead of guessing from the user’s description.
Clear success: one product state or behavioral event can verify that the problem was resolved.
Safe execution: the initial actions are reversible, user-scoped, and unlikely to affect billing, security, data retention, or other users.
Short feedback: the primary outcome appears soon enough to support iteration, even if retention is monitored later.
Enough eligible traffic: the workflow can support a meaningful experiment rather than a handful of anecdotes.

Write the pilot contract before the prompt

A pilot contract forces the product, analytics, support, engineering, and risk decisions into one inspectable artifact. It should specify:

the user problem and eligible cohort;
the trigger that starts the workflow;
the context fields and systems of record;
the approved diagnostic branches;
the allowed and prohibited actions;
the point at which confirmation is required;
the precondition and postcondition for each tool call;
the success event and observation window;
known failure modes and the human handoff rule; and
the primary outcome, guardrail metrics, experiment design, and minimum detectable effect.

Consider an onboarding configuration workflow. The trigger might be a user repeatedly reaching setup without completing it. The context could include entitlement, current configuration, prerequisite status, and the latest validation result. The agent may be allowed to run validation, explain a missing prerequisite, prefill a reversible setting, or launch the next approved step. Resolution requires both the expected configuration state and its corresponding success event. If validation continues to fail, the handoff should include the exact state, error, playbook branch, and actions already attempted.

Avoid starting with data deletion, broad permission changes, security recovery, billing adjustments, or external communications. Those workflows combine difficult authorization questions with high consequences. Prove context quality, tool reliability, verification, and measurement on a narrower action set before expanding the blast radius.

Set the minimum detectable effect before the experiment. If the eligible population cannot detect an outcome change that would justify the investment, narrow the claim, combine additional time periods, or choose a more observable workflow. Do not call an underpowered neutral result proof that the agent has no effect.

Instrument the agent like a product surface, not a transcript

Conversation volume, message count, and thumbs-up feedback are diagnostic signals. They are not sufficient outcome measures. A customer can like an explanation and still remain blocked; another can dislike the wording even though the configuration was fixed.

Measurement layer	Question it answers	Useful signals
Operational reliability	Did the system execute as designed?	Tool success, validation failure, retry, latency, rollback, and escalation
Verified resolution	Did the requested product state become true?	Verified resolution rate, time to resolution, repeat attempt, and repeat contact
Product outcome	Did the user progress in the journey?	Activation, feature adoption, workflow completion, and later retention
Support outcome	Did the workflow reduce avoidable support effort?	Eligible ticket rate, escalation reason, handle-time impact, and handoff quality
Safety and trust	Did the agent stay within policy and user intent?	Permission block, wrong-action review, user reversal, policy violation, and privacy incident

Define the denominators as carefully as the numerators. Verified resolution rate should use eligible support sessions as its denominator and require the success state defined in the pilot contract. Action completion rate should use authorized action attempts, not every conversation. Time to resolution should begin with the original request and stop only when the postcondition is verified, not when the agent finishes generating text.

Do not optimize ticket deflection or containment in isolation. The absence of a ticket can represent resolution, abandonment, or a user working around the problem. Pair support-efficiency measures with product success, repeat contact, and safety guardrails.

Use evaluations and experiments for different questions

A disciplined AI product rhythm connects eval-driven development, A/B testing, minimum detectable effect, activation, retention analysis, and data governance. Each mechanism answers a different question:

Pre-release evaluations: Can the system interpret known intents, select the right context, follow policy, choose an allowed tool, handle tool errors, and verify the expected postcondition? Run the relevant suite whenever the model, prompt, context contract, policy, tool, or playbook changes.
Shadow operation: What would the agent have proposed in real traffic without being allowed to change state? Review mismatched diagnoses, unsupported context, unsafe actions, and missed escalation conditions.
Controlled experiments: Does the agent improve the predefined outcome compared with the existing support experience for the eligible population? Record assignment before the interaction and preserve it through outcome analysis.
Production monitoring: Are errors, reversals, escalations, latency, or policy blocks changing by journey, user role, entitlement, playbook, or release version?

Be careful with naive correlation. Users who invoke support are often already struggling, so their outcomes may look worse than those of users who never needed help. Random assignment among eligible users gives you a defensible counterfactual. When randomization is not possible, describe the result as observational and avoid claiming that the agent caused the change.

Log enough version information to reproduce a decision: model, prompt, policy, context schema, playbook, experiment assignment, tool version, input identifiers, authorization result, and postcondition. Do not place raw secrets or unrestricted personal data in that trace. A metric change is actionable only when you can connect it to the system version that produced it.

Set action boundaries before the model receives tool access

Model confidence is not authority. A highly confident response must never expand a user’s permissions, bypass confirmation, or convert a prohibited action into an allowed one. Authorization belongs in deterministic policy and tool infrastructure outside the model.

Action class	Typical scope	Required controls	Verification
Read and explain	Show relevant state, explain an error, or recommend a next step	User-scoped reads, field filtering, and visible uncertainty when evidence conflicts	Confirm that the response used current state and an approved knowledge path
Reversible change	Update a non-sensitive preference, run validation, or trigger a recoverable workflow	Preview, confirmation when needed, typed input, idempotency, and rollback	Read the resulting state and observe the workflow’s success event
Consequential change	Alter billing, permissions, security, external communication, or retained data	Strong confirmation or human review, separation of duties, and a complete audit trail	Verify every postcondition and provide a safe recovery or escalation route

Implement the boundary with controls the agent cannot negotiate away:

Least-privilege credentials: issue short-lived, user-scoped authorization rather than a general service credential wherever the architecture permits it.
Allowlisted tools: expose narrow actions with typed parameters, explicit preconditions, and constrained targets. Do not give a customer-facing agent arbitrary database or shell access.
Policy before execution: validate identity, permission, data sensitivity, action class, and confirmation status outside the model before any state-changing call.
Postcondition checks: require the agent to read the resulting state. A successful API response can still produce the wrong business outcome.
Safe retries: attach idempotency controls to operations that might be repeated after a timeout or interrupted conversation.
Complete handoffs: send the human operator the intent, relevant context, diagnosis, attempted action, tool result, and unresolved condition so the customer does not have to start over.
Controlled release: use feature flags, cohort restrictions, action-level limits, and an immediate disable path while a workflow is being validated.

Evaluate build-versus-buy decisions at the system boundary

Conversation quality is easy to demonstrate and difficult to use as a purchasing criterion. Evaluate an agent platform on whether it can operate inside your context, permission, observability, and experimentation model.

Can you define and inspect the context contract for each workflow?
Can the platform use user-scoped credentials and enforce tool permissions outside the prompt?
Can every decision, action, version, and outcome be exported to your unified analytics platform?
Can you separate aggregate analytics access from individual customer support access?
Can you run offline evaluations, shadow traffic, controlled experiments, and cohort rollouts?
Can you configure confirmation, rollback, handoff, retention, and data-residency policies?
Can you change the model, tool, or support system without losing metric definitions and historical outcome traces?

A platform that generates excellent dialogue but cannot expose its action trace or connect to verified outcomes will make governance and product measurement harder. A less theatrical system with clear contracts may be the more useful product foundation.

Key takeaways

Start with a governed context contract, not a larger prompt or model.
Connect product analytics and support through shared identities, metric definitions, outcome events, and versioned playbooks.
Give customer-facing agents user-scoped context and a small set of reversible, allowlisted actions.
Count a resolution only when the intended product state or success event is verified.
Use offline evaluations for capability and policy, controlled experiments for causal impact, and production monitoring for drift and safety.
Expand autonomy only after context accuracy, tool reliability, outcome lift, and guardrails have all been demonstrated.

At your next roadmap review, ask for one pilot contract rather than a broad AI support initiative. Choose one recurring journey, name its verified success event, define the smallest safe action set, and make the owner show how every action will be authorized, observed, and reversed. That is enough to move from a chatbot concept to an agentic product you can manage.

References

April 21, 2026

Auditable AI Code Review: A Practical Operating Model

You are not deciding whether an AI model can find bugs in a pull request. You are deciding whether an automated reviewer can participate in a production control without leaving your team unable to explain, challenge, or reverse its decision.

If the only evidence behind an approval is a bot comment that says the change looks safe, keep the system advisory. An auditable AI reviewer needs a bounded mandate, a deterministic approval policy, traceable evidence, and a feedback loop tied to production outcomes. Build those controls first, and faster review becomes a consequence rather than a gamble.

Start with a decision contract, not a model prompt

An approval is a policy decision. The model can supply findings, evidence, and a recommendation, but it should not define the conditions under which its own recommendation becomes authoritative.

Write a decision contract before selecting a model or tuning a prompt. It should answer five questions:

What may the system decide? Typical outcomes are approve, request changes, provide non-blocking comments, or escalate to a person.
Which changes are eligible? Eligibility should be determined by explicit repository, path, change-type, test, ownership, and reversibility rules.
Which checks are mandatory? An eligible pull request should not be approved if a required review lens failed to run, returned incomplete evidence, or produced an unresolved blocking finding.
When must the system abstain? Missing context, conflicting findings, unavailable tools, excessive scope, low-confidence evidence, and protected code paths should cause escalation rather than optimistic approval.
Who owns the result? Name the engineer accountable for the change, the owner of the review policy, and the person or group authorized to change the automation boundary.

The core approval rule can be expressed plainly: the change is eligible, every mandatory check completed, no blocking issue remains, the evidence record is complete, and no human-review requirement was triggered. Encode that rule in a controller your team can inspect and test. Do not bury it inside natural-language instructions to the model.

This separation gives you a clean control plane. Review agents analyze the change. A policy engine evaluates their structured results. A narrowly permissioned service performs the approved action. The model never gets to reinterpret the boundary at the moment it encounters a difficult pull request.

Auditability does not require a future model run to reproduce the same words. Model endpoints, retrieved context, and dependencies can change. It requires the original decision to remain reconstructable from preserved inputs, outputs, policies, tool results, and versions. A skeptical engineer should be able to determine why the pull request was approved without trusting the personality or reputation of the bot.

Split the review into specialist checks with explicit evidence

A single prompt asking whether a pull request is safe compresses several different judgments into one opaque answer. Decompose the review so that each judgment has a clear purpose, input set, output schema, and failure mode.

A practical review pipeline can include these specialist lenses:

Problem-definition quality: Is the requested behavior specific enough to review, and are the acceptance conditions testable?
Intent alignment: Does the diff implement the stated change without silently expanding or contradicting it?
Scope and dependency impact: Which callers, data flows, interfaces, jobs, or services can the change affect outside the edited files?
Logical correctness: Do the changed execution paths handle expected states, boundary conditions, and failure paths?
Test adequacy: Do the tests exercise the behavior that changed, and did the required checks actually run against the reviewed commit?
Security and privacy: Does the change alter trust boundaries, permissions, authentication, secrets, sensitive data handling, or externally controlled inputs?
Local engineering guidance: Does the implementation comply with versioned repository conventions, architectural constraints, and known anti-patterns?
Deployment and recovery: Can the change be observed, disabled, or rolled back without creating a second unsafe operation?

Every specialist should return the same minimum structure: a check identifier, pass/fail/escalate status, a concise claim, evidence tied to files or tool output, the applicable rule version, severity, and a recommended action. A finding such as a possible regression is not auditable. A finding that identifies the affected path, explains the conflicting behavior, points to the relevant code, and names the violated policy is.

Run independent checks before aggregation, and preserve every result even when the final decision is approval. The aggregator may deduplicate findings, but it should not erase dissent. If the intent checker says the change is aligned while the execution-path checker finds a contradiction, route the conflict to a person.

Review context must extend beyond the visible diff. A seemingly harmless one-line copy change was once found to contradict validation behavior elsewhere in the codebase. That is the kind of defect a diff-only reviewer is structurally unlikely to see. Give relevant checks controlled access to callers, validators, schemas, tests, ownership metadata, and versioned internal guidance, then record exactly which context each check used.

More context is not automatically better. Retrieval should be targeted and attributable. When a finding depends on an internal rule, capture the rule identifier and version. When it depends on a test, capture the command, commit, status, and output reference. When it depends on inferred execution flow, record the relevant path so a maintainer can inspect it.

Treat pull-request text, code comments, test fixtures, generated files, and documentation on the changed branch as untrusted data. They can contain instructions designed to redirect an agent. Load approval policy from a protected service or the trusted base branch, not from files the pull request can rewrite. Run proposed code in an isolated environment, mediate tool calls through an allowlist, and keep approval or merge credentials outside the model’s reach. The policy controller should translate a valid decision into an action; the model should never hold the credential that performs it.

Set the automation boundary with hard eligibility gates

Do not begin by assigning every pull request a risk score and approving anything below a convenient number. A composite score can hide a disqualifying condition: a tiny authorization change may receive a low size score even though its blast radius is high. Apply hard gates first. Use scoring only to route changes that remain eligible after those gates.

Common reasons to require human review include:

Authentication, authorization, permissions, cryptography, secrets, or trust-boundary changes.
Payments, billing, entitlements, destructive data operations, or irreversible migrations.
Public API contracts, shared schemas, release infrastructure, or broadly consumed dependencies.
A pull request that changes its own review policy, test requirements, ownership rules, or deployment controls.
Missing required tests, failed or stale CI results, unavailable analysis tools, or a mismatch between the reviewed commit and the tested commit.
Changes spanning too many concerns, components, or execution paths for the approved review envelope.
An active incident, an unclear rollback path, or a direct request for human review.

There is no universal line-count threshold for a small pull request. Derive limits from your architecture and incident history, then version them. A change to a central permission function may be riskier than a much larger isolated test refactor. Scope should include dependency reach and behavior change, not just added and deleted lines.

A staged authority model keeps the boundary legible:

Mode	What the AI reviewer may do	Who decides the merge	Appropriate use
Shadow	Produce a private decision record without affecting the pull request	Human reviewer	Baseline evaluation and policy tuning
Advisory	Post evidence-backed, non-blocking findings	Human reviewer	Measuring usefulness and false alarms in normal work
Blocking	Request changes for narrow, testable policy violations	Human reviewer after resolution	Stable rules with clear evidence and an appeal path
Bounded approval	Approve only changes that pass every eligibility and review condition	Policy controller within its delegated scope	Validated low-risk change classes with complete audit records
Mandatory escalation	Summarize evidence and route the change	Named human owner	Sensitive paths, conflicting findings, missing evidence, or any requested human review

Do not turn bounded approval into an auto-approval quota. Coverage is a result of demonstrated safety, not a target that should pressure teams to weaken eligibility rules.

One high-frequency engineering environment reports that more than 93% of pull requests across two main codebases are agent-driven and more than 19% are approved without a human reviewer. Its reported median merge time fell from 75.8 minutes with human review to 14.6 minutes with AI approval, while downtime from breaking changes declined 35% as deployments doubled. Those organization-level results show that bounded automation can coexist with high deployment frequency and improving safety outcomes. They do not prove that AI approval caused the downtime reduction, and they should not be imported as another team’s launch threshold.

Keep the escape hatch explicit. Any engineer should be able to request a human review without defending the choice. The accountable engineer should still watch the change in production and be ready to roll it back. Automated approval changes who performs a review step; it does not transfer ownership of the production outcome to a model.

Preserve the evidence, then earn autonomy through evaluation

Build a decision record that survives model and policy changes

Create an append-only decision event for every review attempt, including abstentions and failed runs. At minimum, retain:

Repository, pull-request identifier, base commit, reviewed head commit, author, accountable owner, and timestamps.
The pull-request description and acceptance criteria as they existed when the decision was made.
Eligibility rules, protected-path rules, ownership data, prompt-template identifiers, and policy versions.
Model provider and model identifier, relevant runtime settings, retrieval configuration, and tool versions.
The context each specialist received, including immutable references or preserved snapshots for mutable material.
Structured specialist outputs, supporting evidence, tool invocations, CI results, conflicts, and failures.
The deterministic rule evaluation that produced approve, block, comment, or escalate.
Subsequent human overrides, appeals, edits, approvals, merges, rollbacks, hotfixes, and linked incidents.

Store concise decision rationale and inspectable evidence, not hidden chain-of-thought. An auditor needs to know which claim was made, what supported it, which rule applied, and how the controller reached the outcome. Private internal reasoning is neither necessary nor a reliable substitute for those artifacts.

Apply the same security discipline to review logs that you apply to source code. Minimize captured secrets and personal data, control access, define retention, and log policy changes. If a model or retrieval service cannot handle the code under your data-governance requirements, that repository is not eligible for the workflow.

Evaluate decisions, not polished comments

A review can sound thoughtful and still approve the wrong change. Build an evaluation set around decisions and evidence rather than writing quality.

Assemble representative cases. Include clean pull requests, valuable historical human findings, escaped defects, incident-causing changes, incomplete requirements, sensitive paths, cross-component changes, and attempts to manipulate the reviewer through repository content.
Label the expected control outcome. For each case, identify whether the correct action is approve, request changes, or escalate. Record the evidence that an acceptable review must surface.
Separate clear cases from disputed ones. Known incident causes and explicit policy violations can provide strong labels. Ambiguous architectural judgments need maintainer adjudication, and disagreement should remain visible rather than being forced into false certainty.
Freeze a holdout set. Use one portion to improve prompts, retrieval, and policy. Keep another portion unseen until release evaluation so repeated tuning does not create a misleading score.
Compare equivalent cohorts. Evaluate AI and human review on the same risk classes and change types. Comparing AI-approved low-risk changes with all human-reviewed pull requests confounds reviewer quality with task difficulty.

Track metrics that expose different failure modes:

Decision accuracy: How often did the system choose the expected approve, block, or escalate outcome?
False auto-approval rate: How often did it approve a labeled case that should have been blocked or escalated? Break this out by severity and risk class.
Blocking precision: Of the findings that stopped a change, how many maintainers judged valid and actionable?
Known-defect recall: Which seeded or historically verified defects did the review catch? Label this carefully; it is not recall over every defect that might exist.
Evidence completeness: Can every decision be traced to required checks, immutable inputs, policy versions, and supporting artifacts?
Abstention and override rates: Where is the system uncertain, and where do engineers reverse it? Investigate patterns by repository and change class.
Delivery performance: Measure review latency and merge time, but only alongside quality metrics.
Production outcomes: Track rollbacks, hotfixes, escaped defects, incidents, downtime, and customer impact for comparable risk cohorts.

Comment helpfulness is useful feedback, but it is not a safety metric. Engineers may like a concise reviewer that misses a critical defect, or dislike a strict reviewer that correctly blocks an unsafe change. Keep usefulness, correctness, and production impact as separate measures.

Roll out by change class and turn escapes into regression tests

Move from shadow mode to advisory comments, then to narrow blocking rules, and only then to bounded approval. Start with one repository and one low-risk, reversible change class. Write the exit criteria before the pilot begins, including acceptable false-approval and false-block rates, required audit completeness, escalation behavior, and production guardrails.

Canary each expansion. Maintain a kill switch that disables new automated approvals without removing the accumulated evidence. If a required service, model, retrieval index, test runner, or policy store is unavailable, fail closed and return the pull request to the human path.

When an approved change causes a production problem, diagnose the control layer that failed:

Was the change wrongly eligible?
Did retrieval omit relevant code or guidance?
Did a specialist miss or misclassify the defect?
Did the aggregator suppress a conflict?
Did the policy permit approval despite the evidence?
Did CI test a different commit or an incomplete environment?
Did production monitoring fail to surface the effect promptly?

Add the case to the regression suite, version the corrective policy or guidance, rerun the holdout evaluation, and preserve the relationship between the incident and the updated control. That is eval-driven development applied to governance: every escape should make a specific layer harder to fail in the same way again.

Key takeaways

AI output is an input to approval, not the approval policy itself.
Use deterministic eligibility gates before any model-based risk judgment.
Decompose review into specialist checks that return claims, evidence, rule versions, and explicit pass/fail/escalate states.
Keep policy and credentials outside the pull request and outside the model’s control.
Preserve enough evidence to reconstruct the original decision even when the model, repository, or internal guidance later changes.
Expand autonomy only when evaluation and comparable production cohorts support it; never optimize for auto-approval coverage by itself.

Your first useful milestone is not an AI-approved pull request. It is a shadow decision that a maintainer can reconstruct, dispute, and improve. Once that record is dependable, grant the smallest reversible slice of authority, watch what reaches production, and make every expansion earn its place.

References

Intercom — AI Now Approves Our Pull Requests—Safely: Inside an Agentic, Auditable Review Engine

April 21, 2026

AI-Native Startup Execution: A Practical Operating System

Your startup can produce impressive demos every week and still learn slowly. If model behavior, customer evidence, and deployment feedback move through separate queues, faster shipping only creates more uncertainty.

If you are deciding how to run an AI-native startup, use one standard: how quickly can your team turn a real customer artifact into an evaluated product change and a measurable customer outcome? That standard should shape your wedge, ideal customer profile, team design, sales motion, and operating cadence.

Build the smallest closed learning loop

An AI-enabled company can add a model to an existing product process. An AI-native company has to organize the process around intelligence itself: the data entering the system, the judgment the model makes, the action produced, the evaluation of that action, and the feedback that improves the next decision.

That makes a long AI feature list the wrong starting point. Begin with the smallest end-to-end loop that proves the product’s central claim. In a security product, for example, the loop might start with suspicious activity, produce a risk judgment, stop or downgrade a threat, and capture enough evidence to assess whether the intervention was correct. A polished dashboard without that closed loop is packaging, not proof.

Use three gates before committing to a wedge:

The customer can quantify the problem. Ask for frequency, severity, operational burden, or another observable consequence in the customer’s own workflow. General concern about AI is not enough.
The wedge can create value within one buying cycle. If proving value requires a broad platform rollout, several integrations, and organizational change across multiple functions, you have probably selected a destination rather than an entry point.
Product use improves your data advantage. Determine what feedback the product will capture, whether you can legitimately use it, how quickly it arrives, and which evaluation or model decision it improves. Data volume alone does not create an advantage.

Failing any one of these gates weakens the loop. Urgent pain without accessible data leaves the model blind. Useful data without a near-term outcome produces an experiment that customers may admire but will not adopt. A fast result that teaches you nothing can become a services engagement disguised as software.

Your first version should feel narrower and less polished than the roadmap in your head. It may require manual review, a rough operator interface, or close founder involvement. That is acceptable when the roughness is visible and recoverable. It is not permission to hide model failure, skip evaluations, or make consequential actions impossible to inspect. Cut breadth before you cut control.

Define completion around the loop rather than the interface. You are done with the first proof when a target customer can supply the relevant input, receive the intended outcome, show whether it helped, and feed the result back into a repeatable evaluation. Everything else competes with learning speed.

Choose an ICP that maximizes learning density

The largest market is rarely the most useful starting market. Your early ideal customer profile should concentrate urgency, usable data, workflow access, and a reachable decision maker. Those conditions allow one customer interaction to improve both the product and the go-to-market motion.

Create a one-page ICP card with four evidence fields:

Urgency: What is happening now that makes the buyer act rather than merely agree that the problem matters?
Data pathway: Which alerts, cases, decisions, errors, or other operational artifacts can enter the product and its evaluation process?
Workflow position: Where will your output appear, who will act on it, and what existing step must change?
Success signal: What observable customer outcome would justify adoption, renewal, or expansion?

Rate each field as high, medium, or low, but require an evidence note beside the rating. A founder’s conviction is not evidence. A buyer who can describe the cost of the problem, provide representative artifacts, identify the operator who owns the workflow, and agree on a success signal gives you something testable.

Keep adjacent customer segments out of the first learning loop when they require different data, integrations, evaluation criteria, or buying logic. A larger pipeline can make progress look healthier while mixing incompatible requirements into the roadmap. The practical test is simple: if two prospects would judge the same model behavior by different definitions of success, they should not share an early product plan merely because they could buy from the same company.

Founder-market fit helps here, but it is a compression mechanism rather than proof of product-market fit. Deep domain experience can sharpen the problem statement, reduce translation with buyers, and establish credibility. It cannot substitute for evidence that customers will change a workflow around the product.

Founder-led sales should therefore operate as a structured discovery system. For every lost or stalled opportunity, record the exact objection, the stage at which it appeared, the evidence offered by the buyer, and the suspected root cause. Do not turn each objection into a feature request. Cluster the objections, identify the two most common root causes, and turn those into experiments within a sprint.

Look beyond compliments when judging early product-market fit. Stronger signals appear when customers escalate a problem to your team, volunteer data that can improve the system, or invite you into the workflow where decisions are made. None is conclusive alone, but each requires the customer to spend trust, attention, or organizational effort. That makes them more informative than enthusiasm during a demo.

Organize the team around evaluation and deployment

A conventional feature organization separates product definition, engineering delivery, model work, implementation, and customer support. That creates handoffs precisely where an AI-native startup needs rapid feedback. Organize durable units around a customer outcome, with the people who can inspect data, change the system, deploy it, and evaluate the result.

The exact titles can vary, but the unit needs four capabilities:

Product judgment: Choose the scenario, define the customer outcome, and decide which uncertainty deserves the next experiment.
Model and data judgment: Design evaluations, diagnose failure patterns, and understand whether a change improves one scenario while damaging another.
Product engineering: Turn model behavior into a reliable workflow with usable interfaces, permissions, integrations, and operational controls.
Frontline deployment: Work safely in live customer environments, resolve implementation friction, and return generalizable learning to the product.

Forward deployed engineers and solutions engineers are especially valuable when the product has to enter complicated workflows. They can shorten the distance between a live failure and a product decision. They can also become an unofficial custom-development queue. Prevent that by attaching three fields to every customer-specific change: the broader pattern it tests, the product owner responsible for reviewing the result, and the condition under which the work will be standardized, stopped, or kept outside the core product.

AI fluency should also change your hiring process. Model vocabulary on a resume tells you little about how someone will reason under uncertainty. Give candidates a work sample built around a realistic failure: model performance changes across scenarios, the data distribution is moving, and customer trust is at risk.

Ask the candidate to define the evaluation, explain the consequences of false positives and false negatives, diagnose a precision-recall tradeoff, and translate the technical result into a product decision. The strongest response will not jump straight to a model change. It will clarify the scenario, inspect the evidence, identify what the current evaluation misses, and explain the tradeoff in terms an operator or buyer can use.

Use the same discipline for founder and leadership disagreement. Write down the principle in dispute, the evidence each side is using, the decision owner, the time box, and the condition that will trigger a review. Debate the principle rather than multiplying competing implementation plans. Once the decision is made, commit until the agreed review point. This preserves speed without pretending that an uncertain decision has become permanently correct.

Run model and business evidence on the same cadence

An AI-native startup needs two connected scoreboards. One shows whether the system is becoming more reliable. The other shows whether customers are receiving enough value to adopt, retain, and expand. Reviewing them separately makes it easy to celebrate a model improvement that does not change customer behavior, or a short-term commercial win built on fragile product performance.

Evidence layer	Question	Measures to inspect	Decision it should inform
Scenario quality	Does the system make the right judgment in each important situation?	Precision and recall by scenario; recurring misclassification patterns	Evaluation coverage, data work, model changes, and acceptable operating thresholds
Change detection	Does the system recognize when its environment is moving?	Drift, data freshness, and time to first signal for an emerging pattern	Refresh cadence, investigation priority, and new evaluation cases
Operational experience	Can the customer use the output without creating a new burden?	Alert-fatigue reduction and latency under load	Workflow design, prioritization, and deployment constraints
Business value	Is reliable behavior turning into durable adoption?	Activation, expansion, and Net Recurring Revenue	ICP focus, onboarding, positioning, and roadmap investment

Do not collapse these measures into one composite score. Aggregation hides the failure mode you need to fix. A model can improve overall precision while degrading a high-consequence scenario. Activation can remain flat even when evaluation results improve because the output arrives too late, lacks an escalation path, or does not fit the operator’s workflow. The broken link between the two scoreboards is often the most valuable finding in the review.

Make the weekly customer debrief evidence-based. Bring actual alerts, misclassifications, escalations, and deployment failures rather than a verbal account of how the customer feels. For each material artifact, record the hypothesis it challenges, the evaluation that can reproduce it, the next bet, and the owner. That written log becomes organizational memory when priorities change quickly.

Use an evaluation gate for every consequential release:

Run the relevant scenario-level evaluations, not only an aggregate benchmark.
Inspect newly introduced failure modes as well as improvements to the target case.
Check latency and operational behavior under the conditions the customer will actually use.
Name the owner of monitoring, escalation, and rollback before deployment.

This is not process for its own sake. In an AI product, observability is the control center for trust. It tells your team whether the model still deserves the authority the workflow gives it. Demo speed cannot compensate for an inability to see, explain, and correct failure in production.

Key takeaways for your next operating cycle

Define the product as a closed loop from customer signal to verified outcome, not as a list of AI capabilities.
Commit to a wedge only when the problem is quantified, value can appear within one buying cycle, and legitimate product use improves the data advantage.
Select the first ICP for urgency, data access, workflow access, and a clear success signal. Keep adjacent segments separate when they require different definitions of success.
Turn founder-led sales into a discovery system by logging every rejection, clustering root causes, and testing the two most common causes within a sprint.
Build durable customer-facing units that combine product, model, engineering, and deployment judgment.
Review scenario-level model evidence and business outcomes together, then investigate the link whenever one improves without the other.

For your next planning cycle, choose one ICP, one end-to-end loop, one scenario-based evaluation set, and one customer outcome. Give a single owner responsibility for showing the connection between them. If the team cannot trace that chain with production evidence, the roadmap is still too broad. Narrow it before adding another feature, segment, or model.

References

Shivam.Consulting Blog – Inside Artemis’ AI vs AI Security War: Hiring at Speed, PMF Signals, and Founder-Led Sales

April 21, 2026

From Brain Dump to Done: How Todoist’s Ramble Captures Tasks in Real Time with AI

Turning a rambling stream of consciousness into a clean task list while someone is still talking has been a longtime product dream of mine. With Ramble, Todoist brought that dream to life by using live audio AI to capture tasks in real time—no transcription step required. The result is a voice-to-task flow that feels natural, fast, and surprisingly disciplined.

As I listened to the Doist team—Ernesto Garcia (Front-end Product Engineer), Thomas Jost (Backend Software Engineer), and Hugo Fauquenoi (Product Manager)—walk through their approach, I heard a blueprint for building pragmatic GenAI features. What began as a two-to-three month AI exploration became one of their most technically deliberate releases: a “Gemini-powered pipeline that makes tool calls while the user is still speaking, surfacing tasks on screen in real time without any text output from the model.”

The breakthrough started with user research. People weren’t merely dictating tasks; they were doing a “brain dump” first—often into pen and paper or even ChatGPT voice—and only then committing items to Todoist. Meeting users where they already are reframed the problem: don’t force structure upfront; capture fluid thought and translate it into actionable tasks instantly.

That insight led to a bold architectural choice: skip transcription entirely and process raw audio directly with a Gemini live audio model. By removing the brittle middleman of text, the team reduced latency and kept the model focused on one job—turning intent into structured actions. It’s a crisp example of AI workflows designed for reliability over novelty.

The real magic is in the real-time “tool calls.” As the user speaks, the model triggers add task, edit task, and delete task operations immediately. For high-friction contexts like driving, they paired visual task cards with subtle sound effects as confirmation cues. It’s thoughtful conversation design that respects attention and safety without sacrificing speed.

Teaching the model to capture tasks literally—without over-interpreting or trying to complete the work—required careful prompt engineering for voice and temperature tuning. Drawing a bright line between “capture versus do” kept the experience trustworthy. In my own AI Strategy work, I’ve found that establishing explicit agentic guardrails early prevents unintended autonomy later.

Dates were the sleeper challenge. The team had to inject the current date, normalize to days vs. months, and always output dates in English for the natural language parser—while preserving the user’s original language for everything else. If you’ve ever shipped date handling across locales, you’ll appreciate how many edge cases hide in “Taming Dates and Time.”

Quality didn’t hinge on intuition alone. They built an LLM-judge eval system using real employee recordings from 100+ people across 35 countries in 20+ languages to catch prompt regressions. That’s eval-driven development done right: representative data, repeatable scoring, and tight feedback loops as models and prompts evolve.

For project and label matching, they chose direct context injection over RAG. Instead of building a retrieval pipeline, they injected the full project/label list into the system prompt. With smart context window management and a sharply constrained task schema, this was both simpler and more accurate. Sometimes the fastest path to product-market fit is removing moving parts, not adding them.

One product principle stood out: easy correction beats perfect first-time accuracy. Natural language interfaces earn trust when users can fix misfires in a tap or two. That bias toward quick recovery over false precision is how you ship AI that feels useful from day one.

Looking ahead, the roadmap is compelling: multimodal task capture from images and text blobs, Apple Watch support, and automation integrations. As voice AI agent patterns mature, this “tool-only architecture” sets a solid foundation for going from capture to coordinated execution—without losing the simplicity that makes Ramble shine.

If you want to hear the full conversation, you can listen on Spotify or Apple Podcasts. It’s a masterclass in building focused GenAI features that trade cleverness for clarity—and still delight.

Resources & Links: Todoist • Doist • Google Vertex AI (Gemini)

Inspired by this post on Product Talk.

April 16, 2026

How to Build a Trusted AI Product Platform That Scales

Your teams have AI pilots that work in a demo. Then the questions start. Security wants to know what data the system can reach. Product wants to know whether the answers are dependable. Support wants a fallback when the model fails. Executives want evidence that the investment is changing a customer or business outcome.

You do not need another impressive model response. You need a product platform that makes AI behavior understandable, controllable, and repeatable across use cases. That requires a trust architecture, a path from prototype to production, and metrics that expose failure instead of averaging it away.

Trust fails where an AI output crosses a decision boundary

Most teams discuss AI trust as if it were a property of the model. It is better understood as a property of the whole product system. A capable model can still create an untrustworthy experience if it uses the wrong context, hides a consequential assumption, calls an unauthorized tool, or leaves the user unable to correct an action.

The important moment is the handoff from generation to decision. Before that handoff, the output is a possibility. After it, someone may use it to answer a customer, change a record, prioritize work, or trigger another system. The controls you need depend on what crosses that boundary.

A practical way to classify AI use cases is by the authority you give the system:

Inform: The system summarizes, explains, retrieves, or drafts. A person still interprets the result.
Recommend: The system ranks options or proposes a next action. Its framing can materially influence a decision.
Act: The system invokes tools, changes state, communicates externally, or starts a workflow.

Use mode	Primary trust failure	Required product control	Evidence needed before release
Inform	An incorrect, incomplete, or untraceable answer	Visible scope, supporting evidence, uncertainty, and an easy correction path	An evaluator can reproduce the evidence path and identify known limitations
Recommend	A hidden assumption, weak comparison, or recommendation that ignores the user’s constraints	Explicit assumptions, alternatives, decision criteria, and user-editable constraints	Representative cases show whether the recommendation applies the intended rubric
Act	An unauthorized, excessive, or difficult-to-reverse change	Least-privilege access, previews, confirmation, audit records, and reversal where the underlying system supports it	Authorized reviewers validate simulated actions, denied actions, failure recovery, and a limited production path

This classification prevents a common planning error: giving every AI feature the same review process. A summarizer and an autonomous account-management agent should not pass through identical gates. The second system needs stronger identity, permission, confirmation, and recovery controls because its mistakes can propagate beyond the conversation.

For each proposed use case, ask five questions before discussing a model:

<!– wp:list {

April 15, 2026

How We Taught Agentic AI to Speak Product Analytics—and Unlocked Actionable Insights

I set out to solve a deceptively simple problem: help our teams ask product questions in plain English and get trustworthy, analysis-grade answers—fast. That required more than a powerful model; it demanded agents that genuinely understand the language of product analytics, from behavioral analytics nuances to the messy reality of event taxonomies, funnels, and cohorts. In this post, I share how we engineered agentic AI that speaks our domain fluently and turns questions into decisions.

The core challenge wasn’t data volume or dashboard sprawl; it was semantics. Different teams said “activation,” “onboarding,” or “first value” and meant overlapping but distinct things. Our PMs, analysts, and engineers navigated a maze of synonyms across Amplitude analytics, Pendo, and our unified analytics platform. Generic LLMs stumbled on these nuances, so we built a shared ontology—driver trees anchored to a clear North Star—with canonical definitions for activation, retention, and conversion, plus consistent event naming and cohort logic.

We started with a rigorous metric catalog: every KPI linked to its drivers, exact formulas, cohorts, and time windows; every event mapped to a product taxonomy; every dashboard and SQL snippet versioned with ownership and lineage. That catalog became the ground truth for agents. We embedded data governance and privacy-by-design from the start—permissioning for fields and queries, PII redaction, and scoped access that reflected how product teams actually work.

Next, we built a retrieval-first pipeline to ground the agents in our corpus before generation. We indexed metric definitions, dashboards, experiment readouts, runbooks, and high-signal Slack threads so the agent could cite relevant artifacts, not just predict plausible text. With careful context window management and prompt engineering, the agent retrieves definitions and prior analyses, then plans multi-step actions: run a query, compare cohorts, check “minimum detectable effect (MDE)” for an A/B test, and summarize findings with references.

Architecturally, we treated this as “Agent Analytics”: an orchestrator that selects tools based on intent—querying Amplitude analytics or Pendo for behavioral paths and funnels, hitting our warehouse for cohort tables, or pulling experiment metadata and anomaly detection alerts. Tool use is permission-aware, auditable, and designed to fail safe. The agent’s outputs include citations back to the exact definitions, dashboards, and SQL used, so reviewers can validate and iterate.

Quality came from eval-driven development, not intuition. We built a gold set of representative product questions (activation inflections, retention analysis by segment, funnel drop-offs after feature launches) and scored the agent on faithfulness to definitions, numerical accuracy, latency, and actionability. We incorporated regression checks to catch drifts after schema changes, and we tuned prompts to reduce overconfident answers and push for clarifying questions when context was missing.

Safety and reliability were non-negotiable. We layered AI risk management with role-based access, guardrails that block destructive queries, and risk scoring for unfamiliar joins or sudden spikes in metric deltas. The agent logs every step—what it retrieved, which tools it called, and why—so analysts can replay and refine the chain of thought with transparent provenance.

The payoff: product teams now self-serve nuanced questions in minutes instead of days, and our analysts spend more time on discovery than report wrangling. Retention analysis improved as the agent standardized cohort logic; conversion investigations accelerated thanks to consistent funnel definitions; and cross-functional decisions aligned around the same driver trees and shared language. Most importantly, the agent turned ambiguous asks into structured analyses that stand up to scrutiny.

For fellow product leaders, my lesson is simple: start with semantics, not models. A crisp ontology, disciplined taxonomy, and clear ownership will outperform a flashy stack riddled with ambiguity. Avoid technology FOMO; favor retrieval-first grounding, small sharp tools, and continuous discovery with your product trios. When your organization speaks a common analytics language, agents can finally think with you, not just for you.

Next, we’re extending the agent’s planning skills to recommend experiment designs, estimate power and “minimum detectable effect (MDE),” and propose driver-tree-informed bet sizing. We’re also tightening feedback loops so every accepted answer, edit, or override strengthens the retrieval corpus and evaluations. The vision: a calm, reliable layer that makes rigorous product analytics feel conversational—and helps teams move from questions to confident action.

Inspired by this post on Amplitude – Best Practices.

April 13, 2026
Stop Drowning in Tasks: How AI Marketing Agents Restore Focus and Maximize Impact

Every week I meet marketers who are working harder than ever—more campaigns, more content, more dashboards—yet seeing less movement on metrics that matter. The surge of AI tooling has amplified activity, not necessarily impact. That’s the focus problem: we confuse motion with momentum, and our backlogs look great while our outcomes stall.

Learn how AI agents for marketing can help you prioritize impact so you can do important work, instead of just more work.

In my role leading product and growth teams, I’ve learned that AI only compounds value when it is pointed squarely at outcomes. If we don’t define what “good” looks like, agentic AI will simply scale busywork. The antidote is a disciplined operating model that connects strategy to execution and instruments agents with clear success criteria.

First, anchor your program with outcomes vs output OKRs. Choose one or two measurable business outcomes—such as qualified pipeline, conversion rate, or activation—and make everything else subordinate. This provides the compass agents need to make effective trade-offs when speed and volume tempt you to do “one more thing.”

Second, map a driver tree from the target outcome down to the controllable levers: audience segments, offers, channels, messaging, and experience friction. This traceability shows where agents can move the needle fastest—whether that’s accelerating research, sharpening positioning, or eliminating handoffs that slow experimentation.

Third, design a small, agentic AI workforce aligned to those levers. For example: a Research Agent that synthesizes market insights and past performance; a Copy Agent that generates on-brief, on-brand variants; a Distribution Agent that adapts content to each channel and schedules posts; and an Analytics Agent that runs A/B tests, summarizes results, and flags anomalies. Keep human oversight where judgment matters most—strategy, brand voice, and high-stakes decisions.

Fourth, instrument rigor from day one with Agent Analytics and eval-driven development. Define offline evals for brand consistency, factuality, safety, and response time; pair them with online experiments that quantify lift on your target outcomes. Set a minimum detectable effect (MDE) so you stop shipping changes that cannot plausibly move the metric.

Fifth, operationalize your AI workflows. Standardize prompts, inputs, and handoffs; templatize briefs and acceptance criteria; and keep a change log so improvements compound rather than reset. Use short, frequent feedback loops to prune low-impact work and double down on what demonstrably advances your objectives.

I’ve seen teams reclaim focus and momentum when they treat agents as teammates, not toys. The magic isn’t in producing more assets—it’s in consistently choosing the next best action in service of a clear outcome. When you combine outcome clarity, a driver tree, targeted agents, and tight evals, AI becomes a force multiplier for marketing impact.

If you’re feeling overwhelmed by AI’s possibilities, start small: commit to one outcome, one driver you believe is material, and one agent designed for that job. Prove lift, codify the workflow, then scale. Velocity is only valuable when it’s pointed in the right direction.

Inspired by this post on Amplitude – Best Practices.

April 10, 2026
Stop Forcing AI to Prove ROI: A Product Leader’s Playbook to Measure Real Business Value

Every planning cycle, I feel the drumbeat: “Show me the AI ROI—this quarter.” The pressure is real, especially when boards and CFOs expect immediate payback. Yet when I review stalled initiatives across teams and peers, the pattern is consistent: most companies treat AI like a feature to ship, not a system to manage. That mindset almost guarantees we measure the wrong things, declare victory (or failure) too early, and miss the durable value AI can create.

Here’s the core problem I see: we leap to solution and skip the counterfactual. Without a baseline, a clear control, or a defined “what would have happened otherwise,” we’re guessing. We also fixate on lagging, financial KPIs that move slowly (revenue, cost, risk), then use outputs—not outcomes—as OKRs. If we don’t align on outcomes vs output OKRs upfront, the best team in the world can still optimize for activity over impact.

My AI Strategy starts from a simple truth: value shows up along three vectors—revenue, cost, and risk—on different timelines. In the near term, we must validate leading indicators (adoption, engagement, activation) that ladder to those vectors through a transparent driver tree. Over time, those drivers compound into the lagging KPIs finance cares about. When we make the driver tree explicit, everyone can see how model precision, response time, and workflow integration roll up to conversion lift, case deflection, time-to-resolution, or reduced exposure.

To make this rigorous, I run a five-step playbook. First, define the decision and business outcome in plain terms. Second, instrument the baseline with behavioral analytics on a unified analytics platform—tools like Amplitude analytics or Pendo help expose friction points we’ll later target. Third, create a counterfactual using A/B testing and specify a minimum detectable effect (MDE) so we know how long to run and how much traffic we need. Fourth, quantify costs (training, inference, integration, change management) and include AI risk management, privacy-by-design, and data governance up front. Fifth, lock a measurement plan that connects leading indicators to lagging ROI through the driver tree.

Most AI initiatives don’t fail on model quality—they fail on adoption. If the workflow isn’t smoother, trust isn’t earned, or value isn’t obvious, users revert. That’s why I invest early in onboarding, in-app guides, product tours, and thoughtful tooltip design to reduce the time-to-first-value. Then I watch user activation, retention analysis, and task completion to ensure the assistive experience is not just novel—it’s habit-forming.

For generative use cases, eval-driven development is non-negotiable. I maintain offline evaluations for accuracy and safety, and online evaluations for business impact. Retrieval-first pipeline health, context window management, and prompt engineering affect reliability; so do latency and grounding quality. We ship behind feature flags, measure guardrail effectiveness, and tighten feedback loops from human-in-the-loop reviews into model updates—continuously.

On the business side, I avoid “AI theater” by structuring benefits like a CFO. Revenue: increased conversion or expansion driven by better recommendations, faster sales cycles, or higher trial activation. Cost: case deflection, agent time saved, fewer escalations, and lower rework. Risk: reduced exposure via automated checks, anomaly detection, and consistent policy application. If any claim can’t be tied to measured deltas—via A/B testing or strong quasi-experiments—it doesn’t go in the deck.

Build vs buy deserves the same discipline. I map platform scalability, governance requirements, and total cost of ownership against time-to-impact. Teams often underestimate integration and maintenance drag; a pragmatic mix of bought components with thin custom layers can accelerate outcomes while keeping options open. The goal isn’t to own every layer—it’s to own the learning loop and the differentiated experience.

I also remind teams that tooling should serve the strategy, not replace it. I’ve seen concise, effective messaging that captures the point: “Increase revenue, cut costs, and reduce risk with Pendo’s Software Experience Management platform. Optimize the entire software experience to drive adoption and improve engagement.” The words are compelling because they reflect the three-vector value model and the adoption imperative. The same standard should apply to any AI initiative we propose.

If you’re under pressure to prove ROI, shift the conversation: lead with the driver tree, specify your counterfactual, and anchor on leading indicators you can move in weeks—not quarters. Then connect those to the lagging KPIs finance expects over time. When we manage AI like a product—grounded in evidence, experimentation, and user-centered adoption—we don’t have to force ROI. We compound it.

Inspired by this post on Pendo – Perspectives.

April 8, 2026