Tag: AI risk management

AI Product Validation: From Promising Demo to Proven Value

You have an AI demo that looks impressive. It answers the happy-path prompt, the latency seems acceptable, and stakeholders can already imagine the launch. The uncomfortable question is whether any of that proves the product is worth building.

It does not. A useful validation process has to reduce several different risks: whether customers care, whether the workflow helps them, whether the AI performs reliably, whether the economics work, and whether failures remain tolerable. Test those risks in that order and you can make a defensible investment decision without turning production traffic into your debugging environment.

Define the decision before you design the AI

The first artifact for an AI initiative should not be a model shortlist or a prototype. It should be a decision contract that states what must become true for the initiative to deserve more investment.

A practical decision statement has this shape: For a defined user in a defined situation, the proposed capability will improve an observable outcome relative to the current alternative, without breaching named guardrails. If the agreed threshold is met, you will advance. If it is not, you will stop or change a specific assumption.

Write down these five elements before the experiment begins:

User and job: Name who encounters the problem, when it occurs, and what they are trying to complete. A broad label such as knowledge workers is not precise enough to design a useful test.
Current alternative: Record what the user does now, including manual work, an existing product flow, a rules engine, or simply tolerating the problem. This is the baseline the AI must beat.
Observable outcome: Choose a user or business result, not a model activity. Task completion, time-to-value, corrected routing, rework, repeat use, or downstream resolution can carry more meaning than generations or prompt volume.
Success threshold and guardrails: Decide how much improvement would justify the cost and what must not deteriorate. Safety failures, latency, privacy exposure, retention, and cost per successful outcome can all constrain an otherwise positive result.
Decision rule: State what evidence will trigger expansion, another iteration, a change in direction, or cancellation. Precommitting prevents enthusiasm for a polished demo from moving the goalposts later.

The threshold is not universal. It should reflect the value of the outcome, the implementation and operating costs, the consequences of errors, and the return available from competing roadmap work. Minimum detectable effect belongs here: define the smallest improvement that would actually change your decision, then size the test to detect that effect. A test that cannot distinguish a worthwhile gain from noise is not a faster test. It is a delayed decision.

A driver tree helps prevent a common measurement mistake. Start with the desired outcome, connect it to the user behaviors that could produce that outcome, and then connect those behaviors to system-level drivers. For an AI support-triage capability, the outcome might be faster correct routing. Accepted category and priority suggestions are leading signals; downstream corrections, reassignment, and resolution are closer to the outcome. Model classification accuracy matters, but it is only one driver in the chain.

If the proposal involves an autonomous or semi-autonomous agent, run a precondition check before planning the experiment. Volume, instructions, tolerance, access, and a learning loop expose whether agentic complexity is justified:

Volume: Does the workflow happen often enough for automation to create meaningful leverage?
Instructions: Can success, constraints, and exceptions be expressed in testable terms?
Tolerance: Is the likely failure reversible, detectable, and contained?
Access: Can the system use the necessary data and tools with reliable integrations and least-privilege permissions?
Learning loop: Can you measure quality, latency, cost, and failures after launch?

A missing condition tells you what to validate next. Unclear instructions call for more discovery and rubric design. Weak access calls for an integration or data-quality spike. Low error tolerance calls for approvals and a narrower action space. Low volume may mean that a clear workflow, a rule, or better product UX is the better answer. The purpose of validation is not to prove that AI belongs in the solution; it is to discover whether it does.

Climb an evidence ladder instead of jumping to a pilot

An oversized pilot often mixes market, usability, model, integration, and operational risk into one expensive test. When the result disappoints, nobody knows which assumption failed. An evidence ladder gives each experiment one dominant question and increases fidelity only after the previous uncertainty has been reduced.

Question to answer	Lean experiment	Evidence to inspect	What it does not prove
Do users care enough to act?	Painted door, landing page, waitlist, concierge offer, preorder, or deposit where appropriate	Click-through intent, qualified sign-ups, willingness to pay, and continued requests	Usability, AI quality, or scalable delivery
Can the proposed workflow help?	Wizard-of-Oz flow or realistic interactive prototype	Task completion, time on task, errors, material friction, and repeat use	Whether an AI system can deliver the experience reliably
Can the system perform the job?	Offline evaluation on a curated golden set plus targeted technical spikes	Rubric results by case type, failure patterns, latency, and cost	Whether the complete product changes user behavior
Does the product improve the target outcome?	Feature-flagged A/B test or holdout	Primary outcome, leading indicators, cohort effects, and guardrails	Long-term stability under every operating condition
Can it operate within acceptable risk?	Capped rollout with approvals, audit logs, monitoring, and rollback controls	Harm and privacy events, reversals, escalations, reliability, and cost per successful outcome	That future changes will remain safe without continued evaluation

Use the first row when demand is the dominant risk. A painted-door click is a signal of curiosity, not proof of durable value. A qualified sign-up asks for more commitment. A preorder or deposit, when honest and operationally appropriate, tests willingness to pay. Repeated use of a manually delivered service provides stronger behavioral evidence. Do not collapse these signals into a single conversion metric; they represent different levels of commitment.

Once demand appears credible, use a prototype or Wizard-of-Oz flow to learn whether the proposed interaction helps. Pretotyping should answer whether the product deserves to exist, while prototyping should answer how it needs to work. Keeping those questions separate prevents a polished interface from disguising weak demand and prevents a crude early interface from killing a valuable idea before its workflow has been understood.

These experiments still owe users honest expectations. A painted door should reveal that the capability is unavailable after the user expresses interest, rather than pretending it already exists. A concierge or Wizard-of-Oz flow should be explicit about how data will be handled and what follow-up the participant can expect. Deception can manufacture a metric while damaging the trust the eventual product will need.

Advance when the evidence changes the dominant uncertainty. Strong demand does not authorize a production launch; it authorizes a workflow test. A usable workflow authorizes a system evaluation. An offline pass authorizes limited exposure. Each rung earns the next investment without pretending to answer questions it was not designed to answer.

Separate model quality from product value

A model can produce better answers while the product creates less value. Added latency can interrupt the workflow. A retrieval failure can ground an otherwise capable model in the wrong context. A user may spend more time checking and rewriting an answer than doing the task manually. This is why a single accuracy score cannot validate an AI product.

Build a golden set from the work users actually do

Eval-driven development starts before production traffic. Build a curated set of cases that reflects real user complexity, then turn your definition of good into a reproducible scoring process.

Define the evaluation unit: Score the completed job whenever possible, not merely an isolated response. An agent that drafts a correct message but sends it to the wrong destination has failed the job.
Represent meaningful variation: Include normal cases, longer and shorter inputs, ambiguous requests, important customer segments, and known edge conditions. A convenience sample of clean happy paths measures demo readiness.
Tag each slice: Label cases by intent, complexity, risk, input type, or other distinctions that could conceal a concentrated failure. Aggregate performance can improve while a critical slice gets worse.
Write a multidimensional rubric: Score correctness, completeness, groundedness, safety, tone, policy compliance, and any task-specific requirements separately. Add latency and cost as system measures rather than blending everything into an opaque average.
Choose a real baseline: Compare the candidate with the current product, manual workflow, rules-based approach, or incumbent model. The relevant question is not whether the candidate looks capable in isolation; it is whether switching produces enough value.
Preserve regression evidence: Keep a stable set for comparisons and add newly discovered failures to an evolving challenge set. This turns production learning into protection against recurrence.

Keep the measurement layers visible in every readout:

Output quality: correctness, completeness, groundedness, tone, safety, and compliance.
System performance: retrieval quality, tool execution, policy enforcement, latency, reliability, and cost.
User outcome: task completion, time-to-value, edits, rejection, rework, escalation, and repeat use.
Business consequence: the downstream result the initiative was funded to improve, along with retention or other core guardrails where relevant.

Each layer diagnoses a different problem. If output quality is weak, work on context, prompts, retrieval, tools, policies, or the model. If output quality passes but completion does not improve, inspect the interaction and workflow. If users succeed but the cost per successful outcome is unacceptable, narrow the use case or revisit the architecture. A composite score can hide these distinctions at exactly the moment you need them.

Test the behavior distribution, not a lucky response

AI output is variable, so a candidate should not pass because one run happened to look good. Use two evaluation modes. A regression configuration should be as controlled as the system allows, with model, prompt, retrieval, tool, temperature, top-p, and seed settings documented where they apply. A production-like configuration should match the variability users will experience and repeat cases often enough to reveal unstable behavior and tail failures.

Run candidate and baseline systems on the same cases under comparable settings.
Inspect results by slice and failure type, not only the overall average.
Repeat stochastic cases so the team sees consistency, variance, and severe outliers.
Automate clear rubric checks, but retain human review for ambiguous or high-consequence judgments.
Version the model, prompt, retrieval configuration, tools, policies, and evaluation set so a change can be reproduced.

This creates a release gate instead of a demo contest. Offline evaluation will not prove market value, but it can prevent known regressions, unsafe behavior, and obviously weak variants from consuming customer trust in a live experiment.

Make the production test answer a business decision

Production exposure is justified when demand, workflow, and offline performance have enough evidence behind them. The live test should then answer a narrow causal question: does access to this capability improve the intended outcome for the eligible population, compared with the current experience, without violating the operating constraints?

Instrument the complete causal chain

Your event schema should connect eligibility to exposure, interaction, system behavior, task completion, and downstream consequences. At minimum, distinguish these moments:

The user or account became eligible for the test.
The treatment was actually shown or made available.
The capability was invoked, whether explicitly or automatically.
The system succeeded, failed, timed out, or triggered a safeguard.
The output was displayed, accepted, edited, rejected, reversed, or escalated.
The target task was completed or abandoned.
The downstream outcome occurred, such as a correction, reassignment, reopening, or successful resolution for a support workflow.

Attach the cohort and the relevant model, prompt, retrieval, tool, and policy versions to the trace. Capture latency, cost, and safety results without indiscriminately logging sensitive payloads. Privacy-by-design and data governance determine which data may be retained, who may inspect it, and how long it should remain available.

Missing links create predictable misreadings. Without an exposure event, low adoption can be confused with low visibility. Without version information, a regression cannot be tied to a system change. Without the downstream event, acceptance can be mistaken for value even when users later undo the AI’s work.

Choose the design and sample around the decision

Randomization: Choose user, account, workflow, or time window based on where contamination can occur. If people in one account share outputs, user-level assignment may mix treatment and control experiences.
Population: Define eligibility before launch. Balance or stratify meaningful groups such as new accounts and power users when their behavior or exposure differs.
Primary metric: Select one outcome that can settle the main question. Treat diagnostic measures as supporting evidence, not a menu from which to pick a winner later.
Guardrails: Monitor core experience, retention where relevant, time-to-value, safety, privacy, reliability, and cost. Write rollback conditions for unacceptable movement before exposure begins.
Effect size and power: Set the minimum detectable effect from the business decision, estimate the required sample, and acknowledge when available traffic cannot support the desired conclusion.
Exposure control: Use feature flags, a capped rollout, and a holdout so you can stop quickly and preserve a valid comparison.

Standard A/B testing fits many product changes. Ranking and retrieval changes can benefit from interleaving when alternatives can be compared within the same experience. Switchback designs can help when time, seasonality, or shared operating conditions make simultaneous assignment misleading. Match the design to the interference in the workflow instead of defaulting to the experiment template you use for deterministic UI changes.

AI variability also changes the readout. Aggregate outcomes across the multiple interactions users have, compare cohorts, and track confidence intervals over time. A snapshot p-value should not overrule an underpowered test, an unstable effect, or a concentrated safety failure. A statistically inconclusive result means the test did not resolve the decision; it does not prove that the feature has no effect.

Prewrite the scale, iterate, and stop rules

I prefer four explicit decision states because they force the readout to connect evidence to action:

Scale: The primary outcome clears the meaningful threshold, guardrails hold, important cohorts do not show an unacceptable reversal, and reliability and cost remain viable.
Iterate the AI system: User intent is strong, but a defined output or system failure blocks value. The next test should target that failure rather than repeat the same broad pilot.
Change the product experience: Offline quality passes, but users cannot discover, trust, control, or efficiently use the capability. Treat this as workflow evidence, not an automatic reason to swap models.
Stop or reframe: Demand is weak, the economics cannot work, the necessary data or access is unavailable, or credible risk remains outside tolerance.

Risk must be part of the launch rule, not a review added after a positive metric appears. Include toxicity and personally identifiable information checks where relevant, enforce least-privilege access, retain appropriate audit logs, and make rollback operational before exposure. For irreversible financial actions, sensitive regulatory decisions, or any workflow where the acceptable error rate is effectively zero, keep a qualified human approval step or defer autonomy. Faster execution does not compensate for an unacceptable blast radius.

Autonomy should be earned in stages. Begin with assistance that the user can inspect. Move to required approval before actions. Allow autonomous execution only for narrow, low-stakes, reversible actions after stability is demonstrated. Expand permissions and exposure only when monitoring shows that the earlier guardrails continue to hold.

The experiment does not end at launch. Model behavior, retrieval content, user mix, prompts, tools, and operating costs can change. Continue tracking quality, latency, cost per successful outcome, safety, and cohort behavior. Feed new failures into the evaluation set and keep a holdout when the decision warrants one. A weekly readout should identify what changed, which assumption the evidence affected, and what decision follows; it should not become a tour of every available dashboard.

Key takeaways

Start with a precommitted decision contract: user, job, baseline, outcome, threshold, guardrails, and next action.
Validate demand before usability, usability before system capability, and system capability before broad production impact.
Compare the AI with the user’s current alternative, not with an abstract standard of impressive output.
Measure output quality, system performance, user outcomes, and business consequences separately so failures remain diagnosable.
Treat stochastic behavior as a distribution: document configurations, repeat runs, inspect slices, and watch severe outliers.
Use feature flags, holdouts, exposure caps, auditability, and prewritten rollback rules to contain risk while learning.

At your next AI review, ask for the experiment contract instead of another demo. If the team cannot name the dominant risk, current baseline, meaningful threshold, guardrails, and action for each possible result, the next step is not production exposure. It is a sharper test.

Start with the smallest experiment that could credibly invalidate the idea. Evidence that survives that test earns the right to spend more, increase fidelity, and expose more users.

References

April 27, 2026

Auditable AI Code Review: A Practical Operating Model

You are not deciding whether an AI model can find bugs in a pull request. You are deciding whether an automated reviewer can participate in a production control without leaving your team unable to explain, challenge, or reverse its decision.

If the only evidence behind an approval is a bot comment that says the change looks safe, keep the system advisory. An auditable AI reviewer needs a bounded mandate, a deterministic approval policy, traceable evidence, and a feedback loop tied to production outcomes. Build those controls first, and faster review becomes a consequence rather than a gamble.

Start with a decision contract, not a model prompt

An approval is a policy decision. The model can supply findings, evidence, and a recommendation, but it should not define the conditions under which its own recommendation becomes authoritative.

Write a decision contract before selecting a model or tuning a prompt. It should answer five questions:

What may the system decide? Typical outcomes are approve, request changes, provide non-blocking comments, or escalate to a person.
Which changes are eligible? Eligibility should be determined by explicit repository, path, change-type, test, ownership, and reversibility rules.
Which checks are mandatory? An eligible pull request should not be approved if a required review lens failed to run, returned incomplete evidence, or produced an unresolved blocking finding.
When must the system abstain? Missing context, conflicting findings, unavailable tools, excessive scope, low-confidence evidence, and protected code paths should cause escalation rather than optimistic approval.
Who owns the result? Name the engineer accountable for the change, the owner of the review policy, and the person or group authorized to change the automation boundary.

The core approval rule can be expressed plainly: the change is eligible, every mandatory check completed, no blocking issue remains, the evidence record is complete, and no human-review requirement was triggered. Encode that rule in a controller your team can inspect and test. Do not bury it inside natural-language instructions to the model.

This separation gives you a clean control plane. Review agents analyze the change. A policy engine evaluates their structured results. A narrowly permissioned service performs the approved action. The model never gets to reinterpret the boundary at the moment it encounters a difficult pull request.

Auditability does not require a future model run to reproduce the same words. Model endpoints, retrieved context, and dependencies can change. It requires the original decision to remain reconstructable from preserved inputs, outputs, policies, tool results, and versions. A skeptical engineer should be able to determine why the pull request was approved without trusting the personality or reputation of the bot.

Split the review into specialist checks with explicit evidence

A single prompt asking whether a pull request is safe compresses several different judgments into one opaque answer. Decompose the review so that each judgment has a clear purpose, input set, output schema, and failure mode.

A practical review pipeline can include these specialist lenses:

Problem-definition quality: Is the requested behavior specific enough to review, and are the acceptance conditions testable?
Intent alignment: Does the diff implement the stated change without silently expanding or contradicting it?
Scope and dependency impact: Which callers, data flows, interfaces, jobs, or services can the change affect outside the edited files?
Logical correctness: Do the changed execution paths handle expected states, boundary conditions, and failure paths?
Test adequacy: Do the tests exercise the behavior that changed, and did the required checks actually run against the reviewed commit?
Security and privacy: Does the change alter trust boundaries, permissions, authentication, secrets, sensitive data handling, or externally controlled inputs?
Local engineering guidance: Does the implementation comply with versioned repository conventions, architectural constraints, and known anti-patterns?
Deployment and recovery: Can the change be observed, disabled, or rolled back without creating a second unsafe operation?

Every specialist should return the same minimum structure: a check identifier, pass/fail/escalate status, a concise claim, evidence tied to files or tool output, the applicable rule version, severity, and a recommended action. A finding such as a possible regression is not auditable. A finding that identifies the affected path, explains the conflicting behavior, points to the relevant code, and names the violated policy is.

Run independent checks before aggregation, and preserve every result even when the final decision is approval. The aggregator may deduplicate findings, but it should not erase dissent. If the intent checker says the change is aligned while the execution-path checker finds a contradiction, route the conflict to a person.

Review context must extend beyond the visible diff. A seemingly harmless one-line copy change was once found to contradict validation behavior elsewhere in the codebase. That is the kind of defect a diff-only reviewer is structurally unlikely to see. Give relevant checks controlled access to callers, validators, schemas, tests, ownership metadata, and versioned internal guidance, then record exactly which context each check used.

More context is not automatically better. Retrieval should be targeted and attributable. When a finding depends on an internal rule, capture the rule identifier and version. When it depends on a test, capture the command, commit, status, and output reference. When it depends on inferred execution flow, record the relevant path so a maintainer can inspect it.

Treat pull-request text, code comments, test fixtures, generated files, and documentation on the changed branch as untrusted data. They can contain instructions designed to redirect an agent. Load approval policy from a protected service or the trusted base branch, not from files the pull request can rewrite. Run proposed code in an isolated environment, mediate tool calls through an allowlist, and keep approval or merge credentials outside the model’s reach. The policy controller should translate a valid decision into an action; the model should never hold the credential that performs it.

Set the automation boundary with hard eligibility gates

Do not begin by assigning every pull request a risk score and approving anything below a convenient number. A composite score can hide a disqualifying condition: a tiny authorization change may receive a low size score even though its blast radius is high. Apply hard gates first. Use scoring only to route changes that remain eligible after those gates.

Common reasons to require human review include:

Authentication, authorization, permissions, cryptography, secrets, or trust-boundary changes.
Payments, billing, entitlements, destructive data operations, or irreversible migrations.
Public API contracts, shared schemas, release infrastructure, or broadly consumed dependencies.
A pull request that changes its own review policy, test requirements, ownership rules, or deployment controls.
Missing required tests, failed or stale CI results, unavailable analysis tools, or a mismatch between the reviewed commit and the tested commit.
Changes spanning too many concerns, components, or execution paths for the approved review envelope.
An active incident, an unclear rollback path, or a direct request for human review.

There is no universal line-count threshold for a small pull request. Derive limits from your architecture and incident history, then version them. A change to a central permission function may be riskier than a much larger isolated test refactor. Scope should include dependency reach and behavior change, not just added and deleted lines.

A staged authority model keeps the boundary legible:

Mode	What the AI reviewer may do	Who decides the merge	Appropriate use
Shadow	Produce a private decision record without affecting the pull request	Human reviewer	Baseline evaluation and policy tuning
Advisory	Post evidence-backed, non-blocking findings	Human reviewer	Measuring usefulness and false alarms in normal work
Blocking	Request changes for narrow, testable policy violations	Human reviewer after resolution	Stable rules with clear evidence and an appeal path
Bounded approval	Approve only changes that pass every eligibility and review condition	Policy controller within its delegated scope	Validated low-risk change classes with complete audit records
Mandatory escalation	Summarize evidence and route the change	Named human owner	Sensitive paths, conflicting findings, missing evidence, or any requested human review

Do not turn bounded approval into an auto-approval quota. Coverage is a result of demonstrated safety, not a target that should pressure teams to weaken eligibility rules.

One high-frequency engineering environment reports that more than 93% of pull requests across two main codebases are agent-driven and more than 19% are approved without a human reviewer. Its reported median merge time fell from 75.8 minutes with human review to 14.6 minutes with AI approval, while downtime from breaking changes declined 35% as deployments doubled. Those organization-level results show that bounded automation can coexist with high deployment frequency and improving safety outcomes. They do not prove that AI approval caused the downtime reduction, and they should not be imported as another team’s launch threshold.

Keep the escape hatch explicit. Any engineer should be able to request a human review without defending the choice. The accountable engineer should still watch the change in production and be ready to roll it back. Automated approval changes who performs a review step; it does not transfer ownership of the production outcome to a model.

Preserve the evidence, then earn autonomy through evaluation

Build a decision record that survives model and policy changes

Create an append-only decision event for every review attempt, including abstentions and failed runs. At minimum, retain:

Repository, pull-request identifier, base commit, reviewed head commit, author, accountable owner, and timestamps.
The pull-request description and acceptance criteria as they existed when the decision was made.
Eligibility rules, protected-path rules, ownership data, prompt-template identifiers, and policy versions.
Model provider and model identifier, relevant runtime settings, retrieval configuration, and tool versions.
The context each specialist received, including immutable references or preserved snapshots for mutable material.
Structured specialist outputs, supporting evidence, tool invocations, CI results, conflicts, and failures.
The deterministic rule evaluation that produced approve, block, comment, or escalate.
Subsequent human overrides, appeals, edits, approvals, merges, rollbacks, hotfixes, and linked incidents.

Store concise decision rationale and inspectable evidence, not hidden chain-of-thought. An auditor needs to know which claim was made, what supported it, which rule applied, and how the controller reached the outcome. Private internal reasoning is neither necessary nor a reliable substitute for those artifacts.

Apply the same security discipline to review logs that you apply to source code. Minimize captured secrets and personal data, control access, define retention, and log policy changes. If a model or retrieval service cannot handle the code under your data-governance requirements, that repository is not eligible for the workflow.

Evaluate decisions, not polished comments

A review can sound thoughtful and still approve the wrong change. Build an evaluation set around decisions and evidence rather than writing quality.

Assemble representative cases. Include clean pull requests, valuable historical human findings, escaped defects, incident-causing changes, incomplete requirements, sensitive paths, cross-component changes, and attempts to manipulate the reviewer through repository content.
Label the expected control outcome. For each case, identify whether the correct action is approve, request changes, or escalate. Record the evidence that an acceptable review must surface.
Separate clear cases from disputed ones. Known incident causes and explicit policy violations can provide strong labels. Ambiguous architectural judgments need maintainer adjudication, and disagreement should remain visible rather than being forced into false certainty.
Freeze a holdout set. Use one portion to improve prompts, retrieval, and policy. Keep another portion unseen until release evaluation so repeated tuning does not create a misleading score.
Compare equivalent cohorts. Evaluate AI and human review on the same risk classes and change types. Comparing AI-approved low-risk changes with all human-reviewed pull requests confounds reviewer quality with task difficulty.

Track metrics that expose different failure modes:

Decision accuracy: How often did the system choose the expected approve, block, or escalate outcome?
False auto-approval rate: How often did it approve a labeled case that should have been blocked or escalated? Break this out by severity and risk class.
Blocking precision: Of the findings that stopped a change, how many maintainers judged valid and actionable?
Known-defect recall: Which seeded or historically verified defects did the review catch? Label this carefully; it is not recall over every defect that might exist.
Evidence completeness: Can every decision be traced to required checks, immutable inputs, policy versions, and supporting artifacts?
Abstention and override rates: Where is the system uncertain, and where do engineers reverse it? Investigate patterns by repository and change class.
Delivery performance: Measure review latency and merge time, but only alongside quality metrics.
Production outcomes: Track rollbacks, hotfixes, escaped defects, incidents, downtime, and customer impact for comparable risk cohorts.

Comment helpfulness is useful feedback, but it is not a safety metric. Engineers may like a concise reviewer that misses a critical defect, or dislike a strict reviewer that correctly blocks an unsafe change. Keep usefulness, correctness, and production impact as separate measures.

Roll out by change class and turn escapes into regression tests

Move from shadow mode to advisory comments, then to narrow blocking rules, and only then to bounded approval. Start with one repository and one low-risk, reversible change class. Write the exit criteria before the pilot begins, including acceptable false-approval and false-block rates, required audit completeness, escalation behavior, and production guardrails.

Canary each expansion. Maintain a kill switch that disables new automated approvals without removing the accumulated evidence. If a required service, model, retrieval index, test runner, or policy store is unavailable, fail closed and return the pull request to the human path.

When an approved change causes a production problem, diagnose the control layer that failed:

Was the change wrongly eligible?
Did retrieval omit relevant code or guidance?
Did a specialist miss or misclassify the defect?
Did the aggregator suppress a conflict?
Did the policy permit approval despite the evidence?
Did CI test a different commit or an incomplete environment?
Did production monitoring fail to surface the effect promptly?

Add the case to the regression suite, version the corrective policy or guidance, rerun the holdout evaluation, and preserve the relationship between the incident and the updated control. That is eval-driven development applied to governance: every escape should make a specific layer harder to fail in the same way again.

Key takeaways

AI output is an input to approval, not the approval policy itself.
Use deterministic eligibility gates before any model-based risk judgment.
Decompose review into specialist checks that return claims, evidence, rule versions, and explicit pass/fail/escalate states.
Keep policy and credentials outside the pull request and outside the model’s control.
Preserve enough evidence to reconstruct the original decision even when the model, repository, or internal guidance later changes.
Expand autonomy only when evaluation and comparable production cohorts support it; never optimize for auto-approval coverage by itself.

Your first useful milestone is not an AI-approved pull request. It is a shadow decision that a maintainer can reconstruct, dispute, and improve. Once that record is dependable, grant the smallest reversible slice of authority, watch what reaches production, and make every expansion earn its place.

References

Intercom — AI Now Approves Our Pull Requests—Safely: Inside an Agentic, Auditable Review Engine

April 21, 2026

How to Build a Trusted AI Product Platform That Scales

Your teams have AI pilots that work in a demo. Then the questions start. Security wants to know what data the system can reach. Product wants to know whether the answers are dependable. Support wants a fallback when the model fails. Executives want evidence that the investment is changing a customer or business outcome.

You do not need another impressive model response. You need a product platform that makes AI behavior understandable, controllable, and repeatable across use cases. That requires a trust architecture, a path from prototype to production, and metrics that expose failure instead of averaging it away.

Trust fails where an AI output crosses a decision boundary

Most teams discuss AI trust as if it were a property of the model. It is better understood as a property of the whole product system. A capable model can still create an untrustworthy experience if it uses the wrong context, hides a consequential assumption, calls an unauthorized tool, or leaves the user unable to correct an action.

The important moment is the handoff from generation to decision. Before that handoff, the output is a possibility. After it, someone may use it to answer a customer, change a record, prioritize work, or trigger another system. The controls you need depend on what crosses that boundary.

A practical way to classify AI use cases is by the authority you give the system:

Inform: The system summarizes, explains, retrieves, or drafts. A person still interprets the result.
Recommend: The system ranks options or proposes a next action. Its framing can materially influence a decision.
Act: The system invokes tools, changes state, communicates externally, or starts a workflow.

Use mode	Primary trust failure	Required product control	Evidence needed before release
Inform	An incorrect, incomplete, or untraceable answer	Visible scope, supporting evidence, uncertainty, and an easy correction path	An evaluator can reproduce the evidence path and identify known limitations
Recommend	A hidden assumption, weak comparison, or recommendation that ignores the user’s constraints	Explicit assumptions, alternatives, decision criteria, and user-editable constraints	Representative cases show whether the recommendation applies the intended rubric
Act	An unauthorized, excessive, or difficult-to-reverse change	Least-privilege access, previews, confirmation, audit records, and reversal where the underlying system supports it	Authorized reviewers validate simulated actions, denied actions, failure recovery, and a limited production path

This classification prevents a common planning error: giving every AI feature the same review process. A summarizer and an autonomous account-management agent should not pass through identical gates. The second system needs stronger identity, permission, confirmation, and recovery controls because its mistakes can propagate beyond the conversation.

For each proposed use case, ask five questions before discussing a model:

<!– wp:list {

April 15, 2026

Stop Forcing AI to Prove ROI: A Product Leader’s Playbook to Measure Real Business Value

Every planning cycle, I feel the drumbeat: “Show me the AI ROI—this quarter.” The pressure is real, especially when boards and CFOs expect immediate payback. Yet when I review stalled initiatives across teams and peers, the pattern is consistent: most companies treat AI like a feature to ship, not a system to manage. That mindset almost guarantees we measure the wrong things, declare victory (or failure) too early, and miss the durable value AI can create.

Here’s the core problem I see: we leap to solution and skip the counterfactual. Without a baseline, a clear control, or a defined “what would have happened otherwise,” we’re guessing. We also fixate on lagging, financial KPIs that move slowly (revenue, cost, risk), then use outputs—not outcomes—as OKRs. If we don’t align on outcomes vs output OKRs upfront, the best team in the world can still optimize for activity over impact.

My AI Strategy starts from a simple truth: value shows up along three vectors—revenue, cost, and risk—on different timelines. In the near term, we must validate leading indicators (adoption, engagement, activation) that ladder to those vectors through a transparent driver tree. Over time, those drivers compound into the lagging KPIs finance cares about. When we make the driver tree explicit, everyone can see how model precision, response time, and workflow integration roll up to conversion lift, case deflection, time-to-resolution, or reduced exposure.

To make this rigorous, I run a five-step playbook. First, define the decision and business outcome in plain terms. Second, instrument the baseline with behavioral analytics on a unified analytics platform—tools like Amplitude analytics or Pendo help expose friction points we’ll later target. Third, create a counterfactual using A/B testing and specify a minimum detectable effect (MDE) so we know how long to run and how much traffic we need. Fourth, quantify costs (training, inference, integration, change management) and include AI risk management, privacy-by-design, and data governance up front. Fifth, lock a measurement plan that connects leading indicators to lagging ROI through the driver tree.

Most AI initiatives don’t fail on model quality—they fail on adoption. If the workflow isn’t smoother, trust isn’t earned, or value isn’t obvious, users revert. That’s why I invest early in onboarding, in-app guides, product tours, and thoughtful tooltip design to reduce the time-to-first-value. Then I watch user activation, retention analysis, and task completion to ensure the assistive experience is not just novel—it’s habit-forming.

For generative use cases, eval-driven development is non-negotiable. I maintain offline evaluations for accuracy and safety, and online evaluations for business impact. Retrieval-first pipeline health, context window management, and prompt engineering affect reliability; so do latency and grounding quality. We ship behind feature flags, measure guardrail effectiveness, and tighten feedback loops from human-in-the-loop reviews into model updates—continuously.

On the business side, I avoid “AI theater” by structuring benefits like a CFO. Revenue: increased conversion or expansion driven by better recommendations, faster sales cycles, or higher trial activation. Cost: case deflection, agent time saved, fewer escalations, and lower rework. Risk: reduced exposure via automated checks, anomaly detection, and consistent policy application. If any claim can’t be tied to measured deltas—via A/B testing or strong quasi-experiments—it doesn’t go in the deck.

Build vs buy deserves the same discipline. I map platform scalability, governance requirements, and total cost of ownership against time-to-impact. Teams often underestimate integration and maintenance drag; a pragmatic mix of bought components with thin custom layers can accelerate outcomes while keeping options open. The goal isn’t to own every layer—it’s to own the learning loop and the differentiated experience.

I also remind teams that tooling should serve the strategy, not replace it. I’ve seen concise, effective messaging that captures the point: “Increase revenue, cut costs, and reduce risk with Pendo’s Software Experience Management platform. Optimize the entire software experience to drive adoption and improve engagement.” The words are compelling because they reflect the three-vector value model and the adoption imperative. The same standard should apply to any AI initiative we propose.

If you’re under pressure to prove ROI, shift the conversation: lead with the driver tree, specify your counterfactual, and anchor on leading indicators you can move in weeks—not quarters. Then connect those to the lagging KPIs finance expects over time. When we manage AI like a product—grounded in evidence, experimentation, and user-centered adoption—we don’t have to force ROI. We compound it.

Inspired by this post on Pendo – Perspectives.

April 8, 2026
Product Management Isn’t Dead: Why ‘Product Builders’ Will Win in the AI Era—and How to Upskill Now

“Is product management dead?” I hear this question at almost every conference hallway chat. After listening to the latest Product Builders – All Things Product Podcast with Teresa Torres & Petra Wille, I’m more convinced than ever: product management isn’t dead—it’s evolving fast, and the leaders will be those who embrace the shift.

Listen to this episode on: Spotify | Apple Podcasts

The core take resonated deeply with my day-to-day at HighLevel: product management isn’t dying—“the traditional product trio (PM, design, engineering) is collapsing into something new.” The center of gravity is shifting from swim lanes to outcomes, from rigid handoffs to fluid collaboration, and from role definitions to capabilities that actually ship value.

AI is raising the baseline across the board. That “80/20 shift: AI handles patterns, humans handle hard problems” is real on my teams. With LLMs like “GPT 5.2” and “Opus 4.5,” coding agents such as “Claude Code” and “Codex,” and tools like “Replit” and “Lovable,” we’re compressing cycle time on the repeatable 80%. The bottleneck is no longer typing code or drafting copy—it’s selecting the right problems, crafting sharp product strategy, and making confident trade-offs.

This is why the future belongs to “product builders” — people with a shared foundation across disciplines and deep expertise in one area. I look for teams that can shape, prototype, validate, and iterate in tight loops, blending continuous discovery with empowered product teams. The baseline expands, the craft deepens.

Functional expertise still matters—more than ever—because the hard parts are getting harder. We need leaders who can weigh platform scalability against time-to-value, protect privacy-by-design, apply AI risk management, and navigate data governance while sustaining product-market fit. When AI accelerates execution, judgment becomes the differentiator.

For leaders, this creates a clear mandate: “What product leaders must do to create safe AI infrastructure.” In practice, that means building guardrails early—security reviews tailored to AI workflows, QA harnesses that include eval-driven development, model performance observability, and human-in-the-loop review systems. You can’t bolt this on later without paying a tax in velocity and trust.

Hiring signals are already shifting. “How job descriptions and hiring expectations are already shifting” shows up in my reqs: we emphasize cross-functional range, fluency with AI workflows, prompt engineering literacy, and the ability to frame measurable outcomes. We still want craft depth—design systems, systems thinking in engineering, rigorous discovery—but we prize people who move seamlessly from discovery to delivery.

In the episode, I appreciated the crisp framing of why product management isn’t dying—but changing. The rise of the “product builder” foundation reframes team topology and unlocks smaller, more cross-functional squads. AI changes the baseline skill set across product teams, and ignoring it is a career risk. If you’re not learning AI tools, you’re falling behind.

My key takeaways were straightforward and actionable. Smaller, more cross-functional teams are likely. Deep expertise still matters—especially for complex trade-offs. Leaders need guardrails: security, QA, and review systems built for an AI-driven workflow. And if you work in product, design, or engineering, this episode is your signal to start upskilling now.

“The risk of ignoring AI in your craft” is not hypothetical. I encourage PMs to carve out weekly lab time for hands-on experiments with LLMs for product managers, build lightweight prototypes with Replit or Lovable, and pressure-test opportunity solution trees with data-informed discovery. Pair with your engineers on agentic AI use cases, and integrate model evals into your CI/CD pipelines.

“Mentioned in the episode” were several resources worth exploring: “Product at Heart” (June, Hamburg), “Replit,” “Lovable,” “Every,” “Petra’s Coaching Packages,” and “coding agents (Claude Code, Codex) and LLMs (GPT 5.2, Opus 4.5).” These are great jumping-off points for your own product builder toolkit.

My recommendation: queue up the episode on your commute, then pick one workflow to augment with AI before the week ends. Replace a handoff with a shared canvas. Automate a repetitive analysis. Ship a scrappy prototype. Momentum compounds.

Have thoughts on this episode? Leave a comment below. I’d love to hear how your teams are evolving your product trios, what AI workflows are sticking, and where governance has been most challenging.

Inspired by this post on Product Talk.

March 31, 2026

How to Scale AI Customer Experience Without Losing Quality

Your AI agent can resolve more conversations while customer experience quietly becomes harder to trust. A ticket marked resolved can still contain an inaccurate answer, a skipped process step, a repetitive loop, or an escalation that arrived too late.

As the product leader, you don’t have to choose between automation and judgment. You need an operating system that identifies which conversations matter, defines what good looks like, routes exceptions to the right owner, verifies fixes, and connects support quality to customer behavior.

Measure outcomes, execution quality, and coverage separately

Resolution rate is a throughput metric. It tells you how often the operation reached a terminal state, but not whether the answer was correct or the customer was treated appropriately. CSAT has a similar limitation: it captures sentiment, not conformance. Customer sentiment and adherence to your standards answer different questions, so neither should stand in for the other.

Build the dashboard around four layers. Keeping them separate prevents a good aggregate number from concealing a weak customer experience.

Measurement layer	Question it answers	Signals to track	Decision it supports
Customer outcome	Did the customer get useful help?	Resolution outcome, recontact, escalation outcome, sentiment	Whether the interaction solved the customer’s problem
Execution quality	Did the conversation meet your operating standard?	Accuracy, process adherence, clarification, escalation ease, efficiency	Whether the agent behaved correctly
Evaluation coverage	How much of the operation did you actually inspect?	Eligible, automatically evaluated, human-reviewed, and still-unreviewed conversations	How much confidence to place in the quality result
Product impact	Did better help change customer behavior?	Activation, feature adoption, retention, and journey completion by cohort	Whether the CX improvement created durable value

Read the layers together, but don’t merge them into one executive number. High sentiment with low accuracy can mean the agent sounded helpful while giving the wrong answer. A strong scorecard result with poor sentiment can expose a technically correct but difficult experience. Good conversation quality with repeated contacts may mean the product, policy, or documentation still creates the underlying problem.

The product-impact layer is especially important. A support answer may pass every conversational check without improving the journey that matters. Connect CX data to activation, adoption, and retention behavior so you can distinguish a better answer from a better customer outcome.

A simple driver tree makes that connection explicit. Start with the business result, trace it to the customer behavior that produces it, identify the journey friction blocking that behavior, and then define the AI behavior that should remove the friction. If you can’t trace a proposed quality criterion through that chain, it may be a preference rather than a requirement.

Design monitoring as a portfolio, not a random sample

Monitoring begins with selection. A precisely calculated score from the wrong population creates false confidence. Small random samples are useful for trend stability, but they are unlikely to expose every high-risk edge case, complex escalation, or early sign of drift.

Use four complementary monitoring layers:

A stable benchmark cohort. Evaluate a repeatable sample on a consistent schedule. Preserve the same eligibility rules and segmentation so changes in pass rate represent changes in performance rather than changes in the sample.
Risk-targeted monitors. Select conversations with signals that deserve deliberate review. Examples include a customer showing signs of financial vulnerability, an agent repeating essentially the same answer, a required escalation that did not happen, or a sensitive process being handled without its required checks.
Change-specific monitors. Create cohorts for a new model, prompt, workflow, tool, policy, or knowledge release. Version every relevant component so a quality change can be traced to the production change that preceded it.
Journey monitors. Group conversations by customer intent and journey stage, not just channel or queue. This exposes recurring friction in onboarding, adoption, billing, account management, and other product journeys that an operation-wide average will flatten.

Instrument every eligible conversation, but don’t assume every conversation needs human review. Automated evaluation is appropriate for high-volume, clearly defined criteria. Human judgment belongs on critical failures, ambiguous cases, disputed scores, new scenarios, and calibration cohorts. The scalable design is broad automated visibility with concentrated human attention.

Each monitor should report more than a pass rate:

Coverage rate: completed evaluations divided by eligible conversations. Never present pass rate without this denominator.
Pass rate: completed evaluations that met the scorecard standard divided by all completed evaluations.
Critical failure rate: completed evaluations that failed at least one critical criterion divided by all completed evaluations.
Unreviewed queue: qualifying conversations that have not received their required review. Break this out by risk and age rather than showing only a total.
Evaluator overturn rate: dual-reviewed cases in which a human changed the automated judgment. A rising rate can signal an ambiguous rubric, a weak evaluator, or a new conversation pattern.
Failure recurrence: previously addressed failure modes that appear again after a fix. This distinguishes completed work from effective work.

Segment these measures by AI versus human handling, journey, intent, customer segment, channel, and deployed version. One shared quality system can compare AI and human conversations against the same customer outcomes, while still assigning different operational responsibilities. Unified review across automated and human conversations also makes handoff failures visible; otherwise, each side can look healthy while the transition between them breaks.

Be deliberate about the data attached to monitoring. Store the fields needed for segmentation, diagnosis, and audit, and restrict access to sensitive conversation content. A monitoring program creates risk if it spreads personal or regulated data into dashboards that were designed only for aggregate analytics.

Turn scorecards into executable product requirements

A monitor determines what enters review. A scorecard determines how the conversation is judged. That distinction is operationally important: targeted selection and custom evaluation criteria work as two separate controls.

I treat a scorecard as an executable product requirement. Each criterion should be observable in the conversation, interpretable by two independent reviewers, and connected to a specific action when it fails. Vague criteria such as helpful, natural, or on-brand produce arguments rather than reliable signals.

Build each criterion with the following fields:

Intent: the customer or business outcome the criterion protects.
Pass condition: the observable behavior that must be present.
Failure condition: the observable behavior that makes the conversation fail.
Evidence rule: the part of the transcript, tool trace, policy, or approved knowledge that supports the judgment.
Applicability: when the criterion is required and when it is not applicable.
Severity: whether failure contributes to a weighted score or overrides the entire evaluation.
Reviewer: automated evaluation, human evaluation, or both.
Remediation owner: the person or function expected to act on failure.

A practical CX scorecard usually needs criteria such as:

Accuracy: the answer is supported by approved knowledge, customer context, and tool results. Unsupported claims fail even if the customer accepts them.
Resolution and next step: the agent answers the request, clearly states what remains, or routes the customer to the correct next action.
Process adherence: required verification, disclosure, permission, and workflow steps are completed in the correct context.
Clarification: the agent asks for missing information when intent or account context is too ambiguous to answer safely.
Escalation: the agent recognizes defined handoff conditions, escalates without unnecessary resistance, and transfers the context the next agent needs.
Conversation efficiency: the agent avoids repetition, irrelevant steps, and loops while preserving the information necessary for a correct outcome.
Communication quality: the response is clear, appropriately direct, and consistent with the brand’s communication standard.

Weights help express relative importance, but a weighted average must not wash away a consequential failure. Mark accuracy, safety, security, required process steps, or other non-negotiable controls as critical where appropriate. A polite answer that gives a harmful instruction should not pass because it accumulated enough points elsewhere.

For legal, financial, safety, account-security, or regulated decisions, follow the applicable organizational policy and require human review or escalation where that policy demands it. An aggregate quality score is not authorization for an AI agent to make a decision outside its approved scope.

Automated evaluators also need calibration. They are measurement components, not ground truth. Build an adjudicated set of clear passes, clear failures, difficult edge cases, and not-applicable examples. Have human reviewers score the same cases independently, compare their reasoning with the automated result, and rewrite any criterion that allows materially different interpretations. Repeat calibration after changes to the model, evaluator, prompt, tools, knowledge, policy, or conversation mix.

Keep the evidence behind every automated judgment. A score without the relevant transcript excerpt or trace is difficult to challenge and nearly impossible to improve. Reviewers should be able to see what failed, why it failed, and which requirement governed the decision.

Make every failure end in a product decision

Quality monitoring creates value only when it changes the system. The review queue should represent work, not a museum of bad conversations. A useful workflow moves each case through explicit states such as Not reviewed, Reviewed, Needs a fix, and Fix complete.

Require the following information before a failed case leaves review:

The customer intent and journey stage.
The failed criterion and supporting evidence.
The failure class, not merely the visible symptom.
The owner responsible for the correction.
The proposed change and the signal expected to improve.
The regression case, deployment version, and monitor that will verify the correction.

A consistent failure taxonomy keeps teams from treating every bad answer as a prompt problem:

Knowledge failure: the approved information is missing, stale, contradictory, or too difficult to interpret.
Retrieval or context failure: the right information exists, but the system did not retrieve, rank, or apply it.
Policy or workflow failure: the operating rule is wrong, incomplete, or impossible for the agent to execute.
Model behavior failure: the system ignored instructions, made an unsupported inference, or produced an otherwise defective response despite receiving adequate context.
Conversation-design failure: the interaction collected the wrong information, asked an unclear question, or sequenced the exchange poorly.
Tool or handoff failure: an integration, action, routing rule, or transfer prevented the correct outcome.
Product-friction failure: the support interaction is a recurring symptom of something the product itself should make clearer or eliminate.

This taxonomy changes prioritization. If the same onboarding question keeps passing through support, improving the answer may reduce handling friction but preserve the underlying product problem. Journey mapping, behavioral analytics, and in-product guidance can reveal whether the better fix belongs in the interface, workflow, documentation, or agent.

Close each failure through a controlled loop:

Reproduce it. Confirm the failure and preserve the relevant inputs, context, versions, and tool behavior.
Diagnose it. Assign a cause from the shared taxonomy and identify the owner with authority to change that component.
Correct it. Update the product, knowledge, retrieval logic, prompt, workflow, policy, integration, or escalation rule that caused the defect.
Test it. Run the original failure and nearby cases that should remain unchanged. Add the sanitized case to the regression set where appropriate.
Deploy it with versioning. Preserve enough release context to compare behavior before and after the change.
Verify it in production. Watch the targeted monitor, baseline quality, and associated customer behavior. Close the case only when the expected signal improves without creating a new failure elsewhere.

Bring this loop into a regular operating review. Inspect coverage first, then critical failures, the largest changes by segment and version, queue health, recurring causes, completed fixes, and downstream customer behavior. The meeting should end with product decisions: roll back a change, revise knowledge, alter a workflow, strengthen an escalation, change the interface, expand monitoring, or explicitly accept a known limitation.

Assign ownership before volume grows. CX and support leaders can define service standards; product leaders can connect recurring friction to roadmap decisions; AI and engineering owners can maintain instrumentation, evaluators, and regression tests; analytics can connect interactions to behavioral outcomes; and security, legal, or compliance owners can approve critical controls in their domains. Your organization may divide the roles differently, but every critical criterion and failure class needs a named decision owner.

Key takeaways

Resolution rate measures throughput, not whether an AI interaction was accurate, compliant, or useful.
Track customer sentiment, execution quality, evaluation coverage, and product impact as separate layers.
Combine a stable benchmark cohort with risk-targeted, change-specific, and journey-based monitoring.
Show pass rate with coverage and critical failure rate. A strong score over thin or biased coverage is weak evidence.
Write scorecard criteria as executable requirements with pass conditions, failure evidence, severity, reviewer type, and remediation ownership.
Use automated evaluation for breadth and human judgment for calibration, ambiguity, disputes, and consequential failures.
Don’t close a quality issue when a document or prompt changes. Close it after the deployed fix passes regression checks and improves the intended production signal.
Route recurring conversation failures into product discovery. Sometimes the best CX fix is removing the reason customers need to ask.

Start with one journey that combines meaningful volume with meaningful consequence. Give it an eligibility rule, a scorecard, explicit coverage, a review queue, a failure taxonomy, named owners, regression cases, and a downstream product outcome. Run that chain until failures reliably produce verified changes, then extend the same control loop to the next journey. Scalable quality comes from repeating a dependable operating system, not from adding another dashboard.

References

March 25, 2026

How to Ship Responsible AI Products in Regulated Healthcare

Your healthcare AI prototype works in a demo. Clinicians see potential. Then privacy, security, compliance, and legal reviewers ask questions the roadmap cannot answer: Which data crosses the model boundary? What happens when the output is wrong? Who can stop it? What evidence justifies exposing it to patients or providers?

The answer is not a longer policy document. You need a delivery system in which the use case, data boundaries, acceptable behavior, evidence, and rollback path are inspectable before anyone depends on the product. That system lets you move faster because each review produces a decision instead of another round of open-ended concerns.

Key takeaways

Start with the decision or action the AI will influence, not the model you want to deploy.
Keep identifiers in clinical systems by default and send only the behavioral or operational signals a downstream system genuinely needs.
Put success metrics, unacceptable behavior, human review, and stop conditions in the same release contract.
Move from synthetic or de-identified sandbox testing to a tightly controlled pilot, then scale only when the agreed evidence supports it.
Monitor model behavior, workflow performance, segment outcomes, data quality, and incidents as one production system.

Define the clinical boundary before choosing the AI approach

A vague use case such as improving patient engagement is almost impossible to evaluate responsibly. It does not identify a user, a decision, an action, or a credible failure. The first useful artifact is a use-case card that makes those boundaries explicit.

Complete these fields before discussing vendors, models, or architecture:

User and job: Name the person using the capability and the task that person is trying to complete.
Input: List the information required to perform the task. Separate essential inputs from data that is merely available.
Output: Define what the system produces: a summary, draft, recommendation, prediction, classification, or action.
Action authority: State whether the AI informs a person, proposes an action for approval, or executes an action itself.
Unacceptable outcome: Describe the failure that must not reach the user, patient, provider, or downstream system.
Human checkpoint: Identify who reviews the output, what that person can see, and how the person can reject or correct it.
Success measure: Name the workflow outcome that should improve, such as task completion, time-to-first-value, or sustained adoption.
Accountable owner: Name the person who can approve the use case, pause it, and accept or reject residual risk.

The action-authority field is especially important. A system that drafts text for a qualified person to review has a different failure surface from one that sends the text automatically. A recommendation that a clinician can inspect is different from an action that changes a care workflow without an intervening decision. If the team cannot describe that distinction, it is too early to approve a production design.

I use a simple product-risk ladder during intake:

The AI summarizes or drafts, and its output has no effect until a qualified person reviews it.
The AI recommends a next step, but a person must make and record the decision.
The AI executes a reversible administrative action within a tightly bounded workflow.
The AI influences a care pathway, patient communication, or another consequential decision.
The AI executes a consequential or difficult-to-reverse action without prior human approval.

This ladder is a product-triage device, not a legal or clinical classification. Your qualified clinical, privacy, security, compliance, and legal owners still need to determine the obligations that apply. Its purpose is to prevent a low-risk drafting assistant and a high-consequence decision system from passing through the same generic review.

Once the boundary is clear, choose the least complex mechanism that can deliver the outcome. Conventional automation may be enough for deterministic rules. Retrieval may be appropriate when the primary job is finding and grounding information. An agentic workflow introduces additional action authority and therefore needs stronger controls. Selecting among conventional automation, a retrieval-first pipeline, and agentic AI should follow the use case, its failure modes, and its lifecycle requirements.

Apply the same discipline to build-versus-buy decisions. Do not reduce the choice to feature coverage or procurement cost. Evaluate who can control data handling, model and prompt versions, evaluation, incident response, observability, and future changes. A vendor can supply technology, but it cannot own your product decision or your duty to operate the resulting workflow responsibly.

Make the data boundary reviewable, not merely promised

Privacy-by-design becomes real when a reviewer can trace each field from its origin to every place it is processed, logged, measured, retained, and deleted. A sentence saying the product is secure is not a data-control mechanism.

Start with a data-flow map that covers the entire operating path:

The clinical or operational system where the data originates.
Any transformation, minimization, masking, or de-identification step.
The application, retrieval layer, model, or external service that processes it.
Prompt, response, diagnostic, and application logs.
Behavioral analytics and product dashboards.
Human-review, support, escalation, and incident queues.
Long-term storage, retention, deletion, and backup paths.

For every step, record the purpose, permitted fields, prohibited fields, access roles, retention rule, downstream recipients, and owner. If a field has no necessary purpose, remove it before debating how to secure it. Data minimization reduces both the risk surface and the number of controls the team has to maintain.

A practical default is to keep identifiers in clinical systems while allowing only the behavioral signals needed for product analytics to cross the boundary. An analytics event can record that a recommendation was opened, edited, accepted, rejected, or completed without carrying a patient name or clinical narrative. The event should describe what happened in the product, not reproduce the underlying record.

Do not assume data is de-identified merely because a visible name or patient identifier has been removed. Combinations of fields, free text, prompts, model responses, URLs, error messages, and support attachments can still disclose sensitive information. Have the designated privacy and legal owners determine whether the transformation meets the applicable requirements. If they cannot verify it, keep the data inside the approved clinical boundary or use synthetic data for development.

Behavioral instrumentation needs its own contract. For each event, define:

The event name and the exact behavior it represents.
The allowed properties and the business purpose of each property.
Explicitly prohibited identifiers, clinical text, and other sensitive payloads.
The application and workflow versions that generate the event.
The owner who approves schema changes.
Validation rules that reject or quarantine malformed events.
The metric definitions and dashboards that consume the event.

This is governed analytics in operational form. Curated events, certified metric definitions, role-based access, lineage, and change control create a shared, auditable view for product, data, security, and compliance. They also prevent a quieter product failure: two teams using the same metric name for different behaviors and making incompatible release decisions.

Apply comparable scrutiny to an external provider. Ask what data the provider processes, where it is stored, whether inputs or outputs can be used for training, what is logged, how long each artifact is retained, how deletion works, who can access it, which subprocessors receive it, how tenants are separated, and what happens during an outage or security incident. Route the answers to the people responsible for contractual, security, privacy, and regulatory assessment. Product should own the use-case decision, not silently treat vendor approval as proof that every use is approved.

Convert responsible AI into a release contract

Responsible AI fails as a delivery practice when responsibility is expressed only as principles. A team needs observable release criteria: the behavior it expects, the behavior it prohibits, the evidence it will collect, and the condition that stops the launch.

Put those criteria in one release contract shared by product, engineering, data science, clinical leadership, security, privacy, and compliance. The exact metric thresholds will vary by use case, so the accountable owners must set them before the pilot produces results. A threshold chosen after seeing the data is an explanation, not a gate.

Release layer	Define before the pilot	Evidence to collect	Do not proceed when
Product value	The user task and expected workflow improvement	Task completion, time-to-value, adoption, abandonment, and sustained use	The feature creates activity without improving the intended task
Model behavior	Expected responses, prohibited responses, escalation behavior, and task-specific pass criteria	Versioned offline evaluations, human review, guardrail results, and regression comparisons	A critical safety case fails or behavior cannot be reproduced
Data quality	Required inputs, permitted schemas, freshness expectations, and lineage	Schema validation, missing-data checks, source versions, and anomaly monitoring	Inputs are stale, malformed, untraceable, or outside the approved boundary
Human control	Review point, override, correction, escalation, and rollback path	Correction behavior, overrides, escalations, and successful rollback tests	The responsible person cannot inspect, reject, or stop the output
Operational health	Acceptable latency, cost, availability, error behavior, and incident ownership	Production telemetry, alerts, version history, and incident records	Failure is silent, alerts lack an owner, or recovery depends on an untested path
Segment outcomes	The patient, provider, workflow, and operating segments that require separate review	Outcome and error variance across approved segments	Material variance is unexplained or a consequential segment lacks adequate evidence

Model quality is only one layer. A strong offline result can still produce a poor product if the workflow is slow, users cannot correct the output, input data is unreliable, or the intervention fails to improve the intended task. Connect the layers with a driver tree:

Model behavior: What must the system produce or avoid?
Workflow behavior: What will the user do differently if the output is useful and trusted?
User outcome: Which task becomes more complete, efficient, or reliable?
Organizational or care outcome: What meaningful result should eventually change?

Treat each arrow as a hypothesis, not an assumed causal relationship. For example, a more relevant recommendation might reduce corrections, and fewer corrections might improve task completion. Instrument both transitions. If relevance improves but completion does not, the team has learned that the bottleneck is elsewhere.

Your offline evaluation set should include representative routine inputs, ambiguous inputs, edge cases, and the sensitive scenarios most closely connected to the unacceptable outcomes on the use-case card. For each case, store the expected behavior, reviewer rubric, model version, prompt version, retrieval configuration, policy or rule version, and result. This makes regression testing possible when any part of the system changes.

Prompt libraries, model and prompt regression tests, eval-driven development, feature flags, and observability belong in the product delivery system rather than in an isolated data-science workflow. AI behavior can change when the model, prompt, retrieved context, guardrail, input distribution, or surrounding application changes. Version the complete configuration that produced the output.

Use A/B testing only where exposure is ethically and operationally appropriate, failure is reversible, and the relevant reviewers have approved the experiment. Do not use an experiment to discover whether an unbounded high-consequence behavior is safe. Establish safety through evaluation and controlled review first. For an approved experiment, predefine the minimum detectable effect that would make the release risk worthwhile, along with guardrail metrics and stop conditions.

Use evidence gates from sandbox to controlled scale

A responsible rollout is not one approval followed by unrestricted production access. It is a sequence of gates. Each gate expands exposure only after the previous stage produces the required evidence.

Gate 1: Sandbox validation

Start with synthetic or appropriately de-identified data. The sandbox should reproduce the workflow closely enough to test prompts, retrieval, interface behavior, event instrumentation, alerts, and rollback without exposing a patient or provider to an unproven capability.

Use the sandbox to answer concrete questions:

Does each approved input produce a traceable output?
Do ambiguous, incomplete, or malformed inputs fail safely?
Are prohibited data fields rejected before they reach logs or analytics?
Do critical evaluation cases pass on the exact release configuration?
Can a reviewer see the context needed to accept, edit, or reject an output?
Do alerts reach a named owner?
Can the feature be disabled without disrupting the underlying workflow?
Are latency and cost compatible with the intended operating model?

A polished demonstration is not the exit criterion. The exit criterion is a reproducible evidence packet containing the use-case card, data-flow map, event contracts, evaluation results, open risks, mitigations, configuration versions, approvals, and tested rollback procedure.

Gate 2: Controlled production pilot

A pilot is an instrumented risk test, not a smaller marketing launch. Define its boundaries before enabling the feature:

Which users and roles are eligible.
Which workflows and data types are permitted.
Which outputs and actions are enabled.
Where human review is mandatory.
Which feature flag or access control contains exposure.
Which metrics and segments will be reviewed.
Which events trigger an alert, pause, rollback, or incident process.
Who makes the decision to continue, modify, or stop.

Write the success and stop criteria before the first participant enters the pilot. Otherwise, adoption pressure can turn a temporary exception into a permanent operating state. A pre-agreed stop condition gives the incident owner authority to act without waiting for a fresh executive debate while a consequential failure continues.

The pilot should test the entire sociotechnical workflow. Measure whether people understand the AI’s role, inspect the output, use the correction path, escalate uncertain cases, and complete the intended task. A model can appear accurate while users over-trust it, ignore it, or spend more time verifying it than the workflow saves.

Gate 3: Controlled expansion

Scale only when the evidence satisfies the release contract and the remaining risks have named owners. Expand one meaningful dimension at a time where practical: the eligible cohort, supported workflow, data scope, or action authority. Opening all four simultaneously makes it difficult to identify which change caused a new failure.

A disciplined pattern is to move from sandbox validation to controlled pilots with documented data flows, guardrails, and pre-agreed mitigations. The audit trail should be generated from normal delivery artifacts rather than reconstructed when an auditor, customer, or executive asks what happened.

After launch: operate the product as a learning system

Production is where input distributions, user behavior, costs, and failure modes become visible. Run three connected operational views:

System health: Model, prompt, retrieval, and policy versions; latency; cost; errors; availability; and data-pipeline anomalies.
Workflow health: Eligibility, activation, task completion, abandonment, corrections, overrides, escalations, and time-to-value.
Outcome and safety health: Guardrail failures, prohibited behavior, incidents, rollback events, and outcome variance across relevant segments.

Every alert needs an owner, response path, and severity interpretation. Every material incident needs a record of the affected configuration, inputs, outputs, user impact, containment action, root cause, and prevention work. If the team cannot reconstruct which version produced a harmful or noncompliant output, observability is incomplete.

Treat a material model, prompt, retrieval, policy, or data-schema change as a product release even when the interface does not change. Run the relevant regression suite, compare the new configuration with the approved baseline, update the risk record, and preserve the decision. Change control is what prevents a previously reviewed system from becoming a different system under the same feature name.

Keep customer success, support, solutions engineering, and operational users in the feedback loop. Structured corrections and escalations can reveal workflow failures that aggregate accuracy metrics hide. Route those signals into evaluation cases, product discovery, and prioritization instead of treating them as isolated support tickets.

Your next step does not need to be a company-wide governance rewrite. Pick one healthcare AI use case and complete four artifacts: the use-case card, data-flow map, release contract, and gated rollout plan. If you cannot name the unacceptable outcome, the person who can stop the system, or the evidence required to resume it, the use case is not ready for production. Once those answers exist, responsibility becomes part of delivery rather than a negotiation at the end of it.

References

March 25, 2026

Bad Advice from Your AI Clone? Ethics, IP, and How Product Leaders Protect Quality

What happens when an AI starts giving advice in your voice—advice you’d never actually give? I’ve been thinking a lot about that question, and this conversation hit home for me as a product leader navigating the fast-evolving reality of AI “clones.”

Listen to this episode on: https://open.spotify.com/episode/7DNDIlIimwbbMOytArewRp?ref=producttalk.org | https://podcasts.apple.com/kh/podcast/bad-advice/id1794203808?i=1000756914818&ref=producttalk.org. Prefer video? Watch on YouTube: https://www.youtube.com/embed/RF4BwaeMMlg?feature=oembed

The episode examines AI “clones” built from podcast transcripts and public content—where the experimentation feels exciting, where it crosses ethical lines, and what happens when mediocre AI outputs get attributed to real people. The tension is real: when a bot confidently answers in your style but misses the nuance, “it’s not me” becomes more than a disclaimer—it’s a reputational defense.

We dig into the messy parts: IP ownership of open-sourced transcripts, the role of pirated books in LLM training sets, rising inference costs, and the uncomfortable economic question: if anyone can prompt “act like Teresa,” how do creators make a living? In my own decision-making, I look for clear consent, guardrails that prevent impersonation, and transparent UX that never confuses a synthetic perspective with a human expert.

This isn’t anti-AI. It’s a nuanced conversation about quality, consent, and remembering there are real humans behind the ideas.

Here’s how I translate the key takeaways into practice. Using AI for perspective is fine—equating it to the real person isn’t. Free-feeling AI outputs still rely on someone’s work. Expertise is more than past content—it’s context, judgment, and evolution. If someone’s work influences you, find a way to support them. These principles help teams benefit from gen ai without eroding trust or the creator ecosystem.

“Technically possible” doesn’t mean “ethically okay.” My AI Strategy playbook includes privacy-by-design, clear data governance on training materials, and a bright line between inspiration and impersonation. When we ship AI features, we label synthetic outputs, avoid mimicking living experts without permission, and create paths to compensate or promote the humans whose thinking underpins the experience.

I’ve also tested the “act like X” pattern to stress-test product quality. Even when outputs sound plausible, they rarely capture the expert’s mental models, trade-offs, or the evolution of their thinking—especially in complex product discovery work. That gap is the difference between average AI text and expert product management leadership.

If you listen, consider a few reflection prompts: Have you ever used AI to “act like” someone you admire? Could you tell whether the output matched that person’s actual thinking? How do you decide what’s ethically okay when using public content in LLMs? And how can we support creators while still embracing new tools?

Resources & Links you may find helpful: Follow Teresa Torres: https://ProductTalk.org; Follow Petra Wille: https://Petra-Wille.com; Delphi.ai (AI bot platform discussed): https://www.delphi.ai/?ref=producttalk.org; Lenny’s Podcast: https://www.lennysnewsletter.com/podcast?ref=producttalk.org; ChatGPT: https://chatgpt.com/?ref=producttalk.org; Petra’s Coaching Packages: https://www.petra-wille.com/coaching-packages?ref=producttalk.org; Teresa’s Product Talk: https://www.producttalk.org/; Teresa’s book Continuous Discovery Habits: https://www.producttalk.org/continuous-discovery-habits/; Lenny’s open-sourced podcast transcripts: https://www.dropbox.com/scl/fo/yxi4s2w998p1gvtpu4193/AMdNPR8AOw0lMklwtnC0TrQ?rlkey=j06x0nipoti519e0xgm23zsn9&e=1&st=ahz0fj11&dl=0&ref=producttalk.org

Have thoughts on this episode or practices that have worked in your org? Share them below—I’m keen to learn how other teams are balancing innovation with integrity.

Inspired by this post on Product Talk.

March 24, 2026

Agentic AI for Clinical Trial Operations: A Practical Playbook

If you are deciding where to introduce agentic AI in clinical trial operations, the hard question is not whether an agent can complete an impressive demonstration. It is whether the agent can produce a traceable, reviewable result under real trial conditions without obscuring who remains accountable.

Start with a bounded operational workflow, not a promise to automate an entire role. The useful outcome is not an agent that sounds intelligent. It is a smaller work queue, earlier detection of issues, faster human review, and enough evidence to explain every recommendation after the fact.

Start with work that is bounded, frequent, and reversible

Clinical operations contains no shortage of repetitive work. That does not make every task a suitable first agent use case. A workflow can be repetitive and still be unsafe to automate if an error changes a source record, delays escalation, affects patient safety, or hides a protocol issue.

Do not rank candidate workflows by estimated time savings alone. Rank them by risk-adjusted learnability: how quickly can you observe the agent’s behavior, compare it with an accountable reviewer, and contain a mistake before it has a consequential downstream effect?

A strong initial workflow usually has these properties:

A clear trigger and an unambiguous end state.
A finite set of authorized inputs.
An output that a qualified person can independently verify.
A mistake that can be corrected before it changes a consequential decision.
A named reviewer who already owns the underlying process.
An existing queue, baseline, or historical record against which performance can be evaluated.
A defined escalation path for ambiguity, missing data, conflicting records, and tool failure.

Document classification is a useful illustration. An eTMF agent has been applied to more than 80,000 documents per year. That workload is high-volume and structured enough to create repeatable evaluation data. The agent can recommend a classification, expose the evidence behind it, and send uncertain cases to a reviewer. A person can correct the result before the document proceeds through the controlled process.

Monitoring is a different risk class. A CRA agent can assemble safety and data-quality signals from 13 clinical systems, but that breadth is not permission to replace clinical judgment. The safer product boundary is evidence gathering, prioritization, and routing. The accountable professional still determines what the signal means and what action is appropriate.

My rule is simple: let the agent compress evidence gathering before it earns authority to execute an outcome. An agent may identify a possible discrepancy, collect the associated records, and prepare a review packet. It should not resolve a safety issue, close a query, approve clinical content, or alter an authoritative record unless that specific action has been validated, authorized, and made recoverable.

Turn the operating contract into governed platform primitives

Before writing prompts, write the operating contract. It should state the agent’s intended use, authorized inputs, available tools, required output, prohibited actions, review owner, escalation conditions, and evidence to retain. This contract gives product, clinical operations, quality, security, and engineering the same object to inspect.

The prohibited-actions section deserves particular attention. An instruction such as “help the CRA monitor the trial” is too broad to test. A useful boundary sounds more like this: retrieve permitted records, normalize specified fields, identify conditions defined in the approved specification, present supporting evidence, and route the result to the assigned reviewer. Do not interpret clinical significance, overwrite a source value, or close the issue.

A durable platform can encode that contract through reusable primitives such as models, skills, knowledge bases, MCP connectors, versions, and trigger types. Each primitive should own a specific control rather than serving as a loose container for prompts.

Platform primitive	Product decision to make explicit	Operational failure it should contain
Model	Which approved model and configuration may perform the task, including fallback behavior.	An unreviewed model change silently altering the output.
Skill	The narrow action, permitted inputs, expected schema, and failure behavior.	A general-purpose prompt expanding beyond the validated task.
Knowledge base	Which controlled material is authoritative and which version applies.	An answer relying on obsolete or unapproved material.
Connector	Identity, credential, record scope, and read-versus-write permission.	The agent retrieving or changing data beyond its authorization.
Trigger	What condition may start a run and what happens when the condition repeats.	Duplicate, unexpected, or untraceable execution.
Version	Which complete configuration produced a result and how it can be rolled back.	An output that cannot be reproduced during investigation.

Version everything that can materially change behavior: prompts, skills, model configuration, knowledge, ontology mappings, connector permissions, and escalation logic. A run record should identify why the agent started, which configuration ran, which tools it called, what evidence it retrieved, what it produced, and how the reviewer disposed of the result.

Separate read authority from write authority. A standard connector interface can make a system callable; it does not make every call permissible. Authentication and credential handling belong in a governed connector layer, as demonstrated by custom MCP connectors with an authentication and credentialing wrapper. The agent should receive only the tools and permissions required for the current task.

The same governance should apply across delivery models. First-party agents can prove reusable patterns, services-led implementations can handle complex workflows, and self-service configuration can extend adoption. Those three deployment paths should share the same identity controls, version model, evaluation process, monitoring, and audit record. Self-service without centralized guardrails merely distributes configuration risk.

Match retrieval to the question the agent must answer

Many apparent reasoning failures begin as retrieval or data-alignment failures. The agent received an outdated document, missed the relevant section, joined records under inconsistent identifiers, or treated two conflicting statuses as though they agreed. A larger context window does not repair those defects. It can make them harder to notice.

Choose the retrieval pattern from the operational question:

Use embeddings for semantic discovery. This is useful when the agent needs to find conceptually related material despite differences in wording. Retrieval results still need document identity, version, and provenance.
Use document hierarchies when structure carries meaning. Markdown or another explicit hierarchy can preserve the relationship among sections, subsections, tables, and controlled instructions. This is preferable when a nearby heading changes how a passage should be interpreted.
Use just-in-time connector retrieval for current system state. When the answer depends on the latest authorized record, retrieve it from the system at run time rather than relying on a stale copied index.

These patterns are complementary. An agent may use semantic retrieval to identify relevant controlled material and an MCP connector to fetch the current operational record. What matters is that the final output distinguishes retrieved policy or guidance from live trial data and preserves the provenance of both.

Cross-system monitoring also needs an ontology layer. Terms, statuses, units, and identifiers that appear similar may not carry the same operational meaning. A unified ontology can align terminology across multiple clinical systems, but normalization must not erase the original value. Retain the source system, source field, retrieval time, transformation applied, and canonical concept alongside every normalized field used in a recommendation.

Define conflict behavior explicitly. The newest value should not automatically win merely because it has the latest timestamp. If two authoritative records disagree and no validated reconciliation rule applies, the agent should show both, explain the conflict in neutral terms, and escalate. Fabricating a clean answer from inconsistent data is more dangerous than returning no answer.

Context management should reduce the agent’s working set to what the current decision requires. Sub-agents and automatic tool filtering can isolate tasks and limit the tools presented at each step. A retrieval sub-agent might return structured evidence with provenance, while a separate workflow skill applies the approved decision rule. That separation makes failures easier to test and permissions easier to constrain.

Do not optimize context solely for token efficiency. In clinical operations, the stronger reason to keep context narrow is control: fewer irrelevant records, fewer callable tools, clearer evidence lineage, and a smaller surface on which conflicting instructions can alter behavior.

Make evaluation and human review release gates

A clinical operations agent is not ready because it succeeds on a happy-path demonstration. Readiness means its intended behavior, failure behavior, and escalation behavior have all been tested against representative conditions. The evaluation plan should exist before the team sees the final test results, so release criteria do not drift to accommodate a weak agent.

Move through increasing levels of operational authority:

Retrospective evaluation: run the agent against a controlled golden dataset without access to live workflows.
Shadow operation: process current inputs in read-only mode while the existing process remains authoritative.
Assisted operation: show recommendations and evidence to a qualified reviewer, requiring approval before any downstream action.
Bounded execution: automate only the reversible actions that have earned sufficient evidence, while preserving escalation and rollback.

A golden dataset needs more than obvious examples. Include normal cases, ambiguous inputs, missing records, conflicting fields, outdated knowledge, duplicate triggers, unauthorized tool requests, and cases that should produce an abstention. Keep high-consequence failure modes visible as separate evaluation slices; a strong average can conceal the specific false negative that matters most.

Human feedback is useful, but it is not automatically ground truth. Reviewers can disagree, inherit inconsistent local practices, or approve a recommendation without examining it closely. Capture the initial agent output, reviewer action, reason for correction, and final adjudication where the process provides one. Use adjudicated outcomes to improve the golden set instead of treating every click as an equally reliable label.

Evaluate the properties that correspond to the operating contract:

Correct classification, routing, or evidence assembly on adjudicated cases.
Recall on important conditions, reviewed separately for higher-consequence misses.
Abstention and escalation when information is missing, conflicting, or outside scope.
Evidence completeness, including links or identifiers that let a reviewer verify the output.
Tool-use correctness, permission failures, and attempts to call unauthorized tools.
Reviewer acceptance, correction, and overturn reasons.
Operational impact on queue size, review effort, and time to disposition.

Set release thresholds according to the consequence of the task. A threshold appropriate for a reversible document suggestion is not automatically appropriate for a safety-monitoring signal. Do not compensate for weak performance on a high-risk slice with excellent performance on easy cases.

The human review interface is part of the safety system. It should present the recommendation, the exact supporting evidence, source identity, relevant timestamps, detected conflicts, and the permitted next actions. The reviewer needs an obvious way to correct, reject, or escalate the output. A generic approve button encourages automation bias and produces weak feedback data.

Preserve a traceable chain from agent intent to specification to test evidence. A release packet should identify the approved intended use, current versions, evaluation dataset, results by risk slice, known limitations, required human controls, monitoring plan, and rollback procedure. This is not paperwork added after product development. In a GxP-regulated setting, it is part of the product.

Production monitoring should detect changes in both behavior and operating conditions. Watch for shifts in input mix, rising abstention, changes in reviewer overturn reasons, missing provenance, connector failures, and differences after any model, knowledge, permission, or ontology update. When a material change occurs, route the affected configuration back through the relevant evaluation gates.

Key takeaways

Choose a bounded, frequent, and reversible workflow before attempting broad role automation.
Use agents to assemble evidence and prioritize work before granting authority over consequential outcomes.
Express the operating contract through governed models, narrow skills, controlled knowledge, permissioned connectors, triggers, and reproducible versions.
Match retrieval to the question: semantic discovery, hierarchical document access, and live connector retrieval solve different problems.
Preserve ontology mappings and field-level provenance when normalizing data across clinical systems.
Treat abstention, escalation, human review, evaluation evidence, production monitoring, and rollback as release requirements.

Your next artifact should not be a broader agent demonstration. Write the operating contract for the narrowest valuable workflow, then identify its authoritative inputs, prohibited actions, accountable reviewer, evaluation cases, escalation path, and rollback procedure. If any of those are unclear, narrow the workflow again. If they are explicit, you have a credible starting point for an agent that can improve clinical operations without outrunning the evidence.

References

Shivam.Consulting Blog – Inside Medable’s Agent Studio: The Agentic AI Blueprint to Accelerate Safer Clinical Trials

March 19, 2026

Agentic Architecture Demystified: How Modern AI Systems Plan, Learn, and Execute at Scale

In my role leading product teams at HighLevel, I’m often asked to explain what’s really happening behind the scenes of today’s AI products. The short answer is that modern systems are built on "Agentic Architecture: How Modern AI Systems Actually Work"—not just a single model, but a coordinated loop of planning, tool use, memory, and evaluation. Once you see that pattern, the design decisions snap into focus and the roadmap becomes far easier to prioritize.

At its core, agentic AI treats the model as a reasoning engine embedded within an AI workflow. The agent interprets intent, plans steps, calls the right tools and APIs, grounds itself in trusted data, and then evaluates outcomes before deciding to continue or stop. This loop creates reliability, reduces hallucinations, and enables the system to operate in real-world, multi-step scenarios.

Here’s the practical lifecycle I rely on. A user provides intent (a goal or request). We run a retrieval-first pipeline to ground the model in accurate, current data. Prompt engineering structures the task and primes the agent with constraints and success criteria while managing context window management. The agent generates a plan, executes steps by calling tools or services, evaluates intermediate results, reflects or revises as needed, and only then returns a final answer with clear citations or evidence.

For more complex work, I orchestrate multiple specialized agents—commonly a planner, a solver, and a critic—coordinated by a lightweight controller. This multi-agent pattern reduces single-agent blind spots, encourages self-checking, and mirrors how empowered product teams collaborate. Whether it’s conversation design for support flows or a voice AI agent driving hands-free tasks, orchestration is the difference between a clever demo and a dependable product.

Memory is the second pillar. Short-term working context sits in the prompt, while long-term memory lives in vector stores or databases to track past interactions, preferences, and outcomes. Retrieval augments the model with the right facts at the right time, and tight context window management ensures the agent stays focused on signal, not noise. The result is faster responses, lower costs, and far better accuracy.

Reliability is earned through eval-driven development and robust AI risk management. I define offline and online evaluations, guardrails, and human-in-the-loop checkpoints before scaling traffic. These evaluations become living, automated tests that protect against regressions as prompts, models, and tools evolve. The payoff is real: fewer escalations, higher trust, and measurable improvements to quality over time.

From a product strategy perspective, I resist over-engineering. Start with a simple retrieval-first pipeline and a single agent; prove value; then layer in multi-agent orchestration only where it moves key metrics. Instrument everything—latency, cost, grounding coverage, and outcome quality—and build Agent Analytics dashboards so teams can diagnose issues and iterate with confidence.

If you’re looking for a practical playbook, here’s mine: clarify the user intent and success criteria; design the tools the agent can call; ground with authoritative data; write prompts that constrain scope and define termination conditions; add reflection and automated evaluations; and ship behind feature flags for safe, staged rollout. Each step compounds reliability without killing velocity.

The diagram and the video above bring these patterns to life. If you watch closely, you’ll see the same loop—plan, retrieve, act, evaluate—show up in every effective implementation, regardless of domain. That repetition isn’t accidental; it’s the backbone of agentic architecture and a blueprint you can adapt to your own stack.

Ultimately, what matters is outcomes. When we build around agentic AI, we create systems that are explainable to stakeholders, maintainable by engineers, and genuinely helpful to customers. That’s how we move past hype to durable impact—shipping AI products that plan, learn, and execute at scale.

Inspired by this post on Product School.

March 16, 2026
Behavioral Analytics That Crush Fraud: Spot Anomalies, Prioritize Risk, Act with Confidence

Fraud teams are drowning in signals—events, alerts, and edge cases that look suspicious but rarely point to what truly matters now. In my role leading product, I focus on turning that noise into clear, ranked actions the team can trust. Behavioral analytics is how we bridge the gap from “something looks off” to “here’s why it matters and what to do next.”

See how behavioral analytics helps fraud management teams surface anomalies, prioritize risk factors, and act faster with greater confidence.

When I build fraud capabilities, I start by defining the outcomes that matter: find anomalies early, prioritize by impact, and respond in minutes—not days. That requires a rigorous approach to data governance, strong observability across the stack, and a mindset tuned to threat detection and response rather than passive reporting.

For me, behavioral analytics means unifying event streams across web, mobile, payments, and support into a single, trustworthy, unified analytics platform. We then apply anomaly detection on top of baselines for user, device, and entity behavior—capturing velocity spikes, geolocation drift, account takeover signals, and unusual journey paths. The win is not more alerts; it’s clearer context per alert.

Prioritization is where the value compounds. I combine deterministic signals (e.g., device fingerprint mismatches, impossible travel, repeated declines) with weighted risk scoring that adapts to emerging patterns. This helps fraud analysts triage by potential loss and customer impact, not just alert volume—so the highest-risk cases land at the top of the queue with the right context attached.

Actionability is the final mile. I map each risk tier to a playbook—step-up authentication, temporary holds, secondary review, or immediate block—so teams can act with confidence. Real-time alerts route to the right channel; feature flags allow fast containment; and AI risk management practices ensure continuous learning while preserving precision and recall. We close the loop by measuring investigation time, false positive rates, and recovery to keep improving.

A few lessons keep paying off: instrument early and consistently; keep your schema stable; document risk definitions; and test changes with A/B testing to quantify impact before scaling. Treat your fraud stack like a mission-critical cybersecurity system with tight SLAs, clear ownership, and auditable decisions—because it is.

If you’re evaluating your next move, start with a narrow but high-ROI use case (account takeover or payment fraud), stand up clear dashboards for analysts, and iterate on the risk scoring model weekly. With disciplined data practices and aligned playbooks, behavioral analytics turns scattered signals into decisive, defensible action.

Inspired by this post on Amplitude – Perspectives.

March 5, 2026
Battle-Tested AI Agent Orchestration Patterns for Reliable, Observable, Product-Ready Systems

Shipping agentic AI into production is exhilarating—until a flaky output torpedoes trust. Over the past year, I’ve led teams at HighLevel to operationalize agents across customer-facing and internal workflows, and I’ve learned that reliability isn’t an afterthought; it’s an architecture. In this piece, I share the AI Agent Orchestration Patterns for Reliable Products that consistently deliver dependable outcomes at scale.

When we talk about orchestration, we’re talking about more than a single prompt. The shift is from monolithic calls to coordinated “agentic AI” where routers, planners, and specialists collaborate through structured “AI workflows.” In practice, I rely on a few canonical patterns: a planner–executor loop for multi-step tasks, a router–specialist setup for skill selection, and a “retrieval-first pipeline” that grounds generation with authoritative context before a single token is produced.

Reliability-by-design starts with typed inputs/outputs and strict validation. I standardize on JSON schemas, enforce tool/function signatures, and implement idempotency keys so retries don’t wreak havoc on downstream systems. Timeouts, circuit breakers, and backpressure protect the platform under load, while rate limiting and dead-letter queues keep failure modes contained. Most importantly, we engineer graceful degradation: agents “abstain” when uncertain, fall back to deterministic paths, and escalate to humans instead of guessing.

Safety is a first-class concern, not a bolt-on. Our “AI risk management” pipeline includes PII redaction, allow/deny lists for tools and data, and the principle of least privilege for every connector (yes, even the ChatGPT connector). We codify policy-as-code for repeatability and require human-in-the-loop approvals for sensitive or irreversible actions. In my experience, clear red lines and reversible defaults prevent the vast majority of regrettable outcomes.

Without strong “observability,” you’re flying blind. I instrument agents with an “Agent Analytics” layer that captures traces, spans, tool invocations, and token usage across the entire chain. The essential metrics are outcome quality (task success rate), latency (p50/p95), tool failure rates, cost per task, and user-level satisfaction signals. Cross-agent lineage allows us to pinpoint where a plan went awry and which tool or prompt introduced drift—vital for rapid remediation.

Quality improves fastest when it is measured relentlessly. I practice “eval-driven development” with golden datasets, rubric-based scoring, and risk-weighted sampling of edge cases. LLM-as-judge can help, but we always calibrate against human ratings and monitor agreement. In production, I blend online metrics with controlled “A/B testing” and plan experiments to hit a realistic minimum detectable effect (MDE). The result is a virtuous loop where prompt tweaks, tool changes, and retrieval adjustments are verified before wide rollout.

Agents need the same rigor we expect from any modern system. I gate releases through “CI/CD” with linting for prompts, schema checks for tools, and simulation runs for critical paths. “Feature flags” enable shadow and canary deployments so we can throttle exposure by segment or workflow. I also track reliability with “DORA metrics” and “deployment frequency,” and I partner closely with “SRE” for on-call coverage, runbooks, and incident postmortems tailored to agent failure modes.

Context is a resource to allocate, not a bottomless pit. Thoughtful “context window management” means curating retrieval, summarizing long-running threads, setting memory time-to-live, and constraining what the agent can see at any given step. I bias hard toward retrieval over recall, keep chunks small and semantically precise, and validate that the “retrieval-first pipeline” truly returns the right evidence—not just the nearest match.

In day-to-day product work, I lean on a compact playbook: a router that selects the best specialist; a planner that decomposes tasks and allocates tools; a deterministic guard that verifies preconditions; an execution loop with explicit budgets; and a fallback policy that prefers abstaining over hallucinating. Together, these patterns create an agent that behaves like a dependable teammate rather than a creative wildcard.

No architecture thrives without the right rituals. Product trios keep discovery continuous, while clear outcomes (not output) align teams on value instead of vanity. We map risks early, maintain a public quality dashboard, and rehearse failure recoveries so incidents never become improvisations. The cultural signal is simple: we celebrate root-cause clarity and safe iteration over heroics.

If you’re just starting, implement three patterns first: retrieval before generation, abstain-and-escalate for low confidence, and canary releases under feature flags. Instrument everything from day one, run a weekly eval review, and expand scope only when the data says you’re ready. With these habits, your agents will earn user trust—and keep it.

Inspired by this post on Product School.

March 2, 2026