Author: Shivam Tiwari

Apex Arrives: Vertical AI That Beats GPT-5.4 on Customer Service Speed, Accuracy, and Cost

I just watched one of the most significant leaps in customer service AI in years. Last week, a quiet but seismic release landed in CX: Fin introduced Apex, a vertical model purpose-built for support that raises the bar on speed, accuracy, and cost. As a product leader, this is exactly the kind of breakthrough that changes roadmaps, vendor strategies, and what customers can expect from modern service operations.

It’s a brand new model for Fin called Apex, and it’s objectively the highest performing, fastest, and cheapest model for customer service. It beats the very best models in the industry including GPT-5.4 and Opus 4.5.

In this analysis, I’ll unpack why the launch matters for the customer service agent category, what it signals for frontier labs and open‑weight ecosystems, and how leaders should rethink their AI Strategy, build vs buy decisions, and eval-driven development roadmaps.

Fin was already the highest performing and most sophisticated agent in the customer service space, consistently beating impressive competitors like Decagon and Sierra at an average win rate in the 70s. It operates at tremendous scale, now resolving almost 2M customer issues per week, a number that’s growing at an exponential clip. In its short life it’s grown to nearly $100M in recurring revenue.

As of last week, ~100% of all (English language, chat and email) customer conversations are now running on Apex. Since day 1, the Fin engine has comprised a system of models, and last year the team began replacing off‑the‑shelf models with custom ones trained on proprietary data. The core answering model had been a frontier labs offering—initially versions of GPT and more recently Sonnet 4.0. Now, that core answering model is Apex 1.0.

This model resolves customer issues at a materially higher rate than any other model available. One of their largest customers in the gaming space saw the resolution rate improve overnight from 68% to 75% (i.e. a reduction in unresolved conversations of 22%). The team notes they had never seen a jump this large from a single improvement since they started Fin.

Just as important, it’s dramatically faster, has fewer hallucinations, and is far cheaper than other available models—exactly the attributes operations leaders weigh most when deploying agents at scale. In practice, these are the levers that unlock higher CSAT, tighter SLAs, and better unit economics.

Achieving all three simultaneously is extraordinarily hard. Credit goes to foundational research from a 60‑person AI group run by Fergal Reid, and, crucially, to domain‑specific proprietary evals drawn from billions of human and agent interactions produced by the Fin resolution engine—already hand‑tuned to be the most effective in the category. That creates a flywheel: an eval‑driven development loop that trains models to keep improving at the edge of the system’s abilities. In other words, Apex 1.0 looks like the tip of the iceberg.

Zooming out, service is one of the few categories where generative AI has already delivered commercial impact at scale (alongside coding, and arguably the legal industry). With TAMs measured in the hundreds of billions, competition is intense and well capitalized. The pattern I’ve seen repeatedly is clear: winners in these spaces must become full‑stack AI companies. As features become ~free to build, durable competitive differentiation shifts under the hood—to proprietary data, post‑training, inference efficiency, and the quality of the eval loop.

Fin Apex raises the bar for finance-ready AI, highlighting a -65% cut in hallucinations and a quicker first token at 3.7s (0.6s faster), compared with Sonnet 4.6, Opus 4.5, and GPT-5.4 in side-by-side charts.

That’s why competitors will need to release their own models. Many appear to be just starting to hire the talent to do so, which likely gives Fin at least a year of head start. For product leaders, this is a strong signal to revisit build vs buy assumptions, and to quantify when owning your post‑training pipeline and evals becomes the rational move.

Honestly, 2–3 years ago I expected AI application differentiation to live mostly in what we built around third‑party models. The AI game humbles all of us; today it’s obvious that vertical models paired with proprietary evals create compounding moats.

In a podcast interview last week, Andrej Karpathy said:

"I do think we should expect more speciation in the intelligences. The animal kingdom is extremely [diverse] in the brains that exist. And there’s lots of different niches of nature… And I think we should be able to see more speciation. And you don’t need this oracle that knows everything. You kind of speciate it. And then you put it on a specific task. And we should be seeing some of that because you should be able to have much smaller models that still have the cognitive core."

The frontier labs still have the very best models, but open‑weight models aren’t far behind—making pre‑training look increasingly like a commodity. The frontier is moving to post‑training, which is precisely what we see with Apex (and Cursor’s Composer 2), and what we should expect to dominate going forward.

Labs now face a dual reality. On one hand, horizontal general‑purpose models can over‑serve specific verticals (e.g., customer service doesn’t need an oracle that knows everything). On the other, open‑weight models are good enough that high‑quality, domain‑specific post‑training can produce superior models for special‑purpose jobs—and in the ways that matter for those jobs. In service, soft factors like judgement, pleasantness, and attentiveness matter alongside hard factors like resolution effectiveness, speed, and cost.

I’m still bullish on the labs. Many organizations remain heavy customers of Anthropic—whether as part of multi‑model systems or through deep usage of Claude Code in engineering teams (see this example of Claude Code adoption). Yet classic disruption (à la the late, great Clay Christensen) is now at their door. The way out is to disrupt themselves by building cheaper specialized models too, which likely requires acquiring the evals—or the companies with the evals—needed for each task. Expect creative data partnerships, M&A consolidation, and a wave of hyper‑specific model providers that compete head‑to‑head with the labs.

In the meantime, Fin appears to be the only vendor in its space with a custom model that’s also objectively superior to everything else out there. I’m excited to see it deployed broadly for end customers, and I’m watching closely for the next announcement that will accelerate that rollout. For product leaders, the message is clear: the age of vertical models and agentic AI is here—bring your evals, or bring your checkbook.

Inspired by this post on The Intercom Blog.

March 26, 2026
From Engineer to CEO: Hard-Won Lessons on GTM, Cloud-First Bets, and Must-Do Focus

Making the leap from engineer to CEO demands an almost entirely new skillset. I’ve felt that jolt firsthand: the tools that serve you as an IC or even a product leader—system design, crisp PRDs, elegant roadmaps—only get you about 20% of the way. The rest is learning to orchestrate go-to-market strategy, finance, hiring, culture, and product positioning with just enough depth to make sound, fast decisions while empowering true experts to execute.

My operating heuristic is the 80% rule. As CEO or GM, I don’t need to be the best marketer, seller, or finance leader; I need to understand 80% of each function well enough to set a compelling product strategy, ask the right questions, and catch the second-order effects. That breadth unlocks speed, quality of judgment, and the conviction to say no when the organization is tempted by what it can do rather than what it must do.

The clearest illustration comes from the journey that turned Apache Kafka—originally built at LinkedIn—into Confluent, a publicly traded enterprise software company. The technical insight was powerful, but the real lift came from translating that insight into a repeatable go-to-market engine. That required building new muscles: founder-led GTM, enterprise sales orchestration, and open source monetization without alienating the community that fueled adoption.

Early on, the product was “embarrassing” by enterprise standards—thin features, sharp edges, and a long tail of operational gaps. Shipping anyway was the point. A thin vertical slice into the market created learning loops with real customers, not hypotheticals. That uncomfortable speed became a superpower, especially when the company decided to push toward a cloud-first business in the face of widespread opposition.

The messaging challenge was just as hard as the technical one. Most marketing fails because it starts with what we built, not what customers must achieve. A simple product marketing pyramid—vision at the top, category framing and points of parity in the middle, crisp value props and proof at the base—helped explain Kafka to the world in customer language. When the narrative snaps into place, adoption accelerates. In Kafka’s case, one well-timed blog post clarified the “why now” and unlocked a step-change in community and enterprise pull.

There’s a pivotal distinction leaders underestimate: the gap between what a company can do and what it must do. I use a must-do filter before every planning cycle: What moves are non-discretionary for durable product-market fit? For Kafka and Confluent, that meant ruthless prioritization on managed cloud services, reliability, and platform scalability—even when it jeopardized short-term revenue or required retooling how engineering, sales, and support worked.

Fundraising strategy mirrored this clarity. Planning to raise before building the full product wasn’t about hype; it was about matching capital to the physics of the problem. If your category requires enterprise credibility, global infrastructure, and 24/7 SRE, you finance those table stakes early. That’s first principles decision making: instrument the constraints, then design the sequence that gets you to scale with the fewest irreversible mistakes.

In the early years, every product decision felt like a trade between polish and learning. The team essentially bludgeoned its way into a cloud-first posture—less because the initial product was ready, and more because the market’s must-do was obvious. That’s the essence of founder-led GTM: get into the field, close lighthouse customers, and use their arcs to shape the roadmap. It’s also where open source monetization matures from downloads into durable, enterprise value.

As the organization scales, excellence often erodes—the Chipotle problem. Process hardens; quality blurs; the magic decays. The antidotes are simple but hard: a few non-negotiable product quality bars, a short set of product-market fit metrics that everyone can recite, and empowered product teams who own outcomes over output. This is where organizational development matters as much as code: design clear interfaces between product, sales, and success, and you’ll keep velocity without losing standards.

Contrary to popular lore, founder optimism is overrated. Constructive realism wins. I try to model “probabilistic optimism”: assume we will win, but instrument the journey like an SRE runs an incident. Set leading indicators, rehearse failure modes, and make pre-commitments to the must-do path so you’re not swayed by the latest anecdote. It keeps the team out of a failure mindset while making room for rigorous course correction.

Giving up the right things at the right time is a CEO superpower. As complexity grows, I hand off decisions that benefit from specialization and keep only those tied to company narrative, must-do prioritization, and talent bar. CEO time management becomes a portfolio problem: ensure each week contains deep product time, frontline customer exposure, and one compounding systems fix (hiring loop, pricing rubric, or GTM enablement) that pays back for quarters.

If you’re moving from IC or PM into a GM/CEO role, here’s a practical playbook: build your product marketing pyramid; write the one-page must-do memo for the next six quarters; ship a narrow, managed cloud slice early; pick three product-market fit metrics (usage, time-to-value, retention) and publish them company-wide; and architect an enablement engine that turns field learnings into roadmap changes within one quarter. That’s how you transform technical advantage into a category-defining business.

The Kafka-to-Confluent arc reminds me that technology can open a door—but clarity of narrative, sequencing, and must-do focus determines whether you walk through it. When in doubt, bias toward shipping, talking to customers, and tightening the loop between what you learn and what you build. That’s the work of product management leadership at scale.

March 26, 2026
Inside My Product Marketing Playbook: Amplitude Analytics Tactics That Drive PLG Wins

I’ve curated a focused set of product marketing insights that zero in on what actually moves the needle—turning data into decisions. You’ll find a special emphasis on Amplitude Analytics, because its behavioral analytics foundation makes it easier to translate product usage into clear messaging, sharper positioning, and measurable growth.

In my day-to-day as a product leader, I’m constantly bridging the gap between product discovery and go-to-market strategy. The best outcomes come when we connect quantitative signals to narrative: using behavioral analytics to inform the value proposition, refining product positioning with cohort trends, and driving product-led growth with activation and retention insights.

Here’s how I put this into practice. I start with user activation and retention analysis to identify the few behaviors that predict long-term value. Then I run tightly scoped A/B testing to validate messaging and in-product prompts that nudge those behaviors. When the numbers move, I translate wins into a consistent story—one that sales, success, and marketing can all rally around.

One pattern keeps repeating: clarity beats complexity. Instead of piling on more features, I focus on the minimum, verifiable set of behaviors that correlate with outcomes. That discipline makes it easier to craft a crisp value proposition, streamline go-to-market strategy, and accelerate feedback loops between product, design, and marketing.

As you explore this collection, expect practical playbooks over platitudes. You’ll see how to apply Amplitude Analytics to uncover hidden friction, validate hypotheses faster, and operationalize product-led growth motions that compound over time. My goal is to help you move from interesting dashboards to decisive actions that strengthen your roadmap and your revenue.

If you care about building empowered product teams that learn continuously, you’ll feel at home here. Dive in, borrow what works, and adapt the rest to your context—then measure it, iterate, and share the wins with your team.

Inspired by this post on Amplitude – Best Practices.

March 25, 2026

How to Scale AI Customer Experience Without Losing Quality

Your AI agent can resolve more conversations while customer experience quietly becomes harder to trust. A ticket marked resolved can still contain an inaccurate answer, a skipped process step, a repetitive loop, or an escalation that arrived too late.

As the product leader, you don’t have to choose between automation and judgment. You need an operating system that identifies which conversations matter, defines what good looks like, routes exceptions to the right owner, verifies fixes, and connects support quality to customer behavior.

Measure outcomes, execution quality, and coverage separately

Resolution rate is a throughput metric. It tells you how often the operation reached a terminal state, but not whether the answer was correct or the customer was treated appropriately. CSAT has a similar limitation: it captures sentiment, not conformance. Customer sentiment and adherence to your standards answer different questions, so neither should stand in for the other.

Build the dashboard around four layers. Keeping them separate prevents a good aggregate number from concealing a weak customer experience.

Measurement layer	Question it answers	Signals to track	Decision it supports
Customer outcome	Did the customer get useful help?	Resolution outcome, recontact, escalation outcome, sentiment	Whether the interaction solved the customer’s problem
Execution quality	Did the conversation meet your operating standard?	Accuracy, process adherence, clarification, escalation ease, efficiency	Whether the agent behaved correctly
Evaluation coverage	How much of the operation did you actually inspect?	Eligible, automatically evaluated, human-reviewed, and still-unreviewed conversations	How much confidence to place in the quality result
Product impact	Did better help change customer behavior?	Activation, feature adoption, retention, and journey completion by cohort	Whether the CX improvement created durable value

Read the layers together, but don’t merge them into one executive number. High sentiment with low accuracy can mean the agent sounded helpful while giving the wrong answer. A strong scorecard result with poor sentiment can expose a technically correct but difficult experience. Good conversation quality with repeated contacts may mean the product, policy, or documentation still creates the underlying problem.

The product-impact layer is especially important. A support answer may pass every conversational check without improving the journey that matters. Connect CX data to activation, adoption, and retention behavior so you can distinguish a better answer from a better customer outcome.

A simple driver tree makes that connection explicit. Start with the business result, trace it to the customer behavior that produces it, identify the journey friction blocking that behavior, and then define the AI behavior that should remove the friction. If you can’t trace a proposed quality criterion through that chain, it may be a preference rather than a requirement.

Design monitoring as a portfolio, not a random sample

Monitoring begins with selection. A precisely calculated score from the wrong population creates false confidence. Small random samples are useful for trend stability, but they are unlikely to expose every high-risk edge case, complex escalation, or early sign of drift.

Use four complementary monitoring layers:

A stable benchmark cohort. Evaluate a repeatable sample on a consistent schedule. Preserve the same eligibility rules and segmentation so changes in pass rate represent changes in performance rather than changes in the sample.
Risk-targeted monitors. Select conversations with signals that deserve deliberate review. Examples include a customer showing signs of financial vulnerability, an agent repeating essentially the same answer, a required escalation that did not happen, or a sensitive process being handled without its required checks.
Change-specific monitors. Create cohorts for a new model, prompt, workflow, tool, policy, or knowledge release. Version every relevant component so a quality change can be traced to the production change that preceded it.
Journey monitors. Group conversations by customer intent and journey stage, not just channel or queue. This exposes recurring friction in onboarding, adoption, billing, account management, and other product journeys that an operation-wide average will flatten.

Instrument every eligible conversation, but don’t assume every conversation needs human review. Automated evaluation is appropriate for high-volume, clearly defined criteria. Human judgment belongs on critical failures, ambiguous cases, disputed scores, new scenarios, and calibration cohorts. The scalable design is broad automated visibility with concentrated human attention.

Each monitor should report more than a pass rate:

Coverage rate: completed evaluations divided by eligible conversations. Never present pass rate without this denominator.
Pass rate: completed evaluations that met the scorecard standard divided by all completed evaluations.
Critical failure rate: completed evaluations that failed at least one critical criterion divided by all completed evaluations.
Unreviewed queue: qualifying conversations that have not received their required review. Break this out by risk and age rather than showing only a total.
Evaluator overturn rate: dual-reviewed cases in which a human changed the automated judgment. A rising rate can signal an ambiguous rubric, a weak evaluator, or a new conversation pattern.
Failure recurrence: previously addressed failure modes that appear again after a fix. This distinguishes completed work from effective work.

Segment these measures by AI versus human handling, journey, intent, customer segment, channel, and deployed version. One shared quality system can compare AI and human conversations against the same customer outcomes, while still assigning different operational responsibilities. Unified review across automated and human conversations also makes handoff failures visible; otherwise, each side can look healthy while the transition between them breaks.

Be deliberate about the data attached to monitoring. Store the fields needed for segmentation, diagnosis, and audit, and restrict access to sensitive conversation content. A monitoring program creates risk if it spreads personal or regulated data into dashboards that were designed only for aggregate analytics.

Turn scorecards into executable product requirements

A monitor determines what enters review. A scorecard determines how the conversation is judged. That distinction is operationally important: targeted selection and custom evaluation criteria work as two separate controls.

I treat a scorecard as an executable product requirement. Each criterion should be observable in the conversation, interpretable by two independent reviewers, and connected to a specific action when it fails. Vague criteria such as helpful, natural, or on-brand produce arguments rather than reliable signals.

Build each criterion with the following fields:

Intent: the customer or business outcome the criterion protects.
Pass condition: the observable behavior that must be present.
Failure condition: the observable behavior that makes the conversation fail.
Evidence rule: the part of the transcript, tool trace, policy, or approved knowledge that supports the judgment.
Applicability: when the criterion is required and when it is not applicable.
Severity: whether failure contributes to a weighted score or overrides the entire evaluation.
Reviewer: automated evaluation, human evaluation, or both.
Remediation owner: the person or function expected to act on failure.

A practical CX scorecard usually needs criteria such as:

Accuracy: the answer is supported by approved knowledge, customer context, and tool results. Unsupported claims fail even if the customer accepts them.
Resolution and next step: the agent answers the request, clearly states what remains, or routes the customer to the correct next action.
Process adherence: required verification, disclosure, permission, and workflow steps are completed in the correct context.
Clarification: the agent asks for missing information when intent or account context is too ambiguous to answer safely.
Escalation: the agent recognizes defined handoff conditions, escalates without unnecessary resistance, and transfers the context the next agent needs.
Conversation efficiency: the agent avoids repetition, irrelevant steps, and loops while preserving the information necessary for a correct outcome.
Communication quality: the response is clear, appropriately direct, and consistent with the brand’s communication standard.

Weights help express relative importance, but a weighted average must not wash away a consequential failure. Mark accuracy, safety, security, required process steps, or other non-negotiable controls as critical where appropriate. A polite answer that gives a harmful instruction should not pass because it accumulated enough points elsewhere.

For legal, financial, safety, account-security, or regulated decisions, follow the applicable organizational policy and require human review or escalation where that policy demands it. An aggregate quality score is not authorization for an AI agent to make a decision outside its approved scope.

Automated evaluators also need calibration. They are measurement components, not ground truth. Build an adjudicated set of clear passes, clear failures, difficult edge cases, and not-applicable examples. Have human reviewers score the same cases independently, compare their reasoning with the automated result, and rewrite any criterion that allows materially different interpretations. Repeat calibration after changes to the model, evaluator, prompt, tools, knowledge, policy, or conversation mix.

Keep the evidence behind every automated judgment. A score without the relevant transcript excerpt or trace is difficult to challenge and nearly impossible to improve. Reviewers should be able to see what failed, why it failed, and which requirement governed the decision.

Make every failure end in a product decision

Quality monitoring creates value only when it changes the system. The review queue should represent work, not a museum of bad conversations. A useful workflow moves each case through explicit states such as Not reviewed, Reviewed, Needs a fix, and Fix complete.

Require the following information before a failed case leaves review:

The customer intent and journey stage.
The failed criterion and supporting evidence.
The failure class, not merely the visible symptom.
The owner responsible for the correction.
The proposed change and the signal expected to improve.
The regression case, deployment version, and monitor that will verify the correction.

A consistent failure taxonomy keeps teams from treating every bad answer as a prompt problem:

Knowledge failure: the approved information is missing, stale, contradictory, or too difficult to interpret.
Retrieval or context failure: the right information exists, but the system did not retrieve, rank, or apply it.
Policy or workflow failure: the operating rule is wrong, incomplete, or impossible for the agent to execute.
Model behavior failure: the system ignored instructions, made an unsupported inference, or produced an otherwise defective response despite receiving adequate context.
Conversation-design failure: the interaction collected the wrong information, asked an unclear question, or sequenced the exchange poorly.
Tool or handoff failure: an integration, action, routing rule, or transfer prevented the correct outcome.
Product-friction failure: the support interaction is a recurring symptom of something the product itself should make clearer or eliminate.

This taxonomy changes prioritization. If the same onboarding question keeps passing through support, improving the answer may reduce handling friction but preserve the underlying product problem. Journey mapping, behavioral analytics, and in-product guidance can reveal whether the better fix belongs in the interface, workflow, documentation, or agent.

Close each failure through a controlled loop:

Reproduce it. Confirm the failure and preserve the relevant inputs, context, versions, and tool behavior.
Diagnose it. Assign a cause from the shared taxonomy and identify the owner with authority to change that component.
Correct it. Update the product, knowledge, retrieval logic, prompt, workflow, policy, integration, or escalation rule that caused the defect.
Test it. Run the original failure and nearby cases that should remain unchanged. Add the sanitized case to the regression set where appropriate.
Deploy it with versioning. Preserve enough release context to compare behavior before and after the change.
Verify it in production. Watch the targeted monitor, baseline quality, and associated customer behavior. Close the case only when the expected signal improves without creating a new failure elsewhere.

Bring this loop into a regular operating review. Inspect coverage first, then critical failures, the largest changes by segment and version, queue health, recurring causes, completed fixes, and downstream customer behavior. The meeting should end with product decisions: roll back a change, revise knowledge, alter a workflow, strengthen an escalation, change the interface, expand monitoring, or explicitly accept a known limitation.

Assign ownership before volume grows. CX and support leaders can define service standards; product leaders can connect recurring friction to roadmap decisions; AI and engineering owners can maintain instrumentation, evaluators, and regression tests; analytics can connect interactions to behavioral outcomes; and security, legal, or compliance owners can approve critical controls in their domains. Your organization may divide the roles differently, but every critical criterion and failure class needs a named decision owner.

Key takeaways

Resolution rate measures throughput, not whether an AI interaction was accurate, compliant, or useful.
Track customer sentiment, execution quality, evaluation coverage, and product impact as separate layers.
Combine a stable benchmark cohort with risk-targeted, change-specific, and journey-based monitoring.
Show pass rate with coverage and critical failure rate. A strong score over thin or biased coverage is weak evidence.
Write scorecard criteria as executable requirements with pass conditions, failure evidence, severity, reviewer type, and remediation ownership.
Use automated evaluation for breadth and human judgment for calibration, ambiguity, disputes, and consequential failures.
Don’t close a quality issue when a document or prompt changes. Close it after the deployed fix passes regression checks and improves the intended production signal.
Route recurring conversation failures into product discovery. Sometimes the best CX fix is removing the reason customers need to ask.

Start with one journey that combines meaningful volume with meaningful consequence. Give it an eligibility rule, a scorecard, explicit coverage, a review queue, a failure taxonomy, named owners, regression cases, and a downstream product outcome. Run that chain until failures reliably produce verified changes, then extend the same control loop to the next journey. Scalable quality comes from repeating a dependable operating system, not from adding another dashboard.

References

March 25, 2026

How to Ship Responsible AI Products in Regulated Healthcare

Your healthcare AI prototype works in a demo. Clinicians see potential. Then privacy, security, compliance, and legal reviewers ask questions the roadmap cannot answer: Which data crosses the model boundary? What happens when the output is wrong? Who can stop it? What evidence justifies exposing it to patients or providers?

The answer is not a longer policy document. You need a delivery system in which the use case, data boundaries, acceptable behavior, evidence, and rollback path are inspectable before anyone depends on the product. That system lets you move faster because each review produces a decision instead of another round of open-ended concerns.

Key takeaways

Start with the decision or action the AI will influence, not the model you want to deploy.
Keep identifiers in clinical systems by default and send only the behavioral or operational signals a downstream system genuinely needs.
Put success metrics, unacceptable behavior, human review, and stop conditions in the same release contract.
Move from synthetic or de-identified sandbox testing to a tightly controlled pilot, then scale only when the agreed evidence supports it.
Monitor model behavior, workflow performance, segment outcomes, data quality, and incidents as one production system.

Define the clinical boundary before choosing the AI approach

A vague use case such as improving patient engagement is almost impossible to evaluate responsibly. It does not identify a user, a decision, an action, or a credible failure. The first useful artifact is a use-case card that makes those boundaries explicit.

Complete these fields before discussing vendors, models, or architecture:

User and job: Name the person using the capability and the task that person is trying to complete.
Input: List the information required to perform the task. Separate essential inputs from data that is merely available.
Output: Define what the system produces: a summary, draft, recommendation, prediction, classification, or action.
Action authority: State whether the AI informs a person, proposes an action for approval, or executes an action itself.
Unacceptable outcome: Describe the failure that must not reach the user, patient, provider, or downstream system.
Human checkpoint: Identify who reviews the output, what that person can see, and how the person can reject or correct it.
Success measure: Name the workflow outcome that should improve, such as task completion, time-to-first-value, or sustained adoption.
Accountable owner: Name the person who can approve the use case, pause it, and accept or reject residual risk.

The action-authority field is especially important. A system that drafts text for a qualified person to review has a different failure surface from one that sends the text automatically. A recommendation that a clinician can inspect is different from an action that changes a care workflow without an intervening decision. If the team cannot describe that distinction, it is too early to approve a production design.

I use a simple product-risk ladder during intake:

The AI summarizes or drafts, and its output has no effect until a qualified person reviews it.
The AI recommends a next step, but a person must make and record the decision.
The AI executes a reversible administrative action within a tightly bounded workflow.
The AI influences a care pathway, patient communication, or another consequential decision.
The AI executes a consequential or difficult-to-reverse action without prior human approval.

This ladder is a product-triage device, not a legal or clinical classification. Your qualified clinical, privacy, security, compliance, and legal owners still need to determine the obligations that apply. Its purpose is to prevent a low-risk drafting assistant and a high-consequence decision system from passing through the same generic review.

Once the boundary is clear, choose the least complex mechanism that can deliver the outcome. Conventional automation may be enough for deterministic rules. Retrieval may be appropriate when the primary job is finding and grounding information. An agentic workflow introduces additional action authority and therefore needs stronger controls. Selecting among conventional automation, a retrieval-first pipeline, and agentic AI should follow the use case, its failure modes, and its lifecycle requirements.

Apply the same discipline to build-versus-buy decisions. Do not reduce the choice to feature coverage or procurement cost. Evaluate who can control data handling, model and prompt versions, evaluation, incident response, observability, and future changes. A vendor can supply technology, but it cannot own your product decision or your duty to operate the resulting workflow responsibly.

Make the data boundary reviewable, not merely promised

Privacy-by-design becomes real when a reviewer can trace each field from its origin to every place it is processed, logged, measured, retained, and deleted. A sentence saying the product is secure is not a data-control mechanism.

Start with a data-flow map that covers the entire operating path:

The clinical or operational system where the data originates.
Any transformation, minimization, masking, or de-identification step.
The application, retrieval layer, model, or external service that processes it.
Prompt, response, diagnostic, and application logs.
Behavioral analytics and product dashboards.
Human-review, support, escalation, and incident queues.
Long-term storage, retention, deletion, and backup paths.

For every step, record the purpose, permitted fields, prohibited fields, access roles, retention rule, downstream recipients, and owner. If a field has no necessary purpose, remove it before debating how to secure it. Data minimization reduces both the risk surface and the number of controls the team has to maintain.

A practical default is to keep identifiers in clinical systems while allowing only the behavioral signals needed for product analytics to cross the boundary. An analytics event can record that a recommendation was opened, edited, accepted, rejected, or completed without carrying a patient name or clinical narrative. The event should describe what happened in the product, not reproduce the underlying record.

Do not assume data is de-identified merely because a visible name or patient identifier has been removed. Combinations of fields, free text, prompts, model responses, URLs, error messages, and support attachments can still disclose sensitive information. Have the designated privacy and legal owners determine whether the transformation meets the applicable requirements. If they cannot verify it, keep the data inside the approved clinical boundary or use synthetic data for development.

Behavioral instrumentation needs its own contract. For each event, define:

The event name and the exact behavior it represents.
The allowed properties and the business purpose of each property.
Explicitly prohibited identifiers, clinical text, and other sensitive payloads.
The application and workflow versions that generate the event.
The owner who approves schema changes.
Validation rules that reject or quarantine malformed events.
The metric definitions and dashboards that consume the event.

This is governed analytics in operational form. Curated events, certified metric definitions, role-based access, lineage, and change control create a shared, auditable view for product, data, security, and compliance. They also prevent a quieter product failure: two teams using the same metric name for different behaviors and making incompatible release decisions.

Apply comparable scrutiny to an external provider. Ask what data the provider processes, where it is stored, whether inputs or outputs can be used for training, what is logged, how long each artifact is retained, how deletion works, who can access it, which subprocessors receive it, how tenants are separated, and what happens during an outage or security incident. Route the answers to the people responsible for contractual, security, privacy, and regulatory assessment. Product should own the use-case decision, not silently treat vendor approval as proof that every use is approved.

Convert responsible AI into a release contract

Responsible AI fails as a delivery practice when responsibility is expressed only as principles. A team needs observable release criteria: the behavior it expects, the behavior it prohibits, the evidence it will collect, and the condition that stops the launch.

Put those criteria in one release contract shared by product, engineering, data science, clinical leadership, security, privacy, and compliance. The exact metric thresholds will vary by use case, so the accountable owners must set them before the pilot produces results. A threshold chosen after seeing the data is an explanation, not a gate.

Release layer	Define before the pilot	Evidence to collect	Do not proceed when
Product value	The user task and expected workflow improvement	Task completion, time-to-value, adoption, abandonment, and sustained use	The feature creates activity without improving the intended task
Model behavior	Expected responses, prohibited responses, escalation behavior, and task-specific pass criteria	Versioned offline evaluations, human review, guardrail results, and regression comparisons	A critical safety case fails or behavior cannot be reproduced
Data quality	Required inputs, permitted schemas, freshness expectations, and lineage	Schema validation, missing-data checks, source versions, and anomaly monitoring	Inputs are stale, malformed, untraceable, or outside the approved boundary
Human control	Review point, override, correction, escalation, and rollback path	Correction behavior, overrides, escalations, and successful rollback tests	The responsible person cannot inspect, reject, or stop the output
Operational health	Acceptable latency, cost, availability, error behavior, and incident ownership	Production telemetry, alerts, version history, and incident records	Failure is silent, alerts lack an owner, or recovery depends on an untested path
Segment outcomes	The patient, provider, workflow, and operating segments that require separate review	Outcome and error variance across approved segments	Material variance is unexplained or a consequential segment lacks adequate evidence

Model quality is only one layer. A strong offline result can still produce a poor product if the workflow is slow, users cannot correct the output, input data is unreliable, or the intervention fails to improve the intended task. Connect the layers with a driver tree:

Model behavior: What must the system produce or avoid?
Workflow behavior: What will the user do differently if the output is useful and trusted?
User outcome: Which task becomes more complete, efficient, or reliable?
Organizational or care outcome: What meaningful result should eventually change?

Treat each arrow as a hypothesis, not an assumed causal relationship. For example, a more relevant recommendation might reduce corrections, and fewer corrections might improve task completion. Instrument both transitions. If relevance improves but completion does not, the team has learned that the bottleneck is elsewhere.

Your offline evaluation set should include representative routine inputs, ambiguous inputs, edge cases, and the sensitive scenarios most closely connected to the unacceptable outcomes on the use-case card. For each case, store the expected behavior, reviewer rubric, model version, prompt version, retrieval configuration, policy or rule version, and result. This makes regression testing possible when any part of the system changes.

Prompt libraries, model and prompt regression tests, eval-driven development, feature flags, and observability belong in the product delivery system rather than in an isolated data-science workflow. AI behavior can change when the model, prompt, retrieved context, guardrail, input distribution, or surrounding application changes. Version the complete configuration that produced the output.

Use A/B testing only where exposure is ethically and operationally appropriate, failure is reversible, and the relevant reviewers have approved the experiment. Do not use an experiment to discover whether an unbounded high-consequence behavior is safe. Establish safety through evaluation and controlled review first. For an approved experiment, predefine the minimum detectable effect that would make the release risk worthwhile, along with guardrail metrics and stop conditions.

Use evidence gates from sandbox to controlled scale

A responsible rollout is not one approval followed by unrestricted production access. It is a sequence of gates. Each gate expands exposure only after the previous stage produces the required evidence.

Gate 1: Sandbox validation

Start with synthetic or appropriately de-identified data. The sandbox should reproduce the workflow closely enough to test prompts, retrieval, interface behavior, event instrumentation, alerts, and rollback without exposing a patient or provider to an unproven capability.

Use the sandbox to answer concrete questions:

Does each approved input produce a traceable output?
Do ambiguous, incomplete, or malformed inputs fail safely?
Are prohibited data fields rejected before they reach logs or analytics?
Do critical evaluation cases pass on the exact release configuration?
Can a reviewer see the context needed to accept, edit, or reject an output?
Do alerts reach a named owner?
Can the feature be disabled without disrupting the underlying workflow?
Are latency and cost compatible with the intended operating model?

A polished demonstration is not the exit criterion. The exit criterion is a reproducible evidence packet containing the use-case card, data-flow map, event contracts, evaluation results, open risks, mitigations, configuration versions, approvals, and tested rollback procedure.

Gate 2: Controlled production pilot

A pilot is an instrumented risk test, not a smaller marketing launch. Define its boundaries before enabling the feature:

Which users and roles are eligible.
Which workflows and data types are permitted.
Which outputs and actions are enabled.
Where human review is mandatory.
Which feature flag or access control contains exposure.
Which metrics and segments will be reviewed.
Which events trigger an alert, pause, rollback, or incident process.
Who makes the decision to continue, modify, or stop.

Write the success and stop criteria before the first participant enters the pilot. Otherwise, adoption pressure can turn a temporary exception into a permanent operating state. A pre-agreed stop condition gives the incident owner authority to act without waiting for a fresh executive debate while a consequential failure continues.

The pilot should test the entire sociotechnical workflow. Measure whether people understand the AI’s role, inspect the output, use the correction path, escalate uncertain cases, and complete the intended task. A model can appear accurate while users over-trust it, ignore it, or spend more time verifying it than the workflow saves.

Gate 3: Controlled expansion

Scale only when the evidence satisfies the release contract and the remaining risks have named owners. Expand one meaningful dimension at a time where practical: the eligible cohort, supported workflow, data scope, or action authority. Opening all four simultaneously makes it difficult to identify which change caused a new failure.

A disciplined pattern is to move from sandbox validation to controlled pilots with documented data flows, guardrails, and pre-agreed mitigations. The audit trail should be generated from normal delivery artifacts rather than reconstructed when an auditor, customer, or executive asks what happened.

After launch: operate the product as a learning system

Production is where input distributions, user behavior, costs, and failure modes become visible. Run three connected operational views:

System health: Model, prompt, retrieval, and policy versions; latency; cost; errors; availability; and data-pipeline anomalies.
Workflow health: Eligibility, activation, task completion, abandonment, corrections, overrides, escalations, and time-to-value.
Outcome and safety health: Guardrail failures, prohibited behavior, incidents, rollback events, and outcome variance across relevant segments.

Every alert needs an owner, response path, and severity interpretation. Every material incident needs a record of the affected configuration, inputs, outputs, user impact, containment action, root cause, and prevention work. If the team cannot reconstruct which version produced a harmful or noncompliant output, observability is incomplete.

Treat a material model, prompt, retrieval, policy, or data-schema change as a product release even when the interface does not change. Run the relevant regression suite, compare the new configuration with the approved baseline, update the risk record, and preserve the decision. Change control is what prevents a previously reviewed system from becoming a different system under the same feature name.

Keep customer success, support, solutions engineering, and operational users in the feedback loop. Structured corrections and escalations can reveal workflow failures that aggregate accuracy metrics hide. Route those signals into evaluation cases, product discovery, and prioritization instead of treating them as isolated support tickets.

Your next step does not need to be a company-wide governance rewrite. Pick one healthcare AI use case and complete four artifacts: the use-case card, data-flow map, release contract, and gated rollout plan. If you cannot name the unacceptable outcome, the person who can stop the system, or the evidence required to resume it, the use case is not ready for production. Once those answers exist, responsibility becomes part of delivery rather than a negotiation at the end of it.

References

March 25, 2026

Stop Flying Blind with AI Agents: Put Users at the Center with Pendo Agent Analytics

I’ve watched too many AI agent deployments celebrate velocity while overlooking the one thing that determines long-term success: whether real users are actually getting value. Dashboards tend to spotlight model upgrades, prompt tweaks, and launch counts, yet they rarely quantify task completion, trust, or time-to-value. That blind spot isn’t technical—it’s human.

Enterprises are spending 93% of their AI budget building agents and almost none know if those agents are actually working for users. Pendo Agent Analytics closes the gap.

In my product reviews, I look for evidence that agentic AI is improving outcomes across the customer journey, not just the demo path. Without behavioral analytics and observability, teams optimize for throughput instead of resolution, for novelty instead of reliability. This is where eval-driven development, A/B testing, and rigorous cohort analysis become non-negotiable: they translate agent performance into user impact we can measure and improve.

Here’s the pattern that works for me: define user-centric success metrics first, then let the AI follow. I prioritize signals like successful task completion, low-friction activation, reduced escalations, and sentiment lift—tied directly to product-led growth indicators such as retention and expansion. When these metrics move in the right direction, I know the agent is creating compounding value, not just answering faster.

Practically, I operationalize this with an analytics spine that captures end-to-end agent interactions: intents, prompts, responses, clarifying turns, handoffs, and final outcomes. I segment by persona, journey stage, and account tier to uncover where agents delight and where they degrade trust. With this foundation, I can run controlled experiments, spot anomalies early, and connect improvements in agent behavior to improvements in business performance.

Pendo Agent Analytics closes the loop by making these user outcomes visible and actionable. Instead of guessing whether an agent helped or hindered, I can analyze where users stall, which prompts or skills drive completion, and how interventions like in-app guides or product tours change behavior. That visibility lets me tune models and experiences in days, not quarters—and gives stakeholders confidence that our AI investments are paying off for customers.

If you’re scaling agents today, start small but instrument deeply: map top user intents, define offline and online evals, A/B test prompts and policies, monitor regressions, and tie every improvement to activation, adoption, and retention. The result is a durable feedback loop that keeps agents aligned with user value as your surface area grows.

AI agents are not a destination—they’re a capability. When we anchor that capability to clear user outcomes and measure it with the right analytics, we stop flying blind and start compounding advantage. That’s how we turn promising demos into dependable products.

Inspired by this post on Pendo – Best Practices.

March 25, 2026
Bad Advice from Your AI Clone? Ethics, IP, and How Product Leaders Protect Quality

What happens when an AI starts giving advice in your voice—advice you’d never actually give? I’ve been thinking a lot about that question, and this conversation hit home for me as a product leader navigating the fast-evolving reality of AI “clones.”

Listen to this episode on: https://open.spotify.com/episode/7DNDIlIimwbbMOytArewRp?ref=producttalk.org | https://podcasts.apple.com/kh/podcast/bad-advice/id1794203808?i=1000756914818&ref=producttalk.org. Prefer video? Watch on YouTube: https://www.youtube.com/embed/RF4BwaeMMlg?feature=oembed

The episode examines AI “clones” built from podcast transcripts and public content—where the experimentation feels exciting, where it crosses ethical lines, and what happens when mediocre AI outputs get attributed to real people. The tension is real: when a bot confidently answers in your style but misses the nuance, “it’s not me” becomes more than a disclaimer—it’s a reputational defense.

We dig into the messy parts: IP ownership of open-sourced transcripts, the role of pirated books in LLM training sets, rising inference costs, and the uncomfortable economic question: if anyone can prompt “act like Teresa,” how do creators make a living? In my own decision-making, I look for clear consent, guardrails that prevent impersonation, and transparent UX that never confuses a synthetic perspective with a human expert.

This isn’t anti-AI. It’s a nuanced conversation about quality, consent, and remembering there are real humans behind the ideas.

Here’s how I translate the key takeaways into practice. Using AI for perspective is fine—equating it to the real person isn’t. Free-feeling AI outputs still rely on someone’s work. Expertise is more than past content—it’s context, judgment, and evolution. If someone’s work influences you, find a way to support them. These principles help teams benefit from gen ai without eroding trust or the creator ecosystem.

“Technically possible” doesn’t mean “ethically okay.” My AI Strategy playbook includes privacy-by-design, clear data governance on training materials, and a bright line between inspiration and impersonation. When we ship AI features, we label synthetic outputs, avoid mimicking living experts without permission, and create paths to compensate or promote the humans whose thinking underpins the experience.

I’ve also tested the “act like X” pattern to stress-test product quality. Even when outputs sound plausible, they rarely capture the expert’s mental models, trade-offs, or the evolution of their thinking—especially in complex product discovery work. That gap is the difference between average AI text and expert product management leadership.

If you listen, consider a few reflection prompts: Have you ever used AI to “act like” someone you admire? Could you tell whether the output matched that person’s actual thinking? How do you decide what’s ethically okay when using public content in LLMs? And how can we support creators while still embracing new tools?

Resources & Links you may find helpful: Follow Teresa Torres: https://ProductTalk.org; Follow Petra Wille: https://Petra-Wille.com; Delphi.ai (AI bot platform discussed): https://www.delphi.ai/?ref=producttalk.org; Lenny’s Podcast: https://www.lennysnewsletter.com/podcast?ref=producttalk.org; ChatGPT: https://chatgpt.com/?ref=producttalk.org; Petra’s Coaching Packages: https://www.petra-wille.com/coaching-packages?ref=producttalk.org; Teresa’s Product Talk: https://www.producttalk.org/; Teresa’s book Continuous Discovery Habits: https://www.producttalk.org/continuous-discovery-habits/; Lenny’s open-sourced podcast transcripts: https://www.dropbox.com/scl/fo/yxi4s2w998p1gvtpu4193/AMdNPR8AOw0lMklwtnC0TrQ?rlkey=j06x0nipoti519e0xgm23zsn9&e=1&st=ahz0fj11&dl=0&ref=producttalk.org

Have thoughts on this episode or practices that have worked in your org? Share them below—I’m keen to learn how other teams are balancing innovation with integrity.

Inspired by this post on Product Talk.

March 24, 2026
Meet Amplitude’s Always‑On AI Analysts: Instant Answers Without Dashboards or Reports

For years, I’ve watched product, growth, and data teams burn cycles stitching together manual dashboards and reports, then slogging through replay review just to validate a hunch. That overhead slows discovery and delays decisions. The promise here is different: "Discover how Amplitude AI Agents help product, growth, and data teams turn questions into action without manual dashboards, reports, or replay review." As someone obsessed with decision velocity and evidence-based product strategy, that shift is exactly what I’ve been waiting for.

In practice, I think about "Amplitude AI Agents" as always-on data analysts embedded in our workflow. Instead of queuing requests or context-switching into tooling, I can ask targeted questions, get synthesized insights, and move directly to action. This is a powerful example of agentic AI meeting behavioral analytics in a unified analytics platform—removing friction between inquiry and impact while keeping teams focused on outcomes, not artifacts.

What changes for my day-to-day? I can interrogate customer behavior in real time, pressure-test hypotheses from discovery interviews, and quickly understand whether activation, retention, or monetization is the current constraint. If I’m probing a driver tree for activation or a retention analysis for a specific cohort, I can get to a decision faster—without waiting on someone to build a bespoke dashboard. That means more cycles spent shaping product strategy and fewer sunk into report wrangling.

This matters beyond speed. When product, growth, and data leaders anchor discussions in the same source of truth, we shorten the distance from signal to decision. That alignment is the backbone of product-led growth and continuous discovery: shared context, faster feedback loops, and clearer trade-offs. It also reduces the long tail of analytics debt—those one-off reports and stale views that quietly accumulate across teams.

Of course, adopting any AI workflow in analytics demands governance. I hold these systems to the same bar I set for my teams: clarity of assumptions, consistent metric definitions, and auditable reasoning. Pairing "Amplitude analytics" with strong data governance, CI/CD for analytics definitions, and lightweight evals helps ensure the recommendations we act on are reliable, reproducible, and explainable. AI should accelerate our judgment, not replace it.

The strategic shift is simple and profound: move from building dashboards to making decisions. With always-on analysis, we can spend less time instrumenting analytics theater and more time delivering customer value. That is how we translate insights into impact—and why I’m excited to operationalize this capability across our product trios and go-to-market partners.

Inspired by this post on Amplitude – Best Practices.

March 23, 2026

How to Build Scalable, AI-Ready Product Documentation

Your AI assistant gives a confident but outdated setup answer. Search returns three pages with slightly different instructions. Support knows the real workaround, but the documentation owner does not know the product changed. This is usually described as an AI problem. It is more often a knowledge-system problem.

You do not need a second documentation estate written for machines. You need one governed source of product truth that a customer can follow, a support engineer can trust, and an AI system can retrieve without reconstructing the answer from conflicting fragments.

Key takeaways

Organize documentation around the questions and tasks users bring to it, not only around your product navigation or internal team structure.
Give every important section a clear answer, scope, procedure, expected result, and permanent link so it remains useful when retrieved on its own.
Control terminology, versions, ownership, and deprecation explicitly. An AI assistant cannot reliably resolve contradictions that your organization has left unresolved.
Put documentation changes through version control, review, automated checks, and release gates so the published truth keeps pace with the product.
Measure successful task completion and grounded answer quality, not page views alone. Use failures to decide whether to fix the content, retrieval layer, assistant behavior, or product itself.

Start with an answer contract, not a page inventory

A documentation redesign often begins with a list of existing pages. That tells you what you publish, but not what customers need to accomplish. It also preserves accidental boundaries: a feature may have five pages because five teams touched it, while the customer still sees one task.

Begin with an intent register for one product area. Capture the questions that appear during activation, onboarding, routine use, escalation, and renewal. Include the language people actually use in search queries and support requests, even when it differs from your preferred product terminology.

For each intent, record:

The user’s question in their own language.
The task they are trying to complete or the decision they need to make.
The relevant audience or role, such as administrator, developer, or analyst.
The product version, plan, permission, integration, or prerequisite that changes the answer.
The canonical page and section that should answer the question.
The person accountable for keeping that answer current.
The consequence of a wrong or missing answer, such as failed activation, an unnecessary escalation, or use of a deprecated workflow.

This register exposes three different problems that page counts conceal. Some important questions have no answer. Some have several competing answers. Others have an answer that exists but cannot stand on its own because the conditions or expected result appear somewhere else.

Turn each priority intent into an answer contract. A complete unit should state what the user can accomplish, when the instructions apply, what must already be true, what to do, what success looks like, and where to go next. If any of those elements are missing, a human has to infer them and an AI system may invent the bridge.

The opening of a page should therefore name the job, not advertise the feature. “Configure routing for inbound leads” gives the reader a destination. “About lead routing” merely names a subject. This small distinction also gives retrieval systems a stronger match between a real question and the section intended to answer it.

Build retrieval units that still make sense alone

A person may enter through a search result, while an AI application may retrieve only a passage from the middle of a page. In both cases, the selected section has to survive separation from the surrounding document.

That does not mean chopping every page into tiny fragments. Atomic content is complete enough to answer one intent and bounded enough to avoid unrelated material. A fragment that says “click Save” without naming the object, required permission, or expected result is short, but it is not atomic.

Use a repeatable section pattern

For a task-oriented section, use this sequence:

Write a heading that reflects the question or task.
Give the direct answer or outcome before background material.
State who the instructions are for and when they apply.
List permissions, inputs, and prerequisites before the procedure.
Use numbered steps with one observable action in each step.
State the expected result and how the reader can verify it.
Separate exceptions, limitations, and failure states from the main path.
Link to the next likely task rather than a generic documentation landing page.

Keep interface labels, API parameters, status values, and error messages verbatim. If the product displays “Connection expired,” do not rewrite it as “Your integration is no longer active.” The second phrase may read naturally, but it weakens exact search, obscures the product state, and makes support instructions harder to match.

Examples should expose inputs, outputs, and constraints. A useful example says which role is acting, what value is supplied, what the system returns, and which condition would make the result different. A screenshot without that context is evidence of appearance, not a durable explanation of behavior.

Make boundaries and links dependable

Use one primary topic per page, semantic H1-H3 hierarchy, descriptive slugs, and stable section anchors. These practices make pages easier to scan and create smaller, linkable units that retrieval systems can identify precisely.

A stable anchor is part of the content contract. If an implementation guide links directly to the authentication prerequisite, changing that anchor silently breaks more than navigation. It breaks the path by which customers, support macros, release notes, and AI responses reach the authoritative answer.

Do not copy the same procedure into several pages to make each page self-contained. Keep one canonical procedure and give adjacent pages enough context to explain why the reader needs it, followed by a precise link. Duplication feels convenient at publication time and becomes a contradiction risk at the next product change.

Control vocabulary without ignoring customer language

Choose one canonical term for each product concept across the interface, API, documentation, and support material. Put accepted synonyms and older names in a glossary or metadata field so search can recognize them, but keep the explanation anchored to the current term.

This is the difference between supporting natural language and allowing synonym sprawl. “Workspace,” “account,” “tenant,” and “organization” may sound interchangeable inside a company. If they represent different objects in the product, casual substitution creates false equivalence. If they represent the same object, choosing one term removes needless translation work for every reader and retrieval pipeline.

Protect the current truth with metadata and delivery controls

Good prose cannot compensate for missing scope. Two instructions can each be correct for a different version, role, or integration and still produce a wrong answer when retrieved together. Metadata makes those boundaries explicit before retrieval begins.

Define a required metadata contract for every governed page or content unit. At minimum, include:

A stable content ID and canonical URL.
A descriptive title and short task-oriented description.
The product area and content type.
The intended audience or role.
The applicable version or version status.
The lifecycle state, such as current or deprecated.
The accountable owner.
The last-updated or last-reviewed date.

Use the fields as controls, not decoration. Audience metadata should allow an assistant to distinguish administrator instructions from end-user instructions. Version metadata should prevent a current answer from silently incorporating an obsolete step. Ownership should route a failed evaluation to someone who can resolve it.

Deprecation needs more than a warning banner. State what is deprecated, which users or versions are affected, what replaces it, and how to move forward. Preserve old URLs with redirects when a current replacement exists. Removing the old page without a forward path turns bookmarks and deep links into dead ends; leaving it searchable without a clear status lets obsolete guidance continue to circulate.

Ship documentation as part of the product change

Scalability depends on the delivery system behind the content. Version control, peer review, and CI/CD give documentation the same traceability and release discipline used for software changes.

For each product change, the release workflow should answer:

Which user intents and canonical sections are affected?
Do interface labels, parameters, permissions, errors, examples, or screenshots change?
Does the change introduce a new term or alter an existing definition?
Do version boundaries, redirects, or deprecation notices need updating?
Which retrieval evaluations must pass before release?
Who approves the content and owns follow-up corrections?

Automate the checks that have unambiguous pass or fail conditions: broken links, missing required metadata, duplicate IDs, invalid internal references, and orphaned pages. Use human review for semantic accuracy, task completeness, terminology, and whether an image still reflects the current workflow. Automation can detect that a screenshot file exists; it cannot reliably decide that the image teaches the correct behavior.

Set update expectations according to consequence. Instructions tied to a product release need to be correct when the change reaches users. A deprecated workflow needs a forward path before the old path disappears. Lower-risk explanatory material can follow a review schedule. One blanket service level treats cosmetic drift and activation-breaking errors as if they carry the same cost.

Measure answer quality, then migrate in risk order

Page views tell you that someone arrived. They do not tell you whether the person completed the task or whether an AI answer was accurate, grounded, and current. Pair human behavior with retrieval evaluations so each signal leads to a plausible corrective action.

Signal	What it can reveal	Likely action
Repeated searches or rapid returns to results	The answer is hard to find, uses mismatched language, or does not resolve the intent	Improve the title, intent mapping, vocabulary, or section completeness
Low task completion after reading	The procedure may omit prerequisites, verification, or a failure path	Test the instructions against the actual workflow and repair the answer contract
Support escalation after a documentation visit	The content may be incomplete, untrusted, outdated, or describing product friction	Inspect the escalation reason before assuming more content is the solution
Low answer accuracy or grounding	The wrong passage was retrieved, the selected passage conflicts with another, or the assistant exceeded the evidence	Separate retrieval, content, and answer-generation failures
Current and deprecated guidance in one answer	Version metadata, lifecycle labels, or retrieval filters are insufficient	Strengthen version boundaries and remove obsolete material from current-answer paths
High response latency	The retrieval or answer path may be doing unnecessary work	Inspect the pipeline without trading away accuracy or grounding

Build the evaluation set from the same intent register used to design the documentation. For each test question, define the expected canonical page or section, the claims a correct answer must contain, the audience and version it applies to, and any deprecated claim that must not appear. Include questions that should not be answered when the documentation lacks enough evidence. A reliable assistant must be able to stop at the boundary of the known answer.

When a test fails, classify the failure before editing anything:

If retrieval selected the wrong section, inspect information architecture, headings, metadata, vocabulary, and chunk boundaries.
If retrieval selected the correct section but the answer distorted it, inspect the assistant’s instructions and answer-generation behavior.
If two selected sections disagree, resolve the underlying ownership, versioning, or duplication problem.
If no section answers the question, add the missing knowledge or make the limitation explicit.
If the answer is correct but users still fail, inspect the procedure and the product experience. Documentation should not be used to disguise avoidable product friction.

You do not need to rebuild the entire knowledge base before learning whether this operating model works. Migrate in this order:

Choose one product area with meaningful activation, support, or deprecation risk.
Collect its real user intents and map each one to an accountable answer.
Resolve duplicate, contradictory, and missing guidance before changing the retrieval system.
Restructure priority answers into self-contained, linkable sections.
Add the required metadata, ownership, version, and lifecycle controls.
Put those sections through the product release workflow and automated checks.
Run human task checks and retrieval evaluations, classify the failures, and repair the responsible layer.
Expand only after the pattern is repeatable for another product area.

Your first useful deliverable is not an AI documentation strategy deck. It is one high-value customer question with one canonical, current, owned answer that survives retrieval and changes alongside the product.

Start with the question that creates the most expensive ambiguity today. Make its answer complete, linkable, versioned, testable, and part of the release path. That single vertical slice will show you where the larger system actually needs work.

References

March 20, 2026

How to Build a Continuous Improvement Loop for AI Support

Your AI support agent is containing more conversations, but you still cannot answer the question that matters: did it solve the customer’s problem, or did it merely avoid a handoff?

You do not need another dashboard of conversation counts and token usage. You need a closed loop that connects customer outcomes to individual agent decisions, turns failures into test cases, and makes each release safer than the last. Here is the operating model I would put in place.

Define resolution before you optimize the agent

Continuous improvement starts with an outcome contract, not an analytics implementation. If product, support, and engineering use different definitions of success, every subsequent metric will create debate instead of direction.

Write the first outcome contract for one customer journey. Keep it to one page and specify:

Eligible cases: Which requests can the agent reasonably handle? Separate supported journeys from requests that must go directly to a person.
The unit of analysis: Decide whether success belongs to a conversation, a case, or a completed task. A customer who restarts the same issue in a second conversation has not necessarily created a new problem.
Completion evidence: Name the customer statement, downstream system event, human judgment, or evaluator result that confirms the task was completed.
Acceptable escalation: Define when handing the case to a person is the correct outcome. A policy-mandated handoff is not an agent failure.
Failure conditions: Include abandonment, repeated rephrasing, an incorrect action, an unsupported claim, a failed tool call, and a handoff without usable context where they apply.
The observation window: Choose how long you will look for a repeat contact or reversal before treating the resolution as durable. The right interval depends on the journey, so publish it with the metric.

Then separate the signals that teams commonly collapse into one number:

Signal	What it tells you	How to use it
Verified resolution	The eligible customer task was completed using the evidence in your outcome contract.	Use this as the primary outcome for a resolvable journey.
Containment	No person entered the interaction.	Treat it as a routing and labor signal, not proof of success.
Escalation	The agent transferred the case to a person.	Split appropriate escalation from avoidable escalation before drawing conclusions.
Reliability	Tools, retrieval, guardrails, and fallbacks behaved as intended.	Use step-level rates to locate the mechanism behind an outcome.
Performance	The customer waited for the first response and for the complete task.	Track both typical and tail latency so a healthy average does not hide slow cases.
Efficiency	The agent consumed tokens, model spend, tool spend, and human effort.	Measure cost per successful eligible task, not just cost per conversation.
Groundedness	Claims based on retrieved material are supported by the retrieved material.	Evaluate retrieval-dependent journeys separately from general response quality.

The distinction between resolution and containment is particularly important. An unresolved conversation that never reached a person improves containment while making the customer experience worse. Conversely, a fast and well-prepared escalation can be the right resolution path.

Use explicit formulas in the metric catalog. For example, resolution rate should be verified successful eligible cases divided by eligible cases. Cost per success should be the applicable model and tool cost divided by verified successful cases. If you change the eligibility rule or verification method, version the definition rather than silently changing the chart.

Do not let the agent’s own declaration of success become your only evidence. Prefer an observable business event or explicit customer confirmation. When neither exists, use a documented evaluation rubric and periodically compare automated judgments with human review. The goal is not to pretend ambiguity has disappeared. It is to make the ambiguity visible and consistent.

Instrument the trajectory, not just the conversation

Traditional product analytics can tell you that a conversation opened, escalated, and closed. It cannot explain why an agent chose a tool, which knowledge it retrieved, what a guardrail rejected, or where a multistep task went off course. Agent behavior includes nondeterministic trajectories, tool chains, prompt context, retrieval, policies, and fallbacks. Your telemetry has to preserve that sequence.

Build three connected records:

Case record: A stable identifier for the customer need, the normalized intent, channel, eligible cohort, final outcome, escalation reason, customer signal, and any repeat-contact relationship.
Run record: The model, prompt, policy, knowledge, tool, evaluator, and experiment versions used for one execution, along with total latency, token use, cost, guardrail events, fallback state, and final response.
Step record: Each retrieval, model decision, function call, tool result, validation, retry, and handoff. Capture inputs and outputs in an appropriately protected form, plus status, latency, error type, and the next step selected.

Every run should be reproducible at the level needed to investigate it. Version anything that can change behavior: system instructions, prompt templates, policies, model configuration, tool schemas, knowledge snapshots, routing rules, guardrails, and evaluators. Record experiment assignment on the run itself. Otherwise, a failed trace can become impossible to explain after the underlying assets change.

You need three views over these records. The aggregate scorecard shows whether customer and business outcomes are moving. The trace explorer shows the decision path behind a particular result. The evaluation system tests candidate behavior against expected results. None is a substitute for the others: aggregates reveal scale, traces reveal mechanisms, and evaluations reveal whether a proposed change is safe to release.

Privacy has to be part of the data model. Support conversations can contain personal or sensitive information, and copying every raw transcript into a broadly accessible analytics system creates unnecessary exposure. Separate human-identifiable data from model telemetry and mask sensitive content while retaining useful semantic context. Apply access controls and retention rules to raw content, and let most analysis operate on sanitized traces, structured labels, and protected identifiers.

Before declaring instrumentation complete, take one unresolved case and answer five questions from the data alone: What was the customer trying to do? What path did the agent take? Which versions governed that path? At which step did reality diverge from the expected behavior? What outcome evidence followed? Any missing answer is a concrete telemetry gap.

Turn production failures into a daily improvement queue

Analytics becomes continuous improvement only when an observation can enter a queue, acquire an owner, produce a test, and reach production. A weekly dashboard presentation without that path is reporting, not learning.

Create one improvement queue shared by product, support, engineering, and the people responsible for the agent. Feed it from failed evaluations, avoidable handoffs, repeat contacts, low-confidence outcomes, guardrail events, tool errors, and anomalies in journey-level metrics.

At each human handoff, prefill the case identifier and trace, then ask the support specialist for three short inputs:

What was the customer actually trying to accomplish?
What prevented the agent from completing it?
What reusable change could prevent the same failure?

Keep the capture lightweight. The frontline should add information the trace cannot supply, not reconstruct a conversation the system already recorded. This turns human support into a sensor for recurring defects and supports the principle that a solved customer issue should create a chance to prevent its recurrence.

Cluster related cases before prioritizing them. Fixing a single transcript can overfit the agent to one phrasing; fixing a recurring failure pattern improves a journey. Give every cluster a normalized intent and one primary root-cause label:

Knowledge: The required information was missing, stale, contradictory, or too difficult to retrieve.
Retrieval: Useful information existed, but the wrong material was selected or relevant material was omitted.
Instruction or policy: The agent had ambiguous, conflicting, or incomplete directions.
Tooling: A function was unavailable, called incorrectly, timed out, or returned unusable data.
Conversation design: The agent failed to collect required information, explain a limitation, confirm an action, or recover from confusion.
Routing and handoff: The case went to the wrong destination, escalated too late, or arrived without sufficient context.
Capability boundary: The task was eligible on paper but was not reliably achievable with the current model and workflow.
Product or process: The customer encountered a defect or an underlying workflow that support automation alone cannot repair.

Do not use the taxonomy to force certainty. Allow an unknown label, but review unknowns regularly; a growing unknown bucket usually means the taxonomy or telemetry has stopped describing production.

Each queue item should contain the affected journey and cohort, linked traces, observed and expected behavior, root-cause hypothesis, customer consequence, recurrence evidence, responsible owner, proposed smallest change, a new or updated evaluation case, rollout method, and rollback condition. That record makes the improvement auditable from failure to fix.

Prioritize with four questions instead of reacting to the loudest transcript: How often does the pattern appear? How consequential is it when it happens? How confident are you in the root cause? How reversible is the proposed change? A frequent knowledge miss with a narrow, testable correction is a good candidate for rapid iteration. An uncommon policy failure with serious consequences may deserve attention first even if its volume is low.

Review the highest-impact new cluster each workday and choose the smallest safe change that can teach you something. Small changes lower the effort required to act and compound when the cycle is repeated. The discipline lies in closing the loop, not in making every change large enough to justify a project.

Use evaluations as the release contract

A prompt edit is a product change. So is a new knowledge document, model version, tool description, routing rule, or policy. Treating these assets as untracked configuration makes regressions difficult to detect and harder to reverse.

Use a four-part improvement loop: update the controlled asset, test complete simulated conversations, deploy through a controlled release, and analyze production outcomes. That train-test-deploy-analyze cycle keeps learning connected to customer behavior rather than ending at the prompt editor.

Every evaluation case should include:

<!– wp:list {

March 19, 2026

A Practical Strategy for Foundational AI Platforms and Analytics
Your AI roadmap is filling up, but every new use case seems to require another custom retrieval pipeline, another evaluation method, and another analytics implementation. The pilots may work. The portfolio does not yet compound.

You do not solve that problem by declaring an AI platform initiative and assembling a long infrastructure backlog. You solve it by identifying the decisions your products must support, defining the contracts every AI workflow must honor, and connecting evaluation to real product behavior. The result is a foundation that makes the next useful experiment easier to ship, safer to operate, and easier to learn from.

Make critical use cases define the platform

A platform team can spend months building abstractions that application teams neither understand nor need. The usual cause is starting with components: a model gateway, vector storage, a feature store, an evaluation tool, or a new analytics stack. Each may be useful, but none tells you what the platform must make easier.

Start with the highest-value AI decisions in your product and internal operations. A use case is specific enough when you can describe the person making the decision, the context the system requires, the action the AI is allowed to take, the unacceptable failure modes, and the outcome you expect to change.

Write a short capability brief for each critical use case:
- User and moment: Who is trying to do what, and where does the AI enter the workflow?
- Decision or action: Is the system recommending, drafting, classifying, retrieving, or acting autonomously?
- Required context: Which behavioral events, account data, documents, permissions, and prior actions affect the answer?
- Quality definition: What makes an output acceptable, and which errors matter most?
- Release evidence: What must pass before the change reaches users?
- Product outcome: Which user behavior or business result should move if the capability works?
- Operating constraints: What must be logged, redacted, approved, monitored, or reversible?
- Owner: Who owns the outcome after launch, rather than merely delivering the component?
Now compare the briefs. Repeated needs are platform candidates. If several use cases require permission-aware retrieval, consistent experiment assignment, traceable prompt versions, or the same release evaluation, those capabilities deserve shared interfaces. A requirement that appears once may belong inside the application until reuse is proven.

This distinction prevents two expensive mistakes. The first is premature generalization: building a universal service before you understand the variation. The second is copy-and-paste scaling: allowing every team to create incompatible versions of a capability that is already clearly common.

Prioritize the platform backlog by friction removed from critical use cases, not by architectural elegance. A useful backlog item should complete a sentence such as: After this ships, a product team can run the standard evaluation suite on every retrieval change without building its own runner. If the item cannot name the team behavior it changes, it is probably still an implementation idea rather than a platform outcome.

Build the foundation as four enforceable contracts

A foundational AI platform is not one product or one technical layer. It is a set of contracts connecting data, evaluation, delivery, and analytics. The contracts matter more than whether every component comes from the same vendor. They let application teams move independently without giving up consistency where consistency is valuable.

1. The data and context contract

This contract defines what context an AI workflow can request and what comes back. It should cover identity, permissions, event definitions, document metadata, freshness, provenance, and retention. Retrieval should enforce access rules before context reaches the model, not rely on the model to decide what a user may see.

Keep the interface narrow. An application should be able to request approved context for a known user, account, and task without understanding every underlying data system. The response should carry enough metadata to explain where the context came from and which version was used.

A feature store belongs here when several predictive or real-time workflows need the same feature definitions at training and serving time. It should not become a mandatory platform component merely because feature stores appear on AI architecture diagrams. Add one when inconsistency is a demonstrated problem.

2. The evaluation contract

This contract defines the evidence required to change a model, prompt, retrieval configuration, tool, or policy. It should include representative test cases, expected behavior, failure labels, scoring rules, comparison baselines, and release gates.

Do not reduce evaluation to one average score. A harmless wording variation and a permission leak cannot cancel each other out. Track important failure classes separately and make critical failures blocking. Keep examples that exposed production problems so the evaluation set becomes a memory of what the system has learned.

The evaluation harness should produce a result that a product manager can interpret and an automated delivery pipeline can enforce. That is the core of eval-driven development: model and workflow changes move through repeatable evidence rather than one-off demonstrations.

3. The delivery and operations contract

This contract governs how a tested change reaches production and how you recover when it behaves badly. It should preserve the versions of the model, prompt, retriever, tools, policies, and relevant data configuration associated with each release.

Use controlled rollout, feature flags, rollback paths, and CI/CD gates for AI changes just as you would for other consequential product changes. Observability should connect a production trace to the exact configuration that produced it. Otherwise, a team can detect a quality drop without being able to isolate whether the model, context, prompt, tool call, or upstream data changed.

Record enough to diagnose the workflow, but do not treat raw prompts and conversations as ordinary telemetry. Apply privacy-by-design: minimize collection, redact sensitive fields where appropriate, control access, and make retention a deliberate policy. The default path should be the governed path. If secure operation requires every application team to remember a separate checklist, the platform has not removed the risk.

4. The product analytics contract

This contract maps AI interactions to product behavior. Define events for exposure, acceptance, correction, rejection, fallback, task completion, and abandonment where those states apply. Include stable identifiers that connect an interaction to its release and experiment assignment without copying sensitive content into the analytics layer.

The contract should also specify which product metric each AI capability is expected to influence. That keeps teams from declaring success because response quality improved in an offline test while users ignore the feature or fail to complete the task.

Turn analytics into the AI learning and control loop

Traditional product analytics asks what users did. AI evaluation asks how a probabilistic workflow performed. You need both views joined at the interaction level. Without that connection, model teams optimize scores, product teams optimize funnels, and neither can explain why a release helped or hurt the user.

Use four layers of measurement, and do not blend them into one health score:
<!– wp:list {
March 19, 2026

Agentic AI for Clinical Trial Operations: A Practical Playbook

If you are deciding where to introduce agentic AI in clinical trial operations, the hard question is not whether an agent can complete an impressive demonstration. It is whether the agent can produce a traceable, reviewable result under real trial conditions without obscuring who remains accountable.

Start with a bounded operational workflow, not a promise to automate an entire role. The useful outcome is not an agent that sounds intelligent. It is a smaller work queue, earlier detection of issues, faster human review, and enough evidence to explain every recommendation after the fact.

Start with work that is bounded, frequent, and reversible

Clinical operations contains no shortage of repetitive work. That does not make every task a suitable first agent use case. A workflow can be repetitive and still be unsafe to automate if an error changes a source record, delays escalation, affects patient safety, or hides a protocol issue.

Do not rank candidate workflows by estimated time savings alone. Rank them by risk-adjusted learnability: how quickly can you observe the agent’s behavior, compare it with an accountable reviewer, and contain a mistake before it has a consequential downstream effect?

A strong initial workflow usually has these properties:

A clear trigger and an unambiguous end state.
A finite set of authorized inputs.
An output that a qualified person can independently verify.
A mistake that can be corrected before it changes a consequential decision.
A named reviewer who already owns the underlying process.
An existing queue, baseline, or historical record against which performance can be evaluated.
A defined escalation path for ambiguity, missing data, conflicting records, and tool failure.

Document classification is a useful illustration. An eTMF agent has been applied to more than 80,000 documents per year. That workload is high-volume and structured enough to create repeatable evaluation data. The agent can recommend a classification, expose the evidence behind it, and send uncertain cases to a reviewer. A person can correct the result before the document proceeds through the controlled process.

Monitoring is a different risk class. A CRA agent can assemble safety and data-quality signals from 13 clinical systems, but that breadth is not permission to replace clinical judgment. The safer product boundary is evidence gathering, prioritization, and routing. The accountable professional still determines what the signal means and what action is appropriate.

My rule is simple: let the agent compress evidence gathering before it earns authority to execute an outcome. An agent may identify a possible discrepancy, collect the associated records, and prepare a review packet. It should not resolve a safety issue, close a query, approve clinical content, or alter an authoritative record unless that specific action has been validated, authorized, and made recoverable.

Turn the operating contract into governed platform primitives

Before writing prompts, write the operating contract. It should state the agent’s intended use, authorized inputs, available tools, required output, prohibited actions, review owner, escalation conditions, and evidence to retain. This contract gives product, clinical operations, quality, security, and engineering the same object to inspect.

The prohibited-actions section deserves particular attention. An instruction such as “help the CRA monitor the trial” is too broad to test. A useful boundary sounds more like this: retrieve permitted records, normalize specified fields, identify conditions defined in the approved specification, present supporting evidence, and route the result to the assigned reviewer. Do not interpret clinical significance, overwrite a source value, or close the issue.

A durable platform can encode that contract through reusable primitives such as models, skills, knowledge bases, MCP connectors, versions, and trigger types. Each primitive should own a specific control rather than serving as a loose container for prompts.

Platform primitive	Product decision to make explicit	Operational failure it should contain
Model	Which approved model and configuration may perform the task, including fallback behavior.	An unreviewed model change silently altering the output.
Skill	The narrow action, permitted inputs, expected schema, and failure behavior.	A general-purpose prompt expanding beyond the validated task.
Knowledge base	Which controlled material is authoritative and which version applies.	An answer relying on obsolete or unapproved material.
Connector	Identity, credential, record scope, and read-versus-write permission.	The agent retrieving or changing data beyond its authorization.
Trigger	What condition may start a run and what happens when the condition repeats.	Duplicate, unexpected, or untraceable execution.
Version	Which complete configuration produced a result and how it can be rolled back.	An output that cannot be reproduced during investigation.

Version everything that can materially change behavior: prompts, skills, model configuration, knowledge, ontology mappings, connector permissions, and escalation logic. A run record should identify why the agent started, which configuration ran, which tools it called, what evidence it retrieved, what it produced, and how the reviewer disposed of the result.

Separate read authority from write authority. A standard connector interface can make a system callable; it does not make every call permissible. Authentication and credential handling belong in a governed connector layer, as demonstrated by custom MCP connectors with an authentication and credentialing wrapper. The agent should receive only the tools and permissions required for the current task.

The same governance should apply across delivery models. First-party agents can prove reusable patterns, services-led implementations can handle complex workflows, and self-service configuration can extend adoption. Those three deployment paths should share the same identity controls, version model, evaluation process, monitoring, and audit record. Self-service without centralized guardrails merely distributes configuration risk.

Match retrieval to the question the agent must answer

Many apparent reasoning failures begin as retrieval or data-alignment failures. The agent received an outdated document, missed the relevant section, joined records under inconsistent identifiers, or treated two conflicting statuses as though they agreed. A larger context window does not repair those defects. It can make them harder to notice.

Choose the retrieval pattern from the operational question:

Use embeddings for semantic discovery. This is useful when the agent needs to find conceptually related material despite differences in wording. Retrieval results still need document identity, version, and provenance.
Use document hierarchies when structure carries meaning. Markdown or another explicit hierarchy can preserve the relationship among sections, subsections, tables, and controlled instructions. This is preferable when a nearby heading changes how a passage should be interpreted.
Use just-in-time connector retrieval for current system state. When the answer depends on the latest authorized record, retrieve it from the system at run time rather than relying on a stale copied index.

These patterns are complementary. An agent may use semantic retrieval to identify relevant controlled material and an MCP connector to fetch the current operational record. What matters is that the final output distinguishes retrieved policy or guidance from live trial data and preserves the provenance of both.

Cross-system monitoring also needs an ontology layer. Terms, statuses, units, and identifiers that appear similar may not carry the same operational meaning. A unified ontology can align terminology across multiple clinical systems, but normalization must not erase the original value. Retain the source system, source field, retrieval time, transformation applied, and canonical concept alongside every normalized field used in a recommendation.

Define conflict behavior explicitly. The newest value should not automatically win merely because it has the latest timestamp. If two authoritative records disagree and no validated reconciliation rule applies, the agent should show both, explain the conflict in neutral terms, and escalate. Fabricating a clean answer from inconsistent data is more dangerous than returning no answer.

Context management should reduce the agent’s working set to what the current decision requires. Sub-agents and automatic tool filtering can isolate tasks and limit the tools presented at each step. A retrieval sub-agent might return structured evidence with provenance, while a separate workflow skill applies the approved decision rule. That separation makes failures easier to test and permissions easier to constrain.

Do not optimize context solely for token efficiency. In clinical operations, the stronger reason to keep context narrow is control: fewer irrelevant records, fewer callable tools, clearer evidence lineage, and a smaller surface on which conflicting instructions can alter behavior.

Make evaluation and human review release gates

A clinical operations agent is not ready because it succeeds on a happy-path demonstration. Readiness means its intended behavior, failure behavior, and escalation behavior have all been tested against representative conditions. The evaluation plan should exist before the team sees the final test results, so release criteria do not drift to accommodate a weak agent.

Move through increasing levels of operational authority:

Retrospective evaluation: run the agent against a controlled golden dataset without access to live workflows.
Shadow operation: process current inputs in read-only mode while the existing process remains authoritative.
Assisted operation: show recommendations and evidence to a qualified reviewer, requiring approval before any downstream action.
Bounded execution: automate only the reversible actions that have earned sufficient evidence, while preserving escalation and rollback.

A golden dataset needs more than obvious examples. Include normal cases, ambiguous inputs, missing records, conflicting fields, outdated knowledge, duplicate triggers, unauthorized tool requests, and cases that should produce an abstention. Keep high-consequence failure modes visible as separate evaluation slices; a strong average can conceal the specific false negative that matters most.

Human feedback is useful, but it is not automatically ground truth. Reviewers can disagree, inherit inconsistent local practices, or approve a recommendation without examining it closely. Capture the initial agent output, reviewer action, reason for correction, and final adjudication where the process provides one. Use adjudicated outcomes to improve the golden set instead of treating every click as an equally reliable label.

Evaluate the properties that correspond to the operating contract:

Correct classification, routing, or evidence assembly on adjudicated cases.
Recall on important conditions, reviewed separately for higher-consequence misses.
Abstention and escalation when information is missing, conflicting, or outside scope.
Evidence completeness, including links or identifiers that let a reviewer verify the output.
Tool-use correctness, permission failures, and attempts to call unauthorized tools.
Reviewer acceptance, correction, and overturn reasons.
Operational impact on queue size, review effort, and time to disposition.

Set release thresholds according to the consequence of the task. A threshold appropriate for a reversible document suggestion is not automatically appropriate for a safety-monitoring signal. Do not compensate for weak performance on a high-risk slice with excellent performance on easy cases.

The human review interface is part of the safety system. It should present the recommendation, the exact supporting evidence, source identity, relevant timestamps, detected conflicts, and the permitted next actions. The reviewer needs an obvious way to correct, reject, or escalate the output. A generic approve button encourages automation bias and produces weak feedback data.

Preserve a traceable chain from agent intent to specification to test evidence. A release packet should identify the approved intended use, current versions, evaluation dataset, results by risk slice, known limitations, required human controls, monitoring plan, and rollback procedure. This is not paperwork added after product development. In a GxP-regulated setting, it is part of the product.

Production monitoring should detect changes in both behavior and operating conditions. Watch for shifts in input mix, rising abstention, changes in reviewer overturn reasons, missing provenance, connector failures, and differences after any model, knowledge, permission, or ontology update. When a material change occurs, route the affected configuration back through the relevant evaluation gates.

Key takeaways

Choose a bounded, frequent, and reversible workflow before attempting broad role automation.
Use agents to assemble evidence and prioritize work before granting authority over consequential outcomes.
Express the operating contract through governed models, narrow skills, controlled knowledge, permissioned connectors, triggers, and reproducible versions.
Match retrieval to the question: semantic discovery, hierarchical document access, and live connector retrieval solve different problems.
Preserve ontology mappings and field-level provenance when normalizing data across clinical systems.
Treat abstention, escalation, human review, evaluation evidence, production monitoring, and rollback as release requirements.

Your next artifact should not be a broader agent demonstration. Write the operating contract for the narrowest valuable workflow, then identify its authoritative inputs, prohibited actions, accountable reviewer, evaluation cases, escalation path, and rollback procedure. If any of those are unclear, narrow the workflow again. If they are explicit, you have a credible starting point for an agent that can improve clinical operations without outrunning the evidence.

References

Shivam.Consulting Blog – Inside Medable’s Agent Studio: The Agentic AI Blueprint to Accelerate Safer Clinical Trials

March 19, 2026

Author: Shivam Tiwari

Measure outcomes, execution quality, and coverage separately

Design monitoring as a portfolio, not a random sample

Turn scorecards into executable product requirements

Make every failure end in a product decision

Key takeaways

References

Define the clinical boundary before choosing the AI approach

Make the data boundary reviewable, not merely promised

Convert responsible AI into a release contract

Use evidence gates from sandbox to controlled scale

Gate 1: Sandbox validation

Gate 2: Controlled production pilot

Gate 3: Controlled expansion

After launch: operate the product as a learning system

References

Key takeaways

Start with an answer contract, not a page inventory

Build retrieval units that still make sense alone

Use a repeatable section pattern

Make boundaries and links dependable

Control vocabulary without ignoring customer language

Protect the current truth with metadata and delivery controls

Ship documentation as part of the product change

Measure answer quality, then migrate in risk order

References

Define resolution before you optimize the agent

Instrument the trajectory, not just the conversation

Turn production failures into a daily improvement queue

Use evaluations as the release contract

Make critical use cases define the platform

Build the foundation as four enforceable contracts

1. The data and context contract

2. The evaluation contract

3. The delivery and operations contract

4. The product analytics contract

Turn analytics into the AI learning and control loop

Start with work that is bounded, frequent, and reversible

Turn the operating contract into governed platform primitives

Match retrieval to the question the agent must answer

Make evaluation and human review release gates

Key takeaways

References