Month: March 2026

How to Build Scalable, AI-Ready Product Documentation

Your AI assistant gives a confident but outdated setup answer. Search returns three pages with slightly different instructions. Support knows the real workaround, but the documentation owner does not know the product changed. This is usually described as an AI problem. It is more often a knowledge-system problem.

You do not need a second documentation estate written for machines. You need one governed source of product truth that a customer can follow, a support engineer can trust, and an AI system can retrieve without reconstructing the answer from conflicting fragments.

Key takeaways

Organize documentation around the questions and tasks users bring to it, not only around your product navigation or internal team structure.
Give every important section a clear answer, scope, procedure, expected result, and permanent link so it remains useful when retrieved on its own.
Control terminology, versions, ownership, and deprecation explicitly. An AI assistant cannot reliably resolve contradictions that your organization has left unresolved.
Put documentation changes through version control, review, automated checks, and release gates so the published truth keeps pace with the product.
Measure successful task completion and grounded answer quality, not page views alone. Use failures to decide whether to fix the content, retrieval layer, assistant behavior, or product itself.

Start with an answer contract, not a page inventory

A documentation redesign often begins with a list of existing pages. That tells you what you publish, but not what customers need to accomplish. It also preserves accidental boundaries: a feature may have five pages because five teams touched it, while the customer still sees one task.

Begin with an intent register for one product area. Capture the questions that appear during activation, onboarding, routine use, escalation, and renewal. Include the language people actually use in search queries and support requests, even when it differs from your preferred product terminology.

For each intent, record:

The user’s question in their own language.
The task they are trying to complete or the decision they need to make.
The relevant audience or role, such as administrator, developer, or analyst.
The product version, plan, permission, integration, or prerequisite that changes the answer.
The canonical page and section that should answer the question.
The person accountable for keeping that answer current.
The consequence of a wrong or missing answer, such as failed activation, an unnecessary escalation, or use of a deprecated workflow.

This register exposes three different problems that page counts conceal. Some important questions have no answer. Some have several competing answers. Others have an answer that exists but cannot stand on its own because the conditions or expected result appear somewhere else.

Turn each priority intent into an answer contract. A complete unit should state what the user can accomplish, when the instructions apply, what must already be true, what to do, what success looks like, and where to go next. If any of those elements are missing, a human has to infer them and an AI system may invent the bridge.

The opening of a page should therefore name the job, not advertise the feature. “Configure routing for inbound leads” gives the reader a destination. “About lead routing” merely names a subject. This small distinction also gives retrieval systems a stronger match between a real question and the section intended to answer it.

Build retrieval units that still make sense alone

A person may enter through a search result, while an AI application may retrieve only a passage from the middle of a page. In both cases, the selected section has to survive separation from the surrounding document.

That does not mean chopping every page into tiny fragments. Atomic content is complete enough to answer one intent and bounded enough to avoid unrelated material. A fragment that says “click Save” without naming the object, required permission, or expected result is short, but it is not atomic.

Use a repeatable section pattern

For a task-oriented section, use this sequence:

Write a heading that reflects the question or task.
Give the direct answer or outcome before background material.
State who the instructions are for and when they apply.
List permissions, inputs, and prerequisites before the procedure.
Use numbered steps with one observable action in each step.
State the expected result and how the reader can verify it.
Separate exceptions, limitations, and failure states from the main path.
Link to the next likely task rather than a generic documentation landing page.

Keep interface labels, API parameters, status values, and error messages verbatim. If the product displays “Connection expired,” do not rewrite it as “Your integration is no longer active.” The second phrase may read naturally, but it weakens exact search, obscures the product state, and makes support instructions harder to match.

Examples should expose inputs, outputs, and constraints. A useful example says which role is acting, what value is supplied, what the system returns, and which condition would make the result different. A screenshot without that context is evidence of appearance, not a durable explanation of behavior.

Make boundaries and links dependable

Use one primary topic per page, semantic H1-H3 hierarchy, descriptive slugs, and stable section anchors. These practices make pages easier to scan and create smaller, linkable units that retrieval systems can identify precisely.

A stable anchor is part of the content contract. If an implementation guide links directly to the authentication prerequisite, changing that anchor silently breaks more than navigation. It breaks the path by which customers, support macros, release notes, and AI responses reach the authoritative answer.

Do not copy the same procedure into several pages to make each page self-contained. Keep one canonical procedure and give adjacent pages enough context to explain why the reader needs it, followed by a precise link. Duplication feels convenient at publication time and becomes a contradiction risk at the next product change.

Control vocabulary without ignoring customer language

Choose one canonical term for each product concept across the interface, API, documentation, and support material. Put accepted synonyms and older names in a glossary or metadata field so search can recognize them, but keep the explanation anchored to the current term.

This is the difference between supporting natural language and allowing synonym sprawl. “Workspace,” “account,” “tenant,” and “organization” may sound interchangeable inside a company. If they represent different objects in the product, casual substitution creates false equivalence. If they represent the same object, choosing one term removes needless translation work for every reader and retrieval pipeline.

Protect the current truth with metadata and delivery controls

Good prose cannot compensate for missing scope. Two instructions can each be correct for a different version, role, or integration and still produce a wrong answer when retrieved together. Metadata makes those boundaries explicit before retrieval begins.

Define a required metadata contract for every governed page or content unit. At minimum, include:

A stable content ID and canonical URL.
A descriptive title and short task-oriented description.
The product area and content type.
The intended audience or role.
The applicable version or version status.
The lifecycle state, such as current or deprecated.
The accountable owner.
The last-updated or last-reviewed date.

Use the fields as controls, not decoration. Audience metadata should allow an assistant to distinguish administrator instructions from end-user instructions. Version metadata should prevent a current answer from silently incorporating an obsolete step. Ownership should route a failed evaluation to someone who can resolve it.

Deprecation needs more than a warning banner. State what is deprecated, which users or versions are affected, what replaces it, and how to move forward. Preserve old URLs with redirects when a current replacement exists. Removing the old page without a forward path turns bookmarks and deep links into dead ends; leaving it searchable without a clear status lets obsolete guidance continue to circulate.

Ship documentation as part of the product change

Scalability depends on the delivery system behind the content. Version control, peer review, and CI/CD give documentation the same traceability and release discipline used for software changes.

For each product change, the release workflow should answer:

Which user intents and canonical sections are affected?
Do interface labels, parameters, permissions, errors, examples, or screenshots change?
Does the change introduce a new term or alter an existing definition?
Do version boundaries, redirects, or deprecation notices need updating?
Which retrieval evaluations must pass before release?
Who approves the content and owns follow-up corrections?

Automate the checks that have unambiguous pass or fail conditions: broken links, missing required metadata, duplicate IDs, invalid internal references, and orphaned pages. Use human review for semantic accuracy, task completeness, terminology, and whether an image still reflects the current workflow. Automation can detect that a screenshot file exists; it cannot reliably decide that the image teaches the correct behavior.

Set update expectations according to consequence. Instructions tied to a product release need to be correct when the change reaches users. A deprecated workflow needs a forward path before the old path disappears. Lower-risk explanatory material can follow a review schedule. One blanket service level treats cosmetic drift and activation-breaking errors as if they carry the same cost.

Measure answer quality, then migrate in risk order

Page views tell you that someone arrived. They do not tell you whether the person completed the task or whether an AI answer was accurate, grounded, and current. Pair human behavior with retrieval evaluations so each signal leads to a plausible corrective action.

Signal	What it can reveal	Likely action
Repeated searches or rapid returns to results	The answer is hard to find, uses mismatched language, or does not resolve the intent	Improve the title, intent mapping, vocabulary, or section completeness
Low task completion after reading	The procedure may omit prerequisites, verification, or a failure path	Test the instructions against the actual workflow and repair the answer contract
Support escalation after a documentation visit	The content may be incomplete, untrusted, outdated, or describing product friction	Inspect the escalation reason before assuming more content is the solution
Low answer accuracy or grounding	The wrong passage was retrieved, the selected passage conflicts with another, or the assistant exceeded the evidence	Separate retrieval, content, and answer-generation failures
Current and deprecated guidance in one answer	Version metadata, lifecycle labels, or retrieval filters are insufficient	Strengthen version boundaries and remove obsolete material from current-answer paths
High response latency	The retrieval or answer path may be doing unnecessary work	Inspect the pipeline without trading away accuracy or grounding

Build the evaluation set from the same intent register used to design the documentation. For each test question, define the expected canonical page or section, the claims a correct answer must contain, the audience and version it applies to, and any deprecated claim that must not appear. Include questions that should not be answered when the documentation lacks enough evidence. A reliable assistant must be able to stop at the boundary of the known answer.

When a test fails, classify the failure before editing anything:

If retrieval selected the wrong section, inspect information architecture, headings, metadata, vocabulary, and chunk boundaries.
If retrieval selected the correct section but the answer distorted it, inspect the assistant’s instructions and answer-generation behavior.
If two selected sections disagree, resolve the underlying ownership, versioning, or duplication problem.
If no section answers the question, add the missing knowledge or make the limitation explicit.
If the answer is correct but users still fail, inspect the procedure and the product experience. Documentation should not be used to disguise avoidable product friction.

You do not need to rebuild the entire knowledge base before learning whether this operating model works. Migrate in this order:

Choose one product area with meaningful activation, support, or deprecation risk.
Collect its real user intents and map each one to an accountable answer.
Resolve duplicate, contradictory, and missing guidance before changing the retrieval system.
Restructure priority answers into self-contained, linkable sections.
Add the required metadata, ownership, version, and lifecycle controls.
Put those sections through the product release workflow and automated checks.
Run human task checks and retrieval evaluations, classify the failures, and repair the responsible layer.
Expand only after the pattern is repeatable for another product area.

Your first useful deliverable is not an AI documentation strategy deck. It is one high-value customer question with one canonical, current, owned answer that survives retrieval and changes alongside the product.

Start with the question that creates the most expensive ambiguity today. Make its answer complete, linkable, versioned, testable, and part of the release path. That single vertical slice will show you where the larger system actually needs work.

References

March 20, 2026

How to Build a Continuous Improvement Loop for AI Support

Your AI support agent is containing more conversations, but you still cannot answer the question that matters: did it solve the customer’s problem, or did it merely avoid a handoff?

You do not need another dashboard of conversation counts and token usage. You need a closed loop that connects customer outcomes to individual agent decisions, turns failures into test cases, and makes each release safer than the last. Here is the operating model I would put in place.

Define resolution before you optimize the agent

Continuous improvement starts with an outcome contract, not an analytics implementation. If product, support, and engineering use different definitions of success, every subsequent metric will create debate instead of direction.

Write the first outcome contract for one customer journey. Keep it to one page and specify:

Eligible cases: Which requests can the agent reasonably handle? Separate supported journeys from requests that must go directly to a person.
The unit of analysis: Decide whether success belongs to a conversation, a case, or a completed task. A customer who restarts the same issue in a second conversation has not necessarily created a new problem.
Completion evidence: Name the customer statement, downstream system event, human judgment, or evaluator result that confirms the task was completed.
Acceptable escalation: Define when handing the case to a person is the correct outcome. A policy-mandated handoff is not an agent failure.
Failure conditions: Include abandonment, repeated rephrasing, an incorrect action, an unsupported claim, a failed tool call, and a handoff without usable context where they apply.
The observation window: Choose how long you will look for a repeat contact or reversal before treating the resolution as durable. The right interval depends on the journey, so publish it with the metric.

Then separate the signals that teams commonly collapse into one number:

Signal	What it tells you	How to use it
Verified resolution	The eligible customer task was completed using the evidence in your outcome contract.	Use this as the primary outcome for a resolvable journey.
Containment	No person entered the interaction.	Treat it as a routing and labor signal, not proof of success.
Escalation	The agent transferred the case to a person.	Split appropriate escalation from avoidable escalation before drawing conclusions.
Reliability	Tools, retrieval, guardrails, and fallbacks behaved as intended.	Use step-level rates to locate the mechanism behind an outcome.
Performance	The customer waited for the first response and for the complete task.	Track both typical and tail latency so a healthy average does not hide slow cases.
Efficiency	The agent consumed tokens, model spend, tool spend, and human effort.	Measure cost per successful eligible task, not just cost per conversation.
Groundedness	Claims based on retrieved material are supported by the retrieved material.	Evaluate retrieval-dependent journeys separately from general response quality.

The distinction between resolution and containment is particularly important. An unresolved conversation that never reached a person improves containment while making the customer experience worse. Conversely, a fast and well-prepared escalation can be the right resolution path.

Use explicit formulas in the metric catalog. For example, resolution rate should be verified successful eligible cases divided by eligible cases. Cost per success should be the applicable model and tool cost divided by verified successful cases. If you change the eligibility rule or verification method, version the definition rather than silently changing the chart.

Do not let the agent’s own declaration of success become your only evidence. Prefer an observable business event or explicit customer confirmation. When neither exists, use a documented evaluation rubric and periodically compare automated judgments with human review. The goal is not to pretend ambiguity has disappeared. It is to make the ambiguity visible and consistent.

Instrument the trajectory, not just the conversation

Traditional product analytics can tell you that a conversation opened, escalated, and closed. It cannot explain why an agent chose a tool, which knowledge it retrieved, what a guardrail rejected, or where a multistep task went off course. Agent behavior includes nondeterministic trajectories, tool chains, prompt context, retrieval, policies, and fallbacks. Your telemetry has to preserve that sequence.

Build three connected records:

Case record: A stable identifier for the customer need, the normalized intent, channel, eligible cohort, final outcome, escalation reason, customer signal, and any repeat-contact relationship.
Run record: The model, prompt, policy, knowledge, tool, evaluator, and experiment versions used for one execution, along with total latency, token use, cost, guardrail events, fallback state, and final response.
Step record: Each retrieval, model decision, function call, tool result, validation, retry, and handoff. Capture inputs and outputs in an appropriately protected form, plus status, latency, error type, and the next step selected.

Every run should be reproducible at the level needed to investigate it. Version anything that can change behavior: system instructions, prompt templates, policies, model configuration, tool schemas, knowledge snapshots, routing rules, guardrails, and evaluators. Record experiment assignment on the run itself. Otherwise, a failed trace can become impossible to explain after the underlying assets change.

You need three views over these records. The aggregate scorecard shows whether customer and business outcomes are moving. The trace explorer shows the decision path behind a particular result. The evaluation system tests candidate behavior against expected results. None is a substitute for the others: aggregates reveal scale, traces reveal mechanisms, and evaluations reveal whether a proposed change is safe to release.

Privacy has to be part of the data model. Support conversations can contain personal or sensitive information, and copying every raw transcript into a broadly accessible analytics system creates unnecessary exposure. Separate human-identifiable data from model telemetry and mask sensitive content while retaining useful semantic context. Apply access controls and retention rules to raw content, and let most analysis operate on sanitized traces, structured labels, and protected identifiers.

Before declaring instrumentation complete, take one unresolved case and answer five questions from the data alone: What was the customer trying to do? What path did the agent take? Which versions governed that path? At which step did reality diverge from the expected behavior? What outcome evidence followed? Any missing answer is a concrete telemetry gap.

Turn production failures into a daily improvement queue

Analytics becomes continuous improvement only when an observation can enter a queue, acquire an owner, produce a test, and reach production. A weekly dashboard presentation without that path is reporting, not learning.

Create one improvement queue shared by product, support, engineering, and the people responsible for the agent. Feed it from failed evaluations, avoidable handoffs, repeat contacts, low-confidence outcomes, guardrail events, tool errors, and anomalies in journey-level metrics.

At each human handoff, prefill the case identifier and trace, then ask the support specialist for three short inputs:

What was the customer actually trying to accomplish?
What prevented the agent from completing it?
What reusable change could prevent the same failure?

Keep the capture lightweight. The frontline should add information the trace cannot supply, not reconstruct a conversation the system already recorded. This turns human support into a sensor for recurring defects and supports the principle that a solved customer issue should create a chance to prevent its recurrence.

Cluster related cases before prioritizing them. Fixing a single transcript can overfit the agent to one phrasing; fixing a recurring failure pattern improves a journey. Give every cluster a normalized intent and one primary root-cause label:

Knowledge: The required information was missing, stale, contradictory, or too difficult to retrieve.
Retrieval: Useful information existed, but the wrong material was selected or relevant material was omitted.
Instruction or policy: The agent had ambiguous, conflicting, or incomplete directions.
Tooling: A function was unavailable, called incorrectly, timed out, or returned unusable data.
Conversation design: The agent failed to collect required information, explain a limitation, confirm an action, or recover from confusion.
Routing and handoff: The case went to the wrong destination, escalated too late, or arrived without sufficient context.
Capability boundary: The task was eligible on paper but was not reliably achievable with the current model and workflow.
Product or process: The customer encountered a defect or an underlying workflow that support automation alone cannot repair.

Do not use the taxonomy to force certainty. Allow an unknown label, but review unknowns regularly; a growing unknown bucket usually means the taxonomy or telemetry has stopped describing production.

Each queue item should contain the affected journey and cohort, linked traces, observed and expected behavior, root-cause hypothesis, customer consequence, recurrence evidence, responsible owner, proposed smallest change, a new or updated evaluation case, rollout method, and rollback condition. That record makes the improvement auditable from failure to fix.

Prioritize with four questions instead of reacting to the loudest transcript: How often does the pattern appear? How consequential is it when it happens? How confident are you in the root cause? How reversible is the proposed change? A frequent knowledge miss with a narrow, testable correction is a good candidate for rapid iteration. An uncommon policy failure with serious consequences may deserve attention first even if its volume is low.

Review the highest-impact new cluster each workday and choose the smallest safe change that can teach you something. Small changes lower the effort required to act and compound when the cycle is repeated. The discipline lies in closing the loop, not in making every change large enough to justify a project.

Use evaluations as the release contract

A prompt edit is a product change. So is a new knowledge document, model version, tool description, routing rule, or policy. Treating these assets as untracked configuration makes regressions difficult to detect and harder to reverse.

Use a four-part improvement loop: update the controlled asset, test complete simulated conversations, deploy through a controlled release, and analyze production outcomes. That train-test-deploy-analyze cycle keeps learning connected to customer behavior rather than ending at the prompt editor.

Every evaluation case should include:

<!– wp:list {

March 19, 2026

A Practical Strategy for Foundational AI Platforms and Analytics
Your AI roadmap is filling up, but every new use case seems to require another custom retrieval pipeline, another evaluation method, and another analytics implementation. The pilots may work. The portfolio does not yet compound.

You do not solve that problem by declaring an AI platform initiative and assembling a long infrastructure backlog. You solve it by identifying the decisions your products must support, defining the contracts every AI workflow must honor, and connecting evaluation to real product behavior. The result is a foundation that makes the next useful experiment easier to ship, safer to operate, and easier to learn from.

Make critical use cases define the platform

A platform team can spend months building abstractions that application teams neither understand nor need. The usual cause is starting with components: a model gateway, vector storage, a feature store, an evaluation tool, or a new analytics stack. Each may be useful, but none tells you what the platform must make easier.

Start with the highest-value AI decisions in your product and internal operations. A use case is specific enough when you can describe the person making the decision, the context the system requires, the action the AI is allowed to take, the unacceptable failure modes, and the outcome you expect to change.

Write a short capability brief for each critical use case:
- User and moment: Who is trying to do what, and where does the AI enter the workflow?
- Decision or action: Is the system recommending, drafting, classifying, retrieving, or acting autonomously?
- Required context: Which behavioral events, account data, documents, permissions, and prior actions affect the answer?
- Quality definition: What makes an output acceptable, and which errors matter most?
- Release evidence: What must pass before the change reaches users?
- Product outcome: Which user behavior or business result should move if the capability works?
- Operating constraints: What must be logged, redacted, approved, monitored, or reversible?
- Owner: Who owns the outcome after launch, rather than merely delivering the component?
Now compare the briefs. Repeated needs are platform candidates. If several use cases require permission-aware retrieval, consistent experiment assignment, traceable prompt versions, or the same release evaluation, those capabilities deserve shared interfaces. A requirement that appears once may belong inside the application until reuse is proven.

This distinction prevents two expensive mistakes. The first is premature generalization: building a universal service before you understand the variation. The second is copy-and-paste scaling: allowing every team to create incompatible versions of a capability that is already clearly common.

Prioritize the platform backlog by friction removed from critical use cases, not by architectural elegance. A useful backlog item should complete a sentence such as: After this ships, a product team can run the standard evaluation suite on every retrieval change without building its own runner. If the item cannot name the team behavior it changes, it is probably still an implementation idea rather than a platform outcome.

Build the foundation as four enforceable contracts

A foundational AI platform is not one product or one technical layer. It is a set of contracts connecting data, evaluation, delivery, and analytics. The contracts matter more than whether every component comes from the same vendor. They let application teams move independently without giving up consistency where consistency is valuable.

1. The data and context contract

This contract defines what context an AI workflow can request and what comes back. It should cover identity, permissions, event definitions, document metadata, freshness, provenance, and retention. Retrieval should enforce access rules before context reaches the model, not rely on the model to decide what a user may see.

Keep the interface narrow. An application should be able to request approved context for a known user, account, and task without understanding every underlying data system. The response should carry enough metadata to explain where the context came from and which version was used.

A feature store belongs here when several predictive or real-time workflows need the same feature definitions at training and serving time. It should not become a mandatory platform component merely because feature stores appear on AI architecture diagrams. Add one when inconsistency is a demonstrated problem.

2. The evaluation contract

This contract defines the evidence required to change a model, prompt, retrieval configuration, tool, or policy. It should include representative test cases, expected behavior, failure labels, scoring rules, comparison baselines, and release gates.

Do not reduce evaluation to one average score. A harmless wording variation and a permission leak cannot cancel each other out. Track important failure classes separately and make critical failures blocking. Keep examples that exposed production problems so the evaluation set becomes a memory of what the system has learned.

The evaluation harness should produce a result that a product manager can interpret and an automated delivery pipeline can enforce. That is the core of eval-driven development: model and workflow changes move through repeatable evidence rather than one-off demonstrations.

3. The delivery and operations contract

This contract governs how a tested change reaches production and how you recover when it behaves badly. It should preserve the versions of the model, prompt, retriever, tools, policies, and relevant data configuration associated with each release.

Use controlled rollout, feature flags, rollback paths, and CI/CD gates for AI changes just as you would for other consequential product changes. Observability should connect a production trace to the exact configuration that produced it. Otherwise, a team can detect a quality drop without being able to isolate whether the model, context, prompt, tool call, or upstream data changed.

Record enough to diagnose the workflow, but do not treat raw prompts and conversations as ordinary telemetry. Apply privacy-by-design: minimize collection, redact sensitive fields where appropriate, control access, and make retention a deliberate policy. The default path should be the governed path. If secure operation requires every application team to remember a separate checklist, the platform has not removed the risk.

4. The product analytics contract

This contract maps AI interactions to product behavior. Define events for exposure, acceptance, correction, rejection, fallback, task completion, and abandonment where those states apply. Include stable identifiers that connect an interaction to its release and experiment assignment without copying sensitive content into the analytics layer.

The contract should also specify which product metric each AI capability is expected to influence. That keeps teams from declaring success because response quality improved in an offline test while users ignore the feature or fail to complete the task.

Turn analytics into the AI learning and control loop

Traditional product analytics asks what users did. AI evaluation asks how a probabilistic workflow performed. You need both views joined at the interaction level. Without that connection, model teams optimize scores, product teams optimize funnels, and neither can explain why a release helped or hurt the user.

Use four layers of measurement, and do not blend them into one health score:
<!– wp:list {
March 19, 2026

Agentic AI for Clinical Trial Operations: A Practical Playbook

If you are deciding where to introduce agentic AI in clinical trial operations, the hard question is not whether an agent can complete an impressive demonstration. It is whether the agent can produce a traceable, reviewable result under real trial conditions without obscuring who remains accountable.

Start with a bounded operational workflow, not a promise to automate an entire role. The useful outcome is not an agent that sounds intelligent. It is a smaller work queue, earlier detection of issues, faster human review, and enough evidence to explain every recommendation after the fact.

Start with work that is bounded, frequent, and reversible

Clinical operations contains no shortage of repetitive work. That does not make every task a suitable first agent use case. A workflow can be repetitive and still be unsafe to automate if an error changes a source record, delays escalation, affects patient safety, or hides a protocol issue.

Do not rank candidate workflows by estimated time savings alone. Rank them by risk-adjusted learnability: how quickly can you observe the agent’s behavior, compare it with an accountable reviewer, and contain a mistake before it has a consequential downstream effect?

A strong initial workflow usually has these properties:

A clear trigger and an unambiguous end state.
A finite set of authorized inputs.
An output that a qualified person can independently verify.
A mistake that can be corrected before it changes a consequential decision.
A named reviewer who already owns the underlying process.
An existing queue, baseline, or historical record against which performance can be evaluated.
A defined escalation path for ambiguity, missing data, conflicting records, and tool failure.

Document classification is a useful illustration. An eTMF agent has been applied to more than 80,000 documents per year. That workload is high-volume and structured enough to create repeatable evaluation data. The agent can recommend a classification, expose the evidence behind it, and send uncertain cases to a reviewer. A person can correct the result before the document proceeds through the controlled process.

Monitoring is a different risk class. A CRA agent can assemble safety and data-quality signals from 13 clinical systems, but that breadth is not permission to replace clinical judgment. The safer product boundary is evidence gathering, prioritization, and routing. The accountable professional still determines what the signal means and what action is appropriate.

My rule is simple: let the agent compress evidence gathering before it earns authority to execute an outcome. An agent may identify a possible discrepancy, collect the associated records, and prepare a review packet. It should not resolve a safety issue, close a query, approve clinical content, or alter an authoritative record unless that specific action has been validated, authorized, and made recoverable.

Turn the operating contract into governed platform primitives

Before writing prompts, write the operating contract. It should state the agent’s intended use, authorized inputs, available tools, required output, prohibited actions, review owner, escalation conditions, and evidence to retain. This contract gives product, clinical operations, quality, security, and engineering the same object to inspect.

The prohibited-actions section deserves particular attention. An instruction such as “help the CRA monitor the trial” is too broad to test. A useful boundary sounds more like this: retrieve permitted records, normalize specified fields, identify conditions defined in the approved specification, present supporting evidence, and route the result to the assigned reviewer. Do not interpret clinical significance, overwrite a source value, or close the issue.

A durable platform can encode that contract through reusable primitives such as models, skills, knowledge bases, MCP connectors, versions, and trigger types. Each primitive should own a specific control rather than serving as a loose container for prompts.

Platform primitive	Product decision to make explicit	Operational failure it should contain
Model	Which approved model and configuration may perform the task, including fallback behavior.	An unreviewed model change silently altering the output.
Skill	The narrow action, permitted inputs, expected schema, and failure behavior.	A general-purpose prompt expanding beyond the validated task.
Knowledge base	Which controlled material is authoritative and which version applies.	An answer relying on obsolete or unapproved material.
Connector	Identity, credential, record scope, and read-versus-write permission.	The agent retrieving or changing data beyond its authorization.
Trigger	What condition may start a run and what happens when the condition repeats.	Duplicate, unexpected, or untraceable execution.
Version	Which complete configuration produced a result and how it can be rolled back.	An output that cannot be reproduced during investigation.

Version everything that can materially change behavior: prompts, skills, model configuration, knowledge, ontology mappings, connector permissions, and escalation logic. A run record should identify why the agent started, which configuration ran, which tools it called, what evidence it retrieved, what it produced, and how the reviewer disposed of the result.

Separate read authority from write authority. A standard connector interface can make a system callable; it does not make every call permissible. Authentication and credential handling belong in a governed connector layer, as demonstrated by custom MCP connectors with an authentication and credentialing wrapper. The agent should receive only the tools and permissions required for the current task.

The same governance should apply across delivery models. First-party agents can prove reusable patterns, services-led implementations can handle complex workflows, and self-service configuration can extend adoption. Those three deployment paths should share the same identity controls, version model, evaluation process, monitoring, and audit record. Self-service without centralized guardrails merely distributes configuration risk.

Match retrieval to the question the agent must answer

Many apparent reasoning failures begin as retrieval or data-alignment failures. The agent received an outdated document, missed the relevant section, joined records under inconsistent identifiers, or treated two conflicting statuses as though they agreed. A larger context window does not repair those defects. It can make them harder to notice.

Choose the retrieval pattern from the operational question:

Use embeddings for semantic discovery. This is useful when the agent needs to find conceptually related material despite differences in wording. Retrieval results still need document identity, version, and provenance.
Use document hierarchies when structure carries meaning. Markdown or another explicit hierarchy can preserve the relationship among sections, subsections, tables, and controlled instructions. This is preferable when a nearby heading changes how a passage should be interpreted.
Use just-in-time connector retrieval for current system state. When the answer depends on the latest authorized record, retrieve it from the system at run time rather than relying on a stale copied index.

These patterns are complementary. An agent may use semantic retrieval to identify relevant controlled material and an MCP connector to fetch the current operational record. What matters is that the final output distinguishes retrieved policy or guidance from live trial data and preserves the provenance of both.

Cross-system monitoring also needs an ontology layer. Terms, statuses, units, and identifiers that appear similar may not carry the same operational meaning. A unified ontology can align terminology across multiple clinical systems, but normalization must not erase the original value. Retain the source system, source field, retrieval time, transformation applied, and canonical concept alongside every normalized field used in a recommendation.

Define conflict behavior explicitly. The newest value should not automatically win merely because it has the latest timestamp. If two authoritative records disagree and no validated reconciliation rule applies, the agent should show both, explain the conflict in neutral terms, and escalate. Fabricating a clean answer from inconsistent data is more dangerous than returning no answer.

Context management should reduce the agent’s working set to what the current decision requires. Sub-agents and automatic tool filtering can isolate tasks and limit the tools presented at each step. A retrieval sub-agent might return structured evidence with provenance, while a separate workflow skill applies the approved decision rule. That separation makes failures easier to test and permissions easier to constrain.

Do not optimize context solely for token efficiency. In clinical operations, the stronger reason to keep context narrow is control: fewer irrelevant records, fewer callable tools, clearer evidence lineage, and a smaller surface on which conflicting instructions can alter behavior.

Make evaluation and human review release gates

A clinical operations agent is not ready because it succeeds on a happy-path demonstration. Readiness means its intended behavior, failure behavior, and escalation behavior have all been tested against representative conditions. The evaluation plan should exist before the team sees the final test results, so release criteria do not drift to accommodate a weak agent.

Move through increasing levels of operational authority:

Retrospective evaluation: run the agent against a controlled golden dataset without access to live workflows.
Shadow operation: process current inputs in read-only mode while the existing process remains authoritative.
Assisted operation: show recommendations and evidence to a qualified reviewer, requiring approval before any downstream action.
Bounded execution: automate only the reversible actions that have earned sufficient evidence, while preserving escalation and rollback.

A golden dataset needs more than obvious examples. Include normal cases, ambiguous inputs, missing records, conflicting fields, outdated knowledge, duplicate triggers, unauthorized tool requests, and cases that should produce an abstention. Keep high-consequence failure modes visible as separate evaluation slices; a strong average can conceal the specific false negative that matters most.

Human feedback is useful, but it is not automatically ground truth. Reviewers can disagree, inherit inconsistent local practices, or approve a recommendation without examining it closely. Capture the initial agent output, reviewer action, reason for correction, and final adjudication where the process provides one. Use adjudicated outcomes to improve the golden set instead of treating every click as an equally reliable label.

Evaluate the properties that correspond to the operating contract:

Correct classification, routing, or evidence assembly on adjudicated cases.
Recall on important conditions, reviewed separately for higher-consequence misses.
Abstention and escalation when information is missing, conflicting, or outside scope.
Evidence completeness, including links or identifiers that let a reviewer verify the output.
Tool-use correctness, permission failures, and attempts to call unauthorized tools.
Reviewer acceptance, correction, and overturn reasons.
Operational impact on queue size, review effort, and time to disposition.

Set release thresholds according to the consequence of the task. A threshold appropriate for a reversible document suggestion is not automatically appropriate for a safety-monitoring signal. Do not compensate for weak performance on a high-risk slice with excellent performance on easy cases.

The human review interface is part of the safety system. It should present the recommendation, the exact supporting evidence, source identity, relevant timestamps, detected conflicts, and the permitted next actions. The reviewer needs an obvious way to correct, reject, or escalate the output. A generic approve button encourages automation bias and produces weak feedback data.

Preserve a traceable chain from agent intent to specification to test evidence. A release packet should identify the approved intended use, current versions, evaluation dataset, results by risk slice, known limitations, required human controls, monitoring plan, and rollback procedure. This is not paperwork added after product development. In a GxP-regulated setting, it is part of the product.

Production monitoring should detect changes in both behavior and operating conditions. Watch for shifts in input mix, rising abstention, changes in reviewer overturn reasons, missing provenance, connector failures, and differences after any model, knowledge, permission, or ontology update. When a material change occurs, route the affected configuration back through the relevant evaluation gates.

Key takeaways

Choose a bounded, frequent, and reversible workflow before attempting broad role automation.
Use agents to assemble evidence and prioritize work before granting authority over consequential outcomes.
Express the operating contract through governed models, narrow skills, controlled knowledge, permissioned connectors, triggers, and reproducible versions.
Match retrieval to the question: semantic discovery, hierarchical document access, and live connector retrieval solve different problems.
Preserve ontology mappings and field-level provenance when normalizing data across clinical systems.
Treat abstention, escalation, human review, evaluation evidence, production monitoring, and rollback as release requirements.

Your next artifact should not be a broader agent demonstration. Write the operating contract for the narrowest valuable workflow, then identify its authoritative inputs, prohibited actions, accountable reviewer, evaluation cases, escalation path, and rollback procedure. If any of those are unclear, narrow the workflow again. If they are explicit, you have a credible starting point for an agent that can improve clinical operations without outrunning the evidence.

References

Shivam.Consulting Blog – Inside Medable’s Agent Studio: The Agentic AI Blueprint to Accelerate Safer Clinical Trials

March 19, 2026

A Product Leadership System for Faster, Clearer Execution
Your roadmap is full, every function has a planning ritual, and experienced people are working hard. Yet decisions still wait, priorities keep reopening, and substantial work reaches customers later than anyone expected. Adding another process layer will not solve that problem.

You need an execution system: explicit ownership, small batches, a dependable decision cadence, direct customer feedback, and a scorecard that distinguishes progress from activity. When those elements reinforce one another, your teams can move faster without lowering the quality bar or routing every judgment through you.

Give each team an operating contract, not just a roadmap

A roadmap identifies intended destinations. It rarely tells a team how to make the decisions required to reach them. That gap is where autonomy turns into ambiguity: product believes it owns the sequence, engineering waits for scope to stabilize, design explores a wider problem, and an executive assumes a requested feature is already committed.

Before an initiative becomes active work, give the team a short operating contract. It should fit on one page and answer these questions:
- Whose problem are you solving, and in what specific scenario does it occur?
- What observable customer or business outcome should change if the work succeeds?
- Who is accountable for the initiative and its sequence?
- Which constraints are fixed, and which assumptions remain open?
- What is explicitly outside the current scope?
- What is the smallest end-to-end slice that can produce useful evidence?
- What evidence will support the next decision?
- When will that decision be made, and who has the right to make it?
The owner is not the person who approves every task. The owner keeps the problem, outcome, sequence, and unresolved decisions coherent. Engineering, design, research, and product still make solution decisions together inside the stated boundaries.

This contract also protects the team from executive requests that arrive as solutions without context. When someone asks for a feature, do not turn the request directly into a backlog item. Translate it into a problem entry first: the affected customer, the workflow that breaks, the evidence behind the request, the relevant constraint, and the result the requester expects. A commercially important request can remain urgent after that translation, but the team can now evaluate it rather than merely obey it.

Set escalation boundaries at the same time. A team should escalate when decision rights are unclear, two constraints conflict, a priority change affects another team, or the work crosses an agreed risk boundary. It should not need escalation merely because a solution choice is consequential. If every consequential choice travels upward, the team is not autonomous; it is a queue feeding a senior leader.

Finally, maintain one prioritized backlog for the team. Separate executive, product, engineering, and sales backlogs create hidden competition. The operating contract establishes the logic, and the single backlog makes the resulting sequence visible.

Run a weekly loop around decisions and customer learning

Many product cadences organize meetings while leaving decisions to happen unpredictably. A useful cadence does the opposite. Every recurring touchpoint should help the organization choose, learn, or remove a constraint.

A workable leadership week looks like this:
- Monday: Confirm the few priorities that matter, identify decisions that could block progress, and resolve changes in sequencing. Do not reread the entire roadmap.
- Midweek: Review selected product requirements, design flows, research findings, and engineering readiness. Concentrate on ambiguity, batch size, and untested assumptions.
- Thursday: Spend time with customers and partners. Put working slices in front of them when possible, and bring the resulting evidence back to the team.
- Friday: Write down what changed in your understanding. Update the backlog, decision log, and operating contracts where the evidence warrants it.
The sequence matters. Monday establishes intent. Midweek exposes execution risk while there is still time to change course. Customer contact tests the team’s reasoning. Friday turns scattered observations into organizational memory. Without the synthesis step, customer conversations can become interesting anecdotes that never alter a decision.

Make the weekly demo the heartbeat of the team. A good demo starts with the user scenario and the intended outcome, shows the smallest working behavior, states what the team learned, and ends with the next decision. A tour of completed tickets is not a substitute. For platform or infrastructure work, demonstrate working behavior, operational evidence, or a retired technical risk rather than manufacturing a customer-facing screen.

When a team repeatedly has nothing meaningful to demonstrate, inspect the system before questioning effort. The batch may be too large. A dependency may lack an owner. Decisions may be waiting in an approval queue. The team may be building several disconnected components before completing one testable path. The correction is usually to narrow the slice, clarify the decision, or remove the dependency.

A thin slice is not an arbitrary reduction in scope. It must preserve one coherent scenario, reach a state where someone can evaluate it, and create evidence for a consequential next choice. Backend, frontend, and enablement tasks can all be necessary, but completing them separately does not create a feedback loop.

Put product and revenue in the same operating loop

Product and revenue drift apart when they maintain different versions of the customer. Product sees research themes and usage behavior. Revenue sees active deals, objections, urgency, and willingness to pay. Neither view is sufficient on its own.

Use one customer narrative, one shared pipeline of problems worth solving, and one scorecard. Review them together every week. Each proposed problem should carry the customer segment, affected workflow, available evidence, commercial context, expected outcome, and complexity the solution could add.

Then make the sequencing decision explicit:
- Solve now: The problem is important enough, supported well enough, and compatible with the current strategy.
- Stage for scale: The need is credible, but the team must first validate the pattern, build a reusable foundation, or resolve a dependency.
- Do not add: The request is too narrow, conflicts with the product direction, or creates complexity that its value does not justify.
- Sunset: Existing functionality consumes attention without contributing enough customer value or strategic leverage.
This turns product-versus-sales conflict into a visible portfolio decision. Revenue contributes evidence and urgency. Product protects coherence and long-term defensibility. Both functions see why an item moved, waited, or stopped.

Measure outcomes, flow, and quality as separate signals

A team can ship frequently without improving a customer outcome. It can also improve an outcome temporarily while accumulating quality problems that make the pace unsustainable. Your scorecard needs to keep those conditions separate.

For each important bet, review three signal groups:
- Outcome: The observable customer or business result the team is trying to change, supported by current evidence rather than a list of releases.
- Flow: Deployment frequency and the age or state of the current thin slice. These signals reveal whether value and learning can move through the system.
- Quality: Change failure rate and the recurring friction exposed in customer feedback, support conversations, or postmortems.
Use the scorecard to direct attention, not to automate judgment. If deployment frequency is healthy but the intended outcome is not moving, inspect the hypothesis, target customer, and value proposition. More releases may simply deliver the wrong idea faster. If deployment frequency falls, examine batch size, dependencies, and delayed decisions. If change failure rate worsens, narrow the slice and strengthen readiness or recovery before asking the team to accelerate.

Do not rank unlike teams by raw deployment counts. Use trends within the relevant product and technical context. The point is to find constraints and make decisions, not to turn a diagnostic signal into a performance contest.

Write outcome-focused OKRs with enough precision to guide a trade-off. A useful structure is: for a named user and scenario, improve an observable result from its current baseline toward an agreed target by the review point, without damaging a stated guardrail. Establish the baseline before debating the target. If the team cannot observe the result, say that plainly and make instrumentation or customer evidence part of the initial slice.

Feature count, roadmap completion, tickets closed, and activity volume can help with local planning. They are not proof of customer value. Treat them as operational context, not as the headline definition of success.

Keep the executive view compact. Each team should be able to present its intended outcome, current evidence, deployment-frequency trend, change-failure trend, most important customer learning, and next unresolved decision. If a metric never changes a question, a priority, or an intervention, remove it from the review.

Stay close to the work without taking the work away

Product leaders lose judgment when they only consume summaries. They become bottlenecks when they join every working session. The useful middle ground is deliberate sampling: inspect enough real work to calibrate your view, then give feedback that strengthens the team’s next decision.

Each week, sample a rotating set of artifacts such as a product requirements document, a design flow, customer research notes, a postmortem, or a customer thread. You are not trying to approve every artifact. You are checking whether the operating system is producing clear thinking.

Use questions that reveal decision quality:
- Does the requirement name a user scenario and a problem, or does it begin with a predetermined feature?
- Does the design expose a complete path that can be tested, or only polished fragments?
- Do the research notes separate what customers did and said from the team’s interpretation?
- Does the postmortem change an operating mechanism, or merely remind people to be careful?
- Does the customer thread reveal a pattern, an important exception, or one loud request?
- Can the team state the next decision this artifact is meant to support?
Feedback should create motion. Name the user scenario, identify the friction or ambiguity, state the decision principle, propose a smaller testable slice when appropriate, and clarify the next decision. A vague comment such as make this more strategic forces the team to guess what you mean and then wait for another review.

I use a simple leadership boundary: push hard on problem clarity, sequencing, and the quality bar; leave room on solution design and implementation. That boundary keeps accountability with leadership without converting senior judgment into remote-control product management.

Exemplars make this boundary easier to scale. Keep a small, current library of strong problem statements, concise narrative memos, useful research syntheses, clear acceptance criteria, and honest postmortems. Show why each example is effective. Teams learn a quality bar faster from visible work than from an expanding rulebook.

Create short paths for decisions and uncomfortable information

Open office hours give anyone a direct route for a difficult escalation, unfinished design, customer insight, or cross-team conflict. Run them as a decision forum, not an extra status meeting. Capture the decision, owner, rationale, and follow-up so people who were not present can still act consistently.

Keep weekly one-to-ones with your leaders as well. Office hours expose work across the organization; one-to-ones develop judgment, surface recurring constraints, and help a leader notice when someone is absorbing ambiguity on behalf of the system.

Fast feedback from leadership matters because waiting expands batches. When teams expect a long approval cycle, they tend to gather more material and seek approval for more decisions at once. Publish clear decision rights and a dependable response path. If you do not need to make the decision, say so immediately and return it to the named owner.

Spend unscripted time with individual contributors, too. Formal reporting lines filter information. Direct exposure to the people building, researching, designing, supporting, and selling the product helps you hear where the written process and actual work have diverged.

Install the system without reorganizing first

You do not need a company-wide transformation program to test this operating model. Start with one important initiative that is moving slowly or generating repeated disagreement. Keep the current reporting structure and change the mechanics around the work.
1. Capture the current friction. Identify where the initiative waits, where priorities conflict, which decisions keep reopening, and where work returns for avoidable clarification.
2. Write the operating contract. Name the problem, outcome, owner, constraints, non-goals, initial thin slice, required evidence, and next decision.
3. Collapse the work into one sequence. Bring product, engineering, executive, and commercial requests into one prioritized backlog. Preserve their context rather than preserving separate queues.
4. Run the weekly loop. Set priorities on Monday, inspect selected artifacts midweek, expose work and assumptions to customers, and synthesize the learning on Friday.
5. Publish the compact scorecard. Show the intended outcome, deployment frequency, change failure rate, newest customer evidence, and next decision. Do not wait for a perfect dashboard.
6. Inspect the mechanism after a full loop. Remove one gate that added waiting without adding learning, divide one oversized batch, and clarify one decision right that caused an escalation.
During the review, ask concrete questions: What waited for a decision? What was redone because the original problem was unclear? Which customer signal changed the plan? Which metric caused an intervention? Which request arrived without enough context? Where did leadership provide useful boundaries, and where did it take ownership away from the team?

Expand the model only after the team can explain how it changed actual work. Copying the ceremonies without the decision rights, customer exposure, and scorecard will create more meetings, not a stronger execution system.

Key takeaways
- Start each important initiative with a one-page operating contract that connects a real customer problem to an owner, outcome, constraints, thin slice, and next decision.
- Protect autonomy with explicit boundaries. Escalate conflicting priorities and constraints, not every consequential solution choice.
- Organize the week around decisions, working evidence, customer contact, and written synthesis rather than status reporting.
- Read outcomes, deployment frequency, and change failure rate together. No single signal can tell you whether the team is delivering sustainable value.
- Sample real artifacts and give specific feedback, while leaving solution and implementation ownership with the team.
- Give product and revenue one customer narrative, one problem pipeline, and one scorecard so trade-offs become visible sequencing decisions.
At your next Monday priority review, choose one live initiative and write its operating contract before discussing another roadmap change. The missing answers will show you where execution is actually slowing down. Fix that mechanism, run the loop, and let evidence determine where the system should expand next.

References
- Shivam.Consulting Blog — The CPO Playbook I Wish I’d Had: Ditch Bad Wisdom, Ship Faster, and Lead with Clarity
March 19, 2026
Outcomes vs Outputs: How I Stopped the Feature Factory and Drove Real Product Impact

“Outcomes over outputs” is the right mantra—and one I’ve championed across product teams—but turning it into daily practice is where most teams stumble.

It’s simple in theory: focus on the impact of what we build, not just shipping features. In reality, it’s rarely black and white because most teams are asked to do both—hit outcomes and deliver specific outputs—at the same time.

In a benchmark survey, 20% of product teams claim to be outcome-focused, nearly half describe themselves as working in a mix of outcomes and outputs, and about 30% are still primarily working with outputs. I’ve seen versions of this in my own org: we aspire to outcomes, but our rituals, roadmaps, and reporting still reward shipping.

Here’s how I draw the line clearly, coach my teams to avoid common traps, and negotiate better, more actionable outcomes that unlock genuine product discovery and business results.

Simple definitions we live by

An output is something you build or produce—a feature, a project, an initiative. It’s something your team ships.

An outcome is the impact of that output—a change in customer behavior or a business result.

Josh Seiden puts it well in his book Outcomes Over Output: “An outcome is a change in human behavior that drives business results.”

Shift from shipping to shaping results. This graphic clarifies outputs vs outcomes, revealing that value emerges between deliverables and impact—when features change customer behavior and move business results.

I distinguish business outcomes from product outcomes. Business outcomes are typically financial metrics that measure the health of the business (e.g. increase revenue or reduce costs) while product outcomes measure a customer behavior in the product or a sentiment about the product.

Here’s a simple example I’ve used with platform teams. Many B2B companies support a number of integrations. Integrations are outputs. Having integrations alone doesn’t create value. Customers using and finding value in those integrations—that’s an outcome. If those customers retain their subscriptions longer because of the integrations—that’s also an outcome.

Building something isn’t the same as creating value. That’s the core of this distinction, and it’s what separates empowered product teams from feature factories.

Why this distinction matters for empowered product teams

When we task teams with delivering outputs, they’re done when the software ships. When we task teams with delivering outcomes, they aren’t done until the software ships and has the expected impact.

That small shift changes almost everything about how a team works: what we measure (impact, not just delivery), how we know we’re done (measurable behavior change, not release notes), the autonomy we grant (told what to achieve, not what to build), and the planning artifacts we use (an opportunity solution tree beats a feature roadmap when we’re exploring the best path to an outcome).

When I assign outcomes, I’m giving the team latitude—and responsibility—to figure out the best path to success. That’s what opens the door for real product discovery and continuous discovery habits.

Shift your lens from shipping features to achieving impact. This side-by-side visual explains how outcome-driven teams measure success, grant more autonomy, define 'done' by results, and plan with an opportunity solution tree.

Examples: spotting outputs disguised as outcomes

Clear-cut example: “Our outcome is to deliver an Android app.” An Android app is something we build and ship. It’s clearly an output.

To get to an outcome, I ask, “What’s the value of having an Android app?” or “How will we know the Android app is successful?”

We might answer: “Having an Android app will allow us to engage more users. We’ll know it’s successful when people engage with the app on a regular basis.”

This answer uncovers the hidden outcome: engage more people. Now we can set the right scope: increase the percentage of engaged users across any platform; increase the percentage of engaged mobile users; or increase the percentage of engaged Android users.

Any of these outcomes gives us more room to explore than a fixed output. Maybe we don’t need a native app at all. We could deliver the same engagement through a mobile web experience, notifications, or email. And we’re not done when we ship—we’re done when the right people are actually engaged.

Tricky example 1: measure the value creation moment (hires, not applicants)

Move beyond shipping features to the impact that matters. This visual maps the path from build an Android app to the real goal, increase engaged users, by asking why, defining value, and owning results.

When setting outcomes, it’s tempting to choose the easiest-to-measure metric. But a good outcome measures the customer’s value creation moment.

I worked at a company that helped new college grads find their first job. When I started working there, the primary outcome was “increase job applications.” This technically is an outcome—it measures a specific behavior in the product.

But it doesn’t measure the value creation moment. A job seeker doesn’t get value when they apply for a job. They only get value when they get the job. Similarly, employers don’t get value from any job applicant, they get value when the right job applicant applies.

Many job boards try to measure qualified applicants—instead of counting any applicant, they compare the credentials of the applicant to the job description and only count qualified applicants. This is better. But it still doesn’t measure the value creation moment. Both the job seeker and the employer get value when an open job is successfully filled. The right metric is hires.

Yes, “hires” can be hard to instrument because it happens off-platform and incentives misalign. Measure it anyway, even with proxies. The easy metric isn’t always the right outcome.

Tricky example 2: measure impact, not user-generated output (the course reviews trap)

I worked with a team that helped students choose university courses. They set their outcome as: “Increase the number of course reviews on our platform.”

Confusing activity with impact? This visual breaks down four common outcome traps—measuring at the wrong moment, mistaking outputs, chasing adoption, and relying on sentiment—so teams focus on real value.

Sounds like an outcome, right? It’s a metric. You can measure it. It’s an action users take on the site—writing a review. But it’s actually an output in disguise.

Reviews are valuable when they help a student evaluate a course. They don’t create any value if a student never sees them. More reviews aren’t always better, especially if they’re clustered where nobody looks.

A better outcome is “Increase the number of course views that include reviews.” Now we’re measuring impact on the decision moment, not just the production of content.

If you can hit your metric without helping customers, you’re tracking an output, not an outcome.

Tricky example 3: measure success, not just adoption (the traction metric trap)

“Increase the percentage of users who viewed the performance report.”

This looks like a good outcome. It measures a specific behavior in the product. It’s within the team’s control. But it’s what I call a traction metric—it measures adoption of a single feature, not value to the customer.

Why teams get trapped in shipping features: a vicious trust cycle fuels micromanagement, while performance-linked outcomes push safe targets. Break the loop and refocus on customer outcomes that truly move the needle.

Two problems arise. First, people can view the report and still not find what they need. Second, we might have perfectly happy customers who don’t need the report at all. Driving usage of an unneeded feature wastes time and erodes trust.

Measure the value creation moment, not just feature adoption.

Tricky example 4: pair sentiment with behavior

I define a product outcome as a metric that measures either 1. a specific behavior in the product or 2. a sentiment about the product. But sentiment metrics—like CSAT or NPS—can be tricky on their own.

Sentiment metrics are outcomes, but they aren’t directional. They don’t tell us where to explore or set guardrails for what to avoid. So I pair a behavior with a sentiment, for example: “Increase engagement without negatively impacting satisfaction.” I use sentiment as a counterweight.

Facebook and Instagram illustrate why this matters. Meta is exceptional at driving engagement—but to a fault. Many of us don’t like these addictive products. Pairing engagement with a satisfaction guardrail prevents “engagement at all costs.”

Why getting this right is hard (and how I counter it)

Ready to move from shipping features to creating impact? This visual playbook shares five practical moves—translate metrics, partner with teams, iterate, avoid traps, and dig deeper—to turn outputs into measurable outcomes.

The trust cycle. Managers don’t trust that teams can reach outcomes on their own. So managers micromanage the outputs. Teams, in turn, don’t communicate their progress toward outcomes—they communicate their progress on features. This reinforces the manager’s belief that they need to stay involved in the details. It’s a vicious cycle.

I break it by asking teams to show their work—share assumptions, research, opportunity solution trees, and evidence behind choices—and by giving feedback on the thinking, not just the solutions.

The accountability trap. When performance reviews are tied to hitting outcomes, teams play it safe. They sandbag their targets. They disguise outputs as outcomes to guarantee “success.”

I treat outcomes as learning opportunities first. When we start on a new outcome, I set a learning goal—“learn what moves the needle on this metric”—before a performance goal—“increase X by Y%.” This creates space to explore without fear.

How I get teams started with better outcomes

Translate business outcomes to product outcomes. Business outcomes like revenue, retention, and market share are lagging indicators—by the time you see them, it’s too late to act. Product outcomes measure behavior changes within the product that lead to those business results. They’re leading indicators within the team’s control.

Negotiate outcomes with your team. Outcome-setting should be a two-way conversation. Leadership brings the cross-company context. The team brings customer insight and technical realities. Neither side dictates; we co-own the target and the constraints.

Stop celebrating shipped features and start celebrating change. This visual contrasts a feature factory mindset with a true product team, urging teams to track impact, not output, and define success by outcomes.

Expect to iterate on your metrics. Your first outcome metric probably won’t be right. That’s normal. Sonja at tails.com went through four iterations—from 90-day retention to 30-day to 5-day to behavior-based metrics—before landing on something actionable. Thomas at Bluestone Analytics iterated three or four times before finding the right metric. Iteration is the work.

Watch for common mistakes. Outputs disguised as outcomes. Traction metrics masquerading as product outcomes. Sentiment metrics without direction. Business outcomes assigned directly to product teams without translating to behavior change.

Use the right artifacts. Replace feature roadmaps with an opportunity solution tree to explore multiple paths, test assumptions, and sequence bets explicitly against a clear outcome.

Align OKRs with outcomes. If your company uses OKRs, make sure the “KR”s are true product outcomes (behavior change and value creation), not a list of features to ship.

The bottom line

When we shift from an output-first mindset to an outcome-first mindset, it doesn’t mean that outputs stop mattering. Product teams will always ship features, and the ability to do so quickly and with quality still matters. This shift simply ensures those features achieve the intended impact. We aren’t done when we ship—we’re done when what we shipped has the intended impact.

Measure success by the impact of what you ship and you’ll build a product team that learns, adapts, and creates real value. Measure success by what you ship and you’ll get a feature factory.

Quick self-check: is your “outcome” really an outcome?

Ask yourself: 1) Does it measure a behavior change or a sentiment tied to value creation? 2) Could we hit it without helping customers? 3) Is it adoption of a single feature (a traction metric) or a result that customers and the business care about? 4) Do we have a counter-metric to prevent unintended harm? If you stumble on any of these, refine it before you commit.

Inspired by this post on Product Talk.

March 18, 2026
Staying Sane as a Product Leader: Practical Strategies I’m Using from Teresa Torres & Petra Wille

The world can feel like it’s spinning, and as a product leader, I feel that pressure acutely—juggling customer needs, stakeholder expectations, and the relentless news cycle. I recently listened to a powerful conversation with Teresa Torres and Petra Wille about staying grounded when everything feels “bonkers,” and it offered a practical, human way to keep showing up without losing yourself.

What resonated most was the invitation to live my values through small, consistent actions. Rather than waiting for grand gestures or perfect solutions, I’m leaning into the mindset of “Something is better than nothing.” It’s the same spirit we bring to continuous improvement in product: make a change, evaluate impact, iterate.

“Create the world you want to live in” has become a daily prompt for me. I’m applying it to how I spend my attention, time, and platform—three scarce resources for any product management leader. I’m not going to do everything perfectly, but I can make better trade-offs this week than I did last week, and I can keep improving.

Practically, that looks like reconsidering which speaking invites I accept, especially when representation is skewed. If a stage is heavily male, I now ask organizers about their plan for balance before committing. I also question travel expectations for short talks when a high-quality virtual experience is possible—good for sustainability, budgets, and energy. These choices compound, just like product roadmapping and sprint planning decisions.

Petra’s “under-complexity” lens was a wake-up call. In product, oversimplified narratives—whether a single KPI, a vanity metric, or a forced binary—usually increase fear and bad decisions. The same is true in civic discourse. To counter that, I’m seeking more nuance on purpose: reading multiple sources on the same story, listening for who’s not in the room, and noticing how the same facts can carry different meanings depending on who’s telling it.

One simple habit helps: I’ll read The New York Times and The Wall Street Journal on a headline, then follow up with Tangle by Isaac Saul, which lays out “what the left says / what the right says / editor’s take,” sometimes including perspectives from affected communities. It’s a lightweight form of personal knowledge management that improves my product judgment and my citizenship.

Another idea that stuck with me is swapping media proxies for human connection. In product, we don’t ship based on secondhand opinions—we run customer interviews, co-create with users, and build empowered product teams. The same principle applies in community: talk to someone directly affected, ask real questions, and stay curious. When conversations get heated, I try to build bridges, reduce proxies, and look people in the eye.

I’m also reflecting on platform responsibility. Even a “small” platform can snowball through weak ties inside a company or community. I’m asking: When should I speak up? Where should I draw lines? And when is “staying in your lane” actually a way to avoid necessary leadership? These are the same stakeholder management questions we navigate in product strategy—assess impact, clarify intent, and act with integrity.

Local grounding matters, too. I’ve found energy and clarity in community-level action: voting, attending public protests when it feels right, mentoring, and supporting nonprofits like World Pulse. I love the framing of “don’t mess with my neighbors”—it keeps me focused on tangible care when the internet starts to feel like reality. I’ve also seen leaders use angel investing in agriculture-related efforts as a counterbalance to “internet reality,” channeling resources into durable, real-world outcomes.

If you want to experiment this week, pick one small lever you control: where you spend money, time, attention, or your platform. Add nuance by reading at least two different perspectives before reacting. Replace proxies with people by talking to someone with lived experience. Reduce polarization by asking, “what shaped that view?” before judging it. And go local—connect with neighbors or a community group and let small actions compound.

If you’d like to hear the full conversation that inspired these reflections, you can listen on Spotify or Apple Podcasts. Here are the direct links: Spotify: https://open.spotify.com/episode/1sxEFquu73ZB9fL9gGk6Om and Apple Podcasts: https://podcasts.apple.com/kh/podcast/staying-sane/id1794203808?i=1000755696295

Resources I’m exploring and recommend: World Pulse (https://www.worldpulse.org/), The New York Times (https://www.nytimes.com/), The Wall Street Journal (https://www.wsj.com/), and Tangle by Isaac Saul (https://www.readtangle.com/ and https://www.readtangle.com/author/isaac-saul/). For builders and writers, I also appreciate Ghost (https://ghost.org/) as an open-source publishing platform. If you work in or with the MENA ecosystem, take a look at MENA Product Summit ’26 (https://www.prdkt.plus/summit26). Colleagues like Jeff Merrell (https://jeffdmerrell.com/) and grassroots efforts such as No Kings Protest (https://www.nokings.org/) offer additional perspectives and ways to get involved.

If this resonates, share it with a teammate who’s been feeling the weight of the world. I’d love to hear one small, values-aligned action you’re taking this month—what “something” will you try next?

Inspired by this post on Product Talk.

March 17, 2026
Agentic Architecture Demystified: How Modern AI Systems Plan, Learn, and Execute at Scale

In my role leading product teams at HighLevel, I’m often asked to explain what’s really happening behind the scenes of today’s AI products. The short answer is that modern systems are built on "Agentic Architecture: How Modern AI Systems Actually Work"—not just a single model, but a coordinated loop of planning, tool use, memory, and evaluation. Once you see that pattern, the design decisions snap into focus and the roadmap becomes far easier to prioritize.

At its core, agentic AI treats the model as a reasoning engine embedded within an AI workflow. The agent interprets intent, plans steps, calls the right tools and APIs, grounds itself in trusted data, and then evaluates outcomes before deciding to continue or stop. This loop creates reliability, reduces hallucinations, and enables the system to operate in real-world, multi-step scenarios.

Here’s the practical lifecycle I rely on. A user provides intent (a goal or request). We run a retrieval-first pipeline to ground the model in accurate, current data. Prompt engineering structures the task and primes the agent with constraints and success criteria while managing context window management. The agent generates a plan, executes steps by calling tools or services, evaluates intermediate results, reflects or revises as needed, and only then returns a final answer with clear citations or evidence.

For more complex work, I orchestrate multiple specialized agents—commonly a planner, a solver, and a critic—coordinated by a lightweight controller. This multi-agent pattern reduces single-agent blind spots, encourages self-checking, and mirrors how empowered product teams collaborate. Whether it’s conversation design for support flows or a voice AI agent driving hands-free tasks, orchestration is the difference between a clever demo and a dependable product.

Memory is the second pillar. Short-term working context sits in the prompt, while long-term memory lives in vector stores or databases to track past interactions, preferences, and outcomes. Retrieval augments the model with the right facts at the right time, and tight context window management ensures the agent stays focused on signal, not noise. The result is faster responses, lower costs, and far better accuracy.

Reliability is earned through eval-driven development and robust AI risk management. I define offline and online evaluations, guardrails, and human-in-the-loop checkpoints before scaling traffic. These evaluations become living, automated tests that protect against regressions as prompts, models, and tools evolve. The payoff is real: fewer escalations, higher trust, and measurable improvements to quality over time.

From a product strategy perspective, I resist over-engineering. Start with a simple retrieval-first pipeline and a single agent; prove value; then layer in multi-agent orchestration only where it moves key metrics. Instrument everything—latency, cost, grounding coverage, and outcome quality—and build Agent Analytics dashboards so teams can diagnose issues and iterate with confidence.

If you’re looking for a practical playbook, here’s mine: clarify the user intent and success criteria; design the tools the agent can call; ground with authoritative data; write prompts that constrain scope and define termination conditions; add reflection and automated evaluations; and ship behind feature flags for safe, staged rollout. Each step compounds reliability without killing velocity.

The diagram and the video above bring these patterns to life. If you watch closely, you’ll see the same loop—plan, retrieve, act, evaluate—show up in every effective implementation, regardless of domain. That repetition isn’t accidental; it’s the backbone of agentic architecture and a blueprint you can adapt to your own stack.

Ultimately, what matters is outcomes. When we build around agentic AI, we create systems that are explainable to stakeholders, maintainable by engineers, and genuinely helpful to customers. That’s how we move past hype to durable impact—shipping AI products that plan, learn, and execute at scale.

Inspired by this post on Product School.

March 16, 2026
How We Automated 81% of Customer Support with AI—While Uplifting CX, Speed, and ROI

Leading the Support function for a company that builds a leading Agent and AI-forward customer service platform has been, for me, unique, exciting, and yes—daunting. It’s where product ambition meets operational reality, and where every decision I make is immediately tested by customers who expect excellence.

It’s unique because we use the same technology as our customers. We live in the product every day, which puts us in a privileged position to be the voice of the customer across the organization. That tight feedback loop has shaped how I prioritize, what I build next, and how I measure success.

It’s exciting because we get to try all of the new features and capabilities of Fin and the Intercom helpdesk. With a relentless focus on AI innovation, I’ve had access to remarkable tools that help us deliver an incredible customer experience—and I’ve seen firsthand how the right workflows and guardrails turn those tools into outcomes.

And it’s daunting because expectations for our own Customer Support (CS) team are sky high. If we can’t deliver incredible support using our own technology, we undermine its value proposition. That imperative has kept me honest, focused, and fast.

In our new research, “The 2026 Customer Service Transformation Report,” we’ve been sharing how forward-looking teams use AI to transform their support models. If you’d like to get straight to the report, download it here.

When Intercom changed its focus in late 2022 to prioritize the customer service use case, we undertook a critical review of the support experience we were delivering and committed to driving meaningful change under an AI-first framework. That was a turning point: I aligned product strategy and operations around a single north star—automate with quality, and elevate humans to higher-value work.

Three years on, Fin now resolves over 81% of all our customer support volume, delivering immediate and high-quality resolutions. We have absorbed a 300%+ increase in customer demand since 2022 without proportional headcount growth. Without Fin, we would have needed at least 100 additional CS team members to meet that demand and our improved service levels – a net saving to Intercom of between $7.5M–$9M annually.

Throughout this work, we drew on research from the 2026 Customer Service Transformation Report and applied the lessons directly to our own org design, knowledge management, and AI workflows. What follows is our story of transformation and how we achieved a mature deployment of Fin.

The problems we set out to solve

Back in 2022, our challenges looked familiar to any modern support organization, and I knew we needed a step-change—not incremental tweaks.

We faced increased support demand from new and existing customers: Intercom was launching major features and changes at speed, driving up overall customer conversation volume and requiring additional headcount for the CS team. I could see we were scaling people faster than processes—unsustainable without automation.

Our support policy (as defined by our service level objectives) was not based on a high bar: In most cases, we were only committed to “business hours” coverage for the majority of our customers, impacting first response times. Even with SLOs that were not considered best in class, we were struggling to meet our commitments. I wanted 24/7 coverage and faster first responses without sacrificing quality.

We wanted to do more: As we pivoted our strategy, we wanted to open new routes to our support team, such as providing support to website visitors with technical questions and to trial customers. That meant meeting customers earlier in their journey with accurate, on-brand responses—at scale.

What we did

We made a very conscious decision to become our own best reference customer. As Intercom embraced the opportunity that generative AI presented to transform customer service, we intentionally moved to an AI-first strategy for our Customer Support team. I set a simple operating principle: ship value quickly, measure relentlessly, and let evidence guide the next bet.

We started with the highest-volume, informational queries and saw our resolution rates climb quickly. With that foundation in place, we pushed Fin further, training it on deeper documentation and internal procedures, and eventually giving it the ability to take actions on behalf of customers. As Fin took on more complex work, our results started to compound—and trust in the system grew across the organization.

Early adoption and building trust. When “AI Assist” features came to the Intercom Inbox, the CS team got early exposure to AI and were empowered to provide feedback directly to our product teams. This built awareness and trust across the team about what we were trying to achieve with AI, and helped shape the product roadmap. We were also the first beta customer for Fin, rolling it out to a subset of customers to watch sentiment and outcomes closely. With no adverse reaction and an initial resolution rate of over 25%, we deployed Fin to most customer segments within weeks. I’ll never forget the first week we put Fin in front of real customers—the silence of issues that never reached humans was the loudest signal of success.

Knowledge management as a product. We recognized quickly that time spent tuning our help center and knowledge assets for Fin would pay dividends. We transitioned our Help Center Manager into a “Knowledge Manager,” with a dedicated remit to optimize content for Fin. We embedded knowledge creation into our “New Product Introduction” (NPI) process, targeting that Fin would resolve at least 50% of customer issues at every new product and feature launch. Over time, we added new sources, including “Developer Documents,” enabling Fin to handle increasingly complex issues. We built a culture of continuous improvement—allocating “out of the inbox” time so every teammate could close content gaps and raise the bar.

Conversation design end-to-end. To ensure a consistent, high-quality customer experience, we created a new “Conversation Designer” role that owns the journey across automation and human handoffs. Using Intercom’s Workflows, we introduced “skills-based routing” so that when a customer asks for a human, the conversation reaches someone with the right expertise quickly. This is now handled by Fin directly using a feature called “Attributes.” The result: a seamless, on-brand experience regardless of channel or escalation path.

Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.

Organization changes that unlocked leverage. As we scaled Fin, we stood up a dedicated AI Support team under a senior CS leader to continuously optimize automation and define our AI adoption strategy across the journey. We restructured human roles into “Technical Support Specialist” and “Technical Support Engineer” to better align with the complexity of incoming work. We also expanded Support Operations to focus on optimization—using AI to uplevel Enablement, Workforce Management, QA, Process Management, and Data Insights. Just as important, we reset expectations about the balance between time spent supporting customers directly versus improving AI. That mindset shift created compounding returns.

Pushing Fin further with new capabilities. As capabilities matured, we were early adopters and saw measurable wins:

Fin Guidance: Multiple Guidance rules provide additional controls and a more personalized, targeted experience for customers.

Fin Tasks and Procedures: Enables Fin to carry out activities such as updating customers on incident status and deep troubleshooting for technical issues.

Insights: AI-driven dashboards provide deep insight into Fin’s performance and surface recommendations for further optimization. Insights also provides a Customer Experience (CX) Score for every customer interaction, enabling more targeted improvement efforts and opening up new ways to close the loop with customers who have had a poor experience.

What we achieved

What started as a focused effort to improve our customer support experience became the strongest proof point for what’s possible when you fully embrace AI. Fin now resolves over 81% of all our customer support volume and has allowed us to absorb a 300%+ increase in demand without proportional headcount growth. Over 90% of our customers now benefit from improved first response performance, 24/7 coverage, and outbound phone support.

What the numbers don’t fully capture is the shift in how our team operates. With volume absorbed by Fin, our CS teammates now deliver consultative support—guiding next best actions, deepening product adoption, and contributing directly to retention and expansion. Customers that receive these engagements adopt Fin at a much deeper level and achieve greater support success. What was once a reactive, volume-driven team is now a function that generates significant revenue.

What’s next

Customer expectations are always rising, so we’re building on our progress by embracing the Fin Flywheel—an actionable framework for ongoing improvement and optimization. This keeps us honest about the discipline required to sustain AI performance at scale.

Train: Teach Fin to resolve even the most complex queries with Procedures, knowledge, and policies.

Test: Run fully simulated customer conversations from start to finish to see exactly how Fin will behave before going live.

Deploy: Set Fin live across every channel – voice, email, chat, and social – for consistent support wherever customers reach out.

Analyze: Use AI-powered Insights to analyze and improve Fin’s performance and deliver better customer experiences.

We are also investing in our support teammates so they can adjust to the new world of AI—taking on more complex work and being valued for the subject matter expertise, consultative engagement, and empathy they bring to the role. That human layer is where differentiation shines.

We will continue to develop and share best practices for deploying an Agent, based on our own experience with Fin and the lessons learned from our most forward-looking customers. These are captured and continually evolving in The Agent Blueprint.

Transformation takes commitment

The most successful teams aren’t bolting AI onto old processes; they’re rebuilding support around it—investing in knowledge and people alongside technology, and treating AI as a continuous discipline rather than a one-time deployment. That’s the real change required. For support teams willing to make it, there’s a rare opportunity to redefine what customer service can deliver—higher CSAT, faster resolution, and durable ROI.

Inspired by this post on The Intercom Blog.

March 13, 2026
From Resolutions to Outcomes: How We Price AI Agents Fairly and Amplify Customer Value

I’ve long believed a simple truth about AI in customer support: if AI is going to earn trust, pricing has to be aligned with value. That principle has guided my product decisions and the way I hold our teams accountable for measurable outcomes, not activity.

When we shared our perspective on pricing AI Agents in 2023, we made a simple argument: if AI is going to earn trust, pricing has to be aligned with value. At the time for Fin, that value was clear. You pay when the AI resolves a customer’s problem. If it doesn’t, you don’t. That’s fair, easy to understand, and grounded in results, not activity. We were the first to introduce this pricing model because we believed that pricing and value should be inherently linked.

That belief hasn’t changed, it’s grown stronger over time. What’s changed is what Fin can do. As we expanded capabilities and pushed deeper into complex workflows, it became clear that measuring value solely by end-to-end resolutions no longer captured the full picture of impact.

Resolutions were the right place to start. Historically, we measured value based on whether Fin fully resolved a conversation on its own. These are known as resolutions and they gave support teams a clear way to measure ROI, easily comparing the cost of AI versus human support. They also aligned our incentives with our customers, as our revenue was directly tied to Fin’s performance.

That clarity worked. Today, more than 7,000 teams use Fin. Our average resolution rate across customers has increased every month and now stands at 67%, even as Fin increasingly handles more complex queries. That progress came from building an Agent that could take on harder problems and still deliver.

But as Fin got more powerful, “success” stopped being binary. I saw this first-hand in customer design sessions where policy, risk, and compliance needs rightly demanded human-in-the-loop confirmation. We weren’t failing to deliver value; we were delivering it differently.

Over the last couple of years, we invested heavily to ensure Fin could handle the most complex parts of support. As Fin’s capabilities expanded, customers began pushing what Fin can do for them by deploying Fin deeper into their workflows to handle the toughest queries.

In some cases, this required Fin to work in tandem with a human agent because that’s what customer policies and oversight needs dictated. Subscription changes, transaction disputes, billing issues, and other multi-step support scenarios can often require Fin to gather context, read and write to external systems, and execute actions before handing off to a human agent for confirmation.

Fin is still doing what it was configured for – intentionally handing off after doing more of the heavy lifting, saving valuable time for support teams and overall time to serve for their customers. But our pricing metric only recognized value when the conversation ended in a full “AI resolution” (i.e. a human was never involved).

That’s why we’re evolving Fin’s pricing metric from resolutions to outcomes. This shift reflects how customers now define value: not just in full automation, but in safe, efficient progress toward the right result across complex, multi-step, and policy-constrained workflows.

An outcome represents when Fin successfully completes the action it was configured to perform, as part of a conversation. Resolutions are still one type of outcome Fin can deliver, where it handles the issue end-to-end. Another type of outcome can be a Procedure where Fin gathers context, takes action, and hands the conversation off when that’s what customers configured it to do.

Kick off your journey with the #1 Agent—an AI partner designed to turn resolutions into real outcomes. Tap “Start a free trial” to explore faster, smarter customer service and see how Fin delivers value from day one.

Increasing end-to-end AI resolutions is still a core component of scaling Agents, but they are no longer the only measure of Fin's success and utility. Especially as Fin takes on more complex work. Moving to outcomes recognizes that solving a customer problem with full automation isn’t always appropriate. It’s about getting to the right result, safely, and efficiently.

As Fin’s capabilities expand, teams should feel empowered to use it in more nuanced, collaborative work. Outcomes support that by allowing customers to design workflows that meet compliance requirements and include a human agent when necessary. From a product management standpoint, this is how we align incentives, keep risk controls intact, and still accelerate time-to-value.

Fin is becoming even more powerful at handling complex, multi-step support queries. With outcomes, we can support that growth without constantly reinventing how value is measured. And this change gives us a strong pricing foundation that can scale as Fin continues to grow and take on more roles beyond service. This aligns with our vision of Fin becoming a “Customer Agent,” capable of handling the entire customer experience.

What this means for pricing is intentionally straightforward. An outcome will be counted when Fin successfully completes an action it was configured to perform, as part of a conversation. That keeps the model predictable for finance leaders while staying transparent for operators and product teams managing AI workflows.

The pricing model stays simple and the definition of value becomes more accurate. In other words, we’re doubling down on fairness, predictability, and competitiveness—core tenets for any consumption SaaS pricing strategy tied to real business impact.

When we first wrote about outcome-based pricing, we said that trust is the currency of AI. That’s still true. Trust is earned when customers see pricing move in lockstep with utility and risk posture, especially as gen AI and agentic AI take on higher-stakes tasks.

Pricing has to feel fair, it has to be predictable, and it has to stay competitive. Evolving from resolutions to outcomes isn’t a departure from that belief. It’s the natural maturation of how we measure value as AI moves from simple Q&A into complex procedures and human-in-the-loop collaboration.

Fin has grown more powerful because customers asked more of it. Outcomes are how we reflect that progress honestly, while staying true to the same principles that guided us from the start. This is product strategy in action: align incentives, measure what matters, and scale what works.

And as Fin continues to get stronger, we’ll keep holding ourselves to the same standard: price based on the value delivered. That’s how we build durable trust, sustainable ROI, and a better customer experience at scale.

Inspired by this post on The Intercom Blog.

March 12, 2026
Inside Zipline’s Wild Pivot: My Take on Hiring Heat-Seekers and Scaling to 5,000 Hospitals

I’m consistently drawn to stories where product strategy and operational grit collide to change real lives. Zipline, the world’s largest commercial autonomous delivery system, is one of those rare cases. Serving 5,000 hospitals across multiple countries and saving an estimated 17,000 lives per year, it embodies the kind of mission-driven execution I try to model in product management. The arc—from a near-dead home robot startup to a scrappy bet on drone blood delivery in Rwanda, to 135 million autonomous miles flown—offers some of the clearest lessons I’ve seen on hiring, leadership, and product-market fit under extreme constraints.

One principle that immediately resonated with me: why Zipline doesn’t hire for experience. The idea behind “Why Zipline hires teenagers over PhDs” isn’t a dismissal of expertise; it’s a commitment to learning velocity, ownership, and unteachable hunger. The best startup employees, as described here, are “heat-seeking missiles for pain”—people who chase the hardest problems, not the shiniest projects. In my org, I look for the same signal: candidates who can move from ambiguity to action, who find the bottleneck without being asked, and who care more about outcomes than optics.

I also appreciated the unapologetic stance that “blind references are a non-negotiable.” In high-stakes builds—especially in regulated or safety-critical categories—the cost of a mis-hire compounds. I routinely validate for two traits during references: intellectual humility and accountability. “Can candidates admit when they screwed up?” is a powerful filter. If someone can’t name a hard mistake and how they specifically changed as a result, they’re unlikely to scale with the organization.

Equally important is clarity about who not to hire. The employees Zipline doesn’t want are those who optimize for status, process theater, or low-friction work. In practice, that means pressure-testing for problem-finding, not just problem-solving. I often design interviews around messy, cross-functional constraints (regulatory, operational, and financial) to see who can integrate tradeoffs, not just ideate features. That’s how we build empowered product teams that ship consequential outcomes, not outputs.

There’s a reference to “Zipline’s secret leadership playbook,” and while the specifics remain private, the spirit is unmistakable: first principles decision making, ruthless focus, and a culture that rewards radical responsibility. Translating that to my product organization, I emphasize five behaviors: orient to the mission under uncertainty, run fast but close the loop with data, communicate constraints early and often, own the long tail of consequences (especially in safety and reliability), and scale judgment by teaching the why, not just the what. That blend of clarity and autonomy is the backbone of product management leadership at any growth stage.

On the other side of the culture coin is “Why you should always fire quickly” and “The brutal firing advice that shaped Keller’s leadership.” I’ve learned (sometimes the hard way) that slow decisions erode trust and team velocity. Moving quickly doesn’t mean being harsh; it means being fair, explicit, and humane—tight feedback loops, role clarity, and decisive action when the gap persists. If your bar is clear and your coaching is consistent, acting fast protects both the mission and the team’s energy.

Strategically, the origin story reads like a masterclass in choosing the right problem. The team moved “from toy robots to drone delivery: Zipline’s pivot,” then partnered deeply with Rwanda, where “How Rwanda’s health minister changed everything” is a pivotal moment. It wasn’t a linear climb—”How Zipline almost died – twice” and “Why Zipline’s launch was a ‘complete disaster’” underline a tough truth: breakthrough products rarely arrive fully formed. What matters is the operating cadence that turns early chaos into repeatable reliability—especially when the stakes are measured in minutes and lives.

Scaling from 1 hospital to 5000 required more than product brilliance; it demanded systems thinking across logistics, compliance, safety, and community trust. That’s stakeholder management at its highest level. The product lessons are durable: anchor on outcomes, not artifacts; build reliability as a feature; and practice founder-led GTM where your credibility is on the line with customers and regulators. This is where first principles decision making beats benchmarking—particularly in novel categories where there are no playbooks to copy.

There’s also a hard-nosed operational takeaway in “The 10x hardware cost rule every founder should know.” My read: assume total cost of ownership will balloon once you account for manufacturing variability, support, redundancy, maintenance, and compliance. In product strategy, I treat those multipliers as design inputs, not afterthoughts. If the unit economics can’t survive these realities, the idea isn’t ready—no matter how elegant the prototype looks in a lab.

Across all of this, a few product management patterns stand out for me: build teams around outcomes vs output OKRs; hire for slope, not just intercept; make continuous discovery routine with real users (in this case, clinicians and health systems); and treat operational excellence as a product surface. When a mission is this consequential, culture becomes a safety system—and every leadership decision compounds into either speed with quality or speed with regret.

For leaders building in complex domains, this journey is a blueprint: pick problems that matter, hire “heat-seeking missiles for pain,” keep blind references non-negotiable, lead with first principles, and scale with responsibility. Do that well and even a “complete disaster” launch can become the inflection point of a category-defining company that flies 135 million autonomous miles and saves 17,000 lives per year.

March 12, 2026
Ship Smarter with Amplitude + Lovable: See Behavior, Fix Friction, Iterate Faster

I build products with a simple mantra: launch, learn, repeat. Shipping fast is necessary, but shipping smart is what compounds. To do that, I keep analytics close to the work—inside the builder—so every decision is tied to real user behavior, not assumptions.

Connect Amplitude MCP to Lovable to understand user behavior, spot frictions, and ship better updates without leaving your builder.

In practice, this integration lets me bring Amplitude analytics and behavioral analytics directly into the creative flow. I can explore funnels, cohorts, and drop‑offs the moment I’m crafting an experience, then translate those insights into concrete changes without context switching. The result is tighter feedback loops and more confident iteration.

My typical loop looks like this: identify a friction point from funnel analysis, design two or three variants in the builder, and run A/B testing to validate the improvement. I focus on user activation and retention analysis as leading signals, because sustained engagement is the clearest indicator that we’ve solved a real problem. When the data confirms it, we promote the winning experience and move to the next opportunity.

Keeping the work inside the builder also supports continuous discovery. I can pair quantitative insights with qualitative observations, refine journey mapping, and document learnings while the context is fresh. That makes prioritization and product discovery more reliable, and it turns each iteration into a teachable moment for the team.

Strategically, this builder‑first approach enables product-led growth. With fewer handoffs and a unified analytics platform, we compress time from insight to impact. It helps me defend roadmap decisions with evidence, communicate trade‑offs clearly, and keep the team focused on outcomes that matter to customers and the business.

If your goal is to iterate with speed and precision, bring analytics to where you build. Keep the loop tight, measure what moves the needle, and let the data guide your next best update.

Inspired by this post on Amplitude – Best Practices.

March 11, 2026