Category: Product Management

Inside Partner Product Marketing: Lessons that Elevate Go-to-Market and Product-Led Growth

I’ve learned that the most effective partner product marketing is less about decks and more about decisions. When I collaborate with partner product marketing managers, we translate complex capabilities from a unified analytics platform into crisp, outcome-led narratives that customers can act on. This is where product positioning and go-to-market strategy intersect to create momentum for product-led growth.

In my experience, the strongest partner product marketing managers operate like solution orchestrators. They align value propositions across partners, clarify the problem-solution fit, and articulate competitive differentiation without drowning teams in feature lists. By anchoring messaging in clear customer pains and measurable gains, they help everyone—from solutions engineering to sales—tell the same story with confidence.

My playbook starts with outcomes. We define the “why” in terms customers care about, then quantify it with retention analysis, user activation, and time-to-value. That evidence shapes positioning, enables tighter points of parity and differentiation, and ensures our value proposition resonates in market. The result is faster alignment and fewer cycles spent debating messaging without data.

Cross-functional execution makes or breaks the strategy. I partner closely with solutions engineering to validate solution patterns, and with sales to balance sales-led motions alongside product-led growth. Strong stakeholder management keeps discovery loops tight: we capture objections early, refine narratives quickly, and reduce friction across the funnel.

On the tactics side, I rely on A/B testing to de-risk bold messaging changes and to optimize in-app guides and product tours. We set a minimum detectable effect upfront, instrument journeys with Amplitude analytics, and iterate quickly. This gives the team statistical confidence while keeping speed high—especially when refining narratives for complex partner solutions.

Ultimately, great partner product marketing illuminates the shortest path from capability to customer value. When we pair disciplined positioning with data-driven learning, we strengthen our go-to-market strategy and build durable competitive advantage. That’s how we turn strong solutions into market-leading stories that win—and keep—customers.

Inspired by this post on Amplitude – Best Practices.

February 21, 2026

How to Build AI-Ready Product Analytics and Experiments

You are about to approve an AI feature. The demo works, the team has an adoption dashboard, and every response can collect a thumbs-up or thumbs-down. Yet nobody can answer the questions that will matter after launch: Did the feature help customers finish the job? Was the improvement caused by the AI? Did quality hold across important customer segments? Was the gain worth the latency, cost, and risk?

Do not solve that problem by adding more charts. Build an evidence chain from eligibility and exposure through model behavior and human action to a completed customer outcome. An AI-ready measurement system makes model telemetry and product behavior part of the same decision. That is what lets you improve prompts, retrieval, models, and product design without confusing technical progress with customer value.

Key takeaways

Define the product decision, eligible population, primary outcome, guardrails, and minimum detectable effect before choosing events or building dashboards.
Instrument a traceable sequence from eligibility to exposure, request, response, user action, task completion, and repeat value. Shared identifiers matter more than a large event catalog.
Keep model quality, product behavior, reliability, cost, risk, and business outcomes as separate measurement layers, but make them queryable through the same identities and version fields.
Move through offline evaluation, production shadowing, and a controlled rollout. Each stage answers a different question and needs its own exit criteria.
End every experiment with an explicit decision: ship, iterate, restrict, or stop. A result that produces another indefinite request to collect data is not a decision system.

Start with an evidence contract, not an event list

An instrumentation plan often begins too late in the reasoning process. Someone opens a spreadsheet and lists clicks, generations, feedback actions, and errors. The events may all be valid, but they do not guarantee that the resulting data can answer a product question.

Start with a one-page evidence contract. It should force the product, engineering, data, and AI owners to agree on the decision they are trying to make. Complete these fields before implementation:

Decision: State what will change if the evidence is positive, negative, or inconclusive. For example, the decision might be whether to expand a drafting assistant from one workflow to every workflow.
User problem: Name the job the customer is trying to complete. Avoid substituting the proposed AI capability for the problem.
Eligible population: Define who could reasonably benefit, including account type, workflow state, permission, and any relevant exclusions.
Intervention: Specify what is different from the current experience. Include the product surface and the model, prompt, retrieval, and guardrail configuration that define the treatment.
Primary outcome: Choose one customer behavior that represents successful completion of the job. Give it an exact numerator, denominator, and observation window.
Diagnostics: Identify the signals that will explain why the outcome moved, such as output acceptance, editing, retries, fallbacks, and time to completion.
Guardrails: Define the reliability, safety, customer-experience, and cost conditions that the treatment cannot violate.
Decision rule: Predefine the minimum effect worth detecting, how uncertainty will be handled, which segments will be inspected, and what would cause an early rollback.

A useful hypothesis has a visible causal claim: For an eligible cohort, a defined AI experience will improve a named task outcome over a stated observation window, while specific guardrails remain acceptable. Consider a support workflow. “Customers will like AI drafts” is not testable enough. “Giving eligible support agents an AI-generated draft will improve successful ticket completion without degrading customer satisfaction, safety, latency, or cost per successful resolution” tells you what to instrument and what could veto a rollout.

Separate the six measurement layers

One composite AI score is tempting and usually unhelpful. A single number hides trade-offs and makes failures difficult to diagnose. Keep the layers distinct:

Measurement layer	Question it answers	Useful measures	Decision it informs
Eligibility and adoption	Did the intended customer have a real opportunity to use the feature?	Eligible users or accounts, exposures, first use, repeat use	Reach, discoverability, onboarding, and denominator quality
Task outcome	Did the customer complete the job better?	Task success, time to value, completion without rework, durable repeat behavior	Whether the feature creates customer value
Model quality	Was the output usable for this use case?	Rubric score, groundedness where relevant, acceptance, edits, rejection, regeneration	Prompt, retrieval, data, and model improvements
Reliability and efficiency	Can the experience operate consistently?	Latency, error rate, fallback rate, availability, cost per successful outcome	Architecture, model routing, and operational readiness
Risk and trust	Did the system cross a boundary that should block scale?	Safety violations, moderation triggers, unsupported responses, user overrides	Guardrails, restrictions, and rollback
Business outcome	Does the customer value become durable business value?	Activation, retention, support deflection, account expansion, or attributable revenue	Investment level and product strategy

Choose one primary outcome for the experiment. The other layers are not decorative. Product and model diagnostics explain the result, while guardrails can veto it. A faster workflow that creates unacceptable safety failures is not a win. A highly rated output that does not improve task completion is not yet a product outcome.

Instrument one traceable chain, not a bag of events

The core unit of AI analytics is a traceable attempt to complete a job. You need to follow that attempt across the product interface, AI runtime, and downstream outcome. If each system produces isolated records, the dashboard may show healthy model performance and healthy adoption without revealing whether the same customers received both.

A practical event sequence looks like this:

ai_feature_eligible: The user or account entered a state in which the feature could provide value. This creates the denominator for reach and experiment eligibility.
ai_feature_exposed: The experience was actually rendered or otherwise made available. Keep assignment separate from display so delivery failures remain visible.
ai_request_submitted: The customer initiated an AI-assisted action. Capture the intended use case, not the full sensitive input by default.
ai_response_generated: The AI system produced a response. Record the configuration, latency, error state, fallback behavior, and attributable cost.
ai_response_presented: The output reached the customer. A generated response that never rendered should not count as a usable response.
ai_output_action_taken: The customer accepted, copied, edited, regenerated, rejected, or undid the output. Preserve the difference between no action and an explicit rejection.
ai_task_outcome_recorded: The workflow reached its product-level success or failure state. Link this outcome to the request even if it occurs later in another system.
ai_repeat_value_observed: The user or account returned to the workflow and obtained value again. This distinguishes novelty from an emerging habit.

Those names are examples, not a mandatory standard. Your taxonomy should match the language of your product. The important distinction is semantic: eligibility is not exposure, exposure is not use, generation is not delivery, delivery is not acceptance, and acceptance is not task success.

Give every layer the same join keys

The event chain works only when the records can be joined without relying on an email address, timestamp guess, or mutable account field. At minimum, decide how you will represent:

Identity: Stable user and account identifiers, plus an explicit anonymous-to-authenticated identity rule where needed.
Workflow: A workflow or task identifier that survives navigation, retries, and asynchronous processing.
AI execution: Request and response identifiers that distinguish one customer request from multiple internal model or retrieval calls.
Experiment state: Experiment identifier, assigned variant, assignment timestamp, and the reason a user or account was eligible.
Configuration: Model, prompt template, retrieval index, tool, policy, and guardrail versions. A treatment is not stable if these change invisibly during the test.
Product context: Use case, surface, lifecycle stage, account segment, permission state, and other dimensions selected in the evidence contract.
Operational result: Latency, error class, fallback reason, moderation result, and cost fields defined consistently across providers.
Governance: Schema version, data classification, consent or policy state where applicable, and retention treatment.

Capture context at the time of the event. If an account changes plan or segment later, a query should not silently rewrite the conditions under which the experiment ran. Preserve both the stable identity and the relevant historical snapshot.

Apply privacy-by-design to inputs, outputs, and feedback. Raw prompts and generated text can contain customer data that does not belong in a broadly accessible analytics platform. Prefer structured categories, redacted attributes, content-type labels, and references to a separately governed evaluation store. Store the minimum information needed for the decision, not every token merely because it is available.

Catch instrumentation defects before launch

AI workflows create several failure modes that ordinary click tracking can miss. Add these checks to the release path:

Count one logical customer request separately from provider retries, tool calls, retrieval queries, and fallback calls. Otherwise usage and cost denominators will disagree.
Use idempotency or deduplication rules for events emitted by asynchronous jobs. A replayed queue message should not create a second successful task.
Validate required properties and accepted values automatically. Schema checks and feature flags belong in the delivery workflow, not in a cleanup project after launch.
Version an event when its meaning changes. Adding an optional property may be compatible; changing what counts as task success is a new semantic contract.
Test identity resolution across the full journey, including anonymous use, authentication, account switching, shared workspaces, and delayed downstream outcomes.
Reconcile generated, presented, and acted-on counts. A large unexplained gap often reveals a delivery, client, or instrumentation failure before it becomes a misleading product conclusion.

Turn model quality into a product scorecard

An offline model score and an online product metric answer different questions. The offline evaluation asks whether a configuration can produce an acceptable result on a defined set of cases. The online measurement asks whether the experience changes behavior and outcomes for real customers. You need both, and you should not let either impersonate the other.

Use denominators that expose failure

Every rate should state what had the opportunity to enter its numerator. These definitions are more useful than labels such as quality score or engagement:

Task success rate = successful target tasks divided by eligible tasks that reached the defined opportunity.
Delivered response rate = responses presented to the customer divided by valid submitted requests.
Helpful output rate = reviewed outputs that satisfy the use-case rubric divided by outputs with a completed review.
Fallback rate = requests that used the defined fallback path divided by eligible AI requests.
Safety intervention rate = requests that triggered a defined safety intervention divided by requests evaluated by that policy.
Cost per successful outcome = attributable AI runtime cost divided by successful target tasks. Use a consistent cost boundary so model, retrieval, and fallback costs are not included selectively.
Repeat value rate = users or accounts that complete the target task again within the chosen window divided by those that first completed it.

Display the numerator, denominator, missing-outcome count, and metric definition beside the rate. A percentage can look healthy because delivery failures disappeared from its denominator or because only enthusiastic users submitted feedback.

Human signals such as thumbs, edits, acceptance, deflection, and customer satisfaction are valuable diagnostics, but each has an interpretation problem. Thumbs reflect the minority who choose to respond. Acceptance can reward a convenient draft that still needs correction later. A large edit may mean the output was poor, or that it provided a useful starting structure. Regeneration can indicate failure, exploration, or a request for variety. Pair these signals with task completion, time to value, downstream correction, and representative human review.

Build the offline evaluation around the product decision

A representative evaluation set is a product artifact, not merely a model-engineering artifact. Construct it deliberately:

Define the unit being judged. It may be an answer, classification, draft, action plan, tool decision, or completed multi-step workflow.
Write a rubric that separates must-pass requirements from preferences. Include factual or grounded behavior, task completion, policy compliance, and format only where they matter to the user job.
Sample the cases the target population actually produces. Preserve important slices such as use case, complexity, language, account type, or risk level when those dimensions affect the decision.
Define how ambiguous cases, missing context, and evaluator disagreement will be handled. Do not force false certainty into a label simply to complete a dataset.
Record the exact model, prompt, retrieval, tool, and guardrail configuration for every run. A score without a reproducible configuration cannot guide a rollout.
Keep a stable benchmark for comparison while adding a governed set of newly discovered failure cases. If every prompt change also changes the test, improvement becomes impossible to interpret.

Offline success is an entry condition for production learning, not evidence of customer impact. It can eliminate weak configurations cheaply and expose slice-level failures before customers encounter them. It cannot tell you whether people discover the feature, trust it, change their behavior, or retain because of it.

Run experiments as a sequence of risk-reducing gates

Do not ask one A/B test to discover whether the model works, whether the infrastructure survives production, whether the interface is understandable, and whether the business case holds. Move through offline evaluation, production shadowing, and controlled rollout. Each gate removes a different uncertainty.

Offline evaluation: Compare the candidate configuration with the current baseline on the representative evaluation set. Review overall quality, must-pass requirements, important slices, safety behavior, and cost. Exit only when the candidate is good enough to justify production exposure.
Shadow mode: Run the candidate against production traffic without showing its output to customers or changing the workflow. Use this stage to verify input distribution, integration behavior, latency, failures, fallbacks, policy coverage, and attributable cost. Shadow mode cannot demonstrate customer lift because the customer never experiences the treatment.
Controlled rollout: Deliver the experience through a feature flag to a randomized treatment group while preserving a valid control. Measure the primary outcome and guardrails using the assignment unit specified in the evidence contract.
Scaled release: Expand only after the decision rule is met. Continue monitoring for distribution shifts, configuration changes, operational regressions, cost drift, and safety failures that a time-bounded experiment may not capture.

Feature flags are more than a release convenience. They preserve a control, enable a rapid rollback, restrict exposure when a feature is safe only for a defined cohort, and separate model deployment from product exposure. Name an owner for the flag, the rollout decision, and the rollback action before traffic begins.

Pre-register the experiment brief

Pre-registered hypotheses, guardrails, and minimum detectable effect prevent a familiar failure: the team sees a noisy result and rewrites the question until something appears positive. Your brief should contain:

The product decision and the hypothesis being tested.
The eligible population and every exclusion that will be applied.
The baseline experience and the complete treatment configuration.
The randomization unit, assignment method, and exposure definition.
The primary metric, including numerator, denominator, and observation window.
The minimum detectable effect: the smallest improvement that would be material enough to justify the cost or complexity of rollout.
Guardrail definitions, acceptable boundaries, and rollback conditions.
The diagnostic metrics that may explain the result but will not be promoted to primary after the test begins.
The segments that will be examined and why they matter to the product decision.
The analysis method, expected decision point, and owner of the final call.

The minimum detectable effect is a product choice before it is a statistical input. If a smaller gain would not change the roadmap, do not design the experiment around detecting it. Traffic, baseline behavior, outcome variability, assignment unit, observation window, and the selected effect all shape whether the experiment can be conclusive. When traffic is insufficient, the honest choices are to run longer, test a larger change, use a nearer but defensible outcome, combine learning with other evidence, or decline to run an underpowered experiment. Lowering the standard after seeing the result does not create evidence.

Avoid the analysis traps specific to AI products

Do not treat every generation as an independent experimental subject. A single user or account may generate repeatedly, and those observations share the same behavior and assignment.
Randomize at the account level when treatment can spill across a shared workspace, team process, or common customer record. User-level randomization in that setting can contaminate the control.
Do not analyze only people who clicked the AI control. Treatment may change whether they click, so filtering on that action can remove part of the treatment effect. Start from the assigned eligible population and use triggered views as diagnostics.
Do not change the model, prompt, retrieval source, or guardrail silently inside a treatment. If an urgent fix is necessary, record the version boundary and decide whether the test remains interpretable.
Do not optimize an intermediate signal in isolation. More generations can mean adoption or repeated failure; more acceptance can coexist with lower downstream accuracy; faster responses can be worse responses.
Do not repeatedly inspect the result, stop when it looks favorable, and then present that stopping point as planned. Follow the pre-registered analysis or use a statistical design that explicitly supports sequential decisions.
Do not search every segment for a winner after an inconclusive overall result. Treat an unexpected segment pattern as a hypothesis for validation, not automatic authorization to scale.

Create an operating loop that can say stop

A technically correct dashboard does not create accountability. The system becomes useful when the team knows who reviews each signal, what action follows, and which metric has authority when measures disagree.

Use one semantic layer and several decision views

You do not need one dashboard for every audience. You need shared definitions and trustworthy product, marketing, and customer signals underneath purpose-built views:

Leadership view: Primary customer outcome, durable business outcome, cost per successful outcome, major guardrails, rollout status, and decision owner.
Product view: Eligibility-to-outcome funnel, activation, repeat use, retention by cohort, time to value, and the diagnostics behind the current experiment.
AI quality view: Offline rubric results, online review results, feedback behavior, fallbacks, and performance by use case, model version, and important slice.
Operations and trust view: Latency, errors, availability, cost, moderation triggers, safety interventions, and rollback state.

Every view should resolve to the same metric registry. The registry needs a definition, owner, source events, inclusion and exclusion rules, observation window, grain, version, and change history. If task success means one thing in the product review and another in the model review, a common dashboard tool will not create a common truth.

Put measurement into the delivery workflow

During discovery, write the evidence contract alongside the problem statement. The primary outcome should be agreed before the implementation solution hardens.
During implementation, review event semantics, identity, privacy, configuration versioning, and metric formulas. Run automated schema checks with the same seriousness as other release validations.
Before rollout, verify the offline gate, shadow-mode results, experiment assignment, dashboards, alerts, flag owner, and rollback path.
During the experiment, review data quality and guardrails on the agreed cadence. Distinguish operational monitoring from an unplanned search for a favorable outcome.
At the decision point, record the result, uncertainty, segment findings, guardrail status, configuration, and action. Make the record reusable by the next prompt, retrieval, model, or experience iteration.
After the decision, remove abandoned dashboards and events, close obsolete flags, and update the evaluation set with newly validated failure modes. Measurement debt compounds when every experiment leaves permanent debris.

The decision itself should fall into one of four states:

Ship: The primary outcome meets the decision rule, the evidence is interpretable, and guardrails and economics remain acceptable.
Iterate: The result is not ready to scale, but diagnostics identify a plausible and testable failure in quality, retrieval, interaction design, reliability, or targeting.
Restrict: The value is credible only for a defined cohort or use case, and that boundary can be enforced and validated without creating unacceptable risk.
Stop: The effect is below what would justify the investment, a critical guardrail fails, the economics do not work, or the experiment cannot be made interpretable without redesign.

Cost, safety, privacy, and customer trust are not secondary metrics that a conversion lift can overrule. If one is a hard boundary, say so in the evidence contract and give it the power to stop the rollout.

If your current analytics cannot support this full system, start with one high-value AI workflow. Write its evidence contract, implement the traceable event chain, assemble a representative offline evaluation set, and place the experience behind a controlled flag. Your first useful deliverable is not a larger dashboard. It is a product decision that can be made without debating what the data was supposed to mean.

References

February 20, 2026

How to Run AI-Accelerated Product Discovery and Delivery

Your team can turn a behavioral anomaly into a polished prototype within hours rather than over weeks, yet still stall when it is time to choose a problem, approve a test, or act on the result. That is the central trap in AI-accelerated product development: producing artifacts faster does not automatically produce better decisions.

The useful unit of acceleration is the complete learning loop: detect a meaningful signal, frame the opportunity, explore distinct hypotheses, validate the riskiest assumptions, ship with controlled exposure, and use production evidence to decide what happens next. You need one operating model across that loop, not a collection of disconnected AI shortcuts.

Key takeaways

Optimize for time from signal to a decision backed by evidence, not the number of analyses, prototypes, or tickets generated.
Give every investigation an outcome contract: the customer behavior, target cohort, primary metric, guardrails, and decision that the work is intended to inform.
Use AI to create alternatives that represent different value hypotheses. More cosmetic variants usually create more review work without expanding what you can learn.
Carry the same cohort, metric definitions, hypothesis, and constraints from discovery into the production experiment. This prevents the handoff from silently changing the question.
Let agents act only where their permissions, thresholds, audit trail, and rollback path are explicit. Autonomy should expand with evidence, reversibility, and trust.

Design one loop from a product signal to a decision

Most teams first apply AI to individual tasks. An agent summarizes a dashboard. A model drafts a product requirements document. A design tool generates a flow. A coding assistant implements it. Each task becomes faster, but the work still waits between tasks because nobody has defined what evidence is sufficient, who can make the next decision, or what outcome the change should affect.

An agent that discovers more anomalies while the product trio reviews opportunities through the same overloaded process has created a longer inbox. The bottleneck has moved; it has not disappeared. The remedy is to treat a decision-ready hypothesis, rather than an AI-generated artifact, as the unit of product work.

A practical discovery loop has the following sequence:

Write the outcome contract. Name the customer behavior you want to change, the cohort in which it matters, the primary outcome metric, the metrics that must not deteriorate, and the decision this evidence will support.
Map the driver tree. Break the outcome into observable behavioral drivers. This gives the agent a bounded search space and prevents a broad metric movement from producing an equally broad list of possible features.
Issue an investigation brief. Tell the agent which definitions, segments, releases, and time comparisons it may use; which data it may access; what it should monitor; and whether it may only recommend or may also initiate an approved workflow.
Require an evidence packet. An anomaly should arrive with the affected cohort, direction and materiality of the movement, relevant timing, instrumentation checks, plausible alternative explanations, and the next question worth answering.
Record the decision. The product trio should accept, reject, defer, or refine the hypothesis and state why. That decision becomes context for the next investigation instead of disappearing into a meeting.

For an onboarding problem, the outcome contract might identify accounts attempting their first meaningful setup, define the activation behavior precisely, name downstream retention and support demand as guardrails, and authorize the agent to investigate friction without changing the customer experience. That is much more useful than asking AI to find onboarding insights. The broad request has no stopping condition and no decision attached to it.

The driver tree then narrows the investigation. Activation might depend on starting setup, completing required configuration, reaching an initial value-bearing action, and returning to use that value. The point is not to make the tree exhaustive. It is to show which behaviors could plausibly explain the outcome and which are observable in your product data.

This is where continuous agents can provide real leverage. They can monitor established metrics, inspect funnel and cohort movements, and surface material changes such as an activation decline in a valuable cohort or a retention change following a release. They can also compare segments and assemble supporting context without waiting for a fresh manual analysis request.

But the alert is not yet an opportunity, and correlation is not a causal explanation. A broken event, a changed identity rule, a traffic-mix shift, or a simultaneous release can resemble a change in customer behavior. Make instrumentation confidence and alternative explanations mandatory fields in the evidence packet. If either is weak, the next action is to improve the evidence, not to generate a feature.

The product trio still owns the consequential judgment: whether the problem is worth solving, what tradeoff is acceptable, which customer evidence is missing, and whether the likely value justifies the delivery cost. AI should remove investigative toil and expose overlooked evidence. It should not hide a strategic choice inside an automated recommendation.

Use AI to expand hypotheses without expanding waste

Generative design changes the economics of exploration. Once a measurable opportunity is clear, high-fidelity flows can be produced in hours instead of stretching across weeks. That makes it practical to inspect several possible mechanisms before production code is written.

Cheap variation also creates a new failure mode. If every stakeholder can request another screen, the team spends its saved production time reviewing undifferentiated options. The prompt should therefore ask for distinct value hypotheses, not a gallery of cosmetic alternatives.

Build the prototype brief from the evidence packet. It should contain:

Target user and context: the affected cohort, the job it is trying to complete, and the point at which friction appears.
Observed evidence: the behavioral signal, qualitative context if available, instrumentation caveats, and alternative explanations that remain open.
Value hypothesis: why a proposed mechanism should change the target behavior, stated in a form that can be rejected.
Meaningfully different mechanisms: alternatives that change how value is delivered, explained, sequenced, or experienced rather than merely changing visual treatment.
Outcome and guardrails: the primary behavior to influence and the accessibility, privacy, brand, reliability, and business constraints that every variation must respect.
Instrumentation needs: the events and properties required to tell whether people encounter, understand, use, and benefit from the proposed experience.

A useful review question is: If these alternatives perform differently, will you learn something about customer value? If the answer is no, the variations probably differ in presentation but not in hypothesis. Asking AI for more of them will not improve the decision.

Match the validation method to the uncertainty:

Concept validation addresses whether the intended user understands the proposition and considers it relevant.
Usability validation addresses whether the user can recognize the next step, complete the flow, and recover from confusion.
Production experimentation addresses whether exposure changes actual behavior under real product conditions.
Cohort-level follow-through addresses whether an immediate movement is accompanied by the activation, retention, or expansion outcome the team ultimately cares about.

Do not ask a prototype to answer a production question. A polished interaction can expose comprehension and usability problems, but it cannot establish that the experience will improve retention. Conversely, do not consume production capacity to answer a basic usability question that a prototype could resolve before engineering begins.

Define the decision rule before each validation step. State what evidence would cause the trio to advance the hypothesis, revise it, or stop. This prevents a compelling AI-generated design from becoming the default simply because it exists. High fidelity is a communication advantage, not proof of value.

Carry the discovery contract into production

The discovery-to-delivery handoff often introduces more error than the tools remove. A metric is renamed, a cohort becomes broader, a design constraint disappears from the ticket, or an experiment is configured to answer a slightly different question. The team ships quickly and then debates what the result means.

Prevent that translation loss by treating the outcome contract as a living production artifact. Keep the same definitions and segments across pre-launch discovery and post-launch evaluation. If a definition must change, document the change and revisit the hypothesis rather than pretending the evidence is still directly comparable.

Before implementation begins, the trio should be able to point to a compact delivery contract containing:

The customer problem, target cohort, and value hypothesis.
The primary outcome metric and the metrics that protect against unacceptable side effects.
The exact event, property, identity, and segment definitions needed for evaluation.
The minimum detectable effect, meaning the smallest change that would be consequential enough to alter the product decision.
The planned exposure controls, eligibility rules, rollback conditions, and owner.
The accessibility, privacy, data-governance, reliability, and brand constraints inherited from discovery.
The result that would lead to shipping, iteration, further investigation, or rollback.

Set the minimum detectable effect before examining experiment results. The question is not merely whether a statistical difference can be found. It is whether the experiment can detect an effect large enough to matter to the decision. If realistic exposure cannot provide decision-worthy power, acknowledge that limitation. Consider a more substantial intervention, a longer evidence path, or a different validation method instead of asking an underpowered test for certainty it cannot provide.

Risky changes should be gated behind feature flags and delivered through a controlled CI/CD path. A flag limits exposure and creates a rollback mechanism; it does not, by itself, make a release an experiment. You still need stable assignment, defined eligibility, trustworthy instrumentation, and a predeclared interpretation plan.

Not every change is suitable for an A/B test. Some changes are required, too interconnected for clean isolation, or exposed to too little eligible traffic for a decision-worthy test. The discipline still applies: state the expected behavioral change, release progressively when possible, validate the instrumentation, inspect guardrails, and choose the review point before launch.

When production data arrives, evaluate more than the aggregate primary metric. Confirm that exposure and events behaved as intended. Inspect the cohorts named in the original opportunity. Check whether the result varies across important segments. Then follow the downstream activation or retention signal that justified the work. Production conditions include latency, reliability, real data, competing tasks, and repeated use; prototype enthusiasm does not remove any of them.

Finally, record the product decision and feed it back into the system. The agent should know which hypothesis was accepted, what actually shipped, what the experiment showed, and why the team chose to scale, revise, or stop. Without that context, the next automated investigation starts from activity rather than accumulated learning.

Give agents decision rights, guardrails, and a balanced scorecard

Agentic workflows become risky when a team discusses autonomy as a general capability. Decision rights need to be assigned to a defined action in a defined context. The same agent may safely monitor an established metric, recommend an investigation, prepare a prototype, and still require explicit approval before changing a customer experience.

Use the following as a starting policy, then tighten it to your data sensitivity, product risk, and operational controls:

Work	Agent role	Human decision gate	Required control
Monitor an established metric	Run continuously within approved read access	Metric definitions and alert conditions approved in advance	Access boundaries, instrumentation-health checks, and an audit log
Investigate an anomaly	Assemble evidence and recommend hypotheses	Product trio decides whether the signal represents a meaningful opportunity	Cohort context, alternative explanations, confidence, and traceable queries
Generate a prototype or implementation draft	Prepare alternatives and supporting artifacts	Design and engineering approve customer experience and technical choices	Accessibility, privacy, brand, architecture, and data-use constraints
Launch a customer-facing experiment	Prepare configuration; execute only when policy explicitly permits it	Named owner approves exposure, success criteria, and rollback path	Feature flag, eligibility rules, MDE, guardrails, monitoring, and rollback
Trigger a CRM or in-app workflow	Act only inside preapproved conditions	Owner approves audience, message, frequency, and stop rules	Consent-aligned data, bounded actions, suppression logic, and reviewable history

The key distinction is not human versus autonomous work. It is whether the action is bounded, observable, reversible, and aligned to an approved outcome. An agent can be highly autonomous inside a narrow monitoring job and strictly advisory when a decision affects customers, commitments, or sensitive data.

Three governance questions should appear in every agent brief: What may the agent observe? What may it decide? What may it change? Add the owner who reviews its reasoning, the evidence it must preserve, and the mechanism that stops or reverses an action. This turns broad principles such as decision rights, reasoning transparency, and outcome alignment into enforceable operating rules.

Measure flow, quality, outcomes, and risk together

A scorecard focused only on speed will reward premature action. A scorecard focused only on business outcomes will hide whether the operating system is actually improving. Track four dimensions:

Flow: time to insight, time to action, manual analysis effort, and waiting time between investigation, decision, validation, and release.
Decision quality: whether investigations include instrumentation checks and alternative explanations, and whether experiments have a hypothesis, MDE, guardrails, and interpretation rule before launch.
Customer and business outcomes: the relevant movement in activation, retention, expansion, or another outcome named in the contract, including differences across the target cohorts.
Risk: actions outside approved permissions, privacy or access violations, misleading analyses caused by instrumentation problems, customer-impacting errors, and rollbacks.

The relationships between these measures are diagnostic. Shorter time to insight with unchanged time to action means the decision queue is now the bottleneck. More agent-initiated initiatives with flat activation or retention means the organization has increased automation, not product value. Lower manual analysis effort paired with weaker evidence packets means the work became cheaper by discarding necessary scrutiny.

The percentage of initiatives initiated by agents can be useful as an adoption indicator, but it is a poor destination metric. The meaningful result is a shorter, more reliable path to customer and business impact. Keep outcome measures beside time-to-insight, time-to-action, agent-initiated work, and manual analysis effort so local efficiency cannot masquerade as progress.

Start with one bounded learning loop

Do not begin by making every product workflow agentic. Choose one recurring, measurable problem in a trusted part of the data, such as onboarding friction, activation, or retention for a defined cohort. Then roll out the operating model in sequence:

Timestamp the current stages from signal detection through decision, validation, release, and post-launch review. This establishes where work actually waits.
Stabilize the outcome, cohort, event, and segment definitions. If the instrumentation is not trustworthy, repair it before automating interpretation.
Run the agent in read-only, recommendation mode. Require the standard evidence packet and audit whether its conclusions can be reproduced.
Connect approved investigations to the prototype brief. Ask the product trio to select among distinct hypotheses and document why.
Carry the selected hypothesis into the delivery contract, feature flag, instrumentation plan, and evaluation rule.
Permit automated actions only after the team has defined bounded permissions, monitoring, stop conditions, ownership, and rollback.
Review whether the loop became faster without weakening decision quality, customer outcomes, or governance. Expand the model only where that balance holds.

If an agent cannot show how it reached a conclusion, keep it in an investigative support role. If the team cannot state what result would change its decision, pause the experiment design. If cycle time falls but no relevant outcome improves, revisit the opportunity selection and hypothesis quality rather than adding more automation.

For your next active product problem, write the outcome contract before requesting an AI analysis or prototype. Give the agent a bounded investigation brief, require the trio to compare meaningfully different hypotheses, and move the chosen hypothesis into production without changing its metric or cohort. That single end-to-end loop will tell you more about your AI readiness than a long inventory of tools.

The test is straightforward: if AI helps you reach a consequential, auditable product decision sooner and learn from the result, it has accelerated product development. If it merely creates more things to review, it has accelerated output.

References

February 19, 2026

An End-to-End AI Product Workflow From Discovery to Deployment

You have customer interviews, an AI prototype, and a launch request. What you may not have is a defensible chain connecting them. The prototype can look convincing while the team still disagrees about the customer problem, the acceptable failure rate, the limits of automation, and what should happen when the model or a connected tool fails.

A durable AI product workflow makes those decisions explicit. It connects customer evidence to a bounded opportunity, the opportunity to an interaction model, that model to an evaluation contract, and the contract to a guarded production release. You should be able to trace every automated action backward to a customer need and forward to a metric, an owner, and a recovery path.

Turn interviews into an opportunity map, not a feature request

AI products often go wrong before anyone writes a prompt. A customer describes a slow or frustrating task, someone proposes an assistant, and the proposed interface quietly becomes the problem definition. The team then tests whether it can build the assistant instead of whether solving that part of the workflow changes the customer’s outcome.

Start by defining the discovery boundary. Name the user, the workflow, the outcome the user is trying to reach, and the part of that outcome your product could reasonably influence. Keep interviews in the same outcome or product space when you synthesize them. A small batch of three interviews can be enough to produce a useful first draft, but it is not a universal saturation threshold or proof that you understand the market.

The sequence of synthesis matters. Analyze each interview on its own before looking for patterns across interviews. That preserves the situation, sequence, and meaning around each customer’s comments. If you combine transcripts immediately, repeated vocabulary can appear more important than the underlying context, and an unusual but consequential problem can disappear into the average.

Write the outcome anchor. State whose behavior or result should change. Avoid a feature-shaped outcome such as “increase use of the AI assistant.” A better outcome describes progress in the customer’s work.
Create one snapshot per interview. Capture the customer’s goal, the relevant sequence of events, key moments, obstacles, current workaround, and evidence supporting each inferred opportunity.
Separate observation from interpretation. Preserve what happened and what the customer said separately from the team’s explanation of why it happened. Label uncertainty instead of filling gaps with generated prose.
Synthesize across snapshots. Look for shared opportunities, meaningful differences, dependencies, and contradictions. Similar wording does not automatically mean the same need.
Organize opportunities before proposing solutions. Build an opportunity solution tree or equivalent map that connects the product outcome to customer opportunities. Keep solution ideas outside the opportunity labels.
Review the generated structure as a team. Ask what was merged incorrectly, what was missed, what lacks evidence, and which branch reflects a solution disguised as a need.

AI is useful here as a first-pass analyst, not as an authority. It can extract moments, propose opportunity statements, and suggest a hierarchy. Human reviewers contribute product context, recognize important exceptions, and challenge confident-looking inferences. The strongest practical model is an AI-generated draft that the team refines.

Your exit gate for discovery is not a polished tree. It is agreement on a selected opportunity, the evidence behind it, the customer outcome it should influence, and the opportunities deliberately excluded from the current scope. If the team cannot explain those choices without mentioning a model or interface, it is not ready to prototype.

Choose assistance or autonomy before choosing the architecture

The next decision is not which model to use. It is what responsibility the product will accept. An LLM can generate or classify content. An agent wraps model behavior in a workflow that plans, uses tools, retains relevant state, and attempts to complete an outcome. That difference changes the customer promise, the evaluation plan, the permission model, and the consequences of failure.

Decision	Copilot	Agent
Best task shape	High-context work that benefits from judgment, nuance, or brand voice	Bounded, tool-heavy work with a verifiable completion state
Customer promise	Drafts, explains, recommends, or accelerates	Completes an agreed task within a defined scope
Human role	Reviews and commits the result	Sets policy, handles exceptions, and approves sensitive actions
Default permissions	Read, retrieve, and propose	Narrowly scoped tool access, including only the writes required for the task
Primary proof	Useful, grounded output that improves the user’s work	End-to-end task success without unacceptable actions or loops
Failure consequence	A poor suggestion reaches the reviewer	A poor decision can propagate into another system

When the task still depends on tacit knowledge or subjective review, start with a copilot. When it is bounded, tool-heavy, and objectively checkable, consider an agent. The safer product progression is to start assistive and grant autonomy only after success is measurable. Autonomy should be earned capability by capability, not declared at the product level.

You can make that progression concrete without redesigning the entire experience. Let the product draft first. Then let it recommend a plan and show the evidence behind the recommendation. Next, allow reversible actions through a narrow tool whitelist. Keep approval immediately before actions that affect customers, money, permissions, or durable data. Expand the scope only when production evidence supports the previous boundary.

Once the responsibility is clear, define the architecture around it:

Authoritative context: retrieve relevant product, account, policy, or workflow information before asking the model to decide. A retrieval-first pipeline reduces dependence on whatever happens to be encoded in model weights.
Explicit scope: state the role, allowed objectives, prohibited actions, and conditions that require escalation.
Controlled tools: expose only the operations needed for the selected job. Apply unit limits and validate tool inputs outside the model.
Deliberate memory: separate temporary working state, durable customer facts, and governing policy. Do not treat the entire conversation history as an undifferentiated memory store.
Visible checkpoints: show the user what will happen, what data will be used, and which action requires approval.
Traceable execution: record retrieval results, model and prompt versions, tool calls, approvals, guardrail events, and final task status.

This architecture is more durable than a large prompt because each component has a distinct failure mode and owner. Retrieval can be evaluated for evidence quality. Tools can be tested deterministically. Policy can be reviewed independently. The model remains important, but it no longer carries responsibilities that ordinary software can enforce more reliably.

The exit gate is a written responsibility boundary. The team should be able to say what the product may read, what it may write, what it must never do, when a person intervenes, and how successful completion is verified. If any answer is “the model will decide,” the boundary is still incomplete.

Write the evaluation contract before optimizing the prompt

A compelling demo proves that a path can work. It does not establish how often it works, which inputs break it, whether its evidence is trustworthy, or whether it completes the customer’s job at an acceptable cost. Prompt iteration without an evaluation contract tends to optimize whatever the last reviewer noticed.

Write the contract in product language. For each target task, define the eligible input, the expected outcome, the evidence the product may use, allowed actions, prohibited outcomes, completion criteria, escalation conditions, and fallback. Add latency and cost limits chosen for your product economics. There is no universal threshold that makes an AI workflow production-ready; the important discipline is setting the threshold before seeing launch results.

Build the evaluation set from discovery evidence. Include representative customer inputs, important workflow variations, ambiguous cases, missing context, conflicting instructions, tool failures, and requests the product must refuse or escalate. Remove or protect sensitive data according to your governance rules. Every case should identify the acceptable outcome, not merely an ideal sentence, because multiple responses may solve the same job.

For copilots, measure the quality of assistance

Time to first token: how long the user waits before the response begins.
Response latency: how long the useful result takes to complete.
Groundedness: whether material claims are supported by the authoritative context supplied to the model.
User satisfaction: whether the assistance was useful in the actual workflow, not merely fluent.
Task impact: whether the user completes the selected job faster, with less effort, or with fewer corrections, using the outcome defined during discovery.

For agents, measure the whole execution

Task success rate: successfully completed eligible tasks divided by all eligible attempts. Define completion in the customer’s system of record where possible.
Steps per task: the number of model and tool steps required to finish. A rising count can expose inefficient planning or repeated work.
Tool error rate: failed, rejected, or malformed tool calls relative to attempted calls.
Loop detection: executions stopped because the agent repeated actions or failed to make progress.
Guardrail triggers: attempts blocked or redirected by policy. A trigger is diagnostic evidence, not automatically a success or a failure.
Human escalation: tasks handed to a person because the agent lacked permission, confidence, context, or a valid recovery path.
Cost per successful task: total execution cost divided by successful completions. Cost per request can hide expensive retries and failed runs.
Containment rate: eligible tasks completed within the automated workflow without human handling. Publish the eligibility and escalation rules with the metric so teams do not improve containment by narrowing the denominator invisibly.

These agent analytics complement rather than replace end-to-end task success. A fast response can still be wrong. A low tool error rate can coexist with a bad plan. High containment can be harmful if the agent completes the wrong task. Choose one outcome metric, pair it with quality and safety constraints, and retain the diagnostic metrics needed to find the cause of failure.

Route failures to the component that can fix them. Unsupported claims point first to retrieval and grounding. Correct plans with failed actions point to tool integration. Repeated steps point to orchestration or stopping logic. Frequent, legitimate escalations may mean the autonomy boundary is too broad. High model scores with low customer satisfaction should send the team back to the opportunity definition or user experience.

The exit gate is a versioned evaluation suite with release criteria, prohibited outcomes, an approved cost ceiling, and named escalation rules. Run it against every material change to the model, prompt, retrieval configuration, tool contract, or policy. Treat prompts and evaluation cases as product assets under version control, not as text pasted into a dashboard.

Release through gates and design the failure path

Deployment is where an AI capability becomes a product promise. The team now has to manage model variability, external tool behavior, changing knowledge, permissions, cost, and customer expectations at the same time. A launch plan that covers only the happy path is unfinished.

Put the capability behind a feature flag. Separate deployment from exposure so the team can stop new executions without waiting for a code release.
Open a gated beta around one bounded job. Limit the eligible users, tool permissions, data scope, and advertised promise. Make it clear whether the product recommends an action or performs it.
Use a canary for broader production traffic. Expand exposure gradually while comparing task success, guardrail events, tool errors, latency, escalation, and cost per successful task with the release criteria.
Change one material layer at a time when practical. Simultaneous changes to the model, prompt, retrieval index, tools, and policy make regressions difficult to attribute.
Expand only after the previous boundary is stable. More users, more tools, and more autonomy are separate risk decisions. Do not bundle them into one rollout.
Keep rollback and fallback distinct. Rollback restores a known model, prompt, policy, or tool version. Fallback gives the customer a safe alternative when the AI path is unavailable.

Feature flags, gated betas, canary rollouts, incident paths, and rehearsed fallbacks are ordinary operational controls, but they carry unusual weight in AI products because model and tool behavior can drift independently of an application release.

Design specific degraded states before launch:

Model unavailable: preserve the user’s work, explain that automation is unavailable, and offer the established manual path.
Retrieval unavailable or evidence missing: do not silently generate an ungrounded answer. Ask for the missing context, provide a limited response, or escalate.
Write tool fails: stop, report the actual system state, and reconcile before retrying. Blind retries can duplicate durable actions.
Execution stops making progress: terminate the loop at the configured limit and hand over the trace rather than consuming resources indefinitely.
Policy or permission check fails: block the action, preserve the audit record, and route the user to an authorized path.
Tool behavior changes: disable the affected capability until its contract and evaluation cases pass again.

Privacy and auditability belong in the release gate, not in a later compliance review. Document what customer data enters prompts, retrieval, memory, and logs; who can access it; how long each class is retained; and how deletion propagates. For actions affecting customers, money, permissions, or durable data, preserve enough detail to reconstruct the input, retrieved evidence, model and prompt version, tool parameters, approval, guardrail result, and final system state.

The operating stack also needs an ownership decision. Build the workflow logic, data model, and user experience that encode your differentiated value. Consider buying undifferentiated capabilities such as observability, prompt versioning, red-team infrastructure, and policy enforcement when an external component meets your control and governance needs. This build-versus-buy boundary keeps product attention on the parts customers actually choose you for without treating commodity infrastructure as strategically unique.

The production exit gate should require a visible scope statement, passing evaluations, a feature flag, a rollback target, a customer-safe fallback, usable audit traces, an incident owner, and a tested escalation route. If the team cannot explain what the customer sees during failure, it has not finished designing the feature.

Keep discovery, evaluation, and production in one learning loop

Once the product is live, production behavior becomes new discovery input. That does not mean replacing customer conversations with dashboards. Metrics show where the workflow breaks; customer evidence explains what the break means and whether fixing it matters.

Review failures against the original opportunity map. Concentrated escalation around one scenario may reveal an opportunity that was hidden during initial synthesis. High groundedness with low satisfaction may indicate that the product answered accurately but tackled the wrong job. A growing step count may expose orchestration waste, while a rising tool error rate points to integration reliability. If cost per successful task increases, inspect failure and retry paths before making the model cheaper; optimizing unit cost cannot rescue an unsuccessful workflow.

Every meaningful production failure should produce at least one durable change: a corrected opportunity assumption, a new evaluation case, a narrower permission, a tool-contract test, a policy update, a clearer interaction, or a revised fallback. That is how customer discovery and operational learning remain connected instead of becoming separate product and engineering rituals.

Key takeaways

Synthesize each customer interview separately before looking across interviews, then review the AI-generated opportunity structure with human judgment.
Select a customer opportunity before selecting the AI interface. A fluent prototype is not evidence that the underlying job matters.
Use a copilot for judgment-heavy work and consider an agent only for bounded, tool-heavy tasks with verifiable completion.
Define task success, prohibited outcomes, escalation, cost, and fallback before optimizing prompts or choosing a model.
Measure copilots as assistance and agents as end-to-end execution. Do not mistake latency, containment, or tool-call success for customer success.
Release behind flags, expand through gated exposure, and rehearse rollback, fallback, and incident paths before granting more autonomy.

At your next AI product review, ask to see the outcome and opportunity map, the responsibility boundary, the evaluation contract, and the rollout and recovery plan. If one is missing, pause the launch decision at that handoff. Closing that gap is usually more valuable than adding another prompt, tool, or autonomous step.

References

February 18, 2026

Multi‑Agent Systems Demystified: Why One AI Isn’t Enough—and How I Ship Faster With Many

In my day-to-day building AI products, I’ve learned a simple truth: a single model can be brilliant, but a coordinated team of specialized agents is what consistently ships outcomes customers trust. That’s the promise of multi-agent systems—multiple AIs with distinct roles collaborating inside robust AI workflows to deliver accuracy, speed, and resilience you can’t get from a lone model.

Think of a multi-agent system as a well-run product trio for machines: a planner decomposes the job, specialists execute focused tasks, a reviewer checks quality, and an orchestrator keeps everyone aligned. This agentic AI approach mirrors how high-performing teams work—divide complex problems, play to strengths, and create tight feedback loops.

When does one AI stop being enough? Whenever tasks require tool use, domain retrieval, multi-step reasoning, or policy adherence under real-world constraints. In those moments, specialized agents shine—one for search using a retrieval-first pipeline, another for reasoning, another for action execution, and a final one for validation. The result is better accuracy with manageable latency and cost.

The core architecture I rely on starts with a planner that breaks a goal into steps, followed by execution agents equipped with tools and grounded context. I pair this with context window management to keep prompts lean and relevant, and I insert a verifier (or critic) to catch logic slips and policy violations before results reach customers. A lightweight orchestrator coordinates handoffs and retries to keep the whole flow resilient.

To make this production-grade, I treat observability as non-negotiable. Agent Analytics helps me see which agents are adding value versus adding latency, where failures cluster, and how prompts drift over time. From there, eval-driven development gives me measurable confidence: I codify representative tasks, run offline and shadow evaluations, and only promote changes that move accuracy and safety in the right direction.

Governance is equally critical. I design privacy-by-design from the start, restrict data movement with strong data governance, and enforce policy constraints inside the workflow rather than after the fact. This includes red-teaming failure modes, rate-limiting tools, and capturing immutable traces for audits and post-incident reviews—habits borrowed from SRE culture that map well to AI systems.

On the practical side, prompt engineering remains foundational, but it’s the system design that converts clever prompts into reliable outcomes. Tool access, retrieval quality, memory strategy, and error handling matter more than wordsmithing alone. I’ve found that small prompt improvements are amplified when the surrounding workflow is sound—and are overwhelmed when it isn’t.

If you’re just starting, begin with a narrow use case and a minimal set of agents—planner, executor, and verifier—then expand. Use continuous discovery with real users to learn where the workflow fails in the wild, and iterate with tight release cycles. Treat every agent like a microservice with clear contracts, test coverage, and metrics, and you’ll unlock compounding gains without losing control.

The payoff is tangible: faster shipping cycles, fewer regressions, and outcomes customers can actually rely on. When stakes are high and ambiguity is real, one AI is often a talented soloist—but a disciplined ensemble of agents is how I deliver dependable, scalable value at product velocity.

Inspired by this post on Product School.

February 16, 2026
Why “Figma Is Not the Source of Truth”: My Playbook for Design Leadership That Scales

I keep a simple mantra front and center: Figma is not the source of truth. The customer is. In practice, that means the only thing that truly counts is what we ship, how it performs, and whether users come back for more. Mockups are hypotheses; production usage is evidence. When my teams adopt this lens, velocity improves, judgment sharpens, and quality rises where it matters most.

So what does design actually do in a software company? At its best, design builds leverage for the whole system—engineering, product, and marketing—by clarifying problems, raising the quality bar, and making complex decisions legible. The standard I hold is ancient and still essential: products must be useful, usable, and desirable — and above all, used. When we calibrate around “used,” debates about pixels give way to outcomes, and cross-functional partners feel the difference.

I often trace the roots of our craft back well beyond the digital era. The lineage from industrial design to software is real; constraints, ergonomics, affordances, and systems thinking didn’t start with screens. If you’ve ever mapped delight, performance, and reliability in a Kano Model, you’ve touched this lineage. The translation to software is simple: design the full journey, not just the interface—prioritize what improves time-to-value, reduces cognitive load, and earns habitual use.

One lesson I’ve learned the hard way: why design leaders who stop designing stop leading. I still sketch flows, write UX copy, and prototype when it unblocks the team or sets a decisive quality bar. The altitude changes constantly—one hour I’m in a strategic roadmap review, the next I’m in a critique or poking at a prototype. Great design leaders jump up and down in altitude to connect vision to details without becoming a bottleneck.

Over time, I’ve come to rely on four pillars every design manager must master: craft (raising taste and execution), product strategy (clarifying choices and trade-offs), people leadership (coaching, feedback, and hiring), and systems (processes, rituals, and design ops that scale). Neglect any one of these and either quality, speed, or team health will eventually falter.

Perfectionism is a double-edged sword. Over-indexing on quality can paralyze decision-making, but lowering the bar indiscriminately is worse. I’ve seen moments where relaxing standards to “go faster” actually cost the business—rework piled up, trust eroded, and customer value stalled. The answer is principled delegation: I define what “must be true” at each milestone, delegate ownership with clear guardrails, and reserve my veto power for moments where product integrity is genuinely at risk.

Measuring success as a design leader starts with outcomes vs output OKRs. I care about activation, retention, time-to-first-value, NPS verbatims tied to key journeys, and the operational metrics that earn the right to build the next thing. Design output is visible; design outcomes are durable. When trade-offs are needed, I optimize for the smallest shippable surface that still proves the core value proposition, then expand with data.

Scaling judgment is the multiplier. I build it through pattern matching—studying enduring product systems from companies like Airbnb, Amazon, Apple, Asana, Notion, Stripe, Nest, and others—to distinguish where polish compels usage versus where it’s ornamental. Strong opinions matter, but so does being easy to convince with new evidence. I encourage designers to articulate the pattern they’re invoking, why it fits the job-to-be-done, and how we’ll know it worked.

Operating cadence matters. My week is anchored around recruiting, crits, and staff meetings that actually make decisions. In critiques, I use the Do/Try/Consider framework to give actionable direction without micromanaging. On one-on-ones, the question isn’t “Should one-on-ones exist?” but “What are they for right now?”—coaching, performance, or clearing execution blockers. If a meeting doesn’t increase clarity or commitment, it gets redesigned or removed.

Execution-wise, I’ve taken inspiration from Rippling’s operating system—especially its emphasis on speed, precise ownership, and hard commitments. The lesson is timeless: go fast on the right things, make clear promises, and instrument your work so you can see reality quickly. When speed is paired with crisp decision rights and observable outcomes, momentum compounds rather than frays trust.

Hiring your first design leader? Look for someone who can set standards, scale judgment, and ship. They should be able to zoom from company narrative to interaction copy in a single afternoon, coach product trios, and build rituals that make taste and trade-offs explicit. Above all, they should have a point of view on where quality moves the business and where speed is the quality.

Here’s how my team’s approach differs from many: Figma is not the source of truth. We design in Figma, but we learn from production. We pair designers with engineering early, prototype in code when it reduces risk, and wire telemetry into every critical path. Product trios use discovery to validate “useful, usable, desirable — and used,” then commit to outcomes with clear, testable definitions of success. The result is faster iteration, fewer surprises, and experiences customers actually adopt.

If you want to deepen your own pattern library, study products and practices from leaders like Airbnb (https://www.airbnb.com/), Amazon (https://www.amazon.com/), Apple (https://www.apple.com/), Asana (https://www.asana.com/), CrossFit (https://www.crossfit.com/), Figma (https://www.figma.com/), Honeywell (https://www.honeywell.com/), Nest (https://store.google.com/category/google_nest), Notion (https://www.notion.so/), Retool (https://retool.com/), Rippling (https://www.rippling.com/), and Stripe (https://www.stripe.com/). Pay attention to how they balance versatility with clarity, defaults with flexibility, and speed with trust.

The throughline is simple and demanding: design for reality, not for the board. Keep your standards where they create business value, scale judgment with explicit patterns, and instrument everything so learning never stops. When teams embrace that, the work gets better, customers feel it, and the roadmap starts to pull you forward.

February 12, 2026

How to Build an AI-Native Product Development Workflow

Your team can generate a PRD, summarize an interview, and draft acceptance criteria in minutes. Yet the product still may not ship faster. Customer evidence remains scattered, decisions lose their rationale at handoffs, and nobody knows whether an AI-generated recommendation deserves to be trusted.

An AI-native product development workflow fixes that operating system. It connects evidence, decisions, delivery, and evaluation in one traceable learning loop. The goal is not to produce more documents. It is to shorten the path from a customer signal to a reliable product decision, then carry the result back into the next decision.

Change the unit of work from an artifact to a decision

AI-assisted teams use a model inside an existing process. They write the same documents, hold the same handoffs, and make the same decisions, only with faster drafting. That can save time, but it leaves the fundamental bottlenecks untouched.

An AI-native workflow reorganizes the process around decisions. Every meaningful unit of work should carry enough context for the next person or system to understand what is being decided, why it matters, and what evidence would change the decision.

Use a decision packet with five parts:

Decision: State the exact choice in front of the team. Replace broad assignments such as improve onboarding with a decision such as whether to change the first-session setup flow for a defined customer segment.
Evidence: Link the customer examples, research moments, usage data, and business constraints that support the problem. Preserve the original evidence rather than storing only an AI summary.
Assumptions: Separate what the team knows from what it believes. An assumption should be written so that new evidence can confirm or challenge it.
Success condition: Name the customer or business behavior expected to change. For an experiment, define the hypothesis and, where appropriate, the minimum detectable effect before exposure begins.
Decision state: Record the owner, status, unresolved questions, next test, and reason for the latest change.

The model can retrieve evidence, compress it, identify inconsistencies, draft alternatives, and check whether required fields are missing. A person still owns the interpretation, trade-offs, priority, and release decision. This boundary prevents polished language from being mistaken for product judgment.

Apply a simple test to every AI-generated artifact: what decision will this change? If the answer is unclear, the artifact is probably workflow noise. If the answer is clear, attach the artifact to the decision packet instead of allowing it to become another disconnected document.

Build an evidence spine before adding more automation

Most product workflows fragment evidence before a model ever sees it. Support tickets sit in one system, sales notes in another, interviews in folders, and behavioral data in an analytics platform. A prompt cannot recover relationships that the operating system never preserved.

A retrieval-first intake can unify customer feedback, support tickets, sales notes, research transcripts, and usage analytics. Embeddings can help cluster related signals and remove duplicates, but the useful output is not a list of themes. It is a navigable path from a theme to representative evidence and then to the decision it informed.

Build that path as a closed sequence:

Normalize incoming evidence while preserving its source identifier, relevant customer or segment context, and access permissions.
De-duplicate repeated signals and cluster related evidence without erasing meaningful differences between customers or use cases.
Retrieve a small set of representative examples for the decision being made. Do not dump the entire evidence store into the model context.
Write the approved decision, its assumptions, and its rationale into durable external state.
Return experiment results, release outcomes, and new qualitative feedback to the same evidence system.

Keep three forms of information distinct. The evidence store contains raw and normalized inputs. Working context contains only the material needed for the current task. The decision log contains approved conclusions, rejected alternatives, owners, and changes. Mixing all three creates stale prompts, contradictory instructions, and summaries that can no longer be audited.

A prioritization recommendation, for example, should link back to representative customer records and the relevant analytics view. A summary without those links is compression, not evidence. When somebody challenges the recommendation, the team should be able to inspect the underlying material without asking the model to reconstruct its reasoning from memory.

This is also where data governance belongs. Decide which systems the workflow may retrieve from, which fields require redaction, who can see sensitive records, and how model outputs will be retained before connecting those systems. Privacy-by-design, cybersecurity, and regulatory controls need to sit alongside the workflow, not appear as a review after customer information has already crossed an inappropriate boundary.

Run one closed loop from discovery to shipped learning

The product trio remains important in an AI-native workflow. Product, design, and engineering use automation to reach the evidence faster and explore more alternatives, while keeping explicit human gates around interpretation, feasibility, customer experience, and risk. Clear handoffs between context design, external memory, and orchestration make those responsibilities easier to see.

For each stage, name the AI job, the human gate, and the durable output. That turns a collection of AI tools into an operating workflow.

Stage	AI accelerates	Human gate	Durable output
Intake and triage	Normalize, de-duplicate, cluster, and retrieve representative customer signals.	Verify that a cluster reflects a real customer problem rather than repeated wording or a noisy channel.	An opportunity record linked to original evidence.
Discovery	Draft interview guides, summarize transcripts, extract entities, and tag moments of friction.	Interpret what the customer meant, identify contradictions, and decide which uncertainty deserves another conversation.	An evidence-backed problem narrative with open questions.
Opportunity sizing	Organize evidence against a driver tree and assemble available inputs about potential impact.	Choose the outcome, inspect data quality, expose assumptions, and make the prioritization trade-off.	A ranked opportunity with decision criteria and explicit assumptions.
Solution shaping	Generate alternatives, first-pass flows, PRD sections, acceptance criteria, and experiment ideas.	Test desirability, usability, feasibility, strategic fit, and the cost of being wrong.	A solution hypothesis, acceptance criteria, and a test plan.
Planning and execution	Break an approved bet into sequenced work, surface dependencies, and check artifacts for missing requirements.	Set scope, choose rollout controls, confirm instrumentation, and approve release readiness.	An instrumented release plan connected to feature flags, CI/CD, and observability.
Iteration	Compare expected and actual outcomes, organize qualitative feedback, and surface anomalies for review.	Decide whether to scale, revise, stop, or collect more evidence.	An updated decision record returned to the evidence spine.

Exit criteria keep each stage honest. Discovery is not complete because the transcripts have been summarized. It is complete enough to move forward when the team can name the customer problem, the supporting evidence, and the uncertainty it intends to resolve next. Solution shaping is not complete because a PRD exists. It is complete when the hypothesis, constraints, acceptance criteria, test method, and required telemetry are clear enough for a responsible decision.

Plan measurement before release. If the team will use an A/B test, write the hypothesis and minimum detectable effect before looking at the result. If controlled experimentation is not appropriate, name the expected behavior change and the qualitative evidence that would support or challenge it. Feature flags provide controlled exposure, while observability helps the team understand why behavior changed rather than merely showing that it changed.

The workflow closes only when actual outcomes return to discovery. Comparing expected and actual outcomes, harvesting qualitative feedback, and feeding the result back into the evidence system turns a release into organizational learning. Without that return path, the model keeps retrieving yesterday’s beliefs even after the product has disproved them.

Engineer context, evaluations, and decision rights together

Reliability cannot be added as a final quality check. Every AI transformation can lose evidence, introduce unsupported language, or carry stale assumptions into the next stage. The workflow needs controls at the moment each failure can occur.

Give each task a context contract

One large prompt that tries to perform discovery, prioritization, specification, and planning will accumulate irrelevant material and conflicting instructions. Break the workflow into smaller tasks, each with a compact context contract:

The decision or job the output must support.
The approved evidence the model may use.
The constraints and non-negotiable requirements.
The information the model must not infer.
The required output structure.
The conditions that require human review.

Compact task prompts, curated turns, external memory, repeated critical instructions, and isolated sub-agents are practical ways to manage a limited context window. Use external state for durable decisions and retrieve only the relevant slice for the current task. Repeat a critical constraint when the context grows rather than assuming an earlier mention will retain equal influence.

Use a sub-agent when a task benefits from an isolated context or a separate review, such as checking a PRD against approved evidence. Do not add one merely to make the system look agentic. Every additional agent creates another handoff whose inputs, outputs, permissions, and failure behavior must be evaluated.

Build an evaluation harness before scaling the workflow

An evaluation should answer a repeatable question: does this workflow produce an acceptable result on representative work? A few impressive demonstrations do not tell you whether a prompt, retrieval change, or model update made the system more dependable.

Start with real task types your team already performs. Preserve representative inputs, the evidence that should be used, the requirements an acceptable output must satisfy, and known failure conditions. Then run those cases whenever you change the prompt, model, retrieval logic, tool permissions, or output schema.

Evaluate at least these dimensions:

Grounding: Can each important claim be traced to approved evidence?
Fidelity: Did the output preserve material differences, uncertainty, and constraints rather than flattening them into a convenient narrative?
Completeness: Are the fields required for the next decision present?
Decision usefulness: Does the output help a named owner make a specific choice?
Data handling: Did the workflow respect access, redaction, and retention rules?
Format and tool behavior: Did the model follow the schema and use only permitted systems or actions?

Eval-driven development makes prompts and heuristics repeatable. It also gives you a safer way to adopt new models: compare them against the same task set instead of judging them from a fresh demo with different inputs.

Measure learning flow, not AI activity

Documents generated, prompts executed, and summaries produced are activity measures. They can rise while product decisions become less reliable. Use four layers of measurement instead:

Learning flow: Time from a customer signal to an evidence-backed decision, time spent waiting at handoffs, and rework caused by missing context.
AI quality: Evaluation results by task, unsupported claims found during review, required fields missed, and human corrections before approval.
Customer outcome: The activation, adoption, retention, or other behavior named in the original hypothesis.
Delivery health: Deployment frequency, change failure rate, and the operational signals relevant to the release.

Keep decision rights visible beside those measures. The model may propose a priority, but the accountable product leader approves it. The model may draft a customer interpretation, but the product trio validates it against evidence. The model may prepare a release plan, but engineering owns operational readiness. Feature flags, access controls, and human approval are not signs that the workflow is insufficiently automated. They are what make greater automation responsible.

Log the decision, evidence references, model version, prompt or workflow version, retrieval configuration, evaluation result, and approving owner. Documenting decisions, model versions, and test artifacts makes a nuanced call auditable and gives the team a concrete starting point when quality changes.

Key takeaways: a 30/60/90-day rollout

Do not begin by automating the full product lifecycle. Start with one recurring decision, connect its evidence to its outcome, and prove that the loop can be operated reliably. A practical 30/60/90 sequence expands from the evidence foundation to selected workflows and then into planning and delivery.

Days 1-30: Map the evidence systems used for one recurring product decision. Define the decision packet, access rules, retrieval path, current human gates, and initial evaluation cases. Build the smallest retrieval-first pipeline that can preserve links from a recommendation back to original evidence.
Days 31-60: Pilot continuous discovery and PRD drafting. Keep approval manual, evaluate representative cases, record recurring corrections, and tighten the context contract. Do not expand until the team can identify why an output passed or failed.
Days 61-90: Extend the proven pattern to prioritization and experiment design. Connect approved outputs to planning, CI/CD, feature flags, and observability. Feed release outcomes and customer feedback back into the evidence spine.

By the end of the rollout, you should be able to trace an AI recommendation to customer evidence, reconstruct why a decision changed, detect a quality regression after a workflow update, and compare the expected outcome with what happened after release. If one of those paths is missing, fix it before adding another agent or automating another handoff.

Your next move can be small. Choose one product decision scheduled for this week. Put its evidence, assumptions, success condition, and state into a decision packet. Then follow that packet through discovery, delivery, and the first outcome review. That single trace will reveal where your workflow is genuinely AI-native and where faster drafting is only hiding an old bottleneck.

References

February 11, 2026

From Chaos to Clarity with Claude Code: My Hands-On Playbook for Product Leaders

I’ve been pushing hard to operationalize AI for real product work, and this episode zeroes in on the moment Claude Code stops feeling like a demo and starts behaving like a dependable teammate. If you’ve ever wondered how to go from clever prompts in the browser to durable, repeatable workflows on your machine, this walkthrough is for you.

Listen on: Spotify | Apple Podcasts.

My first honest reaction to installing and configuring the desktop agent was the all-too-relatable “this tool thinks everything is a code repo” reality. That framing helped me reset expectations fast: instead of treating it like a magical universal assistant, I began designing guardrails, context, and repeatable routines—exactly how I’d onboard a new team member.

The shift from Claude-in-the-browser to Claude Code on my machine was the unlock. Locally, it can finally work with my files, folders, and workflows. That meant I could ground it in real artifacts—project docs, meeting notes, product specs, and historical decisions—so responses weren’t just plausible; they were contextual and verifiable.

On setup, I now treat /init and Claude MD files as my product requirements. I define roles, boundaries, and canonical sources up front, then run in a deliberate “walled garden.” The “treat it like an intern” model works beautifully: scope access intentionally, expand privileges as trust grows, and keep a tight audit trail of what it can touch and why.

Surprisingly, task management became my ideal on-ramp. It’s easy to validate, the feedback loops are tight, and the ROI is immediate. I export calendar windows rather than granting full calendar access, then let the agent map priorities into Trello, reconcile time blocks, and surface trade-offs. Fast wins build confidence—mine and the agent’s.

Model switching matters more than I expected. When speed is king and “good enough” will do, Haiku keeps the loop snappy. When stakes are higher—complex synthesis, nuanced product strategy, or gnarly ambiguity—I step up to Claude Opus 4.5. Being intentional about when to optimize for latency versus depth is a quiet superpower.

Web tasks can still spiral. When that happens, I pause its autonomy, toggle to fewer steps, and ask, “What are you doing?” Paired with Claude’s Web fetch tool, this makes the agent explain its chain-of-thought planning without exposing hidden reasoning, so I can spot brittle assumptions, prune distractions, and re-ground the task.

Content retrieval has become a killer workflow. I point the agent at my archives—blog posts, book drafts, transcripts, notes—and ask, “Where have I talked about this before?” It assembles a map of prior art, connects themes I’d forgotten, and prevents me from reinventing work. Over time, this evolves into a Zettelkasten-style research system that upgrades rigor and accelerates synthesis.

I’ve also turned Claude Code into a publishing engine. From a single transcript, it drafts titles, descriptions, show notes, and chapters, then routes artifacts to Ghost for formatting. Before anything ships, I run fact-checking workflows that validate claims against transcripts and research sources. The output improves, but more importantly, the scaffolding makes quality repeatable.

Reusable workflows compound. I rely on slash commands to trigger common jobs, break down larger efforts with sub-agents, and wire in hooks and plugins where external systems are needed. This is agentic AI at its most practical: fewer hero prompts, more reliable processes.

Audience analytics and content prioritization are helpful with caveats. I let the agent cluster themes and flag gaps, then I pressure-test its suggestions against first-party data and strategic goals. As with any model-driven insight, triangulation beats blind faith.

Two metaphors guide my day-to-day. First, Claude Code is like a dog—sometimes it returns with the stick, sometimes it gets lost in the woods. Second, the “intern” framing keeps me honest: don’t hand it the whole company on day one. With that mindset, my output jumped—more volume without sacrificing quality—because the workflow scaffolding got better.

In this episode, I cover what Claude Code is and why it’s useful even if you’re not an engineer, the real difference between the browser experience and running locally, how to shape behavior with /init and Claude MD files, why task management is the perfect proving ground, when to export calendar windows versus connecting directly, and when model-switching makes sense—Haiku for speed, Opus for depth.

I also dig into debugging web tasks by asking “What are you doing?”, content retrieval workflows across personal archives, building reusable slash-command systems with sub-agents, hooks, and plugins, practical publishing stacks from transcripts, fact-checking against transcripts and research sources, and using analytics to prioritize content—with a healthy respect for uncertainty.

If you’ve been trying to make Claude Code feel less like “throwing a stick into the woods,” this is the candid, tactical tour I wish I’d had on day one. Drop your questions and experiments below—I’m eager to compare notes and refine the playbook together.

Inspired by this post on Product Talk.

February 10, 2026
Build CX Scores You Can Defend: My 5-step playbook for transparent, trustworthy AI metrics

“You don’t have to trust the algorithm; you can see exactly why a conversation earned the score it did.”

We recently shared how we redesigned CX Score to deliver deeper, more actionable insights across every conversation. The most common follow-up from support leaders was simpler and incredibly important: “Can I trust it?” It’s the right question—and it’s the one I use as my own bar for whether a metric is ready for the C‑suite.

CS teams are the subject matter experts on customer experience. They understand the nuance of what customers feel, the context behind every interaction, and the difference between a technically resolved issue and a genuinely satisfied customer. I’ve learned, conversation by conversation, that any metric we ship has to capture that nuance at scale—or it doesn’t deserve to be used.

We built CX Score to give support teams a complete view of how their customers feel across every conversation. It surfaces what’s working, what’s not, and why—so leaders can communicate impact clearly and drive change across support, product, and the wider business.

A CX Score in action: repeated CSV export failures trigger a low score and customer frustration, while the AI agent clarifies next steps and gathers details—turning raw signals into actionable support insights.

Here’s exactly how I approached building a trustworthy metric that support leaders can inspect, explain, and defend.

1) It’s grounded in how support teams define quality. I started with how experienced support professionals actually evaluate conversations—collecting real examples of strong, mixed, and poor interactions across industries, identifying the specific factors that shape overall experience, and writing plain-English rules for each. The result: CX Score applies the same criteria a trained support professional would use, not generic LLM assumptions.

2) It’s aligned with human judgment. We created a dataset of thousands of real customer conversations spanning multiple industries, languages, channels, and agent types. Each was manually reviewed by experienced support professionals—with two reviewers per conversation where possible and disagreement resolution to create stable consensus labels. The result: CX Score is trained and tested to behave like an expert reviewer, not a language model making broad guesses.

A modern CX analytics view shows how conversations flow from chat, email, and mobile into AI assistance, then to resolutions and sentiment outcomes—turning messy support data into a single, defensible CX Score.

3) It’s engineered by AI specialists. CX Score isn’t a prompt attached to an LLM. It’s a production system built by Intercom’s AI Group: 37 ML scientists and 350 engineers whose full-time focus is AI for customer service. The system includes specialized handling for long transcripts, model configuration tailored for support language and subtle sentiment, prompt engineering designed to default to neutral when evidence is weak, and a multi-stage evaluation pipeline that checks for precision, consistency, and reliability. The result: A metric built by a team that understands LLM behavior in production support environments, where accuracy and consistency matter most.

4) It’s validated statistically, not qualitatively. Trust requires measurement, not vibes. We tested CX Score across standard ML metrics: Precision (when the model flags a negative experience, how often do humans agree?), recall (how many human-identified issues does it catch?), and F1 score (the balance between both). We set an explicit bar: F1 above 0.8, representing high agreement with human judgment. We reran these evaluations through every revision, checking for regressions or biases, and I focused especially on negative experiences, because a false negative hides a real problem. The result: CX Score meets a measurable standard before it ships—not a gut check, a statistical requirement.

5) It was battle-tested with real customers. Lab accuracy isn’t enough. Customer environments are messy: Varied ticket types, mixed languages, unpredictable edge cases. Before release, we ran a multi-phase field test—shadow-scoring conversations with both old and new models, validating sensible behavior across agent type and conversation length, then rolling out to a controlled customer group who confirmed the scores felt right, reasons were clear, and insights were actionable. The result: CX Score shipped because real teams told us it made sense in practice, not because it passed internal tests.

From conversation to clarity: this visual maps the drivers behind a CX Score. Explore how policy feedback, answer quality, and effort combine to produce defendable insights support leaders can act on.

The importance of explainability. One of the most critical choices I made was ensuring CX Score isn’t a black box. Every score comes with clear reasons, concrete excerpts, and a short explanation of what influenced the rating. This turns the metric into something you can inspect, audit, and explain to executives. You don’t have to trust the algorithm. You can see exactly why a conversation earned the score it did.

A metric that evolves with your business. Customer expectations shift. Products change. AI improves. A trustworthy metric can’t be static. CX Score evolves with the same commitments that shaped its redesign: Evaluate the real signals that shape customer experience, keep the logic simple and interpretable, and ensure leaders can make clear decisions from it. It’s built to be a durable source of truth across every conversation.

The takeaway. In a world where products look the same and AI can generate any interaction, customer experience is one of the few differentiators that actually matters. Support leaders have built that expertise conversation by conversation. What they’ve lacked is a measurement system that could validate it at scale—one that’s reliable enough to report to the C-suite, explainable enough to defend in strategy meetings, and rigorous enough to drive real decisions. That’s what CX Score is designed to be: A metric that reflects the reality support leaders see every day, backed by the technical rigor to make it credible everywhere else.

Want to see CX Score in your workspace? Ask your admin to enable it for your team, and start using explainable AI insights to improve customer experience and coach with confidence.

Inspired by this post on The Intercom Blog.

February 9, 2026
AI Agent Deployment Mastery: My Proven Checklist to Ship Safely, Faster, and at Scale

Shipping AI agents is not like shipping a typical feature. The system learns, reasons, and takes action in unpredictable environments, and when it’s customer-facing, the stakes are high. Over the past few years, I’ve refined a practical checklist that helps my teams move quickly without breaking trust. It balances speed with safety, and ambition with accountability—exactly what you need to scale agentic AI in production.

This checklist was forged in real launches—some smooth, some humbling. Early on, I watched an otherwise brilliant agent confidently offer a refund policy we didn’t have. That one incident made it clear: AI agents require a higher bar for guardrails, evals, and observability. Today, I won’t greenlight an AI rollout without these steps being explicit, owned, and testable.

Start with outcomes, not output. I define the job-to-be-done, the target users, and the measurable business impact using outcomes vs output OKRs and driver trees. Success is not “ship an agent,” it’s “reduce first-response time by 40% with no drop in CSAT,” or “increase qualified demo bookings by 20% at a lower cost per acquisition.” Clear outcomes give the agent a purpose and the team a north star.

Prepare the knowledge the agent will use. A retrieval-first pipeline beats raw prompting for most enterprise cases. I inventory sources of truth, set access controls, and enforce data governance from day one. That includes PII handling, redaction, retention policies, and privacy-by-design. If the agent can’t reliably retrieve the right fact at the right time, the rest doesn’t matter.

Choose models and prompts with discipline. I align model selection with context window management, cost, latency, and tool-use requirements. Then I build prompts and tools together, not in isolation, and I keep temperature, stop conditions, and function-calling explicit. Most importantly, I use eval-driven development: golden datasets, task-specific metrics (accuracy, helpfulness, latency, cost), and target thresholds that must be met before widening rollout.

Manage AI risk upfront. I treat jailbreaks, toxicity, and data leakage as product risks, not just security issues. I implement layered defenses—input/output filtering, policy checks, rate limits, and abuse monitoring—and define escalation paths and human-in-the-loop handoffs for ambiguous cases. Every risky capability needs an owner, a playbook, and a test.

Build the pipeline that lets you iterate safely. Prompts, tools, policies, and retrieval configs go through the same CI/CD rigor as code. I use feature flags for progressive delivery, canary cohorts to limit blast radius, and clear rollback procedures. Observability isn’t optional; I track latency, token usage, cost, failure modes, and user outcomes. I also watch DORA metrics and deployment frequency to ensure we’re improving the engine, not just the output.

Constrain autonomy intentionally. Agent behavior design matters as much as model choice. I set step limits, define tool whitelists, separate read vs write permissions, and specify decision checkpoints. When the agent is uncertain or confidence drops below a threshold, it hands off to a human or a deterministic workflow. Guardrails aren’t barriers; they’re bumpers that keep you on the track.

Instrument what users experience, not just what models produce. I track activation, task success, self-serve completion rates, and time-to-value. I pair Agent Analytics with journey analytics so I can see where the agent helps or hurts. I also invest in UX trust cues—transparent explanations, undo paths, and in-app guides—so users feel in control. When the agent changes behavior through learning, the interface should make that understandable.

If you’re shipping a voice AI agent, test in realistic conditions. I set targets for ASR accuracy, barge-in responsiveness, TTS prosody, and end-to-end latency. I predefine safe transfer logic for complex calls and ensure compliance for call recording and data retention. Voice amplifies both the magic and the mistakes; operational excellence is non-negotiable.

Plan the business rollout like a product, not a press release. I align pricing (often consumption SaaS pricing), packaging, and SLAs with actual unit economics—tokens, inference, and retrieval. I equip solutions engineering with playbooks and reference architectures, wire up CRM integration for attribution, and put feedback loops into Intercom or the support stack so we learn from every interaction.

Run operations like an SRE team. I define incident severity for AI-specific failures (e.g., harmful output, runaway cost, degraded retrieval), add alerting, and keep runbooks current. I schedule postmortems that feed directly into eval baselines and backlog priorities. Continuous discovery isn’t a ceremony; it’s the safety net that keeps improvements compounding.

Close the loop on compliance and governance. From day zero, I document data flows, vendor scopes, and audit logs. I verify regulatory compliance and adopt privacy-by-design so I’m not retrofitting later. Transparency, user consent, and opt-outs aren’t just legal checkboxes; they’re trust-building tools that differentiate your product.

The result of this checklist is speed with confidence. It gives my teams a common language to debate trade-offs, a clear path to production, and the guardrails to scale safely. If you’re preparing to deploy an agent, adapt these steps to your stack and your customers. Your future self—and your users—will thank you.

Inspired by this post on Product School.

February 9, 2026
Vibe Coding Unleashed: How Parallel Agents Build KPI Driver Trees in Under Two Hours

I’ve been exploring what I call the next level of vibe coding: orchestrating agentic AI to build complex product artifacts in minutes, not days. The breakthrough comes from ditching linear handoffs and embracing true parallelism—letting specialized agents tackle the work simultaneously while I steer the orchestration. In product management contexts where speed and clarity matter, this shift changes everything.

Building a KPI Driver Tree in two hours becomes possible when you stop building sequentially and start building with parallel agents.

For product leaders, a KPI Driver Tree is the fastest way to make strategy legible. It ties high-level outcomes to the levers we can actually pull—features, channels, pricing, onboarding, activation, and retention mechanics—so we can prioritize with confidence. Done well, it connects outcomes vs output OKRs, clarifies measurement, and aligns the team around a shared, testable model of growth.

Here’s how I operationalize it with agentic AI and AI workflows. I spin up a small team of specialized parallel agents: a Metrics Librarian (taxonomy and definitions), a Data Modeler (event and table design), a Research Synthesizer (voice of customer and causal hypotheses), a UX Prototyper (visualizing the tree and flows), and a QA/Evaluator (logic and consistency checks). An Orchestrator coordinates these agents, resolves conflicts, and composes outputs into a single, production-ready artifact—while I set constraints, review deltas, and decide.

In a typical two-hour sprint, all agents run at once. While the Metrics Librarian finalizes the KPI ontology, the Data Modeler validates instrumentable events and joins, and the UX Prototyper renders an interactive driver tree for a unified analytics platform. Meanwhile, the Synthesizer maps qualitative insights to quantitative levers, and the Evaluator stress-tests assumptions. Because we’re not waiting for sequential handoffs, we converge on a coherent driver tree and its initial measurement plan in one pass.

The payoff isn’t just speed—it’s higher-quality decisions. Parallel agents reduce context loss, expose trade-offs earlier, and allow me to compare multiple viable paths side-by-side. This accelerates continuous discovery, aligns with product strategy, and gives product managers and LLMs for product managers a clear, living map of how inputs roll up to outcomes. It’s the closest I’ve found to running a product trio at machine speed.

Guardrails matter. I pair this approach with strong data governance, privacy-by-design, and eval-driven development so every agent’s output is testable and auditable. Clear prompts, scoped corpora, and consistent acceptance criteria keep the Orchestrator honest, while lightweight Agent Analytics helps me see where reasoning falters and where to improve the system.

If your team is still tackling analytics artifacts sequentially—requirements, then instrumentation, then visualization—consider switching mental models. Treat the driver tree as the backbone, empower parallel agents to co-create around it, and reserve human judgment for the critical calls. This is vibe coding for product management: creative, fast, and grounded in measurable outcomes.

Inspired by this post on Pendo – Best Practices.

February 5, 2026

How to Operationalize Amplitude AI Visibility Upgrades

If your team has plenty of dashboards but still spends too much time turning a product question into a cohort, an explanation, and a decision, the bottleneck is no longer data collection. It is the work between asking the question and acting on the answer.

Amplitude AI Visibility now combines content generation, natural-language segmentation, a cleaner interface, and reliability improvements. That can shorten the path to insight, but only if you place those capabilities inside a disciplined product workflow. The goal is not to generate more analysis. It is to make sound decisions sooner without weakening review, governance, or accountability.

Treat the upgrade as a decision system, not an AI shortcut

A weak rollout starts by giving everyone access and encouraging them to try prompts. That produces activity, but it does not establish whether the technology is improving product work.

Define the unit of value as a completed decision. Each use of AI Visibility should move through a traceable sequence:

Start with a specific product question that could change an action.
Translate the question into an explicit cohort and metric definition.
Examine the relevant behavioral evidence.
Draft a narrative that separates observations from interpretations.
Record the decision, owner, and next action.

The enhancements reduce different kinds of friction inside that sequence. AI chat can reduce the interface work involved in expressing a segment. Content generation can reduce the effort required to turn analysis into a readable brief. A clearer interface can make the workflow easier for cross-functional partners to follow. Reliability improvements can support confidence in the system. None of those changes removes the need to define the question or approve the conclusion.

I would begin with two or three recurring, high-value use cases, not every analytics task. A good pilot question appears often, has a trusted baseline for comparison, and ends in a recognizable decision. Activation analysis, churn exploration, and experiment reporting meet those conditions for many product teams.

Match each enhancement to a concrete product job

Do not ask a team to use AI for analytics in the abstract. Give each workflow an input contract: the decision being considered, the population, the behavior, the observation period, the metric, and the exclusions. This prevents a fluent prompt from hiding an underspecified question.

Find an activation bottleneck without redefining activation

An activation question usually sounds simple: which new users reach value, and where do the others stop? The difficult part is deciding what counts as a new user, what behavior represents value, how long the observation period lasts, and which internal or test activity should be excluded.

Set those definitions before opening AI chat. Then describe the desired cohort in behavioral language and use chat-driven segmentation to iterate on it. Before analyzing the result, compare the AI-created segment with a known cohort, a manually configured version, or an established dashboard. If the populations differ, investigate the definition rather than explaining the chart.

Once the segment is accepted, use content generation to draft a brief that identifies the observed drop-off, the affected population, the relevant comparison, and the question that deserves further discovery. Keep causal language out unless the evidence supports it. A funnel can show where behavior changes; it does not, by itself, explain why.

Explore churn precursors without turning correlation into cause

Churn analysis becomes unreliable when a cohort mixes users who never activated, customers who became inactive, and accounts that formally cancelled. Those are different states with different product implications.

Write a plain-language definition of the state you care about before generating the segment. A useful prompt pattern is: create a cohort of the specified customer population that completed the core behavior during the reference period but did not complete it during the comparison period; exclude internal and test activity; then separate the result by the business attribute relevant to the decision.

Use AI chat to test legitimate variations in that definition, not to invent the definition for you. When a behavioral difference appears, label it as a precursor or association until customer evidence or an experiment supports a causal explanation. The next action may be another analysis, a customer interview, or a retention experiment. It should not automatically be a roadmap commitment.

Draft experiment reports without delegating the decision

AI-generated experiment summaries are useful because the structure is repetitive even when the decision is not. Give the system the approved hypothesis, eligible population, exposure definition, primary outcome, guardrail measures, and underlying analysis. Ask for a draft that covers what changed, what remained uncertain, which segments require caution, and what decision the evidence supports.

The generated narrative should never become the statistical authority. The experiment analysis remains the record for effect estimates, uncertainty, and data-quality caveats. The brief exists to make that evidence understandable and actionable. If the prose and the analysis disagree, correct the prose before it travels to stakeholders.

Put human review around definitions and conclusions

AI can make a loosely defined request look finished. That is the central operating risk. The safest control is to review the workflow where meaning enters and where meaning leaves: validate the segment before interpreting the result, then validate the narrative before sharing it.

Validate the segment before reading the result

Confirm the identity unit. A user, device, workspace, and customer account are not interchangeable.
Check that event names and properties map to the team’s current tracking taxonomy.
Make inclusion rules, exclusions, sequence requirements, and observation periods explicit.
Compare membership or aggregate trends with a trusted manual definition when one exists.
Inspect surprising differences before using them as evidence. A mismatch may come from the cohort definition rather than user behavior.
Store a plain-language definition with the accepted cohort so another person can reproduce the analysis.

Validate the narrative before distributing it

Require each material claim to point back to a chart, table, or approved metric.
Separate observed behavior from a proposed explanation.
Verify that the population, date range, and comparison in the prose match the analysis.
Remove unsupported causal language and any detail the audience is not permitted to access.
State the decision, the remaining uncertainty, and the person responsible for the next action.

Content generation reduces drafting work; it does not transfer review responsibility to the model. This distinction is especially important for executive briefs, where polished language can make a weak inference appear more certain than it is.

Govern prompts, access, and workflow changes

Basic prompt templates, access policies, review steps, and data-governance controls turn experimentation into a repeatable capability. A prompt template should specify the business question, required definitions, exclusions, expected output, evidence standard, and reviewer. Access should follow the same least-privilege principles applied to the underlying analytics data.

Reliability also needs operational visibility. Keep a lightweight record of the original question, accepted cohort definition, supporting analysis, generated brief, reviewer, and resulting decision. When an answer changes unexpectedly, that record helps you distinguish a tracking problem from a cohort change, a prompt change, or an interpretation error.

Measure whether the rollout changes product decisions

Prompt volume and generated summaries are adoption signals, not proof of value. Establish a baseline before the pilot, run the selected use cases through the new workflow, and compare the result using measures tied to decisions.

Signal	How to observe it	What a weak result means
Time-to-insight	Track elapsed time from an accepted question to a reviewed analysis brief.	If the time does not fall, find the handoff or review step that still creates delay.
Stakeholder adoption	Track whether product, design, engineering, growth, and leadership use the workflow in recurring decisions.	If only analysts use it, the interface or output may not fit cross-functional work.
Decision velocity	Track elapsed time from requesting evidence to recording an explicit decision or next action.	If output increases but decisions do not move sooner, the workflow is producing content rather than clarity.
Review quality	Count material corrections to cohort definitions, metrics, and conclusions before and after sharing.	If rework rises, improve the event taxonomy, prompt contract, validation process, or reviewer guidance before expanding access.
Trust exceptions	Record cases in which an AI-assisted result conflicts with validated analytics or cannot be reproduced.	If exceptions persist, pause expansion and resolve the data, definition, or workflow problem.

Judge the pilot as a system. Faster segmentation with heavy correction is not a win. Faster drafting with unchanged decision velocity is not a win either. The useful outcome is a shorter path from question to reviewed decision, with stable or improving quality.

Expand only after the pilot workflow is reproducible. At that point, turn the accepted prompt patterns, cohort definitions, review criteria, and measurement approach into a shared operating playbook. The cleaner interface can help more partners participate, but the playbook is what keeps participation consistent.

Key takeaways

Use Amplitude AI Visibility to shorten a decision workflow, not merely to increase the volume of segments and summaries.
Begin with two or three recurring use cases that have trusted baselines and recognizable decisions.
Define the population, behavior, period, metric, and exclusions before asking AI to create a segment.
Validate cohort meaning before interpreting behavior, then validate the generated narrative before sharing it.
Measure time-to-insight, stakeholder adoption, decision velocity, review quality, and trust exceptions together.
Scale the workflow only when faster output is accompanied by reproducibility and sound review.

Choose the next recurring product decision that still involves too much manual translation. Write its input contract, capture its current path to a reviewed decision, and use that single workflow to determine whether AI Visibility is removing the right friction.

References

Shivam.Consulting Blog – Amplitude’s AI Visibility Upgrade: Content Generation, Chat Segmentation, Sleeker UI – Why It Matters

February 5, 2026