Category: Product Management Leadership

Unlock Travel & Hospitality Growth: Product Benchmarks and Metrics Top Teams Rely On

I lead product teams building travel and hospitality experiences, and one lesson keeps repeating: companies that measure what matters move faster. Benchmarks turn gut feel into grounded product strategy, making it clear where activation, conversion, and retention are underperforming—and where we can unlock outsized growth.

Discover exclusive data and strategies from our Product Benchmark Report. Compare the travel and hospitality industry’s performance across key product metrics.

When I evaluate a product line, I start with a simple model: attract, convert, delight, and retain. For travel and hospitality specifically, I focus on search-to-book conversion, onboarding completion, first-booking activation rate, time-to-book, average booking value, cancellation rate, support contact rate, DAU/MAU stickiness, repeat booking rate, and long-term retention. These key product metrics reveal friction in discovery and checkout flows, surface pricing and inventory gaps, and quantify loyalty.

From there, I assemble a test-and-learn plan. Using Amplitude analytics to instrument the funnel and Pendo for in-app guides and product tours, my teams design A/B testing with a clear minimum detectable effect (MDE), prioritize hypotheses, and execute rapid, weekly iterations. This is classic product-led growth: reduce cognitive load in onboarding, streamline search and filter UX, clarify policies before payment, and personalize reactivation nudges to improve user activation and retention analysis.

Benchmarks are only as trustworthy as the underlying data. I insist on strong data governance, privacy-by-design practices, and clear event taxonomies so that insights remain reliable across quarters and across markets. That foundation keeps our decisions defensible with stakeholders and regulators while accelerating delivery.

Finally, we translate insights into action with crisp product roadmapping and sprint planning. Cross-functional product trios align OKRs to the biggest benchmark gaps, and we review progress in weekly performance rituals so every experiment ladders up to strategy. This cadence helps teams stay empowered and keeps leadership focused on outcomes, not output.

If you’re building in travel and hospitality, use these benchmarks as your starting line and your ongoing scorecard. Calibrate targets against peers, double down on what moves the needle, and let the data guide bold, customer-centered bets. When teams rally around meaningful metrics, momentum compounds.

Inspired by this post on Amplitude – Perspectives.

January 5, 2026

How to Build a Continuous Discovery Habit That Survives Delivery

Your team probably doesn’t lack discovery techniques. It loses discovery when delivery becomes urgent. Customer contact clusters around planning, the team commits to a solution, the calendar fills, and assumptions quietly harden into backlog items. Everyone stays busy, but no one can point to the customer evidence behind the next decision.

Durable continuous discovery is an operating rhythm, not a research phase. The goal isn’t to conduct more interviews. It is to shorten the distance between customer reality and the decisions shaping your product. A weekly rhythm owned by a product trio can reduce rework, sharpen strategy, and keep discovery alive while delivery continues.

See your real discovery system before changing it

Adding a recurring customer interview to the calendar won’t fix a decision process built around handoffs. If ideas arrive from executives, become requirements in product, move to design, and reach engineering as implementation work, the interview is an extra activity attached to the side of the system. It isn’t part of how the team decides.

Start by making the existing system visible. Map what actually happened, including the awkward shortcuts and informal approvals. Do not draw the process described in a playbook.

Spend 60 minutes drawing how your team decides what to build. Show where ideas enter, who shapes them, who approves them, where customers appear, and how the team decides whether the result worked.
Compare your drawing with the drawings made by product, design, and engineering. Differences are evidence that the team does not share the same decision model.
Audit every product decision from last week in a 30-minute session. Include small decisions, not just roadmap commitments.
For each decision, record who made it, what information informed it, and whether the team had direct customer input or received a secondhand interpretation.
Mark the places where discovery and delivery reconnect. A production problem, adoption signal, support request, or implementation constraint can create a new discovery question; it should not disappear into a separate queue.

This process map and decision audit gives you a baseline without turning discovery into a maturity score. Look for the mechanism behind the misses. Perhaps customer input arrives after commitment. Perhaps the product manager is the only person who interprets it. Perhaps the team can describe the solution but not the opportunity it addresses.

Track a compact baseline: how recently the team had direct customer contact, which current decisions include direct input, where cross-functional decisions become handoffs, and which active solutions lack a named customer opportunity. Do not set targets yet. First identify where evidence stops influencing action.

If the process looks reasonable but the habit still collapses, inspect the six prerequisite mindsets: outcome-oriented, customer-centric, collaborative, visual, experimental, and continuous. Turn them into diagnostic questions:

Outcome-oriented: Can the team name the customer or business change it is trying to create, or only the feature it plans to ship?
Customer-centric: Does the team hear directly from customers, or mainly through sales, support, analytics, and stakeholder summaries?
Collaborative: Do product, design, and engineering make decisions together, or meet mainly to exchange work?
Visual: Is there one shared representation of the outcome, opportunities, solutions, and assumptions?
Experimental: Can the team name what could make the current idea fail?
Continuous: Does each learning activity lead to the next question, or does discovery end with a presentation?

Choose the weakest link as your first intervention. A team with output-based goals does not need a better interview script first; it needs an outcome that gives the interview a purpose. A team dominated by handoffs needs shared sensemaking, not another repository.

Install a weekly loop small enough to protect

A habit survives because its trigger, action, and output are clear. Put a recurring discovery block at a stable point in the team’s operating rhythm. Tie it to a current outcome and a live decision, not to a general ambition to understand users better.

Trigger: A protected calendar block recurs every week, including during active delivery.
Focus: The trio brings one current outcome, the decision in front of it, and the uncertainty preventing a confident choice.
Customer contact: The team has a direct customer touchpoint every week. That might be a customer interview, observation of a workflow, or a usability session connected to the current question.
Sensemaking: The trio separates what it observed from what it inferred.
Update: New evidence changes the opportunity solution tree or confirms why no change is warranted.
Commitment: The team names the next uncertainty and starts arranging the next customer contact.

A customer touchpoint is not any meeting attended by a customer. A sales demo, account review, or advisory session dominated by presentation may be valuable, but it does not automatically answer a discovery question. The useful test is whether the customer can reveal a real behavior, need, constraint, or reaction and whether the team can ask follow-up questions.

Prepare each touchpoint by completing this sentence: After this contact, the trio might decide whether… If you cannot finish it, the question is probably too broad. Starting with a decision also reduces the temptation to collect interesting comments that never affect the product.

During the interaction, capture concrete observations before interpretations. Afterward, answer five questions while the context is fresh:

What did the customer do, describe, or struggle to explain?
What interpretation is the team placing on that observation?
Which opportunity or assumption does it affect?
What decision changes, if any?
What remains uncertain enough to examine next?

The distinction between observation and interpretation matters. A customer abandoning a task is an observation. Assuming that price caused the abandonment is an interpretation. If the team records only the interpretation, an early guess can become institutional memory.

Recruiting is part of the habit, not administrative work that begins after an interview is requested. Give coordination to a named owner, maintain a rolling pool of relevant customers, and create simple paths for customer success and support to nominate people who recently experienced the problem. Start the next invitation before the current discovery cycle feels complete. Otherwise, every customer cancellation becomes a reason to skip the week.

When delivery pressure rises, protect the trigger and narrow the activity. Ask a smaller question, review a focused prototype, or examine one step in a workflow. Do not silently replace direct contact with an internal meeting and call the habit complete. If a customer cancels, use the protected time to recruit, refine the decision question, and reschedule. Preserve the rhythm without pretending the missing evidence exists.

Give the product trio ownership of decisions, not ceremonies

A product trio is not three people attending the same interview. It is product, design, and engineering sharing responsibility for understanding the opportunity and choosing how to address it. Attendance can rotate. Interpretation and decision-making cannot be delegated to one function and handed back as a deck.

Make the trio’s decision rights explicit at the start of an outcome. Record the outcome it owns, the decisions it can make autonomously, the constraints it must respect, what requires escalation, and where its evidence will remain visible. Without that contract, discovery may reveal a better direction while the roadmap continues unchanged because nobody knows who can act.

The responsibilities below are a practical starting point, not rigid job boundaries:

Product keeps the outcome, strategic context, customer segment, and pending decision visible.
Design helps the trio expose customer behavior, frame opportunities, and choose an appropriate way to learn.
Engineering surfaces feasibility, system behavior, data, and implementation assumptions before the solution becomes expensive to change.
The trio decides what the evidence means, which option remains viable, and what uncertainty deserves attention next.

Use a short shared debrief after customer contact. The format can remain simple:

Observation: What happened without interpretation?
Meaning: What plausible explanations fit the observation?
Decision: What will the trio change or preserve?
Unknown: What still blocks commitment?

This prevents the loudest interpretation from becoming the team’s conclusion. It also gives engineering a role before implementation and gives design a role beyond producing artifacts.

Leadership should ask for evidence of changed decisions, not proof that ceremonies occurred. Instead of asking how many interviews the team completed, ask which opportunity became clearer, which assumption weakened, what decision changed, and how the change connects to the outcome. Interview volume is easy to report and easy to game. Decision quality is harder to display, but it is the reason the habit exists.

Connect discovery evidence to strategy and delivery

A weekly customer conversation can still become theater if its evidence floats separately from strategy, roadmaps, and sprint planning. The opportunity solution tree provides a shared spine: the desired outcome sits at the top, customer opportunities sit beneath it, and candidate solutions connect to the opportunities they could address. That outcome-opportunity-solution structure keeps the team connected to why it is considering a particular feature.

Use the tree as a decision interface, not a workshop artifact:

Product strategy: Put the intended outcome at the top so the team can test whether its discovery work supports the strategic direction.
Roadmapping: Attach candidate solutions to named opportunities. Keep alternatives visible until evidence or a real constraint justifies commitment.
Sprint planning: Require each significant item to trace back to an opportunity and outcome. If it cannot, surface the mismatch before implementation.
Customer contact: Update the affected opportunity, solution, or assumption during the debrief. Do not wait for a separate documentation session.
Stakeholder communication: Show what changed in the tree, why it changed, and which decision follows. This is more useful than presenting a collection of customer quotations.

Keep a record of rejected options and the evidence or constraint behind each rejection. Otherwise, an old idea can return with a new label and consume another round of debate. The record should remain revisable: new customer behavior, technical capability, or strategic constraints can justify reopening a branch.

Measure whether evidence enters decisions

The safest discovery metric is not an isolated activity count. Measure the health of the loop:

Cadence: Did direct customer contact happen during the weekly rhythm?
Decision integration: Which current decision did that contact inform?
Shared ownership: Did the trio participate in sensemaking, even if every member did not attend the session?
Strategic traceability: Can a delivery item be traced to an opportunity and outcome?
Learning movement: Which belief, option, or assumption changed?

A team can conduct many interviews and learn very little if every conversation validates a solution already selected. Conversely, one focused interaction can be valuable when it exposes a faulty workflow assumption and changes a pending decision. Track cadence to protect the habit, but judge value by movement in the decision model.

Separate customer, model, and operational uncertainty in AI products

AI product teams face a specific discovery trap: an impressive model demonstration can make technical possibility look like customer demand. Keep different uncertainties separate so one kind of evidence does not answer a different question.

Customer uncertainty: What job is the person trying to complete? Where does the current workflow break? Under what conditions will the person trust, verify, correct, or reject an AI-assisted result?
Model uncertainty: Does the system produce acceptable behavior for the intended context? Which failures matter to the user, and how will the team evaluate them?
Operational uncertainty: Can the product obtain the required data and permissions? Where is human review needed? How will failures be detected, explained, and supported?

Customer contact can reveal workflow, language, trust conditions, and failure consequences. It cannot prove that the model behaves reliably. Model evaluations can reveal performance and failure patterns. They cannot prove that the workflow is valuable. Operational checks can establish feasibility and controls. They cannot prove adoption. Keep all of these linked to the same outcome while using the right evidence for each uncertainty.

On the opportunity solution tree, write opportunities in customer terms. “Use generative AI” is a solution direction, not an opportunity. “Reduce the effort required to turn a customer conversation into an accurate follow-up” describes a customer problem that could have AI and non-AI solutions. That distinction helps the trio discover value without becoming attached to a technology.

Fix the mechanism when the habit breaks

What you notice	Likely mechanism	What to change
The team talks to customers, but the roadmap never changes.	Sessions are disconnected from a live decision.	Write the decision before recruiting and record what changed immediately after the interaction.
Engineering joins only after discovery is complete.	The trio label is masking a handoff.	Include engineering in opportunity framing, assumption identification, and shared sensemaking. Session attendance can rotate.
Customer sessions repeatedly fall through.	Recruitment starts only after a question becomes urgent.	Maintain a rolling pool of relevant customers and assign coordination to a named owner.
The opportunity solution tree is stale.	The tree is treated as presentation material.	Update it during the debrief and remove or annotate branches that no longer have support.
Discovery pauses whenever delivery accelerates.	Discovery is scoped as a project rather than a continuous rhythm.	Protect the weekly trigger and narrow the question or method when capacity is tight.
Leadership keeps asking the team for certainty.	The team reports activities without showing their decision impact.	Show the outcome, changed opportunity or assumption, resulting decision, and remaining uncertainty.

Do not respond to a broken habit by adding more process everywhere. Match the intervention to the failure. A recruiting problem needs a pipeline. A decision-rights problem needs leadership alignment. A stale artifact needs an update trigger. A handoff problem needs shared sensemaking.

Key takeaways

Map the current decision system and audit last week’s decisions before adding a new discovery ceremony.
Anchor a direct customer touchpoint every week to a current outcome, decision, and uncertainty.
Let attendance vary when necessary, but keep interpretation and decisions jointly owned by the product trio.
Use the opportunity solution tree as the live connection between strategy, customer evidence, roadmap choices, and sprint work.
When delivery pressure rises, protect the trigger and shrink the activity instead of suspending the cadence.
For AI products, do not use customer enthusiasm as proof of model reliability or an evaluation result as proof of customer value.

Put the recurring customer touchpoint on the calendar, choose the outcome and decision it must inform, and name the product trio responsible for acting on what it learns. At the end of the next weekly cycle, do not ask whether the team “did discovery.” Ask what changed in the decision and what the team needs to learn next.

References

Product Talk – Join My 2026 Continuous Discovery Habits Book Club: Build Weekly Discovery Routines That Stick

January 5, 2026

A Practical Framework for AI-Era Build-versus-Buy Decisions

You have an AI capability on the roadmap. A vendor can demonstrate something credible almost immediately, while engineering believes an internal version would fit the product better. Both claims may be true, and neither one answers the decision in front of you.

The useful question is not simply whether to build or buy. You need to decide which parts of the capability create strategic advantage, what you must learn before committing further, which obligations you are prepared to own, and how you will leave if the economics or technology changes.

Draw the capability boundary before comparing options

Most weak build-versus-buy debates begin with a label that is too broad. AI assistant, support automation, recommendation engine, and enterprise search each describe an experience, not a single technical capability. Comparing a vendor’s finished product with an imagined internal system at that level guarantees an uneven evaluation.

Break the experience into layers before discussing ownership. An AI product might contain data connectors, ingestion, domain retrieval, ranking, generation, orchestration, evaluation, observability, policy guardrails, workflow logic, a user interface, and a human handoff. You can make a different decision for each layer.

Classify every layer by its strategic role:

Differentiation: The layer materially affects why customers choose, retain, or expand with your product. It may encode a proprietary workflow, use unique data, or create a feedback loop competitors cannot easily reproduce.
Parity: Customers expect the capability, but it is not a meaningful reason to choose you. Reliable billing infrastructure, standard integrations, and generic analytics plumbing often belong here.
Control: The layer may not be visible to customers, but it determines whether you can satisfy security, regulatory, reliability, cost, or product-policy obligations. Control can justify ownership even when the layer itself is not differentiating.

My default is to build where the capability creates differentiation and buy where it provides parity. The control category prevents that principle from becoming simplistic. A commodity function can still require an internal boundary, a contractual guarantee, or an owned abstraction if failure would compromise a core promise.

Ask these questions for each layer:

If this layer became substantially better, would it change the product’s value proposition or merely close a feature gap?
Does operating it create proprietary data, evaluation evidence, workflow knowledge, or customer insight that compounds over time?
Would dependence on a vendor’s roadmap prevent you from making an important product promise?
Could a close competitor buy the same capability and achieve roughly the same result?
Do privacy, residency, auditability, reliability, or recovery requirements force you to retain direct control?
Can your team support the layer after launch, including incidents, upgrades, security work, and user adoption?

A retrieval-augmented generation system shows why this decomposition matters. The right answer may be to build the parts that encode domain knowledge while buying fast-moving infrastructure around them.

Layer	Strategic question	Plausible initial posture
Domain retrieval and ranking	Does relevance depend on proprietary content, metadata, permissions, or customer context?	Build when this is central to answer quality and differentiation.
Orchestration and observability	Would owning the runtime create customer value, or only infrastructure work?	Buy when a platform provides adequate reliability, APIs, and portability.
Prompts, policies, guardrails, and evaluation cases	Do these artifacts encode product behavior, risk tolerance, and domain expertise?	Own the specifications and evidence even if a vendor executes them.
User workflow and human handoff	Is the workflow part of the product’s distinctive experience?	Build the differentiated interaction; integrate commodity components behind it.

The point is not that every retrieval system should use this split. The point is to stop forcing one ownership decision across layers with different strategic value. A composed architecture can give you speed at the edges and control at the center.

Compare time to value and total ownership cost separately

Buying and building usually produce different cost curves. Buying can reduce the initial implementation burden and provide proven operations. Building concentrates cost and complexity near the beginning but may create a better fit and more favorable economics at scale. Neither profile is automatically cheaper.

Evaluate the decision across two horizons. The first is time to activated value: how long it takes before the intended users complete the intended workflow successfully. The second is total cost of ownership over the period in which the capability must operate, evolve, and eventually migrate.

Do not treat a signed contract, completed deployment, or merged pull request as time to value. Procurement, security review, data preparation, integration, enablement, in-product guidance, and user activation sit between acquisition and an actual outcome. A fast purchase with weak adoption is not a fast result.

A useful cost model is:

Total ownership cost = acquisition or development + integration + operations + change + risk exposure + exit.

Apply the same formula to both choices. Teams often present the vendor’s full commercial cost against only the internal development estimate, or compare a subscription price with an imagined build that excludes maintenance. Both comparisons are misleading.

Cost area	Evidence needed for a buy option	Evidence needed for a build option
Acquisition or development	Subscription, per-seat or consumption charges, implementation fees, support tier, and expected price changes with growth.	Product, design, engineering, data, security, and platform capacity required to reach usable scope.
Integration	Connector work, identity and permission mapping, data transformation, API constraints, testing, and CI/CD maintenance.	Interfaces with existing systems, migration of current workflows, data contracts, and platform dependencies.
Operations	Internal administration, vendor management, incident coordination, usage monitoring, and workarounds for roadmap gaps.	On-call ownership, observability, model and dependency updates, incident response, capacity management, and reliability work.
Change	Configuration limits, professional services, retraining, contract changes, and waiting for vendor roadmap delivery.	Continuing product development, evaluation maintenance, documentation, enablement, and the opportunity cost of displaced roadmap work.
Risk exposure	Vendor outages, security posture, data handling, roadmap dependence, quota changes, and concentration risk.	Internal security gaps, insufficient operational maturity, key-person dependency, and failure to meet compliance obligations.
Exit	Data export, contract termination, migration assistance, replacement integration, and reconstruction of non-portable artifacts.	Decommissioning, data migration, user transition, and replacement of internally coupled components.

Buying often wins the first horizon while integration work, consumption pricing, roadmap gaps, training, and connector maintenance accumulate later. Building reverses the pressure: the early commitment is larger, and any long-run advantage depends on sustained adoption, sufficient scale, and a team that can operate what it creates.

Run an expected case and a stress case for both options. For a vendor, stress usage, API consumption, support requirements, and the cost of additional environments or features. For an internal system, stress incident load, model or infrastructure changes, evaluation maintenance, and continued product demands. The purpose is not to produce a perfectly precise forecast. It is to expose which assumptions can overturn the decision.

Record those assumptions in the decision memo. If vendor consumption cost must stay within an agreed envelope, state that envelope internally and assign someone to monitor it. If the build case depends on reuse across several product surfaces, name those surfaces and verify that their teams actually intend to adopt the component. An unowned assumption is not a forecast; it is hidden risk.

Turn the debate into an evidence-based decision

A scorecard is useful only when it forces explicit trade-offs. It should not turn judgment into decorative arithmetic. Establish hard gates first, agree on the relative importance of the remaining criteria before vendor demonstrations or internal prototypes create attachment, and then evaluate both options against the same outcome.

A practical scorecard covers differentiation, urgency, security and regulatory risk, integration complexity, and AI leverage and portability.

Dimension	Decision question	Evidence to collect	What changes the decision
Differentiation	How directly does the capability support the value proposition or defensibility?	Product strategy, roadmap commitments, customer workflow evidence, proprietary data advantages, and the importance of controlling behavior.	Build becomes more attractive as the capability determines why customers choose or stay.
Urgency and time to value	What is the cost of waiting, and when can users reach a meaningful outcome?	Procurement and security timelines, integration dependencies, build scope, launch readiness, enablement needs, and adoption path.	Buy becomes more attractive when delay is costly and the purchased path can reach activated value materially sooner.
Security and regulatory risk	Can either option verifiably meet non-negotiable obligations within the launch window?	Data-flow diagrams, privacy controls, residency, retention, audit logs, access controls, certifications, threat response, model lineage, and red-team practices.	An option that fails a mandatory obligation should be removed, regardless of its aggregate score.
Integration complexity	How much continuing work is hidden behind the initial connection?	Sandbox tests, API behavior, quotas, identity mapping, data contracts, failure modes, deployment workflow, and ownership of connectors.	Build gains ground when vendor constraints create persistent product or operational work; buy gains ground when internal integration and support exceed the apparent build scope.
AI leverage and portability	Which prompts, data, evaluations, embeddings, policies, and feedback become valuable, and can they move?	Export tests, API abstraction, model-routing options, ownership terms, deletion process, evaluation access, and migration design.	Build or a hybrid architecture gains ground when the vendor captures an asset central to future differentiation.

Security, regulatory compliance, and minimum reliability are gates, not preferences. A high score elsewhere cannot compensate for an option that cannot lawfully handle the data, meet a required recovery posture, or provide necessary audit evidence. The same logic applies to internal capacity: if no team can own production incidents, an attractive prototype is not a viable build option.

Use a product trio of product, design, and engineering to set the scorecard’s priorities. Bring security, data, finance, procurement, and operations into the criteria they own. This prevents a late-stage veto from appearing as a surprise when it was actually a missing requirement.

Then run comparable discovery work. Give the vendor a production-like workflow in a sandbox. Give the internal option a thin vertical slice that touches the real data and integration boundary. Test the same cases for outcome quality, failure handling, permissions, auditability, operator effort, integration behavior, and unit economics. A polished vendor demonstration and a rough internal prototype reveal different things; common acceptance cases make the evidence comparable.

Keep confidence separate from the decision direction. A criterion can favor building while resting on weak evidence. Mark it as an assumption and define the cheapest test that would resolve it. This is more useful than adding precision to a score whose inputs remain speculative.

The final memo should fit the decision, not the politics around it. Include the capability boundary, strategic classification of each layer, intended user outcome, hard gates, scorecard, cost assumptions, evidence quality, operational owner, exit path, and re-evaluation triggers. Anyone reading it later should be able to tell why the decision was reasonable at the time and which changed condition would justify revisiting it.

Run an AI-specific risk and portability pass

AI changes more than development speed. It introduces movable models, probabilistic behavior, data-dependent quality, metered usage, and artifacts that can become strategically valuable. A normal software procurement checklist will miss several of these dependencies.

Data route: Document what enters the system, which service receives it, where it is stored, how long it is retained, whether it can be used for training, how deletion works, and whether residency requirements apply. Include prompts, retrieved context, generated output, user feedback, and operational logs.
Model and quality governance: Require a way to identify the model, configuration, prompt, retrieval state, and policy version associated with important behavior. Decide who maintains evaluation cases, reviews regressions, investigates failures, and approves consequential changes.
Security and privacy: Verify role-based access, audit logs, PII handling, privacy-by-design controls, threat detection and response, and the vendor’s red-team and incident practices. For an internal build, require equally concrete evidence rather than assuming control equals safety.
Portability: Establish ownership and export mechanisms for source data, metadata, prompts, policies, evaluation sets, feedback, transcripts, and relevant logs. Treat a contractual right to export and a technically usable export as separate requirements.
Unit economics: Map every metered event in the actual workflow. Per-seat pricing, consumption charges, model usage, and orchestration can behave differently as adoption and workflow complexity grow. Test the economic model against expected and stressed usage.
Operational responsibility: Specify who diagnoses a failure that crosses your application, the vendor platform, a model provider, and a data source. Shared architecture does not remove accountability; it makes the handoffs more important.

Portability deserves an actual exit test. Ask the vendor to produce a representative export before the contract is final. Confirm its format, completeness, permission model, and usefulness in another environment. An export button is not evidence that you can reconstruct the product behavior that matters.

Prompts require the same caution. Access to prompt text is necessary, but equivalent behavior may still depend on a model, tool interface, retrieval implementation, or vendor-specific orchestration. Preserve the intent, policies, evaluation cases, and expected outcomes around a prompt, not just the string itself.

Embeddings can also create false confidence about portability. Preserve the original content, chunking inputs, metadata, permission relationships, and evaluation set so embeddings can be regenerated if the model or retrieval system changes. The derived vectors alone are not a complete migration asset.

For vendors, negotiate transparent API quotas, usable sandbox environments, data-export terms, growth price protections, and clear ownership of AI artifacts. Pressure-test the roadmap against your deployment cadence and ask how incidents, breaking changes, and model transitions are communicated. For an internal build, apply the same rigor to service levels, incident response, observability, model lineage, retention, and ongoing staffing.

Buying does not outsource your responsibility for the product’s behavior. Building does not prove that the behavior is controlled. Choose the implementation that can produce the evidence your risk level demands within the launch window.

Make a staged commitment with explicit re-evaluation triggers

A build-versus-buy decision does not need to be permanent to be disciplined. When uncertainty is high and speed matters, a bounded purchase can be a learning instrument. When differentiation or control is already clear, a minimum lovable internal slice can establish the core while purchased components accelerate everything around it.

For a buy-to-learn path, use this sequence:

Name the uncertainty. Decide whether you are testing demand, workflow fit, quality, integration feasibility, adoption, operational burden, or economics. Do not call a general implementation a pilot.
Bound the commitment. Limit initial scope, data exposure, coupling, and custom vendor work to what the learning objective requires. Preserve an adapter or interface where replacement would otherwise become expensive.
Instrument the outcome. Track whether intended users activate, return, complete the workflow, accept the output, escalate to a human, and create operational work. Monitor consumption and connector reliability alongside product use.
Review against prewritten triggers. Deepen the vendor integration if adoption is durable, economics remain acceptable, and integration pain is manageable. Move toward building if unique requirements emerge, strategic artifacts accumulate, vendor constraints block the roadmap, or costs reach the agreed inflection point. Stop if the user outcome does not materialize.

This approach works because a purchased solution can validate value before a deeper build commitment. The learning is reusable only if you retain the data model, evaluation evidence, workflow understanding, and user-behavior insight rather than burying them inside vendor-specific configuration.

For a build-to-differentiate path, keep the first scope narrow. Build the smallest end-to-end experience that proves the differentiating hypothesis. Buy mature infrastructure around it where doing so does not surrender the key data, policy, or product behavior. Isolate components behind explicit interfaces so a model, orchestration service, retrieval system, or observability layer can change without rewriting the entire experience.

Set re-evaluation triggers before launch, while nobody is defending a sunk decision:

Product trigger: Usage fails to become durable, or customers reveal a need that the current option cannot support.
Financial trigger: Consumption pricing, operating cost, or internal staffing moves outside the approved economic envelope.
Technical trigger: Integration maintenance, API limits, reliability, or roadmap mismatch begins delaying important releases.
Risk trigger: Data handling, retention, auditability, model governance, or regulatory obligations can no longer be met.
Strategic trigger: A previously generic layer begins creating proprietary data, workflow advantage, or meaningful differentiation.
Capacity trigger: The internal team can no longer sustain the operational burden, or gains the maturity needed to own a capability previously bought.

Assign an owner and a review event to each trigger. Without ownership, continuous re-evaluation becomes a good intention that loses to roadmap pressure. The decision memo should remain a living control surface for product, engineering, finance, security, and procurement, not an artifact filed after approval.

Do not neglect activation. Whether you build or buy, budget for workflow changes, onboarding, in-app guidance, support preparation, and measurement. Deployment creates availability. Repeated successful use creates value.

Key takeaways

Decompose an AI experience into layers before deciding who should own it.
Build differentiated or control-critical layers; buy parity where a vendor can accelerate activated value.
Compare both choices across time to value and total ownership cost using the same scope and service expectations.
Apply non-negotiable gates before a weighted scorecard, then test both options against common acceptance cases.
Own the data, policies, evaluation evidence, and migration path that protect your future leverage.
Use staged commitments and prewritten triggers so changing the decision becomes responsible management, not an admission of failure.

The next time this question reaches your roadmap review, do not ask for a permanent verdict on build or buy. Ask for a capability map, comparable evidence, an operational owner, a tested exit path, and the conditions that would change the answer. That gives you a decision you can defend now without mortgaging your ability to adapt later.

References

Product School – Build vs Buy in 2026: How I Make Confident, AI-Savvy Software Decisions That Scale

January 5, 2026

AI Customer Service Transformation: An Operating Playbook

Your AI support pilot can look successful while the service operation gets worse. The agent closes more conversations, but customers repeat themselves after escalation, risky cases receive plausible but incomplete answers, and human agents inherit a queue made almost entirely of exceptions.

If you own this transformation, your job is not to install an AI agent. It is to redesign how customer demand moves through knowledge, automation, human judgment, and product feedback. You also need to prove that a conversation marked resolved was actually resolved. That requires an operating model, not just a deployment plan.

Start with an operating thesis, not a deflection target

Production AI changes the work around customer service before it changes the org chart. In a coded set of 166 interviews with support leaders, managers, and frontline specialists discussing Fin or similar AI agents, 94.58% reported a workflow or process change, and 82.53% reported changed role responsibilities. Only 6.02% reported a change to team structure or reporting lines.

That gap matters. If you treat the program as a software rollout, the technology can reach production while ownership, escalation rules, quality controls, and performance expectations remain designed for a human-only queue. The result is automation sitting on top of an unchanged operation.

The interviews were drawn from Intercom customers or prospects and centered on Fin or similar products. They are useful directional evidence from teams close to this transition, but they are not a vendor-neutral census of every customer service organization. Your own demand, risk profile, knowledge quality, and channel mix should determine the design.

I would begin with a one-page transformation brief. Force the leadership team to complete these fields before discussing a broad rollout:

Customer promise: Which customer outcome will become faster, easier, or more reliable?
Eligible demand: Which intents, channels, languages, customer states, and account types may enter the AI workflow?
Decision boundary: What may the AI explain, recommend, decide, or execute? These are different levels of authority.
Human boundary: Which ambiguity, consequence, customer request, or system condition requires a human?
Business hypothesis: Which cost, capacity, service-level, or growth constraint should improve if the workflow succeeds?
Quality gates: Which measures must improve, and which failure measures must not regress?
Learning owner: Who converts failures into knowledge fixes, workflow changes, model evaluations, or product improvements?

Do not make deflection the customer promise. Deflection records the absence of a human interaction; it does not establish that the customer’s problem was solved. A better promise names the intended outcome, such as completing a defined action correctly or answering an eligible question from an approved source without avoidable repetition.

Scope automation using two dimensions: how repeatable the work is and what happens when the answer is wrong. A simple decision matrix prevents the team from treating every incoming conversation as equally automatable.

Work pattern	AI role	Human role	Release condition
Repeatable and low consequence	Resolve from approved knowledge or execute a reversible workflow	Review samples and handle defined exceptions	Correct resolution and reliable rollback are demonstrated
Repeatable and higher consequence	Retrieve, summarize, validate inputs, or draft	Approve the final answer or action	Authoritative sources, approval capture, and auditability are in place
Ambiguous and low consequence	Ask clarifying questions, categorize, and route	Resolve cases that remain ambiguous	The escalation reason and collected context are visible to the human
Ambiguous and higher consequence	Collect only the minimum safe context, then stop	Own judgment, communication, and action	Hard escalation rules have been tested and cannot be bypassed conversationally

Risk is contextual. The same intent may be routine for one account state and consequential for another. Eligibility therefore belongs in the workflow itself, using customer state, requested action, permissions, available knowledge, and tool health. It should not live only in a prompt that asks the model to be careful.

Redesign the full conversation, especially the human handoff

AI-driven service is a routing and resolution system, not a layer that sits in front of the old queue. Teams are already moving triage, routing, translation, categorization, and repetitive responses into automated workflows. Humans increasingly enter for exceptions, nuance, oversight, and quality control.

The unit of design should be one end-to-end customer intent. Do not stop at the AI response. Trace what happens from the first message through resolution, escalation, downstream action, and learning:

Define the intent and entry conditions. State what the customer is trying to accomplish and which signals make the conversation eligible.
Name the authoritative knowledge. Identify the policy, product data, account data, or workflow state required to answer correctly.
Specify permitted actions. Separate explaining a process, recommending an action, preparing an action, and executing it.
Write explicit exit conditions. Define successful completion, customer-requested escalation, uncertainty, missing data, tool failure, policy conflict, and risk escalation.
Design the handoff packet. Give the human the context needed to continue without interrogating the customer again.
Capture a failure reason. Every failed or escalated attempt should produce a category that can be assigned to an owner.
Close the learning loop. Route the failure to knowledge, conversation design, support operations, product, engineering, or governance.

The handoff is where many apparently successful deployments reveal their real cost. If the human receives only a transcript, the AI has transferred a conversation but not the work. The agent must reconstruct the goal, identify what the system already attempted, verify customer-provided facts, and decide whether any prior answer can be trusted.

A useful handoff contract should include:

The customer’s detected goal and the intent assigned to it.
The material facts the customer supplied, with no invented completion of missing fields.
The approved sources used to form the answer.
Any tools called, actions attempted, results returned, and side effects created.
The point of uncertainty or the exact escalation rule triggered.
The unresolved question or recommended next action for the human.
The relevant transcript, available for verification rather than presented as the only summary.

Test the handoff as a product experience. Give a human agent only the packet and the underlying conversation, then observe whether the case can continue without the customer repeating information. Track missing fields and unnecessary rework as workflow defects. Do not hide that effort inside average handle time.

Knowledge needs the same discipline. For each automated intent, name one canonical source, one owner, a review trigger, and a withdrawal path. If two approved pages disagree, the correct AI behavior is not to blend them into a smooth answer. It is to stop, disclose the limitation appropriately, and route the conflict to an owner.

The AI agent does not create knowledge debt, but it can expose and distribute that debt at much greater speed. A missing article, stale policy, ambiguous field, or inaccessible account state can produce thousands of superficially different conversations with the same root cause. Aggregate failures by root cause instead of editing individual answers forever.

Use a failure taxonomy that separates at least these problems: missing knowledge, stale knowledge, conflicting knowledge, retrieval failure, unsupported reasoning, policy-boundary failure, tool or integration failure, incorrect eligibility, poor conversation design, routing failure, and incomplete handoff. Each category should map to a named owner and a defined corrective action. Otherwise, quality review becomes a list of examples rather than an operating system for improvement.

Redesign jobs before you promise headcount savings

Workforce impact is real, but it is not uniform. Headcount or hiring changed in 27.71% of the 166 interviews, often through slower Tier 1 hiring, freezes, natural attrition, or reallocation. That is materially less common than workflow and responsibility changes. The safest conclusion is not that AI automatically removes a fixed percentage of support cost. It is that repetitive demand can shrink while new oversight, exception, knowledge, and optimization work grows.

Calculate net capacity rather than gross deflection. The practical equation is:

Net capacity released = human work correctly avoided – new review, exception, maintenance, and recovery work.

Count the whole system. Include time spent reviewing samples, investigating severe failures, maintaining knowledge, configuring workflows, testing releases, repairing integrations, managing escalations, and helping customers recover from wrong actions. Also separate capacity released from cash savings. A team may use capacity to absorb growth, improve response time, eliminate backlog, or take on higher-complexity work without reducing current payroll.

Role design should follow the new work, not the fashionable job titles. You may create an AI specialist, automation manager, or AI-agent owner, but the essential question is who owns each recurring decision:

Frontline specialists resolve nuanced cases, identify failure patterns, validate knowledge gaps, and contribute difficult conversations to evaluation sets.
Support managers manage the changing workload mix, coach exception handling, monitor capacity, and decide where human judgment adds value.
AI or automation owners configure behavior, maintain evaluations, control releases, monitor production, and coordinate rollback.
Quality owners define error severity, audit both automated and human resolutions, and make recurring failure visible.
Knowledge owners approve canonical content, resolve conflicts, and remove information that should no longer be used.
Product and engineering owners fix product defects, data gaps, and tool failures that support conversations repeatedly expose.

These are responsibilities, not necessarily separate positions. A smaller organization may combine them, but it should not leave them implicit. One person can hold several responsibilities; one critical responsibility cannot be owned by nobody.

Write decision rights alongside role descriptions. Specify who may expand eligible intents, approve a high-consequence workflow, publish knowledge, change a prompt or model, accept a known quality limitation, pause automation, and communicate a customer-impacting failure. An AI owner who is accountable for outcomes but cannot stop a release is not an owner.

The capability profile changes as well. Data literacy, quality assurance, AI-output monitoring, and cross-functional communication are becoming more important as humans move from repetitive execution toward oversight and exception handling. Training should therefore use the actual work artifacts: score a conversation, classify a failure, inspect the sources used, challenge an unsupported answer, improve a handoff, and recommend the correct owning team.

Do not wait until automation is broadly deployed to explain this shift. Before changing staffing plans, show people the future queue, the new performance expectations, the skills they can build, and the paths available for redeployment. Vague assurances create uncertainty, while premature savings commitments force managers to defend a number before the operation has demonstrated sustainable quality.

Measure correct outcomes, not apparent automation

A conversation can be closed, contained, or deflected without being correct. That is why an automation dashboard cannot double as a transformation scorecard. I would make cost per correct resolution the economic anchor, then constrain it with customer-experience and severity guardrails.

Define correct resolution for every intent before launch. At minimum, it should mean that the customer received an accurate and complete answer or action, the applicable policy was followed, the workflow created no unintended side effect, and no avoidable human rescue or repeat contact occurred during an intent-appropriate observation period. The period may differ by intent; a question answered immediately and a downstream account action do not reveal failure on the same schedule.

Measure	Question it answers	Common trap
Eligible demand coverage	How much inbound demand falls inside a clearly approved scope?	Expanding eligibility merely to make automation look larger
AI attempt rate	How often did the AI engage eligible demand?	Counting an attempt as a successful outcome
Audited correct autonomous resolution	How often did sampled AI completions fully meet the intent definition without rescue?	Relying only on closure status or customer silence
Repeat or reopened contact	Did the customer return because the original issue remained unresolved?	Missing a repeat that arrives through another channel or wording
Handoff recovery	Can a human continue efficiently with accurate context?	Measuring routing speed while ignoring repeated questions and reconstruction work
Cost per correct resolution	What does a genuinely completed outcome cost across the whole system?	Excluding review, knowledge, tooling, maintenance, and recovery effort
Severity-weighted failure	How much customer or business consequence did errors create?	Allowing a high average accuracy to hide rare but serious failures
New-work burden	How much human effort did automation introduce?	Treating oversight and maintenance as free capacity

Keep the denominators explicit. Eligible demand coverage is eligible conversations divided by total inbound conversations. AI attempt rate uses eligible conversations as its denominator. Audited correct autonomous resolution should use reviewed AI-completed conversations, not every inbound contact. Mixing those denominators lets a team report a large percentage without showing how much demand was actually solved.

Audit with two sampling paths. Use a representative sample to estimate ordinary performance across intents, channels, languages, and customer states. Add targeted samples for high-consequence actions, new releases, known weak spots, tool failures, unusual escalations, and complaints. A purely random sample can miss rare failures that matter more than common harmless mistakes.

Define error severity before reviewers see the results. A wording issue, an incomplete answer, a wrong policy explanation, an unauthorized disclosure, and an incorrect account action should not contribute equally to one accuracy average. Severity should change the required response: monitor, correct knowledge, roll back a workflow, disable an action, or initiate the relevant incident process.

Maintain separate executive and operating views. The executive view should show eligible volume, audited correct resolution, customer outcome measures, cost per correct resolution, severe-failure trend, capacity released, and where that capacity went. The operating view should break performance down by intent, channel, language, customer state, workflow version, knowledge version, tool, failure category, and escalation reason.

Versioning is essential for diagnosis. Record the model, instructions, knowledge snapshot, workflow configuration, tool version, and eligibility rules associated with each resolved conversation. When several components change together, you may know performance moved without knowing why. Controlled rollouts or eligible-traffic holdouts can provide stronger evidence than a simple before-and-after comparison, especially when demand mix or seasonality is changing.

Set release thresholds before looking at a candidate’s results. The exact threshold should reflect the consequence of the intent and your current human baseline; there is no responsible universal number. The release decision should require sufficient audited quality, acceptable handoff recovery, no prohibited failure, functioning rollback, and an owner for every material defect that remains open.

Scale through evidence-gated stages

Do not scale on a calendar promise. Move when the workflow has produced enough evidence for its next level of authority. A useful sequence separates learning about the problem from granting the system permission to act.

Baseline the demand and draw the boundary

Start with the highest-volume and highest-consequence intents, but do not assume they belong in the same release. Build an inventory containing volume, current human effort, customer outcome, approved knowledge, data requirements, available actions, reversibility, failure consequence, escalation destination, and owner.

Create an evaluation set from real, appropriately handled historical conversations. Remove or protect sensitive data according to your controls. Include ordinary examples, ambiguous requests, missing information, policy conflicts, tool failures, customer requests for a human, and known edge cases. The gate for leaving this stage is not model quality. It is a testable definition of correct behavior and a clear boundary around what the AI must not do.

Run in observation or approval mode

Let the AI classify, retrieve, summarize, or draft while a human retains final authority. Compare its proposed outcome with the completed human outcome. Instrument the failure taxonomy, inspect whether the correct knowledge was available, and test the handoff packet with frontline agents.

Use this stage to repair the system around the model. Many failures will belong to missing content, conflicting policy, broken integrations, weak eligibility, or unclear product behavior. Prompt editing cannot fix an absent source of truth or an action the underlying system cannot perform reliably.

Grant controlled autonomy to bounded work

Begin with stable, low-consequence demand supported by authoritative knowledge and reversible workflows. Enforce eligibility outside the conversational instructions where possible. Keep hard escalation rules for uncertainty, missing data, customer preference, unavailable tools, policy conflicts, and prohibited actions.

Review production samples and targeted risk cases. Watch repeat contacts, human recovery work, severe errors, and changes in the composition of the human queue. A falling queue is not automatically good if the cases that remain take much longer or arrive with damaged customer trust.

Expand one meaningful dimension at a time

Add an intent, channel, language, customer state, or action only after defining how that dimension changes knowledge, evaluation, escalation, and consequence. Reusing a workflow in a new language is not just translation if policies, terminology, tone, or available support paths differ. Adding tool execution is not just a better answer; it grants the system operational authority.

Version each expansion and preserve rollback. If you need causal clarity, avoid changing the model, knowledge, tools, instructions, and eligibility rules in the same release. When simultaneous changes are unavoidable, label the release as a system change and evaluate the combined behavior rather than attributing the result to one component.

Institutionalize the operating model

Only after correct resolution and total workload remain durable should you change long-term staffing assumptions, performance management, budgets, or reporting lines. Update role charters, decision rights, quality routines, release governance, incident ownership, knowledge operations, and planning models together.

Give recurring AI failures a path into the product roadmap. If customers repeatedly ask because the interface is unclear, a workflow fails, or account state is hard to understand, automating the explanation may reduce service effort while preserving the root cause. The better product decision may be to remove the need for the conversation.

Key takeaways

Treat AI customer service as an operating-model transformation, because workflows and responsibilities change before most reporting structures do.
Automate bounded intents, not an undifferentiated share of tickets. Repeatability and consequence should determine the AI’s authority.
Design the human handoff as a product. A transcript without facts, actions, sources, uncertainty, and next steps transfers the queue but not the work.
Use audited correct resolution and cost per correct resolution as anchors. Attempts, closures, containment, and deflection are supporting events, not proof of value.
Calculate net capacity after review, maintenance, exception, and recovery work. Keep that separate from any claimed payroll saving.
Scale only when quality, severity, handoff, ownership, and rollback gates have been met for the next expansion.

Your next move can be small and consequential. Choose one recurring intent, complete the transformation brief, name its canonical knowledge owner, write the handoff contract, and define how you will audit correct resolution. If you cannot assign the knowledge, failure, and release decisions, do not automate the intent yet. Resolving that ownership gap is the first real step in the transformation.

References

Intercom — Inside the AI Customer Service Shift: What 166 Leaders Told Me About Teams, Roles, and ROI

January 5, 2026

How to Design, Launch, and Govern an AI Agent Product

Your AI agent demo works. Now the harder questions arrive: Which actions can it take, how will anyone know it helped, and who owns a bad decision? If those answers are deferred until launch, you do not yet have a product ready to scale. You have a capability looking for permission.

Your job as a product leader is to turn uncertain model behavior into a dependable operating system for one valuable task. That means designing the job, the workflow, the controls, the measurement, and the adoption path together. Model quality matters, but it cannot compensate for an undefined outcome, excessive access, weak tools, or a launch that asks users to trust what they cannot inspect or reverse.

Start with an operating contract, not an agent persona

Names such as sales agent, support copilot, or operations assistant are too broad to guide product decisions. They hide disagreements about what the system can see, what it can change, when it should stop, and what success means. Treating an agent as a product line with a narrow job, grounded data, tool access, and guardrails forces those disagreements into the open while they are still inexpensive to resolve.

Write an operating contract before debating models or interfaces. It should answer the following questions in language that product, engineering, operations, security, and the domain owner can all review:

Who is the user? Name the role performing the job, not a market segment. An account administrator and a support specialist may need different evidence, permissions, and explanations even when they use the same underlying model.
What event starts the job? Specify the observable trigger: a customer request arrives, a record enters an exception state, or a user asks for a particular action. A generic invitation to chat is not a job boundary.
What outcome counts as done? Define a state outside the conversation. The answer might be an approved response, a correctly updated record, a validated recommendation, or a complete handoff. A fluent message is output, not necessarily an outcome.
What evidence may the agent use? List permitted systems, required records, freshness requirements, and data the agent must not retrieve. If the task requires an authoritative record, make its absence a stop condition rather than an invitation to infer.
Which tools may it call? Separate read, draft, and write permissions. An agent that can inspect a record does not automatically need permission to change it, and permission to draft an action does not imply permission to execute it.
What constraints must always hold? Capture business rules, policy boundaries, approval requirements, and prohibited actions. Enforce these constraints in tool and application layers, not only in natural-language instructions.
When must it stop or escalate? Missing required evidence, conflicting records, unsupported requests, tool failures, and policy exceptions should lead to a defined fallback. The agent should not improvise its way around a boundary.
Who remains accountable? Name the owner who approves the contract, reviews failures, and decides whether autonomy can expand. Accountability cannot be assigned to the agent itself.

A compact job statement makes the contract easier to test:

When [trigger] occurs, help [user] achieve [observable outcome] using [approved evidence and tools]. If [stop condition] occurs, hand off to [role] with [required context].

For example, a support agent might retrieve an approved knowledge record and relevant account facts, prepare a response, and stop when identity, policy, or account data is unresolved. Its handoff would include the customer’s request, the evidence retrieved, the steps attempted, and the exact question requiring a specialist. That is a testable product definition. Build a support agent is not.

Add a negative scope as well. State what the agent will not do in the current release, even if the model appears capable of doing it. This keeps a successful pilot from quietly becoming authorization for unrelated work.

The final test is simple: can two reviewers inspect the same run and agree whether the job was completed within the contract? If they need to debate whether the answer merely sounded reasonable, the definition of done is still too vague.

Build deterministic edges around the model

A dependable agent is a workflow, not a long prompt. The model interprets language and chooses among bounded options; the surrounding system controls identity, data access, tool execution, validation, state, and recovery. Retrieval, context management, reliable tools, and clear state often matter more than moving to a larger model.

Design the successful path and the failure path as an explicit sequence:

Retrieve authorized evidence. Fetch only the records relevant to the job. Preserve record identifiers, versions, and freshness so the result can be inspected later.
Construct minimal task state. Carry the user’s identity, requested outcome, validated facts, previous tool results, pending approvals, and unresolved questions. Do not treat an ever-growing chat transcript as the system of record.
Choose from allowed actions. Give the model a constrained set of tools and make unavailable actions genuinely unavailable. A prompt that says do not call a privileged endpoint is not access control.
Validate tool inputs. Use typed schemas, required fields, enumerated values where appropriate, and server-side authorization. Reject malformed or unauthorized calls before they reach the underlying system.
Validate the resulting state. Check deterministic business rules after execution. A successful API response only proves that the call ran; it does not prove that the user’s job was completed correctly.
Finish, recover, or hand off. Return an accepted outcome, retry only when retrying is safe, or create the handoff package specified in the operating contract.

Tool quality deserves product attention. Each consequential tool should expose the smallest permission needed, return machine-readable errors, support a preview when possible, and make repeated requests safe where the underlying operation permits it. Reversible operations need a tested undo path. Irreversible operations need tighter authorization and should not be made safe merely by adding another sentence to the prompt.

Context also needs a budget based on relevance, not on the maximum number of tokens the model accepts. Rank evidence by authority and usefulness. Remove unrelated history. Distinguish verified records from user claims and model-generated summaries. When two authoritative records conflict, preserve the conflict and route it through the stop condition instead of blending them into a plausible answer.

Build the evaluation set before the launch plan

Your evaluation set is the executable version of the operating contract. It should represent the situations that matter to the job, including conditions in which the correct behavior is to refuse, ask for information, or escalate.

Scenario class	What the evaluation should verify
Normal path	The agent retrieves the required evidence, selects the correct tool, satisfies the acceptance criteria, and records a complete result.
Ambiguous request	The agent asks for the missing fact or offers bounded choices instead of assuming the user’s intent.
Missing or stale evidence	The workflow stops, refreshes through an approved path, or escalates according to the contract.
Tool failure	The agent does not claim success, duplicate a consequential action, or lose the task state needed for recovery.
Policy boundary	The prohibited call is blocked by the system, the response explains the available path, and the event is auditable.
Human handoff	The receiving person gets the request, relevant evidence, attempted actions, unresolved issue, and recommended next step.

Score the dimensions separately. A single average can hide the failure that matters most.

Outcome correctness: Did the external result meet the job’s acceptance criteria?
Grounding: Did the response use the required evidence without inventing unsupported facts?
Tool behavior: Were the correct tool, arguments, order, and authorization used?
Policy compliance: Did every prohibited or approval-gated action remain inside its boundary?
Recovery: Did the workflow handle missing data, timeouts, and partial failures without misrepresenting the result?
Handoff quality: Could the receiving person continue without reconstructing the entire run?

Use deterministic assertions wherever the expected state can be checked directly. Use domain review for judgment that depends on policy or professional context. Model-based evaluators can help classify or prioritize a larger sample, but they should not become the only judge of a high-consequence action.

Run scripted evaluations whenever the model, prompt, retrieval logic, tool schema, policy, or orchestration changes. Sample live runs after release to find failure patterns the fixed set does not yet represent, subject to your data-access and retention rules. Add confirmed failures back into the regression set. That is how eval-driven development turns observed behavior into a tighter product.

Select the model after this evaluation loop exists. Compare candidates on the acceptance criteria, latency, operating cost, and operational constraints of the job. The right model is the least complex option that clears the required bar with the complete workflow around it. A model swap should be one testable hypothesis among retrieval, context, tool, state, and prompt changes, not the automatic response to erratic behavior.

Govern autonomy at the action boundary

Governance becomes practical when you classify what the agent may do, not how intelligent it appears. The important distinction is the consequence of the next action: whether it changes state, whether the change can be reversed, and who bears the cost of an error.

Action class	Typical behavior	Default product control
Advise	Summarizes evidence or recommends a next step without changing system state.	Show the supporting evidence and let the user ignore, revise, or escalate the recommendation.
Draft	Creates an editable response, plan, or proposed update that has not been sent or committed.	Require review before external effect. Capture material edits and rejection reasons as feedback.
Execute a reversible action	Changes a record or starts a bounded workflow with a reliable recovery path.	Begin with a preview and explicit approval. Enforce scope in the API, record the action, and make undo visible.
Execute a consequential action	Creates an irreversible, financial, regulatory, security, or substantial customer impact.	Keep a qualified human decision-maker in the path unless the organization has explicitly approved a narrower control model. The agent can assemble evidence and prepare the action without owning the decision.

Do not borrow one accuracy threshold for all four classes. A summarization defect and an unauthorized payment are not interchangeable errors. Set release criteria by action class, and report prohibited-action failures separately rather than averaging them together with low-consequence quality issues.

Human review only reduces risk when the reviewer can make an informed decision. A confirmation button attached to a vague summary creates approval theater. The review interface should show:

The exact action that will occur and the system it will affect.
The evidence used, including record identifiers or other traceable references.
Any missing, stale, or conflicting information.
The expected side effects and whether the action can be reversed.
Clear options to approve, edit, reject, or escalate.

For a handoff, replace approve with a receiving workflow. The person taking over needs a concise task summary, the user’s original intent, the evidence already checked, tool results, the reason automation stopped, and the next decision. Measuring whether that package is usable is more valuable than celebrating a low handoff rate.

Enforcement belongs at the tool boundary. Authenticate the user and agent, authorize each operation, validate inputs, limit accessible records, and block disallowed transitions on the server. Natural-language instructions can guide behavior, but they are not a substitute for permissions, policy checks, or transaction controls.

Keep an audit record proportionate to the risk. For a consequential run, that commonly includes the requesting identity, agent and configuration version, evidence identifiers, tool calls and results, approval decision, final state, and any reversal or escalation. Do not log raw prompts, private records, or retrieved content by default merely because they may be useful later. Decide what is necessary, who can access it, and how long it should be retained as part of AI risk management and data governance.

Assign human ownership across the operating system. Product owns the target outcome and adoption decision. A domain owner approves acceptance criteria and policy interpretation. Engineering owns tool reliability and recovery. Security and privacy owners approve data and access controls. Operations owns monitoring, handoffs, and incident response. One person may cover more than one role, but no responsibility should disappear into the phrase the agent decided.

Governance review should be triggered by meaningful change, not only by a launch meeting. Revisit the contract when you change the model, retrieval source, tool schema, permission, policy, action class, or target user. Review it again when live behavior reveals a new failure mode. That keeps governance attached to the product lifecycle instead of turning it into a document that goes stale after approval.

Instrument the outcome funnel, then earn adoption

An agent does not succeed because users open it or send messages. It succeeds when eligible users complete a valuable job, accept the result, and return when the job recurs. Behavioral instrumentation becomes useful when agent interactions are connected to activation, retention, cost, and risk.

Measure the entire path from opportunity to outcome

Start the funnel before the conversation. If you count only people who already opened the agent, you cannot distinguish poor discovery from poor execution. Define an eligible opportunity for the specific job, then instrument the path through completion.

agent_opportunity_detected: The product can identify that the target job is present for an eligible user.
agent_offer_exposed: The relevant entry point or contextual suggestion is shown.
agent_invoked: The user starts the workflow or an authorized trigger starts it on the user’s behalf.
agent_action_proposed: The workflow produces a recommendation, draft, or preview inside the operating contract.
agent_approval_resolved: The proposed action is approved, edited, rejected, or escalated where review applies.
agent_task_completed: The external acceptance criteria are satisfied and the final state is recorded.
agent_outcome_reversed: The result is undone, reopened, corrected, or otherwise found not to be durable.

The names are less important than consistent semantics. Record the job type, user role, action class, model and workflow version, tool result, and final disposition. Use identifiers and controlled classifications where possible instead of copying sensitive prompt or retrieved content into analytics.

Metric	Useful definition	Common misreading
Activation	Eligible users who complete their first accepted valuable outcome divided by eligible users exposed, for a named cohort and measurement window.	Counting a first prompt or first response as activation even when no job was completed.
Task completion	Eligible initiated tasks that meet the external acceptance criteria divided by eligible initiated tasks.	Using a model’s claim of completion or a successful API call as proof of success.
Containment	Eligible tasks completed without human takeover divided by eligible tasks started, paired with quality and later correction signals.	Rewarding fewer handoffs even when the agent should have escalated.
Time to value	Elapsed time from the eligible trigger to an accepted outcome, including waiting for review when review is part of the workflow.	Measuring response latency while ignoring the rest of the job.
Acceptance and editing	Results accepted as presented, accepted after a material edit, rejected, or escalated. Define material for the job.	Treating any click on approve as equal, regardless of the correction required before approval.
Handoff quality	Handoffs containing the required context and accepted as usable by the receiving role divided by all handoffs.	Viewing every handoff as failure instead of distinguishing correct escalation from avoidable escalation.
Cost per successful outcome	Variable model, tool, infrastructure, and human-review costs divided by accepted completed outcomes.	Optimizing token cost while ignoring rework, review time, or failed attempts.
Risk signals	Blocked prohibited calls, unauthorized attempts, reversals, policy escalations, and incidents, reported as counts and against the relevant opportunity denominator.	Combining materially different events into one average quality score.

Segment these metrics by job, user role, action class, workflow version, tool, and risk class. An overall completion rate can improve while a high-consequence segment gets worse. Version-level segmentation also tells you whether a prompt, retrieval, model, or interface change actually altered behavior.

Pair leading signals with durable outcomes. Edits, rejection, undo, escalation, and approval time can expose friction quickly. Repeated successful use, lower rework, and movement in the target business outcome tell you whether the product is creating lasting value. An increase in escalation is not automatically bad: it may mean the control became easier to use. Inspect whether the escalation was correct and whether the receiving person could act on it.

Let evidence earn each expansion of autonomy

Adoption is a behavior-change problem. Users need to notice the agent at the moment the job occurs, understand its boundary, inspect its work, and recover when it is wrong. A generic product tour may create awareness, but it does not establish trust in a consequential workflow.

Move through deployment modes according to evidence rather than a predetermined calendar:

Shadow mode: Run the workflow without exposing a result or changing state. Compare its proposed outcome with the accepted human outcome and use disagreements to improve the contract and evaluations.
Assisted mode: Let the user request a recommendation or editable draft. Make the evidence and limitations visible, and collect structured edit and rejection reasons.
Approved execution: Show the exact proposed change and require explicit confirmation before the tool commits it. Test authorization, audit, recovery, and handoff paths under live operating conditions.
Bounded autonomy: Allow execution only for the job, users, data, conditions, and limits approved in the operating contract. Continue monitoring outcomes and preserve a kill switch, rollback path, and accountable operator.

Advancement should depend on the evaluation suite, live outcome quality, tool reliability, policy compliance, recovery readiness, and the receiving team’s ability to handle escalations. If the evidence is mixed, narrow the action class or eligible population. Do not compensate for unresolved risk by making the prompt longer.

The interface should answer the user’s practical questions before asking for trust:

Why is the agent appearing at this moment?
What task can it complete, and what remains the user’s responsibility?
Which records or evidence will it use?
What will change if the user approves?
Can the result be edited or undone?
Where does the task go if the agent cannot complete it?

Surface the agent inside the existing workflow when the eligible job appears. State the action in task language, such as prepare this response or verify and update this record, rather than ask AI anything. Keep preview, edit, reject, undo, and escalation controls visible at the decision point. Contextual guidance is most useful when it removes a known piece of friction, not when it explains AI in general.

Use experiments for choices that are safe to vary: entry-point placement, explanation copy, prompt starters, preview layout, or the order of optional steps. Do not A/B test away required approvals, access controls, or safety boundaries. Time-to-value, task completion, edits, undo patterns, and escalation requests provide a more useful adoption picture than raw message volume.

Define activation as the first accepted outcome, not the first interaction. For a drafting workflow, that may be the first reviewed artifact that is actually used. For an operations workflow, it may be the first verified state change. The exact event should match the operating contract, and retention should measure return when the same job recurs rather than habitual chatting that produces no business result.

Key takeaways: use this launch gate

Before exposing an agent to production data or expanding its autonomy, require a clear yes to each question:

Can the job be stated with one user, one trigger, one observable outcome, and explicit stop conditions?
Are read, draft, and write permissions separated and enforced outside the prompt?
Does the evaluation set cover ambiguity, missing evidence, tool failure, policy boundaries, and handoff behavior?
Can every consequential tool validate authorization, return a clear result, and recover safely where recovery is possible?
Is the action classified by consequence and reversibility, with an appropriate approval path?
Can a reviewer see the evidence, proposed effect, missing information, and recovery option before approving?
Is there a named owner for outcomes, policy interpretation, monitoring, escalation, and incident response?
Can analytics connect an eligible opportunity to an accepted outcome, later correction, cost, and risk?
Can the product be narrowed, paused, or rolled back without waiting for a new model release?

A no does not have to stop all learning. It should stop the unsafe action. Move the pilot to shadow, advisory, or draft mode while the missing control is built.

For your next roadmap review, bring four artifacts instead of another open-ended demo: the operating contract, the evaluation matrix, the action classification, and the instrumented outcome funnel. Ship the smallest permissioned workflow that can prove value. Let observed outcomes, not confidence in the demo, earn the next level of autonomy.

References

January 4, 2026

Stop Choosing: Blend Inside-Out and Outside-In Thinking to Accelerate Product-Led Growth

I’ve never seen great products emerge from a one-sided mindset. Inside-out thinking (strategy-first) and outside-in thinking (customer-first) aren’t rivals—they’re a flywheel. When I weave product vision and defensible differentiation together with real customer signals and behavioral data, adoption climbs, engagement deepens, and the roadmap becomes a catalyst for growth rather than a list of features.

For clarity: inside-out anchors on product strategy, value proposition, and the unique capabilities only we can deliver. Outside-in centers on continuous discovery, user research, and telemetry that reveals what customers actually do—not just what they say. At HighLevel, we pair these perspectives in every planning cycle so we’re bold in direction and grounded in evidence.

Increase revenue, cut costs, and reduce risk with Pendo’s Software Experience Management platform. Optimize the entire software experience to drive adoption and improve engagement.

That promise captures why the blend matters. Product-led growth lives or dies on moments like activation, time-to-first-value, and day-30 retention. Inside-out thinking ensures we’re building toward a compelling vision; outside-in thinking ensures users can discover, adopt, and realize value through clear onboarding, in-app guides, and contextual product tours.

Here’s how I apply it in practice. We start by articulating the smallest, sharpest version of our strategy—who we serve, the jobs we must win, and the non-negotiable outcomes. Then we pressure-test that thesis with continuous discovery: call snippets, funnel analysis, pathing, and retention analysis by cohort. When friction shows up in onboarding or early feature adoption, we deploy targeted in-app guides and tours to accelerate user activation without bloating the product or training costs.

A simple operating rhythm keeps the balance: begin each quarter with outcomes vs output OKRs tied to adoption and retention; instrument flows to expose drop-offs; ship iterative improvements; and reinforce them with just-in-time guidance. We use outside-in signals to sequence what we tackle next, and inside-out conviction to avoid chasing noise. The result is faster learning cycles and fewer expensive reworks.

Measurement closes the loop. I track activation rate, time-to-first-value, engagement with the few behaviors that predict renewal, and the impact of each guide or tour on completion rates. When we see lift, we codify the pattern; when we don’t, we prune and refocus. That evidence-based cadence keeps teams empowered and stakeholders aligned.

Culture makes this sustainable. Empowered product teams own outcomes, not tickets. Stakeholder management becomes easier when decisions are grounded in a clear strategy and transparent evidence from real users. And customers feel the difference when the product teaches itself—meeting them with the right help, in the right moment, without getting in their way.

If you’ve been choosing between inside-out and outside-in, stop. Fuse them. Lead with a crisp product strategy, listen with humility, and operationalize adoption through purposeful onboarding, in-app guides, and product tours. That’s how we compound learning, reduce risk, cut support costs, and accelerate product-led growth.

Inspired by this post on Pendo – Perspectives.

January 4, 2026

AI Context Engineering: A Practical System for Product Teams

You ask an AI model for a feature brief. It returns polished prose, sensible recommendations, and a tidy set of success criteria. Then the review starts: the target segment is wrong, the customer evidence is anecdotal, a strategic constraint is missing, and nobody can tell where the claims came from.

This usually isn’t a writing problem. It is a context system problem. Reliable product work starts with selecting, compressing, and structuring the knowledge the model needs before it generates anything. AI context engineering turns that practice into a repeatable operating system for your team.

The goal is not to give the model everything your company knows. The goal is to provide the smallest sufficient body of evidence for the decision in front of you, while preserving enough lineage for a reviewer to inspect the result.

Key takeaways

Start with a decision contract that defines the decision, audience, constraints, evidence standard, and required output.
Build a compact context pack from canonical strategy, relevant behavioral data, direct customer evidence, operating constraints, and decision history.
Retrieve before you generate. Use metadata, recency, authority, and relevance to select evidence instead of dumping entire repositories into the context window.
Preserve traceability. Every important claim should point to an evidence identifier, and the output should separate observations, inferences, and recommendations.
Version the prompt and context together, then evaluate the complete system through rework, review time, first-pass alignment, and evidence fidelity.

Start with the decision, not the document

Product teams often describe the artifact they want rather than the decision it must support. Draft a PRD, summarize these interviews, or write a roadmap rationale sounds concrete, but each request leaves the model to infer what matters.

That ambiguity changes retrieval. A positioning decision needs competitive and customer-language context. A prioritization decision needs strategy, affected users, behavioral evidence, constraints, and opportunity cost. Release notes need verified product behavior, the intended audience, and approved terminology. The same generic prompt cannot reliably determine those boundaries.

Before gathering evidence, write a decision contract with these fields:

Decision: What choice, judgment, or next action will this output support?
Audience: Who will review or use it, and what do they already know?
Deliverable: What sections, level of detail, and format are required?
Boundaries: What is explicitly out of scope, already decided, or prohibited?
Evidence standard: Which claims require direct evidence, and how should citations appear?
Uncertainty: What should the model do when evidence is missing, stale, or contradictory?

A weak request is: Summarize onboarding research. A decision-ready request is: Help the product trio decide whether the onboarding problem should enter discovery. Identify the affected cohort, observed friction, strength of evidence, unresolved questions, and the next research step. Do not recommend a roadmap commitment.

The second request gives retrieval a job. It tells the system which evidence to find and gives reviewers a basis for rejecting unsupported output.

Give conflicting evidence an explicit hierarchy

Most internal knowledge bases contain competing versions of reality. A planning deck may conflict with an approved strategy. A recent support conversation may contradict an older research summary. A customer request may not match observed behavior. Without an authority rule, the model may blend these artifacts into a confident compromise that nobody actually endorsed.

A practical default hierarchy is:

Current, approved strategy and explicit leadership decisions establish the frame.
Behavioral evidence establishes what users did within the measured population and period.
Verbatim customer evidence establishes what particular customers said and how they described the problem.
Support and operational signals reveal recurring friction that may need further validation.
Team hypotheses remain hypotheses until stronger evidence supports them.

This is a starting rule, not a universal ranking. Your hierarchy should match the decision. The important move is to state it. Freshness alone does not make an artifact authoritative, and authority alone does not make old evidence current. When two credible artifacts disagree, instruct the model to expose the conflict rather than reconcile it silently.

Build a minimum viable context pack

A context pack is the evidence package for one task. It is deliberately narrower than a company knowledge base. Each item earns its place by answering a question the requested output must address.

Context layer	Question it answers	Useful artifact
Strategic frame	Why does this problem matter now?	Approved strategy statement, objective, or decision principle
Affected user	Who experiences the problem?	Cohort definition, segment criteria, or relevant account profile
Behavior	What happened in the product?	Usage pattern, funnel analysis, retention signal, or journey evidence
Customer need	How do users describe the problem?	Verbatim interview excerpts, support conversations, or research synthesis
Constraints	What limits the solution space?	Technical, operating, commercial, or policy constraint
Decision history	What has already been decided or rejected?	Decision record with rationale and status

Do not fill every row by default. For a narrow writing task, two layers may be enough. For a prioritization decision, several may be essential. Start with the requested output and ask which evidence would allow a skeptical reviewer to verify each section.

A strong feature-brief pack can be surprisingly small: one strategy paragraph, one analysis of the affected usage cohort, and five verbatim customer quotes. That combination gives the model a frame, a population, and direct language from users. You can then request a problem statement, success criteria, and solution hypotheses, with every element tied to evidence.

The example works because each artifact has a different job. Five documents making the same strategic argument would create repetition, not coverage. Context quality comes from complementary evidence, not document count.

Turn each artifact into an evidence unit

Raw files are difficult to retrieve and easy to misread. Wrap each relevant slice in a small evidence unit:

Identifier: a stable label such as E1 or E2 that the output can cite.
Origin: the system, analysis, interview, or decision record from which it came.
Status: approved, draft, superseded, disputed, or observational.
Scope: the segment, cohort, workflow, product area, and period to which it applies.
Relevant finding: a concise summary written for the current decision.
Raw evidence: the excerpt, data slice, or linked artifact needed to inspect the summary.
Caveat: a known limitation, missing comparison, or unresolved contradiction.

This two-layer structure solves a common compression problem. The short summary conserves context-window space, while the raw excerpt preserves wording and qualifiers when nuance matters. Do not repeatedly summarize prior summaries. Each compression step can remove scope, uncertainty, and disagreement. Keep a path back to the underlying evidence.

You have enough context when every required part of the deliverable has relevant evidence, major conflicts are represented, and additional artifacts merely repeat what is already present. If an output section has no supporting evidence, either retrieve more or label the section as an open question. Do not ask fluent prose to hide the gap.

Retrieve, compress, and assemble in that order

Large context windows make it tempting to attach whole repositories. That usually transfers the curation problem to the model. Relevant evidence must now compete with stale plans, duplicate findings, unrelated segments, and abandoned decisions.

A retrieval-first pipeline can combine semantic matching with metadata filters and recency rules. Semantic similarity finds conceptually related material. Metadata determines whether that material belongs to the right product area, cohort, status, and time frame. Authority rules decide which version should govern when multiple candidates match.

Use this sequence:

Translate the decision contract into evidence questions. Ask what strategic frame, customer signal, behavior, constraint, and decision history are required.
Filter by hard boundaries first. Exclude the wrong product area, segment, status, or period before semantic ranking.
Retrieve relevant slices rather than complete files. A paragraph, chart interpretation, interview excerpt, or decision entry is often the useful unit.
Check authority and freshness. Mark superseded items and retain an older artifact only when its historical context matters.
Check coverage and contradiction. Confirm that the pack represents the affected population and does not hide credible opposing evidence.
Compress each selected item into an evidence unit, retaining a link or raw excerpt for verification.
Assemble the context in a fixed interface so the model can distinguish instructions, evidence, and the requested output.

Retrieval should also preserve access boundaries. An AI layer should not expose an artifact to someone who could not access it in its system of record. Treat customer material and internal strategy as governed inputs, not convenient prompt text.

Use a stable context interface

I treat the prompt as an interface to the context system, not as the system itself. A useful interface contains these blocks in a consistent order:

Role and objective: the perspective the model should take and the decision it must support.
Audience: the people who will use the deliverable and the assumptions they already share.
Constraints: scope boundaries, settled decisions, prohibited claims, and required terminology.
Evidence: labeled units such as E1, E2, and E3, each with status, scope, summary, raw support, and caveats.
Explicit ask: the analysis or artifact required, expressed as concrete questions.
Output contract: required sections, length, ordering, and citation format.
Evidence rules: cite material claims, distinguish observation from inference, expose conflicts, and avoid unsupported facts.
Self-check: identify missing evidence, unverified assumptions, constraint violations, and statements that lack citations.

Do not rely on instructions such as be accurate or think carefully. They do not define what accuracy means for this task. A stronger rule is: Cite an evidence identifier after every material claim. If the pack does not support a claim, label it as an inference or omit it. List unresolved questions separately.

Diagnose output failures as context defects

Output symptom	Likely context defect	Corrective move
Generic recommendations	The pack lacks customer, behavior, or constraint evidence	Add decision-specific evidence instead of more role-playing instructions
Confident but outdated claims	Retrieval ignored status, authority, or recency	Filter superseded artifacts and define which record is canonical
Important nuance disappears	Compression removed qualifiers or disagreement	Restore raw excerpts and carry caveats into the evidence units
Long output that does not support a decision	The ask names an artifact but not the decision	Rewrite the decision contract and remove irrelevant context
Stakeholders distrust the result	Claims have no visible lineage	Require evidence identifiers and preserve links to underlying artifacts
Repeated runs produce different conclusions	The prompt or context changed without version control	Snapshot both inputs and compare one controlled change at a time

This diagnostic matters because prompt edits can disguise the real failure. If the wrong cohort entered the pack, a more detailed output format will only produce a better-organized mistake.

Manage context quality as a product system

A single well-curated prompt can produce a good result. A product team needs a system that can produce a good result again, show why it was good, and reveal what changed when quality declines.

Make the output auditable

Ask the model to separate three kinds of statements:

Observation: directly supported by an evidence unit.
Inference: a reasoned interpretation that connects observations.
Recommendation: a proposed action that depends on evidence, assumptions, and product judgment.

This distinction prevents a plausible interpretation from being presented as a measured fact. Behavioral analytics can show a pattern within its defined cohort and period; it does not, by itself, establish why the behavior occurred. A customer quote can establish that a person expressed a need; it does not, by itself, establish prevalence. The final recommendation still needs human judgment about strategy, tradeoffs, and risk.

For consequential work, request a smaller cited output first. Review its evidence mapping, then expand it into a PRD, roadmap narrative, or executive brief. This makes unsupported reasoning easier to catch than reviewing a long deliverable after the model has built several sections on the same weak assumption.

Version the whole generation package

Store these elements together for each run:

Workflow and template version
Decision contract
Context snapshot and evidence identifiers
Retrieval and filtering rules
Prompt version
Model output
Human review result and requested changes

Prompt versioning without context versioning is incomplete. Two runs using identical instructions can diverge because an approved strategy changed, a stale analysis entered retrieval, or a different set of interviews was selected. The context snapshot lets you explain that difference.

Evaluate the workflow, not the elegance of one answer

Create a small evaluation set from real, recurring product tasks. Keep the decision and expected evidence stable while testing changes to retrieval, compression, context ordering, or instructions. Change one major variable at a time; otherwise you will not know what improved the result.

Review each run against a consistent rubric:

Evidence fidelity: Do claims accurately represent the cited material and its scope?
Coverage: Does the output address every required part of the decision?
Constraint adherence: Does it respect settled decisions, exclusions, and required terminology?
Traceability: Can a reviewer follow important claims back to evidence?
Uncertainty handling: Are missing, stale, or contradictory inputs visible?
Decision usefulness: Can the intended audience act, decide, or request the right next evidence?

At the workflow level, track rework rate, review time, and stakeholder alignment on the first pass. These measures reveal whether the system reduces review burden and improves decision readiness. Output volume does not.

When an evaluation fails, route the defect to the right layer. Evidence fidelity usually points to retrieval, source selection, or compression. Constraint failures point to the context interface. A technically correct but unusable deliverable points back to the decision contract. This turns AI quality from a subjective debate into a product improvement loop.

Template workflows only after you understand their evidence needs

Discovery synthesis, roadmap rationale, feature briefs, and release notes are good candidates because they recur and have recognizable inputs. Give each workflow its own decision contract, required context layers, retrieval filters, output contract, and evaluation rubric. Do not force them into one universal mega-prompt.

Start with one workflow your team already performs frequently. Take a real task, define the decision, assemble a compact evidence pack, assign identifiers, and review the result against the rubric above. Save the complete generation package. On the next run, change one weak layer and compare the review burden.

Once that loop is repeatable, AI stops being a blank page with a clever prompt. It becomes a governed product workflow whose inputs, reasoning boundaries, and quality can be inspected and improved.

References

Pendo – AI Context Pulling Playbook: How I Get LLMs and Teams to Collaborate for Better Product Outcomes

January 4, 2026

AI Transformation Is an Operating Model, Not a Feature Roadmap

You probably do not have an AI ideas problem. You have a conversion problem. Promising prototypes appear across the company, but few survive the distance between a convincing demo and a dependable customer or business outcome.

The way out is to stop treating AI transformation as a feature portfolio. Treat it as a redesign of how your organization senses problems, makes decisions, takes safe action, and learns from production. The practical unit of change is one closed loop with an accountable owner, trusted context, explicit guardrails, and measurable results.

Key takeaways: the transformation system in brief

Start with a bounded customer or employee workflow, not a company-wide AI program or a preferred model.
Define the outcome, quality threshold, action boundary, and fallback before choosing the implementation.
Build capabilities in dependency order: governed data, grounded context, constrained workflows, task-specific evaluations, and production operations.
Measure customer outcomes, AI behavior, delivery reliability, and organizational learning separately. No single metric can represent all four.
Centralize reusable controls and infrastructure, but keep problem selection and outcome ownership inside the domain team.
Increase autonomy only after the system can detect failure, escalate uncertainty, limit permissions, and recover safely.

Start with a transformation wedge, not a transformation program

A broad mandate such as make every team AI-first sounds ambitious but gives teams no useful decision rule. It encourages tool adoption, disconnected pilots, and activity metrics. A narrower mandate forces the hard questions into the open.

I call that narrower unit a transformation wedge: a bounded, repeatable moment where intelligence can remove meaningful friction, where the result can be observed, and where a safe fallback already exists. The wedge is small enough to govern but important enough to prove a new organizational capability.

Use these gates when selecting it:

Meaningful friction: A customer or employee is losing time, making avoidable errors, or failing to complete an important job.
Observable outcome: You can instrument the desired behavior rather than relying on opinions about output quality.
Available context: The system can reach sufficiently trusted information without placing sensitive data into an uncontrolled context.
Repeatable demand: The workflow occurs often enough to produce learning that the team can use.
Bounded consequence: The system can be constrained, reviewed, escalated, or reversed when confidence is inadequate.
Reusable learning: At least one capability – such as retrieval, evaluation, telemetry, or an integration – can support the next workflow.

This distinction changes the conversation. Add a support chatbot is an implementation idea. Reduce the time to an accurate support resolution while preserving policy adherence is a transformation wedge. The second framing leaves room to choose retrieval, workflow automation, agentic behavior, or a simpler interface based on evidence.

Write the outcome contract before selecting a model

For the selected wedge, create a short outcome contract. It should be understandable to product, engineering, design, operations, security, and the executive sponsor without translation.

User and moment: Who encounters the friction, and at what point in the workflow?
Current behavior: What happens without the AI intervention, and what baseline evidence is available?
Primary outcome: Which customer or business behavior should change?
Quality guardrails: Which failure measures must remain within an agreed boundary?
Trusted context: Which data may be used, who owns it, and which sensitive fields must be removed or protected?
Action boundary: May the system summarize, recommend, communicate, or execute? Name prohibited actions explicitly.
Fallback: What happens when evidence is missing, the model is uncertain, an integration fails, or a policy conflict appears?
Release evidence: Which offline evaluations, controlled experiments, and production signals will justify expansion?
Accountability: Who owns the outcome, the AI behavior, the data, and incident decisions?

In a support workflow, for example, the contract might pair a resolution outcome with accuracy and policy-adherence guardrails. A retrieval-first path can ground the response in approved knowledge, while a defined escalation route gives the system somewhere safe to send ambiguity. That combination of grounding, constrained action, evaluation, and escalation is much more consequential than the choice of chat interface.

Instrument the baseline and the intervention from the beginning. If telemetry arrives after launch, the team will be able to show that an AI feature shipped but not whether the targeted behavior improved.

Build the capability stack and the product loop together

Teams often start in the middle of the stack: they select a model, write prompts, and then discover that the data is unreliable, evaluation is subjective, or production failures have no owner. Model capability matters, but it cannot compensate for missing organizational capability.

Build the stack in dependency order:

Governed data: Identify approved data, access rules, sensitive fields, and accountable owners. Privacy-by-design belongs in the workflow definition, not in a review added before release.
Trusted context: When the task depends on company or customer knowledge, retrieve the relevant context from approved systems and control what enters the model’s context window. Define what the system should do when evidence is incomplete or conflicting.
Constrained workflow: Separate model judgment from deterministic operations. Give each integration an explicit purpose, permission boundary, failure path, and audit trail. Agentic AI should orchestrate only the actions the organization is prepared to observe and govern.
Task-specific evaluation: Build scenarios from the real workflow. Include expected cases, ambiguous inputs, missing context, policy conflicts, and known high-consequence failures. Define acceptance criteria before comparing prompts, models, or vendors.
Release and operations: Use feature flags, controlled rollout, production telemetry, threat detection, and incident management. Assign authority to pause or limit the system when behavior drifts.

This order is not a waterfall. Retrieval quality may expose a data problem, while an evaluation failure may expose a poorly defined policy. The point is to preserve the dependencies: autonomous action cannot become dependable before context, evaluation, permissions, and operations exist.

Use AI to expand options and evidence to make commitments

The capability stack changes day-to-day product work only when it is connected to discovery, design, delivery, and adoption. The useful pattern is to let AI accelerate reversible exploration while keeping consequential decisions anchored in evidence.

Discovery: Use AI to cluster interview notes, support tickets, and session transcripts. Then inspect the underlying material and pressure-test important themes with live customer conversations. A fluent summary is a hypothesis generator, not customer validation.
Design: Generate several storyboards, interaction flows, or guidance variants early. Refine promising options through the design system, accessibility requirements, and human review rather than treating the first plausible generation as finished design.
Delivery: Use AI to prepare hypotheses, test cases, and experiment materials. Keep success metrics and the minimum detectable effect explicit, and release variants through feature flags so that speed does not erase experimental discipline.
Adoption: Generate targeted in-app guidance, release it to controlled segments, and measure activation and retention alongside the immediate interaction. Shipping the intelligent behavior and helping users adopt it are parts of the same product decision.

This combination can create a tighter discovery, design, delivery, and learning loop without pretending that model output replaces research, statistical judgment, design standards, or customer evidence.

Replace status review with a weekly learning review

Whether the accountable unit is called a product trio or something else, give it a weekly operating rhythm focused on verified learning. A useful agenda is:

Review the primary outcome and every guardrail, including meaningful segment differences.
Inspect evaluation failures and trace them to context, model behavior, policy, workflow design, or integration behavior.
Read the latest experiment evidence and distinguish a result from an interpretation.
Review reliability changes, incidents, near misses, and unresolved escalation paths.
Make an explicit decision to continue, change, limit, or stop the current approach, with an owner for the next piece of evidence.

Do not let this become a prompt-tuning meeting. Prompt changes are only one possible response. A retrieval defect, unclear product policy, missing event, weak handoff, or badly chosen outcome may be the actual constraint.

Use a metric chain instead of one AI success number

AI pilots look healthy when they are measured by output: drafts generated, tasks attempted, people trained, or features shipped. Those numbers can describe activity, but they do not establish customer value, dependable behavior, or organizational readiness.

A transformation scorecard needs separate layers because each answers a different management question:

Measurement layer	Question it answers	Useful measures
Customer and business outcome	Did the important behavior improve?	User activation, time-to-first-value, support resolution rate or time, retention
AI quality and safety	Is the intelligent behavior reliable enough for this workflow?	Task accuracy, hallucination rate, policy adherence, correct escalation
Delivery reliability	Can the team improve the system quickly without destabilizing it?	Deployment frequency, lead time, change failure rate, mean time to recovery
Organizational learning	Is the organization reaching better decisions faster?	Cycle time, experiment throughput, decision quality against predefined evidence

The metric names are not definitions. Make each operational for the selected workflow. Accuracy might mean correct support answers, successful tool completion, or correct classification; those are different tests. A hallucination rate needs a declared denominator and a rule for what counts as unsupported. Decision quality needs a rubric tied to the evidence available when the decision was made, not whether the result later happened to be favorable.

Connect the layers as a metric chain. In grounded support, retrieval and response evaluations establish whether the system can produce an accurate answer. Product telemetry shows whether the customer receives a useful resolution or an appropriate escalation. Resolution and retention measures show whether that behavior matters to the business. Delivery and learning measures show whether the organization can improve the loop repeatedly.

Interpret disagreement between the layers

The disagreements are often more informative than the headline result:

If offline evaluations improve but customer behavior does not, inspect workflow placement, user trust, adoption, and whether the evaluated task matches the real job.
If customer outcomes improve while policy adherence deteriorates, do not expand the rollout. The apparent win is being financed by unmanaged risk.
If deployment frequency rises while change failure rate or recovery time worsens, the team has increased release activity rather than adaptive capacity.
If cycle time falls but decisions are repeatedly reversed for missing evidence, the system is producing faster motion, not better learning.
If averages look healthy but a target segment fails, keep the rollout segmented until the failure mechanism is understood.

Use the right method for the question. Evaluations test whether AI behavior meets defined quality and safety criteria. A/B testing tests whether a product intervention changes user behavior; setting the hypothesis, success metric, and minimum detectable effect before reading results protects that inference. DORA metrics reveal the health of the delivery system. None is a substitute for the others. Connecting model, product, business, and delivery measures is what turns telemetry into an operating mechanism.

Centralize guardrails and distribute outcome ownership

Organizational design usually fails at one of two extremes. A central AI group becomes a queue that is distant from customer problems, or every team builds its own prompts, data paths, evaluations, and incident process. The useful split is to centralize scarce controls and reusable capabilities while distributing domain decisions.

Centralize the capabilities that should not be reinvented

Approved data-access and privacy patterns
Retrieval, context-management, and model-routing components
Evaluation tooling, baseline scenarios, and reporting conventions
Observability, auditability, feature-flag, and incident-response patterns
Prompt and workflow libraries with named owners and change history
Security, regulatory, and procurement requirements

Keep product judgment inside the domain

Choosing the customer or employee problem
Defining the outcome and acceptable trade-offs
Validating whether retrieved context represents the domain correctly
Designing the experience, fallback, and human handoff
Running controlled rollout and interpreting segment behavior
Deciding whether to continue, constrain, redesign, or stop the bet

This division preserves empowered product teams without turning governance into optional advice. The central capability owner defines the safe road; the domain team remains accountable for choosing the destination and proving that it is worth reaching.

Scale controls with the consequence of being wrong

Do not use one approval process for every workflow. A drafting assistant and an agent that changes customer records do not create the same exposure. Classify a workflow by what it can do and what happens when it fails.

Advisory output: A person reviews the draft, summary, or analysis before it affects another party. Evaluate usefulness and factual reliability, and make the reviewer accountable for the final decision.
User-facing recommendation: The output reaches a customer or employee directly. Add grounding, policy tests, clear escalation, monitored rollout, and an accessible non-AI path.
Action-taking workflow: The system invokes tools or changes state. Limit permissions, constrain eligible actions, preserve an audit trail, test integration failures, and provide a reliable stop or recovery path.
Sensitive or regulated workflow: Add the relevant privacy, security, legal, and compliance owners before data or actions enter the system. If an approved path does not exist, keep the workflow out of production until it does.

A human in the loop is not a complete control by itself. Name what the person must inspect, what evidence is visible, when escalation is mandatory, and whether the person has enough time and authority to intervene. Otherwise, the human becomes ceremonial approval around an automated decision.

Redesign roles around judgment, not tool usage

AI can accelerate exploration, synthesis, and test preparation. People still have to interpret customers, choose outcomes, set quality thresholds, resolve policy ambiguity, and accept accountability for consequences. Role design and hiring should reflect that boundary.

A product manager should be able to write the outcome contract, connect model behavior to user behavior, and make trade-offs visible.
A designer should be able to generate and interrogate alternatives, preserve accessibility, and design uncertainty and fallback states.
An engineer should be able to separate probabilistic behavior from deterministic operations and build evaluation, observability, permission, and recovery paths.
A leader should be able to fund reusable capability, challenge vanity metrics, and stop a persuasive demo that lacks production evidence.

Use communities of practice to spread prompt patterns, evaluation baselines, reusable workflows, and failure lessons. They work best as distribution networks for repeatable product and evaluation practices, not as committees that absorb accountability from the teams shipping the work.

At your next portfolio review, select one transformation wedge and require its outcome contract, metric chain, evaluation set, fallback, and named owners. Put it into the weekly learning rhythm before funding another disconnected pilot. Once the loop works in production, extract the reusable components and make the next team faster. That is the point at which AI stops being a collection of features and starts changing how the organization operates.

References

January 4, 2026

How Product Leaders Turn AI Strategy Into an Operating System

Your AI roadmap probably isn’t short of ideas. The hard decision is which ideas deserve production responsibility: a user promise, a quality bar, a failure path, an owner, and a reason to keep funding them after launch.

You operationalize AI by turning those decisions into a repeatable management system. The broader shift from experiments to execution makes that system more important than any individual model choice. It lets your teams discover useful applications, ship them responsibly, teach customers how to use them, and decide from evidence whether to scale, change, or stop.

Turn AI ambition into a portfolio of bounded bets

An AI strategy is not a list of places where a model could be added. It is a set of choices about which customer or business problems deserve investment, how much authority AI should receive, and what evidence will justify the next commitment.

Start every candidate with a one-page opportunity contract. If the team can describe the model but cannot complete the contract, the idea is not ready for prioritization.

User and moment: Name the person, the task they are trying to complete, and the point in the workflow where the difficulty occurs.
Current behavior: Record how the task works without the proposed feature. Use an observable baseline such as completion, elapsed time, handoffs, abandonment, rework, or cost per completed task.
AI contribution: State whether AI will classify, retrieve, recommend, generate, summarize, or take an action. Avoid vague phrases such as “AI-powered experience.”
Expected change: Identify the user behavior that should change first and the customer or business outcome that should follow.
Boundaries: List what the system must not decide, which data it must not use, and which users or scenarios are outside the initial release.
Consequence and reversibility: Describe what happens when the system is wrong and whether the user can inspect, correct, undo, or escalate the result.
Next evidence: Define the smallest test that could reduce the most important uncertainty. That might be a workflow prototype, customer discovery, a retrieval test, or an evaluation against representative cases.

This contract forces an important distinction between assistance and authority. Drafting a reply for a person to review is not the same product as sending that reply automatically. Recommending an account action is not the same as applying it. The second version has a larger blast radius, a different trust requirement, and a stricter need for auditability and recovery.

Begin with the minimum authority required to create value. Increase autonomy only when the evidence supports it. This is not timidity. It is a sequencing decision that lets you learn about quality and user behavior before accepting a larger operational risk.

Prioritize the resulting bets across six lenses: customer value, workflow frequency, data readiness, evaluability, blast radius, and operating cost. Do not collapse them into a decorative score that hides disagreement. Use them to expose the trade-off. A frequent, valuable task may still be a poor first bet if critical failures cannot be detected. A low-risk task may be easy to ship but too marginal to earn repeat use.

Write a stop condition at the same time as the investment case. For example: stop if the team cannot construct a credible evaluation set, if the workflow requires data the product cannot responsibly access, or if users do not reach the intended outcome after the experience and onboarding have both been tested. A portfolio becomes manageable when stopping is a designed decision rather than an admission of defeat.

Define production readiness before the team starts building

A prototype proves that a system can produce a compelling result once. A product must produce an acceptable result across the situations that matter, make its limitations understandable, and recover when the result is not acceptable.

Give each AI bet a production contract before it enters committed delivery. The contract should contain:

The user promise: Describe what the product will help the user accomplish. Do not promise intelligence in the abstract.
The context boundary: Specify which product data, retrieved knowledge, instructions, tools, and prior interactions the system may use.
The quality dimensions: Choose criteria that fit the task, such as correctness, completeness, groundedness, policy compliance, tool execution, tone, or structured-output validity.
Scenario-specific thresholds: Set release criteria for meaningful segments and failure types instead of relying on one average score. The acceptable standard for brainstorming copy is not the acceptable standard for changing an account or communicating a binding decision.
The fallback: Define what the user sees and can do when confidence is inadequate, a tool fails, retrieval returns weak context, or the output violates a rule.
The operating envelope: Set the latency, reliability, and cost constraints needed for the workflow to remain viable.
The data rules: Record what may be retained, what must be removed, who can inspect traces, and how sensitive information is handled.
The instrumentation plan: Name the events, evaluation results, feedback, escalations, and outcome measures required to make the next decision.

There is no universal quality threshold for an AI feature. The right threshold depends on the consequence of an error, the user’s ability to detect it, and the availability of a safe recovery path. Set the bar by scenario and harm, then make the release decision against that bar. An aggregate average can conceal a severe failure in a smaller but important segment.

Build the evaluation set before tuning the experience

Create a versioned evaluation set from the workflow you intend to support. Include ordinary cases, meaningful variations, known edge cases, and inputs that should trigger a refusal, clarification, or handoff. Label the expected outcome and the unacceptable failure. Do not require exact wording unless exact wording is part of the product requirement.

Run that set against the initial baseline and after changes to prompts, models, retrieval, tools, policies, or orchestration. Preserve results by scenario so the team can see both improvements and regressions. A single overall score is useful for orientation; it is not enough for a launch decision.

Automated checks work well for properties that can be specified clearly, such as output structure, required fields, tool completion, forbidden content, or citation presence. Use structured human review where quality depends on judgement. Keep the rubric stable enough to compare versions, and change it deliberately when the product promise changes.

Design the failure experience as part of the feature

Users do not experience your evaluation score. They experience a suggestion they cannot verify, a slow response, an action they did not intend, or a dead end after the system fails. Design those moments before launch.

Show the context or inputs that materially shaped the result when doing so helps the user judge it.
Make generated content editable before it becomes externally visible.
Require explicit confirmation before consequential or difficult-to-reverse actions.
Preserve the original state and provide rollback where the underlying workflow permits it.
Offer a clear manual path when the system cannot complete the task.
Capture corrections and escalations as learning signals without treating every user edit as proof that the system was wrong.

Do not place sensitive production data into an unapproved model, connector, or testing tool. The downside can include unauthorized disclosure, retention outside your controls, and regulatory or contractual exposure. Use an approved environment and appropriately protected or de-identified test material while privacy and security owners validate the production path.

Run one decision loop from discovery through scale

AI initiatives become expensive when discovery, delivery, launch, and governance operate as separate queues. The useful unit of management is one decision loop with shared artifacts, named owners, and explicit gates.

Discover the workflow: Observe the current task, its failure points, the information available at the decision moment, and the user’s existing workarounds. Validate that the problem matters before testing how impressive a model can appear.
Shape a complete slice: Select the smallest workflow that can deliver an outcome, including its context, interface, recovery path, and instrumentation. A prompt without those elements is a component, not a product increment.
Pass the build gate: Approve committed delivery only when the opportunity contract, production contract, evaluation set, data path, and accountable owners are credible.
Deliver through normal product planning: Put evaluation cases, telemetry, fallback behavior, privacy work, and operational readiness into the roadmap and sprint scope. Do not leave them in a separate “hardening” phase after the visible feature is complete.
Launch a new behavior: Use onboarding, in-app guidance, examples, and product tours to show when the capability is useful, what input it needs, and how the user should review the result. The activation event should represent completed value, not a button click.
Review and decide: Compare outcomes with the baseline, inspect evaluation performance by scenario, locate adoption drop-offs, and review cost, reliability, incidents, and new risks. End with a decision to scale, revise, constrain, or stop.

A practical ownership split keeps this loop moving. Product owns the customer outcome, scope, adoption, and portfolio decision. Engineering owns the production system, reliability, observability, and cost controls. Design owns comprehension, user control, and recovery in the experience. The evaluation owner maintains cases, rubrics, baselines, and regression visibility. Privacy, security, legal, or compliance owners define required controls according to the risk. The business or operational owner defines any human review policy and accepts changes to the real-world process.

One directly responsible leader should assemble the evidence and drive the launch recommendation, but that role does not erase specialist approval where it is required. Record the decision, conditions, and unresolved risks. Otherwise the same debate returns at every review and nobody can tell why the system was allowed to progress.

Use risk-tiered oversight. A reversible drafting aid with no sensitive data does not need the same review path as an agent that changes customer records, sends external communications, or initiates a financial action. Increase review, auditability, confirmation, and monitoring as authority and consequence increase. This keeps governance proportional and makes the path to approval understandable before work begins.

At each portfolio review, use the same compact decision packet: baseline and current outcome, scenario-level evaluation movement, activation funnel, operating performance, incidents or policy exceptions, learning completed, and the next requested commitment. A polished demonstration can support the discussion, but it cannot substitute for this evidence.

Measure value, quality, adoption, and risk separately

AI dashboards become misleading when usage, answer quality, customer value, and system health are blended into one success number. They answer different questions and lead to different decisions. Keep the layers separate, then connect them with a driver tree.

Layer	Question	Useful measures	Decision it informs
Customer or business outcome	Did the workflow become meaningfully better?	Task completion, resolution, conversion, elapsed time, rework, or cost per successful outcome	Whether the use case deserves continued investment
User behavior	Are eligible users reaching and repeating the value?	Eligibility, exposure, first attempt, successful completion, repeat use, abandonment, fallback, and escalation	Whether to change positioning, onboarding, interaction design, or workflow placement
System quality	Is the result fit for the intended task?	Scenario pass rate, human rubric results, groundedness where required, tool success, structured-output validity, and critical-failure count	Whether to change context, retrieval, prompts, models, tools, or scope
Operations	Can the product deliver the experience sustainably?	Latency, reliability, retries, failure rate, incidents, and cost per successful task	Whether architecture and unit economics support scale
Risk and control	Are safeguards working at the level of authority granted?	Policy exceptions, unauthorized actions, sensitive-data events, confirmations, rollbacks, and human escalations	Whether to add controls, reduce authority, constrain availability, or pause

Build the adoption funnel around the real workflow: eligible user, meaningful exposure, first attempt, successful outcome, and repeat use when the need occurs again. Define the repeat window from the natural frequency of the task. A daily workflow and a quarterly workflow cannot share a useful retention window.

Do not mistake interaction volume for value. More messages can mean the user is retrying after poor results. A low cost per response can hide an expensive task that requires several responses and a manual correction. Favor successful outcomes per eligible user and cost per successful outcome, then use interaction-level metrics to diagnose what happened inside the journey.

The metric layers also tell you where to intervene:

If evaluation quality is acceptable but activation is weak, inspect discoverability, positioning, onboarding, and whether the feature appears at the right workflow moment.
If first use is strong but successful completion is weak, inspect inputs, context retrieval, interaction design, tool execution, and recovery.
If completion is strong but repeat use is weak, verify that the use case is naturally repeatable and that the experience created enough value to displace the old behavior.
If adoption is strong but critical failures or operating costs are outside the contract, constrain the release while you fix the production system. Popularity does not neutralize risk or poor economics.
If the outcome improves, scenario evaluations remain acceptable, users return when the need recurs, and operating constraints hold, you have evidence to expand availability or authority.

This is how measurement becomes a funding mechanism rather than a reporting ritual. Each signal points to a different action, and each review produces a clear next commitment.

Key takeaways for your next AI portfolio review

Treat every AI idea as a bounded product bet with a named user, baseline workflow, expected outcome, authority level, and stop condition.
Require a production contract covering quality, evaluation, fallback, data, economics, instrumentation, and failure recovery before committed delivery begins.
Build privacy, evaluation, telemetry, onboarding, and operational readiness into the roadmap and sprint scope instead of postponing them until launch.
Grant the minimum authority needed to create value, then expand autonomy only when quality, adoption, control, and operational evidence support it.
Measure customer outcomes, user behavior, system quality, operations, and risk as connected but distinct layers.
End every review with an explicit decision to scale, revise, constrain, or stop, plus the evidence required for the next decision.

At your next portfolio review, choose one leading AI candidate and refuse to discuss the model first. Write the opportunity contract, define its production bar, assign the owners, and identify the first complete workflow you can measure. If those decisions are clear, the technology has a path to become a product. If they are not, another prototype will only postpone the real work.

References

Pendo – Perspectives – Inside PendomoniumX London: AI’s tipping point and what product leaders should do next

January 3, 2026

Master the Five Stages of Software Experience Maturity and Prioritize What to Fix First

Experience quality compounds just like code quality. To align teams and accelerate outcomes, I rely on a clear, five-stage software experience maturity model to assess where we are, why we’re there, and how to advance. It turns fuzzy debates into concrete product strategy and reinforces a product-led growth mindset.

Find out where you stand—and what to fix first—with this maturity framework.

Why a five-stage model? It gives product, design, engineering, and go-to-market a shared language for trade-offs, helps us move from opinions to evidence, and ties day-to-day improvements to outcomes vs output OKRs. Instead of spreading effort thin, we sequence the right bets at the right time and build momentum with measurable wins.

Here’s how I apply it in practice. I start with a brief, honest self-assessment across the customer journey: onboarding clarity, user activation moments, in-app guides and product tours, UX writing, support loops, reliability, and analytics coverage. Then I layer in learnings from continuous discovery and product discovery—interviews, usage patterns, and support transcripts—so we see the experience as customers do, not just as we intended.

When it comes to what to fix first, I prioritize prerequisites over polish. If the value proposition isn’t clear, onboarding is confusing, or activation is inconsistent, we address those before adding new features. I instrument the funnel end-to-end, establish a minimum detectable effect (MDE) for A/B testing, and ensure we can answer basic questions about who activates, who retains, and why.

Measurement is non-negotiable. I pair retention analysis and activation metrics with qualitative signals to avoid local maxima. Amplitude analytics helps reveal behavioral patterns, while Pendo and in-app guides close gaps in comprehension and guidance. Intercom and CRM integration with HubSpot connect product signals to account health, so we can see how experience maturity drives revenue and retention.

Operationally, I anchor the roadmap to a small set of experience outcomes, link them to product strategy, and review progress in cadence with leadership. This approach builds product management leadership muscle: sharper stakeholder management, clearer trade-offs, and faster feedback loops. Most importantly, the team sees how each improvement ladders up to a better, more durable user experience.

If you’re mapping your own path across the five stages, start by sizing the gaps that block activation and retention, commit to a few high-leverage fixes, and measure relentlessly. With a shared maturity model, your team gains focus, your customers feel the difference, and your product compounds value with every release.

Inspired by this post on Pendo – Best Practices.

January 3, 2026
Four High-Impact Lifecycle Journeys to Run in Pendo Orchestrate for Activation and Retention

When I map the customer lifecycle, I look for the precise moments where guidance, context, and timing can transform a casual click into a committed relationship. That’s exactly why I rely on Pendo Orchestrate—to turn intent into a systematic, repeatable product strategy that scales across every stage of the journey.

From first click to lifelong retention, you’ll deliver the right message at the exact right time, every step of the way. With Pendo Orchestrate, you can design those kinds of moments with intention. And in this blog, we’ll show you how.

In practice, I translate that promise into four lifecycle journeys every product team should be running with Pendo Orchestrate: new user onboarding, activation to the aha moment, expansion and upsell, and renewal and retention. These journeys power product-led growth and keep the roadmap aligned to measurable business outcomes.

Onboarding: I use in-app guides and product tours to welcome new users, set expectations, and reduce time-to-value. Contextual tooltips and gentle checklists keep users moving, while clear, concise UX writing removes friction. The goal is simple: accelerate early wins so onboarding naturally flows into user activation.

Activation: To help users reach the aha moment, I pair behavioral insights with targeted in-app guides. When a user approaches a key milestone, Pendo Orchestrate triggers just-in-time prompts that reinforce the value proposition. I keep these nudges focused, specific, and measurable so activation improves without overwhelming the experience.

Expansion: Once users adopt core workflows, I introduce advanced capabilities through tailored tours and contextual education. These cues appear where they’re most relevant—in the flow of work—so cross-sell and upsell moments feel helpful, not salesy. The intent is to deepen adoption by connecting features to outcomes users already care about.

Renewal and retention: I watch for patterns that suggest risk (stalled usage, incomplete workflows) and offer supportive interventions. Lightweight guides, quick tips, and feedback loops help resolve issues before they become churn. Combined with retention analysis, these orchestrations keep customers engaged and set the stage for long-term value.

When these four journeys run in concert, your product becomes the primary engine of growth. Pendo Orchestrate ensures the right in-app guidance shows up at the right moment—so your product strategy, product discovery, and day-to-day execution stay tightly aligned. That’s how you move beyond one-off campaigns and build a durable, product-led growth system.

Inspired by this post on Pendo – Best Practices.

January 3, 2026
Why I’m All-In on INDUSTRY 2025: 5 Powerful Reasons For Product Leaders at The Product Conference

INDUSTRY 2025: The Product Conference is circled on my calendar for good reason. In my role leading product management at HighLevel, I look for events that sharpen strategy, accelerate learning, and connect me with operators who ship. This one consistently delivers on all three, and 2025 promises to raise the bar for product management leadership.

Join Pendo at INDUSTRY in Cleveland, Ohio.

First, I expect deeply actionable product strategy insights—beyond platitudes. I’m prioritizing conversations on outcomes vs output OKRs, product roadmapping and sprint planning, and how great teams articulate a crisp value proposition while maintaining points of parity that matter. I’m going in with specific questions on product-market fit lessons and how to systematize strategic bets without stifling discovery.

Second, the surge of AI in product work is too important to observe from the sidelines. I’m comparing approaches across AI Strategy, LLMs for product managers, prompt engineering, and eval-driven development—especially in retrieval-first pipeline patterns. My focus: where AI genuinely improves product discovery, in-app guides, and customer support ai strategy, and where it risks adding complexity without outcomes.

Third, the community is unmatched for conference networking and pragmatic learning. I’m intentional about meeting product trios who run continuous discovery at scale, as well as leaders who’ve cracked stakeholder management under pressure. These are the moments where competitive differentiation is born—through candid stories of what didn’t work and why.

Fourth, I’m eager to stress-test data practices that power product-led growth. I’ll be exchanging notes on retention analysis, unified analytics platform decisions, user activation, and how teams integrate qualitative feedback with event data to inform roadmaps. I’m also interested in how practitioners leverage platforms like Pendo, Amplitude analytics, Intercom, and HubSpot to reduce time-to-insight and craft effective product tours and in-app guides.

Fifth, I treat INDUSTRY as a checkpoint for leadership growth. I’m looking for fresh takes on empowering product teams, first principles decision making, organizational development, and the IC to manager transition. The best sessions don’t just inspire; they give me two moves I can apply with my team on Monday.

To make the most of the week, I’m applying a continuous discovery mindset: arrive with clear learning goals, capture portable frameworks, and translate at least two insights into experiments before wheels-up. If you’re focused on product strategy, product discovery, and product-led growth, we’ll have plenty to compare and build on together.

I’ll be in Cleveland ready to learn, share, and connect with peers who care about craft and outcomes. If you’re attending, let’s compare notes on what’s working, what’s stalled, and how we can raise the bar for product management leadership in 2025 and beyond.

Inspired by this post on Pendo – Perspectives.

January 3, 2026