Category: AI Strategy

Why I’m All-In on INDUSTRY 2025: 5 Powerful Reasons For Product Leaders at The Product Conference

INDUSTRY 2025: The Product Conference is circled on my calendar for good reason. In my role leading product management at HighLevel, I look for events that sharpen strategy, accelerate learning, and connect me with operators who ship. This one consistently delivers on all three, and 2025 promises to raise the bar for product management leadership.

Join Pendo at INDUSTRY in Cleveland, Ohio.

First, I expect deeply actionable product strategy insights—beyond platitudes. I’m prioritizing conversations on outcomes vs output OKRs, product roadmapping and sprint planning, and how great teams articulate a crisp value proposition while maintaining points of parity that matter. I’m going in with specific questions on product-market fit lessons and how to systematize strategic bets without stifling discovery.

Second, the surge of AI in product work is too important to observe from the sidelines. I’m comparing approaches across AI Strategy, LLMs for product managers, prompt engineering, and eval-driven development—especially in retrieval-first pipeline patterns. My focus: where AI genuinely improves product discovery, in-app guides, and customer support ai strategy, and where it risks adding complexity without outcomes.

Third, the community is unmatched for conference networking and pragmatic learning. I’m intentional about meeting product trios who run continuous discovery at scale, as well as leaders who’ve cracked stakeholder management under pressure. These are the moments where competitive differentiation is born—through candid stories of what didn’t work and why.

Fourth, I’m eager to stress-test data practices that power product-led growth. I’ll be exchanging notes on retention analysis, unified analytics platform decisions, user activation, and how teams integrate qualitative feedback with event data to inform roadmaps. I’m also interested in how practitioners leverage platforms like Pendo, Amplitude analytics, Intercom, and HubSpot to reduce time-to-insight and craft effective product tours and in-app guides.

Fifth, I treat INDUSTRY as a checkpoint for leadership growth. I’m looking for fresh takes on empowering product teams, first principles decision making, organizational development, and the IC to manager transition. The best sessions don’t just inspire; they give me two moves I can apply with my team on Monday.

To make the most of the week, I’m applying a continuous discovery mindset: arrive with clear learning goals, capture portable frameworks, and translate at least two insights into experiments before wheels-up. If you’re focused on product strategy, product discovery, and product-led growth, we’ll have plenty to compare and build on together.

I’ll be in Cleveland ready to learn, share, and connect with peers who care about craft and outcomes. If you’re attending, let’s compare notes on what’s working, what’s stalled, and how we can raise the bar for product management leadership in 2025 and beyond.

Inspired by this post on Pendo – Perspectives.

January 3, 2026

How to Govern AI Agents With Product Analytics That Drives Action

Your dashboard can show growing AI agent usage while the product itself gets worse. Users may invoke the agent, wait for an answer, rewrite it, repeat the task manually, or discover too late that an action needs to be undone. An invocation count records activity. It does not tell you whether the agent was useful, safe, or worthy of more authority.

If you own an agent roadmap, the practical question is not whether the model can complete an impressive demo. It is whether you can see what the agent did, limit what it was allowed to do, connect its behavior to a user or business outcome, and stop or reverse a bad release. Product analytics should be the control system that helps you answer those questions.

Key takeaways

Define the agent’s job, eligible users, data boundary, action boundary, target outcome, and failure conditions before choosing dashboard metrics.
Join product behavior, agent decisions, tool activity, and business outcomes with shared run and workflow identifiers. A model trace or product funnel on its own is incomplete.
Treat permissions as product logic. Read access, recommendations, reversible actions, and high-consequence actions need different controls and evidence.
Version prompts, retrieval sources, models, tools, policies, and event schemas together so that a change in performance can be traced to a release.
Use quality, safety, experience, business, and operational gates to decide whether an agent should expand, remain constrained, be revised, or be retired.

Define the outcome and authority before the events

Teams often start by instrumenting what is easiest to count: conversations, messages, tool calls, and thumbs-up feedback. That produces a busy dashboard without a decision model. Start one level earlier. What job is the agent responsible for, and what evidence would justify giving it more reach or authority?

Write a one-page agent contract

An agent contract is a product artifact, not a legal document. It creates a stable reference for instrumentation, evaluation, access control, and rollout decisions. Write down:

Job: the decision or task the agent helps complete. Avoid broad mandates such as improve support or assist product managers.
Eligible workflow: the exact point at which the agent may appear or run. Eligibility must be measurable even when the user never invokes the agent.
Eligible users and accounts: the roles, segments, or environments included in the release, plus explicit exclusions.
Inputs: the approved resources, fields, retrieval collections, and user-provided context the agent may inspect.
Outputs: whether the agent answers, recommends, drafts, updates a system, contacts someone, or triggers another workflow.
Human checkpoints: the actions that require review, the person authorized to review them, and what that person must be shown.
Target outcome: the user or business result, its denominator, its measurement window, and the system that records it.
Known failure states: unsupported answers, irrelevant retrieval, repeated retries, blocked tools, abandoned approvals, incorrect actions, and failed handoffs.
Stop condition: the quality, risk, reliability, or outcome signal that pauses the rollout and identifies who owns the decision.

The eligibility definition matters more than it appears. If you count only people who chose to use the agent, your dashboard excludes people who ignored it, did not notice it, distrusted it, or could not access it. Record the eligible population first. That gives adoption, completion, and outcome metrics a defensible denominator.

Keep the first contract narrow. A practical starting footprint is one valuable question, a small team, and one assistant. Narrow scope is not merely easier to ship. It makes failures interpretable and limits the consequences of a bad policy, prompt, connector, or event definition.

Translate authority into enforceable policy

I use a strict definition of governance: the agent has a bounded objective, a known identity, limited data access, limited tools, recorded policy decisions, an escalation route, and a named owner. A policy page that the runtime cannot enforce is guidance, not governance.

Authority level	What the agent may do	Evidence to retain	Default release control
Retrieve	Read approved analytics, records, or knowledge without changing a system	Resource identifiers, applied scope, retrieval status, policy version, and references used	Pre-approved resources with least-privilege access and data minimization
Recommend	Explain, summarize, rank, draft, or propose an action	Agent version, supporting references, presentation status, and user response	The user decides whether to accept, edit, reject, or escalate
Act reversibly	Create a note or make another bounded change that can be reliably undone	Tool, target, before-and-after state, approval, execution result, and reversal path	Explicit approval during the bounded rollout, followed by evidence-based expansion
Act with high consequence	Send an external communication, alter access or entitlements, disclose sensitive data, or perform a hard-to-reverse operation	Everything above, plus approver identity, policy result, purpose, and incident linkage	A human makes the consequential decision; eligibility and tool scope remain narrow

Technical reversibility is not the same as consequence reversibility. A database field may be restored while a customer message, exposed record, or lost trust cannot be recalled. Classify authority by the real-world consequence, not by whether an API offers an undo method.

Model Context Protocol can make the policy surface clearer because it separates read-only resources from bounded tools and gives agents a standard way to discover them. That interface is useful, but the protocol does not decide who should access a resource, which fields are permitted, or whether an action needs approval. Authentication, authorization, redaction, policy enforcement, retention, and audit logging still belong in your architecture.

Apply controls before the model call and again before every tool execution. Prompts, retrieved context, logs, and third-party services can all become paths for sensitive-data leakage. Redact data the task does not require, keep secrets outside prompts, use scoped credentials, validate structured tool inputs, and record blocked requests as carefully as successful ones. A denied request is evidence that your policy worked, but repeated denials may also reveal a broken workflow, an overly broad prompt, or an attempted attack.

Build telemetry that joins agent decisions to user outcomes

Product analytics and AI observability answer different halves of the same question. A trace can show which context was retrieved, which policy ran, and which tool was called. Product analytics can show what the user did before and after the interaction, which cohort they belonged to, and whether the workflow reached its intended result. Neither view alone proves that the agent created value.

Join them with two identifiers. An agent run identifier follows one execution from trigger to final status. A workflow identifier connects that execution to the broader task, including manual steps, retries, handoffs, and the eventual business outcome. A user may start several runs inside one workflow, so treating every run as an independent success will inflate apparent demand and hide rework.

Use a minimum viable event contract

The following event model is deliberately small. Adapt the names to your analytics conventions, but preserve the states and identifiers.

Suggested event	Required properties	Decision it supports
agent_eligible	Workflow identifier, use case, surface, cohort, eligibility reason, and policy version	Who could have used the agent, including people who did not invoke it?
agent_run_started	Run identifier, workflow identifier, agent version, entry point, and initiating actor type	Where is the agent being invoked, and how often do workflows require retries?
agent_answer_presented	Run identifier, answer status, retrieval status, reference status, latency band, and fallback status	Did the user receive a grounded answer, a fallback, or no usable response?
agent_action_requested	Run identifier, tool, target type, authority level, required scope, approval requirement, and policy result	What is the agent attempting, and where are requests blocked or escalated?
agent_action_finished	Run identifier, tool, execution status, error class, approver state, reversibility state, and duration band	Did an approved action actually complete, fail, time out, or require recovery?
agent_handoff_started	Run identifier, workflow identifier, handoff reason, destination, context-transfer status, and user choice	Why did automation stop, and could the receiving person continue without reconstructing the task?
agent_run_outcome	Run identifier, workflow identifier, completion state, user response, correction state, and failure taxonomy	Was the output accepted, edited, rejected, abandoned, retried, or escalated?
workflow_outcome	Workflow identifier, outcome name, outcome state, measurement window, and source system	Did the underlying product or business result occur?

Put the agent, model, prompt, retrieval, tool, policy, and event-schema versions on the relevant records. Without version lineage, a quality shift produces debate instead of diagnosis. You will know that performance changed but not whether the cause was a prompt edit, a new model, a retrieval update, a permission change, a tool release, or broken instrumentation.

Do not make raw prompts and complete responses the default payload in a general-purpose analytics tool. They can contain personal data, secrets, customer content, or retrieved text that the analytics audience should not see. Send structured classifications and reference identifiers to product analytics. Keep any detailed trace required for investigation in an access-controlled store with explicit retention rules.

Use enumerated properties for states such as accepted, edited, rejected, blocked, failed, and handed off. Free-text status fields fragment quickly and make reliable cohorts impossible. Preserve a limited diagnostic field only where someone owns its review and classification.

Measure a stack, not a vanity metric

A useful scorecard separates five layers. Each layer answers a different management question:

Reach and adoption: Of eligible workflows, where was the agent offered and invoked? This shows discoverability and voluntary use, not value.
Task experience: Of started workflows, how many completed, retried, fell back, transferred to a person, or were abandoned? Segment edits and overrides instead of treating every acceptance as equally successful.
Agent quality: Was the answer supported by approved context, relevant to the request, structurally valid, and consistent with the task-specific evaluation criteria?
Governance and safety: Which tool requests were allowed, denied, escalated, or attempted outside the approved scope? Which redaction, moderation, or policy checks failed?
Business outcome: Did the downstream result move for the eligible workflow and intended cohort? Examples include completed onboarding, resolved cases, qualified leads, retained users, or a shorter cycle, depending on the contract.

Always display the numerator and denominator behind a rate. A falling handoff rate may look positive until you discover that completions also fell. A high acceptance rate may hide repeated runs if the dashboard counts only the final answer. A rising task outcome may reflect a changing user mix rather than the agent. Cohort, version, eligibility, and workflow-level views prevent those misreadings.

Behavioral analytics can establish association and expose where to investigate. It does not automatically establish causality. When the decision requires a causal claim, use a controlled experiment only after both variants meet the same safety and access requirements. Prompts, decision rules, and handoff designs can be tested across appropriate user cohorts; known unsafe behavior, privacy controls, and access boundaries are not experiment variants.

Turn analytics into release gates, not retrospective reporting

A governed agent release includes more than a prompt. It includes the model configuration, instructions, retrieval sources, tool definitions, permission scopes, policy rules, user disclosures, approval flow, handoff design, and telemetry. Change any of those and you have changed the product behavior.

That is why evaluation belongs in delivery, not in a quarterly review. Task-specific test sets, reference answers, error classifications, and pass-or-block thresholds can gate model and prompt changes in CI/CD. Production analytics then checks whether the behavior generalizes to real workflows without weakening the controls established before launch.

Use a staged promotion path

Validate the interface. Enumerate the resources, tools, schemas, scopes, and denial behavior. Run harmless requests and confirm that unavailable capabilities remain unavailable.
Run task evaluations. Test representative requests, known failure cases, adversarial inputs, missing context, malformed tool arguments, and handoff conditions. Classify failures by consequence rather than relying on one blended quality score.
Exercise the workflow without autonomous consequence. Use dry runs or recommendation-only behavior. Confirm telemetry, references, approvals, fallback, escalation, and rollback before enabling writes.
Release to a bounded eligible cohort. Keep tool scopes narrow and consequential actions under human control. Compare observed behavior with the contract, not with the enthusiasm generated by the demo.
Experiment inside the approved boundary. Test prompt, retrieval, interaction, and handoff variants only after they independently satisfy the safety gate. Analyze results by workflow and version.
Promote or constrain deliberately. Expand access or authority only when the relevant gates pass. A failed safety gate can restrict a release even when adoption or the business metric improves.

Pre-commit the gates

Choose thresholds and blocking conditions before reading the launch results. If the team sets them afterward, a promising outcome can quietly lower the quality bar, while a favored feature can turn every failure into an exception.

Gate	Evidence	Blocking condition	Typical response
Quality	Task evaluations, grounded-answer checks, correction categories, and unsupported-output reviews	A consequential failure class exceeds the pre-agreed tolerance or lacks a reliable detector	Revise instructions, retrieval, output constraints, or task scope
Safety and governance	Policy decisions, unauthorized tool attempts, redaction results, approval records, and incidents	An unresolved high-severity policy or data-control failure remains possible	Disable the affected tool or cohort, rotate credentials where needed, and follow the incident runbook
User experience	Completion, edits, rejection, fallback, abandonment, retries, and handoff continuity by cohort	The agent adds work, obscures control, or fails to transfer usable context	Simplify the interaction, improve disclosure, or return the step to a human workflow
Business outcome	The contract’s downstream metric for eligible workflows, with an appropriate comparison	Usage grows without a credible improvement in the intended outcome	Revisit the job, target cohort, workflow placement, or value hypothesis
Operations	Tool errors, latency, timeouts, dependency health, fallback success, and rollback readiness	The workflow cannot meet its reliability requirement or cannot fail safely	Reduce dependency surface, improve fallback, or pause promotion

Do not average these gates into a single agent score. A composite score can let strong adoption cancel a serious security failure or let low latency hide poor answer quality. Keep each gate visible, assign its owner, and specify which failures block promotion without negotiation.

Release decisions should also be reversible. Keep prior prompt, policy, retrieval, and tool configurations identifiable. Define how the runtime disables a tool, narrows a cohort, returns to recommendation-only behavior, or routes directly to a person. A rollback plan that depends on diagnosing the root cause first is too slow for a live incident.

Make the dashboard an operating system for the product team

The best agent dashboard does not attempt to show every event. It puts the release decision in view. Organize it in the order the team should reason:

Outcome: eligible workflows, target business result, comparison group where appropriate, and results by cohort and release version.
Journey: eligible, offered, invoked, answer presented, action proposed, approved, executed, handed off, and completed.
Quality and trust: grounded status, acceptance, substantive edits, rejection, retries, corrections, fallback, and qualitative feedback categories.
Governance and operations: allowed and denied tools, approval states, out-of-scope attempts, redaction failures, incidents, errors, latency, and dependency health.

Every panel should filter by agent version, policy version, tool, entry point, cohort, and workflow outcome. A top-line average is useful for orientation, but releases fail in slices: a user role with missing permissions, a workflow with poor retrieval, a new policy that blocks a required tool, or a handoff destination that cannot use the transferred context.

Run a decision review, not a dashboard tour

A regular review with the product trio can use behavioral telemetry, user feedback, and business outcomes to refine prompts, retrieval, and decision logic. Bring security, legal, analytics, operations, or domain owners into decisions that cross their boundaries. The meeting should answer:

Which intended outcome moved, for which eligible cohort, and under which release version?
Where did users retry, edit, reject, abandon, or request a person, and what does the failure taxonomy show?
Which permissions were never needed, and which denied requests reveal either a valid attack defense or a mismatch between the job and the available tools?
Did the agent reduce user work, or did it move that work into reviewing, correcting, approving, and recovering?
Are outcomes consistent across important roles and workflow entry points, or is the top-line result hiding a weak segment?
What changed since the prior release across the model, prompt, retrieval corpus, tools, policies, user experience, and instrumentation?
Should the team expand, hold, revise, restrict, roll back, or retire the current behavior?

Record the decision beside the release lineage: the hypothesis, eligible scope, versions, expected outcome, gates, observed evidence, known risks, owner, and next review condition. This turns governance into an operating history. It also prevents the same debate from restarting when a metric moves or a stakeholder changes.

Ownership must be explicit. Product owns the job, intended outcome, and promotion decision. Engineering owns runtime reliability, tool boundaries, traceability, and rollback mechanics. Design owns disclosure, user control, approval clarity, correction, and handoff. Data or analytics owns event integrity and metric definitions. Security and legal own the policies and incident requirements within their mandates. Shared input is valuable; shared accountability without a decision owner is not.

Start with one consequential workflow. Write its contract, add the eligibility event and shared identifiers, classify every available tool by authority, pre-commit the release gates, and review the first bounded cohort against the business outcome. Do not broaden the agent until you can explain why it ran, what it was permitted to see and do, what the user did next, whether the workflow improved, and how you would stop it safely.

References

January 3, 2026

My Proven Experimentation Playbook for AI PMs: Faster Learning, Safer Launches, Bigger Wins

I build AI products with a simple conviction: disciplined experimentation beats intuition. Over the years, I’ve refined a practical playbook that helps my teams learn faster, reduce risk, and turn every release into a smarter next step.

Product experimentation isn’t luck; it’s a method. Learn how top AI product managers test, measure, and grow smarter with every release.

I begin every effort with a crisp hypothesis, an expected user or business outcome, and unambiguous success criteria tied to outcomes vs output OKRs. Before writing a line of code, I define primary metrics and guardrails so we know what “good” looks like—and what to stop.

When the change affects UX, pricing, or activation flows, I favor A/B testing with the statistical rigor to back decisions. We calculate the minimum detectable effect (MDE), choose appropriate randomization units, and pre-register the analysis plan to avoid p-hacking. This gives the team the confidence to scale wins and sunset underperformers quickly.

AI features demand a tailored approach, so I run eval-driven development before any user sees a variant. We curate golden datasets, score candidate prompts and models, and stress-test failure modes. This is where LLMs for product managers matters: prompt templates, context window management, and a retrieval-first pipeline are all evaluated for quality, latency, and cost-to-serve. I treat “hallucination rate,” safety violations, and bias as first-class metrics under AI risk management.

To de-risk launches, we ship behind feature flags with CI/CD, monitor DORA metrics, and roll out in stages. Product trios own problem framing to solution delivery, which shortens feedback loops and preserves accountability. If early signals drift from our hypotheses, we pause, adjust, and re-run—no sunk-cost thinking.

Measurement is non-negotiable. I instrument user journeys end-to-end with Amplitude analytics, track activation and retention analysis, and map behavior to learning objectives. We consolidate logs and events into a unified analytics platform so qualitative insights from customer research pair cleanly with quantitative trends.

Continuous discovery keeps the engine running. Weekly customer conversations, in-product feedback, and lightweight prototypes ensure we validate needs, not just solutions. The output flows into product discovery, product roadmapping and sprint planning, and a reusable AI product toolbox that scales across teams.

Finally, I protect the culture that makes experimentation work: we celebrate invalidated hypotheses, document decisions, and optimize for outcomes over output. That’s how empowered product teams sustain product-led growth—even as complexity grows.

If you’re building AI features today, adopt this playbook to maximize learning velocity, minimize risk, and compound advantage. The method is straightforward: form strong hypotheses, test with rigor, measure what matters, and let evidence—not HiPPOs—guide the roadmap.

Inspired by this post on Product School.

December 31, 2025
The New AI Playbook for Product Portfolio Optimization: Slash Complexity, Boost ROI

The most valuable lesson I’ve learned leading product organizations is that portfolio choices make or break outcomes. In an era of infinite requests and finite teams, the question isn’t what we could build—it’s what we must build next. That’s why I’m codifying a pragmatic, AI-driven playbook to optimize the product portfolio while staying true to outcomes, not output.

AI-powered product portfolio optimization is here. Explore strategies and tools helping product leaders manage complexity and boost ROI.

My starting point is a data backbone that connects strategy to reality. I aggregate product usage, revenue by segment, cost-to-serve, retention cohorts, and support signals into a unified analytics platform, then layer a retrieval-first pipeline so LLMs can reason over clean context. Instrumentation matters: Amplitude analytics, Pendo, and in-app guides provide the behavioral and activation signals that make prioritization measurable.

From there, I translate strategy into an objective decision system. I express outcomes vs output OKRs, align initiatives to value proposition and competitive differentiation, and classify opportunities with the Kano Model. LLMs for product managers help cluster voice-of-customer at scale; with thoughtful prompt engineering and AI workflows, I can map themes to jobs-to-be-done, quantify demand, and de-duplicate asks across stakeholders.

Execution hinges on evidence. I run A/B testing with a clear minimum detectable effect (MDE), pair it with eval-driven development for AI features, and ship through CI/CD while tracking DORA metrics. This closes the loop between product roadmapping and sprint planning and real-world performance—activation, retention analysis, and Web Vitals inform the next set of portfolio bets.

Trust is a feature, so governance is built-in. Privacy-by-design, data governance, and AI risk management guide how we store, prompt, and evaluate models. I apply guardrails to sensitive workflows and define success metrics that balance short-term ROI with long-term resilience and regulatory compliance.

The operating model matters as much as the models themselves. Product trios and empowered product teams run continuous discovery, pressure-test assumptions in QBRs vs OKRs, and make trade-offs visible. Stakeholder management becomes easier when the portfolio narrative is anchored in transparent scenarios and shared metrics.

If you’re getting started, here’s my flow: unify data, define outcomes, segment opportunities, simulate scenarios, and test fast. Use LLMs to synthesize signals you’d never humanly read, then make one focused bet per team that moves a measurable KPI. Rinse, learn, and reallocate—portfolio optimization is a living system, not an annual meeting.

Ultimately, the promise of this new playbook is simple: less noise, sharper focus, and compounding ROI. By pairing AI Strategy with disciplined product management leadership, we can manage complexity with clarity—and consistently build what matters most.

Inspired by this post on Product School.

December 29, 2025
10 AI Business Models You Need Now: Proven Playbooks Turning Algorithms into Revenue

I’ve spent the past few product cycles re-architecting roadmaps around one simple reality: AI is no longer just a feature—it’s a business model. The companies winning market share are those that treat models, data, and workflows as monetizable assets with defensible moats, not science projects.

AI business models are rewriting value creation. Learn how smart teams turn algorithms into profit engines, reshaping entire industries.

From my seat in product leadership, I evaluate AI bets through three lenses: durable value (moat and differentiation), measurable outcomes (clear ROI), and unit economics (gross margins under real-world load). With that frame, here are ten AI business models I see performing now—and how I decide when to invest.

1) API-first Model-as-a-Service. I monetize foundation or specialized models via an API, priced by tokens, requests, or time-in-context. Success hinges on latency, accuracy, and “context window management” that balances quality with cost. This is where “consumption SaaS pricing” shines and where disciplined rate-limiting, observability, and SLAs build trust.

2) Vertical AI copilots. I package domain-specific expertise (legal, healthcare, finance, field service) into workflow-native assistants that surface next-best actions. Because these copilots live where work happens, I price on outcomes—time saved, revenue recovered, or risk reduced—aligning value with customer metrics and accelerating product adoption.

3) Agentic AI automation. When autonomous agents handle multi-step tasks across tools, I lean toward per-outcome or per-job pricing. Reliability is the moat, so I invest early in eval-driven development, robust guardrails, and human-in-the-loop QA. This model compounds fast once agents can execute end-to-end workflows with transparent audit trails.

4) Copilot add-ons inside existing SaaS. I’ve seen “AI Assist” tiers deliver immediate ARPU lift and retention gains. The playbook: start with high-frequency, high-friction jobs (drafts, summaries, enrichment), then expand to proactive suggestions. This aligns tightly with product strategy and lets me stage value without overhauling the core experience.

5) Insights-as-a-Service via data network effects. I transform exhaust data into benchmarking, predictions, and prescriptive recommendations—while honoring privacy-by-design and data governance. The more customers I onboard, the stronger the patterns, and the higher the switching costs. Pricing ties to seats plus an outcomes or value metric.

6) Retrieval-first pipeline for enterprise knowledge. I land with high-accuracy answers over customer data (search, summarize, cite), then expand into workflow automations. This “retrieval-first pipeline” reduces hallucinations, boosts trust, and creates defensibility through connectors, semantic indexing, and continuous relevance tuning—an ideal fit for LLMs for product managers prioritizing reliability.

7) Open source monetization. When I bet on openness, I monetize hosting, support, enterprise controls, and compliance features. The advantage is developer love and rapid iteration; the moat is operational excellence at scale, plus integrations customers rely on. This model converts community momentum into predictable revenue.

8) Marketplaces for prompts, skills, and agents. I create a platform for third-party extensions and charge a take rate on usage. The flywheel spins when developers see distribution, customers see breadth, and I enforce strong quality bars. The roadmap focuses on governance, discovery, and safe execution policies.

9) Solutions with forward deployed engineers. For complex rollouts, I pair product with specialized implementation to guarantee outcomes. Revenue blends software plus services, accelerating time-to-value and informing the roadmap with real-world constraints. Over time, learnings fold back into scalable, self-serve capabilities.

10) AI risk, security, and compliance tooling. As AI scales, so does the need for policy enforcement, monitoring, and auditability. I monetize via platform subscriptions that address model provenance, data leakage prevention, red teaming, and reporting. Strong “AI risk management” is now a purchasing requirement, not a nice-to-have.

How do I choose among these models? I start with the customer’s biggest workflow pain, map it to the fastest path to measurable outcomes, and align pricing with value creation. Then I build defensibility through data advantage, distribution, and governance. If a model deepens trust, improves margins, and compounds learning, it earns a place on the roadmap.

Inspired by this post on Product School.

December 24, 2025
Monetizing AI with Confidence: Proven Models, Smart Pricing, and ROI You Can Defend

I’ve learned the hard way that shipping an impressive AI demo is not the same as creating a durable revenue engine. In my role leading product strategy, I focus on one goal: connect AI capabilities to measurable customer outcomes, then price and package them so both value and margins are visible and defensible.

Monetizing AI features into profit isn’t trivial. Here are some clear strategies for capturing and pricing AI products and how to monetize with returns.

First, I clarify the business model. Add-on AI packs work when the value is concentrated in a specific workflow (for example, automated summarization or AI copilot assistance). Tiered packaging helps when AI elevates the overall experience across many features. Usage-based or consumption SaaS pricing is ideal when value scales with volume—tokens, documents processed, calls handled, or agents invoked—because it aligns price to realized outcomes.

Next, I align pricing mechanics with the customer’s value story. I anchor price against the baseline they know: hours saved, conversions gained, cases deflected, or risk reduced. Then I set floors based on unit economics—model inference, vector storage, and orchestration costs—so gross margins remain healthy as usage grows. Clear guardrails (quotas, rate limits, and context window management) prevent surprise bills and keep cost-to-serve predictable.

Packaging is where monetization becomes intuitive. I gate high-cadence, high-compute features behind premium tiers, and I expose quick wins (like smart suggestions) in core tiers to accelerate activation. For enterprise, I bundle governance, audit logs, data controls, and “privacy-by-design” features to justify step-up pricing and reduce procurement friction.

To sustain ROI, I run an eval-driven development loop. I define quality metrics (accuracy, helpfulness, latency, safety) and instrument the retrieval-first pipeline so I can isolate where value is created or lost. This lets me right-size models, tune prompts, and swap components without compromising outcomes or margins—critical for LLMs for product managers who must balance experience and cost.

Measurement is non-negotiable. I track activation, time-to-first-value, weekly engaged AI users, and feature-level retention. For revenue impact, I attribute uplift through A/B testing and minimum detectable effect thresholds, measuring conversion lift, ticket deflection, and cycle-time reductions. When customers see these numbers in their own dashboards, procurement turns into partnership.

Risk and compliance are part of the product, not an afterthought. I build in AI risk management, data governance, and red-teaming from day one. Clear data boundaries, human-in-the-loop controls, and transparent disclosures protect end users and make enterprise legal teams our allies rather than blockers.

Go-to-market matters as much as the model. I use product-led growth tactics—free AI credits, transparent meters, and in-app guides—to let users feel the value before the paywall. Sales enablement centers on the value proposition: faster outcomes, higher quality, and lower total cost of ownership, not just “gen ai” for its own sake. Pricing pages should showcase tiers, usage bands, and outcomes, eliminating guesswork.

Here’s the simple playbook I follow: validate the problem with continuous discovery, instrument the workflow, pilot with generous caps, and collect willingness-to-pay signals early. Then iterate the price meter, refine units of value (documents, messages, or actions), and align SKUs to buyer personas. Over time, I introduce agentic AI capabilities as premium modules when they demonstrably reduce steps or automate entire objectives.

When AI monetization works, it feels effortless to customers because the price mirrors the outcome. When it doesn’t, it’s usually because packaging hides value, pricing ignores unit economics, or ROI isn’t visible. By grounding strategy in value metrics, consumption-aware pricing, and rigorous evaluation, I’ve found we can scale AI revenue with confidence—and keep both customers and margins happy.

Inspired by this post on Product School.

December 22, 2025
Trustworthy AI Product Engineering: From Demo to Daily Use
You have an AI feature that performs impressively in a demo. The difficult decision comes next: can you let it shape a customer’s workflow when its inputs may be incomplete, its output is probabilistic, and a polished answer can still be wrong?

The answer should not depend on confidence theater or one launch-day accuracy score. You need a product and engineering system that makes claims traceable, uncertainty actionable, failures bounded, and quality continuously measurable. That is what turns trust from a brand promise into a release criterion.

Define a trust contract before choosing the architecture

Trustworthy AI does not mean an AI product is always correct. It means the product is explicit about what it can do, shows the basis for consequential claims, declines work outside its operating boundary, and gives the user a safe way to recover when something goes wrong.

I treat every consequential AI workflow as having a trust contract. This is not a legal document or a general responsible-AI statement. It is a short product specification that connects a user decision to evidence, acceptable errors, system behavior, and ownership.

Write the contract before debating models or orchestration frameworks. Include these fields:
- User and decision: Name the person relying on the output and the decision the output will influence. Generating ideas and approving a customer-facing action are different products, even if they use the same model.
- Permitted claim: State what the system may conclude. A diagnostic assistant might identify a likely contributor to a metric change, but it should not present correlation as proven causation.
- Required evidence: Define the data, permissions, time range, comparison, and retrieval quality needed before the claim can appear.
- Uncertainty behavior: Specify when the product answers normally, adds a qualification, asks for more information, or abstains.
- Action boundary: Separate advice, preparation of a reversible action, and autonomous execution. Each step toward execution needs a stronger quality threshold and a clearer recovery path.
- Unacceptable outcome: Describe failures that block release, such as exposing another customer’s data, inventing a citation, applying an action to the wrong account, or concealing missing evidence.
- Quality measure and owner: Choose the metric that reflects the failure cost and assign a person who can stop or roll back the feature.
This contract prevents a common category error: treating model capability as product readiness. The same output quality may be acceptable when a user is brainstorming and unacceptable when the system is changing a live configuration. Risk comes from the combination of the output, the user, and the action that follows.

Consider an assistant investigating a drop in campaign performance. It may safely offer a hypothesis if it displays the metric, segment, comparison window, and missing data. It should not automatically reallocate a budget when the evidence is incomplete. The safe alternative is to keep the result advisory and require a person to verify the cited analysis before any consequential change.

If you cannot complete the trust contract, keep the feature inside a reversible, supervised workflow. That is not a failure to innovate. It is an accurate boundary for what the product can currently support.

Engineer an evidence path, not just an answer

A fluent response is an interface. It is not evidence. For an AI product to support a real decision, the user must be able to move from the claim to the data that supports it without reconstructing the system’s reasoning from scratch.

Start with a retrieval-first flow: authoritative data, retrieval, structured context, generation, policy checks, presentation, and telemetry. That requires robust data contracts and a deliberate orchestration layer, because no prompt can repair ambiguous field meanings, stale records, or broken permissions.

A useful data contract should tell the AI system and its operators:
- What each field means, including its unit and valid states.
- Which tenant, account, or user is allowed to access it.
- How fresh the value must be for the intended decision.
- How null, delayed, duplicated, or conflicting records are represented.
- Which transformations produced a derived metric.
- Which identifier links the generated claim back to the underlying record, query, chart, or dashboard.
Pass an evidence object through the system alongside the generated answer. At minimum, that object should contain the claim it supports, the source identifiers, filters, time window, retrieval timestamp, relevant transformations, and any missing or conflicting signals. The policy layer can then inspect the same evidence the interface will expose.

This design is stronger than asking the model to add citations after it has written an answer. A citation generated as decoration can look convincing while pointing to something irrelevant. A citation carried through the pipeline can be checked for permissions, relevance, and claim-level support before the user sees it.

In the interface, build an inspection ladder:
<!– wp:list {
December 18, 2025

How to Design Multi-Agent Fintech Support That Finishes Work

Your support prototype can explain what happens after a customer reports a stolen card. The harder product decision is whether you can trust it to carry that case from the first message to a verified outcome without losing state, skipping an approval, duplicating an action, or going silent while work remains open.

You will not solve that problem by adding a larger prompt or more conversational agents. You need an operating model for cases that span people, policies, systems, and days. The model below gives you a practical way to define the work, divide agent responsibilities, control execution, and measure whether the customer's problem was actually resolved.

Define the case before you define the agents

A stolen-card request exposes the central mistake in support automation. Freezing the card is visible, immediate, and easy to demonstrate. The less visible work may include dispute intake, fraud investigation, merchant communication, customer outreach, approvals, and follow-up. If your scope ends when the chat ends, you have automated the tip of the workflow while leaving its operational burden intact.

Start with a case contract. This is the shared definition of what entered the system, what outcome is owed, which actions are permitted, and what evidence will prove completion. Define it before deciding how many agents you need.

Customer outcome: State the result in operational terms. "Card secured and required follow-up completed" is more useful than "customer helped."
Entry conditions: Record the signals that create the case, including the customer request, the affected product, and any authentication or evidence requirements imposed by your policy.
Required work: Enumerate the actions, investigations, notices, approvals, and follow-ups that may sit below the initial request.
Allowed actions: Specify which tools may be called, which fields may be changed, and which financial or account actions require approval.
State and owner: Give every open case a current state and an accountable role. "The agents are working on it" is not a state.
Waiting conditions: Name the external event that can unblock the case, such as a customer reply, a system response, a timer, or a human decision.
Terminal conditions: Define resolved, declined, cancelled, transferred, and incomplete outcomes separately. Each one should require evidence and a reason code.

The strongest procedure starts as a workflow map owned by the people who understand disputes, fraud, operations, and compliance. Those subject-matter experts can maintain agent procedures in natural language, but natural language should not mean unmanaged prose. Give each procedure an owner, version, effective date, test cases, and approval history. A policy change should produce a traceable procedure change, not an invisible prompt edit.

Test your case contract with an awkward question: could the system truthfully tell the customer that the case is resolved while a mandatory downstream task is still pending? If the answer is yes, your terminal condition is wrong. Fix that before tuning response quality.

Split responsibilities at operational handoffs

A multi-agent design earns its complexity only when the separation makes ownership clearer. Creating several agents with overlapping prompts usually produces more routing ambiguity, not more capability. Divide the system where the nature of the work, permissions, or waiting behavior changes.

A useful pattern separates inbound, back-office, and outbound responsibilities while keeping procedures, skills, and guardrails on a shared foundation.

Agent role	What it owns	Typical handoff signal	Boundary to enforce
Inbound	Understands the request, gathers required details, performs permitted immediate actions, and creates or updates the case	The case has enough validated information to begin operational work	It cannot imply resolution merely because the conversation was handled
Back office	Executes system work, coordinates investigation steps, records evidence, and manages pending operational tasks	More information, an approval, or customer communication is required	It cannot invent missing evidence or bypass a policy gate to keep the case moving
Outbound	Requests missing information, communicates status or decisions, and follows up until a defined terminal condition is reached	The required response arrives, a timer fires, or the outreach policy is exhausted	It cannot decide that silence means success unless the procedure explicitly defines that outcome

The handoff should be a structured state transition, not an open-ended conversation between agents. Pass a compact case record containing the case identifier, current state, completed actions, evidence references, pending requirement, next allowed actions, applicable procedure version, and relevant deadline or timer. That record prevents the next agent from reconstructing the truth from a transcript.

Keep skills modular as well. "Send a status request," "retrieve transaction details," and "submit an approved case update" are easier to authorize, test, and audit than one broad tool called "handle dispute." Each skill should declare its required inputs, permitted states, side effects, expected result, and failure behavior.

Do not use separate agents simply to mirror your organization chart. Use them when different stages need different permissions, context, completion rules, or escalation paths. If two proposed agents can perform the same actions in the same states under the same controls, they probably belong together.

Let a state machine control long-running work

The language model can interpret a message and propose the next step. It should not be the sole authority on what state the case is in or which actions are legal from that state. A state-machine orchestrator can manage turns, triggers, and skill selection across an asynchronous case while the model handles the language inside those boundaries.

For an illustrative stolen-card workflow, your states might include:

Report received.
Immediate protection pending.
Immediate protection confirmed.
Required information under review.
Investigation or dispute work in progress.
Waiting on the customer, a merchant, an internal system, or a human approver.
Decision ready.
Required communication pending.
Resolved, transferred, declined, cancelled, or closed incomplete with a recorded reason.

Adapt the states to your product, operating procedure, and regulatory obligations. The value is not in these labels. It is in making every transition explicit. For each transition, specify the triggering event, required preconditions, allowed skill, expected side effect, accountable role, failure path, timer behavior, and evidence written back to the case.

Then scope skills deterministically for each turn. An agent handling a customer reply while the case is waiting for information may be allowed to validate the reply, attach evidence, request a missing item, or resume the workflow. It should not be able to perform unrelated account actions simply because those tools exist elsewhere in the platform. This per-state allow-list reduces the number of unsafe choices the model can make.

Async triggers deserve the same design care as messages. A customer reply, API status change, timer expiry, failed tool call, and human approval are all events that can create a new turn. Store them durably and process them against the current case version. Otherwise a delayed event can act on stale state after the case has already moved forward.

Financial actions also need protection from retries. A timeout does not prove that a tool failed; the action may have succeeded while the response was lost. Use an idempotency key where the receiving system supports one, record the attempted operation before retrying, and reconcile uncertain outcomes. Blindly repeating a freeze, refund, fee adjustment, or dispute submission can create customer harm and financial exposure.

Outbound completion needs its own rule. The customer may never send a final message, so "the conversation ended" cannot define success. A defensible terminal condition can require that the necessary notice was sent, mandatory actions are complete, no unresolved task remains, and any follow-up timer has reached the outcome defined by policy. Silence may end an outreach attempt; it does not automatically prove the underlying case was resolved.

Finally, write an audit record for every transition. Capture the prior state, event, procedure version, allowed skills, selected action, tool result, guardrail result, human decision if present, and resulting state. A transcript tells you what was said. A transition log tells you why the system acted.

Make compliance and human review part of execution

Do not reduce compliance to a paragraph at the end of the system prompt. High-stakes rules need controls at the point where the system interprets information, chooses an action, changes a case, or communicates a decision.

Use three complementary layers:

Deterministic controls: Enforce permissions, required fields, state preconditions, transaction limits defined by your policy, and mandatory approvals in code or workflow configuration.
Classification guardrails: Detect whether an input, proposed action, or outgoing message belongs to a risk category that must be blocked, revised, or reviewed.
Human decisions: Route policy exceptions, consequential approvals, conflicting evidence, ambiguous cases, and unsupported operations to an accountable person.

For critical regulatory checks, treat guardrails as classification problems and prioritize recall when missing a risky case is more costly than sending an extra case to review. That choice has an operational consequence: more false positives can increase manual workload and delay customers. Product, operations, risk, and compliance owners should agree on that trade-off for each guardrail rather than applying one global threshold.

Every classifier needs a defined consequence. A positive result might block an action, remove a skill from the current turn, require human approval, or permit the workflow to continue with additional logging. A score without an execution rule is only dashboard data.

Customer-specific policies matter in a platform serving more than one fintech. The system may share an architecture while each customer requires its own procedures and guardrails. Resolve the applicable policy set from trusted configuration before the model acts, attach the policy version to the case, and prevent cross-customer retrieval or tool access. Do not ask a model to infer which client's rules should apply from conversational context.

Human escalation should be a first-class tool call, not a side-channel message. The request should contain the exact decision needed, current state, relevant evidence, attempted actions, available options, policy context, risk of delay, and response deadline. The human's answer should return as a recorded workflow event so the orchestrator can validate it and resume from the correct state.

This pattern is especially important when an API is missing. A person may complete the task in an internal system, but the agent must not assume it happened. Require a structured confirmation and evidence before advancing the case. If that evidence never arrives, keep the case visibly pending or escalate it according to the procedure.

Because these workflows can affect money, account access, customer rights, and regulatory obligations, your AI design cannot substitute for review by qualified legal, compliance, risk, and operations owners. Let those owners approve the policies, controls, escalation criteria, and customer communications before live execution. Begin with read-only or reversible capabilities where possible, and do not grant autonomous financial actions until the failure and recovery paths have been tested.

Measure verified resolution and improve from failures

A conversational system can produce polished replies while leaving cases unfinished. That is why containment or deflection cannot be your sole success metric. The primary question is whether the case reached the correct terminal state with the required evidence, policy checks, and customer communication.

Build a metric hierarchy that separates outcomes from diagnostics:

Case outcome: Track the share of eligible cases reaching a verified terminal state, along with cases reopened, transferred, or found incomplete during review.
Customer experience: Track customer satisfaction and whether the customer must contact support again because ownership or status was unclear.
Operational performance: Track time to resolution, first-contact resolution where that metric is genuinely applicable, deflection, escalation rate, waiting time by state, and human work by escalation reason.
Risk performance: Track critical guardrail misses, false-positive reviews, unauthorized action attempts, procedure deviations, and cases advanced without required evidence.
Agent-stage performance: Track routing accuracy, skill success, handoff completeness, tool failures, timer outcomes, and terminal-state correctness for each role.

Be careful with first-contact resolution in workflows that are supposed to run asynchronously. A fraud investigation may remain open after a perfectly handled first interaction. Optimizing the agent to close the contact can therefore conflict with the real outcome. Use time to verified resolution and unresolved-work visibility alongside conversation metrics.

Evaluation should inspect both language and execution. A useful case-level rubric asks whether the system understood the request, selected an allowed skill, used the correct procedure version, obtained required evidence, respected guardrails, preserved context at handoffs, communicated accurately, and entered the right terminal state.

An automated evaluation pipeline can flag cases for human review and turn reviewed failures into labeled data. Do not sample only obviously failed conversations. Include high-risk classifications, recently changed procedures, new skills, long-running cases, human escalations, unusual state transitions, tool errors, and a baseline sample of apparently successful cases. Otherwise your evaluation set will miss failures that look normal in aggregate metrics.

Give every reviewed failure a place in a product backlog. The fix may belong to the procedure, state machine, skill contract, integration, guardrail, escalation path, or model behavior. "The agent made a mistake" is too broad to assign. A stable failure taxonomy tells you which layer should change and which regression tests must be added before release.

A sensible implementation sequence is:

Choose one bounded journey with a meaningful operational tail and a clearly accountable owner.
Map the full case, including hidden back-office steps, waiting states, approvals, exceptions, communications, and terminal conditions.
Define the case schema, events, state transitions, evidence requirements, and audit record.
Assign inbound, back-office, and outbound responsibilities only where permissions or completion rules differ.
Expose narrow modular skills and apply a deterministic allow-list in every state.
Add compliance classifiers, hard controls, and human decision gates before enabling consequential actions.
Run historical, synthetic, or controlled cases through the workflow and evaluate the complete case, not just the generated messages.
Release gradually, monitor state-level failures, and feed reviewed cases back into procedures, controls, and regression evaluations.

Key takeaways

Scope the customer's complete case before choosing the number of agents.
Separate agents at real permission, workflow, or completion boundaries.
Let the model interpret language, but let explicit state and policy control execution.
Treat human review as a structured workflow event with an owner and deadline.
Define "done" with evidence; a finished chat is not a finished case.
Optimize for verified resolution, policy adherence, and safe recovery rather than response quality alone.

At your next design review, put one real support case on the page and ask four questions: where can it wait, what event unblocks it, who approves a risky action, and what evidence proves completion? If your team cannot answer all four from the workflow, the system is not ready to act. Once those answers are explicit, agent boundaries become an engineering decision instead of a bet on autonomous behavior.

References

Shivam.Consulting Blog — Beyond the Support Iceberg: Gradient Labs' Multi-Agent Breakthrough That Actually Gets Work Done

December 18, 2025

AI Product Management Skills: A Practical 12-Month Roadmap
You may know how to prompt a model and still feel unprepared to own an AI product. That gap is real. Producing a plausible response is easy; deciding what should be built, how to evaluate it, when to trust it, and whether it improved the user journey requires a broader product skill set.

The useful roadmap is not a queue of courses or tools. It is a sequence of increasingly consequential work: understand model behavior, turn ideas into testable artifacts, ship a bounded workflow, and then build the operating system that lets more teams do it responsibly.

What you should be able to do after 12 months

An AI product manager does not need to become a machine-learning engineer. You do need enough technical judgment to frame a feasible problem, challenge an architecture, inspect failures, define an evaluation, and make a release decision with engineering and design.

The 12-month progression from foundations to governed scale works because each phase produces evidence needed by the next one. You learn model constraints before promising a user experience. You build evaluations before exposing the system to real customers. You prove one workflow before standardizing it across a product organization.

Key takeaways
- Months 1-3: Learn model behavior, context management, prompting, retrieval, privacy, and data governance. Apply them to product discovery.
- Months 4-6: Build prototypes and an evaluation system. Instrument activation and retention before treating the feature as ready.
- Months 7-9: Ship a bounded AI-enabled workflow with safeguards, monitoring, recovery paths, and clear human control.
- Months 10-12: Standardize evaluation gates, analytics, discovery practices, roadmapping, and outcome-based reporting.
Treat these as capability gates, not calendar milestones. If you cannot explain why a prototype failed in month six, more production infrastructure will not fix the problem. If you cannot show that users received value in month nine, scaling the feature will only distribute uncertainty.

By the end of the roadmap, your portfolio should contain operating artifacts rather than course certificates: an AI product brief, a prompt and retrieval pattern, a reusable evaluation set, an instrumented production workflow, a risk checklist, and a scale playbook. Those artifacts demonstrate that you can move from possibility to accountable product performance.

Months 1-3: Learn enough AI to make sound product decisions

Your first objective is not technical fluency for its own sake. It is learning where model behavior changes a familiar product decision. A deterministic feature is expected to return the same result for the same state. A generative feature can produce different, incomplete, or confidently incorrect outputs. That changes acceptance criteria, testing, interface design, and the meaning of “done.”

Build an operator’s mental model

Work through four capabilities in order:
1. Model behavior and constraints: Learn what the model receives, what it produces, where variability enters, and which failures matter to the user. You should be able to distinguish a capability problem from a context, instruction, or workflow problem.
2. Context window management: Decide which information belongs in the model’s working context, which information is stale, and which information should never be sent. More context is not automatically better context. Irrelevant material can obscure the evidence the task actually requires.
3. Prompting as product specification: Write reusable instructions that state the task, relevant context, constraints, required output, and quality criteria. Save the prompt with examples of both acceptable and unacceptable behavior. A prompt library is useful only when another person can reproduce and assess the result.
4. Retrieval-first design: For tasks that depend on changing or proprietary knowledge, learn the basic pipeline: retrieve relevant approved information, give that information to the model, generate an answer, and preserve enough traceability to investigate failures. This is a product choice as much as an architecture choice because it determines what the experience can reliably know.
Pair these capabilities with privacy-by-design and data governance from the beginning. Before using customer or company information, write down which data classes are permitted, who can access them, where they may be retained, and what must be removed or masked. If those answers are unclear, use synthetic or explicitly approved material until the policy is settled. Avoiding sensitive data at the prototype stage is safer than trying to remove it after it has spread through prompts, logs, and evaluation files.

Apply the foundations to product discovery

Discovery gives you a low-risk place to practise. Use generative AI to summarize research, cluster feedback, compare recurring needs, or sharpen a value proposition. Keep the model in an assistive role: every synthesized theme should remain traceable to the underlying customer evidence. If you cannot inspect the feedback behind a cluster, you cannot tell whether the model found a pattern or flattened important differences.

Create an AI product brief for one candidate problem. Include:
- The user and the job they are trying to complete.
- The decision or work the model will assist with.
- The inputs the system may use and the inputs it must reject.
- The expected output and the conditions that make it useful.
- The consequence of a wrong, missing, or delayed output.
- The point at which a person reviews, edits, approves, or overrides the result.
- The product signal that would show improved user behavior.
You are ready for the next phase when you can explain the proposed experience without hiding behind model vocabulary. You should be able to identify the necessary context, name the important failure modes, explain whether retrieval is needed, and show how the user remains in control.

Months 4-6: Prototype the experience and build its evaluation system

A prototype is valuable when it tests uncertainty, not when it merely looks polished. Use generative AI to accelerate UX mocks, PRDs, in-app guidance, and alternative interaction flows, but spend the saved time on the questions that determine whether the product deserves to ship.

Prototype the entire decision loop. Show where the user supplies context, how the result is presented, what happens when the answer is weak, how the user corrects it, and whether that correction improves the next step. The error state is part of the primary AI experience; hiding it until engineering integration creates false confidence.

Use evaluation as a development method

Eval-driven development turns a vague judgment such as “the answers seem good” into a repeatable product decision. Build the evaluation alongside the prototype:
1. Define the task boundary. State what the system is expected to do and what remains outside its responsibility.
2. Collect representative cases. Include normal inputs, ambiguous inputs, missing information, adversarial behavior, and cases where the correct response is to stop or ask for clarification.
3. Write a scoring rubric. Assess the properties the user actually needs, such as correctness, relevance, completeness, appropriate tone, traceability, or compliance with a constraint.
4. Record a baseline. Compare the proposed experience with the current workflow or a simpler non-AI alternative. A model output is not valuable merely because it exists.
5. Inspect failure patterns. Separate prompt failures, missing-context failures, retrieval failures, model limitations, interface confusion, and policy violations. Each category points to a different remedy.
6. Set a release gate. Decide which failures block launch, which require human review, and which are tolerable in the intended use case. The gate should reflect the consequence of error, not enthusiasm for the feature.
Keep the evaluation set versioned with the product. When you change the prompt, model, retrieval logic, or available tools, rerun the same cases. Otherwise, an apparent improvement in one example can conceal regressions elsewhere.

Instrument behavior before launch

Quality evaluation and product analytics answer different questions. An evaluation tells you whether the system behaved acceptably on known cases. Behavioral analytics tells you whether customers reached value in the product.

Define the journey in Amplitude or your existing analytics system before exposing the prototype broadly. Capture the moment a user encounters the feature, supplies enough information, receives an output, accepts or edits it, completes the downstream task, returns to use it again, abandons it, or escalates to a person. That sequence gives you activation and retention signals rather than a vanity count of generations.

If you run an A/B test, choose the minimum detectable effect before launch. The decision matters because an experiment that cannot detect a product-relevant change may produce an inconclusive result even when the dashboard looks busy. Define the primary outcome, guardrail metrics, exposure rule, and analysis plan before looking at the results.

Move forward when the prototype solves a defined task, the evaluation catches meaningful failures, the events expose the user journey, and the experiment can answer a decision. A persuasive demo without those four elements is still a demo.

Months 7-9: Ship a bounded workflow, not an open-ended assistant

The production phase is where product judgment becomes visible. Start with a workflow that has a recognizable beginning, end, and owner. Customer-support, CRM, and guided-onboarding workflows are useful patterns because the AI can sit inside an existing user journey rather than asking customers to invent a use case from a blank chat box.

Screen the workflow before committing engineering capacity:
- Is the user’s job clear enough to define a successful completion?
- Does the system have access to approved, relevant context?
- Can you observe whether the user accepted, corrected, ignored, or escalated the output?
- What happens to the customer if the system is wrong?
- Can a consequential action be paused, reviewed, or reversed?
- Is a generative system materially better than a rule, search result, template, or conventional workflow?
Use agentic AI only when the job genuinely requires several connected steps, tool use, or changing plans. Additional autonomy also creates more places for permissions, context, and actions to go wrong. Begin with the narrowest useful boundary, then expand it when production evidence supports the change.

Map the production loop before building it

A product trio should be able to trace the complete workflow on one page:
1. Trigger: What user action or system event begins the workflow?
2. Context: Which profile, conversation, account, or knowledge records are retrieved?
3. Generation or decision: What does the model produce, classify, recommend, or plan?
4. Tool action: Which systems can it read from or write to, and under whose authority?
5. Human checkpoint: Which output can be edited, rejected, or approved before it changes customer data or sends an external message?
6. Recovery: How does the product handle low confidence, missing data, tool failure, timeouts, or a user correction?
7. Learning signal: Which feedback updates the evaluation set, product decision, or workflow design?
Place safeguards at the point of consequence. Restrict the data and tools the workflow can access. Require explicit approval before a high-impact external action. Preserve a record of the inputs, retrieved context, output, action, and user response so a failure can be investigated. If an action cannot be safely reversed, keep it behind human review until the risk has been addressed.

Threat detection and response also need a product playbook. Define what counts as suspicious input or abnormal behavior, who receives the alert, how the workflow is disabled or contained, what evidence is retained, and how affected users are handled. The escalation path should exist before the first serious incident, not be improvised during it.

Monitor the experience at four levels
- User outcome: Did the customer complete the intended job with less effort or fewer avoidable handoffs?
- AI quality: Are the evaluation scores and failure categories changing after releases?
- Workflow health: Are retrieval, model, and tool steps completing as expected, and can the team locate the failing stage?
- Risk: Are users overriding outputs, escalating cases, encountering policy violations, or triggering suspicious behavior?
Track deployment frequency because a team that can release safely can also learn faster. Do not confuse release frequency with customer value, though. The useful loop connects a deployment to a quality change, a behavior change, and a decision about what to do next.

Months 10-12: Turn one successful product into a repeatable system

Scaling is not copying the same AI feature into every surface. It is making the successful practices reusable while preserving room for different user risks and workflow requirements.

Codify the operating assets that reduced uncertainty during the earlier phases:
- An intake template that starts with the user problem, current workflow, expected outcome, and consequence of error.
- A continuous-discovery practice that keeps generated themes connected to original customer evidence.
- A retrieval-first architecture template for products that depend on approved or changing knowledge.
- A shared prompt library with owners, versions, expected behavior, and known limitations.
- An evaluation gate covering representative cases, blocking failures, human-review requirements, and regression checks.
- A production checklist covering permissions, privacy, observability, recovery, threat response, and user control.
- A monitoring cadence that connects product behavior, AI quality, workflow health, and risk.
Do not impose one universal quality threshold on every AI feature. A low-consequence drafting aid and a workflow that changes a customer account do not carry the same downside. Use the same evaluation process across teams, but set release gates according to the task, affected user, reversibility, and consequence of failure.

Use common analytics without erasing product context

A unified analytics model lets leadership compare lift across products without forcing every team to use an identical funnel. Standardize the basic meanings of exposure, meaningful use, successful task completion, correction, abandonment, escalation, and return usage. Then let each product define the events that represent those states in its own journey.

This is also where roadmapping and sprint planning should move from output commitments to outcome-based decisions. “Ship an AI assistant” is an output. A useful objective describes the customer behavior or business result that should change. The roadmap can then contain competing ways to produce that change, including improvements that do not require AI.

Use a consistent stakeholder narrative:
- What shipped: The workflow or capability placed in users’ hands.
- What moved: The user, product, quality, and risk signals that changed.
- What was learned: The assumptions confirmed, rejected, or still unresolved.
- What happens next: The decision to expand, revise, contain, or stop the work.
That structure prevents activity from masquerading as progress. It also gives executives a clear basis for funding decisions: evidence of value, evidence of control, and a specific next bet.

Start this week with one recurring user decision. Write its AI product brief, run the workflow manually with permitted data, and save the successful and failed cases as the beginning of an evaluation set. If you cannot define a good result or the consequence of a bad one, stay in discovery. If you can, you have a concrete first artifact and a reason to proceed to a prototype.

References
- Shivam.Consulting Blog – Master AI as a Product Manager in 12 Months: My 2026 Roadmap to Ship Smarter, Faster
December 17, 2025

Context-Driven AI Product Engineering That Survives Production

Your AI feature can look excellent in a demo and still fail in production. The prompt has not changed, but the user, account, permissions, available data, and business decision have. A fluent answer built on the wrong context is still the wrong answer.

If your team keeps rewriting instructions to fix inconsistent results, inspect what the model can see, why it can see it, and what it is expected to do with that information. Context-driven AI product engineering turns those decisions into a versioned, measurable product system rather than hiding them inside one large prompt.

Determine whether context is actually the bottleneck

Runtime context is the complete package available to the model for a specific task. It includes instructions, retrieved evidence, permissions, conversation state, memory, tool definitions, metric definitions, output requirements, and stop conditions. Prompt text is only one part of that package.

This distinction matters because different failure classes require different fixes. A prompt change cannot retrieve a missing CRM record. A larger model cannot make a stale policy current. Better prose cannot repair an authorization error. Start by assigning every bad result to the layer that produced it.

Evidence is missing: the necessary record, document, event, or metric never reached the system.
Evidence was available but not selected: retrieval, filtering, metadata, or ranking favored the wrong material.
Evidence is stale or contradictory: the system lacks a freshness rule or conflict-resolution policy.
The procedure is incomplete: the model has facts but not the sequence, metric definition, or decision rule needed to use them.
The scope is unsafe: the context contains data the current user, role, tenant, or workflow should not access.
The answer contract is unclear: the model does not know when to cite evidence, expose uncertainty, request missing input, call a tool, or abstain.
The answer is technically correct but operationally unhelpful: it does not fit the user’s role, decision, timing, or next action.

For one failed session, reconstruct the full path instead of reading only the final answer:

Capture the user’s request, detected intent, role, tenant, and relevant permissions.
Record the retrieval queries, filters, candidate results, metadata, and ranking scores.
Show which candidates entered the context, which were excluded, and why.
Inspect the assembled instructions, evidence, memory, tool contracts, and output schema.
Record every tool call, returned result, retry, timeout, and policy decision.
Compare the answer with the evidence that was actually available at generation time.

The resulting trace gives you a practical decision tree. If the correct evidence was absent from the candidate set, fix ingestion or retrieval. If it was retrieved but excluded, fix ranking or context packing. If it entered the prompt but the answer contradicted it, test instruction hierarchy, conflict handling, or model behavior. If the evidence and answer were both correct but the user still could not act, fix the product experience.

This is why a retrieval-first, context-aware design usually creates more leverage than another round of isolated prompt editing: it makes the evidence path visible and gives each failure an identifiable owner.

Write a context contract before choosing the architecture

A context contract defines what the AI needs for one product task, where that context may come from, how it must be constrained, and what the system should do when the contract cannot be satisfied. It is the interface between product intent and runtime engineering.

Consider an account-risk assistant used by a customer success manager. Its contract could look like this:

Contract field	Decision to make	Example implementation
Task boundary	What may the AI decide or produce?	Summarize risk signals and propose a next step; do not change the account record.
Authorized evidence	Which information is both relevant and permitted?	CRM fields, recent support history, approved playbooks, and defined product-usage metrics visible to the current user.
Identity and scope	Which user, tenant, account, and role govern access?	Resolve all four before retrieval and preserve them through every tool call.
Freshness	How current must each evidence type be?	Carry the captured-at timestamp and qualify the answer when a required record exceeds the product’s approved freshness window.
Conflict rule	What happens when trusted inputs disagree?	Expose the conflict and its timestamps instead of silently choosing one value.
Procedure	Which reasoning process should the workflow execute?	Identify the account, retrieve authorized signals, apply metric definitions, compare evidence, state caveats, and propose an action.
Output contract	What structure must the response follow?	Answer, supporting evidence, caveats, recommended action, and provenance.
Abstention rule	When should the system decline to conclude?	Report missing evidence when a required record, metric definition, or permission check is unavailable.
Audit payload	What must be reproducible later?	Context-contract version, evidence identifiers, timestamps, policy version, tool results, and model configuration.

The contract should keep five kinds of context distinct. Task context says what the user is trying to accomplish. Evidence context contains facts relevant to that task. Policy context defines permissions, governance, and prohibited behavior. Interaction context carries the useful parts of the current conversation and approved long-term memory. Execution context defines tools, schemas, retries, and stop conditions.

Keeping those layers separate prevents a common production mistake: treating all text as equally authoritative. A user’s request should not override a permission rule. A retrieved comment should not outrank an approved policy. An old conversation should not silently redefine a current metric. Your assembly logic needs an explicit precedence order for these collisions.

Personalization belongs in the contract too. Intent and role should narrow context, not merely add more of it. A finance user may need policy-safe excerpts and transaction evidence. A customer success user may need current account activity and support history. A product manager may need metric definitions, cohorts, experiment state, and caveats. Role-aware assembly and scoped memory make the same underlying capability useful without exposing every available field to every request.

You know the contract is testable when each field can become a pass-or-fail assertion. Did the workflow apply the current permission scope? Did it include the required metric definition? Did it expose a conflict? Did it abstain when decisive evidence was unavailable? If a requirement cannot be tested or observed, it is still an aspiration rather than an engineering contract.

Build context assembly as a controlled pipeline

The production unit is not a prompt template. It is the pipeline that converts a user request into a bounded evidence packet and an executable task. That pipeline should have explicit stages:

Authorize the request. Resolve identity, role, tenant, account scope, and permitted operations before searching for evidence. Apply access controls again before generation as a second check.
Normalize the inputs. Give each record or chunk a stable identifier plus source type, owner, tenant, timestamp, policy classification, schema version, and other metadata needed for filtering.
Generate retrieval candidates. Combine semantic retrieval for conceptually related language with keyword retrieval for exact identifiers, product names, codes, and policy terms.
Filter and rank for the task. Use intent, role, account, freshness, authority, and source-level confidence in addition to semantic similarity.
Resolve stale and conflicting evidence. Apply the contract’s freshness and precedence rules before the model sees the packet. Preserve unresolved conflicts as explicit context.
Pack the context window. Allocate space by priority, remove duplicates, keep decisive passages intact, and exclude material that does not change the task.
Execute through a defined interface. Supply tool schemas, metric definitions, procedure steps, output fields, citation requirements, and abstention conditions.
Attach provenance and emit a trace. Store identifiers and versions needed to reproduce the decision without indiscriminately copying sensitive raw content into logs.

Hybrid retrieval is useful because semantic and lexical search solve different problems. Semantic search can find a relevant concept expressed in different words. Keyword search protects exact matches such as an account identifier, event name, plan code, or policy term. Metadata then makes the results usable: a highly similar passage from the wrong tenant or an obsolete policy is not a valid result.

Authorization must shape retrieval itself. Do not search a global corpus, rank everything, and rely on a final prompt instruction to hide unauthorized results. That approach can expose sensitive material to intermediate services, caches, traces, or debugging tools even if it never appears in the final answer. Filter at the retrieval boundary, preserve tenant and role scope through tool calls, and validate the assembled packet before generation.

Context-window management is also a relevance problem, not just a token-count problem. Reserve capacity in a deliberate order: non-negotiable policy and permissions, the current task, decisive evidence, required procedure and definitions, recent interaction state, then supplemental material. When the packet is too large, compress or drop lower-priority evidence rather than truncating whichever section happens to come last.

Memory needs its own product rules. Short-term conversation state should retain unresolved references, user corrections, and active task decisions. Long-term memory should be scoped to durable facts that the product is allowed to retain. Define how memory is written, validated, refreshed, read, and deleted. Dumping a full transcript into every turn increases noise and can revive facts or instructions that no longer apply.

For analytical products, context must include a procedure as well as data. A reliable workflow starts with the decision to be made, anchors it to metric definitions and guardrails, retrieves trusted data, generates testable hypotheses, segments the evidence, and returns options with trade-offs and caveats. That structured analyst loop is far easier to evaluate than a broad instruction to analyze the data.

The same restraint applies to agents. Use multiple steps or tools when decomposition makes the task clearer, safer, or more verifiable. Each step needs an input schema, permitted tools, completion condition, failure path, and evidence handoff. Agentic patterns are most useful when task decomposition reduces real complexity; extra autonomy without a clearer control boundary simply creates more places for context to drift.

Ship with layered evaluations, observability, and ownership

Evaluate the evidence path before scoring the prose

A single answer-quality score hides the layer that failed. Build an evaluation stack that follows the same stages as the runtime pipeline:

Retrieval evaluation: Was the required evidence present in the candidate set, and where did it rank?
Assembly evaluation: Did the final packet include required facts and policies, exclude unauthorized or irrelevant material, preserve provenance, and respect freshness rules?
Behavior evaluation: Did the model follow the procedure, use the supplied evidence, handle conflicts, cite support, and abstain when required?
Answer evaluation: Was the result correct, grounded, complete enough for the task, and structured as promised?
Product evaluation: Did the user complete the task, reach an answer faster, correct the output, return to the capability, or escalate to a human?
Operational evaluation: Did latency, context size, cost, tool failures, permission denials, and fallback behavior stay within the product’s approved limits?

Your offline evaluation set should represent the failure surface, not just normal requests. Include different roles and intents, sparse accounts, stale records, contradictory inputs, missing definitions, empty retrieval, tool failures, unauthorized requests, and cases where abstention is the correct result. Label the evidence that should be retrieved as well as the answer that should be produced. Otherwise, a system can pass by reaching the right conclusion through the wrong material.

Version the evaluation cases, context contract, retrieval configuration, policy set, prompt, tools, and model independently. Change one major layer at a time when possible. If a model upgrade, ranking change, and prompt rewrite ship together, an improved aggregate score will not tell you what worked or which change caused a regression in a sensitive slice.

After offline acceptance, use staged online experiments with a predeclared outcome, guardrails, acceptance threshold, and minimum detectable effect. Task success, groundedness, time to first answer, adoption, and deflection can all be useful, but only when they match the workflow. A support assistant should not optimize deflection by confidently blocking necessary escalation. An analytical assistant should not optimize speed by dropping caveats required for a sound decision.

Instrument enough to reproduce failure without creating a new data risk

For each request, emit a structured event envelope containing the workflow and context-contract versions, detected intent, authorized scope, retrieval-query identifier, evidence identifiers, ranking metadata, freshness state, tool outcomes, policy decisions, answer status, latency, and user feedback. This gives product and engineering a common record for diagnosing failure.

Do not default to logging every raw prompt, retrieved document, or tool response. Production context can contain customer data, confidential policy, or personal information. Prefer stable identifiers, approved redaction, access-controlled traces, and retention rules. Keep the minimum raw material needed for authorized debugging and evaluation, and make data ownership explicit.

Roll out in stages: run the new pipeline against offline cases, observe it without user impact where possible, expose it to a constrained cohort, compare it with the existing experience, and expand only after both quality and operational guardrails hold. Preserve a feature flag, a known-safe fallback, and a rollback path for context changes as well as model changes.

Give every context surface an owner

Context crosses organizational boundaries, so shared responsibility without named ownership turns into drift. Assign decisions explicitly:

Product owns the task boundary, target user, intended decision, outcome metric, failure taxonomy, and acceptance trade-offs.
Design owns how evidence, uncertainty, correction, abstention, and human handoff appear in the experience.
AI and platform engineering own retrieval, ranking, assembly, tool interfaces, reproducibility, evaluation infrastructure, and fallbacks.
Data owners own schemas, metric definitions, lineage, freshness, and the authoritative status of each collection.
Security, privacy, and governance owners define permitted use, redaction, retention, and audit requirements.
SRE owns service-level monitoring, failure alerts, capacity behavior, deployment safety, and rollback readiness.

A Staff AI Engineer can connect these concerns by turning research choices into repeatable workflows and shared evaluation infrastructure, but that role should not become the sole owner of product judgment, source governance, or production reliability. Cross-functional execution works when each decision has one accountable owner and the whole group uses the same context trace and evaluation results.

Treat context changes like code changes. A release should identify the changed source, schema, ranking rule, contract, or policy; show the affected evaluation slices; state the expected product outcome; and preserve a rollback path. CI/CD guardrails, drift monitoring, and human review turn context from an informal prompt dependency into an operable platform capability.

Key takeaways

Diagnose the failed layer before editing the prompt. Missing evidence, bad ranking, stale data, unsafe scope, incomplete procedure, and weak UX are different problems.
Define a context contract for each workflow: task boundary, authorized evidence, freshness, precedence, procedure, output, abstention, and audit payload.
Authorize before retrieval, rank with task and metadata signals, and validate the assembled packet before generation.
Manage the context window by authority and decision value, not by filling every available token.
Evaluate retrieval, assembly, model behavior, answer quality, user outcomes, and operational performance separately.
Version context components independently, release them through staged controls, and assign an accountable owner to every surface.

At your next AI product review, do not approve the experience from the final answer alone. Ask to see the evidence packet, permission scope, context-contract version, failed evaluation slices, runtime trace, and rollback path. Those artifacts reveal whether the feature is dependable or merely persuasive.

Start with one production workflow whose failures matter to users. Trace its most common failure, write the contract, repair the responsible layer, and require the change to pass both offline evaluation and a guarded rollout. Once that loop works, you have the foundation for a reusable context platform rather than another prompt that only works in the demo.

References

December 16, 2025

2026 Support Capacity Playbook: Bold AI Automation, Smarter Staffing, Zero‑Surprise SLAs

Capacity planning has always been a high-stakes exercise in customer service, and when you miss, the signal shows up fast in backlogs and SLAs. I’ve lived that pressure across multiple cycles, and 2026 will reward teams that plan differently. AI fundamentally changes capacity planning because it changes the work. It resolves the bulk of your volume, speeds up execution, and elevates the complexity and value of what humans handle. The consequence is simple: planning models must evolve. This is the final installment in my 2026 customer service planning series, and I’m focusing on the tension every leader feels right now—be ambitious about automation, but avoid the trap of understaffing if your assumptions don’t hold. My goal is to share how AI changes the logic of capacity planning, what I’ve learned implementing these practices with my team and with customers, and the common traps to avoid. Traditional planning rests on relatively stable assumptions: volume grows predictably, work types stay consistent, handle times don’t swing dramatically, and productivity improves slowly with better tools and training. In an AI-first model, none of that is guaranteed, and the fundamentals flip. The mix of work changes as AI absorbs a growing share of simpler conversations, leaving humans with deeper, more time-consuming issues that demand human-to-human connection. Demand can actually increase when you remove friction, so AI can both resolve more and attract more volume. Human time splits differently as teammates solve customer problems and also review AI behavior, give feedback, improve content, and support system-level work. Performance becomes dynamic, not fixed—automation rate isn’t a one-time number; it can rise with care and fall with neglect. If you plan for 2026 using a pre-AI model—assuming similar productivity, similar work mix, and a linear relationship between volume and headcount—you will underestimate what it now takes to run a high-performing support organization. There are many metrics you can track, but the one to put at the center is automation rate (AI Agent involvement rate × AI Agent resolution rate). This single construct tells me what share of total volume AI actually resolves, how much work remains for humans, how much additional demand humans can absorb, and how ambitious I can be with headcount. Early in the journey, I prioritize raising involvement—getting the AI involved in more conversations. Once involvement is high, I shift to resolution on the hardest remaining work, where each additional 1% of automation can represent several people’s worth of capacity. In my 2026 plans, automation rate sits alongside projected inbound volume, average “output” per person for the more complex work that remains, and occupancy—how much time is allocated to customer-facing interactions versus operational and strategic work. Together, those inputs give a realistic picture of how many people you need and where they should spend their time. First, plan boldly on automation, but match it with investment. I do not cap automation assumptions at 40–50% “because AI is new.” Many teams are already modeling 60%, 70%, even 80%+ for 2026—when they invest in AI ownership and content. The investment is non-negotiable: named ownership for AI performance (AI ops, knowledge management, conversation design), clear automation targets by work type (e.g., informational vs. personalized vs. actions vs. deep troubleshooting), realistic expectations for what’s easy to automate and what’s not, and a concrete plan to raise automation over time in monthly or quarterly steps rather than a single jump. To decide where to invest first, I dig into the data. I start with the biggest volume drivers, separate content-led issues from those dependent on data or complex procedures, assume higher resolution potential for content-led topics once the knowledge base is in shape, and set more modest initial resolution expectations for system-dependent flows. Then I stair-step improvements as the systems, data contracts, and workflows mature. In short, bold automation goals only work when paired with the team structure, content, and systems required to reach them—and the discipline to iterate. Second, expect human “output” per person to go down. That’s a mindset shift. Historically, we assumed individual productivity would stay flat or tick up as tools improved. In an AI-first model, humans handle fewer conversations but more complex, cross-functional issues—and create more value despite lower case counts. I model a lower “cases closed per person” than prior-year baselines, explicitly assume the remaining work is more complex and time-consuming, and redefine productivity to include system-level work like AI Agent improvements, content updates, and policy or workflow change management. I also report “capacity created” from automation alongside human outputs, so leadership sees the full picture. Third, rethink occupancy: more time off the queues, on higher-value work. Traditional occupancy splits time between inbox and training, meetings, and breaks. Now there’s an expanding “out-of-inbox” portfolio that directly affects AI performance and overall capacity: reviewing AI-handled conversations, improving AI Agent triaging and handovers, contributing to content and procedures, feeding insights to product and engineering, and supporting system changes that reduce future volume. I set lower inbox occupancy targets than before and make the rationale explicit. People aren’t working less—they’re working differently. In planning, I assume more time spent on improvement and system work, make it visible (for example, X% in inbox and Y% on AI and system improvement), and treat this as critical, not a “nice to have.” If you don’t proactively allocate it, it won’t happen—and your automation and performance targets will suffer. Fourth, work with the finance team early, and treat your plan as a set of assumptions. Capacity planning with AI is a set of bets across automation rate, human output, demand growth, occupancy, and where surplus capacity (if any) goes. I bring finance in early, show that the plan is dynamic and directly tied to AI performance, and label every lever as an assumption with ranges. I commit to a quarterly review cadence with finance to compare assumptions versus reality and adjust headcount, targets, and investment as needed. The risks are real: if automation grows slower than expected and you stop backfilling too early, you’ll be understaffed for months. Hiring and onboarding take time, so course-correcting late creates strain. If you do produce surplus capacity, have a clear strategy to reallocate those teammates to higher-value work—improving systems, feeding insights back to product, supporting new channels, and driving proactive CX—rather than defaulting to reductions. I also set explicit guardrails—if automation rate misses by five points for two consecutive months, we pause planned reductions and revisit hiring gates. If it over-performs, we shift people into backlog eradication, content upgrades, or proactive outreach, so we bank compounding value. To set your team up for success in 2026, anchor your plan on automation rate, be honest that humans will handle fewer but harder conversations, and protect time for system improvements. Partner early and often with finance, avoid shrinking too fast, and design a plan for surplus capacity so you’re never caught flat-footed. If AI is going to handle the majority of your customer conversations, your plan has to be designed to help it do that well and to keep your team set up for meaningful, sustainable work. A 2026 plan built on adaptable assumptions—not fixed predictions—will hold up as your work, your systems, and your customers’ expectations continue to change. If you’d like future editions like this, subscribe and stay close—I’ll keep sharing what’s working, what isn’t, and how to tune your customer support AI strategy in real time.

Inspired by this post on The Intercom Blog.

December 16, 2025
AI in Product Design: My Proven Playbook, Real Use Cases, and the Tools That Win Faster

In product design, AI has shifted from novelty to non-negotiable. I’ve watched teams accelerate discovery, compress prototyping cycles, and turn ambiguous ideas into validated experiences faster than ever—without sacrificing quality or customer trust.

AI in product design has quickly moved from new to necessary. Here are the AI product design tools and approaches you need to stay relevant in this decade.

From my vantage point leading product teams, “necessary” means AI is woven throughout the product lifecycle—discovery, prioritization, prototyping, validation, and iteration—not bolted on. The goal isn’t to chase hype; it’s to build durable advantage with clear AI Strategy, disciplined execution, and measurable outcomes.

First, anchor the work in strategy. Tie every AI initiative to a specific customer problem and value proposition, then express that linkage with outcomes vs output OKRs. This keeps teams focused on real impact and avoids feature-chasing. It also sharpens product positioning and clarifies where AI can deliver competitive differentiation versus simple points of parity.

Second, upgrade discovery. I rely on AI workflows to synthesize interviews, cluster themes, and surface insights at scale. A retrieval-first pipeline—grounding models in our own data—improves factuality and reduces hallucinations. Combine this with strong data governance and privacy-by-design so insights are trustworthy and compliant from day one.

Third, make quality measurable. Adopt eval-driven development: define evaluation sets and acceptance thresholds that reflect real user tasks before you ship. Pair that with A/B testing and minimum detectable effect (MDE) discipline, so you learn quickly and confidently. Add safety guardrails (red-teaming prompts, content filters, and bias checks) to manage AI risk without slowing the pace.

Fourth, enable empowered product teams. Product trios (PM, design, engineering) should co-create prompts, prototypes, and evaluation criteria. Give designers and PMs practical tools—LLMs for product managers, structured prompt templates, and reusable components—so AI-augmented work becomes the default, not a special project.

Where does AI shine in product design today? Concept exploration and market scans, turning fuzzy opportunity spaces into crisp problem statements. Rapid wireframes and interaction ideas, using gen ai for product prototyping to explore multiple design directions in minutes. UX writing that adapts tone and reduces friction across onboarding, tooltip design, and microcopy.

It also excels at guided experiences. I’ve seen strong lifts in user activation when we pair in-app guides and product tours with context-aware suggestions. For support and education use cases, a retrieval-grounded assistant can deflect tickets, shorten time-to-value, and reinforce the product’s value proposition at the exact moment a user needs help.

Voice is another frontier. A well-scoped voice AI agent can accelerate complex workflows (think data entry or multi-step configurations) when hands-free is faster or more intuitive. Just be intentional about when agentic AI adds net value versus when a simple UI tweak would do.

On the tooling side, my AI product toolbox is pragmatic and modular. For analytics and learning loops, Amplitude analytics and Pendo help quantify behavior changes and retention analysis. For in-product engagement and feedback routing, Intercom and HubSpot integrate cleanly with LLM-driven tagging and summarization. For ideation and automation, I use a ChatGPT connector and Claude Code for quick scripts, data wrangling, and prompt experiments. The constant: a retrieval-first pipeline that grounds models in approved knowledge and maintains context window management at scale.

Risk management is built in, not bolted on. Set clear AI risk management policies, catalog model and data dependencies, and document decisions. Align with regulatory compliance requirements early, and keep an audit trail of prompts, datasets, and eval results. That’s how you move fast without breaking trust.

If you’re getting started, begin small: pick one high-friction workflow, add a retrieval-grounded copilot, and measure the lift. Use the results to inform product roadmapping and sprint planning, then scale to adjacent use cases. With disciplined discovery, sharp evaluation, and the right tooling, AI becomes a force multiplier for product teams and a clear win for customers.

Inspired by this post on Product School.

December 15, 2025

Category: AI Strategy

Key takeaways

Define the outcome and authority before the events

Write a one-page agent contract

Translate authority into enforceable policy

Build telemetry that joins agent decisions to user outcomes

Use a minimum viable event contract

Measure a stack, not a vanity metric

Turn analytics into release gates, not retrospective reporting

Use a staged promotion path

Pre-commit the gates

Make the dashboard an operating system for the product team

Run a decision review, not a dashboard tour

References

Define a trust contract before choosing the architecture

Engineer an evidence path, not just an answer

Define the case before you define the agents

Split responsibilities at operational handoffs

Let a state machine control long-running work

Make compliance and human review part of execution

Measure verified resolution and improve from failures

Key takeaways

References

What you should be able to do after 12 months

Months 1-3: Learn enough AI to make sound product decisions

Build an operator’s mental model

Apply the foundations to product discovery

Months 4-6: Prototype the experience and build its evaluation system

Use evaluation as a development method

Instrument behavior before launch

Months 7-9: Ship a bounded workflow, not an open-ended assistant

Map the production loop before building it

Monitor the experience at four levels

Months 10-12: Turn one successful product into a repeatable system

Use common analytics without erasing product context

References

Determine whether context is actually the bottleneck

Write a context contract before choosing the architecture

Build context assembly as a controlled pipeline

Ship with layered evaluations, observability, and ownership

Evaluate the evidence path before scoring the prose

Instrument enough to reproduce failure without creating a new data risk

Give every context surface an owner

Key takeaways

References