Tag: agentic AI

AI Context Engineering: A System for Product Decisions
You give an LLM your discovery notes, a dashboard export, and a roadmap question. It returns polished recommendations in seconds. The recommendations sound plausible, yet your product trio still cannot tell which option deserves a commitment.

The missing ingredient is usually not a better prompt. It is a decision-ready context system: a controlled way to give AI the evidence, boundaries, and outcome definition required to reason about the same product decision your team is actually making. Done well, this gives you more than a convincing answer. It gives you a traceable choice, explicit uncertainty, and a validation plan.

Define the decision before you collect the context

For product work, context engineering is the deliberate design of everything an AI system can use at the moment it reasons: customer evidence, metrics, goals, constraints, definitions, instructions, and prior decisions. The useful unit is not a prompt or a document. It is the decision.

This distinction matters because an LLM can answer an underspecified request without exposing that the request was underspecified. Ask it to improve onboarding, and it can produce a credible list of patterns. That output still does not tell you which user segment matters, what improvement means, which current friction is supported by evidence, or what downside the team must avoid.

Before pulling any context, write a decision frame that answers these questions:
- What decision must be made? Name the commitment, not the general topic. Choose whether to change a specific onboarding step is a decision; explore onboarding is not.
- Who is the decision for? Identify the customer segment, use case, or part of the journey. Evidence from one segment should not silently become a claim about every user.
- What outcome should change? State the behavior or business result you want, then identify the guardrail signals that should not deteriorate.
- What can constrain the answer? Include privacy, risk, brand, commercial, technical, and operational boundaries before ideation begins.
- What evidence could change the choice? If no possible evidence would change the decision, you are asking AI to justify a conclusion rather than help make one.
- What must the output enable? Specify whether you need options, a recommendation, a decision memo, an experiment plan, or a list of unresolved questions.
Anchor this frame in outcomes rather than deliverables. Improve activation for a defined segment while protecting support load establishes a decision boundary. Build a new onboarding checklist merely names output. The first lets AI compare interventions; the second encourages it to decorate a predetermined solution.

A practical test is to remove the proposed feature from the frame. If the decision still makes sense, you have probably described an outcome. If the frame collapses, the team may already be committed to an output.

Build a context packet that preserves evidence quality

A context packet is the smallest governed collection of information that allows the model and the product team to reason about the decision. It can combine customer quotes, behavioral trends, funnel friction, support conversations, and commercial constraints. The important work is to assemble, structure, compress, and challenge that evidence before asking for recommendations.

Do not treat every input as the same kind of truth. A customer quote gives you detail about an experience, not its prevalence. Usage analytics show behavior, not necessarily motivation. Support conversations overrepresent people who contacted support. CRM data can expose commercial constraints without proving that a feature creates customer value. Labeling these boundaries prevents the model from blending different signals into false certainty.

Use this structure for the packet:
- Decision header: the choice, decision owner, affected segment, and action that follows the decision.
- Outcome frame: the desired outcome, current signal, primary measurement, guardrails, and any metric definitions needed to interpret the data correctly.
- Evidence ledger: each relevant observation with its origin, segment, time period, and scope. Keep direct observations separate from interpretations.
- Constraints: technical dependencies, commercial commitments, privacy rules, brand boundaries, operational capacity, and known risks.
- Contradiction register: evidence that points in different directions, including differences between customer statements and observed behavior.
- Unknowns: missing evidence, ambiguous definitions, unrepresented segments, and assumptions the team has not validated.
- Output contract: the form of response you need, the criteria options must address, and the unsupported claims the model must label rather than fill in.
Compression is where many context packets either become useful or become misleading. The goal is not merely to shorten the material. It is to increase the proportion of decision-relevant signal without erasing qualifications.
1. Normalize repeated evidence. Deduplicate copied notes and repeated tickets so repetition in the packet does not impersonate independent confirmation. Preserve any real frequency data separately.
2. Retain the qualifiers. Do not compress away the segment, time range, denominator, metric definition, or product state that determines what an observation means.
3. Label epistemic status. Mark material as observation, interpretation, assumption, or generated hypothesis. A concise packet should make these distinctions clearer, not blur them.
4. Keep contradictions visible. If interviews describe one problem while behavioral data points elsewhere, preserve both signals and ask what evidence would resolve the conflict.
5. Remove inert context. My rule is simple: if an item cannot change an option, a risk assessment, or the validation plan, it does not belong in the active packet. Keep it available outside the model context if the team may need to inspect it later.
Apply privacy-by-design while assembling the packet, not after the model has processed it. Customer transcripts, CRM records, and support conversations can contain personal or confidential data. Use approved systems, follow applicable access controls and data terms, redact identifiers, and aggregate where the decision does not require record-level detail. If you cannot establish that the data is permitted in the AI workflow, leave it out and provide a safe summary. The downside is not a weaker prompt; it is potential exposure of customer or company information.

Separate synthesis, strategy, and skepticism

Asking for a summary, a recommendation, and a critique in the same instruction makes it difficult to see where evidence ends and invention begins. A stronger agentic workflow separates those jobs into distinct passes: Summarizer, Strategist, and Skeptic.

The Summarizer creates an evidence map

The Summarizer should organize the packet without deciding what to build. Ask it to group evidence around the decision, preserve relevant qualifiers, expose conflicts, and identify missing information. Explicitly prohibit recommendations during this pass.

A useful Summarizer output contains the supported observations, the segments represented, the outcome signals involved, the contradictions, and the unknowns. Review this output against the packet before continuing. If the model has turned an assumption into a fact, fix the evidence map rather than hoping a later pass corrects it.

The Strategist develops decision options

Give the Strategist the approved evidence map, the original decision frame, and the constraints. Ask for a small, meaningfully different set of options, including the option to leave the product unchanged when that is legitimate.

Require the same fields for every option:
- the customer problem or opportunity it addresses;
- the packet evidence that supports it;
- the assumptions required for it to work;
- the expected outcome and guardrail signals;
- the dependencies and material trade-offs;
- the simplest valid way to reduce its largest uncertainty.
This format prevents one option from winning because it received a more persuasive narrative. It also makes unsupported leaps visible. If the model cannot connect an option to evidence, that option can remain an idea, but it must be labeled as a hypothesis rather than presented as a conclusion.

The Skeptic tries to disconfirm the options

The Skeptic should not produce generic risks. Ask it to find the strongest contrary evidence, the segment that might be harmed, the constraint most likely to invalidate the option, the metric that could be gamed, and the observation that would show the underlying hypothesis is wrong.

Require it to distinguish counterevidence already present in the packet from new conjecture. This matters because a skeptical tone can sound rigorous even when it is unsupported.

The same LLM can perform all three roles, but role prompts do not create independent evidence or independent reviewers. Freeze the context packet used for the loop, label every generated artifact, and keep generated claims out of the evidence ledger until a human verifies them. Role separation is a workflow control, not a guarantee of correctness.

Stop adding passes when the workflow is only rearranging language. The loop has done its job when the team can see the supported facts, viable options, disputed assumptions, material risks, and next evidence needed to decide.

Make the product trio the decision gate

AI can accelerate the reasoning, but it should not become the decision owner. Bring the packet and the three-pass output into a product trio of product, design, and engineering. The purpose of that forum is not to approve the AI recommendation. It is to make the trade-offs explicit and decide what the team is prepared to learn.
1. Verify the evidence boundary. Check whether the represented segments, product states, and metrics match the decision. Ask which customer or operational perspective is absent.
2. Classify the important claims. Mark each claim as supported observation, team interpretation, assumption, or generated hypothesis. If nobody can trace a recommendation back to the packet, treat it as a hypothesis or remove it.
3. Compare trade-offs on equal terms. Evaluate every option against the desired outcome, guardrails, constraints, dependencies, and learning value. Do not let the most detailed option appear strongest merely because the model wrote more about it.
4. Choose the next commitment. The valid outcomes are to proceed, run a discovery or validation step, defer the decision, or reject the options. Assign a human owner and make clear what action the decision authorizes.
5. Record the rationale. Convert the discussion into a concise decision memo rather than forwarding raw model output to stakeholders.
The decision memo should include:
- the decision and why it is being made now;
- the target segment, desired outcome, and guardrails;
- the evidence that carried the most weight;
- the chosen option and the alternatives rejected;
- the trade-offs accepted by the decision owner;
- the assumptions and unresolved questions;
- the validation method and disconfirming signal;
- the owner and trigger for revisiting the decision.
This gives stakeholders something stronger than AI-generated confidence. They can inspect what the choice rests on, where judgment entered, what could prove the team wrong, and when the decision should be reconsidered.

Close the loop with validation and decision memory

Even a well-grounded model output is not product validation. It is a structured hypothesis. Match the validation method to the claim and to the consequence of being wrong.
- For a causal behavior claim: use a controlled A/B test when traffic, instrumentation, and the product experience make that appropriate. Define the primary metric, minimum detectable effect, guardrails, analysis approach, and stopping rules before reading the result.
- For a usability or comprehension claim: use targeted customer interviews or usability evaluation with the relevant segment. AI can help organize notes, but preserve outliers and do not turn a small qualitative sample into a prevalence claim.
- For an operational claim: use a limited release with observability, support monitoring, and an explicit rollback condition. Watch the workflow around the feature, not only the feature interaction itself.
- For privacy, brand, regulatory, or other high-consequence constraints: complete the appropriate human review before launch. A persuasive model assessment is not a substitute for the accountable specialist or decision owner.
For an onboarding decision, for example, the packet may contain segment definitions, observed friction, support themes, and conversion signals. The workflow can propose alternative interventions and measurement plans. The trio still chooses which hypothesis deserves a controlled test, whether the minimum detectable effect is practical, and which activation or retention signals will determine the next move.

After validation, return the result to the context system. Record what shipped, the observed outcome, affected segments, unexpected behavior, and which assumptions held or failed. Update the decision memo and evidence ledger. Otherwise, the next AI session begins from the same stale assumptions, and the organization pays again to relearn what it already discovered.

That accumulated decision memory is one of the most valuable outputs of context engineering. It turns AI collaboration from isolated prompting into a feedback loop connecting discovery, strategy, execution, and measurable results.

Key takeaways
- Frame the product decision, target segment, outcome, and constraints before asking AI for options.
- Give the model a compressed evidence packet, not an unstructured pile of documents.
- Keep observations, interpretations, assumptions, and generated hypotheses visibly separate.
- Use distinct Summarizer, Strategist, and Skeptic passes to expose where reasoning changes.
- Let a human product trio own the trade-offs, commitment, and stakeholder rationale.
- Treat every recommendation as a hypothesis until validation produces new evidence, then feed that evidence back into the decision record.
Choose the next real product decision that is important enough to validate and bounded enough to act on. Write its decision frame, assemble the smallest safe context packet, run the three reasoning passes, and take a decision memo into your product trio. When the result flows back into the packet, context engineering stops being a prompting technique and becomes part of how you run product.

References
- Pendo – Perspectives — AI Context Pulling Playbook: How I Make Humans + LLMs Collaborate for Sharper Product Outcomes
November 6, 2025

Agentic AI for Incident Response: A Practical Operating Model

An incident fires. Your responders are not short of data; they are short of a trustworthy path through it. Deployment timelines, service ownership, dashboards, logs, runbooks, and prior incidents live in separate places, while the cost of a wrong action rises by the minute.

The decision in front of you is not whether AI can summarize the incident channel. It is whether an agent can shorten the investigation without becoming another failure mode. That requires an operating model covering the agent’s job, context, permissions, interface, and evaluation before you give it meaningful authority.

Give the agent an investigation job before action authority

An incident-response agent should run a goal-directed investigation loop, not wait for isolated prompts like a chatbot. A credible implementation can collect context, form and test hypotheses, and draft fixes inside Slack. The important product decision is where that loop must stop for human judgment.

Model the loop on the work a strong responder already performs:

Scope the incident. Identify the affected service, environment, customer surface, start time, and known symptoms. Preserve unknowns instead of filling them with plausible guesses.
Gather relevant context. Retrieve recent changes, service ownership, dependencies, telemetry, runbooks, feature-flag changes, and similar incidents.
Form competing hypotheses. Produce a ranked set rather than locking onto the first convincing explanation. Distinguish observed facts from inferences.
Test each hypothesis. Use read-only tools to query metrics, logs, traces, deployment state, and dependency health. Record what supports or weakens each possibility.
Propose the next best action. Explain the target, expected effect, risk, preconditions, and recovery path. Do not hide uncertainty behind an authoritative tone.
Update the investigation. Incorporate tool results and responder corrections, discard disproven hypotheses, and choose the next check.

The incident commander remains accountable for priorities and mitigation. The agent acts as an investigation engine: it gathers, tests, organizes, and proposes. This division is more useful than treating human involvement as a final approval click after the AI has already made every material decision.

Choose the first workflow with care. A good starting point has a bounded service area, dependable read-only signals, known responders, established runbooks, and outcomes you can verify after the incident. A workflow that depends on undocumented tribal knowledge or unrestricted production access is not ready for agentic automation. Fix the operating system around the incident before expecting a model to compensate for it.

Do not begin with the most dramatic remediation you can automate. Early value usually comes from reducing context switching, locating the correct owner, connecting symptoms to recent changes, and eliminating weak hypotheses. Those tasks consume scarce attention but do not require the agent to mutate production.

Context quality determines the ceiling of the investigation

A capable model cannot reason with operational context it cannot find, distinguish, or trust. If a service has three names across the deployment system, observability platform, and incident channel, retrieval becomes unreliable before model reasoning even begins.

Create a context contract for every service placed within the agent’s scope. At minimum, make these fields explicit:

Identity: canonical service name, aliases, repository, runtime, and environment.
Ownership: accountable team, current on-call route, and escalation path.
Topology: upstream dependencies, downstream consumers, data stores, queues, and shared infrastructure.
Change history: deployments, configuration changes, feature flags, migrations, and rollback state.
Operational knowledge: current runbooks, known failure modes, dashboards, alerts, and prior incident records.
Control policy: tools the agent may call, environments it may inspect, actions it may propose, and actions it may never execute.

Start retrieval with exact operational signals. Filter by canonical service, environment, incident time window, deployment identifier, alert type, and ownership tag. Then rerank the surviving records for the current question. This deterministic tagging and reranking foundation is easier to debug than making semantic similarity responsible for every retrieval decision.

Add embeddings where language actually creates ambiguity: matching an unfamiliar symptom to a differently worded historical incident, finding a relevant paragraph inside a long runbook, or connecting terminology used by two teams. Semantic retrieval should widen discovery, not erase exact boundaries such as production versus staging or one tenant versus another.

Require every retrieved item to carry provenance that a responder can inspect: its system of record, service and environment, creation or update time, incident-time availability, and reason for retrieval. This lets the responder notice four common failures quickly:

A runbook is relevant but stale.
An ownership record is current but was different when the incident began.
A similar incident came from another environment with different dependencies.
A historical evaluation accidentally exposed the final root cause before the agent could have known it.

Treat missing context as an observable product state. The agent should say that it cannot locate a deployment record or dependency map, identify which system was checked, and propose a safe way to continue. A confident answer assembled around a missing record is more dangerous than an explicit gap.

Scale permissions to reversibility and blast radius

Autonomy is not one switch. It is a set of permissions attached to particular tools, targets, environments, and action classes. Granting broad credentials because the agent usually behaves conservatively turns a model-quality issue into a production-control issue.

Action class	Appropriate agent role	Required human control
Read-only investigation	Query approved telemetry, changes, ownership, and runbooks	Audited access with service and environment boundaries
Recommendation or communication	Draft a diagnostic check, remediation plan, incident update, or escalation	A responder reviews customer-facing messages and consequential recommendations
Bounded, reversible execution	Invoke a preapproved runbook against an explicitly named target	Approval bound to the exact action, target, inputs, and current incident
Irreversible or broad execution	Explain the need and prepare a plan, but do not execute during the initial rollout	Existing change controls and accountable operators remain in force

Do not label an action reversible merely because the interface contains a rollback button. A deployment rollback can still be unsafe after an incompatible schema or data change. A restart can amplify load or destroy useful diagnostic state. Reversibility has to be validated for the specific service state, not inferred from the action name.

For every executable tool, define guardrails outside the prompt:

Use least-privilege credentials scoped by service and environment.
Allowlist tools, targets, and input shapes rather than relying on natural-language prohibitions.
Preview the exact command or workflow, target, parameters, and expected effect before approval.
Bind approval to that exact action so the agent cannot reuse it for a changed target or plan.
Use rate limits, idempotency controls, and circuit breakers where repeated calls could cause harm.
Route production changes through existing CI/CD or runbook automation when possible.
Record retrievals, tool inputs, tool outputs, approvals, denials, and resulting state changes in an audit trail.
Provide a direct way to suspend the agent’s tool access without disabling the incident workflow itself.

The action proposal should be a control artifact, not a conversational suggestion. It needs the evidence supporting the action, the exact target, the expected observable result, the maximum intended scope, known preconditions, and what the responder will do if the result does not appear. If the agent cannot supply those fields, it has not earned execution authority for that action.

Keep outward communication on a separate permission path. Drafting a status update is low-risk technically but consequential for customers and the business. Human review should verify what is known, what remains uncertain, and whether the message promises a recovery time the evidence cannot support.

Make evidence and uncertainty legible in the incident room

Putting the agent inside the collaboration surface where incidents already unfold reduces the friction of opening another product and re-explaining the situation. It also means the agent’s output competes with urgent human messages. Long narrative answers will be skipped, however intelligent they sound.

Give each investigation update a stable structure:

Observed: facts returned by named systems, with timestamps and links where available.
Hypotheses: ranked explanations with the supporting and conflicting evidence for each.
Changed since the last update: new evidence, rejected hypotheses, and responder corrections.
Next check: the read-only query or tool call most likely to distinguish between the remaining possibilities.
Proposed action: target, expected effect, blast radius, preconditions, and recovery path.
Decision needed: the specific approval, input, or ownership choice required from a human.

This is not a request to expose a model’s private, free-form chain of thought. Responders need a structured evidence trail: claims, retrieved signals, tool results, rejected alternatives, and action rationale. That artifact is more useful for review because each part can be checked against the operational record.

Confidence labels are helpful only when they change behavior. Define what the interface does when confidence is low: ask for a missing service identifier, run another safe check, present multiple hypotheses, or escalate to the owner. Do not display a precise-looking score unless you have evaluated whether that score corresponds to actual correctness in your incident set.

Design human correction as part of the main workflow. A responder should be able to reject a hypothesis, correct the service or environment, mark a retrieved record stale, deny an action, and state why. The agent should preserve that decision in the incident record and replan from it. Repeatedly resurfacing a rejected hypothesis erodes trust even when the underlying model is otherwise capable.

Watch for a subtle interface failure: polished summaries can make weak investigations look complete. Make unresolved questions and conflicting signals visually prominent in the message structure. The goal is not to make the agent sound certain. It is to help the incident commander see what is known, what is inferred, and what decision comes next.

Test against past incidents, then expand authority one boundary at a time

A demo proves that the agent can complete a favorable path. It does not prove that the agent will retrieve the right context, resist a misleading correlation, respect permissions, or propose a safe action when production is ambiguous.

Use post-incident time-travel evaluations. Reconstruct what the agent could have known at each point in a real incident. Begin with the original trigger and expose deployments, telemetry, messages, and tool results only when they became available. Hide the final root cause, later analysis, and corrected metadata until the corresponding point in the replay. Otherwise, you are testing hindsight rather than incident response.

Grade the investigation on operational usefulness, not prose quality:

Scoping accuracy: Did it identify the correct service, environment, symptoms, and ownership route?
Context retrieval: Did it find the relevant change, runbook, dependency, or earlier incident without mixing incompatible records?
Hypothesis quality: Where did the eventual cause appear in the ranked set, and what evidence was used to test it?
Evidence integrity: Does every factual claim match a retrieved record or tool result? Did the agent invent a signal that was never observed?
Tool correctness: Did it select the correct tool, target, environment, and parameters?
Action safety: Was the proposed action inside policy, and were its blast radius, preconditions, and recovery path explicit?
Calibration: Did expressed certainty track actual correctness, especially when context was incomplete?
Time compression: How did the time to a useful hypothesis, correct owner, mitigation decision, and recovery compare with the existing workflow?
Human effort: Which searches, handoffs, repeated explanations, and diagnostic checks did the agent remove or add?

Treat safety failures differently from diagnostic misses. A missed hypothesis is a capability problem. Crossing a permission boundary, inventing evidence, or targeting the wrong environment is a release blocker for that tool path. Averaging all outcomes into one quality score can conceal exactly the failure that matters most.

A practical rollout sequence

Instrument the human workflow. Capture incident timelines, ownership changes, diagnostic steps, approvals, mitigations, and outcomes. You need a baseline before claiming improvement.
Replay historical incidents. Use time-bounded context and score the agent against known outcomes. Repair retrieval and service metadata before tuning for eloquence.
Run in shadow mode. Let the agent investigate live incidents without posting conclusions or changing systems. Compare its evidence and hypotheses with the responder’s path.
Expose read-only assistance. Allow responders to request context, hypothesis checks, and draft updates. Collect explicit acceptance, correction, and rejection signals.
Add recommendation mode. Let the agent propose remediations using the structured action artifact, while humans continue to execute through established controls.
Enable one bounded action path. Choose a preapproved runbook with a clear target, validated preconditions, observable effect, and recovery procedure. Keep approval attached to the exact invocation.
Expand by tool and service. Grant additional authority only when evaluation evidence supports that particular boundary. Do not treat success on one service as proof of readiness everywhere.

Re-run the evaluation set after changes to prompts, models, tools, service topology, runbooks, or permissions. An agent can regress even when its general language quality improves. Operational behavior depends on the whole system around the model.

Key takeaways

Start with investigation and context compression; earn execution authority later.
Build deterministic service, environment, time, and ownership filters before depending on semantic retrieval.
Separate observed facts, hypotheses, and proposed actions in every incident update.
Enforce permissions in tools and infrastructure, not only in prompts.
Evaluate with historical time travel so the agent never sees facts that were unavailable during the real incident.
Expand autonomy one action, tool, service, and environment boundary at a time.

The next outage is the wrong time to discover that your agent cannot distinguish a plausible explanation from verified evidence. Before it happens, choose one bounded incident workflow, define its context contract and permission envelope, and replay several real investigations without future information. If the agent can make its evidence legible, stay inside policy, and consistently move responders toward the next correct decision, you have a foundation worth expanding.

References

Shivam.Consulting Blog — How Incident.io’s AI SRE Diagnoses, Hypothesizes, and Fixes Outages in Slack at Record Speed

November 6, 2025

Turn Claude Code Into a Trusted Teammate: My 3-Layer Memory System You Can Copy

"Can you critique the landing page for my new Story-Based Customer Interviews course?" That simple ask used to kick off hours of back-and-forth where I fed an AI the same context over and over—only to get generic feedback that wouldn’t land with my audience or fit my products. As a product leader, that inefficiency was unacceptable; as a writer, it was just plain frustrating.

Not anymore. Today, Claude not only critiques my work, it helps me produce it. It generates marketing copy—in my voice. It helps me write blog posts. It knows what search terms are relevant to my business and helps me optimize my articles for SEO and now AEO. It helps me with competitive research, academic research, and discovery research. And it does all of this with little prompting from me.

I don’t upload files to a web-based project. I don’t manage elaborate prompt libraries. I don’t repeat myself. I ask for help and Claude knows exactly what to do. The shift happened when I learned how to give Claude Code a memory. Claude now knows who my target customer is, the key value propositions I focus on, the specific opportunities each product addresses, my revenue model, my marketing channels, and so much more.

A dark-themed strategy slide for the post Stop Repeating Yourself: Give Claude Code a Memory, showing how to lead with a CLAUDE.md glossary page, write clearly for nontechnical readers, and link glossary and article to boost discovery and engagement.

With that memory, I consistently get high-quality output tailored to my audience and aligned to my products and services. I don’t retype the same context; Claude just remembers. In this article, I’ll show you exactly how I set up that memory. It relies on Claude Code (which requires a Pro subscription), and it’s worth it. If you’re new to Claude Code, start with "Claude Code: What It Is, How It’s Different, and Why Non-Technical People Should Use It."

Here’s the underlying problem: with large language models, every conversation starts from scratch. Yes, ChatGPT can remember some things and Claude can search past conversations, but practically speaking each new thread wipes the slate clean. If I were working on a new landing page, I’d normally need to upload target customer context, product details, primary and secondary value propositions, FAQ questions and answers, plus testimonials and logos for social proof—every single time.

Start fast with Claude’s home screen: Sonnet 4.5 is ready, and quick actions for writing, learning, and coding sit beneath a clean prompt box—ideal for showing how memory cuts repetition and streamlines daily development.

Projects in web-based tools help a bit, but they introduce a new dilemma. When I move to the next landing page targeting the same customer but a different product and value proposition, do I start a new Project (tedious) or keep expanding the old one (which muddies the context window and degrades output quality)? The good news: Claude Code solves this by giving the model a precise, durable memory without overloading any single conversation.

Claude Code can read files on my local machine, which is an understated superpower. I use those files to create a persistent, reusable memory that works across all chats and Projects. Files can be mixed and matched, so I give Claude exactly what it needs for the task at hand—and nothing more. For a first landing page, I reference the target customer and the relevant product; for the second, I reuse the same target customer file and point to the new product file.

Dark-mode Notes screenshot captures Claude Code in action: it fetches producttalk.org, reads context files, and delivers a concise homepage evaluation—showing how memory streamlines repeated analysis tasks.

When you give an LLM the exact right context, output quality jumps. More context only helps if it’s the right context. For a landing page, Claude needs to know about the current product and perhaps related products for differentiation—but it doesn’t need to know about unrelated offerings. Structure your memory so Claude gets precisely what’s required.

Once I did this, Claude shifted from “intern who needs handholding” to trusted advisor and capable teammate. It doesn’t guess at my value propositions—I’ve already told it. It writes in my voice because it has my writing guide and samples. It knows who owns which course and which use cases map to which features. The setup takes a bit of upfront work, but it compounds: update a file when something changes and you’re done. Most of this information already lives in your system; the trick is making it easy for Claude to use.

See how Claude Code stops repetition: global and project CLAUDE.md files, plus custom reference docs, flow into the editor so the assistant remembers your preferences and context while you code and run commands.

Because the files live on my machine, I own the system. No vendor or device lock-in. I decide when and who to share with. I can work with Claude on one project and ChatGPT on another—both can rely on the same file-based memory strategy. It’s an AI strategy that scales with product discovery, accelerates go-to-market content, sharpens competitive differentiation, and supports product-led growth.

Here’s how I design the memory: I use three layers. Claude Code already encourages global preferences and Project-specific instructions, but the third layer—reference context—is where the real power lives.

Peek inside a markdown playbook for Claude Code: concise rules for writing, multi-level planning, and clear feedback that turn repeated reminders into reusable memory and smoother, faster coding sessions.

Layer 1: Global Preferences (Always on). The first time I launched Claude Code, I created a CLAUDE.md file at ~/.claude/CLAUDE.md. This is where I keep the cross-project rules of engagement—how I like to work with Claude. Mine includes: Always create a plan for me to review before you start any work; Give me direct feedback (no hedging, no gentle suggestions); Use bullet points for summaries; Ask clarifying questions one at a time so I can give complete answers; No emojis unless I explicitly ask for them. Claude Code automatically loads this file at the start of every session, so I never restate my preferences.

Layer 2: Project-Specific Instructions. Different projects have different rules. In my writing workspace, the Project CLAUDE.md sets the roles (I’m the primary writer; Claude is my thought partner and editor), defines a multi-round review flow (content → structure → accuracy → typos), prioritizes human readability over SEO, and points to my writing style guide. In my task management system, I include how my Trello integration works, file naming conventions for tasks, and how to process research papers into summaries. In my code projects, I specify the technology stack (Node.js vs. Python), testing framework (Jest for Node.js, pytest for Python), code style and conventions, project architecture and directory structure, and which dependencies and libraries to use. Each project directory has its own CLAUDE.md, and Claude automatically loads the relevant file when I’m working there.

Peek inside a markdown playbook for collaborating with Claude—covering session setup, roles, editorial standards, and research steps—to show how saved instructions create consistent results without repeating yourself.

Layer 3: Reference Context (Pull as Needed)—the real power. LLMs have a context window—a limit to how much they can process at once. Even within that limit, loading too much degrades performance due to “context rot.” The remedy is ruthless context management: small, targeted files that load only when needed. Keep CLAUDE.md files concise and focused on rules and workflows. For detailed knowledge, create separate reference files and list them in your CLAUDE.md so Claude knows they exist and when to fetch them. When I ask for help creating a landing page, Claude knows to use my business profile, the product file, and my target customers context.

Here’s what most people miss: you don’t cram everything into global or Project files. You maintain small, reusable reference files that Claude only loads on demand. In my walkthrough, I share exactly which context files I created and why; how I got Claude Code to help me create them; how I break them into small, reusable components so Claude gets precisely what it needs; how I keep everything up to date; and step-by-step instructions so you can set up a similar memory system.

Three project notes funnel into Claude Code, turning reusable context into working output. This visual shows how saving key docs as memory lets the AI pick up where you left off and skip repetitive prompting across tasks.

Let’s dive in.

Inspired by this post on Product Talk.

November 5, 2025
AI at Home, Impact at Work: Experiments That Supercharged My Product Leadership

I recently tuned into an insightful All Things Product episode featuring Teresa Torres and Petra Wille on how experimenting with AI in everyday life sharpens how we build AI-powered products at work. The core premise resonated deeply with my AI Strategy: low-stakes, personal experiments accelerate confidence, clarify limitations, and build an AI product toolbox we can bring into the office with rigor.

If you want to dive in, you can listen on Spotify or Apple Podcasts. I found the conversation especially relevant for product trios and anyone shaping LLMs for product managers in high-stakes environments.

The idea is simple but powerful: when I prototype with AI at home—where the stakes are low—I learn faster, make safer mistakes, and internalize critical product patterns. Over time, those patterns transfer directly to work: tighter context management, sharper bias awareness, clearer human-in-the-loop guardrails, and a more nuanced view of when to use AI as a thought partner versus when to consider agentic AI.

In my own practice, I’ve mirrored many of the scenarios discussed: using ChatGPT by OpenAI to plan meals, analyze public data sets like school budgets, and even sanity-check real estate evaluations. These seemingly mundane tasks are fertile ground for learning about context window limits, hallucination (artificial intelligence), AI bias, and privacy-by-design trade-offs. Each experiment helps me craft better prompts, structure data for clarity, and decide when a human review step is non-negotiable—core habits for AI risk management.

At work, I treat AI as a thought partner for writing, research synthesis, and contract review. I also explore when and how to responsibly evolve toward agentic AI for repeatable workflows. The distinction matters: a thought partner augments judgment; an agent automates execution. Building the right scaffolding—data governance, auditability, constraints, and escalation paths—ensures we unlock speed without compromising safety.

Three lines from the episode stayed with me: “I’m trying to write things that only I can write — that’s my guiding writing light right now.” — Teresa. “The more we use AI, the more we learn what it’s good at, what it’s not good at, and where context becomes a limitation.” — Teresa. “It’s a safer playground — we can build our toolbox at home before bringing those lessons to work.” — Petra. These are practical north stars for product management leadership in the GenAI era.

For anyone getting started, here’s what worked for me: begin with “low-stakes” personal experiments, write down your prompts and outcomes, and reflect on failure modes. Treat each activity as product discovery: What problem am I solving? What outcome matters? What data and context does the model need? Which decisions must stay human-in-the-loop? This discipline builds an AI product toolbox you can confidently apply to real customer problems.

I also keep a running toolkit of references and tools that inform my practice: Context window as a concept helps me size and sequence information. Visual and video tools like Midjourney and Sora expand how I think about multimodal experiences. I rotate between Claude by Anthropic and ChatGPT by OpenAI depending on task fit, and I’ve used Claude Code when I need structured assistance with code review. For knowledge capture and workflow, Readwise and Ghost help me structure insights and ship content.

If you want more structured learning paths, I found Josh Seiden’s Learn AI With Me, A 30-Day Sprint to be a practical primer, and the broader community conversation at Product at Heart Conference is invaluable. For a deeper grounding in risk, I recommend reviewing topics like Hallucination (artificial intelligence), AI bias, and Agentic AI—and revisiting the complementary episode, Context is King.

I’d love to hear how you’re experimenting: Where have you seen AI meaningfully reduce toil? Where does it still struggle? How are you balancing creativity, data safety, and compliance as you scale? Drop a comment below and let’s compare notes—especially on patterns that help product trios move faster without sacrificing trust.

Bottom line: start small at home, carry lessons into the office, and build with curiosity and intentionality. That’s how we level up our product discovery, sharpen our value proposition, and lead teams confidently through the GenAI transition.

Inspired by this post on Product Talk.

November 4, 2025
From Chaos to Consistency: How I Built a Scalable AI Content Design Agent with RAG

It’s Monday morning, and my Slack and email are already overflowing with content requests: “Can you review this flow?”; “Can you rewrite this screen?”; “Can you name this feature?” I’m not freshly back from holiday—this is just a regular work week kicking off. If you’ve ever been a solo content designer supporting multiple teams, you’ll recognize the pressure. The pipeline for content in product design is always full, and the demand for expertise never stops.

Fixing this isn’t just a matter of better time management or incremental process tweaks. To truly scale, I needed to extend my reach by bringing AI into the design process—without sacrificing judgment, standards, or quality. That Monday morning, I realized I had to scale my skills, my judgment, and our systems, not just my calendar.

Building AI is fundamentally about building systems. I wanted to use AI to scale myself without devaluing critical thinking or flooding the product with generic, verbose content. I also knew a useful AI tool must do more than spit out microcopy—it has to plug into a system we can continually shape. As a content designer, the system is always the starting point. Strong design systems create strong content standards; then AI agents can produce content that meets those standards at speed, freeing me from the bulk of standardized work. That’s not a threat—it’s an advantage. To instruct AI well, our systems must be well constructed.

I often think about this work like a bakery. You need a recipe before you can make a loaf of bread. Most interface content churns out the same loaf, day in and day out. It’s better for the master bakers to focus on the unique, custom bakes—and how the recipe needs to change. With that mindset, I set out to build an AI content design agent.

Inside the Content Design Agent workspace, a clean chat UI titled VERBI pairs a central prompt box with chips for writing, editing, and reviews, plus clear controls to view permissions and open the agent setup for product teams.

When I started this project back in May 2025, many LLMs still had frustrating limitations. Google Gemini let me build a custom Gem agent, but I couldn’t share it with other users. ChatGPT could be customized, but only with static files: I couldn’t point it to live, updatable URL sources. I settled on Glean for three simple reasons: everyone at the company had access; Glean could access all internal documentation and treat URLs as sources of truth; and its then-new Agents feature made AI search customizable. Configuring an agent in Glean is straightforward—you choose a trigger, a set of prompts, and a set of actions—but first I needed to get the inputs right.

AI agents need focus. We had a wealth of internal information at Intercom, but not all of it was current or reliable. I curated exactly what the agent could access and assembled a tightly governed knowledge collection in Glean. Only essential information made the cut: the Intercom style guide—our definitive house style, including regularly-broken rules like “always write in US English” and “use sentence case everywhere”; tone of voice guidance for how we show up across mediums; a product glossary with hundreds of feature names and writing conventions; a monetization glossary for prices, plans, and add-ons; product marketing messaging guides with positioning for every feature and launch; core research insights across the product; and fin.ai and intercom.com/suite as the official, most up-to-date messaging sources.

This is classic RAG (retrieval-augmented generation) in action, ensuring every answer is grounded in approved sources of truth. With the collection in place, I instructed the agent to prioritize these resources above anything else.

Step into a clean, no-code builder that shows how to assemble a Content Design Agent: kick off with a chat-trigger, run a company search, then respond with expert guidance, all guided by a simple starter checklist.

Then came the fun part—building and branding the agent. “Content Design Assistant” felt bland, so I named it VERBI, a nod to its “verbal” design job. When people interact with VERBI, they usually begin with a question, but the intent varies widely. I defined a set of task prompts to guide expectations and outputs: “Can you write this?”; “Can you edit this?”; “Can you review this?”; “Can you name this?”; “Give me options”; “Give me guidance”; “Give me strategy”; “Give me research.” This mirrors the real breadth of content design, from creation to critique to discovery.

To manage responses, VERBI needed three things: start with a specific task prompt; understand how to draw on the right resources each time; and connect with other systems. With task prompts defined, I wrote a detailed system prompt covering the essentials. Role: you are a content designer, supporting product designers. Employer: Intercom (consisting of Fin AI Agent and our next-gen Helpdesk). Resources: content design collection, research collection, Storybook design system. Tone of voice: follow a specific tone for our UI, adjust the tone for everything else. Components: for UI, use the specific guidelines in our design system only. Use cases: writing, editing, critiquing, naming, researching, and more.

One connection mattered most: our design system, recently rebranded as “Surge.” Surge contains detailed content guidelines for every component in our product UI, from accordions and banners to tabs and tooltips. That granularity took months of human effort to codify, and it paid off. Designers no longer guess how to write for a toggle, a button, or a tooltip—and now VERBI understands and enforces those rules, too. A great content design assistant isn’t just a clever system prompt; it needs deep, component-level guidance to retrieve.

UI documentation showcases the Badge component’s content rules, teaching how to name statuses, define types, and apply color so labels read clearly. A handy visual for building a content design agent and ensuring consistent product messaging.

Accessing the design system wasn’t simple at first. It lives in Storybook, which Glean couldn’t access directly. I started by scraping guidance from Storybook into an HTML file with Cursor and uploading it to VERBI—a functional but clunky workaround that required re-scraping every few days. Then our IT team stepped in. They used the Glean Indexing API to turn Storybook into a live data source. Now VERBI connects to Storybook directly. Ask it something ultra-specific, like the correct date format for Japan, and it returns the right answer. That integration elevated the agent from helpful to indispensable—human-level precision, 24/7, at scale.

With prompts and resources in place, I launched VERBI and pressure-tested it. It was accurate and well-informed most of the time, but like any AI agent, it had quirks. I needed it to act as a gatekeeper, not a brainstorming partner that might bend rules or invent new ones. So I added a few explicit guardrails to the system prompt. Stopping sycophancy: “Inform, challenge, and assist. Never placate. Don’t agree by default. If something’s wrong, say so. Challenge assumptions.” Halting hallucinations: “If you don’t find the information required in our resources, say you don’t know the answer. Don’t guess and don’t give answers based on general knowledge.” Avoiding verbosity: “Keep answers short and to the point. Cut the fluff. Skip all niceties and social padding. Only give longer answers if the user asks you to.” These constraints keep responses crisp, correct, and consistent. Like any living system, the prompt needs occasional tune-ups, but the maintenance is minor compared to the upside.

Where we are now: VERBI has been triggered 700+ times since launch. The benefits are tangible. For me, quality scales without constant policing; repetitive questions about naming, style, or punctuation have dropped significantly. I reclaim time because the agent drafts and checks V1 content across teams, enabling me to focus on higher-impact work. For the design team, iteration is faster, confidence is higher, and strategic clarity improves because shared language and grounded guidelines make decisions easier and more consistent.

I used to spend too much time mopping up basic content mistakes and untangling spaghetti-like UI copy prone to human error. VERBI removes those errors at the source. The real advantage is speed: we get from blank slate to a high-quality first draft quickly, which means we can spend our energy deciding whether the content is right, not just “good enough.” Design is the whole interface—words, visuals, interactions—so reviews now happen with real content, never “copy TBD.” Our principle to sweat the details applies equally whether work is human-made or AI-assisted.

Knee-jerk critiques of AI-driven content design often assume teams generate content from nothing and ship it. In reality, great AI is the outcome of great human decisions and strong systems. Its value is pulling us together faster—getting us to a complete, standards-compliant design we can review as a team before sharing it with the world. That’s how AI helps us win: by turning chaos into consistency, and consistency into velocity.

Inspired by this post on The Intercom Blog.

October 31, 2025
What I Learned from Trainline’s Agentic AI: Building a Trusted Travel Assistant at Scale

Over the past year, I’ve been shipping agentic AI into production and coaching product teams on what it really takes to make these systems trustworthy in the wild. One story that crystallizes the playbook comes from Trainline’s move to an agentic architecture for travel assistance—an approach that mirrors what I’ve seen work in high-stakes, real-time customer experiences.

Trainline—the world’s leading rail and coach platform—helps millions of travelers get from point A to point B. Now, they’re using AI to make every step of the journey smoother.

I studied how "David Eason (Principal Product Manager) Billie Bradley (Product Manager), and Matt Farrelly (Head of AI and Machine Learning)" approached the build of "Travel Assistant, an AI-powered travel companion that helps customers navigate disruptions, find real-time answers, and travel with confidence." Their work exemplifies the kind of end-to-end thinking required to move beyond demos into dependable, on-the-go assistance.

They share how they: Identified underserved traveler needs beyond ticketing; Built a fully agentic system from day one, combining orchestration, tools, and reasoning loops; Designed layered guardrails for safety, grounding, and human handoff; Expanded from 450 to 700,000 curated pages of information for retrieval; Developed LLM-as-judge evals and a custom user context simulator to measure quality in real-time; Balanced latency, UX, and reliability to make AI assistance feel trustworthy on the go.

I align strongly with their core takeaways: "AI assistants need both scalable reasoning and deep domain context to be useful." "Tool design and guardrails are as critical as prompt design in agent systems." "LLM-as-judge evals make it possible to measure open-ended systems without massive labeling costs." And perhaps most importantly, "Even legacy companies can move fast when they embrace experimentation and tight PM–engineering collaboration."

From an AI strategy perspective, starting "fully agentic" was the right call. When the problem space is dynamic—disruptions, route changes, fare conditions—reasoning loops and orchestration aren’t luxuries; they’re table stakes. Tool selection becomes product design: you need the right retrieval interfaces, constraint-aware planners, and API contracts that are resilient to partial failures. Layered guardrails for safety, grounding, and human handoff reduce hallucination risk while preserving responsiveness—critical when users are standing on a platform waiting for an answer.

The retrieval scale-up—"Expanded from 450 to 700,000 curated pages of information for retrieval"—is a classic inflection point. I’ve seen teams stall here when they treat content growth as a pure indexing problem. The winning move is curation and structure: normalize sources, encode policy-level constraints, and align retrieval chunks to decision boundaries the agent actually uses. That’s how you keep precision high while coverage explodes.

Evaluation is where most open-ended assistants fail quietly, which is why I was encouraged to see "Developed LLM-as-judge evals and a custom user context simulator to measure quality in real-time." In practice, LLM-as-judge gives you scalable, scenario-based scoring without prohibitive labeling, while a user context simulator surfaces regressions tied to persona, itinerary state, and device constraints. The combination closes the loop between model behavior, tool layer changes, and UX outcomes.

On product delivery, the decision to have the system "Balanced latency, UX, and reliability to make AI assistance feel trustworthy on the go" shows mature prioritization. For travel, trust accrues in seconds: fast-enough responses, graceful degradation when upstream data lags, and explicit handoff when confidence dips. This is where guardrails meet UX writing—clear, bounded language signals competence even when the system defers.

Finally, the organizational pattern matters. The teams that win in agentic AI are cross-functional, experimentation-driven, and ruthless about instrumentation. Tight PM–engineering collaboration, explicit safety thresholds, and an eval stack that mirrors real user journeys are what turn promising architectures into dependable products.

It’s a behind-the-scenes look at how an established company is embracing new AI architectures to serve customers at scale.

If you’re building agentic AI in production, borrow these moves: invest early in tool and guardrail design, scale retrieval with curation not just volume, adopt LLM-as-judge plus context simulation for continuous evaluation, and treat latency and reliability as core product requirements—not afterthoughts. That’s how you ship AI assistance that customers trust when it matters most.

Inspired by this post on Product Talk.

October 30, 2025
Why We’re Building Our Next AI R&D Hub in Berlin—and Hiring 100 to Power Fin’s Growth

I’m excited to share that we’re opening our next R&D hub in Berlin to support significant investment in our AI customer service platform, Intercom, and market-leading AI Agent, Fin. We intend to hire 100 people in Berlin over the year ahead across engineering, AI, data science, product, and design. This move reflects our AI Strategy, our commitment to product management leadership, and our focus on building enduring product-led growth.

We believe that in a short number of years, the vast majority of customer service will be done by AI. Fin is already the world’s best Customer Service Agent. At Pioneer, our recent summit for AI customer service leaders in NYC, we talked about how Fin will become a true end-to-end Customer Agent, extending far beyond service. We showcased how companies like WHOOP, Anthropic, and Lightspeed are already pushing Fin in ways that help them grow their business.

This market opportunity is massive and expanding at unprecedented pace. Our ambition is to earn our place as one of the most successful AI businesses during this wave of AI disruption, and we want more brilliant people on our team to pursue this as aggressively as possible. If you’re motivated by Generative AI, LLMs, and building real products that scale, you’ll find both challenge and impact here.

We are already on track to be one of the fastest growing private software companies. Fin is the primary contributor to this, and is months away from passing $100m in ARR. So far, more than 7000 businesses have transformed their customer service with Fin, including German companies like electricity provider Ostrom, smart home technology provider tado°, and grocery delivery company Flink, along with global leaders like Vanta, Clay, Lovable, and Miro.

Why Berlin? We’re drawn to the city’s rare blend of deep technical talent and rich creative culture—within a vibrant, globally connected ecosystem close to our R&D hubs in Dublin and London. It’s a place where top-tier engineers and designers thrive, and where ambitious builders from around the world want to relocate and create category-defining products.

Momentum is building: this month-by-month chart shows a consistent rise from the mid-20s to nearly 70% between May 2023 and Sep 2025—signaling strong progress as we expand engineering, AI, and automation at our new Berlin R&D hub.

We needed a new location that would sustain the high ambition and standards held by our world-class AI teams in Dublin and London. Berlin has emerged as one of Europe’s hottest centers for AI talent, with a high density of AI-focused startups, applied research labs, and practitioners who bring exceptional literacy, optimism, and ambition. It’s the right accelerator for our AI hiring and a place to bring in brilliant minds to shape the future of our product and business.

While Intercom’s reach is global with our headquarters in San Francisco, our R&D leadership remains anchored in Dublin, where half of the executive team sits—making Berlin both geographically and strategically an ideal next location for our growth.

This isn’t our first time expanding our footprint; we previously bet on London and are delighted with how that’s been working. When we shared our Berlin news internally, the energy was palpable, with many teammates volunteering to help spin up the hub successfully—including colleagues who helped make London a big success, like Danny. That level of ownership and momentum is exactly what we aim to cultivate in Berlin.

We’re looking for people who thrive in a high-intensity, high-ambition, high-standards environment and want to help build one of the world’s best AI companies. For builders like that, the opportunity for impact, growth, and career progression is extraordinary. As with London and Dublin before it, the early Berlin cohort will have a disproportionate influence on team norms, culture, and long-term outcomes. We are in the middle of a huge disruptive wave with AI, and Fin is one of the leading examples of commercially successful AI applications. Joining Intercom is an opportunity to be part of this disruptive wave, and help us build out our vision for Fin becoming the world’s best Customer Agent.

On a minimalist stage, four speakers share insights on AI research, automation, and engineering as part of a panel tied to Berlin expansion and the launch of a new European R&D hub.

There are plenty of AI companies to join, but our technology and culture set us apart. Any AI product is only as good as the AI layer powering it. Ours is industry-leading, built by a highly talented, ambitious, and technical team of over 40 machine learning scientists, engineers, and designers in Europe who continuously optimize Fin’s performance through cutting-edge research, experimentation, and innovation. Fin’s average resolution rate increases 1% every month. That kind of steady, compounding improvement is exactly what great customer support AI strategy looks like in practice.

We also build in public and share our progress and learnings with the AI community at large. Recently, our Chief AI Officer Fergal Reid and SVP of Engineering Jordan Neill joined leaders from Cognition, Harvey, and Perplexity in San Francisco to share real lessons, challenges, and breakthroughs from building frontier AI products. Our AI team regularly publishes their insights on the AI research blog; from optimizing inference speed and availability, to building our own proprietary models that outperform general purpose models for CX.

Our AI group and the broader R&D org they operate within work at extraordinary scale and speed. We recognize that moving fast can’t be taken for granted—you must fight for it—and we’re doing just that, embracing the capabilities AI tooling brings us to achieve 2x the throughput. One example of this mindset in practice is us “Betting on the future of frontend at Intercom,” making a technology choice that optimizes for our teams’ ability to build high-quality product, fast.

Our design and product teams are world-class and forward-thinking; they’re embracing AI to evolve how they work, as shared in our 3-point framework for AI-driven design and recently presented by Emmet Connolly, our SVP of Design, at this year’s Hatch conference in Berlin. As a product leader, I’m grateful to work alongside brilliant product and design thinkers—it gives me confidence that we’re solving the right problems, solving them well, and driving real impact.

From live demos to hands-on coding, this snapshot captures the momentum we're bringing to our Berlin R&D hub – AI experiments, hand-tracking prototypes, and simulation tools powering our next wave of engineering.

We plan to open our Berlin office space in December or January. To get the office started, we’re hiring Senior Product Engineers, Machine Learning Scientists, Product Managers, Senior Product Designers, Engineering Managers, and Data Scientists immediately. If your craft sits at the intersection of LLMs for product managers, agentic AI, and empowered product teams, you’ll be right at home.

You can learn more about our open roles, company, culture, and locations on our careers site, or feel free to reach out to me, Jordan, Fergal, or Brian directly on LinkedIn if you have any questions.

Some of our engineering team will also be at LeadDev Berlin on November 3rd—come say hi if you’re attending.

I’m looking forward to continuing to build Intercom as one of our generation’s best AI companies—and I’m excited for our expansion into Berlin to be a major contribution to that success.

Inspired by this post on The Intercom Blog.

October 29, 2025
Context Is King: My Playbook to Prep Product Teams for High-Impact AI Collaboration

Context is king in AI-powered product work—and I felt that deeply while digging into “Context is King – All Things Product Podcast with Teresa Torres & Petra Wille.” The conversation affirmed a truth I see daily: AI becomes a powerful teammate only when we give it the right context, just as we do with empowered product teams. When we treat AI like a colleague joining mid-flight—without our company history, industry nuances, or strategy—we instantly unlock better outcomes.

Listen to this episode on: Spotify | Apple Podcasts

Here’s what stood out and how I’m applying it. First, most AI outputs fail without proper context. That’s not a model problem; it’s a leadership problem. Thinking of AI like onboarding a new intern is the right mental model—start with the minimum viable context, then iterate. Practical first steps matter: decision logs, clear success metrics, and structured documentation. The art is balancing enough context to guide performance without overloading the system. The parallels are striking: the way we create strategic context for product trios and teams is the same way we’ll empower agentic AI systems.

In my teams, we prepare for AI collaboration by operationalizing context. We keep decision logs to capture the why behind choices, use outcome-based success metrics (not just output), and maintain machine-readable documentation that LLMs for product managers can parse reliably. We define guardrails up front—constraints, customer segments, privacy-by-design considerations, and the non-goals that often trip up gen ai. This foundation turns AI from a novelty into a force multiplier for product discovery and product roadmapping and sprint planning.

I use a simple “context pack” to onboard AI agents and teammates alike: 1) business goals and outcomes, 2) constraints and guardrails, 3) canonical artifacts (like PRDs, journey maps, interview notes), 4) domain vocabulary and definitions, and 5) operating procedures (how we make decisions, when to escalate, what good looks like). Start small, then refine as the AI demonstrates capability. This mirrors great onboarding—and it works just as well for agentic AI as it does for humans.

Not all context is helpful. More isn’t better; the minimum effective context is. I resist the urge to dump our entire Confluence on an AI system. Instead, I progressively reveal relevant details—just like I would with a new PM on a complex problem space. This keeps signals high, noise low, and performance measurable against clear success metrics.

If your org isn’t adopting AI yet, don’t wait. You can become AI-ready now by documenting strategic intent, decision rationale, and definitions in structured, searchable, machine-readable ways. Treat this as core AI Strategy work that strengthens empowered product teams—regardless of tooling—while building your AI product toolbox for tomorrow.

For those who want to explore further, these resources and mentions are a strong complement to the episode’s themes.

Follow Teresa Torres: https://ProductTalk.org

Follow Petra Wille: https://Petra-Wille.com

Agentic AI

Teresa’s new podcast, Just Now Possible in Youtube, Apple Podcast, and Spotify

Petra’s Coaching Packages

ChatGPT

Henrik Kniberg’s talk at Product at Heart on treating AI agents like interns

Teresa’s webinars on how she built the Product Talk Interview Coach: Behind the Scenes: Building the Product Talk Interview Coach and How I Designed & Implemented Evals for Product Talk’s Interview Coach

Josh Seiden’s blog series about AI

Teresa’s new blog posts: 15 Ways to Use AI at Home (and Fill Your AI Product Toolbox) and 21 Ways to Use AI at Work (And Build Your AI Product Toolbox)

Petra's new blog post: Why Context, Not Just Data, Will Define AI-Ready Product Teams

Have thoughts on this episode or how you’re preparing your teams to collaborate with AI? Leave a comment below—let’s compare playbooks and level up together.

Inspired by this post on Product Talk.

October 28, 2025
Beyond Digital: How AI Transformation Builds Adaptive, Intelligent Organizations That Win

Digital transformation rewired our systems; AI transformation rewires how we learn, decide, and compete. “AI transformation goes beyond automation to create adaptive, intelligent organizations. Discover why it’s the next imperative and how to measure success.” That statement captures what I experience daily: we’re moving from scripted workflows to living systems that improve with every interaction.
When I talk about AI transformation, I’m not describing a tool rollout. I’m describing an operating model where data, models, and product strategy converge to create compounding advantage. In practice, that means agentic AI orchestrating tasks, robust data governance and privacy-by-design from day one, and empowered product teams that ship, measure, and iterate at high tempo.
The imperative is strategic, not merely technical. Markets are compressing cycle times, and customers now expect intelligent experiences by default. Organizations that master AI Strategy and product-led growth will set the pace—using AI for competitive differentiation rather than feature parity.
This shift changes how I build teams and backlogs. I lean on product trios, forward deployed engineers, and tight product discovery loops to reduce uncertainty early. We design for resilience and learning: human-in-the-loop feedback, clear escalation paths, and telemetry that turns every interaction into a hypothesis test.
Governance is a first-class feature. AI risk management, data governance, and threat detection and response sit alongside performance metrics in the same dashboard. We codify guardrails—policy, provenance, and permissions—so innovation scales safely and sustainably.
Measurement is where transformation becomes real. I anchor on outcomes vs output OKRs tied to customer value and revenue impact. At the product layer, I track activation, time-to-value, retention, and adoption by persona. For ML quality, I monitor precision/recall, coverage, hallucination rate, and model drift. In experimentation, A/B testing with a thoughtful minimum detectable effect (MDE) prevents false wins, while Amplitude analytics, Pendo, and Intercom instrumentation expose where guidance or UX writing can unlock activation.
The fastest wins often start in service and sales. A customer support ai strategy can deflect tickets with high-resolution answers while escalating edge cases to humans with full context. CRM integration with HubSpot and a ChatGPT connector enables reps to generate next-best-actions, summarize calls, and personalize outreach—measurably lifting conversion and lowering cost-to-serve.
On the build side, LLMs for product managers and gen ai for product prototyping accelerate discovery cycles. I use CustomGPT workflows to validate value propositions quickly, then harden successful flows with engineering. Throughout, product positioning and a crisp value proposition ensure that what we ship is understandable, differentiated, and priced to match ROI—consumption SaaS pricing when usage scales value.
If you’re getting started, begin with a single, high-frequency journey, instrument it deeply, and publish transparent OKRs. Pair empowered product teams with clear governance, and iterate toward agentic AI experiences. The payoff isn’t a one-time launch; it’s a continuously learning system—and a culture—that compounds advantage release after release.

Inspired by this post on Pendo – Perspectives.

October 25, 2025

How to Build an AI-Powered SaaS Customer Lifecycle

You may already have AI in onboarding, a support agent answering questions, a churn score in customer success, and automated upgrade prompts. Yet the customer still experiences four separate systems. They repeat their intent, receive messages that ignore unresolved problems, and get treated as an expansion opportunity before they have realized the value they bought.

That is not primarily a model problem. It is a lifecycle design problem. The useful goal is not to put AI at every touchpoint. It is to give each lifecycle decision the right evidence, a permitted action, a measurable outcome, and a clear owner.

Model the lifecycle as customer value states

Most SaaS lifecycle maps are organized around internal stages: marketing qualified, sold, onboarded, supported, renewed, expanded. Those labels tell you which team owns the account. They do not reliably tell an AI system what the customer is trying to accomplish or what should happen next.

Start with customer value states instead. A value state is an evidence-based description of the customer’s current relationship with the product. It should be observable in product behavior, account context, or customer conversations. It should also imply a limited set of appropriate actions.

Customer value state	Evidence to look for	Decision the system can support	Outcome to measure
Seeking first value	The intended job or role is known, but the account has not completed its activation milestone	Choose the next necessary setup step, guide, or human intervention	Completion of the activation milestone and time to value
Establishing repeat value	The first milestone is complete, but the behavior associated with ongoing value is not yet established	Reinforce the next useful workflow without replaying basic onboarding	Repeat completion of the value-producing workflow
Blocked	A failed workflow, unresolved ticket, repeated help request, or explicit expression of confusion is present	Diagnose, resolve, or route the obstacle before sending another growth message	Resolution of the underlying problem, including reopen and escalation signals
Deepening value	More roles, workflows, or relevant capabilities are being adopted after the core job succeeds	Recommend education or adjacent capabilities tied to the customer’s demonstrated need	Use of the additional capability and continued core-product value
At risk of losing value	Expected value behavior has weakened and supporting context points to friction or disengagement	Form a risk hypothesis, select a recovery action, or ask an owner to investigate	Restoration of the value behavior and cohort retention
Expansion ready	The account has achieved a defined outcome and has evidence of an additional role, capacity, or capability need	Present an offer that addresses the evidenced need	Adoption and realized value after expansion, not merely offer acceptance

These are templates, not universal definitions. Your activation milestone must represent the first meaningful result promised by your product. Your expansion milestone must demonstrate value and a relevant new need. Mapping activation and expansion milestones to the value proposition keeps automation anchored to customer progress rather than internal funnel activity.

For each state, write a state contract with six parts:

Entry evidence: the events, attributes, or conversations that make the state plausible.
Exit evidence: what must become true before the customer moves to another state.
Disqualifiers: conditions that suppress an action, such as an unresolved blocking issue.
Allowed actions: what AI may recommend, draft, or execute while the customer is in that state.
Decision owner: the person accountable for the rule and its outcome, even when execution is automated.
Success and guardrail metrics: the intended customer result and the signs that the intervention is causing harm.

A state should not be inferred from one weak signal. A missing login might indicate friction, seasonality, a role change, or successful completion of an infrequent job. Treat it as an observation until supporting evidence changes the recommended action.

Build a decision system, not a collection of copilots

A lifecycle agent needs more than a large prompt and access to several applications. It needs an architecture that turns fragmented customer evidence into controlled decisions. I use five layers to make that architecture explicit.

Identity and permissions: resolve the user, account, workspace, role, plan, and data-access boundary before retrieving context.
Signals: assemble relevant product events, CRM attributes, lifecycle milestones, support conversations, tickets, and prior interventions.
Reasoning: classify the value state, cite the evidence, estimate uncertainty, and choose an allowed next action or abstain.
Action: deliver an in-app guide, answer a question, draft outreach, route work, or request approval according to policy.
Feedback: capture the customer outcome, human correction, escalation, and later state transition so the decision can be evaluated.

The identity layer comes first because customer records rarely share a clean key. A support conversation may identify a person, product analytics may identify a user and workspace, and the CRM may organize the relationship at the account level. If those entities are joined incorrectly, an otherwise capable model can recommend an action using another workspace’s context or attribute one user’s friction to an entire account.

Do not place every available field into every prompt. Retrieve the minimum context needed for the current decision, and enforce the permissions of the requesting user and the action-taking service. For teams using Intercom with ChatGPT, the available read-only connection can expose conversations, tickets, and user data while respecting existing Intercom permissions. That is a useful pattern for exploration and decision support: broaden access to relevant evidence without silently broadening write authority.

The reasoning layer should return a structured decision record, not just fluent text. At minimum, store:

The proposed customer value state.
The specific evidence used and when it was observed.
Contradictory or missing evidence.
The recommended action and its expected customer outcome.
The policy that permits the action.
The confidence or abstention reason.
The human or system owner.
The condition that makes the recommendation stale.

This record gives you something an operator can inspect and something an evaluation system can score. It also prevents a recommendation from surviving after the facts change. An upgrade prompt prepared before a serious support issue, for example, should expire when that issue appears.

The feedback layer must record more than whether somebody clicked. Capture whether the customer reached the intended value state, whether a human changed the recommendation, and whether the intervention created a new problem. A unified measurement layer that connects behavior, funnels, cohorts, retention analysis, and CRM context makes those downstream effects visible across teams.

Automate the next best decision at each lifecycle stage

The same architecture can serve onboarding, support, retention, and expansion, but the evidence and acceptable actions differ. Design each motion as its own decision loop.

Onboarding: optimize for first value, not guide completion

An onboarding system should know the customer’s intended job, current role, completed setup steps, latest product behavior, and activation milestone. Its task is to identify the next necessary step, not to expose every feature.

A practical decision rule has four parts:

Trigger: an eligible account has not yet reached its defined activation milestone.
Action: select an in-app guide, explanation, or human handoff based on the missing prerequisite and observed context.
Suppression: stop the guide after activation, an opt-out, a conflicting workflow, or evidence of a blocking issue.
Measurement: evaluate activation and time to value, with guide completion treated only as a diagnostic signal.

A personalized tour can still fail if it teaches a workflow unrelated to the customer’s goal. Conversely, a user can skip the tour and activate successfully. That is why the state transition matters more than interaction with the onboarding surface.

Support: resolve the problem in its product context

Support is a strong place to begin because the customer’s intent is explicit, the context is relatively rich, and the result can be observed. Contextual in-app help combined with agentic AI can diagnose an issue, retrieve relevant knowledge, and guide the customer without forcing a channel switch.

The agent should distinguish among an information gap, a product defect, a permissions problem, a configuration problem, and a request for a capability that does not exist. Each requires a different response. A confident but irrelevant answer can lower ticket volume while leaving the customer blocked, so measure resolution of the problem alongside reopen, escalation, and correction signals.

Give the support agent a clear escalation packet: the customer’s goal, current screen or workflow, relevant recent actions, retrieved evidence, attempted resolution, and reason for escalation. The human should not have to reconstruct the case from a chat transcript.

Retention: produce a risk hypothesis, not a churn verdict

Usage decline by itself is ambiguous. A negative conversation by itself may already be resolved. Combine behavioral change with lifecycle expectations, unresolved friction, account context, and previous interventions before deciding that value is at risk.

The system’s output should explain what changed, why that change matters for this account, which evidence weakens the hypothesis, and what recovery action is appropriate. If the evidence is weak, the next action may be a review task rather than automated outreach.

Measure whether the expected value-producing behavior returns and whether retention improves for eligible cohorts. Also inspect unnecessary interventions. A message sent to a healthy customer is not harmless merely because it was automated; it can confuse the relationship and consume customer-success attention.

Expansion: require proof of value and proof of need

An account reaching a plan limit is not enough to establish expansion readiness. The system should look for two kinds of evidence: the customer has achieved meaningful value with the current product, and an additional role, capacity, workflow, or capability need is now visible.

Then match the offer to that need. Suppress it when a blocking support issue is open, the account has not reached its prerequisite milestone, or the evidence is too uncertain. Feature adoption, outcomes achieved, and time-to-value can serve as readiness signals, but your product team still has to define what those signals mean for each offer.

Do not stop measurement at acceptance. Check whether the customer adopts the added capability and continues to receive core value. Otherwise, the system may optimize for short-term conversion while creating future disappointment, downgrade risk, or avoidable support load.

Measure customer outcomes and decision quality separately

AI activity metrics are easy to collect: prompts processed, recommendations produced, messages sent, and conversations deflected. None proves that the lifecycle improved. You need two scorecards.

The first evaluates decision quality before broader release:

State accuracy: does the predicted lifecycle state match the available evidence and the review label?
Evidence grounding: can each material claim in the decision be traced to retrieved customer context?
Action compliance: is the recommended action permitted for this state, user, account, and channel?
Abstention quality: does the system pause when identity, evidence, or policy is insufficient?
Human correction: what do reviewers change, and do those corrections cluster around a specific state or segment?

The second evaluates live customer and business outcomes:

Motion	Primary outcome	Useful diagnostic	Guardrail
Onboarding	Eligible customers reaching the activation milestone	Where the activation path stalls by role or use case	Abandonment, blocking support contacts, and unwanted guide exposure
Support	The customer’s problem is resolved	Retrieval quality, escalation reasons, and human corrections	Reopens, incorrect actions, and negative feedback
Retention	Value behavior and cohort retention are restored	Accuracy of risk hypotheses and intervention uptake	Unnecessary outreach and healthy accounts incorrectly flagged
Expansion	The added capability is adopted and produces value	Readiness evidence and offer relevance	Open friction, rapid disengagement, downgrade, or increased support burden

Define the eligible population and denominator before launch. If an onboarding intervention applies only to administrators pursuing a particular use case, evaluate it on that population. Mixing in ineligible users can make a weak intervention appear safe or a useful one appear ineffective.

When you run an experiment, specify the randomization unit, primary outcome, guardrails, minimum detectable effect, and stopping rule before looking at results. Segmentation and disciplined A/B testing with a defined minimum detectable effect help distinguish a real lifecycle improvement from movement in a convenient proxy.

Offline evaluations and live experiments answer different questions. An evaluation tells you whether the system follows policy and makes defensible decisions on known cases. An experiment tells you whether exposing eligible customers to those decisions changes outcomes. You need both before granting more autonomy.

Start with one closed loop and earn autonomy

Do not begin with an autonomous agent spanning acquisition through renewal. Choose one recurring decision with rich context, a reversible action, an observable outcome, and a named owner. Support or a narrowly defined onboarding obstacle often meets those conditions.

Write the decision specification. Define the value state, eligibility rule, evidence, disqualifiers, permitted actions, success metric, guardrails, and owner.
Assemble read-only context. Resolve identity and permissions, retrieve only the evidence required, and expose citations to the operator.
Run in shadow mode. Let the system produce decisions without contacting customers or changing accounts. Review errors, abstentions, and missing context.
Move to assistive mode. Allow the system to draft or recommend while an authorized person approves the action.
Review the loop regularly. Examine outcomes, overrides, permission failures, stale recommendations, and differences across eligible segments. A weekly digest of customer-conversation highlights can keep frontline evidence present in product and go-to-market decisions.
Grant scoped autonomy. Automate only the action types that have stable performance, reliable outcome capture, and a safe recovery path. Keep monitoring and a kill switch in place.

Separate access from authority throughout this sequence. The ability to read an account does not authorize the agent to alter it. Use explicit policies for each action and enforce them outside the model.

Informational actions: summarizing evidence, classifying a state, retrieving approved knowledge, or preparing a brief can often remain read-only.
Assistive actions: drafting outreach, proposing a guide, or recommending a workflow change should remain subject to review until the relevant decision quality is established.
Consequential actions: changing access, contracts, pricing, account status, or customer data can create financial, operational, or irreversible harm. Require an authorized human or a separate deterministic approval workflow rather than relying on model confidence.

Privacy-by-design is part of product quality here. Minimize retrieved data, preserve existing access controls, define retention for prompts and decision records, and log who or what authorized every write. If the system cannot identify the account reliably or explain the evidence behind an action, it should abstain.

Key takeaways

Organize lifecycle AI around observable customer value states, not departmental handoffs.
Require every automated decision to include evidence, an allowed action, an owner, an expiry condition, and a measurable customer outcome.
Use AI differently across onboarding, support, retention, and expansion because each motion has distinct evidence and risk.
Evaluate decision quality offline, then test customer and business impact on a clearly defined eligible population.
Begin read-only, move through assisted execution, and grant autonomy one reversible action at a time.

Your first move is straightforward: pick one lifecycle decision customers encounter repeatedly and write its state contract. If you cannot specify the evidence, disqualifiers, owner, and outcome on one page, the decision is not ready for an agent. Once that contract is clear, AI becomes an implementation choice instead of a substitute for product judgment.

References

October 25, 2025

Enterprise AI Foundations: An Operating Model That Scales

If your company has several promising AI pilots but each one needs a fresh data pipeline, a new security exception, and a different executive sponsor, you do not have a model-selection problem. You have a foundation and operating-model problem.

Your next decision should not be which assistant to launch. It should be which capabilities every AI workflow will share, who owns the decisions around them, and what evidence a workflow must produce before it can act in production. Get those choices right and each use case makes the next one easier. Get them wrong and every pilot becomes a custom integration that happens to contain a model.

Build the foundation around a workflow, not a model

A model is a component. The durable unit of enterprise AI is a workflow: a trigger arrives, the system gathers permitted context, judgment is applied, an action or recommendation is produced, and someone can verify the outcome.

Define that workflow before discussing prompts or agent interfaces. A usable workflow contract should name:

The business owner and the person accountable for the result.
The trigger that starts the work and the evidence that proves it is complete.
The authoritative systems, records, and taxonomies the AI may use.
The identity, tenant, purpose, and permissions attached to each request.
The tools the system may call and the state each tool is allowed to change.
The decisions the model may make, the checks that remain deterministic, and the points that require human approval.
The fallback when data is missing, instructions conflict, a tool fails, or confidence is inadequate.
The business, quality, risk, latency, and operating measures used to judge production performance.

That contract turns a broad ambition such as “use AI in customer operations” into an engineering and product object that can be reviewed. It also exposes false readiness. If nobody can identify the source of truth, approval boundary, or completion event, improving the prompt will not make the workflow production-ready.

Foundation layer	Decision it must settle	Minimum usable artifact
Outcome and workflow	What job starts, what result matters, and who owns it?	Workflow contract, baseline, completion event, and accountable owner
Context and data	Which information is authoritative, current, relevant, and traceable?	Source inventory, schema or taxonomy, lineage, quality checks, and freshness rules
Identity and policy	Who may see or do what, for which tenant and purpose?	Permission map, retention rules, consent requirements, and policy decisions
Reasoning and orchestration	Where may the model interpret, synthesize, plan, or ask for clarification?	Prompts, tool definitions, routing logic, refusal behavior, and approval points
Execution	Which side effects are permitted, validated, and reversible?	Typed tool inputs, deterministic validation, idempotent operations, approvals, and rollback procedure
Evidence and operations	Can the organization reconstruct, evaluate, and support what happened?	Event log, acceptance set, production dashboard, escalation path, and incident owner

The context layer deserves particular attention because it determines what the AI can know. A useful pattern transforms raw records into progressively more meaningful objects, such as elements, highlights, insights, and decision-ready briefs, while preserving a path back to the underlying evidence. This is more dependable than asking a model to rediscover structure from an undifferentiated pile of text every time.

Unified context does not require copying every record into one giant store. It requires consistent identifiers, explicit ownership, documented lineage, predictable retrieval, and policy enforcement across the systems that remain authoritative. The same principle applies to instrumentation. Capture the user, account, intent, sources retrieved, tools requested, policy decisions, output, correction, and final outcome as part of the workflow itself. Measurement built into the foundation is what lets you separate a persuasive demo from repeatable value.

Put model judgment inside deterministic boundaries

Enterprise AI becomes easier to reason about when you stop asking whether an entire workflow should be deterministic or agentic. Most useful workflows need both.

A model can interpret messy language, summarize evidence, match an intent to a known taxonomy, draft a response, or propose a sequence of actions. Deterministic services should establish identity, enforce tenant isolation, evaluate permissions, fetch exact records, validate required fields, perform calculations, control approvals, execute state changes, and write the audit trail.

A safe execution path looks like this:

The request enters with authenticated identity, tenant, role, and relevant workflow state.
A policy service determines which sources and tools are available for that identity and purpose.
Retrieval returns permitted context with identifiers, freshness information, and traceable evidence.
The model interprets the request and proposes an answer or tool call.
Deterministic code validates the proposed action, required fields, business rules, and current state.
The workflow obtains human approval when the consequence or reversibility requires it.
The execution service performs the action and records the request, policy decision, inputs, result, and resulting state.
The interface shows the user what happened, what evidence was used, and what still requires attention.

The model should not become the authorization layer. Telling an agent in a prompt not to access another tenant is not access control. Never give a broadly privileged tool to a model merely because the instruction text says to use it carefully.

An explicit request-and-adjudicate boundary is stronger: the assistant requests a source or capability, and the surrounding system approves or denies it. MCP-based tool access can support this pattern when the implementation keeps access negotiation visible and auditable. The important design choice is not the protocol alone. It is that a failed policy check cannot be negotiated away by the model.

Be especially conservative when a tool can delete records, change access, send an external communication, or commit money. An incorrect draft can be reviewed. An incorrect state change can create customer, financial, privacy, or legal exposure. Until validation, approval, auditability, and rollback are proven, keep the workflow in recommendation mode or execute it in a sandbox.

Version and evaluate the whole behavior

A production release is more than a model name or prompt. Treat the model and its configuration, system instructions, taxonomy, retrieval sources, ranking rules, tool schemas, permission policies, workflow code, approval logic, and evaluation set as one versioned behavior bundle. A change to any member of that bundle can change the result.

Before exposure grows, test that bundle against cases that represent the real operating boundary:

A normal request with complete and current context.
An ambiguous request that should trigger clarification.
A request for data the user is not permitted to access.
Stale, missing, duplicated, or conflicting records.
An instruction embedded in retrieved content that attempts to redirect the agent.
A malformed tool call or a temporary tool failure.
A proposed action that violates a business rule.
A high-consequence action that must stop for approval.
A case with no supported answer, where refusal or human handoff is correct.

Passing the happy path is capability testing. Passing the boundary cases is operational readiness. Keep the exact failing examples in the acceptance set so the next prompt, retrieval, policy, tool, or model change must face them again.

Centralize the rails and federate workflow ownership

The centralized-versus-decentralized debate is too blunt for enterprise AI. A purely central team tends to become a queue for domain requests it cannot fully understand. A fully decentralized model asks every product group to rebuild identity, access controls, model routing, evaluation, and observability. My preferred design is centralized rails with federated ownership of workflows and outcomes.

The enterprise AI platform team owns shared capabilities

Approved model and provider access, routing, version control, and rollback mechanisms.
Identity propagation, tenant isolation, policy enforcement, secrets, and tool registration.
Common retrieval, citation, logging, evaluation, red-team, and observability infrastructure.
Reusable interaction patterns for clarification, refusal, approval, progress, and human handoff.
Reference architectures, deployment paths, and incident procedures that domain teams can adopt without inventing new controls.

The platform team should expose these as paved paths with clear defaults. Its success is not the number of models connected. It is the number of production workflows that can reuse the same controls without requesting one-off exceptions.

The domain product team owns the job and its evidence

The workflow contract, baseline, target outcome, and user experience.
The domain taxonomy, authoritative records, exceptions, and completion criteria.
The acceptance set and the human judgments needed to calibrate it.
Adoption, task success, user corrections, operational impact, and workflow economics.
Training, support, escalation, and the decision to expand, redesign, or stop the use case.

Put builders close to the work during discovery and early production. A product manager and engineer should inspect actual handoffs, shadow runbooks, exception queues, and failure recovery with the people doing the job. The most revealing question is not how the happy path works. It is what people do when the official process stops working. That is where hidden permissions, political handoffs, brittle scripts, and unrecorded judgment usually surface.

The portfolio council owns risk appetite and shared investment

A small cross-functional council can resolve decisions that no single product team should make alone. It should set risk tiers, fund shared capabilities, approve genuine policy exceptions, resolve competing claims on enterprise data, and decide which workflows deserve expansion. It should not review every prompt or become a permanent approval meeting for routine releases.

Decision rights still need named people. The business owner defines the acceptable outcome and fallback. Product owns the workflow and value evidence. Engineering owns execution integrity. Data owners define authoritative context and quality. Security owns identity, access, threat controls, and incident requirements. Legal defines permitted uses of data and relevant external commitments. Operations owns the production runbook and escalation path. Governance maintains reusable policy and risk classification.

I would treat the operating model as incomplete until the organization can answer four questions without forming a new committee: Who can approve this use? Who can block its release? Who is paged or contacted when it fails? Who decides whether it returns to service?

Promote workflows through evidence, not enthusiasm

Do not apply the same controls to every AI feature. Classify a workflow by what it can do and what happens when it is wrong, not by whether it appears in a chat window.

Assist: The system drafts, summarizes, or retrieves. It cannot change enterprise state, and the user verifies the output before relying on it.
Prepare: The system gathers evidence and proposes a decision or action. Deterministic checks and an accountable person’s confirmation stand between the proposal and execution.
Execute: The system changes an internal or external state. It needs least-privilege access, validation, auditability, recovery behavior, and explicit approval wherever the consequence cannot be safely reversed.

A workflow must be reclassified when its data, permissions, audience, or actions change. A drafting assistant does not remain low risk after someone adds a tool that sends the draft automatically.

Use promotion gates to stop pilot momentum from substituting for readiness:

Workflow gate: Is there a named owner, a real trigger, an end-to-end job, a baseline, and an observable completion event?
Context gate: Are the authoritative records known, permissioned, sufficiently current, and traceable from output back to evidence?
Behavior gate: Does the versioned system pass its acceptance cases for quality, citations, clarification, refusal, tool use, and policy compliance?
Operational gate: Are monitoring, escalation, support, incident response, rollback, and user communication ready before production exposure?
Value gate: Does production evidence show a better outcome for the workflow without an unacceptable increase in corrections, risk, latency, operating load, or cost?

A successful demo does not waive any gate. Neither does executive sponsorship. If the workflow lacks an owner or authoritative context, it remains a discovery project. If it cannot be observed or rolled back, it remains a controlled pilot. If it passes quality checks but produces no meaningful workflow improvement, it should not expand merely because users find it interesting.

Give every production workflow at least one business measure, one behavior measure, one risk measure, and one operating measure. Depending on the job, these might include verified task completion or rework; citation fidelity, corrections, fallbacks, or latency; blocked unauthorized requests or policy incidents; and escalation load, rollback frequency, or unit cost. Capture the baseline for the same job before release. Without that baseline, productivity claims become opinion.

Use A/B testing only after both variants meet the required safety and policy thresholds. An unsafe treatment should not receive more traffic simply to complete an experiment. Automated graders can help screen large evaluation sets, but a model judging another model is not an independent source of truth. Combine layered evaluations, citations, deterministic checks, and calibrated human review, then inspect disagreement rather than hiding it inside an average score.

Choose one complete workflow and make it earn expansion

Your first production workflow should not be the broadest vision on the strategy deck. Choose the smallest complete loop that delivers a meaningful result and forces the organization to exercise reusable parts of the foundation.

A strong starting workflow has a known owner, an established budget or category, a recognizable trigger, accessible sources of truth, a result you can verify, and a failure mode you can contain. It occurs often enough to produce feedback and has enough friction that a better workflow matters. It should also require capabilities that later use cases can reuse, such as permission-aware retrieval, approval, tool execution, or audit logging.

Then move through the work in this order:

Follow the current job from trigger to verified completion, including exceptions and recovery paths.
Record the baseline and identify which part requires language judgment rather than ordinary workflow automation.
Write the workflow contract, assign its risk class, and name the owner of every consequential decision.
Build a thin vertical slice that includes identity, context, policy, model behavior, execution, audit evidence, and fallback. Do not postpone the difficult control layers until after the interface works.
Create the acceptance set from real workflow patterns and known failure boundaries, then run it before exposing the workflow to users.
Release to a controlled group with production observability, an escalation route, and a tested rollback procedure.
Inspect corrections, refusals, tool failures, policy denials, handoffs, and final outcomes. Change a versioned component only when you can evaluate the effect.
Promote the workflow only after it clears the relevant gates. Extract the reusable capability before funding a wider set of similar use cases.

This approach also changes roadmap conversations. A new use case should identify what it can reuse, what new domain capability it requires, and which risk boundary it crosses. If every request needs a custom policy, custom retrieval path, custom interface, and custom incident process, you are accumulating projects rather than building a platform.

Key takeaways

The workflow contract, not the model, is the durable unit of enterprise AI.
Context needs authoritative sources, permissions, lineage, structure, and production instrumentation before an agent can use it reliably.
Let models interpret and propose; keep authorization, validation, consequential execution, audit, and rollback deterministic.
Centralize shared rails while domain teams own workflow outcomes, exceptions, acceptance cases, and adoption.
Classify risk by data and action, then require evidence at workflow, context, behavior, operational, and value gates.
Start with one bounded, complete workflow and expand only when its controls and shared capabilities can be reused.

At your next AI roadmap review, replace “Which model should power this?” with a harder set of questions: Who owns the completed job? What context is authoritative? Which permissions apply? Where is judgment allowed? What must be validated? How will failure be detected and reversed?

If those answers are missing, the foundation is the next roadmap item. Select one workflow, build the full control loop around it, and fund the reusable capability it exposes. You will know the operating model is beginning to scale when the next team can ship on those rails without asking the enterprise to accept a new class of exception.

References

Shivam.Consulting Blog — Turning Community Noise into Action: My Product Lessons from Zencity’s AI That Listens
Shivam.Consulting Blog — Go Hard Early: Enterprise AI Lessons That Built Serval’s Magical IT Automation Agents
Shivam.Consulting Blog — Build the Cake, Then the Frosting: 3 Elements of a High-Performing AI Strategy That Wins

October 25, 2025

How to Govern and Measure an Enterprise AI Agent Portfolio

Your company probably does not have an AI agent shortage. It has a decision problem: which workflows deserve an agent, what authority each agent should receive, and what evidence should earn the next expansion of autonomy.

If those answers live in separate roadmap, security, finance, and compliance reviews, pilots can multiply while accountability disappears. You need one operating model that connects portfolio strategy, executable controls, product analytics, and release decisions. That is how you move from promising demonstrations to agents that create governed, repeatable value.

Build the portfolio around workflows, not agent ideas

Do not begin with a backlog of sales agents, support agents, and operations agents. Those labels are too broad to expose the work, risk, or economic case. Begin with a bounded workflow such as preparing a support response from approved knowledge, reconciling a CRM record, or proposing the next action for an account.

A strong candidate has high frequency, understandable rules, and an outcome you can observe. The task should also have clear start and stop conditions. If different stakeholders cannot agree on what the agent is allowed to do, what a successful result looks like, or when a human must take over, the workflow is not ready for autonomous execution.

Create a one-page agent charter before committing roadmap capacity. It should answer:

What business outcome should change, and what is the current baseline without the agent?
Who initiates the task, who receives the result, and who is accountable when it fails?
Where does the task begin and end? Which adjacent decisions are explicitly out of scope?
Which systems and data may the agent read, propose changes to, or update?
What constitutes success for one task instance?
Which failures are merely inconvenient, and which create privacy, security, financial, legal, or customer harm?
What is the expected cost per successful outcome, including human review and escalation?
What evidence will justify continued investment, expanded access, or termination?

This charter forces an important distinction between an output and an outcome. Producing a draft is an output. Resolving the customer issue without a quality regression is an outcome. Updating a record is an output. Improving the accuracy or timeliness of the operating process is an outcome. Fund the latter.

Prioritize candidates across five dimensions: business value, task repeatability, technical tractability, downside risk, and learning advantage. Do not hide those dimensions inside one weighted score. A single number can make a high-value but irreversible action look equivalent to a lower-risk workflow. Keep the dimensions visible so leadership can choose the appropriate entry point.

That entry point should be an autonomy tier, not a binary decision to automate or not automate:

Autonomy tier	What the agent may do	Default control	Evidence needed to advance
Observe	Read approved information, search, classify, or summarize without proposing an external change	Scoped identity, data boundaries, logging, and output evaluation	Reliable retrieval, acceptable quality, and known failure patterns
Propose	Draft an answer, recommendation, plan, or system change	A person reviews and approves before the change affects the workflow	Task-level acceptance, quality, edit burden, cost, and safe escalation behavior
Act reversibly	Execute narrowly defined changes that have a tested recovery path	Allowlisted tools, parameter constraints, feature flags, audit logs, and rollback	Successful execution, low recovery burden, stable economics, and no critical control failures
Act consequentially	Take actions with material financial, privacy, legal, security, or customer consequences	Explicit approval or separation of duties, reconciliation, incident response, and formal risk acceptance	Sustained evidence for the exact task and permission being expanded, plus approval from the relevant control owners

Autonomy should advance by task and permission. An agent may be dependable when reading a CRM and still be unsafe when modifying it. It may execute one reversible update but require approval for another. A good average quality score is not a license to grant broad write access.

The portfolio should also answer where durable advantage could come from. A prompt wrapped around a generally available model is easy to copy. A workflow that combines proprietary signals, useful feedback, reliable tool orchestration, and deep product integration can improve as it is used. That distinction should affect whether you build a strategic capability, buy a commodity function, or stop the work altogether.

Turn governance policy into controls the agent cannot bypass

A governance document does not govern an agent. Runtime controls do. For every policy statement, identify the control that enforces it, the telemetry that proves it ran, the owner who responds to a failure, and the action that limits the blast radius.

Implement the minimum control set

Identity and access: give the agent its own identity, apply least privilege, isolate environments, time-box credentials where appropriate, and avoid inheriting a user’s full authority by default.
Data boundaries: define approved sources, apply PII redaction and data-loss controls, set retention rules, and prevent sensitive content from leaking into prompts, logs, or downstream tools.
Tool boundaries: allowlist operations and resources, validate parameters, constrain destinations, and reject requests that fall outside the declared business purpose.
Action safety: require approval for consequential actions, design idempotent operations where possible, test rollback or reconciliation, and provide a kill switch that operations can use without deploying new code.
Model and application defenses: test prompt injection, ground outputs in approved context, require citations where verification matters, and provide deterministic fallbacks for known failure conditions.
Change control: version the model, prompt, retrieval configuration, tool definitions, policies, and evaluation set so a regression can be traced to a specific release.
Operational response: route agent failures into existing monitoring, cybersecurity, incident management, and escalation processes instead of creating a separate shadow operating model.

The audit record should let an authorized reviewer reconstruct what happened without storing secrets indiscriminately. Capture the initiating principal, business purpose, agent and configuration version, relevant input references, retrieved context, access decision, tool request, approval, result, latency, error, and correlation identifier. Protect those records under the same data classification and retention rules as the workflow itself.

Model Context Protocol can provide consistent connective tissue between an agent and enterprise tools, but a common interface does not replace authorization. The protocol may make integrations easier to discover and invoke; your control plane must still decide which agent can call which tool, on whose behalf, for what purpose, with which parameters, and under which approval rule.

Treat each tool call as a privileged business operation. Reading a customer record, drafting a change, and committing that change are separate capabilities. Give them separate permissions. This design makes progressive autonomy possible because you can expand one capability without handing the agent an entire system.

Make ownership explicit before production

The phrase responsible AI becomes empty when everyone is responsible in the abstract. Assign named decision rights:

The product owner owns the workflow boundary, user outcome, adoption, and roadmap decision.
The engineering owner owns system behavior, evaluation infrastructure, reliability, rollback, and technical remediation.
The system and data owners approve access, permitted operations, data classification, and retention.
Security, privacy, compliance, and legal owners define or approve controls in their domains. Consequential use cases should not proceed on product judgment alone.
The operational owner responds to incidents, handles escalations, and confirms that recovery procedures work.
The accountable executive accepts residual risk when the business chooses to expand consequential autonomy.

Every production agent should therefore have a business owner, technical owner, control tier, tool inventory, escalation path, and service expectation. Deferring security, compliance, and governance creates retrofit work precisely when pressure to scale is highest. Put these fields in the product definition, not in a document assembled after launch.

Measure successful outcomes, not model activity

Token volume, raw completions, and average latency tell you that the system is active. They do not tell you that it is useful. The measurement system must connect agent behavior to task quality, business impact, economics, risk, and adoption.

Start by defining success for one task instance. The definition must be observable and strict enough to reject plausible-looking failure. A support task might require an accurate resolution that passes the quality check. A CRM task might require the correct record, required fields, no duplicate, and a successful write. A proposed campaign might count only after an authorized person accepts it. The exact test will differ, but the unit of value cannot be the presence of an answer.

Build the scorecard in layers:

Business outcome: incremental conversion, retention, satisfaction, revenue, cost reduction, risk reduction, or another outcome tied to the workflow’s purpose.
Task outcome: success rate, quality score, time to resolution, containment where containment is desirable, human acceptance, edit burden, and escalation.
Operational health: end-to-end latency, tool latency, error rate, retries, timeouts, retrieval failures, unavailable dependencies, and recovery time.
Economics: model usage, retrieval and tool costs, infrastructure, retries, human review, escalations, rework, and incident handling.
Risk: policy blocks, attempted unauthorized actions, sensitive-data events, unsafe outputs, approval bypasses, audit gaps, and severity-weighted incidents.
Adoption: eligible users exposed, activation, repeat use, abandonment, manual workarounds, and retention by workflow and persona.

The primary economic metric should usually be cost per successful outcome, not cost per request. Calculate it as total operating cost divided by the number of tasks that satisfy the success definition. Total operating cost should include model and infrastructure spend, retrieval and tool usage, retries, human review, escalation, and attributable rework. An inexpensive call that creates a failed task is not efficient.

Task success, time to resolution, containment, total cost, and downstream business impact belong in the same measurement model. Keeping them together prevents local optimization. A cheaper model may increase review effort. Higher containment may hide unsafe failure to escalate. Faster responses may reduce answer quality. A useful dashboard makes those trade-offs visible.

Do not automatically treat a human handoff as failure. In a high-risk workflow, escalation may be the correct behavior. Track justified and avoidable handoffs separately. The same principle applies to policy blocks: an increase could indicate more attacks, an overly restrictive control, or a guardrail doing exactly what it should. You need the reason and context, not just the count.

Design measurement for decisions

Every metric should have a decision attached to it. Before exposure expands, record the primary outcome, guardrail metrics, minimum acceptable quality, prohibited failure conditions, cost ceiling, and rollback trigger. If the team plans an A/B test, define the minimum detectable effect: the smallest change that would be meaningful enough to affect the rollout decision. Otherwise, you can run a statistically tidy experiment that cannot answer the business question.

Compare the agent with the current workflow, not with an imaginary state of perfect automation. Use a controlled holdback when the workflow permits it. Where randomization is impractical or unsafe, establish a credible baseline and document what changed besides the agent. Segment results by persona, task type, channel, tool, and risk tier. Portfolio averages routinely conceal a severe failure in a small but important slice.

Trace each outcome back to the agent version, prompt, policy, retrieved context, and tool sequence that produced it. This creates a closed learning loop: identify a failure cluster, reproduce it offline, add it to the evaluation set, change the system, verify the fix, and monitor the same cluster after release.

Finally, separate model quality from product adoption. A technically capable agent can still fail because users do not know when to invoke it, what it can access, or when they remain responsible for approval. Instrument the experience around the agent. Onboarding, in-product guidance, activation analysis, retention analysis, and controlled experiments show whether the capability has become part of the workflow rather than a feature users tried once.

Use lifecycle gates to earn autonomy one permission at a time

An enterprise agent should not jump from prototype to unrestricted production. Give each stage a decision, an owner, and predefined pass, hold, and stop conditions. A gate without an explicit decision rule is ceremony.

Frame the workflow. Approve the agent charter, baseline, accountable owner, system boundaries, autonomy tier, risk classification, and success definition. Stop if the task cannot be bounded or measured.
Build a slim vertical slice. Connect the minimum retrieval, model, orchestration, and tool path needed to complete the task end to end. Create a representative evaluation set and a failure taxonomy before adding speculative capabilities.
Validate offline and in a sandbox. Test normal tasks and foreseeable failures, including prompt injection, missing or stale context, malformed outputs, timeouts, duplicate requests, revoked credentials, unavailable tools, and empty retrieval. Confirm that denials, fallbacks, and audit records behave correctly.
Run a controlled pilot. Use a defined cohort, feature flags, human approval, and visible escalation paths. Measure task outcomes, economics, risk events, user behavior, and review burden. A friendly cohort is useful only if its tasks still represent the production workflow.
Release constrained production access. Start with the narrowest tool scope and lowest safe autonomy. Activate monitoring, incident ownership, rollback, support procedures, and user guidance before increasing exposure.
Expand, hold, redesign, or stop. Increase one permission, workflow segment, or cohort at a time. Require evidence for the exact boundary being changed. Revoke access or roll back when a critical control fails, even if average product metrics remain positive.

Production-grade behavior depends on retrieval, tool use, memory and state design, deterministic fallbacks, continuous evaluation, and end-to-end instrumentation. That is why the vertical slice matters. It exposes integration and control failures while the blast radius is still small. A polished conversational layer without the operational path proves very little.

Run the same gate after material changes to the model, prompt, retrieval pipeline, tool definitions, permissions, or data. Passing an earlier evaluation does not prove that a changed system is safe. Version the change, rerun the relevant offline tests, release behind a feature flag, and monitor for regression in the affected task segments.

The operating cadence should make decisions at three levels:

Delivery decisions: inspect failure clusters, evaluation results, user friction, tool reliability, and the next bounded change.
Risk and change decisions: review incidents, control performance, permission changes, new data access, vendor or model changes, and unresolved exceptions.
Portfolio decisions: compare incremental business value, cost per successful outcome, adoption, operational burden, residual risk, and strategic learning across agents.

The executive view should fit on one page per agent: business outcome, current autonomy tier, eligible and active exposure, task success, cost per successful outcome, critical risk indicators, material incidents, current owner, and the next decision. If the review is dominated by tokens, prompts, or model names, it is operating at the wrong altitude.

This structure also gives you a rational way to stop. End or redesign an initiative when the workflow cannot be bounded, users do not adopt it, the economics worsen after retries and review are included, control failures remain unresolved, or the capability offers no strategic advantage over a commodity alternative. Killing an agent that cannot pass its gates is portfolio management, not a failure of ambition.

Key takeaways

Define the workflow, baseline, accountable owner, and successful outcome before selecting an agent architecture.
Assign autonomy by task and permission. Reading, proposing, reversible execution, and consequential execution require different evidence and controls.
Translate every governance policy into an enforceable control, observable event, named owner, and incident response.
Use cost per successful outcome as the economic denominator, including retries, tools, review, escalation, and rework.
Evaluate business value, task quality, operational health, risk, economics, and adoption together so one metric cannot conceal harm elsewhere.
Expand autonomy through lifecycle gates and feature flags, one bounded permission or cohort at a time.

If you need a practical place to begin, select one high-frequency, rules-based workflow with a measurable baseline. Complete the agent charter, start at the propose tier, instrument task success and total cost, and put the vertical slice through the governance gates. Expand only the next permission that the evidence supports. That loop teaches your organization how to make accountable AI decisions, which is more valuable than adding another impressive pilot.

References

October 24, 2025

Tag: agentic AI

Define the decision before you collect the context

Build a context packet that preserves evidence quality

Separate synthesis, strategy, and skepticism

The Summarizer creates an evidence map

The Strategist develops decision options

The Skeptic tries to disconfirm the options

Make the product trio the decision gate

Close the loop with validation and decision memory

Key takeaways

References

Give the agent an investigation job before action authority

Context quality determines the ceiling of the investigation

Scale permissions to reversibility and blast radius

Make evidence and uncertainty legible in the incident room

Test against past incidents, then expand authority one boundary at a time

A practical rollout sequence

Key takeaways

References

Model the lifecycle as customer value states

Build a decision system, not a collection of copilots

Automate the next best decision at each lifecycle stage

Onboarding: optimize for first value, not guide completion

Support: resolve the problem in its product context

Retention: produce a risk hypothesis, not a churn verdict

Expansion: require proof of value and proof of need

Measure customer outcomes and decision quality separately

Start with one closed loop and earn autonomy

Key takeaways

References

Build the foundation around a workflow, not a model

Put model judgment inside deterministic boundaries

Version and evaluate the whole behavior

Centralize the rails and federate workflow ownership

The enterprise AI platform team owns shared capabilities

The domain product team owns the job and its evidence

The portfolio council owns risk appetite and shared investment

Promote workflows through evidence, not enthusiasm

Choose one complete workflow and make it earn expansion

Key takeaways

References

Build the portfolio around workflows, not agent ideas

Turn governance policy into controls the agent cannot bypass

Implement the minimum control set

Make ownership explicit before production

Measure successful outcomes, not model activity

Design measurement for decisions

Use lifecycle gates to earn autonomy one permission at a time

Key takeaways

References