Tag: LLMs for product managers

AI Won’t Replace Engineers—Engineers Using AI Will: A Practical Playbook for Your Next Move

Will AI replace software engineers or reshape their roles? Explore risks, opportunities, and alternative career paths in tech.

I’m often asked whether AI will make software engineers obsolete. My short answer: AI is already automating tasks, not eliminating the role. The engineers who learn to orchestrate models, systems, and stakeholders will create more value—not less. The real shift is from keystrokes to judgment, from writing code to designing socio-technical systems that deliver outcomes.

Today’s gen ai assistants—think Claude Code and ChatGPT connector—excel at unit test scaffolding, boilerplate generation, refactoring, docstrings, and code search. When integrated into CI/CD, they can open draft pull requests, annotate diffs, and propose fixes. This lifts developer productivity and frees time for higher-leverage work: problem framing, architecture decisions, and customer discovery.

What changes in the role? We spend more cycles on product discovery, privacy-by-design, and AI Strategy, and fewer on repetitive implementation. We design agentic AI workflows that combine retrieval, tools, and guardrails; we evaluate trade-offs that blend performance, cost, and safety; and we partner with empowered product teams to ship the smallest valuable slice, learn, and iterate.

Measure what matters. If AI is working, DORA metrics should improve: higher deployment frequency, shorter lead time for changes, stable change failure rate, and faster MTTR. Pair that with outcomes vs output OKRs to avoid gaming the system—shaving seconds off a build is meaningless if it doesn’t move activation, retention, or revenue. A unified analytics platform can help connect engineering signals to business impact.

Risk is real—and manageable. AI risk management and data governance are now core competencies, not afterthoughts. Protect IP with robust access controls, context window management, and red-teaming. In production, instrument threat detection and response to catch prompt injection, data leakage, and model drift. Treat this like any other reliability discipline alongside SRE.

If parts of coding get automated, where can great engineers thrive? Several high-impact paths are emerging: platform engineering for LLMs (tooling, evals, observability), SRE for AI-infused systems, developer evangelism and education, product management for AI-native experiences, security engineering focused on model and data threats, and forward deployed engineers who pair with customers to solve messy, real-world problems.

How to upskill fast: build an AI product toolbox and ship small. Prototype gen ai features end-to-end—retrieval, function calling, human-in-the-loop QA—and connect them to your CRM integration or support stack. Use A/B testing with a clear minimum detectable effect (MDE) to validate impact. Leverage CustomGPT workflows for internal enablement and in-app guides or product tours to onboard users safely.

Here’s a pragmatic 90-day plan. Week 0–2: audit your top 10 engineering tasks by time spent; identify 3 that are ripe for AI augmentation. Week 3–6: pilot inside CI/CD with explicit guardrails; track DORA metrics and developer sentiment. Week 7–10: productionize the wins; document runbooks; add incident management paths. Week 11–12: share learnings with product trios, refine your value proposition, and set next-quarter OKRs.

AI won’t replace software engineers; engineers who master AI will outpace those who don’t. If we embrace the shift—toward systems thinking, responsible governance, and customer outcomes—we’ll build better products faster and open new, rewarding career paths. The opportunity is here and compounding.

Inspired by this post on Product School.

November 12, 2025
A Quality System for Trustworthy AI-Assisted UX Research
Your AI-generated synthesis can be polished, plausible, and wrong. The dangerous failures are rarely obvious fabrications. They are quieter: a biased sample becomes a universal claim, a participant’s opinion becomes a product need, or a tidy theme loses the contradiction that should have changed the roadmap.

If you are deciding whether to trust AI-assisted UX research, do not judge the fluency of the summary. Judge the evidence chain behind it. You need to see how a product decision connects to the participants recruited, the questions asked, the underlying observations, the analytical interpretation, and the behavioral data used to check it.

Key takeaways
- Research quality is mostly determined before an AI tool sees a transcript. Start with the decision, learning question, and hypothesis.
- Use AI to accelerate transcription, extraction, tagging, clustering, and contradiction searches. Keep interpretation, confidence, and product judgment under human control.
- Require every theme to retain its participant coverage, supporting evidence, counterexamples, and unresolved uncertainty.
- Pair qualitative findings with funnels, cohorts, session evidence, and CRM data when those signals are relevant. Neither qualitative nor quantitative evidence should carry the decision alone.
- Finish with an atomic insight and a recorded choice. A summary that does not change a decision, test, or learning priority is not finished research.
Define quality at the decision boundary

Many teams begin AI-assisted research by asking which model should summarize their transcripts. That is too late in the process. The first quality control is the decision the research must inform.

Strong discovery begins with a decision statement, an explicit learning goal, and a hypothesis the team is willing to falsify. Without those constraints, an AI system can generate an impressive taxonomy of themes while leaving the actual product question untouched.

Before recruiting participants or writing prompts, create a short research contract:
- Decision: Name the choice that is genuinely open. Examples include whether to pursue an opportunity, which problem to solve first, or whether a proposed workflow deserves further testing.
- Decision condition: State what you would need to learn to proceed, pause, narrow the audience, or reject the current direction.
- Learning question: Ask about the behavior, context, constraint, or unmet need that makes the decision uncertain.
- Hypothesis: Write the current belief in a form that evidence could disprove. If every possible interview result would support it, it is not a useful hypothesis.
- Relevant population: Specify whose behavior matters to this decision and which segments could experience the problem differently.
- Evidence plan: Identify what interviews can reveal and which behavioral or operational signals could challenge the interpretation.
- Data boundary: Decide what the AI tool is allowed to receive, what must be removed, and who may review the resulting artifacts.
This contract changes how you evaluate the output. You are no longer asking whether the summary sounds reasonable. You are asking whether the evidence changes a named choice under stated conditions.

My standard is simple: a decision-grade insight must survive a skeptical review without relying on the model’s authority. A reviewer should be able to inspect the underlying evidence, see which participants and segments it covers, understand the interpretation applied to it, and identify what remains unknown.

Keep one distinction visible throughout the work:
- Observation: What the participant did, described, showed, or failed to complete.
- Interpretation: What that behavior may mean about a goal, anxiety, constraint, or job.
- Implication: What the product team may choose to change, test, or leave alone.
AI can help produce all three, but it should never blur them into a single sentence. Once an inference is written as if it were an observed fact, the rest of the synthesis becomes difficult to audit.

Protect the signal before AI touches it

An LLM cannot repair a convenient sample or a leading interview guide. It can only reorganize the resulting bias, often in language that makes the bias look more certain.

Recruit for the decision, not for convenience

If you interview only power users, you risk treating advanced workflows as mainstream needs. If you interview only vocal detractors, the roadmap can become a queue of complaints. A more useful recruiting frame includes new users, churned users, people who evaluated but did not convert, and adjacent personas where the decision calls for them.

Build a participant matrix before outreach. Use rows for the segments that could materially change the decision and columns for relevant states, such as adoption stage, conversion outcome, or workflow maturity. The matrix is not a quota formula. It is a visibility tool. It should make overrepresented groups and missing perspectives obvious.

Carry that segment metadata into synthesis. A theme that appears among established customers should not silently become a claim about evaluators. When a segment is absent, write that limitation into the insight rather than hiding it in an appendix.

Ask for behavior before interpretation

Questions about whether someone likes an idea invite speculation, politeness, and solution theater. Ask about the last relevant event instead. Have the participant reconstruct what triggered it, what they tried, where they hesitated, who else became involved, what workaround they used, and what happened next.

Neutral, behavior-first questions become stronger when participants can support the account with artifacts such as screenshots or workflow examples. The artifact does not automatically prove the interpretation, but it helps distinguish remembered behavior from a general opinion.

Pilot the guide with the product trio. Remove product terminology that telegraphs the preferred answer. Check whether each question could produce evidence against the working hypothesis. If the guide repeatedly asks participants to react to your solution, it is a concept evaluation guide, not an open discovery guide. Label it accordingly.

Set privacy boundaries before uploading transcripts

Consent to an interview does not automatically settle how AI will be used in transcription, analysis, storage, or sharing. Tell participants how their material will be handled, follow your organization’s data governance requirements, and remove identifiers that are not needed for the decision.

Do not place sensitive participant data into an unapproved prompt workflow. If the tool’s handling, retention, or access controls have not been approved, keep raw transcripts out of it and work with appropriately de-identified material in an authorized environment. The downside is not merely a poor synthesis; it is unnecessary exposure of participant and customer information.

De-identification should not erase the context required for analysis. Preserve non-identifying segment labels, workflow stage, and participant codes when they are relevant. The goal is to minimize sensitive data while retaining enough context to audit coverage and interpretation.

Make AI produce an auditable synthesis

The most reliable workflow separates extraction from clustering and clustering from judgment. Asking for findings, recommendations, sentiment, and a roadmap in one prompt encourages the model to fill gaps and compress uncertainty.
1. Prepare the evidence set. Preserve the original transcript or recording, assign a participant code, attach relevant segment metadata, and remove unnecessary identifiers. Do not let an AI-generated summary replace the underlying material.
2. Extract participant-level observations. Ask the model to work through each participant separately. Capture the behavior or event, its context, the supporting excerpt or evidence location, and any missing information. Do not ask for themes yet.
3. Review the extraction. Check whether the observation is grounded in the transcript and whether the model has converted an opinion into behavior or inferred a motive the participant did not provide.
4. Cluster reviewed observations. Group similar evidence only after the participant-level pass. Require each cluster to retain the contributing participant codes, segment coverage, supporting evidence, and meaningful variations.
5. Search for contradictions. Ask which observations do not fit the cluster, which participants experienced the situation differently, and which alternative explanations remain plausible. Do not treat dissent as noise merely because it makes the summary less tidy.
6. Draft atomic insights. Turn a defensible pattern into a small evidence packet containing the finding, evidence, coverage, contradictions, confidence rationale, product implication, and unresolved question.
7. Triangulate relevant claims. Compare the qualitative interpretation with funnels, cohorts, session evidence, in-product paths, or CRM data when those systems contain a useful signal.
8. Conduct the decision review. A person accountable for the product choice inspects the evidence chain, challenges the interpretation, and records what the team will do or learn next.
You can make the separation explicit with narrowly scoped prompts.

Extraction prompt: Use only the supplied transcript. For each relevant event, return the participant code, observed or reported behavior, context, supporting excerpt, evidence location, and uncertainty. Do not merge participants, infer motives, or recommend a solution. Flag information that is missing.

Clustering prompt: Use only the reviewed observations. Group evidence by shared behavior and context. For every cluster, retain participant codes, represented segments, supporting observations, material variations, counterexamples, and plausible alternative explanations. Do not use repetition in the transcript as a substitute for participant coverage.

Challenge prompt: Review the proposed themes as a skeptical researcher. Identify unsupported generalizations, segment differences that were flattened, interpretations written as observations, contradictory evidence, and claims that cannot be traced to the supplied material. Do not invent missing evidence.

Prompt design helps, but it does not replace review. Keep the prompt, relevant tool or model information, input scope, and human corrections with the research artifact. If the synthesis later changes, you should be able to determine whether the cause was new evidence, a different analytical instruction, or a human judgment.

AI is well suited to accelerating transcription, tagging, theme clustering, Jobs to Be Done extraction, and searches for hesitation or sentiment. Treat the latter outputs as interpretations to validate, not measurements generated by an objective instrument. A sentiment label is useful only when a reviewer can return to the behavior and language that produced it.

Validate the insight, then record the decision

A good synthesis review is not a copy-edit. It is an attempt to break the claim before the claim influences a roadmap.

Run a quality review against the evidence chain
- Traceability: Can a reviewer move from the insight to the contributing participants and the exact supporting material?
- Coverage: Does the claim name the segments represented, and does it disclose relevant segments that are missing?
- Construct validity: Is the finding about the behavior the study intended to understand, or has a nearby opinion been used as a proxy?
- Separation: Are observation, interpretation, and product implication visibly distinct?
- Contradiction: Does the artifact preserve disconfirming cases and material variations instead of forcing consensus?
- Triangulation: Where behavioral data is relevant, does it support, narrow, or challenge the qualitative account?
- Decision relevance: Does the finding change a live choice, a test, or the next learning priority?
Do not outsource confidence to the model. A confident tone is a language property, not an evidence assessment. Record confidence as a human rationale based on the clarity of the underlying behavior, the relevance and coverage of participants, consistency and counterexamples, and any corroborating behavioral evidence.

Quantitative and qualitative signals answer different parts of the question. Funnels, cohorts, and retention analysis can show where behavior changes or where people leave. Interviews and artifacts can expose the goals, anxieties, organizational constraints, and workarounds behind that behavior. Pairing those signals is how a team moves from observing what happened to developing a testable account of why.

When the signals disagree, do not average them into a vague conclusion. Check whether the interview sample represents the population in the analytics, whether the event instrumentation reflects the behavior being discussed, whether segments have been combined, and whether the evidence refers to the same stage of the journey. A contradiction is often the next research question.

Use an atomic insight format

A reusable insight should be small enough to inspect and complete enough to guide a choice. Use this structure:
- Decision: The product choice this evidence informs.
- Finding: The observed behavioral pattern and the context in which it occurs.
- Evidence: Participant codes, excerpts or artifact locations, and any relevant behavioral signal.
- Coverage: The represented segments and known gaps.
- Interpretation: The best current explanation, clearly labeled as an inference.
- Contradictions: Cases or data that weaken, narrow, or complicate the interpretation.
- Confidence: A short rationale grounded in evidence quality, coverage, consistency, and triangulation.
- Product implication: The opportunity, risk, constraint, or tradeoff the team should consider.
- Disposition: Act, test further, monitor, or take no action.
- Next unknown: The uncertainty most likely to change the decision.
Useful insight records also prevent familiar synthesis mistakes. Replace a broad label such as onboarding friction with the specific behavior, actor, context, and consequence. Do not let a memorable quotation stand in for a pattern. Do not describe a participant’s requested feature as the underlying need. Do not convert an AI-generated cluster into a roadmap item until the evidence packet survives review.

Bring the atomic insights to a decision review with the product trio. Record the choice, its rationale, what the team is deliberately not doing, and the evidence that could reopen the decision. Connect the chosen action to an outcome or learning objective rather than treating delivery of a feature as proof that the research was correct.

For your next study, start with one live decision and run the evidence through this chain. If a theme cannot be traced, mark it as a hypothesis. If participant coverage is lopsided, narrow the claim. If qualitative and behavioral evidence conflict, investigate the conflict before committing the roadmap. That is how AI becomes a fast, inspectable research assistant instead of an unaccountable author of customer truth.

References
- Shivam.Consulting Blog – 5 Costly UX Research Pitfalls I See Often – and How AI + Qual Insights Prevent Them
November 11, 2025
From Sketch to Clickable Demo: My AI Prototyping Playbook to Build Apps in Hours

I’ve spent much of my career compressing the distance between a napkin sketch and something real customers can touch. At HighLevel, my product teams use generative AI to validate ideas faster, reduce risk earlier, and win stakeholder trust with evidence instead of slides. The goal isn’t to be flashy—it’s to be precise, testable, and repeatable.

Today, you can build it before you pitch it. AI prototyping can turn ideas into clickable demos in hours. Here are some tools to try and steps to follow.

I start every AI prototyping sprint by sharpening the problem statement and the outcome we care about. That means being explicit about the target user, jobs-to-be-done, and the riskiest assumptions. I define a minimum detectable effect (MDE) and tie it to outcomes vs output OKRs so everyone aligns on what “good” looks like before we touch a tool.

From there, I move from sketch to interface. I capture a rough flow (whiteboard, tablet, or even paper) and generate UI variations with my AI product toolbox—tools that translate structure into components and screens. I’ll iterate on information hierarchy and copy until the narrative supports the core job, borrowing techniques from UX writing. For product managers leaning into LLMs for product managers, this phase is about speed to feedback, not perfection.

Next, I wire data and logic. I connect a lightweight backend or spreadsheet, stitch in a CRM integration if needed, and add LLM calls through a ChatGPT connector or Claude Code. If the concept benefits from multi-step autonomy, I introduce agentic AI to orchestrate tasks across APIs. CustomGPT workflows help me encapsulate business rules so the demo behaves consistently in user paths we care about.

Governance is not optional at this stage. I apply privacy-by-design defaults, document data governance decisions, and run a quick AI risk management pass: input validation, prompt safety, rate limits, and fallback responses. This keeps the prototype credible and prevents false positives from polluting stakeholder perception.

With a click-through in hand, I instrument the experience so learning compounds. I drop in Amplitude analytics to track activation, task completion, and drop-off, and set up simple A/B testing when there’s a meaningful design or copy choice. This makes the prototype a learning vehicle, not just a demo.

Then I get it in front of users—fast. Five targeted conversations will beat fifty internal opinions. I run structured product discovery interviews, observe time-to-value, and capture objections. This is where empowered product teams shine: we make changes in real time, re-run the flow, and document what moves the needle for product-led growth.

When speed matters, I use a four-hour cadence: Hour 1 for problem framing and MDE; Hour 2 for sketch-to-UI generation; Hour 3 for data wiring and AI logic; Hour 4 for instrumentation and user walkthroughs. By the end, we have a clickable demo, preliminary analytics, and a clear decision on whether to advance, pivot, or park.

Finally, I translate insights into a concise artifact: the hypothesis we tested, the signal we observed, the trade-offs we made, and the next sprint plan for product roadmapping and sprint planning. The point is not to be right on the first try; it’s to learn precisely, cheaply, and quickly enough to invest with conviction.

If you adopt this approach, you’ll find that stakeholder management becomes easier, team energy rises, and your roadmap earns credibility. Build it before you pitch it, and let real interactions—not wishful thinking—do the heavy lifting.

Inspired by this post on Product School.

November 10, 2025
Win AI Search: Proven Playbook to Get Your Startup Recommended by ChatGPT & Perplexity

AI search is quickly becoming the new homepage for startups. When a buyer asks a model for the best tools, they often take the short list at face value. I treat this moment as a product surface I can influence with strategy, content, structure, and distribution—much like any other go-to-market channel.

Early on, I set a simple objective for my team and me: "Learn how LLMs like ChatGPT and Perplexity decide which startups to recommend and what signals help a brand get discovered in AI search." That sentence became our north star for experiments, instrumentation, and content architecture.

Here is the mental model that consistently holds up in practice. Large language models synthesize answers from a knowledge graph built from crawled content, citations, and high-signal sources. They weight consensus, clarity, recency, authority, and machine-readability. I don’t pretend to know the internals, but across hundreds of tests, the same patterns correlate with being surfaced and cited.

First, I make our entity unambiguous. I standardize the company name, product names, and leadership bios across the site and external profiles. I implement Organization and Product markup with schema.org and link out with sameAs to authoritative profiles like LinkedIn, Crunchbase, GitHub, and key directory listings. The goal is to collapse ambiguity so AI search knows exactly who we are and which claims are attributable to us.

Next, I publish definitive, answer-first pages. For every core query—what we do, who it’s for, outcomes, differentiators, pricing, comparisons, and integrations—I ship a page that leads with a crisp summary, then supports it with evidence, examples, and plain language. I include Q&A sections, realistic use cases, and named case studies so models can quote and ground responses in verifiable facts.

I then make the site maximally machine-readable. I add schema.org for SoftwareApplication, Product, FAQPage, and HowTo where relevant. I keep titles, H1/H2 structure, internal links, and metadata descriptive and consistent. I expose last-modified dates, maintain an XML sitemap, and keep a visible changelog and release notes. Freshness matters—Perplexity, in particular, tends to privilege recent, well-cited material when answering time-sensitive questions.

Citations are non-negotiable. I earn credible mentions on third-party properties, analyst lists, comparison pages, and customer reviews. I prioritize authoritative placements over volume, then make sure our site references those sources to reinforce the signal. When Perplexity cites our page alongside a respected third-party review, our inclusion rate in answers rises noticeably.

I also design for developers, buyers, and machines at once. That means clean docs, integration pages, and transparent security and trust content. Clear API references, integration guides, and reliability notes give models concrete artifacts to summarize. Pricing, privacy, and support policies reduce uncertainty and increase the likelihood that an answer will include us.

Measurement turns this from a hunch into a system. I run controlled content experiments, track minimum detectable effect on discovery and mentions, and instrument referral patterns from AI assistants when citations appear. I monitor which prompts surface our brand, which sources are cited, and which pages are repeatedly used as references. When we move a KPI, we codify the pattern into our playbook and scale it.

Trust is the compounding advantage. I maintain a transparent trust center, privacy-by-design posture, and clear data governance practices. I remove vague claims, back up benefits with evidence, and keep all performance or security statements auditable. Models tend to lift brands that feel low-risk, well-documented, and widely corroborated.

If you want a fast start, here’s the checklist I rely on. Standardize your entity and ship schema.org. Publish answer-first pages for core jobs-to-be-done, comparisons, and integrations. Earn authoritative third-party citations and reference them. Keep release notes, changelogs, and dates current. Instrument AI discovery and iterate based on what gets cited. Do this consistently, and your startup earns a fair shot at being recommended when buyers ask AI for the best options.

Inspired by this post on Amplitude – Best Practices.

November 7, 2025
Prototypes vs Products: How I De-risk Ideas Fast and Ship Reliable Value at Scale

Note: This is part of the product creator series of articles, based on the overview article, The Era of the Product Creator. This series is for anyone who wants to create a successful product—whether or not you’ve had formal training or experience in product management, product design, or engineering. Over the years, I’ve watched smart teams stumble because they treated a prototype like a product. The distinction is simple but vital: prototypes exist to learn; products exist to earn trust by delivering value reliably at scale. When we blur that line, we ship avoidable risk to customers and slow ourselves down later with rework. When I build a prototype, I’m testing assumptions as quickly and cheaply as possible. It might be a clickable Figma mock, a Wizard‑of‑Oz demo, or a quick script stitching together a ChatGPT connector with a CustomGPT workflow. It’s intentionally disposable. I expect missing edge cases, fake data, hand‑waving on latency, and limited attention to security or privacy. The only goal is to answer the riskiest questions fast. A product is a promise. It’s hardened for reliability, performance, security, and privacy‑by‑design. It’s observable with real analytics, supports CI/CD and rollback, meets accessibility guidelines, and can be maintained by empowered product teams. It has clear SLAs, incident management runbooks, and instrumentation that lets me track outcomes vs output OKRs and DORA metrics. Keeping prototypes and products separate makes us faster and safer. Prototypes accelerate discovery; products operationalize value. If I catch myself “polishing” a prototype, I pause and either discard it or define the path to production with the right engineering rigor, data governance, and stakeholder management. Here’s how I decide. In prototype mode, I timebox learning to days, not weeks, and focus on a single risky assumption—value, usability, or feasibility. I validate through qualitative research and usability tests, not vanity metrics. To graduate to product work, I require a crisp problem statement, evidence of problem‑solution fit, a technical plan for scale and observability, a privacy and threat modeling review, and a measurement plan (including minimum detectable effect) for upcoming A/B testing. AI adds new wrinkles. For gen AI and agentic AI, I evaluate model behavior offline before exposing anything to customers. That includes prompt design, context window management, guardrails to minimize hallucinations, and clear fallback strategies. I define red‑team scenarios, logging for auditability, and policies for data retention and encryption as part of AI risk management. A recent example: we prototyped an agent workflow in a day that felt magical in demos. We resisted the urge to ship. Instead, we added authentication, rate limiting, PII redaction, human‑in‑the‑loop review, observability, and in‑app guides and product tours for onboarding. Only then did we move to a limited release with a well‑defined go‑to‑market strategy and support readiness. One more trap to avoid: calling a prototype an MVP. An MVP is still a product—minimal in scope but complete enough to deliver value, gather trustworthy data, and support customers. If you wouldn’t put your name on it or support it in production, it’s a prototype, not an MVP. If you’re a product creator, align your product trios around this discipline. Use prototypes to learn quickly in discovery, and use products to deliver outcomes in delivery. That mindset protects customer trust, speeds iteration, and moves you toward product‑market fit with far less waste.

Inspired by this post on SVPG.

November 7, 2025
AI Context Engineering: A System for Product Decisions
You give an LLM your discovery notes, a dashboard export, and a roadmap question. It returns polished recommendations in seconds. The recommendations sound plausible, yet your product trio still cannot tell which option deserves a commitment.

The missing ingredient is usually not a better prompt. It is a decision-ready context system: a controlled way to give AI the evidence, boundaries, and outcome definition required to reason about the same product decision your team is actually making. Done well, this gives you more than a convincing answer. It gives you a traceable choice, explicit uncertainty, and a validation plan.

Define the decision before you collect the context

For product work, context engineering is the deliberate design of everything an AI system can use at the moment it reasons: customer evidence, metrics, goals, constraints, definitions, instructions, and prior decisions. The useful unit is not a prompt or a document. It is the decision.

This distinction matters because an LLM can answer an underspecified request without exposing that the request was underspecified. Ask it to improve onboarding, and it can produce a credible list of patterns. That output still does not tell you which user segment matters, what improvement means, which current friction is supported by evidence, or what downside the team must avoid.

Before pulling any context, write a decision frame that answers these questions:
- What decision must be made? Name the commitment, not the general topic. Choose whether to change a specific onboarding step is a decision; explore onboarding is not.
- Who is the decision for? Identify the customer segment, use case, or part of the journey. Evidence from one segment should not silently become a claim about every user.
- What outcome should change? State the behavior or business result you want, then identify the guardrail signals that should not deteriorate.
- What can constrain the answer? Include privacy, risk, brand, commercial, technical, and operational boundaries before ideation begins.
- What evidence could change the choice? If no possible evidence would change the decision, you are asking AI to justify a conclusion rather than help make one.
- What must the output enable? Specify whether you need options, a recommendation, a decision memo, an experiment plan, or a list of unresolved questions.
Anchor this frame in outcomes rather than deliverables. Improve activation for a defined segment while protecting support load establishes a decision boundary. Build a new onboarding checklist merely names output. The first lets AI compare interventions; the second encourages it to decorate a predetermined solution.

A practical test is to remove the proposed feature from the frame. If the decision still makes sense, you have probably described an outcome. If the frame collapses, the team may already be committed to an output.

Build a context packet that preserves evidence quality

A context packet is the smallest governed collection of information that allows the model and the product team to reason about the decision. It can combine customer quotes, behavioral trends, funnel friction, support conversations, and commercial constraints. The important work is to assemble, structure, compress, and challenge that evidence before asking for recommendations.

Do not treat every input as the same kind of truth. A customer quote gives you detail about an experience, not its prevalence. Usage analytics show behavior, not necessarily motivation. Support conversations overrepresent people who contacted support. CRM data can expose commercial constraints without proving that a feature creates customer value. Labeling these boundaries prevents the model from blending different signals into false certainty.

Use this structure for the packet:
- Decision header: the choice, decision owner, affected segment, and action that follows the decision.
- Outcome frame: the desired outcome, current signal, primary measurement, guardrails, and any metric definitions needed to interpret the data correctly.
- Evidence ledger: each relevant observation with its origin, segment, time period, and scope. Keep direct observations separate from interpretations.
- Constraints: technical dependencies, commercial commitments, privacy rules, brand boundaries, operational capacity, and known risks.
- Contradiction register: evidence that points in different directions, including differences between customer statements and observed behavior.
- Unknowns: missing evidence, ambiguous definitions, unrepresented segments, and assumptions the team has not validated.
- Output contract: the form of response you need, the criteria options must address, and the unsupported claims the model must label rather than fill in.
Compression is where many context packets either become useful or become misleading. The goal is not merely to shorten the material. It is to increase the proportion of decision-relevant signal without erasing qualifications.
1. Normalize repeated evidence. Deduplicate copied notes and repeated tickets so repetition in the packet does not impersonate independent confirmation. Preserve any real frequency data separately.
2. Retain the qualifiers. Do not compress away the segment, time range, denominator, metric definition, or product state that determines what an observation means.
3. Label epistemic status. Mark material as observation, interpretation, assumption, or generated hypothesis. A concise packet should make these distinctions clearer, not blur them.
4. Keep contradictions visible. If interviews describe one problem while behavioral data points elsewhere, preserve both signals and ask what evidence would resolve the conflict.
5. Remove inert context. My rule is simple: if an item cannot change an option, a risk assessment, or the validation plan, it does not belong in the active packet. Keep it available outside the model context if the team may need to inspect it later.
Apply privacy-by-design while assembling the packet, not after the model has processed it. Customer transcripts, CRM records, and support conversations can contain personal or confidential data. Use approved systems, follow applicable access controls and data terms, redact identifiers, and aggregate where the decision does not require record-level detail. If you cannot establish that the data is permitted in the AI workflow, leave it out and provide a safe summary. The downside is not a weaker prompt; it is potential exposure of customer or company information.

Separate synthesis, strategy, and skepticism

Asking for a summary, a recommendation, and a critique in the same instruction makes it difficult to see where evidence ends and invention begins. A stronger agentic workflow separates those jobs into distinct passes: Summarizer, Strategist, and Skeptic.

The Summarizer creates an evidence map

The Summarizer should organize the packet without deciding what to build. Ask it to group evidence around the decision, preserve relevant qualifiers, expose conflicts, and identify missing information. Explicitly prohibit recommendations during this pass.

A useful Summarizer output contains the supported observations, the segments represented, the outcome signals involved, the contradictions, and the unknowns. Review this output against the packet before continuing. If the model has turned an assumption into a fact, fix the evidence map rather than hoping a later pass corrects it.

The Strategist develops decision options

Give the Strategist the approved evidence map, the original decision frame, and the constraints. Ask for a small, meaningfully different set of options, including the option to leave the product unchanged when that is legitimate.

Require the same fields for every option:
- the customer problem or opportunity it addresses;
- the packet evidence that supports it;
- the assumptions required for it to work;
- the expected outcome and guardrail signals;
- the dependencies and material trade-offs;
- the simplest valid way to reduce its largest uncertainty.
This format prevents one option from winning because it received a more persuasive narrative. It also makes unsupported leaps visible. If the model cannot connect an option to evidence, that option can remain an idea, but it must be labeled as a hypothesis rather than presented as a conclusion.

The Skeptic tries to disconfirm the options

The Skeptic should not produce generic risks. Ask it to find the strongest contrary evidence, the segment that might be harmed, the constraint most likely to invalidate the option, the metric that could be gamed, and the observation that would show the underlying hypothesis is wrong.

Require it to distinguish counterevidence already present in the packet from new conjecture. This matters because a skeptical tone can sound rigorous even when it is unsupported.

The same LLM can perform all three roles, but role prompts do not create independent evidence or independent reviewers. Freeze the context packet used for the loop, label every generated artifact, and keep generated claims out of the evidence ledger until a human verifies them. Role separation is a workflow control, not a guarantee of correctness.

Stop adding passes when the workflow is only rearranging language. The loop has done its job when the team can see the supported facts, viable options, disputed assumptions, material risks, and next evidence needed to decide.

Make the product trio the decision gate

AI can accelerate the reasoning, but it should not become the decision owner. Bring the packet and the three-pass output into a product trio of product, design, and engineering. The purpose of that forum is not to approve the AI recommendation. It is to make the trade-offs explicit and decide what the team is prepared to learn.
1. Verify the evidence boundary. Check whether the represented segments, product states, and metrics match the decision. Ask which customer or operational perspective is absent.
2. Classify the important claims. Mark each claim as supported observation, team interpretation, assumption, or generated hypothesis. If nobody can trace a recommendation back to the packet, treat it as a hypothesis or remove it.
3. Compare trade-offs on equal terms. Evaluate every option against the desired outcome, guardrails, constraints, dependencies, and learning value. Do not let the most detailed option appear strongest merely because the model wrote more about it.
4. Choose the next commitment. The valid outcomes are to proceed, run a discovery or validation step, defer the decision, or reject the options. Assign a human owner and make clear what action the decision authorizes.
5. Record the rationale. Convert the discussion into a concise decision memo rather than forwarding raw model output to stakeholders.
The decision memo should include:
- the decision and why it is being made now;
- the target segment, desired outcome, and guardrails;
- the evidence that carried the most weight;
- the chosen option and the alternatives rejected;
- the trade-offs accepted by the decision owner;
- the assumptions and unresolved questions;
- the validation method and disconfirming signal;
- the owner and trigger for revisiting the decision.
This gives stakeholders something stronger than AI-generated confidence. They can inspect what the choice rests on, where judgment entered, what could prove the team wrong, and when the decision should be reconsidered.

Close the loop with validation and decision memory

Even a well-grounded model output is not product validation. It is a structured hypothesis. Match the validation method to the claim and to the consequence of being wrong.
- For a causal behavior claim: use a controlled A/B test when traffic, instrumentation, and the product experience make that appropriate. Define the primary metric, minimum detectable effect, guardrails, analysis approach, and stopping rules before reading the result.
- For a usability or comprehension claim: use targeted customer interviews or usability evaluation with the relevant segment. AI can help organize notes, but preserve outliers and do not turn a small qualitative sample into a prevalence claim.
- For an operational claim: use a limited release with observability, support monitoring, and an explicit rollback condition. Watch the workflow around the feature, not only the feature interaction itself.
- For privacy, brand, regulatory, or other high-consequence constraints: complete the appropriate human review before launch. A persuasive model assessment is not a substitute for the accountable specialist or decision owner.
For an onboarding decision, for example, the packet may contain segment definitions, observed friction, support themes, and conversion signals. The workflow can propose alternative interventions and measurement plans. The trio still chooses which hypothesis deserves a controlled test, whether the minimum detectable effect is practical, and which activation or retention signals will determine the next move.

After validation, return the result to the context system. Record what shipped, the observed outcome, affected segments, unexpected behavior, and which assumptions held or failed. Update the decision memo and evidence ledger. Otherwise, the next AI session begins from the same stale assumptions, and the organization pays again to relearn what it already discovered.

That accumulated decision memory is one of the most valuable outputs of context engineering. It turns AI collaboration from isolated prompting into a feedback loop connecting discovery, strategy, execution, and measurable results.

Key takeaways
- Frame the product decision, target segment, outcome, and constraints before asking AI for options.
- Give the model a compressed evidence packet, not an unstructured pile of documents.
- Keep observations, interpretations, assumptions, and generated hypotheses visibly separate.
- Use distinct Summarizer, Strategist, and Skeptic passes to expose where reasoning changes.
- Let a human product trio own the trade-offs, commitment, and stakeholder rationale.
- Treat every recommendation as a hypothesis until validation produces new evidence, then feed that evidence back into the decision record.
Choose the next real product decision that is important enough to validate and bounded enough to act on. Write its decision frame, assemble the smallest safe context packet, run the three reasoning passes, and take a decision memo into your product trio. When the result flows back into the packet, context engineering stops being a prompting technique and becomes part of how you run product.

References
- Pendo – Perspectives — AI Context Pulling Playbook: How I Make Humans + LLMs Collaborate for Sharper Product Outcomes
November 6, 2025
Turn Claude Code Into a Trusted Teammate: My 3-Layer Memory System You Can Copy

"Can you critique the landing page for my new Story-Based Customer Interviews course?" That simple ask used to kick off hours of back-and-forth where I fed an AI the same context over and over—only to get generic feedback that wouldn’t land with my audience or fit my products. As a product leader, that inefficiency was unacceptable; as a writer, it was just plain frustrating.

Not anymore. Today, Claude not only critiques my work, it helps me produce it. It generates marketing copy—in my voice. It helps me write blog posts. It knows what search terms are relevant to my business and helps me optimize my articles for SEO and now AEO. It helps me with competitive research, academic research, and discovery research. And it does all of this with little prompting from me.

I don’t upload files to a web-based project. I don’t manage elaborate prompt libraries. I don’t repeat myself. I ask for help and Claude knows exactly what to do. The shift happened when I learned how to give Claude Code a memory. Claude now knows who my target customer is, the key value propositions I focus on, the specific opportunities each product addresses, my revenue model, my marketing channels, and so much more.

A dark-themed strategy slide for the post Stop Repeating Yourself: Give Claude Code a Memory, showing how to lead with a CLAUDE.md glossary page, write clearly for nontechnical readers, and link glossary and article to boost discovery and engagement.

With that memory, I consistently get high-quality output tailored to my audience and aligned to my products and services. I don’t retype the same context; Claude just remembers. In this article, I’ll show you exactly how I set up that memory. It relies on Claude Code (which requires a Pro subscription), and it’s worth it. If you’re new to Claude Code, start with "Claude Code: What It Is, How It’s Different, and Why Non-Technical People Should Use It."

Here’s the underlying problem: with large language models, every conversation starts from scratch. Yes, ChatGPT can remember some things and Claude can search past conversations, but practically speaking each new thread wipes the slate clean. If I were working on a new landing page, I’d normally need to upload target customer context, product details, primary and secondary value propositions, FAQ questions and answers, plus testimonials and logos for social proof—every single time.

Start fast with Claude’s home screen: Sonnet 4.5 is ready, and quick actions for writing, learning, and coding sit beneath a clean prompt box—ideal for showing how memory cuts repetition and streamlines daily development.

Projects in web-based tools help a bit, but they introduce a new dilemma. When I move to the next landing page targeting the same customer but a different product and value proposition, do I start a new Project (tedious) or keep expanding the old one (which muddies the context window and degrades output quality)? The good news: Claude Code solves this by giving the model a precise, durable memory without overloading any single conversation.

Claude Code can read files on my local machine, which is an understated superpower. I use those files to create a persistent, reusable memory that works across all chats and Projects. Files can be mixed and matched, so I give Claude exactly what it needs for the task at hand—and nothing more. For a first landing page, I reference the target customer and the relevant product; for the second, I reuse the same target customer file and point to the new product file.

Dark-mode Notes screenshot captures Claude Code in action: it fetches producttalk.org, reads context files, and delivers a concise homepage evaluation—showing how memory streamlines repeated analysis tasks.

When you give an LLM the exact right context, output quality jumps. More context only helps if it’s the right context. For a landing page, Claude needs to know about the current product and perhaps related products for differentiation—but it doesn’t need to know about unrelated offerings. Structure your memory so Claude gets precisely what’s required.

Once I did this, Claude shifted from “intern who needs handholding” to trusted advisor and capable teammate. It doesn’t guess at my value propositions—I’ve already told it. It writes in my voice because it has my writing guide and samples. It knows who owns which course and which use cases map to which features. The setup takes a bit of upfront work, but it compounds: update a file when something changes and you’re done. Most of this information already lives in your system; the trick is making it easy for Claude to use.

See how Claude Code stops repetition: global and project CLAUDE.md files, plus custom reference docs, flow into the editor so the assistant remembers your preferences and context while you code and run commands.

Because the files live on my machine, I own the system. No vendor or device lock-in. I decide when and who to share with. I can work with Claude on one project and ChatGPT on another—both can rely on the same file-based memory strategy. It’s an AI strategy that scales with product discovery, accelerates go-to-market content, sharpens competitive differentiation, and supports product-led growth.

Here’s how I design the memory: I use three layers. Claude Code already encourages global preferences and Project-specific instructions, but the third layer—reference context—is where the real power lives.

Peek inside a markdown playbook for Claude Code: concise rules for writing, multi-level planning, and clear feedback that turn repeated reminders into reusable memory and smoother, faster coding sessions.

Layer 1: Global Preferences (Always on). The first time I launched Claude Code, I created a CLAUDE.md file at ~/.claude/CLAUDE.md. This is where I keep the cross-project rules of engagement—how I like to work with Claude. Mine includes: Always create a plan for me to review before you start any work; Give me direct feedback (no hedging, no gentle suggestions); Use bullet points for summaries; Ask clarifying questions one at a time so I can give complete answers; No emojis unless I explicitly ask for them. Claude Code automatically loads this file at the start of every session, so I never restate my preferences.

Layer 2: Project-Specific Instructions. Different projects have different rules. In my writing workspace, the Project CLAUDE.md sets the roles (I’m the primary writer; Claude is my thought partner and editor), defines a multi-round review flow (content → structure → accuracy → typos), prioritizes human readability over SEO, and points to my writing style guide. In my task management system, I include how my Trello integration works, file naming conventions for tasks, and how to process research papers into summaries. In my code projects, I specify the technology stack (Node.js vs. Python), testing framework (Jest for Node.js, pytest for Python), code style and conventions, project architecture and directory structure, and which dependencies and libraries to use. Each project directory has its own CLAUDE.md, and Claude automatically loads the relevant file when I’m working there.

Peek inside a markdown playbook for collaborating with Claude—covering session setup, roles, editorial standards, and research steps—to show how saved instructions create consistent results without repeating yourself.

Layer 3: Reference Context (Pull as Needed)—the real power. LLMs have a context window—a limit to how much they can process at once. Even within that limit, loading too much degrades performance due to “context rot.” The remedy is ruthless context management: small, targeted files that load only when needed. Keep CLAUDE.md files concise and focused on rules and workflows. For detailed knowledge, create separate reference files and list them in your CLAUDE.md so Claude knows they exist and when to fetch them. When I ask for help creating a landing page, Claude knows to use my business profile, the product file, and my target customers context.

Here’s what most people miss: you don’t cram everything into global or Project files. You maintain small, reusable reference files that Claude only loads on demand. In my walkthrough, I share exactly which context files I created and why; how I got Claude Code to help me create them; how I break them into small, reusable components so Claude gets precisely what it needs; how I keep everything up to date; and step-by-step instructions so you can set up a similar memory system.

Three project notes funnel into Claude Code, turning reusable context into working output. This visual shows how saving key docs as memory lets the AI pick up where you left off and skip repetitive prompting across tasks.

Let’s dive in.

Inspired by this post on Product Talk.

November 5, 2025
AI at Home, Impact at Work: Experiments That Supercharged My Product Leadership

I recently tuned into an insightful All Things Product episode featuring Teresa Torres and Petra Wille on how experimenting with AI in everyday life sharpens how we build AI-powered products at work. The core premise resonated deeply with my AI Strategy: low-stakes, personal experiments accelerate confidence, clarify limitations, and build an AI product toolbox we can bring into the office with rigor.

If you want to dive in, you can listen on Spotify or Apple Podcasts. I found the conversation especially relevant for product trios and anyone shaping LLMs for product managers in high-stakes environments.

The idea is simple but powerful: when I prototype with AI at home—where the stakes are low—I learn faster, make safer mistakes, and internalize critical product patterns. Over time, those patterns transfer directly to work: tighter context management, sharper bias awareness, clearer human-in-the-loop guardrails, and a more nuanced view of when to use AI as a thought partner versus when to consider agentic AI.

In my own practice, I’ve mirrored many of the scenarios discussed: using ChatGPT by OpenAI to plan meals, analyze public data sets like school budgets, and even sanity-check real estate evaluations. These seemingly mundane tasks are fertile ground for learning about context window limits, hallucination (artificial intelligence), AI bias, and privacy-by-design trade-offs. Each experiment helps me craft better prompts, structure data for clarity, and decide when a human review step is non-negotiable—core habits for AI risk management.

At work, I treat AI as a thought partner for writing, research synthesis, and contract review. I also explore when and how to responsibly evolve toward agentic AI for repeatable workflows. The distinction matters: a thought partner augments judgment; an agent automates execution. Building the right scaffolding—data governance, auditability, constraints, and escalation paths—ensures we unlock speed without compromising safety.

Three lines from the episode stayed with me: “I’m trying to write things that only I can write — that’s my guiding writing light right now.” — Teresa. “The more we use AI, the more we learn what it’s good at, what it’s not good at, and where context becomes a limitation.” — Teresa. “It’s a safer playground — we can build our toolbox at home before bringing those lessons to work.” — Petra. These are practical north stars for product management leadership in the GenAI era.

For anyone getting started, here’s what worked for me: begin with “low-stakes” personal experiments, write down your prompts and outcomes, and reflect on failure modes. Treat each activity as product discovery: What problem am I solving? What outcome matters? What data and context does the model need? Which decisions must stay human-in-the-loop? This discipline builds an AI product toolbox you can confidently apply to real customer problems.

I also keep a running toolkit of references and tools that inform my practice: Context window as a concept helps me size and sequence information. Visual and video tools like Midjourney and Sora expand how I think about multimodal experiences. I rotate between Claude by Anthropic and ChatGPT by OpenAI depending on task fit, and I’ve used Claude Code when I need structured assistance with code review. For knowledge capture and workflow, Readwise and Ghost help me structure insights and ship content.

If you want more structured learning paths, I found Josh Seiden’s Learn AI With Me, A 30-Day Sprint to be a practical primer, and the broader community conversation at Product at Heart Conference is invaluable. For a deeper grounding in risk, I recommend reviewing topics like Hallucination (artificial intelligence), AI bias, and Agentic AI—and revisiting the complementary episode, Context is King.

I’d love to hear how you’re experimenting: Where have you seen AI meaningfully reduce toil? Where does it still struggle? How are you balancing creativity, data safety, and compliance as you scale? Drop a comment below and let’s compare notes—especially on patterns that help product trios move faster without sacrificing trust.

Bottom line: start small at home, carry lessons into the office, and build with curiosity and intentionality. That’s how we level up our product discovery, sharpen our value proposition, and lead teams confidently through the GenAI transition.

Inspired by this post on Product Talk.

November 4, 2025

How to Build AI Upskilling That Changes Product Team Behavior

You’ve approved AI training, given people access to new tools, and watched the demos fill up. Yet product decisions still look the same. A few enthusiasts move faster, most people return to familiar workflows, and leaders struggle to explain what the investment changed.

The missing piece is usually not another course. It is a system that connects strategy, role-specific practice, manager coaching, and business evidence. If you are responsible for an AI-era workforce transformation, your job is to make new capability visible in the work, not merely available in a learning portal.

Start with the product behavior that must change

A broad goal such as “make the product team AI-ready” cannot guide a training program. It does not tell a PM what to do differently on Monday, a manager what to coach, or an executive what evidence to inspect.

Begin with the company strategy and work backward. Capabilities should connect to customer outcomes and outcomes-based OKRs, so every learning investment has a reason to exist. If you cannot connect a skill to a decision, workflow, or strategic bet, leave it out of the first release.

Use this sequence to turn an abstract AI ambition into a trainable capability:

Name the strategic outcome. Choose an outcome already present in the roadmap or operating plan. Do not create a separate set of learning goals that competes with the business.
Locate the workflow. Identify where the outcome is won or lost: discovery synthesis, prioritization, experimentation, sprint planning, onboarding, product tours, or another recurring part of delivery.
Identify the accountable role. Be precise about whether the behavior belongs to a product manager, designer, engineer, analyst, product leader, or cross-functional partner.
Write the observable behavior. Describe what a capable person produces or decides. “Understands LLMs” is not observable. “Can define evaluation criteria before an AI feature enters development” is.
Inspect current evidence. Review real artifacts, decisions, and workflow data. Self-reported confidence can help you find anxiety or demand, but it does not establish competence.
Select the intervention and proof. Decide whether the person needs instruction, practice, feedback, a new role path, or some combination. Name the evidence you expect to improve.

Consider a team that wants to use generative AI in product discovery. “Complete prompt training” is an activity. A useful capability statement is more demanding: the PM can use an LLM to organize customer inputs, separate supported themes from plausible-sounding output, document the method, validate the findings, and turn the synthesis into a product decision. That statement tells you what to teach, what artifact to review, and where human judgment remains essential.

Capture these decisions in a small capability map with fields for strategic outcome, workflow, role, expected behavior, current evidence, learning path, practice assignment, reviewer, and outcome metric. The map becomes the contract between the executive sponsor, functional leader, manager, and learner. It also prevents the curriculum from expanding every time someone finds a new AI tool.

Decide whether you are upskilling or reskilling

Upskilling and reskilling require different commitments. Treating them as interchangeable creates false expectations for the learner and poor workforce plans for the business.

Upskilling deepens capability within a person’s current role, while reskilling prepares that person to move into a different lane. A PM learning AI-assisted discovery, evaluation design, or stronger data governance is usually upskilling. An engineer or analyst transitioning into an applied generative AI role is reskilling.

Decision	Upskilling	Reskilling
Role after training	The person remains in the same role and performs it at a higher level.	The person moves toward a materially different role or set of responsibilities.
Problem it solves	The strategy requires stronger execution in an existing workflow.	The strategy creates a capability or talent need the current organization does not cover.
Typical product example	A PM adds LLM evaluation, AI-assisted synthesis, or privacy-by-design to existing product work.	An engineer or analyst develops toward an applied generative AI position.
Primary proof	Better behavior and decisions in the person’s current workflow.	Competent performance against milestones for the destination role.
Support model	Embedded practice, feedback, coaching, and reusable playbooks.	A role charter, staged milestones, tailored onboarding, a mentor, and sandboxed practice.

The cleanest decision test is role continuity. If the role remains intact and the person needs a stronger method, upskill. If the destination changes the person’s core responsibilities, decision rights, or career lane, reskill.

Do not disguise reskilling as a short course. A person moving into applied AI needs clarity about the destination role, protected practice, feedback from someone who can judge the work, and an explicit way to demonstrate readiness. Course completion may show effort. It does not show that the person can operate independently in the new lane.

You also do not need to choose one path for the entire workforce. A sensible portfolio can upskill most PMs and product leaders in AI product judgment while reskilling a smaller cohort of engineers and analysts for specialized applied work. The mix should follow the roadmap, not a blanket mandate that every employee become an AI specialist.

Put practice inside the product operating system

A course can introduce vocabulary and demonstrate a method. It cannot, by itself, make the method survive contact with a real roadmap, imperfect data, stakeholder pressure, and an approaching release. Transfer happens when the learner applies the skill in the environment where it must eventually work.

That is why training should be embedded in product workflows and connected to adoption and business outcomes. Discovery reviews, product trio rituals, sprint planning, critiques, code reviews, onboarding work, and QBR discussions are not interruptions to learning. They are the places where learning becomes operational.

Use the 70-20-10 model as a design check: most development comes from doing, a meaningful share comes from coaching and peer learning, and a smaller share comes from formal instruction. The proportions are less important than the correction they force. If your plan is mostly video modules and workshops, it is missing the practice environment that creates capability.

A practical learning loop looks like this:

Teach one bounded concept. Examples include LLM foundations, prompt design, evaluation criteria, research synthesis, data governance, or privacy-by-design.
Demonstrate it on a recognizable artifact. Use a discovery summary, decision memo, prototype, roadmap decision, evaluation plan, onboarding flow, or product tour rather than a context-free exercise.
Let the learner perform the work. Start in an internal sandbox or a low-risk initiative, then move into a live workflow when the review and safety boundaries are clear.
Review the output, not the learner’s enthusiasm. A manager, mentor, guild, or product trio should critique the reasoning, evidence, risks, and final decision.
Publish the reusable pattern. Save the prompt, checklist, rubric, example, and known failure modes in a playbook that another person can use.
Repeat in the next work cycle. The learner should apply the capability again without relying on the instructor to drive every step.

Make each role path specific enough to practice

For product managers, concentrate on the judgments they already own: discovery synthesis, framing an AI opportunity, setting evaluation criteria, connecting a prototype to the roadmap, spotting unsupported model output, and communicating tradeoffs to stakeholders.

For product leaders and managers, add a different layer. They need to set decision rights, review AI work consistently, coach to outcomes, protect learning time, and distinguish a promising demonstration from a capability that can be adopted repeatedly. A manager who cannot evaluate the new behavior will unintentionally push the learner back toward the old one.

For engineers and analysts moving toward applied generative AI, use staged practice projects, senior mentorship, and explicit milestones. Internal tools can be useful assignments because they create real constraints and users without requiring the cohort’s first exercise to become a customer-facing production system.

For cross-functional partners, train around the handoffs they influence. Product tours, onboarding sequences, user activation, customer feedback, and stakeholder communication all benefit when the people involved understand both the product objective and the limits of the AI system.

Keep the safety boundary visible throughout the path. Do not turn a training exercise into an unreviewed production deployment or place sensitive customer data into a tool that has not been approved for it. Use sandboxed, synthetic, or otherwise appropriate material until privacy, data governance, access, and review requirements are clear. Responsible AI is part of competent product work, not a compliance module to append at the end.

Protect time as deliberately as budget

A learning budget does little when every calendar is full. Give the cohort recurring focus time, place practice assignments into normal planning, and make the manager accountable for preserving the space. When a new learning commitment enters the plan, ask what will be deprioritized. Without that tradeoff, development becomes extra work and participation will favor the people who already have the most discretionary time.

Make teaching visible as well. Communities of practice, cross-team demonstrations, shadow sessions, and critique groups allow effective methods to travel. Reward the people who turn tacit judgment into a usable rubric or playbook; their contribution raises the capability of more than one learner.

Measure adoption, behavior, and business impact separately

Attendance is an operational signal. It can tell you whether people reached the training, but it cannot tell you whether they can perform the work. Completion rates are equally limited. A person can finish every module without changing a single product decision.

Build the measurement plan in three layers:

Adoption: Is the learner using the workflow, tool, or method? Depending on the path, inspect time-to-first-value, repeat use, feature activation, participation in practice, or progress through role milestones.
Behavior and capability: Is the work different? Review the quality of discovery, evaluation plans, written strategy, stakeholder communication, prototypes, and decisions. Use a rubric so reviewers are judging the same attributes.
Business and operating outcomes: Is the changed behavior helping the system perform? Relevant measures can include time from insight to iteration, deployment frequency and other DORA metrics for engineering-heavy paths, onboarding time-to-productivity, retention analysis, user activation, and attributable ROI.

The metric must stay close to the capability. Training a PM in AI-assisted discovery and then judging the program only by company revenue creates an attribution gap too wide to manage. Inspect whether discovery synthesis and decisions improved first, whether the insight-to-iteration cycle changed next, and how those changes relate to the wider business result.

Establish the baseline before the cohort begins. Review examples of the current work, record the relevant workflow measures, and agree on what meaningful improvement would look like. Where the data supports it, define a minimum detectable effect so normal variation is not presented as proof that training worked.

Do not force every path into the same dashboard. An existing PM’s upskilling path may be best judged through discovery artifacts, decision quality, and cycle time. A reskilling path may require demonstrated milestones, mentor assessment, and time-to-productivity in the destination role. A manager path may require evidence that feedback quality and role clarity improved. Standardize the measurement logic, not the metric regardless of context.

Use the reviews to make decisions. If adoption is low, inspect access, relevance, manager support, and protected time. If adoption is high but behavior is unchanged, redesign the practice and feedback. If behavior improves but the business measure does not, revisit the assumed connection between the capability and the strategic outcome. A learning dashboard earns its place only when it changes the program.

Launch one focused 90-day capability portfolio

You do not need an enterprise-wide academy to begin. A practical first release is one upskilling initiative and one reskilling initiative that can be delivered within 90 days. Running both exposes the different support each path needs without spreading the organization across too many capabilities.

Treat the portfolio like a product launch:

Frame the problem. Choose a strategic outcome, map the relevant workflow and roles, inspect current evidence, and establish a baseline.
Select the cohorts. Put people into an upskilling or reskilling path based on the work they will own, not their interest in a particular tool.
Design the path. Combine narrow instruction with a real assignment, a sandbox where needed, a reviewer, a reusable artifact, and explicit evidence of competence.
Prepare the managers. Give them the capability rubric, coaching expectations, safety boundaries, and authority to protect time or remove competing work.
Run visible practice. Use demonstrations, critiques, shadowing, product trio reviews, and communities of practice to expose both good patterns and failure modes.
Inspect the evidence. Review adoption, behavior, and outcome measures. Scale what transferred, change what created activity without capability, and stop what no longer serves the strategy.
Institutionalize what worked. Move validated paths into onboarding, career frameworks, manager expectations, product playbooks, and planning cadences so the capability survives beyond the cohort.

Set stakeholder expectations before the launch. Finance needs to understand how ROI will be evaluated. HR needs to connect reskilling and capability growth to career paths. Functional leaders need to agree on standards. Managers need to know that learning time is an operating commitment. The learner should not be left to negotiate these dependencies alone.

Key takeaways

Start with a strategic outcome and an observable product behavior, not a catalog of AI topics.
Upskill when the role stays the same; reskill when the person is moving into a materially different lane.
Use formal instruction to introduce a method, then build competence through live practice, feedback, and repetition.
Train managers to recognize and coach the new behavior, or the old operating habits will return.
Measure adoption, capability, and business impact as separate layers.
Run one upskilling path and one reskilling path in the first 90-day portfolio, then scale only what changes the work.

At your next planning session, choose one recurring product workflow where AI capability should already be improving the outcome but is not. Name the role, the behavior, the artifact, the reviewer, and the measure. That single path will teach you more about your organization’s readiness than another company-wide course.

References

November 3, 2025

AI-Enabled Product Management: A Practical Operating Model

Your product managers are probably already using AI to summarize feedback, draft requirements, and prepare planning documents. The harder question is whether any of that is improving the decisions behind the documents.

That distinction matters. Faster artifact production can create the appearance of progress while weak evidence, unclear ownership, and unresolved trade-offs remain untouched. A useful AI-enabled product operating model shortens the path from customer evidence to accountable action without treating fluent output as product judgment.

Start with a recurring decision, not a general-purpose assistant

The natural starting point is an assistant that can answer anything. It is also difficult to evaluate because every request has different inputs, quality criteria, and consequences. Start with one recurring decision whose current workflow you understand.

AI is already useful for synthesizing feedback, drafting PRDs and acceptance criteria, turning notes into user stories, and preparing experiment plans. Those are valuable tasks, but they are parts of a workflow. None of them determines which customer problem deserves investment or which trade-off the company should accept.

Define a decision contract before choosing a model or writing a prompt:

Decision: State the exact choice to be made. Replace improve onboarding with choose which activation barrier to address next.
Trigger: Name when the workflow runs, such as before roadmap review, after a discovery cycle, or when an anomaly appears.
Required evidence: Identify the interviews, support records, analytics, CRM context, experiments, and strategic constraints that must inform the choice.
Output contract: Specify the claims, citations, contradictory evidence, unknowns, and proposed next questions the AI must return.
Decision owner: Name the person accountable for accepting, rejecting, or changing the recommendation.
Red lines: Identify actions the system may not take, data it may not expose, and conclusions it may not present without review.
Outcome signal: Choose the product or workflow measure that will reveal whether the decision improved anything.

If you cannot name the decision owner and the action that follows the output, you have an AI demonstration rather than an operating workflow.

Product decision	What AI can prepare	What the PM must decide
Which problem to investigate	Clusters of interview, support, and behavioral signals with links to the underlying records	Whether the pattern is strategically important and which customers need follow-up
Which roadmap request deserves attention	Evidence by segment, frequency, workflow, and conflicting signal	Opportunity cost, strategic fit, and whether the request represents a problem or a proposed solution
Whether an experiment is ready	Hypothesis, acceptance criteria, instrumentation needs, and minimum detectable effect inputs	Whether the causal question is worth testing and whether the exposure risk is acceptable
How to position a capability	Customer language, points of parity, objections, and candidate messages	The value proposition and competitive differentiation the company can credibly defend
How to respond to an operational signal	Anomaly context, affected journey stage, supporting records, and candidate playbooks	Whether to intervene, whom to affect, and how to judge the result

The prompt should reflect that contract. A weak request says: summarize customer feedback. A decision-ready request says: for the specified segment and workflow, group evidence by customer problem, cite every supporting record, identify contradictions and missing coverage, separate observation from inference, and propose the next discovery question without recommending a roadmap commitment.

That change is small but important. It directs AI toward evidence preparation while preserving the PM’s responsibility for interpretation and commitment.

Build a context layer your PMs can interrogate and verify

A generic model knows language patterns, not the current state of your customers, product, strategy, or commitments. Copying a few notes into a prompt helps with an isolated task, but it does not create a reliable product-management system.

Retrieval-Augmented Generation connects an LLM to internal product, customer, and market knowledge so relevant material can be retrieved when a question is asked. For a PM, that knowledge may include interview notes, support tickets, win-loss records, QBRs, specifications, CRM data, and product analytics. The practical benefit is not merely a more personalized answer. It is an answer that can be checked against the company’s evidence.

Do not begin by indexing every repository. A large corpus increases coverage, but it also introduces stale specifications, duplicate tickets, conflicting terminology, inaccessible customer data, and documents whose status is unclear. Trust is usually lost at the corpus boundary before it is lost at the model layer.

A minimum trustworthy context layer needs:

Explicit scope: Document which repositories, products, segments, and time periods are included. The system should disclose when a question falls outside that scope.
Access enforcement: Apply user and tenant permissions during retrieval, not merely after an answer has been generated. A record being technically retrievable does not make it appropriate for every PM or every output.
Useful metadata: Preserve product area, customer segment, workflow, channel, date, product version, record owner, and status where available. These fields help distinguish current evidence from historical noise.
Evidence hierarchy: Decide how the system handles an approved specification that conflicts with an old planning note, or verified analytics that conflict with an anecdotal request. It should show the conflict rather than silently blending the two.
Answer boundaries: Require separate sections for supported facts, inferences, contradictory evidence, and unknowns. Require links to the records carrying each material claim.
Feedback history: Store reviewer corrections and the failure category behind each correction. A thumbs-down with no explanation does not tell you whether retrieval, reasoning, freshness, permissions, or presentation failed.

Start in read-only mode with a narrow, high-signal workflow, such as synthesizing support patterns for one segment. Ask reviewers to mark each important claim as supported, partly supported, or unsupported and to note relevant evidence that was missed. A polished answer with no traceable basis fails even when its conclusion happens to be plausible.

RAG does not turn internal data into truth. Retrieval can return stale, partial, or contradictory material, and a missing record is not proof that a customer problem does not exist. Your PM still has to assess coverage, distinguish signal from sampling bias, and decide when fresh discovery is necessary.

Privacy-by-design belongs in this layer as well. Support and CRM records may contain personal information, confidential commitments, or account-specific context. Minimize what is indexed, redact what is not needed, preserve access controls, and define which outputs may leave the internal workflow. Data governance is part of product quality here, not an administrative task to add after launch.

Match AI autonomy to the consequence of being wrong

Human review is too vague to be a control. It can mean a careful decision by an accountable owner, or a hurried click on an approval button after the work has effectively been accepted. Define autonomy according to the consequence and reversibility of each action.

Assist: AI transforms material without changing external state. Examples include transcribing notes, formatting requirements, clustering feedback, or drafting an internal brief. The user reviews the result before relying on it.
Recommend: AI interprets evidence and proposes a choice, but a named owner makes the decision. Roadmap evidence summaries, experiment proposals, and candidate positioning belong here.
Act reversibly: AI performs a bounded action that is observable and easy to undo, such as creating a draft ticket, applying an internal label, running an analysis, or staging an in-app guide in preview. Tool permissions, scope, and rollback must be enforced.
Act with material consequence: The workflow affects customers, exposure to an experiment, permissions, contractual commitments, published messaging, or data that cannot be restored easily. Require explicit approval from the accountable owner before execution.

A credible direction of travel includes agents that monitor activation funnels, flag anomalies, prepare playbooks, and help coordinate experiments or in-app guidance. That does not justify giving one agent broad access to analytics, messaging, experimentation, and customer data. Each tool should have the narrowest permission and action scope the workflow needs.

For consequential actions, make the approval packet decision-ready:

The exact action the agent proposes to take
The affected product area, customer cohort, or internal system
The evidence supporting the action, with links
Contradictory evidence and unresolved uncertainty
The expected product outcome and how it will be observed
The rollback procedure and the conditions that trigger it
The approver, approval expiry, and complete action log

Enforce guardrails in the system rather than relying on prompt language. Use constrained service accounts, scoped tools, staging environments, rate limits, complete logs, and an accessible kill switch. A prompt is an instruction to a model; it is not a security boundary.

My rule is simple: if the accountable PM cannot explain how the evidence supports the proposed action, the workflow has not earned more autonomy. The right response is to improve the context and evaluation loop, not to make the approval interface easier to click through.

Evaluate the output, the workflow, and the product outcome

An AI initiative can generate more documents while making product management worse. More drafts may create review queues, spread unsupported claims, or encourage teams to reopen decisions that lacked new evidence. Measure three layers so local speed is not mistaken for organizational value.

Evaluation layer	Question	Evidence to inspect
Output reliability	Is the result grounded, complete enough for its purpose, appropriately uncertain, and safe to use?	Citation checks, missed evidence, unsupported claims, privacy failures, and subject-matter review
Workflow performance	Does AI reduce elapsed time and rework without moving effort into a hidden review step?	Time from trigger to decision, acceptance and editing patterns, handoffs, reopened work, and blocked decisions
Product impact	Did the resulting decision improve the customer or business outcome the workflow exists to influence?	The relevant activation, retention, experiment, support, or commercial measure, interpreted in the context of the decision

Baseline the existing workflow before introducing AI. Record its trigger, participants, elapsed time, common failure modes, and decision outcome. Otherwise, a faster AI run will be compared with an imaginary manual process instead of the work people actually perform.

Use outcomes rather than artifact volume when setting the objective. Drafts produced, prompts submitted, and active users describe activity. A shorter evidence-to-decision cycle, fewer unsupported roadmap claims, or better performance on the product outcome describes value. The metric must match the workflow; there is no universal AI productivity score.

A practical review loop looks like this:

Maintain a representative evaluation set containing ordinary cases, known failures, ambiguous inputs, permission boundaries, and contradictory evidence.
Run the current prompt, retrieval configuration, model, and tools against that set.
Have the relevant product, design, engineering, data, or domain reviewer score the output against the decision contract.
Classify each failure. Separate missing retrieval from unsupported inference, stale context, permission errors, incomplete instructions, and poor presentation.
Change one major component at a time so you can tell whether the prompt, corpus, retrieval rules, model, tool, or approval design improved the result.
Run the full evaluation set again before promoting the change. Keep prompts and retrieval configurations versioned so regressions can be traced and reversed.
Review production corrections and near misses, add them to the evaluation set, and revisit the autonomy level if the consequence profile has changed.

This is a good ritual for a product trio, with engineering or a forward deployed engineer handling system integration and observability where the workflow requires it. The PM owns the problem definition and decision quality; design protects the fidelity of customer interpretation; engineering owns the reliability and bounded behavior of the implementation. Subject-matter owners still review claims that cross their domain.

Expand in stages. Move from a single-segment synthesis to a cited discovery brief, then to roadmap evidence, experiment preparation, and only later to reversible execution. Do not promote the workflow when material claims remain uncited, permission failures are unresolved, reviewers cannot explain its conclusions, or downstream rework is increasing. Those are operating failures, even if the model’s prose looks strong.

Key takeaways

Choose one recurring product decision and define its owner, evidence, output, red lines, and outcome before selecting AI tools.
Use a governed retrieval layer to make internal context accessible, current, permission-aware, and traceable to the underlying records.
Separate evidence preparation from judgment. AI can organize and challenge the case; the PM remains accountable for the bet.
Increase autonomy only when actions are bounded, observable, reversible, and supported by an explicit approval model.
Evaluate output reliability, workflow performance, and product impact. Artifact volume is not a proxy for better product management.
Scale only after real corrections and failure cases have been added to a repeatable evaluation set.

Before your next planning cycle, pick one disputed decision that repeats often. Write its decision contract, assemble a small representative evidence set, and run the AI workflow in read-only mode beside the current process. If reviewers can trace the material claims, identify what is missing, and make the decision with less rework, you have a foundation worth expanding. If they cannot, improve the context and controls before adding another feature or agent.

References

November 3, 2025

From Chaos to Consistency: How I Built a Scalable AI Content Design Agent with RAG

It’s Monday morning, and my Slack and email are already overflowing with content requests: “Can you review this flow?”; “Can you rewrite this screen?”; “Can you name this feature?” I’m not freshly back from holiday—this is just a regular work week kicking off. If you’ve ever been a solo content designer supporting multiple teams, you’ll recognize the pressure. The pipeline for content in product design is always full, and the demand for expertise never stops.

Fixing this isn’t just a matter of better time management or incremental process tweaks. To truly scale, I needed to extend my reach by bringing AI into the design process—without sacrificing judgment, standards, or quality. That Monday morning, I realized I had to scale my skills, my judgment, and our systems, not just my calendar.

Building AI is fundamentally about building systems. I wanted to use AI to scale myself without devaluing critical thinking or flooding the product with generic, verbose content. I also knew a useful AI tool must do more than spit out microcopy—it has to plug into a system we can continually shape. As a content designer, the system is always the starting point. Strong design systems create strong content standards; then AI agents can produce content that meets those standards at speed, freeing me from the bulk of standardized work. That’s not a threat—it’s an advantage. To instruct AI well, our systems must be well constructed.

I often think about this work like a bakery. You need a recipe before you can make a loaf of bread. Most interface content churns out the same loaf, day in and day out. It’s better for the master bakers to focus on the unique, custom bakes—and how the recipe needs to change. With that mindset, I set out to build an AI content design agent.

Inside the Content Design Agent workspace, a clean chat UI titled VERBI pairs a central prompt box with chips for writing, editing, and reviews, plus clear controls to view permissions and open the agent setup for product teams.

When I started this project back in May 2025, many LLMs still had frustrating limitations. Google Gemini let me build a custom Gem agent, but I couldn’t share it with other users. ChatGPT could be customized, but only with static files: I couldn’t point it to live, updatable URL sources. I settled on Glean for three simple reasons: everyone at the company had access; Glean could access all internal documentation and treat URLs as sources of truth; and its then-new Agents feature made AI search customizable. Configuring an agent in Glean is straightforward—you choose a trigger, a set of prompts, and a set of actions—but first I needed to get the inputs right.

AI agents need focus. We had a wealth of internal information at Intercom, but not all of it was current or reliable. I curated exactly what the agent could access and assembled a tightly governed knowledge collection in Glean. Only essential information made the cut: the Intercom style guide—our definitive house style, including regularly-broken rules like “always write in US English” and “use sentence case everywhere”; tone of voice guidance for how we show up across mediums; a product glossary with hundreds of feature names and writing conventions; a monetization glossary for prices, plans, and add-ons; product marketing messaging guides with positioning for every feature and launch; core research insights across the product; and fin.ai and intercom.com/suite as the official, most up-to-date messaging sources.

This is classic RAG (retrieval-augmented generation) in action, ensuring every answer is grounded in approved sources of truth. With the collection in place, I instructed the agent to prioritize these resources above anything else.

Step into a clean, no-code builder that shows how to assemble a Content Design Agent: kick off with a chat-trigger, run a company search, then respond with expert guidance, all guided by a simple starter checklist.

Then came the fun part—building and branding the agent. “Content Design Assistant” felt bland, so I named it VERBI, a nod to its “verbal” design job. When people interact with VERBI, they usually begin with a question, but the intent varies widely. I defined a set of task prompts to guide expectations and outputs: “Can you write this?”; “Can you edit this?”; “Can you review this?”; “Can you name this?”; “Give me options”; “Give me guidance”; “Give me strategy”; “Give me research.” This mirrors the real breadth of content design, from creation to critique to discovery.

To manage responses, VERBI needed three things: start with a specific task prompt; understand how to draw on the right resources each time; and connect with other systems. With task prompts defined, I wrote a detailed system prompt covering the essentials. Role: you are a content designer, supporting product designers. Employer: Intercom (consisting of Fin AI Agent and our next-gen Helpdesk). Resources: content design collection, research collection, Storybook design system. Tone of voice: follow a specific tone for our UI, adjust the tone for everything else. Components: for UI, use the specific guidelines in our design system only. Use cases: writing, editing, critiquing, naming, researching, and more.

One connection mattered most: our design system, recently rebranded as “Surge.” Surge contains detailed content guidelines for every component in our product UI, from accordions and banners to tabs and tooltips. That granularity took months of human effort to codify, and it paid off. Designers no longer guess how to write for a toggle, a button, or a tooltip—and now VERBI understands and enforces those rules, too. A great content design assistant isn’t just a clever system prompt; it needs deep, component-level guidance to retrieve.

UI documentation showcases the Badge component’s content rules, teaching how to name statuses, define types, and apply color so labels read clearly. A handy visual for building a content design agent and ensuring consistent product messaging.

Accessing the design system wasn’t simple at first. It lives in Storybook, which Glean couldn’t access directly. I started by scraping guidance from Storybook into an HTML file with Cursor and uploading it to VERBI—a functional but clunky workaround that required re-scraping every few days. Then our IT team stepped in. They used the Glean Indexing API to turn Storybook into a live data source. Now VERBI connects to Storybook directly. Ask it something ultra-specific, like the correct date format for Japan, and it returns the right answer. That integration elevated the agent from helpful to indispensable—human-level precision, 24/7, at scale.

With prompts and resources in place, I launched VERBI and pressure-tested it. It was accurate and well-informed most of the time, but like any AI agent, it had quirks. I needed it to act as a gatekeeper, not a brainstorming partner that might bend rules or invent new ones. So I added a few explicit guardrails to the system prompt. Stopping sycophancy: “Inform, challenge, and assist. Never placate. Don’t agree by default. If something’s wrong, say so. Challenge assumptions.” Halting hallucinations: “If you don’t find the information required in our resources, say you don’t know the answer. Don’t guess and don’t give answers based on general knowledge.” Avoiding verbosity: “Keep answers short and to the point. Cut the fluff. Skip all niceties and social padding. Only give longer answers if the user asks you to.” These constraints keep responses crisp, correct, and consistent. Like any living system, the prompt needs occasional tune-ups, but the maintenance is minor compared to the upside.

Where we are now: VERBI has been triggered 700+ times since launch. The benefits are tangible. For me, quality scales without constant policing; repetitive questions about naming, style, or punctuation have dropped significantly. I reclaim time because the agent drafts and checks V1 content across teams, enabling me to focus on higher-impact work. For the design team, iteration is faster, confidence is higher, and strategic clarity improves because shared language and grounded guidelines make decisions easier and more consistent.

I used to spend too much time mopping up basic content mistakes and untangling spaghetti-like UI copy prone to human error. VERBI removes those errors at the source. The real advantage is speed: we get from blank slate to a high-quality first draft quickly, which means we can spend our energy deciding whether the content is right, not just “good enough.” Design is the whole interface—words, visuals, interactions—so reviews now happen with real content, never “copy TBD.” Our principle to sweat the details applies equally whether work is human-made or AI-assisted.

Knee-jerk critiques of AI-driven content design often assume teams generate content from nothing and ship it. In reality, great AI is the outcome of great human decisions and strong systems. Its value is pulling us together faster—getting us to a complete, standards-compliant design we can review as a team before sharing it with the world. That’s how AI helps us win: by turning chaos into consistency, and consistency into velocity.

Inspired by this post on The Intercom Blog.

October 31, 2025
What I Learned from Trainline’s Agentic AI: Building a Trusted Travel Assistant at Scale

Over the past year, I’ve been shipping agentic AI into production and coaching product teams on what it really takes to make these systems trustworthy in the wild. One story that crystallizes the playbook comes from Trainline’s move to an agentic architecture for travel assistance—an approach that mirrors what I’ve seen work in high-stakes, real-time customer experiences.

Trainline—the world’s leading rail and coach platform—helps millions of travelers get from point A to point B. Now, they’re using AI to make every step of the journey smoother.

I studied how "David Eason (Principal Product Manager) Billie Bradley (Product Manager), and Matt Farrelly (Head of AI and Machine Learning)" approached the build of "Travel Assistant, an AI-powered travel companion that helps customers navigate disruptions, find real-time answers, and travel with confidence." Their work exemplifies the kind of end-to-end thinking required to move beyond demos into dependable, on-the-go assistance.

They share how they: Identified underserved traveler needs beyond ticketing; Built a fully agentic system from day one, combining orchestration, tools, and reasoning loops; Designed layered guardrails for safety, grounding, and human handoff; Expanded from 450 to 700,000 curated pages of information for retrieval; Developed LLM-as-judge evals and a custom user context simulator to measure quality in real-time; Balanced latency, UX, and reliability to make AI assistance feel trustworthy on the go.

I align strongly with their core takeaways: "AI assistants need both scalable reasoning and deep domain context to be useful." "Tool design and guardrails are as critical as prompt design in agent systems." "LLM-as-judge evals make it possible to measure open-ended systems without massive labeling costs." And perhaps most importantly, "Even legacy companies can move fast when they embrace experimentation and tight PM–engineering collaboration."

From an AI strategy perspective, starting "fully agentic" was the right call. When the problem space is dynamic—disruptions, route changes, fare conditions—reasoning loops and orchestration aren’t luxuries; they’re table stakes. Tool selection becomes product design: you need the right retrieval interfaces, constraint-aware planners, and API contracts that are resilient to partial failures. Layered guardrails for safety, grounding, and human handoff reduce hallucination risk while preserving responsiveness—critical when users are standing on a platform waiting for an answer.

The retrieval scale-up—"Expanded from 450 to 700,000 curated pages of information for retrieval"—is a classic inflection point. I’ve seen teams stall here when they treat content growth as a pure indexing problem. The winning move is curation and structure: normalize sources, encode policy-level constraints, and align retrieval chunks to decision boundaries the agent actually uses. That’s how you keep precision high while coverage explodes.

Evaluation is where most open-ended assistants fail quietly, which is why I was encouraged to see "Developed LLM-as-judge evals and a custom user context simulator to measure quality in real-time." In practice, LLM-as-judge gives you scalable, scenario-based scoring without prohibitive labeling, while a user context simulator surfaces regressions tied to persona, itinerary state, and device constraints. The combination closes the loop between model behavior, tool layer changes, and UX outcomes.

On product delivery, the decision to have the system "Balanced latency, UX, and reliability to make AI assistance feel trustworthy on the go" shows mature prioritization. For travel, trust accrues in seconds: fast-enough responses, graceful degradation when upstream data lags, and explicit handoff when confidence dips. This is where guardrails meet UX writing—clear, bounded language signals competence even when the system defers.

Finally, the organizational pattern matters. The teams that win in agentic AI are cross-functional, experimentation-driven, and ruthless about instrumentation. Tight PM–engineering collaboration, explicit safety thresholds, and an eval stack that mirrors real user journeys are what turn promising architectures into dependable products.

It’s a behind-the-scenes look at how an established company is embracing new AI architectures to serve customers at scale.

If you’re building agentic AI in production, borrow these moves: invest early in tool and guardrail design, scale retrieval with curation not just volume, adopt LLM-as-judge plus context simulation for continuous evaluation, and treat latency and reliability as core product requirements—not afterthoughts. That’s how you ship AI assistance that customers trust when it matters most.

Inspired by this post on Product Talk.

October 30, 2025

Tag: LLMs for product managers

Key takeaways

Define quality at the decision boundary

Protect the signal before AI touches it

Recruit for the decision, not for convenience

Ask for behavior before interpretation

Set privacy boundaries before uploading transcripts

Make AI produce an auditable synthesis

Validate the insight, then record the decision

Run a quality review against the evidence chain

Use an atomic insight format

References

Define the decision before you collect the context

Build a context packet that preserves evidence quality

Separate synthesis, strategy, and skepticism

The Summarizer creates an evidence map

The Strategist develops decision options

The Skeptic tries to disconfirm the options

Make the product trio the decision gate

Close the loop with validation and decision memory

Key takeaways

References

Start with the product behavior that must change

Decide whether you are upskilling or reskilling

Put practice inside the product operating system

Make each role path specific enough to practice

Protect time as deliberately as budget

Measure adoption, behavior, and business impact separately

Launch one focused 90-day capability portfolio

Key takeaways

References

Start with a recurring decision, not a general-purpose assistant

Build a context layer your PMs can interrogate and verify

Match AI autonomy to the consequence of being wrong

Evaluate the output, the workflow, and the product outcome

Key takeaways

References