Category: AI Strategy

Brand Visibility in AI Answer Engines: A Product Playbook

If your CEO asks why an AI answer names a competitor but leaves out your brand, the tempting response is to publish more pages or look for a ChatGPT optimization trick. That treats the symptom. The real question is whether the answer engine can confidently connect your brand to the user’s decision, verify the connection, and explain it accurately.

Treat AI visibility as a product system. You can improve its inputs, test its outputs, and assign owners to its failure modes. You cannot guarantee a mention, but you can increase the probability of an accurate inclusion by building a clear public identity, credible evidence, reliable retrieval, and useful actions.

Define the decision you want to be present for

Brand visibility is too vague to manage. Visibility for what? A category definition, a shortlist, an integration question, a troubleshooting task, and a product comparison are different jobs. Each requires different evidence.

Start with an intent map. Use the customer journey, support conversations, sales objections, onboarding friction, and product analytics to identify the decisions that matter. Then connect each decision to the artifact an answer engine would need.

User job	Typical question	Artifact to publish	Desired answer behavior
Understand the category	What problem does this category solve?	Category explainer and glossary	Recognize the brand’s category and relevant use cases
Evaluate options	Which product fits this workflow or constraint?	Use-case page, comparison, and evidence	Include the brand when it genuinely fits and state the tradeoffs
Get started	How do I reach the first useful outcome?	Quick-start documentation	Return accurate prerequisites and steps
Integrate	Does this product connect to another system?	Integration page and API documentation	Describe compatibility, setup, and limitations correctly
Resolve a problem	Why is this workflow failing?	Troubleshooting documentation	Retrieve a grounded diagnosis and resolution path
Check current status	Is this feature available, and what changed?	Changelog and release notes	Use current product facts instead of stale descriptions

For each row, define when your brand is actually eligible. A weak objective says, ‘The brand should appear.’ A useful objective says, ‘The brand is relevant when the user needs this capability, works under these constraints, and can verify these claims.’

That distinction protects the program from vanity metrics. Your product should not appear in every answer. It should appear in the answers where it can help, in the correct category, with an honest account of its strengths and limits. My rule is simple: a mention that misclassifies the product is a failure, even if the brand name is present.

Prioritize prompt families using product judgment. Start where a better answer could affect a meaningful buying, activation, integration, or support decision. Within that set, look for the largest evidence gap: an important question for which your current public material is missing, contradictory, gated, or stale. That gives you a defensible backlog rather than an open-ended demand for more content.

Build a canonical brand record before producing more content

An answer engine has a harder job when your homepage describes one category, your documentation uses another product name, a partner directory lists an old capability, and a comparison page makes a broader claim than the evidence supports. Publishing another page adds volume without resolving the identity problem.

Create an internal brand fact record that becomes the contract for every public property. It should contain:

The official organization, product, and feature names, including approved abbreviations.
The primary category and a plain-language description of what the product does.
The users, jobs, and constraints for which the product is relevant.
The capabilities and integrations that can be stated publicly.
The limitations or eligibility conditions that materially change a recommendation.
The evidence behind important claims, such as documentation, case studies, API references, or release notes.
An owner and review trigger for every fact that can change.

Use this record to audit the homepage, product pages, documentation, API references, GitHub repositories, partner listings, review profiles, and conference descriptions. Do not force identical prose everywhere. Do keep the underlying identity, category, capability, and product status consistent.

Your site architecture should make that identity easy to follow. Connect category explainers to use-case pages, use-case pages to product documentation, documentation to integrations and troubleshooting, and changing capabilities to release notes. The links should reflect a real path from understanding to evaluation to action.

Then inspect the technical path an unauthenticated visitor can use. The essentials are concrete:

Put foundational product facts in semantic HTML rather than only inside images, videos, or interfaces that require a login.
Keep robots.txt and XML sitemaps friendly to public product and documentation pages.
Use canonical tags to concentrate signals when similar pages exist.
Apply schema.org types such as Organization, Product, HowTo, and FAQPage only where the visible content supports them.
Use descriptive headings and rich alt text so page meaning is not dependent on presentation.
Keep public pages fast enough to retrieve reliably.
Leave foundational documentation open when there is no business, privacy, or security reason to gate it.

Do not loosen access controls in the name of visibility. Public product facts, help content, and approved evidence belong in the retrievable footprint. Customer data, internal plans, private support records, and administrative documentation do not. The right fix for a gated public fact is a safe public page, not broader access to a private system.

Write pages that answer prompts without requiring guesswork

Traditional marketing pages often ask the visitor to infer the product’s category, audience, and value from slogans. An answer engine needs explicit relationships. It should be able to identify what the product is, who it is for, what task it performs, what conditions apply, and where the supporting evidence lives.

Use a predictable page contract

Write as if you are teaching a capable assistant that lacks your internal context. A useful page contract contains:

A short opening that directly answers the page’s primary question.
A clear definition of the product, feature, workflow, or integration.
Prerequisites and eligibility conditions before the instructions begin.
Steps or decision criteria in the order the user needs them.
Limitations, tradeoffs, and unsupported cases near the claim they qualify.
Links to evidence and deeper documentation.
A visible path to the next task, such as setup, troubleshooting, or an API operation.

Define acronyms where they first appear. Use descriptive headings rather than clever labels. Add concise question-and-answer sections when they match real prompts. Repeat canonical facts consistently, but do not bury the useful answer under repeated positioning language.

Match the artifact to the intent

A single generic landing page cannot cover the full journey. Build the artifact that makes the intended answer defensible:

Category explainers should define the problem, the common workflow, the relevant buyer, and the boundaries of the category.
Use-case pages should connect a specific user job to product capabilities and show the conditions under which the fit holds.
Comparison pages should state points of parity, meaningful differences, user fit, limitations, and migration considerations without turning every dimension into a victory claim.
Quick starts should identify prerequisites, the setup sequence, the first observable success, and common failure paths.
Integration pages should state supported objects or workflows, authentication requirements, data direction, limitations, and links to the relevant API or setup instructions.
Troubleshooting pages should connect symptoms to likely causes, corrective steps, and a way to verify that the fix worked.
Release notes and changelogs should make changing availability, behavior, and terminology explicit.

Comparison content deserves particular care because it directly affects product positioning. Do not hide obvious points of parity or invent distinctions that a buyer cannot verify. Explain where the alternatives differ, who benefits from each difference, and when the distinction should change the decision. Honest limits make the rest of the page more credible.

Maintain a claim ledger behind these pages. Record the exact claim, its evidence, the public locations where it appears, its owner, and the event that should trigger review. A product rename, integration change, policy update, or feature release should update the ledger and the affected pages together. This is how content operations become part of product operations.

Layer authority, live retrieval, and useful actions

AI visibility can happen at different layers. Treating them as one channel makes diagnosis difficult:

Public-footprint visibility comes from a clear, consistent body of information that helps an engine recognize the brand and its category.
Retrieval visibility happens when the engine or an attached workflow fetches current material during the conversation.
Action visibility happens when a connector or tool lets the user complete a task through the assistant.

The public footprint needs distribution as well as first-party content. Keep product facts consistent across documentation, API references, GitHub repositories, partner directories, reputable media, conference material, and legitimate third-party reviews. Pursue inclusion in structured knowledge bases such as Wikidata only when the brand meets the relevant eligibility requirements.

Do not manufacture authority through fabricated claims, fake reviews, or spammy link schemes. Those tactics create contradictions and reputational risk. The durable strategy is to be verifiably useful on the surfaces where practitioners already look for answers.

Live retrieval becomes important when an answer depends on current documentation, account context, or a changing product state. A retrieval-first pipeline should fetch the relevant material before the response is generated. Its quality depends on more than adding documents to an index.

Chunk documentation around a coherent task or concept rather than breaking related instructions apart.
Carry the heading and parent context with each chunk so a retrieved paragraph retains its meaning.
Add metadata for product, feature, version or status, intent, update state, and access permissions.
Prefer canonical documentation when duplicate explanations compete.
Return citations or document identifiers that allow the answer to be checked.
Test retrieval against the same prompt families used for visibility measurement.

A ChatGPT connector or CustomGPT workflow adds the action layer. Publish a high-quality OpenAPI specification, keep each action narrowly scoped, and describe its inputs, permissions, output, and failure conditions clearly. The assistant should be able to choose the correct operation without guessing between overlapping tools.

Privacy-by-design belongs in the architecture, not in a warning added after launch. Enforce the user’s permissions before retrieval, preserve tenant boundaries, minimize the data passed into the model context, and keep secrets out of indexed content. If an action changes data or creates an external consequence, use clear confirmation and guardrails appropriate to that action.

A connector does not replace the public footprint. It improves accuracy and task completion for users who can access it. Public explanations still establish category relevance, authority, and discoverability before the user invokes a tool.

Measure visibility as a product system, not a screenshot

A favorable answer copied into a presentation is not a measurement system. Answer behavior can vary with wording, context, model configuration, accessible material, and tool availability. Build a stable panel of priority prompts and track its outputs over time.

Each prompt in the panel should have an intent identifier, target user, task, wording, expected eligibility condition, claims that must be correct, and an artifact owner. Include natural variants across category discovery, evaluation, setup, integration, and troubleshooting. Preserve the panel long enough to compare changes instead of rewriting it after every result.

Score more than whether the name appeared:

Eligible mention rate: how often the brand appears when the predefined fit conditions are present.
Grounded citation rate: how often the answer points to appropriate first-party or credible third-party evidence.
Factual accuracy: whether the answer passes a predefined set of product facts.
Positioning accuracy: whether the brand is placed in the right category, use case, and competitive context.
Freshness: whether changing capabilities and product status match the canonical record.
Retrieval success: whether the workflow returns the document needed for the task.
Action completion: whether an enabled connector completes the intended task under the correct permissions.

Share of voice can help, but only within eligible prompts. A rising mention rate paired with falling accuracy is not progress. Nor is a citation useful when it points to an outdated page.

Use the failure pattern to choose the next intervention:

If the brand is absent across an entire intent family, inspect coverage, category clarity, and external authority.
If it appears under the wrong category, reconcile names and definitions across the canonical record and public properties.
If it appears without evidence, strengthen the relevant artifact and its links to documentation or proof.
If the facts are stale, repair canonical pages, release notes, metadata, and duplicate content.
If retrieval returns the wrong page, adjust chunking, metadata, canonical preference, and evaluation queries.
If the answer is correct but the action fails, inspect the OpenAPI description, authentication, permissions, inputs, and error handling.

Test changes with the same discipline used for a product experiment. State the hypothesis before shipping. Freeze the evaluation rubric. Capture a baseline, compare the candidate under the same conditions, and use repeated samples rather than interpreting one convenient response. Use an A/B design only where exposure can be isolated; otherwise label the result as a before-and-after observation and avoid claiming causality.

Set the minimum detectable effect before reviewing the outcome. In this context, it is the smallest improvement large enough to justify a decision. That prevents a tiny movement in a noisy prompt panel from becoming a success story merely because the team wants the release to work.

Assign ownership by failure class. Product marketing can own canonical positioning, documentation can own instructional accuracy, the web team can own crawlability and structured markup, engineering can own retrieval and connectors, and product or analytics can own the evaluation panel. A shared dashboard is useful only when each red metric has a named route to action.

Key takeaways

Optimize for eligibility in a real user decision, not for raw brand-name frequency.
Establish one canonical brand fact record before adding more public content.
Publish answer-shaped artifacts for category, comparison, setup, integration, troubleshooting, and product-change intents.
Combine a trustworthy public footprint with live retrieval and carefully scoped actions.
Measure mentions, citations, accuracy, freshness, retrieval, and task completion separately.
Tie every content or technical change to a hypothesis, a stable prompt panel, and a minimum detectable effect.

Start with the prompt family closest to a real buying, activation, integration, or support decision. Capture the baseline answer, identify the smallest missing or unreliable artifact, fix it, and rerun the same evaluation. Expand to adjacent intents only after the first one produces consistently accurate, well-grounded answers.

The goal is not to make an assistant say your name. It is to make your brand a defensible inclusion for the right question, supported by current evidence and a working next step.

References

Shivam.Consulting Blog – Crack the AI Answer Engine: How I Boost Brand Visibility in ChatGPT – Proven, Ethical Playbook

November 17, 2025

How I Use ChatGPT to Supercharge PM: Smart Workflows, Killer Prompts, and Real-World Wins

Every week, I lean on ChatGPT to cut through noise, reduce rework, and move faster with more confidence. It’s not a silver bullet, but it has become an unfair advantage in my day-to-day leadership of product strategy, discovery, and delivery. Unlock workflows, prompts, and real PM tips showing how ChatGPT quietly reshapes product management behind the scenes.

Here’s my stance: ChatGPT doesn’t replace product judgment. It amplifies it. Used well, it accelerates product discovery, clarifies roadmaps, sharpens positioning, and strengthens stakeholder management. Used poorly, it creates noise and risk. What follows are the specific workflows and prompts that reliably save me hours while protecting quality and trust.

Discovery and research are where I see the biggest upside. I use ChatGPT to draft interview guides, transform raw notes into theme clusters, and generate “Jobs to Be Done” problem statements—then I validate them with customers. I anonymize inputs to protect privacy and follow privacy-by-design and data governance commitments; AI risk management matters more than ever when we’re handling real user data.

When I move from insight to definition, ChatGPT helps me spin up crisp PRDs and user stories. I provide context about our users, constraints, and success metrics and ask for structured outputs: goals, non-goals, acceptance criteria, and risks. This keeps our product trios aligned and focused on outcomes vs output OKRs, not just shipping features.

For competitive analysis and positioning, I feed in public information and ask for points of parity, points of differentiation, and potential messaging angles. I treat the output as a starting point for my value proposition and battlecards—not the final word. It’s a fast way to surface hypotheses and pressure-test our product-led growth narrative.

Roadmapping and sprint planning also benefit. I use ChatGPT to map dependencies, draft milestone narratives, and transform epics into well-formed backlogs. When we align quarterly plans, I ask for risk scenarios and contingency options so we can make trade-offs explicit before we commit.

On analytics and experiments, ChatGPT is my drafting partner. It helps me define A/B testing plans, clarify the minimum detectable effect (MDE), and outline instrumentation requirements. I still verify numbers in our analytics stack, but the scaffolding is done in minutes, not hours—freeing me to focus on retention analysis and activation levers.

Stakeholder communication is where the time savings compound. I use ChatGPT to produce executive summaries, QBRs vs OKRs comparisons, and board-ready narratives that highlight outcomes, risks, and next steps. It’s a powerful way to stay crisp and consistent across leadership updates without losing the nuance that matters.

Prompt patterns make or break results. I keep four rules: set the role, provide rich context, define constraints, and specify the output format. For example: “You are a senior PM advisor. Context: [user, market, problem]. Constraints: [privacy, timeline, budget]. Output: PRD with goals, acceptance criteria, and risks.” With larger inputs, I use context window management by chunking content and asking for summaries before synthesis.

For internal knowledge, I lean on a retrieval-first pipeline. Instead of pasting long docs, I reference curated, approved sources so answers track to current reality. CustomGPT workflows and a simple ChatGPT connector help with governance: they increase speed while reducing the chance of hallucinations and stale information.

Guardrails are non-negotiable. We never paste sensitive data into prompts; we redact PII, spot-check against source-of-truth systems, and red-team important outputs. AI risk management isn’t just a checkbox—it’s how we maintain trust while scaling productivity with gen ai.

Finally, enablement turns personal productivity into team capability. I run short playbooks for empowered product teams: discovery synthesis, PRD drafting, roadmap storytelling, and stakeholder-ready updates. The result is higher-quality thinking, faster cycles, and fewer meetings to align on the essentials.

ChatGPT for product managers isn’t hype; it’s a practical edge when you apply discipline. Start with one workflow that drains your time, add a prompt template, and measure the outcome. In a week, you’ll have proof. In a quarter, you’ll have a new operating system for how your team learns, decides, and ships.

Inspired by this post on Product School.

November 17, 2025
Taming 1,000+ Vendor Emails: How Xelix’s AI Helpdesk Delivers Fast, Confident Answers

Chaos in vendor communications is a problem I see across finance operations: sprawling accounts payable inboxes, slow response times, and missed context. That’s why this build caught my attention—not just because it’s GenAI, but because it’s a disciplined product strategy that converts email overload into measurable outcomes.

Accounts payable inboxes can see 1,000+ vendor emails a day. Xelix’s new Helpdesk turns that chaos into structured tickets, enriched with ERP data, and pre-drafted replies—complete with confidence scores.

I dug into the end-to-end approach with the team—Claire Smid — AI Engineer, Xelix; Emilija Gransaull — Back-End Tech Lead, Xelix; Talal A. — Product Manager, Xelix—focusing on how they scoped the problem, iterated fast, and de-risked AI in production.

Their product thesis is refreshingly pragmatic. They prototyped with “daily slices” (Carpaccio-style) and built a retrieval-first pipeline that matches vendors, links invoices, and drafts accurate responses—before a human ever clicks “send.” That framing matters: enrichment and matching take center stage, with the model amplifying precision instead of improvising.

We unpacked the tricky bits that make or break an AI helpdesk at scale: vendor identity matching, Outlook threading, UX pivots from “inbox clone” to ticket-first views, and the metrics that prove real impact (handling time, stickiness, auto-closed spam). The pipeline architecture and email processing choices were grounded in operational realities, not just AI aspirations.

Several takeaways are worth pinning to any AI product roadmap. “Start narrow to win: pick high-volume, high-cost requests (invoice status & reminders).” “Enrichment > magic: accurate replies come from great retrieval/matching, not just a bigger LLM.” “Design for adoption: familiar inbox view helps onboarding, but a ticket-first UI unlocks AI features.” These are the kinds of decisions that drive adoption, trust, and ROI.

Data enrichment challenges dominated early learning curves: stitching ERP context into tickets, handling vendor identification at scale, managing email thread continuity, and calibrating response generation for accuracy. On the generation side, the team emphasized precision over verbosity—clean responses that reflect system-of-record truth—then instrumented the experience to “Evaluate System Performance” with production-grade telemetry.

Trust was treated as a product feature. “Measure outcomes, not vibes: track ‘messages sent from Helpdesk’, % auto-resolved.” And critically, “Confidence builds trust: show match quality and response confidence so humans know when to edit.” By surfacing match quality and confidence scores, they shortened coaching loops and made human-in-the-loop supervision feel natural, not burdensome.

What’s next is equally compelling: “targeted generation, multiple specialized responders, and more agentic routing.” That direction aligns with agentic AI patterns I recommend for operations-heavy workflows—route first, retrieve deeply, then generate with intent. It’s a scalable path from assistive AI to autonomous resolution while maintaining governance and auditability.

If you want a quick map of the journey, the conversation flowed from 0:00 Meet the Team: Claire, Emilija, and Talal, 00:36 Introduction to Xelix and Its Products, 01:08 Understanding Accounts Payable Teams, 01:37 Help Desk Product Overview, 03:11 Challenges Faced by Accounts Payable Teams, 04:03 AI Integration in Help Desk, 05:47 Automating Reconciliation Requests, 07:45 Development Methodology: Carpaccio, 09:11 Prototyping and Beta Testing, 12:00 Manual Tagging and Data Collection, 16:39 Focusing on High-Impact Use Cases, 18:55 User Experience and Interface Design, 24:56 Pipeline Architecture and Email Processing, 28:21 Data Enrichment Challenges, 29:04 Handling Vendor Identification, 33:33 Email Thread Management, 36:15 Generating Accurate Responses, 40:48 Evaluating System Performance, 49:20 Future Developments and Goals.

My takeaway for product leaders: when the domain is high-volume and rules-heavy (like AP), retrieval-first beats model-first. Start with the narrowest, costliest intents; prove lift with “messages sent from Helpdesk” and “% auto-resolved”; then graduate UX from familiar to AI-native (ticket-first) once trust is earned. That’s how you turn vendor chaos into answers—reliably, scalably, and fast.

Inspired by this post on Product Talk.

November 13, 2025
AI Won’t Replace Engineers—Engineers Using AI Will: A Practical Playbook for Your Next Move

Will AI replace software engineers or reshape their roles? Explore risks, opportunities, and alternative career paths in tech.

I’m often asked whether AI will make software engineers obsolete. My short answer: AI is already automating tasks, not eliminating the role. The engineers who learn to orchestrate models, systems, and stakeholders will create more value—not less. The real shift is from keystrokes to judgment, from writing code to designing socio-technical systems that deliver outcomes.

Today’s gen ai assistants—think Claude Code and ChatGPT connector—excel at unit test scaffolding, boilerplate generation, refactoring, docstrings, and code search. When integrated into CI/CD, they can open draft pull requests, annotate diffs, and propose fixes. This lifts developer productivity and frees time for higher-leverage work: problem framing, architecture decisions, and customer discovery.

What changes in the role? We spend more cycles on product discovery, privacy-by-design, and AI Strategy, and fewer on repetitive implementation. We design agentic AI workflows that combine retrieval, tools, and guardrails; we evaluate trade-offs that blend performance, cost, and safety; and we partner with empowered product teams to ship the smallest valuable slice, learn, and iterate.

Measure what matters. If AI is working, DORA metrics should improve: higher deployment frequency, shorter lead time for changes, stable change failure rate, and faster MTTR. Pair that with outcomes vs output OKRs to avoid gaming the system—shaving seconds off a build is meaningless if it doesn’t move activation, retention, or revenue. A unified analytics platform can help connect engineering signals to business impact.

Risk is real—and manageable. AI risk management and data governance are now core competencies, not afterthoughts. Protect IP with robust access controls, context window management, and red-teaming. In production, instrument threat detection and response to catch prompt injection, data leakage, and model drift. Treat this like any other reliability discipline alongside SRE.

If parts of coding get automated, where can great engineers thrive? Several high-impact paths are emerging: platform engineering for LLMs (tooling, evals, observability), SRE for AI-infused systems, developer evangelism and education, product management for AI-native experiences, security engineering focused on model and data threats, and forward deployed engineers who pair with customers to solve messy, real-world problems.

How to upskill fast: build an AI product toolbox and ship small. Prototype gen ai features end-to-end—retrieval, function calling, human-in-the-loop QA—and connect them to your CRM integration or support stack. Use A/B testing with a clear minimum detectable effect (MDE) to validate impact. Leverage CustomGPT workflows for internal enablement and in-app guides or product tours to onboard users safely.

Here’s a pragmatic 90-day plan. Week 0–2: audit your top 10 engineering tasks by time spent; identify 3 that are ripe for AI augmentation. Week 3–6: pilot inside CI/CD with explicit guardrails; track DORA metrics and developer sentiment. Week 7–10: productionize the wins; document runbooks; add incident management paths. Week 11–12: share learnings with product trios, refine your value proposition, and set next-quarter OKRs.

AI won’t replace software engineers; engineers who master AI will outpace those who don’t. If we embrace the shift—toward systems thinking, responsible governance, and customer outcomes—we’ll build better products faster and open new, rewarding career paths. The opportunity is here and compounding.

Inspired by this post on Product School.

November 12, 2025
A Quality System for Trustworthy AI-Assisted UX Research
Your AI-generated synthesis can be polished, plausible, and wrong. The dangerous failures are rarely obvious fabrications. They are quieter: a biased sample becomes a universal claim, a participant’s opinion becomes a product need, or a tidy theme loses the contradiction that should have changed the roadmap.

If you are deciding whether to trust AI-assisted UX research, do not judge the fluency of the summary. Judge the evidence chain behind it. You need to see how a product decision connects to the participants recruited, the questions asked, the underlying observations, the analytical interpretation, and the behavioral data used to check it.

Key takeaways
- Research quality is mostly determined before an AI tool sees a transcript. Start with the decision, learning question, and hypothesis.
- Use AI to accelerate transcription, extraction, tagging, clustering, and contradiction searches. Keep interpretation, confidence, and product judgment under human control.
- Require every theme to retain its participant coverage, supporting evidence, counterexamples, and unresolved uncertainty.
- Pair qualitative findings with funnels, cohorts, session evidence, and CRM data when those signals are relevant. Neither qualitative nor quantitative evidence should carry the decision alone.
- Finish with an atomic insight and a recorded choice. A summary that does not change a decision, test, or learning priority is not finished research.
Define quality at the decision boundary

Many teams begin AI-assisted research by asking which model should summarize their transcripts. That is too late in the process. The first quality control is the decision the research must inform.

Strong discovery begins with a decision statement, an explicit learning goal, and a hypothesis the team is willing to falsify. Without those constraints, an AI system can generate an impressive taxonomy of themes while leaving the actual product question untouched.

Before recruiting participants or writing prompts, create a short research contract:
- Decision: Name the choice that is genuinely open. Examples include whether to pursue an opportunity, which problem to solve first, or whether a proposed workflow deserves further testing.
- Decision condition: State what you would need to learn to proceed, pause, narrow the audience, or reject the current direction.
- Learning question: Ask about the behavior, context, constraint, or unmet need that makes the decision uncertain.
- Hypothesis: Write the current belief in a form that evidence could disprove. If every possible interview result would support it, it is not a useful hypothesis.
- Relevant population: Specify whose behavior matters to this decision and which segments could experience the problem differently.
- Evidence plan: Identify what interviews can reveal and which behavioral or operational signals could challenge the interpretation.
- Data boundary: Decide what the AI tool is allowed to receive, what must be removed, and who may review the resulting artifacts.
This contract changes how you evaluate the output. You are no longer asking whether the summary sounds reasonable. You are asking whether the evidence changes a named choice under stated conditions.

My standard is simple: a decision-grade insight must survive a skeptical review without relying on the model’s authority. A reviewer should be able to inspect the underlying evidence, see which participants and segments it covers, understand the interpretation applied to it, and identify what remains unknown.

Keep one distinction visible throughout the work:
- Observation: What the participant did, described, showed, or failed to complete.
- Interpretation: What that behavior may mean about a goal, anxiety, constraint, or job.
- Implication: What the product team may choose to change, test, or leave alone.
AI can help produce all three, but it should never blur them into a single sentence. Once an inference is written as if it were an observed fact, the rest of the synthesis becomes difficult to audit.

Protect the signal before AI touches it

An LLM cannot repair a convenient sample or a leading interview guide. It can only reorganize the resulting bias, often in language that makes the bias look more certain.

Recruit for the decision, not for convenience

If you interview only power users, you risk treating advanced workflows as mainstream needs. If you interview only vocal detractors, the roadmap can become a queue of complaints. A more useful recruiting frame includes new users, churned users, people who evaluated but did not convert, and adjacent personas where the decision calls for them.

Build a participant matrix before outreach. Use rows for the segments that could materially change the decision and columns for relevant states, such as adoption stage, conversion outcome, or workflow maturity. The matrix is not a quota formula. It is a visibility tool. It should make overrepresented groups and missing perspectives obvious.

Carry that segment metadata into synthesis. A theme that appears among established customers should not silently become a claim about evaluators. When a segment is absent, write that limitation into the insight rather than hiding it in an appendix.

Ask for behavior before interpretation

Questions about whether someone likes an idea invite speculation, politeness, and solution theater. Ask about the last relevant event instead. Have the participant reconstruct what triggered it, what they tried, where they hesitated, who else became involved, what workaround they used, and what happened next.

Neutral, behavior-first questions become stronger when participants can support the account with artifacts such as screenshots or workflow examples. The artifact does not automatically prove the interpretation, but it helps distinguish remembered behavior from a general opinion.

Pilot the guide with the product trio. Remove product terminology that telegraphs the preferred answer. Check whether each question could produce evidence against the working hypothesis. If the guide repeatedly asks participants to react to your solution, it is a concept evaluation guide, not an open discovery guide. Label it accordingly.

Set privacy boundaries before uploading transcripts

Consent to an interview does not automatically settle how AI will be used in transcription, analysis, storage, or sharing. Tell participants how their material will be handled, follow your organization’s data governance requirements, and remove identifiers that are not needed for the decision.

Do not place sensitive participant data into an unapproved prompt workflow. If the tool’s handling, retention, or access controls have not been approved, keep raw transcripts out of it and work with appropriately de-identified material in an authorized environment. The downside is not merely a poor synthesis; it is unnecessary exposure of participant and customer information.

De-identification should not erase the context required for analysis. Preserve non-identifying segment labels, workflow stage, and participant codes when they are relevant. The goal is to minimize sensitive data while retaining enough context to audit coverage and interpretation.

Make AI produce an auditable synthesis

The most reliable workflow separates extraction from clustering and clustering from judgment. Asking for findings, recommendations, sentiment, and a roadmap in one prompt encourages the model to fill gaps and compress uncertainty.
1. Prepare the evidence set. Preserve the original transcript or recording, assign a participant code, attach relevant segment metadata, and remove unnecessary identifiers. Do not let an AI-generated summary replace the underlying material.
2. Extract participant-level observations. Ask the model to work through each participant separately. Capture the behavior or event, its context, the supporting excerpt or evidence location, and any missing information. Do not ask for themes yet.
3. Review the extraction. Check whether the observation is grounded in the transcript and whether the model has converted an opinion into behavior or inferred a motive the participant did not provide.
4. Cluster reviewed observations. Group similar evidence only after the participant-level pass. Require each cluster to retain the contributing participant codes, segment coverage, supporting evidence, and meaningful variations.
5. Search for contradictions. Ask which observations do not fit the cluster, which participants experienced the situation differently, and which alternative explanations remain plausible. Do not treat dissent as noise merely because it makes the summary less tidy.
6. Draft atomic insights. Turn a defensible pattern into a small evidence packet containing the finding, evidence, coverage, contradictions, confidence rationale, product implication, and unresolved question.
7. Triangulate relevant claims. Compare the qualitative interpretation with funnels, cohorts, session evidence, in-product paths, or CRM data when those systems contain a useful signal.
8. Conduct the decision review. A person accountable for the product choice inspects the evidence chain, challenges the interpretation, and records what the team will do or learn next.
You can make the separation explicit with narrowly scoped prompts.

Extraction prompt: Use only the supplied transcript. For each relevant event, return the participant code, observed or reported behavior, context, supporting excerpt, evidence location, and uncertainty. Do not merge participants, infer motives, or recommend a solution. Flag information that is missing.

Clustering prompt: Use only the reviewed observations. Group evidence by shared behavior and context. For every cluster, retain participant codes, represented segments, supporting observations, material variations, counterexamples, and plausible alternative explanations. Do not use repetition in the transcript as a substitute for participant coverage.

Challenge prompt: Review the proposed themes as a skeptical researcher. Identify unsupported generalizations, segment differences that were flattened, interpretations written as observations, contradictory evidence, and claims that cannot be traced to the supplied material. Do not invent missing evidence.

Prompt design helps, but it does not replace review. Keep the prompt, relevant tool or model information, input scope, and human corrections with the research artifact. If the synthesis later changes, you should be able to determine whether the cause was new evidence, a different analytical instruction, or a human judgment.

AI is well suited to accelerating transcription, tagging, theme clustering, Jobs to Be Done extraction, and searches for hesitation or sentiment. Treat the latter outputs as interpretations to validate, not measurements generated by an objective instrument. A sentiment label is useful only when a reviewer can return to the behavior and language that produced it.

Validate the insight, then record the decision

A good synthesis review is not a copy-edit. It is an attempt to break the claim before the claim influences a roadmap.

Run a quality review against the evidence chain
- Traceability: Can a reviewer move from the insight to the contributing participants and the exact supporting material?
- Coverage: Does the claim name the segments represented, and does it disclose relevant segments that are missing?
- Construct validity: Is the finding about the behavior the study intended to understand, or has a nearby opinion been used as a proxy?
- Separation: Are observation, interpretation, and product implication visibly distinct?
- Contradiction: Does the artifact preserve disconfirming cases and material variations instead of forcing consensus?
- Triangulation: Where behavioral data is relevant, does it support, narrow, or challenge the qualitative account?
- Decision relevance: Does the finding change a live choice, a test, or the next learning priority?
Do not outsource confidence to the model. A confident tone is a language property, not an evidence assessment. Record confidence as a human rationale based on the clarity of the underlying behavior, the relevance and coverage of participants, consistency and counterexamples, and any corroborating behavioral evidence.

Quantitative and qualitative signals answer different parts of the question. Funnels, cohorts, and retention analysis can show where behavior changes or where people leave. Interviews and artifacts can expose the goals, anxieties, organizational constraints, and workarounds behind that behavior. Pairing those signals is how a team moves from observing what happened to developing a testable account of why.

When the signals disagree, do not average them into a vague conclusion. Check whether the interview sample represents the population in the analytics, whether the event instrumentation reflects the behavior being discussed, whether segments have been combined, and whether the evidence refers to the same stage of the journey. A contradiction is often the next research question.

Use an atomic insight format

A reusable insight should be small enough to inspect and complete enough to guide a choice. Use this structure:
- Decision: The product choice this evidence informs.
- Finding: The observed behavioral pattern and the context in which it occurs.
- Evidence: Participant codes, excerpts or artifact locations, and any relevant behavioral signal.
- Coverage: The represented segments and known gaps.
- Interpretation: The best current explanation, clearly labeled as an inference.
- Contradictions: Cases or data that weaken, narrow, or complicate the interpretation.
- Confidence: A short rationale grounded in evidence quality, coverage, consistency, and triangulation.
- Product implication: The opportunity, risk, constraint, or tradeoff the team should consider.
- Disposition: Act, test further, monitor, or take no action.
- Next unknown: The uncertainty most likely to change the decision.
Useful insight records also prevent familiar synthesis mistakes. Replace a broad label such as onboarding friction with the specific behavior, actor, context, and consequence. Do not let a memorable quotation stand in for a pattern. Do not describe a participant’s requested feature as the underlying need. Do not convert an AI-generated cluster into a roadmap item until the evidence packet survives review.

Bring the atomic insights to a decision review with the product trio. Record the choice, its rationale, what the team is deliberately not doing, and the evidence that could reopen the decision. Connect the chosen action to an outcome or learning objective rather than treating delivery of a feature as proof that the research was correct.

For your next study, start with one live decision and run the evidence through this chain. If a theme cannot be traced, mark it as a hypothesis. If participant coverage is lopsided, narrow the claim. If qualitative and behavioral evidence conflict, investigate the conflict before committing the roadmap. That is how AI becomes a fast, inspectable research assistant instead of an unaccountable author of customer truth.

References
- Shivam.Consulting Blog – 5 Costly UX Research Pitfalls I See Often – and How AI + Qual Insights Prevent Them
November 11, 2025

How to Evaluate AI Voice Support in Real-World Conditions

You have a shortlist of AI voice support products, a polished recording, and a decision that could affect thousands of customer conversations. The hard question is not whether an agent can sound convincing during one ideal call. It is whether the system stays useful when a caller interrupts, corrects themselves, asks an ambiguous question, waits on a backend system, or needs a human.

You can answer that question before a broad rollout. The method is to test complete support outcomes, introduce controlled complications, score failures separately from conversational polish, and use the result to define a limited production pilot.

Evaluate the support outcome, not the performance

A natural voice can create an impression of competence before the agent has done anything useful. Pleasant pacing, expressive speech, and a quick opening matter, but they cannot compensate for retrieving the wrong account, misunderstanding the request, or claiming that an action succeeded when it did not.

Treat the unit of evaluation as a completed support job. Depending on the intent, that job may require the agent to identify the caller, understand the request, retrieve the right information, explain the answer, perform an authorized action, confirm the resulting state, and send a follow-up or transfer the conversation. If you score only the spoken answer, you leave most of the product untested.

One live Fin Voice call illustrated this end-to-end standard in about 90 seconds: the agent verified identity, retrieved account information, managed an interruption, presented options, completed a workflow, and sent a follow-up email. That sequence is a useful model for constructing a test. It is not, by itself, proof of reliability across other calls.

Before anyone places a test call, write an outcome contract for each scenario:

Caller goal: What is the person trying to accomplish?
Starting state: What customer, account, order, subscription, or case data exists before the call?
Available evidence: Which knowledge, policies, and records may the agent use?
Permitted actions: What may the agent change, create, send, cancel, or escalate?
Required clarification: Which missing or conflicting facts must be resolved before an answer or action?
Completion evidence: What observable state proves that the request was resolved?
Unacceptable outcome: What error would make the call a failure even if the conversation sounded good?

This contract prevents a common scoring mistake: confusing non-transfer with resolution. A call can remain inside the AI channel and still leave the customer with a wrong answer, an incomplete action, or no idea what happens next. Conversely, an intentional transfer can be the correct resolution when the agent reaches a policy, permission, or confidence boundary.

Build scenarios around the ways real calls become difficult

Start with support intents your operation actually receives. Prioritize intents that are frequent, expensive to handle, important to customer trust, or dependent on multiple systems. Do not begin with trivia questions that merely demonstrate broad language-model knowledge. You are evaluating support execution.

For every core intent, create a straightforward case and several controlled variants. Keep the customer objective constant while changing one condition at a time. That makes a failure diagnosable instead of merely disappointing.

A practical scenario matrix

Clean path: The caller gives the relevant facts in a clear order. This establishes whether the basic workflow works at all.
Missing information: Omit a detail the agent needs. Check whether it asks a focused question instead of guessing or restarting the intake.
Ambiguous intent: Use wording that could map to two support issues. The agent should disambiguate before retrieving data or taking action.
Mid-call correction: Let the caller change an account detail, date, product, or preferred option. Check whether the corrected fact replaces the old one throughout the workflow.
Interruption: Speak while the agent is answering. Observe whether it stops cleanly, understands the new input, and continues from the right point.
Backend delay: Introduce a slow retrieval or action. Evaluate how the agent manages the wait and whether it distinguishes a pending operation from a completed one.
Backend failure: Make a required system unavailable or return an error. The agent should not fabricate a result or promise completion it cannot verify.
Policy boundary: Ask for something the agent is not allowed to do. Test the explanation, alternatives, and escalation path.
Human request: Ask directly for a person. Verify that the agent follows the configured policy without turning the handoff into an argument.
Listening conditions: If your deployment must support different languages, accents, devices, or noisy environments, test each condition explicitly rather than treating one clear studio call as representative.

Give testers the goal, account state, and one complication. Do not script every sentence. A fully written dialogue tests whether the agent can follow the dialogue you anticipated; a goal-based scenario tests whether it can manage the conversation the caller actually creates.

Keep a few variants undisclosed until the live session. This is not a trick. It prevents the evaluation from becoming a memorized path while still keeping every test fair and reproducible. Record the exact variant afterward so another evaluator can run it again.

Run the call through the systems you expect to deploy

An unedited live call is more informative than a produced recording, but live alone is not enough. A live test can still use ideal data, a simplified integration, a practiced caller, and a workflow that avoids the hard parts of your environment.

Ask to run the scenario through a path that resembles the intended deployment:

Place a normal phone call through the proposed telephony route. If production will use call forwarding, test the forwarding path rather than a direct internal endpoint.
Use a safe test account containing representative records, permissions, and history.
Require the agent to retrieve data from the backend system that will be authoritative in production.
Introduce the chosen interruption, correction, ambiguity, delay, or error during the live conversation.
Require a real test action where it is safe to do so, not a verbal description of what the agent would have done.
Inspect the backend state after the call. Confirm that the correct record changed once, with the expected values.
Verify every promised follow-up, case creation, notification, or handoff outside the voice channel.
Retain the recording, transcript, timestamps, tool activity, and final system state for scoring.

This is especially important when an agent can take consequential actions. A fluent confirmation is not evidence that the action happened. The system of record is the evidence.

Repeat important scenarios with different wording and a different caller. One successful run demonstrates that the capability can work. Repeated variants reveal whether the capability depends on a narrow phrase, a rehearsed cadence, or an unusually forgiving path.

Key takeaways

Score complete resolution, including backend state and follow-up, rather than voice quality alone.
Change one condition at a time so you can identify why a call failed.
Test interruptions, corrections, ambiguity, system delays, system errors, and escalation.
Measure different kinds of waiting separately; a lookup pause and a turn-detection problem are not the same defect.
Treat a successful demo as evidence for a pilot, not permission for an unrestricted rollout.

Score conversation, reasoning, and operational closure separately

A single overall rating hides the information you need to make a product decision. The call may sound awkward but reach the correct outcome, or sound excellent while making a dangerous mistake. Separate the evaluation into three layers.

Layer	What to inspect	Evidence of a pass	Typical failure
Conversation mechanics	Turn detection, interruption handling, pacing, response length, and intelligibility	The caller can speak naturally, correct the agent, and follow the response without fighting for the floor	The agent talks over the caller, leaves confusing silence, or delivers answers too long to retain by ear
Decision quality	Intent recognition, clarification, use of account context, policy application, and answer accuracy	The agent asks only for missing information, uses the correct evidence, and avoids unsupported conclusions	The agent guesses, asks redundant questions, ignores a correction, or applies the wrong policy
Operational closure	Identity checks, tool calls, state changes, confirmation, follow-up, and escalation	The verified backend state matches the caller’s request and the agent’s final explanation	The agent claims success without a completed action, changes the wrong record, duplicates work, or drops context during handoff

Use a simple 0-2 score for each criterion: 0 for failed or unsupported, 1 for completed with material caller effort or recovery, and 2 for correct and usable. The scale is deliberately small. Evaluators can usually distinguish failure, friction, and success more consistently than they can defend the difference between seven and eight on a ten-point scale.

Do not average away critical errors. A wrong account action, failed identity control, fabricated completion, or forbidden disclosure should remain visible as a release blocker even if many low-risk calls receive high scores. Record both the criterion scores and the count of critical failures.

Break latency into moments the caller can feel

Latency is not one number. Capture at least three moments: the time the agent takes to recognize that the caller has finished, the time it spends reasoning or waiting for a system, and the time needed to begin and complete the spoken response.

End-of-turn delay: A long delay after every caller turn makes the exchange feel unresponsive and can encourage both sides to start speaking at once.
Reasoning or retrieval delay: A pause can be appropriate when the agent is checking account data or invoking a backend workflow. Brief pauses were audible during live subscription and backend checks, which is more informative than editing those waits out.
Response delivery: A fast start does not help if the answer becomes a long monologue. Voice responses need structure and pacing that work for listening, not merely text that sounds acceptable when read.

Ask what is happening during a pause. If the system is doing useful work, the next statement should reflect that work and the action log should verify it. If the pause is long enough to make a caller wonder whether the call has dropped, the experience needs an appropriate progress cue. If the agent answers instantly but guesses, speed is concealing a quality problem.

Review individual timings as well as an average. A generally responsive agent with occasional severe stalls creates a different operational problem from one that is consistently a little slow. Your test recordings and timestamps should make both patterns visible without inventing a universal pass threshold that ignores the complexity of the workflow.

Make recovery and escalation part of the product test

The strongest voice experiences are not the ones that never encounter confusion. They are the ones that recover without making the caller restart. Recovery is therefore a capability to test, not an embarrassing exception to hide.

Interrupt the agent in the middle of an answer. Correct a fact it has already used. Add a second request after the first appears resolved. Say that an explanation was unclear. Ask for a human. These moves reveal whether the agent maintains conversational state or merely produces plausible turns one at a time.

During recovery, look for specific behavior:

It stops speaking promptly when the caller takes the turn.
It identifies what changed instead of repeating the whole interaction.
It replaces corrected information rather than carrying both versions forward.
It asks a narrow clarification when the next action is uncertain.
It does not claim to understand when the transcript or subsequent action shows otherwise.
It preserves verified context and the reason for contact when a human takes over.
It tells the caller what will happen next instead of ending on an internal routing label.

Tone belongs in this test, but not as a beauty contest between synthetic voices. Evaluate whether pacing, brevity, acknowledgement, and word choice suit the moment. A caller correcting a billing detail needs a clear acknowledgement and an accurate update, not theatrical empathy. A caller who sounds uncertain may need a shorter explanation and a confirming question. Tone is the behavior of the conversation, not just the timbre selected in a settings menu.

Escalation should also count as a valid outcome when it is timely and informed. Define which conditions require a handoff, which allow one, and what context must travel with it. Then test the handoff from the caller’s side. If the customer reaches a person but has to repeat identity, intent, and every attempted step, the routing technically worked while the support experience failed.

Turn the evaluation into a controlled pilot decision

A strong live evaluation earns the right to run a pilot. It does not justify sending every eligible call to the agent. Production introduces variation in callers, data quality, traffic, integrations, and issue combinations that a demonstration cannot reproduce fully.

I would require five gates before approving even a limited external pilot:

Capability gate: Every must-have intent has completed its end-to-end workflow, including at least one controlled complication.
Critical-risk gate: No unresolved failure can expose the wrong account, bypass a required check, perform an unauthorized action, or report a false completion.
Conversation gate: The agent can handle interruptions, corrections, clarification, and explicit human requests without trapping the caller in a loop.
Operations gate: Your team can configure terminology, guidance, escalation behavior, greetings, voice, and deployment controls for the intended support environment.
Learning gate: Owners can inspect recordings, transcripts, tool activity, outcomes, and failures, then change the knowledge, workflow, policy, or conversation design responsible.

Start the pilot with a reversible slice of traffic and a clear human fallback. Select intents whose correct outcome can be verified in your systems. Define who reviews failed and escalated calls, who can pause the rollout, and who owns each class of fix. An answer-quality issue, a telephony issue, and a backend integration issue require different owners even when the caller experiences all three as one bad call.

Expand only when observed calls meet the outcome contracts you wrote before the demo. If the definition of success keeps changing after failures appear, the evaluation is no longer protecting the decision.

For your next vendor session, replace “show me your best call” with a scenario pack, a test account, and a request to inspect the final system state. You will learn more from one imperfect call that recovers correctly than from a flawless recording that never had to recover at all.

References

Intercom – Stop Falling for Hollywood Demos: The Unfiltered Truth of Live AI Voice for Support

November 11, 2025

From Sketch to Clickable Demo: My AI Prototyping Playbook to Build Apps in Hours

I’ve spent much of my career compressing the distance between a napkin sketch and something real customers can touch. At HighLevel, my product teams use generative AI to validate ideas faster, reduce risk earlier, and win stakeholder trust with evidence instead of slides. The goal isn’t to be flashy—it’s to be precise, testable, and repeatable.

Today, you can build it before you pitch it. AI prototyping can turn ideas into clickable demos in hours. Here are some tools to try and steps to follow.

I start every AI prototyping sprint by sharpening the problem statement and the outcome we care about. That means being explicit about the target user, jobs-to-be-done, and the riskiest assumptions. I define a minimum detectable effect (MDE) and tie it to outcomes vs output OKRs so everyone aligns on what “good” looks like before we touch a tool.

From there, I move from sketch to interface. I capture a rough flow (whiteboard, tablet, or even paper) and generate UI variations with my AI product toolbox—tools that translate structure into components and screens. I’ll iterate on information hierarchy and copy until the narrative supports the core job, borrowing techniques from UX writing. For product managers leaning into LLMs for product managers, this phase is about speed to feedback, not perfection.

Next, I wire data and logic. I connect a lightweight backend or spreadsheet, stitch in a CRM integration if needed, and add LLM calls through a ChatGPT connector or Claude Code. If the concept benefits from multi-step autonomy, I introduce agentic AI to orchestrate tasks across APIs. CustomGPT workflows help me encapsulate business rules so the demo behaves consistently in user paths we care about.

Governance is not optional at this stage. I apply privacy-by-design defaults, document data governance decisions, and run a quick AI risk management pass: input validation, prompt safety, rate limits, and fallback responses. This keeps the prototype credible and prevents false positives from polluting stakeholder perception.

With a click-through in hand, I instrument the experience so learning compounds. I drop in Amplitude analytics to track activation, task completion, and drop-off, and set up simple A/B testing when there’s a meaningful design or copy choice. This makes the prototype a learning vehicle, not just a demo.

Then I get it in front of users—fast. Five targeted conversations will beat fifty internal opinions. I run structured product discovery interviews, observe time-to-value, and capture objections. This is where empowered product teams shine: we make changes in real time, re-run the flow, and document what moves the needle for product-led growth.

When speed matters, I use a four-hour cadence: Hour 1 for problem framing and MDE; Hour 2 for sketch-to-UI generation; Hour 3 for data wiring and AI logic; Hour 4 for instrumentation and user walkthroughs. By the end, we have a clickable demo, preliminary analytics, and a clear decision on whether to advance, pivot, or park.

Finally, I translate insights into a concise artifact: the hypothesis we tested, the signal we observed, the trade-offs we made, and the next sprint plan for product roadmapping and sprint planning. The point is not to be right on the first try; it’s to learn precisely, cheaply, and quickly enough to invest with conviction.

If you adopt this approach, you’ll find that stakeholder management becomes easier, team energy rises, and your roadmap earns credibility. Build it before you pitch it, and let real interactions—not wishful thinking—do the heavy lifting.

Inspired by this post on Product School.

November 10, 2025
Win AI Search: Proven Playbook to Get Your Startup Recommended by ChatGPT & Perplexity

AI search is quickly becoming the new homepage for startups. When a buyer asks a model for the best tools, they often take the short list at face value. I treat this moment as a product surface I can influence with strategy, content, structure, and distribution—much like any other go-to-market channel.

Early on, I set a simple objective for my team and me: "Learn how LLMs like ChatGPT and Perplexity decide which startups to recommend and what signals help a brand get discovered in AI search." That sentence became our north star for experiments, instrumentation, and content architecture.

Here is the mental model that consistently holds up in practice. Large language models synthesize answers from a knowledge graph built from crawled content, citations, and high-signal sources. They weight consensus, clarity, recency, authority, and machine-readability. I don’t pretend to know the internals, but across hundreds of tests, the same patterns correlate with being surfaced and cited.

First, I make our entity unambiguous. I standardize the company name, product names, and leadership bios across the site and external profiles. I implement Organization and Product markup with schema.org and link out with sameAs to authoritative profiles like LinkedIn, Crunchbase, GitHub, and key directory listings. The goal is to collapse ambiguity so AI search knows exactly who we are and which claims are attributable to us.

Next, I publish definitive, answer-first pages. For every core query—what we do, who it’s for, outcomes, differentiators, pricing, comparisons, and integrations—I ship a page that leads with a crisp summary, then supports it with evidence, examples, and plain language. I include Q&A sections, realistic use cases, and named case studies so models can quote and ground responses in verifiable facts.

I then make the site maximally machine-readable. I add schema.org for SoftwareApplication, Product, FAQPage, and HowTo where relevant. I keep titles, H1/H2 structure, internal links, and metadata descriptive and consistent. I expose last-modified dates, maintain an XML sitemap, and keep a visible changelog and release notes. Freshness matters—Perplexity, in particular, tends to privilege recent, well-cited material when answering time-sensitive questions.

Citations are non-negotiable. I earn credible mentions on third-party properties, analyst lists, comparison pages, and customer reviews. I prioritize authoritative placements over volume, then make sure our site references those sources to reinforce the signal. When Perplexity cites our page alongside a respected third-party review, our inclusion rate in answers rises noticeably.

I also design for developers, buyers, and machines at once. That means clean docs, integration pages, and transparent security and trust content. Clear API references, integration guides, and reliability notes give models concrete artifacts to summarize. Pricing, privacy, and support policies reduce uncertainty and increase the likelihood that an answer will include us.

Measurement turns this from a hunch into a system. I run controlled content experiments, track minimum detectable effect on discovery and mentions, and instrument referral patterns from AI assistants when citations appear. I monitor which prompts surface our brand, which sources are cited, and which pages are repeatedly used as references. When we move a KPI, we codify the pattern into our playbook and scale it.

Trust is the compounding advantage. I maintain a transparent trust center, privacy-by-design posture, and clear data governance practices. I remove vague claims, back up benefits with evidence, and keep all performance or security statements auditable. Models tend to lift brands that feel low-risk, well-documented, and widely corroborated.

If you want a fast start, here’s the checklist I rely on. Standardize your entity and ship schema.org. Publish answer-first pages for core jobs-to-be-done, comparisons, and integrations. Earn authoritative third-party citations and reference them. Keep release notes, changelogs, and dates current. Instrument AI discovery and iterate based on what gets cited. Do this consistently, and your startup earns a fair shot at being recommended when buyers ask AI for the best options.

Inspired by this post on Amplitude – Best Practices.

November 7, 2025
AI Context Engineering: A System for Product Decisions
You give an LLM your discovery notes, a dashboard export, and a roadmap question. It returns polished recommendations in seconds. The recommendations sound plausible, yet your product trio still cannot tell which option deserves a commitment.

The missing ingredient is usually not a better prompt. It is a decision-ready context system: a controlled way to give AI the evidence, boundaries, and outcome definition required to reason about the same product decision your team is actually making. Done well, this gives you more than a convincing answer. It gives you a traceable choice, explicit uncertainty, and a validation plan.

Define the decision before you collect the context

For product work, context engineering is the deliberate design of everything an AI system can use at the moment it reasons: customer evidence, metrics, goals, constraints, definitions, instructions, and prior decisions. The useful unit is not a prompt or a document. It is the decision.

This distinction matters because an LLM can answer an underspecified request without exposing that the request was underspecified. Ask it to improve onboarding, and it can produce a credible list of patterns. That output still does not tell you which user segment matters, what improvement means, which current friction is supported by evidence, or what downside the team must avoid.

Before pulling any context, write a decision frame that answers these questions:
- What decision must be made? Name the commitment, not the general topic. Choose whether to change a specific onboarding step is a decision; explore onboarding is not.
- Who is the decision for? Identify the customer segment, use case, or part of the journey. Evidence from one segment should not silently become a claim about every user.
- What outcome should change? State the behavior or business result you want, then identify the guardrail signals that should not deteriorate.
- What can constrain the answer? Include privacy, risk, brand, commercial, technical, and operational boundaries before ideation begins.
- What evidence could change the choice? If no possible evidence would change the decision, you are asking AI to justify a conclusion rather than help make one.
- What must the output enable? Specify whether you need options, a recommendation, a decision memo, an experiment plan, or a list of unresolved questions.
Anchor this frame in outcomes rather than deliverables. Improve activation for a defined segment while protecting support load establishes a decision boundary. Build a new onboarding checklist merely names output. The first lets AI compare interventions; the second encourages it to decorate a predetermined solution.

A practical test is to remove the proposed feature from the frame. If the decision still makes sense, you have probably described an outcome. If the frame collapses, the team may already be committed to an output.

Build a context packet that preserves evidence quality

A context packet is the smallest governed collection of information that allows the model and the product team to reason about the decision. It can combine customer quotes, behavioral trends, funnel friction, support conversations, and commercial constraints. The important work is to assemble, structure, compress, and challenge that evidence before asking for recommendations.

Do not treat every input as the same kind of truth. A customer quote gives you detail about an experience, not its prevalence. Usage analytics show behavior, not necessarily motivation. Support conversations overrepresent people who contacted support. CRM data can expose commercial constraints without proving that a feature creates customer value. Labeling these boundaries prevents the model from blending different signals into false certainty.

Use this structure for the packet:
- Decision header: the choice, decision owner, affected segment, and action that follows the decision.
- Outcome frame: the desired outcome, current signal, primary measurement, guardrails, and any metric definitions needed to interpret the data correctly.
- Evidence ledger: each relevant observation with its origin, segment, time period, and scope. Keep direct observations separate from interpretations.
- Constraints: technical dependencies, commercial commitments, privacy rules, brand boundaries, operational capacity, and known risks.
- Contradiction register: evidence that points in different directions, including differences between customer statements and observed behavior.
- Unknowns: missing evidence, ambiguous definitions, unrepresented segments, and assumptions the team has not validated.
- Output contract: the form of response you need, the criteria options must address, and the unsupported claims the model must label rather than fill in.
Compression is where many context packets either become useful or become misleading. The goal is not merely to shorten the material. It is to increase the proportion of decision-relevant signal without erasing qualifications.
1. Normalize repeated evidence. Deduplicate copied notes and repeated tickets so repetition in the packet does not impersonate independent confirmation. Preserve any real frequency data separately.
2. Retain the qualifiers. Do not compress away the segment, time range, denominator, metric definition, or product state that determines what an observation means.
3. Label epistemic status. Mark material as observation, interpretation, assumption, or generated hypothesis. A concise packet should make these distinctions clearer, not blur them.
4. Keep contradictions visible. If interviews describe one problem while behavioral data points elsewhere, preserve both signals and ask what evidence would resolve the conflict.
5. Remove inert context. My rule is simple: if an item cannot change an option, a risk assessment, or the validation plan, it does not belong in the active packet. Keep it available outside the model context if the team may need to inspect it later.
Apply privacy-by-design while assembling the packet, not after the model has processed it. Customer transcripts, CRM records, and support conversations can contain personal or confidential data. Use approved systems, follow applicable access controls and data terms, redact identifiers, and aggregate where the decision does not require record-level detail. If you cannot establish that the data is permitted in the AI workflow, leave it out and provide a safe summary. The downside is not a weaker prompt; it is potential exposure of customer or company information.

Separate synthesis, strategy, and skepticism

Asking for a summary, a recommendation, and a critique in the same instruction makes it difficult to see where evidence ends and invention begins. A stronger agentic workflow separates those jobs into distinct passes: Summarizer, Strategist, and Skeptic.

The Summarizer creates an evidence map

The Summarizer should organize the packet without deciding what to build. Ask it to group evidence around the decision, preserve relevant qualifiers, expose conflicts, and identify missing information. Explicitly prohibit recommendations during this pass.

A useful Summarizer output contains the supported observations, the segments represented, the outcome signals involved, the contradictions, and the unknowns. Review this output against the packet before continuing. If the model has turned an assumption into a fact, fix the evidence map rather than hoping a later pass corrects it.

The Strategist develops decision options

Give the Strategist the approved evidence map, the original decision frame, and the constraints. Ask for a small, meaningfully different set of options, including the option to leave the product unchanged when that is legitimate.

Require the same fields for every option:
- the customer problem or opportunity it addresses;
- the packet evidence that supports it;
- the assumptions required for it to work;
- the expected outcome and guardrail signals;
- the dependencies and material trade-offs;
- the simplest valid way to reduce its largest uncertainty.
This format prevents one option from winning because it received a more persuasive narrative. It also makes unsupported leaps visible. If the model cannot connect an option to evidence, that option can remain an idea, but it must be labeled as a hypothesis rather than presented as a conclusion.

The Skeptic tries to disconfirm the options

The Skeptic should not produce generic risks. Ask it to find the strongest contrary evidence, the segment that might be harmed, the constraint most likely to invalidate the option, the metric that could be gamed, and the observation that would show the underlying hypothesis is wrong.

Require it to distinguish counterevidence already present in the packet from new conjecture. This matters because a skeptical tone can sound rigorous even when it is unsupported.

The same LLM can perform all three roles, but role prompts do not create independent evidence or independent reviewers. Freeze the context packet used for the loop, label every generated artifact, and keep generated claims out of the evidence ledger until a human verifies them. Role separation is a workflow control, not a guarantee of correctness.

Stop adding passes when the workflow is only rearranging language. The loop has done its job when the team can see the supported facts, viable options, disputed assumptions, material risks, and next evidence needed to decide.

Make the product trio the decision gate

AI can accelerate the reasoning, but it should not become the decision owner. Bring the packet and the three-pass output into a product trio of product, design, and engineering. The purpose of that forum is not to approve the AI recommendation. It is to make the trade-offs explicit and decide what the team is prepared to learn.
1. Verify the evidence boundary. Check whether the represented segments, product states, and metrics match the decision. Ask which customer or operational perspective is absent.
2. Classify the important claims. Mark each claim as supported observation, team interpretation, assumption, or generated hypothesis. If nobody can trace a recommendation back to the packet, treat it as a hypothesis or remove it.
3. Compare trade-offs on equal terms. Evaluate every option against the desired outcome, guardrails, constraints, dependencies, and learning value. Do not let the most detailed option appear strongest merely because the model wrote more about it.
4. Choose the next commitment. The valid outcomes are to proceed, run a discovery or validation step, defer the decision, or reject the options. Assign a human owner and make clear what action the decision authorizes.
5. Record the rationale. Convert the discussion into a concise decision memo rather than forwarding raw model output to stakeholders.
The decision memo should include:
- the decision and why it is being made now;
- the target segment, desired outcome, and guardrails;
- the evidence that carried the most weight;
- the chosen option and the alternatives rejected;
- the trade-offs accepted by the decision owner;
- the assumptions and unresolved questions;
- the validation method and disconfirming signal;
- the owner and trigger for revisiting the decision.
This gives stakeholders something stronger than AI-generated confidence. They can inspect what the choice rests on, where judgment entered, what could prove the team wrong, and when the decision should be reconsidered.

Close the loop with validation and decision memory

Even a well-grounded model output is not product validation. It is a structured hypothesis. Match the validation method to the claim and to the consequence of being wrong.
- For a causal behavior claim: use a controlled A/B test when traffic, instrumentation, and the product experience make that appropriate. Define the primary metric, minimum detectable effect, guardrails, analysis approach, and stopping rules before reading the result.
- For a usability or comprehension claim: use targeted customer interviews or usability evaluation with the relevant segment. AI can help organize notes, but preserve outliers and do not turn a small qualitative sample into a prevalence claim.
- For an operational claim: use a limited release with observability, support monitoring, and an explicit rollback condition. Watch the workflow around the feature, not only the feature interaction itself.
- For privacy, brand, regulatory, or other high-consequence constraints: complete the appropriate human review before launch. A persuasive model assessment is not a substitute for the accountable specialist or decision owner.
For an onboarding decision, for example, the packet may contain segment definitions, observed friction, support themes, and conversion signals. The workflow can propose alternative interventions and measurement plans. The trio still chooses which hypothesis deserves a controlled test, whether the minimum detectable effect is practical, and which activation or retention signals will determine the next move.

After validation, return the result to the context system. Record what shipped, the observed outcome, affected segments, unexpected behavior, and which assumptions held or failed. Update the decision memo and evidence ledger. Otherwise, the next AI session begins from the same stale assumptions, and the organization pays again to relearn what it already discovered.

That accumulated decision memory is one of the most valuable outputs of context engineering. It turns AI collaboration from isolated prompting into a feedback loop connecting discovery, strategy, execution, and measurable results.

Key takeaways
- Frame the product decision, target segment, outcome, and constraints before asking AI for options.
- Give the model a compressed evidence packet, not an unstructured pile of documents.
- Keep observations, interpretations, assumptions, and generated hypotheses visibly separate.
- Use distinct Summarizer, Strategist, and Skeptic passes to expose where reasoning changes.
- Let a human product trio own the trade-offs, commitment, and stakeholder rationale.
- Treat every recommendation as a hypothesis until validation produces new evidence, then feed that evidence back into the decision record.
Choose the next real product decision that is important enough to validate and bounded enough to act on. Write its decision frame, assemble the smallest safe context packet, run the three reasoning passes, and take a decision memo into your product trio. When the result flows back into the packet, context engineering stops being a prompting technique and becomes part of how you run product.

References
- Pendo – Perspectives — AI Context Pulling Playbook: How I Make Humans + LLMs Collaborate for Sharper Product Outcomes
November 6, 2025

Agentic AI for Incident Response: A Practical Operating Model

An incident fires. Your responders are not short of data; they are short of a trustworthy path through it. Deployment timelines, service ownership, dashboards, logs, runbooks, and prior incidents live in separate places, while the cost of a wrong action rises by the minute.

The decision in front of you is not whether AI can summarize the incident channel. It is whether an agent can shorten the investigation without becoming another failure mode. That requires an operating model covering the agent’s job, context, permissions, interface, and evaluation before you give it meaningful authority.

Give the agent an investigation job before action authority

An incident-response agent should run a goal-directed investigation loop, not wait for isolated prompts like a chatbot. A credible implementation can collect context, form and test hypotheses, and draft fixes inside Slack. The important product decision is where that loop must stop for human judgment.

Model the loop on the work a strong responder already performs:

Scope the incident. Identify the affected service, environment, customer surface, start time, and known symptoms. Preserve unknowns instead of filling them with plausible guesses.
Gather relevant context. Retrieve recent changes, service ownership, dependencies, telemetry, runbooks, feature-flag changes, and similar incidents.
Form competing hypotheses. Produce a ranked set rather than locking onto the first convincing explanation. Distinguish observed facts from inferences.
Test each hypothesis. Use read-only tools to query metrics, logs, traces, deployment state, and dependency health. Record what supports or weakens each possibility.
Propose the next best action. Explain the target, expected effect, risk, preconditions, and recovery path. Do not hide uncertainty behind an authoritative tone.
Update the investigation. Incorporate tool results and responder corrections, discard disproven hypotheses, and choose the next check.

The incident commander remains accountable for priorities and mitigation. The agent acts as an investigation engine: it gathers, tests, organizes, and proposes. This division is more useful than treating human involvement as a final approval click after the AI has already made every material decision.

Choose the first workflow with care. A good starting point has a bounded service area, dependable read-only signals, known responders, established runbooks, and outcomes you can verify after the incident. A workflow that depends on undocumented tribal knowledge or unrestricted production access is not ready for agentic automation. Fix the operating system around the incident before expecting a model to compensate for it.

Do not begin with the most dramatic remediation you can automate. Early value usually comes from reducing context switching, locating the correct owner, connecting symptoms to recent changes, and eliminating weak hypotheses. Those tasks consume scarce attention but do not require the agent to mutate production.

Context quality determines the ceiling of the investigation

A capable model cannot reason with operational context it cannot find, distinguish, or trust. If a service has three names across the deployment system, observability platform, and incident channel, retrieval becomes unreliable before model reasoning even begins.

Create a context contract for every service placed within the agent’s scope. At minimum, make these fields explicit:

Identity: canonical service name, aliases, repository, runtime, and environment.
Ownership: accountable team, current on-call route, and escalation path.
Topology: upstream dependencies, downstream consumers, data stores, queues, and shared infrastructure.
Change history: deployments, configuration changes, feature flags, migrations, and rollback state.
Operational knowledge: current runbooks, known failure modes, dashboards, alerts, and prior incident records.
Control policy: tools the agent may call, environments it may inspect, actions it may propose, and actions it may never execute.

Start retrieval with exact operational signals. Filter by canonical service, environment, incident time window, deployment identifier, alert type, and ownership tag. Then rerank the surviving records for the current question. This deterministic tagging and reranking foundation is easier to debug than making semantic similarity responsible for every retrieval decision.

Add embeddings where language actually creates ambiguity: matching an unfamiliar symptom to a differently worded historical incident, finding a relevant paragraph inside a long runbook, or connecting terminology used by two teams. Semantic retrieval should widen discovery, not erase exact boundaries such as production versus staging or one tenant versus another.

Require every retrieved item to carry provenance that a responder can inspect: its system of record, service and environment, creation or update time, incident-time availability, and reason for retrieval. This lets the responder notice four common failures quickly:

A runbook is relevant but stale.
An ownership record is current but was different when the incident began.
A similar incident came from another environment with different dependencies.
A historical evaluation accidentally exposed the final root cause before the agent could have known it.

Treat missing context as an observable product state. The agent should say that it cannot locate a deployment record or dependency map, identify which system was checked, and propose a safe way to continue. A confident answer assembled around a missing record is more dangerous than an explicit gap.

Scale permissions to reversibility and blast radius

Autonomy is not one switch. It is a set of permissions attached to particular tools, targets, environments, and action classes. Granting broad credentials because the agent usually behaves conservatively turns a model-quality issue into a production-control issue.

Action class	Appropriate agent role	Required human control
Read-only investigation	Query approved telemetry, changes, ownership, and runbooks	Audited access with service and environment boundaries
Recommendation or communication	Draft a diagnostic check, remediation plan, incident update, or escalation	A responder reviews customer-facing messages and consequential recommendations
Bounded, reversible execution	Invoke a preapproved runbook against an explicitly named target	Approval bound to the exact action, target, inputs, and current incident
Irreversible or broad execution	Explain the need and prepare a plan, but do not execute during the initial rollout	Existing change controls and accountable operators remain in force

Do not label an action reversible merely because the interface contains a rollback button. A deployment rollback can still be unsafe after an incompatible schema or data change. A restart can amplify load or destroy useful diagnostic state. Reversibility has to be validated for the specific service state, not inferred from the action name.

For every executable tool, define guardrails outside the prompt:

Use least-privilege credentials scoped by service and environment.
Allowlist tools, targets, and input shapes rather than relying on natural-language prohibitions.
Preview the exact command or workflow, target, parameters, and expected effect before approval.
Bind approval to that exact action so the agent cannot reuse it for a changed target or plan.
Use rate limits, idempotency controls, and circuit breakers where repeated calls could cause harm.
Route production changes through existing CI/CD or runbook automation when possible.
Record retrievals, tool inputs, tool outputs, approvals, denials, and resulting state changes in an audit trail.
Provide a direct way to suspend the agent’s tool access without disabling the incident workflow itself.

The action proposal should be a control artifact, not a conversational suggestion. It needs the evidence supporting the action, the exact target, the expected observable result, the maximum intended scope, known preconditions, and what the responder will do if the result does not appear. If the agent cannot supply those fields, it has not earned execution authority for that action.

Keep outward communication on a separate permission path. Drafting a status update is low-risk technically but consequential for customers and the business. Human review should verify what is known, what remains uncertain, and whether the message promises a recovery time the evidence cannot support.

Make evidence and uncertainty legible in the incident room

Putting the agent inside the collaboration surface where incidents already unfold reduces the friction of opening another product and re-explaining the situation. It also means the agent’s output competes with urgent human messages. Long narrative answers will be skipped, however intelligent they sound.

Give each investigation update a stable structure:

Observed: facts returned by named systems, with timestamps and links where available.
Hypotheses: ranked explanations with the supporting and conflicting evidence for each.
Changed since the last update: new evidence, rejected hypotheses, and responder corrections.
Next check: the read-only query or tool call most likely to distinguish between the remaining possibilities.
Proposed action: target, expected effect, blast radius, preconditions, and recovery path.
Decision needed: the specific approval, input, or ownership choice required from a human.

This is not a request to expose a model’s private, free-form chain of thought. Responders need a structured evidence trail: claims, retrieved signals, tool results, rejected alternatives, and action rationale. That artifact is more useful for review because each part can be checked against the operational record.

Confidence labels are helpful only when they change behavior. Define what the interface does when confidence is low: ask for a missing service identifier, run another safe check, present multiple hypotheses, or escalate to the owner. Do not display a precise-looking score unless you have evaluated whether that score corresponds to actual correctness in your incident set.

Design human correction as part of the main workflow. A responder should be able to reject a hypothesis, correct the service or environment, mark a retrieved record stale, deny an action, and state why. The agent should preserve that decision in the incident record and replan from it. Repeatedly resurfacing a rejected hypothesis erodes trust even when the underlying model is otherwise capable.

Watch for a subtle interface failure: polished summaries can make weak investigations look complete. Make unresolved questions and conflicting signals visually prominent in the message structure. The goal is not to make the agent sound certain. It is to help the incident commander see what is known, what is inferred, and what decision comes next.

Test against past incidents, then expand authority one boundary at a time

A demo proves that the agent can complete a favorable path. It does not prove that the agent will retrieve the right context, resist a misleading correlation, respect permissions, or propose a safe action when production is ambiguous.

Use post-incident time-travel evaluations. Reconstruct what the agent could have known at each point in a real incident. Begin with the original trigger and expose deployments, telemetry, messages, and tool results only when they became available. Hide the final root cause, later analysis, and corrected metadata until the corresponding point in the replay. Otherwise, you are testing hindsight rather than incident response.

Grade the investigation on operational usefulness, not prose quality:

Scoping accuracy: Did it identify the correct service, environment, symptoms, and ownership route?
Context retrieval: Did it find the relevant change, runbook, dependency, or earlier incident without mixing incompatible records?
Hypothesis quality: Where did the eventual cause appear in the ranked set, and what evidence was used to test it?
Evidence integrity: Does every factual claim match a retrieved record or tool result? Did the agent invent a signal that was never observed?
Tool correctness: Did it select the correct tool, target, environment, and parameters?
Action safety: Was the proposed action inside policy, and were its blast radius, preconditions, and recovery path explicit?
Calibration: Did expressed certainty track actual correctness, especially when context was incomplete?
Time compression: How did the time to a useful hypothesis, correct owner, mitigation decision, and recovery compare with the existing workflow?
Human effort: Which searches, handoffs, repeated explanations, and diagnostic checks did the agent remove or add?

Treat safety failures differently from diagnostic misses. A missed hypothesis is a capability problem. Crossing a permission boundary, inventing evidence, or targeting the wrong environment is a release blocker for that tool path. Averaging all outcomes into one quality score can conceal exactly the failure that matters most.

A practical rollout sequence

Instrument the human workflow. Capture incident timelines, ownership changes, diagnostic steps, approvals, mitigations, and outcomes. You need a baseline before claiming improvement.
Replay historical incidents. Use time-bounded context and score the agent against known outcomes. Repair retrieval and service metadata before tuning for eloquence.
Run in shadow mode. Let the agent investigate live incidents without posting conclusions or changing systems. Compare its evidence and hypotheses with the responder’s path.
Expose read-only assistance. Allow responders to request context, hypothesis checks, and draft updates. Collect explicit acceptance, correction, and rejection signals.
Add recommendation mode. Let the agent propose remediations using the structured action artifact, while humans continue to execute through established controls.
Enable one bounded action path. Choose a preapproved runbook with a clear target, validated preconditions, observable effect, and recovery procedure. Keep approval attached to the exact invocation.
Expand by tool and service. Grant additional authority only when evaluation evidence supports that particular boundary. Do not treat success on one service as proof of readiness everywhere.

Re-run the evaluation set after changes to prompts, models, tools, service topology, runbooks, or permissions. An agent can regress even when its general language quality improves. Operational behavior depends on the whole system around the model.

Key takeaways

Start with investigation and context compression; earn execution authority later.
Build deterministic service, environment, time, and ownership filters before depending on semantic retrieval.
Separate observed facts, hypotheses, and proposed actions in every incident update.
Enforce permissions in tools and infrastructure, not only in prompts.
Evaluate with historical time travel so the agent never sees facts that were unavailable during the real incident.
Expand autonomy one action, tool, service, and environment boundary at a time.

The next outage is the wrong time to discover that your agent cannot distinguish a plausible explanation from verified evidence. Before it happens, choose one bounded incident workflow, define its context contract and permission envelope, and replay several real investigations without future information. If the agent can make its evidence legible, stay inside policy, and consistently move responders toward the next correct decision, you have a foundation worth expanding.

References

Shivam.Consulting Blog — How Incident.io’s AI SRE Diagnoses, Hypothesizes, and Fixes Outages in Slack at Record Speed

November 6, 2025

Turn Claude Code Into a Trusted Teammate: My 3-Layer Memory System You Can Copy

"Can you critique the landing page for my new Story-Based Customer Interviews course?" That simple ask used to kick off hours of back-and-forth where I fed an AI the same context over and over—only to get generic feedback that wouldn’t land with my audience or fit my products. As a product leader, that inefficiency was unacceptable; as a writer, it was just plain frustrating.

Not anymore. Today, Claude not only critiques my work, it helps me produce it. It generates marketing copy—in my voice. It helps me write blog posts. It knows what search terms are relevant to my business and helps me optimize my articles for SEO and now AEO. It helps me with competitive research, academic research, and discovery research. And it does all of this with little prompting from me.

I don’t upload files to a web-based project. I don’t manage elaborate prompt libraries. I don’t repeat myself. I ask for help and Claude knows exactly what to do. The shift happened when I learned how to give Claude Code a memory. Claude now knows who my target customer is, the key value propositions I focus on, the specific opportunities each product addresses, my revenue model, my marketing channels, and so much more.

A dark-themed strategy slide for the post Stop Repeating Yourself: Give Claude Code a Memory, showing how to lead with a CLAUDE.md glossary page, write clearly for nontechnical readers, and link glossary and article to boost discovery and engagement.

With that memory, I consistently get high-quality output tailored to my audience and aligned to my products and services. I don’t retype the same context; Claude just remembers. In this article, I’ll show you exactly how I set up that memory. It relies on Claude Code (which requires a Pro subscription), and it’s worth it. If you’re new to Claude Code, start with "Claude Code: What It Is, How It’s Different, and Why Non-Technical People Should Use It."

Here’s the underlying problem: with large language models, every conversation starts from scratch. Yes, ChatGPT can remember some things and Claude can search past conversations, but practically speaking each new thread wipes the slate clean. If I were working on a new landing page, I’d normally need to upload target customer context, product details, primary and secondary value propositions, FAQ questions and answers, plus testimonials and logos for social proof—every single time.

Start fast with Claude’s home screen: Sonnet 4.5 is ready, and quick actions for writing, learning, and coding sit beneath a clean prompt box—ideal for showing how memory cuts repetition and streamlines daily development.

Projects in web-based tools help a bit, but they introduce a new dilemma. When I move to the next landing page targeting the same customer but a different product and value proposition, do I start a new Project (tedious) or keep expanding the old one (which muddies the context window and degrades output quality)? The good news: Claude Code solves this by giving the model a precise, durable memory without overloading any single conversation.

Claude Code can read files on my local machine, which is an understated superpower. I use those files to create a persistent, reusable memory that works across all chats and Projects. Files can be mixed and matched, so I give Claude exactly what it needs for the task at hand—and nothing more. For a first landing page, I reference the target customer and the relevant product; for the second, I reuse the same target customer file and point to the new product file.

Dark-mode Notes screenshot captures Claude Code in action: it fetches producttalk.org, reads context files, and delivers a concise homepage evaluation—showing how memory streamlines repeated analysis tasks.

When you give an LLM the exact right context, output quality jumps. More context only helps if it’s the right context. For a landing page, Claude needs to know about the current product and perhaps related products for differentiation—but it doesn’t need to know about unrelated offerings. Structure your memory so Claude gets precisely what’s required.

Once I did this, Claude shifted from “intern who needs handholding” to trusted advisor and capable teammate. It doesn’t guess at my value propositions—I’ve already told it. It writes in my voice because it has my writing guide and samples. It knows who owns which course and which use cases map to which features. The setup takes a bit of upfront work, but it compounds: update a file when something changes and you’re done. Most of this information already lives in your system; the trick is making it easy for Claude to use.

See how Claude Code stops repetition: global and project CLAUDE.md files, plus custom reference docs, flow into the editor so the assistant remembers your preferences and context while you code and run commands.

Because the files live on my machine, I own the system. No vendor or device lock-in. I decide when and who to share with. I can work with Claude on one project and ChatGPT on another—both can rely on the same file-based memory strategy. It’s an AI strategy that scales with product discovery, accelerates go-to-market content, sharpens competitive differentiation, and supports product-led growth.

Here’s how I design the memory: I use three layers. Claude Code already encourages global preferences and Project-specific instructions, but the third layer—reference context—is where the real power lives.

Peek inside a markdown playbook for Claude Code: concise rules for writing, multi-level planning, and clear feedback that turn repeated reminders into reusable memory and smoother, faster coding sessions.

Layer 1: Global Preferences (Always on). The first time I launched Claude Code, I created a CLAUDE.md file at ~/.claude/CLAUDE.md. This is where I keep the cross-project rules of engagement—how I like to work with Claude. Mine includes: Always create a plan for me to review before you start any work; Give me direct feedback (no hedging, no gentle suggestions); Use bullet points for summaries; Ask clarifying questions one at a time so I can give complete answers; No emojis unless I explicitly ask for them. Claude Code automatically loads this file at the start of every session, so I never restate my preferences.

Layer 2: Project-Specific Instructions. Different projects have different rules. In my writing workspace, the Project CLAUDE.md sets the roles (I’m the primary writer; Claude is my thought partner and editor), defines a multi-round review flow (content → structure → accuracy → typos), prioritizes human readability over SEO, and points to my writing style guide. In my task management system, I include how my Trello integration works, file naming conventions for tasks, and how to process research papers into summaries. In my code projects, I specify the technology stack (Node.js vs. Python), testing framework (Jest for Node.js, pytest for Python), code style and conventions, project architecture and directory structure, and which dependencies and libraries to use. Each project directory has its own CLAUDE.md, and Claude automatically loads the relevant file when I’m working there.

Peek inside a markdown playbook for collaborating with Claude—covering session setup, roles, editorial standards, and research steps—to show how saved instructions create consistent results without repeating yourself.

Layer 3: Reference Context (Pull as Needed)—the real power. LLMs have a context window—a limit to how much they can process at once. Even within that limit, loading too much degrades performance due to “context rot.” The remedy is ruthless context management: small, targeted files that load only when needed. Keep CLAUDE.md files concise and focused on rules and workflows. For detailed knowledge, create separate reference files and list them in your CLAUDE.md so Claude knows they exist and when to fetch them. When I ask for help creating a landing page, Claude knows to use my business profile, the product file, and my target customers context.

Here’s what most people miss: you don’t cram everything into global or Project files. You maintain small, reusable reference files that Claude only loads on demand. In my walkthrough, I share exactly which context files I created and why; how I got Claude Code to help me create them; how I break them into small, reusable components so Claude gets precisely what it needs; how I keep everything up to date; and step-by-step instructions so you can set up a similar memory system.

Three project notes funnel into Claude Code, turning reusable context into working output. This visual shows how saving key docs as memory lets the AI pick up where you left off and skip repetitive prompting across tasks.

Let’s dive in.

Inspired by this post on Product Talk.

November 5, 2025
AI at Home, Impact at Work: Experiments That Supercharged My Product Leadership

I recently tuned into an insightful All Things Product episode featuring Teresa Torres and Petra Wille on how experimenting with AI in everyday life sharpens how we build AI-powered products at work. The core premise resonated deeply with my AI Strategy: low-stakes, personal experiments accelerate confidence, clarify limitations, and build an AI product toolbox we can bring into the office with rigor.

If you want to dive in, you can listen on Spotify or Apple Podcasts. I found the conversation especially relevant for product trios and anyone shaping LLMs for product managers in high-stakes environments.

The idea is simple but powerful: when I prototype with AI at home—where the stakes are low—I learn faster, make safer mistakes, and internalize critical product patterns. Over time, those patterns transfer directly to work: tighter context management, sharper bias awareness, clearer human-in-the-loop guardrails, and a more nuanced view of when to use AI as a thought partner versus when to consider agentic AI.

In my own practice, I’ve mirrored many of the scenarios discussed: using ChatGPT by OpenAI to plan meals, analyze public data sets like school budgets, and even sanity-check real estate evaluations. These seemingly mundane tasks are fertile ground for learning about context window limits, hallucination (artificial intelligence), AI bias, and privacy-by-design trade-offs. Each experiment helps me craft better prompts, structure data for clarity, and decide when a human review step is non-negotiable—core habits for AI risk management.

At work, I treat AI as a thought partner for writing, research synthesis, and contract review. I also explore when and how to responsibly evolve toward agentic AI for repeatable workflows. The distinction matters: a thought partner augments judgment; an agent automates execution. Building the right scaffolding—data governance, auditability, constraints, and escalation paths—ensures we unlock speed without compromising safety.

Three lines from the episode stayed with me: “I’m trying to write things that only I can write — that’s my guiding writing light right now.” — Teresa. “The more we use AI, the more we learn what it’s good at, what it’s not good at, and where context becomes a limitation.” — Teresa. “It’s a safer playground — we can build our toolbox at home before bringing those lessons to work.” — Petra. These are practical north stars for product management leadership in the GenAI era.

For anyone getting started, here’s what worked for me: begin with “low-stakes” personal experiments, write down your prompts and outcomes, and reflect on failure modes. Treat each activity as product discovery: What problem am I solving? What outcome matters? What data and context does the model need? Which decisions must stay human-in-the-loop? This discipline builds an AI product toolbox you can confidently apply to real customer problems.

I also keep a running toolkit of references and tools that inform my practice: Context window as a concept helps me size and sequence information. Visual and video tools like Midjourney and Sora expand how I think about multimodal experiences. I rotate between Claude by Anthropic and ChatGPT by OpenAI depending on task fit, and I’ve used Claude Code when I need structured assistance with code review. For knowledge capture and workflow, Readwise and Ghost help me structure insights and ship content.

If you want more structured learning paths, I found Josh Seiden’s Learn AI With Me, A 30-Day Sprint to be a practical primer, and the broader community conversation at Product at Heart Conference is invaluable. For a deeper grounding in risk, I recommend reviewing topics like Hallucination (artificial intelligence), AI bias, and Agentic AI—and revisiting the complementary episode, Context is King.

I’d love to hear how you’re experimenting: Where have you seen AI meaningfully reduce toil? Where does it still struggle? How are you balancing creativity, data safety, and compliance as you scale? Drop a comment below and let’s compare notes—especially on patterns that help product trios move faster without sacrificing trust.

Bottom line: start small at home, carry lessons into the office, and build with curiosity and intentionality. That’s how we level up our product discovery, sharpen our value proposition, and lead teams confidently through the GenAI transition.

Inspired by this post on Product Talk.

November 4, 2025

Category: AI Strategy

Define the decision you want to be present for

Build a canonical brand record before producing more content

Write pages that answer prompts without requiring guesswork

Use a predictable page contract

Match the artifact to the intent

Layer authority, live retrieval, and useful actions

Measure visibility as a product system, not a screenshot

Key takeaways

References

Key takeaways

Define quality at the decision boundary

Protect the signal before AI touches it

Recruit for the decision, not for convenience

Ask for behavior before interpretation

Set privacy boundaries before uploading transcripts

Make AI produce an auditable synthesis

Validate the insight, then record the decision

Run a quality review against the evidence chain

Use an atomic insight format

References

Evaluate the support outcome, not the performance

Build scenarios around the ways real calls become difficult

A practical scenario matrix

Run the call through the systems you expect to deploy

Key takeaways

Score conversation, reasoning, and operational closure separately

Break latency into moments the caller can feel

Make recovery and escalation part of the product test

Turn the evaluation into a controlled pilot decision

References

Define the decision before you collect the context

Build a context packet that preserves evidence quality

Separate synthesis, strategy, and skepticism

The Summarizer creates an evidence map

The Strategist develops decision options

The Skeptic tries to disconfirm the options

Make the product trio the decision gate

Close the loop with validation and decision memory

Key takeaways

References

Give the agent an investigation job before action authority

Context quality determines the ceiling of the investigation

Scale permissions to reversibility and blast radius

Make evidence and uncertainty legible in the incident room

Test against past incidents, then expand authority one boundary at a time

A practical rollout sequence

Key takeaways

References