Tag: AI workflows

Supercharge Insights with Amplitude Agent Connectors: Connect Notion, Slack, Linear & More

I’ve led enough multi-tool product organizations to know how quickly momentum erodes when insights and actions live in different places. When my teams bounce between Notion, Atlassian, Slack, Linear, and analytics dashboards, we pay a real tax in context switching. That’s why I’m excited about what Amplitude is enabling with Agent Connectors—bringing our daily work and our data-driven decisions into one fluid, agentic AI workflow.

Connect Notion, Atlassian, Slack, Linear, and more to Amplitude's Global Agent. Get richer analysis and take action across tools without leaving Amplitude.

Practically, this means I can treat Amplitude analytics as a unified analytics platform where analysis and execution finally meet. Instead of exporting charts or copying insights into docs, I can drive Agent Analytics directly from the same surface where I manage behavioral analytics, reducing friction and accelerating decisions. For my product strategy, that’s a meaningful shift—from “insight later” to “insight-to-action now.”

Here’s how I’d use it on a typical day: I ask the agent to synthesize signals from recent feature usage, spotlight anomalies, and then draft a concise summary for our Slack channel. In the same flow, I can prompt it to reference our Notion specs for context and queue next steps in Linear, keeping Atlassian stakeholders looped in without any extra swiveling between tabs. The value isn’t just faster execution; it’s tighter alignment across teams because the analysis and the plan live together.

From an operating model perspective, this is how I scale AI workflows responsibly. I can define clear prompts, approval paths, and ownership so the agent augments—not replaces—expert judgment. Data governance and permissions remain front and center: the agent sees what your teams are allowed to see, and we maintain auditability on critical workflow steps. The outcome is a trustworthy, repeatable system that compounds learning over time.

If you’re exploring agentic AI for product teams, start small and instrument your ROI. Pick one or two connectors (Slack and Notion are great first choices), define a measurable workflow—like pushing weekly retention insights and creating prioritized follow-ups in Linear—and iterate using continuous discovery. In my experience, the first wins appear as reduced time-to-insight, fewer meetings to align, and faster cycle time from observation to shipped change.

The big picture is simple: bring your work to your analytics, and your analytics to your work. With Agent Connectors, Amplitude’s Global Agent helps close the loop from understanding behavior to taking action—without leaving the place where your insights are born.

Inspired by this post on Amplitude – Best Practices.

June 3, 2026
Package Hack Wake-Up Call: My Playbook for Securing Cowork, Coding Agents, and Secrets

I love being a builder. It feels like a superpower I can’t stop using, and lately I’ve been channeling it into better workflows, faster experimentation, and sharper product thinking.

I tinker with my Claude Code workflows to make every day more effortless. I’m having a blast creating AI-generated interview snapshots and opportunity solution trees for Vistaly. I also spend time digging into traces and iterating on the AI coaches I use for our discovery courses.

Then the recent wave of malicious software spreading through the open-source community popped my bubble. It hit companies big and small—names like OpenAI, PostHog, and Zapier. As I dug in, I realized what many cybersecurity experts have long known: this is a deep rabbit hole. If I want to build responsibly, I have to get significantly better at protecting my devices, credentials, and code. And if you’re building with AI or modern tooling, you likely do, too.

Here’s why. We all rely on open-source software. Most modern applications assemble tried-and-true components—parsing a PDF, handling dates across time zones, visualizing spreadsheet data, connecting to an API—rather than reinventing them. The same is true for agent skills and MCP servers; they accelerate how we get value from models. This is overwhelmingly a good thing. But it also creates an attack surface that bad actors exploit.

We don’t need to abandon third-party code. We do need to understand the mechanisms attackers use and consistently defend against them.

When one malicious worm compromises hundreds of packages, what should dev teams do? This visual teaser maps the agenda—how it spreads, how to guard against it, AI tool risks, and concrete steps to mitigate.

On May 11th, I started seeing tweets about a TanStack hack. At that time, I didn’t know what TanStack was. But apparently, it’s a popular set of JavaScript libraries that are used by a lot of React sites. At first, I didn’t pay much attention. Then I learned the packages were compromised by a worm—malicious software that self-replicates—and it spread quickly. Within hours, dozens of packages were implicated; by day’s end, it was in the hundreds. That’s when I knew I had to lean in.

If you’ve explored safe development practices with coding agents before, you’ve seen the basics of package safety. A package is a bundle of reusable code shared through registries, and nearly every app you use depends on them. The unfortunate twist with this specific hack, known as the Mini Shai-Hulud worm, is that it shows prior “safe enough” heuristics aren’t sufficient. Popularity and trust signals don’t guarantee safety. We have to do more.

So here’s what I’ll cover today: how malicious software typically works, a practical framework for guarding against it, the specific risks of using Cowork to write and run code, and concrete steps to mitigate that risk. My goal is simple: help you keep building—despite the risks—while protecting your data and your business.

Quick disclaimer: I’m not a security expert. I’m sharing my personal journey and what I’ve learned through research and hands-on work. Please use your best judgment when applying any of this.

Package hacks share a simple playbook: get in, sweep for secrets, and phone home. This visual breaks down the 3 steps and flags new entry points—from packages to MCP servers, agent skills, and app extensions.

An agent recently scoured over 230,000 malicious software incidents and found that most malicious software follows a similar pattern. First, it needs an entry point onto your computer. Once installed, it scours your device for sensitive data, and then it uses your network connection to send that data to its own servers. The Mini Shai-Hulud worm spreads via malicious package install scripts that run at download time, then searches the device for credentials (including package publishing rights), poisons additional packages to continue replicating, and uses multiple channels—including the victim’s own GitHub public repos—to distribute secrets.

In practice, most attacks boil down to three steps: 1) It finds an entry point to your device. 2) It searches your device for sensitive data. 3) It sends that data to its own server. The good news: this pattern also tells us how to defend. We can harden entry points, minimize what code and agents can access, and constrain outgoing network traffic.

Keep in mind that install scripts aren’t the only entry vector. Any code that runs on your machine could contain malicious payloads: third-party packages, agent skills, MCP servers, browser or desktop extensions—the list is long. As coding agents and “vibe coding” tools become mainstream, more non-engineers are exposed to the same risks engineers have managed for years.

You might be at elevated risk if you do any of the following: you download and use third-party skills or MCP servers; you let Claude Code, Codex, or other coding agents write scripts that run locally and use third-party packages; you use an IDE like VS Code or Cursor with third-party extensions; or you install third-party extensions in tools like Obsidian. This isn’t an exhaustive list, but if any of these apply, it’s worth tightening your approach.

Relying on third-party code? This visual highlights four common risk zones—agent skills/MCP servers, coding agents, IDE extensions, and Obsidian plugins—and urges a review of downloads, local scripts, and add-ons.

The “safest” approach would be to avoid installing third-party software on your local device entirely. That’s not realistic. We all depend on third-party components in our stack. So I’ll start with one of the most common paths for non-engineers writing and running code today: Cowork.

Evaluating Cowork’s safety was eye-opening. Cowork offers meaningful protection—more than running code directly on your machine—but it isn’t bulletproof. There’s a notable gap you should understand.

Here’s how Cowork helps. It runs code inside a virtual machine, which isolates the execution environment from your real device—a quarantine room for code. While Cowork doesn’t fully control what comes into the room (that part is on you), if malicious code gets in, it’s contained and cannot reach the rest of your filesystem. Cowork also limits outbound network traffic from the virtual machine, which helps disrupt data exfiltration. However, it’s not foolproof.

Because Claude can install packages inside Cowork, it remains susceptible to malicious code like the Mini Shai-Hulud worm. And GitHub is on the allow list so Cowork can read and write to your repos. Since the Mini Shai-Hulud worm uses GitHub to publish secrets, this creates exposure. The crucial mitigation: if you never give Cowork access to sensitive data, there’s nothing for an attacker to steal.

A quick visual from a security deep dive on package hacks shows how Cowork handles threats: entry points are contained, data is only safe when kept outside, and network traffic is partly limited—making shared data the gap to watch.

Your responsibility is straightforward but critical: your data is only safe if it stays outside the virtual machine. When you mount folders into Cowork, those folders become accessible to any code running inside the VM. That includes malicious scripts. Before sharing, ask two questions: do the folders contain any credentials or secrets, and do they include proprietary data that would be harmful if accessed?

It’s common for code to need credentials. That’s why Cowork includes connectors to third-party sources like Google Drive and Slack. Credentials configured for these connectors never enter the VM—they remain outside the quarantine room—so they’re not exposed to malicious code. But if your code requires additional credentials inside the VM, scope them tightly and assume they could be compromised.

You can also use custom MCP servers you create yourself with Cowork. Those credentials stay outside the VM as well, provided the MCP servers are remote (hosted on a web server, not downloaded locally). It’s more work than dropping in a local server, but it keeps secrets out of reach from VM-executed code.

Beyond credentials, scrutinize the actual content you share with Cowork, including anything accessed through connectors. Least privilege is the rule: grant only what’s absolutely necessary for the task, and nothing more.

Amid a wave of package-supply attacks, this Product Talk visual launches a 3-part guide to safer AI building—starting with Cowork safety today, then Claude code config next week, and off-device development coming soon.

What about skills? Cowork supports skills, and you can add third-party skills inside the quarantine room. If you’re not placing your own data in that room, you can afford more risk. The moment you add sensitive or proprietary data, be selective. Skills can include third-party code, and bad actors use skill directories to distribute malicious payloads. Personally, I never use third-party skills as-is. If one looks useful, I read through the files, then ask Claude to recreate it so I understand what it does and maintain control. If I were to use third-party skills, I’d do it in Cowork and keep their data access to the minimum necessary.

Overall, Cowork is a solid, “safe-ish” option if you’re disciplined about what you share. The challenge is that utility often requires access to real data—exactly what we’re trying to protect. In an upcoming deep dive, I’ll outline strategies to keep malicious code out in the first place. While I’ll focus on local development, the same patterns can extend to Cowork with a bit of setup.

One more important clarification: don’t confuse Cowork with the Code tab in the Claude Desktop app. Cowork runs code inside a virtual machine. The Code tab does not. If you ask Claude to write and execute code from the Code tab, that code runs on your local device and you’re fully responsible for security. There is one exception: the Code tab can run code in Anthropic’s cloud; I’ll cover that approach when we get into moving development off the local machine.

To summarize Cowork’s protections against the attacker’s three-step pattern: installs and scripts still run, but they’re contained inside an isolated virtual machine instead of your real device; access to sensitive data is strongly limited to the specific folders you mount, leaving the rest of your filesystem (including unrelated credentials) out of reach; data exfiltration is partially constrained because Anthropic limits outbound network traffic from the VM—helpful, but not absolute. By contrast, local Code tab sessions offer no isolation, no filesystem restrictions, and no network limits—so any malicious install scripts run directly on your machine with full access and open egress.

My takeaways so far: I still love building with AI, but I’m doing it more cautiously. Cowork offers meaningful containment when used deliberately. I still prefer the flexibility of Claude Code, and I’ve reconfigured my setup to reduce risk. Even so, “safer” isn’t “safe,” which is why I’m increasingly shifting development off my local device to more controlled environments. I’ll share the practical details—tools, configs, and scripts—in the next installments.

If this perspective is useful, let me know. I want builders to move fast—and safely—through this new era of agentic AI. Until then, stay safe out there.

Inspired by this post on Product Talk.

June 3, 2026
A Reliable Amplitude AI Workflow for Product Decisions
You ask Amplitude AI why activation fell. It returns a convincing explanation, a few plausible segments, and a recommendation your team could act on. The problem is that you still don’t know whether the answer reflects your product data, an ambiguous metric, or a reasonable-sounding guess.

You don’t fix that uncertainty with a longer prompt. You fix it with a controlled workflow: define the decision, provide only the context needed to analyze it, let AI run a bounded sequence of checks, and require evidence before accepting a conclusion. The result is an analysis another product manager can inspect, reproduce, and turn into action.

Start with a decision contract, not an open-ended question

A request such as analyze our onboarding leaves too many choices to the model. It must decide what onboarding means, which users count, what success looks like, which period matters, and whether the goal is diagnosis or opportunity discovery. A polished answer can hide those unresolved choices.

Write a short decision contract before opening the analysis. It should contain five elements:
- Decision: State what someone will decide after reading the result. For example: decide which activation bottleneck the onboarding team should investigate next.
- Population: Name the eligible users, accounts, plan types, platforms, markets, or acquisition channels.
- Metric: Supply the exact event or formula, its time window, and any exclusions.
- Evidence bar: Specify what the answer must show, such as the supporting events, segments, funnel steps, or behavioral trend.
- Output: Ask for a conclusion, competing explanations, uncertainties, and the next analysis or product action.
A useful objective is narrow enough to fit in one sentence. Your quality rubric can be slightly longer: require every conclusion to identify the relevant metric, population, comparison, and evidence. This intent-first, evaluation-driven approach keeps the analysis tied to a product decision instead of rewarding whatever answer sounds most complete.

Constraints belong in the contract too. If the team cannot change pricing, instrumentation, or a particular onboarding step, say so. If a result must remain descriptive because the analysis cannot establish causality, require that distinction. AI is more useful when it knows which doors are closed.

Build a compact context packet Amplitude AI can actually use

Amplitude AI can only interpret behavior through the data model it receives. If two teams use different definitions of an activated account, or an event changed meaning after an instrumentation update, the model can produce a coherent answer to the wrong question.

Create a reusable context packet for each important product area. Keep it short enough to review, but precise enough to remove semantic guesswork. Include:
- Metric definitions: Write the numerator, denominator, qualifying window, and exclusions for activation, retention, conversion, or any other decision metric.
- Event taxonomy: List the events and properties relevant to the question, including known aliases or deprecated events that should not be used.
- Segment definitions: Explain how key cohorts are formed and which properties distinguish users from accounts.
- Known data limitations: Flag missing platforms, delayed events, identity-resolution issues, tracking changes, and periods that should not be compared.
- Recent product context: Include only releases, experiments, or journey changes that could plausibly affect the behavior under review.
Use retrieval before expansion. Start with the smallest relevant set of definitions and observations. Add more context only when the analysis reaches a question that requires it. Dumping an entire analytics catalog into the prompt makes it harder to see which definitions shaped the answer and gives irrelevant details more chances to distract the model.

Examples can stabilize recurring work, but choose them carefully. One to three strong examples are enough to demonstrate the expected structure, evidence standard, and level of uncertainty. Remove old conclusions and stale numbers before reuse. You want the model to copy the analytical pattern, not inherit a previous answer.

Version this packet alongside the workflow. When an event definition, segment, or guardrail changes, record the change and rerun the analyses that depend on it. That turns context management from prompt housekeeping into part of your analytics governance.

Run a bounded analysis loop, then challenge the result

Move from observation to explanation in explicit steps

Don’t ask for a diagnosis in a single jump. A reliable workflow separates what happened from why it may have happened. Use a fixed sequence:
1. Establish the baseline. Confirm the metric definition, eligible population, comparison, and direction of change.
2. Locate the difference. Break the result down by the segments most relevant to the decision. Avoid exploring every available property.
3. Inspect the journey. Examine funnel steps, behavioral paths, retention patterns, or other views that can show where behavior diverges.
4. Generate competing hypotheses. Ask for more than one plausible explanation and require supporting and contradicting evidence for each.
5. Choose the next best analysis. Run the segment drill-down, funnel attribution, or anomaly check most likely to separate the leading explanations.
6. Apply a stop rule. End when the evidence is sufficient for the stated decision, when the remaining uncertainty requires new instrumentation, or when another analysis would not change the next action.
The stop rule matters. Without one, an agentic workflow can keep generating cuts of the data that add activity without increasing confidence. Before each tool call, require the system to state what question the analysis will answer and how each possible result would change its next step.

If you expose Amplitude actions through MCP or another callable interface, keep each tool narrow and observable. A call should have explicit inputs, a recognizable output shape, and an error state the workflow can surface. Log the question, parameters, returned evidence, and the interpretation built from it. Tool access makes iteration faster; it does not remove the need for an audit trail.

Put every conclusion through a verification gate

Before a finding reaches a stakeholder, check it against a simple evidence ledger. For each important claim, record:
- the event, metric, segment, funnel step, or trend that supports it;
- the population and comparison to which it applies;
- whether it is an observation, interpretation, or causal hypothesis;
- the strongest alternative explanation;
- the assumptions or data limitations that could change the conclusion;
- the next check required if confidence is still too low for the decision.
Then try to disprove the preferred answer. Ask whether the pattern survives a relevant segment change, whether a tracking change could explain it, and whether the same evidence also supports a competing hypothesis. This adversarial pass is often more valuable than asking the model to make its first response more detailed.

Turn repeated checks into an evaluation set. Save representative questions, approved metric definitions, required evidence fields, and known failure cases. Rerun them when prompts, context, instrumentation, or model versions change. Review failures by category: wrong scope, wrong metric, unsupported inference, missed uncertainty, or unusable recommendation. That gives your team a regression signal instead of a vague impression that the workflow still works.

Hand stakeholders a decision artifact, not an AI transcript

The output should make the next decision easier. A long transcript of prompts, tool calls, and exploratory branches shifts the work of interpretation onto the reader. Keep the trace for auditability, but present a concise decision artifact with six fields:
- Decision: The choice this analysis informs.
- Finding: The clearest supported behavioral observation.
- Evidence: The exact events, segments, funnel steps, or trends behind the finding.
- Uncertainty: What remains unknown and what the analysis cannot establish.
- Recommendation: The next analysis, discovery activity, experiment, or product change justified by the evidence.
- Owner: The person responsible for the next step and the condition that triggers a follow-up.
Keep human judgment at the decision boundary. Amplitude AI can retrieve definitions, propose analyses, call tools, compare patterns, and draft the artifact. A product leader should still decide whether the evidence is strong enough, whether the recommendation fits current constraints, and whether the cost of being wrong is acceptable.

That division of labor also clarifies accountability. If the AI workflow produces an unsupported inference, improve the context, tool contract, or evaluation. If the evidence is sound but the organization chooses a different path, record the strategic reason. Don’t let an AI-generated recommendation blur the difference between analytical output and an accountable product decision.

Key takeaways
- Begin with the decision, population, metric, evidence bar, and required output.
- Give Amplitude AI a small, versioned context packet instead of an unfiltered analytics catalog.
- Separate baseline measurement, segmentation, journey analysis, hypothesis generation, and the next tool call.
- Require evidence, alternatives, assumptions, and a stop rule before accepting a conclusion.
- Save recurring checks as evaluations and rerun them when data, prompts, tools, or models change.
- Deliver a decision artifact with a named owner while keeping the analytical trace available for review.
Start with one recurring product question this week. Write its decision contract, assemble the minimum context packet, and define the verification gate before asking Amplitude AI to analyze anything. Once that workflow survives review, save it as the template for the next question.

References
- Shivam.Consulting Blog — Decode How Amplitude AI Thinks: Proven Workflows to Get Actionable, High-Accuracy Results
June 2, 2026
Stop Support Tickets Before They Start: How AI Unsticks Users and Lifts Conversions

Every moment of friction in a product carries a hidden cost: attention drifts, motivation wanes, and the next click becomes a support ticket—or worse, silent churn. Over the years, I’ve learned to treat “stuck” as an urgent product signal, not just an operational nuisance. When we unstick users in the flow, we protect revenue, brand trust, and the momentum that powers product-led growth.

Learn how Amplitude’s Global Support team uses AI Assistant to reduce support tickets, prevent user churn, and increase conversions.

I reference that line often because it captures a proven pattern: meet users where confusion peaks and resolve it instantly. In my practice, the formula is straightforward—pair behavioral analytics and session replay with a just-in-time AI Assistant, routed by clear driver trees. This transforms support from reactive firefighting into a proactive, in-product experience that accelerates onboarding and boosts user activation.

Here’s how I operationalize it. First, I use Amplitude analytics and behavioral analytics to surface high-friction steps—pages with elevated drop-off, loops, or rage clicks. Session replay clarifies the “why” behind the numbers, while cohort and retention analysis reveal who’s most at risk. Then I deploy targeted in-app guides and tooltip design to preempt known pitfalls, while an AI Assistant handles real-time questions with context from our knowledge base and product docs.

The AI Assistant is more than a chatbot. With well-structured AI workflows, it detects intent, pulls precise snippets from docs-as-code, and handles routine issues instantly. When complexity spikes, it executes a graceful handoff to consultative support via Intercom or a Zendesk integration—preserving conversation history and sentiment cues—so humans spend time where judgment matters. This hybrid model keeps response times low without sacrificing quality.

To de-risk changes, I lean on A/B testing and feature flags. I measure time-to-value, activation rate, and funnel conversion as leading indicators, while tracking ticket deflection, CSAT, and NRR as trailing indicators. The goal isn’t just fewer tickets; it’s faster learning loops and a compounding improvement in user outcomes. When we see activation curves steepen and onboarding friction flatten, we know the system is working.

Practically, I start with the top three friction points in onboarding, implement narrow in-app guides, and deploy the AI Assistant with strict guardrails and clear escalation paths. Weekly reviews align product, customer success, and solutions engineering around shared telemetry—so we tune prompts, content, and UI patterns together. Over time, I’ve seen ticket volume decline meaningfully, while conversion and retention rise as users experience fewer dead ends.

If you’re evaluating where to begin, identify the moments where confusion compounds—pricing configuration, integrations, and data mapping are common culprits. Then introduce targeted, context-aware help right where users hesitate. You’ll not only prevent “every stuck user” from turning into a ticket—you’ll convert friction into confidence, and confidence into growth.

Inspired by this post on Amplitude – Best Practices.

June 1, 2026
Speed-to-Lead Is Dead: How AI Agents End the Wait and Rebuild a High-Velocity Sales Org

A prospect lands on our site, skims pricing, watches a demo, and clicks “contact sales.” For years, that’s where momentum died. They waited, and we built entire sales motions around managing that delay.

We optimized for “speed-to-lead,” made it the hallmark of a high-performing sales development org, hired more SDRs, tuned routing rules, added shift coverage, and stared at response-time dashboards. Typical SLA targets were one hour for best-fit leads, four hours for core MQLs, forty-eight hours for everyone else. Those were considered good numbers.

No one questioned the premise because the lag felt structural—shift scheduling, routing delays, and humans working 9–5. The fastest teams could only shrink the gap; nobody could remove it.

An AI Agent closes it completely.

When a prospect arrives today, the conversation can begin immediately. That single change reshapes how I design a sales org—how we staff it, what our team prioritizes, and the metrics we hold ourselves accountable for.

Step outside our dashboards and look at the buyer experience. We spend heavily to drive traffic, then push visitors into forms and queues that add friction precisely when purchase intent peaks.

Intent is highest the moment someone seeks out our product. If an SDR follows up two or three hours later, that buyer’s in another meeting, the urgency has faded, and the moment is gone. We still call it a lead; the buyer has already moved on.

What AI changes

Agents eliminate the structural constraints that made speed-to-lead a problem—shift scheduling, routing delays, CRM batch processing, the SDR being on another call. None of it applies anymore because every single lead can be engaged immediately, at any hour and in any language.

The impact goes beyond response time. When an Agent engages at peak intent, qualification, discovery, and even an initial demo moment can unfold in a single, continuous conversation. The gated funnel collapses. There’s no reason to qualify someone today, schedule discovery for Thursday, and demo the following week when the conversation is already happening.

The constraint the industry built around simply isn’t there anymore. We’re already seeing it with Fin, a Customer Agent. As sales leaders, we need to frame this differently.

If speed-to-lead is no longer the constraint, the knock-on effects reach every part of the org.

Introduce Fin for Sales to your team with this clean hero banner: bold headline, signature blue spiral, and a clear 'Start free trial' call to action—inviting readers to explore an AI customer agent built for revenue.

SDRs focus on moving deals forward. Instead of frontline triage, they double down on phone-based selling and relationship building, complex deal navigation, and multi-threaded engagement across stakeholders—the high-leverage work that used to get crowded out by the inbox.

Pipeline gets more relevant. The old model rewarded volume: capture as many form fills as possible, respond fast, and sort quality later. When an Agent engages at the moment of intent, it qualifies during the conversation. Low-fit leads get filtered out before they reach the team, and high-fit prospects arrive with context—needs, timeline, stakeholders—instead of just a name and email.

You measure outcomes, not response time. When first response is instant, different metrics matter. I anchor on three questions:

1) Is the Agent doing the work? Completion rate, qualification rate, and contact capture rate indicate whether conversations reach clear outcomes and produce usable handoffs to the team.

2) Is the work producing pipeline? Meetings booked and pipeline created through Agent-handled conversations are the leading indicators of revenue, not how fast someone followed up.

3) Are buyers having a good experience? Conversation-level satisfaction matters more than ever because the Agent is the first interaction prospects have with your company. The experience it delivers is the first impression you make.

These three questions reveal whether the motion is working. Time-to-first-response can’t.

Sales orgs built hiring plans, workflows, and performance metrics around beating intent decay. That made sense when the lag was unavoidable. It isn’t anymore.

An Agent is always on. It engages the moment a prospect arrives on your site, qualifies them in real time, and routes them to the right outcome without waiting for someone to be free. The lag the industry built itself around doesn’t exist when the conversation starts immediately.

The companies leaning into this are investing in what happens after the conversation starts: how well the Agent qualifies, where it creates pipeline, and what SDRs should actually spend time on. What matters now is not how fast you respond, but what the conversation produces.

Speed-to-lead made sense when the delay was structural. It isn’t anymore. If you’re re-architecting go-to-market, instrument Agent Analytics, revisit SDR charters, and tighten CRM integration so every qualified handoff is instant, traceable, and revenue-linked.

Inspired by this post on The Intercom Blog.

May 26, 2026
How to Operate Always-On AI Agents Without Losing Control
You want an AI agent to keep work moving after you close your laptop. The difficult part is not getting one successful overnight run. It is making the hundredth run predictable enough that you do not wake up to an embarrassing email, a corrupted task queue, or an unexplained usage bill.

The right operating model looks less like a clever prompt and more like a small, well-managed operations team. Give each agent a narrow job, an inspectable queue, limited tools, a clear definition of done, and an explicit place to stop. That is how you gain useful autonomy without surrendering control.

Start with a delegation contract, not a general-purpose assistant

An always-on agent should not begin with a broad instruction such as “manage my sales work.” That leaves the model to decide what managing means, which systems it may change, and when it has enough evidence to act. The ambiguity is tolerable during an interactive session because you can correct it. It becomes operational risk when the agent runs unattended.

Start by defining a job that produces a recognizable artifact. A sales-admin agent can prepare a briefing before a scheduled call and create proposed follow-up tasks afterward. A podcast-manager agent can assemble interview context, prepare a transcript-review document, and queue a reminder to share it. A coding-manager agent can review prior sessions and identify recurring mistakes. These are bounded responsibilities with visible outputs, not vague mandates to “help.” Three specialized agents handling podcast, sales, and coding workflows demonstrate how cleanly this pattern can separate unrelated work.

Write the delegation contract in an identity file that the agent reads at the beginning of every run. It should answer seven questions:
1. Who are you? Name the role, not the underlying model: sales admin, podcast manager, coding manager, or another function a person would recognize.
2. What outcome do you own? Describe the recurring deliverable and the event that makes it useful.
3. Where may you work? Name the exact task, output, and script folders the agent can use.
4. What inputs may you trust? Identify the calendar, task file, transcript, session log, or other allowed input for the job.
5. What may you change? Separate reading, drafting, creating internal files, updating tasks, and acting in external systems.
6. What counts as complete? Specify the artifact, required fields, location, and status update expected at the end.
7. When must you stop? Define what the agent should do when information conflicts, a tool fails, permission is missing, or the next step would affect another person.
The last question matters most. A useful agent does not need permission to improvise its way through every obstacle. It needs a reliable way to say, “I could not complete this safely; here is the missing decision.” Treat a well-documented block as a successful operational outcome, not as agent failure.

Keep consequential decisions outside the unattended role. The agent can prepare a customer email without sending it. It can propose changes to a deal record without changing the commercial commitment. It can summarize a coding pattern without modifying a production system. Moving from preparation to execution should be a deliberate permission decision, not an accidental side effect of adding another tool.

Build an inspectable operating loop around four components

The prompt is only one part of the system. Reliable agent operations need four components with distinct responsibilities: identity, scheduling, tasks, and scripts. Keeping them separate makes failures easier to locate and changes easier to review.

Identity defines responsibility

The identity file is the stable operating policy. It tells the agent what role it is playing, where its work lives, what it may do, and what completion looks like. Do not overload it with the details of one assignment. If the identity changes every time a task arrives, you no longer have a stable agent; you have an unreviewed prompt generator.

The scheduler supplies a heartbeat

The scheduler should wake the agent, point it to the correct identity and queue, and capture the result. It should not contain the business logic for podcast preparation or sales follow-up. That logic belongs in inspectable task instructions and small scripts.

A Mac that remains online can use macOS LaunchAgents as this heartbeat. LaunchAgents run with the user’s permissions, which is operationally convenient but also defines the risk boundary: the agent may be able to reach anything the scheduled process and its tools can reach. Running scheduled agents on an always-on Mac Mini therefore makes permission design part of the architecture, not a setting to revisit later.

Make the schedule explicit and easy to disable. Each job should have a known trigger, whether that is a recurring interval, a calendar-related event, or a periodic review. If you cannot quickly answer why an agent ran at a particular time, the scheduler is already too opaque.

Tasks hold durable state

Use a dedicated task folder for each agent. A Markdown file with frontmatter is enough to represent a work item while remaining readable by both a person and a tool. The frontmatter can hold machine-readable state; the body can hold the request, context, acceptance criteria, and eventual run notes.

Choose a small lifecycle and apply it consistently. For example: queued, in progress, blocked, completed, and failed. The exact labels matter less than the transition rules:
- A queued task is eligible to be claimed.
- An in-progress task records which run claimed it, preventing another run from silently doing the same work.
- A blocked task names the missing input or decision and preserves all useful partial work.
- A completed task links to its output and records what changed.
- A failed task records the failed operation and whether retrying it is safe.
Give each recurring event a stable identifier. Before creating a meeting brief, transcript-review document, or follow-up task, the agent should check whether that event has already been processed. This idempotency check prevents a retry or overlapping schedule from creating duplicates.

Do not treat chat history as the task database. Conversations are useful working context, but durable state belongs in a file or system you can inspect independently. Saving identities, task files, and scripts in a shared knowledge workspace such as Obsidian also makes the operating model portable across devices and coding assistants. Changing the model runner should not require rebuilding the job.

Scripts expose narrow capabilities

Scripts should perform small, deterministic operations: fetch an allowed input, create a document in a known location, normalize a transcript, or update a task field. Keep the judgement in the agent and the mechanics in scripts with explicit inputs and outputs.

A small script is easier to inspect than a broad instruction to use the terminal however the model sees fit. It also gives you one place to add validation, duplicate checks, and error handling. When an agent repeatedly constructs the same command or edits the same file shape, promote that operation into a reviewed script rather than relying on the model to reproduce it perfectly on every run.

Design the overnight failure path before the happy path

Unattended automation changes the cost of a mistake. During an interactive session, a confusing output costs a correction. Overnight, the same confusion can trigger repeated work, alter several systems, or contact someone before you see it. Your design should limit the consequence of a wrong interpretation, not merely improve the probability of a correct one.

Use a permission ladder

Classify capabilities by consequence and grant them one level at a time:
1. Read: inspect approved calendars, task files, transcripts, logs, or documents.
2. Prepare: create drafts, summaries, reports, and proposed tasks inside a bounded workspace.
3. Update: change internal records whose history can be inspected and reversed.
4. Act externally: send messages, share files, update customer-facing systems, or invoke paid services.
5. Perform destructive or privileged work: delete data, change access, alter infrastructure, or execute an irreversible operation.
Most new agents should prove themselves at the read and prepare levels. Promotion should be capability-specific. An agent that reliably prepares a sales brief has not thereby earned permission to send customer communication. Reliability does not transfer automatically from one action class to another.

For external actions, use a pending-approval state that contains the exact proposed action. You should be able to review the recipient, content, destination, and relevant context without reopening the entire run. Destructive or privileged actions should remain outside unattended execution unless you have an explicit recovery path and have deliberately accepted the consequence of failure.

Treat external text as data, not authority

Calendar descriptions, transcripts, web pages, emails, and documents may contain instructions that conflict with the agent’s job. The identity and task contract must outrank text found inside those inputs. An interview guest’s biography can inform a briefing; it cannot expand the podcast agent’s permissions. A meeting note can identify a follow-up; it cannot authorize the agent to send one.

Keep credentials out of identity and task files. Give scripts access only to the credentials required for their operation, and avoid handing an agent a general browser, terminal, file system, and credential store merely because each tool is useful in isolation. The dangerous capability is often the combination.

Make retries selective

A retry is appropriate when the failure is plausibly temporary and repeating the operation is safe. A network timeout during a read may qualify. Ambiguous recipient identity, conflicting meeting details, missing share settings, or an unclear customer commitment do not. Retrying an ambiguity only asks the model to make the same unsupported decision again.

Before enabling automatic retries, require the operation to pass three tests: it can detect whether it already succeeded, a duplicate would not create harm, and the number of attempts is capped. Otherwise, mark the task blocked and surface it for review.

Put hard boundaries around usage

Always-on does not mean continuously reasoning. It means the system is available to process eligible work on a known schedule. A run should inspect the queue, process a bounded amount of work, record its result, and exit.

Set limits at several layers: eligible task types, work accepted per run, retries per task, tools available to the role, and provider-side spending or usage controls where available. Record usage beside the task outcome so you can distinguish an expensive valuable job from an agent that consumes resources while circling an ambiguity. Surprise charges are not only a pricing problem; they usually indicate that the operating loop lacks a stopping rule.

Finally, maintain a kill switch you can use without asking the agent to cooperate. Disabling the schedule or revoking the narrow credential should stop future work. If stopping the system requires the same model and scripts that may be malfunctioning, it is not an independent control.

Measure whether the agent is reducing work or relocating it

A completed status is not proof of value. An agent can close every task while leaving you to verify facts, repair formatting, remove duplicates, and reconstruct why it made a decision. That is work relocation, not delegation.

Evaluate the operation with measures tied to the job:
- Usable completion rate: the share of eligible tasks that produce an output meeting the acceptance criteria without substantive rework.
- Correction rate: how often you must change facts, recipients, permissions, status, or next steps before using the output.
- Duplicate or false-action rate: how often the agent repeats a job or creates an action that the triggering event did not require.
- Blocked rate by cause: which missing inputs, permissions, or unclear rules repeatedly prevent completion.
- Time to review: the human attention required to approve, repair, or understand the result.
- Usage per usable outcome: the model or service consumption attached to work you actually keep.
These measures tell you what to change. A high blocked rate caused by missing context points to an input problem. Frequent factual corrections point to retrieval or acceptance-criteria problems. Duplicate work points to task identity and idempotency. High review time with otherwise correct output often means the evidence and change log are poorly presented.

Require every run to leave a compact receipt: the task it claimed, inputs it used, scripts it invoked, files or records it changed, output location, completion status, and reason for any block. You should not need to replay hidden reasoning. You need enough evidence to verify the operation and diagnose the next failure.

Review early runs closely and review again after changing an identity, script, tool, model, or input source. A stable task can become unstable when any one of those dependencies changes. Plain-text identities, tasks, and scripts make that change surface inspectable and versionable.

Your agents can also improve the operating system itself. A periodic coding-manager workflow, for example, can review prior coding sessions, identify recurring dead ends, and propose changes in how future sessions are run. The important separation is that the agent proposes an improvement with evidence; the operating policy changes only after review. Self-observation is useful. Unreviewed self-modification is a different risk class.

Expand only when the current job has earned more autonomy

Adding agents is easy once the scheduler and folder structure exist. That convenience can tempt you to automate work whose boundaries are not ready. Scale based on operational evidence, not on the number of possible use cases you can imagine.

A job is a strong candidate for always-on operation when it has a recurring trigger, stable inputs, an observable deliverable, clear acceptance criteria, bounded permissions, and enough repetition to justify maintaining the workflow. Preparation, follow-up capture, document setup, and periodic retrospectives fit because a person can inspect their artifacts and correct them before higher-consequence decisions are made.

Keep work interactive when the task depends on novel judgement, unresolved organizational context, sensitive negotiation, or irreversible action. An agent may still prepare evidence and options, but the decision should remain with the person who owns the consequence.

Before expanding an existing agent’s permissions or creating another role, check five gates:
1. The current output is regularly usable without substantial reconstruction.
2. Common failure modes are visible and end in safe states.
3. Duplicate prevention and retry behavior have been exercised.
4. Usage is attributable to tasks and bounded by stopping rules.
5. The next capability has its own acceptance criteria and consequence review.
Do not create one agent per application. Create one per coherent responsibility. A podcast manager may use a calendar, a document system, and a task list while retaining one outcome. Conversely, sales administration and coding retrospectives should not share an identity merely because they use the same model. Role boundaries should follow accountability, not tooling.

Key takeaways
- Begin with one recurring job that produces an inspectable artifact, not a general instruction to manage a function.
- Give the agent a durable identity, a dedicated task queue, an explicit schedule, and small reviewed scripts.
- Use task states, stable event identifiers, and completion receipts so retries and overlapping runs do not create invisible duplication.
- Keep new agents at read-and-prepare permissions until their outputs and failure modes are consistently understandable.
- Route ambiguity and consequential external actions to approval instead of asking the model to guess.
- Cap eligible work, retries, tools, and usage; always-on availability should still produce finite runs.
- Measure usable outcomes, corrections, blocks, duplicates, review effort, and usage before granting more autonomy.
Pick one task you already repeat and write its delegation contract before choosing more tools. If you cannot define the input, output, permission boundary, completion test, and safe stopping condition on one page, the job is not ready to run while you are offline. Tighten the job first. The agent can earn broader responsibility after the operating evidence is there.

References
- Product Talk – My Always-On AI Team: How I Get Claude Agents to Tackle Work While I’m Offline
May 20, 2026

How to Build an AI-Native Product Discovery Workflow

Your discovery stack may already hold interview transcripts, support conversations, behavioral analytics, experiment results, and roadmap assumptions. Yet the decision in a product review can still depend on whoever read the most material or built the most persuasive deck.

If adding an LLM only gives you faster summaries, the workflow is not AI-native. An AI-native discovery workflow shortens the distance from evidence to a decision while making every important claim easier to inspect. AI retrieves, structures, compares, and challenges the evidence. You remain accountable for what the evidence means and what the product team does next.

Key takeaways

Begin every AI-assisted discovery run with an outcome, a metric, defined context, and a decision that someone needs to make.
Preserve raw evidence and give each observation a stable identifier before asking AI to synthesize it.
Break the workflow into bounded jobs such as retrieval, extraction, clustering, contradiction detection, and decision-brief drafting.
Evaluate citation accuracy, evidence fidelity, counterevidence, abstention, and access controls before the output enters a roadmap discussion.
Measure whether the workflow improves decision quality and product outcomes, not merely whether the model produces polished prose.

Frame the decision before you involve the model

Most weak discovery prompts fail before the model sees them. Analyze the interviews, summarize the feedback, and find insights are activities, not decisions. They give the model no principled way to distinguish useful evidence from interesting noise.

Write a short decision contract first. A useful contract specifies the outcome and metric, the context and constraints, and the decision and deliverable. Those fields turn an open-ended request into a bounded discovery task.

Outcome and metric: Name the user or business outcome, then define the behavior or measure that represents it. Activation, funnel conversion, and retention are not interchangeable. Include the event definition and observation window used by your analytics system.
Context and constraints: State the relevant cohort, product surface, timeframe, market, known exclusions, and data-access limits. New self-serve accounts on the web can exhibit a different pattern from established accounts or customers using another surface.
Decision and deliverable: Say what someone will do with the answer. Ask for a ranked opportunity brief, an interview plan, a set of competing explanations, or experiment candidates only when that format supports a real pending decision.

Reusable decision prompt: Help me decide [decision]. The outcome is [outcome], measured as [metric definition]. Limit the analysis to [cohort, surface, timeframe, and constraints]. Retrieve evidence from [approved repositories]. Return [deliverable]. For every material claim, include the evidence identifier, any conflicting evidence, the affected segment, and what is still unknown. If the available evidence cannot support a recommendation, say so and specify what is missing.

The last sentence matters. An AI system should be allowed to return insufficient evidence. If every run must end with a recommendation, the workflow rewards plausible completion instead of honest discovery.

Keep the outcome separate from the proposed solution. Improve activation is an outcome. Validate an onboarding checklist is already a solution choice. When you embed the solution in the prompt, AI tends to organize the available evidence around that choice instead of testing whether another opportunity matters more.

Use evidence-strength labels that a reviewer can verify rather than asking the model for an unsupported confidence percentage:

Sufficient: Direct evidence applies to the target context, and no material contradiction remains unresolved.
Mixed: Direct evidence and meaningful counterevidence both exist, or the pattern changes by segment.
Insufficient: Evidence is missing, indirect, stale for the decision, or outside the target context.

Build a traceable evidence pipeline, not a transcript pile

AI cannot make discovery evidence traceable if the underlying repository has already flattened observations, interpretations, and decisions into the same notes. Preserve those layers separately. My rule is simple: automate the movement and inspection of evidence before automating judgment.

Layer	What it contains	Control that matters
Raw evidence	Interview recordings or transcripts, support records, session evidence, and analytics query results	Keep the original record intact, access-controlled, and addressable by a stable locator
Evidence units	Atomic observations with metadata	Separate exact customer language, observed behavior, and analyst interpretation
Opportunities	Candidate needs, frictions, or desired outcomes	Attach supporting evidence, counterevidence, affected segments, and unresolved questions
Decisions	Choices made, rejected alternatives, assumptions, and rationale	Name the decision owner and preserve the evidence available at the time
Learning	Experiment results and later customer or behavioral evidence	Update the opportunity without erasing the earlier reasoning

Each evidence unit should carry enough metadata to survive outside its original document:

A stable evidence identifier.
The collection date and an exact locator such as a transcript timestamp or saved analytics query.
The relevant user segment, product surface, and journey stage.
The raw observation, kept separate from the interpretation proposed by a person or model.
The access, retention, and sensitivity classification.
The opportunity, assumption, or outcome to which the evidence may relate.

This structure prevents a common failure: a model paraphrases an interview, a later summary compresses that paraphrase, and the roadmap eventually treats the compressed interpretation as a customer fact. A reviewer should always be able to move from a claim to the evidence unit and then to the original record.

Apply data-governance rules before ingestion. If customer conversations contain personal, confidential, or contract-restricted information, do not copy them into an AI system until its access, retention, redaction, and model-training terms match your commitments. A more convenient synthesis workflow is not worth an unauthorized disclosure.

Retrieve the smallest useful context

Once the evidence corpus no longer fits sensibly into a prompt, use a retrieval-first pipeline with modular prompts and observable traces. Retrieval-augmented generation should select evidence relevant to the decision contract, rather than asking a general agent to reason over everything the company knows.

RAG is a grounding mechanism, not a truth guarantee. A fluent answer does not prove that the retriever found the decisive interview, the correct event definition, or the evidence that contradicts the dominant pattern. Configure retrieval to look for both support and contradiction, preserve evidence identifiers, respect access controls, and return no result when the available context does not meet the task.

An opportunity solution tree can provide the shared view above this pipeline: the desired outcome connects to opportunities, solution candidates, and tests. Treat the tree as a navigable representation of current thinking. Every important node should still resolve to evidence and assumptions beneath it.

Give AI a chain of bounded jobs

A single agent asked to interview customers, interpret feedback, size opportunities, choose a solution, and write a roadmap has too many ways to hide a weak inference. Break the work into stages with explicit inputs and review gates:

Prepare: Give AI the outcome, assumptions, and learning gaps. Let it draft non-leading interview questions. A human checks whether the guide is testing an assumption or merely inviting agreement.
Convert: Extract atomic observations from approved records. Require exact locators and label customer language, observed behavior, and interpretation separately.
Synthesize: Cluster candidate opportunities without erasing segment differences. Request supporting evidence, counterevidence, and unrepresented cohorts for every cluster.
Connect: Use behavioral analytics to examine whether the observed pattern appears in the target cohort. Interviews can expose mechanisms and unmet needs; they should not be treated as a substitute for measuring prevalence.
Challenge: Ask for rival explanations, evidence that would reverse the conclusion, and assumptions that remain untested. This stage should consume the evidence record, not just the previous summary.
Draft: Produce a decision brief containing the pending decision, options, evidence, contradictions, unknowns, and proposed next test. A named human accepts, revises, or rejects it.
Learn: Attach experiment and outcome evidence to the same opportunity record. Preserve what the team believed before the test so later reviewers can inspect how the decision changed.

Pass structured artifacts between stages. If each stage receives only prose copied from the previous chat, unsupported claims can become progressively harder to distinguish from evidence.

Buy workflow plumbing; own the decision logic

You do not need to build every repository, connector, permission system, visualization, and observability screen. Licensing purpose-built opportunity-tree infrastructure can be the sensible choice when your differentiated work is the learning system rather than the canvas or collaboration layer.

Keep ownership of the parts that encode how your company makes product decisions: the decision contract, evidence schema, opportunity taxonomy, prompt modules, evaluation cases, escalation rules, and approval gates. Before choosing a platform, ask:

Can you export the raw evidence, metadata, opportunity structure, prompts, and run traces?
Can access rules follow the evidence through retrieval and generation?
Can the system connect to your approved analytics and customer-evidence repositories without repeated manual copying?
Can you evaluate a prompt or retrieval change against representative past cases?
Can a reviewer inspect why a claim appeared and what evidence was omitted?
Would building this capability improve the customer outcome, or merely recreate commodity workflow infrastructure?

Evaluate the workflow before it shapes the roadmap

Start evals before AI-generated conclusions become routine inputs to product reviews. The evaluation set should represent the cases the workflow will actually encounter: a clear pattern, conflicting evidence, insufficient evidence, cohort-specific behavior, stale material, duplicated records, and content the requesting user is not allowed to retrieve.

For synthesis and decision-support tasks, evaluate behavior that a reviewer can observe:

Citation validity: Every material claim points to a real, accessible evidence identifier.
Evidence fidelity: Quotations and behavioral facts remain faithful to the underlying record; interpretations are labeled as interpretations.
Retrieval coverage: The output includes the evidence required to assess the target opportunity, not merely the easiest matching passages.
Contradiction handling: Material counterevidence and segment differences are visible rather than buried.
Abstention: The system returns insufficient evidence when the decision cannot be supported.
Decision fit: The deliverable answers the stated decision instead of drifting into a generic summary or unrelated recommendation.
Policy compliance: Restricted evidence stays outside unauthorized retrieval, traces, and generated output.

A strict release gate is useful here. Fail the output if it invents an evidence identifier, turns an interpretation into a quotation, ignores a material contradiction, or exposes restricted content. Those are not cosmetic defects that a polished paragraph can offset.

Treat the prompt, retrieval configuration, model choice, taxonomy, and evaluation set as versioned artifacts. This is the practical value of eval-driven development and early observability: when behavior changes, you can identify the change that caused it and rerun representative cases before wider use.

For each production run, retain the decision contract, evidence identifiers retrieved, prompt and retrieval versions, generated output, reviewer edits, final decision, and later outcome. That trace lets you distinguish a retrieval failure from a synthesis failure, a weak decision contract, or a reasonable decision invalidated by new evidence.

Model-quality checks are only one layer. Also baseline and monitor the discovery workflow itself:

Time from a framed question to a reviewable decision brief.
The share of material claims with inspectable evidence.
Reviewer corrections to quotations, segments, event definitions, and interpretations.
Decisions reopened because relevant evidence was missing or misread.
Movement in the outcome and metric named in the original decision contract.

Do not set improvement targets until you have a baseline for the existing process. A system can make synthesis faster while increasing correction work or encouraging premature decisions. The end-to-end measure tells you whether the saved time is real.

Turn the workflow into a product operating system

AI-native discovery changes the product team’s operating model only when ownership remains explicit. The product manager or product trio owns the outcome, assumptions, and decision. Research and design judgment protects interview quality and interpretive nuance. Data and engineering ownership protects event definitions, retrieval reliability, instrumentation, and access controls. AI produces candidate artifacts. The decision owner approves the action.

Review by exception instead of rereading every generated sentence. Inspect claims marked mixed or insufficient, new opportunity clusters, segment differences, material contradictions, changed event definitions, and outputs that differ from earlier runs. This focuses human attention where judgment is most valuable without treating the model as an authority.

Roll out the workflow through one recurring, reversible discovery decision:

Choose a decision for which customer evidence and behavioral data already exist, such as prioritizing an onboarding friction or investigating a repeated support issue.
Baseline the current path from question to decision, including reviewer corrections and missing-evidence failures.
Create the decision contract, evidence schema, and access rules before connecting an agent.
Build the evaluation set from previous clear, contradictory, insufficient, segment-specific, and restricted cases.
Run the AI workflow in shadow mode beside the existing process. Compare claims, omissions, reviewer effort, and the resulting decision without allowing the generated output to act automatically.
Promote bounded jobs only after they pass their gates. Evidence extraction may be ready before opportunity ranking, and opportunity ranking may be ready before solution recommendations.
Expand to another workflow only when the traces are stable, reviewers understand escalation paths, and the first use case is improving the decision process rather than merely generating more material.

At your next discovery review, do not ask what AI found. Bring one decision contract, require every consequential claim to resolve to evidence, and make the unresolved assumption visible. That is a small enough change to start immediately and a strong enough foundation for everything you automate later.

References

May 19, 2026

Level Up: May 26 Claude Code Show & Tell + Final Product Discovery Fundamentals Cohort

I’m excited to share two opportunities this season to uplevel your craft, connect with peers, and leave with practical, repeatable techniques you can apply immediately to your product work.

We will be doing another round of Claude Code: Show and Tell on May 26th at 9am PDT. These community-driven sessions are hands-on and fast-paced—we swap proven workflows, compare prompts, and pressure-test approaches together. You’ll see how product teams are operationalizing AI workflows in real contexts and walk away with ideas you can adapt for your own roadmap and experimentation pipeline. Invites will go out to Supporting Members and CDH Members tomorrow. If you'd like to join us, keep an eye on your inbox for the invite.

I love these Show & Tell sessions because they translate tacit knowledge into clear, reusable playbooks. Whether you’re refining evaluation loops for LLMs, streamlining discovery synthesis, or standardizing prompts for consistency, the shared rigor and camaraderie make it a high-signal hour for any product leader invested in AI workflows.

I also want to share that I'll be teaching our June 4th – July 9th cohort of Product Discovery Fundamentals. This is the last time I'll be teaching this cohort in its current format. If you've been thinking of enrolling in this program, and want to take it with me, this is your last chance. Register here.

Across this cohort, we’ll practice continuous discovery habits—framing opportunities, tightening assumptions, running lean experiments, and aligning product trios on evidence-backed decisions. If you want a rigorous, repeatable system for turning customer insight into confident prioritization and compelling product strategy, I’d be thrilled to have you in the room.

Inspired by this post on Product Talk.

May 18, 2026
Unlocking AI Agents: The Real Barrier Is Readiness—Not Capability—Here’s How to Scale

There’s a question that runs underneath every AI Agent evaluation: what can it do?

Two years ago, that was the right question to ask because Agents were limited and capability was a genuine constraint. The gap between what organizations needed and what the technology could deliver was wide. I felt that gap acutely in early pilots—plenty of ambition, not enough dependable execution.

That gap has since narrowed considerably, and yet most organizations are running their Agents well below what’s technically possible. I see teams lean on answering and routing, but stop short of looking things up, taking actions, or resolving complex, multi-step problems—especially where data, process variance, or risk come into play.

The standard explanation is that AI isn’t good enough yet—models must improve, or vendors must ship more features. But after studying organizations across industries actively expanding their AI automation, I’ve found that this explanation holds up less often than people assume. The blockers tend to be elsewhere.

The teams I’ve observed weren’t primarily constrained by what their AI could do; they were constrained by what their organization was structured to let it do. In other words, the ceiling wasn’t the Agent’s capability—it was organizational readiness, governance, and risk tolerance.

“Readiness” for AI breaks into five distinct types, and most organizations have some but not all of them. Below is how I assess them with product, operations, and engineering leaders.

Content readiness is whether you can explain your product and policies clearly and consistently. Most companies can. In practice, that means up-to-date knowledge bases, unified policy language, and clear versions that Agents can cite and apply.

Scope readiness is whether you’ve defined the edges: when should AI engage, and when should it step aside? Edge cases multiply, intent varies by customer segment, sensitive topics surface mid-conversation, but most teams can work through this with effort. Clear guardrails reduce ambiguity and shrink risk.

Procedural readiness is where things start to get harder. This is about whether you can articulate your processes clearly enough for something other than a human with years of tacit knowledge to follow. The happy path is rarely the problem. It’s the failure paths, decision branches, variations that have never been written down because they’ve always lived in someone’s head.

Data readiness is the first real cliff. Can you reliably identify the right user, account, or object at the moment a decision needs to be made? Is the data trustworthy in real time? Are the APIs stable, accessible, and actually connected? For most organizations, the honest answer is “partially, but we’re not always sure when it breaks.”

Execution readiness is the highest bar. Not just technically (can the Agent make the change?) but organizationally. Who owns it when the wrong refund gets processed? Who detects it? Who recovers? Does someone with authority actually accept the risk?

Most companies have the first two, some have the third, fewer have the fourth and fifth. When I map this with teams, we often discover that their Agent’s ceiling is really a reflection of operational maturity and data plumbing, not model quality.

We studied companies across six industries – energy, healthcare, ecommerce, gaming, financial services, property management – all trying to expand what their Agents could do. The pattern was consistent: teams set out to automate real actions—looking up account status, processing changes, handling transactions. In most cases, the AI could technically do it, but at a certain point (somewhere between guiding a user through a process and looking something up on their behalf) they hit a wall.

One team tried to automate application changes but couldn’t reliably identify which application to modify across their internal systems. Another explored billing automation but couldn’t access live account data due to regulatory constraints. A third needed to verify status across third-party vendor systems their Agent couldn’t reliably reach. I’ve seen similar constraints surface around CRM integration, data governance, and vendor SLAs—none of which are model issues.

In most cases, the team redesigned around what their infrastructure could support. They moved toward guiding—walking users through processes step by step, rather than executing changes on their behalf. It worked, it resolved conversations and delivered real value, just differently than anyone planned. In customer support, this often looks like consultative flows that shorten time-to-resolution even without direct writes.

Most Agent evaluations are built around capability. Can it handle complex queries? Does it support multiple channels? Can it integrate with our systems? These are reasonable things to evaluate for, but they produce a capability score, and that doesn’t tell you whether your organization can actually use what you’re buying.

The teams that got to deeper automation, the ones executing actions early, didn’t have “better AI,” they had more standardized operations. Actions that were already well-defined, consistently applied, and exposed through stable systems with clear rules. Automation wasn’t inventing new behavior, it was triggering actions that were already tightly controlled elsewhere.

Readiness enables capability, not the other way around. Which reframes the evaluation question from “can the AI do this?” to “are we actually ready for it to?”

Something that gets lost in most conversations about AI readiness is that organizations are often further along than they assume, just not for the kind of work they were planning for. A team that set out to automate refunds but can reliably guide users through complex troubleshooting has genuine capability deployed. They’re operating at the level their readiness supports, which is a starting point, not a deficit.

The more useful frame isn’t “are we ready?” – it’s “what are we ready for, and what specifically stands between here and the next level?” The gaps tend to be concrete: a missing API, data that lives in three systems that don’t agree, a process that’s never been documented, or an ownership question nobody has answered. These are solvable problems. They just require a different kind of investment than buying a more capable Agent.

What nobody has worked through seriously yet is how organizations actually build readiness. Does it develop naturally through using AI at shallower levels first? Or is it mostly a function of prior decisions, like system architecture choices made years ago, operational maturity that accumulated over time, engineering investments that have nothing to do with AI? When readiness does increase, what actually changes? Does the support team develop it? Does engineering grant it? Does it require executive sponsorship and investment in infrastructure with no obvious AI label on it?

In my experience, progress comes from a joint effort: product to define scope and guardrails, operations to codify procedures and edge cases, engineering to harden APIs and observability, and leadership to underwrite risk with clear ownership. When those pieces align, agentic AI moves from guided assistance to safe, auditable execution.

Until there are clearer answers, the pattern is likely to continue. Companies will buy capable Agents, plan ambitious rollouts, and find that the harder work is building the organizational infrastructure. The Agents can do the work. The question is what it takes to let them.

Inspired by this post on The Intercom Blog.

May 18, 2026

How to Deploy an Operator AI Agent in Customer Operations

Your support team probably does not need another chatbot that summarizes a ticket on command. It needs help with the operational work surrounding every ticket: finding why escalations changed, keeping knowledge accurate, correcting broken automations, coordinating incident communication, and showing human reps what deserves attention next.

An operator AI agent can take on that work, but only if you design it as an operating system for customer operations rather than a conversational layer over support APIs. The useful version closes the loop from signal to diagnosis to tested change. The dangerous version produces plausible commentary and receives permission to act before it has earned trust.

Define the job as a closed loop, not a chat box

A customer-facing AI agent handles an individual customer’s request. An operator agent works on the system around those requests: conversations, help content, automation configuration, performance data, incident workflows, and the human queue.

That distinction changes the product requirement. The agent is not complete when it answers a question such as why escalations increased. It is complete when it can investigate the increase, identify a supported cause, determine which operational object needs attention, prepare a change, test that change where possible, and route it to the right person for approval.

Observe: Detect a question, anomaly, scheduled task, failed conversation, release brief, or incident.
Diagnose: Select the relevant metrics and attributes, inspect representative conversations, and separate recurring patterns from isolated cases.
Locate the control point: Determine whether the problem sits in knowledge, guidance, a procedure, a data connector, an automation rule, or a human workflow.
Propose: Produce a concrete artifact such as an article diff, configuration change, procedure, incident audience, or prioritized queue.
Verify: Run a simulation or another appropriate check and expose failures, edge cases, and remaining uncertainty.
Act and learn: Apply an approved change, record what happened, and monitor the affected outcome for regression.

Consider the prompt, Why did escalations rise last week? A reporting copilot returns a chart. A useful operator identifies which escalation definition applies, segments the change, reads relevant conversations, finds the repeated cause, checks whether the corresponding help content or automation is deficient, and prepares the smallest defensible correction. That progression from an operational question to an actionable proposal is already possible across analysis, knowledge maintenance, automation building, and human support workflows.

Write the acceptance criteria around that complete handoff. Require the evidence used, the proposed artifact, the scope of impact, the verification result, the named reviewer, and any action the agent is forbidden to take. If the output still leaves an operations manager rebuilding the context manually, you have a chat assistant, not an operator.

Build reliability below the model and price that work honestly

A foundation model with API access can make a persuasive prototype. It can query ticket data, summarize conversations, and write a report that appears coherent. The hard part begins when different workspaces use different fields, configurations, workflows, permissions, and definitions of success.

The model should not have to rediscover your operating rules on every run. Encode those rules in purpose-built tools and reusable skills. A tool performs one bounded operation, such as retrieving a conversation, searching knowledge, or running a defined report. A skill coordinates several tools to complete a business job, such as debugging a failed resolution or rolling a policy change through the help center.

Operator’s production architecture is described as having more than 50 tools and 10 multi-step skills. Those counts are not targets to copy. They illustrate how quickly the hidden surface area grows once an agent must do dependable operational work instead of demonstrating a few API calls.

System layer	Job it must perform	Failure you should test for	Control to add
Semantic retrieval	Find content by meaning, not only exact words	Irrelevant or incomplete evidence produces a confident diagnosis	Evaluate retrieval against real support questions and known content gaps
Attribute awareness	Know which metrics, fields, and custom attributes are populated and meaningful	The agent invents a pattern from sparse or unused fields	Expose field definitions, coverage, allowed joins, and missing-data signals
Atomic tools	Perform narrow reads or writes predictably	A broad API wrapper allows an unintended query or change	Use typed inputs, constrained scopes, explicit permissions, and structured results
Domain skills	Chain tools according to a repeatable customer-operations method	The same request follows a different process on each run	Define required steps, exit conditions, evidence, and escalation paths
Review interface	Turn reasoning into charts, diffs, tests, and proposals	A reviewer approves a wall of prose without understanding the change	Render the decision in the format appropriate to the object being changed

Semantic retrieval and attribute awareness deserve particular attention. Retrieval grounds the agent in the content that can actually answer the question. Attribute awareness stops it from treating every available field as equally meaningful. A custom field that exists but is almost never populated should not become the foundation of an operational recommendation.

Give every tool a contract before the model can call it:

The business purpose and the questions it is allowed to answer.
The read and write permissions it requires.
The preconditions that must be true before it runs.
The evidence and identifiers it must return.
Its behavior when data is missing, ambiguous, stale, or inconsistent.
The audit event, approval requirement, and rollback path for a write.

Evaluate build versus buy beyond the demonstration

A proof of concept establishes that a model can produce a plausible answer with your data. It does not establish that the answer is grounded, that the proposed action is safe, or that the system will behave consistently as configurations change.

For a build decision, include retrieval tuning, permission design, tenant isolation, tool maintenance, skill development, evaluation data, observability, proposal interfaces, audit history, rollback behavior, and on-call ownership. Also ask who will update the agent when a support object, metric definition, product policy, or API changes. If these responsibilities do not have durable owners, the internal agent will age like any other unsupported operations system.

For a buy decision, ask the vendor to demonstrate your difficult cases rather than its preferred prompts. Use a conversation with conflicting evidence, an unused custom attribute, an outdated localized article, a misconfigured rule, and a proposed write with a wide blast radius. Inspect the evidence, tool trace, permissions, diff, test result, and audit record. The quality of the generated prose is one of the least informative parts of that evaluation.

Put a proposal boundary around every material action

Moving from analysis to live changes is a different class of production problem. A wrong summary wastes time. A wrong configuration can degrade customer outcomes across every conversation that matches it. An incorrect outbound message cannot be recalled after customers have read it.

I would give the agent autonomy according to consequence, not according to how confident its language sounds:

Read: Search content, inspect conversations, calculate approved metrics, and assemble evidence. Run these tasks autonomously within access controls and log every operation.
Recommend: Explain a root cause or rank an opportunity. Attach the underlying conversations, segments, fields, and assumptions so a person can challenge the conclusion.
Prepare: Draft an article, procedure, rule, connector configuration, customer response, or queue. Save it as a proposal with no production effect.
Change: Publish, configure, send, or otherwise alter the live operation only after the required reviewer sees the exact scope and explicitly approves it.

A proposal is a structured change object, not a paragraph asking for trust. Production-grade operator systems can present reviewable diffs before applying changes, allowing the reviewer to accept, reject, or refine the work. The same principle should govern any operator implementation.

Your review screen should answer six questions without forcing the approver into another tool:

What object will change?
What exact fields, passages, rules, or recipients are affected?
What evidence connects the observed problem to this change?
What test ran, and which cases failed or remained untested?
Who must approve, and which permission will execute the action?
How can the change be reversed, and what cannot be reversed?

Customer outreach needs the strictest treatment because sending is effectively irreversible. Do not approve a batch from a conversational summary that hides the audience. The safe alternative is a preview containing the resolved customer list, inclusion logic, exclusions, exact message variants, delivery channel, and approver. Start by allowing the agent to prepare that package while a person performs the send.

Simulation also needs a visible place in the proposal. If the agent modifies an automation procedure, show which representative conversations were tested, the expected outcome for each, the observed outcome, and why any mismatch occurred. An overall pass label is not enough to reveal an important edge case.

Human approval is not a permanent substitute for system quality. If reviewers routinely accept proposals without inspecting them, the control has become ceremonial. Track corrections, rejections, rollbacks, and the evidence reviewers open. Use those signals to improve the relevant retrieval rule, tool, skill, or interface.

Roll out workflows in increasing order of consequence

Choose the first workflow by its operating characteristics. A strong starting candidate recurs frequently, consumes expert attention, has accessible evidence, produces a clear artifact, and has a named reviewer. It should also allow the agent to be useful before it receives broad write permission.

A practical rollout sequence looks like this:

Recurring operations analyst. Give the agent one standing question, such as what changed in escalations or automation performance. Define the metric, comparison period, relevant segments, evidence requirements, and report destination. Require links to representative conversations and allow the conclusion that no action is warranted. Compare its reasoning with an experienced operator’s review until the failure modes are understood.
Knowledge steward. Feed it a release brief or policy change. Ask it to find affected help content, identify missing coverage, and prepare article diffs in the required voice and format. Include localized variants where they exist. The reviewer should validate product behavior, instructions, links, policy language, and whether the proposed set of pages is complete before publishing.
Automation maintainer. Start with known failed conversations. Ask the agent to distinguish a content gap from a rule, procedure, guidance, or connector problem; prepare the smallest correction; define triggers and edge cases; and simulate the result. Do not grant live configuration access until the tool trace and tests make the diagnosis reproducible.
Human-operations coordinator. Use the agent to assemble an incident audience, draft targeted responses, prepare coaching evidence, or prioritize a rep’s queue. These workflows can save substantial coordination time, but they touch customer communication and employee decisions. Begin in preparation mode, expose the selection logic, and expand autonomy only after identity, permission, review, and audit controls have been exercised.

This sequence is a risk ordering, not a universal maturity model. A read-only weekly analysis is easier to inspect and reverse than an outbound incident campaign. A knowledge proposal has a reviewable artifact. A live automation change affects future conversations, while customer communication may create an immediate and irreversible consequence. Move forward when the evidence and controls for the next class of action are ready, not merely because the previous feature launched.

Measure the completed loop, not chat activity

Prompt counts and conversation volume tell you that people opened the product. They do not tell you that customer operations improved. Build the scorecard around the operational loop:

Diagnostic quality: Whether the proposed root cause survives expert review, whether its evidence supports the conclusion, and how often factual correction is required.
Operational throughput: Time from a detected signal to a reviewed proposal and from an approved proposal to a verified change.
Artifact quality: Acceptance, revision, rejection, and rollback patterns for knowledge, automation, configuration, and communication proposals.
Customer outcome: Resolution, escalation, repeat contact, and sentiment for the affected topic after the change, interpreted alongside volume and case mix.
Safety: Permission denials, attempted out-of-scope actions, failed simulations, unauthorized writes, rollbacks, and missing audit events.
Human leverage: Expert time spent collecting evidence, recreating context, drafting the artifact, and reviewing the final proposal.

Do not make automation rate the only goal. A higher rate can coexist with poor resolutions or avoidable escalations. Treat it as one diagnostic measure and pair it with customer outcomes, correction rates, and topic-level regressions.

Create an evaluation set from real operating conditions: known content gaps, misconfigured rules, legitimate escalations, sparse attributes, conflicting evidence, localized content, and incidents with precise audience criteria. Give each case an expected outcome, required evidence, allowed tools, and forbidden action. Re-run the set when the model, retrieval system, tool, skill, permissions, or support configuration changes.

Scheduled work is where the leverage begins to compound. An operator can run recurring analysis and deliver the resulting report without waiting for a manager to remember the question. Keep an owner on every scheduled job, however. That owner should know where failures appear, when the task last completed, which data it used, and how to pause it.

Key takeaways

An operator agent improves the system around customer conversations; it is not simply another customer-facing bot.
The product boundary should cover observation, diagnosis, proposal, verification, approval, action, and monitoring.
Reliable behavior comes from grounded retrieval, attribute awareness, bounded tools, encoded domain skills, and structured review surfaces.
Grant autonomy by consequence: broad freedom to inspect approved data, tighter controls to prepare changes, and explicit approval for production writes.
Roll out recurring analysis before knowledge changes, automation configuration, and customer communication unless your own risk profile clearly supports another order.
Measure supported diagnoses, accepted artifacts, customer outcomes, human time, and safety events rather than prompt volume alone.

Your next step is to choose one recurring operational question and write down the evidence it requires, the artifact a good answer should produce, the person who will review it, and the actions the agent must not take. Once that loop works reliably, add one downstream proposal. That is a much stronger foundation for an operator agent than beginning with an open-ended prompt and a broad API key.

References

May 14, 2026

I Pointed a “Ralph Wiggum” AI Loop at My Product for a Week—The Data That Stopped Chaos

I spent a week pointing a "Ralph Wiggum loop" at my product to see how far an agentic AI could take pragmatic, everyday improvements without human micromanagement. It was equal parts exhilarating and nerve-wracking. The short version: the loop moved fast and broke assumptions, but Amplitude analytics kept it from going off the rails—and turned chaos into controlled acceleration.

By "Ralph Wiggum loop," I mean a deliberately naive, endlessly curious cycle: try something small, ship it behind a flag, watch the data, then try again. It is the product equivalent of a fearless intern who experiments constantly. That energy is invaluable for discovery, but it absolutely demands strong guardrails and a clear definition of success.

Before I started, I framed the outcomes I cared about: user activation within the first session, reduction in time-to-value, and early retention indicators. I set baselines and a minimum detectable effect (MDE) for A/B testing so the loop could distinguish noise from signal. I also documented a driver tree of behaviors we wanted to influence and ensured every event was cleanly instrumented in Amplitude analytics to support reliable behavioral analytics.

The guardrails mattered most. I put every change behind feature flags with instant rollback. I defined "off the rails" conditions upfront, including regression thresholds for activation and retention analysis, and enabled anomaly detection to surface unexpected spikes or drops. Session replay was ready to diagnose confusion fast, and I kept a daily evaluation cadence so the loop never ran unattended for long.

Day by day, the loop proposed micro-experiments: onboarding copy variants, tooltip timing, in-app guide sequencing, and subtle changes to progressive disclosure. Each iteration shipped behind a flag to a small cohort. I watched leading indicators in real time, then zoomed out to cohort views to guard against short-term gains that might erode longer-term value. When something looked promising, we expanded exposure methodically; when something looked risky, we paused immediately.

We had a pivotal moment where the loop suggested a bolder call-to-action that spiked activation. On the surface, it looked like a win. Amplitude cohorts told a fuller story: downstream engagement softened, and anomaly detection flagged a pattern that hinted at premature conversion rather than genuine intent. A quick rollback through feature flags saved the week—and reminded me why eval-driven development should be the default for agentic AI workflows.

The most surprising part was how quickly the loop unlocked small compounding gains once the measurement scaffolding was in place. With a unified analytics platform and crisp guardrails, the system became a safe sandbox where the AI could explore aggressively while we stayed anchored to outcomes. The combination of behavioral analytics, A/B testing discipline, and daily human review turned raw speed into durable learning.

My takeaways are direct. Agentic AI can accelerate discovery, but only if you define stop conditions and wire strict feedback loops into your stack. Measurement is product strategy here—without it, you get noisy activity instead of progress. Invest in instrumentation first, treat feature flags as non-negotiable, and let anomaly detection and session replay be your early warning system. Most of all, tie every experiment to activation, engagement, or retention, not vanity metrics.

If you’re considering your own week with a "Ralph Wiggum loop," start painfully small, constrain the blast radius, and insist on decision-quality data. Do that, and you’ll turn a chaotic agent into a compounding engine for product discovery—one that moves fast, learns faster, and stays on track.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
From Prototype to Production: How I Built Reliable AI-Generated Opportunity Solution Trees

I just wrapped an all-out engineering sprint. That still sounds odd coming from me, because while I’ve written code on and off for years, I don’t self-identify as an engineer. I’m a product manager who used to be a designer. It’s been a long time since I wrote code for a living.

But AI has expanded what’s just now possible—for our products, and for us. It’s pushed me to do more than I imagined. In that spirit, I want to share a recent engineering story. It includes technical details, and a year ago I couldn’t have done any of it. I learned it with the help of AI, and my aim is to show what’s now within reach.

I’ve been building two services with a partner at Vistaly: AI-generated interview snapshots and AI-generated opportunity solution trees. We put out a call for alpha partners, received over 100 applicants, and selected eight design partners to start.

A clear, color‑coded map from desired outcome to opportunities, solutions, and assumption tests—showing how to structure discovery work and prompt AI to generate, compare, and validate product ideas.

Each team uploaded three customer interviews. I identified the key moments and opportunities and then generated an opportunity solution tree from those snapshots. I provide the AI services; Vistaly is building the UI and workflows around them.

Early feedback was strong. Teams immediately asked to upload more interviews—exactly the kind of demand signal you hope to see—so we got to work making that possible.

Go behind the scenes as AI turns raw feedback into a clear Opportunity Solution Tree. Linked cards reveal user needs—onboarding, support offload, and bot-readiness signals—so product teams can spot priorities and next steps at a glance.

Updating an opportunity solution tree with new interview content is far harder than generating a new tree from scratch. I initially underestimated the complexity. Our goal wasn’t to produce a tree and declare it truth. We wanted teams to engage, correct, and collaborate with the AI—scaffolding cross-interview synthesis instead of doing it for them.

To support that, we needed a way to communicate precisely how a tree would change after new interviews were added. We took inspiration from git diff and set out to build the equivalent for opportunity solution trees—step-by-step change sets that explain each proposed modification.

A clear visual of AI‑generated opportunity solution trees: outcomes feed opportunities that branch into sub‑opportunities, while evidence is preserved. The structure ensures updates stay traceable and never cause data loss.

That decision was right, but the lift was larger than I expected. It wasn’t enough to generate an updated tree; I also had to provide a clear, ordered walkthrough of what changed and why.

I often see the same pattern with AI: it’s easy to get to an impressive prototype, but much harder to reach a production-grade product. That was exactly my experience here. My service actually comprised two sub-services: generating a new tree from scratch and updating an existing tree with new interviews. The first worked well in alpha; the second had to be built before anyone could add a fourth interview.

Explore how an outcome expands into an Opportunity Solution Tree: Opportunities A and B stem from the goal, with C and D nested under B, while a concise change set tracks every node added along the way.

On the surface, these services look similar. In reality, updates must preserve existing structure unless new evidence requires a change. You have to account for compound operations—merges, splits, deletes—while guaranteeing no data loss. Every node has source opportunities (supporting evidence from interviews) and children (tree sub-opportunities), and neither can be dropped.

In classic AI fashion, I got a reasonable version working in a few days and shipped it to our design partners. One team quickly hit our beta limits and asked to convert to a paid subscription so they could keep going. They showed a willingness to pay, converted, and started uploading aggressively.

Watch an Opportunity Solution Tree evolve: the original parent A with x, y, z branches is split into A and B, shifting evidence while preserving links—mirroring how AI refines scope and structure in discovery.

At the 14th, 15th, and 16th uploads, the cracks appeared. We saw odd behavior in some trees. The Vistaly team noticed that the change sets—the step-by-step instructions emitted by my service—didn’t always reconstruct the final tree my service also emitted. We needed those steps to match exactly, so teams could review and accept, modify, or reject each change with confidence.

They flagged the issue the day I was flying to New Orleans for Jazz Fest. In hindsight, I’m glad I didn’t grasp the scope of what awaited me. I had roughly 80% of the work still to do to make tree updates rock solid. At least I got to enjoy the music first.

From fragments to focus: this diagram shows how Opportunities B and C are merged into a single Opportunity Solution Tree, removing duplicates and unifying context so AI can rank and explore five related opportunities with clarity.

Back home, I started diagnosing. My service was a pipeline: several LLM-driven steps followed by deterministic code to compare trees and produce change sets. As I dug in, I realized that approach was flawed. Tree diffs, unlike linear document diffs, are ambiguous.

In a document, if I add a sentence, the diff shows an addition. If I delete a paragraph and rewrite it, the diff shows a removal and an addition. Simple. But trees are different. Suppose I split opportunity A into A and B, and later merge B with C. The split can disappear from the final diff.

Peek inside our process: a simple opportunity solution tree maps an outcome to prioritized opportunities A and C with downstream options x-z and t-v. A clear snapshot of how AI organizes product discovery.

When the model splits an opportunity, it must distribute A’s source opportunities and children between A and B. For instance, if A has source opportunities 1, 2, 3 and children x, y, z, after the split A might keep 1, 2, and x, while B takes 3, y, and z.

Now suppose the model merges B into C. If C originally had source opportunities 4 and 5 and children t, u, v, then after the merge C now has source opportunities 3, 4, 5 and children t, u, v, y, z. When you compare the original and final trees, it looks like A somehow donated some evidence and children directly to C. The split and merge that explain why are invisible to a naive diff.

See how an AI-generated Opportunity Solution Tree unfolds: one Outcome flows to Opportunities A and C, then into options x–v. Clean colors and arrows reveal the hierarchy from goal to opportunities at a glance.

That was the core insight: we didn’t just need to show what changed—we needed to show why it changed. I had to reconstruct each move step-by-step. That meant getting the model to show its work, which opened a new can of worms.

I refactored my prompts so the model produced both the final output and the exact change set it used to get there. The action language was explicit: add, delete, reframe, merge, split, and so on. Crucially, I asked the model to describe its moves in user-meaningful terms—“split A into A and B, then merge B into C”—not as opaque reassignments of sources and children.

Watch an opportunity solution tree take shape: start with the outcome, add opportunities A and B, then extend B to C and D. The paired change set makes every edit transparent—ideal for AI-assisted product discovery.

For each LLM step, the model now emitted its recommendation and the corresponding change set. This helped, but it wasn’t perfect. After extensive testing and error analysis, two classes of errors emerged: (1) the model attempted an invalid move, and (2) the change set didn’t actually generate the recommendation.

Category 1 felt like designing a game while the model played it creatively. For example, what happens when the model tries to merge a parent with a child? If opportunity A has children B, C, and D and the model merges A with B, the merge is directional. If the instruction is “keep A, delete B,” that works—the parent absorbs the child. But if the instruction is “keep B, delete A,” then C and D become orphans. These puzzles were solvable and even fun.

Visual explainer from Product Talk on AI-generated Opportunity Solution Trees. It contrasts an allowed merge (B into A) with a not-allowed merge (A into B) that leaves child opportunities orphaned, guiding safe hierarchy edits.

Category 2 was harder. Despite prompt iterations, I could only push the discrepancy rate down to about 1 in 40 instances. With 10–20 LLM calls per run, that meant roughly half of all runs still failed. Not acceptable for production. I hit a wall. A paying customer was waiting, and more design partners were queued up.

Next, I tried to correct the model’s mistakes with deterministic code. I had promised that my change sets would generate the output tree, so I wrote verifiers: detect conflicts (e.g., delete a node, then try to use it later), guard against data loss, prevent orphaned nodes, and more. Detection was straightforward; correction was not. Fixing issues required guessing the model’s intent. If the sequence said “delete A, then merge A with B,” should I remove A entirely or salvage A’s sources and children by merging into B? There were dozens of such cases with no unambiguous answer.

A step-by-step loop shows how changes are validated: generate a change set, run a validation tool, review the result, then repeat on failure and exit on pass—mirroring iterative work behind AI-built Opportunity Solution Trees.

After 11 straight days of deep work—including weekends—I was exhausted. I dislike hustle culture; this isn’t how I design my life. But I was stuck, and then I had an insight.

On a walk with my husband (also an engineer), I realized I could have the LLM repair its own mistakes. My data contract with Vistaly requires that the change set must generate the output tree. I had already built robust validation code. I knew exactly when a change set failed—and why. No amount of prompt tuning alone was fixing it. So I turned the validator into a tool for the model and created a simple agentic loop.

The loop works like this: the model proposes a change set, calls the validation tool, and gets back a pass/fail plus specific feedback. If it fails, the model uses those instructions to repair the change set and calls the tool again. Iterate until success or a max number of turns.

I prototyped in Node.js with a single model call, a verifier pass, and a repair attempt. At first, the loop didn’t converge—it just accumulated compute. I experimented with how to communicate errors, how much context to include, and how to sequence feedback. Eventually, it clicked: the model began fixing its own mistakes and typically returned a valid change set in one or two repairs. It was, in practice, eval-driven development applied to LLM outputs.

I had already built an agent loop utility for another AI workflow, so I productionized quickly: model call, optional tool invocation, tool result returned to the model, repeat until the validator signals success or the loop times out. I integrated the new loop into the pipeline and shipped the revamped service to Vistaly on Monday at noon. They’re integrating now, and it will be in the hands of our design partners shortly. I’m relieved—and ready for a day off.

Reflecting on the last two weeks, a few things stand out. First, I shed limiting beliefs about being an engineer. To make this reliable, I had to solve legitimately hard problems, and that feels good.

Second, this was genuinely fun. Designing the action set and watching the model push those boundaries was like working through elegant puzzles. Models are incredibly creative, and harnessing that creativity with the right constraints is deeply satisfying.

Third, I learned when I can and can’t trust Claude to write code for me. Since Opus 4.6 came out, I gave Claude a much longer leash. After the past two weeks, Claude is back on a short leash. I found a lot of gaps in my implementation in areas where I simply trusted that Claude got it right, when in fact it didn’t. If you don’t have the right infrastructure—planning, testing, code review—this can be disastrous. I’ll be investing more here and sharing what I learn.

Finally, if this work had been spread over two months, it would have been thoroughly enjoyable. I’m discovering how much I like being an AI engineer. It feels like a new chapter where I can combine opportunity solution trees with modern AI engineering—and deliver real value to product teams doing continuous discovery.

I’m excited to share more of what we’re building with Vistaly and to onboard more design partners soon. If you’re interested, get on the waiting list. And if you’ve been hesitant to stretch beyond your current skill set, I hope this story nudges you to take the first small step toward what’s just now possible.

Inspired by this post on Product Talk.

May 13, 2026