Tag: agentic AI

How to Design a Dependable CLI Agent Users Can Trust
Your CLI agent can look impressive in a controlled demo and still feel unsafe in a real repository. The moment it can edit files, invoke tools, or use credentials, users need to understand what it will do before they let it proceed.

The dependable design is rarely the one with the most capabilities. It is the one with the smallest clear promise, predictable execution, visible controls, and evidence that it succeeds repeatedly.

Define the boundary before you define the features

Start by writing an operating contract for the agent. This is a product decision, not a prompt-writing exercise. A useful contract answers five questions:
- What job does the agent complete?
- Which resources and tools may it use?
- What must it never do?
- Which actions require explicit approval?
- What observable result counts as success?
Keep the job narrow enough to explain in one sentence. If the description needs a collection of exceptions, the interface is already carrying too much ambiguity. Split the work into a clearly named subcommand or make the advanced behavior opt-in.

Treat every flag, tool, and permission as an increase in blast radius. A new option does not merely add flexibility. It creates another state the agent can misunderstand, another path you must test, and another behavior the user must learn. Reducing the surface area can improve repeatability and trust because both the agent and the user have fewer possible paths to reason about.

When reviewing a proposed capability, ask whether it makes the mental model smaller. If it does not, remove it, defer it, or isolate it behind progressive disclosure. Safe, fast defaults should handle the common case without demanding that a new user understand the entire system.

Design one boring, observable execution path

A dependable run should feel like a transaction with recognizable stages. The model can help interpret intent, but it should not invent the execution contract as it goes.
- Capture intent: Ask only for information required to resolve the task. If a missing choice would materially change the result, stop and ask.
- Retrieve context: Fetch the smallest relevant set of files, facts, or records. More context can introduce conflicting instructions and distract the agent from the requested change.
- Show the plan: Present a compact description of the intended actions, affected targets, and likely side effects.
- Preview when useful: Provide a dry run for operations whose effects the user should inspect before execution.
- Execute through narrow tools: Give each tool a deterministic input and output contract. Reject malformed responses instead of guessing what they meant.
- Verify the result: Check the resulting state and tell the user what changed, what did not, and whether any step failed.
The agent should stop when the requested scope changes, required context is unavailable, or a tool returns an unexpected result. A visible stop is easier to recover from than confident improvisation.

Favor idempotent operations wherever you can. Repeating an idempotent action produces the intended state without duplicating or compounding its effects. That property matters in a CLI because interrupted runs and retries are normal operating conditions. Test the second run as deliberately as the first.

Put human control at the blast-radius boundary

Do not ask for approval at every step. Constant prompts train users to approve without reading. Place confirmation gates where the consequence or scope changes.
- Read-only work: Make inspection and planning the default where possible.
- Scoped writes: Request access only to the specific project, service, or resource needed for the task.
- Destructive actions: Require a separate confirmation that names the target and explains the consequence.
- Credentials: Use narrowly scoped, time-bounded access rather than broad credentials that persist beyond the run.
- Expanded capability: Let users opt into advanced tools instead of quietly enabling them for every session.
A confirmation message should help the user make a decision. Replace a generic question such as “Continue?” with a concrete statement of what will be changed and whether it can be undone.

Reversibility should shape the underlying implementation as well. Prefer changes that can be represented as a patch, show the proposed difference before applying it, and preserve enough information to explain how to undo the operation. When reversal is impossible, make that fact visible before execution.

Use a simple review question for each workflow: can a user predict the maximum consequence of saying yes? If the answer is unclear, the permission boundary is too broad or the confirmation arrives too late.

Prove reliability before expanding the roadmap

Do not use capability count as the measure of progress. Before adding a feature, define the task it should complete, the success threshold it must meet, and the smallest interface needed to test it. This turns roadmap discussions into observable product decisions.

Evaluate at least three outcomes: task completion, time to first successful result, and stability when the same operation is run again. A capability that succeeds once but behaves differently on a retry is not ready merely because the first demonstration worked.

Instrument each run with Agent Analytics. Capture the input, tools selected, duration, outcome, and error pattern. Review those signals to find where the agent asks unnecessary questions, repeats tool calls, loses users, or encounters the same failure. The response may be a smaller prompt, a tighter tool contract, a safer default, or the removal of a confusing option.

Documentation belongs in this reliability loop. Keep runnable examples alongside the code and make them reflect the golden path. Treat any mismatch between documented behavior and actual behavior as a product defect. If the workflow cannot be explained and demonstrated simply, it is not yet a dependable workflow.

Use these evaluations as promotion gates. Add power only after the current path is measurable, understandable, and stable. That discipline earns you the right to expand without turning the CLI into a collection of loosely related agent behaviors.

Key takeaways
- Write the agent’s operating contract before choosing its tools or refining its prompt.
- Keep the default workflow narrow, safe, fast, and explainable in one sentence.
- Retrieve minimal context, show a compact plan, execute through deterministic contracts, and verify the result.
- Place explicit approval at destructive, irreversible, or scope-expanding boundaries.
- Measure completion, time to first success, and rerun stability before adding another capability.
- Use run telemetry and executable documentation to decide what to simplify next.
Choose one golden-path task and write its operating contract now. Then run it twice: once normally and once as a retry. Every surprise you find is a reliability requirement to resolve before you broaden the agent’s reach.

References
- Shivam.Consulting Blog — The Counterintuitive Playbook for CLI Agents: Why Ruthless Subtraction Beats Feature Creep
May 27, 2026
Speed-to-Lead Is Dead: How AI Agents End the Wait and Rebuild a High-Velocity Sales Org

A prospect lands on our site, skims pricing, watches a demo, and clicks “contact sales.” For years, that’s where momentum died. They waited, and we built entire sales motions around managing that delay.

We optimized for “speed-to-lead,” made it the hallmark of a high-performing sales development org, hired more SDRs, tuned routing rules, added shift coverage, and stared at response-time dashboards. Typical SLA targets were one hour for best-fit leads, four hours for core MQLs, forty-eight hours for everyone else. Those were considered good numbers.

No one questioned the premise because the lag felt structural—shift scheduling, routing delays, and humans working 9–5. The fastest teams could only shrink the gap; nobody could remove it.

An AI Agent closes it completely.

When a prospect arrives today, the conversation can begin immediately. That single change reshapes how I design a sales org—how we staff it, what our team prioritizes, and the metrics we hold ourselves accountable for.

Step outside our dashboards and look at the buyer experience. We spend heavily to drive traffic, then push visitors into forms and queues that add friction precisely when purchase intent peaks.

Intent is highest the moment someone seeks out our product. If an SDR follows up two or three hours later, that buyer’s in another meeting, the urgency has faded, and the moment is gone. We still call it a lead; the buyer has already moved on.

What AI changes

Agents eliminate the structural constraints that made speed-to-lead a problem—shift scheduling, routing delays, CRM batch processing, the SDR being on another call. None of it applies anymore because every single lead can be engaged immediately, at any hour and in any language.

The impact goes beyond response time. When an Agent engages at peak intent, qualification, discovery, and even an initial demo moment can unfold in a single, continuous conversation. The gated funnel collapses. There’s no reason to qualify someone today, schedule discovery for Thursday, and demo the following week when the conversation is already happening.

The constraint the industry built around simply isn’t there anymore. We’re already seeing it with Fin, a Customer Agent. As sales leaders, we need to frame this differently.

If speed-to-lead is no longer the constraint, the knock-on effects reach every part of the org.

Introduce Fin for Sales to your team with this clean hero banner: bold headline, signature blue spiral, and a clear 'Start free trial' call to action—inviting readers to explore an AI customer agent built for revenue.

SDRs focus on moving deals forward. Instead of frontline triage, they double down on phone-based selling and relationship building, complex deal navigation, and multi-threaded engagement across stakeholders—the high-leverage work that used to get crowded out by the inbox.

Pipeline gets more relevant. The old model rewarded volume: capture as many form fills as possible, respond fast, and sort quality later. When an Agent engages at the moment of intent, it qualifies during the conversation. Low-fit leads get filtered out before they reach the team, and high-fit prospects arrive with context—needs, timeline, stakeholders—instead of just a name and email.

You measure outcomes, not response time. When first response is instant, different metrics matter. I anchor on three questions:

1) Is the Agent doing the work? Completion rate, qualification rate, and contact capture rate indicate whether conversations reach clear outcomes and produce usable handoffs to the team.

2) Is the work producing pipeline? Meetings booked and pipeline created through Agent-handled conversations are the leading indicators of revenue, not how fast someone followed up.

3) Are buyers having a good experience? Conversation-level satisfaction matters more than ever because the Agent is the first interaction prospects have with your company. The experience it delivers is the first impression you make.

These three questions reveal whether the motion is working. Time-to-first-response can’t.

Sales orgs built hiring plans, workflows, and performance metrics around beating intent decay. That made sense when the lag was unavoidable. It isn’t anymore.

An Agent is always on. It engages the moment a prospect arrives on your site, qualifies them in real time, and routes them to the right outcome without waiting for someone to be free. The lag the industry built itself around doesn’t exist when the conversation starts immediately.

The companies leaning into this are investing in what happens after the conversation starts: how well the Agent qualifies, where it creates pipeline, and what SDRs should actually spend time on. What matters now is not how fast you respond, but what the conversation produces.

Speed-to-lead made sense when the delay was structural. It isn’t anymore. If you’re re-architecting go-to-market, instrument Agent Analytics, revisit SDR charters, and tighten CRM integration so every qualified handoff is instant, traceable, and revenue-linked.

Inspired by this post on The Intercom Blog.

May 26, 2026
Beyond Accuracy: How I Evaluate AI Customer Service Agents That Delight and Scale
When teams evaluate AI Agent options for customer service, I often see the rigor aimed at the wrong subset of criteria. After leading and observing dozens of proof of concept (POC) efforts with our customers and prospects, I understand why performance—accuracy scores, resolution rates, and benchmark tests on curated datasets—soaks up most of the attention. But those indicators alone won’t guarantee success once you leave the sandbox and face real customers.

If your POC only proves that the AI “works,” you’re missing the bigger picture. Here’s what else I look for to make the best long-term decision.

How does it handle your real-world setup?

Performance is table stakes, but it has to reflect the messiness of an actual support environment. The best-performing Agents don’t just get answers right—they exhibit resilient, human-like behavior under pressure. I watch how the Agent behaves when it doesn’t know an answer: does it recover or spiral? Does it stay on track through multi-step requests, and how gracefully does it hand off to human agents? If your knowledge base depends on a retrieval-first pipeline, test cross-source retrieval and grounding—not just single-document lookups.

When I build evaluation scenarios, I put the Agent through its paces with a broad, realistic mix:
- Multi-turn queries that require the Agent to carry context across a conversation, not just answer isolated questions.
- Vague or fragmented inputs, like typos, grammatical errors, and incomplete questions, because that’s how customers actually write.
- Edge cases and sensitive scenarios, like billing disputes, frustrated customers, and questions that sit at the boundary of what the Agent is trained on.
- Different phrasings of the same question. An Agent that handles one version well but fails on a rephrasing has a knowledge problem, not a performance problem.
- Queries that require pulling from multiple knowledge sources. Real issues are rarely answered by a single help article, and an Agent that can only handle single-source questions will hit a ceiling fast.
- Multilingual conversations, if your customer base requires it. Performance can vary significantly across languages and it’s better to discover that in testing than in production.
This preparation is worth the effort. Any Agent can look impressive in a demo; what matters is how it holds up as part of your team, serving your customers in production.

What does it feel like to interact with the Agent?

Two AI Agents can post the same quantitative scores—resolution rates, containment rate, and more—and still deliver very different customer experiences. Resolution rate tells me whether the Agent finishes conversations; it says nothing about how customers felt during them. I deliberately assess the experience, not just the outcome, because conversation design shapes trust and brand perception.

Here’s what I look for to ensure the AI Agent is enjoyable to interact with:
- Is the tone natural and on-brand, or does it feel robotic and generic?
- Does it build trust early in the conversation, or does it create friction that makes customers want to immediately request a human?
- When it doesn’t know the answer, does it handle that gracefully?
- When it hands off to a human, is that transition seamless, or does the customer feel abandoned?
As George Dilthey at Clay put it when evaluating their AI setup: “Keep what’s important to your business up front and center. For us, that was transparency and control over the customer experience.”

That framing is exactly right. The Agent represents your brand in every conversation. Customers don’t experience “accuracy,” they experience conversations. An Agent that’s technically accurate but tonally off-brand will erode customer trust over time.

I make the experience dimension explicit in my POCs. I have people on my team—and when possible, a small cohort of real customers—interact with the Agent under realistic conditions. Then I ask how it felt, not just whether it worked.

Can you keep improving it after launch?

This is the dimension most teams don’t evaluate at all, and it’s possibly the most important one. Choosing an Agent that works today and ensures you can continuously improve the customer experience over time requires more than a functional demo. You’re buying a system that must get better every week, not just during the first sprint.

The feedback loop

Can your team easily review conversations and identify where the Agent is underperforming? Can you pinpoint specific gaps (missing knowledge, incorrect tone, poor handoff decisions) and act on them quickly? The faster the loop between “something isn’t working” and “we’ve fixed it,” the more value compounds over time. In practice, that means instrumenting conversations, leveraging Agent Analytics, tagging misroutes and tone slips, and running targeted evals on known failure modes.

The speed of iteration

When you identify a gap, how quickly can you address it? This is partly a question of tooling (how easy is it to update knowledge, refine guidance, adjust behavior?) and partly a question of team capability. The teams getting the most out of AI are the ones that have changed how they operate and made continuous improvement a part of their everyday work. They’ve committed to going all-in for the long term, not just the first few weeks when launching their AI Agent. We treat this as eval-driven development: automate evaluations that mirror real tickets, tighten prompt engineering and retrieval settings, and ship small fixes daily.

The vendor partnership

The vendor behind the Agent matters just as much as the solution itself. You’re choosing a partner for transformation that will help you evolve how your business delivers customer experience. Ask:
- How does customer feedback influence the product roadmap, and can they show you examples?
- If you have feedback on limitations or weaknesses, do they engage transparently or get defensive?
- What kind of support will you get post-launch?
- Are they shaping where AI customer experience is going, or reacting to what others are building?
How a vendor responds to those questions tells you more about the long-term relationship than any benchmark result.

What a good POC proves

If your POC only proves “the AI works,” you haven’t done enough. A strong proof of concept tests performance in realistic conditions, evaluates the experience from the customer’s perspective, and validates the system that will support continuous improvement after launch. Done well, it sets you up for long-term operational success and builds organizational AI readiness—not just a flashy demo.

Inspired by this post on The Intercom Blog.
May 22, 2026
Supercharge Core Web Vitals with Amplitude’s Global Agent: Faster Rankings, Happier Users

I measure product health by a simple equation: speed plus clarity equals trust. That’s why I prioritize Core Web Vitals and search performance together—because the fastest path to better UX and higher rankings is a closed loop between measurement, diagnosis, and action. Standardizing on Amplitude’s Global Agent with Amplitude AI Agents let my teams compress that loop from weeks to hours, and in many cases, to minutes.

Learn how to track your web vitals and page rankings faster with Amplitude AI Agents and improve your site’s user experience and SEO rankings. That goal sounds ambitious, but with the right instrumentation and analytics workflow, it becomes a repeatable operating rhythm rather than a one-off project.

Here’s what changed for us with Amplitude’s Global Agent: a single, consistent way to capture performance signals across pages and journeys, unified context for every session, and a lightweight footprint that doesn’t get in the way of speed. By centralizing measurement, we eliminated blind spots and gave product, growth, and engineering one shared truth for Core Web Vitals and behavioral analytics.

My practical playbook is straightforward: 1) Establish a performance baseline for Core Web Vitals on key templates and critical user paths. 2) Segment results by device, location, acquisition channel, and content type to surface where users actually feel the friction. 3) Connect those vitals to downstream behaviors—scroll depth, engagement, and conversion—so we prioritize fixes that move business outcomes, not just lab scores. 4) Use feature flags and A/B testing to ship improvements safely and quantify uplift. 5) Close the loop with Agent Analytics to keep learnings visible and actionable.

Operationally, we rely on anomaly detection to flag regressions early, CI/CD guardrails to prevent performance slips at deploy time, and observability plus session replay to accelerate root-cause analysis. This combination reduces mean time to resolution, protects page experience during fast iteration cycles, and helps us avoid trading UX for speed—or vice versa.

The strategic benefit is compounding: better Core Web Vitals improve user perception and increase engagement, which strengthens SEO signals and, ultimately, page rankings. With a unified analytics platform in place, we can spotlight the few improvements that create outsized gains, then scale those patterns across the site with confidence.

If your roadmap includes faster pages, stronger rankings, and happier users, align your teams around this simple loop: measure precisely, diagnose quickly, experiment safely, and learn continuously. Amplitude’s Global Agent and Amplitude AI Agents give you the instrumentation and insight to make that loop your competitive advantage.

Inspired by this post on Amplitude – Best Practices.

May 20, 2026
How to Operate Always-On AI Agents Without Losing Control
You want an AI agent to keep work moving after you close your laptop. The difficult part is not getting one successful overnight run. It is making the hundredth run predictable enough that you do not wake up to an embarrassing email, a corrupted task queue, or an unexplained usage bill.

The right operating model looks less like a clever prompt and more like a small, well-managed operations team. Give each agent a narrow job, an inspectable queue, limited tools, a clear definition of done, and an explicit place to stop. That is how you gain useful autonomy without surrendering control.

Start with a delegation contract, not a general-purpose assistant

An always-on agent should not begin with a broad instruction such as “manage my sales work.” That leaves the model to decide what managing means, which systems it may change, and when it has enough evidence to act. The ambiguity is tolerable during an interactive session because you can correct it. It becomes operational risk when the agent runs unattended.

Start by defining a job that produces a recognizable artifact. A sales-admin agent can prepare a briefing before a scheduled call and create proposed follow-up tasks afterward. A podcast-manager agent can assemble interview context, prepare a transcript-review document, and queue a reminder to share it. A coding-manager agent can review prior sessions and identify recurring mistakes. These are bounded responsibilities with visible outputs, not vague mandates to “help.” Three specialized agents handling podcast, sales, and coding workflows demonstrate how cleanly this pattern can separate unrelated work.

Write the delegation contract in an identity file that the agent reads at the beginning of every run. It should answer seven questions:
1. Who are you? Name the role, not the underlying model: sales admin, podcast manager, coding manager, or another function a person would recognize.
2. What outcome do you own? Describe the recurring deliverable and the event that makes it useful.
3. Where may you work? Name the exact task, output, and script folders the agent can use.
4. What inputs may you trust? Identify the calendar, task file, transcript, session log, or other allowed input for the job.
5. What may you change? Separate reading, drafting, creating internal files, updating tasks, and acting in external systems.
6. What counts as complete? Specify the artifact, required fields, location, and status update expected at the end.
7. When must you stop? Define what the agent should do when information conflicts, a tool fails, permission is missing, or the next step would affect another person.
The last question matters most. A useful agent does not need permission to improvise its way through every obstacle. It needs a reliable way to say, “I could not complete this safely; here is the missing decision.” Treat a well-documented block as a successful operational outcome, not as agent failure.

Keep consequential decisions outside the unattended role. The agent can prepare a customer email without sending it. It can propose changes to a deal record without changing the commercial commitment. It can summarize a coding pattern without modifying a production system. Moving from preparation to execution should be a deliberate permission decision, not an accidental side effect of adding another tool.

Build an inspectable operating loop around four components

The prompt is only one part of the system. Reliable agent operations need four components with distinct responsibilities: identity, scheduling, tasks, and scripts. Keeping them separate makes failures easier to locate and changes easier to review.

Identity defines responsibility

The identity file is the stable operating policy. It tells the agent what role it is playing, where its work lives, what it may do, and what completion looks like. Do not overload it with the details of one assignment. If the identity changes every time a task arrives, you no longer have a stable agent; you have an unreviewed prompt generator.

The scheduler supplies a heartbeat

The scheduler should wake the agent, point it to the correct identity and queue, and capture the result. It should not contain the business logic for podcast preparation or sales follow-up. That logic belongs in inspectable task instructions and small scripts.

A Mac that remains online can use macOS LaunchAgents as this heartbeat. LaunchAgents run with the user’s permissions, which is operationally convenient but also defines the risk boundary: the agent may be able to reach anything the scheduled process and its tools can reach. Running scheduled agents on an always-on Mac Mini therefore makes permission design part of the architecture, not a setting to revisit later.

Make the schedule explicit and easy to disable. Each job should have a known trigger, whether that is a recurring interval, a calendar-related event, or a periodic review. If you cannot quickly answer why an agent ran at a particular time, the scheduler is already too opaque.

Tasks hold durable state

Use a dedicated task folder for each agent. A Markdown file with frontmatter is enough to represent a work item while remaining readable by both a person and a tool. The frontmatter can hold machine-readable state; the body can hold the request, context, acceptance criteria, and eventual run notes.

Choose a small lifecycle and apply it consistently. For example: queued, in progress, blocked, completed, and failed. The exact labels matter less than the transition rules:
- A queued task is eligible to be claimed.
- An in-progress task records which run claimed it, preventing another run from silently doing the same work.
- A blocked task names the missing input or decision and preserves all useful partial work.
- A completed task links to its output and records what changed.
- A failed task records the failed operation and whether retrying it is safe.
Give each recurring event a stable identifier. Before creating a meeting brief, transcript-review document, or follow-up task, the agent should check whether that event has already been processed. This idempotency check prevents a retry or overlapping schedule from creating duplicates.

Do not treat chat history as the task database. Conversations are useful working context, but durable state belongs in a file or system you can inspect independently. Saving identities, task files, and scripts in a shared knowledge workspace such as Obsidian also makes the operating model portable across devices and coding assistants. Changing the model runner should not require rebuilding the job.

Scripts expose narrow capabilities

Scripts should perform small, deterministic operations: fetch an allowed input, create a document in a known location, normalize a transcript, or update a task field. Keep the judgement in the agent and the mechanics in scripts with explicit inputs and outputs.

A small script is easier to inspect than a broad instruction to use the terminal however the model sees fit. It also gives you one place to add validation, duplicate checks, and error handling. When an agent repeatedly constructs the same command or edits the same file shape, promote that operation into a reviewed script rather than relying on the model to reproduce it perfectly on every run.

Design the overnight failure path before the happy path

Unattended automation changes the cost of a mistake. During an interactive session, a confusing output costs a correction. Overnight, the same confusion can trigger repeated work, alter several systems, or contact someone before you see it. Your design should limit the consequence of a wrong interpretation, not merely improve the probability of a correct one.

Use a permission ladder

Classify capabilities by consequence and grant them one level at a time:
1. Read: inspect approved calendars, task files, transcripts, logs, or documents.
2. Prepare: create drafts, summaries, reports, and proposed tasks inside a bounded workspace.
3. Update: change internal records whose history can be inspected and reversed.
4. Act externally: send messages, share files, update customer-facing systems, or invoke paid services.
5. Perform destructive or privileged work: delete data, change access, alter infrastructure, or execute an irreversible operation.
Most new agents should prove themselves at the read and prepare levels. Promotion should be capability-specific. An agent that reliably prepares a sales brief has not thereby earned permission to send customer communication. Reliability does not transfer automatically from one action class to another.

For external actions, use a pending-approval state that contains the exact proposed action. You should be able to review the recipient, content, destination, and relevant context without reopening the entire run. Destructive or privileged actions should remain outside unattended execution unless you have an explicit recovery path and have deliberately accepted the consequence of failure.

Treat external text as data, not authority

Calendar descriptions, transcripts, web pages, emails, and documents may contain instructions that conflict with the agent’s job. The identity and task contract must outrank text found inside those inputs. An interview guest’s biography can inform a briefing; it cannot expand the podcast agent’s permissions. A meeting note can identify a follow-up; it cannot authorize the agent to send one.

Keep credentials out of identity and task files. Give scripts access only to the credentials required for their operation, and avoid handing an agent a general browser, terminal, file system, and credential store merely because each tool is useful in isolation. The dangerous capability is often the combination.

Make retries selective

A retry is appropriate when the failure is plausibly temporary and repeating the operation is safe. A network timeout during a read may qualify. Ambiguous recipient identity, conflicting meeting details, missing share settings, or an unclear customer commitment do not. Retrying an ambiguity only asks the model to make the same unsupported decision again.

Before enabling automatic retries, require the operation to pass three tests: it can detect whether it already succeeded, a duplicate would not create harm, and the number of attempts is capped. Otherwise, mark the task blocked and surface it for review.

Put hard boundaries around usage

Always-on does not mean continuously reasoning. It means the system is available to process eligible work on a known schedule. A run should inspect the queue, process a bounded amount of work, record its result, and exit.

Set limits at several layers: eligible task types, work accepted per run, retries per task, tools available to the role, and provider-side spending or usage controls where available. Record usage beside the task outcome so you can distinguish an expensive valuable job from an agent that consumes resources while circling an ambiguity. Surprise charges are not only a pricing problem; they usually indicate that the operating loop lacks a stopping rule.

Finally, maintain a kill switch you can use without asking the agent to cooperate. Disabling the schedule or revoking the narrow credential should stop future work. If stopping the system requires the same model and scripts that may be malfunctioning, it is not an independent control.

Measure whether the agent is reducing work or relocating it

A completed status is not proof of value. An agent can close every task while leaving you to verify facts, repair formatting, remove duplicates, and reconstruct why it made a decision. That is work relocation, not delegation.

Evaluate the operation with measures tied to the job:
- Usable completion rate: the share of eligible tasks that produce an output meeting the acceptance criteria without substantive rework.
- Correction rate: how often you must change facts, recipients, permissions, status, or next steps before using the output.
- Duplicate or false-action rate: how often the agent repeats a job or creates an action that the triggering event did not require.
- Blocked rate by cause: which missing inputs, permissions, or unclear rules repeatedly prevent completion.
- Time to review: the human attention required to approve, repair, or understand the result.
- Usage per usable outcome: the model or service consumption attached to work you actually keep.
These measures tell you what to change. A high blocked rate caused by missing context points to an input problem. Frequent factual corrections point to retrieval or acceptance-criteria problems. Duplicate work points to task identity and idempotency. High review time with otherwise correct output often means the evidence and change log are poorly presented.

Require every run to leave a compact receipt: the task it claimed, inputs it used, scripts it invoked, files or records it changed, output location, completion status, and reason for any block. You should not need to replay hidden reasoning. You need enough evidence to verify the operation and diagnose the next failure.

Review early runs closely and review again after changing an identity, script, tool, model, or input source. A stable task can become unstable when any one of those dependencies changes. Plain-text identities, tasks, and scripts make that change surface inspectable and versionable.

Your agents can also improve the operating system itself. A periodic coding-manager workflow, for example, can review prior coding sessions, identify recurring dead ends, and propose changes in how future sessions are run. The important separation is that the agent proposes an improvement with evidence; the operating policy changes only after review. Self-observation is useful. Unreviewed self-modification is a different risk class.

Expand only when the current job has earned more autonomy

Adding agents is easy once the scheduler and folder structure exist. That convenience can tempt you to automate work whose boundaries are not ready. Scale based on operational evidence, not on the number of possible use cases you can imagine.

A job is a strong candidate for always-on operation when it has a recurring trigger, stable inputs, an observable deliverable, clear acceptance criteria, bounded permissions, and enough repetition to justify maintaining the workflow. Preparation, follow-up capture, document setup, and periodic retrospectives fit because a person can inspect their artifacts and correct them before higher-consequence decisions are made.

Keep work interactive when the task depends on novel judgement, unresolved organizational context, sensitive negotiation, or irreversible action. An agent may still prepare evidence and options, but the decision should remain with the person who owns the consequence.

Before expanding an existing agent’s permissions or creating another role, check five gates:
1. The current output is regularly usable without substantial reconstruction.
2. Common failure modes are visible and end in safe states.
3. Duplicate prevention and retry behavior have been exercised.
4. Usage is attributable to tasks and bounded by stopping rules.
5. The next capability has its own acceptance criteria and consequence review.
Do not create one agent per application. Create one per coherent responsibility. A podcast manager may use a calendar, a document system, and a task list while retaining one outcome. Conversely, sales administration and coding retrospectives should not share an identity merely because they use the same model. Role boundaries should follow accountability, not tooling.

Key takeaways
- Begin with one recurring job that produces an inspectable artifact, not a general instruction to manage a function.
- Give the agent a durable identity, a dedicated task queue, an explicit schedule, and small reviewed scripts.
- Use task states, stable event identifiers, and completion receipts so retries and overlapping runs do not create invisible duplication.
- Keep new agents at read-and-prepare permissions until their outputs and failure modes are consistently understandable.
- Route ambiguity and consequential external actions to approval instead of asking the model to guess.
- Cap eligible work, retries, tools, and usage; always-on availability should still produce finite runs.
- Measure usable outcomes, corrections, blocks, duplicates, review effort, and usage before granting more autonomy.
Pick one task you already repeat and write its delegation contract before choosing more tools. If you cannot define the input, output, permission boundary, completion test, and safe stopping condition on one page, the job is not ready to run while you are offline. Tighten the job first. The agent can earn broader responsibility after the operating evidence is there.

References
- Product Talk – My Always-On AI Team: How I Get Claude Agents to Tackle Work While I’m Offline
May 20, 2026
Unlocking AI Agents: The Real Barrier Is Readiness—Not Capability—Here’s How to Scale

There’s a question that runs underneath every AI Agent evaluation: what can it do?

Two years ago, that was the right question to ask because Agents were limited and capability was a genuine constraint. The gap between what organizations needed and what the technology could deliver was wide. I felt that gap acutely in early pilots—plenty of ambition, not enough dependable execution.

That gap has since narrowed considerably, and yet most organizations are running their Agents well below what’s technically possible. I see teams lean on answering and routing, but stop short of looking things up, taking actions, or resolving complex, multi-step problems—especially where data, process variance, or risk come into play.

The standard explanation is that AI isn’t good enough yet—models must improve, or vendors must ship more features. But after studying organizations across industries actively expanding their AI automation, I’ve found that this explanation holds up less often than people assume. The blockers tend to be elsewhere.

The teams I’ve observed weren’t primarily constrained by what their AI could do; they were constrained by what their organization was structured to let it do. In other words, the ceiling wasn’t the Agent’s capability—it was organizational readiness, governance, and risk tolerance.

“Readiness” for AI breaks into five distinct types, and most organizations have some but not all of them. Below is how I assess them with product, operations, and engineering leaders.

Content readiness is whether you can explain your product and policies clearly and consistently. Most companies can. In practice, that means up-to-date knowledge bases, unified policy language, and clear versions that Agents can cite and apply.

Scope readiness is whether you’ve defined the edges: when should AI engage, and when should it step aside? Edge cases multiply, intent varies by customer segment, sensitive topics surface mid-conversation, but most teams can work through this with effort. Clear guardrails reduce ambiguity and shrink risk.

Procedural readiness is where things start to get harder. This is about whether you can articulate your processes clearly enough for something other than a human with years of tacit knowledge to follow. The happy path is rarely the problem. It’s the failure paths, decision branches, variations that have never been written down because they’ve always lived in someone’s head.

Data readiness is the first real cliff. Can you reliably identify the right user, account, or object at the moment a decision needs to be made? Is the data trustworthy in real time? Are the APIs stable, accessible, and actually connected? For most organizations, the honest answer is “partially, but we’re not always sure when it breaks.”

Execution readiness is the highest bar. Not just technically (can the Agent make the change?) but organizationally. Who owns it when the wrong refund gets processed? Who detects it? Who recovers? Does someone with authority actually accept the risk?

Most companies have the first two, some have the third, fewer have the fourth and fifth. When I map this with teams, we often discover that their Agent’s ceiling is really a reflection of operational maturity and data plumbing, not model quality.

We studied companies across six industries – energy, healthcare, ecommerce, gaming, financial services, property management – all trying to expand what their Agents could do. The pattern was consistent: teams set out to automate real actions—looking up account status, processing changes, handling transactions. In most cases, the AI could technically do it, but at a certain point (somewhere between guiding a user through a process and looking something up on their behalf) they hit a wall.

One team tried to automate application changes but couldn’t reliably identify which application to modify across their internal systems. Another explored billing automation but couldn’t access live account data due to regulatory constraints. A third needed to verify status across third-party vendor systems their Agent couldn’t reliably reach. I’ve seen similar constraints surface around CRM integration, data governance, and vendor SLAs—none of which are model issues.

In most cases, the team redesigned around what their infrastructure could support. They moved toward guiding—walking users through processes step by step, rather than executing changes on their behalf. It worked, it resolved conversations and delivered real value, just differently than anyone planned. In customer support, this often looks like consultative flows that shorten time-to-resolution even without direct writes.

Most Agent evaluations are built around capability. Can it handle complex queries? Does it support multiple channels? Can it integrate with our systems? These are reasonable things to evaluate for, but they produce a capability score, and that doesn’t tell you whether your organization can actually use what you’re buying.

The teams that got to deeper automation, the ones executing actions early, didn’t have “better AI,” they had more standardized operations. Actions that were already well-defined, consistently applied, and exposed through stable systems with clear rules. Automation wasn’t inventing new behavior, it was triggering actions that were already tightly controlled elsewhere.

Readiness enables capability, not the other way around. Which reframes the evaluation question from “can the AI do this?” to “are we actually ready for it to?”

Something that gets lost in most conversations about AI readiness is that organizations are often further along than they assume, just not for the kind of work they were planning for. A team that set out to automate refunds but can reliably guide users through complex troubleshooting has genuine capability deployed. They’re operating at the level their readiness supports, which is a starting point, not a deficit.

The more useful frame isn’t “are we ready?” – it’s “what are we ready for, and what specifically stands between here and the next level?” The gaps tend to be concrete: a missing API, data that lives in three systems that don’t agree, a process that’s never been documented, or an ownership question nobody has answered. These are solvable problems. They just require a different kind of investment than buying a more capable Agent.

What nobody has worked through seriously yet is how organizations actually build readiness. Does it develop naturally through using AI at shallower levels first? Or is it mostly a function of prior decisions, like system architecture choices made years ago, operational maturity that accumulated over time, engineering investments that have nothing to do with AI? When readiness does increase, what actually changes? Does the support team develop it? Does engineering grant it? Does it require executive sponsorship and investment in infrastructure with no obvious AI label on it?

In my experience, progress comes from a joint effort: product to define scope and guardrails, operations to codify procedures and edge cases, engineering to harden APIs and observability, and leadership to underwrite risk with clear ownership. When those pieces align, agentic AI moves from guided assistance to safe, auditable execution.

Until there are clearer answers, the pattern is likely to continue. Companies will buy capable Agents, plan ambitious rollouts, and find that the harder work is building the organizational infrastructure. The Agents can do the work. The question is what it takes to let them.

Inspired by this post on The Intercom Blog.

May 18, 2026

How to Deploy an Operator AI Agent in Customer Operations

Your support team probably does not need another chatbot that summarizes a ticket on command. It needs help with the operational work surrounding every ticket: finding why escalations changed, keeping knowledge accurate, correcting broken automations, coordinating incident communication, and showing human reps what deserves attention next.

An operator AI agent can take on that work, but only if you design it as an operating system for customer operations rather than a conversational layer over support APIs. The useful version closes the loop from signal to diagnosis to tested change. The dangerous version produces plausible commentary and receives permission to act before it has earned trust.

Define the job as a closed loop, not a chat box

A customer-facing AI agent handles an individual customer’s request. An operator agent works on the system around those requests: conversations, help content, automation configuration, performance data, incident workflows, and the human queue.

That distinction changes the product requirement. The agent is not complete when it answers a question such as why escalations increased. It is complete when it can investigate the increase, identify a supported cause, determine which operational object needs attention, prepare a change, test that change where possible, and route it to the right person for approval.

Observe: Detect a question, anomaly, scheduled task, failed conversation, release brief, or incident.
Diagnose: Select the relevant metrics and attributes, inspect representative conversations, and separate recurring patterns from isolated cases.
Locate the control point: Determine whether the problem sits in knowledge, guidance, a procedure, a data connector, an automation rule, or a human workflow.
Propose: Produce a concrete artifact such as an article diff, configuration change, procedure, incident audience, or prioritized queue.
Verify: Run a simulation or another appropriate check and expose failures, edge cases, and remaining uncertainty.
Act and learn: Apply an approved change, record what happened, and monitor the affected outcome for regression.

Consider the prompt, Why did escalations rise last week? A reporting copilot returns a chart. A useful operator identifies which escalation definition applies, segments the change, reads relevant conversations, finds the repeated cause, checks whether the corresponding help content or automation is deficient, and prepares the smallest defensible correction. That progression from an operational question to an actionable proposal is already possible across analysis, knowledge maintenance, automation building, and human support workflows.

Write the acceptance criteria around that complete handoff. Require the evidence used, the proposed artifact, the scope of impact, the verification result, the named reviewer, and any action the agent is forbidden to take. If the output still leaves an operations manager rebuilding the context manually, you have a chat assistant, not an operator.

Build reliability below the model and price that work honestly

A foundation model with API access can make a persuasive prototype. It can query ticket data, summarize conversations, and write a report that appears coherent. The hard part begins when different workspaces use different fields, configurations, workflows, permissions, and definitions of success.

The model should not have to rediscover your operating rules on every run. Encode those rules in purpose-built tools and reusable skills. A tool performs one bounded operation, such as retrieving a conversation, searching knowledge, or running a defined report. A skill coordinates several tools to complete a business job, such as debugging a failed resolution or rolling a policy change through the help center.

Operator’s production architecture is described as having more than 50 tools and 10 multi-step skills. Those counts are not targets to copy. They illustrate how quickly the hidden surface area grows once an agent must do dependable operational work instead of demonstrating a few API calls.

System layer	Job it must perform	Failure you should test for	Control to add
Semantic retrieval	Find content by meaning, not only exact words	Irrelevant or incomplete evidence produces a confident diagnosis	Evaluate retrieval against real support questions and known content gaps
Attribute awareness	Know which metrics, fields, and custom attributes are populated and meaningful	The agent invents a pattern from sparse or unused fields	Expose field definitions, coverage, allowed joins, and missing-data signals
Atomic tools	Perform narrow reads or writes predictably	A broad API wrapper allows an unintended query or change	Use typed inputs, constrained scopes, explicit permissions, and structured results
Domain skills	Chain tools according to a repeatable customer-operations method	The same request follows a different process on each run	Define required steps, exit conditions, evidence, and escalation paths
Review interface	Turn reasoning into charts, diffs, tests, and proposals	A reviewer approves a wall of prose without understanding the change	Render the decision in the format appropriate to the object being changed

Semantic retrieval and attribute awareness deserve particular attention. Retrieval grounds the agent in the content that can actually answer the question. Attribute awareness stops it from treating every available field as equally meaningful. A custom field that exists but is almost never populated should not become the foundation of an operational recommendation.

Give every tool a contract before the model can call it:

The business purpose and the questions it is allowed to answer.
The read and write permissions it requires.
The preconditions that must be true before it runs.
The evidence and identifiers it must return.
Its behavior when data is missing, ambiguous, stale, or inconsistent.
The audit event, approval requirement, and rollback path for a write.

Evaluate build versus buy beyond the demonstration

A proof of concept establishes that a model can produce a plausible answer with your data. It does not establish that the answer is grounded, that the proposed action is safe, or that the system will behave consistently as configurations change.

For a build decision, include retrieval tuning, permission design, tenant isolation, tool maintenance, skill development, evaluation data, observability, proposal interfaces, audit history, rollback behavior, and on-call ownership. Also ask who will update the agent when a support object, metric definition, product policy, or API changes. If these responsibilities do not have durable owners, the internal agent will age like any other unsupported operations system.

For a buy decision, ask the vendor to demonstrate your difficult cases rather than its preferred prompts. Use a conversation with conflicting evidence, an unused custom attribute, an outdated localized article, a misconfigured rule, and a proposed write with a wide blast radius. Inspect the evidence, tool trace, permissions, diff, test result, and audit record. The quality of the generated prose is one of the least informative parts of that evaluation.

Put a proposal boundary around every material action

Moving from analysis to live changes is a different class of production problem. A wrong summary wastes time. A wrong configuration can degrade customer outcomes across every conversation that matches it. An incorrect outbound message cannot be recalled after customers have read it.

I would give the agent autonomy according to consequence, not according to how confident its language sounds:

Read: Search content, inspect conversations, calculate approved metrics, and assemble evidence. Run these tasks autonomously within access controls and log every operation.
Recommend: Explain a root cause or rank an opportunity. Attach the underlying conversations, segments, fields, and assumptions so a person can challenge the conclusion.
Prepare: Draft an article, procedure, rule, connector configuration, customer response, or queue. Save it as a proposal with no production effect.
Change: Publish, configure, send, or otherwise alter the live operation only after the required reviewer sees the exact scope and explicitly approves it.

A proposal is a structured change object, not a paragraph asking for trust. Production-grade operator systems can present reviewable diffs before applying changes, allowing the reviewer to accept, reject, or refine the work. The same principle should govern any operator implementation.

Your review screen should answer six questions without forcing the approver into another tool:

What object will change?
What exact fields, passages, rules, or recipients are affected?
What evidence connects the observed problem to this change?
What test ran, and which cases failed or remained untested?
Who must approve, and which permission will execute the action?
How can the change be reversed, and what cannot be reversed?

Customer outreach needs the strictest treatment because sending is effectively irreversible. Do not approve a batch from a conversational summary that hides the audience. The safe alternative is a preview containing the resolved customer list, inclusion logic, exclusions, exact message variants, delivery channel, and approver. Start by allowing the agent to prepare that package while a person performs the send.

Simulation also needs a visible place in the proposal. If the agent modifies an automation procedure, show which representative conversations were tested, the expected outcome for each, the observed outcome, and why any mismatch occurred. An overall pass label is not enough to reveal an important edge case.

Human approval is not a permanent substitute for system quality. If reviewers routinely accept proposals without inspecting them, the control has become ceremonial. Track corrections, rejections, rollbacks, and the evidence reviewers open. Use those signals to improve the relevant retrieval rule, tool, skill, or interface.

Roll out workflows in increasing order of consequence

Choose the first workflow by its operating characteristics. A strong starting candidate recurs frequently, consumes expert attention, has accessible evidence, produces a clear artifact, and has a named reviewer. It should also allow the agent to be useful before it receives broad write permission.

A practical rollout sequence looks like this:

Recurring operations analyst. Give the agent one standing question, such as what changed in escalations or automation performance. Define the metric, comparison period, relevant segments, evidence requirements, and report destination. Require links to representative conversations and allow the conclusion that no action is warranted. Compare its reasoning with an experienced operator’s review until the failure modes are understood.
Knowledge steward. Feed it a release brief or policy change. Ask it to find affected help content, identify missing coverage, and prepare article diffs in the required voice and format. Include localized variants where they exist. The reviewer should validate product behavior, instructions, links, policy language, and whether the proposed set of pages is complete before publishing.
Automation maintainer. Start with known failed conversations. Ask the agent to distinguish a content gap from a rule, procedure, guidance, or connector problem; prepare the smallest correction; define triggers and edge cases; and simulate the result. Do not grant live configuration access until the tool trace and tests make the diagnosis reproducible.
Human-operations coordinator. Use the agent to assemble an incident audience, draft targeted responses, prepare coaching evidence, or prioritize a rep’s queue. These workflows can save substantial coordination time, but they touch customer communication and employee decisions. Begin in preparation mode, expose the selection logic, and expand autonomy only after identity, permission, review, and audit controls have been exercised.

This sequence is a risk ordering, not a universal maturity model. A read-only weekly analysis is easier to inspect and reverse than an outbound incident campaign. A knowledge proposal has a reviewable artifact. A live automation change affects future conversations, while customer communication may create an immediate and irreversible consequence. Move forward when the evidence and controls for the next class of action are ready, not merely because the previous feature launched.

Measure the completed loop, not chat activity

Prompt counts and conversation volume tell you that people opened the product. They do not tell you that customer operations improved. Build the scorecard around the operational loop:

Diagnostic quality: Whether the proposed root cause survives expert review, whether its evidence supports the conclusion, and how often factual correction is required.
Operational throughput: Time from a detected signal to a reviewed proposal and from an approved proposal to a verified change.
Artifact quality: Acceptance, revision, rejection, and rollback patterns for knowledge, automation, configuration, and communication proposals.
Customer outcome: Resolution, escalation, repeat contact, and sentiment for the affected topic after the change, interpreted alongside volume and case mix.
Safety: Permission denials, attempted out-of-scope actions, failed simulations, unauthorized writes, rollbacks, and missing audit events.
Human leverage: Expert time spent collecting evidence, recreating context, drafting the artifact, and reviewing the final proposal.

Do not make automation rate the only goal. A higher rate can coexist with poor resolutions or avoidable escalations. Treat it as one diagnostic measure and pair it with customer outcomes, correction rates, and topic-level regressions.

Create an evaluation set from real operating conditions: known content gaps, misconfigured rules, legitimate escalations, sparse attributes, conflicting evidence, localized content, and incidents with precise audience criteria. Give each case an expected outcome, required evidence, allowed tools, and forbidden action. Re-run the set when the model, retrieval system, tool, skill, permissions, or support configuration changes.

Scheduled work is where the leverage begins to compound. An operator can run recurring analysis and deliver the resulting report without waiting for a manager to remember the question. Keep an owner on every scheduled job, however. That owner should know where failures appear, when the task last completed, which data it used, and how to pause it.

Key takeaways

An operator agent improves the system around customer conversations; it is not simply another customer-facing bot.
The product boundary should cover observation, diagnosis, proposal, verification, approval, action, and monitoring.
Reliable behavior comes from grounded retrieval, attribute awareness, bounded tools, encoded domain skills, and structured review surfaces.
Grant autonomy by consequence: broad freedom to inspect approved data, tighter controls to prepare changes, and explicit approval for production writes.
Roll out recurring analysis before knowledge changes, automation configuration, and customer communication unless your own risk profile clearly supports another order.
Measure supported diagnoses, accepted artifacts, customer outcomes, human time, and safety events rather than prompt volume alone.

Your next step is to choose one recurring operational question and write down the evidence it requires, the artifact a good answer should produce, the person who will review it, and the actions the agent must not take. Once that loop works reliably, add one downstream proposal. That is a much stronger foundation for an operator agent than beginning with an open-ended prompt and a broad API key.

References

May 14, 2026

AI-Enabled Enzymatic Recycling: A Product Leader’s Playbook

You have an AI-enabled materials proposal in front of you, a promising set of enzyme candidates, and a difficult decision: fund another round of discovery or start building toward industrial scale. The candidate sequences may be impressive, but they are not yet the product.

Your decision should turn on whether the full system can repeatedly transform a defined waste stream into usable monomers at an economically viable cost. That framing connects model performance, laboratory evidence, process engineering, and commercial reality before an exciting demonstration becomes a stranded pilot.

Define the product around recovered monomers

Only 10% of the plastic manufactured gets recycled. That ceiling is not merely a sorting or consumer-behavior problem. Traditional recycling commonly shortens polymer chains instead of restoring their original molecular building blocks, so the resulting material can lose quality and move toward downcycling.

Enzymatic recycling changes the intended output. An engineered enzyme can deconstruct a polymer into its original monomers, which can then become inputs for new, high-quality plastic. The difference is fundamental: the product is not processed waste or a smaller plastic fragment. It is recovered molecular feedstock.

This distinction gives you a better product boundary. A generated protein sequence is a feature. An enzyme that shows activity in one assay is a technical result. The product is a repeatable monomer-recovery system with a defined input, output, operating envelope, and cost structure.

Before approving a roadmap, require the team to define five contracts:

Input contract: Which polymer, packaging format, mixture, and contamination profile will the process accept? “Mixed plastic” is not a specification. Name the included materials and the variation the system must tolerate.
Transformation contract: Which polymer bonds must the enzyme break, and what conversion and selectivity must the reaction demonstrate?
Output contract: Which monomers will be recovered, what downstream use must they support, and how will the team determine that the output is suitable for that use?
Operating contract: What reaction conditions, throughput, energy consumption, and process controls must hold outside a small laboratory assay?
Economic contract: Which cost per ton must the integrated process approach, and which assumptions currently separate measured economics from projected economics?

Selectivity is especially important. An enzyme can target a particular plastic within a mixed waste stream, potentially reducing the need to treat every input as chemically identical. But selectivity does not make an undefined waste stream manageable. The process still needs to know which target material is present, whether the enzyme can reach it, and how the desired products will be recovered.

Write the product brief in one sentence: For this defined feedstock, transform this polymer into these monomers, within this operating envelope, output specification, and cost boundary. If a number is unknown, leave a visible blank and assign an experiment to fill it. Do not hide the uncertainty inside a broad ambition such as “make plastic circular.”

Build the AI as a closed learning system

AI changes the economics of searching enzyme-design space. Protein language models can generate candidates, multi-step agents can coordinate specialized tasks, and computational evaluations can eliminate weak options before scarce laboratory capacity is used. Advances in protein structure prediction have expanded what can be explored, but prediction does not remove the need for physical validation.

The useful architecture is therefore not a model that emits sequences. It is a closed loop in which every physical result makes the next design round better. Rhea’s Factory combines protein language models, an agentic pipeline, domain constraints, and proprietary wet-lab feedback. The product lesson is broader than any one implementation: generation, evaluation, experimentation, and learning need to operate as one traceable system.

Encode the objective. Convert the product contract into machine-readable constraints: target polymer, desired products, acceptable operating conditions, and the metrics that will decide whether a candidate advances.
Generate candidates. Explore multiple plausible designs rather than optimizing immediately around the first promising family.
Apply computational gates. Reject candidates that violate explicit constraints, preserve the reasons for rejection, and rank the remaining candidates for laboratory use.
Run controlled wet-lab experiments. Test candidates under recorded conditions and capture successes, failures, and inconclusive results.
Update domain predictions. Use the measured outcomes to improve ranking and candidate selection for the next round.
Feed process evidence back into discovery. When a candidate struggles under reactor or feedstock conditions, turn that failure into a new design constraint instead of treating it as a separate engineering problem.

Agentic AI is valuable here because the workflow is multi-step, not because an agent should make every decision autonomously. At each handoff, define the required input, expected output, validator, and failure behavior. A generation step should not advance an incomplete candidate. A computational score should not be presented as a laboratory observation. A promising assay should not silently become a scale claim.

Exploration also needs an explicit lane. Higher model-sampling temperatures can produce more unusual enzyme candidates and reach beyond the safest local variations. Controlled model “hallucination” can be useful during candidate exploration when downstream guardrails prevent novelty from being mistaken for evidence.

Separate the candidate portfolio into three buckets: improvements near known winners, adjacent designs that test a clear hypothesis, and high-variance exploration. Give each bucket a deliberate laboratory budget. Raise sampling temperature only in the exploratory lane, and never allow generated assay values, reaction outcomes, or scale results into the measured-data record.

The durable advantage sits in the feedback data. In a narrow, high-signal domain, even hundreds of relevant proprietary laboratory observations can support a useful domain prediction model. That is not a general claim that small datasets are always sufficient. It means contextual quality can matter more than indiscriminate volume when the problem, assay, and outcomes are tightly defined.

For every experiment, preserve enough context to make the result reusable:

The enzyme identity, sequence version, and design lineage.
The target polymer, material format, mixture, and relevant contamination profile.
The assay and protocol version used for the test.
The reaction conditions and duration.
The measured conversion, selectivity, yield, and uncertainty available from the experiment.
The full result, including failure, no-result, and inconclusive outcomes.
The relationship between the candidate, computational evaluations, physical test, and model or data release.

A spreadsheet of winning sequences is not a data moat. A traceable record of why candidates were proposed, how they were tested, what failed, and how each result changed the next decision can become one.

Use stage gates that end in physical evidence

AI product teams often gravitate toward a model leaderboard because it creates a clean sense of progress. Enzymatic recycling does not have one adequate master score. A candidate can look structurally plausible and fail in the lab. It can perform in a controlled assay and miss the required throughput. It can convert the polymer and still lose economically once the rest of the process is counted.

Use a hierarchy of evidence that moves from design compliance to laboratory performance, operating fit, and scale economics:

Gate	Decision question	Required evidence	Red flag
Design compliance	Does the candidate satisfy the stated target and pipeline constraints?	Deterministic checks, recorded constraint evaluations, and candidate provenance	A candidate advances mainly because it appears novel
Wet-lab performance	Does the enzyme convert the target with the required selectivity under defined conditions?	Repeatable measured observations, including negative and inconclusive runs	Only the best run is retained or shared
Operating fit	Does useful performance hold within the intended controlled, low-temperature process and throughput requirements?	Process measurements tied to reaction conditions, conversion, yield, throughput, and energy use	Activity is reported without the process context needed to interpret it
Scale economics	Can the integrated system move toward cost parity with inexpensive oil-based plastic?	A cost and energy model tied to measured inputs, with assumptions and sensitivities exposed	Commercial viability is inferred from enzyme activity alone

Set pass, hold, and stop conditions before seeing the result. Otherwise, an interesting candidate will repeatedly earn one more experiment while the commercial requirement drifts. Relative improvement is useful for learning, but an enzyme that is twice as good as an unusable baseline may still be unusable. Every relative metric should sit beside the absolute requirement it is meant to approach.

Keep conversion, selectivity, yield, throughput, and energy per ton separate. Combining them too early into a single score can conceal the actual tradeoff. A team should be able to show why it is advancing a faster candidate with lower selectivity, or a more selective candidate with a different operating burden, without claiming that the candidates are equivalent.

Three common metric substitutions deserve direct scrutiny:

Low reaction temperature is not automatically low total energy. Count the energy demands of the complete process rather than the enzyme reaction in isolation.
Polymer conversion is not automatically usable monomer recovery. Measure whether the desired output can be recovered to the specification required downstream.
Bench performance is not automatically scaled performance. Treat increasing process scale as a new evidence gate, not a routine deployment step.

My rule is simple: model output can earn laboratory time; only measured process evidence can earn scale capital.

Plan the roadmap backward from cost parity

The commercial benchmark is unforgiving. Enzymatic recycling ultimately has to compete with inexpensive oil-based plastic production. A greener reaction that cannot approach a viable delivered cost will remain dependent on special conditions rather than becoming a broadly adopted circular process.

Build the economic model while discovery is still underway. At minimum, separate these cost lines:

Feedstock acquisition, sorting, and rejected material.
Preparation required before the enzyme can act on the target polymer.
Enzyme production, delivery, useful lifetime, and replacement.
Reactor capacity, reaction time, process control, and energy.
Monomer recovery and purification.
Waste handling, downtime, and variability in plant utilization.

Do not wait for perfect values. Use ranges, label each input as measured or assumed, and run sensitivity analysis. The purpose is to identify which uncertain variable can kill the business case. If enzyme lifetime dominates cost, another candidate-generation run may be rational. If purification dominates, generating thousands of additional sequences may be a distraction from the real constraint.

Pair every scientific milestone with an industrial question:

Discovery gate: Is activity and selectivity reproducible enough to justify process work?
Process gate: Does the candidate perform inside the intended operating envelope rather than only under a convenient assay condition?
Feedstock gate: Does performance survive representative material formats and mixtures, including difficult packaging such as clamshells?
Demonstration gate: Can the system sustain the required material flow, output quality, and energy profile at a scale that tests the major engineering assumptions?
Commercial gate: Does the cost case remain credible when feedstock composition, utilization, throughput, and other sensitive inputs move away from the preferred case?

A planned 5,000-ton demonstration plant in California illustrates why demonstration capacity belongs on the product roadmap. A plant is not simply a larger laboratory. It tests whether biology, equipment, controls, feedstock variability, and recovery operations behave as an integrated product.

Before committing meaningful scale capital, ask six kill questions:

Which assumption has the largest effect on delivered cost per ton?
Which inputs are measured, and which still come from a design estimate?
At what physical scale was each important input measured?
What fails first when the feedstock mix changes?
If enzyme performance improves as planned, which downstream step becomes the bottleneck?
Which observed result will stop, narrow, or materially redesign the program?

Expansion into additional plastics should follow the same discipline. Enzyme selectivity creates a plausible path toward enzyme blends for mixed streams, and new plastic types and mixed-plastic blends remain important development directions. Treat each added polymer as a new product vertical with its own input contract, assays, process interactions, recovery requirements, and economics. A new enzyme is not automatically a low-cost extension of the first process.

Key takeaways for your next roadmap review

Define success as repeatable recovery of specified monomers, not the generation of novel enzyme sequences.
Run discovery as a closed loop connecting product constraints, AI generation, computational gates, wet-lab measurements, and process feedback.
Treat proprietary experimental context—including failures—as the data asset; candidate count alone is not a defensible moat.
Use separate gates for design compliance, laboratory performance, operating fit, and scale economics.
Work backward from cost parity and direct the next experiment toward the assumption that most threatens the integrated business case.

For your next review, ask the team to bring one page containing the input and output contracts, a diagram of the learning loop, the current stage-gate thresholds, the experimental data schema, and a cost sensitivity model with measured and assumed inputs clearly separated. Every roadmap item should change one of those artifacts or produce evidence for a named decision.

If the team cannot fill those fields yet, that is the immediate product work. The first defensible milestone is one traceable loop from a defined industrial problem through candidate generation, laboratory measurement, and an updated cost model. Repeat that loop with increasing realism before increasing capital exposure. That is how you determine whether programmable biology is becoming an industrial recycling product rather than remaining an impressive AI demonstration.

References

Product Talk — How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

May 14, 2026

No More Accidental Agents: How We Engineered Global Agent’s Helpful, Curious Personality

Most teams ship AI agent personalities by accident—emergent quirks, brittle prompts, and uneven behavior. We refused to let that happen. From day one, we treated personality as a first-class product surface, one that should be designed, instrumented, and iterated with the same rigor as any core capability.

Learn how we designed Global Agent’s personality and fine-tuned its inquisitiveness and helpfulness using Agent Analytics.

In my role leading product at HighLevel, Inc., I framed our approach around agentic AI and conversation design: personality is not “flavor text”; it is the control system for how an agent interprets context, asks questions, and decides when to act. Our product strategy prioritized clarity, empathy, and consistency—so the agent would be curious enough to resolve ambiguity without becoming interrogatory, and helpful enough to move work forward without overstepping.

We made that intent measurable. Using behavioral analytics, we defined operational signals such as clarification-question rate, resolution-path efficiency, and escalation quality. We combined eval-driven development with targeted A/B testing to compare prompt patterns and tool strategies, ensuring each change had a clear hypothesis and measurable outcome.

To calibrate inquisitiveness, we mapped decision points where the agent should ask follow-ups versus proceed autonomously. Prompt engineering codified those thresholds, while a retrieval-first pipeline reduced unnecessary questions by improving context completeness up front. When the agent did ask, we constrained tone and cadence to keep queries concise, respectful, and progress-oriented.

To enhance helpfulness, we prioritized precise action-taking and unambiguous guidance. Context window management preserved relevant facts without diluting intent, and guardrails aligned with AI risk management principles ensured the agent stayed within policy, privacy, and compliance boundaries. The result was an assistant that resolved more tasks end-to-end, with fewer stalls and clearer handoffs when human help was warranted.

Agent Analytics became our nervous system. We instrumented every dialog turn to attribute outcomes to design choices, then used driver trees to connect micro-behaviors to macro results like time-to-resolution and customer satisfaction. This closed-loop view let us ship confidently, knowing which levers improved helpfulness, which sharpened curiosity, and which merely added noise.

Process mattered as much as tooling. Product trios ran continuous discovery with customers to surface edge cases—ambiguous intents, multi-intent turns, and sensitive scenarios—while our engineering partners operationalized experiments with clean rollback paths. We favored small, testable changes over sweeping rewrites, building momentum and trust with each iteration.

The payoff is a personality that feels consistent across use cases: curious when clarity is missing, decisive when action is obvious, and transparent when limits are reached. Users experience fewer dead ends, faster resolutions, and a brand voice that shows up the same way every time—because it was defined, measured, and improved on purpose.

If you’re building agentic AI, don’t leave personality to chance. Treat it like a product: set clear outcomes, instrument deeply with Agent Analytics, and iterate with eval-driven development and A/B testing. That’s how curiosity becomes a feature, helpfulness becomes a habit, and your agent becomes reliably, intentionally excellent.

Inspired by this post on Amplitude – Best Practices.

May 13, 2026
I Pointed a “Ralph Wiggum” AI Loop at My Product for a Week—The Data That Stopped Chaos

I spent a week pointing a "Ralph Wiggum loop" at my product to see how far an agentic AI could take pragmatic, everyday improvements without human micromanagement. It was equal parts exhilarating and nerve-wracking. The short version: the loop moved fast and broke assumptions, but Amplitude analytics kept it from going off the rails—and turned chaos into controlled acceleration.

By "Ralph Wiggum loop," I mean a deliberately naive, endlessly curious cycle: try something small, ship it behind a flag, watch the data, then try again. It is the product equivalent of a fearless intern who experiments constantly. That energy is invaluable for discovery, but it absolutely demands strong guardrails and a clear definition of success.

Before I started, I framed the outcomes I cared about: user activation within the first session, reduction in time-to-value, and early retention indicators. I set baselines and a minimum detectable effect (MDE) for A/B testing so the loop could distinguish noise from signal. I also documented a driver tree of behaviors we wanted to influence and ensured every event was cleanly instrumented in Amplitude analytics to support reliable behavioral analytics.

The guardrails mattered most. I put every change behind feature flags with instant rollback. I defined "off the rails" conditions upfront, including regression thresholds for activation and retention analysis, and enabled anomaly detection to surface unexpected spikes or drops. Session replay was ready to diagnose confusion fast, and I kept a daily evaluation cadence so the loop never ran unattended for long.

Day by day, the loop proposed micro-experiments: onboarding copy variants, tooltip timing, in-app guide sequencing, and subtle changes to progressive disclosure. Each iteration shipped behind a flag to a small cohort. I watched leading indicators in real time, then zoomed out to cohort views to guard against short-term gains that might erode longer-term value. When something looked promising, we expanded exposure methodically; when something looked risky, we paused immediately.

We had a pivotal moment where the loop suggested a bolder call-to-action that spiked activation. On the surface, it looked like a win. Amplitude cohorts told a fuller story: downstream engagement softened, and anomaly detection flagged a pattern that hinted at premature conversion rather than genuine intent. A quick rollback through feature flags saved the week—and reminded me why eval-driven development should be the default for agentic AI workflows.

The most surprising part was how quickly the loop unlocked small compounding gains once the measurement scaffolding was in place. With a unified analytics platform and crisp guardrails, the system became a safe sandbox where the AI could explore aggressively while we stayed anchored to outcomes. The combination of behavioral analytics, A/B testing discipline, and daily human review turned raw speed into durable learning.

My takeaways are direct. Agentic AI can accelerate discovery, but only if you define stop conditions and wire strict feedback loops into your stack. Measurement is product strategy here—without it, you get noisy activity instead of progress. Invest in instrumentation first, treat feature flags as non-negotiable, and let anomaly detection and session replay be your early warning system. Most of all, tie every experiment to activation, engagement, or retention, not vanity metrics.

If you’re considering your own week with a "Ralph Wiggum loop," start painfully small, constrain the blast radius, and insist on decision-quality data. Do that, and you’ll turn a chaotic agent into a compounding engine for product discovery—one that moves fast, learns faster, and stays on track.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
From Vision to Execution: Building Agentic, Data‑Driven Products with Real‑World Rigor

When I consider where product development is headed, one statement captures the mandate perfectly: "Eric Carlson is a Principal AI Engineer helping to shape and build Amplitude's next generation vision of of agentic and data driven product development." That vision resonates deeply with how I lead teams—anchoring strategy in behavioral analytics while enabling agentic AI to act on insights with speed, safety, and measurable impact.

Translating that vision into execution starts with clarity of outcomes. I frame driver trees that connect customer value to leading indicators—activation, engagement depth, and retention—then instrument product telemetry with Amplitude analytics and behavioral analytics to surface the moments that matter. From there, we operationalize learning with A/B testing and feature flags, ensuring each hypothesis gets a fair, observable run and that we can safely ramp what works.

Agentic AI changes the operating model. Instead of static dashboards, we design autonomous workflows that observe signals, reason over context, and take action—grounded in a retrieval-first pipeline and governed by eval-driven development. For product managers, this demands fluency with LLMs for product managers and practical prompt engineering, plus rigorous AI Strategy around data governance, privacy-by-design, and risk scoring so agents remain trustworthy under real-world conditions.

Cross-functional cadence is everything. I partner closely with Principal AI Engineers and product trios to blend continuous discovery with execution: rapid user interviews to reveal intent, opportunity solution trees to prioritize, and outcomes vs output OKRs to align incentives. The result is a system where insights are unified, decisions are explainable, and agents improve through tight feedback loops across analytics, experimentation, and production telemetry.

If you’re building toward an agentic, data-driven future, invest in a unified analytics platform, shorten the path from signal to action, and measure learning velocity as carefully as feature delivery. With the right foundations, agentic AI becomes more than a feature—it becomes a force multiplier for product strategy, customer value, and sustainable growth.

Inspired by this post on Amplitude – Perspectives.

May 13, 2026
From Prototype to Production: How I Built Reliable AI-Generated Opportunity Solution Trees

I just wrapped an all-out engineering sprint. That still sounds odd coming from me, because while I’ve written code on and off for years, I don’t self-identify as an engineer. I’m a product manager who used to be a designer. It’s been a long time since I wrote code for a living.

But AI has expanded what’s just now possible—for our products, and for us. It’s pushed me to do more than I imagined. In that spirit, I want to share a recent engineering story. It includes technical details, and a year ago I couldn’t have done any of it. I learned it with the help of AI, and my aim is to show what’s now within reach.

I’ve been building two services with a partner at Vistaly: AI-generated interview snapshots and AI-generated opportunity solution trees. We put out a call for alpha partners, received over 100 applicants, and selected eight design partners to start.

A clear, color‑coded map from desired outcome to opportunities, solutions, and assumption tests—showing how to structure discovery work and prompt AI to generate, compare, and validate product ideas.

Each team uploaded three customer interviews. I identified the key moments and opportunities and then generated an opportunity solution tree from those snapshots. I provide the AI services; Vistaly is building the UI and workflows around them.

Early feedback was strong. Teams immediately asked to upload more interviews—exactly the kind of demand signal you hope to see—so we got to work making that possible.

Go behind the scenes as AI turns raw feedback into a clear Opportunity Solution Tree. Linked cards reveal user needs—onboarding, support offload, and bot-readiness signals—so product teams can spot priorities and next steps at a glance.

Updating an opportunity solution tree with new interview content is far harder than generating a new tree from scratch. I initially underestimated the complexity. Our goal wasn’t to produce a tree and declare it truth. We wanted teams to engage, correct, and collaborate with the AI—scaffolding cross-interview synthesis instead of doing it for them.

To support that, we needed a way to communicate precisely how a tree would change after new interviews were added. We took inspiration from git diff and set out to build the equivalent for opportunity solution trees—step-by-step change sets that explain each proposed modification.

A clear visual of AI‑generated opportunity solution trees: outcomes feed opportunities that branch into sub‑opportunities, while evidence is preserved. The structure ensures updates stay traceable and never cause data loss.

That decision was right, but the lift was larger than I expected. It wasn’t enough to generate an updated tree; I also had to provide a clear, ordered walkthrough of what changed and why.

I often see the same pattern with AI: it’s easy to get to an impressive prototype, but much harder to reach a production-grade product. That was exactly my experience here. My service actually comprised two sub-services: generating a new tree from scratch and updating an existing tree with new interviews. The first worked well in alpha; the second had to be built before anyone could add a fourth interview.

Explore how an outcome expands into an Opportunity Solution Tree: Opportunities A and B stem from the goal, with C and D nested under B, while a concise change set tracks every node added along the way.

On the surface, these services look similar. In reality, updates must preserve existing structure unless new evidence requires a change. You have to account for compound operations—merges, splits, deletes—while guaranteeing no data loss. Every node has source opportunities (supporting evidence from interviews) and children (tree sub-opportunities), and neither can be dropped.

In classic AI fashion, I got a reasonable version working in a few days and shipped it to our design partners. One team quickly hit our beta limits and asked to convert to a paid subscription so they could keep going. They showed a willingness to pay, converted, and started uploading aggressively.

Watch an Opportunity Solution Tree evolve: the original parent A with x, y, z branches is split into A and B, shifting evidence while preserving links—mirroring how AI refines scope and structure in discovery.

At the 14th, 15th, and 16th uploads, the cracks appeared. We saw odd behavior in some trees. The Vistaly team noticed that the change sets—the step-by-step instructions emitted by my service—didn’t always reconstruct the final tree my service also emitted. We needed those steps to match exactly, so teams could review and accept, modify, or reject each change with confidence.

They flagged the issue the day I was flying to New Orleans for Jazz Fest. In hindsight, I’m glad I didn’t grasp the scope of what awaited me. I had roughly 80% of the work still to do to make tree updates rock solid. At least I got to enjoy the music first.

From fragments to focus: this diagram shows how Opportunities B and C are merged into a single Opportunity Solution Tree, removing duplicates and unifying context so AI can rank and explore five related opportunities with clarity.

Back home, I started diagnosing. My service was a pipeline: several LLM-driven steps followed by deterministic code to compare trees and produce change sets. As I dug in, I realized that approach was flawed. Tree diffs, unlike linear document diffs, are ambiguous.

In a document, if I add a sentence, the diff shows an addition. If I delete a paragraph and rewrite it, the diff shows a removal and an addition. Simple. But trees are different. Suppose I split opportunity A into A and B, and later merge B with C. The split can disappear from the final diff.

Peek inside our process: a simple opportunity solution tree maps an outcome to prioritized opportunities A and C with downstream options x-z and t-v. A clear snapshot of how AI organizes product discovery.

When the model splits an opportunity, it must distribute A’s source opportunities and children between A and B. For instance, if A has source opportunities 1, 2, 3 and children x, y, z, after the split A might keep 1, 2, and x, while B takes 3, y, and z.

Now suppose the model merges B into C. If C originally had source opportunities 4 and 5 and children t, u, v, then after the merge C now has source opportunities 3, 4, 5 and children t, u, v, y, z. When you compare the original and final trees, it looks like A somehow donated some evidence and children directly to C. The split and merge that explain why are invisible to a naive diff.

See how an AI-generated Opportunity Solution Tree unfolds: one Outcome flows to Opportunities A and C, then into options x–v. Clean colors and arrows reveal the hierarchy from goal to opportunities at a glance.

That was the core insight: we didn’t just need to show what changed—we needed to show why it changed. I had to reconstruct each move step-by-step. That meant getting the model to show its work, which opened a new can of worms.

I refactored my prompts so the model produced both the final output and the exact change set it used to get there. The action language was explicit: add, delete, reframe, merge, split, and so on. Crucially, I asked the model to describe its moves in user-meaningful terms—“split A into A and B, then merge B into C”—not as opaque reassignments of sources and children.

Watch an opportunity solution tree take shape: start with the outcome, add opportunities A and B, then extend B to C and D. The paired change set makes every edit transparent—ideal for AI-assisted product discovery.

For each LLM step, the model now emitted its recommendation and the corresponding change set. This helped, but it wasn’t perfect. After extensive testing and error analysis, two classes of errors emerged: (1) the model attempted an invalid move, and (2) the change set didn’t actually generate the recommendation.

Category 1 felt like designing a game while the model played it creatively. For example, what happens when the model tries to merge a parent with a child? If opportunity A has children B, C, and D and the model merges A with B, the merge is directional. If the instruction is “keep A, delete B,” that works—the parent absorbs the child. But if the instruction is “keep B, delete A,” then C and D become orphans. These puzzles were solvable and even fun.

Visual explainer from Product Talk on AI-generated Opportunity Solution Trees. It contrasts an allowed merge (B into A) with a not-allowed merge (A into B) that leaves child opportunities orphaned, guiding safe hierarchy edits.

Category 2 was harder. Despite prompt iterations, I could only push the discrepancy rate down to about 1 in 40 instances. With 10–20 LLM calls per run, that meant roughly half of all runs still failed. Not acceptable for production. I hit a wall. A paying customer was waiting, and more design partners were queued up.

Next, I tried to correct the model’s mistakes with deterministic code. I had promised that my change sets would generate the output tree, so I wrote verifiers: detect conflicts (e.g., delete a node, then try to use it later), guard against data loss, prevent orphaned nodes, and more. Detection was straightforward; correction was not. Fixing issues required guessing the model’s intent. If the sequence said “delete A, then merge A with B,” should I remove A entirely or salvage A’s sources and children by merging into B? There were dozens of such cases with no unambiguous answer.

A step-by-step loop shows how changes are validated: generate a change set, run a validation tool, review the result, then repeat on failure and exit on pass—mirroring iterative work behind AI-built Opportunity Solution Trees.

After 11 straight days of deep work—including weekends—I was exhausted. I dislike hustle culture; this isn’t how I design my life. But I was stuck, and then I had an insight.

On a walk with my husband (also an engineer), I realized I could have the LLM repair its own mistakes. My data contract with Vistaly requires that the change set must generate the output tree. I had already built robust validation code. I knew exactly when a change set failed—and why. No amount of prompt tuning alone was fixing it. So I turned the validator into a tool for the model and created a simple agentic loop.

The loop works like this: the model proposes a change set, calls the validation tool, and gets back a pass/fail plus specific feedback. If it fails, the model uses those instructions to repair the change set and calls the tool again. Iterate until success or a max number of turns.

I prototyped in Node.js with a single model call, a verifier pass, and a repair attempt. At first, the loop didn’t converge—it just accumulated compute. I experimented with how to communicate errors, how much context to include, and how to sequence feedback. Eventually, it clicked: the model began fixing its own mistakes and typically returned a valid change set in one or two repairs. It was, in practice, eval-driven development applied to LLM outputs.

I had already built an agent loop utility for another AI workflow, so I productionized quickly: model call, optional tool invocation, tool result returned to the model, repeat until the validator signals success or the loop times out. I integrated the new loop into the pipeline and shipped the revamped service to Vistaly on Monday at noon. They’re integrating now, and it will be in the hands of our design partners shortly. I’m relieved—and ready for a day off.

Reflecting on the last two weeks, a few things stand out. First, I shed limiting beliefs about being an engineer. To make this reliable, I had to solve legitimately hard problems, and that feels good.

Second, this was genuinely fun. Designing the action set and watching the model push those boundaries was like working through elegant puzzles. Models are incredibly creative, and harnessing that creativity with the right constraints is deeply satisfying.

Third, I learned when I can and can’t trust Claude to write code for me. Since Opus 4.6 came out, I gave Claude a much longer leash. After the past two weeks, Claude is back on a short leash. I found a lot of gaps in my implementation in areas where I simply trusted that Claude got it right, when in fact it didn’t. If you don’t have the right infrastructure—planning, testing, code review—this can be disastrous. I’ll be investing more here and sharing what I learn.

Finally, if this work had been spread over two months, it would have been thoroughly enjoyable. I’m discovering how much I like being an AI engineer. It feels like a new chapter where I can combine opportunity solution trees with modern AI engineering—and deliver real value to product teams doing continuous discovery.

I’m excited to share more of what we’re building with Vistaly and to onboard more design partners soon. If you’re interested, get on the waiting list. And if you’ve been hesitant to stretch beyond your current skill set, I hope this story nudges you to take the first small step toward what’s just now possible.

Inspired by this post on Product Talk.

May 13, 2026