Tag: data governance

Connecting Product Analytics, Attribution, and Growth Decisions
Connected product analytics is not simply a larger collection of events, dashboards, and campaign reports. Its practical value comes from preserving the context behind customer behavior, applying consistent definitions, and carrying trustworthy insights into the systems where teams make decisions.

The four source articles describe complementary parts of that operating model: journey-aware attribution, governed product data, AI-assisted analysis across tools, and continuous measurement. Combined, they offer a framework for turning scattered signals into more defensible growth decisions.

Key takeaways
- Attribution becomes more informative when relevant campaign, session, and product context remains connected to later outcomes.
- Persisted context can reveal associations across a journey, but it does not by itself prove that a touchpoint caused a conversion.
- Naming standards, ownership, metadata, and shared customer definitions determine whether connected analytics can be trusted.
- AI agents and connectors can reduce the effort required to investigate and communicate insights, provided permissions and analytical boundaries are explicit.
- Growth improves through a repeatable learning loop that connects observed behavior to a decision, an intervention, and subsequent measurement.
Attribution improves when journey context survives the final click

The source on persisted properties challenges the idea that the last recorded interaction adequately explains a conversion. It reports that customer decisions may be shaped by activity distributed across sessions, channels, campaigns, and product experiences. In its examples, an e-commerce purchase may follow product discovery, promotions, and cart activity; a financial-services outcome may depend on education, trust-building, eligibility checks, and compliance-sensitive steps; and a B2B lead may emerge after product tours, comparison pages, demos, onboarding interactions, stakeholder reviews, and CRM touchpoints.

Persisted properties address part of this measurement problem by retaining meaningful context as a user continues through a journey. This gives analysts more than the attributes attached to the final event and supports questions such as which acquisition context is associated with later activation, which discovery experience precedes stronger conversion, or which onboarding path appears among retained users.

That richer context should not be confused with automatic causal proof. Attribution assigns or interprets credit according to available data and a chosen analytical approach. A recurring touchpoint may be a useful signal, a proxy for user intent, or an actual contributor to an outcome. Connected journey data makes those possibilities easier to investigate, while controlled experiments and other appropriate evaluation methods remain necessary when a team needs to establish whether changing a touchpoint changes the result.

The practical shift is therefore from asking which interaction deserves all the credit to asking which sequence of interactions warrants attention. That framing is more useful for product roadmaps, campaign investment, onboarding design, and retention analysis because it treats conversion as the outcome of a journey rather than an isolated click.

Data governance supplies the shared meaning behind every signal

More connected data creates more analytical value only when teams agree on what the data represents. The Pendo administration source emphasizes naming conventions, ownership rules, and review cycles for pages, features, segments, guides, and reports. It also describes visitor, account, and product metadata as a strategic asset that should reflect concepts such as onboarding stage, plan type, activation, customer-success motion, and retention.

The marketing analytics source approaches the same requirement from an organizational angle. It argues that analytics works best as a shared language across product, marketing, sales, and customer success. Instead of allowing each function to interpret campaign and product signals independently, teams can align around customer journeys, funnel behavior, and the points at which users find value or leave.

Together, these sources show that the semantic layer is as important as the technical connection. A campaign label, user segment, account tier, activation event, and retention definition must remain intelligible when they move between an analytics platform, a CRM integration, a product report, or an AI-assisted workflow. Otherwise, a connected system can distribute ambiguity more efficiently without improving judgment.

Governance also affects interventions, not just reports. The Pendo source recommends contextual and concise in-app guides, product tours, and tooltips tied to measurable outcomes. This connects the measurement layer to the product experience: the same governed definitions used to identify friction should inform who receives guidance, what behavior the guidance is intended to change, and how the result will be evaluated.

AI connectors reduce workflow friction but do not repair weak analytics

The agent-connectors source extends connected analytics beyond dashboards. It describes an agent working across tools already used by product, analytics, and go-to-market teams, allowing context, analysis, and action to be brought into a more unified interaction. Its central benefit is operational: people can spend less effort moving information between tabs and systems while maintaining the flow of an investigation.

The marketing source similarly presents AI as most useful when paired with behavioral analytics, customer context, disciplined measurement, positioning, and a clear go-to-market strategy. In that account, AI workflows improve the scale and speed of judgment; they do not create durable growth independently of a sound measurement practice.

This distinction matters because an agent can make an answer easier to obtain without making its underlying evidence more reliable. If event definitions conflict, metadata is incomplete, or attribution assumptions are hidden, a connected agent may produce a fluent response to the wrong question. The connector source therefore places importance on permissions, appropriate context, governance, and boundaries alongside prompt design.

A well-designed workflow should preserve the path from a business question to the supporting behavioral evidence. It should also make clear which system supplied the context, which segment or journey definition was used, and whether the result is a descriptive association, an attributed outcome, or evidence from a stronger evaluation. That transparency helps an agent accelerate analysis without becoming an unexamined source of truth.

A connected growth loop joins evidence, intervention, and learning

The sources converge on a continuous operating loop even though each enters it at a different point. Persisted properties preserve the journey context needed to form a better question. Governance and metadata make the relevant users, accounts, features, and outcomes consistently identifiable. Behavioral analytics helps teams locate meaningful movement or friction. Product guidance, campaigns, positioning changes, and go-to-market decisions then become interventions whose effects can be measured.

The Pendo source makes this learning loop explicit by recommending that initiatives record the expected behavior, the observed result, the change in the customer journey, and the team’s next response. The marketing source adds that product, marketing, sales, and customer success should use those findings collectively. The agent-connectors source supplies a potential interface for carrying the analysis across their tools, while the attribution source supplies the longitudinal context needed to avoid judging the intervention solely by the final interaction.

This model also clarifies what a useful growth insight looks like. It is not merely a rising metric or a generated explanation. It connects a defined audience and journey to an observable outcome, states the limits of the attribution, identifies a decision the organization can make, and establishes what should be measured afterward. That standard directs attention toward learning and resource allocation rather than dashboard activity.

The next stage of connected analytics will depend less on adding isolated reports and more on maintaining reliable context as questions move across teams and tools. Organizations that preserve that context, govern its meaning, and test the decisions made from it will be better positioned to turn analytics and AI into a durable growth capability.

References
July 3, 2026
Migrate Analytics Platforms Without Chaos: 7 Proven Lessons to Plan, Move, and Land Cleanly

I’ve led and rescued more analytics migrations than I can count, and I know the pressure: every event, dashboard, and decision pipeline depends on getting it right. Migrating analytics platforms doesn't have to be painful. Get seven lessons from Human37 and Amplitude to help your team plan, migrate, and land cleanly.

Here’s how I approach this work so teams keep momentum, regain trust in their numbers, and accelerate product-led growth on a unified analytics platform—without the rework and stakeholder fatigue that typically follow.

Lesson 1 — Start with outcomes, not events. Before moving a single event, I align leaders on the questions we must answer and the decisions we must speed up: activation, retention, and expansion. I map those goals to a simple driver tree, then back into the behavioral analytics we need. This trims noise, tightens scope, and ensures Amplitude analytics (or any destination) is instrumented for decisions, not vanity metrics.

Lesson 2 — Audit and map your data with rigor. I inventory current events, properties, IDs, and sources, then define a target schema with clear naming conventions, ownership, and versioning. Data governance and privacy-by-design are non-negotiable: we separate PII, document consent paths, and remove legacy debris. This step prevents schema drift and makes platform scalability sustainable.

Lesson 3 — De-risk the cutover with a phased plan. Rather than a big-bang switch, I dual-run critical flows, compare telemetry, and use feature flags to roll forward (and back) safely. Observability and anomaly detection are my guardrails: I monitor volume, cardinality, and event timeliness to spot regressions early—long before executives notice broken charts.

Lesson 4 — Treat instrumentation like product code. I wire schema checks into CI/CD, enforce typed analytics wrappers, and validate payloads pre-merge. With docs-as-code, the tracking plan stays current and reviewable. This keeps quality high at scale and avoids the slow death of broken funnels caused by well-meaning quick fixes.

Lesson 5 — Enable the people, not just the platform. Tools don’t create insight—teams do. I run hands-on enablement with product tours and in-app guides tailored to each role, establish communities of practice, and publish short playbooks for common questions (activation analysis, cohort retention, and journey mapping). When customer success and growth marketers can self-serve, adoption sticks.

Lesson 6 — Land cleanly with fast, visible wins. Within the first two weeks post-cutover, I showcase analyses that matter: retention analysis by use-case, friction points via session replay and heatmaps, and conversion lift by segment. These quick proofs build confidence, reinforce the value proposition, and keep stakeholders engaged through the longer tail of hardening.

Lesson 7 — Govern and evolve continuously. After go-live, I schedule schema reviews, backlog grooming, and QBRs to prune events and refine definitions. Ownership is explicit, and changes flow through the same review process as code. This keeps the unified analytics platform trustworthy as the product (and org) changes.

I’ve seen this playbook turn skepticism into momentum. In one migration I inherited mid-flight, we refocused on decisions, tightened governance, and phased the rollout; the team moved from fire drills to confident launches—and stakeholders finally believed the numbers again.

If your team is staring down a migration, anchor on outcomes, automate quality, and invest in enablement. With disciplined execution readiness and the lessons I’ve applied alongside partners like Human37 and platforms like Amplitude, you can move fast, reduce risk, and land cleanly—without the chaos.

Inspired by this post on Amplitude – Perspectives.

June 22, 2026

Secure System Access for AI Agents: A Phased Control Model

An AI agent becomes operationally valuable when it can move beyond explaining a process and complete the underlying work. That same transition gives the agent access to sensitive data and consequential actions, so integration must be designed as both a product capability and a security boundary.

The practical objective is not maximum access. It is the smallest dependable set of permissions that lets an agent resolve a well-defined workflow, supported by deterministic controls, observable outcomes, and a clear path to human intervention.

System access changes both the value and the risk

Without backend access, an agent can describe how to update an account, check a renewal, or report a damaged order. With access to a CRM, billing platform, or order-management system, it can potentially retrieve the relevant record and complete the request during the conversation. The Intercom article presents this shift from answering to acting as a central difference between basic AI adoption and mature deployment.

The article cites Intercom’s 2026 Customer Service Transformation Report, reporting improved metrics among 87% of teams with mature AI deployments, compared with 62% overall. It also reports that 82% of senior leaders said their teams had invested in AI during the preceding year, while only 10% said they had reached mature deployment. These source-reported figures suggest an integration gap, but they do not independently establish that system access caused the reported improvements or that an integration is secure.

Security therefore cannot be added after the workflow succeeds. A customer-facing interface may remove the need to visit a separate application, but it must not remove identity and authorization checks. The agent still needs a trustworthy way to associate the request with the correct customer, determine what that customer is permitted to do, and constrain the backend operation accordingly.

Choose workflows where access justifies its complexity

Not every automated conversation benefits equally from deeper integration. Intercom reports the results of rebuilding four fixed, scripted Tasks as Procedures with system access. Over the 12 months through May 2026, the reported resolution rate for its bounce-list workflow rose from 9.3% to 79.9%, while bug reporting increased from 9.2% to 66.5%. Email forwarding moved from 44.9% to 66.5%, but Messenger installation rose only from 67% to 69.2%.

The variation is more instructive than the headline gains. According to the article, the bounce-list process required multi-step reasoning, dynamic branches, and error recovery. Bug reporting still ended in a human handoff, but the procedure improved that handoff by pre-triaging the issue, surfacing possible GitHub matches, extracting relevant URLs, and requesting impersonation access. Messenger installation was already a comparatively linear process, leaving less room for improvement.

A suitable first integration is therefore not merely a popular support topic. It should be high-volume and repeatable, have an identifiable system owner, and depend on live data or actions that materially change the outcome. Existing APIs improve feasibility, but the security review should also consider data sensitivity, reversibility, authorization complexity, and the consequences of acting on an ambiguous request.

Use an access ladder instead of a single launch

The phased approach described by Intercom can also serve as a security model. Each stage expands capability only after the workflow and its controls have produced enough evidence to justify the next step.

Stage	Agent capability	Appropriate use	Control emphasis
No integration	Guide, troubleshoot, check policy, triage, and route	Discover where explanations repeatedly lead to manual work	Evaluate answer quality, routing accuracy, and escalation behavior
Read-only access	Retrieve approved fields such as order or subscription status	Resolve information requests without changing a record	Restrict endpoints, records, and fields; verify customer authorization
Write access	Update records or initiate actions such as cancellations or refunds	Complete bounded workflows after earlier stages are dependable	Validate inputs, limit action scope, record outcomes, and require approval where consequences warrant it

Mock responses can test branching logic before an API is ready, as the Intercom article notes. It also proposes a temporary human-in-the-loop step when an integration is still several engineering sprints away. These methods can validate the workflow and expose missing requirements, but simulated success should not be treated as proof that production identity, authorization, failure recovery, and audit controls are ready.

Put deterministic controls around probabilistic decisions

Plain-language workflow instructions can guide an agent, but security-critical constraints should not depend solely on the model interpreting those instructions correctly. A safer architecture places enforceable controls between the agent and each backend system.

Control	Practical design implication
Dedicated identity	Give the agent its own service identity rather than borrowing a staff account, so permissions and activity remain attributable.
Least privilege	Allow only the endpoints, operations, records, and fields required by the selected workflow.
Read and write separation	Keep retrieval permissions distinct from mutation permissions and grant write access only when the use case requires it.
Independent policy enforcement	Validate identity, authorization, limits, and required inputs outside the model before executing an operation.
Bounded actions	Prefer narrow, purpose-built operations over unrestricted database or administrative access.
Human approval and escalation	Route ambiguous, exceptional, sensitive, or difficult-to-reverse cases to an authorized person.
Auditability and monitoring	Record the request, decision, tool call, result, and escalation so failures and unusual patterns can be investigated.
Safe failure behavior	Prevent retries, timeouts, or partial completion from producing duplicated or inconsistent changes.

The integration request should document the workflow in plain language, identify every read and write point, name the system owner, and specify the minimum required fields. It should also define how success and harm will be measured: not only whether the agent completed the conversation, but whether it selected the correct record, performed the authorized action once, protected restricted data, and escalated when it lacked sufficient confidence or permission.

This framing also improves the business case. Engineering is being asked to expose a narrowly scoped capability with explicit boundaries, rather than to provide broad access to a general-purpose agent. Leadership can then compare measurable workflow value with implementation effort and residual risk.

Key takeaways

System access creates value when it lets an agent complete work, but it simultaneously expands the security boundary.
The best initial workflow is frequent, bounded, operationally meaningful, and owned by a team that can approve its data and actions.
Progress from no integration to read-only retrieval and then to narrowly scoped write operations; do not treat access as an all-or-nothing decision.
Enforce identity, authorization, field restrictions, action limits, and audit logging outside the model’s natural-language instructions.
Evaluate correctness, unauthorized-action risk, failure recovery, and handoff quality alongside resolution rate.

The strongest long-term pattern is a portfolio of small, governed capabilities rather than one broadly privileged agent. Each successful workflow can supply the evidence needed to extend access deliberately, while keeping the consequences of error visible and contained.

References

Intercom — Win Executive Buy-In for AI Agent System Access: Unlock Actions, Boost Resolution, Cut Costs

June 11, 2026

Supercharge Insights with Amplitude Agent Connectors: Connect Notion, Slack, Linear & More

I’ve led enough multi-tool product organizations to know how quickly momentum erodes when insights and actions live in different places. When my teams bounce between Notion, Atlassian, Slack, Linear, and analytics dashboards, we pay a real tax in context switching. That’s why I’m excited about what Amplitude is enabling with Agent Connectors—bringing our daily work and our data-driven decisions into one fluid, agentic AI workflow.

Connect Notion, Atlassian, Slack, Linear, and more to Amplitude's Global Agent. Get richer analysis and take action across tools without leaving Amplitude.

Practically, this means I can treat Amplitude analytics as a unified analytics platform where analysis and execution finally meet. Instead of exporting charts or copying insights into docs, I can drive Agent Analytics directly from the same surface where I manage behavioral analytics, reducing friction and accelerating decisions. For my product strategy, that’s a meaningful shift—from “insight later” to “insight-to-action now.”

Here’s how I’d use it on a typical day: I ask the agent to synthesize signals from recent feature usage, spotlight anomalies, and then draft a concise summary for our Slack channel. In the same flow, I can prompt it to reference our Notion specs for context and queue next steps in Linear, keeping Atlassian stakeholders looped in without any extra swiveling between tabs. The value isn’t just faster execution; it’s tighter alignment across teams because the analysis and the plan live together.

From an operating model perspective, this is how I scale AI workflows responsibly. I can define clear prompts, approval paths, and ownership so the agent augments—not replaces—expert judgment. Data governance and permissions remain front and center: the agent sees what your teams are allowed to see, and we maintain auditability on critical workflow steps. The outcome is a trustworthy, repeatable system that compounds learning over time.

If you’re exploring agentic AI for product teams, start small and instrument your ROI. Pick one or two connectors (Slack and Notion are great first choices), define a measurable workflow—like pushing weekly retention insights and creating prioritized follow-ups in Linear—and iterate using continuous discovery. In my experience, the first wins appear as reduced time-to-insight, fewer meetings to align, and faster cycle time from observation to shipped change.

The big picture is simple: bring your work to your analytics, and your analytics to your work. With Agent Connectors, Amplitude’s Global Agent helps close the loop from understanding behavior to taking action—without leaving the place where your insights are born.

Inspired by this post on Amplitude – Best Practices.

June 3, 2026
Package Hack Wake-Up Call: My Playbook for Securing Cowork, Coding Agents, and Secrets

I love being a builder. It feels like a superpower I can’t stop using, and lately I’ve been channeling it into better workflows, faster experimentation, and sharper product thinking.

I tinker with my Claude Code workflows to make every day more effortless. I’m having a blast creating AI-generated interview snapshots and opportunity solution trees for Vistaly. I also spend time digging into traces and iterating on the AI coaches I use for our discovery courses.

Then the recent wave of malicious software spreading through the open-source community popped my bubble. It hit companies big and small—names like OpenAI, PostHog, and Zapier. As I dug in, I realized what many cybersecurity experts have long known: this is a deep rabbit hole. If I want to build responsibly, I have to get significantly better at protecting my devices, credentials, and code. And if you’re building with AI or modern tooling, you likely do, too.

Here’s why. We all rely on open-source software. Most modern applications assemble tried-and-true components—parsing a PDF, handling dates across time zones, visualizing spreadsheet data, connecting to an API—rather than reinventing them. The same is true for agent skills and MCP servers; they accelerate how we get value from models. This is overwhelmingly a good thing. But it also creates an attack surface that bad actors exploit.

We don’t need to abandon third-party code. We do need to understand the mechanisms attackers use and consistently defend against them.

When one malicious worm compromises hundreds of packages, what should dev teams do? This visual teaser maps the agenda—how it spreads, how to guard against it, AI tool risks, and concrete steps to mitigate.

On May 11th, I started seeing tweets about a TanStack hack. At that time, I didn’t know what TanStack was. But apparently, it’s a popular set of JavaScript libraries that are used by a lot of React sites. At first, I didn’t pay much attention. Then I learned the packages were compromised by a worm—malicious software that self-replicates—and it spread quickly. Within hours, dozens of packages were implicated; by day’s end, it was in the hundreds. That’s when I knew I had to lean in.

If you’ve explored safe development practices with coding agents before, you’ve seen the basics of package safety. A package is a bundle of reusable code shared through registries, and nearly every app you use depends on them. The unfortunate twist with this specific hack, known as the Mini Shai-Hulud worm, is that it shows prior “safe enough” heuristics aren’t sufficient. Popularity and trust signals don’t guarantee safety. We have to do more.

So here’s what I’ll cover today: how malicious software typically works, a practical framework for guarding against it, the specific risks of using Cowork to write and run code, and concrete steps to mitigate that risk. My goal is simple: help you keep building—despite the risks—while protecting your data and your business.

Quick disclaimer: I’m not a security expert. I’m sharing my personal journey and what I’ve learned through research and hands-on work. Please use your best judgment when applying any of this.

Package hacks share a simple playbook: get in, sweep for secrets, and phone home. This visual breaks down the 3 steps and flags new entry points—from packages to MCP servers, agent skills, and app extensions.

An agent recently scoured over 230,000 malicious software incidents and found that most malicious software follows a similar pattern. First, it needs an entry point onto your computer. Once installed, it scours your device for sensitive data, and then it uses your network connection to send that data to its own servers. The Mini Shai-Hulud worm spreads via malicious package install scripts that run at download time, then searches the device for credentials (including package publishing rights), poisons additional packages to continue replicating, and uses multiple channels—including the victim’s own GitHub public repos—to distribute secrets.

In practice, most attacks boil down to three steps: 1) It finds an entry point to your device. 2) It searches your device for sensitive data. 3) It sends that data to its own server. The good news: this pattern also tells us how to defend. We can harden entry points, minimize what code and agents can access, and constrain outgoing network traffic.

Keep in mind that install scripts aren’t the only entry vector. Any code that runs on your machine could contain malicious payloads: third-party packages, agent skills, MCP servers, browser or desktop extensions—the list is long. As coding agents and “vibe coding” tools become mainstream, more non-engineers are exposed to the same risks engineers have managed for years.

You might be at elevated risk if you do any of the following: you download and use third-party skills or MCP servers; you let Claude Code, Codex, or other coding agents write scripts that run locally and use third-party packages; you use an IDE like VS Code or Cursor with third-party extensions; or you install third-party extensions in tools like Obsidian. This isn’t an exhaustive list, but if any of these apply, it’s worth tightening your approach.

Relying on third-party code? This visual highlights four common risk zones—agent skills/MCP servers, coding agents, IDE extensions, and Obsidian plugins—and urges a review of downloads, local scripts, and add-ons.

The “safest” approach would be to avoid installing third-party software on your local device entirely. That’s not realistic. We all depend on third-party components in our stack. So I’ll start with one of the most common paths for non-engineers writing and running code today: Cowork.

Evaluating Cowork’s safety was eye-opening. Cowork offers meaningful protection—more than running code directly on your machine—but it isn’t bulletproof. There’s a notable gap you should understand.

Here’s how Cowork helps. It runs code inside a virtual machine, which isolates the execution environment from your real device—a quarantine room for code. While Cowork doesn’t fully control what comes into the room (that part is on you), if malicious code gets in, it’s contained and cannot reach the rest of your filesystem. Cowork also limits outbound network traffic from the virtual machine, which helps disrupt data exfiltration. However, it’s not foolproof.

Because Claude can install packages inside Cowork, it remains susceptible to malicious code like the Mini Shai-Hulud worm. And GitHub is on the allow list so Cowork can read and write to your repos. Since the Mini Shai-Hulud worm uses GitHub to publish secrets, this creates exposure. The crucial mitigation: if you never give Cowork access to sensitive data, there’s nothing for an attacker to steal.

A quick visual from a security deep dive on package hacks shows how Cowork handles threats: entry points are contained, data is only safe when kept outside, and network traffic is partly limited—making shared data the gap to watch.

Your responsibility is straightforward but critical: your data is only safe if it stays outside the virtual machine. When you mount folders into Cowork, those folders become accessible to any code running inside the VM. That includes malicious scripts. Before sharing, ask two questions: do the folders contain any credentials or secrets, and do they include proprietary data that would be harmful if accessed?

It’s common for code to need credentials. That’s why Cowork includes connectors to third-party sources like Google Drive and Slack. Credentials configured for these connectors never enter the VM—they remain outside the quarantine room—so they’re not exposed to malicious code. But if your code requires additional credentials inside the VM, scope them tightly and assume they could be compromised.

You can also use custom MCP servers you create yourself with Cowork. Those credentials stay outside the VM as well, provided the MCP servers are remote (hosted on a web server, not downloaded locally). It’s more work than dropping in a local server, but it keeps secrets out of reach from VM-executed code.

Beyond credentials, scrutinize the actual content you share with Cowork, including anything accessed through connectors. Least privilege is the rule: grant only what’s absolutely necessary for the task, and nothing more.

Amid a wave of package-supply attacks, this Product Talk visual launches a 3-part guide to safer AI building—starting with Cowork safety today, then Claude code config next week, and off-device development coming soon.

What about skills? Cowork supports skills, and you can add third-party skills inside the quarantine room. If you’re not placing your own data in that room, you can afford more risk. The moment you add sensitive or proprietary data, be selective. Skills can include third-party code, and bad actors use skill directories to distribute malicious payloads. Personally, I never use third-party skills as-is. If one looks useful, I read through the files, then ask Claude to recreate it so I understand what it does and maintain control. If I were to use third-party skills, I’d do it in Cowork and keep their data access to the minimum necessary.

Overall, Cowork is a solid, “safe-ish” option if you’re disciplined about what you share. The challenge is that utility often requires access to real data—exactly what we’re trying to protect. In an upcoming deep dive, I’ll outline strategies to keep malicious code out in the first place. While I’ll focus on local development, the same patterns can extend to Cowork with a bit of setup.

One more important clarification: don’t confuse Cowork with the Code tab in the Claude Desktop app. Cowork runs code inside a virtual machine. The Code tab does not. If you ask Claude to write and execute code from the Code tab, that code runs on your local device and you’re fully responsible for security. There is one exception: the Code tab can run code in Anthropic’s cloud; I’ll cover that approach when we get into moving development off the local machine.

To summarize Cowork’s protections against the attacker’s three-step pattern: installs and scripts still run, but they’re contained inside an isolated virtual machine instead of your real device; access to sensitive data is strongly limited to the specific folders you mount, leaving the rest of your filesystem (including unrelated credentials) out of reach; data exfiltration is partially constrained because Anthropic limits outbound network traffic from the VM—helpful, but not absolute. By contrast, local Code tab sessions offer no isolation, no filesystem restrictions, and no network limits—so any malicious install scripts run directly on your machine with full access and open egress.

My takeaways so far: I still love building with AI, but I’m doing it more cautiously. Cowork offers meaningful containment when used deliberately. I still prefer the flexibility of Claude Code, and I’ve reconfigured my setup to reduce risk. Even so, “safer” isn’t “safe,” which is why I’m increasingly shifting development off my local device to more controlled environments. I’ll share the practical details—tools, configs, and scripts—in the next installments.

If this perspective is useful, let me know. I want builders to move fast—and safely—through this new era of agentic AI. Until then, stay safe out there.

Inspired by this post on Product Talk.

June 3, 2026

How to Build a Resilient Experimentation Program at Scale

Your teams are running more experiments, but decisions are not getting easier. Results arrive late, apparent wins fail to repeat, and every readout starts a new argument about the data.

The fix is not another testing tool or a higher experiment count. You need an operating system that protects validity when traffic, products, models, and customer behavior change underneath you. That system starts before exposure, routes each question to the right evaluation method, and ends with a decision your team can execute.

Give every experiment a decision contract

An experiment should begin with a decision, not a feature. Ask what you will do if the result is positive, negative, inconclusive, or unsafe. If the answer is the same in every case, the test is not worth running.

Turn the proposed test into a short decision contract before engineering begins. Record:

The customer problem: the friction or unmet need you observed.
The causal hypothesis: the product change, the behavior it should alter, and why.
The eligible population: who can enter the experiment and who must be excluded.
The primary outcome: the one metric that determines whether the hypothesis worked.
The guardrails: the measures that can block a rollout even when the primary outcome improves.
The decision thresholds: the minimum effect worth acting on and the conditions for shipping, iterating, stopping, or rolling back.

A driver tree helps you connect the primary metric to the business outcome without pretending that one experiment can prove the entire chain. If the goal is retention, for example, the immediate experiment may be designed to change activation behavior. The contract should distinguish that leading behavior from the longer-term outcome.

Set the minimum detectable effect and guardrails before reading results. The minimum detectable effect is not the smallest movement your analytics can display. It is the smallest improvement that would justify the cost, risk, and complexity of the change. If your available population cannot reliably detect that effect, narrow the question, combine low-traffic variants, choose a more sensitive proximal metric, or do not run the test.

Pre-committing to the metric, stopping rule, exclusions, and decision criteria also limits convenient reinterpretation. Teams can still investigate unexpected patterns, but those findings should become new hypotheses rather than retroactive proof that the original bet won.

Match the question to the cheapest reliable evidence

Production A/B testing is only one layer of experimentation. It is often the slowest and most expensive layer because it consumes customer attention, operational capacity, and statistical power. Use it when real behavior is necessary to resolve a meaningful decision.

Evidence layer	Best question	Move forward when
Offline evaluation	Does the output meet a defined quality, policy, or safety standard?	The candidate passes the agreed evaluation set and regression checks.
Replay or shadow mode	How would the change behave on realistic inputs without affecting users?	Failure patterns, cost, and latency remain inside the operating limits.
Targeted canary	Is the change safe and observable under live conditions?	Telemetry is healthy and no guardrail triggers a rollback.
Controlled A/B test	Does the change cause a valuable shift in user behavior?	The result meets the pre-registered decision criteria.
Progressive rollout	Does the effect and reliability persist as exposure expands?	Segment-level outcomes and operational measures remain acceptable.

This layered model becomes essential for AI products. Prompts, retrieval logic, policies, model versions, and traffic composition can all change the experience. A single production metric cannot tell you whether a decline came from product value, output quality, latency, cost, safety, or an upstream model shift.

Build an evaluation stack for prompts, policies, regressions, canaries, and selective A/B tests. A candidate should earn broader exposure by passing the cheaper layers first. This reduces traffic waste and gives the team diagnostic evidence when a live result moves unexpectedly.

Do not use a multi-armed bandit simply because it can direct more traffic toward a leading variant. Bandits are useful when the objective is clear, feedback is timely, and guardrails are dependable. They are a poor substitute for stable measurement or causal understanding. If you need to estimate an effect, learn about segments, or detect delayed harm, retain a controlled comparison.

Engineer trustworthy measurement and reversible delivery

An experimentation program is only as resilient as its event pipeline. A mathematically correct analysis built on shifting event definitions is still wrong. Treat instrumentation as a product interface with owners, documentation, versioning, tests, and observability.

Before exposure begins, verify that assignment, exposure, outcome, and guardrail events share consistent identities and timestamps. Confirm that users enter only the experiments for which they are eligible. Check that retries, duplicate events, delayed ingestion, and cross-device behavior cannot silently change the denominator.

Naming conventions, schema versioning, lineage, anomaly detection, and pipeline observability are not analytics housekeeping. They let teams move without sacrificing the meaning of their measurements. Assign an owner to every critical event and make schema changes visible to the teams whose experiments depend on them.

During the run, monitor data quality separately from product performance. Sample ratio mismatch, assignment failures, missing exposure events, sharp volume changes, and implausible segment movements should pause interpretation. Do not explain these signals away because the headline result looks attractive.

Delivery must be reversible as well as measurable. Put material treatments behind feature flags. Start with a targeted canary, watch operational and customer guardrails, and expand exposure in stages. Define who can stop the rollout and make sure that person has both the telemetry and access required to act.

For broad platform or AI changes, maintain a persistent holdout when feasible. A long-lived control gives you a reference point for cumulative effects that short experiments miss, including changes in retention, trust, support burden, and cost. Protect the holdout from accidental contamination and document every change that affects its interpretation.

Scale the program around decisions, not test volume

A central experimentation team cannot design and analyze every test at scale. Product teams need autonomy inside a governed system. Centralize the parts where inconsistency creates shared risk: assignment services, metric definitions, event standards, quality checks, templates, and audit records. Let teams own hypotheses, customer context, treatment design, and decisions inside those guardrails.

Use a lightweight review based on risk. A reversible interface change with a proven metric can follow a standard path. A pricing change, safety policy, ranking system, or shared AI capability deserves stronger review, tighter exposure controls, and a clearer rollback plan. Governance should become more demanding as the blast radius grows.

Maintain a portfolio view rather than a leaderboard of teams by test count. For each active experiment, track the decision it supports, expected value, detectable effect, traffic requirement, risk class, owner, and current evidence layer. This reveals when several teams are competing for the same population, when a strategic question is underpowered, and when multiple small tests should become one coherent learning plan.

Reset a brittle program over 90 days

You can make the operating model concrete without attempting a platform-wide rebuild:

By day 30: audit the backlog and current tests. Stop or consolidate experiments that cannot meet their minimum detectable effect. Identify unreliable events, missing owners, conflicting metric definitions, and launches without explicit decision criteria. For AI surfaces, establish a minimal offline evaluation harness for prompts, policies, quality, and safety.
By day 60: publish standard hypothesis and readout templates. Put high-risk changes behind feature flags, make guardrails visible, and introduce canary exposure. Establish persistent holdouts where broad or cumulative effects matter. Add alerts for instrumentation drift and operational regressions.
By day 90: manage a balanced portfolio across offline evaluations, replay or shadow tests, canaries, controlled experiments, and progressive rollouts. Review program health through decision speed, valid learning, repeatability, and detected harm rather than the number of tests launched.

Create a community of practice alongside these controls. Regularly examine inconclusive results, failed replications, instrumentation incidents, and stopped rollouts. These cases expose weaknesses in the system more reliably than a gallery of wins. The goal is not to eliminate failure. It is to make failure informative, contained, and cheap.

Key takeaways

Start with the decision the experiment must support, then pre-register the hypothesis, primary metric, guardrails, detectable effect, and stopping rule.
Use offline evaluations, replay, shadow mode, and canaries to eliminate weak or unsafe candidates before consuming production traffic.
Treat event semantics, assignment, exposure, lineage, and anomaly detection as production infrastructure.
Pair controlled measurement with feature flags, progressive exposure, explicit rollback authority, and persistent holdouts where cumulative effects matter.
Judge the program by trustworthy decisions and reusable learning, not experiment volume or the percentage of positive results.

Choose one upcoming decision with meaningful customer or operational risk. Write its decision contract, identify the cheapest evidence layer that could disprove it, and verify the rollback path before anyone builds the treatment. That single discipline is a practical starting point for a program that can keep learning as your product and organization change.

References

June 1, 2026

Behavioral Customer Data for Proactive SaaS Retention
Your cancellation dashboard can tell you who has already left. It cannot tell you which accounts are failing to reach value, why their behavior changed, or what your team should do while the relationship is still recoverable.

That is the real purpose of behavioral customer data. You are not trying to produce a more sophisticated churn report. You are building an operating system that turns observable behavior into a reason, an owner, and a timely response.

Start with the retention decision, not the dashboard

A risk score has no operational value if nobody knows what to do when it changes. Before choosing events, dashboards, or models, write down the retention decisions your data must support.

For every proposed signal, define a decision contract:
- Trigger: What behavior changed, started, stopped, failed, or never happened?
- Interpretation: What customer state might that behavior indicate?
- Owner: Should product, customer success, support, solutions engineering, or billing respond?
- Intervention: What is the smallest useful action that could remove the obstacle?
- Success signal: Which subsequent behavior would show that the customer is back on a value path?
- Expiration rule: When should the alert or intervention stop so the customer is not repeatedly contacted?
This contract prevents a common failure: treating all declining activity as the same problem. A customer who cannot finish an integration needs a different response from an activated customer whose core usage suddenly drops. A payment problem is different again. Combining them into one generic churn-risk label hides the information required to help.

The signal also needs to match the product’s natural rhythm. Daily inactivity can matter in a daily workflow, but the same rule will create false alarms for a workflow used weekly or at the end of a reporting cycle. Compare behavior with the expected use pattern for the account’s persona, plan, lifecycle stage, and use case.

I would design backward from a small set of decisions rather than forward from every event that happens to be available. The most useful leading indicators usually describe activation, time-to-first-value, depth of feature adoption, usage momentum, friction, and expansion intent. Each tells you something different about whether value is beginning, recurring, weakening, or growing.

Instrument the path from first value to recurring value

Measure value at the account level

In B2B SaaS, the person clicking is not always the entity that retains. Users perform actions, while the account usually owns the subscription. Your model therefore needs both a reliable user identity and an account identity, plus a record of which users belonged to which account when the behavior occurred.

This distinction matters when roles differ. An administrator may configure the product once, an operator may use the core workflow repeatedly, and an executive may only view outcomes. A login-frequency rule applied equally to all three will misclassify healthy behavior as disengagement. Define the value-producing behavior for each relevant persona, then roll those behaviors into an account-level state.

Map the customer journey around observable value states:
- Setup: The account has supplied the prerequisites required to attempt the core workflow.
- Activation: The account has completed a meaningful milestone that indicates initial value, not merely finished an onboarding screen.
- Recurring value: The core workflow is being completed at a cadence consistent with the use case.
- Adoption depth: The account is using the capabilities required to obtain more complete or durable value.
- Friction: Attempts, errors, failed integrations, or support interactions indicate that progress is being blocked.
- Expansion intent: Behavior indicates a new use case, broader adoption, or interest in a relevant upgrade path.
Your activation milestone is the pivotal definition. It should represent the earliest behavior that credibly demonstrates value. Completing profile fields or dismissing a tour may be easy to measure, but neither proves that the customer accomplished the job for which the product was purchased.

Do not force one milestone across materially different use cases. If a plan, persona, or workflow changes the way value is produced, define the appropriate milestone for that segment. You can still report a common activation outcome while preserving the underlying reason an account qualified.

Use a minimal tracking contract

Once the value path is clear, instrument attempts, completions, failures, and meaningful outcomes along that path. A useful event contract includes:
- A stable event name with a documented business meaning.
- The user and account identifiers required for identity resolution.
- The time the behavior actually occurred, not only the time it reached the analytics system.
- The persona, plan, lifecycle stage, and use case needed for segmentation.
- The product object or workflow involved.
- A normalized outcome or error category when the action can fail.
- The event owner and the process for approving semantic changes.
For an integration workflow, for example, separate connection attempted, connection completed, and connection failed. Attach the provider and a controlled error category. Do not attach credentials, tokens, raw request bodies, or unrestricted personal information. Those fields create security and privacy exposure without improving the retention decision.

The foundation is a clean event taxonomy, dependable identity resolution, and privacy-by-design. Capture only what the decision requires. If support sentiment is useful, prefer a governed derived category over copying unrestricted support conversations into an analytics platform. Keep sensitive material in the controlled system that already owns it.

Before using any event in a risk score, ask product, data, and customer success to reconstruct the same account timeline. Check for duplicate events, delayed delivery, internal or test traffic, users mapped to the wrong account, plan changes that were not propagated, and renamed events with conflicting meanings. If those teams see different stories, automation will only distribute the disagreement faster.

It is also safer to trigger interventions from a derived account state than directly from a raw event. A raw event says that something happened. An account state says whether activation is incomplete, recurring value has weakened, an integration is blocked, or a commercial issue is unresolved. That state can carry a reason code, observation time, data-quality status, and expiration rule into the product, lifecycle messaging, or customer success workflow.

Build a risk score people can challenge and act on

You do not need a black-box model to begin. A transparent rule set is often more useful because product and customer success can inspect the evidence, dispute a weak assumption, and choose the correct response.

A practical account score can combine several distinct dimensions:
<!– wp:list {
May 18, 2026
Unlocking AI Agents: The Real Barrier Is Readiness—Not Capability—Here’s How to Scale

There’s a question that runs underneath every AI Agent evaluation: what can it do?

Two years ago, that was the right question to ask because Agents were limited and capability was a genuine constraint. The gap between what organizations needed and what the technology could deliver was wide. I felt that gap acutely in early pilots—plenty of ambition, not enough dependable execution.

That gap has since narrowed considerably, and yet most organizations are running their Agents well below what’s technically possible. I see teams lean on answering and routing, but stop short of looking things up, taking actions, or resolving complex, multi-step problems—especially where data, process variance, or risk come into play.

The standard explanation is that AI isn’t good enough yet—models must improve, or vendors must ship more features. But after studying organizations across industries actively expanding their AI automation, I’ve found that this explanation holds up less often than people assume. The blockers tend to be elsewhere.

The teams I’ve observed weren’t primarily constrained by what their AI could do; they were constrained by what their organization was structured to let it do. In other words, the ceiling wasn’t the Agent’s capability—it was organizational readiness, governance, and risk tolerance.

“Readiness” for AI breaks into five distinct types, and most organizations have some but not all of them. Below is how I assess them with product, operations, and engineering leaders.

Content readiness is whether you can explain your product and policies clearly and consistently. Most companies can. In practice, that means up-to-date knowledge bases, unified policy language, and clear versions that Agents can cite and apply.

Scope readiness is whether you’ve defined the edges: when should AI engage, and when should it step aside? Edge cases multiply, intent varies by customer segment, sensitive topics surface mid-conversation, but most teams can work through this with effort. Clear guardrails reduce ambiguity and shrink risk.

Procedural readiness is where things start to get harder. This is about whether you can articulate your processes clearly enough for something other than a human with years of tacit knowledge to follow. The happy path is rarely the problem. It’s the failure paths, decision branches, variations that have never been written down because they’ve always lived in someone’s head.

Data readiness is the first real cliff. Can you reliably identify the right user, account, or object at the moment a decision needs to be made? Is the data trustworthy in real time? Are the APIs stable, accessible, and actually connected? For most organizations, the honest answer is “partially, but we’re not always sure when it breaks.”

Execution readiness is the highest bar. Not just technically (can the Agent make the change?) but organizationally. Who owns it when the wrong refund gets processed? Who detects it? Who recovers? Does someone with authority actually accept the risk?

Most companies have the first two, some have the third, fewer have the fourth and fifth. When I map this with teams, we often discover that their Agent’s ceiling is really a reflection of operational maturity and data plumbing, not model quality.

We studied companies across six industries – energy, healthcare, ecommerce, gaming, financial services, property management – all trying to expand what their Agents could do. The pattern was consistent: teams set out to automate real actions—looking up account status, processing changes, handling transactions. In most cases, the AI could technically do it, but at a certain point (somewhere between guiding a user through a process and looking something up on their behalf) they hit a wall.

One team tried to automate application changes but couldn’t reliably identify which application to modify across their internal systems. Another explored billing automation but couldn’t access live account data due to regulatory constraints. A third needed to verify status across third-party vendor systems their Agent couldn’t reliably reach. I’ve seen similar constraints surface around CRM integration, data governance, and vendor SLAs—none of which are model issues.

In most cases, the team redesigned around what their infrastructure could support. They moved toward guiding—walking users through processes step by step, rather than executing changes on their behalf. It worked, it resolved conversations and delivered real value, just differently than anyone planned. In customer support, this often looks like consultative flows that shorten time-to-resolution even without direct writes.

Most Agent evaluations are built around capability. Can it handle complex queries? Does it support multiple channels? Can it integrate with our systems? These are reasonable things to evaluate for, but they produce a capability score, and that doesn’t tell you whether your organization can actually use what you’re buying.

The teams that got to deeper automation, the ones executing actions early, didn’t have “better AI,” they had more standardized operations. Actions that were already well-defined, consistently applied, and exposed through stable systems with clear rules. Automation wasn’t inventing new behavior, it was triggering actions that were already tightly controlled elsewhere.

Readiness enables capability, not the other way around. Which reframes the evaluation question from “can the AI do this?” to “are we actually ready for it to?”

Something that gets lost in most conversations about AI readiness is that organizations are often further along than they assume, just not for the kind of work they were planning for. A team that set out to automate refunds but can reliably guide users through complex troubleshooting has genuine capability deployed. They’re operating at the level their readiness supports, which is a starting point, not a deficit.

The more useful frame isn’t “are we ready?” – it’s “what are we ready for, and what specifically stands between here and the next level?” The gaps tend to be concrete: a missing API, data that lives in three systems that don’t agree, a process that’s never been documented, or an ownership question nobody has answered. These are solvable problems. They just require a different kind of investment than buying a more capable Agent.

What nobody has worked through seriously yet is how organizations actually build readiness. Does it develop naturally through using AI at shallower levels first? Or is it mostly a function of prior decisions, like system architecture choices made years ago, operational maturity that accumulated over time, engineering investments that have nothing to do with AI? When readiness does increase, what actually changes? Does the support team develop it? Does engineering grant it? Does it require executive sponsorship and investment in infrastructure with no obvious AI label on it?

In my experience, progress comes from a joint effort: product to define scope and guardrails, operations to codify procedures and edge cases, engineering to harden APIs and observability, and leadership to underwrite risk with clear ownership. When those pieces align, agentic AI moves from guided assistance to safe, auditable execution.

Until there are clearer answers, the pattern is likely to continue. Companies will buy capable Agents, plan ambitious rollouts, and find that the harder work is building the organizational infrastructure. The Agents can do the work. The question is what it takes to let them.

Inspired by this post on The Intercom Blog.

May 18, 2026

Governed AI Analytics in Financial Services: A Playbook

You have a credible AI analytics use case, product teams want access, and risk leaders want proof that the system will not expose sensitive data or influence the wrong decision. The mistake is to settle that tension with a broad choice between “innovation” and “control.” That choice is too vague to operate.

Start with a narrower question: what decision may this system influence, using which data, under whose authority, with what evidence afterward? Once those boundaries are explicit, you can give teams meaningful speed without asking compliance to accept an invisible risk.

Classify the decision before you assess the AI

Many AI reviews begin with the model: where it is hosted, how it was trained, or whether it can explain an answer. Those questions matter, but they do not establish the business risk. The same model can summarize an approved dashboard, flag an unusual transaction pattern, or help determine an outcome that affects a customer. Those are not equivalent uses.

Classify each use case by consequence, reversibility, and action authority. Consequence asks what happens if the output is wrong. Reversibility asks whether a person can correct the result before harm occurs. Action authority asks whether the system informs a person, recommends an action, or executes one.

Use case pattern	Permitted role for AI	Control that matters most	Boundary to make explicit
Descriptive analysis	Summarize approved metrics or behavioral patterns	Data permissions and traceable metric definitions	The output cannot create a new customer-level action
Investigative signal	Surface anomalies or suspicious patterns for review	Analyst validation, evidence capture, and disposition logging	A signal is not a finding or a verdict
Product recommendation	Suggest an intervention, workflow, or experiment	Human approval and outcome monitoring	The recommendation cannot bypass existing approval paths
Customer-affecting decision	Support a formally governed decision process	Documented oversight, explainability, and accountable human authority	The final authority and escalation path must be unambiguous

This classification prevents two common errors. The first is applying the heaviest possible review to every analytical assistant, which sends teams into unofficial tools and manual workarounds. The second is treating every output as “just an insight” even when a downstream workflow turns it into a customer action.

Trace the output one step beyond the interface. If an anomaly score enters a case-management queue, changes account handling, or triggers outreach, govern that downstream effect as part of the use case. A recommendation does not become low risk merely because a person clicks the final button.

Before development begins, write an allowed-action statement and a prohibited-action statement. For example: “The system may prioritize patterns for analyst investigation. It may not label a customer, close a case, or initiate an external action.” That pair of sentences is more operationally useful than calling the project “medium risk.”

Risk and compliance leaders still need to map the use case to the organization’s actual legal and regulatory obligations. A product risk classification is an operating tool, not a legal conclusion. When a use case could affect access, eligibility, pricing, fraud treatment, or another consequential outcome, obtain the appropriate compliance and legal review before activation.

Turn governance principles into an enforceable contract

Principles such as fairness, privacy, transparency, and human oversight do not control a production workflow by themselves. Each principle needs an owner, an enforcement point, and evidence that the control operated. I treat that combination as the governance contract for the use case.

Define the data boundary

List the approved data domains, fields, purposes, environments, and user groups. Do not stop at “customer data” or “analytics data.” Those labels are too broad to enforce. State which attributes the system can retrieve, which identifiers it can display, whether results may be exported, and where generated outputs may be stored.

Purpose: the business question the data may be used to answer.
Permitted inputs: the approved events, attributes, aggregates, and reference data.
Prohibited inputs: data classes that the workflow must never retrieve or infer.
Permitted users: roles allowed to query, review, approve, or export results.
Output handling: where results may be displayed, retained, shared, or reused.
Failure behavior: what the system does when permission, provenance, or confidence is insufficient.

Enforce that boundary with role-based access controls and granular permissions at retrieval time. Filtering an answer after a model has received restricted data is not equivalent to preventing access. The model, retrieval layer, analytics service, export path, and destination workflow all need to respect the same user identity and policy context.

Assign decision rights to named roles

A committee can set policy, but it cannot own every operational decision. Give each use case an accountable product owner, a data owner, a control owner, and a business reviewer. Clarify who can approve launch, who can change the data scope, who reviews exceptions, and who has authority to stop the workflow.

The product owner defines the user problem, allowed action, prohibited action, and business outcome.
The data owner approves the data purpose, quality expectations, permissions, and reuse limits.
The risk or compliance owner maps policy obligations to testable controls and reviews material exceptions.
The platform or security owner implements identity, access, isolation, logging, and change controls.
The business reviewer accepts, rejects, or escalates outputs and records why.

Keep the decision rights close to the workflow. If a reviewer sees an unsupported conclusion, that person needs a clear way to reject it, preserve the evidence, and route the issue. If every exception disappears into a general governance inbox, the formal control will be bypassed when operational pressure rises.

Design the audit record before launch

An audit trail should reconstruct what happened without relying on someone’s memory. Capture the requesting identity and role, the approved purpose, the data and metric definitions used, the system configuration, the generated result, any human review, the resulting action, and later corrections or overrides.

Logging creates its own data risk. Prompts, retrieved context, generated explanations, and reviewer notes can contain sensitive information. Protect the audit store with appropriate access, retention, and segregation rather than treating logs as harmless operational exhaust. Where policy permits, record protected references to sensitive records instead of duplicating raw payloads.

A practical platform evaluation should test whether the system combines strong data governance, auditable AI behavior, secure scale, and a direct connection to product outcomes. A policy document that cannot be enforced in the workflow is not enough, and a platform control without an accountable operating process is not enough either.

Put controls inside the workflows people actually use

Governance fails when it exists as a review ceremony around the product rather than a behavior inside it. Analysts should not have to remember a separate policy every time they ask a question. The approved data scope, identity context, review step, and evidence capture should travel with the task.

Behavioral analytics: govern the meaning as well as the data

Behavioral analytics can reveal how customers move through onboarding, self-service, support, payments, and other product journeys. The danger is not limited to unauthorized access. An AI system can also combine valid events into a misleading interpretation of customer intent.

Start the workflow with curated event definitions and approved business metrics. Require the output to expose the cohort definition, time context, filters, exclusions, and comparison used. The analyst should be able to inspect the path from a narrative claim back to the underlying measure before sharing it.

Separate observation from inference in the interface. “Users in this cohort abandoned the flow after this step” is an observation tied to event data. “They abandoned because they distrusted the process” is a hypothesis. Labeling those differently prevents fluent language from turning a plausible explanation into an unsupported fact.

Anomaly detection: route a signal into investigation, not judgment

An anomaly means a pattern differs from an expected baseline. It does not establish fraud, customer intent, system abuse, or operational error. Treat anomaly detection as a prioritization mechanism unless a separately governed process establishes something more.

Give the reviewer the observed deviation, relevant context, the comparison baseline, and links to permitted evidence. Capture the reviewer’s disposition: confirmed issue, expected behavior, insufficient evidence, data-quality problem, or escalation. That disposition is both an audit artifact and a feedback signal for improving the workflow.

Watch the operational burden as closely as the detection capability. A flood of weak signals can make the nominal control less safe because reviewers rush, defer, or stop trusting the queue. Monitor false positives, unresolved escalations, overrides, and the reasons analysts reject outputs. When those indicators deteriorate, reduce scope or pause automated routing while the cause is investigated.

Self-service analysis: give teams a governed lane

Product managers and analysts need enough freedom to explore without sending every question through a central approval queue. Create a governed workspace containing approved metrics, documented data products, role-aware access, and restricted export paths. Let people iterate freely inside that lane while changes to data scope, decision authority, or external activation trigger a new review.

Make the boundary visible. Users should know when an answer is based on incomplete data, when a metric is not approved for customer-level decisions, and when an output cannot be exported. A silent denial encourages workarounds; a clear denial that identifies the policy boundary gives the user a legitimate next step.

Do not give an analytics assistant write access to operational systems merely because the integration is convenient. Insight generation and action execution are separate privileges. Connect them only when the action, reviewer, failure mode, and rollback path have been governed explicitly.

Pilot with evidence, not a polished demonstration

A convincing demo proves that the happy path works. A governed pilot must also prove that the system refuses the wrong request, exposes enough evidence for review, and leaves a usable record when something goes wrong.

Choose a narrow workflow with an identifiable user, a bounded data set, a reviewable output, and a business outcome you already understand. Avoid beginning with an enterprise-wide assistant or an autonomous action layer. Broad scope makes it difficult to distinguish model behavior, data problems, permission failures, and process gaps.

Write the decision contract. Record the user, purpose, permitted inputs, allowed action, prohibited action, reviewer, and stop authority.
Configure the smallest useful data boundary. Include only the fields and metrics needed for the chosen workflow.
Test legitimate work. Confirm that authorized users can produce an insight, inspect its basis, and complete the intended review.
Test prohibited work. Attempt access with the wrong role, request excluded attributes, try an unauthorized export, and ask the system to take a prohibited action.
Test ambiguity and failure. Use incomplete context, conflicting metric definitions, missing permissions, and unavailable dependencies. Confirm that the system fails visibly and safely.
Reconstruct the event. Use the audit record to determine who requested the output, what information was used, what was generated, who reviewed it, and what happened next.
Change the system deliberately. Update a relevant configuration or model component and confirm that approval, documentation, testing, and monitoring follow the change.

Do not accept screenshots as evidence for controls that operate behind the interface. Ask the vendor or internal platform team to demonstrate a denied request, a permission change, a reviewer override, an exported audit record, and the behavior after a governed configuration change. The test should follow your use case and identities, not a generic demonstration tenant.

Measure value and control health together. If the system produces faster insights but increases unreviewed actions, weakens attribution, or creates an investigation backlog, it has not delivered a durable improvement.

Dimension	Question	Useful signals
Business value	Does the workflow improve a real product, growth, risk, or operational decision?	Time to a validated insight, useful investigations completed, issues resolved, and attributable product outcomes
Analytical quality	Can a reviewer verify the conclusion?	Accepted and rejected outputs, unsupported claims, metric-definition errors, and missing context
Control effectiveness	Did policy operate as designed?	Prohibited requests blocked, required reviews completed, permission exceptions, and audit-record completeness
Operational health	Can people sustain the workflow?	False-positive burden, unresolved escalations, overrides, rework, and reviewer backlog
Change safety	Do updates preserve the approved boundary?	Documented changes, completed regression checks, new failure patterns, and monitored post-change behavior

Set release gates in binary language. The use case has a named accountable owner or it does not. Permissions have been tested with unauthorized identities or they have not. High-impact outputs receive the required review or they do not. Audit evidence can reconstruct an event or it cannot. Ambiguous gates become exceptions as soon as delivery pressure appears.

When the pilot is stable, reuse the control components rather than copying the entire use case. Standard identity propagation, data classification, audit schemas, reviewer workflows, and change gates can form a shared control plane. Each new use case still needs its own purpose, decision boundary, outcome measure, and risk assessment.

Key takeaways

Govern the decision the AI can influence, not just the model that produces the output.
Write both an allowed-action statement and a prohibited-action statement before development begins.
Enforce data permissions before retrieval and carry the user’s identity through analysis, export, and downstream action.
Treat human review as an operational workflow with evidence, dispositions, escalations, and stop authority.
Keep observations, hypotheses, recommendations, and customer-affecting decisions visibly distinct.
Test denial, ambiguity, change, and audit reconstruction alongside the happy path.
Track business value, analytical quality, control effectiveness, and operational burden on the same scorecard.

Your next move is not to draft an enterprise AI policy. Pick one live analytics workflow and write its decision contract on a single page. If you cannot name the allowed action, prohibited action, data boundary, reviewer, audit evidence, and stop authority, the workflow is not ready to scale. If you can, you have the foundation for AI analytics that product teams can use and risk leaders can defend.

References

Amplitude – Financial Services AI

May 15, 2026

AI-Enabled Enzymatic Recycling: A Product Leader’s Playbook

You have an AI-enabled materials proposal in front of you, a promising set of enzyme candidates, and a difficult decision: fund another round of discovery or start building toward industrial scale. The candidate sequences may be impressive, but they are not yet the product.

Your decision should turn on whether the full system can repeatedly transform a defined waste stream into usable monomers at an economically viable cost. That framing connects model performance, laboratory evidence, process engineering, and commercial reality before an exciting demonstration becomes a stranded pilot.

Define the product around recovered monomers

Only 10% of the plastic manufactured gets recycled. That ceiling is not merely a sorting or consumer-behavior problem. Traditional recycling commonly shortens polymer chains instead of restoring their original molecular building blocks, so the resulting material can lose quality and move toward downcycling.

Enzymatic recycling changes the intended output. An engineered enzyme can deconstruct a polymer into its original monomers, which can then become inputs for new, high-quality plastic. The difference is fundamental: the product is not processed waste or a smaller plastic fragment. It is recovered molecular feedstock.

This distinction gives you a better product boundary. A generated protein sequence is a feature. An enzyme that shows activity in one assay is a technical result. The product is a repeatable monomer-recovery system with a defined input, output, operating envelope, and cost structure.

Before approving a roadmap, require the team to define five contracts:

Input contract: Which polymer, packaging format, mixture, and contamination profile will the process accept? “Mixed plastic” is not a specification. Name the included materials and the variation the system must tolerate.
Transformation contract: Which polymer bonds must the enzyme break, and what conversion and selectivity must the reaction demonstrate?
Output contract: Which monomers will be recovered, what downstream use must they support, and how will the team determine that the output is suitable for that use?
Operating contract: What reaction conditions, throughput, energy consumption, and process controls must hold outside a small laboratory assay?
Economic contract: Which cost per ton must the integrated process approach, and which assumptions currently separate measured economics from projected economics?

Selectivity is especially important. An enzyme can target a particular plastic within a mixed waste stream, potentially reducing the need to treat every input as chemically identical. But selectivity does not make an undefined waste stream manageable. The process still needs to know which target material is present, whether the enzyme can reach it, and how the desired products will be recovered.

Write the product brief in one sentence: For this defined feedstock, transform this polymer into these monomers, within this operating envelope, output specification, and cost boundary. If a number is unknown, leave a visible blank and assign an experiment to fill it. Do not hide the uncertainty inside a broad ambition such as “make plastic circular.”

Build the AI as a closed learning system

AI changes the economics of searching enzyme-design space. Protein language models can generate candidates, multi-step agents can coordinate specialized tasks, and computational evaluations can eliminate weak options before scarce laboratory capacity is used. Advances in protein structure prediction have expanded what can be explored, but prediction does not remove the need for physical validation.

The useful architecture is therefore not a model that emits sequences. It is a closed loop in which every physical result makes the next design round better. Rhea’s Factory combines protein language models, an agentic pipeline, domain constraints, and proprietary wet-lab feedback. The product lesson is broader than any one implementation: generation, evaluation, experimentation, and learning need to operate as one traceable system.

Encode the objective. Convert the product contract into machine-readable constraints: target polymer, desired products, acceptable operating conditions, and the metrics that will decide whether a candidate advances.
Generate candidates. Explore multiple plausible designs rather than optimizing immediately around the first promising family.
Apply computational gates. Reject candidates that violate explicit constraints, preserve the reasons for rejection, and rank the remaining candidates for laboratory use.
Run controlled wet-lab experiments. Test candidates under recorded conditions and capture successes, failures, and inconclusive results.
Update domain predictions. Use the measured outcomes to improve ranking and candidate selection for the next round.
Feed process evidence back into discovery. When a candidate struggles under reactor or feedstock conditions, turn that failure into a new design constraint instead of treating it as a separate engineering problem.

Agentic AI is valuable here because the workflow is multi-step, not because an agent should make every decision autonomously. At each handoff, define the required input, expected output, validator, and failure behavior. A generation step should not advance an incomplete candidate. A computational score should not be presented as a laboratory observation. A promising assay should not silently become a scale claim.

Exploration also needs an explicit lane. Higher model-sampling temperatures can produce more unusual enzyme candidates and reach beyond the safest local variations. Controlled model “hallucination” can be useful during candidate exploration when downstream guardrails prevent novelty from being mistaken for evidence.

Separate the candidate portfolio into three buckets: improvements near known winners, adjacent designs that test a clear hypothesis, and high-variance exploration. Give each bucket a deliberate laboratory budget. Raise sampling temperature only in the exploratory lane, and never allow generated assay values, reaction outcomes, or scale results into the measured-data record.

The durable advantage sits in the feedback data. In a narrow, high-signal domain, even hundreds of relevant proprietary laboratory observations can support a useful domain prediction model. That is not a general claim that small datasets are always sufficient. It means contextual quality can matter more than indiscriminate volume when the problem, assay, and outcomes are tightly defined.

For every experiment, preserve enough context to make the result reusable:

The enzyme identity, sequence version, and design lineage.
The target polymer, material format, mixture, and relevant contamination profile.
The assay and protocol version used for the test.
The reaction conditions and duration.
The measured conversion, selectivity, yield, and uncertainty available from the experiment.
The full result, including failure, no-result, and inconclusive outcomes.
The relationship between the candidate, computational evaluations, physical test, and model or data release.

A spreadsheet of winning sequences is not a data moat. A traceable record of why candidates were proposed, how they were tested, what failed, and how each result changed the next decision can become one.

Use stage gates that end in physical evidence

AI product teams often gravitate toward a model leaderboard because it creates a clean sense of progress. Enzymatic recycling does not have one adequate master score. A candidate can look structurally plausible and fail in the lab. It can perform in a controlled assay and miss the required throughput. It can convert the polymer and still lose economically once the rest of the process is counted.

Use a hierarchy of evidence that moves from design compliance to laboratory performance, operating fit, and scale economics:

Gate	Decision question	Required evidence	Red flag
Design compliance	Does the candidate satisfy the stated target and pipeline constraints?	Deterministic checks, recorded constraint evaluations, and candidate provenance	A candidate advances mainly because it appears novel
Wet-lab performance	Does the enzyme convert the target with the required selectivity under defined conditions?	Repeatable measured observations, including negative and inconclusive runs	Only the best run is retained or shared
Operating fit	Does useful performance hold within the intended controlled, low-temperature process and throughput requirements?	Process measurements tied to reaction conditions, conversion, yield, throughput, and energy use	Activity is reported without the process context needed to interpret it
Scale economics	Can the integrated system move toward cost parity with inexpensive oil-based plastic?	A cost and energy model tied to measured inputs, with assumptions and sensitivities exposed	Commercial viability is inferred from enzyme activity alone

Set pass, hold, and stop conditions before seeing the result. Otherwise, an interesting candidate will repeatedly earn one more experiment while the commercial requirement drifts. Relative improvement is useful for learning, but an enzyme that is twice as good as an unusable baseline may still be unusable. Every relative metric should sit beside the absolute requirement it is meant to approach.

Keep conversion, selectivity, yield, throughput, and energy per ton separate. Combining them too early into a single score can conceal the actual tradeoff. A team should be able to show why it is advancing a faster candidate with lower selectivity, or a more selective candidate with a different operating burden, without claiming that the candidates are equivalent.

Three common metric substitutions deserve direct scrutiny:

Low reaction temperature is not automatically low total energy. Count the energy demands of the complete process rather than the enzyme reaction in isolation.
Polymer conversion is not automatically usable monomer recovery. Measure whether the desired output can be recovered to the specification required downstream.
Bench performance is not automatically scaled performance. Treat increasing process scale as a new evidence gate, not a routine deployment step.

My rule is simple: model output can earn laboratory time; only measured process evidence can earn scale capital.

Plan the roadmap backward from cost parity

The commercial benchmark is unforgiving. Enzymatic recycling ultimately has to compete with inexpensive oil-based plastic production. A greener reaction that cannot approach a viable delivered cost will remain dependent on special conditions rather than becoming a broadly adopted circular process.

Build the economic model while discovery is still underway. At minimum, separate these cost lines:

Feedstock acquisition, sorting, and rejected material.
Preparation required before the enzyme can act on the target polymer.
Enzyme production, delivery, useful lifetime, and replacement.
Reactor capacity, reaction time, process control, and energy.
Monomer recovery and purification.
Waste handling, downtime, and variability in plant utilization.

Do not wait for perfect values. Use ranges, label each input as measured or assumed, and run sensitivity analysis. The purpose is to identify which uncertain variable can kill the business case. If enzyme lifetime dominates cost, another candidate-generation run may be rational. If purification dominates, generating thousands of additional sequences may be a distraction from the real constraint.

Pair every scientific milestone with an industrial question:

Discovery gate: Is activity and selectivity reproducible enough to justify process work?
Process gate: Does the candidate perform inside the intended operating envelope rather than only under a convenient assay condition?
Feedstock gate: Does performance survive representative material formats and mixtures, including difficult packaging such as clamshells?
Demonstration gate: Can the system sustain the required material flow, output quality, and energy profile at a scale that tests the major engineering assumptions?
Commercial gate: Does the cost case remain credible when feedstock composition, utilization, throughput, and other sensitive inputs move away from the preferred case?

A planned 5,000-ton demonstration plant in California illustrates why demonstration capacity belongs on the product roadmap. A plant is not simply a larger laboratory. It tests whether biology, equipment, controls, feedstock variability, and recovery operations behave as an integrated product.

Before committing meaningful scale capital, ask six kill questions:

Which assumption has the largest effect on delivered cost per ton?
Which inputs are measured, and which still come from a design estimate?
At what physical scale was each important input measured?
What fails first when the feedstock mix changes?
If enzyme performance improves as planned, which downstream step becomes the bottleneck?
Which observed result will stop, narrow, or materially redesign the program?

Expansion into additional plastics should follow the same discipline. Enzyme selectivity creates a plausible path toward enzyme blends for mixed streams, and new plastic types and mixed-plastic blends remain important development directions. Treat each added polymer as a new product vertical with its own input contract, assays, process interactions, recovery requirements, and economics. A new enzyme is not automatically a low-cost extension of the first process.

Key takeaways for your next roadmap review

Define success as repeatable recovery of specified monomers, not the generation of novel enzyme sequences.
Run discovery as a closed loop connecting product constraints, AI generation, computational gates, wet-lab measurements, and process feedback.
Treat proprietary experimental context—including failures—as the data asset; candidate count alone is not a defensible moat.
Use separate gates for design compliance, laboratory performance, operating fit, and scale economics.
Work backward from cost parity and direct the next experiment toward the assumption that most threatens the integrated business case.

For your next review, ask the team to bring one page containing the input and output contracts, a diagram of the learning loop, the current stage-gate thresholds, the experimental data schema, and a cost sensitivity model with measured and assumed inputs clearly separated. Every roadmap item should change one of those artifacts or produce evidence for a named decision.

If the team cannot fill those fields yet, that is the immediate product work. The first defensible milestone is one traceable loop from a defined industrial problem through candidate generation, laboratory measurement, and an updated cost model. Repeat that loop with increasing realism before increasing capital exposure. That is how you determine whether programmable biology is becoming an industrial recycling product rather than remaining an impressive AI demonstration.

References

Product Talk — How AI-Designed Enzymes and Agentic AI Could Finally Make Plastic Truly Recyclable

May 14, 2026

From Internal FinOps Agents to Customer-Embedded Optimization

Your cloud-cost agent can identify the line item that moved and still fail to change a single decision. The gap appears after the diagnosis: the recommendation arrives without the product, pricing, ownership, and risk context needed to act.

If you are taking an internal FinOps capability into the customer experience, design for a closed decision loop. The goal is not autonomous cost cutting. It is a governed system that connects spend to customer value, recommends the next move, and proves whether the move worked.

Design a decision loop, not another cost dashboard

Start by naming the decision your product will improve. A broad promise such as optimize cloud spend gives the agent no useful boundary. A better contract is: detect a material change in workload cost, identify the most plausible driver, propose one permitted response, route it to the right owner, and verify the effect.

Draw the product boundary around an outcome

The operating loop is simple to describe: observe, explain, propose, authorize, execute, and verify. A dashboard normally stops at observe or explain. An agentic FinOps workflow carries evidence into a recommendation and then closes the loop with an approved action and post-action telemetry.

Agentic does not mean unrestricted. It means the agent can select the next permitted step based on context. Deterministic services should still perform calculations, enforce policies, check permissions, and execute infrastructure changes. Use the model where interpretation is valuable: reconciling signals, building a driver narrative, identifying missing context, explaining tradeoffs, and routing a decision.

That distinction matters in FinOps. A model should not improvise a billing calculation, invent a price, or bypass a commitment policy. If a calculation has one correct result, compute it in code and give the result to the agent as evidence.

Build four layers with explicit responsibilities

Evidence layer: Billing exports, usage metering, observability, product telemetry, pricing logic, feature flags, deployment activity, environment metadata, customer segmentation, and ownership records.
Reasoning layer: Driver trees, anomaly triage, competing explanations, confidence evidence, and recommendation selection.
Action layer: Policy checks, approval routing, change preparation, execution, rollback, and escalation.
Learning layer: Post-action telemetry, realized outcomes, agent evaluations, customer feedback, and recurring patterns that belong in the product roadmap.

A retrieval-first pipeline that combines billing, usage, observability, product, and go-to-market context is more useful than a large prompt containing a monthly cost export. Retrieve the records needed for the current decision and preserve their lineage. Every recommendation should reveal which records were used, when they were updated, which pricing assumptions applied, and what the agent could not retrieve.

Customer-facing retrieval adds another non-negotiable boundary: tenant isolation must be enforced before context reaches the model. Do not rely on a prompt to prevent cross-customer disclosure. Access control belongs in the retrieval and service layers, with the resulting access decision recorded in the audit trail.

Start with one anomaly and one reversible response

Your first release does not need to optimize every cloud service. A practical thin slice is anomaly detection plus one high-leverage remediation path. For example, the agent might detect a change in non-production workload cost, connect it to a schedule change, prepare a schedule correction, request approval from the workload owner, and monitor the next usage window.

Choose a first action that is bounded and reversible. A scheduling correction is easier to inspect and undo than a long-term financial commitment or a production capacity change. The purpose of the thin slice is to prove the whole operating loop, not merely the anomaly model.

Make every recommendation safe enough to act on

A recommendation without an execution envelope is an opinion. It may be correct, but the recipient still has to reconstruct the evidence, find the owner, assess the downside, and decide how to validate it. That is where apparently intelligent systems create more work than they remove.

Use a recommendation contract

Treat every agent recommendation as a structured product object. At minimum, require these fields:

Decision: The exact choice the recipient is being asked to make.
Scope: The account, workload, service, environment, and time window affected.
Owner: The person or role accountable for the workload and the person authorized to approve the action.
Evidence: Links to the billing, usage, observability, deployment, and product records that support the diagnosis, including their freshness.
Driver path: The causal chain the agent believes explains the change, plus material alternative explanations it considered.
Proposed action: The change, its expected mechanism, and any assumptions behind an estimated effect. If the effect cannot be estimated reliably, say that it is unknown.
Confidence and unknowns: Evaluation-backed confidence evidence, missing context, and conditions that would invalidate the recommendation.
Execution envelope: Policy checks, blast radius, approver, expiration, rollback procedure, and escalation path.
Verification plan: The telemetry, observation window, success condition, and stop condition used after the action.

The expiration field is easy to overlook. Cloud state changes quickly enough that an old recommendation can remain plausible after its evidence has gone stale. Expire the recommendation when its pricing, topology, deployment, or usage assumptions are no longer current. Force a fresh retrieval before execution.

Grant autonomy by action class

Do not give an agent one global autonomy setting. Earn autonomy independently for each action class:

Observe: Detect and organize a possible anomaly.
Explain: Build a driver tree and expose supporting evidence without proposing a change.
Recommend: Propose an action while a human retains approval and execution.
Prepare: Generate a change plan or dry run, but require an authorized owner to apply it.
Execute within policy: Apply a reversible, bounded action only when the policy engine, permissions, evidence freshness, and rollback checks all pass.

Purchasing a cloud commitment or altering production resources can create real financial or availability exposure. Keep finance and service owners in the approval path until confidence evidence and post-action telemetry demonstrate reliable performance for that specific intervention. Good results on anomaly explanations do not establish that the same agent is safe to execute infrastructure changes.

Governance should be visible in the product, not left in a policy document. Show the approver which data was accessed, which rules passed, who changed the recommendation, what action ran, and what happened afterward. Privacy-by-design, data controls, and transparent decision logs are part of the user experience when the system influences money and production infrastructure.

Evaluate the decision loop, not the prose

A polished explanation is not evidence of a useful agent. Build evaluations around the failure modes that can block or distort a decision:

Did the recommendation use the correct customer, workload, environment, price, and time window?
Can each material claim be traced to an underlying record?
Does the driver path match known cases, including cases with several plausible causes?
Does the agent abstain when ownership, telemetry, or pricing context is missing?
Did approval routing and policy enforcement behave correctly?
Can the recipient perform the proposed action without reconstructing missing steps?
Did post-action telemetry confirm the expected direction of change without creating an unacceptable operational tradeoff?

Put retrieval changes, prompts, policies, and tools through the same delivery discipline as application code. Eval-driven development, CI/CD, and a weekly shipping cadence make regressions visible before a persuasive but poorly grounded recommendation reaches an operator or customer.

Embed the capability with customers before scaling it

The first customer version should not be a general-purpose cost chatbot. It should be a narrow, product-assisted engineering motion in which a Forward Deployed Engineer, or FDE, helps the customer connect product usage, cloud architecture, and cost-to-value.

Choose a small pod and customers that can teach you

A sensible starting shape is one FDE pod focused on two or three high-potential customers. High potential should not mean merely the largest cloud bill. Select customers where the team can access the necessary evidence, an accountable sponsor can authorize changes, the problem is likely to recur, and the customer agrees to clear data and governance boundaries.

Evidence readiness: Billing, metering, observability, pricing, and deployment context can be joined without weeks of manual reconciliation.
Decision access: An engineering, product, or finance owner can approve an intervention and explain the operational constraints.
Learning value: The problem represents a pattern that may apply beyond one account.
Measurability: The customer and FDE can agree on a cost-to-value measure before making a change.
Governance fit: Data access, retention, tenant isolation, approvals, and audit expectations are explicit.

If any of these conditions is absent, the engagement may still be commercially important, but it is a weak environment for deciding whether the agentic product works. Separate account urgency from product-learning quality.

Run a customer optimization loop that produces reusable knowledge

Define the value unit. Agree on what an active workload or valuable unit of product usage means. Total spend alone cannot distinguish efficient growth from contraction.
Establish the baseline. Record current cost per active workload, time-to-first-value, relevant deployment behavior, and the constraints the customer will not trade away.
Build the driver tree. Connect the spend change to services, environments, releases, product behavior, and customer usage. Surface gaps instead of filling them with assumptions.
Select one intervention. Prefer the smallest action that can test the diagnosis. Document the expected mechanism, approver, risk, and rollback before execution.
Verify the outcome. Compare post-action telemetry with the agreed baseline. Record savings, unit-economics movement, performance effects, adoption effects, and unintended consequences separately.
Codify the pattern. Capture the inputs, decision rule, action, exceptions, safeguards, and evidence required to repeat the intervention.
Send a weekly learning packet to product. Include successful patterns, failed diagnoses, missing platform capabilities, customer language, and recommendations that still depend on FDE judgment.

Within a quarter, this loop should make it possible to distinguish interventions that can be automated, patterns that should become native product features, and problems that still require deeper solutions engineering. The point is not to eliminate the FDE. It is to reserve that scarce judgment for cases where ambiguity and customer context remain material.

Make the commercial incentive legible

Customer-embedded optimization creates an obvious trust question for a consumption business: does the vendor want the customer to spend less or consume more? The clean answer is to optimize cost-to-value rather than either number in isolation.

A customer’s total cloud cost can rise while cost per active workload improves because valuable usage is growing. Total cost can also fall because the customer is using less of the product, which is not an optimization success. Label the outcome precisely: lower total spend, lower unit cost, avoided waste, shifted commitment, higher useful consumption, or reduced operational risk. Do not collapse these different effects into a generic savings claim.

The FDE is also a trust boundary. The role should explain the recommendation, expose assumptions, and represent the customer’s constraints. It should not become a human interface for repetitive exports and one-off queries that the platform ought to handle.

Turn field work into a roadmap, not permanent custom service

A strong FDE can make a weak product look successful by solving every gap manually. That is useful for an individual customer and dangerous for product strategy. You need an explicit test for moving work from the field into an agent workflow or native platform capability.

Apply a productization test to every recurring intervention

Can the same signal be retrieved reliably across the intended customer segment?
Can the decision logic be expressed without undocumented customer-specific knowledge?
Can the action be bounded by a stable policy, approval path, and rollback procedure?
Can the outcome be measured with telemetry that exists before and after the change?
Do the likely exceptions fit a review workflow, or do they fundamentally change the decision?

If the signal, decision, action, and measurement are repeatable, make the pattern a native feature or automated playbook. If the evidence is repeatable but judgment varies, keep an agentic workflow with human review. If the action carries high financial or availability risk, keep the FDE and accountable owner in the loop. If the pattern is a one-off, document it but resist turning it into product scope.

Use a scorecard that reveals where the loop is breaking

Dimension	Measure	Decision it informs
Insight speed	Time-to-insight from a material spend change	Is the system finding the issue early enough to change an engineering decision?
Action quality	Recommendations with evidence, an owner, a permitted action, and a verification plan	Is the agent producing executable decisions or polished commentary?
Economics	Realized savings per recommendation and cost per active workload	Did the intervention improve spend or unit economics for the intended value unit?
Reliability	Post-action effects, abstentions, rollbacks, and policy failures by action class	Which interventions have earned more autonomy, and which need tighter controls?
Customer outcome	Time-to-first-value and NRR movement on FDE-supported accounts	Is the motion improving adoption and durable account value? NRR is directional evidence, not proof of causation.
Product leverage	Recurring field patterns converted into features, guardrails, or in-product guidance	Is customer work compounding into a scalable product?

Recommendation volume, prompt length, and agent activity are operating diagnostics, not business outcomes. A quiet system that changes a few high-value decisions can be more useful than an active system that produces hundreds of unactioned findings.

Make build versus buy a component decision

Do not treat the choice as one monolithic platform decision. Separate commodity capabilities from the context and workflow that create differentiation. Evaluate billing ingestion, normalization, anomaly detection, the context model, pricing logic, recommendation policy, approval routing, execution, and agent analytics independently.

Does the capability require knowledge of your architecture, pricing model, feature flags, customer usage, or deployment behavior?
Can an external component preserve evidence lineage, tenant isolation, and decision logs at the level your customers require?
Is the capability a generic input to the product, or is it where your product makes a differentiated decision?
Can your team evaluate and operate the component continuously, including regressions after model, prompt, policy, or data changes?
Will the component reduce time-to-value without trapping critical customer and pricing context in an opaque workflow?

Unique architecture, pricing, and growth loops can justify building the context and decision layers. But weak tagging, unclear ownership, and missing observability undermine either path. Fix those foundations before expecting an in-house or purchased agent to produce precise optimization decisions.

Give the core product to a product trio spanning product management, engineering, and FinOps. Bring FDE, customer success, SRE, finance, and security into discovery and evaluation where their decisions are affected. Field requests should enter the roadmap with evidence of recurrence, strategic importance, or platform leverage rather than becoming an informal side door to custom development.

Key takeaways

Define the product as observe, explain, propose, authorize, execute, and verify. Diagnosis alone is not an agentic outcome.
Retrieve billing, usage, observability, pricing, product, and ownership context for each decision, with lineage and tenant boundaries enforced outside the prompt.
Represent every recommendation as a governed contract containing evidence, owner, action, risk, approval, rollback, expiration, and verification.
Grant autonomy by action class. Keep humans in the loop for commitments and production changes until that intervention has reliable post-action evidence.
Start customer delivery with one FDE pod and two or three customers that offer evidence access, decision access, measurable value, and reusable learning.
Measure time-to-insight, realized outcomes, unit economics, reliability, customer value, and productized patterns instead of counting recommendations.

This week, choose one recurring cost anomaly and map the complete path from underlying records to a verified action. Name the owner, approval rule, rollback, and success telemetry before improving the prompt. Do not add a second workflow until the first can explain what changed, why the action was allowed, and whether it improved customer cost-to-value.

References

May 11, 2026

How to Scale Session Replay Without Sacrificing Privacy

You want session replay on more journeys because the blind spots are expensive. A funnel can show where users leave, but it cannot show whether they encountered a broken control, a confusing message, a layout shift, or an error that never reached your analytics. Replay can turn those behavioral signals into enough context to make a product decision.

The hard part is expanding that visibility without collecting data you should not have, degrading the experience you are trying to understand, or filling storage with recordings nobody will use. The answer is not a single masking setting. You need a capture contract, a delivery architecture, a sampling model, and an operating scorecard that treat performance, fidelity, and privacy as one system.

Set the capture contract before you expand coverage

Replay programs often begin with a coverage question: what percentage of sessions should you record? That is the wrong first question. Start with the decision you expect the recording to change. If nobody can name that decision, more coverage will create more cost and exposure without producing more insight.

Write a capture contract for each product surface. This is a short, reviewable specification that connects a business purpose to technical controls. It should answer:

What question is replay meant to answer? Examples include diagnosing failed activation, explaining an error spike, or finding friction in a conversion step.
Which routes, components, and user cohorts are in scope? Name them. Do not approve an undefined all-product rollout.
Which data is prohibited? Include form values, credentials, payment details, message content, health information, account-recovery data, and any product-specific sensitive fields that apply.
What consent state permits capture? The recorder should not initialize before the required state is known. Withdrawal should stop capture and prevent queued data from being sent.
Who can watch a replay? Define roles by purpose. Product discovery, support investigation, engineering diagnosis, and administration do not automatically require identical access.
How long will the data remain available? Tie retention to the stated purpose rather than keeping replay indefinitely because storage permits it.
What sampling rule applies? State the baseline rate, targeted cohorts, exclusions, temporary overrides, owner, and expiry condition.

Selective capture, redaction, consent, retention, role-based access, and environment-aware sampling are separate controls. Treating one of them as a substitute for the others creates predictable gaps. Masking does not grant consent. Restricted access does not make excessive collection necessary. Short retention does not make an exposed credential harmless.

Apply those controls as close to collection as possible. A web replay is commonly reconstructed from serialized page state, changes, and interaction events. The privacy risk therefore sits in the data leaving the browser, not only in what the player later displays. A value hidden during playback may already exist in an outbound payload or stored record.

A useful default is to block text and input values, then allowlist only fields proven safe and necessary. Add route-level and component-level exclusions for sensitive surfaces. Use a separate, time-bounded approval for diagnostic capture that needs greater fidelity. I would reject a policy that merely says to mask personal information: the term depends on context, and engineers cannot reliably implement an undefined category.

Test the contract against the raw system, not just the player. Seed a non-production fixture page with recognizable test values, exercise every relevant component state, inspect the browser payload, inspect the stored representation, and verify that exports and downstream tools preserve the restriction. If a prohibited test value crosses the collection boundary, the control has failed even if the replay screen obscures it.

Consent and retention obligations vary by jurisdiction, contract, and data type. Your privacy or legal owner must approve those rules for the markets you serve. Engineering can enforce an approved policy; it cannot infer that policy from a generic replay configuration.

Keep capture off the user’s critical path

Scalable replay starts in the browser, where your product competes with the recorder for main-thread time, memory, and bandwidth. A backend that can ingest billions of events does not help if the recorder makes an interaction sluggish or loses the DOM changes needed to explain the problem.

The delivery design should make page experience more important than recording completeness. Decoupled capture and delivery, adaptive batching, compression, backpressure controls, and priority handling provide the basic pattern:

Capture the minimum useful representation. Filter excluded nodes and values before serialization. Avoid collecting detail that no approved use case needs.
Separate recording from transport. The capture path should write to a bounded queue rather than waiting for a network request. Upload latency must not become interaction latency.
Batch adaptively. Small batches can reduce delay during quiet periods, while larger compressed batches can reduce request overhead during sustained activity. The policy should respond to queue pressure and network conditions.
Define backpressure behavior. When production exceeds delivery capacity, the recorder needs a documented degradation order. Preserve navigation, consent changes, errors, and the structural events required for reconstruction before lower-value detail. Never freeze the page to protect the replay.
Bound long sessions. Flush incrementally, cap memory use, and make reconnection behavior explicit. A queue that grows for the life of a tab will eventually turn a delivery problem into a page-performance problem.
Make partial data visible. Mark gaps, dropped segments, and incomplete uploads. A replay that silently appears complete is more dangerous than one that clearly communicates its limits.

Backpressure deserves special attention because it forces a product decision disguised as an implementation detail. If the system cannot retain everything, what must survive? The answer should come from the capture contract. An error marker without enough surrounding state may be useless, but exhaustive cursor movement may be expendable. Rank event classes before an incident forces the recorder to choose implicitly.

Do not validate the client only on a fast laptop and stable connection. Use representative complex pages and test replay on and off under CPU pressure, constrained networking, rapid DOM change, background-tab transitions, reconnection, and long sessions. Compare Web Vitals, long tasks, memory growth, bytes transferred, queue drops, upload completion, and playback completeness. Long sessions, traffic spikes, complex interactions, and variable networks are precisely where an apparently sound design reveals its failure modes.

There is no universal acceptable overhead that fits every product. Set budgets relative to your production baseline and the importance of the journey. A small regression on a frequently used mobile activation path may matter more than a larger regression on an internal administration page. Segment the results by route, browser, device class, network condition, and session length so averages do not hide the users most affected.

Sample for decisions, not for a warehouse of footage

A single global sample rate is easy to configure and hard to defend. It spends collection capacity uniformly even though product questions are not uniformly valuable. It can also miss rare failures while overrepresenting routine sessions that nobody will watch.

Use a portfolio of sampling modes:

Random baseline sampling gives you a less biased view of ordinary behavior and lets you notice problems you did not predefine.
Cohort sampling increases visibility for a defined population such as new users, a browser family, a release cohort, or users entering a critical journey.
Signal-based sampling concentrates diagnosis around errors, failed steps, rage clicks, dead clicks, abnormal exits, or other instrumented friction signals.
Temporary diagnostic sampling raises fidelity for a narrow incident or release window, with an owner and an automatic expiry condition.
Hard exclusions override every sampling mode. A high-value investigation is not permission to collect from a prohibited surface or consent state.

Onboarding, activation, high-friction conversion flows, and paths with disproportionate revenue or trust impact are sensible places to begin because a clearer diagnosis can change a meaningful decision. Signals such as errors, rage clicks, dead clicks, scroll behavior, and stalled progress can then help you find the sessions worth examining.

Keep one statistical distinction clear. Targeted replay is good for explaining a known problem, but it cannot tell you how prevalent that problem is. If you record sessions because they contain an error, the resulting library will naturally make errors look common. Use analytics or a random baseline to measure frequency. Use replay to understand mechanism and context.

A disciplined investigation looks like this:

Find a measurable change in a funnel, cohort, error rate, performance signal, or support pattern.
Define the affected population before opening replays.
Review a deliberately selected set of relevant sessions and record recurring observable behaviors, not interpretations of user intent.
Turn those observations into a falsifiable product or technical hypothesis.
Instrument, release, or experiment so the hypothesis can be measured outside the replay player.

This prevents two common mistakes: browsing memorable sessions until a story feels true, and treating one vivid recording as evidence of market-wide demand. Replay is strongest when it explains a quantitative signal and leads back to a measurable change.

Run replay with a coupled performance, privacy, and value scorecard

Session replay is not finished when playback works. It is an operating capability with client releases, configuration changes, storage growth, access decisions, and incident risk. Give it an owner and review the system across five dimensions.

Dimension	Signals to watch	Decision the signals should trigger
User experience	Web Vitals, long tasks, main-thread work, memory growth, and replay bytes	Reduce capture detail, change delivery behavior, narrow coverage, or halt a rollout when the recorder breaks its budget
Replay fidelity	Queue drops, missing segments, incomplete uploads, event integrity, and playback reconstruction errors	Fix prioritization or transport before teams rely on incomplete recordings for decisions
Platform reliability	Ingestion failures, processing delay, retrieval latency, playback-start failures, and behavior during traffic spikes	Add capacity, repair a failing stage, or adjust sampling without shifting the problem into the browser
Privacy and governance	Redaction test failures, capture outside approved consent states, retention exceptions, and access outside approved roles	Disable affected capture, contain the data, follow the approved deletion or incident process, and repair the control before restoring it
Decision value	Investigations that reached a useful replay, time to diagnosis, time to resolution, and product hypotheses validated outside replay	Move coverage toward high-value use cases or retire collection that produces no action

These dimensions constrain each other. Aggressive compression may improve bandwidth while hurting reconstruction. More capture may improve fidelity while violating the page budget. Narrow access may improve governance while blocking the support engineers responsible for incident response. The job is not to maximize any single metric; it is to keep the entire system inside approved boundaries.

Version capture configuration like production code. A seemingly harmless selector change can expose text, remove necessary context, or increase mutation volume. Test recorder and configuration releases against fixture pages containing known sensitive values and known reconstructable interactions. Keep a rollback path.

Prepare shutdown controls before launch. You should be able to stop capture for a component, route, environment, tenant group, or the whole product without waiting for a new application release. Document who can use each control, how queued data is handled, how affected stored data is identified, and when privacy, security, support, and engineering must be involved. If collection crosses a prohibited boundary, continuing to record while the team debates ownership compounds the exposure.

Finally, connect replay operations to the workflows that consume it. Product teams need links from behavioral cohorts to relevant sessions. Support needs controlled escalation paths. Engineering and SRE need errors, network signals, layout shifts, and performance context close to the replay timeline. Connecting interaction context to observability and delivery workflows can shorten the path from an anomaly to a testable explanation, but only if the data remains trustworthy and accessible to the right roles.

Key takeaways

Approve a capture contract for each surface before approving a broader sample rate.
Redact or exclude sensitive data before it leaves the browser; a masked player is not enough.
Protect the page with decoupled delivery, bounded queues, adaptive batching, and explicit backpressure priorities.
Keep random sampling for prevalence and use targeted sampling to explain known signals.
Operate performance, fidelity, platform reliability, privacy, and decision value as a coupled scorecard.
Require scoped shutdown controls, retention handling, access ownership, and rollback before production expansion.

Before you increase replay coverage, ask for two artifacts: a one-page capture contract for the next journey and a replay-on versus replay-off test under that journey’s difficult conditions. If the team cannot show what is allowed to leave the browser, how the page stays within budget, and which decision the recordings will change, the rollout is not ready to scale.

References

May 7, 2026

Tag: data governance

Key takeaways

Attribution improves when journey context survives the final click

Data governance supplies the shared meaning behind every signal

AI connectors reduce workflow friction but do not repair weak analytics

A connected growth loop joins evidence, intervention, and learning

References

System access changes both the value and the risk

Choose workflows where access justifies its complexity

Use an access ladder instead of a single launch

Put deterministic controls around probabilistic decisions

Key takeaways

References

Give every experiment a decision contract

Match the question to the cheapest reliable evidence

Engineer trustworthy measurement and reversible delivery

Scale the program around decisions, not test volume

Reset a brittle program over 90 days

Key takeaways

References

Start with the retention decision, not the dashboard

Instrument the path from first value to recurring value

Measure value at the account level

Use a minimal tracking contract

Build a risk score people can challenge and act on

Classify the decision before you assess the AI

Turn governance principles into an enforceable contract

Define the data boundary

Assign decision rights to named roles

Design the audit record before launch

Put controls inside the workflows people actually use

Behavioral analytics: govern the meaning as well as the data

Anomaly detection: route a signal into investigation, not judgment

Self-service analysis: give teams a governed lane

Pilot with evidence, not a polished demonstration

Key takeaways

References

Define the product around recovered monomers

Build the AI as a closed learning system

Use stage gates that end in physical evidence

Plan the roadmap backward from cost parity

Key takeaways for your next roadmap review

References

Design a decision loop, not another cost dashboard

Draw the product boundary around an outcome

Build four layers with explicit responsibilities

Start with one anomaly and one reversible response

Make every recommendation safe enough to act on

Use a recommendation contract

Grant autonomy by action class

Evaluate the decision loop, not the prose

Embed the capability with customers before scaling it

Choose a small pod and customers that can teach you

Run a customer optimization loop that produces reusable knowledge

Make the commercial incentive legible

Turn field work into a roadmap, not permanent custom service

Apply a productization test to every recurring intervention

Use a scorecard that reveals where the loop is breaking

Make build versus buy a component decision

Key takeaways

References

Set the capture contract before you expand coverage

Keep capture off the user’s critical path

Sample for decisions, not for a warehouse of footage

Run replay with a coupled performance, privacy, and value scorecard

Key takeaways

References