Tag: privacy-by-design

AI Product Leadership: Faster Learning, Safer Systems
AI-enabled product leadership is not primarily a contest to automate more work. The stronger opportunity is to shorten learning loops while improving the quality, traceability, and safety of product decisions.

Across the five source articles, a common operating model emerges: begin with bounded problems, connect AI to real customer evidence, define quality through domain expertise, and make safeguards proportional to the consequences of failure. This model applies both to internal product workflows and to customer-facing AI systems.

Move from an AI tool stack to an evidence system

The article on essential tools for product managers presents AI as a working layer across product intelligence, research, analytics, roadmapping, design, prioritization, and delivery. Its most useful implication is that tool selection should begin with the decision a team needs to improve, not with the number of AI features available.

A feedback summarizer, behavioral analytics platform, prototyping assistant, and requirements generator can each save time. Their strategic value appears when their outputs are connected: qualitative feedback helps explain observed behavior, behavioral evidence tests assumptions raised in interviews, and both inform prioritization. The product manager still has to reconcile customer pain, business outcomes, engineering effort, differentiation, and stakeholder expectations.

The practical guide to finding AI use cases reaches the same conclusion from a different direction. It recommends starting with a concrete item from everyday work, testing how AI might help, and studying the gap between the desired result and the output. It specifically proposes a 15-minute daily practice and treats an initially poor result as evidence about instructions, context, constraints, or model capability.

Together, these perspectives suggest two complementary levels of adoption. At the individual level, task-first experimentation builds judgment about what AI can do. At the team level, connected evidence workflows turn that judgment into a repeatable product operating system. Buying tools without the first creates shallow adoption; isolated personal experiments without the second produce scattered efficiency rather than organizational learning.

Use AI to deepen discovery, not to create distance from customers

The 2026 roadmap article frames roadmaps as portfolios of experiments involving products, learning methods, teaching models, and choices about what to stop doing. It argues that AI can reduce tedious discovery work and provide feedback on demanding skills, including interviewing, assumption testing, and opportunity mapping. At the same time, it warns against substituting agents or dashboards for human curiosity and direct customer contact.

That tension supplies an important boundary for AI-enabled discovery. Models can organize notes, identify recurring themes, critique an interview guide, expose possible confirmation bias, or compare evidence across sources. They cannot independently determine whether the team asked the right customers, understood the social context, or interpreted ambiguous language correctly. Those remain product and research judgments.

The safety-first consent coach described in the Override Labs article illustrates why context matters. According to that account, the nonprofit examined 2,000 Reddit posts per subreddit to validate demand and understand how vulnerable questions were expressed. The discovery material included uncertainty, shame, peer pressure, and the possibility that someone might be seeking permission rather than reflection. A conventional feature request or decontextualized summary could have obscured those conditions.

The cross-team review reinforces this point through other domains. It reports that former teachers at eSpark created evaluation rubrics based on how educators assess student work and enriched educational content with domain-specific metadata when generic embeddings produced weak matches. It also describes how local-government knowledge at Zencity changed the interpretation of sentiment, and how incident-response experience informed Incident.io’s investigation architecture. Across these examples, AI increased the importance of domain expertise because people still had to define what relevance, quality, and failure meant.

Let the consequence of failure determine the product architecture

Not every AI-assisted task needs the same controls. A weak draft of an internal stakeholder update can be reviewed and corrected cheaply. A response that could be interpreted as permission in a consent-related situation has a fundamentally different risk profile. Responsible product development begins by distinguishing those cases before selecting architecture or interaction patterns.

The Override Labs account offers the clearest high-stakes pattern. The team reportedly defined a "South star" around the worst outcome: a teenager using the product response as a green light for harmful action. The product therefore avoids giving a green-flag verdict. It runs deterministic risk classification before calling Claude, adjusts responses by risk tier, and uses a structure that validates, reflects, and invites further reflection. A licensed therapist contributed to the evaluation rubric, while positive masculinity coaches helped shape the tone.

The underlying principle is broader than that implementation. A generative model should operate inside a product-defined safety system rather than becoming the safety system. Product leaders can translate that principle into four design questions: what outcome must never be encouraged, which decisions require deterministic handling, when should generation be constrained or withheld, and which domain experts are qualified to judge the response?

The review of AI product teams adds another trust boundary: deciding when a system should admit that it does not know. This is both a model-quality issue and a product behavior. Teams need to specify what insufficient evidence looks like, what the interface communicates in that state, and whether the user should retry, provide more context, consult a person, or stop the workflow.

This risk-based approach avoids two unhelpful extremes. Applying high-stakes controls to every low-consequence drafting task can make experimentation needlessly heavy. Treating sensitive decisions like ordinary content generation can leave critical failure modes to probabilistic behavior. The appropriate control set follows the plausible harm, reversibility, affected population, and user’s ability to detect an error.

Make evaluation, privacy, and leadership part of delivery

The production-team review describes evaluation as an evolving operational capability rather than a final test. It reports that Stack Overflow ran about 50 experiments across five pods in three months, produced four versions of an AI-powered search product, and ultimately stopped that effort. Arize began building its Alyx agent before established agent frameworks were available, while eSpark’s former teachers learned to write evaluation code with LLM assistance. These are source-reported examples, not independently verified benchmarks, but they demonstrate how structured learning can support both shipping and stopping decisions.

Evaluation should therefore start when the use case is defined. Early rubrics can be simple: representative tasks, expected properties, unacceptable outputs, and a review process. As the product matures, teams can add risk tiers, regression sets, production observations, and explicit release criteria. The goal is not to claim that a model is universally good; it is to establish whether a particular system performs acceptably within a bounded workflow.

Privacy belongs in the same product definition. The consent-coach article reports that the service uses no accounts, cookies, or cross-session tracking. That choice limits conventional retention analytics, but it also supports the trust required for a sensitive interaction. It shows that less data can be a deliberate product feature when identification or surveillance would discourage honest use.

Leadership determines whether these practices persist. The roadmap article argues that training alone does not change an organization when leaders continue to reward old behaviors. Its proposed learning model combines on-demand material, AI-generated feedback, coaching resources, and human support. The practical-use-case article similarly recommends peer demonstrations and structured practice. Both suggest that AI readiness is a management system: teams need permission to experiment, shared examples, quality standards, and leaders who reinforce evidence-based behavior.

Key takeaways
- Start with a bounded task and a defined outcome; use repeated practice to learn where AI adds leverage and where it fails.
- Connect research, feedback, behavioral data, prioritization, and delivery so that AI improves decisions rather than producing isolated artifacts.
- Keep direct customer contact and domain expertise at the center of discovery, synthesis, and quality judgment.
- Define the worst credible outcome before designing a customer-facing AI experience, then match controls to that risk.
- Build evaluation and privacy into the product operating model, including criteria for refusing, escalating, or admitting uncertainty.
- Measure AI leadership by better learning and safer outcomes, not by tool count, output volume, or automation alone.
Building the next product operating rhythm

The next step for product organizations is not a universal AI playbook. It is a disciplined rhythm in which teams choose a real problem, gather contextual evidence, define acceptable and unacceptable behavior, test a bounded intervention, and revise or stop it based on results. As AI capabilities change, that rhythm can remain stable. It gives product leaders a way to pursue faster learning without treating speed as a substitute for responsibility.

References
July 3, 2026
Migrate Analytics Platforms Without Chaos: 7 Proven Lessons to Plan, Move, and Land Cleanly

I’ve led and rescued more analytics migrations than I can count, and I know the pressure: every event, dashboard, and decision pipeline depends on getting it right. Migrating analytics platforms doesn't have to be painful. Get seven lessons from Human37 and Amplitude to help your team plan, migrate, and land cleanly.

Here’s how I approach this work so teams keep momentum, regain trust in their numbers, and accelerate product-led growth on a unified analytics platform—without the rework and stakeholder fatigue that typically follow.

Lesson 1 — Start with outcomes, not events. Before moving a single event, I align leaders on the questions we must answer and the decisions we must speed up: activation, retention, and expansion. I map those goals to a simple driver tree, then back into the behavioral analytics we need. This trims noise, tightens scope, and ensures Amplitude analytics (or any destination) is instrumented for decisions, not vanity metrics.

Lesson 2 — Audit and map your data with rigor. I inventory current events, properties, IDs, and sources, then define a target schema with clear naming conventions, ownership, and versioning. Data governance and privacy-by-design are non-negotiable: we separate PII, document consent paths, and remove legacy debris. This step prevents schema drift and makes platform scalability sustainable.

Lesson 3 — De-risk the cutover with a phased plan. Rather than a big-bang switch, I dual-run critical flows, compare telemetry, and use feature flags to roll forward (and back) safely. Observability and anomaly detection are my guardrails: I monitor volume, cardinality, and event timeliness to spot regressions early—long before executives notice broken charts.

Lesson 4 — Treat instrumentation like product code. I wire schema checks into CI/CD, enforce typed analytics wrappers, and validate payloads pre-merge. With docs-as-code, the tracking plan stays current and reviewable. This keeps quality high at scale and avoids the slow death of broken funnels caused by well-meaning quick fixes.

Lesson 5 — Enable the people, not just the platform. Tools don’t create insight—teams do. I run hands-on enablement with product tours and in-app guides tailored to each role, establish communities of practice, and publish short playbooks for common questions (activation analysis, cohort retention, and journey mapping). When customer success and growth marketers can self-serve, adoption sticks.

Lesson 6 — Land cleanly with fast, visible wins. Within the first two weeks post-cutover, I showcase analyses that matter: retention analysis by use-case, friction points via session replay and heatmaps, and conversion lift by segment. These quick proofs build confidence, reinforce the value proposition, and keep stakeholders engaged through the longer tail of hardening.

Lesson 7 — Govern and evolve continuously. After go-live, I schedule schema reviews, backlog grooming, and QBRs to prune events and refine definitions. Ownership is explicit, and changes flow through the same review process as code. This keeps the unified analytics platform trustworthy as the product (and org) changes.

I’ve seen this playbook turn skepticism into momentum. In one migration I inherited mid-flight, we refocused on decisions, tightened governance, and phased the rollout; the team moved from fire drills to confident launches—and stakeholders finally believed the numbers again.

If your team is staring down a migration, anchor on outcomes, automate quality, and invest in enablement. With disciplined execution readiness and the lessons I’ve applied alongside partners like Human37 and platforms like Amplitude, you can move fast, reduce risk, and land cleanly—without the chaos.

Inspired by this post on Amplitude – Perspectives.

June 22, 2026
Package Hack Wake-Up Call: My Playbook for Securing Cowork, Coding Agents, and Secrets

I love being a builder. It feels like a superpower I can’t stop using, and lately I’ve been channeling it into better workflows, faster experimentation, and sharper product thinking.

I tinker with my Claude Code workflows to make every day more effortless. I’m having a blast creating AI-generated interview snapshots and opportunity solution trees for Vistaly. I also spend time digging into traces and iterating on the AI coaches I use for our discovery courses.

Then the recent wave of malicious software spreading through the open-source community popped my bubble. It hit companies big and small—names like OpenAI, PostHog, and Zapier. As I dug in, I realized what many cybersecurity experts have long known: this is a deep rabbit hole. If I want to build responsibly, I have to get significantly better at protecting my devices, credentials, and code. And if you’re building with AI or modern tooling, you likely do, too.

Here’s why. We all rely on open-source software. Most modern applications assemble tried-and-true components—parsing a PDF, handling dates across time zones, visualizing spreadsheet data, connecting to an API—rather than reinventing them. The same is true for agent skills and MCP servers; they accelerate how we get value from models. This is overwhelmingly a good thing. But it also creates an attack surface that bad actors exploit.

We don’t need to abandon third-party code. We do need to understand the mechanisms attackers use and consistently defend against them.

When one malicious worm compromises hundreds of packages, what should dev teams do? This visual teaser maps the agenda—how it spreads, how to guard against it, AI tool risks, and concrete steps to mitigate.

On May 11th, I started seeing tweets about a TanStack hack. At that time, I didn’t know what TanStack was. But apparently, it’s a popular set of JavaScript libraries that are used by a lot of React sites. At first, I didn’t pay much attention. Then I learned the packages were compromised by a worm—malicious software that self-replicates—and it spread quickly. Within hours, dozens of packages were implicated; by day’s end, it was in the hundreds. That’s when I knew I had to lean in.

If you’ve explored safe development practices with coding agents before, you’ve seen the basics of package safety. A package is a bundle of reusable code shared through registries, and nearly every app you use depends on them. The unfortunate twist with this specific hack, known as the Mini Shai-Hulud worm, is that it shows prior “safe enough” heuristics aren’t sufficient. Popularity and trust signals don’t guarantee safety. We have to do more.

So here’s what I’ll cover today: how malicious software typically works, a practical framework for guarding against it, the specific risks of using Cowork to write and run code, and concrete steps to mitigate that risk. My goal is simple: help you keep building—despite the risks—while protecting your data and your business.

Quick disclaimer: I’m not a security expert. I’m sharing my personal journey and what I’ve learned through research and hands-on work. Please use your best judgment when applying any of this.

Package hacks share a simple playbook: get in, sweep for secrets, and phone home. This visual breaks down the 3 steps and flags new entry points—from packages to MCP servers, agent skills, and app extensions.

An agent recently scoured over 230,000 malicious software incidents and found that most malicious software follows a similar pattern. First, it needs an entry point onto your computer. Once installed, it scours your device for sensitive data, and then it uses your network connection to send that data to its own servers. The Mini Shai-Hulud worm spreads via malicious package install scripts that run at download time, then searches the device for credentials (including package publishing rights), poisons additional packages to continue replicating, and uses multiple channels—including the victim’s own GitHub public repos—to distribute secrets.

In practice, most attacks boil down to three steps: 1) It finds an entry point to your device. 2) It searches your device for sensitive data. 3) It sends that data to its own server. The good news: this pattern also tells us how to defend. We can harden entry points, minimize what code and agents can access, and constrain outgoing network traffic.

Keep in mind that install scripts aren’t the only entry vector. Any code that runs on your machine could contain malicious payloads: third-party packages, agent skills, MCP servers, browser or desktop extensions—the list is long. As coding agents and “vibe coding” tools become mainstream, more non-engineers are exposed to the same risks engineers have managed for years.

You might be at elevated risk if you do any of the following: you download and use third-party skills or MCP servers; you let Claude Code, Codex, or other coding agents write scripts that run locally and use third-party packages; you use an IDE like VS Code or Cursor with third-party extensions; or you install third-party extensions in tools like Obsidian. This isn’t an exhaustive list, but if any of these apply, it’s worth tightening your approach.

Relying on third-party code? This visual highlights four common risk zones—agent skills/MCP servers, coding agents, IDE extensions, and Obsidian plugins—and urges a review of downloads, local scripts, and add-ons.

The “safest” approach would be to avoid installing third-party software on your local device entirely. That’s not realistic. We all depend on third-party components in our stack. So I’ll start with one of the most common paths for non-engineers writing and running code today: Cowork.

Evaluating Cowork’s safety was eye-opening. Cowork offers meaningful protection—more than running code directly on your machine—but it isn’t bulletproof. There’s a notable gap you should understand.

Here’s how Cowork helps. It runs code inside a virtual machine, which isolates the execution environment from your real device—a quarantine room for code. While Cowork doesn’t fully control what comes into the room (that part is on you), if malicious code gets in, it’s contained and cannot reach the rest of your filesystem. Cowork also limits outbound network traffic from the virtual machine, which helps disrupt data exfiltration. However, it’s not foolproof.

Because Claude can install packages inside Cowork, it remains susceptible to malicious code like the Mini Shai-Hulud worm. And GitHub is on the allow list so Cowork can read and write to your repos. Since the Mini Shai-Hulud worm uses GitHub to publish secrets, this creates exposure. The crucial mitigation: if you never give Cowork access to sensitive data, there’s nothing for an attacker to steal.

A quick visual from a security deep dive on package hacks shows how Cowork handles threats: entry points are contained, data is only safe when kept outside, and network traffic is partly limited—making shared data the gap to watch.

Your responsibility is straightforward but critical: your data is only safe if it stays outside the virtual machine. When you mount folders into Cowork, those folders become accessible to any code running inside the VM. That includes malicious scripts. Before sharing, ask two questions: do the folders contain any credentials or secrets, and do they include proprietary data that would be harmful if accessed?

It’s common for code to need credentials. That’s why Cowork includes connectors to third-party sources like Google Drive and Slack. Credentials configured for these connectors never enter the VM—they remain outside the quarantine room—so they’re not exposed to malicious code. But if your code requires additional credentials inside the VM, scope them tightly and assume they could be compromised.

You can also use custom MCP servers you create yourself with Cowork. Those credentials stay outside the VM as well, provided the MCP servers are remote (hosted on a web server, not downloaded locally). It’s more work than dropping in a local server, but it keeps secrets out of reach from VM-executed code.

Beyond credentials, scrutinize the actual content you share with Cowork, including anything accessed through connectors. Least privilege is the rule: grant only what’s absolutely necessary for the task, and nothing more.

Amid a wave of package-supply attacks, this Product Talk visual launches a 3-part guide to safer AI building—starting with Cowork safety today, then Claude code config next week, and off-device development coming soon.

What about skills? Cowork supports skills, and you can add third-party skills inside the quarantine room. If you’re not placing your own data in that room, you can afford more risk. The moment you add sensitive or proprietary data, be selective. Skills can include third-party code, and bad actors use skill directories to distribute malicious payloads. Personally, I never use third-party skills as-is. If one looks useful, I read through the files, then ask Claude to recreate it so I understand what it does and maintain control. If I were to use third-party skills, I’d do it in Cowork and keep their data access to the minimum necessary.

Overall, Cowork is a solid, “safe-ish” option if you’re disciplined about what you share. The challenge is that utility often requires access to real data—exactly what we’re trying to protect. In an upcoming deep dive, I’ll outline strategies to keep malicious code out in the first place. While I’ll focus on local development, the same patterns can extend to Cowork with a bit of setup.

One more important clarification: don’t confuse Cowork with the Code tab in the Claude Desktop app. Cowork runs code inside a virtual machine. The Code tab does not. If you ask Claude to write and execute code from the Code tab, that code runs on your local device and you’re fully responsible for security. There is one exception: the Code tab can run code in Anthropic’s cloud; I’ll cover that approach when we get into moving development off the local machine.

To summarize Cowork’s protections against the attacker’s three-step pattern: installs and scripts still run, but they’re contained inside an isolated virtual machine instead of your real device; access to sensitive data is strongly limited to the specific folders you mount, leaving the rest of your filesystem (including unrelated credentials) out of reach; data exfiltration is partially constrained because Anthropic limits outbound network traffic from the VM—helpful, but not absolute. By contrast, local Code tab sessions offer no isolation, no filesystem restrictions, and no network limits—so any malicious install scripts run directly on your machine with full access and open egress.

My takeaways so far: I still love building with AI, but I’m doing it more cautiously. Cowork offers meaningful containment when used deliberately. I still prefer the flexibility of Claude Code, and I’ve reconfigured my setup to reduce risk. Even so, “safer” isn’t “safe,” which is why I’m increasingly shifting development off my local device to more controlled environments. I’ll share the practical details—tools, configs, and scripts—in the next installments.

If this perspective is useful, let me know. I want builders to move fast—and safely—through this new era of agentic AI. Until then, stay safe out there.

Inspired by this post on Product Talk.

June 3, 2026

A Product Leader’s Playbook for Humane, Sustainable Growth

Your growth dashboard can be green while your product is becoming less valuable to the people who use it. Activation rises. Engagement deepens. Revenue follows. Yet customers feel pressured, workers absorb hidden costs, or automation removes the human contact that made the experience trustworthy.

You don’t have to choose between humane technology and commercial performance. You do need an operating model that treats human outcomes as product outcomes, exposes harmful trade-offs early, and rewards durable value rather than extraction.

Start with the harm your growth model could create

Most growth models describe the path from acquisition to revenue. A humane growth model also describes who could be worse off if that path succeeds.

Map the product’s intended value first: the problem a person wants to solve, the moment they receive a useful result, and the reason they would return. Then examine the same journey from the perspective of people who may not appear in your analytics. That can include a customer’s employees, contractors who deliver the service, family members affected by the product, local businesses, or people excluded by the design.

Create an impact ledger for the growth surface you are reviewing. Keep it beside the business case, not in a separate ethics document that nobody consults during prioritization.

Impact area	Question to answer	Signal to monitor
User agency	Can people understand the choice, refuse it, reverse it, and leave?	Overrides, cancellations, reversals, and interview evidence
Well-being	Does additional use help people finish their intended task, or merely keep them present?	Successful outcomes, passive time, and expressions of regret
Economic fairness	Who captures the value, and who absorbs the labor, risk, or cost?	Complaints, payout concerns, and changes in burden across participants
Human connection	Does the experience strengthen useful relationships or replace them unnecessarily?	Human handoffs and feedback from affected communities
Trust and safety	Do people know when automation is involved and what happens to their data?	Escalations, corrections, safety reports, and trust feedback

The ledger is not an attempt to predict every consequence. It is a way to make foreseeable trade-offs visible before a team becomes committed to a launch. This matters commercially as well as ethically: extractive growth can weaken trust and retention while increasing regulatory and reputational exposure.

Pair every growth metric with a human countermetric

A metric becomes dangerous when the team can improve it while making the customer’s life worse. Engagement is the familiar example. More time in a product may indicate value, confusion, dependency, or difficulty leaving. The number alone cannot tell you which.

Give each primary growth metric a countermetric that protects the outcome you actually intend. The pair should appear in the same experiment brief and the same review meeting.

Growth metric	Human countermetric	Decision it improves
Activation	Completion of the customer’s intended outcome	Whether setup creates value or only reaches an internal milestone
Engagement	Intentional task completion	Whether additional use is productive or merely prolonged
Retention	Trust, voluntary continuation, and ease of exit	Whether customers stay because the product remains useful
Conversion	Comprehension of price, consent, and commitment	Whether revenue depends on informed choice
Automation rate	Correction, reversal, and human-escalation success	Whether efficiency survives real-world exceptions

Do not combine the pair into a single score too quickly. A blended score can conceal the exact trade-off leaders need to see. Review both trends and ask whether the business result would still be desirable if the countermetric deteriorated further.

Set the stopping condition before running an experiment. Decide which trust, safety, fairness, or agency signal would block rollout even if the primary metric improves. A guardrail invented after seeing strong conversion is rarely a real guardrail.

Expand discovery beyond the people who already love the product

Power users are good at explaining how to improve the experience they have accepted. They are less able to represent people who abandoned it, avoided it, could not access it, or carry costs without being the buyer.

Add an outside-in lane to continuous discovery. Include customers who reduced usage or left, people who encountered a failed automation, front-line workers affected by the workflow, and community members who experience consequences without controlling the purchase. Treat these conversations as product discovery, not public relations.

Ask questions that reveal displacement and dependency: What became easier? What became harder? What did this replace? When did you feel unable to make a meaningful choice? Who else had to change their behavior so you could receive the benefit? What would a responsible version of this experience preserve?

Bring the evidence into roadmap decisions in its original shape. A complaint about loss of control should not be translated into a generic request for better usability. A contractor describing unfair risk is not reporting a minor service defect. Name the underlying impact so the team can address the product model rather than polish its interface.

Put humane constraints inside the experiment

Principles have little effect if they enter the process after pricing, interaction design, and technical architecture are settled. Put them into the experiment before the team writes production code.

State the human outcome. Describe what should become better in the person’s life or work, not merely what behavior should increase.
Name the affected groups. Include non-users who supply labor, absorb risk, or experience downstream effects.
Define meaningful choice. Specify how people will understand automation, decline it, correct it, and reverse important actions.
Design the failure path. Decide how a person reaches human help when the system is uncertain, unsafe, or wrong.
Pre-commit to a stopping rule. Record which negative signal pauses expansion regardless of the growth result.

For AI products, this is where risk management becomes part of product management. Give users enough information to understand when AI is acting. Preserve review for consequential outputs. Build correction and escalation into the main workflow. Apply privacy-by-design while deciding what data the product needs, rather than after collecting everything that might be useful.

The product trio should own these decisions. Legal, security, trust, and policy partners can strengthen the work, but they cannot compensate for a roadmap whose incentives reward harm. The product leader remains accountable for the whole system being optimized.

Choose durable depth over indiscriminate scale

Scale is not proof of value. It is an amplifier. If the operating model depends on weak consent, hidden costs, unfair labor, or the removal of every human interaction, scale magnifies those weaknesses.

A narrower product can create a stronger business when the team understands a community deeply enough to solve its full problem. A locally focused mobility service, for example, could optimize for rider safety, driver economics, and neighborhood usefulness rather than treating every participant as an interchangeable unit of supply or demand. The market is smaller by design, but the value proposition can be clearer and trust can become part of the product’s advantage.

Test the durability of your strategy with a simple question: if customers become better informed and cultural expectations become stricter, does the growth model become stronger or weaker? A group of German primary-school parents collectively chose to delay smartphones until age 11 or 12. Product leaders should expect social norms to change, sometimes in direct opposition to adoption assumptions embedded in a forecast.

At the next roadmap review, challenge any initiative that needs customers to misunderstand a choice, remain dependent, or accept worsening treatment as the company grows. If removing that mechanism destroys the economics, you have found a strategy problem, not an optimization problem.

Key takeaways

Document who could be harmed by a successful growth initiative, including people who never appear in the customer database.
Pair activation, engagement, retention, conversion, and automation metrics with measures of outcomes, agency, trust, and recovery.
Include former users, affected workers, and non-buyers in continuous discovery.
Define consent, correction, escalation, and stopping conditions before launching an experiment.
Prefer a focused market with durable value over scale that depends on hidden human costs.

Start with the growth initiative carrying the greatest human risk. Add its impact ledger and countermetric to the next decision meeting, assign an owner, and make expansion conditional on both business value and human value holding up.

References

Shivam.Consulting Blog — Is Technology Still Net Positive? A Product Leader’s Reckoning and Playbook for Humane Growth

May 26, 2026

Governed AI Analytics in Financial Services: A Playbook

You have a credible AI analytics use case, product teams want access, and risk leaders want proof that the system will not expose sensitive data or influence the wrong decision. The mistake is to settle that tension with a broad choice between “innovation” and “control.” That choice is too vague to operate.

Start with a narrower question: what decision may this system influence, using which data, under whose authority, with what evidence afterward? Once those boundaries are explicit, you can give teams meaningful speed without asking compliance to accept an invisible risk.

Classify the decision before you assess the AI

Many AI reviews begin with the model: where it is hosted, how it was trained, or whether it can explain an answer. Those questions matter, but they do not establish the business risk. The same model can summarize an approved dashboard, flag an unusual transaction pattern, or help determine an outcome that affects a customer. Those are not equivalent uses.

Classify each use case by consequence, reversibility, and action authority. Consequence asks what happens if the output is wrong. Reversibility asks whether a person can correct the result before harm occurs. Action authority asks whether the system informs a person, recommends an action, or executes one.

Use case pattern	Permitted role for AI	Control that matters most	Boundary to make explicit
Descriptive analysis	Summarize approved metrics or behavioral patterns	Data permissions and traceable metric definitions	The output cannot create a new customer-level action
Investigative signal	Surface anomalies or suspicious patterns for review	Analyst validation, evidence capture, and disposition logging	A signal is not a finding or a verdict
Product recommendation	Suggest an intervention, workflow, or experiment	Human approval and outcome monitoring	The recommendation cannot bypass existing approval paths
Customer-affecting decision	Support a formally governed decision process	Documented oversight, explainability, and accountable human authority	The final authority and escalation path must be unambiguous

This classification prevents two common errors. The first is applying the heaviest possible review to every analytical assistant, which sends teams into unofficial tools and manual workarounds. The second is treating every output as “just an insight” even when a downstream workflow turns it into a customer action.

Trace the output one step beyond the interface. If an anomaly score enters a case-management queue, changes account handling, or triggers outreach, govern that downstream effect as part of the use case. A recommendation does not become low risk merely because a person clicks the final button.

Before development begins, write an allowed-action statement and a prohibited-action statement. For example: “The system may prioritize patterns for analyst investigation. It may not label a customer, close a case, or initiate an external action.” That pair of sentences is more operationally useful than calling the project “medium risk.”

Risk and compliance leaders still need to map the use case to the organization’s actual legal and regulatory obligations. A product risk classification is an operating tool, not a legal conclusion. When a use case could affect access, eligibility, pricing, fraud treatment, or another consequential outcome, obtain the appropriate compliance and legal review before activation.

Turn governance principles into an enforceable contract

Principles such as fairness, privacy, transparency, and human oversight do not control a production workflow by themselves. Each principle needs an owner, an enforcement point, and evidence that the control operated. I treat that combination as the governance contract for the use case.

Define the data boundary

List the approved data domains, fields, purposes, environments, and user groups. Do not stop at “customer data” or “analytics data.” Those labels are too broad to enforce. State which attributes the system can retrieve, which identifiers it can display, whether results may be exported, and where generated outputs may be stored.

Purpose: the business question the data may be used to answer.
Permitted inputs: the approved events, attributes, aggregates, and reference data.
Prohibited inputs: data classes that the workflow must never retrieve or infer.
Permitted users: roles allowed to query, review, approve, or export results.
Output handling: where results may be displayed, retained, shared, or reused.
Failure behavior: what the system does when permission, provenance, or confidence is insufficient.

Enforce that boundary with role-based access controls and granular permissions at retrieval time. Filtering an answer after a model has received restricted data is not equivalent to preventing access. The model, retrieval layer, analytics service, export path, and destination workflow all need to respect the same user identity and policy context.

Assign decision rights to named roles

A committee can set policy, but it cannot own every operational decision. Give each use case an accountable product owner, a data owner, a control owner, and a business reviewer. Clarify who can approve launch, who can change the data scope, who reviews exceptions, and who has authority to stop the workflow.

The product owner defines the user problem, allowed action, prohibited action, and business outcome.
The data owner approves the data purpose, quality expectations, permissions, and reuse limits.
The risk or compliance owner maps policy obligations to testable controls and reviews material exceptions.
The platform or security owner implements identity, access, isolation, logging, and change controls.
The business reviewer accepts, rejects, or escalates outputs and records why.

Keep the decision rights close to the workflow. If a reviewer sees an unsupported conclusion, that person needs a clear way to reject it, preserve the evidence, and route the issue. If every exception disappears into a general governance inbox, the formal control will be bypassed when operational pressure rises.

Design the audit record before launch

An audit trail should reconstruct what happened without relying on someone’s memory. Capture the requesting identity and role, the approved purpose, the data and metric definitions used, the system configuration, the generated result, any human review, the resulting action, and later corrections or overrides.

Logging creates its own data risk. Prompts, retrieved context, generated explanations, and reviewer notes can contain sensitive information. Protect the audit store with appropriate access, retention, and segregation rather than treating logs as harmless operational exhaust. Where policy permits, record protected references to sensitive records instead of duplicating raw payloads.

A practical platform evaluation should test whether the system combines strong data governance, auditable AI behavior, secure scale, and a direct connection to product outcomes. A policy document that cannot be enforced in the workflow is not enough, and a platform control without an accountable operating process is not enough either.

Put controls inside the workflows people actually use

Governance fails when it exists as a review ceremony around the product rather than a behavior inside it. Analysts should not have to remember a separate policy every time they ask a question. The approved data scope, identity context, review step, and evidence capture should travel with the task.

Behavioral analytics: govern the meaning as well as the data

Behavioral analytics can reveal how customers move through onboarding, self-service, support, payments, and other product journeys. The danger is not limited to unauthorized access. An AI system can also combine valid events into a misleading interpretation of customer intent.

Start the workflow with curated event definitions and approved business metrics. Require the output to expose the cohort definition, time context, filters, exclusions, and comparison used. The analyst should be able to inspect the path from a narrative claim back to the underlying measure before sharing it.

Separate observation from inference in the interface. “Users in this cohort abandoned the flow after this step” is an observation tied to event data. “They abandoned because they distrusted the process” is a hypothesis. Labeling those differently prevents fluent language from turning a plausible explanation into an unsupported fact.

Anomaly detection: route a signal into investigation, not judgment

An anomaly means a pattern differs from an expected baseline. It does not establish fraud, customer intent, system abuse, or operational error. Treat anomaly detection as a prioritization mechanism unless a separately governed process establishes something more.

Give the reviewer the observed deviation, relevant context, the comparison baseline, and links to permitted evidence. Capture the reviewer’s disposition: confirmed issue, expected behavior, insufficient evidence, data-quality problem, or escalation. That disposition is both an audit artifact and a feedback signal for improving the workflow.

Watch the operational burden as closely as the detection capability. A flood of weak signals can make the nominal control less safe because reviewers rush, defer, or stop trusting the queue. Monitor false positives, unresolved escalations, overrides, and the reasons analysts reject outputs. When those indicators deteriorate, reduce scope or pause automated routing while the cause is investigated.

Self-service analysis: give teams a governed lane

Product managers and analysts need enough freedom to explore without sending every question through a central approval queue. Create a governed workspace containing approved metrics, documented data products, role-aware access, and restricted export paths. Let people iterate freely inside that lane while changes to data scope, decision authority, or external activation trigger a new review.

Make the boundary visible. Users should know when an answer is based on incomplete data, when a metric is not approved for customer-level decisions, and when an output cannot be exported. A silent denial encourages workarounds; a clear denial that identifies the policy boundary gives the user a legitimate next step.

Do not give an analytics assistant write access to operational systems merely because the integration is convenient. Insight generation and action execution are separate privileges. Connect them only when the action, reviewer, failure mode, and rollback path have been governed explicitly.

Pilot with evidence, not a polished demonstration

A convincing demo proves that the happy path works. A governed pilot must also prove that the system refuses the wrong request, exposes enough evidence for review, and leaves a usable record when something goes wrong.

Choose a narrow workflow with an identifiable user, a bounded data set, a reviewable output, and a business outcome you already understand. Avoid beginning with an enterprise-wide assistant or an autonomous action layer. Broad scope makes it difficult to distinguish model behavior, data problems, permission failures, and process gaps.

Write the decision contract. Record the user, purpose, permitted inputs, allowed action, prohibited action, reviewer, and stop authority.
Configure the smallest useful data boundary. Include only the fields and metrics needed for the chosen workflow.
Test legitimate work. Confirm that authorized users can produce an insight, inspect its basis, and complete the intended review.
Test prohibited work. Attempt access with the wrong role, request excluded attributes, try an unauthorized export, and ask the system to take a prohibited action.
Test ambiguity and failure. Use incomplete context, conflicting metric definitions, missing permissions, and unavailable dependencies. Confirm that the system fails visibly and safely.
Reconstruct the event. Use the audit record to determine who requested the output, what information was used, what was generated, who reviewed it, and what happened next.
Change the system deliberately. Update a relevant configuration or model component and confirm that approval, documentation, testing, and monitoring follow the change.

Do not accept screenshots as evidence for controls that operate behind the interface. Ask the vendor or internal platform team to demonstrate a denied request, a permission change, a reviewer override, an exported audit record, and the behavior after a governed configuration change. The test should follow your use case and identities, not a generic demonstration tenant.

Measure value and control health together. If the system produces faster insights but increases unreviewed actions, weakens attribution, or creates an investigation backlog, it has not delivered a durable improvement.

Dimension	Question	Useful signals
Business value	Does the workflow improve a real product, growth, risk, or operational decision?	Time to a validated insight, useful investigations completed, issues resolved, and attributable product outcomes
Analytical quality	Can a reviewer verify the conclusion?	Accepted and rejected outputs, unsupported claims, metric-definition errors, and missing context
Control effectiveness	Did policy operate as designed?	Prohibited requests blocked, required reviews completed, permission exceptions, and audit-record completeness
Operational health	Can people sustain the workflow?	False-positive burden, unresolved escalations, overrides, rework, and reviewer backlog
Change safety	Do updates preserve the approved boundary?	Documented changes, completed regression checks, new failure patterns, and monitored post-change behavior

Set release gates in binary language. The use case has a named accountable owner or it does not. Permissions have been tested with unauthorized identities or they have not. High-impact outputs receive the required review or they do not. Audit evidence can reconstruct an event or it cannot. Ambiguous gates become exceptions as soon as delivery pressure appears.

When the pilot is stable, reuse the control components rather than copying the entire use case. Standard identity propagation, data classification, audit schemas, reviewer workflows, and change gates can form a shared control plane. Each new use case still needs its own purpose, decision boundary, outcome measure, and risk assessment.

Key takeaways

Govern the decision the AI can influence, not just the model that produces the output.
Write both an allowed-action statement and a prohibited-action statement before development begins.
Enforce data permissions before retrieval and carry the user’s identity through analysis, export, and downstream action.
Treat human review as an operational workflow with evidence, dispositions, escalations, and stop authority.
Keep observations, hypotheses, recommendations, and customer-affecting decisions visibly distinct.
Test denial, ambiguity, change, and audit reconstruction alongside the happy path.
Track business value, analytical quality, control effectiveness, and operational burden on the same scorecard.

Your next move is not to draft an enterprise AI policy. Pick one live analytics workflow and write its decision contract on a single page. If you cannot name the allowed action, prohibited action, data boundary, reviewer, audit evidence, and stop authority, the workflow is not ready to scale. If you can, you have the foundation for AI analytics that product teams can use and risk leaders can defend.

References

Amplitude – Financial Services AI

May 15, 2026

How to Scale Session Replay Without Sacrificing Privacy

You want session replay on more journeys because the blind spots are expensive. A funnel can show where users leave, but it cannot show whether they encountered a broken control, a confusing message, a layout shift, or an error that never reached your analytics. Replay can turn those behavioral signals into enough context to make a product decision.

The hard part is expanding that visibility without collecting data you should not have, degrading the experience you are trying to understand, or filling storage with recordings nobody will use. The answer is not a single masking setting. You need a capture contract, a delivery architecture, a sampling model, and an operating scorecard that treat performance, fidelity, and privacy as one system.

Set the capture contract before you expand coverage

Replay programs often begin with a coverage question: what percentage of sessions should you record? That is the wrong first question. Start with the decision you expect the recording to change. If nobody can name that decision, more coverage will create more cost and exposure without producing more insight.

Write a capture contract for each product surface. This is a short, reviewable specification that connects a business purpose to technical controls. It should answer:

What question is replay meant to answer? Examples include diagnosing failed activation, explaining an error spike, or finding friction in a conversion step.
Which routes, components, and user cohorts are in scope? Name them. Do not approve an undefined all-product rollout.
Which data is prohibited? Include form values, credentials, payment details, message content, health information, account-recovery data, and any product-specific sensitive fields that apply.
What consent state permits capture? The recorder should not initialize before the required state is known. Withdrawal should stop capture and prevent queued data from being sent.
Who can watch a replay? Define roles by purpose. Product discovery, support investigation, engineering diagnosis, and administration do not automatically require identical access.
How long will the data remain available? Tie retention to the stated purpose rather than keeping replay indefinitely because storage permits it.
What sampling rule applies? State the baseline rate, targeted cohorts, exclusions, temporary overrides, owner, and expiry condition.

Selective capture, redaction, consent, retention, role-based access, and environment-aware sampling are separate controls. Treating one of them as a substitute for the others creates predictable gaps. Masking does not grant consent. Restricted access does not make excessive collection necessary. Short retention does not make an exposed credential harmless.

Apply those controls as close to collection as possible. A web replay is commonly reconstructed from serialized page state, changes, and interaction events. The privacy risk therefore sits in the data leaving the browser, not only in what the player later displays. A value hidden during playback may already exist in an outbound payload or stored record.

A useful default is to block text and input values, then allowlist only fields proven safe and necessary. Add route-level and component-level exclusions for sensitive surfaces. Use a separate, time-bounded approval for diagnostic capture that needs greater fidelity. I would reject a policy that merely says to mask personal information: the term depends on context, and engineers cannot reliably implement an undefined category.

Test the contract against the raw system, not just the player. Seed a non-production fixture page with recognizable test values, exercise every relevant component state, inspect the browser payload, inspect the stored representation, and verify that exports and downstream tools preserve the restriction. If a prohibited test value crosses the collection boundary, the control has failed even if the replay screen obscures it.

Consent and retention obligations vary by jurisdiction, contract, and data type. Your privacy or legal owner must approve those rules for the markets you serve. Engineering can enforce an approved policy; it cannot infer that policy from a generic replay configuration.

Keep capture off the user’s critical path

Scalable replay starts in the browser, where your product competes with the recorder for main-thread time, memory, and bandwidth. A backend that can ingest billions of events does not help if the recorder makes an interaction sluggish or loses the DOM changes needed to explain the problem.

The delivery design should make page experience more important than recording completeness. Decoupled capture and delivery, adaptive batching, compression, backpressure controls, and priority handling provide the basic pattern:

Capture the minimum useful representation. Filter excluded nodes and values before serialization. Avoid collecting detail that no approved use case needs.
Separate recording from transport. The capture path should write to a bounded queue rather than waiting for a network request. Upload latency must not become interaction latency.
Batch adaptively. Small batches can reduce delay during quiet periods, while larger compressed batches can reduce request overhead during sustained activity. The policy should respond to queue pressure and network conditions.
Define backpressure behavior. When production exceeds delivery capacity, the recorder needs a documented degradation order. Preserve navigation, consent changes, errors, and the structural events required for reconstruction before lower-value detail. Never freeze the page to protect the replay.
Bound long sessions. Flush incrementally, cap memory use, and make reconnection behavior explicit. A queue that grows for the life of a tab will eventually turn a delivery problem into a page-performance problem.
Make partial data visible. Mark gaps, dropped segments, and incomplete uploads. A replay that silently appears complete is more dangerous than one that clearly communicates its limits.

Backpressure deserves special attention because it forces a product decision disguised as an implementation detail. If the system cannot retain everything, what must survive? The answer should come from the capture contract. An error marker without enough surrounding state may be useless, but exhaustive cursor movement may be expendable. Rank event classes before an incident forces the recorder to choose implicitly.

Do not validate the client only on a fast laptop and stable connection. Use representative complex pages and test replay on and off under CPU pressure, constrained networking, rapid DOM change, background-tab transitions, reconnection, and long sessions. Compare Web Vitals, long tasks, memory growth, bytes transferred, queue drops, upload completion, and playback completeness. Long sessions, traffic spikes, complex interactions, and variable networks are precisely where an apparently sound design reveals its failure modes.

There is no universal acceptable overhead that fits every product. Set budgets relative to your production baseline and the importance of the journey. A small regression on a frequently used mobile activation path may matter more than a larger regression on an internal administration page. Segment the results by route, browser, device class, network condition, and session length so averages do not hide the users most affected.

Sample for decisions, not for a warehouse of footage

A single global sample rate is easy to configure and hard to defend. It spends collection capacity uniformly even though product questions are not uniformly valuable. It can also miss rare failures while overrepresenting routine sessions that nobody will watch.

Use a portfolio of sampling modes:

Random baseline sampling gives you a less biased view of ordinary behavior and lets you notice problems you did not predefine.
Cohort sampling increases visibility for a defined population such as new users, a browser family, a release cohort, or users entering a critical journey.
Signal-based sampling concentrates diagnosis around errors, failed steps, rage clicks, dead clicks, abnormal exits, or other instrumented friction signals.
Temporary diagnostic sampling raises fidelity for a narrow incident or release window, with an owner and an automatic expiry condition.
Hard exclusions override every sampling mode. A high-value investigation is not permission to collect from a prohibited surface or consent state.

Onboarding, activation, high-friction conversion flows, and paths with disproportionate revenue or trust impact are sensible places to begin because a clearer diagnosis can change a meaningful decision. Signals such as errors, rage clicks, dead clicks, scroll behavior, and stalled progress can then help you find the sessions worth examining.

Keep one statistical distinction clear. Targeted replay is good for explaining a known problem, but it cannot tell you how prevalent that problem is. If you record sessions because they contain an error, the resulting library will naturally make errors look common. Use analytics or a random baseline to measure frequency. Use replay to understand mechanism and context.

A disciplined investigation looks like this:

Find a measurable change in a funnel, cohort, error rate, performance signal, or support pattern.
Define the affected population before opening replays.
Review a deliberately selected set of relevant sessions and record recurring observable behaviors, not interpretations of user intent.
Turn those observations into a falsifiable product or technical hypothesis.
Instrument, release, or experiment so the hypothesis can be measured outside the replay player.

This prevents two common mistakes: browsing memorable sessions until a story feels true, and treating one vivid recording as evidence of market-wide demand. Replay is strongest when it explains a quantitative signal and leads back to a measurable change.

Run replay with a coupled performance, privacy, and value scorecard

Session replay is not finished when playback works. It is an operating capability with client releases, configuration changes, storage growth, access decisions, and incident risk. Give it an owner and review the system across five dimensions.

Dimension	Signals to watch	Decision the signals should trigger
User experience	Web Vitals, long tasks, main-thread work, memory growth, and replay bytes	Reduce capture detail, change delivery behavior, narrow coverage, or halt a rollout when the recorder breaks its budget
Replay fidelity	Queue drops, missing segments, incomplete uploads, event integrity, and playback reconstruction errors	Fix prioritization or transport before teams rely on incomplete recordings for decisions
Platform reliability	Ingestion failures, processing delay, retrieval latency, playback-start failures, and behavior during traffic spikes	Add capacity, repair a failing stage, or adjust sampling without shifting the problem into the browser
Privacy and governance	Redaction test failures, capture outside approved consent states, retention exceptions, and access outside approved roles	Disable affected capture, contain the data, follow the approved deletion or incident process, and repair the control before restoring it
Decision value	Investigations that reached a useful replay, time to diagnosis, time to resolution, and product hypotheses validated outside replay	Move coverage toward high-value use cases or retire collection that produces no action

These dimensions constrain each other. Aggressive compression may improve bandwidth while hurting reconstruction. More capture may improve fidelity while violating the page budget. Narrow access may improve governance while blocking the support engineers responsible for incident response. The job is not to maximize any single metric; it is to keep the entire system inside approved boundaries.

Version capture configuration like production code. A seemingly harmless selector change can expose text, remove necessary context, or increase mutation volume. Test recorder and configuration releases against fixture pages containing known sensitive values and known reconstructable interactions. Keep a rollback path.

Prepare shutdown controls before launch. You should be able to stop capture for a component, route, environment, tenant group, or the whole product without waiting for a new application release. Document who can use each control, how queued data is handled, how affected stored data is identified, and when privacy, security, support, and engineering must be involved. If collection crosses a prohibited boundary, continuing to record while the team debates ownership compounds the exposure.

Finally, connect replay operations to the workflows that consume it. Product teams need links from behavioral cohorts to relevant sessions. Support needs controlled escalation paths. Engineering and SRE need errors, network signals, layout shifts, and performance context close to the replay timeline. Connecting interaction context to observability and delivery workflows can shorten the path from an anomaly to a testable explanation, but only if the data remains trustworthy and accessible to the right roles.

Key takeaways

Approve a capture contract for each surface before approving a broader sample rate.
Redact or exclude sensitive data before it leaves the browser; a masked player is not enough.
Protect the page with decoupled delivery, bounded queues, adaptive batching, and explicit backpressure priorities.
Keep random sampling for prevalence and use targeted sampling to explain known signals.
Operate performance, fidelity, platform reliability, privacy, and decision value as a coupled scorecard.
Require scoped shutdown controls, retention handling, access ownership, and rollback before production expansion.

Before you increase replay coverage, ask for two artifacts: a one-page capture contract for the next journey and a replay-on versus replay-off test under that journey’s difficult conditions. If the team cannot show what is allowed to leave the browser, how the page stays within budget, and which decision the recordings will change, the rollout is not ready to scale.

References

May 7, 2026

Amplitude MCP: Evidence-Grounded AI Workflows for Product Teams

An AI assistant can produce a convincing roadmap recommendation or code patch before you have established what users actually did. That speed feels productive until a confident answer turns an instrumentation gap, a rare edge case, or a coincidental sequence into a product decision.

Amplitude MCP is most useful when it reverses that order. The assistant retrieves behavioral evidence first, labels what is observed versus inferred, proposes a bounded action, and defines how the result will be verified. You still make the decision and own the release, but you spend less time moving context between analytics, product documents, Session Replay, and the development environment.

Key takeaways

Treat Amplitude MCP as an evidence-retrieval layer, not an automated decision-maker. Access to analytics does not make every conclusion valid.
Require every response to separate observed behavior, inferred explanations, proposed actions, and verified outcomes.
Use aggregate analytics to establish prevalence and affected segments, Session Replay to understand the journey, and code-level tests to validate a technical explanation.
End product workflows with a decision brief and engineering workflows with a reproducible test, a controlled release plan, and post-release behavioral verification.
Begin with a narrow, high-value workflow. Apply least-privilege access, redact sensitive data, and evaluate retrieval accuracy, analytical discipline, latency, and business usefulness before expanding.

Create an evidence contract before asking for a recommendation

An MCP connection can make evidence accessible, but it cannot decide whether your event taxonomy is reliable, whether a cohort is appropriate, or whether a pattern is causal. Amplitude MCP can let an assistant request behavioral context such as funnels, cohorts, segments, and user journeys as needed. Your workflow still has to constrain what is retrieved and how it may be interpreted.

The practical control is an evidence contract: a short specification for the question, the permitted data, the expected output, and the point at which the assistant must stop. Write it before asking for a recommendation. Otherwise, the assistant can silently change the population, comparison, or definition while producing an answer that sounds coherent.

Decision: State the exact choice the analysis is meant to inform. “Improve onboarding” is a theme; “decide which onboarding step needs further investigation” is a decision.
Population: Name the relevant segment, account type, lifecycle stage, product surface, or release exposure. Do not let the assistant substitute all users because that query is easier.
Behavior definition: Specify the events or funnel that represent the outcome. If activation, retention, or failure has no agreed event definition, resolve that ambiguity before interpreting results.
Comparison: Define the cohort, release, segment, or other baseline against which a difference should be assessed.
Permitted evidence: List the analytics views, event paths, Session Replays, error details, and code context the assistant may use.
Required traceability: Make the assistant identify the query, event definition, segment, and replay behind each material observation.
Abstention rule: Require the assistant to say when missing instrumentation, insufficient data, or conflicting evidence prevents a conclusion.

A reusable prompt can be direct: “Analyze [outcome] for [segment] using [funnel, cohort, or event path]. Use [comparison] as the baseline. For every conclusion, identify the supporting query or replay. Return observed facts, data limitations, hypotheses, next retrievals, recommended action, and a verification plan. If the evidence is insufficient, state what is missing instead of filling the gap.”

The labels matter. Without them, a behavioral sequence can become a supposed root cause within one paragraph. Use the following distinction in product investigations, incident work, and roadmap analysis:

Layer	What belongs here	What must support it
Observed	An event pattern, funnel difference, cohort trend, replayed interaction, error, or test result	A traceable query, event timeline, replay, log, or test output
Inferred	A plausible explanation for the observed behavior	Supporting and conflicting evidence, plus assumptions that remain unverified
Proposed	An instrumentation change, discovery step, experiment, code change, or rollout action	A stated rationale, expected effect, risk, and owner
Verified	A conclusion that the intervention produced the intended result without an unacceptable regression	Post-change tests and behavioral evidence using definitions consistent with the original investigation

This structure does more than improve prompt quality. It makes reviews faster. A product manager can challenge the population, an analyst can challenge the event definition, and an engineer can challenge the technical hypothesis without reopening the entire conversation.

Turn product questions into bounded analytics tasks

Broad questions invite broad stories. “Why is activation down?” asks the assistant to choose the definition, locate a pattern, infer a cause, and recommend a solution in one leap. Break that work into retrieval, interpretation, and decision stages instead.

Find an activation blocker without inventing causality

Suppose you need to determine which onboarding step deserves attention for an SMB segment. Behavioral analytics can locate where journeys diverge, while Session Replay can show what happened around that point. Neither alone proves why the behavior occurred.

Define activation. Name the event or event sequence that represents the outcome. If stakeholders use different definitions, surface that disagreement rather than averaging it away.
Fix the population and comparison. Specify the SMB segment and the cohort, release, or successful journey against which it should be compared.
Retrieve the funnel or event path. Ask for the event definitions as well as the result. An unexplained event name is not enough to support a decision.
Locate the observed divergence. Identify where completion or progression differs. Call it a divergence, not a cause or even a blocker yet.
Inspect contrasting journeys. Review unsuccessful and successful Session Replays around the same step. Capture UI state, preceding actions, environment details, errors, and unexpected loops.
Generate competing hypotheses. Include product friction, technical failure, user intent, and instrumentation error where each is plausible. Ask what evidence would weaken each explanation.
Choose the next action that matches the evidence. That may be additional instrumentation, customer discovery, a controlled experiment, a targeted technical investigation, or a product change. The assistant should not default to shipping.
Write the decision record. Preserve the query, segment, replay references, observed facts, unresolved uncertainty, chosen action, and verification signal.

Do not let the assistant jump from “fewer users completed this step” to “the copy is confusing.” The first statement may be observable. The second is a hypothesis that needs corroboration. This distinction is the difference between faster analysis and faster rationalization.

Use behavioral context to sharpen roadmap decisions

Behavioral evidence can show whether a problem appears in real journeys, which segments encounter it, and how the surrounding path differs. It does not determine strategic importance, implementation cost, contractual commitments, regulatory exposure, or the opportunity cost of displacing other work. Those remain product leadership inputs.

Ask the assistant to produce an opportunity brief rather than a priority score. The brief should contain:

The outcome and user segment under consideration
The observed behavior and the exact analytics definition behind it
The prevalence and journey context the available evidence can support, without pretending that frequency equals severity
Successful paths or unaffected segments that provide counterevidence
Known data-quality limitations
Competing explanations and what would distinguish them
The smallest useful discovery, instrumentation, experiment, or delivery step
The signal that would cause you to continue, revise, or stop

This format is particularly useful for activation and retention work because it prevents a familiar category error: an analytics pattern describes behavior, while a roadmap decision combines that behavior with strategy, feasibility, risk, and judgment. Amplitude MCP can improve the behavioral part of the decision without pretending to own the whole decision.

Close the engineering loop from customer signal to verified fix

Code generation is only the middle of a debugging workflow. The more important sequence is evidence, reproduction, hypothesis, failing test, bounded change, controlled release, and verification. Amplitude MCP helps connect the customer side of that sequence to Claude or Cursor, but a plausible diff is not a completed investigation.

From a customer report to a reproducible failure

A support ticket usually contains a symptom. Turn it into an evidence packet before asking the coding assistant for a fix.

Establish impact. Use behavioral analytics to find affected segments, related anomalies, and comparable successful journeys. This tells you whether you are investigating an isolated path or a broader degradation.
Reconstruct the experience. Use Session Replay to capture the sequence of actions, UI state, environment, and the moment the behavior diverged. Preserve timestamps for relevant console errors or API failures.
State expected versus actual behavior. Do not make the coding assistant infer the product requirement from the failure.
Provide constraints. Include known dependencies, release exposure, rate limits, feature-flag state, and any code areas that must not change.
Ask for hypotheses before a patch. Require a list of candidate causes, supporting evidence, contradictory evidence, and missing instrumentation.
Request the smallest failing test. Whenever feasible, reproduce the failure in a test before accepting a code change. If urgent containment is necessary, record it separately from the durable fix.
Validate locally and through CI/CD. A generated test or patch still needs human review and the normal engineering checks.
Release behind a feature flag where appropriate. Limit exposure while verifying the behavior in production.
Verify with the original signals. Re-run the relevant analytics, inspect post-change replays, and monitor related behavioral and performance indicators before increasing exposure.

This workflow can turn a replayed customer problem into reproduction steps, a root-cause hypothesis, a minimal failing test, and a controlled verification plan. The human owner still decides whether the evidence is sufficient, whether the patch is safe, and whether the rollout should continue.

A useful debugging prompt is: “Reconstruct the observed sequence from this replay and event timeline. Separate facts from suspected causes. Identify missing instrumentation. Propose the smallest failing test and the narrowest relevant patch surface. State what post-release evidence would confirm or falsify the fix.”

A passing test proves that the code behaves under the conditions represented by that test. It does not prove that the affected customer journey is repaired. That is why the workflow returns to behavioral evidence after deployment.

From a code symptom back to customer impact

Sometimes the investigation begins with a flaky test, a suspicious diff, or a performance regression. In that direction, the assistant first maps possible failure modes and critical code paths. Amplitude then helps answer whether real users reach those paths, under which conditions, and with what observable consequences.

Give the assistant the test failure, diff, or performance symptom and ask it to enumerate the affected code paths.
Translate those paths into observable events, screens, releases, or journey conditions. If no observable signal exists, add instrumentation before making a product-impact claim.
Retrieve matching behavioral patterns and inspect replays that support and contradict the suspected failure.
Separate technical correctness from operational priority. A real defect may have limited observed reach; a common path may still be functioning correctly.
Implement and test the narrowest justified change.
After release, monitor the original journey, relevant errors, and performance measures such as Web Vitals before ramping the flag.

Frequency must not become the only severity test. Security, privacy, data-integrity, and irreversible-loss risks can demand action even when behavioral analytics shows few affected sessions. Use analytics to understand exposure, not to override the appropriate risk process.

Scale only after retrieval and governance earn trust

The strongest rollout begins with one recurring question, not unrestricted access to every project and replay. Activation blockers and bug triage are good candidates because the input, evidence, decision, and verification artifacts can all be made explicit. Start with a high-value, lower-risk dataset and expand only after the workflow performs reliably.

Make access narrower than the assistant’s capability

Session Replay and event data can contain sensitive customer context. An MCP connection does not remove the obligations attached to that data. Apply the same access rules inside the AI workflow that apply in the analytics product, then reduce exposure further where the task does not require it.

Begin with read-only retrieval for the selected workflow.
Limit access to the relevant projects, datasets, and replay permissions supported by your access model.
Redact sensitive fields before the data reaches either replay or the assistant.
Send the minimum context necessary for the task. Prefer event identifiers, stack traces, test cases, and bounded timelines over raw personally identifiable information.
Keep analytics retrieval, code modification, and deployment authority separate. Successful retrieval is not a reason to grant release permissions.
Preserve the query and evidence references behind material decisions so a reviewer can reconstruct what the assistant saw.
Treat a replay link as governed customer data, not as a generic attachment that can be copied into any conversation.

These controls reflect a practical privacy-by-design rule: include only the information needed to reach the fix and favor structured technical artifacts over raw PII. If the workflow cannot answer a question within those boundaries, the correct result may be escalation to an authorized person rather than broader automated access.

Evaluate the workflow, not just the prose

A polished response is a weak success criterion. Build an evaluation set from representative work and include cases where the answer is easy, ambiguous, unsupported by current instrumentation, and blocked by permissions. The assistant should succeed by reaching the right conclusion or by refusing to overstate what the evidence supports.

Retrieval correctness: Did it use the intended project, event definitions, segment, comparison, and available time scope?
Traceability: Can a reviewer follow every material observation back to a query, replay, error, or test?
Analytical discipline: Did it distinguish behavioral association from cause and identify counterevidence?
Action quality: Is the proposed next step bounded, testable, and proportionate to the evidence?
Abstention quality: Did it stop when data was missing, permissions were insufficient, or the available evidence conflicted?
Latency: Did the workflow reduce time spent finding and transferring context without adding review overhead elsewhere?
Business usefulness: Did the evidence improve the decision, reproduction, or verification outcome rather than merely shorten the response?
Governance: Did retrieval stay within approved access and data-handling boundaries?

Classify failures by layer. A wrong segment is a retrieval failure. An unsupported causal claim is an interpretation failure. An oversized code rewrite is an action failure. Exposure of unnecessary customer data is a governance failure. That classification tells you whether to change permissions, analytics definitions, prompts, review rules, or the underlying product instrumentation.

Use a narrow adoption sequence

Choose one repeated workflow with a visible evidence trail, such as activation analysis or production bug triage.
Record how the workflow operates without MCP, including where context is lost and which handoffs cause rework.
Define the evidence contract, approved access, expected artifact, and human decision gate.
Run representative cases and record retrieval, interpretation, action, and governance failures.
Standardize the prompts, evidence packet, and review checklist only after the failure patterns are understood.
Measure time-to-insight, decision usefulness, and engineering outcomes without assuming that faster responses mean better decisions.
Expand to retention analysis, roadmap shaping, or experiment generation only when the narrow workflow remains traceable and safe.

For incident and engineering use cases, preserve root causes and guardrails as docs-as-code so the next investigation can retrieve known failure patterns instead of rediscovering them. Watch change lead time and deployment frequency alongside stability; speed that produces more regressions is not an improvement.

Start with one decision your team faces repeatedly. Define what the assistant may observe, how it must label inference, who approves the action, and what evidence will verify the result. If it cannot show that chain, it is not ready to influence the decision. If it can, Amplitude MCP becomes more than a convenient connector: it becomes part of a disciplined evidence loop between product behavior and execution.

References

May 6, 2026

AI Product Data Security: A Practical Playbook for PMs

Your AI feature is ready to move beyond the prototype, but one question can still stop the release: exactly which customer data leaves your boundary, where is it copied, and who can retrieve it later? If the answer is scattered across architecture diagrams, vendor settings, and assumptions, you do not yet have a security decision.

You can resolve that uncertainty without turning every experiment into a committee exercise. Map the data path, assign the capability a risk lane, minimize what the model receives, and automate the controls that follow from the classification. The result is a release process that is both faster and easier to defend.

Start with the data path, not the model

The first security question is not what the model knows. It is what your product sends, retrieves, transforms, stores, logs, and displays. A provider can have a strong security posture while your implementation still exposes data through an overbroad retrieval query, a debug log, or an incorrectly scoped support tool.

Draw the complete path for one user request. Do not use a generic platform diagram. Follow the actual capability from the moment a user or system creates an input until every resulting copy has expired or been deleted.

Identify the original input, including form fields, uploaded files, messages, system-generated events, and API payloads.
List the context added by your application, such as account attributes, conversation history, analytics, retrieved documents, feature configuration, or tool results.
Mark every transformation before the model call: filtering, redaction, tokenization, summarization, chunking, or schema conversion.
Name the service that receives each payload, including gateways, model providers, observability tools, evaluation systems, queues, and caches.
Trace the response through validation, tool execution, display, analytics, support access, and downstream storage.
Record when each copy expires, how deletion propagates, and who can access it while it exists.

For every step, capture six fields: data class, system owner, access scope, external recipient, retention rule, and failure consequence. If any field is unknown, label it unknown. An explicit unknown is useful discovery work; an undocumented assumption is hidden risk.

Do not stop at obvious records such as customer PII and payment identifiers. Prompts, retrieved context, user-linked analytics, internal roadmaps, feature flags, configuration values, embeddings, vector stores, and evaluation datasets can also reveal confidential facts or inferred identity. Treat them as product data with owners and controls, not harmless implementation residue.

Use a completion test that exposes weak assumptions

Your map is ready for a decision when someone outside the feature team can answer these questions from it:

What is the most sensitive field the capability can receive?
Which fields cross the company boundary, and which named service receives them?
Can one customer ever retrieve another customer’s data?
Are raw prompts, completions, retrieved passages, or tool results logged?
Which identities can inspect those logs or replay a request?
What happens to derived data when the original record is deleted or its permissions change?
Which control contains the incident if the model, retrieval layer, or tool call behaves unexpectedly?

If the team can only answer these questions by asking several vendors or searching production settings, keep the release open. The missing work is not paperwork. It is part of the product’s operating design.

Turn the risk assessment into a release lane

A risk score is useful only when it changes what the team must do. Avoid a long questionnaire that ends with an ambiguous rating. Use a small number of lanes, give each lane an observable entry condition, and attach default release controls.

Risk lane	Typical signals	Default release posture
Low	Internal capability; synthetic or public inputs; no sensitive context; no consequential external action	Approved provider, least-privilege credentials, basic access tests, and confirmation that secrets are not entering prompts or logs
Elevated	Customer-facing capability; authenticated user context; behavioral telemetry; stored prompts or outputs; retrieval from private content	Data minimization, pre-call redaction, permission-aware retrieval, explicit retention, adversarial evaluations, runtime monitoring, and a named incident owner
High	Regulated-data adjacent; payment identifiers; broad confidential retrieval; sensitive identity data; or authority to perform a consequential action	Early Security, Legal, privacy, and Data involvement; documented threat model; human approval where an action warrants it; verified containment; and release evidence reviewed before exposure

These lanes are an operating model, not a compliance determination. Applicable controls depend on the actual data, customer contracts, geography, industry, and use case. Security and legal specialists should make those determinations when the capability creates legal, regulatory, or material customer exposure.

Classify the capability, not the entire product. A writing assistant that uses text supplied for a single request may sit in a different lane from an account assistant that searches every customer conversation and updates CRM records, even when both use the same model.

Score the capability across these dimensions:

Data sensitivity: public, internal, confidential, personal, payment-related, or regulated-data adjacent.
Audience: constrained employee group, all employees, authenticated customers, or public users.
Retrieval reach: one supplied record, an authorized account subset, or a broad internal corpus.
Action authority: produces a suggestion, drafts a change, or executes an external action.
Persistence: ephemeral processing, structured event storage, or retained raw inputs and outputs.
Third-party exposure: stays inside your controlled environment or passes through one or more providers and subprocessors.

Use the highest-risk dimension to set the initial lane. Lower it only after a design change removes the exposure. A promise to be careful is not a mitigating control; scoped retrieval, enforced redaction, disabled raw logging, and restricted tool permissions are.

Reclassify when the feature changes its data, audience, retrieval reach, retention, provider, or ability to act. A seemingly small roadmap addition, such as remembering past conversations or connecting a second data source, can change the security posture more than a model upgrade does.

Design the system to disclose less data

The most reliable way to protect data is to keep unnecessary data out of the AI path. Encryption and contractual terms matter, but they do not make an irrelevant customer field necessary. Start with the user outcome and ask which minimum facts the model needs to produce it.

Minimize before you redact

Redaction is a valuable deterministic safeguard, but it should not carry the whole design. Free-form text can contain names, secrets, identifiers, and confidential business information in formats your rules do not recognize. Reduce the payload first, then redact the smaller payload that remains.

Replace a full customer object with the few fields required for the task.
Use a temporary account token when the model does not need a person’s name, email address, or payment identifier.
Convert long interaction histories into purpose-specific structured fields when the task does not require the original prose.
Exclude internal notes, disabled fields, hidden metadata, and unrelated attachments by default.
Log structured events such as policy result, model identifier, latency, and request status when raw prompt text is not required.

Separate identity from content wherever the workflow allows it. The application can retain the relationship between a temporary token and an account while the model processes only the content needed for the task. Access to the token map should remain narrower than access to routine AI telemetry.

Make retrieval permission-aware

A retrieval-first architecture can keep the raw corpus inside your controlled boundary while selecting only relevant context for a request. It is not automatically private. If an external model receives the selected passages, those passages still cross the boundary and still require minimization, redaction, approved-provider controls, and a clear retention policy.

Apply authorization when the request is made, not only when content is indexed. The retrieval layer should constrain results by tenant, user, role, and current document permissions before any text becomes model context. Do not index content that the eventual searcher could never be allowed to read unless the architecture has another enforceable isolation boundary.

Treat embeddings and vector-store metadata as sensitive derived data. A vector is not a magic anonymizer, and metadata can disclose document names, account relationships, categories, or activity patterns even when full text is elsewhere. Your deletion and permission-change process must reach the index, cached results, evaluation copies, and any stored citations, not just the primary database.

Retrieved content is also untrusted input. A malicious or compromised document can contain instructions intended to change model behavior. Keep system instructions separate, restrict available tools, validate tool arguments, and enforce authorization in application code. The model should never be the component that decides whether a user may access a record or perform an action.

Place deterministic controls on both sides of the call

Before the call: validate the request schema, remove disallowed fields, redact known sensitive patterns, apply allow and deny policies, and constrain retrieval.
After the call: validate output structure, block disallowed sensitive patterns, verify any cited record belongs to the authorized scope, and check tool arguments before execution.
During operation: monitor unusual prompt, output, retrieval, and access patterns without creating a second uncontrolled store of raw content.

An output filter cannot undo data already disclosed to an external provider. Use post-call checks to protect users and downstream systems, but use pre-call minimization and access enforcement to prevent the disclosure itself.

Make vendor approval specific to the intended use

Do not approve an AI vendor in the abstract. Approve a defined service, account configuration, data class, region, retention posture, and use case. A provider suitable for public-content summarization may not be suitable for customer conversations or payment-related identifiers.

Ask questions that produce enforceable answers rather than broad assurances:

Training and service improvement: Can prompts, files, retrieved passages, outputs, feedback, or metadata be used to train models or improve services? Is the restriction a default, a setting, or a contractual term?
Retention: How long does each data type remain in primary systems, safety systems, failure logs, backups, and support tooling? What initiates deletion, and what exceptions apply?
Human access: Under what conditions can provider personnel inspect customer content, and how is that access authorized, logged, and reviewed?
Security controls: Is data encrypted in transit and at rest? What key-management options, private networking, scoped credentials, access logs, and administrative controls are available?
Location and subprocessors: Which regions process and store the data? Where can support access occur? Which subprocessors participate in the path?
Assurance evidence: Which services and controls are covered by SOC 2, ISO 27001, or HIPAA-related commitments where relevant to the use case?
Response: How will the provider communicate a security incident, policy change, model change, or subprocessor change that affects your approved use?

An audit or certification is useful evidence about a defined scope. It is not proof that your architecture, settings, or use case is safe. Confirm that the service named in the evidence is the service your product will actually call, and that your configuration does not bypass the controls you evaluated.

Keep a short decision record with the approved purpose, permitted and prohibited data, named endpoints or services, required account settings, retention terms, region, responsible owner, and review triggers. Reopen the decision when the purpose, data class, provider terms, model path, subprocessor chain, or architecture changes.

A shared catalog of approved providers and patterns also reduces shadow AI. Make the approved route easier to use by supplying scoped credentials, reference architectures, redaction utilities, retrieval patterns, and clear examples of prohibited inputs. Governance works better when the safe path is a usable product for internal teams.

Put the controls into delivery and incident response

A policy that depends on every engineer remembering every rule will drift. Store the capability’s classification, required controls, approved provider configuration, and decision owner alongside the delivery artifacts. Version changes so the team can see when a new data source or retention behavior altered the release posture.

Translate the release lane into automated checks wherever the control can be tested:

Scan prompts, templates, configuration, and code for exposed secrets and unapproved endpoints.
Unit-test redaction and tokenization against representative allowed and disallowed inputs.
Integration-test tenant boundaries, role permissions, retrieval filters, and deletion propagation.
Run evaluations that attempt to elicit restricted data, override instructions, retrieve unauthorized records, or trigger tools outside the allowed scope.
Validate the selected provider, model path, region, logging setting, and retention configuration against the approval record.
Block release when required evidence, monitoring, rollback controls, or an incident owner is missing.

Evaluation data needs the same scrutiny as production data. Remove unnecessary identities, restrict access, define retention, and avoid copying raw customer interactions merely because an evaluation system is internal. A test corpus can become a long-lived data store if nobody owns its lifecycle.

Monitor security-relevant events rather than indiscriminately recording content. Useful signals include blocked sensitive-data patterns, denied cross-scope retrieval, calls to unapproved services, unusual access behavior, unexpected changes in model or endpoint usage, and failed retention or deletion jobs. Structured metadata often provides the operational signal you need without preserving every prompt and completion.

Prepare containment before the first customer request

Your incident runbook should name the people and mechanisms needed to contain the feature. Depending on the incident, that can include disabling the affected path with a feature flag, revoking or rotating credentials, restricting retrieval, stopping unsafe logging, locating downstream copies, and contacting the provider.

Do not improvise evidence deletion or customer notification during an incident. Security, privacy, and legal owners should determine preservation, notification, and regulatory obligations based on the specific exposure. The product runbook should make those owners reachable and give them an accurate data-flow record, timestamps, affected systems, and containment status.

After containment, update the control that failed: the architecture, automated check, provider setting, policy, runbook, or team guidance. A review that ends with a reminder to be more careful leaves the same mechanism in place.

Key takeaways

Map every copy of the data, including retrieved passages, logs, embeddings, evaluations, caches, and tool results.
Classify individual capabilities by their highest-risk dimension, then attach mandatory controls to the lane.
Minimize fields before redaction, enforce permissions outside the model, and treat derived stores as sensitive.
Approve vendors for a named use, configuration, data class, region, and retention posture rather than issuing blanket approval.
Put redaction, access, retrieval, configuration, evaluation, and release checks into CI/CD.
Design containment and ownership before launch so an incident does not begin with a search for the right people and switches.

Pick one AI capability currently approaching release and produce its request-to-deletion data map. Assign its lane, turn every unknown into an owned backlog item, and automate the first control the team is still checking by hand. That is how security becomes part of product delivery instead of a negotiation at the end.

References

Shivam.Consulting Blog – AI Data Security for Product Teams: Protect Sensitive Product Data Without Slowing Innovation

April 27, 2026

How to Build Agentic AI for Product Analytics and Support

Your support bot can tell a customer where a setting lives, yet leave that customer to diagnose the problem, change the setting, and hope it worked. Your product team then receives a chat transcript without knowing whether the interaction improved activation, feature adoption, or retention.

If you are deciding how to connect AI, product analytics, and support, do not start with the model. Design the closed loop first: assemble trustworthy context, choose an allowed action, verify the resulting product state, and measure the user outcome. The model is one component inside that system.

Treat product analytics as the agent’s control plane

A useful standard is an assistant that understands the user’s context, can complete an allowed action, and measures whether the action helped. Remove any one of those capabilities and the experience degrades quickly. Context without action produces advice. Action without context creates risk. Action without measurement creates an impressive demo that cannot earn a durable place on the roadmap.

Product analytics supplies the behavioral context and outcome signals for this loop. It can show where the user is in a journey, which features have been adopted, which step failed, and whether the expected success event eventually occurred. It should not be treated as a warehouse-sized attachment to the prompt.

Define a support context contract

Create a small, governed context object for each supported workflow. Give the agent only the fields required to understand and resolve that workflow:

Actor and access: the authenticated user, account, role, entitlements, and permissions relevant to the requested action.
Journey state: the onboarding step, feature-adoption state, experiment assignment, or other stage that explains what the user is trying to complete.
Current product state: the relevant configuration from the operational system of record, including whether required prerequisites are satisfied.
Friction evidence: recent failed events, validation results, repeated attempts, and known errors connected to this workflow.
Desired outcome: the product state and behavioral event that will count as successful resolution.

Resolve analytics events and tool calls to the same stable user and account identifiers. Preserve timestamps and the origin of each field. For a live action decision, let the operational system of record determine current state; use analytics to explain the journey and measure the outcome. An event stream can be delayed or incomplete, so it should not overrule a current configuration read.

Behavior is also evidence, not intent. Repeated visits to a setup screen could indicate confusion, careful verification, or an advanced workflow. When those interpretations require different actions, the agent should ask one targeted question instead of turning a behavioral pattern into a confident diagnosis.

Apply data minimization at this boundary. Do not place secrets, payment information, unrelated conversation history, or an account’s entire event history into the model context. Filter fields before the model sees them, and enforce the filter in code rather than relying on a prompt instruction.

Give the analytics agent a metric contract

An internal analytics agent has a different job from a customer-facing support agent. It may translate a product question into metrics, cohorts, funnels, or retention views, but a fluent answer is not enough. Require every analysis to return:

the product question it interpreted;
the metric definition and success event it used;
the cohort, filters, and observation window;
the analysis or query reference needed to reproduce the result;
known data-quality limitations and unresolved ambiguity; and
a clear distinction between observed association and demonstrated causal lift.

This turns the analytics agent into a traceable decision aid. It also prevents two agents from using the same metric name while silently applying different event definitions, account filters, or windows.

Design one closed loop from signal to verified outcome

The core unit of agentic support is not the conversation. It is a resolution attempt with a beginning, an authorized action, and a verifiable end state. Use the following loop for every workflow:

Observe the trigger. Capture the user’s request or a product signal that indicates likely friction.
Assemble scoped context. Load only the identity, permission, journey, state, and error fields defined in the context contract.
Diagnose the next constraint. Determine which prerequisite, configuration, permission, or knowledge gap is blocking progress. If the evidence is ambiguous, ask rather than assume.
Select an approved playbook. Match the constraint to a versioned workflow with explicit eligibility rules, allowed tools, and prohibited actions.
Obtain the required authorization. Show the proposed change and its consequence whenever the action changes product state or affects other people.
Execute through a narrow tool. Use a typed, allowlisted operation. Make retryable actions idempotent so a repeated call does not create duplicate changes.
Verify the result. Read the resulting product state and look for the defined success event. Tool completion alone does not prove customer resolution.
Record the outcome. Log the context version, playbook, model, policy decision, tool call, result, success signal, and any escalation or user reversal.

The loop supports two related products without collapsing their permissions. An internal analytics agent can identify an affected cohort, inspect a funnel, or surface a recurring failure pattern. A customer-facing support agent can use the approved finding to help one authenticated user, but it should see only that user’s permitted context and tools. A human support operator should receive the same trace when the agent escalates.

Keep the shared layer deliberately small: stable identities, canonical metric definitions, governed context fields, outcome events, and versioned playbooks. The analytics agent and support agent can then improve the same system while retaining separate access policies and evaluation criteria.

Do not automatically convert every observed correlation into a new support action. Let analytics generate a candidate playbook, review the causal logic and risk, test it against known cases, and release it through a controlled experiment. The learning unit is the reviewed playbook, not an unexamined prompt change.

Choose a first workflow that can prove its own value

The first pilot should be easy to verify, not merely easy to demonstrate. A conversational answer looks polished even when it does not change the user’s outcome. A narrow configuration or onboarding workflow is usually a better proving ground because eligibility, allowed actions, and success can be defined before launch.

Score candidate workflows against these criteria:

Repeated demand: the same intent or failure appears often enough to justify a reusable playbook.
Observable state: the agent can read the prerequisites and current configuration instead of guessing from the user’s description.
Clear success: one product state or behavioral event can verify that the problem was resolved.
Safe execution: the initial actions are reversible, user-scoped, and unlikely to affect billing, security, data retention, or other users.
Short feedback: the primary outcome appears soon enough to support iteration, even if retention is monitored later.
Enough eligible traffic: the workflow can support a meaningful experiment rather than a handful of anecdotes.

Write the pilot contract before the prompt

A pilot contract forces the product, analytics, support, engineering, and risk decisions into one inspectable artifact. It should specify:

the user problem and eligible cohort;
the trigger that starts the workflow;
the context fields and systems of record;
the approved diagnostic branches;
the allowed and prohibited actions;
the point at which confirmation is required;
the precondition and postcondition for each tool call;
the success event and observation window;
known failure modes and the human handoff rule; and
the primary outcome, guardrail metrics, experiment design, and minimum detectable effect.

Consider an onboarding configuration workflow. The trigger might be a user repeatedly reaching setup without completing it. The context could include entitlement, current configuration, prerequisite status, and the latest validation result. The agent may be allowed to run validation, explain a missing prerequisite, prefill a reversible setting, or launch the next approved step. Resolution requires both the expected configuration state and its corresponding success event. If validation continues to fail, the handoff should include the exact state, error, playbook branch, and actions already attempted.

Avoid starting with data deletion, broad permission changes, security recovery, billing adjustments, or external communications. Those workflows combine difficult authorization questions with high consequences. Prove context quality, tool reliability, verification, and measurement on a narrower action set before expanding the blast radius.

Set the minimum detectable effect before the experiment. If the eligible population cannot detect an outcome change that would justify the investment, narrow the claim, combine additional time periods, or choose a more observable workflow. Do not call an underpowered neutral result proof that the agent has no effect.

Instrument the agent like a product surface, not a transcript

Conversation volume, message count, and thumbs-up feedback are diagnostic signals. They are not sufficient outcome measures. A customer can like an explanation and still remain blocked; another can dislike the wording even though the configuration was fixed.

Measurement layer	Question it answers	Useful signals
Operational reliability	Did the system execute as designed?	Tool success, validation failure, retry, latency, rollback, and escalation
Verified resolution	Did the requested product state become true?	Verified resolution rate, time to resolution, repeat attempt, and repeat contact
Product outcome	Did the user progress in the journey?	Activation, feature adoption, workflow completion, and later retention
Support outcome	Did the workflow reduce avoidable support effort?	Eligible ticket rate, escalation reason, handle-time impact, and handoff quality
Safety and trust	Did the agent stay within policy and user intent?	Permission block, wrong-action review, user reversal, policy violation, and privacy incident

Define the denominators as carefully as the numerators. Verified resolution rate should use eligible support sessions as its denominator and require the success state defined in the pilot contract. Action completion rate should use authorized action attempts, not every conversation. Time to resolution should begin with the original request and stop only when the postcondition is verified, not when the agent finishes generating text.

Do not optimize ticket deflection or containment in isolation. The absence of a ticket can represent resolution, abandonment, or a user working around the problem. Pair support-efficiency measures with product success, repeat contact, and safety guardrails.

Use evaluations and experiments for different questions

A disciplined AI product rhythm connects eval-driven development, A/B testing, minimum detectable effect, activation, retention analysis, and data governance. Each mechanism answers a different question:

Pre-release evaluations: Can the system interpret known intents, select the right context, follow policy, choose an allowed tool, handle tool errors, and verify the expected postcondition? Run the relevant suite whenever the model, prompt, context contract, policy, tool, or playbook changes.
Shadow operation: What would the agent have proposed in real traffic without being allowed to change state? Review mismatched diagnoses, unsupported context, unsafe actions, and missed escalation conditions.
Controlled experiments: Does the agent improve the predefined outcome compared with the existing support experience for the eligible population? Record assignment before the interaction and preserve it through outcome analysis.
Production monitoring: Are errors, reversals, escalations, latency, or policy blocks changing by journey, user role, entitlement, playbook, or release version?

Be careful with naive correlation. Users who invoke support are often already struggling, so their outcomes may look worse than those of users who never needed help. Random assignment among eligible users gives you a defensible counterfactual. When randomization is not possible, describe the result as observational and avoid claiming that the agent caused the change.

Log enough version information to reproduce a decision: model, prompt, policy, context schema, playbook, experiment assignment, tool version, input identifiers, authorization result, and postcondition. Do not place raw secrets or unrestricted personal data in that trace. A metric change is actionable only when you can connect it to the system version that produced it.

Set action boundaries before the model receives tool access

Model confidence is not authority. A highly confident response must never expand a user’s permissions, bypass confirmation, or convert a prohibited action into an allowed one. Authorization belongs in deterministic policy and tool infrastructure outside the model.

Action class	Typical scope	Required controls	Verification
Read and explain	Show relevant state, explain an error, or recommend a next step	User-scoped reads, field filtering, and visible uncertainty when evidence conflicts	Confirm that the response used current state and an approved knowledge path
Reversible change	Update a non-sensitive preference, run validation, or trigger a recoverable workflow	Preview, confirmation when needed, typed input, idempotency, and rollback	Read the resulting state and observe the workflow’s success event
Consequential change	Alter billing, permissions, security, external communication, or retained data	Strong confirmation or human review, separation of duties, and a complete audit trail	Verify every postcondition and provide a safe recovery or escalation route

Implement the boundary with controls the agent cannot negotiate away:

Least-privilege credentials: issue short-lived, user-scoped authorization rather than a general service credential wherever the architecture permits it.
Allowlisted tools: expose narrow actions with typed parameters, explicit preconditions, and constrained targets. Do not give a customer-facing agent arbitrary database or shell access.
Policy before execution: validate identity, permission, data sensitivity, action class, and confirmation status outside the model before any state-changing call.
Postcondition checks: require the agent to read the resulting state. A successful API response can still produce the wrong business outcome.
Safe retries: attach idempotency controls to operations that might be repeated after a timeout or interrupted conversation.
Complete handoffs: send the human operator the intent, relevant context, diagnosis, attempted action, tool result, and unresolved condition so the customer does not have to start over.
Controlled release: use feature flags, cohort restrictions, action-level limits, and an immediate disable path while a workflow is being validated.

Evaluate build-versus-buy decisions at the system boundary

Conversation quality is easy to demonstrate and difficult to use as a purchasing criterion. Evaluate an agent platform on whether it can operate inside your context, permission, observability, and experimentation model.

Can you define and inspect the context contract for each workflow?
Can the platform use user-scoped credentials and enforce tool permissions outside the prompt?
Can every decision, action, version, and outcome be exported to your unified analytics platform?
Can you separate aggregate analytics access from individual customer support access?
Can you run offline evaluations, shadow traffic, controlled experiments, and cohort rollouts?
Can you configure confirmation, rollback, handoff, retention, and data-residency policies?
Can you change the model, tool, or support system without losing metric definitions and historical outcome traces?

A platform that generates excellent dialogue but cannot expose its action trace or connect to verified outcomes will make governance and product measurement harder. A less theatrical system with clear contracts may be the more useful product foundation.

Key takeaways

Start with a governed context contract, not a larger prompt or model.
Connect product analytics and support through shared identities, metric definitions, outcome events, and versioned playbooks.
Give customer-facing agents user-scoped context and a small set of reversible, allowlisted actions.
Count a resolution only when the intended product state or success event is verified.
Use offline evaluations for capability and policy, controlled experiments for causal impact, and production monitoring for drift and safety.
Expand autonomy only after context accuracy, tool reliability, outcome lift, and guardrails have all been demonstrated.

At your next roadmap review, ask for one pilot contract rather than a broad AI support initiative. Choose one recurring journey, name its verified success event, define the smallest safe action set, and make the owner show how every action will be authorized, observed, and reversed. That is enough to move from a chatbot concept to an agentic product you can manage.

References

April 21, 2026

How to Ship Responsible AI Products in Regulated Healthcare

Your healthcare AI prototype works in a demo. Clinicians see potential. Then privacy, security, compliance, and legal reviewers ask questions the roadmap cannot answer: Which data crosses the model boundary? What happens when the output is wrong? Who can stop it? What evidence justifies exposing it to patients or providers?

The answer is not a longer policy document. You need a delivery system in which the use case, data boundaries, acceptable behavior, evidence, and rollback path are inspectable before anyone depends on the product. That system lets you move faster because each review produces a decision instead of another round of open-ended concerns.

Key takeaways

Start with the decision or action the AI will influence, not the model you want to deploy.
Keep identifiers in clinical systems by default and send only the behavioral or operational signals a downstream system genuinely needs.
Put success metrics, unacceptable behavior, human review, and stop conditions in the same release contract.
Move from synthetic or de-identified sandbox testing to a tightly controlled pilot, then scale only when the agreed evidence supports it.
Monitor model behavior, workflow performance, segment outcomes, data quality, and incidents as one production system.

Define the clinical boundary before choosing the AI approach

A vague use case such as improving patient engagement is almost impossible to evaluate responsibly. It does not identify a user, a decision, an action, or a credible failure. The first useful artifact is a use-case card that makes those boundaries explicit.

Complete these fields before discussing vendors, models, or architecture:

User and job: Name the person using the capability and the task that person is trying to complete.
Input: List the information required to perform the task. Separate essential inputs from data that is merely available.
Output: Define what the system produces: a summary, draft, recommendation, prediction, classification, or action.
Action authority: State whether the AI informs a person, proposes an action for approval, or executes an action itself.
Unacceptable outcome: Describe the failure that must not reach the user, patient, provider, or downstream system.
Human checkpoint: Identify who reviews the output, what that person can see, and how the person can reject or correct it.
Success measure: Name the workflow outcome that should improve, such as task completion, time-to-first-value, or sustained adoption.
Accountable owner: Name the person who can approve the use case, pause it, and accept or reject residual risk.

The action-authority field is especially important. A system that drafts text for a qualified person to review has a different failure surface from one that sends the text automatically. A recommendation that a clinician can inspect is different from an action that changes a care workflow without an intervening decision. If the team cannot describe that distinction, it is too early to approve a production design.

I use a simple product-risk ladder during intake:

The AI summarizes or drafts, and its output has no effect until a qualified person reviews it.
The AI recommends a next step, but a person must make and record the decision.
The AI executes a reversible administrative action within a tightly bounded workflow.
The AI influences a care pathway, patient communication, or another consequential decision.
The AI executes a consequential or difficult-to-reverse action without prior human approval.

This ladder is a product-triage device, not a legal or clinical classification. Your qualified clinical, privacy, security, compliance, and legal owners still need to determine the obligations that apply. Its purpose is to prevent a low-risk drafting assistant and a high-consequence decision system from passing through the same generic review.

Once the boundary is clear, choose the least complex mechanism that can deliver the outcome. Conventional automation may be enough for deterministic rules. Retrieval may be appropriate when the primary job is finding and grounding information. An agentic workflow introduces additional action authority and therefore needs stronger controls. Selecting among conventional automation, a retrieval-first pipeline, and agentic AI should follow the use case, its failure modes, and its lifecycle requirements.

Apply the same discipline to build-versus-buy decisions. Do not reduce the choice to feature coverage or procurement cost. Evaluate who can control data handling, model and prompt versions, evaluation, incident response, observability, and future changes. A vendor can supply technology, but it cannot own your product decision or your duty to operate the resulting workflow responsibly.

Make the data boundary reviewable, not merely promised

Privacy-by-design becomes real when a reviewer can trace each field from its origin to every place it is processed, logged, measured, retained, and deleted. A sentence saying the product is secure is not a data-control mechanism.

Start with a data-flow map that covers the entire operating path:

The clinical or operational system where the data originates.
Any transformation, minimization, masking, or de-identification step.
The application, retrieval layer, model, or external service that processes it.
Prompt, response, diagnostic, and application logs.
Behavioral analytics and product dashboards.
Human-review, support, escalation, and incident queues.
Long-term storage, retention, deletion, and backup paths.

For every step, record the purpose, permitted fields, prohibited fields, access roles, retention rule, downstream recipients, and owner. If a field has no necessary purpose, remove it before debating how to secure it. Data minimization reduces both the risk surface and the number of controls the team has to maintain.

A practical default is to keep identifiers in clinical systems while allowing only the behavioral signals needed for product analytics to cross the boundary. An analytics event can record that a recommendation was opened, edited, accepted, rejected, or completed without carrying a patient name or clinical narrative. The event should describe what happened in the product, not reproduce the underlying record.

Do not assume data is de-identified merely because a visible name or patient identifier has been removed. Combinations of fields, free text, prompts, model responses, URLs, error messages, and support attachments can still disclose sensitive information. Have the designated privacy and legal owners determine whether the transformation meets the applicable requirements. If they cannot verify it, keep the data inside the approved clinical boundary or use synthetic data for development.

Behavioral instrumentation needs its own contract. For each event, define:

The event name and the exact behavior it represents.
The allowed properties and the business purpose of each property.
Explicitly prohibited identifiers, clinical text, and other sensitive payloads.
The application and workflow versions that generate the event.
The owner who approves schema changes.
Validation rules that reject or quarantine malformed events.
The metric definitions and dashboards that consume the event.

This is governed analytics in operational form. Curated events, certified metric definitions, role-based access, lineage, and change control create a shared, auditable view for product, data, security, and compliance. They also prevent a quieter product failure: two teams using the same metric name for different behaviors and making incompatible release decisions.

Apply comparable scrutiny to an external provider. Ask what data the provider processes, where it is stored, whether inputs or outputs can be used for training, what is logged, how long each artifact is retained, how deletion works, who can access it, which subprocessors receive it, how tenants are separated, and what happens during an outage or security incident. Route the answers to the people responsible for contractual, security, privacy, and regulatory assessment. Product should own the use-case decision, not silently treat vendor approval as proof that every use is approved.

Convert responsible AI into a release contract

Responsible AI fails as a delivery practice when responsibility is expressed only as principles. A team needs observable release criteria: the behavior it expects, the behavior it prohibits, the evidence it will collect, and the condition that stops the launch.

Put those criteria in one release contract shared by product, engineering, data science, clinical leadership, security, privacy, and compliance. The exact metric thresholds will vary by use case, so the accountable owners must set them before the pilot produces results. A threshold chosen after seeing the data is an explanation, not a gate.

Release layer	Define before the pilot	Evidence to collect	Do not proceed when
Product value	The user task and expected workflow improvement	Task completion, time-to-value, adoption, abandonment, and sustained use	The feature creates activity without improving the intended task
Model behavior	Expected responses, prohibited responses, escalation behavior, and task-specific pass criteria	Versioned offline evaluations, human review, guardrail results, and regression comparisons	A critical safety case fails or behavior cannot be reproduced
Data quality	Required inputs, permitted schemas, freshness expectations, and lineage	Schema validation, missing-data checks, source versions, and anomaly monitoring	Inputs are stale, malformed, untraceable, or outside the approved boundary
Human control	Review point, override, correction, escalation, and rollback path	Correction behavior, overrides, escalations, and successful rollback tests	The responsible person cannot inspect, reject, or stop the output
Operational health	Acceptable latency, cost, availability, error behavior, and incident ownership	Production telemetry, alerts, version history, and incident records	Failure is silent, alerts lack an owner, or recovery depends on an untested path
Segment outcomes	The patient, provider, workflow, and operating segments that require separate review	Outcome and error variance across approved segments	Material variance is unexplained or a consequential segment lacks adequate evidence

Model quality is only one layer. A strong offline result can still produce a poor product if the workflow is slow, users cannot correct the output, input data is unreliable, or the intervention fails to improve the intended task. Connect the layers with a driver tree:

Model behavior: What must the system produce or avoid?
Workflow behavior: What will the user do differently if the output is useful and trusted?
User outcome: Which task becomes more complete, efficient, or reliable?
Organizational or care outcome: What meaningful result should eventually change?

Treat each arrow as a hypothesis, not an assumed causal relationship. For example, a more relevant recommendation might reduce corrections, and fewer corrections might improve task completion. Instrument both transitions. If relevance improves but completion does not, the team has learned that the bottleneck is elsewhere.

Your offline evaluation set should include representative routine inputs, ambiguous inputs, edge cases, and the sensitive scenarios most closely connected to the unacceptable outcomes on the use-case card. For each case, store the expected behavior, reviewer rubric, model version, prompt version, retrieval configuration, policy or rule version, and result. This makes regression testing possible when any part of the system changes.

Prompt libraries, model and prompt regression tests, eval-driven development, feature flags, and observability belong in the product delivery system rather than in an isolated data-science workflow. AI behavior can change when the model, prompt, retrieved context, guardrail, input distribution, or surrounding application changes. Version the complete configuration that produced the output.

Use A/B testing only where exposure is ethically and operationally appropriate, failure is reversible, and the relevant reviewers have approved the experiment. Do not use an experiment to discover whether an unbounded high-consequence behavior is safe. Establish safety through evaluation and controlled review first. For an approved experiment, predefine the minimum detectable effect that would make the release risk worthwhile, along with guardrail metrics and stop conditions.

Use evidence gates from sandbox to controlled scale

A responsible rollout is not one approval followed by unrestricted production access. It is a sequence of gates. Each gate expands exposure only after the previous stage produces the required evidence.

Gate 1: Sandbox validation

Start with synthetic or appropriately de-identified data. The sandbox should reproduce the workflow closely enough to test prompts, retrieval, interface behavior, event instrumentation, alerts, and rollback without exposing a patient or provider to an unproven capability.

Use the sandbox to answer concrete questions:

Does each approved input produce a traceable output?
Do ambiguous, incomplete, or malformed inputs fail safely?
Are prohibited data fields rejected before they reach logs or analytics?
Do critical evaluation cases pass on the exact release configuration?
Can a reviewer see the context needed to accept, edit, or reject an output?
Do alerts reach a named owner?
Can the feature be disabled without disrupting the underlying workflow?
Are latency and cost compatible with the intended operating model?

A polished demonstration is not the exit criterion. The exit criterion is a reproducible evidence packet containing the use-case card, data-flow map, event contracts, evaluation results, open risks, mitigations, configuration versions, approvals, and tested rollback procedure.

Gate 2: Controlled production pilot

A pilot is an instrumented risk test, not a smaller marketing launch. Define its boundaries before enabling the feature:

Which users and roles are eligible.
Which workflows and data types are permitted.
Which outputs and actions are enabled.
Where human review is mandatory.
Which feature flag or access control contains exposure.
Which metrics and segments will be reviewed.
Which events trigger an alert, pause, rollback, or incident process.
Who makes the decision to continue, modify, or stop.

Write the success and stop criteria before the first participant enters the pilot. Otherwise, adoption pressure can turn a temporary exception into a permanent operating state. A pre-agreed stop condition gives the incident owner authority to act without waiting for a fresh executive debate while a consequential failure continues.

The pilot should test the entire sociotechnical workflow. Measure whether people understand the AI’s role, inspect the output, use the correction path, escalate uncertain cases, and complete the intended task. A model can appear accurate while users over-trust it, ignore it, or spend more time verifying it than the workflow saves.

Gate 3: Controlled expansion

Scale only when the evidence satisfies the release contract and the remaining risks have named owners. Expand one meaningful dimension at a time where practical: the eligible cohort, supported workflow, data scope, or action authority. Opening all four simultaneously makes it difficult to identify which change caused a new failure.

A disciplined pattern is to move from sandbox validation to controlled pilots with documented data flows, guardrails, and pre-agreed mitigations. The audit trail should be generated from normal delivery artifacts rather than reconstructed when an auditor, customer, or executive asks what happened.

After launch: operate the product as a learning system

Production is where input distributions, user behavior, costs, and failure modes become visible. Run three connected operational views:

System health: Model, prompt, retrieval, and policy versions; latency; cost; errors; availability; and data-pipeline anomalies.
Workflow health: Eligibility, activation, task completion, abandonment, corrections, overrides, escalations, and time-to-value.
Outcome and safety health: Guardrail failures, prohibited behavior, incidents, rollback events, and outcome variance across relevant segments.

Every alert needs an owner, response path, and severity interpretation. Every material incident needs a record of the affected configuration, inputs, outputs, user impact, containment action, root cause, and prevention work. If the team cannot reconstruct which version produced a harmful or noncompliant output, observability is incomplete.

Treat a material model, prompt, retrieval, policy, or data-schema change as a product release even when the interface does not change. Run the relevant regression suite, compare the new configuration with the approved baseline, update the risk record, and preserve the decision. Change control is what prevents a previously reviewed system from becoming a different system under the same feature name.

Keep customer success, support, solutions engineering, and operational users in the feedback loop. Structured corrections and escalations can reveal workflow failures that aggregate accuracy metrics hide. Route those signals into evaluation cases, product discovery, and prioritization instead of treating them as isolated support tickets.

Your next step does not need to be a company-wide governance rewrite. Pick one healthcare AI use case and complete four artifacts: the use-case card, data-flow map, release contract, and gated rollout plan. If you cannot name the unacceptable outcome, the person who can stop the system, or the evidence required to resume it, the use case is not ready for production. Once those answers exist, responsibility becomes part of delivery rather than a negotiation at the end of it.

References

March 25, 2026

Bad Advice from Your AI Clone? Ethics, IP, and How Product Leaders Protect Quality

What happens when an AI starts giving advice in your voice—advice you’d never actually give? I’ve been thinking a lot about that question, and this conversation hit home for me as a product leader navigating the fast-evolving reality of AI “clones.”

Listen to this episode on: https://open.spotify.com/episode/7DNDIlIimwbbMOytArewRp?ref=producttalk.org | https://podcasts.apple.com/kh/podcast/bad-advice/id1794203808?i=1000756914818&ref=producttalk.org. Prefer video? Watch on YouTube: https://www.youtube.com/embed/RF4BwaeMMlg?feature=oembed

The episode examines AI “clones” built from podcast transcripts and public content—where the experimentation feels exciting, where it crosses ethical lines, and what happens when mediocre AI outputs get attributed to real people. The tension is real: when a bot confidently answers in your style but misses the nuance, “it’s not me” becomes more than a disclaimer—it’s a reputational defense.

We dig into the messy parts: IP ownership of open-sourced transcripts, the role of pirated books in LLM training sets, rising inference costs, and the uncomfortable economic question: if anyone can prompt “act like Teresa,” how do creators make a living? In my own decision-making, I look for clear consent, guardrails that prevent impersonation, and transparent UX that never confuses a synthetic perspective with a human expert.

This isn’t anti-AI. It’s a nuanced conversation about quality, consent, and remembering there are real humans behind the ideas.

Here’s how I translate the key takeaways into practice. Using AI for perspective is fine—equating it to the real person isn’t. Free-feeling AI outputs still rely on someone’s work. Expertise is more than past content—it’s context, judgment, and evolution. If someone’s work influences you, find a way to support them. These principles help teams benefit from gen ai without eroding trust or the creator ecosystem.

“Technically possible” doesn’t mean “ethically okay.” My AI Strategy playbook includes privacy-by-design, clear data governance on training materials, and a bright line between inspiration and impersonation. When we ship AI features, we label synthetic outputs, avoid mimicking living experts without permission, and create paths to compensate or promote the humans whose thinking underpins the experience.

I’ve also tested the “act like X” pattern to stress-test product quality. Even when outputs sound plausible, they rarely capture the expert’s mental models, trade-offs, or the evolution of their thinking—especially in complex product discovery work. That gap is the difference between average AI text and expert product management leadership.

If you listen, consider a few reflection prompts: Have you ever used AI to “act like” someone you admire? Could you tell whether the output matched that person’s actual thinking? How do you decide what’s ethically okay when using public content in LLMs? And how can we support creators while still embracing new tools?

Resources & Links you may find helpful: Follow Teresa Torres: https://ProductTalk.org; Follow Petra Wille: https://Petra-Wille.com; Delphi.ai (AI bot platform discussed): https://www.delphi.ai/?ref=producttalk.org; Lenny’s Podcast: https://www.lennysnewsletter.com/podcast?ref=producttalk.org; ChatGPT: https://chatgpt.com/?ref=producttalk.org; Petra’s Coaching Packages: https://www.petra-wille.com/coaching-packages?ref=producttalk.org; Teresa’s Product Talk: https://www.producttalk.org/; Teresa’s book Continuous Discovery Habits: https://www.producttalk.org/continuous-discovery-habits/; Lenny’s open-sourced podcast transcripts: https://www.dropbox.com/scl/fo/yxi4s2w998p1gvtpu4193/AMdNPR8AOw0lMklwtnC0TrQ?rlkey=j06x0nipoti519e0xgm23zsn9&e=1&st=ahz0fj11&dl=0&ref=producttalk.org

Have thoughts on this episode or practices that have worked in your org? Share them below—I’m keen to learn how other teams are balancing innovation with integrity.

Inspired by this post on Product Talk.

March 24, 2026
Multi‑Agent Systems Demystified: Why One AI Isn’t Enough—and How I Ship Faster With Many

In my day-to-day building AI products, I’ve learned a simple truth: a single model can be brilliant, but a coordinated team of specialized agents is what consistently ships outcomes customers trust. That’s the promise of multi-agent systems—multiple AIs with distinct roles collaborating inside robust AI workflows to deliver accuracy, speed, and resilience you can’t get from a lone model.

Think of a multi-agent system as a well-run product trio for machines: a planner decomposes the job, specialists execute focused tasks, a reviewer checks quality, and an orchestrator keeps everyone aligned. This agentic AI approach mirrors how high-performing teams work—divide complex problems, play to strengths, and create tight feedback loops.

When does one AI stop being enough? Whenever tasks require tool use, domain retrieval, multi-step reasoning, or policy adherence under real-world constraints. In those moments, specialized agents shine—one for search using a retrieval-first pipeline, another for reasoning, another for action execution, and a final one for validation. The result is better accuracy with manageable latency and cost.

The core architecture I rely on starts with a planner that breaks a goal into steps, followed by execution agents equipped with tools and grounded context. I pair this with context window management to keep prompts lean and relevant, and I insert a verifier (or critic) to catch logic slips and policy violations before results reach customers. A lightweight orchestrator coordinates handoffs and retries to keep the whole flow resilient.

To make this production-grade, I treat observability as non-negotiable. Agent Analytics helps me see which agents are adding value versus adding latency, where failures cluster, and how prompts drift over time. From there, eval-driven development gives me measurable confidence: I codify representative tasks, run offline and shadow evaluations, and only promote changes that move accuracy and safety in the right direction.

Governance is equally critical. I design privacy-by-design from the start, restrict data movement with strong data governance, and enforce policy constraints inside the workflow rather than after the fact. This includes red-teaming failure modes, rate-limiting tools, and capturing immutable traces for audits and post-incident reviews—habits borrowed from SRE culture that map well to AI systems.

On the practical side, prompt engineering remains foundational, but it’s the system design that converts clever prompts into reliable outcomes. Tool access, retrieval quality, memory strategy, and error handling matter more than wordsmithing alone. I’ve found that small prompt improvements are amplified when the surrounding workflow is sound—and are overwhelmed when it isn’t.

If you’re just starting, begin with a narrow use case and a minimal set of agents—planner, executor, and verifier—then expand. Use continuous discovery with real users to learn where the workflow fails in the wild, and iterate with tight release cycles. Treat every agent like a microservice with clear contracts, test coverage, and metrics, and you’ll unlock compounding gains without losing control.

The payoff is tangible: faster shipping cycles, fewer regressions, and outcomes customers can actually rely on. When stakes are high and ambiguity is real, one AI is often a talented soloist—but a disciplined ensemble of agents is how I deliver dependable, scalable value at product velocity.

Inspired by this post on Product School.

February 16, 2026

Tag: privacy-by-design

Move from an AI tool stack to an evidence system

Use AI to deepen discovery, not to create distance from customers

Let the consequence of failure determine the product architecture

Make evaluation, privacy, and leadership part of delivery

Key takeaways

Building the next product operating rhythm

References

Start with the harm your growth model could create

Pair every growth metric with a human countermetric

Expand discovery beyond the people who already love the product

Put humane constraints inside the experiment

Choose durable depth over indiscriminate scale

Key takeaways

References

Classify the decision before you assess the AI

Turn governance principles into an enforceable contract

Define the data boundary

Assign decision rights to named roles

Design the audit record before launch

Put controls inside the workflows people actually use

Behavioral analytics: govern the meaning as well as the data

Anomaly detection: route a signal into investigation, not judgment

Self-service analysis: give teams a governed lane

Pilot with evidence, not a polished demonstration

Key takeaways

References

Set the capture contract before you expand coverage

Keep capture off the user’s critical path

Sample for decisions, not for a warehouse of footage

Run replay with a coupled performance, privacy, and value scorecard

Key takeaways

References

Key takeaways

Create an evidence contract before asking for a recommendation

Turn product questions into bounded analytics tasks

Find an activation blocker without inventing causality

Use behavioral context to sharpen roadmap decisions

Close the engineering loop from customer signal to verified fix

From a customer report to a reproducible failure

From a code symptom back to customer impact

Scale only after retrieval and governance earn trust

Make access narrower than the assistant’s capability

Evaluate the workflow, not just the prose

Use a narrow adoption sequence

References

Start with the data path, not the model

Use a completion test that exposes weak assumptions

Turn the risk assessment into a release lane

Design the system to disclose less data

Minimize before you redact

Make retrieval permission-aware

Place deterministic controls on both sides of the call

Make vendor approval specific to the intended use

Put the controls into delivery and incident response

Prepare containment before the first customer request

Key takeaways

References

Treat product analytics as the agent’s control plane

Define a support context contract

Give the analytics agent a metric contract

Design one closed loop from signal to verified outcome

Choose a first workflow that can prove its own value

Write the pilot contract before the prompt

Instrument the agent like a product surface, not a transcript

Use evaluations and experiments for different questions

Set action boundaries before the model receives tool access

Evaluate build-versus-buy decisions at the system boundary

Key takeaways

References

Define the clinical boundary before choosing the AI approach

Make the data boundary reviewable, not merely promised

Convert responsible AI into a release contract

Use evidence gates from sandbox to controlled scale

Gate 1: Sandbox validation

Gate 2: Controlled production pilot

Gate 3: Controlled expansion

After launch: operate the product as a learning system

References