Author: Shivam Tiwari

Why We Made Fin the Most Open Agent: Instant HubSpot & Freshdesk Support With 76% Resolutions

I’ve spent my career pairing product strategy with customer reality, and nothing is more clear right now than the demand for openness and speed. Today, we’re announcing that Fin can be used as a Service Agent on top of HubSpot and Freshworks, meaning you can use the world’s best Agent without migrating off your helpdesk.

Hubspot and Freshdesk customers can now:

Get Fin live, integrated, and working seamlessly in less than an hour.

Delivering a 76% average resolution rate.

Across all customer channels (voice, email, chat, social, and more).

Resolving complex queries that require reading and writing to third party systems.

With everything fully configurable to follow the unique policies of every individual business.

This launch is a very visible step in a journey we’ve been on from day one: building an open, customer-first platform that plays well with the rest of your stack. We’ve long known that businesses want flexibility in how they configure their customer-facing tech stack. Since the very beginning, we have built Fin as an open platform, with APIs, MCPs, CLI, and opening up access to Apex, our proprietary trained model that delivers best in class performance.

To make things easy for our customers, we have extensive public documentation of our product on our website, in our help center, and in our developer docs. We are the only Agent company in our space to do this, others hide most details behind sign-in screens, which we don’t believe is the right thing to do.

Open Agent platforms will win because customers refuse to be boxed into closed ecosystems. We now believe our category has reached a stage where customers demand open platforms, that those who open up are more likely to win, and those who remain closed and protectionist will accelerate their demise.

We are operating in a fast changing world, and customers do not want to be locked into a single vendor or closed ecosystem. They want the ability to experiment, to swap things in and out, and move everything with ease, technically and commercially.

In an open world, the best product will win. In a world where businesses can easily swap vendors, the best product will win. We are happy to compete on that front, confident that Fin delivers the best customer experience and the highest performance.

From a product management lens, this openness is powered by agentic AI patterns paired with robust CRM integration. Under the hood, we use Model Context Protocol (MCP), well-documented APIs, and orchestrated AI workflows to read from and write to third-party systems. That’s how Fin handles true multi-channel work—including voice AI agent scenarios—while giving teams the observability they need through Agent Analytics.

If you are a Hubspot or Freshdesk customer, you can now have Fin integrated and live within an hour, without needing any help from us. We’re here if you want us, but as part of our commitment to building an open platform, we’ve designed everything to be self-servable—start in minutes or watch a quick demo of how everything works.

Fin for Hubspot

Fin for Freshdesk

Inspired by this post on The Intercom Blog.

June 9, 2026
Learning Together: The Small-Group Product Coaching Strategy That Accelerates Real-World Growth

I’m continually evaluating how to invest in my team’s professional development in ways that create lasting capability, not just momentary enthusiasm. Recently, I revisited a compelling conversation featuring Teresa Torres and Petra Wille that zeroes in on how product teams actually learn best—especially when we’re accountable for product management leadership and sustainable practice change across empowered product teams.

Listen to this episode on: Spotify | Apple Podcasts

What's the best way to invest in your team's professional development — train everyone at once, let people self-direct, or something in between?

In my experience, the answer depends on your goals, the maturity of your product discovery habits, and how you create peer accountability. What resonated most with me was their argument that small, intentional groups are a powerful (and underused) learning model—one that aligns with how we build momentum in product discovery, product strategy, and continuous discovery routines.

Three Models of Team Learning

Train everyone at once — builds shared language, but not everyone is ready at the same time

Self-directed learning — works for highly motivated individuals, but lacks accountability

Small-group learning — the sweet spot: peer accountability, shared momentum, and just-in-time relevance

Across my teams, I’ve seen organization-wide training create useful common ground, but it rarely changes day-to-day behaviors without a follow-on mechanism for practice. Self-directed learning can inspire, yet it often fails to translate into consistent habits without peer pressure and shared goals. Small-group learning, especially within product trios or adjacent squads, consistently drives the most adoption because it blends relevance, peer accountability, and just-in-time application to real customer interviews, roadmap decisions, and stakeholder management challenges.

Why Learning Together Works

Creates natural accountability and deadlines

Helps people apply concepts to their own real work

Especially valuable for product leaders, who rarely have built-in peers to learn alongside

I’ve found small cohorts particularly effective for product leaders who need a safe space to pressure-test decisions, compare notes on org design, and align on product strategy trade-offs—without slipping into status updates. When leaders learn together, they build shared muscle memory that makes it easier to reinforce practices like continuous discovery and communities of practice across the organization.

Group Coaching vs. One-on-One Coaching

Individual: sounding board, holding space, powerful questions

Group/team: real work in the room, peer learning, bridges between leaders who rarely collaborate

Keep participants as close colleagues — trust and vulnerability go up when people already know each other

One-on-one coaching is invaluable for personal reflection and targeted growth. But when I need to accelerate collective behavior change—like improving discovery cadence, refining opportunity solution tree reviews, or aligning around outcome-based roadmapping—group coaching wins. Keeping participants as close colleagues increases vulnerability and candor, which in turn speeds up learning and leads to real changes in how teams plan, prioritize, and ship.

Key Takeaways

Start a book club — debriefing together beats reading alone

Train pilot teams before rolling out org-wide

Encourage duos or trios to take courses together

Match your learning format to your actual goal

Keep coaching groups tight for more honest, productive sessions

Here’s how I operationalize this: I start with a pilot team to validate the learning format and cadence, then expand to adjacent trios to build a network effect. We anchor learning to current initiatives (not abstract theory), ensure weekly touchpoints, and capture playbooks in our internal knowledge base so improvements persist beyond the cohort.

Resources & Links:

Follow Teresa Torres: https://ProductTalk.org

Follow Petra Wille: https://Petra-Wille.com

Mentioned in this episode:

Communities of Practice

Petra Wille's book Strong Product Communities – The Essential Guide to Product

Become a Better Product Leader: A 52-Week Transformation Journey – Petra's email course with quarterly live Q&A

Teresa Torres’ book Continuous Discovery Habits

Continuous Discovery Habits (CDH) Book Club

Petra’s STRONG Product People Corporate book clubs

Teresa's Product Discovery Fundamentals course

Work with Petra

Learning together at a conference like Product at Heart

Teresa & Hope Gurion's group leadership coaching program through Product Talk Train Your Team

Join the Conversation:

Have thoughts on this episode? Leave a comment below.

Inspired by this post on Product Talk.

June 9, 2026

From AI Builder to Agent Swarm: A Product Delivery Model

AI-native product delivery has two distinct layers: a product professional who turns uncertainty into testable artifacts, and an agent workflow that divides complex work among specialized AI components. Treating either layer as the whole model misses the more useful opportunity.

Together, the AI Builder role described by Product School and the parallel-agent architecture discussed by Pendo suggest an operating model for moving from customer evidence to evaluated software. The central lesson is not simply to add more AI. It is to assign clear responsibilities, preserve evidence across handoffs, and expand automation only where it improves a measurable constraint.

Key takeaways

The AI Builder is the human integration layer, connecting discovery, prototyping, evaluation, and delivery inside the product trio.
Parallel agents are a system design choice, useful when specialized paths can improve latency, answer quality, or resilience.
Evaluations, analytics, observability, and controlled releases form the shared control system for both layers.
Fan-out should respond to uncertainty and business importance rather than becoming the default for every task.

One delivery system, with human and machine responsibilities

The Product School article presents the AI Builder as a hybrid product professional rather than a renamed product manager or an isolated prototyper. In its account, this person uses AI across analysis, prototyping, evaluation, and shipping, with the aim of shortening the distance between a customer problem and a runnable experiment.

The Pendo article addresses a different layer. It describes workflows in which research, reasoning, tool use, and formatting can be assigned to specialized agents and then reconciled. Its focus is not ownership of the product problem, but the computational structure used to complete work.

Read together, the articles separate two ideas that are often blurred. An AI-native team still needs a person or group to choose the problem, define acceptable behavior, interpret customer evidence, and decide whether an experiment justifies investment. Agents can perform bounded tasks within that process, but parallel execution does not establish product relevance on its own.

Layer	Primary responsibility	Typical artifacts	Control question
AI Builder and product trio	Translate customer and business uncertainty into experiments	Prototypes, evaluation criteria, instrumented experiences, delivery recommendations	Is the team learning about an outcome that matters?
Agent workflow	Execute and reconcile specialized tasks	Retrieved context, candidate responses, tool results, rankings, formatted outputs	Does orchestration improve the target measure enough to justify its complexity?
Delivery platform	Provide access, measurement, release controls, and safeguards	Tool interfaces, traces, feature flags, budgets, analytics, fallbacks	Can the workflow be observed, governed, and changed safely?

This division of responsibility also clarifies the meaning of vibe coding in the Pendo account. Prompts, examples, and constraints are used to shape an intended experience before the team commits to extensive code or rigid rules. The AI Builder supplies the product judgment and experiment design around that activity; an agent architecture supplies one possible execution mechanism.

Parallelism should target a constraint, not become a default

Pendo reports three proposed benefits of parallel agents. Independent specialists can work concurrently to reduce latency, diverse candidate paths can be compared to improve quality, and risky or failure-prone operations can be isolated behind fallbacks. The article names fan-out/fan-in, race-and-rerank, specialist swarms, consensus, and self-consistency checks as patterns for producing and reconciling candidates.

Those benefits depend on the shape of the task. Parallel research may help when several sources or interpretations must be examined independently. A race-and-rerank pattern may help when multiple plausible outputs can be scored against explicit criteria. Guarded fallbacks may improve resilience when a tool can fail without invalidating the entire experience. By contrast, multiplying agents around a simple, deterministic step adds coordination, cost, and more places to inspect when something goes wrong.

The Product School article provides the missing selection mechanism: the workflow begins with a high-signal use case and explicit evaluation criteria. That makes orchestration a response to observed limitations in an experiment rather than an architectural commitment made in advance. A prototype can begin with the smallest credible workflow, reveal whether the bottleneck is grounding, reasoning, tool reliability, or response time, and introduce specialization at that point.

Pendo proposes a similar progression at the system level: begin with retrieval, add a planner-executor split, and introduce parallel specialists where accuracy or latency problems appear. It also recommends placing budgets on fan-out, caching results, using smaller models when confidence is high, and widening the workflow when uncertainty rises. These are recommendations from the source, not independently reported benchmarks, but they establish a useful product principle: additional computation should be purchased in proportion to uncertainty and consequence.

Evaluation is the bridge from discovery to dependable delivery

The strongest overlap between the two articles is evaluation. Product School describes AI Builders converting interviews and behavioral analytics into instrumented experiments, benchmarking quality before production, and using A/B testing to feed results back into strategy. Pendo similarly calls for offline evaluations before rollout, production experiments afterward, and agent-level analytics to identify regressions across individual workflow steps.

This creates a continuous evidence path rather than a handoff between discovery and engineering. A customer problem informs a prototype; the prototype produces evaluation cases; those cases become release gates; production behavior supplies new evidence for the next iteration. CI/CD can move changes through the delivery system, while evaluations determine whether an AI behavior is ready to move with them.

A staged adoption path

Select a bounded use case. Product School suggests beginning with a high-signal application such as generative-AI prototyping or an in-app guide, rather than attempting to transform the whole delivery process at once.
Define the evidence before expanding the build. Specify evaluation criteria, analytics, and the customer or business outcome the experiment is intended to illuminate.
Establish grounded context. Both articles emphasize retrieval-oriented workflows. Product School also discusses prompts, context windows, and data contracts as product surfaces that require deliberate design.
Start with minimal orchestration. A single workflow or planner-executor arrangement provides a baseline against which a specialist or parallel design can be judged.
Add parallel paths selectively. Introduce research, tool-calling, reasoning, or validation specialists only where evaluation results reveal a material limitation.
Release behind controls. The sources point to feature flags, A/B testing, observability, anomaly detection, fallbacks, and post-launch review as ways to expose failures and limit their impact.

The Model Context Protocol appears in both accounts as a way to standardize access to tools and data. Product School frames MCP integrations as part of an AI Builder’s toolbox, while Pendo argues that standardized access keeps agent roles separate from authentication, quotas, and observability. The combined implication is organizational as well as technical: shared interfaces can let product teams experiment with workflow roles without embedding every platform concern in every prompt.

The operating model changes what a product team owns

Product School places the AI Builder inside the product trio, working with design and engineering from the beginning. Pendo argues that product trios can own complete AI workflows rather than limiting their attention to prompts. These views converge on broader product accountability: the team owns the behavior, evidence, cost, risk, and release mechanism as one product surface.

That ownership requires clearer boundaries, not fewer disciplines. Product judgment determines which outcome deserves attention. Design shapes the customer interaction and failure experience. Engineering and platform work make tool access, observability, quotas, and release controls dependable. The AI Builder connects these concerns through runnable artifacts, while specialized agents remain replaceable components within the evaluated workflow.

The resulting measure of maturity is not the number of agents deployed or the speed of prototype generation. It is whether the team can trace a customer need through an experiment, an evaluation, a controlled release, and a learning decision. As tools become easier to compose, that chain of evidence will be the durable advantage in AI-native product delivery.

References

June 8, 2026

Turning Pendomonium Insights Into a Product Growth System

Pendomonium can be treated as more than a source of product ideas. Its practical value lies in connecting an identified growth problem to behavioral evidence, a targeted intervention, and a measurable follow-through plan.

The supplied Pendo account spans analytics, onboarding tools, product strategy, expert advice, and peer conversations. Viewed together, those elements form a repeatable operating system for converting conference learning into product growth experiments.

Conference learning becomes useful when it enters a growth loop

The Pendo article describes several distinct kinds of conference value: behavioral analytics and retention analysis can clarify where users struggle; journey mapping and continuous discovery can improve the framing of those problems; and in-app guides, product tours, or broader UX changes can become possible responses. Sessions about empowered teams, stakeholder alignment, and outcome-focused roadmaps address the organizational conditions needed to execute those responses.

The synthesis is a four-part loop: diagnose a meaningful behavior, select an intervention that fits the observed friction, define the intended outcome, and establish ownership for the work. No single conference session completes that loop. A talk may sharpen the hypothesis, a workshop may produce a prototype, an expert conversation may challenge the targeting logic, and a peer discussion may provide a useful benchmark.

This framing also changes the standard for conference return on investment. The relevant question is not how many notes or feature ideas an attendee collected. It is whether the event improved a decision that can be tested against an activation, onboarding, or retention outcome.

Begin with a constrained product growth question

The Pendo author recommends defining explicit outcomes before booking the trip: one activation measure to improve, one source of onboarding friction to address, and one discovery practice to strengthen. That constraint is useful because it turns a large agenda into a decision filter. Sessions become relevant when they contribute evidence, methods, or execution support for the selected problem.

Preparation should make the question concrete. The source advises bringing access to analytics dashboards and, where possible, a staging environment. An attendee could then inspect an event taxonomy, review a relevant session replay, or draft guidance for a defined segment while expert input is available. These activities are more actionable than collecting generalized advice because they expose assumptions about instrumentation, audience selection, and user context.

The same discipline applies to office hours. According to the article, these appointments can fill quickly, and attendees should arrive with precise questions, such as an unexplained activation drop-off or uncertainty about a guide-targeting rule. A useful expert conversation should end with a clearer hypothesis or experiment, not merely a product demonstration.

Match evidence, intervention, and measurement

The strongest practice implied by the source is to keep diagnosis and intervention connected. Session replay, behavioral analytics, retention analysis, and journey mapping illuminate different parts of user behavior. In-app guides and tours are possible treatments, but they should not become automatic answers to every point of friction.

Product growth question	Relevant conference input	Useful work product
Where does progress break down?	Behavioral analytics, retention analysis, journey mapping, or session replay	An evidence-backed problem statement
What could reduce the friction?	Guide and tour workshops, UX examples, or expert feedback	A testable intervention rather than an unprioritized idea
Who should receive the intervention?	Segmentation and targeting guidance	A defined audience, context, and trigger
How will the team evaluate it?	Activation and onboarding discussions	A success measure tied to the original problem

This sequence helps prevent tool-first product management. If replay evidence suggests that users cannot understand an interface, contextual guidance might be appropriate. If the underlying workflow is structurally difficult, adding another tour could conceal rather than remove the problem. The conference contribution is therefore not just exposure to Pendo capabilities; it is the opportunity to examine when each capability fits the diagnosed need.

The author reports applying workshop ideas to experiments and subsequently seeing faster time-to-value for new users. That is a useful practitioner account, but it is not presented as comparative evidence. Teams adopting the practice should still establish their own baseline, success measure, and evaluation method.

Use a return-home cadence to preserve accountability

Conference insights decay when they remain in personal notes. The source proposes a note structure containing the problem, supporting data, proposed intervention, and success measure. It also describes leaving the event with two prioritized experiments, named directly responsible individuals, and an execution-readiness checklist.

Before the event: choose the activation measure, onboarding problem, and discovery practice that will guide agenda decisions.
During sessions and workshops: record evidence and assumptions separately from proposed features or guidance.
During expert and peer conversations: pressure-test the hypothesis, instrumentation, targeting logic, and execution constraints.
Within 24 hours of a useful introduction: follow up with a concise summary and a specific next step, reflecting the networking practice recommended in the source.
Within 48 hours of returning: hold the debrief described by the author, prioritize the experiment backlog, and agree on a shared activation or onboarding milestone.
Across the next 30, 60, and 90 days: review whether ownership translated the selected insights into tests, decisions, and measurable learning.

Intentional networking can support this cadence rather than sit outside it. The article recommends meeting peers with comparable metrics, product operations leaders, and solutions engineers. Their value is not simply expanding a contact list: each perspective can expose a different weakness in the plan, from an unrealistic benchmark to an instrumentation gap or an unresolved delivery dependency.

Key takeaways

Anchor the event to a specific growth outcome and a defined source of user friction.
Use analytics, journey evidence, workshops, expert advice, and peer input as complementary parts of one decision process.
Select guides, tours, or UX changes only after diagnosing the behavior they are intended to change.
Capture every promising idea with supporting evidence, a success measure, and a responsible owner.
Use the post-event debrief and 30-60-90 day follow-through to turn conference learning into accountable experimentation.

The durable opportunity is to make Pendomonium the start of a learning cycle rather than the end of an annual event. Teams that arrive with a narrow question and leave with an owned experiment can carry the conference’s value into their next product decision.

References

Pendo Blog – Pendomonium 101: Why You’ll Love It-and My Field-Tested Tips for First-Time Attendees

June 8, 2026

Engineering MCP Agents as a Reliable Product Platform
Model Context Protocol adoption becomes consequential when an agent can retrieve organizational knowledge, select tools, and change a system of record. At that point, the engineering challenge is no longer simply connecting a model to an API. It is operating a product platform whose context, permissions, decisions, and side effects must remain dependable.

The source article’s experience with workflows spanning Miro, Jira, and Confluence points to a coherent platform model: retrieval determines what the agent knows, tool contracts constrain what it can do, evaluation tests its behavior, and observability makes failures diagnosable. Product strategy and interaction design then determine whether that machinery improves work users already perform.

Key takeaways
- Treat retrieval, tool schemas, prompts, policies, and telemetry as platform components with explicit owners and versioning.
- Prove one frequent, measurable workflow before expanding the agent’s tool and use-case surface.
- Combine least-privilege access with visible tool rationale, consent controls, audit records, and safe recovery paths.
- Evaluate the complete chain from retrieved context to downstream action, not just the quality of generated text.
- Govern the tool catalog and delivery pipeline continuously so that extensibility does not become uncontrolled operational risk.
The platform boundary extends beyond the MCP connection

MCP provides a practical interface through which models can reach data, tools, and actions, according to the source article. The protocol connection is therefore an enabling layer, not the whole agent platform. A production workflow also depends on source authority, identity and permission checks, context selection, tool arbitration, execution controls, user-facing recovery states, and evidence that the result was useful.

This broader boundary changes how teams should decompose the system. Retrieval is a managed context service rather than an incidental prompt-building step. Tools are governed capabilities rather than a loose collection of endpoints. Prompts and policies are deployable artifacts rather than text copied into application code. Traces and evaluations are part of the control plane because they reveal whether the other layers continue to work together.

The source recommends starting with authoritative content, normalizing it with docs-as-code discipline, attaching metadata that supports permission-aware filtering, and selecting the smallest high-signal context needed for a task. The engineering implication is important: access control must shape retrieval before information reaches the model. Filtering only when an action is attempted would leave the reasoning process exposed to context the user or agent may not be entitled to use.

Context quality also affects more than answer accuracy. The source links focused retrieval to lower hallucination risk, more accurate tool calls, and lower cost. That makes retrieval performance a shared dependency for safety, reliability, latency, and economics. It deserves its own contracts, tests, freshness expectations, and failure modes.

A golden path turns architecture into an operating contract

The source describes an initial workflow that summarized a Miro board into action items and wrote them to Jira. It reports that variants involving Confluence summaries, epic splitting, and backlog grooming followed only after the original path reached its reliability targets. This is less a recommendation for those particular products than a useful sequencing principle for agent platform engineering.

A narrowly defined workflow exposes the entire contract between context and consequence. The team must decide which content is authoritative, what the model may infer, which tool is appropriate, what inputs the tool accepts, what the user should review, how a partial failure is handled, and how success is measured. A broad assistant can conceal these questions behind plausible conversation; a golden path forces explicit answers.

The right first workflow is therefore not merely technically convenient. It should be frequent enough to matter, have an observable completion state, and carry side effects that can be bounded. The source frames outcomes such as time saved during backlog grooming, better meeting notes in Confluence, and fewer context switches across Miro boards as more useful roadmap anchors than novel model capabilities. It also recommends comparing task success, completion time, user edits, detected defects, and downstream business effects rather than relying on engagement alone.

Those measures form a practical evidence chain. Evaluation results show whether the system behaves as designed; workflow measures show whether users can complete the task; business measures show whether the completed task creates value. Keeping the levels distinct prevents a technically impressive agent from being mistaken for a successful product.

Safety depends on controlling actions and explaining them

Tool access creates a sharper risk boundary than text generation because an incorrect decision can alter a ticket, document, or other shared record. The source’s proposed response combines least-privilege scopes, a human-readable rationale for each call, and an audit trail. It also calls for proposed inputs and expected side effects to be visible when the agent is about to use a tool.

These controls address different failure classes. Narrow scopes limit the maximum effect of a bad decision. Input previews help users catch incorrect parameters before execution. Rationale makes the selection inspectable. Audit records support diagnosis and accountability afterward. None substitutes for the others, and a confirmation dialog alone does not make an overprivileged tool safe.

Recovery behavior belongs in the same design. The source recommends retrying suitable failures with backoff, falling back to read-only behavior, or requesting consent or missing context. A robust platform should distinguish failures that are safe to retry from failures that require a different plan. It should also preserve an understandable state when a multi-step workflow completes only partially, so the user knows what changed and what did not.

Transparency need not mean exposing raw internal reasoning. The useful product surface is operational evidence: the sources used, the selected capability, the intended inputs, the expected effect, and the resulting status. The source suggests a reveal panel containing retrieved sources, candidate tools, and confidence signals for power users. More generally, the amount of review should follow the consequence of the action: low-risk retrieval can remain lightweight, while consequential writes warrant clearer inspection and consent.

Evaluation, observability, and delivery form one reliability loop

The source outlines offline tests for intent classification and tool selection, online shadow evaluations for live drift, and regression checks after deployment. It also recommends traces that capture prompts, retrieved chunks, tool inputs, tool outputs, latency, and error codes. Together, these practices connect a visible failure to the component and version that produced it.

Evaluation without observability can show that quality declined without explaining why. Observability without evaluation can produce detailed traces without deciding whether the behavior was acceptable. A mature loop needs both: test cases encode desired behavior, traces expose actual behavior, and production outcomes reveal gaps in the test set.

The delivery process must preserve that connection. The source treats prompts, tool schemas, and guardrails as versioned artifacts deployed behind feature flags, with canary releases, controlled comparisons, and rollback capability. This approach makes a behavioral change attributable. If tool selection deteriorates after a prompt revision or a schema update breaks an integration, operators can identify the change and contain its reach.

Latency should be governed in the same loop because an accurate workflow can still fail as a product experience. The source reports using task-specific latency budgets, caching stable retrieval results, parallelizing safe calls, prefetching likely session context, and providing progress when work exceeds the expected budget. These techniques should remain subordinate to correctness: parallel execution is appropriate only when calls are independent, while caching must respect freshness and permission boundaries.

The source also assigns prompts a user-experience role, combining plain-language intent, domain constraints, and explicit tool contracts while using examples, tooltips, and in-product guidance to help users frame requests. This connects conversation design to reliability. Better instructions can reduce ambiguity before the platform has to resolve it through additional model turns or risky assumptions.

Scale requires governance of tools, teams, and ownership

MCP’s extensibility can turn into tool sprawl if every integration is added without lifecycle management. The source recommends a curated catalog recording each tool’s owner, scope, schema version, and deprecation policy. It also describes schema linting in continuous integration, backward-compatible changes, and quarterly retirement of unused tools. These are conventional platform disciplines applied to an agent’s capability surface.

A catalog is valuable because an agent reasons over descriptions and schemas while operators depend on stable implementation contracts. Poorly differentiated tools can make selection ambiguous; unannounced schema changes can invalidate prompts and evaluations; ownerless tools can remain available after their data or permission assumptions have changed. Governance should therefore assess semantic clarity as well as API validity.

Organizational design matters for the same reason. The source describes an empowered trio consisting of a product manager responsible for outcomes and risk posture, a forward-deployed engineer focused on schemas and scalability, and a designer responsible for conversational flows and recovery states. It also favors weekly evaluation reviews over demonstration-led progress. The underlying principle is shared ownership: platform reliability cannot be delegated entirely to model engineering when the decisive questions span product value, system behavior, permissions, and user comprehension.

The source’s proposed 30-day starter sequence moves from selecting one workflow and defining permissions, measures, and evaluations; through retrieval and a minimal tool set; to an instrumented internal pilot; and finally to hardening and a limited beta. The schedule is reported as a blueprint rather than independent proof of how long every implementation should take. Its more transferable lesson is dependency order: define the outcome and risk boundary before multiplying capabilities.

As agents begin coordinating across products, the durable advantage will come from platforms that preserve this discipline across every new connection. MCP can make capabilities composable, but dependable composition will still depend on controlled context, explicit authority, observable execution, and evidence that the workflow improves real work.

References
- Shivam.Consulting Blog — Mastering MCP: Battle-tested Playbooks from Miro, Atlassian, and What I’ve Learned
June 8, 2026

Reusable AI Agent Workflows Need Evaluation Contracts

Reusing an AI agent capability can accelerate delivery, but reuse also multiplies the consequences of an undetected defect. A retrieval component, tool-call routine, or safety check may appear in several workflows, so its quality cannot depend on the team that happens to integrate it next.

The practical answer is to package each reusable skill with an evaluation contract: defined behavior, test fixtures, observability, guardrails, and outcome measures that travel with the component. Read together, the two source articles outline how modular workflow design and eval-driven development can reinforce each other from prototype through production.

Reuse requires a contract, not just a prompt

The AI skills library article describes modular capabilities for retrieval and grounding, summarization, classification, tool use, data enrichment, safety controls, and evaluation harnesses. Its central architectural idea is consistency: common interfaces and conventions allow teams to compose capabilities and replace implementations without rebuilding an entire flow.

That modularity addresses code and workflow reuse, but it leaves an important product question: what must remain true when an implementation is replaced? The product-manager evaluation playbook supplies the missing half. It calls for versioned prompts, tools, and datasets; fixed offline scenarios; production experiments; and traces that expose how an agent reached an answer.

The synthesis is an evaluation contract attached to every reusable skill. The contract defines acceptable inputs and outputs, relevant policies, expected telemetry, representative tests, and promotion thresholds. A skill is then reusable because its behavior can be checked repeatedly, not merely because its code can be imported.

This distinction matters most in composed workflows. A summarizer that performs well on clean text may behave differently after a weak retrieval step. A tool-use component may generate a plausible response even when the underlying action fails. Reusable interfaces make these components interchangeable; evaluation contracts make the substitutions accountable.

Measure four layers of agent quality

No single score can represent the quality of a reusable agent workflow. The evaluation article separates concerns such as task success, factuality, safety, latency, cost, evidence quality, and product outcomes. The skills-library article adds operational concerns around guardrails, runtime metrics, and production monitoring. Combined, they suggest a four-layer model.

Evaluation layer	Question it answers	Reusable evidence	Reported signals
Component behavior	Does the skill perform its assigned task?	Fixed fixtures, golden examples, and domain scenarios	Task success, factuality, and retrieval evidence quality
Safety and policy	Does it remain within required boundaries?	Adversarial cases, policy checks, and guardrail configurations	Safety performance, PII handling, and content-policy adherence
Operational performance	Can it run reliably within product constraints?	Traces, logs, version records, and production dashboards	Latency, cost, tool success, and fallback behavior
Product impact	Does better agent behavior create user or business value?	Experiment definitions and driver-tree mappings	Task completion, satisfaction, activation, retention, and NRR

The layers should remain distinguishable even when a dashboard brings them together. If a workflow’s task-success score rises while latency or cost deteriorates, the trade-off should be visible. If offline factuality improves without changing completion or satisfaction in production, the result should not automatically be treated as a product win.

Retrieval-first workflows illustrate the value of separation. The evaluation playbook recommends assessing the quality of retrieved evidence independently from generation. That boundary makes a failure attributable: the system can distinguish missing or irrelevant evidence from a generator that mishandled useful context. The same principle applies to classification, tool selection, tool execution, and response composition.

A reusable workflow needs a controlled promotion path

The two sources describe complementary stages rather than competing evaluation methods. The skills-library article starts with a quick-start chain, configurable skills, guardrails, evaluation datasets, and instrumentation. The evaluation playbook places fixed offline suites before user exposure, followed by controlled online validation. Together they form a promotion path from composable prototype to measured production capability.

Offline evaluation establishes eligibility

A candidate workflow should first face stable examples representing core scenarios, known failure modes, edge cases, adversarial prompts, and domain-specific questions, as reported by the evaluation playbook. Stable fixtures make comparisons reproducible when a prompt, model, tool, retrieval strategy, or policy changes. Running these checks through CI/CD, as proposed in the skills-library article, turns evaluation into a regular release control instead of a separate audit.

Model-based judges can expand coverage for qualities such as helpfulness, coherence, and adherence, but the evaluation article cautions that they require calibration against a small, high-quality human-labeled set. It also recommends monitoring judge drift and retaining human review for edge cases or flows where mistakes carry greater consequences. A reusable judge configuration should therefore include its rubric, reference labels, version, and conditions for escalation.

Online evaluation establishes value

Passing offline checks shows that a variant is eligible for controlled exposure; it does not prove that users benefit from it. Both articles describe feature flags and A/B testing as mechanisms for comparing workflow variants in production. The evaluation playbook identifies conversation outcomes, tool success rates, human-support fallbacks, and user satisfaction as useful online signals.

This staged approach also limits ambiguity. An offline regression can block a weak component before exposure, while an online experiment can test whether an eligible improvement changes real behavior. Promotion should depend on both: acceptable component performance and evidence that the complete workflow advances its intended outcome.

Traces turn composition failures into fixable problems

Composability increases the number of boundaries at which a workflow can fail. The evaluation playbook treats traces as the backbone of agent evaluation because they record inputs, intermediate actions, invoked tools, and final responses. The skills-library article similarly connects reusable chains to logs, traces, metrics, and production dashboards.

A final-answer score alone may reveal that a workflow failed, but a trace can localize the failure. It can show whether retrieval supplied poor evidence, classification selected an unsuitable route, a tool call failed, a guardrail intervened, or generation ignored valid context. This makes evaluation useful for component ownership: teams can repair the relevant skill rather than adding a broad prompt patch to the entire chain.

Trace analysis also supports reuse decisions. If one component repeatedly causes latency, cost, or safety regressions across several workflows, improving that shared component may create more value than optimizing each application independently. Conversely, a component that succeeds in one context but fails in another may need a narrower contract rather than a universal interface.

Versioning is essential to that diagnosis. The evaluation playbook recommends versioning prompts, tools, and datasets, while the skills-library article emphasizes swappable implementations and comparable variants. Without linked versions for the component, evaluation set, judge, and workflow configuration, an apparent improvement may be difficult to reproduce or attribute.

Governance and product outcomes belong in the same system

Reusable workflows can spread good controls, but they can also propagate weak ones. The skills-library article reports guardrails for PII redaction, content-policy checks, and rate limiting, alongside configuration intended to support privacy-by-design. Packaging these controls as reusable capabilities can make the approved path easier to adopt, while evaluation fixtures test whether the controls continue to work as surrounding workflows change.

Governance should not be reduced to a final pass-or-fail gate. Safety, privacy, and policy behavior need their own cases and traces throughout development. The amount of human review can then reflect the cost of error, consistent with the evaluation playbook’s recommendation to retain human oversight for higher-risk flows.

The same evaluation system must connect technical quality to product value. The evaluation playbook proposes a driver tree that links per-turn measures such as helpfulness, safety, and latency to session outcomes such as task completion, and then to product measures including activation, retention, and Net Recurring Revenue. This hierarchy prevents a local metric from becoming the objective by default.

For product teams, the resulting unit of roadmap work is not simply a new skill. It is a versioned capability with evidence about behavior, operational fitness, policy compliance, and contribution to an intended outcome. That shared definition gives product trios, engineers, and governance stakeholders a more precise basis for deciding whether to reuse, revise, or retire a component.

Key takeaways

Package each reusable agent skill with an evaluation contract covering behavior, fixtures, telemetry, policies, and promotion criteria.
Keep component quality, safety, operational performance, and product impact distinct so improvements and trade-offs remain attributable.
Use fixed offline evaluations to establish release eligibility, then controlled online experiments to determine real-world value.
Trace intermediate steps and tool activity so failures can be assigned to the correct component instead of patched at the final response.
Version workflows, prompts, tools, datasets, and judges together so results remain comparable and reproducible.

As skill libraries expand, their lasting advantage will come from accumulated evidence rather than component count. Teams that make evaluation portable alongside implementation can reuse workflows without surrendering visibility, governance, or product accountability.

References

June 5, 2026

Connecting Amplitude Positioning to Product-Led Growth
For an analytics product, positioning cannot stop at a market-facing promise. The promise has to appear in onboarding, become visible in user behavior, withstand technical evaluation, and give sales and product teams a consistent explanation of value.

Taken together, two Shivam.Consulting profiles describe complementary sides of that system at Amplitude. The profile of Darshil Gandhi emphasizes competitive, partner, and technical credibility, while the profile of Tommy Keeley concentrates on acquisition, activation, engagement, and experimentation. Their combined lesson is that positioning and product-led growth work best as one evidence loop rather than as separate marketing and product programs.

Positioning becomes credible inside the product

Product positioning defines the problem a product addresses, the value it promises, and the reasons a buyer should choose it. Product-led growth puts that proposition under immediate pressure: users encounter the product directly and can compare the promise with the experience.

The Darshil Gandhi profile reports that Gandhi leads competitive intelligence, partner product marketing, and technical marketing at Amplitude after serving as a principal on a solutions engineering team. The article treats that technical background as important because positioning must reflect real implementations, not merely persuasive language. It connects this approach to field-tested demonstrations, documentation, reference architectures, integrations, and feedback from sales and solutions engineering.

The Tommy Keeley profile approaches the same credibility question from the user’s side of the interface. It describes guided onboarding, product tours, progressive disclosure, contextual prompts, and other in-product guidance as ways to move users toward an early experience of value. Funnel instrumentation and session replay are presented as tools for locating friction in that journey.

These perspectives form a useful positioning test. A claim must be technically defensible during evaluation, understandable when a user first enters the product, and observable in subsequent behavior. If one of those conditions fails, stronger copy alone is unlikely to repair the mismatch.

Behavioral evidence closes the positioning loop

The two profiles both assign behavioral analytics a role beyond reporting. In the Gandhi article, Amplitude analytics are used to validate claims and identify themes associated with competitive wins. In the Keeley article, behavioral analytics, cohort analysis, funnels, pathing, and retention analysis help determine which actions are associated with longer-term value and where users abandon important journeys.

This creates a feedback loop between market language and product behavior. Positioning proposes that a capability produces a meaningful outcome. Instrumentation then shows whether intended users reach that capability, adopt it, and continue using the product. Field feedback adds another layer by revealing which claims survive buyer scrutiny and which require qualification or clearer proof.

The distinction between correlation and causation remains important. Cohort patterns can identify promising behaviors, but an association with retention does not by itself prove that encouraging the behavior will improve retention. The Keeley profile therefore pairs behavioral analysis with controlled A/B testing, minimum detectable effect thresholds, guardrail metrics, sequential testing, and feature flags. In this model, analytics generates hypotheses and experiments provide stronger evidence for decisions.

The same discipline applies to AI-enabled personalization. The Keeley article describes using generative AI for tailored onboarding, recommended next actions, and summaries of activity patterns, while placing interventions behind feature flags and evaluating them through controlled experiments with privacy-by-design constraints. AI is therefore framed as an extension of the measurement system, not a substitute for a clear value proposition.

A shared driver tree connects the market promise to growth

A recurring mechanism across both sources is the driver tree. The Gandhi profile recommends connecting capabilities to customer outcomes so competitive narratives remain consistent. The Keeley profile starts with a North Star Metric and maps drivers across acquisition, activation, engagement, retention, and monetization. Combined, these uses turn the driver tree into a translation layer between positioning and product-led execution.

At the top sits the outcome the product claims to enable. Beneath it are the behaviors that indicate users are realizing that outcome, followed by the product capabilities and interventions intended to support those behaviors. Competitive intelligence can examine whether the top-level promise is distinctive and relevant. Technical marketing can verify that the enabling capabilities work as described. Growth teams can measure whether users discover and adopt them.

This structure also changes acquisition decisions. The Keeley profile argues for optimizing beyond clicks toward post-signup behaviors associated with retention. That requires congruence among the landing-page message, the users being attracted, and the experience after signup. A campaign that produces registrations but draws people away from the product’s strongest use case may improve a top-of-funnel measure while weakening the product-led system.

Growth loops should follow the same logic. The Keeley article identifies collaboration invitations, user-generated content, and shareable artifacts as possible viral mechanisms. Their strategic value depends on whether sharing is a natural expression of the product’s core value. When distribution emerges from useful product behavior, the loop reinforces positioning; when sharing is detached from that value, it risks becoming a short-lived acquisition tactic.

Key takeaways
- Positioning should be treated as a testable claim linking a capability, a user behavior, and a meaningful outcome.
- Technical evidence, field feedback, and behavioral analytics answer different questions; credible differentiation needs all three.
- A shared driver tree can align competitive intelligence, product marketing, growth, design, engineering, sales, and solutions engineering around the same value logic.
- Acquisition quality should be judged partly by meaningful post-signup behavior, not solely by traffic or registration volume.
- Onboarding, in-product guidance, and viral loops should express the core value proposition rather than operate as disconnected growth tactics.
- Personalization, including AI-enabled interventions, needs feature controls, privacy safeguards, and experimental evaluation.
Organizational alignment is part of the positioning system

Neither source presents this work as the responsibility of a single function. The Gandhi profile emphasizes collaboration among competitive intelligence, partner product marketing, technical marketing, sales, solutions engineering, and product. The Keeley profile describes empowered product trios, continuous discovery, and outcome-focused roadmaps that connect engineering, design, and product decisions to measured growth drivers.

The synthesis suggests a practical division of responsibility without creating separate agendas. Market-facing teams clarify the buyer’s alternatives and the basis for differentiation. Technical teams establish what can be demonstrated and implemented. Product teams reduce the distance between signup and experienced value. Growth teams measure the journey and test interventions. Partners can make integrations and associated use cases more repeatable.

The forward opportunity is to make this loop increasingly explicit: every major positioning claim can be connected to product evidence, every growth initiative can be checked against the intended value proposition, and every field objection can become an input to product discovery. That approach gives Amplitude’s reported playbooks a broader implication for product-led companies: differentiation becomes more durable when the story, the implementation, and the observed behavior keep correcting one another.

References
- Shivam.Consulting Blog — From Solutions Engineering to PMM Leadership: Darshil Gandhi’s Playbook for Amplitude’s Edge
- Shivam.Consulting Blog — Director of Product, Growth & AI at Amplitude: My Playbook for Viral Growth and Engagement
June 5, 2026
A Game-Changing Leap in Voice AI: Fin Voice 2, Apex Flash, and a Live Demo You Can Trust

In competitive markets, I see two options: try to win the game competitors set, or choose to play a different game. In the "Customer Agents" category, I’ve watched too many glossy, fabricated demos—especially around voice—mask the real challenges. Voice is just extremely hard. We all know the future of customer experiences will be Agent-driven voice, yet most of us haven’t actually spoken with a modern AI Agent when calling a business because the tech hasn’t been truly ready in the wild. Today, the bar moves.

What changed? There’s a live, public demo of cutting-edge voice tech you can stress test yourself—no smoke, no mirrors. I recommend taking it for a spin: https://fin.ai/voice. It’s fast, natural, and, yes, very, very good.

For context, yesterday brought Apex Flash, their newest and fastest model, built for the unique demands of low latency channels like voice. Today comes Fin Voice 2, a major upgrade to Fin Voice with over 20 new features, and the first product built on Apex Flash.

Here are the three things that stood out to me—and why they matter for customer support AI strategy and product strategy.

First — thanks to Apex Flash, Fin Voice 2 is now the fastest, most natural Agent for phone, with higher resolution rates and customer satisfaction scores than ever before. Apex Flash is trained on millions of customer experience interactions, fine tuned for customer service, and can be configured to understand all your knowledge and follow all your policies. The result is higher resolution at significantly lower latency—the best of both worlds for voice AI agent performance.

Speed and naturalness here aren’t accidental. Most voice AI products are slow because they convert speech to text, send it to a general model, get a text answer, and then convert it back to speech. Fin Voice 2 was designed to work differently, separating the real time layer that handles speech processing, and the layer that generates answers. That architecture is purpose-built for the demands of customer service on voice.

Powered by Apex Flash, Fin Voice 2 raises the bar on quality and speed—boosting resolution rates and guidance following while cutting time to first audio and semantic search latency, with a lift in CSAT too.

Second — Fin Voice 2 can handle complex queries end to end: taking actions in external systems, verifying callers’ identities, processing refunds, booking appointments, and more. Phone is a high-stakes channel, and Fin adapts to customers across emotional states, clarifies when needed, and confirms key details before taking action. Most of the time, Fin can resolve the query in full, and when it can’t, it seamlessly hands off to the human team, maintaining full customer context and history. You also get multiple improvements to call quality, plus proactive outbound calls to follow up on unresolved issues—all orchestrated by robust AI workflows.

Third — Fin Voice 2 gives you total control with industry-leading tools to configure and manage how Fin behaves. You get rich, detailed insights into call behavior and quality, the most common topics of calls, and one-click recommendations to improve. As with everything in Fin, you can fully self-serve and then manage it all with ease, without requiring professional services. Many vendors only let you set up their voice agent under supervision; with Fin, you get everything you need to iterate fast.

If you haven’t tried the demo yet, go check it out: https://fin.ai/voice. If you prefer to wait, don’t be surprised when you end up speaking with it at a favorite brand soon.

From a product management lens, this is what matters: latency is a feature customers feel; transparency builds trust in enterprise AI; and control is non-negotiable for CX leaders. The combination of a purpose-built, agentic AI architecture, measurable gains in resolution and CSAT, and true self-serve configuration signals that voice is moving from prototype theater to production reality. That’s the different game I want our industry to play.

Inspired by this post on The Intercom Blog.

June 4, 2026
Crafting Beloved Tech Brands: My Moonshot Marketing Playbook for the Post-LLM Era

I spend a lot of my time asking a deceptively simple question: what does excellent marketing actually look like in 2026? From the vantage point of product leadership, the answer isn’t a spreadsheet or a channel plan—it’s a feeling. Beloved tech brands earn the benefit of the doubt, create gravity around their roadmap, and make customers proud to belong. That kind of momentum is not an accident; it’s a system.

Here’s the hard truth I’ve learned building and scaling products: giving teams different goals creates dysfunction. When brand, demand gen, product marketing, and comms run on fragmented OKRs, you manufacture internal headwinds. “Marketing is one engine – not separate pieces.” One strategy, one narrative, one set of outcomes—expressed through different craft disciplines and time horizons.

That unity of purpose clarifies executive roles, too. The real difference between an SVP and a CMO is scope and narrative ownership. A great CMO architects the whole system—portfolio allocation, brand architecture, integrated go-to-market strategy, and the bar for creative taste—while refusing to get dragged into decisions they should never be making (for example, approving every headline or micromanaging channel tactics). Leaders should decide the outcomes, standards, and constraints; teams should control the craft.

On portfolio design, I run marketing like a portfolio of moonshots. You need a healthy mix: proven programs that compound, emergent bets that learn fast, and a small set of true moonshots that can change the slope of the curve. The point isn’t bravado; it’s risk-balanced exploration. If everything ships safely, you’re under-investing in differentiation. If everything is a swing for the fences, you’re not building a repeatable growth engine.

This is where taste becomes a strategic advantage. “Ubiquity is the opposite of cool.” If you want to be beloved, you cannot treat every channel, audience, and moment as equal. Early on, selective distribution, distinctive creative codes, and tight community loops create status and meaning. Later, you scale without sanding off the edges that made the product special.

Why do a few companies build a flywheel of momentum while others stall? They align story, product, and distribution. The product earns trust, the narrative creates aspiration, and the go-to-market strategy ensures the right customers experience both at the right time. Then perception cycles kick in—the Silicon Valley clock turns—and irrational optimism or skepticism can amplify signals. The antidote is compounding proof: consistent product shipping, community advocacy, and creative that makes people care.

Scaling taste across an organization is teachable. I codify brand principles, narrative guardrails, and examples of “right” versus “almost right.” I replace abstract feedback with decision rubrics—what we keep, kill, or revise and why. I run recurring creative reviews with a small cross-functional council, so judgment compounds. Taste can’t be fully automated, but it can be operationalized: shared references, a story bible, and a high bar for craft that’s explicit, not mystical.

In a post-LLM world, the fundamentals haven’t changed—but the frontier has. Generative tools supercharge iteration and research, yet the artistry never really left. You still need a point of view, a tension worth resolving, and a value proposition that’s felt, not just stated. Can taste be encoded in software? Parts of it—pattern libraries, style constraints, data-driven feedback—absolutely. But the spark that makes work unforgettable remains human: judgment, risk tolerance, and the courage to ship something that might not fit the playbook.

That’s why telling an optimistic, yet realistic story about AI matters. Over-automation drains humanity; under-automation wastes potential. The best work pairs AI Strategy with craft leadership: LLMs for rapid exploration, humans for narrative decisions and ethical judgment. Your message should show how AI expands customer agency, not just efficiency.

The brand-versus-growth debate is a false choice. The right story accelerates pipeline, and the right demand programs reinforce the brand. Look at Apple’s discipline around product truth and design codes, or Google Chrome’s “The Web Is What You Make of It (Dear Sophie)” for proof that emotion and utility can co-exist. Notion, Pinterest, Square, HubSpot, and Harley-Davidson show how community, identity, and product-led growth interlock when the company knows exactly what it stands for.

When it comes to launches, I’ve learned that announcement videos full of humans, lack humanity. Overproduced gloss often dilutes the truth customers seek: what problem does this solve, how quickly can I feel the value, and why does it matter now? Real users, real context, and a crisp arc from problem to promise will outperform most theatrics.

Practically, I architect my week to protect taste and outcomes. Early-week for strategy, portfolio reviews, and cross-functional alignment; mid-week for deep creative and product marketing work; late-week for decision clears and postmortems. I time-box “disruptive energy”—space to chase non-obvious ideas—and I guard it like any critical meeting. Without protected cycles for exploration, the urgent will always suffocate the important.

If there’s a single takeaway: playbooks are obsolete, but the fundamentals are not. The channels change; the psychology doesn’t. Run one engine. Allocate a true portfolio. Scale taste with rigor. In the AI era, make people care. That’s how beloved tech brands are built—and how they endure.

June 4, 2026
Supercharge Insights with Amplitude Agent Connectors: Connect Notion, Slack, Linear & More

I’ve led enough multi-tool product organizations to know how quickly momentum erodes when insights and actions live in different places. When my teams bounce between Notion, Atlassian, Slack, Linear, and analytics dashboards, we pay a real tax in context switching. That’s why I’m excited about what Amplitude is enabling with Agent Connectors—bringing our daily work and our data-driven decisions into one fluid, agentic AI workflow.

Connect Notion, Atlassian, Slack, Linear, and more to Amplitude's Global Agent. Get richer analysis and take action across tools without leaving Amplitude.

Practically, this means I can treat Amplitude analytics as a unified analytics platform where analysis and execution finally meet. Instead of exporting charts or copying insights into docs, I can drive Agent Analytics directly from the same surface where I manage behavioral analytics, reducing friction and accelerating decisions. For my product strategy, that’s a meaningful shift—from “insight later” to “insight-to-action now.”

Here’s how I’d use it on a typical day: I ask the agent to synthesize signals from recent feature usage, spotlight anomalies, and then draft a concise summary for our Slack channel. In the same flow, I can prompt it to reference our Notion specs for context and queue next steps in Linear, keeping Atlassian stakeholders looped in without any extra swiveling between tabs. The value isn’t just faster execution; it’s tighter alignment across teams because the analysis and the plan live together.

From an operating model perspective, this is how I scale AI workflows responsibly. I can define clear prompts, approval paths, and ownership so the agent augments—not replaces—expert judgment. Data governance and permissions remain front and center: the agent sees what your teams are allowed to see, and we maintain auditability on critical workflow steps. The outcome is a trustworthy, repeatable system that compounds learning over time.

If you’re exploring agentic AI for product teams, start small and instrument your ROI. Pick one or two connectors (Slack and Notion are great first choices), define a measurable workflow—like pushing weekly retention insights and creating prioritized follow-ups in Linear—and iterate using continuous discovery. In my experience, the first wins appear as reduced time-to-insight, fewer meetings to align, and faster cycle time from observation to shipped change.

The big picture is simple: bring your work to your analytics, and your analytics to your work. With Agent Connectors, Amplitude’s Global Agent helps close the loop from understanding behavior to taking action—without leaving the place where your insights are born.

Inspired by this post on Amplitude – Best Practices.

June 3, 2026
Package Hack Wake-Up Call: My Playbook for Securing Cowork, Coding Agents, and Secrets

I love being a builder. It feels like a superpower I can’t stop using, and lately I’ve been channeling it into better workflows, faster experimentation, and sharper product thinking.

I tinker with my Claude Code workflows to make every day more effortless. I’m having a blast creating AI-generated interview snapshots and opportunity solution trees for Vistaly. I also spend time digging into traces and iterating on the AI coaches I use for our discovery courses.

Then the recent wave of malicious software spreading through the open-source community popped my bubble. It hit companies big and small—names like OpenAI, PostHog, and Zapier. As I dug in, I realized what many cybersecurity experts have long known: this is a deep rabbit hole. If I want to build responsibly, I have to get significantly better at protecting my devices, credentials, and code. And if you’re building with AI or modern tooling, you likely do, too.

Here’s why. We all rely on open-source software. Most modern applications assemble tried-and-true components—parsing a PDF, handling dates across time zones, visualizing spreadsheet data, connecting to an API—rather than reinventing them. The same is true for agent skills and MCP servers; they accelerate how we get value from models. This is overwhelmingly a good thing. But it also creates an attack surface that bad actors exploit.

We don’t need to abandon third-party code. We do need to understand the mechanisms attackers use and consistently defend against them.

When one malicious worm compromises hundreds of packages, what should dev teams do? This visual teaser maps the agenda—how it spreads, how to guard against it, AI tool risks, and concrete steps to mitigate.

On May 11th, I started seeing tweets about a TanStack hack. At that time, I didn’t know what TanStack was. But apparently, it’s a popular set of JavaScript libraries that are used by a lot of React sites. At first, I didn’t pay much attention. Then I learned the packages were compromised by a worm—malicious software that self-replicates—and it spread quickly. Within hours, dozens of packages were implicated; by day’s end, it was in the hundreds. That’s when I knew I had to lean in.

If you’ve explored safe development practices with coding agents before, you’ve seen the basics of package safety. A package is a bundle of reusable code shared through registries, and nearly every app you use depends on them. The unfortunate twist with this specific hack, known as the Mini Shai-Hulud worm, is that it shows prior “safe enough” heuristics aren’t sufficient. Popularity and trust signals don’t guarantee safety. We have to do more.

So here’s what I’ll cover today: how malicious software typically works, a practical framework for guarding against it, the specific risks of using Cowork to write and run code, and concrete steps to mitigate that risk. My goal is simple: help you keep building—despite the risks—while protecting your data and your business.

Quick disclaimer: I’m not a security expert. I’m sharing my personal journey and what I’ve learned through research and hands-on work. Please use your best judgment when applying any of this.

Package hacks share a simple playbook: get in, sweep for secrets, and phone home. This visual breaks down the 3 steps and flags new entry points—from packages to MCP servers, agent skills, and app extensions.

An agent recently scoured over 230,000 malicious software incidents and found that most malicious software follows a similar pattern. First, it needs an entry point onto your computer. Once installed, it scours your device for sensitive data, and then it uses your network connection to send that data to its own servers. The Mini Shai-Hulud worm spreads via malicious package install scripts that run at download time, then searches the device for credentials (including package publishing rights), poisons additional packages to continue replicating, and uses multiple channels—including the victim’s own GitHub public repos—to distribute secrets.

In practice, most attacks boil down to three steps: 1) It finds an entry point to your device. 2) It searches your device for sensitive data. 3) It sends that data to its own server. The good news: this pattern also tells us how to defend. We can harden entry points, minimize what code and agents can access, and constrain outgoing network traffic.

Keep in mind that install scripts aren’t the only entry vector. Any code that runs on your machine could contain malicious payloads: third-party packages, agent skills, MCP servers, browser or desktop extensions—the list is long. As coding agents and “vibe coding” tools become mainstream, more non-engineers are exposed to the same risks engineers have managed for years.

You might be at elevated risk if you do any of the following: you download and use third-party skills or MCP servers; you let Claude Code, Codex, or other coding agents write scripts that run locally and use third-party packages; you use an IDE like VS Code or Cursor with third-party extensions; or you install third-party extensions in tools like Obsidian. This isn’t an exhaustive list, but if any of these apply, it’s worth tightening your approach.

Relying on third-party code? This visual highlights four common risk zones—agent skills/MCP servers, coding agents, IDE extensions, and Obsidian plugins—and urges a review of downloads, local scripts, and add-ons.

The “safest” approach would be to avoid installing third-party software on your local device entirely. That’s not realistic. We all depend on third-party components in our stack. So I’ll start with one of the most common paths for non-engineers writing and running code today: Cowork.

Evaluating Cowork’s safety was eye-opening. Cowork offers meaningful protection—more than running code directly on your machine—but it isn’t bulletproof. There’s a notable gap you should understand.

Here’s how Cowork helps. It runs code inside a virtual machine, which isolates the execution environment from your real device—a quarantine room for code. While Cowork doesn’t fully control what comes into the room (that part is on you), if malicious code gets in, it’s contained and cannot reach the rest of your filesystem. Cowork also limits outbound network traffic from the virtual machine, which helps disrupt data exfiltration. However, it’s not foolproof.

Because Claude can install packages inside Cowork, it remains susceptible to malicious code like the Mini Shai-Hulud worm. And GitHub is on the allow list so Cowork can read and write to your repos. Since the Mini Shai-Hulud worm uses GitHub to publish secrets, this creates exposure. The crucial mitigation: if you never give Cowork access to sensitive data, there’s nothing for an attacker to steal.

A quick visual from a security deep dive on package hacks shows how Cowork handles threats: entry points are contained, data is only safe when kept outside, and network traffic is partly limited—making shared data the gap to watch.

Your responsibility is straightforward but critical: your data is only safe if it stays outside the virtual machine. When you mount folders into Cowork, those folders become accessible to any code running inside the VM. That includes malicious scripts. Before sharing, ask two questions: do the folders contain any credentials or secrets, and do they include proprietary data that would be harmful if accessed?

It’s common for code to need credentials. That’s why Cowork includes connectors to third-party sources like Google Drive and Slack. Credentials configured for these connectors never enter the VM—they remain outside the quarantine room—so they’re not exposed to malicious code. But if your code requires additional credentials inside the VM, scope them tightly and assume they could be compromised.

You can also use custom MCP servers you create yourself with Cowork. Those credentials stay outside the VM as well, provided the MCP servers are remote (hosted on a web server, not downloaded locally). It’s more work than dropping in a local server, but it keeps secrets out of reach from VM-executed code.

Beyond credentials, scrutinize the actual content you share with Cowork, including anything accessed through connectors. Least privilege is the rule: grant only what’s absolutely necessary for the task, and nothing more.

Amid a wave of package-supply attacks, this Product Talk visual launches a 3-part guide to safer AI building—starting with Cowork safety today, then Claude code config next week, and off-device development coming soon.

What about skills? Cowork supports skills, and you can add third-party skills inside the quarantine room. If you’re not placing your own data in that room, you can afford more risk. The moment you add sensitive or proprietary data, be selective. Skills can include third-party code, and bad actors use skill directories to distribute malicious payloads. Personally, I never use third-party skills as-is. If one looks useful, I read through the files, then ask Claude to recreate it so I understand what it does and maintain control. If I were to use third-party skills, I’d do it in Cowork and keep their data access to the minimum necessary.

Overall, Cowork is a solid, “safe-ish” option if you’re disciplined about what you share. The challenge is that utility often requires access to real data—exactly what we’re trying to protect. In an upcoming deep dive, I’ll outline strategies to keep malicious code out in the first place. While I’ll focus on local development, the same patterns can extend to Cowork with a bit of setup.

One more important clarification: don’t confuse Cowork with the Code tab in the Claude Desktop app. Cowork runs code inside a virtual machine. The Code tab does not. If you ask Claude to write and execute code from the Code tab, that code runs on your local device and you’re fully responsible for security. There is one exception: the Code tab can run code in Anthropic’s cloud; I’ll cover that approach when we get into moving development off the local machine.

To summarize Cowork’s protections against the attacker’s three-step pattern: installs and scripts still run, but they’re contained inside an isolated virtual machine instead of your real device; access to sensitive data is strongly limited to the specific folders you mount, leaving the rest of your filesystem (including unrelated credentials) out of reach; data exfiltration is partially constrained because Anthropic limits outbound network traffic from the VM—helpful, but not absolute. By contrast, local Code tab sessions offer no isolation, no filesystem restrictions, and no network limits—so any malicious install scripts run directly on your machine with full access and open egress.

My takeaways so far: I still love building with AI, but I’m doing it more cautiously. Cowork offers meaningful containment when used deliberately. I still prefer the flexibility of Claude Code, and I’ve reconfigured my setup to reduce risk. Even so, “safer” isn’t “safe,” which is why I’m increasingly shifting development off my local device to more controlled environments. I’ll share the practical details—tools, configs, and scripts—in the next installments.

If this perspective is useful, let me know. I want builders to move fast—and safely—through this new era of agentic AI. Until then, stay safe out there.

Inspired by this post on Product Talk.

June 3, 2026
Broken Procurement Is Costing You Talent: A Product Leader’s Playbook for Speed and Sanity

Procurement should accelerate value, not suffocate it. Listening to this episode, I found myself nodding (and wincing) through a painfully familiar story about how well-intended controls morph into barriers that keep great expertise out. As a product leader responsible for speed, outcomes, and brand experience, I see procurement as a direct mirror of culture—and an often overlooked part of the product operating system.

In the conversation, Teresa is cranky—and honestly, she has every right to be. She’s simultaneously juggling seven speaking engagement contracts, and six of them have become a part-time job in themselves—think 80-page ethics policies, 800-question security forms, and Multi-Factor Authentication (MFA) questions asked 17 different times. Meanwhile, the one company that just put her fee on a credit card? Scheduled, confirmed, and done in two weeks. That contrast is the whole story: friction repels talent; clarity and simplicity attract it.

Petra adds her own horror story—filling out 12 identical Word document forms—and together they surface a deeper truth I’ve seen across organizations: broken vendor processes don’t just frustrate consultants; they stop companies from getting the expertise they actually need. And despite what many assume, company size isn’t the deciding factor—leadership intent and process ownership are.

If you’ve ever wondered why a training got canceled, why a speaker backed out, or why your team can’t seem to bring in outside experts, this is likely the culprit: procurement theater. Repetitive forms, unbounded scope creep, and sprawling security reviews create drag that outlasts any short-term legal or compliance gain. The opportunity cost—lost learning, slower progress, and talent that simply says no—is enormous.

One detail that stood out: with CEO-level buy-in, a legal review timeline collapsed from four months to 10 days. I’ve seen the same thing. Executive sponsorship is the fastest procurement tool there is, and it reveals what the organization truly values. If you can compress the path when a leader cares, you can redesign the path so it’s always faster—without compromising real risk management.

I also loved the clarity of a simple policy from the episode: Teresa’s new policy is straightforward—her paperwork, credit card payment, no vendor setup—or no speaking engagement. That’s not obstinance; it’s a bright-line test for whether an organization respects expert time and understands total cost. The best experts have options, and friction filters them out first.

Here’s how I operationalize this in product-led organizations. Tier risk by engagement type (e.g., one-hour talk vs. long-term software vendor) and match the process to the risk. Offer a credit-card fast lane with standard, plain-English terms for low-risk work. Eliminate duplicate data entry and kill redundant questionnaires. Use a single, secure intake that auto-fills known fields. Track cycle time end to end, and publish SLAs for legal, InfoSec, and finance. Most importantly, make vendor experience a first-class metric—because it is a brand experience.

Security and compliance matter, but they must be right-sized. If you’re buying a keynote, you’re not buying data processing—so why the 800-question security review? Calibrate controls to actual data access and system interaction. The episode even references AWS DynamoDB and GuardDuty, plus Claude Code—helpful reminders that your stack context matters, but not every purchase touches it. Don’t conflate deep technical diligence for a SaaS integration with a simple, no-data engagement.

There’s a reason the classic film Office Space gets a nod—it’s the perfect metaphor for what happens when well-meaning governance calcifies. Bureaucracy compounds over time, usually after adverse events, until startups—or any team that still moves fast—run circles around you. Procurement that treats experts like adversaries won’t win the race that actually matters: learning faster than the market.

If you want the full story, listen to the episode here: Spotify (https://open.spotify.com/episode/2JHnTvnZX2WcFczml7ozKY?ref=producttalk.org) | Apple Podcasts (https://podcasts.apple.com/kh/podcast/procurement/id1794203808?i=1000770701690&ref=producttalk.org). It’s cathartic, but more importantly, it’s a blueprint for fixing what’s broken.

Mentioned in the episode: Hire Teresa to Speak (https://www.producttalk.org/hire-teresa-to-speak/), AWS DynamoDB (https://aws.amazon.com/dynamodb/?ref=producttalk.org), GuardDuty (https://aws.amazon.com/guardduty/?ref=producttalk.org), Claude Code (https://www.claude.com/product/claude-code?ref=producttalk.org), and Office Space (https://en.wikipedia.org/wiki/Office_Space?ref=producttalk.org).

I’d love to hear your experiences and fixes. Where does your procurement flow break, how do you measure cycle time today, and what would it take to create a vendor experience you’d be proud to put your brand on? Drop your thoughts below and let’s trade playbooks.

Inspired by this post on Product Talk.

June 2, 2026