Tag: platform scalability

Durable Product and Platform Leadership Beyond the Launch
A successful product can create momentum, but durable leadership determines whether that momentum becomes an enduring company or platform. The distinction is especially important for infrastructure businesses, where trust, scalability, and operating discipline must keep pace with adoption.

Taken together, the two source articles suggest a practical leadership test: can an organization preserve customer value while strengthening the strategy, governance, and systems surrounding the product?

Product strength can conceal organizational weakness

Why Great Products Can Still Fail argues that product excellence is necessary but insufficient for company health. A compelling product may temporarily mask unclear strategy, weak accountability, poor tradeoffs, or an operating culture that values output more than outcomes. Adoption and market opportunity do not automatically prove that the organization can make sound decisions as it grows.

This changes the leadership question. The issue is not simply whether teams can ship something customers value, but whether the company can repeatedly direct talent and capital toward the right problems. Product discovery, stakeholder management, roadmapping, and sprint planning become parts of a governance system: they connect customer evidence to decisions and expose assumptions before those assumptions harden into costly commitments.

The article also emphasizes ethical decision-making and corporate governance. That perspective broadens product leadership beyond roadmap ownership. Leaders remain responsible for the organizational conditions under which a successful product is developed, sold, and extended.

Durable platforms reduce uncertainty at every stage

The Supabase article approaches durability through a developer-platform case study. It reports that Supabase started with an open-source PostgreSQL proposition intended to combine rapid application development with an architecture developers would not have to abandon as their needs became more serious. In that account, the platform’s value rests on fast setup, predictable building blocks, reliable documentation, sensible defaults, and a credible path to scale.

Those qualities reveal a broader platform principle: durability is not the same as having the largest feature set. A durable platform lowers uncertainty. It helps customers understand what they are adopting, begin using it without unnecessary friction, and remain confident that early speed will not create an architectural trap later.

The source attributes part of that confidence to Supabase’s alignment with PostgreSQL and its open-source approach. Community trust and commercial growth are presented as mutually reinforcing rather than competing motions. This complements the governance argument from the first article: trust is created when a company’s operating choices support the product promise, not merely when its marketing states that promise.

Leadership durability comes from connected operating loops

The Supabase account reports that founder Paul Copplestone’s earlier startup experiences contributed to an emphasis on finding product-market fit before blitzscaling and on separating fundraising from building. It also describes the company as operating with a constraint mindset even after raising capital. Read alongside the warning that strong products can disguise structural problems, the lesson is that available resources should not be mistaken for validated demand or organizational readiness.

Positioning forms another operating loop. According to the Supabase article, a tagline change preceded the project reaching the top position on Hacker News and was treated as an early product-market-fit signal. The useful interpretation is not that wording alone establishes fit. It is that positioning can test whether the market recognizes the job a product performs. When the message and the customer problem align, feedback becomes clearer and acquisition friction may fall.

Measurement must then distinguish genuine contribution from inherited momentum. The source reports that Supabase designed sales compensation around incremental uplift over a control group. In a product-led business, that approach asks whether sales created conversion or expansion beyond what self-service adoption would probably have generated. It places evidence above activity and limits the temptation to claim credit for demand already produced by the product.

Organizational learning completes the system. The article describes a fully distributed, asynchronous team with near-zero attrition and connects its scaling philosophy to kaizen, or continuous improvement. Because these are claims from a single company-focused account rather than independently verified comparisons, they should be treated as reported characteristics. Their leadership relevance is still clear: asynchronous execution depends on strong writing and explicit ownership, while continuous improvement requires teams to identify and remove recurring friction.

AI readiness should amplify a durable foundation

The Supabase article reports three AI-related waves involving pgvector, Bolt and Lovable, and Claude Code. It presents these developments as successive ways in which retrieval, rapid application creation, and AI-native development workflows increased the relevance of an existing backend platform.

The sequence matters because it separates readiness from trend chasing. The reported AI opportunities could compound platform value because the underlying customer need already existed: developers wanted to build quickly on a backend they could trust. AI changed workflows and urgency, but it did not replace the platform’s core value proposition.

For leadership teams, this implies a disciplined filter for emerging technology. A new capability deserves investment when it strengthens an established customer job, improves the platform’s trusted primitives, or opens a coherent path for existing users. Technology excitement alone cannot resolve weak positioning, unclear ownership, or an unproven operating model.

Key takeaways
- Treat product success as evidence, not immunity. Adoption does not eliminate the need for governance, ethical judgment, and explicit accountability.
- Design platforms around customer confidence. Fast onboarding, dependable primitives, clear documentation, and a credible scaling path matter together.
- Preserve constraints after capital or demand arrives. Resources should follow validated customer value rather than substitute for it.
- Measure incremental impact. Product-led and sales-led motions need a method for separating created lift from revenue that would have occurred anyway.
- Use AI to extend a durable value proposition. Emerging workflows are most useful when they compound an existing platform advantage.
Durable leadership is ultimately visible in what happens after early success: whether the organization converts attention into learning, learning into disciplined choices, and those choices into a platform customers can continue to trust.

References
- Shivam.Consulting Blog – Why Great Products Can Still Fail: The Leadership Lesson Every Team Needs
- Shivam.Consulting Blog – Why Supabase’s AI-Era Growth Playbook Should Reshape How We Build Platforms
July 3, 2026
Beyond Black‑Box Scores: Custom AI That Elevates Trust & Safety Without Burnout

What do you do when off-the-shelf moderation scores aren't good enough—and the alternative is paying human contractors to spend their days reviewing traumatizing content at scale? I’ve wrestled with that exact trade-off in enterprise environments, and it’s why I was eager to unpack how custom AI can raise the bar on trust and safety without compromising accuracy, latency, or the well-being of our teams.

In this episode of Just Now Possible, I sit down with Nikki Marinsek (Data Scientist), Brian McCaffrey (Software Engineer), and Dan Means (Machine Learning Engineer) from Musubi, an AI-native trust and safety toolkit for content platforms. Musubi builds custom-trained ML models and LLM-powered moderation tools that adapt to each platform's unique policies—from dating apps to social networks to AI inference endpoints. As a product leader, I’m drawn to their blend of eval-driven development, agentic AI, and pragmatic deployment pipelines that actually meet real-world SLAs.

We walk through their full journey—starting with a first prototype on tabular data—then discovering the system was sometimes catching issues human moderators missed. That insight became a forcing function to formalize evaluation, calibrate thresholds, and design feedback loops that help humans and models converge. Just as importantly, they built a policy optimizer that uses agentic flows so non-technical trust and safety teams can iterate on LLM moderation policies without needing a data scientist in the room.

If you’ve ever had to balance latency, accuracy, and cost at scale, you’ll appreciate how Musubi tests trade-offs across traditional ML, embedding-driven classification, and LLMs. Their approach mirrors the patterns I expect in high-throughput stacks: cache and pre-compute where possible, contain worst-case latencies, and push evaluation tooling to customers so policy changes are safe, observable, and fast to deploy.

What resonated most with me is their core product strategy: put eval tools directly in customers’ hands. When teams can benchmark AI against humans, referee disagreements using “LLM as judge,” and make policy gaps visible, trust increases and operational drift decreases. That’s the foundation for durable product strategy in sensitive domains like content moderation, fraud management, and risk scoring.

Listen to this episode on: Spotify | Apple Podcasts

Guests: Nikki Marinsek, Data Scientist, Musubi; Brian McCaffrey, Software Engineer, Musubi; Dan Means, Machine Learning Engineer, Musubi.

In this episode: Why off-the-shelf moderation scores fail and how custom-trained models fix that; How Musubi combines traditional ML with LLMs for different moderation tasks; The discovery that AI can outperform human moderators—and how to communicate that to clients; Using AI as a judge to referee disagreements between AI and human decisions; How Musubi onboards new customers with "reverse demos"; What custom model training actually means: fine-tuning, feature engineering, and reusable deployment pipelines; The policy optimizer: an agentic flow that helps customers iterate on their LLM moderation policies; Why pushing eval tools directly to customers is a core product strategy; How Musubi is building flexible orchestration workflows for non-technical trust and safety teams.

From a product management lens, a few highlights stand out. First, the disciplined separation of concerns: use traditional ML for high-precision, low-latency pattern detection and LLMs for nuanced policy interpretation. Second, invest in golden sets and policy loops early so you can quantify improvement and avoid subjective debates. Third, productize customization—create reusable deployment pipelines, parameterized policies, and self-serve evaluation—so each customer’s “custom model” still scales like a platform.

I also appreciated the onboarding tactic of "reverse demos." Rather than a canned walkthrough, the team invites customers to bring real policies and edge cases, then instruments the workflow live. That move builds credibility, accelerates discovery, and surfaces the fastest paths to value—an approach I recommend whenever you’re selling complex AI workflows to non-technical stakeholders.

If you’re navigating cost and latency trade-offs, the conversation goes deep on techniques like embedding-driven classification, fine-tuning vs. training, and when to route decisions through LLM adjudication. My takeaway: treat the router, the evaluator, and the policy as first-class products. When those elements are observable and testable, you can raise quality without exploding compute costs or creating operational bottlenecks.

Resources & Links: Musubi — AI-powered trust and safety toolkit for content platforms. Maven AI Evals Course — AI evals course.

Chapters: 00:00 Meet the Team; 01:18 Why Everyone Wears Product; 02:32 What Musubi Builds; 04:51 AI for Human Moderation; 09:59 Adversaries and Asymmetry; 11:48 Early Days and Low Latency; 13:35 First Prototype Slice; 15:33 Traditional ML Meets LLMs; 19:52 Benchmarking Against Humans; 23:09 LLM as Judge and Policy Gaps; 29:53 From Prototype to Platform; 31:15 Customer Onboarding Reverse Demos; 36:08 Custom Models Per Customer; 38:05 Fine Tuning vs Training; 39:14 Embedding Driven Classification; 40:04 Cost and Latency Tradeoffs; 43:21 Productizing Customization; 49:16 Scaling Prototypes to Production; 51:58 Golden Sets and Policy Loops; 56:17 Coaching Customers Safely; 01:02:06 Gamified Feedback Signals; 01:06:19 Agentic Toolkit Roadmap; 01:09:05 Workflow Orchestration Future; 01:12:06 Wrap Up and Thanks.

Ultimately, this is a playbook for modern trust and safety: align your models to your policies, make evals a habit not an event, and empower non-technical teams with agentic workflows and transparent metrics. That’s how we move beyond black-box scores to systems we can measure, manage, and trust.

Inspired by this post on Product Talk.

June 11, 2026
Engineering MCP Agents as a Reliable Product Platform
Model Context Protocol adoption becomes consequential when an agent can retrieve organizational knowledge, select tools, and change a system of record. At that point, the engineering challenge is no longer simply connecting a model to an API. It is operating a product platform whose context, permissions, decisions, and side effects must remain dependable.

The source article’s experience with workflows spanning Miro, Jira, and Confluence points to a coherent platform model: retrieval determines what the agent knows, tool contracts constrain what it can do, evaluation tests its behavior, and observability makes failures diagnosable. Product strategy and interaction design then determine whether that machinery improves work users already perform.

Key takeaways
- Treat retrieval, tool schemas, prompts, policies, and telemetry as platform components with explicit owners and versioning.
- Prove one frequent, measurable workflow before expanding the agent’s tool and use-case surface.
- Combine least-privilege access with visible tool rationale, consent controls, audit records, and safe recovery paths.
- Evaluate the complete chain from retrieved context to downstream action, not just the quality of generated text.
- Govern the tool catalog and delivery pipeline continuously so that extensibility does not become uncontrolled operational risk.
The platform boundary extends beyond the MCP connection

MCP provides a practical interface through which models can reach data, tools, and actions, according to the source article. The protocol connection is therefore an enabling layer, not the whole agent platform. A production workflow also depends on source authority, identity and permission checks, context selection, tool arbitration, execution controls, user-facing recovery states, and evidence that the result was useful.

This broader boundary changes how teams should decompose the system. Retrieval is a managed context service rather than an incidental prompt-building step. Tools are governed capabilities rather than a loose collection of endpoints. Prompts and policies are deployable artifacts rather than text copied into application code. Traces and evaluations are part of the control plane because they reveal whether the other layers continue to work together.

The source recommends starting with authoritative content, normalizing it with docs-as-code discipline, attaching metadata that supports permission-aware filtering, and selecting the smallest high-signal context needed for a task. The engineering implication is important: access control must shape retrieval before information reaches the model. Filtering only when an action is attempted would leave the reasoning process exposed to context the user or agent may not be entitled to use.

Context quality also affects more than answer accuracy. The source links focused retrieval to lower hallucination risk, more accurate tool calls, and lower cost. That makes retrieval performance a shared dependency for safety, reliability, latency, and economics. It deserves its own contracts, tests, freshness expectations, and failure modes.

A golden path turns architecture into an operating contract

The source describes an initial workflow that summarized a Miro board into action items and wrote them to Jira. It reports that variants involving Confluence summaries, epic splitting, and backlog grooming followed only after the original path reached its reliability targets. This is less a recommendation for those particular products than a useful sequencing principle for agent platform engineering.

A narrowly defined workflow exposes the entire contract between context and consequence. The team must decide which content is authoritative, what the model may infer, which tool is appropriate, what inputs the tool accepts, what the user should review, how a partial failure is handled, and how success is measured. A broad assistant can conceal these questions behind plausible conversation; a golden path forces explicit answers.

The right first workflow is therefore not merely technically convenient. It should be frequent enough to matter, have an observable completion state, and carry side effects that can be bounded. The source frames outcomes such as time saved during backlog grooming, better meeting notes in Confluence, and fewer context switches across Miro boards as more useful roadmap anchors than novel model capabilities. It also recommends comparing task success, completion time, user edits, detected defects, and downstream business effects rather than relying on engagement alone.

Those measures form a practical evidence chain. Evaluation results show whether the system behaves as designed; workflow measures show whether users can complete the task; business measures show whether the completed task creates value. Keeping the levels distinct prevents a technically impressive agent from being mistaken for a successful product.

Safety depends on controlling actions and explaining them

Tool access creates a sharper risk boundary than text generation because an incorrect decision can alter a ticket, document, or other shared record. The source’s proposed response combines least-privilege scopes, a human-readable rationale for each call, and an audit trail. It also calls for proposed inputs and expected side effects to be visible when the agent is about to use a tool.

These controls address different failure classes. Narrow scopes limit the maximum effect of a bad decision. Input previews help users catch incorrect parameters before execution. Rationale makes the selection inspectable. Audit records support diagnosis and accountability afterward. None substitutes for the others, and a confirmation dialog alone does not make an overprivileged tool safe.

Recovery behavior belongs in the same design. The source recommends retrying suitable failures with backoff, falling back to read-only behavior, or requesting consent or missing context. A robust platform should distinguish failures that are safe to retry from failures that require a different plan. It should also preserve an understandable state when a multi-step workflow completes only partially, so the user knows what changed and what did not.

Transparency need not mean exposing raw internal reasoning. The useful product surface is operational evidence: the sources used, the selected capability, the intended inputs, the expected effect, and the resulting status. The source suggests a reveal panel containing retrieved sources, candidate tools, and confidence signals for power users. More generally, the amount of review should follow the consequence of the action: low-risk retrieval can remain lightweight, while consequential writes warrant clearer inspection and consent.

Evaluation, observability, and delivery form one reliability loop

The source outlines offline tests for intent classification and tool selection, online shadow evaluations for live drift, and regression checks after deployment. It also recommends traces that capture prompts, retrieved chunks, tool inputs, tool outputs, latency, and error codes. Together, these practices connect a visible failure to the component and version that produced it.

Evaluation without observability can show that quality declined without explaining why. Observability without evaluation can produce detailed traces without deciding whether the behavior was acceptable. A mature loop needs both: test cases encode desired behavior, traces expose actual behavior, and production outcomes reveal gaps in the test set.

The delivery process must preserve that connection. The source treats prompts, tool schemas, and guardrails as versioned artifacts deployed behind feature flags, with canary releases, controlled comparisons, and rollback capability. This approach makes a behavioral change attributable. If tool selection deteriorates after a prompt revision or a schema update breaks an integration, operators can identify the change and contain its reach.

Latency should be governed in the same loop because an accurate workflow can still fail as a product experience. The source reports using task-specific latency budgets, caching stable retrieval results, parallelizing safe calls, prefetching likely session context, and providing progress when work exceeds the expected budget. These techniques should remain subordinate to correctness: parallel execution is appropriate only when calls are independent, while caching must respect freshness and permission boundaries.

The source also assigns prompts a user-experience role, combining plain-language intent, domain constraints, and explicit tool contracts while using examples, tooltips, and in-product guidance to help users frame requests. This connects conversation design to reliability. Better instructions can reduce ambiguity before the platform has to resolve it through additional model turns or risky assumptions.

Scale requires governance of tools, teams, and ownership

MCP’s extensibility can turn into tool sprawl if every integration is added without lifecycle management. The source recommends a curated catalog recording each tool’s owner, scope, schema version, and deprecation policy. It also describes schema linting in continuous integration, backward-compatible changes, and quarterly retirement of unused tools. These are conventional platform disciplines applied to an agent’s capability surface.

A catalog is valuable because an agent reasons over descriptions and schemas while operators depend on stable implementation contracts. Poorly differentiated tools can make selection ambiguous; unannounced schema changes can invalidate prompts and evaluations; ownerless tools can remain available after their data or permission assumptions have changed. Governance should therefore assess semantic clarity as well as API validity.

Organizational design matters for the same reason. The source describes an empowered trio consisting of a product manager responsible for outcomes and risk posture, a forward-deployed engineer focused on schemas and scalability, and a designer responsible for conversational flows and recovery states. It also favors weekly evaluation reviews over demonstration-led progress. The underlying principle is shared ownership: platform reliability cannot be delegated entirely to model engineering when the decisive questions span product value, system behavior, permissions, and user comprehension.

The source’s proposed 30-day starter sequence moves from selecting one workflow and defining permissions, measures, and evaluations; through retrieval and a minimal tool set; to an instrumented internal pilot; and finally to hardening and a limited beta. The schedule is reported as a blueprint rather than independent proof of how long every implementation should take. Its more transferable lesson is dependency order: define the outcome and risk boundary before multiplying capabilities.

As agents begin coordinating across products, the durable advantage will come from platforms that preserve this discipline across every new connection. MCP can make capabilities composable, but dependable composition will still depend on controlled context, explicit authority, observable execution, and evidence that the workflow improves real work.

References
- Shivam.Consulting Blog — Mastering MCP: Battle-tested Playbooks from Miro, Atlassian, and What I’ve Learned
June 8, 2026
Built for Your Biggest Days: How We Engineer Fair, Reliable Scale Without Downtime

I’m getting sharper, more specific questions about scale from enterprise customers every quarter, and that’s exactly how it should be. Teams want to know how our platform behaves during their highest-volume moments — the Black Friday sales, the sporting events, the production incidents — and they want confidence their growth won’t outpace the systems they depend on. We welcome those questions. They’re the right ones to ask of any critical component of your business. Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform. We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer. While those numbers matter, they age quickly. Every growing software company can publish a bigger number every year, month, week. What ultimately matters is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it. Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them. Here’s how we do that in practice. We build on boring foundations because at the edges, we try hard not to be clever. We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product. “That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.” This extends a principle I’ve embraced for years: run less software. The point isn’t to minimize the stack for its own sake; it’s to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together. Boring technology choices aren’t a lack of ambition — they reserve our ambition for the nuanced scaling challenges that matter. The source of truth is the hard part. You can scale stateless web traffic by adding machines, add queue consumers, and add cache. Those are real problems — just not the hardest ones. The source-of-truth database is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears. That is why we moved to Vitess, managed by PlanetScale. The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers. When we first laid out this direction, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day. Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed. We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk. Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows. Search is not a hidden bottleneck for us. Search underpins core product surfaces across the platform — from vector search in our AI features to realtime reporting — and if it’s slow or unhealthy, customers feel it. Scaling isn’t just adding more machines; often the better approach is making the product do less unnecessary work. Today, our Elasticsearch clusters support a much higher-throughput product than in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We’re serving a larger product surface more efficiently, not just running a bigger cluster. More importantly, when an index gets too large or traffic distribution turns unhealthy, we don’t want a high-risk, manual migration. We reshape Elasticsearch indexes online by partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and deleting the old index only when we’re confident. We’ve used this pattern for years to make large search migrations safer and more incremental — a core playbook in our platform scalability and SRE practices. Fairness is non-negotiable in a multi-tenant system. A single customer’s high-volume moment should not quietly become everyone else’s latency problem. We design for this at multiple layers. For asynchronous work, we use overflow queues and queueing strategies that prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They’re designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants. We also build application-level guardrails so customer isolation doesn’t depend on every engineer remembering every rule in every code path. In a large multi-tenant Rails application, the safe path must be built into the system. The focus is primarily about correctness and customer data separation, but the broader operating principle is the same: important customer boundaries should be enforced by infrastructure and application frameworks. The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it’s actually needed. Fin adds a new dimension to scaling. Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work. The details differ from traditional SaaS infrastructure, but the principle is the same: understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. AI providers are not commodity storage systems, and we do not design as if they are. That is why we have invested in Fin-specific reliability systems. Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point. For enterprise customers, this matters because AI support volume can spike just like human support volume — and the AI layer must absorb that spike without relying on one fragile upstream path. When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform. Performance tests help, but production traffic is reality. Real customers use products in ways no synthetic test will perfectly predict: launches, incidents, seasonal patterns, gaming events, sudden changes in end-user behavior. Those moments give us data that no lab can fully reproduce. Often, a large customer event barely moves the platform-wide graphs because our customer base is broad enough that one industry’s peak aligns with another’s quiet period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect. “That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.” Sometimes, these events teach us something specific. In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed. The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb. We treat large customer events this way even when there’s no broad customer impact. They’re opportunities to understand real scaling properties and make the next event safer — a habit that anchors our incident management, observability, and FinOps practices. Scale is also an operating model. The infrastructure matters, but it’s not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents. “That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.” Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through infrastructure noise and answer the question that matters most during an incident: are customers able to use the product successfully? This shapes how we ship. Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. That isn’t a vanity metric — it’s part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers. These are the very DORA metrics we hold ourselves to in order to balance CI/CD speed with stability. “Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.” The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence. That’s how scale stays safe over time. Scheduled maintenance should be extraordinary. Historically, database maintenance was a main reason for maintenance windows: upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime. With the move to Vitess and PlanetScale, we changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime — and we do this in practice, not just as a goal. This matters because customers rely on our platform for live operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience. “Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.” Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact. That’s the practical benefit of the architecture work: better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers. What this means for customers is simple: be skeptical of vague scale claims. The question isn’t whether a vendor says they can scale — it’s whether they can explain how, where the limits are, what they measure, how they recover, and what they’ve changed after learning from production. We understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system — and we use that data to keep improving. That’s what you should expect from a platform you depend on during your busiest moments.

Inspired by this post on The Intercom Blog.

May 19, 2026

How to Scale Session Replay Without Sacrificing Privacy

You want session replay on more journeys because the blind spots are expensive. A funnel can show where users leave, but it cannot show whether they encountered a broken control, a confusing message, a layout shift, or an error that never reached your analytics. Replay can turn those behavioral signals into enough context to make a product decision.

The hard part is expanding that visibility without collecting data you should not have, degrading the experience you are trying to understand, or filling storage with recordings nobody will use. The answer is not a single masking setting. You need a capture contract, a delivery architecture, a sampling model, and an operating scorecard that treat performance, fidelity, and privacy as one system.

Set the capture contract before you expand coverage

Replay programs often begin with a coverage question: what percentage of sessions should you record? That is the wrong first question. Start with the decision you expect the recording to change. If nobody can name that decision, more coverage will create more cost and exposure without producing more insight.

Write a capture contract for each product surface. This is a short, reviewable specification that connects a business purpose to technical controls. It should answer:

What question is replay meant to answer? Examples include diagnosing failed activation, explaining an error spike, or finding friction in a conversion step.
Which routes, components, and user cohorts are in scope? Name them. Do not approve an undefined all-product rollout.
Which data is prohibited? Include form values, credentials, payment details, message content, health information, account-recovery data, and any product-specific sensitive fields that apply.
What consent state permits capture? The recorder should not initialize before the required state is known. Withdrawal should stop capture and prevent queued data from being sent.
Who can watch a replay? Define roles by purpose. Product discovery, support investigation, engineering diagnosis, and administration do not automatically require identical access.
How long will the data remain available? Tie retention to the stated purpose rather than keeping replay indefinitely because storage permits it.
What sampling rule applies? State the baseline rate, targeted cohorts, exclusions, temporary overrides, owner, and expiry condition.

Selective capture, redaction, consent, retention, role-based access, and environment-aware sampling are separate controls. Treating one of them as a substitute for the others creates predictable gaps. Masking does not grant consent. Restricted access does not make excessive collection necessary. Short retention does not make an exposed credential harmless.

Apply those controls as close to collection as possible. A web replay is commonly reconstructed from serialized page state, changes, and interaction events. The privacy risk therefore sits in the data leaving the browser, not only in what the player later displays. A value hidden during playback may already exist in an outbound payload or stored record.

A useful default is to block text and input values, then allowlist only fields proven safe and necessary. Add route-level and component-level exclusions for sensitive surfaces. Use a separate, time-bounded approval for diagnostic capture that needs greater fidelity. I would reject a policy that merely says to mask personal information: the term depends on context, and engineers cannot reliably implement an undefined category.

Test the contract against the raw system, not just the player. Seed a non-production fixture page with recognizable test values, exercise every relevant component state, inspect the browser payload, inspect the stored representation, and verify that exports and downstream tools preserve the restriction. If a prohibited test value crosses the collection boundary, the control has failed even if the replay screen obscures it.

Consent and retention obligations vary by jurisdiction, contract, and data type. Your privacy or legal owner must approve those rules for the markets you serve. Engineering can enforce an approved policy; it cannot infer that policy from a generic replay configuration.

Keep capture off the user’s critical path

Scalable replay starts in the browser, where your product competes with the recorder for main-thread time, memory, and bandwidth. A backend that can ingest billions of events does not help if the recorder makes an interaction sluggish or loses the DOM changes needed to explain the problem.

The delivery design should make page experience more important than recording completeness. Decoupled capture and delivery, adaptive batching, compression, backpressure controls, and priority handling provide the basic pattern:

Capture the minimum useful representation. Filter excluded nodes and values before serialization. Avoid collecting detail that no approved use case needs.
Separate recording from transport. The capture path should write to a bounded queue rather than waiting for a network request. Upload latency must not become interaction latency.
Batch adaptively. Small batches can reduce delay during quiet periods, while larger compressed batches can reduce request overhead during sustained activity. The policy should respond to queue pressure and network conditions.
Define backpressure behavior. When production exceeds delivery capacity, the recorder needs a documented degradation order. Preserve navigation, consent changes, errors, and the structural events required for reconstruction before lower-value detail. Never freeze the page to protect the replay.
Bound long sessions. Flush incrementally, cap memory use, and make reconnection behavior explicit. A queue that grows for the life of a tab will eventually turn a delivery problem into a page-performance problem.
Make partial data visible. Mark gaps, dropped segments, and incomplete uploads. A replay that silently appears complete is more dangerous than one that clearly communicates its limits.

Backpressure deserves special attention because it forces a product decision disguised as an implementation detail. If the system cannot retain everything, what must survive? The answer should come from the capture contract. An error marker without enough surrounding state may be useless, but exhaustive cursor movement may be expendable. Rank event classes before an incident forces the recorder to choose implicitly.

Do not validate the client only on a fast laptop and stable connection. Use representative complex pages and test replay on and off under CPU pressure, constrained networking, rapid DOM change, background-tab transitions, reconnection, and long sessions. Compare Web Vitals, long tasks, memory growth, bytes transferred, queue drops, upload completion, and playback completeness. Long sessions, traffic spikes, complex interactions, and variable networks are precisely where an apparently sound design reveals its failure modes.

There is no universal acceptable overhead that fits every product. Set budgets relative to your production baseline and the importance of the journey. A small regression on a frequently used mobile activation path may matter more than a larger regression on an internal administration page. Segment the results by route, browser, device class, network condition, and session length so averages do not hide the users most affected.

Sample for decisions, not for a warehouse of footage

A single global sample rate is easy to configure and hard to defend. It spends collection capacity uniformly even though product questions are not uniformly valuable. It can also miss rare failures while overrepresenting routine sessions that nobody will watch.

Use a portfolio of sampling modes:

Random baseline sampling gives you a less biased view of ordinary behavior and lets you notice problems you did not predefine.
Cohort sampling increases visibility for a defined population such as new users, a browser family, a release cohort, or users entering a critical journey.
Signal-based sampling concentrates diagnosis around errors, failed steps, rage clicks, dead clicks, abnormal exits, or other instrumented friction signals.
Temporary diagnostic sampling raises fidelity for a narrow incident or release window, with an owner and an automatic expiry condition.
Hard exclusions override every sampling mode. A high-value investigation is not permission to collect from a prohibited surface or consent state.

Onboarding, activation, high-friction conversion flows, and paths with disproportionate revenue or trust impact are sensible places to begin because a clearer diagnosis can change a meaningful decision. Signals such as errors, rage clicks, dead clicks, scroll behavior, and stalled progress can then help you find the sessions worth examining.

Keep one statistical distinction clear. Targeted replay is good for explaining a known problem, but it cannot tell you how prevalent that problem is. If you record sessions because they contain an error, the resulting library will naturally make errors look common. Use analytics or a random baseline to measure frequency. Use replay to understand mechanism and context.

A disciplined investigation looks like this:

Find a measurable change in a funnel, cohort, error rate, performance signal, or support pattern.
Define the affected population before opening replays.
Review a deliberately selected set of relevant sessions and record recurring observable behaviors, not interpretations of user intent.
Turn those observations into a falsifiable product or technical hypothesis.
Instrument, release, or experiment so the hypothesis can be measured outside the replay player.

This prevents two common mistakes: browsing memorable sessions until a story feels true, and treating one vivid recording as evidence of market-wide demand. Replay is strongest when it explains a quantitative signal and leads back to a measurable change.

Run replay with a coupled performance, privacy, and value scorecard

Session replay is not finished when playback works. It is an operating capability with client releases, configuration changes, storage growth, access decisions, and incident risk. Give it an owner and review the system across five dimensions.

Dimension	Signals to watch	Decision the signals should trigger
User experience	Web Vitals, long tasks, main-thread work, memory growth, and replay bytes	Reduce capture detail, change delivery behavior, narrow coverage, or halt a rollout when the recorder breaks its budget
Replay fidelity	Queue drops, missing segments, incomplete uploads, event integrity, and playback reconstruction errors	Fix prioritization or transport before teams rely on incomplete recordings for decisions
Platform reliability	Ingestion failures, processing delay, retrieval latency, playback-start failures, and behavior during traffic spikes	Add capacity, repair a failing stage, or adjust sampling without shifting the problem into the browser
Privacy and governance	Redaction test failures, capture outside approved consent states, retention exceptions, and access outside approved roles	Disable affected capture, contain the data, follow the approved deletion or incident process, and repair the control before restoring it
Decision value	Investigations that reached a useful replay, time to diagnosis, time to resolution, and product hypotheses validated outside replay	Move coverage toward high-value use cases or retire collection that produces no action

These dimensions constrain each other. Aggressive compression may improve bandwidth while hurting reconstruction. More capture may improve fidelity while violating the page budget. Narrow access may improve governance while blocking the support engineers responsible for incident response. The job is not to maximize any single metric; it is to keep the entire system inside approved boundaries.

Version capture configuration like production code. A seemingly harmless selector change can expose text, remove necessary context, or increase mutation volume. Test recorder and configuration releases against fixture pages containing known sensitive values and known reconstructable interactions. Keep a rollback path.

Prepare shutdown controls before launch. You should be able to stop capture for a component, route, environment, tenant group, or the whole product without waiting for a new application release. Document who can use each control, how queued data is handled, how affected stored data is identified, and when privacy, security, support, and engineering must be involved. If collection crosses a prohibited boundary, continuing to record while the team debates ownership compounds the exposure.

Finally, connect replay operations to the workflows that consume it. Product teams need links from behavioral cohorts to relevant sessions. Support needs controlled escalation paths. Engineering and SRE need errors, network signals, layout shifts, and performance context close to the replay timeline. Connecting interaction context to observability and delivery workflows can shorten the path from an anomaly to a testable explanation, but only if the data remains trustworthy and accessible to the right roles.

Key takeaways

Approve a capture contract for each surface before approving a broader sample rate.
Redact or exclude sensitive data before it leaves the browser; a masked player is not enough.
Protect the page with decoupled delivery, bounded queues, adaptive batching, and explicit backpressure priorities.
Keep random sampling for prevalence and use targeted sampling to explain known signals.
Operate performance, fidelity, platform reliability, privacy, and decision value as a coupled scorecard.
Require scoped shutdown controls, retention handling, access ownership, and rollback before production expansion.

Before you increase replay coverage, ask for two artifacts: a one-page capture contract for the next journey and a replay-on versus replay-off test under that journey’s difficult conditions. If the team cannot show what is allowed to leave the browser, how the page stays within budget, and which decision the recordings will change, the rollout is not ready to scale.

References

May 7, 2026

From 70 Employees to Dominance: My Playbook for Hypergrowth, Focus, and Top-Down Goals

Scaling a real-world marketplace from scrappy to dominant takes a different kind of product leadership. Reflecting on Christopher Payne’s decade leading DoorDash as President and COO — growing from roughly 70 employees to the dominant food delivery platform in the US — I’m struck by how much of that success hinged on mastering an atoms-based business while still operating with software-level rigor. As a VP of Product Management, I see the same patterns in my own work: relentless clarity on inputs, a bias for builder-executives, and a cadence that keeps leaders close to product details without becoming bottlenecks.

Running an atoms-based business versus a pure software company forces you to obsess over operational physics: unit economics, quality control, on-time reliability, and dense local liquidity. It’s precisely where traditional “bits” executives can stumble. What’s worked for me is a simple “plate spinning” framework for executive attention: identify the five or six plates that must never stop — customer experience, marketplace health, quality and safety, product velocity, platform reliability, and P&L — then schedule recurring deep dives to keep those plates spinning. If a plate wobbles, I drop in, fix the root cause, re-instrument the inputs, and zoom back out.

Hiring at hypergrowth speed only works when you bias toward a “builder mentality.” I look for executives who run toward fuzzy problems, write clearly, and can prove they’ve shipped value with incomplete information. Prior industry experience can be a liability when you’re reinventing the market; first-principles thinkers outlearn domain experts who try to port yesterday’s playbooks. In executive hiring, I’ve found structured work samples and narrative memos far more predictive than marathon interview loops — companies routinely spend too much time on job interviews and too little time evaluating how candidates think and execute.

Great executives never outgrow the details. Staying close doesn’t mean micromanaging — it means sampling the customer journey and instrumenting the system so you can feel where it hurts. In my own practice, I rotate through frontline touchpoints weekly: support transcripts, NPS verbatims, failed checkout sessions, and reliability dashboards. Small signals often reveal systemic issues. A single ciabatta bread moment — the kind of edge-case substitution that seems trivial — can expose broken handoffs, unclear policies, and misaligned incentives across the marketplace.

Top-down goal setting beats bottom-up when you’re aiming for category leadership. Bottom-up targets tend to regress to comfort; they calibrate to today’s constraints, not tomorrow’s possibilities. I set ambitious, top-down outcomes (not output), frame the non-negotiables, and map driver trees to clarify the input metrics that matter. Then I ask empowered product teams to pressure-test the plan, propose approaches, and own the how. This preserves ambition while unlocking creativity — a practical balance of clarity and autonomy that outcomes vs output OKRs were designed to achieve.

One-size-fits-all management is a myth. Early-stage teams need hands-on coaching and fast decisions; later-stage teams need mechanisms that scale: crisp PRDs, pre-mortems, and operating cadences that separate strategy, planning, and execution. The mark of a high-functioning executive team is not uniform style — it’s high candor, fast escalation paths, and visible commitment after debate. In tough moments, a little charisma goes a long way; in practice, that’s not theatrics, it’s steady optimism, simple language, and consistent follow-through that keeps people moving forward.

The hypergrowth skill stack for executives is surprisingly learnable: ruthless prioritization under uncertainty, narrative writing that aligns cross-functionally, structured delegation with clear “inspection points,” and a weekly rhythm that protects maker time. I leverage a cadence of business reviews (inputs > outputs), customer-scent checks, and decision logs so we can move fast without losing the thread. CEO and executive time management is the ultimate forcing function — if we can’t show where our attention maps to goals, the team won’t either.

Some of my enduring lessons echo the best of Amazon and eBay: customer obsession beats competitor obsession, input metrics beat lagging vanity metrics, and simple mechanisms beat heroics. From Jeff Bezos’s playbook I borrow the insistence on written narratives, single-threaded ownership, and clarity on what will not change. Those principles remain the backbone of platform scalability and resilient product strategy, especially when markets get noisy.

AI is about to flatten organizations. With agentic AI, retrieval-first pipelines, and AI workflows embedded into product development, managers can widen their span without losing fidelity. I see LLMs for product managers accelerating discovery, PRD drafting, and experiment analysis — while raising the bar on decision quality. The implication for leadership: fewer layers, more transparency, and even greater pressure to define sharp, top-down outcomes that teams can autonomously pursue.

If I had to compress this into a playbook, it’s this: set audacious, top-down goals; keep your “plate spinning” calendar sacred; write more than you talk; hire builders, not resume archetypes; sample the customer journey every week; and build mechanisms that make the right thing easier than the heroic thing. That’s how you scale product management leadership from dozens to thousands — in atoms, in bits, and in the messy, exhilarating space where they meet.

April 17, 2026

How to Build a Trusted AI Product Platform That Scales

Your teams have AI pilots that work in a demo. Then the questions start. Security wants to know what data the system can reach. Product wants to know whether the answers are dependable. Support wants a fallback when the model fails. Executives want evidence that the investment is changing a customer or business outcome.

You do not need another impressive model response. You need a product platform that makes AI behavior understandable, controllable, and repeatable across use cases. That requires a trust architecture, a path from prototype to production, and metrics that expose failure instead of averaging it away.

Trust fails where an AI output crosses a decision boundary

Most teams discuss AI trust as if it were a property of the model. It is better understood as a property of the whole product system. A capable model can still create an untrustworthy experience if it uses the wrong context, hides a consequential assumption, calls an unauthorized tool, or leaves the user unable to correct an action.

The important moment is the handoff from generation to decision. Before that handoff, the output is a possibility. After it, someone may use it to answer a customer, change a record, prioritize work, or trigger another system. The controls you need depend on what crosses that boundary.

A practical way to classify AI use cases is by the authority you give the system:

Inform: The system summarizes, explains, retrieves, or drafts. A person still interprets the result.
Recommend: The system ranks options or proposes a next action. Its framing can materially influence a decision.
Act: The system invokes tools, changes state, communicates externally, or starts a workflow.

Use mode	Primary trust failure	Required product control	Evidence needed before release
Inform	An incorrect, incomplete, or untraceable answer	Visible scope, supporting evidence, uncertainty, and an easy correction path	An evaluator can reproduce the evidence path and identify known limitations
Recommend	A hidden assumption, weak comparison, or recommendation that ignores the user’s constraints	Explicit assumptions, alternatives, decision criteria, and user-editable constraints	Representative cases show whether the recommendation applies the intended rubric
Act	An unauthorized, excessive, or difficult-to-reverse change	Least-privilege access, previews, confirmation, audit records, and reversal where the underlying system supports it	Authorized reviewers validate simulated actions, denied actions, failure recovery, and a limited production path

This classification prevents a common planning error: giving every AI feature the same review process. A summarizer and an autonomous account-management agent should not pass through identical gates. The second system needs stronger identity, permission, confirmation, and recovery controls because its mistakes can propagate beyond the conversation.

For each proposed use case, ask five questions before discussing a model:

<!– wp:list {

April 15, 2026

Commercial vs. Internal Products: Hard Truths, High Leverage, and How I Make the Call

Internal Products Are Hard; Commercial Products Are Harder. That line captures years of hard-won lessons from leading both internal platforms and market-facing SaaS at HighLevel. I’ve seen how the two demand different muscles—even when the tech stack, talent, and timelines look the same on paper.

When I talk about internal products, I mean services and solutions that our own employees use to take care of customers—customer-enabling tools and services, agent consoles, fulfillment and billing workflows, operations dashboards, and the underlying platforms that keep them fast, compliant, and resilient. These tools don’t generate revenue directly, but they quietly determine customer experience, gross margin, and how quickly we can ship, resolve issues, and scale.

Commercial products, by contrast, add a second challenge layer. Beyond discovery, usability, and reliability, we must conquer positioning, pricing and packaging, competitive differentiation, sales enablement, procurement hurdles, and ongoing customer success motion. The surface area for failure is bigger, and the time-to-signal on product-market fit is slower and noisier.

Here’s how I decide where to invest. First, I anchor on outcomes, not output. If the business priority is net revenue retention, faster onboarding, or reduced cost-to-serve, internal products often provide the highest-leverage path. If the priority is new revenue, new market entry, or a must-have differentiator, we lean commercial. I make the trade explicit in outcomes vs output OKRs so we can defend the decision when pressure mounts.

Second, I run a clear build vs buy calculus. For internal needs, the default is buy if a mature, configurable solution exists that meets our security, data governance, and integration requirements. I only build when the workflow is core to our differentiation, the TCO of customization is lower than vendor sprawl, or we can capture unique proprietary advantage. For commercial products, I avoid embedding third-party IP in a way that caps differentiation or compresses margins as we scale.

Third, I insist on continuous discovery. Internal audiences are not a captive market—they’re discerning experts with real jobs to do. I treat them like customers, with structured customer interviews, journey mapping, and opportunity solution trees. I rely on empowered product teams and product trios to validate problems and reduce solution risk before we commit engineering time.

Fourth, I frame commercial vs internal work with capacity guardrails. In most planning cycles, I reserve explicit allocation for platform scalability and internal tooling, separate from feature bets. Without this, internal products become backlog filler, which guarantees we’ll pay the interest later in churn, SLA breaches, and slower delivery.

Execution differs too. For internal products, change management is the make-or-break. I plan enablement as a first-class deliverable: clear rollouts, in-app guides, training, and feedback loops with frontline champions. I track adoption, time-to-resolution, error rate, and satisfaction for internal users with the same rigor we apply to external users.

For commercial products, I design the discovery-to-GTM handshake early. Pricing and packaging must reflect value drivers discovered in research, not what’s easiest to meter. Sales and solutions engineering need crisp narratives, objection handling, and proof points. Customer success needs activation plans and health signals tied directly to leading indicators of retention.

Across both, I instrument the product and process. I lean on feature flags and progressive delivery to manage risk, and I protect SLOs with error budgets so teams balance reliability with iteration speed. CI/CD isn’t a badge—it’s how we earn the right to ship continuously without eroding trust.

Common pitfalls recur. Teams skip UX for employee tools because “they have to use it”—which backfires as shadow workflows and rework. Leaders underfund internal platforms, then wonder why velocity stalls. On the commercial side, teams over-index on features and under-invest in positioning and onboarding, leading to poor activation and elongated sales cycles.

What’s the payoff? When we treat internal products as products, we unlock scale: shorter handling times, fewer escalations, clearer accountability, and higher customer satisfaction. When we approach commercial products with the same discovery rigor plus smart GTM, we compress time-to-value and amplify differentiation. The craft is knowing which lever to pull when—and having the discipline to measure what matters.

My rule of thumb is simple. If the goal is operational excellence that compounds across the entire customer journey, invest in internal products with the same intensity you reserve for revenue-generating features. If the goal is market expansion or category leadership, invest in commercial products with a tight discovery-to-GTM loop. In either case, clarity of outcomes, disciplined discovery, and empowered teams win the day.

Inspired by this post on SVPG.

April 9, 2026
Never Stop Disrupting: Why the Fin API Platform Signals a New Era for Agentic AI

Disruption is the only sustainable strategy in product. When a platform meaningfully changes how we build and operate, I pay attention—not just as a product leader, but as someone accountable for turning AI Strategy into durable competitive differentiation. That’s why the launch of the Fin API platform stands out: it’s a concrete step toward agentic AI at enterprise scale.

Today, I’m diving into what this launch includes, why it matters for product strategy, and how I’d navigate the build vs buy decision in this new landscape. My goal is to translate the announcement into actionable guidance for product teams, CX leaders, and forward-deployed engineers who are building the next generation of customer support and product-led experiences.

Fin is a customer agent platform that at present resolves over 2M customer issues a week, growing at a rapid exponential pace. It’s relied on by the best brands, large and small, in every vertical you can imagine. From Atlassian and Riot Games, to smaller hot upstarts like Mercury and Polymarket. It runs on a family of models trained by its AI group. Last week, they announced Apex, which is the world’s first specialized customer service LLM. In production tests over the last 6 months, it beat every single frontier model, including those from Anthropic and OpenAI, on resolution rate, latency, hallucination rate, and cost.

With this launch, teams can access the platform’s core capabilities and underlying models directly via API, with contracts starting at $250k per year, and usage rates that are by far the cheapest in the industry for each of the model’s subcategories. For leaders evaluating total cost of ownership, this is a meaningful data point: it shifts the economics of scaled automation from experimental to operational.

Why now? Because builders want options. I hear from teams daily that want to design their own agents, tune prompts and policies, and integrate with bespoke CRMs, data lakes, and product surfaces. The Fin announcement meets that demand with three clear build-paths, each mapping to a different operating model and maturity stage.

First, for the vast majority of companies, the Fin Agent Platform is the pragmatic starting point. Fin reports ~8k companies on it today. It addresses 99% of customer needs out of the box—without exhausting consulting engagements—while delivering top-tier resolution rates. If your priority is time-to-value, governance, and platform scalability, this route de-risks implementation and accelerates outcomes.

Second, for teams that need custom surfaces or channels, the Fin Agent API lets you present Fin in unique contexts. You get the Fin platform’s orchestration and controls, but you’re free to bypass the default messenger, email, voice, or any prebuilt channel and embed the agent natively in your product. I see this as the sweet spot for product-led growth motions where conversation design and UX writing are strategic levers.

Third, for companies building hyper-specific agents—think service plus in-product actions—the new API access to Apex and the broader collection of models is the obvious move. Unlike generalized models, these are purpose-trained for customer service scenarios and operational policies. If you have strong in-house solutions engineering, a retrieval-first pipeline, and eval-driven development in place, this path maximizes control without reinventing the model layer.

This also opens the door for vertical specialists. Fin-like businesses focused on deep domains can emerge quickly—Fin for dentists? Why not? Fin for car dealerships? Sure. I expect startups and modern CX providers (including players like Decagon and Sierra) to carve out niches where domain data, workflows, and compliance are the real moats. That’s where differentiated AI beats generic capability.

There’s a defensive reason to pay attention here. The software landscape is shifting fast: the moat is no longer feature parity—it’s the quality of your agents and the data flywheels powering them. Building software is simply less hard now, and I’ve watched engineering teams more than double measurable productivity as they adopt AI-assisted development. The implication is clear: the interface-and-features era is giving way to an agents-and-outcomes era.

Serious software companies must evolve from being a features company to an agents company—and build those agents on differentiated AI. More value will accrue at the model and orchestration layers, where safety, latency, cost, and resolution quality are won. That puts a premium on prompt engineering discipline, policy routing, continuous discovery of edge cases, and rigorous offline/online evals to keep hallucination rates low while maintaining speed.

How would I choose among the three build-paths? If you’re early or resource-constrained, start with the Fin Agent Platform to validate outcomes and align stakeholders. If you need branded experiences and tighter product integration, use the Fin Agent API to control surfaces without owning the heavy lifting. If you have strong ML ops and a mature customer support ai strategy, go model-level with Apex and companions, layering in your own guardrails, context window management, and test harnesses. In each case, balance velocity, control, and risk—your build vs buy decision should be grounded in clear metrics and an explicit product strategy.

Where does this lead? We’ll see more companies expose specialized model families with clearer economics and stronger governance. For now, I’m excited to see what teams build with the Fin API platform—and how they turn agentic AI into measurable improvements in resolution rate, CSAT, cost-to-serve, and ultimately, customer loyalty.

Inspired by this post on The Intercom Blog.

April 3, 2026
How We Built PR Review Bots In‑House for a Fraction of the Cost—and How You Can Too

PR review bots are all the rage, but they cost a premium. We built our own for cheap that work just as well, if not better. Here's how.

As a VP of Product Management, I care deeply about the velocity and quality of our software delivery. The decision to build our own pull request (PR) review agents came from a simple calculus: we needed tighter control over developer experience, CI/CD integration, and cost—without sacrificing accuracy or reliability. The result was a pragmatic system that accelerates reviews, improves code quality, and pays for itself through faster feedback loops.

Before we wrote a line of code, we defined success. Our objectives were to shorten review cycles, reduce back-and-forth on style and test coverage, and surface risks earlier—measured against DORA metrics like lead time and deployment frequency. That focus aligned the team, guided our build vs buy decision, and anchored scope to the highest-impact use cases.

We started rules-first, AI-optional. The initial release enforced guardrails that are universally valuable: linting and formatting checks, required test coverage thresholds, commit message standards, ownership validation (CODEOWNERS), and basic security scans. These automated gates eliminated predictable review friction, freeing engineers to focus on logic and architecture rather than style debates.

Then we layered intelligence where it mattered. We added lightweight, explainable checks for common code smells and dependency risks, plus optional natural-language summaries that turn large diffs into concise context. Where appropriate, we introduced agentic AI workflows to triage PRs by risk, draft review comments, and suggest missing tests—always keeping humans in the loop. This hybrid approach kept costs low and outcomes high.

Integration with our CI/CD pipeline was non-negotiable. We wired GitHub/GitLab webhooks to a stateless service that queued work, executed checks in containerized workers, and posted results back as status checks and review comments. Caching, parallelization, and smart diff-scoping ensured we only computed what changed, keeping the experience snappy even on large repos.

Adoption hinged on developer experience. We made the bot’s feedback fast, specific, and actionable, with clear remediation steps and links to documentation. Feature flags allowed teams to opt into new checks gradually. ChatOps commands enabled quick overrides for emergencies, while policy-as-code kept rules visible, versioned, and auditable.

We treated this like any product: eval-driven development for accuracy, ongoing telemetry for false-positive rates, and explicit SLAs for response times. We instrumented outcomes end-to-end—tracking PR cycle time, comment-to-merge ratios, and rework—so we could prove the ROI and tune the system without guesswork.

The outcome: a reliable PR review companion that runs on a shoestring budget, integrates cleanly with our workflows, and measurably improves engineering throughput. If you’re weighing build vs buy, start small with rules that deliver immediate value, then layer intelligence where it earns its keep. With a clear product strategy, you can stand up capable PR review bots quickly—and scale them as your needs grow.

If you’re ready to try this yourself, begin with your top three friction points in code reviews, wire them into your CI/CD checks, and pilot with a single team. Iterate weekly, measure relentlessly, and let your developers be your strongest signal. You’ll be surprised how far a pragmatic, product-led approach can take you.

Inspired by this post on Amplitude – Perspectives.

March 27, 2026
Unlocking Impact: What Amplitude’s MCP server and experimentation platform teach product leaders

In my role leading product management at HighLevel, I study the architectures and operating models behind high-velocity learning. I often reference "Amplitude's MCP server and its experimentation platform" as a benchmark for how to operationalize scale, reliability, and speed of insight across complex product ecosystems. That lens informs how I design processes, data flows, and decision loops that turn ambiguity into measurable outcomes.

Experimentation is the heartbeat of eval-driven development. In practice, that means running disciplined A/B testing, deploying targeted feature flags to de-risk rollouts, and sizing experiments with a clear minimum detectable effect (MDE) so we avoid vanity wins. When teams internalize these habits, we shift from opinion-led debates to evidence-led decisions—and that’s where product-led growth compounds.

I'm an AI enthusiast, so I think a lot about how experimentation accelerates AI roadmaps. The same rigor that validates UI changes should govern prompts, retrieval strategies, and policy settings for LLM-backed features. By treating AI behaviors as first-class experiment surfaces—and tying them to user activation, retention analysis, and value proposition metrics—we move faster without compromising safety, privacy-by-design, or customer trust.

Making this work in production demands clean instrumentation and a unified analytics platform. I look for stacks that combine Amplitude analytics with robust observability and CI/CD to ensure we can ship, measure, and iterate continuously. When platform scalability and data governance are baked in from the start, product trios can focus on product discovery rather than firefighting pipelines or reconciling metrics.

My playbook is straightforward: define decision-worthy questions, map them to crisp success metrics, run right-sized experiments with feature flags, and use consistent analytics to close the loop. Do this well, and you create a durable advantage—faster learning cycles, sharper product positioning, and a culture that lives by outcomes over output. That’s the real lesson I take from platforms that execute experimentation at scale: process and technology are table stakes; what wins is the discipline to learn relentlessly.

Inspired by this post on Amplitude – Perspectives.

March 27, 2026
From Engineer to CEO: Hard-Won Lessons on GTM, Cloud-First Bets, and Must-Do Focus

Making the leap from engineer to CEO demands an almost entirely new skillset. I’ve felt that jolt firsthand: the tools that serve you as an IC or even a product leader—system design, crisp PRDs, elegant roadmaps—only get you about 20% of the way. The rest is learning to orchestrate go-to-market strategy, finance, hiring, culture, and product positioning with just enough depth to make sound, fast decisions while empowering true experts to execute.

My operating heuristic is the 80% rule. As CEO or GM, I don’t need to be the best marketer, seller, or finance leader; I need to understand 80% of each function well enough to set a compelling product strategy, ask the right questions, and catch the second-order effects. That breadth unlocks speed, quality of judgment, and the conviction to say no when the organization is tempted by what it can do rather than what it must do.

The clearest illustration comes from the journey that turned Apache Kafka—originally built at LinkedIn—into Confluent, a publicly traded enterprise software company. The technical insight was powerful, but the real lift came from translating that insight into a repeatable go-to-market engine. That required building new muscles: founder-led GTM, enterprise sales orchestration, and open source monetization without alienating the community that fueled adoption.

Early on, the product was “embarrassing” by enterprise standards—thin features, sharp edges, and a long tail of operational gaps. Shipping anyway was the point. A thin vertical slice into the market created learning loops with real customers, not hypotheticals. That uncomfortable speed became a superpower, especially when the company decided to push toward a cloud-first business in the face of widespread opposition.

The messaging challenge was just as hard as the technical one. Most marketing fails because it starts with what we built, not what customers must achieve. A simple product marketing pyramid—vision at the top, category framing and points of parity in the middle, crisp value props and proof at the base—helped explain Kafka to the world in customer language. When the narrative snaps into place, adoption accelerates. In Kafka’s case, one well-timed blog post clarified the “why now” and unlocked a step-change in community and enterprise pull.

There’s a pivotal distinction leaders underestimate: the gap between what a company can do and what it must do. I use a must-do filter before every planning cycle: What moves are non-discretionary for durable product-market fit? For Kafka and Confluent, that meant ruthless prioritization on managed cloud services, reliability, and platform scalability—even when it jeopardized short-term revenue or required retooling how engineering, sales, and support worked.

Fundraising strategy mirrored this clarity. Planning to raise before building the full product wasn’t about hype; it was about matching capital to the physics of the problem. If your category requires enterprise credibility, global infrastructure, and 24/7 SRE, you finance those table stakes early. That’s first principles decision making: instrument the constraints, then design the sequence that gets you to scale with the fewest irreversible mistakes.

In the early years, every product decision felt like a trade between polish and learning. The team essentially bludgeoned its way into a cloud-first posture—less because the initial product was ready, and more because the market’s must-do was obvious. That’s the essence of founder-led GTM: get into the field, close lighthouse customers, and use their arcs to shape the roadmap. It’s also where open source monetization matures from downloads into durable, enterprise value.

As the organization scales, excellence often erodes—the Chipotle problem. Process hardens; quality blurs; the magic decays. The antidotes are simple but hard: a few non-negotiable product quality bars, a short set of product-market fit metrics that everyone can recite, and empowered product teams who own outcomes over output. This is where organizational development matters as much as code: design clear interfaces between product, sales, and success, and you’ll keep velocity without losing standards.

Contrary to popular lore, founder optimism is overrated. Constructive realism wins. I try to model “probabilistic optimism”: assume we will win, but instrument the journey like an SRE runs an incident. Set leading indicators, rehearse failure modes, and make pre-commitments to the must-do path so you’re not swayed by the latest anecdote. It keeps the team out of a failure mindset while making room for rigorous course correction.

Giving up the right things at the right time is a CEO superpower. As complexity grows, I hand off decisions that benefit from specialization and keep only those tied to company narrative, must-do prioritization, and talent bar. CEO time management becomes a portfolio problem: ensure each week contains deep product time, frontline customer exposure, and one compounding systems fix (hiring loop, pricing rubric, or GTM enablement) that pays back for quarters.

If you’re moving from IC or PM into a GM/CEO role, here’s a practical playbook: build your product marketing pyramid; write the one-page must-do memo for the next six quarters; ship a narrow, managed cloud slice early; pick three product-market fit metrics (usage, time-to-value, retention) and publish them company-wide; and architect an enablement engine that turns field learnings into roadmap changes within one quarter. That’s how you transform technical advantage into a category-defining business.

The Kafka-to-Confluent arc reminds me that technology can open a door—but clarity of narrative, sequencing, and must-do focus determines whether you walk through it. When in doubt, bias toward shipping, talking to customers, and tightening the loop between what you learn and what you build. That’s the work of product management leadership at scale.

March 26, 2026