Tag: DORA metrics

Why Product Engineers Are Transforming Software Delivery: Ownership, Speed, and Real Impact

I’ve watched the rise of product engineering up close, and it’s reshaping how we build software. The old model of rigid handoffs and separate functions is giving way to small, empowered product teams where engineers own the customer problem end to end. That shift isn’t just cultural—it’s a performance advantage that compounds with every release.

I often summarize it this way: “Product engineers are taking over. They ship code, talk to users, and own outcomes—no handoff required. Here’s what the role is, and why it matters now.”

When I say “product engineer,” I’m describing a builder who goes beyond writing code. I expect them to partner in product trios with product management and design, participate in continuous discovery, and make decisions grounded in product strategy and real customer insight. They don’t toss features over a wall; they own the problem, the solution, and the measurable outcome.

Why now? Modern delivery practices like CI/CD and feature flags compress feedback loops, while behavioral analytics and session replay make customer friction visible in real time. As expectations rise for quick iterations and clear value, teams that reduce handoffs and align around outcomes outperform on DORA metrics such as deployment frequency and lead time for changes.

Day to day, a strong product engineer blends discovery and delivery. They join customer interviews, review support tickets, analyze usage patterns, and run A/B testing to validate hypotheses. Then they ship code in small, safe increments, instrument telemetry, and watch adoption and retention signals to confirm they’re moving the numbers that matter.

Team shape matters. I favor compact, cross-functional squads anchored by product trios, each with explicit outcomes vs output OKRs. Product engineers often operate like forward deployed engineers, partnering with customer success and solutions engineering to learn at the edge of real-world usage. This proximity to customers turns ambiguity into insight—and insight into product leverage.

Accountability is concrete. We track DORA metrics for delivery health and pair them with product outcomes such as activation, time-to-value, and Net Recurring Revenue (NRR) drivers. The combination keeps us honest about both how fast we move and whether what we ship truly works for customers.

The hiring profile is distinct. I look for engineers who are curious about the “why,” comfortable with trade-offs, and energized by customer conversations. They can navigate architectural complexity, but they also translate user feedback into crisp product bets. Many grow into natural facilitators of discovery rituals and developer evangelism across the organization.

If you’re getting started, pilot a single squad. Establish clear outcomes vs output OKRs, invest in CI/CD and feature flags, and commit to continuous discovery with weekly customer interviews. Give the team ownership of a KPI tied to product strategy, and measure progress with DORA metrics plus usage and retention signals. The early wins—fewer handoffs, faster learning, tighter feedback loops—build momentum quickly.

In short, product engineers thrive where accountability, autonomy, and user empathy meet. They reduce wasteful coordination, shorten the path from insight to impact, and ensure we ship code that customers actually adopt. That’s why this role is reshaping how software gets built—and why the teams that embrace it will set the pace for everyone else.

Inspired by this post on Pendo – Perspectives.

June 15, 2026
Built for Your Biggest Days: How We Engineer Fair, Reliable Scale Without Downtime

I’m getting sharper, more specific questions about scale from enterprise customers every quarter, and that’s exactly how it should be. Teams want to know how our platform behaves during their highest-volume moments — the Black Friday sales, the sporting events, the production incidents — and they want confidence their growth won’t outpace the systems they depend on. We welcome those questions. They’re the right ones to ask of any critical component of your business. Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform. We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer. While those numbers matter, they age quickly. Every growing software company can publish a bigger number every year, month, week. What ultimately matters is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it. Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them. Here’s how we do that in practice. We build on boring foundations because at the edges, we try hard not to be clever. We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product. “That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.” This extends a principle I’ve embraced for years: run less software. The point isn’t to minimize the stack for its own sake; it’s to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together. Boring technology choices aren’t a lack of ambition — they reserve our ambition for the nuanced scaling challenges that matter. The source of truth is the hard part. You can scale stateless web traffic by adding machines, add queue consumers, and add cache. Those are real problems — just not the hardest ones. The source-of-truth database is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears. That is why we moved to Vitess, managed by PlanetScale. The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers. When we first laid out this direction, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day. Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed. We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk. Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows. Search is not a hidden bottleneck for us. Search underpins core product surfaces across the platform — from vector search in our AI features to realtime reporting — and if it’s slow or unhealthy, customers feel it. Scaling isn’t just adding more machines; often the better approach is making the product do less unnecessary work. Today, our Elasticsearch clusters support a much higher-throughput product than in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We’re serving a larger product surface more efficiently, not just running a bigger cluster. More importantly, when an index gets too large or traffic distribution turns unhealthy, we don’t want a high-risk, manual migration. We reshape Elasticsearch indexes online by partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and deleting the old index only when we’re confident. We’ve used this pattern for years to make large search migrations safer and more incremental — a core playbook in our platform scalability and SRE practices. Fairness is non-negotiable in a multi-tenant system. A single customer’s high-volume moment should not quietly become everyone else’s latency problem. We design for this at multiple layers. For asynchronous work, we use overflow queues and queueing strategies that prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They’re designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants. We also build application-level guardrails so customer isolation doesn’t depend on every engineer remembering every rule in every code path. In a large multi-tenant Rails application, the safe path must be built into the system. The focus is primarily about correctness and customer data separation, but the broader operating principle is the same: important customer boundaries should be enforced by infrastructure and application frameworks. The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it’s actually needed. Fin adds a new dimension to scaling. Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work. The details differ from traditional SaaS infrastructure, but the principle is the same: understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. AI providers are not commodity storage systems, and we do not design as if they are. That is why we have invested in Fin-specific reliability systems. Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point. For enterprise customers, this matters because AI support volume can spike just like human support volume — and the AI layer must absorb that spike without relying on one fragile upstream path. When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform. Performance tests help, but production traffic is reality. Real customers use products in ways no synthetic test will perfectly predict: launches, incidents, seasonal patterns, gaming events, sudden changes in end-user behavior. Those moments give us data that no lab can fully reproduce. Often, a large customer event barely moves the platform-wide graphs because our customer base is broad enough that one industry’s peak aligns with another’s quiet period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect. “That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.” Sometimes, these events teach us something specific. In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed. The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb. We treat large customer events this way even when there’s no broad customer impact. They’re opportunities to understand real scaling properties and make the next event safer — a habit that anchors our incident management, observability, and FinOps practices. Scale is also an operating model. The infrastructure matters, but it’s not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents. “That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.” Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through infrastructure noise and answer the question that matters most during an incident: are customers able to use the product successfully? This shapes how we ship. Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. That isn’t a vanity metric — it’s part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers. These are the very DORA metrics we hold ourselves to in order to balance CI/CD speed with stability. “Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.” The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence. That’s how scale stays safe over time. Scheduled maintenance should be extraordinary. Historically, database maintenance was a main reason for maintenance windows: upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime. With the move to Vitess and PlanetScale, we changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime — and we do this in practice, not just as a goal. This matters because customers rely on our platform for live operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience. “Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.” Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact. That’s the practical benefit of the architecture work: better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers. What this means for customers is simple: be skeptical of vague scale claims. The question isn’t whether a vendor says they can scale — it’s whether they can explain how, where the limits are, what they measure, how they recover, and what they’ve changed after learning from production. We understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system — and we use that data to keep improving. That’s what you should expect from a platform you depend on during your busiest moments.

Inspired by this post on The Intercom Blog.

May 19, 2026
How We Built PR Review Bots In‑House for a Fraction of the Cost—and How You Can Too

PR review bots are all the rage, but they cost a premium. We built our own for cheap that work just as well, if not better. Here's how.

As a VP of Product Management, I care deeply about the velocity and quality of our software delivery. The decision to build our own pull request (PR) review agents came from a simple calculus: we needed tighter control over developer experience, CI/CD integration, and cost—without sacrificing accuracy or reliability. The result was a pragmatic system that accelerates reviews, improves code quality, and pays for itself through faster feedback loops.

Before we wrote a line of code, we defined success. Our objectives were to shorten review cycles, reduce back-and-forth on style and test coverage, and surface risks earlier—measured against DORA metrics like lead time and deployment frequency. That focus aligned the team, guided our build vs buy decision, and anchored scope to the highest-impact use cases.

We started rules-first, AI-optional. The initial release enforced guardrails that are universally valuable: linting and formatting checks, required test coverage thresholds, commit message standards, ownership validation (CODEOWNERS), and basic security scans. These automated gates eliminated predictable review friction, freeing engineers to focus on logic and architecture rather than style debates.

Then we layered intelligence where it mattered. We added lightweight, explainable checks for common code smells and dependency risks, plus optional natural-language summaries that turn large diffs into concise context. Where appropriate, we introduced agentic AI workflows to triage PRs by risk, draft review comments, and suggest missing tests—always keeping humans in the loop. This hybrid approach kept costs low and outcomes high.

Integration with our CI/CD pipeline was non-negotiable. We wired GitHub/GitLab webhooks to a stateless service that queued work, executed checks in containerized workers, and posted results back as status checks and review comments. Caching, parallelization, and smart diff-scoping ensured we only computed what changed, keeping the experience snappy even on large repos.

Adoption hinged on developer experience. We made the bot’s feedback fast, specific, and actionable, with clear remediation steps and links to documentation. Feature flags allowed teams to opt into new checks gradually. ChatOps commands enabled quick overrides for emergencies, while policy-as-code kept rules visible, versioned, and auditable.

We treated this like any product: eval-driven development for accuracy, ongoing telemetry for false-positive rates, and explicit SLAs for response times. We instrumented outcomes end-to-end—tracking PR cycle time, comment-to-merge ratios, and rework—so we could prove the ROI and tune the system without guesswork.

The outcome: a reliable PR review companion that runs on a shoestring budget, integrates cleanly with our workflows, and measurably improves engineering throughput. If you’re weighing build vs buy, start small with rules that deliver immediate value, then layer intelligence where it earns its keep. With a clear product strategy, you can stand up capable PR review bots quickly—and scale them as your needs grow.

If you’re ready to try this yourself, begin with your top three friction points in code reviews, wire them into your CI/CD checks, and pilot with a single team. Iterate weekly, measure relentlessly, and let your developers be your strongest signal. You’ll be surprised how far a pragmatic, product-led approach can take you.

Inspired by this post on Amplitude – Perspectives.

March 27, 2026
Battle-Tested AI Agent Orchestration Patterns for Reliable, Observable, Product-Ready Systems

Shipping agentic AI into production is exhilarating—until a flaky output torpedoes trust. Over the past year, I’ve led teams at HighLevel to operationalize agents across customer-facing and internal workflows, and I’ve learned that reliability isn’t an afterthought; it’s an architecture. In this piece, I share the AI Agent Orchestration Patterns for Reliable Products that consistently deliver dependable outcomes at scale.

When we talk about orchestration, we’re talking about more than a single prompt. The shift is from monolithic calls to coordinated “agentic AI” where routers, planners, and specialists collaborate through structured “AI workflows.” In practice, I rely on a few canonical patterns: a planner–executor loop for multi-step tasks, a router–specialist setup for skill selection, and a “retrieval-first pipeline” that grounds generation with authoritative context before a single token is produced.

Reliability-by-design starts with typed inputs/outputs and strict validation. I standardize on JSON schemas, enforce tool/function signatures, and implement idempotency keys so retries don’t wreak havoc on downstream systems. Timeouts, circuit breakers, and backpressure protect the platform under load, while rate limiting and dead-letter queues keep failure modes contained. Most importantly, we engineer graceful degradation: agents “abstain” when uncertain, fall back to deterministic paths, and escalate to humans instead of guessing.

Safety is a first-class concern, not a bolt-on. Our “AI risk management” pipeline includes PII redaction, allow/deny lists for tools and data, and the principle of least privilege for every connector (yes, even the ChatGPT connector). We codify policy-as-code for repeatability and require human-in-the-loop approvals for sensitive or irreversible actions. In my experience, clear red lines and reversible defaults prevent the vast majority of regrettable outcomes.

Without strong “observability,” you’re flying blind. I instrument agents with an “Agent Analytics” layer that captures traces, spans, tool invocations, and token usage across the entire chain. The essential metrics are outcome quality (task success rate), latency (p50/p95), tool failure rates, cost per task, and user-level satisfaction signals. Cross-agent lineage allows us to pinpoint where a plan went awry and which tool or prompt introduced drift—vital for rapid remediation.

Quality improves fastest when it is measured relentlessly. I practice “eval-driven development” with golden datasets, rubric-based scoring, and risk-weighted sampling of edge cases. LLM-as-judge can help, but we always calibrate against human ratings and monitor agreement. In production, I blend online metrics with controlled “A/B testing” and plan experiments to hit a realistic minimum detectable effect (MDE). The result is a virtuous loop where prompt tweaks, tool changes, and retrieval adjustments are verified before wide rollout.

Agents need the same rigor we expect from any modern system. I gate releases through “CI/CD” with linting for prompts, schema checks for tools, and simulation runs for critical paths. “Feature flags” enable shadow and canary deployments so we can throttle exposure by segment or workflow. I also track reliability with “DORA metrics” and “deployment frequency,” and I partner closely with “SRE” for on-call coverage, runbooks, and incident postmortems tailored to agent failure modes.

Context is a resource to allocate, not a bottomless pit. Thoughtful “context window management” means curating retrieval, summarizing long-running threads, setting memory time-to-live, and constraining what the agent can see at any given step. I bias hard toward retrieval over recall, keep chunks small and semantically precise, and validate that the “retrieval-first pipeline” truly returns the right evidence—not just the nearest match.

In day-to-day product work, I lean on a compact playbook: a router that selects the best specialist; a planner that decomposes tasks and allocates tools; a deterministic guard that verifies preconditions; an execution loop with explicit budgets; and a fallback policy that prefers abstaining over hallucinating. Together, these patterns create an agent that behaves like a dependable teammate rather than a creative wildcard.

No architecture thrives without the right rituals. Product trios keep discovery continuous, while clear outcomes (not output) align teams on value instead of vanity. We map risks early, maintain a public quality dashboard, and rehearse failure recoveries so incidents never become improvisations. The cultural signal is simple: we celebrate root-cause clarity and safe iteration over heroics.

If you’re just starting, implement three patterns first: retrieval before generation, abstain-and-escalate for low confidence, and canary releases under feature flags. Instrument everything from day one, run a weekly eval review, and expand scope only when the data says you’re ready. With these habits, your agents will earn user trust—and keep it.

Inspired by this post on Product School.

March 2, 2026
The Safety of Speed: 180 Deploys a Day, 12‑Minute Releases, 99.8%+ Availability

“Speed is not the enemy of safety; it is the prerequisite for it.” I live by this principle. In our organization, the average time from merging code to it being used by customers in production is just 12 minutes, and that short window is fundamental to how we build, ship, and learn.

In January 2026, we are averaging 180 ships per workday – roughly 20 deployments every hour. Conventional wisdom suggests that to increase stability, you must slow down. I believe the opposite. Speed is not the enemy of safety; it is the prerequisite for it. Accumulating code creates risk; shipping small batches minimizes it. Shipping is our company’s heartbeat.

Maintaining this frequency while targeting 99.8+% availability has required over a decade of focused investment in systems, principles, and processes. We protect the integrity of our systems through three layers of defense: an automated pipeline that is simple, reliable, and removes the need for manual intervention, a shipping workflow that promotes ownership and uses guardrails as accelerants, and a recovery model that optimizes for mitigating inevitable failures. Here’s how we’ve built each layer so that velocity is our greatest source of stability.

While our platform consists of various services and frontend applications, I’ll focus here on our Ruby on Rails monolith. It is our core application and the one we deploy most frequently; we also deploy it to three different data‑hosting regions with independent pipelines. Our other services follow similar pipeline principles and safeguards, but the Rails monolith is the clearest example of how we ship at scale.

The automated pipeline is designed to move code from merge to production as fast as possible while enforcing strict safety checks. It is fully automated, and the vast majority of releases require no human intervention—critical for CI/CD at high deployment frequency.

Once an engineer merges code to GitHub, two things happen immediately. First, the build: we compile the Rails application and its dependencies into a deployable asset (a slug) in about four minutes. Second, parallel CI: our test suite runs alongside the build; through extensive optimization, parallelization, and test selection, the vast majority of CI builds finish in under five minutes.

As soon as the slug is built, it’s deployed to a pre‑production environment. CI does not block the progression of the slug to pre‑production. Deploying to pre‑production takes around two minutes. This environment serves no customer traffic, but it is connected to our production datastores, mirrors our production infrastructure variants (e.g., web serving, asynchronous worker), and is configured so that requests exercise the pre‑release code and workers.

Immediately after deployment, we run and await several automated approval gates. We verify that the application boots cleanly on hosts (boot test), confirm the parallel test suite passed (CI check), and execute functional synthetics using Datadog Synthetics on critical flows—such as loading or editing a Fin workflow. If any gate fails, the release is halted and does not go to production.

Once approved, we promote the code to thousands of large virtual machines. A deployment orchestrator triggers these deployments simultaneously, while a decentralized, staggered rollout avoids changing the state of the entire fleet at the same millisecond. Within each machine, a rolling restart mechanism removes a process with old code from the serving path, lets it drain gracefully, and replaces it with a fresh process running the new code. From the moment a deployment starts, first requests are served by new code within roughly two minutes, and the vast majority of the global fleet updates transparently within six minutes. When restarts trigger on every machine, production unblocks so the next deployment can begin.

We treat a stalled pipeline as a high‑priority incident. If the automated system rejects three consecutive release attempts, it pages an on‑call engineer. These are pre‑production blocks, but if the shipping lane stops moving, changes pile up—and our stability relies on building and shipping in small steps. The on‑call’s job is to restore flow so that tiny, safe, frequent updates continue to keep risk low.

Our shipping workflow is built on extreme ownership: tools assist, but the engineer is accountable for quality and the decision to merge. I insist that you are present when you ship. The practical benefit of a 12‑minute deployment cycle is that engineers remain in the zone, focused on the problem they just solved, and ready to validate behavior as it goes live.

A rocket lifts into a luminous sky, a metaphor for shipping code fast without breaking things, where precision, automation, and guardrails power 180 safe deployments a day.

To support this, our deployment system sends Slack notifications the moment code is submitted and as it advances through stages, embeds direct observability links to relevant dashboards and logs in every PR and message, and prompts verification so engineers actively watch the dials and test features in production. It is not acceptable to rely on green builds. You’re expected to watch your change go live and if you’re not prepared to rollback, you’re not prepared to ship. We maintain a no‑blame culture: quick rollbacks and immediate reverts are signs of vigilance and ownership, not failure.

We make extensive use of feature flags to turn deployment into a non‑event. By decoupling deployment (moving code to servers) from release (turning features on), we shrink the blast radius of change. Flags can be enabled for all customers, a specific subset, or disabled for everyone in under 60 seconds through our backend UI. Engineers can group flags into beta features and run phased rollouts; we also ensure flags work consistently across non‑monolith applications. In the past three months, we created over 560 flags—and we actively manage them to avoid permanent complexity.

For complex refactors—especially when behavior should not change—we leverage GitHub Scientist, an open‑source experimentation library. It runs candidate logic (new code) in parallel with existing logic (old code) in production, instruments both paths for result and timing comparisons, and keeps existing behavior user‑visible. That means we can iterate on and validate new code under real load without risking the experience, then switch seamlessly when confident.

When engineers need to go deeper before merging, they can generate a slug and deploy it to a virtual machine, detaching a running production host from the serving path and connecting for manual testing. They can also put a pre‑release slug on a serving machine that handles a small percentage of jobs or web requests. Single‑host validation lets us slice observability to those hosts, compare against the main release, and make low‑level changes safer. Staging is a simulation; production is reality. Testing on a single production host validates assumptions with real‑world data without risking the fleet.

Our recovery model starts from a simple principle: stop monitoring systems; start monitoring outcomes. Traditional monitoring tells you if a server is healthy; we care whether customers are healthy. We rely on heartbeat metrics—vital signs that represent the core value our product provides—such as the rate at which messages and comments are created.

Unlike standard uptime checks, heartbeat metrics are binary in spirit. If message send rates dip below baseline, it does not matter if infrastructure dashboards are green. Down is down, and if customers can’t do their job, uptime percentages are irrelevant. By tracking real‑world success rates as a high‑level signal, we catch subtle degradations that traditional alerting either misses or over‑alerts on.

Because we ship in small, incremental steps and maintain previous releases on our virtual machines, our Time to Recover (TTR) is generally very fast. If a heartbeat metric drops or a critical anomaly is detected right after a ship, the system can trigger an automatic rollback, reverting to the release that was running 20 minutes ago—often restoring service before an engineer responds. For complex issues, engineers can initiate a manual rollback through our deployment UI; doing so also locks the production pipeline to prevent further releases while we investigate and remove problematic code.

Resumption of service is not the end. Every incident prompts an incident review, and we don’t just fix the bug. We ask, “How did the machine allow this to happen?” Then we harden the system so it cannot happen again. This loop—fast shipping, fast recovery, rigorous learning—compounds resilience over time.

This operating model aligns to DORA metrics: high deployment frequency, short lead time for changes, low change failure rate, and rapid time to restore service. It’s a CI/CD and SRE‑informed approach that converts speed into a defensive advantage rather than a liability.

Shipping 180 times a day isn’t a vanity metric; it’s a deliberate choice to protect the customer experience. With a 12‑minute window from code to customer, the feedback loop is tight and engineers retain context—and accountability—for the immediate impact of their work. Maintaining this pace requires more than fast CI; it requires judgment, extreme ownership, disciplined use of feature flags, and a recovery model that monitors outcomes. We rely on human expertise, augmented by these layers of defense, to catch issues before they turn into customer pain. We don’t ship fast despite our need for stability; we ship fast to stay in control of change.

Inspired by this post on The Intercom Blog.

January 26, 2026
PMs and Developers Need Different AI Metrics—Here’s How That Builds Faster, Better Products

I’ve sat in countless AI measurement debates and noticed a recurring gap. One major voice has been noticeably underrepresented in the AI measurement conversation: the product manager (PM) that’s leading development. From experience, PMs and developers do need different measurement tools—and making those differences explicit is exactly what speeds up decisions and improves outcomes.

Developers optimize the model and system layer. Their toolkit centers on eval-driven development: offline evals, regression suites, red-teaming, latency and throughput monitoring, token cost tracking, and hallucination rate reduction. On the delivery side, engineering teams watch DORA metrics alongside CI/CD performance to keep iteration fast and safe. When building LLM-backed experiences, they also care deeply about retrieval-first pipeline quality and context window management because those mechanics determine grounding, relevance, and consistency.

PMs, by contrast, own outcomes. We instrument user journeys end to end and define a clear north-star tied to value: activation, time-to-value, task success rate, retention analysis, support deflection, and revenue contribution. We rely on A/B testing frameworks and minimum detectable effect (MDE) planning to separate real impact from noise, and we consolidate behavioral signals in a unified analytics platform like Amplitude analytics and Pendo to understand adoption, friction, and cohort differences. This is the heart of product-led growth and continuous discovery: evidence, not anecdotes.

The fact that these toolboxes differ is a strength, not a weakness. Specialized metrics keep responsibilities crisp: developers guarantee model quality and reliability; PMs guarantee that quality translates into customer and business outcomes. What we need is an explicit metrics ladder that connects layers—model-level quality floors and SLOs, feature-level KPIs, and company-level results—so trade-offs are transparent and prioritization is principled.

In practice, I create a shared measurement contract for every AI initiative. It links eval sets to user-facing success criteria, defines acceptance thresholds, and spells out observability across the stack. We include governance from day one—AI risk management, privacy-by-design, and data governance—so we can scale responsibly without slowing teams down.

Here’s the AI product toolbox I give my teams: start with a concise value hypothesis; define a success rubric the customer would recognize; instrument the happy path and the failure path; plan experiments with MDE up front; segment results by persona and job-to-be-done; and close the loop with qualitative feedback inside the product via in-app guides, product tours, and lightweight surveys. For AI features specifically, add Agent Analytics for agentic AI, capture grounding sources for explainability, and log model/context inputs to make debugging and iteration repeatable. That way, LLMs for product managers stop being magic and start being manageable.

When we roll out a new assistant—whether a retrieval-augmented copilot or a voice AI agent—we set two dashboards: one for developers (eval pass rates, latency, context integrity, error budgets) and one for PMs (activation, task completion, deflection, satisfaction). The dashboards read differently by design, yet they are joined at the hip by shared definitions and experiment IDs. This lets us move quickly with confidence: engineering can tighten quality loops while product steers toward the outcome that matters most.

If you’re feeling the tension between model metrics and product metrics, don’t collapse them—connect them. Start with a thin slice, agree on 3–5 measurable outcomes, and let your evals and A/B tests work together. With a clear metrics ladder and a unified analytics platform, PMs and developers can each excel at their craft and still ship AI that customers love.

Inspired by this post on Pendo – Perspectives.

January 7, 2026
My Proven Experimentation Playbook for AI PMs: Faster Learning, Safer Launches, Bigger Wins

I build AI products with a simple conviction: disciplined experimentation beats intuition. Over the years, I’ve refined a practical playbook that helps my teams learn faster, reduce risk, and turn every release into a smarter next step.

Product experimentation isn’t luck; it’s a method. Learn how top AI product managers test, measure, and grow smarter with every release.

I begin every effort with a crisp hypothesis, an expected user or business outcome, and unambiguous success criteria tied to outcomes vs output OKRs. Before writing a line of code, I define primary metrics and guardrails so we know what “good” looks like—and what to stop.

When the change affects UX, pricing, or activation flows, I favor A/B testing with the statistical rigor to back decisions. We calculate the minimum detectable effect (MDE), choose appropriate randomization units, and pre-register the analysis plan to avoid p-hacking. This gives the team the confidence to scale wins and sunset underperformers quickly.

AI features demand a tailored approach, so I run eval-driven development before any user sees a variant. We curate golden datasets, score candidate prompts and models, and stress-test failure modes. This is where LLMs for product managers matters: prompt templates, context window management, and a retrieval-first pipeline are all evaluated for quality, latency, and cost-to-serve. I treat “hallucination rate,” safety violations, and bias as first-class metrics under AI risk management.

To de-risk launches, we ship behind feature flags with CI/CD, monitor DORA metrics, and roll out in stages. Product trios own problem framing to solution delivery, which shortens feedback loops and preserves accountability. If early signals drift from our hypotheses, we pause, adjust, and re-run—no sunk-cost thinking.

Measurement is non-negotiable. I instrument user journeys end-to-end with Amplitude analytics, track activation and retention analysis, and map behavior to learning objectives. We consolidate logs and events into a unified analytics platform so qualitative insights from customer research pair cleanly with quantitative trends.

Continuous discovery keeps the engine running. Weekly customer conversations, in-product feedback, and lightweight prototypes ensure we validate needs, not just solutions. The output flows into product discovery, product roadmapping and sprint planning, and a reusable AI product toolbox that scales across teams.

Finally, I protect the culture that makes experimentation work: we celebrate invalidated hypotheses, document decisions, and optimize for outcomes over output. That’s how empowered product teams sustain product-led growth—even as complexity grows.

If you’re building AI features today, adopt this playbook to maximize learning velocity, minimize risk, and compound advantage. The method is straightforward: form strong hypotheses, test with rigor, measure what matters, and let evidence—not HiPPOs—guide the roadmap.

Inspired by this post on Product School.

December 31, 2025
Quantitative Metrics vs. Qualitative Insight: How I Balance Data and Discovery to Grow Products

Quantitative metrics tell the story in numbers; qualitative ones whisper why it matters. Both shape how products grow. Here’s what you need to know.

In my day-to-day, I rely on quantitative metrics to surface what’s changing in the business and where we need to focus. Activation rate, conversion through the onboarding funnel, feature adoption, retention analysis, and LTV/CAC give me a precise read on performance. I also keep an eye on DORA metrics to understand delivery health and deployment frequency, but I never mistake those for customer outcomes. Numbers spotlight signal—but they rarely explain causality on their own.

That’s where qualitative analysis earns its keep. Customer interviews, usability studies, win/loss debriefs, support transcripts, and community feedback give me the context behind the charts. Tools like Pendo help me layer in in-app guides and micro-surveys to capture intent and friction in the flow. This combination turns raw data into decisions that actually move the product strategy forward.

My operating cadence is simple: weekly dashboards to monitor quantitative metrics, ongoing continuous discovery to collect qualitative insight, and a monthly synthesis to reconcile both with our outcomes vs output OKRs. The aim is to move from opinions to evidence, and from anecdotes to patterns. When quant and qual agree, we execute confidently; when they diverge, we design the smallest experiment to learn fast.

I use a three-question decision tree to choose the method. First, are we exploring or validating? Exploration leans qualitative; validation leans quantitative. Second, do we have enough volume for statistical power? If yes, I’ll run A/B testing with a clear minimum detectable effect (MDE) to avoid false positives. If not, I’ll rely on targeted qualitative discovery until we can instrument a meaningful test. Third, will this decision meaningfully impact our product-led growth or user activation goals? If it will, we invest in both measurement and discovery to reduce decision risk.

Here’s a concrete example. We once saw a sudden drop in user activation. The quantitative dashboard flagged a step-function change at onboarding step three, but it couldn’t explain why. A quick round of qualitative interviews revealed that our tooltip design buried a critical permission request. We shipped a Pendo-powered in-app guide variant and ran an A/B test to validate the fix. Activation rebounded within a week, and 30-day retention followed suit.

There are common pitfalls I actively avoid. Chasing vanity metrics that don’t ladder up to outcomes. Conflating shipping speed with customer value by over-indexing on DORA metrics. Overfitting with A/B testing when the MDE is unrealistic for our traffic. And on the qualitative side, mistaking a compelling anecdote for a representative sample without triangulating evidence.

If you’re looking to tighten your practice, start with a lightweight playbook: instrument core events in Amplitude analytics; define a small set of outcomes vs output OKRs; schedule recurring customer conversations as part of continuous discovery; tag qualitative insights so patterns surface over time; and pair every material UX change with either a well-powered experiment or a clear qualitative learning goal. This creates a unified analytics and discovery loop that compounds.

Ultimately, quantitative metrics help me prioritize with clarity, while qualitative analysis helps me decide with confidence. When you weave them together, you not only ship faster—you ship the right thing, for the right reason, at the right time.

Inspired by this post on Product School.

December 30, 2025
Inside the Engine Room: How I Drive Scalable Analytics APIs, Reliability, and Performance

I build and scale analytics platforms with a product mindset, and the work starts with the "middleware and compute systems that power analytics at scale." In platforms like Amplitude analytics and other unified analytics platform architectures, that foundation is what makes everything else possible.

Day to day, I oversee the "APIs behind charts, cohorts, and metrics—driving performance, reliability, and platform scalability." When those APIs are fast and resilient, every product team—from growth to customer success—can trust the insights they use to ship, learn, and iterate.

From an engineering leadership standpoint, I partner closely with SRE to define SLOs and error budgets, wire CI/CD pipelines for safe deploys, and track DORA metrics so we improve speed without compromising quality. This combination reduces incident management toil and shortens MTTR while keeping data freshness and query latency within strict thresholds.

From a product management leadership lens, the goal is clarity: crisp APIs, predictable contracts, and transparent stakeholder management across data, engineering, and GTM teams. That alignment empowers product teams with reliable cohorts and metrics, accelerates experimentation, and de-risks roadmaps.

If you’re scaling analytics, invest first in the platform layer: middleware and compute, schema governance, caching strategies, and cost-aware compute. Do that well, and the visible experience—charts, cohorts, and metrics—feels effortless, even as you grow to serve billions of events with confidence.

Inspired by this post on Amplitude – Best Practices.

December 12, 2025
AI Won’t Replace Engineers—Engineers Using AI Will: A Practical Playbook for Your Next Move

Will AI replace software engineers or reshape their roles? Explore risks, opportunities, and alternative career paths in tech.

I’m often asked whether AI will make software engineers obsolete. My short answer: AI is already automating tasks, not eliminating the role. The engineers who learn to orchestrate models, systems, and stakeholders will create more value—not less. The real shift is from keystrokes to judgment, from writing code to designing socio-technical systems that deliver outcomes.

Today’s gen ai assistants—think Claude Code and ChatGPT connector—excel at unit test scaffolding, boilerplate generation, refactoring, docstrings, and code search. When integrated into CI/CD, they can open draft pull requests, annotate diffs, and propose fixes. This lifts developer productivity and frees time for higher-leverage work: problem framing, architecture decisions, and customer discovery.

What changes in the role? We spend more cycles on product discovery, privacy-by-design, and AI Strategy, and fewer on repetitive implementation. We design agentic AI workflows that combine retrieval, tools, and guardrails; we evaluate trade-offs that blend performance, cost, and safety; and we partner with empowered product teams to ship the smallest valuable slice, learn, and iterate.

Measure what matters. If AI is working, DORA metrics should improve: higher deployment frequency, shorter lead time for changes, stable change failure rate, and faster MTTR. Pair that with outcomes vs output OKRs to avoid gaming the system—shaving seconds off a build is meaningless if it doesn’t move activation, retention, or revenue. A unified analytics platform can help connect engineering signals to business impact.

Risk is real—and manageable. AI risk management and data governance are now core competencies, not afterthoughts. Protect IP with robust access controls, context window management, and red-teaming. In production, instrument threat detection and response to catch prompt injection, data leakage, and model drift. Treat this like any other reliability discipline alongside SRE.

If parts of coding get automated, where can great engineers thrive? Several high-impact paths are emerging: platform engineering for LLMs (tooling, evals, observability), SRE for AI-infused systems, developer evangelism and education, product management for AI-native experiences, security engineering focused on model and data threats, and forward deployed engineers who pair with customers to solve messy, real-world problems.

How to upskill fast: build an AI product toolbox and ship small. Prototype gen ai features end-to-end—retrieval, function calling, human-in-the-loop QA—and connect them to your CRM integration or support stack. Use A/B testing with a clear minimum detectable effect (MDE) to validate impact. Leverage CustomGPT workflows for internal enablement and in-app guides or product tours to onboard users safely.

Here’s a pragmatic 90-day plan. Week 0–2: audit your top 10 engineering tasks by time spent; identify 3 that are ripe for AI augmentation. Week 3–6: pilot inside CI/CD with explicit guardrails; track DORA metrics and developer sentiment. Week 7–10: productionize the wins; document runbooks; add incident management paths. Week 11–12: share learnings with product trios, refine your value proposition, and set next-quarter OKRs.

AI won’t replace software engineers; engineers who master AI will outpace those who don’t. If we embrace the shift—toward systems thinking, responsible governance, and customer outcomes—we’ll build better products faster and open new, rewarding career paths. The opportunity is here and compounding.

Inspired by this post on Product School.

November 12, 2025
11 Unconventional Product Management Moves That Supercharge Strategy, Teams, and Impact

I’ve spent years leading product strategy at HighLevel, Inc., and the patterns I rely on don’t always show up in the usual playbooks. In practice, the moves that compound impact are often the quiet ones—unsexy, rigorous, and relentlessly customer-centered.

These product management best practices challenge the norm. Read and you’ll sharpen your strategy and elevate your impact beyond just features.

What follows are the 11 under-discussed habits I return to when the stakes are high and the path is foggy. They help me ship meaningful outcomes, develop empowered product teams, and align our go-to-market strategy without getting trapped in feature theater.

Best practice 1 — Anchor goals to outcomes, not output. I frame “outcomes vs output OKRs” so teams focus on behavior change and business results, not ticket counts. Activation rate, retained revenue, and cycle time beat launch volume every time.

Best practice 2 — Run discovery with product trios. I put design, engineering, and product in the same room early, often with forward deployed engineers. This trio model accelerates product discovery, uncovers risks faster, and builds shared ownership.

Best practice 3 — Decide from first principles, then apply the try do consider framework. I separate points of parity from true differentiation and protect our value proposition. The result: clearer choices, less rework, and a strategy that compounds.

Best practice 4 — Be statistically honest with A/B testing. I size experiments by minimum detectable effect (MDE), guard against peeking, and follow through with retention analysis. This discipline prevents false positives from steering the roadmap.

Best practice 5 — Treat delivery as a learning engine. CI/CD, feature flags, and progressive rollouts let us learn without gambling the brand. I track deployment frequency and DORA metrics to raise quality while increasing the tempo of validated learning.

Best practice 6 — Build a unified analytics backbone. I connect product telemetry to a unified analytics platform and CRM integration so we can see the full funnel. Amplitude analytics, Pendo, and Intercom help us tie behaviors to value realization and inform prioritization.

Best practice 7 — Make onboarding a first-class product. In-app guides, product tours, UX writing, and thoughtful tooltip design shorten time-to-value and lift user activation. This is the quiet lever behind sustainable product-led growth.

Best practice 8 — Systematize stakeholder management. I pair QBRs vs OKRs to balance narrative and numbers, keep board management transparent, and align sequencing through product roadmapping and sprint planning. Clear rituals minimize thrash and build trust.

Best practice 9 — Connect strategy to positioning early. I pressure-test product positioning, clarify our value proposition, and deliberately choose which points of parity to match and which to ignore. This reduces me-too work and sharpens competitive differentiation.

Best practice 10 — Use AI as a responsible force multiplier. I employ LLMs for product managers and gen ai for product prototyping while enforcing privacy-by-design, AI risk management, and strong data governance. The goal is leverage without compromising trust.

Best practice 11 — Write it down to move faster together. I keep crisp decision logs, assumptions, and pre-mortems so empowered product teams can act with context. This simple habit makes onboarding easy, reduces re-litigating, and keeps momentum through change.

When I apply these practices consistently, the team ships less noise and more value. The compounding effect is real: clearer priorities, faster learning cycles, stronger alignment, and a roadmap that tells a coherent story from discovery to adoption.

Inspired by this post on Product School.

November 3, 2025
Scale Product Operations with Confidence: Hard-Won Lessons to Drive Experimentation and Value

Scaling product operations across markets and teams is equal parts craft and discipline. Over the years, I’ve distilled what works into a pragmatic operating system that balances speed with rigor, enables experimentation at scale, and keeps the entire organization aligned on customer value.

Learn how top product leaders at leading companies scale product operations, drive experimentation, and deliver customer value.

The backbone is a clear outcomes-first operating model. I anchor strategy in outcomes vs output OKRs, empower product trios to own problem discovery and solution delivery end to end, and insist on empowered product teams that can make decisions without waiting for permission. This structure raises the signal-to-noise ratio, reduces handoffs, and accelerates learning.

Operational excellence then turns intent into predictable flow. CI/CD pipelines, high deployment frequency, and DORA metrics give me a real-time view of delivery health while creating the safety to ship smaller, reversible changes. When teams can deploy confidently and measure impact continuously, execution quality and morale both improve.

Experimentation is a first-class citizen, not an afterthought. We normalize A/B testing by defining a minimum detectable effect (MDE) up front, instrumenting guardrails for customer experience, and pre-registering success criteria. This keeps experiments honest, speeds up decision-making, and makes it clear when to iterate, when to scale, and when to stop.

Data turns experiments into insight. I lean on a unified analytics platform, with tools like Amplitude analytics for product discovery, activation, and retention analysis. Standardized taxonomies and event quality reviews ensure we can trust the numbers, compare tests, and build cumulative knowledge rather than running one-off trials.

To translate insight into adoption, I invest in product-led growth mechanics. In-app guides, product tours, and thoughtful tooltip design help users discover value fast, while lifecycle nudges align with milestones in the journey. This reduces the burden on sales and success while compounding engagement and retention over time.

Governance should enable, not constrain. Lightweight data governance and privacy-by-design practices mean experiments respect user trust and regulatory requirements without slowing teams down. Clear review paths and pre-approved templates make it easier to do the right thing quickly.

Alignment is continuous, not quarterly theater. I connect strategy and execution with crisp product roadmapping and sprint planning, and I reconcile learning cycles with planning cycles so insights flow into the next iteration. QBRs evolve from status updates into decision forums where we reallocate capacity based on evidence, not opinion.

Here’s the playbook I rely on: clarify the few outcomes that matter; form durable product trios around customer problems; instrument ruthlessly so every change is measurable; operationalize experimentation with A/B testing, MDE, and guardrails; and maintain fast flow with CI/CD and DORA metrics. When this system hums, teams move faster, risk goes down, and customers feel the improvement in every interaction.

At scale, excellence looks deceptively simple: clear outcomes, empowered teams, fast and safe delivery, and relentless learning. Get those right and product operations become a force multiplier—one that compounds customer value with every release.

Inspired by this post on Product School.

November 3, 2025