Inspired by this post on The Intercom Blog.

Inspired by this post on The Intercom Blog.


We just launched Operator, an Agent for your customer operations that helps you understand, manage, and improve your entire customer experience. I’ve spent years shipping AI-driven products at production scale, and this one reflects the lessons I’ve learned the hard way about what it really takes to go from a flashy demo to a dependable system your team trusts.
To give you a clear view of just how powerful this Agent is, I want to share the technical infrastructure and engineering choices that make Operator work reliably at production scale across thousands of customer workspaces. My goal is to demystify the gap between a well-prompted LLM and a true, production-grade Agent—so you can make an informed build vs. buy decision.
If you’re a technical leader evaluating whether to build something like this yourself, or trying to understand the difference between a well-prompted LLM and a production Agent system, this is for you.
Escaping the “it’s just an LLM” trap
Most engineering teams in this space start the same way: a prototype. You take a foundation model, give it API access to your support data, add a system prompt with some domain context, and you’ve got something that queries your database, summarizes tickets, and generates reports that look right. It demos convincingly—and I’ve been there, impressed in the moment, only to watch it buckle under real-world complexity.
The problem with that prototype is that it obscures the scope of what’s actually required. It demonstrates the 10% of the system that’s straightforward to build, and it’s easy to assume the rest is just as straightforward. It isn’t. The gap between a working demo and a production system your team depends on daily is where most of the engineering investment lives. That’s precisely the gap we focused on closing.
With Operator, we’ve invested deeply in every layer: tooling, reasoning, how the Agent takes action, and the infrastructure that makes it reliable at scale. Here’s a closer look at the architecture and why it matters for agentic AI, platform scalability, and observability.
The tooling layer
The first thing we had to confront was that the obvious approach (giving a model access to your APIs and letting it figure things out) doesn’t hold up in production. The model makes reasonable decisions for simple queries, but operating across thousands of customer workspaces with different configurations, data models, and usage patterns, a “figure it out” approach isn’t nearly precise enough.
What you need is purpose-built tooling: tools that encode decisions about what data to fetch, how to structure it, what context to include, and what to leave out. Operator has over 50 of these tools and 10 skills.
A tool is a single action that Operator takes (search content, run a query, look up a conversation). A skill chains multiple tools together to complete a whole job, like debugging a conversation end-to-end, rolling out a content update across an entire help center, and identifying the next automation opportunity. This is where AI workflows move from abstract prompts to dependable, repeatable outcomes.
The difference between using thin wrappers around API endpoints and purpose-built tooling shows up in something as seemingly simple as a performance question. When you ask “how did Fin perform last week?”, a naive implementation runs a query and hands back a table. Operator runs a reporting tool that determines which metrics are relevant for your specific workspace, which are meaningful for your particular question, and what the numbers actually mean in context, giving you a much richer answer that you can do something tangible with.
Developing that behavior took months of engineering. Not because any individual piece is conceptually hard, but because getting it right across the full range of customer workspaces, configurations, and edge cases is an iterative process. You build it, you test it against real conversations, you find the cases where it breaks, you fix those, and you repeat. There’s no shortcut—and in practice, this is where most DIY efforts stall.
The intelligence layer
The tooling layer solves what to do, but beneath it is a harder problem: understanding what’s worth doing, and why. This is the layer that makes Operator understand your business rather than just query it. Three components go into it, and in my experience they’re non-negotiable for a reliable Agent.
1. Semantic search
Unlike solutions that rely on keyword matching, Operator uses a system that understands what content is about, not just what words it contains. When it searches your help center, it’s using the same semantic search engine we’ve spent years optimizing for Fin itself. This is a retrieval system that’s been tuned against millions of real support conversations, with precision and recall characteristics we’ve measured and improved continuously. This retrieval-first pipeline is the backbone of grounding and dramatically reduces hallucinations.
2. Attribute awareness
Operator has access to your data and knows what is meaningful for different questions. It knows which metrics are actually in use in your workspace, which custom attributes carry signals, and which fields are populated versus effectively empty. We’ve built specific skills that give Operator this meta-knowledge, so when it’s investigating a performance question, it’s looking at the right things, not hallucinating insights from sparse data.
3. Intelligent reasoning
A well-built Agent can answer your question and anticipate what you should ask next. If you ask Operator about escalations spiking, it doesn’t just say, “escalations increased 23% week-over-week.” It’ll continue on to tell you why this happened by examining the escalated conversations and identifying that a disproportionate number involved a specific product area, before moving on to check whether the relevant help content is up to date, and, if it isn’t, proposing an update. That chain of reasoning isn’t prompt engineering. It’s encoded in the skills we’ve built, refined against the patterns we see across our entire customer base.
The action layer
This is where the engineering complexity increases by an order of magnitude because instead of just analyzing problems and recommending solutions, Operator takes action to solve them itself. It can update Guidance rules, draft and publish help articles, create Procedures, configure data connectors, and modify your Fin configuration. Moving from read-only insights to write-capable actions is a fundamentally different class of product and infrastructure problem—one that demands rigorous SRE practices and rock-solid safeguards.
Every one of these actions has to be safe, reversible, and auditable. An analytics tool that occasionally returns a wrong number is frustrating. but an Agent that occasionally applies a wrong configuration change to a live support system is a different category of problem. To prevent this, we built a robust proposal system, whereby every change Operator suggests is presented as a reviewable diff. You see exactly what will change before anything is applied, with the option to accept, reject, or refine. Nothing goes live without your explicit approval.
What else sets Operator apart
A UI that’s both conversational and graphical, not one or the other. Operator blends conversational interaction with purpose-built graphical components. Proposal diffs show exactly what will change in an article. Inline charts visualize performance trends. Dashboards render directly inside the conversation thread. In practice, that means a knowledge manager reviews a structured diff—not a wall of LLM-generated text—and a team lead asking about weekly performance gets an accurate chart with context, not a paragraph approximating data.
Building this hybrid experience is extremely difficult outside of a native platform integration. In a chat interface or CLI, you’re limited to text output; in a standalone dashboard, you lose conversational context. Operator does both in the same thread, so every interaction is detailed and context-rich—and importantly, actionable in the flow of work.
It lives where your team already works. Operator is built into the same platform your team uses every day. It’s not a separate tool with a separate login, nor is it a Slack bot your engineer set up that only three people know about. It operates exactly where you are, alongside the conversations, help center articles, workflows, and data you’re working with. That tight integration closes the gap between finding a problem and fixing it: spot an outdated article while reviewing a Fin conversation, and Operator can surface the fix in the same session. Notice an escalation spike in the morning, and you can ask Operator to investigate without switching tools, waiting for a data pull, or filing a ticket.
The compounding advantage
Every customer using Operator teaches us something. We see which debugging approaches work across different types of support operations, learn which content structures perform better, and identify automation strategies that consistently land. Those patterns get encoded back into Operator’s skills and tools. When we discover that a particular sequence of investigation steps reliably identifies the root cause of a spike in escalations, we build that into Operator’s diagnostic skill. When we find that a specific way of structuring help articles leads to higher Fin resolution rates, we encode that into the content creation skill. Our engineering team is continuously shipping improvements based on what we observe across the entire customer base.
A custom-built solution gives you exactly what you built, meaning it doesn’t get smarter unless you invest engineering resources into making it smarter. And that usually means taking time and talent away from your core product. I’ve watched teams underestimate the ongoing cost of eval-driven development, model upgrades, and API churn—costs that only grow as your footprint expands.
We’re not locking the door
Some teams want to build their own Agents. Some of our most technical customers do this. But when you do, you’re working with raw APIs and building your own tooling on top of them. When you use Operator, you’re working with a system that already knows what questions to ask, understands your data, and encodes the best practices we’ve learned from thousands of support teams. We recently launched the Fin CLI, which means you can use third-party agents like Claude Code or Cursor to interact with your Fin data and configuration. That door is open. What I hope this post has clarified is everything that goes into the build of Operator: Over 50 tools and 10 skills, purpose-built for support operations. Years of investment in semantic search. Deep integration with every layer of Fin’s stack. The proposal system. The intelligence layer. The reliability infrastructure.
If you’d still like to move ahead with building a custom solution, here’s an honest assessment. You can build a useful read-only tool in weeks. It’ll query your data, summarize tickets, and generate reports, but turning it into a production system will take quarters. Reliability, security, edge case handling, multi-tenant data isolation, and graceful degradation are all important architectural decisions that you’ll need to get right from the start. The action layer is also where you might risk stalling out. Going from “here’s what’s wrong” to safely making changes in a production system is a fundamentally different engineering problem than analysis. Most DIY projects never get there. Finally, you’ll be maintaining it forever. Every model upgrade, API change, and new capability in your support platform means updating your custom tooling. We have a team dedicated to this. You’ll need one too.
The economics still favor buying when a vendor has invested more in the problem than you can justify internally. What I hope this post adds is a clearer picture of what that investment actually looks like from an engineering perspective—and why it compounds into a durable advantage for your support organization.
The investment is ongoing. The problems we’re solving at the infrastructure level today are harder than the ones we solved a year ago, and that trajectory isn’t slowing down. If you’re ready to see the difference a production-grade Agent can make, explore Operator.
Inspired by this post on The Intercom Blog.


Session replay should illuminate user behavior, not slow it down. That belief drove us to rebuild the delivery layer behind our Session Replay from the ground up so it’s lighter on your pages while capturing richer, more reliable signals for behavioral analytics and product insights.
Our objective was clear: preserve page performance and Core Web Vitals while improving data completeness under real-world conditions. We focused on reducing client-side overhead, smoothing network bursts, and scaling the pipeline so it performs consistently during long sessions, high-traffic spikes, and complex interactions—without compromising observability or user experience.
To get there, we redesigned how events flow from the browser to our edge and storage layers. We decoupled capture from delivery, introduced adaptive batching and backpressure-aware controls, tightened compression strategies, and prioritized critical events to reduce jitter and dropped packets. The result is a delivery path that’s resilient to network variance, efficient in payload size, and friendlier to the main thread—key ingredients for platform scalability and SRE-grade reliability.
Get a glimpse into how we overhauled Session Replay’s data delivery, and how you can expect more complete data, lower payload sizes, and more. In practice, that means steadier capture across long sessions, fewer gaps during rapid DOM changes, and leaner, faster uploads that respect the constraints of modern browsers and mobile networks. It’s an upgrade designed to protect page speed while strengthening the fidelity of what you see in replay.
These changes elevate how product teams, analysts, and support engineers diagnose issues and optimize funnels. With higher-fidelity replay and lighter page impact, you can connect the dots faster—from anomaly detection and conversion bottlenecks to subtle UX friction—within a unified analytics platform. It’s a meaningful step forward for data-driven product strategy and for keeping your observability toolkit both accurate and performance-aware.
While performance guided every decision, privacy and governance stayed first-class. Our delivery patterns work hand-in-hand with data governance practices to help teams maintain responsible capture boundaries while still achieving the completeness and granularity they need. This balance lets you scale replay confidently across surfaces and teams.
We’ll continue monitoring downstream impact across Web Vitals, long tasks, error rates, and event integrity—iterating as we learn. If you rely on session replay to inform roadmaps, triage incidents, or accelerate product-led growth, you should feel the difference: a lighter footprint on your page and a stronger foundation for trustworthy insights.
Inspired by this post on Amplitude – Best Practices.


At Intercom, shipping is our heartbeat. We push code to production hundreds of times a day, and I’ve seen firsthand how that pace sharpens our product instincts and forces clarity in our CI/CD practices.
Engineers, engineering managers, designers, and PMs all contribute to this, safely. The average time from merging code to it running in production is 12 minutes. For me, that’s not just a vanity metric—it’s a DORA-style signal that our release pipeline and observability are aligned with the velocity our customers expect.
I’ve long held a belief that might sound counterintuitive: speed is not the enemy of safety. It’s a prerequisite for it. Accumulating code creates risk. Shipping small batches minimizes it. The faster you ship, the smaller each change is, and the easier it is to catch problems, and roll back when something goes wrong as the context is still fresh in your head. That small-batch discipline underpins how I approach AI workflows and risk management across product teams.
Today, over 93% of our pull requests (PRs) across our two main codebases are Agent-driven. And over 19% are auto-approved with no human reviewer in the loop. When I first saw those numbers at scale, I asked the same question you might be asking: are we trading rigor for speed? The answer lives in the data.
I want to focus on that second number, and why I think it makes us safer. Most people hear “AI is approving our pull requests” and think that’s reckless. I thought so once, too—until I looked at the outcomes that actually matter.
Last year, our CTO Darragh Curran set an explicit goal: double the productivity of our entire R&D organization within 12 months. Because the faster we can build and ship, the faster our customers get the capabilities they need. Ambitious? Absolutely. But the operational clarity that comes from such a target is invaluable for product leaders.
Nine months later, we did it. The results were significant across the board, but here’s the stat that crystallized it for me: downtime from breaking code changes dropped 35%, even as our deployments doubled. Shipping faster made us safer. As we modernize how we build and ship software, we systematically surface bottlenecks and tackle them. One of the biggest we found? PR review.
Humans simply don’t have the time or mental capacity to properly review the volume of AI-generated code we’re now producing. I’ve watched great engineers get stuck in review queues, or worse, feel pressure to rubber-stamp under time constraints—an anti-pattern I’ve battled in multiple orgs.
When an AI Agent can produce a working implementation in minutes, waiting hours or days for a human to review it is an impedance mismatch. The production line is moving faster than the quality gate can keep up. When that happens, one of two things follows: either the queue backs up and velocity drops, or, more dangerously, humans start rubber-stamping. Glancing at a diff, skimming the description, clicking approve. Some companies are drifting into this failure mode silently. We chose to confront it head-on and built a rigorous solution.
PR review, done properly, is complex. A good reviewer evaluates the problem statement, aligns the diff to intent, checks for safety and logical issues, applies deep product context, and scans for performance and anti-patterns. No single human can cover all of that on every PR at high deployment frequency. The truth—borne out by data—is that the human baseline we often assume is stronger than it really is.

So we asked ourselves: what if we could do better?
Our PR review Agent doesn’t treat code review as a single task. It decomposes it into separate sub-jobs, each handled by an independent sub-Agent. One assesses the quality of the problem description. Another checks whether the diff actually aligns with the stated intent. Another reviews for safety concerns. Another checks for logical correctness. Another reviews against best practices and known anti-patterns. And so on. As a product leader, this is exactly the kind of agentic AI architecture I look for: specialized, auditable steps that strengthen the overall control plane.
The result is that every PR is reviewed as if a dozen of our most tenured and knowledgeable engineers were all looking at it simultaneously, each bringing their own specialist lens. In the past, getting that breadth of review on a single PR was impossible. Now it’s the default. And unlike ad hoc human review, this system is consistent and tireless.
A human reviewer typically focuses on the actual code changes, the diff. Our Agent goes deeper. It traces execution paths, following the implications of a change through the codebase. This is something humans rarely had time to do, even when they wanted to.
While testing our new PR review Agent on a set of historical PRs, we found it flagging a one-line text copy change as incorrect. On the surface, it looked completely harmless, just a text update. We assumed it was a mistake, but it wasn’t. Our Agent caught that the new copy contradicted an existing validation mechanism elsewhere in the codebase. No human reviewer would have realistically found this unless they happened to have written that validation code very recently. Our Agent catches this kind of thing consistently, every time, because it’s always tracing execution.
The review isn’t generic either. It’s grounded in Intercom-specific guidance that our engineers have built and continue to refine, encoding the same context, standards, and product knowledge they’d apply if they were reviewing the PR themselves. When the Agent reviews a PR, engineers flag whether the review comments were helpful or not, and that feedback continuously sharpens the guidance. It’s a flywheel: the more our engineers invest in teaching the system how to think about our codebase, the better every subsequent review gets. This is eval-driven development in action.
Automated approval is also never forced. Any engineer can request a human review on any change, at any time. The system is a tool, not a mandate. At Intercom, shipping code doesn’t end at merge. The engineer who ships a change is expected to watch it go live, monitor its behaviour in production, and be ready to roll back if something isn’t right. AI approval doesn’t change that. The human who ships the code remains accountable for the outcome.

The naive take on AI-approved PRs is that it’s just a rubber-stamp LLM call so that humans don’t have to bother. A convenience feature. That misses what’s actually happening. Our Agent is strict. It won’t approve large PRs. If a change is too big, too complex, or too broad in scope, it flags it and requires it to be broken down. That design nudges engineers toward smaller, well-scoped changes—the safest way to ship, review, test, and, if needed, roll back.
This matters enormously for safety. Small changes are easier to review, easier to test, easier to understand, and, critically, easier to roll back when something goes wrong. This is the same principle that has always underpinned our shipping culture, but now the PR review Agent actively enforces it. As someone who’s owned incident management and SRE partnerships, I can’t overstate how powerful this is.

It’s tempting to look at a goal like “>50% AI-approved PRs” and worry we’re optimizing for a metric rather than an outcome. I see it differently. The real goal is to remove a bottleneck that, if left unchecked, pushes people toward rubber-stamping. By elevating the review bar and keeping batch sizes small, we protect both speed and stability.
We didn’t assume AI review would be good enough; we actively ran an experiment. Our hypothesis was that AI review could match or outperform human review quality, measured by outcomes: were the changes correct? Did they cause problems in production? How quickly were they reviewed and approved?
We started with a controlled pilot of over 100 PRs through the AI approval pipeline. The results: zero reverts of AI-approved PRs, and a 6–16x improvement in time-to-approval at the 75th percentile. Since then, the system has scaled significantly. In the first four weeks of broader rollout, 497 PRs went fully autonomous, with Claude writing the code and our AI approval system reviewing, approving, and shipping to production.

Beyond the approval pipeline itself, we also looked more broadly at how AI-authored code performs in production compared to human-authored code. AI-authored backend code had a revert rate of 0.53%, compared to 5.39% for human-authored. On the frontend, it was 0.22% versus 2.00%.

AI-authored code, reviewed and approved through our automated pipeline, is being reverted at a fraction of the rate of human-authored, human-approved code. I don’t expect that to stay at zero forever, but the evidence shows the quality bar our Agent holds is at least as high as the one humans were holding, and in many cases higher. And here’s the humbling perspective: the product changes that caused outages in the past? They were all reviewed and approved by humans. Human review is not a guarantee of safety. It never was.
Everything I’ve described—the sub-Agent architecture, the traceability, the labeling, the data—wasn’t just built for speed. It was built for auditability. Every AI-approved PR is labelled, logged, and queryable. The review comments, the approval decision, the test results, the merge event: all recorded. The evidence an auditor expects to see is the same whether a human or an AI approved the change. The “who” may change, but the “what” doesn’t. That’s how you meet SOC 2, HIPAA, ISO 27001, ISO 42001, and AIUC-1 without compromising agility.
We engaged our auditors, Schellman, early, before we scaled. We proactively worked with them to confirm that our automated review processes and the evidence they produce meet the requirements of our compliance frameworks, including SOC 2, HIPAA, ISO 27001, ISO 42001, and AIUC-1, among others. We think AI-driven change management can meet and exceed the standards that human-driven processes set, and we want to help prove that. In my experience, when you build for safety, compliance follows—never the other way around.
You can only go so far with PR review as a safety mechanism, no matter how good the reviewer is, human or AI. Only in production do you discover the unknown unknowns. The majority of Intercom’s largest outages weren’t even caused by changes to product code at all. They were infrastructure issues, unanticipated customer usage patterns, or third-party outages. PR review, whether human or AI, was never going to catch those. That’s why, in parallel, we’re also working on an Agent that proactively diagnoses issues in production. We’ll share more on this soon.
Speed has always been at the core of how we build at Intercom, not in spite of safety, but because of it. And we’re getting even faster with AI. It’s easy to assume that AI-approved PRs would lead to a drop in quality and safety but our data proves otherwise. Our heartbeat is just getting stronger. For product leaders, this is the blueprint: pair agentic AI with small batches, robust observability, and clear accountability, and you make shipping both faster and safer.
Inspired by this post on The Intercom Blog.


Shipping agentic AI into production is exhilarating—until a flaky output torpedoes trust. Over the past year, I’ve led teams at HighLevel to operationalize agents across customer-facing and internal workflows, and I’ve learned that reliability isn’t an afterthought; it’s an architecture. In this piece, I share the AI Agent Orchestration Patterns for Reliable Products that consistently deliver dependable outcomes at scale.
When we talk about orchestration, we’re talking about more than a single prompt. The shift is from monolithic calls to coordinated “agentic AI” where routers, planners, and specialists collaborate through structured “AI workflows.” In practice, I rely on a few canonical patterns: a planner–executor loop for multi-step tasks, a router–specialist setup for skill selection, and a “retrieval-first pipeline” that grounds generation with authoritative context before a single token is produced.
Reliability-by-design starts with typed inputs/outputs and strict validation. I standardize on JSON schemas, enforce tool/function signatures, and implement idempotency keys so retries don’t wreak havoc on downstream systems. Timeouts, circuit breakers, and backpressure protect the platform under load, while rate limiting and dead-letter queues keep failure modes contained. Most importantly, we engineer graceful degradation: agents “abstain” when uncertain, fall back to deterministic paths, and escalate to humans instead of guessing.
Safety is a first-class concern, not a bolt-on. Our “AI risk management” pipeline includes PII redaction, allow/deny lists for tools and data, and the principle of least privilege for every connector (yes, even the ChatGPT connector). We codify policy-as-code for repeatability and require human-in-the-loop approvals for sensitive or irreversible actions. In my experience, clear red lines and reversible defaults prevent the vast majority of regrettable outcomes.
Without strong “observability,” you’re flying blind. I instrument agents with an “Agent Analytics” layer that captures traces, spans, tool invocations, and token usage across the entire chain. The essential metrics are outcome quality (task success rate), latency (p50/p95), tool failure rates, cost per task, and user-level satisfaction signals. Cross-agent lineage allows us to pinpoint where a plan went awry and which tool or prompt introduced drift—vital for rapid remediation.
Quality improves fastest when it is measured relentlessly. I practice “eval-driven development” with golden datasets, rubric-based scoring, and risk-weighted sampling of edge cases. LLM-as-judge can help, but we always calibrate against human ratings and monitor agreement. In production, I blend online metrics with controlled “A/B testing” and plan experiments to hit a realistic minimum detectable effect (MDE). The result is a virtuous loop where prompt tweaks, tool changes, and retrieval adjustments are verified before wide rollout.
Agents need the same rigor we expect from any modern system. I gate releases through “CI/CD” with linting for prompts, schema checks for tools, and simulation runs for critical paths. “Feature flags” enable shadow and canary deployments so we can throttle exposure by segment or workflow. I also track reliability with “DORA metrics” and “deployment frequency,” and I partner closely with “SRE” for on-call coverage, runbooks, and incident postmortems tailored to agent failure modes.
Context is a resource to allocate, not a bottomless pit. Thoughtful “context window management” means curating retrieval, summarizing long-running threads, setting memory time-to-live, and constraining what the agent can see at any given step. I bias hard toward retrieval over recall, keep chunks small and semantically precise, and validate that the “retrieval-first pipeline” truly returns the right evidence—not just the nearest match.
In day-to-day product work, I lean on a compact playbook: a router that selects the best specialist; a planner that decomposes tasks and allocates tools; a deterministic guard that verifies preconditions; an execution loop with explicit budgets; and a fallback policy that prefers abstaining over hallucinating. Together, these patterns create an agent that behaves like a dependable teammate rather than a creative wildcard.
No architecture thrives without the right rituals. Product trios keep discovery continuous, while clear outcomes (not output) align teams on value instead of vanity. We map risks early, maintain a public quality dashboard, and rehearse failure recoveries so incidents never become improvisations. The cultural signal is simple: we celebrate root-cause clarity and safe iteration over heroics.
If you’re just starting, implement three patterns first: retrieval before generation, abstain-and-escalate for low confidence, and canary releases under feature flags. Instrument everything from day one, run a weekly eval review, and expand scope only when the data says you’re ready. With these habits, your agents will earn user trust—and keep it.
Inspired by this post on Product School.


Where is the true boundary between product and engineering—and what happens when it gets blurry? I’ve led and coached teams through this question many times, and I’ve learned that clarity here isn’t just a nice-to-have; it’s foundational to quality, velocity, and team health.
I’ve seen well-intentioned product managers step in to “help” by taking ownership of bug triage, tech debt prioritization, or even system architecture. At first, it feels productive. Over time, it creates role confusion, slows decision-making, and burns out PMs—while paradoxically lowering engineering quality. The “CEO of the product” myth and legacy IT, project-based mindsets are usually at the root. Treating engineers as “order takers” breaks down in evergreen product environments.
The healthiest collaboration model is simple and disciplined: The product trio owns the “what”; engineering owns the “how”. Product managers are not people managers for engineers—and shouldn’t be accountable for engineering quality. Our job is to frame the problem, align on outcomes, and continuously discover value with customers—not to supervise technical execution.
If quality is a problem, the solution is escalating and fixing the system, not managing individual bugs. In practice, that means surfacing patterns and elevating them to engineering leadership, who can address root causes—staffing, skills, code health, CI/CD gaps, observability, or process design—rather than asking PMs to paper over issues with status updates. This keeps accountability where it belongs and reinforces outcomes vs output OKRs.
One high-leverage move is to remove unnecessary intermediaries. Removing the PM as a middleman creates better flow and clearer ownership. Create direct paths for stakeholders to get bug status without routing everything through product. Use dashboards, shared tools, or Slack channels instead of one-off updates. In my teams, shared Jira views, Slack incident channels, and status pages eliminated handoffs, improved stakeholder management, and gave engineers the space to solve problems end-to-end.
Strong engineering leadership is non-negotiable. What strong engineering leadership should own (and why that matters) is the technical system, quality guardrails, sustainable pace, and the practices that uphold them—incident management, code review rigor, test coverage, and SLOs with SRE. Skilled engineering teams naturally push back when boundaries are crossed—and that’s a good thing. It signals ownership, craft pride, and a pathway to durable execution.
When do I step in as product? Primarily to clarify desired outcomes, sequencing, and trade-offs—bringing customer and business context to the table. I structure product roadmapping and sprint planning around value slices and risks, not task lists. I align on decision rights early: architecture and tech debt strategies live with engineering; product strategy, positioning, and success metrics live with product; discovery and prioritization live with the product trio.
Here are the system-level moves I’ve found most effective: Escalate systemic quality issues to engineering leadership, not individual contributors. Advocate for real engineering leadership if your org expects product teams—not IT teams. Then reinforce a culture of continuous discovery so product, design, and engineering make better upstream decisions together. This is how empowered product teams ship higher-quality outcomes—without burning anyone out.
If you’ve ever found yourself acting as the middleman for bug status or being asked to “own” engineering decisions outside your expertise, you’re not alone. Reset the boundaries, make work visible, and double down on shared outcomes. In my experience, the moment we clarify roles and remove status theater, quality rises, cycle time improves, and everyone does the job they were hired to do—better.
Inspired by this post on Product Talk.


Scaling AI Visibility pushed me to rethink what “reliable” really means for AI infrastructure. As my team expanded usage across more datasets, models, and workflows, we uncovered unexpected sources of report failure and built the guardrails, observability, and processes that now anchor our stability strategy.
In practice, the surprising failure modes were rarely the loud ones. We saw report failure triggered by small schema drift from non-deterministic LLM outputs, silent permission changes in upstream data sources, token-limit truncation that broke downstream parsing, third-party API rate limits that surfaced only under bursty load, and clock skew that confused idempotent writes. Individually these issues looked minor; together they created reliability debt.
Our first move was deep observability. We instrumented the end-to-end pipeline with structured logs, distributed tracing, and high-signal metrics mapped to SLOs and error budgets. That visibility let us separate symptom from cause, quantify impact by segment, and prioritize fixes that moved business outcomes, not just vanity thresholds. It also gave product managers and SREs a shared, real-time view to make tradeoffs explicit.
Next, we hardened the runtime with resilience patterns: circuit breakers on flaky dependencies, timeouts tuned to p95 behavior, retries with jittered backoff, idempotent processing for at-least-once delivery, and backpressure-aware queues. We enforced schema contracts at ingestion with JSON validation and added feature flags to decouple deploys from releases, so we could roll forward or back within minutes when signals degraded.
On the product side, we adopted eval-driven development for model and prompt changes, shifting risky modifications behind canaries and staged rollouts. CI/CD gates required evaluation baselines to hold or improve before promotion. We tracked DORA metrics to keep deployment frequency high without sacrificing change failure rate, and we used P95 latency and budget burn as the forcing functions for prioritization.
Culture mattered as much as code. We formalized incident management with clear ownership, lightweight runbooks, and blameless reviews that produced crisp, automatable actions. We partnered early with SRE on SLO design, integrated privacy-by-design and PII scanning into the pipeline, and treated AI risk management as an ongoing product constraint rather than a checkbox.
The net effect: fewer flaky reports, faster recovery when things do break, and far more confidence to ship improvements to AI Visibility at pace. If you’re scaling similar capabilities, start with observability, make resilience patterns non-negotiable, and let SLOs guide your product roadmap. Reliability is not a phase—it’s the product.
Inspired by this post on Amplitude – Best Practices.


“Speed is not the enemy of safety; it is the prerequisite for it.” I live by this principle. In our organization, the average time from merging code to it being used by customers in production is just 12 minutes, and that short window is fundamental to how we build, ship, and learn.
In January 2026, we are averaging 180 ships per workday – roughly 20 deployments every hour. Conventional wisdom suggests that to increase stability, you must slow down. I believe the opposite. Speed is not the enemy of safety; it is the prerequisite for it. Accumulating code creates risk; shipping small batches minimizes it. Shipping is our company’s heartbeat.
Maintaining this frequency while targeting 99.8+% availability has required over a decade of focused investment in systems, principles, and processes. We protect the integrity of our systems through three layers of defense: an automated pipeline that is simple, reliable, and removes the need for manual intervention, a shipping workflow that promotes ownership and uses guardrails as accelerants, and a recovery model that optimizes for mitigating inevitable failures. Here’s how we’ve built each layer so that velocity is our greatest source of stability.
While our platform consists of various services and frontend applications, I’ll focus here on our Ruby on Rails monolith. It is our core application and the one we deploy most frequently; we also deploy it to three different data‑hosting regions with independent pipelines. Our other services follow similar pipeline principles and safeguards, but the Rails monolith is the clearest example of how we ship at scale.
The automated pipeline is designed to move code from merge to production as fast as possible while enforcing strict safety checks. It is fully automated, and the vast majority of releases require no human intervention—critical for CI/CD at high deployment frequency.
Once an engineer merges code to GitHub, two things happen immediately. First, the build: we compile the Rails application and its dependencies into a deployable asset (a slug) in about four minutes. Second, parallel CI: our test suite runs alongside the build; through extensive optimization, parallelization, and test selection, the vast majority of CI builds finish in under five minutes.
As soon as the slug is built, it’s deployed to a pre‑production environment. CI does not block the progression of the slug to pre‑production. Deploying to pre‑production takes around two minutes. This environment serves no customer traffic, but it is connected to our production datastores, mirrors our production infrastructure variants (e.g., web serving, asynchronous worker), and is configured so that requests exercise the pre‑release code and workers.
Immediately after deployment, we run and await several automated approval gates. We verify that the application boots cleanly on hosts (boot test), confirm the parallel test suite passed (CI check), and execute functional synthetics using Datadog Synthetics on critical flows—such as loading or editing a Fin workflow. If any gate fails, the release is halted and does not go to production.
Once approved, we promote the code to thousands of large virtual machines. A deployment orchestrator triggers these deployments simultaneously, while a decentralized, staggered rollout avoids changing the state of the entire fleet at the same millisecond. Within each machine, a rolling restart mechanism removes a process with old code from the serving path, lets it drain gracefully, and replaces it with a fresh process running the new code. From the moment a deployment starts, first requests are served by new code within roughly two minutes, and the vast majority of the global fleet updates transparently within six minutes. When restarts trigger on every machine, production unblocks so the next deployment can begin.
We treat a stalled pipeline as a high‑priority incident. If the automated system rejects three consecutive release attempts, it pages an on‑call engineer. These are pre‑production blocks, but if the shipping lane stops moving, changes pile up—and our stability relies on building and shipping in small steps. The on‑call’s job is to restore flow so that tiny, safe, frequent updates continue to keep risk low.
Our shipping workflow is built on extreme ownership: tools assist, but the engineer is accountable for quality and the decision to merge. I insist that you are present when you ship. The practical benefit of a 12‑minute deployment cycle is that engineers remain in the zone, focused on the problem they just solved, and ready to validate behavior as it goes live.

To support this, our deployment system sends Slack notifications the moment code is submitted and as it advances through stages, embeds direct observability links to relevant dashboards and logs in every PR and message, and prompts verification so engineers actively watch the dials and test features in production. It is not acceptable to rely on green builds. You’re expected to watch your change go live and if you’re not prepared to rollback, you’re not prepared to ship. We maintain a no‑blame culture: quick rollbacks and immediate reverts are signs of vigilance and ownership, not failure.
We make extensive use of feature flags to turn deployment into a non‑event. By decoupling deployment (moving code to servers) from release (turning features on), we shrink the blast radius of change. Flags can be enabled for all customers, a specific subset, or disabled for everyone in under 60 seconds through our backend UI. Engineers can group flags into beta features and run phased rollouts; we also ensure flags work consistently across non‑monolith applications. In the past three months, we created over 560 flags—and we actively manage them to avoid permanent complexity.
For complex refactors—especially when behavior should not change—we leverage GitHub Scientist, an open‑source experimentation library. It runs candidate logic (new code) in parallel with existing logic (old code) in production, instruments both paths for result and timing comparisons, and keeps existing behavior user‑visible. That means we can iterate on and validate new code under real load without risking the experience, then switch seamlessly when confident.
When engineers need to go deeper before merging, they can generate a slug and deploy it to a virtual machine, detaching a running production host from the serving path and connecting for manual testing. They can also put a pre‑release slug on a serving machine that handles a small percentage of jobs or web requests. Single‑host validation lets us slice observability to those hosts, compare against the main release, and make low‑level changes safer. Staging is a simulation; production is reality. Testing on a single production host validates assumptions with real‑world data without risking the fleet.
Our recovery model starts from a simple principle: stop monitoring systems; start monitoring outcomes. Traditional monitoring tells you if a server is healthy; we care whether customers are healthy. We rely on heartbeat metrics—vital signs that represent the core value our product provides—such as the rate at which messages and comments are created.
Unlike standard uptime checks, heartbeat metrics are binary in spirit. If message send rates dip below baseline, it does not matter if infrastructure dashboards are green. Down is down, and if customers can’t do their job, uptime percentages are irrelevant. By tracking real‑world success rates as a high‑level signal, we catch subtle degradations that traditional alerting either misses or over‑alerts on.
Because we ship in small, incremental steps and maintain previous releases on our virtual machines, our Time to Recover (TTR) is generally very fast. If a heartbeat metric drops or a critical anomaly is detected right after a ship, the system can trigger an automatic rollback, reverting to the release that was running 20 minutes ago—often restoring service before an engineer responds. For complex issues, engineers can initiate a manual rollback through our deployment UI; doing so also locks the production pipeline to prevent further releases while we investigate and remove problematic code.
Resumption of service is not the end. Every incident prompts an incident review, and we don’t just fix the bug. We ask, “How did the machine allow this to happen?” Then we harden the system so it cannot happen again. This loop—fast shipping, fast recovery, rigorous learning—compounds resilience over time.
This operating model aligns to DORA metrics: high deployment frequency, short lead time for changes, low change failure rate, and rapid time to restore service. It’s a CI/CD and SRE‑informed approach that converts speed into a defensive advantage rather than a liability.
Shipping 180 times a day isn’t a vanity metric; it’s a deliberate choice to protect the customer experience. With a 12‑minute window from code to customer, the feedback loop is tight and engineers retain context—and accountability—for the immediate impact of their work. Maintaining this pace requires more than fast CI; it requires judgment, extreme ownership, disciplined use of feature flags, and a recovery model that monitors outcomes. We rely on human expertise, augmented by these layers of defense, to catch issues before they turn into customer pain. We don’t ship fast despite our need for stability; we ship fast to stay in control of change.
Inspired by this post on The Intercom Blog.


I build and scale analytics platforms with a product mindset, and the work starts with the "middleware and compute systems that power analytics at scale." In platforms like Amplitude analytics and other unified analytics platform architectures, that foundation is what makes everything else possible.
Day to day, I oversee the "APIs behind charts, cohorts, and metrics—driving performance, reliability, and platform scalability." When those APIs are fast and resilient, every product team—from growth to customer success—can trust the insights they use to ship, learn, and iterate.
From an engineering leadership standpoint, I partner closely with SRE to define SLOs and error budgets, wire CI/CD pipelines for safe deploys, and track DORA metrics so we improve speed without compromising quality. This combination reduces incident management toil and shortens MTTR while keeping data freshness and query latency within strict thresholds.
From a product management leadership lens, the goal is clarity: crisp APIs, predictable contracts, and transparent stakeholder management across data, engineering, and GTM teams. That alignment empowers product teams with reliable cohorts and metrics, accelerates experimentation, and de-risks roadmaps.
If you’re scaling analytics, invest first in the platform layer: middleware and compute, schema governance, caching strategies, and cost-aware compute. Do that well, and the visible experience—charts, cohorts, and metrics—feels effortless, even as you grow to serve billions of events with confidence.
Inspired by this post on Amplitude – Best Practices.


I’ve spent years helping talented engineers explore what’s next when pure coding no longer feels like the only—or best—path. From hiring across cross-functional teams to mentoring career pivots, I’ve seen firsthand how engineering strengths translate into high-leverage roles that shape product, strategy, and growth.
Software engineers have alternative career options leveraging their skills in roles like product manager, data scientist, business analyst, and 22 more.
When an engineer moves into product management, they’re not starting from scratch—they’re redirecting problem-solving, systems thinking, and customer empathy toward outcomes. In practice, that means mastering product discovery, strengthening stakeholder management, and getting fluent in product roadmapping and sprint planning, so decisions are guided by impact rather than “outputs vs outcomes” confusion. I’ve watched this transition unlock empowered product teams and clearer prioritization across complex backlogs.
Data-oriented paths are equally compelling. If you enjoy experimentation and evidence-based decisions, roles in analytics or data science reward rigor. Think A/B testing, identifying the minimum detectable effect (MDE), and using tools like Amplitude analytics to translate behavioral signals into product bets. Pair that with retention analysis and you’ll become indispensable to growth conversations.
Business-facing roles such as business analyst or product marketing manager are ideal if you’re energized by customer problems and market narratives. Your engineering fluency sharpens value propositions, product positioning, and go-to-market strategy in a way that resonates with both buyers and builders. In my teams, the best bridges between product and revenue often came from former engineers who could articulate trade-offs with clarity.
If operational excellence is your edge, consider SRE, DevOps, or cybersecurity. The same instincts that push you toward clean CI/CD pipelines and resilient architectures translate well into incident management, threat detection and response, and privacy-by-design practices. These roles reward systems thinking and the ability to balance reliability with delivery speed.
For engineers who love community and storytelling, developer evangelism is a natural fit. You’ll translate complex concepts into actionable guidance, from in-app guides and product tours to UX writing and documentation. The best evangelists I’ve worked with turn feedback loops into product insight, strengthening activation and product-led growth without heavy sales pressure.
Customer-facing technical roles—solutions engineer, forward deployed engineer, or technical consultant—let you stay close to the product while solving real-world problems. You’ll drive onboarding quality, user activation, and adoption while surfacing insights that influence roadmaps. Done well, this work tightens the loop between customer outcomes and product decisions.
AI-centered roles are expanding rapidly. If you’re curious about AI Strategy, retrieval-first pipelines, or the practical use of LLMs for product managers, you can bring an engineer’s discernment to a noisy space. The most valuable contributors here pair pragmatic architecture choices with clear risk management and measurable business value, not hype.
Leadership tracks remain a strong option too. The IC to manager transition isn’t about title; it’s about raising the ceiling for others. You’ll coach empowered product teams, shape organizational development, and align initiatives to defensible metrics—think DORA metrics for flow, leading indicators for value, and OKRs that measure outcomes over output.
If you’re exploring a pivot, start small and intentional. Run “career A/B tests” by taking on cross-functional projects, shadowing adjacent roles, or shipping a lightweight portfolio that demonstrates the new muscle. Join a ProductCon session, practice conference networking, and refine a narrative that links your engineering foundation to the outcomes your target role owns.
Finally, map your personal unfair advantages—domain knowledge, systems thinking, customer empathy, or operational rigor—to the roles that value them most. With focus, you can reposition your engineering experience into a differentiated story that accelerates your next chapter. The breadth of options is real, and with a deliberate plan, you’ll turn curiosity into conviction—and conviction into impact.
Inspired by this post on Product School.


Will AI replace software engineers or reshape their roles? Explore risks, opportunities, and alternative career paths in tech.
I’m often asked whether AI will make software engineers obsolete. My short answer: AI is already automating tasks, not eliminating the role. The engineers who learn to orchestrate models, systems, and stakeholders will create more value—not less. The real shift is from keystrokes to judgment, from writing code to designing socio-technical systems that deliver outcomes.
Today’s gen ai assistants—think Claude Code and ChatGPT connector—excel at unit test scaffolding, boilerplate generation, refactoring, docstrings, and code search. When integrated into CI/CD, they can open draft pull requests, annotate diffs, and propose fixes. This lifts developer productivity and frees time for higher-leverage work: problem framing, architecture decisions, and customer discovery.
What changes in the role? We spend more cycles on product discovery, privacy-by-design, and AI Strategy, and fewer on repetitive implementation. We design agentic AI workflows that combine retrieval, tools, and guardrails; we evaluate trade-offs that blend performance, cost, and safety; and we partner with empowered product teams to ship the smallest valuable slice, learn, and iterate.
Measure what matters. If AI is working, DORA metrics should improve: higher deployment frequency, shorter lead time for changes, stable change failure rate, and faster MTTR. Pair that with outcomes vs output OKRs to avoid gaming the system—shaving seconds off a build is meaningless if it doesn’t move activation, retention, or revenue. A unified analytics platform can help connect engineering signals to business impact.
Risk is real—and manageable. AI risk management and data governance are now core competencies, not afterthoughts. Protect IP with robust access controls, context window management, and red-teaming. In production, instrument threat detection and response to catch prompt injection, data leakage, and model drift. Treat this like any other reliability discipline alongside SRE.
If parts of coding get automated, where can great engineers thrive? Several high-impact paths are emerging: platform engineering for LLMs (tooling, evals, observability), SRE for AI-infused systems, developer evangelism and education, product management for AI-native experiences, security engineering focused on model and data threats, and forward deployed engineers who pair with customers to solve messy, real-world problems.
How to upskill fast: build an AI product toolbox and ship small. Prototype gen ai features end-to-end—retrieval, function calling, human-in-the-loop QA—and connect them to your CRM integration or support stack. Use A/B testing with a clear minimum detectable effect (MDE) to validate impact. Leverage CustomGPT workflows for internal enablement and in-app guides or product tours to onboard users safely.
Here’s a pragmatic 90-day plan. Week 0–2: audit your top 10 engineering tasks by time spent; identify 3 that are ripe for AI augmentation. Week 3–6: pilot inside CI/CD with explicit guardrails; track DORA metrics and developer sentiment. Week 7–10: productionize the wins; document runbooks; add incident management paths. Week 11–12: share learnings with product trios, refine your value proposition, and set next-quarter OKRs.
AI won’t replace software engineers; engineers who master AI will outpace those who don’t. If we embrace the shift—toward systems thinking, responsible governance, and customer outcomes—we’ll build better products faster and open new, rewarding career paths. The opportunity is here and compounding.
Inspired by this post on Product School.


When your site goes down, every second counts. I’ve lived that reality across multiple product lines, and the difference between a five-minute blip and a two-hour outage is felt by customers, engineers, and the business. That’s why I’ve been closely following how Incident.io has evolved from coordination during chaos to intelligent, proactive response.
Now, they’re building something new: an AI SRE that can actually help diagnose and respond to incidents. As someone who thinks deeply about reliability, velocity, and customer trust, that promise hits the intersection of AI Strategy, product management leadership, and operational excellence.
I recently spent time with Lawrence Jones, Founding Engineer at Incident.io and Ed Dean Product Lead for AI at Incident.io, digging into how their team is teaching AI to think like a site reliability engineer. They shared how they went from simple prototypes that summarized incidents to a multi-agent system that forms hypotheses, tests them, and even drafts fixes—all from within Slack.
Here’s what stood out to me first: AI’s biggest impact comes from compressing time—identifying causes minutes instead of hours. In practice, that means fewer cycles lost to paging the wrong on-call, clearer paths to root cause, and faster recovery—without cutting humans out of the decision loop.
Equally important is deciding where automation belongs. The team’s approach aligns with how I evaluate high-risk workflows: Identify which parts of debugging can safely be automated. Combine retrieval, tagging, and re-ranking to find relevant context fast. Use post-incident “time travel” evals to measure how well their AI performed. Balance human trust and AI confidence inside high-stakes workflows. The human remains accountable; the AI accelerates context, options, and execution.
On the technical side, the retrieval choices were refreshingly pragmatic. Retrieval-augmented reasoning still benefits from simplicity: deterministic tagging and re-ranking often beat complex vector setups. I’ve seen the same in production: start with crisp, deterministic signals, then layer embeddings where they truly add value. This keeps systems debuggable and stable as you scale.
The interface choices matter just as much as the models. “Slack as the interface for human-AI collaboration” puts the agent where incidents already live, reducing friction and increasing adoption. Under the hood, they’ve been pragmatic with “PGVector and Postgres for retrieval experiments”, using “RAG (Retrieval-Augmented Generation)” and “Multi-agent orchestration” to chain context gathering, hypothesis formation, and action proposals. The north star is compelling: “AI as your company’s immune system”.
What impressed me operationally was the rigor around evaluation. Post-incident “time travel” evals let teams score AI accuracy after they know what really happened. That’s the standard we should all adopt: test the agent against reality, not just synthetic prompts, and feed those learnings back into prompts, tools, and guardrails.
Trust is the currency in incidents, so the product surface must reflect uncertainty with care. Building trust in AI isn’t just about precision—it’s about showing reasoning and uncertainty in ways humans understand. In other words, show the chain of thought as a structured artifact (signals considered, hypotheses rejected, evidence gathered), expose confidence bands, and always make it easy for humans to override or guide.
From a workflow standpoint, the investigation loop mirrors seasoned SRE practice: fast scoping, parallel checks and data sources, building hypotheses and refining findings, then proposing remediations paired with the context that justifies them. Human-agent collaboration here is not a handoff—it’s a tight copilot loop where the agent gathers, tests, and drafts, and the human confirms, prioritizes, and executes.
For platform and security leaders, this approach blends speed with safety. Clear permissions, auditable actions, blast-radius constraints, and CI/CD integration keep the AI inside defined guardrails while still delivering material acceleration. The payoff is higher deployment frequency without compromising reliability—because detection, triage, and rollback become faster and more repeatable.
My takeaway as a product leader: this is a blueprint for agentic AI in mission-critical workflows. Start in the tools users live in (Slack), nail retrieval with deterministic foundations, model the expert’s playbook (not just their summaries), and make evaluation a first-class part of the product. Do that well, and the AI goes from assistant to teammate—conservative when it should be, bold when the evidence supports it, and always legible to the humans in the loop.
The momentum around Incident.io’s AI SRE suggests where we’re headed next: deeper integrations, broader coverage across service catalogs, and richer automations that remain transparent and controllable. For teams investing in reliability, this is the moment to operationalize agentic AI—measured, auditable, and designed for trust—so you can move faster when it matters most.
Inspired by this post on Product Talk.
