Tag: incident management

  • Built for Your Biggest Days: How We Engineer Fair, Reliable Scale Without Downtime

    I’m getting sharper, more specific questions about scale from enterprise customers every quarter, and that’s exactly how it should be. Teams want to know how our platform behaves during their highest-volume moments — the Black Friday sales, the sporting events, the production incidents — and they want confidence their growth won’t outpace the systems they depend on. We welcome those questions. They’re the right ones to ask of any critical component of your business. Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform. We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer. While those numbers matter, they age quickly. Every growing software company can publish a bigger number every year, month, week. What ultimately matters is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it. Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them. Here’s how we do that in practice. We build on boring foundations because at the edges, we try hard not to be clever. We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product. “That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.” This extends a principle I’ve embraced for years: run less software. The point isn’t to minimize the stack for its own sake; it’s to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together. Boring technology choices aren’t a lack of ambition — they reserve our ambition for the nuanced scaling challenges that matter. The source of truth is the hard part. You can scale stateless web traffic by adding machines, add queue consumers, and add cache. Those are real problems — just not the hardest ones. The source-of-truth database is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears. That is why we moved to Vitess, managed by PlanetScale. The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers. When we first laid out this direction, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day. Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed. We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk. Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows. Search is not a hidden bottleneck for us. Search underpins core product surfaces across the platform — from vector search in our AI features to realtime reporting — and if it’s slow or unhealthy, customers feel it. Scaling isn’t just adding more machines; often the better approach is making the product do less unnecessary work. Today, our Elasticsearch clusters support a much higher-throughput product than in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We’re serving a larger product surface more efficiently, not just running a bigger cluster. More importantly, when an index gets too large or traffic distribution turns unhealthy, we don’t want a high-risk, manual migration. We reshape Elasticsearch indexes online by partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and deleting the old index only when we’re confident. We’ve used this pattern for years to make large search migrations safer and more incremental — a core playbook in our platform scalability and SRE practices. Fairness is non-negotiable in a multi-tenant system. A single customer’s high-volume moment should not quietly become everyone else’s latency problem. We design for this at multiple layers. For asynchronous work, we use overflow queues and queueing strategies that prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They’re designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants. We also build application-level guardrails so customer isolation doesn’t depend on every engineer remembering every rule in every code path. In a large multi-tenant Rails application, the safe path must be built into the system. The focus is primarily about correctness and customer data separation, but the broader operating principle is the same: important customer boundaries should be enforced by infrastructure and application frameworks. The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it’s actually needed. Fin adds a new dimension to scaling. Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work. The details differ from traditional SaaS infrastructure, but the principle is the same: understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. AI providers are not commodity storage systems, and we do not design as if they are. That is why we have invested in Fin-specific reliability systems. Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point. For enterprise customers, this matters because AI support volume can spike just like human support volume — and the AI layer must absorb that spike without relying on one fragile upstream path. When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform. Performance tests help, but production traffic is reality. Real customers use products in ways no synthetic test will perfectly predict: launches, incidents, seasonal patterns, gaming events, sudden changes in end-user behavior. Those moments give us data that no lab can fully reproduce. Often, a large customer event barely moves the platform-wide graphs because our customer base is broad enough that one industry’s peak aligns with another’s quiet period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect. “That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.” Sometimes, these events teach us something specific. In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed. The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb. We treat large customer events this way even when there’s no broad customer impact. They’re opportunities to understand real scaling properties and make the next event safer — a habit that anchors our incident management, observability, and FinOps practices. Scale is also an operating model. The infrastructure matters, but it’s not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents. “That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.” Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through infrastructure noise and answer the question that matters most during an incident: are customers able to use the product successfully? This shapes how we ship. Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. That isn’t a vanity metric — it’s part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers. These are the very DORA metrics we hold ourselves to in order to balance CI/CD speed with stability. “Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.” The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence. That’s how scale stays safe over time. Scheduled maintenance should be extraordinary. Historically, database maintenance was a main reason for maintenance windows: upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime. With the move to Vitess and PlanetScale, we changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime — and we do this in practice, not just as a goal. This matters because customers rely on our platform for live operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience. “Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.” Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact. That’s the practical benefit of the architecture work: better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers. What this means for customers is simple: be skeptical of vague scale claims. The question isn’t whether a vendor says they can scale — it’s whether they can explain how, where the limits are, what they measure, how they recover, and what they’ve changed after learning from production. We understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system — and we use that data to keep improving. That’s what you should expect from a platform you depend on during your busiest moments.

    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • Operator Unleashed: The AI Agent That Transforms Customer Ops across Fin and Intercom

    Operator Unleashed: The AI Agent That Transforms Customer Ops across Fin and Intercom

    Today I’m introducing Operator, an Agent that works across both Fin and the Intercom helpdesk to help you manage your customer operations.

    In practical terms, Operator manages help content, builds automation, does the ongoing work that determines how well Fin performs, and runs the operational work your human team doesn’t have time for. That combination is precisely what modern support teams need to move from reactive firefighting to proactive, consultative support.

    Why does this matter? Running a customer operation means managing AI and humans simultaneously, and doing this well requires more capacity than most teams realistically have. I’ve felt that strain firsthand—competing priorities, constant context switching, and a never-ending queue that blurs strategic focus.

    On the AI side, Fin’s performance is largely influenced by what surrounds it: the accuracy of your help content, the quality of your Fin configuration, and how well you understand what’s working and why. When product teams ship daily, keeping your help center current means finding every affected article before customers notice the gaps. When Fin gets a conversation wrong, diagnosing it requires reading through what happened, identifying the root cause at the configuration level, making the fix, and verifying it worked. Analyzing why your resolution rate dropped means pulling conversations, finding patterns, and tracing the cause back to something actionable. And beyond individual fixes, there’s the ongoing question of what to automate next – what your human reps are still handling repetitively, whether it’s worth building a Procedure for it, and how to test it before it goes live.

    On the human side, the demands are just as continuous. When an incident hits, someone needs to identify every affected customer, draft the right response, and send it before the problem compounds. Team leads need visibility into rep performance across hundreds of conversations to coach effectively and prep for 1:1s. Reps need to know what to prioritize without spending the first part of their day figuring it out. In fast-moving environments, that operational overhead wastes energy you should be investing in better customer outcomes.

    Black-and-white testimonial graphic from Synthesia about Fin Operator: a smiling professional at left and a quote at right describing how asking Operator clarifies what happened and makes improving Fin easier.
    Meet Operator, the agent that explains your customer conversations. This Synthesia testimonial shows how simply asking Operator reveals what happened and makes refining Fin faster for support and enablement teams.

    Too often, the work outpaces what teams can manage, so it happens reactively, or not at all. Operator was built to change that, giving teams a new way to understand, manage, and improve their customer operations. Here’s how I put Operator to work across AI workflows and human-led processes.

    First, I use Operator to ask my data anything. Your support operation generates more useful data than most teams have time to process. Operator gives you direct access to it. You can ask it any question about what’s happening in your operation (why a metric changed, what’s driving escalations, how the team performed last week) and it returns structured answers with charts, breakdowns, and the ability to dig further. It analyzes samples of real conversations on the fly to surface patterns and identify root causes. If your head of product wants to know what customers are saying about a new release, you can ask Operator rather than spending half a day pulling a report together. It also works across your entire operation, analyzing Fin’s performance, your human reps’ performance, and customer sentiment.

    Crucially, I don’t start from scratch every time. Give Operator ongoing work, like analyzing your automation rate every Monday, flagging anything that needs attention, and posting the report in your Fin workspace. It’ll run the analysis, write the report, and deliver it without you having to go looking for it. That’s the kind of agentic AI leverage that compounds week after week.

    Second, I keep the knowledge base current without writing a single article. Your knowledge base is only as useful as it is accurate. When product teams ship fast, keeping pace with content updates is a substantial, ongoing job. Give Operator a brief about anything, from a new feature or policy change to release notes, and it finds every article in your help center that needs updating, drafts the edits in your tone of voice and style, identifies content gaps, and drafts new articles to fill them. It even handles localized versions. Every change is formatted as a proposal (Operator’s version of a pull request) for you to review, edit, and approve before anything goes live. It feels like adding several knowledge managers to the team overnight, without the ramp time.

    Monochrome testimonial graphic showing a bearded person's headshot beside bold copy from Raylo praising Fin Operator for accurate analysis, strong trend insights, and reporting beyond basic LLM connectors.
    See why teams choose Fin Operator for customer operations: accurate analysis, trend insights, and conversation debugging—going beyond basic LLM connectors. A Raylo testimonial spotlights daily, real-world impact.

    Third, I build, test, and ship improvements to Fin directly through Operator. When Fin gets a conversation wrong because of a content gap or misconfigured rule, Operator can debug it by reading through the conversation, identifying what caused the problem, proposing a fix, and running simulation tests to verify it before you approve. You see what changed and why before anything goes live. Beyond debugging, Operator has deep knowledge of every Fin feature and capability, so you can ask it directly to help you configure whatever you need. If you need a Procedure for a specific query type, describe the outcome you want and Operator builds it, including triggers, multi-step instructions, edge case handling, and a simulation test, all from a single prompt. The same applies to configuring Guidance rules, data connectors, monitors, and workflows. You don’t need to know which feature solves your problem or how to configure it; you just describe what you want.

    For teams looking to increase their overall automation rate, Operator can handle that strategically too. Ask it to analyze where your biggest automation opportunities are and it surfaces them by volume, along with an estimate of the weekly team time each one is consuming. Pick one, and it builds the solution for you to approve. That’s consultative support, productized.

    Finally, I use Operator to effortlessly manage the human side of support. When an incident hits, Operator identifies every affected conversation, drafts targeted responses, and sends them proactively, turning what would normally be hours of reactive triage into minutes of review and approval. For ongoing management, a team lead prepping for 1:1s can ask Operator to pull each rep’s metrics, flag outliers, and surface what’s worth digging into. A rep coming back from a meeting can ask what to focus on next and get a prioritized queue based on urgency, customer value, and wait time. And because Operator sees patterns across everything your human team is handling, it can surface the conversations they’re still resolving manually, flagging your next automation opportunity before you’ve had time to go looking for it.

    Here’s why this works. Operator isn’t a general-purpose AI model given access to your data. It’s built on a library of purpose-built tools that encode expertise specific to support operations, like how to pick the right attributes for a given analysis, search a knowledge base semantically, debug Fin’s reasoning in a specific conversation, or write and test a Procedure that will actually work. That specialized toolkit is what makes its recommendations trustworthy and its execution reliable.

    Minimalist banner reading 'Transform your support operation with Operator' above a bright orange square with an abstract purple-green knot logo, suggesting AI-driven customer support automation.
    Elevate customer service with Operator. The bold headline and vivid knot logo introduce a modern AI platform that streamlines workflows, speeds resolutions, and scales support operations without extra headcount.

    The proposal (pull request) system makes this possible. When Operator updates content, adjusts configuration, or modifies how Fin behaves, it creates a proposal – a structured diff of what’s changing and why. You review it, edit if needed, and approve before it takes effect. Operator does the cognitive work; the human stays in control of what goes live.

    More than 200 early users are already trying Operator, and every one of them is finding new use cases. It’s a genuine step change in capability, and I expect it will change the way support teams run their operation. We’re working towards a vision of Operator being increasingly agentic, expanding across every new role Fin takes on.

    Operator is available in early access now. If you’re ready to transform your customer operations across Fin and the Intercom helpdesk with agentic AI, start here: https://fin.ai/operator.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • AI Agents Can Ship Code, But You Own the Bugs: Debug Faster in Claude or Cursor with Amplitude MCP

    AI Agents Can Ship Code, But You Own the Bugs: Debug Faster in Claude or Cursor with Amplitude MCP

    AI agents are getting remarkably good at scaffolding features and writing tests, yet when production issues surface, accountability still lands on me and my team. The last mile of quality—reproducing the issue, isolating the root cause, and validating a durable fix—remains a human responsibility, even in an era of agentic AI. That’s why I’ve built a repeatable debugging approach that blends behavioral analytics with agent-assisted coding to close the loop quickly and safely.

    Investigate bugs directly in Claude or Cursor with Amplitude MCP. Learn two Session Replay workflows to debug faster.

    The goal is simple: transform messy, anecdotal bug reports into actionable, prioritized work that my developers can resolve confidently. By pairing Session Replay with Amplitude analytics, I can quantify impact, capture precise reproduction steps, and feed rich context into Claude or Cursor. The result is a faster path from signal to solution—and fewer back-and-forth cycles with engineering, support, and product.

    Here’s how I use Session Replay to tighten the feedback loop. First, I lean on behavioral analytics to detect anomalies and segment affected users, so I know whether we’re facing an edge case or a widespread degradation. Then I use the replay to see exactly what the customer experienced: the path they took, the UI state, the environment details that matter (device, browser, version), and the precise moment things went sideways. This contextual backbone lets me enter Claude or Cursor with high-signal inputs, rather than guesswork.

    Workflow 1: From customer session to reproducible issue. I start with the offending Session Replay and capture the exact steps to reproduce, including state transitions and timestamps for any console errors or API failures. In Claude or Cursor, I provide those steps, reference the replay link, and ask the model to propose a minimal failing test and a hypothesis for root cause. With Amplitude MCP as the connective tissue, I can keep the model anchored to the relevant events and user path while it generates patches or targeted instrumentation. I validate the hypothesis locally, run the failing test, and then move the fix through CI/CD with feature flags so we can verify in production without overexposing risk.

    Workflow 2: From code symptoms back to customer evidence. Sometimes I begin in the IDE or agent environment with a flaky test, a suspicious diff, or a performance regression. In that case, I ask Claude or Cursor to outline likely failure modes and the critical code paths. Then I pivot to Session Replay for corroboration: do real users hit these paths, under what conditions, and how often? Using Amplitude MCP to anchor the agent in actual user journeys helps separate theoretical fixes from changes that will meaningfully improve outcomes. I confirm with replays after the patch lands, monitor Web Vitals and related behavioral metrics, and only then ramp the flag.

    Two practices make these workflows consistently effective. First, I frame prompts to keep the model tightly scoped: reproduction steps, expected vs. actual behavior, impacted segments, and any known constraints (e.g., rate limits, third-party dependencies). Second, I treat the agent as a proactive pair-programmer: it drafts hypotheses, tests, and diffs, while I provide ground truth from Session Replay and analytics. That division of labor keeps the LLM productive without letting it drift from the evidence.

    Operationally, I also align this approach with our incident management and observability standards. For high-severity issues, SREs and product managers share the same replay artifacts, event timelines, and roll-forward criteria. We document root causes and guardrails as docs-as-code, then socialize them via developer evangelism so similar classes of bugs get caught earlier. Over time, this tightens our DORA metrics—particularly lead time for changes and deployment frequency—without compromising stability.

    Privacy-by-design is non-negotiable. We ensure Session Replay redacts sensitive fields, enforces least-privilege access, and complies with our data governance policies. When I involve an agent, I include only the minimum data necessary to reach a fix and prefer structured artifacts (event IDs, stack traces, and test cases) over raw PII. These safeguards let us move quickly without trading away trust.

    The takeaway is pragmatic: agents can accelerate creation, but accountability for quality still rests with us. By grounding Claude or Cursor in real user behavior via Amplitude MCP and Session Replay, I get faster reproduction, more accurate fixes, and cleaner rollouts. The combination turns “mysterious customer bug” into “verified hypothesis and passing test” in a fraction of the time—and that’s how we ship responsibly at speed.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • Stop Blurring the Lines: Clear Product–Engineering Boundaries to Boost Quality and Prevent Burnout

    Stop Blurring the Lines: Clear Product–Engineering Boundaries to Boost Quality and Prevent Burnout

    Where is the true boundary between product and engineering—and what happens when it gets blurry? I’ve led and coached teams through this question many times, and I’ve learned that clarity here isn’t just a nice-to-have; it’s foundational to quality, velocity, and team health.

    I’ve seen well-intentioned product managers step in to “help” by taking ownership of bug triage, tech debt prioritization, or even system architecture. At first, it feels productive. Over time, it creates role confusion, slows decision-making, and burns out PMs—while paradoxically lowering engineering quality. The “CEO of the product” myth and legacy IT, project-based mindsets are usually at the root. Treating engineers as “order takers” breaks down in evergreen product environments.

    The healthiest collaboration model is simple and disciplined: The product trio owns the “what”; engineering owns the “how”. Product managers are not people managers for engineers—and shouldn’t be accountable for engineering quality. Our job is to frame the problem, align on outcomes, and continuously discover value with customers—not to supervise technical execution.

    If quality is a problem, the solution is escalating and fixing the system, not managing individual bugs. In practice, that means surfacing patterns and elevating them to engineering leadership, who can address root causes—staffing, skills, code health, CI/CD gaps, observability, or process design—rather than asking PMs to paper over issues with status updates. This keeps accountability where it belongs and reinforces outcomes vs output OKRs.

    One high-leverage move is to remove unnecessary intermediaries. Removing the PM as a middleman creates better flow and clearer ownership. Create direct paths for stakeholders to get bug status without routing everything through product. Use dashboards, shared tools, or Slack channels instead of one-off updates. In my teams, shared Jira views, Slack incident channels, and status pages eliminated handoffs, improved stakeholder management, and gave engineers the space to solve problems end-to-end.

    Strong engineering leadership is non-negotiable. What strong engineering leadership should own (and why that matters) is the technical system, quality guardrails, sustainable pace, and the practices that uphold them—incident management, code review rigor, test coverage, and SLOs with SRE. Skilled engineering teams naturally push back when boundaries are crossed—and that’s a good thing. It signals ownership, craft pride, and a pathway to durable execution.

    When do I step in as product? Primarily to clarify desired outcomes, sequencing, and trade-offs—bringing customer and business context to the table. I structure product roadmapping and sprint planning around value slices and risks, not task lists. I align on decision rights early: architecture and tech debt strategies live with engineering; product strategy, positioning, and success metrics live with product; discovery and prioritization live with the product trio.

    Here are the system-level moves I’ve found most effective: Escalate systemic quality issues to engineering leadership, not individual contributors. Advocate for real engineering leadership if your org expects product teams—not IT teams. Then reinforce a culture of continuous discovery so product, design, and engineering make better upstream decisions together. This is how empowered product teams ship higher-quality outcomes—without burning anyone out.

    If you’ve ever found yourself acting as the middleman for bug status or being asked to “own” engineering decisions outside your expertise, you’re not alone. Reset the boundaries, make work visible, and double down on shared outcomes. In my experience, the moment we clarify roles and remove status theater, quality rises, cycle time improves, and everyone does the job they were hired to do—better.


    Inspired by this post on Product Talk.


    Book a consult png image
  • The Safety of Speed: 180 Deploys a Day, 12‑Minute Releases, 99.8%+ Availability

    The Safety of Speed: 180 Deploys a Day, 12‑Minute Releases, 99.8%+ Availability

    “Speed is not the enemy of safety; it is the prerequisite for it.” I live by this principle. In our organization, the average time from merging code to it being used by customers in production is just 12 minutes, and that short window is fundamental to how we build, ship, and learn.

    In January 2026, we are averaging 180 ships per workday – roughly 20 deployments every hour. Conventional wisdom suggests that to increase stability, you must slow down. I believe the opposite. Speed is not the enemy of safety; it is the prerequisite for it. Accumulating code creates risk; shipping small batches minimizes it. Shipping is our company’s heartbeat.

    Maintaining this frequency while targeting 99.8+% availability has required over a decade of focused investment in systems, principles, and processes. We protect the integrity of our systems through three layers of defense: an automated pipeline that is simple, reliable, and removes the need for manual intervention, a shipping workflow that promotes ownership and uses guardrails as accelerants, and a recovery model that optimizes for mitigating inevitable failures. Here’s how we’ve built each layer so that velocity is our greatest source of stability.

    While our platform consists of various services and frontend applications, I’ll focus here on our Ruby on Rails monolith. It is our core application and the one we deploy most frequently; we also deploy it to three different data‑hosting regions with independent pipelines. Our other services follow similar pipeline principles and safeguards, but the Rails monolith is the clearest example of how we ship at scale.

    The automated pipeline is designed to move code from merge to production as fast as possible while enforcing strict safety checks. It is fully automated, and the vast majority of releases require no human intervention—critical for CI/CD at high deployment frequency.

    Once an engineer merges code to GitHub, two things happen immediately. First, the build: we compile the Rails application and its dependencies into a deployable asset (a slug) in about four minutes. Second, parallel CI: our test suite runs alongside the build; through extensive optimization, parallelization, and test selection, the vast majority of CI builds finish in under five minutes.

    As soon as the slug is built, it’s deployed to a pre‑production environment. CI does not block the progression of the slug to pre‑production. Deploying to pre‑production takes around two minutes. This environment serves no customer traffic, but it is connected to our production datastores, mirrors our production infrastructure variants (e.g., web serving, asynchronous worker), and is configured so that requests exercise the pre‑release code and workers.

    Immediately after deployment, we run and await several automated approval gates. We verify that the application boots cleanly on hosts (boot test), confirm the parallel test suite passed (CI check), and execute functional synthetics using Datadog Synthetics on critical flows—such as loading or editing a Fin workflow. If any gate fails, the release is halted and does not go to production.

    Once approved, we promote the code to thousands of large virtual machines. A deployment orchestrator triggers these deployments simultaneously, while a decentralized, staggered rollout avoids changing the state of the entire fleet at the same millisecond. Within each machine, a rolling restart mechanism removes a process with old code from the serving path, lets it drain gracefully, and replaces it with a fresh process running the new code. From the moment a deployment starts, first requests are served by new code within roughly two minutes, and the vast majority of the global fleet updates transparently within six minutes. When restarts trigger on every machine, production unblocks so the next deployment can begin.

    We treat a stalled pipeline as a high‑priority incident. If the automated system rejects three consecutive release attempts, it pages an on‑call engineer. These are pre‑production blocks, but if the shipping lane stops moving, changes pile up—and our stability relies on building and shipping in small steps. The on‑call’s job is to restore flow so that tiny, safe, frequent updates continue to keep risk low.

    Our shipping workflow is built on extreme ownership: tools assist, but the engineer is accountable for quality and the decision to merge. I insist that you are present when you ship. The practical benefit of a 12‑minute deployment cycle is that engineers remain in the zone, focused on the problem they just solved, and ready to validate behavior as it goes live.

    Stylized rocket launch piercing dramatic cumulus clouds at sunrise, glowing vapor trail symbolizing fast yet controlled delivery; overlaid headline text reads 'The safety of speed' in the sky.
    A rocket lifts into a luminous sky, a metaphor for shipping code fast without breaking things, where precision, automation, and guardrails power 180 safe deployments a day.

    To support this, our deployment system sends Slack notifications the moment code is submitted and as it advances through stages, embeds direct observability links to relevant dashboards and logs in every PR and message, and prompts verification so engineers actively watch the dials and test features in production. It is not acceptable to rely on green builds. You’re expected to watch your change go live and if you’re not prepared to rollback, you’re not prepared to ship. We maintain a no‑blame culture: quick rollbacks and immediate reverts are signs of vigilance and ownership, not failure.

    We make extensive use of feature flags to turn deployment into a non‑event. By decoupling deployment (moving code to servers) from release (turning features on), we shrink the blast radius of change. Flags can be enabled for all customers, a specific subset, or disabled for everyone in under 60 seconds through our backend UI. Engineers can group flags into beta features and run phased rollouts; we also ensure flags work consistently across non‑monolith applications. In the past three months, we created over 560 flags—and we actively manage them to avoid permanent complexity.

    For complex refactors—especially when behavior should not change—we leverage GitHub Scientist, an open‑source experimentation library. It runs candidate logic (new code) in parallel with existing logic (old code) in production, instruments both paths for result and timing comparisons, and keeps existing behavior user‑visible. That means we can iterate on and validate new code under real load without risking the experience, then switch seamlessly when confident.

    When engineers need to go deeper before merging, they can generate a slug and deploy it to a virtual machine, detaching a running production host from the serving path and connecting for manual testing. They can also put a pre‑release slug on a serving machine that handles a small percentage of jobs or web requests. Single‑host validation lets us slice observability to those hosts, compare against the main release, and make low‑level changes safer. Staging is a simulation; production is reality. Testing on a single production host validates assumptions with real‑world data without risking the fleet.

    Our recovery model starts from a simple principle: stop monitoring systems; start monitoring outcomes. Traditional monitoring tells you if a server is healthy; we care whether customers are healthy. We rely on heartbeat metrics—vital signs that represent the core value our product provides—such as the rate at which messages and comments are created.

    Unlike standard uptime checks, heartbeat metrics are binary in spirit. If message send rates dip below baseline, it does not matter if infrastructure dashboards are green. Down is down, and if customers can’t do their job, uptime percentages are irrelevant. By tracking real‑world success rates as a high‑level signal, we catch subtle degradations that traditional alerting either misses or over‑alerts on.

    Because we ship in small, incremental steps and maintain previous releases on our virtual machines, our Time to Recover (TTR) is generally very fast. If a heartbeat metric drops or a critical anomaly is detected right after a ship, the system can trigger an automatic rollback, reverting to the release that was running 20 minutes ago—often restoring service before an engineer responds. For complex issues, engineers can initiate a manual rollback through our deployment UI; doing so also locks the production pipeline to prevent further releases while we investigate and remove problematic code.

    Resumption of service is not the end. Every incident prompts an incident review, and we don’t just fix the bug. We ask, “How did the machine allow this to happen?” Then we harden the system so it cannot happen again. This loop—fast shipping, fast recovery, rigorous learning—compounds resilience over time.

    This operating model aligns to DORA metrics: high deployment frequency, short lead time for changes, low change failure rate, and rapid time to restore service. It’s a CI/CD and SRE‑informed approach that converts speed into a defensive advantage rather than a liability.

    Shipping 180 times a day isn’t a vanity metric; it’s a deliberate choice to protect the customer experience. With a 12‑minute window from code to customer, the feedback loop is tight and engineers retain context—and accountability—for the immediate impact of their work. Maintaining this pace requires more than fast CI; it requires judgment, extreme ownership, disciplined use of feature flags, and a recovery model that monitors outcomes. We rely on human expertise, augmented by these layers of defense, to catch issues before they turn into customer pain. We don’t ship fast despite our need for stability; we ship fast to stay in control of change.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • 3 Powerful Ways AI Is Rewriting Cybersecurity: Smarter Defense, Faster Response, Fewer Breaches

    3 Powerful Ways AI Is Rewriting Cybersecurity: Smarter Defense, Faster Response, Fewer Breaches

    Every week, I watch the cybersecurity landscape shift under our feet. As a VP of Product Management, I’m responsible for building secure, resilient products—and that means understanding how artificial intelligence is transforming the way IT teams defend, respond, and even anticipate attacks.

    Learn the ways in which AI is transforming both cybersecurity offense and defense for IT teams.

    First, AI supercharges threat detection and prevention. Pattern-recognition models now sift through endpoint telemetry, identity signals, and network flows to surface anomalies in near real time. In practice, that means fewer false positives, faster prioritization, and earlier containment. We’re pairing behavioral analytics with enrichment from our SIEM/EDR stack so analysts get a ranked, explainable view of risk instead of a noisy alert queue—directly improving mean time to detect and laying the groundwork for scalable threat detection and response.

    Second, AI accelerates incident response. We’ve embedded LLM-powered copilots into our SOC workflows to summarize alerts, propose next-best actions, and auto-generate draft remediation steps from playbooks. Orchestration then executes routine tasks—isolating endpoints, rotating credentials, updating tickets—while keeping a human-in-the-loop for approvals. To keep this safe, we use privacy-by-design principles, a retrieval-first pipeline for authoritative playbook content, and eval-driven development to measure precision/recall on suggested actions. The result is meaningful reduction in mean time to recover and more consistent incident management.

    Third, the offense is getting smarter—and we need to be honest about it. Adversaries use gen AI to craft targeted spear-phishing, deepfake executive voice notes, and polymorphic malware that evades signature-based tools. We counter by red-teaming with AI, deploying deception tech to waste attacker cycles, and hardening identity as the new perimeter (MFA, conditional access, continuous risk scoring). Education matters, too: when employees see how convincing AI-generated lures have become, phishing reports spike and successful compromise rates drop.

    None of this works without strong governance. We treat AI like any high-impact capability: rigorous data governance, model access controls, and AI risk management across the lifecycle. We log model prompts and outputs, restrict sensitive data via contextual policies, and continuously test for drift and bias. This is as much an IT leadership challenge as it is a technical one—clear ownership, well-defined runbooks, and regular tabletop exercises make the difference between resilience and chaos.

    If you’re getting started, I recommend a focused 90-day plan: identify one high-signal detection use case, one response playbook ripe for automation, and one employee risk area (usually phishing) for immediate uplift. Instrument everything—latency, precision/recall, MTTR—and iterate with a cross-functional group spanning security engineering, SRE, and product management leadership. With disciplined AI strategy and guardrails in place, you can move faster, reduce noise, and stay ahead of adversaries without compromising data or trust.


    Inspired by this post on Pendo – Perspectives.


    Book a consult png image
  • Inside the Engine Room: How I Drive Scalable Analytics APIs, Reliability, and Performance

    Inside the Engine Room: How I Drive Scalable Analytics APIs, Reliability, and Performance

    I build and scale analytics platforms with a product mindset, and the work starts with the "middleware and compute systems that power analytics at scale." In platforms like Amplitude analytics and other unified analytics platform architectures, that foundation is what makes everything else possible.

    Day to day, I oversee the "APIs behind charts, cohorts, and metrics—driving performance, reliability, and platform scalability." When those APIs are fast and resilient, every product team—from growth to customer success—can trust the insights they use to ship, learn, and iterate.

    From an engineering leadership standpoint, I partner closely with SRE to define SLOs and error budgets, wire CI/CD pipelines for safe deploys, and track DORA metrics so we improve speed without compromising quality. This combination reduces incident management toil and shortens MTTR while keeping data freshness and query latency within strict thresholds.

    From a product management leadership lens, the goal is clarity: crisp APIs, predictable contracts, and transparent stakeholder management across data, engineering, and GTM teams. That alignment empowers product teams with reliable cohorts and metrics, accelerates experimentation, and de-risks roadmaps.

    If you’re scaling analytics, invest first in the platform layer: middleware and compute, schema governance, caching strategies, and cost-aware compute. Do that well, and the visible experience—charts, cohorts, and metrics—feels effortless, even as you grow to serve billions of events with confidence.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image
  • Spain’s Tough New Customer Service Law: What It Signals—and How AI Keeps You Compliant, Fast, and Human

    Spain’s Tough New Customer Service Law: What It Signals—and How AI Keeps You Compliant, Fast, and Human

    Support teams in Spain just got the clearest signal yet that the old way of doing things won’t cut it anymore. As I look at the details, I see more than a regulatory hurdle—I see a blueprint for the modernization many of us have been pushing toward for years.

    The signal arrives in the form of one of the most ambitious customer service regulations in Europe—a law designed to strengthen consumer protections and set clear expectations for fair, transparent, and personalized customer service. Among its measures: new protections against spam calls, stronger transparency requirements, safeguards around personalized interactions, and measurable standards for speed, accessibility, and complaint handling within customer support.

    It’s a significant shift, especially for large enterprises and essential-service providers. While the initial reaction might be anxiety about audits and penalties, the larger opportunity is hard to ignore: this law compels us to build modern, resilient support operations that scale, perform, and earn trust.

    Spain is often an early mover in consumer-protection regulation, and this shift could signal what future standards across the EU might look like. For EMEA leaders, this is a moment to reevaluate operating models, invest in automation thoughtfully, and ensure customer experience improvements directly support regulatory compliance.

    Below, I break down what the law requires, what it means in practice, and how AI Agents like Fin can help teams meet regulatory expectations while delivering faster, more personal support at scale.

    The law applies in full to providers of regulated services, including water, energy, passenger transport, postal services, pay-audiovisual media, and electronic communications, and also to any company (or group) that meets certain size and turnover thresholds, even if their core business falls outside those sectors.

    Large companies (those with more than 250 employees and over €50 million in turnover) also hold additional obligations, particularly around multilingual support in Spain’s co-official language regions.

    While the law is still moving through its final approval stages, the direction is clear: a broad set of obligations will apply to reinforce consumer rights, ensuring they can: Reach support quickly. Speak to a human when needed. Get clear information during outages or service disruptions. Have complaints handled promptly and on time.

    1. 95% of support calls must be answered within three minutes

    This raises the bar significantly for responsiveness, especially during spikes, outages, billing cycles, or seasonal surges. Most support systems are not built for this level of agility. In my experience, you can’t hire your way to this metric sustainably—you have to design for it.

    2. Customers must be able to speak to a human on request

    Automation is allowed, but it cannot be the only option. At any point during a call, a customer must be able to transfer to a human if they ask for one. Companies cannot trap customers in automated loops. The practical implication: every workflow needs a reliable, audited escape hatch to a person.

    3. Support lines must be free of charge

    Premium-rate numbers are prohibited. Customer service cannot generate revenue for the business, nor may it be used to upsell products. This cleanly separates service from sales and reduces consumer friction.

    4. Essential services must offer 24/7 support for continuity issues

    Electricity, water, gas, telecoms, and transport providers must always be reachable at all hours when customers need to report service interruptions. That means coverage, triage, and routing must be always-on.

    5. Complaints must be resolved within 15 days – or within five days for undue charges

    This halves the previous general complaint window of 30 days and adds a much faster path for billing-error complaints. Companies must maintain records, assign tracking numbers, and ensure timely follow-up. Your case management discipline will make or break this requirement.

    6. No spam calls or unwanted commercial pressure

    Companies must identify business calls with a designated prefix, and customer -service calls with a different one. Telecom operators will be required to block calls that do not use these codes. Additionally, contracts obtained via unsolicited calls will be legally null and void, protecting consumers from being pressured into commitments they never intended to make.

    7. Companies must maintain a unified complaint-tracking system

    All complaints, claims, and incidents must be recorded in a centralized system to ensure traceability. If your data is fragmented across tools, this is a call to centralize and standardize intake.

    8. Companies must pass annual external audits

    These audits assess whether customer service processes are meeting the required standards. In practice, that means consistent processes, measurable outcomes, and reliable evidence.

    9. Better linguistic and accessibility rights

    Large companies operating in regions with co-official languages must be able to provide support in those languages. They must also ensure their customer service is accessible for vulnerable consumers, such as those with disabilities or older adults. Multilingual and accessible by design is the new default.

    10. Fairer contract renewals

    Companies must provide customers with 15 days’ notice prior to automatic renewal of online subscriptions and make cancellation simple. This is both a compliance and customer trust win.

    Most support systems weren’t built for this level of speed or operational rigor. But the steps required to comply are the same ones that make service better for customers—and better for the teams delivering it. That’s why I view AI as an essential capability, not a bolt-on.

    With the regulatory expectations clear, the question becomes: what does a modern, compliant support operation look like? For me, it blends human empathy with intelligent automation, proving auditability without sacrificing experience.

    This is where AI plays a meaningful role. Not as a replacement for humans, but as a reliable front line that can handle a wide range of queries, including the most complex ones that require real depth, while keeping queues under control.

    Adopting an AI Agent like Fin helps teams build a support model that meets regulatory expectations and improves customer experience across all your channels. Here’s how.

    Many organizations will struggle to meet the three-minute standard during normal times, let alone during spikes or busy seasons, without unsustainably scaling their teams. Fin can help by reducing the number of calls that reach your phone lines and Fin Voice will ensure the ones that do are handled quickly.

    Reducing avoidable call volume before it reaches the queue

    Many of the queries teams receive are predictable: outage updates, billing questions, account changes, and other repeatable issues. Fin can resolve these instantly across several channels, including live chat, SMS, email, and WhatsApp, using the content and processes your team already maintains. I’ve seen this alone cut peak-time pressure dramatically.

    Answering the phone immediately

    For customers who do call, Fin Voice can pick up straight away. It provides natural, conversational responses based on your existing knowledge and helps your team stay responsive during busy periods.

    Making it easy to reach a human easier during spikes

    When queues build up, Fin can capture the reason for the call, gather details, and prioritize the most urgent issues. If you offer callback options, Fin can help schedule them quickly so customers avoid long wait times, which is key for staying compliant during peak periods.

    The law requires customers to reach a real person whenever they request one. Fin supports this by keeping the path to a human clear and dependable: every interaction includes an option to speak to a person, and that option is accessible until the issue is resolved; when chosen, Fin hands over full context so human teams don’t start from scratch; if you show team availability or wait times, Fin can surface that information for customers; escalations can be prioritized to ensure faster pickup; alerts can notify on-call staff when urgent issues arise. On the phone, Fin Voice follows the same principle. Callers can request a transfer at any moment, and Fin routes the call to the right team with context intact.

    Essential-service providers must be reachable at any hour when customers need to report service interruptions. Fin can help you meet this requirement without building a full overnight staffing model.

    Always-on answers and triage

    Fin provides first-line support at any hour of the day or night. Fin Voice brings this capability to the phone, giving callers immediate help even when your human team is offline. Fin can also direct customers to the latest updates you’ve published, such as outage information or status pages.

    Routing urgent issues to the right people

    When an issue requires human judgment, Fin gathers the necessary details and routes it to the appropriate on-call team using your existing after-hours processes. Teams can set up notifications so urgent issues are seen quickly.

    Proactively surface what matters most

    With AI Insights, Fin can also monitor for emerging patterns in customer conversations through Trending Topics. This means that if there’s a sudden spike in reports about a specific outage or a recurring question about a new process, Fin can flag these trends in real time. Your team is alerted to what’s top-of-mind for customers, so you can prioritize updates, publish targeted FAQs, or escalate critical issues, ensuring your support stays relevant and responsive, even overnight.

    Complaints and outages often create the biggest spikes in volume, and the new law increases pressure to respond quickly, keep customers informed, and maintain complete records. This is exactly where structured AI intake adds value.

    A more structured complaint intake

    Fin can recognize when a customer is lodging a complaint, gather required information, and initiate a record in your existing system with a clear ID assigned from the outset.

    Clear ownership and deadline alignment

    Your team can then use your case-management tools to apply the 15-day resolution timeline (or five says for undue charges). Fin’s structured intake helps ensure that ownership and next steps are visible, rather than buried in unstructured notes.

    Faster, more consistent outage communications

    During service interruptions, Fin can share the latest published information, provide estimated fix times when available, and direct customers to live updates. On the phone, Fin Voice can triage incident-related calls quickly so callers aren’t waiting for a human agent just to receive basic information.

    While multilingual support is only mandatory for large companies operating in co-official language regions, it remains essential for meeting consumer expectations. Fin helps by supporting multilingual, natural language interactions across voice and other channels; operating within channels that support accessibility features, like channels compatible with screen readers or commonly used messaging apps; and offering “request a call” paths and collecting the necessary information up front so teams can follow up quickly for customers who prefer phone support.

    The law prohibits customer service interactions from generating additional revenue or being used to offer new products. With Guidance, you can set Fin up to stay firmly within these boundaries by shaping how it responds, which topics it should avoid, and what it should prioritize when a customer is seeking help or lodging a complaint.

    The law raises expectations around documentation and audit readiness. Fin helps by making customer interactions more structured and consistent: when a conversation involves a complaint, Fin can ensure the required information is captured and a clear ID assigned; that ID can follow the interaction so it remains easy to trace; consistent intake gives you better visibility into key metrics regulators care about, like response times, time to first human contact, escalation volume, and whether complaints are resolved within required timelines; transcripts, summaries, and metadata can be retained until cases are resolved, supporting audit requirements; many organizations maintain internal compliance playbooks outlining processes and owners. Fin’s structured intake helps keep these practices reliable; leverage Insights to identify trending topics, optimize processes and measure service quality.

    Spain’s new customer service law raises the bar on speed, access, and accountability. It’s natural to worry about how your team will cope, especially if your support operation has grown organically across tools and regions. I’ve seen how quickly burnout and chaos can set in when expectations rise faster than capacity.

    The reality is that meeting these expectations through people alone would put unsustainable pressure on already stretched support teams. The risk of burnout and operational chaos is real, which is why an AI Agent like Fin can bring welcome relief.

    By handling everything from high-volume, repetitive questions to many of the deeper, more involved issues customers raise, Fin keeps queues manageable and prevents the strain from falling entirely on your human team, helping everyone stay above water as expectations rise.

    For companies operating across the EU, adapting early to Spain’s stricter expectations can build resilience for whatever comes next—whether that ends up being driven by regulation or customer demand. Now is the time to align compliance, AI strategy, and customer experience into a single, measurable operating model.


    Inspired by this post on The Intercom Blog.


    Book a consult png image
  • How Incident.io’s AI SRE Diagnoses, Hypothesizes, and Fixes Outages in Slack at Record Speed

    How Incident.io’s AI SRE Diagnoses, Hypothesizes, and Fixes Outages in Slack at Record Speed

    When your site goes down, every second counts. I’ve lived that reality across multiple product lines, and the difference between a five-minute blip and a two-hour outage is felt by customers, engineers, and the business. That’s why I’ve been closely following how Incident.io has evolved from coordination during chaos to intelligent, proactive response.

    Now, they’re building something new: an AI SRE that can actually help diagnose and respond to incidents. As someone who thinks deeply about reliability, velocity, and customer trust, that promise hits the intersection of AI Strategy, product management leadership, and operational excellence.

    I recently spent time with Lawrence Jones, Founding Engineer at Incident.io and Ed Dean Product Lead for AI at Incident.io, digging into how their team is teaching AI to think like a site reliability engineer. They shared how they went from simple prototypes that summarized incidents to a multi-agent system that forms hypotheses, tests them, and even drafts fixes—all from within Slack.

    Here’s what stood out to me first: AI’s biggest impact comes from compressing time—identifying causes minutes instead of hours. In practice, that means fewer cycles lost to paging the wrong on-call, clearer paths to root cause, and faster recovery—without cutting humans out of the decision loop.

    Equally important is deciding where automation belongs. The team’s approach aligns with how I evaluate high-risk workflows: Identify which parts of debugging can safely be automated. Combine retrieval, tagging, and re-ranking to find relevant context fast. Use post-incident “time travel” evals to measure how well their AI performed. Balance human trust and AI confidence inside high-stakes workflows. The human remains accountable; the AI accelerates context, options, and execution.

    On the technical side, the retrieval choices were refreshingly pragmatic. Retrieval-augmented reasoning still benefits from simplicity: deterministic tagging and re-ranking often beat complex vector setups. I’ve seen the same in production: start with crisp, deterministic signals, then layer embeddings where they truly add value. This keeps systems debuggable and stable as you scale.

    The interface choices matter just as much as the models. “Slack as the interface for human-AI collaboration” puts the agent where incidents already live, reducing friction and increasing adoption. Under the hood, they’ve been pragmatic with “PGVector and Postgres for retrieval experiments”, using “RAG (Retrieval-Augmented Generation)” and “Multi-agent orchestration” to chain context gathering, hypothesis formation, and action proposals. The north star is compelling: “AI as your company’s immune system”.

    What impressed me operationally was the rigor around evaluation. Post-incident “time travel” evals let teams score AI accuracy after they know what really happened. That’s the standard we should all adopt: test the agent against reality, not just synthetic prompts, and feed those learnings back into prompts, tools, and guardrails.

    Trust is the currency in incidents, so the product surface must reflect uncertainty with care. Building trust in AI isn’t just about precision—it’s about showing reasoning and uncertainty in ways humans understand. In other words, show the chain of thought as a structured artifact (signals considered, hypotheses rejected, evidence gathered), expose confidence bands, and always make it easy for humans to override or guide.

    From a workflow standpoint, the investigation loop mirrors seasoned SRE practice: fast scoping, parallel checks and data sources, building hypotheses and refining findings, then proposing remediations paired with the context that justifies them. Human-agent collaboration here is not a handoff—it’s a tight copilot loop where the agent gathers, tests, and drafts, and the human confirms, prioritizes, and executes.

    For platform and security leaders, this approach blends speed with safety. Clear permissions, auditable actions, blast-radius constraints, and CI/CD integration keep the AI inside defined guardrails while still delivering material acceleration. The payoff is higher deployment frequency without compromising reliability—because detection, triage, and rollback become faster and more repeatable.

    My takeaway as a product leader: this is a blueprint for agentic AI in mission-critical workflows. Start in the tools users live in (Slack), nail retrieval with deterministic foundations, model the expert’s playbook (not just their summaries), and make evaluation a first-class part of the product. Do that well, and the AI goes from assistant to teammate—conservative when it should be, bold when the evidence supports it, and always legible to the humans in the loop.

    The momentum around Incident.io’s AI SRE suggests where we’re headed next: deeper integrations, broader coverage across service catalogs, and richer automations that remain transparent and controllable. For teams investing in reliability, this is the moment to operationalize agentic AI—measured, auditable, and designed for trust—so you can move faster when it matters most.


    Inspired by this post on Product Talk.


    Book a consult png image