Tag: incident management

From Customer Signals to Reliable Product Operations
Customer signals become operationally useful only when a team knows what each signal can establish, how quickly it requires action, and who owns the next decision. A support complaint, a workflow metric, and a detailed customer story may describe the same experience, but they do not carry the same context or call for the same response.

The two source articles illuminate opposite ends of this system. The incident-management article shows how customer impact should trigger rapid containment, while the product-discovery article explains why early evidence usually needs enrichment before it supports a durable product commitment. Together, they suggest a product operations model that separates detection, diagnosis, recovery, and learning without disconnecting them.

Key takeaways
- Signals should be classified by purpose: some reveal that customers are being harmed, while others help explain why.
- The cost of waiting should determine response speed, but urgency should not turn an incomplete signal into false certainty.
- Support, behavioral data, operational telemetry, rollout monitoring, and customer interviews contribute different forms of evidence.
- Strong product operations preserve signal provenance, route it to a clear owner, and define the next evidence-building or recovery action.
- Incident learning and continuous discovery should feed the same organizational memory so recurring friction becomes easier to recognize and address.
One customer signal can serve several operational jobs

The phrase “customer signal” often collapses several distinct concepts. A signal can detect a change, indicate its scale, describe a particular experience, test an explanation, or evaluate a proposed solution. Confusion arises when an input collected for one of these jobs is treated as if it can perform all of them.

The incident playbook reports that Support, including automated support capabilities, may identify a pattern in customer conversations before a technical dashboard exposes it. It also describes heartbeat metrics that track whether customers can complete core workflows, rather than merely whether underlying systems remain online. In that setting, tickets and outcome metrics act as detection mechanisms: they establish that the experience may be unhealthy and that investigation should begin.

The evidence-focused article assigns a different role to many of the same inputs. It characterizes support tickets, app-store reviews, sales notes, and behavioral analytics as useful prompts for discovery but weak foundations for deciding what to build on their own. These sources can expose repetition or friction, yet they may omit the sequence, motivation, constraints, and tradeoffs behind the observed behavior.

These positions are complementary. A compressed support report can be strong enough to initiate triage without being rich enough to define a roadmap solution. Likewise, a behavioral change can justify investigation without proving its cause. Product operations should therefore attach an explicit purpose to each signal: detect, size, explain, validate, or monitor. That label prevents teams from asking an input to support a conclusion it cannot carry.

Response speed and evidence depth belong on different clocks

Customer signals create two fundamentally different decision conditions. When customers are actively unable to complete an important task, delay expands the harm. When a team is considering a durable product investment, premature certainty can consume capacity and institutionalize the wrong interpretation.

The incident article argues that a declared incident should become the responsible team’s immediate priority. Its reported process converges customer reports, product alarms, and engineer rollout monitoring on a rapid assessment of customer impact. It also reports that engineers monitor changes through production and that a rollback can land in a little under two minutes. In this context, a safe rollback does not require a complete causal theory; it is a reversible containment decision intended to reduce exposure while investigation continues.

The discovery article describes a more deliberate progression through a “Ladder of Evidence.” Repeated low-context signals justify moving upward toward recent, story-based customer accounts. Those accounts reconstruct what the customer was trying to do, what happened, and what constraints shaped the experience. The purpose is not to delay action indefinitely, but to avoid turning frequency into an unsupported solution.

A useful synthesis is to separate the action threshold from the belief threshold. Teams can act quickly when an intervention is reversible and the cost of waiting is high. They should demand richer evidence when a choice is difficult to reverse, consumes substantial capacity, or assumes a specific explanation for customer behavior. Fast containment and careful learning are therefore not competing philosophies; they govern different commitments.

A routed signal system turns inputs into decisions

Preserve provenance before interpreting the signal

Every captured signal should retain enough context to show where it came from, which customer workflow it concerns, when it occurred, and whether it is an observation or an interpretation. This is a general operating practice rather than a fact reported by either source, but it follows directly from their shared concern with signal quality. A ticket summary, a metric anomaly, and an interview account should remain distinguishable after entering a common repository.

Preserving provenance also makes limitations visible. A Sales note may reflect the priorities of a commercial conversation. A dashboard records selected events but not necessarily customer intent. A story-based interview offers depth about a specific experience but does not by itself establish prevalence. None of these limitations makes the source unusable; each defines the questions it can responsibly answer.

Correlate without treating evidence as a vote

The discovery article presents triangulation across quantitative data, organizational observations, and qualitative customer insight. It cautions, in effect, against treating three inputs as interchangeable ballots. Convergence can strengthen an explanation, contradiction can expose segmentation or missing context, and silence in one channel can reveal an instrumentation or access gap.

The incident article supplies an operational version of the same principle. Customer conversations, heartbeat metrics, ordinary alarms, and rollout monitoring offer separate views of product health. A support pattern may establish visible pain, while a workflow metric helps assess scope and timing. Combining them produces a more useful impact picture than either channel can produce alone.

Route the signal to an explicit next action

A signal repository becomes a backlog graveyard if collection is not paired with routing. The next action might be incident triage, instrumentation review, identification of affected customers, a story-based interview, solution evaluation, or continued monitoring. The choice should reflect what is already known and which uncertainty most constrains the next decision.

This routing step is where product operations adds leverage. It connects customer-facing teams, product trios, engineering owners, and decision-makers without pretending that every input deserves a feature request. It also creates a traceable path from the original observation to the investigation, intervention, and later result.

Ownership and cadence close the signal-to-learning loop

Signals move faster when ownership is defined before pressure arrives. The incident article reports distinct responsibilities for a technical lead, an incident commander when escalation is needed, a business lead for customer-facing coordination, and a resolution owner for follow-up work. The benefit is not hierarchy for its own sake; it is reduced ambiguity while customers are affected.

Discovery needs comparable clarity. The evidence article places responsibility on product teams to distinguish observations from interpretations, match the research method to the question, and improve interview quality without discouraging customer contact. Product operations can support that discipline by making evidence strength visible and ensuring that recurring signals receive either an investigation owner or an explicit decision not to pursue them.

The two workflows should ultimately reconnect. An incident can generate product questions about confusing recovery paths, missing safeguards, or poorly observed workflows. Discovery can reveal customer-critical actions that deserve heartbeat metrics or stronger operational readiness. Post-incident follow-ups, recurring signal reviews, customer research, and roadmap discussions should contribute to a shared record rather than separate departmental archives.

The next stage of mature product operations is therefore not simply collecting more feedback or adding more dashboards. It is designing a system in which the weakest signal can trigger appropriate attention, stronger evidence can refine the explanation, and clear ownership can carry learning into safer product and operational choices.

References
- Shivam.Consulting Blog — When Systems Fail: A Proven Incident Playbook for Fast, Safe Customer Recovery
- Shivam.Consulting Blog — Escape the Evidence Trap: Turn Customer Signals Into Better Product Decisions
July 14, 2026
Built for Your Biggest Days: How We Engineer Fair, Reliable Scale Without Downtime

I’m getting sharper, more specific questions about scale from enterprise customers every quarter, and that’s exactly how it should be. Teams want to know how our platform behaves during their highest-volume moments — the Black Friday sales, the sporting events, the production incidents — and they want confidence their growth won’t outpace the systems they depend on. We welcome those questions. They’re the right ones to ask of any critical component of your business. Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform. We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer. While those numbers matter, they age quickly. Every growing software company can publish a bigger number every year, month, week. What ultimately matters is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it. Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them. Here’s how we do that in practice. We build on boring foundations because at the edges, we try hard not to be clever. We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product. “That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.” This extends a principle I’ve embraced for years: run less software. The point isn’t to minimize the stack for its own sake; it’s to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together. Boring technology choices aren’t a lack of ambition — they reserve our ambition for the nuanced scaling challenges that matter. The source of truth is the hard part. You can scale stateless web traffic by adding machines, add queue consumers, and add cache. Those are real problems — just not the hardest ones. The source-of-truth database is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears. That is why we moved to Vitess, managed by PlanetScale. The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers. When we first laid out this direction, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day. Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed. We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk. Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows. Search is not a hidden bottleneck for us. Search underpins core product surfaces across the platform — from vector search in our AI features to realtime reporting — and if it’s slow or unhealthy, customers feel it. Scaling isn’t just adding more machines; often the better approach is making the product do less unnecessary work. Today, our Elasticsearch clusters support a much higher-throughput product than in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We’re serving a larger product surface more efficiently, not just running a bigger cluster. More importantly, when an index gets too large or traffic distribution turns unhealthy, we don’t want a high-risk, manual migration. We reshape Elasticsearch indexes online by partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and deleting the old index only when we’re confident. We’ve used this pattern for years to make large search migrations safer and more incremental — a core playbook in our platform scalability and SRE practices. Fairness is non-negotiable in a multi-tenant system. A single customer’s high-volume moment should not quietly become everyone else’s latency problem. We design for this at multiple layers. For asynchronous work, we use overflow queues and queueing strategies that prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They’re designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants. We also build application-level guardrails so customer isolation doesn’t depend on every engineer remembering every rule in every code path. In a large multi-tenant Rails application, the safe path must be built into the system. The focus is primarily about correctness and customer data separation, but the broader operating principle is the same: important customer boundaries should be enforced by infrastructure and application frameworks. The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it’s actually needed. Fin adds a new dimension to scaling. Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work. The details differ from traditional SaaS infrastructure, but the principle is the same: understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. AI providers are not commodity storage systems, and we do not design as if they are. That is why we have invested in Fin-specific reliability systems. Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point. For enterprise customers, this matters because AI support volume can spike just like human support volume — and the AI layer must absorb that spike without relying on one fragile upstream path. When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform. Performance tests help, but production traffic is reality. Real customers use products in ways no synthetic test will perfectly predict: launches, incidents, seasonal patterns, gaming events, sudden changes in end-user behavior. Those moments give us data that no lab can fully reproduce. Often, a large customer event barely moves the platform-wide graphs because our customer base is broad enough that one industry’s peak aligns with another’s quiet period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect. “That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.” Sometimes, these events teach us something specific. In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed. The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb. We treat large customer events this way even when there’s no broad customer impact. They’re opportunities to understand real scaling properties and make the next event safer — a habit that anchors our incident management, observability, and FinOps practices. Scale is also an operating model. The infrastructure matters, but it’s not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents. “That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.” Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through infrastructure noise and answer the question that matters most during an incident: are customers able to use the product successfully? This shapes how we ship. Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. That isn’t a vanity metric — it’s part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers. These are the very DORA metrics we hold ourselves to in order to balance CI/CD speed with stability. “Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.” The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence. That’s how scale stays safe over time. Scheduled maintenance should be extraordinary. Historically, database maintenance was a main reason for maintenance windows: upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime. With the move to Vitess and PlanetScale, we changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime — and we do this in practice, not just as a goal. This matters because customers rely on our platform for live operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience. “Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.” Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact. That’s the practical benefit of the architecture work: better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers. What this means for customers is simple: be skeptical of vague scale claims. The question isn’t whether a vendor says they can scale — it’s whether they can explain how, where the limits are, what they measure, how they recover, and what they’ve changed after learning from production. We understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system — and we use that data to keep improving. That’s what you should expect from a platform you depend on during your busiest moments.

Inspired by this post on The Intercom Blog.

May 19, 2026
Stop Blurring the Lines: Clear Product–Engineering Boundaries to Boost Quality and Prevent Burnout

Where is the true boundary between product and engineering—and what happens when it gets blurry? I’ve led and coached teams through this question many times, and I’ve learned that clarity here isn’t just a nice-to-have; it’s foundational to quality, velocity, and team health.

I’ve seen well-intentioned product managers step in to “help” by taking ownership of bug triage, tech debt prioritization, or even system architecture. At first, it feels productive. Over time, it creates role confusion, slows decision-making, and burns out PMs—while paradoxically lowering engineering quality. The “CEO of the product” myth and legacy IT, project-based mindsets are usually at the root. Treating engineers as “order takers” breaks down in evergreen product environments.

The healthiest collaboration model is simple and disciplined: The product trio owns the “what”; engineering owns the “how”. Product managers are not people managers for engineers—and shouldn’t be accountable for engineering quality. Our job is to frame the problem, align on outcomes, and continuously discover value with customers—not to supervise technical execution.

If quality is a problem, the solution is escalating and fixing the system, not managing individual bugs. In practice, that means surfacing patterns and elevating them to engineering leadership, who can address root causes—staffing, skills, code health, CI/CD gaps, observability, or process design—rather than asking PMs to paper over issues with status updates. This keeps accountability where it belongs and reinforces outcomes vs output OKRs.

One high-leverage move is to remove unnecessary intermediaries. Removing the PM as a middleman creates better flow and clearer ownership. Create direct paths for stakeholders to get bug status without routing everything through product. Use dashboards, shared tools, or Slack channels instead of one-off updates. In my teams, shared Jira views, Slack incident channels, and status pages eliminated handoffs, improved stakeholder management, and gave engineers the space to solve problems end-to-end.

Strong engineering leadership is non-negotiable. What strong engineering leadership should own (and why that matters) is the technical system, quality guardrails, sustainable pace, and the practices that uphold them—incident management, code review rigor, test coverage, and SLOs with SRE. Skilled engineering teams naturally push back when boundaries are crossed—and that’s a good thing. It signals ownership, craft pride, and a pathway to durable execution.

When do I step in as product? Primarily to clarify desired outcomes, sequencing, and trade-offs—bringing customer and business context to the table. I structure product roadmapping and sprint planning around value slices and risks, not task lists. I align on decision rights early: architecture and tech debt strategies live with engineering; product strategy, positioning, and success metrics live with product; discovery and prioritization live with the product trio.

Here are the system-level moves I’ve found most effective: Escalate systemic quality issues to engineering leadership, not individual contributors. Advocate for real engineering leadership if your org expects product teams—not IT teams. Then reinforce a culture of continuous discovery so product, design, and engineering make better upstream decisions together. This is how empowered product teams ship higher-quality outcomes—without burning anyone out.

If you’ve ever found yourself acting as the middleman for bug status or being asked to “own” engineering decisions outside your expertise, you’re not alone. Reset the boundaries, make work visible, and double down on shared outcomes. In my experience, the moment we clarify roles and remove status theater, quality rises, cycle time improves, and everyone does the job they were hired to do—better.

Inspired by this post on Product Talk.

February 24, 2026

Reliable AI Infrastructure: A Product Leader’s Playbook

Your AI feature can be online, fast, and still be failing. A report renders but omits important records. A workflow returns valid JSON with the wrong meaning. A retry creates a duplicate. A permissions change quietly removes the data needed for a trustworthy answer.

If you own an AI product, an uptime dashboard cannot tell you whether users are receiving the outcome you promised. You need a reliability system that covers data, models, runtime dependencies, output quality, delivery, and recovery. The practical goal is not to eliminate every failure. It is to detect meaningful failures early, contain their impact, and recover without making the situation worse.

Define reliability at the user-outcome boundary

Traditional service reliability often starts with a relatively clean question: did the request succeed? AI products make that question insufficient. A request can return a success status while the user receives an incomplete, structurally invalid, stale, unauthorized, or semantically poor result.

The failures worth designing for include small schema changes in non-deterministic output, silent permission changes, token-limit truncation, burst-driven rate limits, and clock skew affecting idempotent writes. None requires a total outage. Each can still break the product promise.

Start by writing a reliability contract for one important user journey. State what must be true when that journey succeeds. A useful contract usually covers these dimensions:

Reliability dimension	Question to answer	Evidence to capture
Completion	Did the workflow reach a terminal outcome?	Completed, rejected, timed out, cancelled, or still pending
Structural validity	Does the output satisfy the interface expected downstream?	Schema-validation result, schema version, and rejection reason
Data integrity	Was the required data accessible, current, and complete enough for the task?	Data-source status, permission result, retrieval result, and freshness signal
Semantic quality	Is the answer useful and acceptable for this use case?	Evaluation result by task, customer segment, language, or workflow
Latency	Did the outcome arrive while it was still useful?	End-to-end latency and latency for each pipeline stage
Delivery integrity	Was the result applied once, without duplication or corruption?	Idempotency key, write status, attempt count, and final state
Privacy and risk	Did processing respect the product’s data-handling rules?	Policy checks, PII-scanning result, access decision, and exception path

This contract prevents an easy but damaging mistake: counting technically completed requests as successful user outcomes. If a report is truncated yet parseable, the transport succeeded and the product failed. If a model response is excellent but based on data the user can no longer access, the answer should not be delivered as a success.

Turn the contract into service-level indicators that the system can measure. Then set service-level objectives around the indicators that matter to the user. The difference between the objective and actual performance becomes the error budget available for change and experimentation.

Do not hide behind a global average. Break reliability down by model, prompt version, schema version, dataset, workflow, customer segment, and dependency. AI failures are often concentrated. A healthy aggregate can conceal a severe regression for one language, one integration, or one high-value workflow.

Your error budget should also drive decisions. When budget consumption accelerates, narrow the rollout, pause the risky change, or redirect capacity toward the failure path. When the budget is healthy, you have evidence that the product can absorb controlled experimentation. That is more useful than declaring reliability important while allowing roadmap pressure to settle every tradeoff.

Instrument the full path from request to delivered outcome

A useful AI trace does not stop at the model call. It follows the user request through authentication, permission checks, data retrieval, context assembly, model execution, output validation, business rules, persistence, and delivery. Give the journey one correlation identifier so an engineer can move from a failed user outcome to the responsible stage without reconstructing the request from unrelated logs.

Build visibility at three levels:

Structured events: Record the request identifier, workflow, customer segment, model, prompt version, schema version, dependency, attempt number, latency, result class, and failure code. Use controlled fields rather than free-form error messages for the dimensions you expect to aggregate.
Distributed traces: Create a span for each meaningful stage. A trace should show whether time was spent waiting in a queue, retrieving data, calling a provider, validating output, or committing a side effect.
Product-level metrics: Measure valid completion, semantic evaluation results, p95 latency, queue pressure, validation failures, permission failures, truncation, retry volume, circuit-breaker activity, and error-budget consumption.

Keep raw customer data, prompts, and model responses out of routine telemetry unless there is a defined and approved need to retain them. Structured metadata is usually enough for operational diagnosis. When content must be inspected, apply access controls, retention rules, redaction, and PII scanning as part of the observability design. Logging sensitive data first and deciding how to govern it later creates a second reliability problem: the monitoring system becomes a source of risk.

Design failure codes around actions, not organizational boundaries. Invalid model output, missing source permission, provider throttling, exhausted token budget, duplicate delivery, and policy rejection tell the responder what kind of path failed. A generic model error or integration error forces the on-call person to rediscover information the system already had.

Alerts should represent conditions that require intervention. Error-budget burn, broad validation failures, growing queue age, or a dependency circuit remaining open may justify an immediate response. A slow-moving change in evaluation performance may belong in a product review instead. If every anomaly pages someone, the monitoring system trains the organization to ignore it.

The same dashboard should work for product and engineering. An SRE needs the failing dependency and trace. A product leader needs the affected workflow, segment, volume, and user consequence. Connecting both views prevents a team from fixing the loudest technical symptom while a quieter failure causes more product damage.

Harden each boundary instead of trusting the happy path

Most AI workflows combine components with different failure behavior: internal services, databases, queues, retrieval systems, model providers, and third-party data sources. Reliability comes from controlling the boundary around each component. The following sequence gives you a practical hardening checklist.

Bound every external call. Set explicit timeouts using observed latency distributions, including p95 behavior, as an input. A missing timeout allows one slow dependency to consume workers and delay unrelated requests. Treat timeout as a classified outcome rather than an unhandled exception.
Retry only failures likely to be temporary. Provider throttling and transient network failures may recover. Invalid input, permission denial, and schema rejection usually will not. Use delayed retries with exponential backoff and jitter so concurrent failures do not return as another synchronized burst. Cap attempts and record the final reason.
Put a circuit breaker around unstable dependencies. When failure crosses the condition you have defined, stop sending traffic long enough to prevent resource exhaustion and cascading latency. Make the open, probing, and closed states visible. The product should communicate a controlled unavailable or delayed state rather than pretending work completed.
Make side effects idempotent. Derive the idempotency key from the logical operation, destination, and relevant payload version. Persist the result of the operation so retries can return or reconcile the prior outcome. Do not depend on local wall-clock time alone to distinguish writes; clock skew can turn retry protection into duplicate or missing work.
Apply backpressure before the queue becomes the outage. Bound concurrency for each constrained dependency. When demand exceeds safe processing capacity, queue, defer, or reject according to the user promise. Preserve enough state to resume safely. Unbounded retries feeding an unbounded queue convert a temporary provider problem into a long recovery.
Validate contracts before committing effects. Validate generated JSON against the expected schema, including required fields, types, allowed values, and relevant bounds. Keep parsing separate from business validation: syntactically valid output can still violate a product rule. Reject or quarantine invalid results before they reach reporting, billing, messaging, or another irreversible operation.
Detect incomplete generation explicitly. Budget context and expected output together. When the provider exposes completion metadata, use it to distinguish a completed response from one stopped by a limit. Do not pass partial structured output downstream merely because a parser can repair it. Reduce unnecessary context, split an oversized task, or return a controlled failure.
Treat permissions as changing runtime state. Check access near the point of retrieval, classify authorization failures separately, and monitor permission-related drops by integration. Do not repeatedly retry a denial. If upstream access changes silently, the product should expose which data is unavailable rather than producing an apparently complete result from a partial dataset.
Put risky behavior behind feature flags. Separate deployment from release. A flag should let you disable a model, prompt, retrieval path, or downstream action without waiting for another deployment. Test the rollback or disable path before relying on it during an incident.

These controls need an explicit order of operations. Validate permissions before retrieving sensitive data. Validate generated output before executing a side effect. Persist idempotency state before acknowledging completion. Apply retry policy after classifying the failure. Ordering is what prevents individually sensible mechanisms from undermining one another.

Be careful with graceful degradation. It is useful when the degraded state remains honest and valuable, such as delaying a non-urgent report or identifying an unavailable data source. It is dangerous when the system silently substitutes stale, incomplete, or lower-quality information and presents it as equivalent. The user must be able to distinguish degraded output from normal output.

Make model and prompt releases earn production traffic

A prompt edit can change output structure. A model change can improve one task while weakening another. A retrieval change can alter both answer quality and latency. Treat these modifications as production changes even when no application code changed.

An eval-driven release path should work like this:

Version the complete behavior. Record the model, prompt, schema, retrieval configuration, tool definitions, policy rules, and relevant application release. Without this bundle, a failed response cannot be reproduced with confidence.
Build evaluations around the product contract. Cover representative tasks, important customer segments, difficult inputs, and failure cases discovered in production. Include structural checks alongside semantic checks. A quality score cannot compensate for output that breaks its interface.
Establish a baseline. Compare the candidate with the current production behavior on the same evaluation set. Review the distribution by meaningful slice rather than relying only on one average score.
Gate promotion in CI/CD. Require the agreed evaluation baselines to hold or improve before the candidate can progress. Make exceptions explicit, owned, and reversible. A hidden manual bypass is not a release policy.
Release through a canary. Send a limited, observable portion of eligible traffic to the candidate. Keep the current version available. Watch evaluation signals, validation failures, p95 latency, dependency behavior, and error-budget consumption by version.
Expand in stages or roll back. Increase exposure only while the user-facing indicators remain within the agreed conditions. If a signal degrades, use the feature flag or version control to stop exposure quickly while preserving diagnostic evidence.

The release gate needs product judgment. Not every evaluation failure carries the same consequence. A formatting defect in an internal draft is different from an unsupported claim in a customer-facing recommendation or an unauthorized action by an agent. Define which failures block release, which require human review, and which can be monitored after release.

Do not force a choice between delivery speed and reliability without evidence. Track deployment frequency alongside change failure rate. Frequent, small, reversible releases can improve both learning speed and recovery. Large bundled changes make it harder to identify the cause of regression and increase the amount of behavior a rollback must undo.

Before approving an AI release, a product leader should be able to answer five questions:

Which user promise can this change affect?
Which evaluation and production indicators represent that promise?
Which segments could regress even if the aggregate improves?
What condition stops or reverses the rollout?
Who has the authority and the mechanism to act when that condition appears?

If those answers are missing, the release is relying on optimism rather than a control system.

Run reliability as a product operating system

Technical safeguards decay unless ownership and operating routines keep them current. Models change, integrations evolve, permissions move, and traffic develops new burst patterns. Reliability therefore belongs in roadmap and incident decisions, not in a one-time infrastructure project.

Prepare a lightweight runbook for each critical journey. It should identify the owner, user-visible failure states, primary indicators, relevant dashboards, recent release controls, dependency status, safe disable path, and rules for replaying work. A responder should not have to infer whether replay can duplicate a message, report, charge, or external action.

During an incident, establish the user impact before chasing every technical symptom. Identify the affected workflow and segment, stop further harm, preserve evidence, and use the safest available rollback or containment control. Communicate whether results are delayed, incomplete, unavailable, or at risk of duplication. Those states require different user actions.

Afterward, use a blameless review to find the conditions that allowed the failure to reach users. The strongest follow-up actions are testable and automatable: a new schema check, an evaluation case, a permission metric, a retry limit, a canary gate, a better idempotency key, or a rehearsed rollback. An instruction to be more careful is not a control.

Prioritize the reliability backlog by user consequence and error-budget impact. A noisy internal exception with no lost outcome may matter less than a silent data omission affecting a small but important workflow. This keeps observability from becoming a competition to reduce whichever counter is easiest to move.

Privacy-by-design and AI risk management belong in the same operating system. Add PII scanning, access validation, and policy checks to the pipeline and release gates. Assign owners for exceptions. Revisit the controls as the product gains new data sources or actions. Risk is a continuing product constraint, not a review performed after the architecture is settled.

Key takeaways

Define success at the delivered user outcome, not at the HTTP response or completed model call.
Measure completion, structural validity, data integrity, semantic quality, latency, delivery integrity, and privacy where each applies.
Trace the whole pipeline and segment reliability by model, prompt, schema, workflow, dataset, and customer group.
Use timeouts, selective retries, circuit breakers, idempotency, backpressure, validation, and feature flags as coordinated controls.
Gate model and prompt changes with evaluations, then use canaries and staged releases to limit exposure.
Let SLOs, error-budget consumption, and user consequence determine when reliability work outranks feature work.

Choose your highest-consequence AI journey and write its reliability contract. Trace it end to end, attach an SLO to the user outcome, and replay the known failure modes against the controls you already have. If the system cannot tell you whether its output was valid, complete, permitted, and delivered once, that is the first reliability gap to close.

References

Shivam.Consulting Blog — How We Built Rock-Solid AI Infrastructure: Lessons From Scaling AI Visibility and Reliability

February 4, 2026

The Safety of Speed: 180 Deploys a Day, 12‑Minute Releases, 99.8%+ Availability

“Speed is not the enemy of safety; it is the prerequisite for it.” I live by this principle. In our organization, the average time from merging code to it being used by customers in production is just 12 minutes, and that short window is fundamental to how we build, ship, and learn.

In January 2026, we are averaging 180 ships per workday – roughly 20 deployments every hour. Conventional wisdom suggests that to increase stability, you must slow down. I believe the opposite. Speed is not the enemy of safety; it is the prerequisite for it. Accumulating code creates risk; shipping small batches minimizes it. Shipping is our company’s heartbeat.

Maintaining this frequency while targeting 99.8+% availability has required over a decade of focused investment in systems, principles, and processes. We protect the integrity of our systems through three layers of defense: an automated pipeline that is simple, reliable, and removes the need for manual intervention, a shipping workflow that promotes ownership and uses guardrails as accelerants, and a recovery model that optimizes for mitigating inevitable failures. Here’s how we’ve built each layer so that velocity is our greatest source of stability.

While our platform consists of various services and frontend applications, I’ll focus here on our Ruby on Rails monolith. It is our core application and the one we deploy most frequently; we also deploy it to three different data‑hosting regions with independent pipelines. Our other services follow similar pipeline principles and safeguards, but the Rails monolith is the clearest example of how we ship at scale.

The automated pipeline is designed to move code from merge to production as fast as possible while enforcing strict safety checks. It is fully automated, and the vast majority of releases require no human intervention—critical for CI/CD at high deployment frequency.

Once an engineer merges code to GitHub, two things happen immediately. First, the build: we compile the Rails application and its dependencies into a deployable asset (a slug) in about four minutes. Second, parallel CI: our test suite runs alongside the build; through extensive optimization, parallelization, and test selection, the vast majority of CI builds finish in under five minutes.

As soon as the slug is built, it’s deployed to a pre‑production environment. CI does not block the progression of the slug to pre‑production. Deploying to pre‑production takes around two minutes. This environment serves no customer traffic, but it is connected to our production datastores, mirrors our production infrastructure variants (e.g., web serving, asynchronous worker), and is configured so that requests exercise the pre‑release code and workers.

Immediately after deployment, we run and await several automated approval gates. We verify that the application boots cleanly on hosts (boot test), confirm the parallel test suite passed (CI check), and execute functional synthetics using Datadog Synthetics on critical flows—such as loading or editing a Fin workflow. If any gate fails, the release is halted and does not go to production.

Once approved, we promote the code to thousands of large virtual machines. A deployment orchestrator triggers these deployments simultaneously, while a decentralized, staggered rollout avoids changing the state of the entire fleet at the same millisecond. Within each machine, a rolling restart mechanism removes a process with old code from the serving path, lets it drain gracefully, and replaces it with a fresh process running the new code. From the moment a deployment starts, first requests are served by new code within roughly two minutes, and the vast majority of the global fleet updates transparently within six minutes. When restarts trigger on every machine, production unblocks so the next deployment can begin.

We treat a stalled pipeline as a high‑priority incident. If the automated system rejects three consecutive release attempts, it pages an on‑call engineer. These are pre‑production blocks, but if the shipping lane stops moving, changes pile up—and our stability relies on building and shipping in small steps. The on‑call’s job is to restore flow so that tiny, safe, frequent updates continue to keep risk low.

Our shipping workflow is built on extreme ownership: tools assist, but the engineer is accountable for quality and the decision to merge. I insist that you are present when you ship. The practical benefit of a 12‑minute deployment cycle is that engineers remain in the zone, focused on the problem they just solved, and ready to validate behavior as it goes live.

A rocket lifts into a luminous sky, a metaphor for shipping code fast without breaking things, where precision, automation, and guardrails power 180 safe deployments a day.

To support this, our deployment system sends Slack notifications the moment code is submitted and as it advances through stages, embeds direct observability links to relevant dashboards and logs in every PR and message, and prompts verification so engineers actively watch the dials and test features in production. It is not acceptable to rely on green builds. You’re expected to watch your change go live and if you’re not prepared to rollback, you’re not prepared to ship. We maintain a no‑blame culture: quick rollbacks and immediate reverts are signs of vigilance and ownership, not failure.

We make extensive use of feature flags to turn deployment into a non‑event. By decoupling deployment (moving code to servers) from release (turning features on), we shrink the blast radius of change. Flags can be enabled for all customers, a specific subset, or disabled for everyone in under 60 seconds through our backend UI. Engineers can group flags into beta features and run phased rollouts; we also ensure flags work consistently across non‑monolith applications. In the past three months, we created over 560 flags—and we actively manage them to avoid permanent complexity.

For complex refactors—especially when behavior should not change—we leverage GitHub Scientist, an open‑source experimentation library. It runs candidate logic (new code) in parallel with existing logic (old code) in production, instruments both paths for result and timing comparisons, and keeps existing behavior user‑visible. That means we can iterate on and validate new code under real load without risking the experience, then switch seamlessly when confident.

When engineers need to go deeper before merging, they can generate a slug and deploy it to a virtual machine, detaching a running production host from the serving path and connecting for manual testing. They can also put a pre‑release slug on a serving machine that handles a small percentage of jobs or web requests. Single‑host validation lets us slice observability to those hosts, compare against the main release, and make low‑level changes safer. Staging is a simulation; production is reality. Testing on a single production host validates assumptions with real‑world data without risking the fleet.

Our recovery model starts from a simple principle: stop monitoring systems; start monitoring outcomes. Traditional monitoring tells you if a server is healthy; we care whether customers are healthy. We rely on heartbeat metrics—vital signs that represent the core value our product provides—such as the rate at which messages and comments are created.

Unlike standard uptime checks, heartbeat metrics are binary in spirit. If message send rates dip below baseline, it does not matter if infrastructure dashboards are green. Down is down, and if customers can’t do their job, uptime percentages are irrelevant. By tracking real‑world success rates as a high‑level signal, we catch subtle degradations that traditional alerting either misses or over‑alerts on.

Because we ship in small, incremental steps and maintain previous releases on our virtual machines, our Time to Recover (TTR) is generally very fast. If a heartbeat metric drops or a critical anomaly is detected right after a ship, the system can trigger an automatic rollback, reverting to the release that was running 20 minutes ago—often restoring service before an engineer responds. For complex issues, engineers can initiate a manual rollback through our deployment UI; doing so also locks the production pipeline to prevent further releases while we investigate and remove problematic code.

Resumption of service is not the end. Every incident prompts an incident review, and we don’t just fix the bug. We ask, “How did the machine allow this to happen?” Then we harden the system so it cannot happen again. This loop—fast shipping, fast recovery, rigorous learning—compounds resilience over time.

This operating model aligns to DORA metrics: high deployment frequency, short lead time for changes, low change failure rate, and rapid time to restore service. It’s a CI/CD and SRE‑informed approach that converts speed into a defensive advantage rather than a liability.

Shipping 180 times a day isn’t a vanity metric; it’s a deliberate choice to protect the customer experience. With a 12‑minute window from code to customer, the feedback loop is tight and engineers retain context—and accountability—for the immediate impact of their work. Maintaining this pace requires more than fast CI; it requires judgment, extreme ownership, disciplined use of feature flags, and a recovery model that monitors outcomes. We rely on human expertise, augmented by these layers of defense, to catch issues before they turn into customer pain. We don’t ship fast despite our need for stability; we ship fast to stay in control of change.

Inspired by this post on The Intercom Blog.

January 26, 2026

How to Build AI-Enabled Cybersecurity Operations Safely

You have an alert queue full of low-context signals, analysts spending time assembling evidence, and pressure to show that AI can improve the operation. The tempting move is to add a copilot to the security console and call the problem solved.

The harder leadership decision is where AI may influence a security decision, where it may take action, and how you will know it is helping. The right goal is not an autonomous security operations center. It is a shorter, more reliable path from signal to containment, with explicit limits on what a model can do.

Design the decision loop before choosing the AI

AI-enabled cybersecurity operations are easier to manage when you separate three capabilities that vendors often bundle together:

Detection models identify patterns, anomalies, or risk signals in security telemetry.
Generative AI explains evidence, summarizes an incident, retrieves a relevant playbook, and proposes a next action.
Orchestration performs a deterministic operation such as collecting evidence, updating a ticket, isolating an endpoint, or rotating a credential.

These components should not share the same authority. An anomaly score is not proof of compromise. A fluent explanation is not an approved response. A tool call is not safe merely because the model produced valid syntax.

Map the operational loop before you evaluate a model:

Observe: collect the endpoint, identity, network, and application signals relevant to the use case.
Detect: rank suspicious activity without hiding the underlying evidence.
Enrich: add asset criticality, identity context, recent changes, and the applicable response procedure.
Decide: show the recommended action, its prerequisites, and the reason for escalation.
Act: send the approved instruction to deterministic automation with narrowly scoped permissions.
Learn: record the analyst’s disposition, edits, approval, execution result, and any reversal.

For each stage, name the owner, permitted inputs, expected output, failure mode, and fallback. If the AI service becomes unavailable, established detections and response paths should continue to work. If the model produces a poor recommendation, an analyst should be able to reject it without fighting the workflow.

This map is also the product specification. It gives security engineering, SRE, product management, and risk owners a shared object to review. It prevents the initiative from collapsing into a feature list such as summarization, chat, and automation without a defined operational result.

Start with one detection decision, not another alert stream

A strong first use case has frequent decisions, usable feedback, and enough context to evaluate the model. It should improve an existing analyst workflow instead of creating a separate queue that someone must remember to check.

Behavioral models can examine endpoint telemetry, identity signals, and network flows to find activity that fixed signatures may miss. The useful product is not the anomaly itself. It is a ranked case that tells the analyst what changed, which evidence drove the score, what asset or identity is exposed, and what decision is required.

Use these criteria to choose the first workflow:

The decision is specific. “Investigate unusual authentication behavior for a privileged identity” is testable. “Use AI to detect threats” is not.
The evidence is available at decision time. If analysts must leave the workflow and search several systems before judging the recommendation, the AI is working with incomplete context.
The disposition is captured. Confirmed threat, benign activity, insufficient evidence, and duplicate are more useful than a generic closed status.
The existing path remains visible. Analysts should be able to compare the AI-ranked case with the evidence they already trust.
A wrong answer is recoverable. Begin with prioritization and investigation support, not an irreversible action.

Do not treat a smaller alert queue as proof of better detection. A model can reduce noise by suppressing useful signals. Measure precision and recall together: precision asks how much surfaced work was relevant, while recall asks how much relevant activity the workflow found. Because missed incidents may become visible only later, define how labels will be corrected when an investigation changes the original disposition.

Mean time to detect also needs a precise starting point. Decide whether the clock begins when the event occurs, when telemetry reaches the platform, or when an existing control first observes it. Otherwise, a faster model can appear to improve detection while ingestion or analyst queue time remains untouched.

The launch question is therefore not “Did the model find anomalies?” Ask whether it moved the right cases forward sooner, preserved the evidence needed for judgment, and avoided pushing material risk below the analyst’s line of sight.

Give the response copilot context, not unchecked authority

Incident response is a natural place for generative AI because analysts repeatedly assemble timelines, summarize evidence, search runbooks, draft ticket updates, and prepare remediation steps. Those tasks are language-heavy, but the actions they inform can disrupt production or destroy evidence.

Use a retrieval-first flow for response recommendations:

Retrieve the approved playbook and the version that applies to the incident type.
Assemble the facts the model is permitted to see, including the alert evidence and relevant asset context.
Generate a recommendation tied to a named playbook step rather than relying on the model’s general memory.
Check prerequisites, identity permissions, environment, and action scope through policy code outside the model.
Present the evidence, proposed action, expected impact, and rollback path to the designated approver.
Execute the approved operation through a deterministic orchestration layer.
Log the retrieved material, prompt, output, approval, tool arguments, result, and subsequent reversal or escalation.

This architecture makes an important distinction: the model can propose an action, but policy and people grant authority. The model should never be able to expand its own permissions or substitute a different tool when the approved operation fails.

An authority ladder gives that distinction operational force. Use the following as a starting policy and adapt it to the blast radius of your environment:

Action class	Examples	AI role	Required control
Read-only support	Summarize evidence, retrieve a runbook, collect approved diagnostics	Generate or execute within a fixed scope	Least-privilege access, complete logging, and no mutation permissions
Reversible operational change	Update a ticket, isolate an endpoint, rotate a credential	Recommend and prepare the action	Named human approval, validated target, impact warning, and tested rollback
High-blast-radius or irreversible change	Block a production network segment, alter broad access policy, delete data or evidence	Explain and escalate only	Incident command process and approval from the responsible system owner

Endpoint isolation can interrupt legitimate work. Credential rotation can break services when dependencies are unknown. Deleting data can permanently remove forensic evidence. Put those consequences beside the approval button, and provide a safe alternative such as collecting more evidence or opening an incident bridge.

Test the copilot as a security product, not as a conversational demo. Your evaluation set should cover correct recommendations, missing prerequisites, conflicting evidence, obsolete playbooks, requests outside the user’s permission, sensitive data, malformed tool arguments, and situations that require refusal or escalation. Measure whether the recommendation is grounded in the approved playbook, whether the action is appropriate, and whether the system preserved the required approval boundary.

Begin in shadow mode, where recommendations are evaluated but cannot change systems. Move next to draft-only assistance. Permit bounded execution only after the team has defined promotion criteria, rollback behavior, and an owner who can stop the workflow.

Prompt and output logs deserve the same access discipline as other sensitive security records. They may contain identities, indicators, configuration details, or incident evidence. Apply contextual data policies before information reaches the model, restrict access to the logs, and make retention a deliberate governance decision rather than a vendor default.

Counter AI-enabled attacks by changing the process

Attackers can use generative AI for targeted spear-phishing, deepfake executive voice messages, and more evasive malware. Trying to make every employee reliably identify synthetic content is a weak control. The appearance and quality of the lure will keep changing.

Change the process that turns a convincing message into access, money movement, or sensitive disclosure:

Require an out-of-band verification step for unusual executive requests, especially when the request changes credentials, access, payment details, or normal procedure.
Do not let familiarity with a voice, writing style, profile image, or caller ID serve as identity proof.
Harden identity controls with multifactor authentication, conditional access, and continuous risk scoring.
Give help-desk and operations teams a defined escalation path when a requester applies urgency or asks them to bypass verification.
Train employees with realistic AI-generated lure patterns, then measure reporting behavior and successful compromise rather than course completion alone.
Use AI-assisted red-team exercises to test the process, and use deception controls where they can divert attacker effort without putting production data at risk.

This reframes awareness training. Employees are not expected to become media-forensics experts. They need to notice when a request crosses a risk boundary and know the exact verification step to take. Product leaders can help by removing friction from the safe path: make reporting easy, make escalation visible, and avoid punishing someone who pauses a suspicious request.

The same principle applies to detection. Do not build the defense around whether content “looks AI-generated.” Build it around identity, behavior, privilege, asset sensitivity, and the actions an attacker is attempting.

Use a 90-day plan with measurable promotion gates

A focused 90-day plan is enough to establish an operating model if you keep the scope narrow: one high-signal detection decision, one mature response playbook, and one employee risk path such as phishing. The purpose is not to automate the security operation in a quarter. It is to prove that the decision loop can become faster without weakening control.

Days 1-30: define the workflow and baseline

Map the current signal-to-action path and identify where time, context, or consistency is lost.
Name a product owner, security owner, model-risk owner, and operational approver for the workflow.
Select the detection decision, response playbook, and employee risk process in scope.
Record baseline mean time to detect, mean time to recover, queue time, disposition quality, and the existing failure modes.
Define the data the model may access, the data it must not access, and the identity under which each tool operation runs.
Write the authority ladder, fallback behavior, stop condition, and rollback procedure before connecting production tools.

Days 31-60: evaluate in shadow mode

Run the detection model beside the existing workflow and compare ranked cases with analyst dispositions.
Test response recommendations against approved playbooks, including ambiguous and adversarial cases.
Review false positives and false negatives with analysts instead of reducing model quality to one aggregate score.
Confirm that sensitive-data policies, model access controls, prompt and output logging, and audit access work as designed.
Run a tabletop exercise covering model failure, unavailable retrieval, unsafe recommendations, excessive permissions, and orchestration failure.
Set promotion criteria for model quality, operational benefit, privacy, access control, and reversibility. Use thresholds appropriate to the risk of the chosen workflow rather than copying a generic benchmark.

Days 61-90: release bounded capability

Release the detection workflow to a defined analyst group while preserving the established fallback.
Enable draft-only response assistance before allowing any system mutation.
Permit only the actions covered by the approved authority policy; keep high-blast-radius changes outside model execution.
Review analyst edits, rejections, approvals, reversals, and escalations to find where the workflow lacks context.
Compare mean time to detect and recover with the baseline, while checking that precision, recall, privacy, and control failures have not regressed.
Make the next release decision explicitly: expand, hold, narrow the scope, or stop. A pilot that exposes an unsafe assumption has still produced a useful result.

The dashboard should separate outcomes from guardrails. Detection and recovery time tell you whether the operation improved. Precision, recall, recommendation correctness, and playbook grounding tell you how the model behaved. Rejections, manual edits, reversals, unauthorized-action attempts, and sensitive-data policy violations tell you whether the workflow is safe enough to scale.

Acceptance rate alone is not a quality metric. Analysts may accept a recommendation because it is correct, because the interface makes editing difficult, or because workload encourages quick approval. Review the resulting action and later incident outcome, not only the click.

Governance must continue after launch. Assign an owner to every model-enabled workflow, control access by role and context, version the model and retrieved playbooks, retain an auditable decision record, test for drift and bias, and repeat tabletop exercises when permissions or orchestration change. A model update is a security-product release, even when it arrives through a managed vendor.

Key takeaways

Optimize the full signal-to-action loop; do not add a disconnected AI queue.
Let models detect, summarize, and recommend, while policy and named people control authority.
Ground response guidance in approved, versioned playbooks before generating remediation steps.
Use shadow mode, draft-only assistance, and bounded execution as separate promotion stages.
Measure operational outcomes alongside precision, recall, overrides, reversals, privacy failures, and unauthorized-action attempts.
Defend against convincing AI-generated lures by hardening identity and verification processes, not by expecting perfect human detection.

Your next operating review should end with three named decisions: the detection workflow you will improve, the response action the AI may only recommend, and the metric that would stop the release. Once those are explicit, AI becomes a governable capability instead of an open-ended security experiment.

References

Pendo – 3 Powerful Ways AI Is Rewriting Cybersecurity: Smarter Defense, Faster Response, Fewer Breaches

January 4, 2026

Inside the Engine Room: How I Drive Scalable Analytics APIs, Reliability, and Performance

I build and scale analytics platforms with a product mindset, and the work starts with the "middleware and compute systems that power analytics at scale." In platforms like Amplitude analytics and other unified analytics platform architectures, that foundation is what makes everything else possible.

Day to day, I oversee the "APIs behind charts, cohorts, and metrics—driving performance, reliability, and platform scalability." When those APIs are fast and resilient, every product team—from growth to customer success—can trust the insights they use to ship, learn, and iterate.

From an engineering leadership standpoint, I partner closely with SRE to define SLOs and error budgets, wire CI/CD pipelines for safe deploys, and track DORA metrics so we improve speed without compromising quality. This combination reduces incident management toil and shortens MTTR while keeping data freshness and query latency within strict thresholds.

From a product management leadership lens, the goal is clarity: crisp APIs, predictable contracts, and transparent stakeholder management across data, engineering, and GTM teams. That alignment empowers product teams with reliable cohorts and metrics, accelerates experimentation, and de-risks roadmaps.

If you’re scaling analytics, invest first in the platform layer: middleware and compute, schema governance, caching strategies, and cost-aware compute. Do that well, and the visible experience—charts, cohorts, and metrics—feels effortless, even as you grow to serve billions of events with confidence.

Inspired by this post on Amplitude – Best Practices.

December 12, 2025

Agentic AI for Incident Response: A Practical Operating Model

An incident fires. Your responders are not short of data; they are short of a trustworthy path through it. Deployment timelines, service ownership, dashboards, logs, runbooks, and prior incidents live in separate places, while the cost of a wrong action rises by the minute.

The decision in front of you is not whether AI can summarize the incident channel. It is whether an agent can shorten the investigation without becoming another failure mode. That requires an operating model covering the agent’s job, context, permissions, interface, and evaluation before you give it meaningful authority.

Give the agent an investigation job before action authority

An incident-response agent should run a goal-directed investigation loop, not wait for isolated prompts like a chatbot. A credible implementation can collect context, form and test hypotheses, and draft fixes inside Slack. The important product decision is where that loop must stop for human judgment.

Model the loop on the work a strong responder already performs:

Scope the incident. Identify the affected service, environment, customer surface, start time, and known symptoms. Preserve unknowns instead of filling them with plausible guesses.
Gather relevant context. Retrieve recent changes, service ownership, dependencies, telemetry, runbooks, feature-flag changes, and similar incidents.
Form competing hypotheses. Produce a ranked set rather than locking onto the first convincing explanation. Distinguish observed facts from inferences.
Test each hypothesis. Use read-only tools to query metrics, logs, traces, deployment state, and dependency health. Record what supports or weakens each possibility.
Propose the next best action. Explain the target, expected effect, risk, preconditions, and recovery path. Do not hide uncertainty behind an authoritative tone.
Update the investigation. Incorporate tool results and responder corrections, discard disproven hypotheses, and choose the next check.

The incident commander remains accountable for priorities and mitigation. The agent acts as an investigation engine: it gathers, tests, organizes, and proposes. This division is more useful than treating human involvement as a final approval click after the AI has already made every material decision.

Choose the first workflow with care. A good starting point has a bounded service area, dependable read-only signals, known responders, established runbooks, and outcomes you can verify after the incident. A workflow that depends on undocumented tribal knowledge or unrestricted production access is not ready for agentic automation. Fix the operating system around the incident before expecting a model to compensate for it.

Do not begin with the most dramatic remediation you can automate. Early value usually comes from reducing context switching, locating the correct owner, connecting symptoms to recent changes, and eliminating weak hypotheses. Those tasks consume scarce attention but do not require the agent to mutate production.

Context quality determines the ceiling of the investigation

A capable model cannot reason with operational context it cannot find, distinguish, or trust. If a service has three names across the deployment system, observability platform, and incident channel, retrieval becomes unreliable before model reasoning even begins.

Create a context contract for every service placed within the agent’s scope. At minimum, make these fields explicit:

Identity: canonical service name, aliases, repository, runtime, and environment.
Ownership: accountable team, current on-call route, and escalation path.
Topology: upstream dependencies, downstream consumers, data stores, queues, and shared infrastructure.
Change history: deployments, configuration changes, feature flags, migrations, and rollback state.
Operational knowledge: current runbooks, known failure modes, dashboards, alerts, and prior incident records.
Control policy: tools the agent may call, environments it may inspect, actions it may propose, and actions it may never execute.

Start retrieval with exact operational signals. Filter by canonical service, environment, incident time window, deployment identifier, alert type, and ownership tag. Then rerank the surviving records for the current question. This deterministic tagging and reranking foundation is easier to debug than making semantic similarity responsible for every retrieval decision.

Add embeddings where language actually creates ambiguity: matching an unfamiliar symptom to a differently worded historical incident, finding a relevant paragraph inside a long runbook, or connecting terminology used by two teams. Semantic retrieval should widen discovery, not erase exact boundaries such as production versus staging or one tenant versus another.

Require every retrieved item to carry provenance that a responder can inspect: its system of record, service and environment, creation or update time, incident-time availability, and reason for retrieval. This lets the responder notice four common failures quickly:

A runbook is relevant but stale.
An ownership record is current but was different when the incident began.
A similar incident came from another environment with different dependencies.
A historical evaluation accidentally exposed the final root cause before the agent could have known it.

Treat missing context as an observable product state. The agent should say that it cannot locate a deployment record or dependency map, identify which system was checked, and propose a safe way to continue. A confident answer assembled around a missing record is more dangerous than an explicit gap.

Scale permissions to reversibility and blast radius

Autonomy is not one switch. It is a set of permissions attached to particular tools, targets, environments, and action classes. Granting broad credentials because the agent usually behaves conservatively turns a model-quality issue into a production-control issue.

Action class	Appropriate agent role	Required human control
Read-only investigation	Query approved telemetry, changes, ownership, and runbooks	Audited access with service and environment boundaries
Recommendation or communication	Draft a diagnostic check, remediation plan, incident update, or escalation	A responder reviews customer-facing messages and consequential recommendations
Bounded, reversible execution	Invoke a preapproved runbook against an explicitly named target	Approval bound to the exact action, target, inputs, and current incident
Irreversible or broad execution	Explain the need and prepare a plan, but do not execute during the initial rollout	Existing change controls and accountable operators remain in force

Do not label an action reversible merely because the interface contains a rollback button. A deployment rollback can still be unsafe after an incompatible schema or data change. A restart can amplify load or destroy useful diagnostic state. Reversibility has to be validated for the specific service state, not inferred from the action name.

For every executable tool, define guardrails outside the prompt:

Use least-privilege credentials scoped by service and environment.
Allowlist tools, targets, and input shapes rather than relying on natural-language prohibitions.
Preview the exact command or workflow, target, parameters, and expected effect before approval.
Bind approval to that exact action so the agent cannot reuse it for a changed target or plan.
Use rate limits, idempotency controls, and circuit breakers where repeated calls could cause harm.
Route production changes through existing CI/CD or runbook automation when possible.
Record retrievals, tool inputs, tool outputs, approvals, denials, and resulting state changes in an audit trail.
Provide a direct way to suspend the agent’s tool access without disabling the incident workflow itself.

The action proposal should be a control artifact, not a conversational suggestion. It needs the evidence supporting the action, the exact target, the expected observable result, the maximum intended scope, known preconditions, and what the responder will do if the result does not appear. If the agent cannot supply those fields, it has not earned execution authority for that action.

Keep outward communication on a separate permission path. Drafting a status update is low-risk technically but consequential for customers and the business. Human review should verify what is known, what remains uncertain, and whether the message promises a recovery time the evidence cannot support.

Make evidence and uncertainty legible in the incident room

Putting the agent inside the collaboration surface where incidents already unfold reduces the friction of opening another product and re-explaining the situation. It also means the agent’s output competes with urgent human messages. Long narrative answers will be skipped, however intelligent they sound.

Give each investigation update a stable structure:

Observed: facts returned by named systems, with timestamps and links where available.
Hypotheses: ranked explanations with the supporting and conflicting evidence for each.
Changed since the last update: new evidence, rejected hypotheses, and responder corrections.
Next check: the read-only query or tool call most likely to distinguish between the remaining possibilities.
Proposed action: target, expected effect, blast radius, preconditions, and recovery path.
Decision needed: the specific approval, input, or ownership choice required from a human.

This is not a request to expose a model’s private, free-form chain of thought. Responders need a structured evidence trail: claims, retrieved signals, tool results, rejected alternatives, and action rationale. That artifact is more useful for review because each part can be checked against the operational record.

Confidence labels are helpful only when they change behavior. Define what the interface does when confidence is low: ask for a missing service identifier, run another safe check, present multiple hypotheses, or escalate to the owner. Do not display a precise-looking score unless you have evaluated whether that score corresponds to actual correctness in your incident set.

Design human correction as part of the main workflow. A responder should be able to reject a hypothesis, correct the service or environment, mark a retrieved record stale, deny an action, and state why. The agent should preserve that decision in the incident record and replan from it. Repeatedly resurfacing a rejected hypothesis erodes trust even when the underlying model is otherwise capable.

Watch for a subtle interface failure: polished summaries can make weak investigations look complete. Make unresolved questions and conflicting signals visually prominent in the message structure. The goal is not to make the agent sound certain. It is to help the incident commander see what is known, what is inferred, and what decision comes next.

Test against past incidents, then expand authority one boundary at a time

A demo proves that the agent can complete a favorable path. It does not prove that the agent will retrieve the right context, resist a misleading correlation, respect permissions, or propose a safe action when production is ambiguous.

Use post-incident time-travel evaluations. Reconstruct what the agent could have known at each point in a real incident. Begin with the original trigger and expose deployments, telemetry, messages, and tool results only when they became available. Hide the final root cause, later analysis, and corrected metadata until the corresponding point in the replay. Otherwise, you are testing hindsight rather than incident response.

Grade the investigation on operational usefulness, not prose quality:

Scoping accuracy: Did it identify the correct service, environment, symptoms, and ownership route?
Context retrieval: Did it find the relevant change, runbook, dependency, or earlier incident without mixing incompatible records?
Hypothesis quality: Where did the eventual cause appear in the ranked set, and what evidence was used to test it?
Evidence integrity: Does every factual claim match a retrieved record or tool result? Did the agent invent a signal that was never observed?
Tool correctness: Did it select the correct tool, target, environment, and parameters?
Action safety: Was the proposed action inside policy, and were its blast radius, preconditions, and recovery path explicit?
Calibration: Did expressed certainty track actual correctness, especially when context was incomplete?
Time compression: How did the time to a useful hypothesis, correct owner, mitigation decision, and recovery compare with the existing workflow?
Human effort: Which searches, handoffs, repeated explanations, and diagnostic checks did the agent remove or add?

Treat safety failures differently from diagnostic misses. A missed hypothesis is a capability problem. Crossing a permission boundary, inventing evidence, or targeting the wrong environment is a release blocker for that tool path. Averaging all outcomes into one quality score can conceal exactly the failure that matters most.

A practical rollout sequence

Instrument the human workflow. Capture incident timelines, ownership changes, diagnostic steps, approvals, mitigations, and outcomes. You need a baseline before claiming improvement.
Replay historical incidents. Use time-bounded context and score the agent against known outcomes. Repair retrieval and service metadata before tuning for eloquence.
Run in shadow mode. Let the agent investigate live incidents without posting conclusions or changing systems. Compare its evidence and hypotheses with the responder’s path.
Expose read-only assistance. Allow responders to request context, hypothesis checks, and draft updates. Collect explicit acceptance, correction, and rejection signals.
Add recommendation mode. Let the agent propose remediations using the structured action artifact, while humans continue to execute through established controls.
Enable one bounded action path. Choose a preapproved runbook with a clear target, validated preconditions, observable effect, and recovery procedure. Keep approval attached to the exact invocation.
Expand by tool and service. Grant additional authority only when evaluation evidence supports that particular boundary. Do not treat success on one service as proof of readiness everywhere.

Re-run the evaluation set after changes to prompts, models, tools, service topology, runbooks, or permissions. An agent can regress even when its general language quality improves. Operational behavior depends on the whole system around the model.

Key takeaways

Start with investigation and context compression; earn execution authority later.
Build deterministic service, environment, time, and ownership filters before depending on semantic retrieval.
Separate observed facts, hypotheses, and proposed actions in every incident update.
Enforce permissions in tools and infrastructure, not only in prompts.
Evaluate with historical time travel so the agent never sees facts that were unavailable during the real incident.
Expand autonomy one action, tool, service, and environment boundary at a time.

The next outage is the wrong time to discover that your agent cannot distinguish a plausible explanation from verified evidence. Before it happens, choose one bounded incident workflow, define its context contract and permission envelope, and replay several real investigations without future information. If the agent can make its evidence legible, stay inside policy, and consistently move responders toward the next correct decision, you have a foundation worth expanding.

References

Shivam.Consulting Blog — How Incident.io’s AI SRE Diagnoses, Hypothesizes, and Fixes Outages in Slack at Record Speed

November 6, 2025

Tag: incident management

Key takeaways

One customer signal can serve several operational jobs

Response speed and evidence depth belong on different clocks

A routed signal system turns inputs into decisions

Preserve provenance before interpreting the signal

Correlate without treating evidence as a vote

Route the signal to an explicit next action

Ownership and cadence close the signal-to-learning loop

References

Define reliability at the user-outcome boundary

Instrument the full path from request to delivered outcome

Harden each boundary instead of trusting the happy path

Make model and prompt releases earn production traffic

Run reliability as a product operating system

Key takeaways

References

Design the decision loop before choosing the AI

Start with one detection decision, not another alert stream

Give the response copilot context, not unchecked authority

Counter AI-enabled attacks by changing the process

Use a 90-day plan with measurable promotion gates

Days 1-30: define the workflow and baseline

Days 31-60: evaluate in shadow mode

Days 61-90: release bounded capability

Key takeaways

References

Give the agent an investigation job before action authority

Context quality determines the ceiling of the investigation

Scale permissions to reversibility and blast radius

Make evidence and uncertainty legible in the incident room

Test against past incidents, then expand authority one boundary at a time

A practical rollout sequence

Key takeaways

References