Month: January 2026

How to Scale AI Pilots Into Mature Production Systems

You have AI pilots that demo well, enthusiastic teams asking for broader rollout, and executives expecting the investment to show up in operating results. Yet the closer you get to production, the longer the list of unresolved questions becomes: Who owns the workflow? How will quality be measured? What happens when the model is wrong? Can the economics survive real usage?

The next move is not to launch more pilots. It is to install a system that can repeatedly turn a validated use case into a governed, measurable, and improving production workflow. That system is what separates AI experimentation from mature deployment.

A successful pilot is not evidence of production readiness

AI adoption is already common enough that adoption itself tells you very little. Among more than 2,400 global customer service professionals, 82% of senior leaders invested in AI in 2025, 87% planned to invest in 2026, and only 10% described their deployment as mature. The sample is specific to customer service, so those figures are better used as a directional benchmark than as a universal maturity rate. The underlying execution problem applies much more broadly: buying or piloting AI is easier than making it dependable inside a core workflow.

A pilot is designed to answer a narrow learning question. Can the model classify this request, draft this response, summarize this record, or choose the next action under controlled conditions? Production has to answer a harder question: can the entire workflow create enough value, across ordinary and difficult cases, while remaining safe, observable, supportable, and economically sensible?

I use a simple test. If the team can describe the model but cannot describe the operating workflow around it, the work is still a prototype. A production case should make each of these elements explicit:

Outcome: The customer or business result that should improve, plus the current baseline.
Workflow boundary: Where AI enters, which decisions it may make, which systems it may use, and where its authority ends.
Quality standard: The evaluation cases, acceptance criteria, and failure categories that determine whether a release is good enough.
Safe failure path: What the system does when information is missing, a tool fails, a policy is triggered, or the requested action exceeds its authority.
Accountability: A named product owner for the outcome and a named operational owner for production performance.
Economics: The value created and the full cost of inference, retrieval, tools, review, support, and incident handling.
Learning mechanism: How production failures and user corrections return to the evaluation set and release process.

These are not finishing tasks to schedule after the model works. They are part of the product. Deferring them creates a predictable trap: the pilot looks increasingly impressive while the distance to a responsible launch quietly grows.

Do not confuse automation coverage with maturity, either. A system can handle many requests and still be immature if nobody can explain why it made a decision, detect a quality regression, contain a failure, or calculate the result. Conversely, a narrowly scoped workflow can be mature when its boundaries, controls, outcomes, and ownership are clear.

Depth matters because quality is produced by the whole operating system, not the prompt alone. In customer service, 43% of mature adopters reported higher quality and consistency, compared with 24% of teams in earlier stages. These are self-reported results, but the practical implication is sound: integration, evaluation, and continuous improvement are not overhead around the AI. They are how the AI becomes useful at scale.

Promote each workflow through explicit maturity gates

Maturity should be earned workflow by workflow. An organization does not become mature because it has a central AI team, an approved model vendor, or a large portfolio. It becomes mature when important workflows can move through a repeatable sequence of decisions without relying on heroics.

Stage	Decision to make	Evidence required to advance	Reason to hold
Discover	Is this a valuable and appropriate problem for AI?	A defined user problem, current baseline, workflow map, risk classification, and initial build-versus-buy view	The use case is driven by model novelty, has no meaningful outcome, or depends on inaccessible data
Prove	Can the proposed workflow improve on the current process?	Representative evaluation cases, a working prototype, documented failure modes, and a controlled comparison with the baseline	Success appears only in curated demos, or the team cannot reproduce the result across realistic cases
Operate	Can the workflow run safely and reliably in production?	Monitoring, escalation, access controls, auditability, incident procedures, release controls, rollback, and an accountable operator	Failures cannot be detected or contained, or production responsibility is still ambiguous
Scale	Should usage, autonomy, channels, or organizational reach expand?	Sustained outcome improvement, acceptable quality and risk, validated economics, user adoption, and reusable operating components	Volume is growing faster than quality, cost, support capacity, or governance can be understood

The purpose of a gate is not to create a committee. It is to prevent enthusiasm, executive attention, or sunk cost from substituting for evidence. The domain team should be able to prepare the evidence as part of normal product development. Specialist review should become more demanding only as the possible consequence of failure increases.

Give every workflow a short deployment contract. Keep it in the same system where the team manages releases and evaluations, not in a presentation that disappears after approval. The contract should include:

The intended user, job to be done, business outcome, and current baseline.
The inputs the workflow accepts and the outputs or actions it may produce.
The actions that are prohibited, require confirmation, or must be routed to a person.
The data sources, retrieval rules, system permissions, retention rules, and privacy constraints.
The evaluation set, quality dimensions, acceptance criteria, and known limitations.
The failure taxonomy, escalation path, incident owner, and customer recovery procedure.
The prompt, model, retrieval, tool, and policy versions included in the release.
The production metrics, cost measures, rollout control, and rollback conditions.
The product owner, operational owner, and risk approvers.

The acceptance criteria will differ by workflow. A drafting assistant, an internal search experience, and an agent authorized to modify a customer account should not face the same bar. Base the bar on consequence, reversibility, detectability, and recovery. If an error can create an irreversible change, expose sensitive data, make a material commitment, or deny someone an important service, require an appropriate human authorization step rather than relying on average model performance.

The deployment contract also makes scope changes visible. Adding a new tool, data source, channel, language, model, or autonomous action is not merely more traffic. It changes the system’s failure surface. Update the contract, extend the evaluation set, and pass the relevant gate again.

Build three feedback loops before increasing autonomy

A mature deployment learns at three levels: whether the workflow creates value, whether its decisions meet the required standard, and whether the production system remains reliable. If any loop is missing, the team can collect impressive activity metrics while the actual product deteriorates.

Connect model behavior to a business outcome

Start with the baseline process, not an AI metric. If the workflow is intended to resolve a support request, qualify an opportunity, complete an onboarding step, or assist an employee, measure how that outcome happens without the new system. Otherwise, you will know that the AI generated output but not whether it improved anything.

Use a metric stack that separates outcomes from diagnostics:

Business outcome: The customer, revenue, cost, risk, or productivity result the investment is meant to change.
Workflow outcome: Completion, resolution, successful handoff, correction, rework, abandonment, or another measure of whether the task reached its intended end.
Quality and safety: Correctness, grounding, policy compliance, appropriate escalation, harmful failure, and user correction.
Operational performance: Availability, latency, tool success, retrieval quality, incident volume, and recovery.
Economics: Cost per successful outcome, including model usage, infrastructure, external tools, human review, support, and remediation.

The layers diagnose different problems. A prompt change may improve an offline score without changing task completion. More automation may reduce handling work while increasing corrections. A cheaper model may lower inference cost but create enough rework to raise the cost per successful outcome. Do not compress those effects into one AI score.

Measurement tends to improve as deployment deepens. In the customer service maturity data, reported ROI tracking increased from 35% among teams exploring AI to 70% among mature deployments. That does not prove maturity automatically causes measurement, but it shows how closely operational depth and measurement discipline travel together.

When traffic and product conditions support an experiment, compare the AI workflow with the current experience. Define the decision metric and minimum detectable effect before running an A/B test. For lower-volume or higher-risk workflows, use controlled rollout evidence, expert review, and structured case analysis rather than pretending a small sample provides statistical certainty.

Turn evaluations into release criteria

An evaluation set is not a collection of attractive examples. It should represent ordinary work, difficult edge cases, policy boundaries, known failures, and the situations in which the system should refuse or escalate. Build it before optimizing the prompt so the team cannot unconsciously redefine success around whatever the prototype already does well.

For each case, record the expected behavior and why it is expected. Some outputs can be checked against a deterministic answer. Others need a rubric that distinguishes task completion, factual support, instruction following, tone, policy compliance, and escalation quality. Where reviewers can reasonably disagree, capture that disagreement instead of forcing false precision into a single label.

Use offline and online evaluation for different jobs. Offline evaluation protects releases by testing candidate changes against a stable set. Online evaluation reveals distribution shifts, new user behavior, integration failures, and outcomes that cannot be recreated fully before launch. Neither is sufficient on its own.

Version the entire behavior-producing system: model, prompt, retrieval configuration, knowledge snapshot, tools, policies, and routing logic. A model comparison is not meaningful if the surrounding system changed silently. For every proposed release, make the decision policy explicit: ship, hold, narrow the scope, expand gradually, or roll back. This is the practical core of eval-driven development with target metrics and a decision policy defined before launch.

Operate the workflow as a production service

AI introduces variable outputs, but it still depends on familiar production systems: identity, permissions, data pipelines, APIs, queues, search, external tools, and user interfaces. A model can appear to be wrong when retrieval returned stale information or a downstream tool rejected an action. Monitoring only the final text hides the failure that engineers need to fix.

Trace the workflow end to end. Subject to your privacy and retention rules, capture the release version, retrieval and tool events, policy decisions, response, escalation, user correction, and eventual workflow outcome. Monitor distributions and failure categories, not just averages. An acceptable overall score can conceal a serious regression for a particular intent, customer segment, channel, or action.

When the workflow depends on changing or private knowledge, connect it to governed retrieval instead of expecting the base model to contain the right answer. Use safe integration points for tools, least-privilege access, and explicit authorization for consequential actions. CI/CD, feature flags, canary releases, observability, audit trails, privacy controls, red teaming, and human review form a practical control plane for releasing changes without exposing the entire population at once.

Every material production failure should produce more than an incident ticket. Classify the failure, add or update the corresponding evaluation case, correct the prompt, retrieval, policy, tool, or interface responsible, and retest the workflow before restoring scope. That turns operational pain into a permanent improvement in the release system.

Use 30-60-90 days to build the scaling system

A useful 30-60-90-day sequence starts with two lighthouse use cases. The goal is not to force every use case into production within a quarter. It is to prove that your organization can move valuable workflows through the same gates, shared controls, and learning loops.

Days 0-30: narrow the portfolio and establish accountability

Inventory active pilots and classify each as discovery, proof, operation, or scale. Do not let a polished demo assign its own stage.
Select two lighthouse workflows using customer impact, feasibility, strategic relevance, and risk. Choose workflows meaningful enough to matter but bounded enough to operate responsibly.
Record the current process and baseline before the AI changes user or employee behavior.
Name the product owner, operational owner, and required risk decision-makers for each workflow.
Complete the first version of each deployment contract, including the autonomy boundary and safe failure path.
Make the build-versus-buy decision at the workflow level. Include data access, integration, auditability, evaluation portability, operating cost, and switching constraints.
Pause pilots that have no accountable owner, no measurable outcome, or no plausible route through the operating gate.

This first phase is where leadership earns focus. A broad AI mandate often creates a queue of unrelated prototypes, each with its own vendor, data assumptions, and definition of success. Choosing lighthouse workflows gives the platform and governance work a real customer instead of turning them into abstract architecture programs.

Days 31-60: install evaluation, controls, and workflow operations

Build the offline evaluation set from representative work, edge cases, policy boundaries, and failures already found during discovery.
Define acceptance criteria and the release decision policy before further prompt or model optimization.
Integrate the necessary retrieval and tools through governed access points. Keep permissions narrower than the user’s full access where the workflow does not need it.
Add observability across retrieval, reasoning inputs, tool execution, output, escalation, and business outcome.
Prepare feature flags, a controlled rollout, rollback, incident procedures, and a customer recovery path.
Run the workflow with appropriate human oversight. Record corrections and escalations as structured evidence, not informal feedback in chat.
Train the people who will supervise, support, and improve the workflow. Update operating procedures before transferring real responsibility to AI.

Training cannot be limited to prompt tips. Operators need to know what the system may do, how its failure modes appear, when to intervene, how to report a new failure, and who can change production behavior. Product and engineering teams need the same vocabulary for evaluation, incidents, and risk.

Days 61-90: expand evidence, not enthusiasm

Increase scope only for workflows that meet their operating gate. Expansion may mean more traffic, another intent, a new channel, or greater autonomy; evaluate each change explicitly.
Compare the production outcome and cost with the original baseline. Include corrections, review, support, and remediation in the economics.
Turn repeated needs into shared components such as model access, retrieval, identity, evaluation infrastructure, observability, policy enforcement, and audit logging.
Move validated production failures into the evaluation suite and confirm that the release process catches them.
Review job responsibilities, incentives, staffing assumptions, and training needs created by the redesigned workflow.
Hold a portfolio decision for every remaining pilot: advance, narrow, combine, pause, buy, or stop.

Organizational change is part of this phase. As AI altered customer service work, 45% of teams updated job descriptions and 40% increased AI training. That is a useful warning against treating adoption as an in-app onboarding problem. If AI takes responsibility for part of a workflow, someone must take responsibility for supervising it, handling exceptions, and improving the system.

Assign decision rights clearly. The domain product team should own the user problem, outcome, workflow design, evaluation cases, and adoption. A platform function should own shared access, retrieval, observability, release infrastructure, and policy enforcement. Risk specialists should define control requirements and review higher-consequence uses. The operational owner should manage quality, escalations, and incidents after launch. Executive leadership should decide portfolio priority, capacity, and which bets no longer deserve investment.

This structure avoids two common extremes. A fully centralized AI team becomes a delivery bottleneck and loses domain context. Fully independent teams duplicate infrastructure and apply inconsistent controls. Centralize reusable capabilities and non-negotiable policies; keep workflow outcomes and day-to-day learning with empowered domain teams.

Expect pressure to spread successful patterns. In customer service organizations, 52% planned to scale AI into areas such as customer success, marketing, and sales. Reuse the platform, governance, evaluation methods, and operating vocabulary. Do not copy a support workflow into another function and assume its value, risks, permissions, or quality bar remain valid.

FAQ: decisions that determine whether AI scales

Should AI be owned centrally or by product teams?

Use a federated model. Centralize capabilities that become safer, cheaper, or more consistent when shared: approved model access, identity, data controls, retrieval services, evaluation tooling, observability, auditability, incident standards, and risk policies. Embed workflow ownership in the domain team that understands the user, process, and business outcome. A central group can set the paved road, but it should not become the permanent product team for every AI use case.

When is an AI workflow ready for more autonomy?

Increase autonomy when the workflow has demonstrated acceptable behavior for the exact action and population being added, failures are detectable, consequences are containable, rollback works, and an operational owner can handle exceptions. Do not remove human review merely because the average quality score improved. Judge autonomy by the worst credible consequence, the reversibility of the action, and the system’s ability to recognize when it should stop.

Autonomy is not binary. The system can retrieve information, recommend an action, draft the result, ask for confirmation, execute within a limited permission, or execute and trigger retrospective review. Choose the narrowest level that captures the value. Expand only when evidence supports the next level.

When should a pilot be stopped rather than scaled?

Stop or reframe a pilot when it has no accountable workflow owner, cannot beat a meaningful baseline, works only on curated inputs, requires unacceptable access, has no safe failure path, or creates more review and remediation than the outcome justifies. Also stop when the supposed AI problem is actually a broken policy, missing data, or poorly designed process that should be fixed directly.

A failed autonomy concept can still reveal a useful assistive product. If execution is too risky, narrow the workflow to retrieval, recommendation, drafting, or exception detection. That is a product decision, not a face-saving exercise. The right scope is the one that creates measurable value under an operating model you can defend.

At your next AI portfolio review, ask each owner to bring a baseline, deployment contract, evaluation evidence, and a clear gate decision. Fund shared infrastructure where the lighthouse workflows expose a recurring need. Expand only after the operating evidence catches up with the demo. That is how you turn a collection of pilots into an AI capability that can carry real responsibility.

References

January 28, 2026

Stop Losing Customers: Predict Churn with Digital Analytics and Act Before It’s Too Late

I stopped treating churn as a postmortem and started treating it as a forecasting problem. When we instrument our product, connect the dots across journeys, and embed those signals into our daily operations, churn becomes predictable—and preventable. This shift has been one of the most impactful product strategy moves my teams have made for product-led growth and retention analysis.

"Discover why and how CS teams can use digital analytics to take a proactive, predictive approach to churn, stopping it before it happens." That is exactly the mindset I bring to customer success and product collaboration: anticipate risk, intervene with precision, and demonstrate measurable impact.

The practical work starts with leading indicators. I look at user activation milestones, time-to-first-value, feature adoption depth, frequency and recency of key events, account-level coverage (are multiple users active or just one champion?), usage volatility, and friction signals like repeated errors or stalled onboarding. These behavioral inputs are stronger predictors of churn than survey sentiment alone.

From there, I create a churn risk score. Early on, a transparent rules-based model is usually enough to separate healthy from at-risk accounts. Over time, we can layer in supervised learning if the data supports it. I rely on Amplitude analytics, Pendo, or a unified analytics platform to tag events, build cohorts, and compute risk in near real time. This is where we consistently see the patterns that matter—especially around user activation and sustained adoption.

Signals without action won’t save a customer, so I connect the model to our systems of engagement. Through CRM integration, at-risk accounts trigger clear playbooks for CSMs and lifecycle marketers. Inside the product, in-app guides address gaps exactly where they occur—guiding users to the next best action, unblocking onboarding, or showcasing the value hidden behind underused features.

Because not every nudge works for every segment, we treat intervention design as a product problem and run A/B testing on copy, timing, channel, and offer. We test whether a contextual tooltip outperforms an email sequence, whether a short product tour beats a knowledge base link, and which incentives accelerate onboarding without cannibalizing expansion.

Operationally, this is a team sport. Product, CS, and marketing meet in product trios to review risk cohorts, prioritize root-cause fixes, and tune playbooks. We run a weekly risk review to turn insights into decisions, and we use monthly business reviews to connect leading indicators to lagging outcomes like retention, expansion, and NRR.

Measurement is non-negotiable. We pair retention analysis with qualitative feedback to understand whether our interventions truly change behavior. The goal is to close the loop: when a risk cluster improves, we codify the playbook; when a tactic underperforms, we learn, adjust, and try again. Over time, the organization builds a muscle for proactive, data-informed customer health management.

If you’re getting started, begin by instrumenting events tied to value moments, define a simple health score, and stand up a basic alerting workflow. Pilot one or two interventions, measure lift, and iterate. Within a single quarter, you’ll have enough signal to prioritize product improvements and scale the practices that reliably reduce risk.

Churn rarely surprises teams that listen to their data and respond in real time. With disciplined analytics, thoughtful in-product guidance, and tight alignment across CS and product, we can move from reacting to predicting—and keep more customers succeeding with far less effort.

Inspired by this post on Amplitude – Perspectives.

January 27, 2026
Build vs. Buy in an AI-First World: My Framework to De-Risk Decisions and Own Your Data

Build vs. buy is a decision that never truly goes away, and with AI reshaping the economics of software, I’m revisiting this question more frequently—and with more nuance—than ever. The temptation to “just build it” is real when prototypes are cheaper, shipping feels faster, and small tools can rival big platforms. But the real decision has never been about code; it’s about value, data, and long-term responsibility.

Across product orgs at every stage, I see the same pattern: AI makes building feel easier—but it doesn’t eliminate the tradeoffs. The hard part is separating what differentiates your product from what simply supports it. That’s why I start by asking whether the capability is truly core to my value stream, and then I force myself to reason about ownership and maintenance, not just velocity.

My rule of thumb remains simple: If something isn’t core to your value stream, don’t build it. And even when it is core, vendors may still be better positioned—especially for payments, invoicing, and infrastructure. Those domains carry deep operational complexity, continuous compliance, and reliability requirements that are easy to underestimate and painful to own.

Here’s how this plays out for me. I would never build my own blogging platform. I moved from WordPress to Ghost, because publishing isn’t where I differentiate, and the long tail of upgrades, security, and performance is a drag on focus. The platform does the job, my audience gets a better experience, and my team avoids owning commodity maintenance work.

On the other hand, I did build my own task management system—despite the abundance of excellent tools like Trello, Evernote, and OmniFocus. For me, tasks, notes, and workflows are deeply personal and idiosyncratic. I wanted my system to reflect how I think, plan, and communicate, with tight integration to my daily product rituals. In this case, the underlying data became the real product—and owning and controlling that data changed the equation.

That’s the heart of the decision: When the underlying data becomes the real product, ownership matters. Task management, notes, and workflows evolve into a personalized operating system. The moment your data model represents your unique value—and your future differentiation—build vs. buy is no longer a tooling choice; it’s a strategy choice.

AI is pushing this even further. Cheaper prototyping and “vibe coding” lower the cost of building. Tools like Claude Code and platforms from OpenAI make it viable to ship smaller, targeted tools that would have been uneconomical a few years ago. That expands the frontier of what teams can build without committing to a monolithic platform—and it puts pressure on vendors to improve data portability.

Which brings me to vendor lock-in. Exports aren’t always enough. When I evaluate CRMs or course platforms, I look for more than CSV dumps. I want robust, well-documented APIs, webhook coverage, import/export parity, schema transparency, and a clear migration path. I’ve seen teams drown in brittle integrations with Salesforce or HubSpot, struggle to unwind course data from Teachable, or get stuck in signature workflows around DocuSign without a clean escape hatch. Portability is table stakes now.

I treat build vs. buy as a discovery problem. Options are assumptions to test. On the build side, I run feasibility spikes: proof-of-concept integrations, latency checks, cost-to-serve models, and a sober read on maintenance. On the buy side, I trial vendors, not their marketing. I replicate a real workflow, test the edges, validate data portability, and simulate failure modes like vendor downtime or schema changes.

A word of caution on complexity: “we can build anything” is not the same as “we should build this.” Long-lived products accumulate hidden complexity over time—security, privacy, performance, observability, SRE runbooks, QA automation, documentation, and compliance. Be honest about engineering capabilities and maintenance costs, especially when uptime and regulatory exposure are in play.

My practical checklist looks like this: Is this core to our differentiation? Do we need to own the data model? How strong is data portability (APIs, webhooks, mapping, re-import)? What’s the true total cost of ownership over three years (people, ops, security, compliance)? Are there regulatory or reliability constraints better handled by a vendor? What’s the opportunity cost of not building something more strategic? And if we buy, what’s our exit plan?

Ultimately, build vs. buy isn’t just about speed or cost—it’s about core value, data ownership, and long-term responsibility. AI lowers the barrier to building, but it doesn’t erase complexity. Treat build vs. buy decisions like any other discovery effort: test assumptions, prototype, and validate before committing. Ask not just can we build it, but should we own it?

If you’re wrestling with vendor lock-in, fielding pressure to “just build it,” or rethinking your stack in an AI-first world, this lens will help you ask better questions before you commit. And if you’re exploring targeted builds alongside platforms like Stripe, Dropbox, Obsidian, or Ghost, I’d love to hear what’s working for you and where portability remains a hurdle.

Inspired by this post on Product Talk.

January 27, 2026
The Safety of Speed: 180 Deploys a Day, 12‑Minute Releases, 99.8%+ Availability

“Speed is not the enemy of safety; it is the prerequisite for it.” I live by this principle. In our organization, the average time from merging code to it being used by customers in production is just 12 minutes, and that short window is fundamental to how we build, ship, and learn.

In January 2026, we are averaging 180 ships per workday – roughly 20 deployments every hour. Conventional wisdom suggests that to increase stability, you must slow down. I believe the opposite. Speed is not the enemy of safety; it is the prerequisite for it. Accumulating code creates risk; shipping small batches minimizes it. Shipping is our company’s heartbeat.

Maintaining this frequency while targeting 99.8+% availability has required over a decade of focused investment in systems, principles, and processes. We protect the integrity of our systems through three layers of defense: an automated pipeline that is simple, reliable, and removes the need for manual intervention, a shipping workflow that promotes ownership and uses guardrails as accelerants, and a recovery model that optimizes for mitigating inevitable failures. Here’s how we’ve built each layer so that velocity is our greatest source of stability.

While our platform consists of various services and frontend applications, I’ll focus here on our Ruby on Rails monolith. It is our core application and the one we deploy most frequently; we also deploy it to three different data‑hosting regions with independent pipelines. Our other services follow similar pipeline principles and safeguards, but the Rails monolith is the clearest example of how we ship at scale.

The automated pipeline is designed to move code from merge to production as fast as possible while enforcing strict safety checks. It is fully automated, and the vast majority of releases require no human intervention—critical for CI/CD at high deployment frequency.

Once an engineer merges code to GitHub, two things happen immediately. First, the build: we compile the Rails application and its dependencies into a deployable asset (a slug) in about four minutes. Second, parallel CI: our test suite runs alongside the build; through extensive optimization, parallelization, and test selection, the vast majority of CI builds finish in under five minutes.

As soon as the slug is built, it’s deployed to a pre‑production environment. CI does not block the progression of the slug to pre‑production. Deploying to pre‑production takes around two minutes. This environment serves no customer traffic, but it is connected to our production datastores, mirrors our production infrastructure variants (e.g., web serving, asynchronous worker), and is configured so that requests exercise the pre‑release code and workers.

Immediately after deployment, we run and await several automated approval gates. We verify that the application boots cleanly on hosts (boot test), confirm the parallel test suite passed (CI check), and execute functional synthetics using Datadog Synthetics on critical flows—such as loading or editing a Fin workflow. If any gate fails, the release is halted and does not go to production.

Once approved, we promote the code to thousands of large virtual machines. A deployment orchestrator triggers these deployments simultaneously, while a decentralized, staggered rollout avoids changing the state of the entire fleet at the same millisecond. Within each machine, a rolling restart mechanism removes a process with old code from the serving path, lets it drain gracefully, and replaces it with a fresh process running the new code. From the moment a deployment starts, first requests are served by new code within roughly two minutes, and the vast majority of the global fleet updates transparently within six minutes. When restarts trigger on every machine, production unblocks so the next deployment can begin.

We treat a stalled pipeline as a high‑priority incident. If the automated system rejects three consecutive release attempts, it pages an on‑call engineer. These are pre‑production blocks, but if the shipping lane stops moving, changes pile up—and our stability relies on building and shipping in small steps. The on‑call’s job is to restore flow so that tiny, safe, frequent updates continue to keep risk low.

Our shipping workflow is built on extreme ownership: tools assist, but the engineer is accountable for quality and the decision to merge. I insist that you are present when you ship. The practical benefit of a 12‑minute deployment cycle is that engineers remain in the zone, focused on the problem they just solved, and ready to validate behavior as it goes live.

A rocket lifts into a luminous sky, a metaphor for shipping code fast without breaking things, where precision, automation, and guardrails power 180 safe deployments a day.

To support this, our deployment system sends Slack notifications the moment code is submitted and as it advances through stages, embeds direct observability links to relevant dashboards and logs in every PR and message, and prompts verification so engineers actively watch the dials and test features in production. It is not acceptable to rely on green builds. You’re expected to watch your change go live and if you’re not prepared to rollback, you’re not prepared to ship. We maintain a no‑blame culture: quick rollbacks and immediate reverts are signs of vigilance and ownership, not failure.

We make extensive use of feature flags to turn deployment into a non‑event. By decoupling deployment (moving code to servers) from release (turning features on), we shrink the blast radius of change. Flags can be enabled for all customers, a specific subset, or disabled for everyone in under 60 seconds through our backend UI. Engineers can group flags into beta features and run phased rollouts; we also ensure flags work consistently across non‑monolith applications. In the past three months, we created over 560 flags—and we actively manage them to avoid permanent complexity.

For complex refactors—especially when behavior should not change—we leverage GitHub Scientist, an open‑source experimentation library. It runs candidate logic (new code) in parallel with existing logic (old code) in production, instruments both paths for result and timing comparisons, and keeps existing behavior user‑visible. That means we can iterate on and validate new code under real load without risking the experience, then switch seamlessly when confident.

When engineers need to go deeper before merging, they can generate a slug and deploy it to a virtual machine, detaching a running production host from the serving path and connecting for manual testing. They can also put a pre‑release slug on a serving machine that handles a small percentage of jobs or web requests. Single‑host validation lets us slice observability to those hosts, compare against the main release, and make low‑level changes safer. Staging is a simulation; production is reality. Testing on a single production host validates assumptions with real‑world data without risking the fleet.

Our recovery model starts from a simple principle: stop monitoring systems; start monitoring outcomes. Traditional monitoring tells you if a server is healthy; we care whether customers are healthy. We rely on heartbeat metrics—vital signs that represent the core value our product provides—such as the rate at which messages and comments are created.

Unlike standard uptime checks, heartbeat metrics are binary in spirit. If message send rates dip below baseline, it does not matter if infrastructure dashboards are green. Down is down, and if customers can’t do their job, uptime percentages are irrelevant. By tracking real‑world success rates as a high‑level signal, we catch subtle degradations that traditional alerting either misses or over‑alerts on.

Because we ship in small, incremental steps and maintain previous releases on our virtual machines, our Time to Recover (TTR) is generally very fast. If a heartbeat metric drops or a critical anomaly is detected right after a ship, the system can trigger an automatic rollback, reverting to the release that was running 20 minutes ago—often restoring service before an engineer responds. For complex issues, engineers can initiate a manual rollback through our deployment UI; doing so also locks the production pipeline to prevent further releases while we investigate and remove problematic code.

Resumption of service is not the end. Every incident prompts an incident review, and we don’t just fix the bug. We ask, “How did the machine allow this to happen?” Then we harden the system so it cannot happen again. This loop—fast shipping, fast recovery, rigorous learning—compounds resilience over time.

This operating model aligns to DORA metrics: high deployment frequency, short lead time for changes, low change failure rate, and rapid time to restore service. It’s a CI/CD and SRE‑informed approach that converts speed into a defensive advantage rather than a liability.

Shipping 180 times a day isn’t a vanity metric; it’s a deliberate choice to protect the customer experience. With a 12‑minute window from code to customer, the feedback loop is tight and engineers retain context—and accountability—for the immediate impact of their work. Maintaining this pace requires more than fast CI; it requires judgment, extreme ownership, disciplined use of feature flags, and a recovery model that monitors outcomes. We rely on human expertise, augmented by these layers of defense, to catch issues before they turn into customer pain. We don’t ship fast despite our need for stability; we ship fast to stay in control of change.

Inspired by this post on The Intercom Blog.

January 26, 2026
The Customer Feedback Playbook: AI-Powered Tactics I Use to Make Better Product Decisions

Customer feedback is the most reliable compass I have for product strategy and execution. Over the years leading product at HighLevel, I’ve built and refined a system that turns raw signals from users into clear, prioritized decisions our teams can confidently ship.

A practical guide to collecting and using product feedback in product management (from AI tools to early-stage tactics) for better product decisions.

My playbook starts with continuous discovery. I keep a steady flow of insights from sales calls, customer support threads, community forums, and in-product behavior so I can triangulate patterns rather than chase loud anecdotes. This mix of quantitative and qualitative data helps me separate urgent noise from strategically meaningful trends.

On the quantitative side, I rely on product analytics to ground the conversation. Amplitude analytics gives me activation, retention cohorts, and feature engagement, while controlled experiments and A/B testing validate whether an idea actually moves a target metric. Tying these signals to specific customer segments helps me see where product-led growth is working—and where it’s stalling.

For qualitative insight, I combine in-app guides and lightweight surveys (via tools like Pendo) with structured interviews and support escalations (often surfaced through platforms like Intercom). I map problems using the Kano Model to understand which requests are basic expectations, which are performance drivers, and which are potential delights. This keeps our roadmap focused on outcomes, not just outputs.

AI now accelerates the synthesis step. With LLMs for product managers in my AI product toolbox, I summarize interview transcripts, cluster themes across thousands of notes, and quantify sentiment without losing nuance. I still review raw artifacts to avoid hallucinations and preserve context, but AI reduces the time from signal to insight dramatically—freeing me to spend more energy on judgment and storytelling.

In early-stage contexts, I bias toward speed and proximity to users. I schedule founder- or PM-led discovery calls weekly, instrument product tours early, and launch scrappy in-product prompts to validate demand before over-investing. When data is sparse, I focus on high-signal channels (power users, churned customers with qualified use cases) and document crisp problem statements that connect directly to activation, retention analysis, and revenue outcomes.

Prioritization ties everything together. I translate insights into hypotheses aligned to outcomes vs output OKRs, then pressure-test them with feasibility and strategic fit. We run small, measurable experiments, track deltas in activation and retention, and adjust the product roadmapping and sprint planning cadence based on what the data and customers teach us.

This approach builds trust with stakeholders and creates empowered product teams. By grounding decisions in a transparent trail of feedback, analytics, and experiments, we reduce thrash, move faster, and—most importantly—ship product moments that customers value.

If you’re refining your own feedback engine, start by instrumenting the basics, set a weekly discovery rhythm, and let AI handle the heavy lifting on aggregation and synthesis. The compounding effect is real: better insights lead to better bets, which lead to better outcomes for your users and your business.

Inspired by this post on Product School.

January 26, 2026

Real-Time Analytics for Financial-Services Contact Centers

Your contact center can have excellent reporting and still react too late. A weekly chart may explain why transfers rose, authentication failed, or members called again. It cannot recover the interaction that is already going wrong.

That is the practical case for real-time analytics in financial services: detect a useful signal while there is still time to change the outcome, then deliver a safe action to the person or system that can take it. The goal is not a faster dashboard. It is a shorter path from behavior to decision to resolution.

Key takeaways

Define real time against the decision window. A signal is timely only if it arrives before the next useful action expires.
Start with journeys that create material cost or dissatisfaction, such as lost cards, fraud disputes, loan-status requests, password resets, and payment issues.
Instrument the outcome as carefully as the interaction. Otherwise, you can see that an alert fired without knowing whether it helped.
Activate insights inside routing, agent, supervisor, and follow-up workflows. A separate analytics destination creates another queue for people to monitor.
Measure resolution, repeat demand, and guardrails. Activity metrics such as alerts generated or prompts displayed are diagnostics, not business outcomes.
Build privacy controls, consent handling, access restrictions, and auditability into the decision loop before expanding its reach.

Define real time as a decision contract

Real time is not a universal refresh rate. It is a promise that a signal will reach its decision point while an effective response is still possible. An agent-assist prompt must arrive before the conversation moves past the relevant step. A routing signal must arrive before the interaction enters the wrong queue. A proactive follow-up must arrive before the member has to contact you again.

This distinction prevents an expensive architecture mistake: streaming every event without deciding what any event should change. Some information needs immediate activation. Some belongs in a supervisor review. Some is useful only for longer-term journey redesign. Treating all three as equally urgent increases cost and noise without improving service.

Before building a pipeline, write a decision contract for each use case. The contract should connect the signal to an owner, action, deadline, guardrail, and measurable outcome.

Decision-contract field	Question to answer	Illustrative fraud-routing example
Trigger	What observable event or state starts the decision?	A potential fraud signal appears during an active interaction.
Decision	What choice becomes possible because of the signal?	Whether the interaction should receive specialized handling.
Action	What should the workflow do?	Prioritize the appropriate route and carry the available context forward.
Owner	Who or what is accountable for acting?	The routing workflow, with a supervisor responsible for defined exceptions.
Action window	When does the intervention stop being useful?	Before the interaction is transferred or the relevant verification step is completed.
Guardrail	What must never be bypassed?	Required compliance steps, authorized data access, and a clear human override.
Outcome	How will you know whether the action helped?	Resolution without an avoidable transfer, escalation, or repeat contact.

A contract also exposes weak use cases early. If nobody can name the action, the signal is probably reporting data rather than real-time decision data. If the action has no owner, it will become an ignored alert. If the outcome is merely that a prompt appeared, the team has confused delivery with impact.

The underlying platform still needs to bring together behavior across voice, chat, IVR, email, and in-app journeys. But unification is useful only when identity, journey state, and timing remain coherent across those channels. A member who fails authentication in the app and then calls should not look like two unrelated problems.

Instrument five costly journeys before the whole contact center

A complete contact-center data program is too broad a starting point. It invites months of taxonomy work before anyone changes an outcome. Begin with the five journeys most likely to concentrate cost or dissatisfaction: lost card, fraud dispute, loan status, password reset, and payment issue.

This is not a mandate to automate all five at once. Rank them using the evidence you already have: contact demand, transfers, repeat contacts, unresolved cases, authentication failures, and escalations. Choose the journey where a specific intervention is both valuable and operationally feasible.

For the chosen journey, create an outcome card before defining events:

Member intent: What is the person actually trying to complete?
Observable start: Which event shows that the journey has begun?
Resolution state: What evidence means the need was completed, not merely that the interaction ended?
Failure states: Where can authentication, routing, handoff, self-service, or follow-up break down?
Intervention: Which failure can the contact center change while the journey is active?
Outcome and guardrails: Which result should move, and which compliance or experience measures must not deteriorate?

The event model should then describe the journey rather than mirror the screens of each tool. At minimum, preserve a pseudonymous member reference, interaction reference, channel, event time, journey, journey step, authentication state, transfer or escalation state, intervention, and outcome. If intent or risk is inferred, record the version and confidence associated with that inference. If an agent accepts, dismisses, or overrides guidance, capture that response too.

Consistent definitions matter more than a large event count. Decide what a transfer is, when a new contact belongs to an existing journey, and what qualifies as resolution. Version those definitions. Otherwise, a changed IVR flow or CRM configuration can appear to improve performance simply because the instrumentation changed.

Instrument the negative space as well. If the member disappears from a self-service flow, the absence of a completion event is not enough to explain why. Capture the last meaningful step, the failure category when it is available, and whether the member moved to another channel. That is how you distinguish successful deflection from abandonment followed by a call.

Do not copy every transcript, recording, credential, or financial value into a broadly accessible analytics stream merely because the technology allows it. Use minimized attributes and controlled references where they are sufficient. Keep restricted evidence behind narrower permissions. Availability is not the same as permission.

Put the decision inside the workflow

The last mile determines whether real-time analytics changes performance. An insight that requires an agent to open another application, interpret a graph, and decide what it means has already lost much of its value. Activation belongs in the systems where agents, supervisors, and automated workflows already act.

Four activation patterns cover most of the useful surface area:

Routing: Use intent, journey state, or a potential risk signal to direct the interaction to the appropriate skill. High-risk transactions can be prioritized for specialized handling, but the signal should not silently become a final financial or fraud decision.
Agent guidance: Surface the next relevant step, missing compliance action, or known journey context during the interaction. Explain why the guidance appeared, avoid conflicting prompts, and give the agent a defined way to dismiss or override it.
Supervisor intervention: Alert on a material pattern with an attached playbook. The notification should identify what changed, which interactions are affected, which action is available, and when the alert expires.
Member follow-up: Trigger a relevant message or next step after an unresolved interaction. The follow-up should close a known gap, not merely create another generic communication.

Self-service requires particular care. If balance inquiries or password resets are overwhelming queues, routing eligible demand to self-service may help. But containment is not the same as resolution. Measure whether the member completed the task and whether another contact followed. A journey that exits the IVR but returns through chat has changed channels, not disappeared.

Each activation needs a safe fallback. If identity is uncertain, the signal is stale, or a dependency is unavailable, revert to the normal approved workflow. Do not let a broken analytics path invent a route or compliance step. Log the fallback so operational teams can distinguish a bad recommendation from a recommendation that never reached its destination.

Alert design deserves the same product discipline as customer-facing design. Deduplicate repeated signals, suppress guidance after the relevant action window, and route exceptions to a named owner. A queue full of low-value alerts trains people to ignore the important ones.

The technology choice comes after these workflow requirements. CRM integration should carry member and journey context forward, while the analytics layer captures behavior and evaluates interventions. Products such as Amplitude, Pendo, and Intercom may instrument digital touchpoints, but the build-versus-buy decision should turn on your decision contracts: identity reconciliation, activation latency, workflow integrations, experimentation, access control, auditability, and operational reliability.

I would not approve a platform solely because its dashboards are polished. Ask the vendor or internal platform team to demonstrate an end-to-end loop using one of your journeys: signal received, decision evaluated, workflow changed, outcome captured, and audit record produced. That sequence is the product you are buying or building.

Measure outcomes, experiment carefully, and govern the loop

Real-time analytics does not reduce operating cost by itself. It changes a decision, which changes a journey, which may change demand and resolution. Your measurement model has to preserve that chain.

Use a scorecard that separates outcomes from activity

Choose a primary outcome that matches the journey. Useful candidates include first-contact resolution, repeat-contact reduction, containment, and average time to resolution. Define the eligible population and exclusions explicitly so the metric cannot drift when channel mix changes.

Then organize the remaining measures by purpose:

Journey outcome: Was the member’s need resolved, and did it stay resolved?
Operational mechanism: Did transfers, escalations, routing failures, or authentication failures change?
Intervention delivery: Was the recommendation generated, delivered in time, accepted, dismissed, or overridden?
Experience and compliance guardrails: Were required steps completed, and did complaints, corrections, or manual exceptions increase?
System health: Was the signal complete, timely, correctly joined to the journey, and available when the workflow needed it?

Average handle time can be diagnostic, but it should not become the automatic objective. A shorter interaction that leaves the member unresolved may simply move cost into a repeat contact. Resolution and repeat demand tell you whether the system removed work or postponed it.

Test the intervention, not the existence of the data

Controlled experiments can show whether a changed IVR path, authentication step, or post-contact follow-up improves the chosen outcome. Define the minimum detectable effect before the test so the team knows which improvement would justify a decision and whether the eligible volume can support a useful result.

Choose the unit of assignment deliberately. If the same member can return during the measurement window, assigning different experiences by interaction can contaminate the comparison. A member-level assignment may be cleaner. If the intervention changes an entire queue or supervisor workflow, individual assignment may be impractical; use a rollout design that reflects how the operation actually works.

Do not randomize away mandatory compliance controls. When an intervention affects fraud handling, sensitive disclosures, or consequential routing, begin in observe-only mode, review false positives and overrides, and use an approved rollout. Experiment with the delivery or operational design only where compliance and legal owners confirm that variation is permissible.

Make governance part of the product

Privacy and compliance cannot sit downstream of activation. A real-time system makes decisions from live member behavior, so access controls, consent management, and audit trails belong in the initial architecture.

For every decision contract, document the permitted purpose of the data, who can access it, where it is retained, how consent is honored, what enters the audit record, and who approves changes. Do not infer that an attribute is lawful to use because it exists in the CRM. The relevant compliance and legal owners must determine acceptable use for the jurisdiction, product, and member context.

Auditability should reach beyond data access. Preserve enough context to reconstruct what signal arrived, which rule or model version evaluated it, what action was recommended, what the workflow did, whether a person overrode it, and what outcome followed. That record supports incident investigation, performance review, and defensible change management.

Run the operating cadence through a product trio spanning operations, data, and compliance. In each review, ask which decisions fired, which arrived too late, which actions were ignored, which outcomes changed, and which guardrails moved. Retire noisy signals. Refine ambiguous definitions. Promote successful interventions gradually. This keeps the program focused on decision quality instead of dashboard volume.

Your next step is small and concrete: choose the highest-cost or highest-friction journey among the initial five, write its decision contract, and run the signal in observe-only mode. When the team can trace the path from trigger to approved action to outcome, activate the narrowest useful intervention. Expand only after that loop is measurable, reliable, and governable.

References

Shivam.Consulting Blog – Stop Drowning in Dashboards: Real-Time Digital Analytics for Finserv Contact Centers

January 23, 2026

From Product Strategy to Adoption: The Solutions Engineering Loop

Your roadmap can be coherent, the launch can go smoothly, and adoption can still stall. The usual gap is not a lack of effort. Product strategy, field discovery, technical validation, and post-launch adoption were treated as separate activities, so each team learned something that the others could not use.

You can close that gap by treating solutions engineering as part of the product learning system. The goal is not to give sales more technical coverage or product managers a larger queue of requests. It is to turn real customer conditions into testable product decisions, then follow those decisions until customers reach and repeat the intended outcome.

Key takeaways

Define adoption as an observable customer behavior before you approve features or plan a launch.
Ask solutions engineers to capture evidence in a consistent format, separating what happened from what the team thinks it means.
Treat demos and proofs of value as experiments with a hypothesis, baseline, value event, and decision rule.
Measure the entire path from promise to repeated value. A login, click, or feature visit is rarely sufficient evidence of adoption.
Route each adoption problem to the right response: positioning, enablement, onboarding, integration, product design, or core capability.
Carry launch learning into roadmap and operating reviews so that adoption changes product decisions instead of becoming a reporting exercise.

Write the adoption contract before debating the roadmap

Product strategy becomes operational only when it specifies whose behavior should change, why that change matters, and what evidence would show that the product created value. Without that translation, teams can agree on a strategic theme while holding incompatible ideas about what success means.

Write an adoption contract for every significant product bet. This is not a legal document or a long requirements package. It is a compact agreement among product, engineering, design, solutions engineering, sales, and customer success about the outcome being pursued.

Target user: Name the persona and operating context. A broad account segment is not enough when different people inside the account perform different jobs.
Starting condition: Describe the workflow, constraint, or workaround that exists before the change. This gives you a baseline against which the new experience can be judged.
Desired outcome: State the progress the customer wants, not the capability you intend to ship.
Value event: Identify the observable behavior showing that the user reached meaningful value. Opening a page may indicate exposure; completing the intended workflow is stronger evidence.
Repeat condition: Define what sustained use looks like at the natural cadence of the workflow. Daily activity is useful only when the job itself occurs daily.
Decision evidence: Agree on the metric, qualitative evidence, technical constraints, and guardrails that will determine whether to continue, change, or stop the bet.

Suppose a product automates part of an operations workflow. Increase feature adoption is too weak to guide a team. A more useful contract says that the operations manager can complete the target workflow, obtain the intended result, and return to it when the job recurs without relying on the old workaround. The team can now examine setup, permissions, integrations, comprehension, completion, and repetition as separate parts of the same outcome.

This contract also changes roadmap conversations. A feature request no longer competes on enthusiasm alone. You can ask which persona it serves, which blocked outcome it unlocks, what evidence supports the problem, and which customer behavior should change if the solution works. If those questions have no answers, the request needs more discovery before it needs an estimate.

The solutions engineer is especially valuable here because the role can translate ambiguous requirements, technical conditions, and customer goals into product and go-to-market decisions. Bring that perspective in while the adoption contract is still editable. Waiting until a sales demonstration turns solutions engineering into a downstream interpreter of decisions it could have helped improve.

Keep the contract stable enough to align the organization, but do not protect it from evidence. If customers repeatedly pursue a different outcome, struggle with an unanticipated dependency, or assign value to another part of the workflow, revise the contract explicitly. Silent reinterpretation is dangerous because every function can declare success against a different definition.

Turn solutions engineering into a field evidence system

A solutions engineer sees details that rarely survive a conventional feature request: the customer’s existing architecture, the sequence of the workflow, the people involved in approval, the point at which a demonstration loses credibility, and the difference between a stated objection and an actual blocker. That information can improve strategy, but only if the organization captures it in a form that product teams can compare and act on.

Do not send raw call notes into a backlog and call it discovery. Use a consistent field record containing:

The persona, account context, and job being attempted.
The customer’s desired outcome and current way of achieving it.
The specific moment where the current product or proposed solution created friction.
The relevant technical condition, such as an integration, data, permission, security, or scalability constraint.
The evidence observed: customer behavior, workflow review, demonstration response, deployment result, or instrumented product data.
The proposed interpretation and any plausible alternative explanation.
The next question or test that would reduce uncertainty.

Within that record, separate observation, interpretation, and request. They are not interchangeable.

Observation: The administrator could not complete configuration because a required permission was unavailable.
Interpretation: The permission model may not match how this persona operates.
Request: Add a new administrative role.

The observation is evidence. The interpretation is a hypothesis. The request is merely a candidate solution. Keeping them separate prevents the first proposed feature from becoming the definition of the problem.

Product leaders also need to protect this system from deal gravity. A commercially urgent request can be important without representing a widespread product problem. Conversely, a small implementation detail can reveal a structural barrier affecting an entire target persona. Evaluate field evidence with questions that make those differences visible:

Does the problem recur for the same persona, workflow, or environment?
Does it block the value event, or does it affect a peripheral preference?
Is the root cause a missing capability, product usability, configuration, integration, positioning, or customer process?
Does the problem appear in the strategic segment the product is designed to serve?
Can a smaller or reversible test distinguish competing explanations?
What roadmap decision would change if the hypothesis were confirmed?

The last question is a useful filter. If no plausible answer would change a decision, the team may be collecting interesting information rather than decision-grade evidence.

Bring the strongest records into the product trio’s learning cycle. Field evidence can then sharpen roadmaps, sprint planning, positioning, integration decisions, and stakeholder conversations. The solutions engineer contributes technical discovery and customer context. Product management synthesizes patterns and owns tradeoffs. Design examines behavior and workflow. Engineering tests feasibility and exposes architectural consequences. Sales contributes commercial context without converting urgency into automatic priority. Customer success adds evidence about repeated use after the initial deployment.

This arrangement does not require every customer conversation to become a committee meeting. It requires a common evidence format, clear decision ownership, and a route by which meaningful field learning reaches the people making product choices.

Run demos and proofs of value as product experiments

A polished demo can prove that a presenter understands the product. It does not prove that a customer can adopt it. Even a successful technical pilot can produce a false positive if the solutions engineer performed work that the target user cannot repeat, the environment was unusually controlled, or the team never defined what customer value would look like.

Design every proof around a falsifiable outcome hypothesis. A practical structure is:

For [persona] in [context], enabling [change] should produce [observable behavior] because [value mechanism]. The hypothesis is supported if [agreed evidence] meets [predefined decision rule] without violating [guardrail].

The brackets force useful decisions. You have to name the user, the operating condition, the expected behavior, and the reason the behavior represents value. You also have to decide what would count as failure before enthusiasm about the result changes the standard.

Build the proof plan around the following elements:

Baseline: Document how the customer handles the job now, including meaningful friction and dependencies.
Scope: State which workflow, persona, environment, and constraints the proof includes. Anything outside that boundary remains unvalidated.
Value event: Identify the customer action or result that demonstrates progress toward the desired outcome.
Instrumentation: Decide which product events and qualitative observations will show where users advance or stop.
Enablement: Record how much assistance the customer receives. Heavy expert intervention can validate technical feasibility while leaving self-service adoption unproven.
Decision rule: Define what evidence will justify progression, iteration, repositioning, or stopping.
Learning owner: Name the person responsible for translating the result into a product, go-to-market, or adoption decision.

A compact learning record can follow Insight – Hypothesis – Experiment – Metric. The sequence matters. A customer comment produces an insight, not a roadmap commitment. The insight leads to an explanation that can be tested. The test produces evidence against a metric or decision criterion.

Capture objections with the same discipline. An objection about missing functionality may actually reflect an unclear value proposition, concern about integration effort, lack of trust in the output, or a mismatch between the buyer and the eventual user. Ask what the customer is unable or unwilling to do because of the concern. That question moves the discussion from the requested feature to the blocked behavior.

At the end of a proof, classify what was learned:

Value validated: The target persona reached the intended outcome under representative conditions.
Need validated, capability missing: The problem matters, but the product cannot yet support the required workflow.
Capability works, adoption friction remains: The product can deliver value, but setup, comprehension, trust, or workflow design prevents the user from reaching it reliably.
Technical path blocked: An integration, architecture, data, permission, or operational constraint prevents deployment.
Positioning mismatch: The product can do what was promised, but the outcome is not important enough for the target persona or was framed in the wrong terms.
Evidence inconclusive: The test did not represent the intended user or environment, or the decision criteria were not measurable.

These outcomes lead to different work. A positioning mismatch belongs in messaging and discovery. Adoption friction may require onboarding or product design. A recurring technical blocker may deserve roadmap investment. An inconclusive test deserves a better test, not a success story.

Demos and early deployments become high-signal learning mechanisms when their evidence returns to product strategy instead of ending in the opportunity record. That closed loop is what allows solutions engineering to improve both customer execution and the underlying product.

Measure adoption as a path, then act at the break

Adoption is not a single metric. It is a sequence that begins with a credible promise and ends when the customer repeatedly receives value. Measuring only the end hides the reason users failed. Measuring only clicks overstates progress.

Map the path for the persona named in the adoption contract:

Promise understood: The user or buyer can connect the product to a relevant outcome.
Entry: The intended user begins setup or enters the target workflow.
Readiness: Required data, permissions, integrations, and configuration are in place.
Activation: The user reaches the first meaningful value event.
Repetition: The user returns when the underlying job recurs and reaches value again.
Expansion: Appropriate users, workflows, or capabilities are added because the initial value is credible.
Retention: The product remains part of the operating workflow because it continues to produce the desired outcome.

Instrument enough of this path to locate the break. For each event, retain the eligible persona or account, relevant product context, action attempted, result, and connection to the intended outcome. Always state the denominator. Activation among configured users answers a different question from activation among all purchased accounts.

Use metrics such as time-to-value, daily active usage, and feature adoption only when they match the product’s natural workflow. A monthly administrative task should not be judged by daily activity. A high feature-visit count does not establish value if users repeatedly enter the feature because they are confused. Pair behavioral data with the qualitative context collected during proofs, implementations, and customer conversations.

Observed pattern	Question to investigate	Likely response area
Interest is high, but eligible users do not begin	Is the value proposition relevant and credible to this persona?	Positioning, targeting, or enablement
Users begin, but readiness fails	Which data, permission, configuration, or integration dependency is blocking progress?	Onboarding, implementation, or platform capability
Users complete setup, but do not reach the value event	Does the workflow lead clearly to the promised outcome?	Product design, guidance, or core capability
Users activate, but do not return	Is the job recurring, and was the first outcome valuable enough to change behavior?	Discovery, workflow fit, trust, or value proposition
Adoption differs sharply by persona or context	Was the product designed and positioned for the segment that is actually succeeding?	Segmentation, strategy, or packaging
Usage is high, but retention or customer outcomes remain weak	Is activity measuring value, required effort, or repeated friction?	Metric design, product quality, or outcome alignment

Match the intervention to the break. Use onboarding checklists, empty-state prompts, in-app guides, and product tours when the user needs contextual direction. Do not use a tooltip to conceal a broken permission model, an unreliable integration, or a workflow that does not create enough value. Guidance can reduce comprehension friction; it cannot repair a missing capability.

When you test an intervention, define the intended behavior change and success criterion in advance. Segment the result by the persona and context in the adoption contract. An aggregate improvement can hide a deteriorating experience for the strategic user, while a neutral aggregate can hide a strong response in the segment the product is meant to serve.

Complete the loop with a compact adoption review. After a launch or concentrated batch of customer learning, hold the internal readout while the context is still usable. A practical pattern is to run the readout within a week and convert learning into a 30-60-90-day experiment plan. The review should show:

The adoption contract and any evidence that challenges it.
Progress through the journey, segmented by the intended persona and context.
The strongest field observations, with interpretations kept separate.
The active hypotheses and experiments.
The product, positioning, enablement, integration, or onboarding decision each experiment may change.
The owner responsible for making that decision and returning the result to the group.

Use operating reviews and QBRs to evaluate outcomes, not to count shipped features or completed guides. Ask which hypothesis was confirmed, which constraint was removed, where customers still stop, and what the organization will do differently. If an experiment has no route to a decision, revise it or stop it.

For your next significant product bet, start before the launch plan. Write the adoption contract, ask a solutions engineer to challenge it with field conditions, and define the evidence that would change your mind. That small discipline gives every later conversation – roadmap, demo, onboarding, analytics, and QBR – the same customer outcome to pursue.

References

January 23, 2026

Why Codeless Product Analytics Wins: Faster Insights, Fewer Bottlenecks, Bigger PLG Results

Every quarter, I watch product teams move from gut feel to data-informed decisions—until instrumentation bottlenecks slow them to a crawl. That’s why I’ve become an advocate for codeless analytics: it removes the dependency on engineering sprints for basic event tracking and lets teams answer product questions in hours, not weeks.

We explain what codeless analytics are, why (and how) Pendo supports them, plus responses to the top three myths about low-code/no-code solutions.

Here’s how I frame it with my teams: codeless analytics enables product managers, designers, and customer success to tag features visually, track interactions, and analyze adoption without shipping code. The goal isn’t to replace engineered events; it’s to accelerate discovery, speed up iteration, and reduce context-switching for developers. In practice, this means cleaner prioritization, faster validation of hypotheses, and tighter product-led growth loops.

Why Pendo? In my experience, Pendo’s codeless model shortens the distance from question to insight. Visual tagging makes event setup accessible, in-app guides and product tours let us experiment with onboarding and activation, and governance controls ensure data remains trustworthy across teams. The result is a unified analytics approach where we reserve custom instrumentation for complex logic while using codeless tracking for everyday product questions.

Let’s address the top three myths I hear most often. Myth 1: “No-code is only for simple use cases.” In reality, most decisions we make weekly—feature adoption, path analysis, funnel drop-offs, and retention analysis—do not require custom code. Codeless analytics handles these well, and when we need deeper context (like server-side events), we complement it with engineered tracking. It’s a both/and, not an either/or.

Myth 2: “Codeless data isn’t accurate.” Accuracy comes from governance, not the method. I set clear standards: naming conventions, tagging reviews, ownership, and periodic audits. With disciplined process, codeless tracking yields consistent, decision-grade data. The added benefit is visibility—non-technical stakeholders can validate the instrumentation themselves, reducing misalignment.

Myth 3: “Engineers must instrument everything to scale.” Engineering time is precious; we should spend it on differentiated capabilities, not on routine click tracking. Codeless analytics scales by empowering product teams to self-serve, while engineering focuses on back-end, performance, and edge cases. When paired with a unified analytics platform and clear data contracts, this model scales cleanly across product lines.

For teams adopting this approach, I recommend a simple operating model: define your core product questions up front, tag features aligned to those questions, connect insights to in-app guides for experiments, and measure user activation and retention continuously. Whether you run Pendo alongside Amplitude analytics or within a broader unified analytics platform, the key is to keep the insight-to-action loop tight.

The future of product analytics is codeless because it puts insights where they belong—directly in the hands of the people designing the experience. When we remove bottlenecks, we learn faster, ship smarter, and drive measurable PLG impact. That’s how we turn product analytics from a reporting function into a competitive advantage.

Inspired by this post on Pendo – Best Practices.

January 22, 2026
Building Physician‑Grade AI When Trust Is Everything: Inside Healio’s Proven Playbook

Trust is the currency of any high-stakes AI product, and nowhere is that more true than in healthcare. I recently dug into how Healio built an AI assistant for physicians—an audience that can’t afford to be wrong—and it’s a masterclass in balancing accuracy, transparency, and speed without compromising credibility.

Healio, a 125-year-old medical publishing company, set out to create Healio AI to help clinicians prepare for patient care. From the outset, their guiding principle was simple: physicians won’t trust you until you prove it. That lens shaped every decision—from discovery and prototyping to architecture, evaluation, and ongoing validation.

Discovery started with a survey of 300 healthcare professionals to understand real-world needs at the point of care. The headline insight: physicians primarily want AI for preparation, not bedside use. Even more surprising, the top ask wasn’t purely diagnostic support; it was help with patient communication and empathy—translating complex information into clear, accessible conversation.

Momentum mattered. After beginning with Figma mockups to validate workflows, the team built a working prototype in a single weekend using Cursor. That velocity wasn’t about cutting corners; it was about proving value quickly, reducing ambiguity, and iterating with concrete feedback from physicians.

Under the hood, the system employs RAG and hybrid search—combining lexical search, vector search, and semantic search across multiple trusted sources like PubMed. As any PM who has integrated biomedical literature knows, "just use PubMed" isn’t simple—there are five different ways to access the same data, each with trade-offs. The team made pragmatic choices to balance freshness, coverage, latency, and cost while preserving trust in source quality.

Designing for trust extended all the way to the citation UX. The team leaned into citations that physicians actually trust: subscripts, hover states, and progressive disclosure. This gave clinicians verifiable threads back to source material without overwhelming the core interaction, aligning with how experts want to audit evidence under time pressure.

Evaluation wasn’t left to chance. They stood up eight LLM judges for evals: safety, medical accuracy, faithfulness, relevancy, completeness, reasoning, clarity, and overall quality. Just as importantly, they treated those signals as directional, not definitive. In a high-stakes domain, physician feedback trumps LLM-as-judge feedback—so they complemented automated evals with direct reviews from practicing clinicians to calibrate quality and reduce hallucinations.

On the safety front, the team implemented HIPAA compliance and input guardrails for masking personal health information. That choice reflects strong data governance and privacy-by-design thinking: protect PHI by default, constrain prompts to safe boundaries, and make compliance a first-class citizen in the product architecture.

They also addressed monetization without compromising experience. Serving contextual ads while the LLM processes queries is a practical approach that preserves physician workflow efficiency and creates a clear, non-intrusive revenue model.

Critically, the work didn’t stop at launch. The Healio Innovation Partners provide ongoing discovery and validation, ensuring the system evolves with physician needs and the medical evidence base. This is the operating cadence you want for any AI product that sits at the intersection of safety, accuracy, and fast-changing knowledge.

My takeaways for building AI in high-stakes domains: prioritize retrieval-first pipelines over model cleverness; couple RAG with hybrid search across vetted sources; design citations that earn trust at a glance; use eval-driven development, but let domain-expert feedback be the ultimate judge; and embed regulatory compliance into your product strategy from day one. If trust is your North Star, this is a playbook worth emulating.

Inspired by this post on Product Talk.

January 22, 2026
AI-Powered Growth Loops: Transform Your PLG Product into a Self-Optimizing Engine

Across my teams and portfolio, I’m watching AI fundamentally reshape product-led growth—from static funnels and one-off playbooks to adaptive, compounding growth loops that learn in real time. The shift isn’t just technological; it’s an operating model change that rewards continuous discovery, rigorous instrumentation, and outcome-driven product strategy.

"Learn how AI is transforming PLG with a new generation of growth loops that can turn your product into a self-optimizing platform." That line captures what I’ve been building toward: systems that sense user intent, decide the next best action, act contextually, and learn to improve the loop with every interaction.

Here’s the core pattern I rely on. First, sense: unify product analytics and behavioral signals (think Amplitude analytics, Pendo events, Intercom conversations) into a single, queryable, privacy-safe layer. Second, decide: apply AI Strategy—LLMs for product managers, rules, and retrieval—to segment users by intent and probability of success. Third, act: deliver in-app guides, product tours, tooltips, or personalized nudges that accelerate user activation and time-to-value. Finally, learn: run A/B testing with a clear minimum detectable effect (MDE), then feed outcomes back into the model for continuous optimization.

Activation is where the gains start compounding. With gen ai, I can auto-generate tailored onboarding checklists, dynamic walkthroughs, and contextual help that adapts to the user’s role, data maturity, and current friction points. We’ve moved from generic product tours to precision guidance that updates based on real-time behavior—often lifting first-week activation and shortening time-to-first-value without adding support load.

Experimentation is the governor that keeps speed and quality in balance. I instrument every growth loop end to end and pair eval-driven development with A/B testing to confirm incremental impact. Amplitude analytics gives me cohort views and path analysis; Pendo or Intercom can deliver in-app variants; a unified analytics platform closes the loop on retention analysis so I’m not optimizing for click-through at the expense of long-term value.

Retention and expansion are where AI shines as a compounding engine. Retrieval-first pipeline patterns allow instant, contextual support that deflects tickets and boosts perceived product competence. Agentic AI can orchestrate next-best actions—prompting power users toward advanced features, surfacing value moments, or timing expansion prompts when success signals appear. The result is a virtuous cycle: better guidance drives deeper adoption, which improves model accuracy, which unlocks more relevant guidance.

None of this works without guardrails. I bake in AI risk management from the start: strict data governance, privacy-by-design, human-in-the-loop review for high-impact actions, transparent user consent, and continuous drift monitoring. The goal is reliable automation that users trust—augmented by clear fail-safes when confidence drops.

Operationally, I anchor the work in empowered product teams and product trios, focus on outcomes vs output OKRs, and practice continuous discovery to validate problems and solutions before scaling. The baseline metrics I watch: activation rate, time-to-value, week-four retention, PQL/PQA conversion, expansion revenue, and support deflection—each tied to a specific growth loop hypothesis.

If you’re starting fresh, begin with the highest-leverage loop: user activation. Instrument your onboarding journey, define the critical path to value, ship two to three personalized interventions, and measure impact with a precommitted MDE. Scale what wins, drop what doesn’t, and iterate weekly. Once activation is compounding, extend the same approach to adoption depth, collaboration features, and expansion triggers.

In practical terms, AI-powered PLG is less about flashy features and more about disciplined feedback loops. Build the sensing fabric, keep the decision layer auditable, ship small actions quickly, and treat learning as the product. Do that, and your product doesn’t just grow—it becomes a self-optimizing platform.

Inspired by this post on Product School.

January 21, 2026
Inside Product at Heart 2026: Bold Single-Track Vision, AI Everywhere, Deeper Connections

I just tuned into the latest conversation on the upcoming Product at Heart 2026, and it hit on the exact challenges product leaders are navigating right now: curating meaningful content in a world where AI moves faster than our agendas, designing formats that create real connection, and ensuring every minute earns its place. Listening to Petra Wille and Teresa Torres map out the speaker lineup, workshops, and structural shifts, I found myself nodding along—this is the kind of thoughtful curation we need if we want product teams and product leaders to walk away with practical value, not just inspiration.

Listen to this episode on: Spotify | Apple Podcasts

What stood out immediately is the bold move to a single-track conference for 2026. In an era of gen ai hype and endless breakouts, this choice signals clear intent: tighter curation, a shared experience, and less FOMO. The team isn’t carving out a separate AI track—and I love that decision. Their stance is simple and sensible: No AI track—AI will show up everywhere, but not as a siloed topic. The team sees it as part of the everyday toolkit. That mirrors how high-performing, empowered product teams actually work today—AI Strategy and AI workflows are part of the operating system, not a side show.

The keynote lineup is already compelling. Christian Idiodi (SVPG) brings storytelling that turns product principles into habits you can actually use on Monday. Elaine Kasket, cyber-psychologist, exploring digital afterlife and AI replicas, will push us to think more deeply about the human side of our systems. And Teresa Torres will be sharing what she’s learning about AI—exactly the kind of continuous discovery mindset we need as we integrate LLMs into product discovery and delivery.

I’m also thrilled to see roundtables become what they’re calling an “alternative track.” That’s a smart way to deepen learning without fragmenting attention. The best conference ROI I’ve had often comes from targeted small-group conversations—where product trios compare approaches, swap metrics frameworks, or challenge each other’s product strategy assumptions. It’s a design choice that rewards curiosity and builds communities of practice.

We also get a behind-the-scenes look at Teresa’s Maker Studio workshop, where participants will build personal AI workflows. That’s exactly the hands-on, practitioner-first approach teams need right now—less demo theater, more systems that stick. If your roadmap includes integrating LLMs into continuous discovery or augmenting your team’s decision velocity, this kind of guided practice is gold.

The broader workshop slate looks deep and balanced. Expect returning favorites and practical frameworks: Rich Mironov on the realities of product leadership in complex orgs; Büşra’s metrics workshop translating outcomes into action; and an overview of additional workshops from Rich Mironov, Büşra Coşkuner, Marcus Castenfors, and Özlem Yüce. From success metrics to toolkits for product managers, the content spans IC to product management leadership—ideal if you’re stepping into new roles or scaling empowered product teams.

One of the most exciting evolutions is the Product Leadership Event, now a 1.5-day retreat. The format blends talk sessions, mini-workshops, dinners, and small-group excursions (boat rides, improv, etc.), giving leaders time and space to exchange playbooks, stress-test decisions, and build real relationships. It’s capped at 60 attendees (all in product leadership roles) to keep it intimate and useful. As someone who believes in outcomes vs output OKRs and first principles decision making, I appreciate how this structure encourages depth over breadth—and real accountability among peers.

Here are the core takeaways I’m carrying into my own planning: single-track means tighter curation, so every talk has to earn its place. Roundtables are growing into an “alternative track,” offering more ways to engage beyond stage talks. Workshops go deep and meet you where you are—IC, manager, or executive. And the leadership retreat expands to maximize learning from peers, not just from the stage. If you care about product discovery, product strategy, and conference networking that leads to actual business impact, this program looks thoughtfully engineered.

If you’re planning your 2026 calendar—or just curious how conferences evolve alongside the craft—this is a thoughtful walkthrough of what to expect. Come say hi to Teresa and Petra—on stage, at a roundtable, or somewhere in the hallway conversations that make these events memorable.

For more context and resources mentioned, explore: Product at Heart, Arne Kittler, Mind the Product, Christian Idiodi of Silicon Valley Product Group, Elaine Kasket, House of Beautiful Business, The 7 Habits of Highly Effective People by Stephen Covey, Rich Mironov, Marty Cagan, Claude Code, Codex by OpenAI, Marcus Castenfors, Büşra Coşkuner and her Success Metrics: A Playbook for Product Managers, Özlem Yüce’s Essential Toolkit for Product Managers, Petra’s Product Leadership Wheel (PLwheel), and Netlight.

Follow Teresa Torres: https://ProductTalk.org

Follow Petra Wille: https://Petra-Wille.com

Full transcripts are only available for paid subscribers.

Inspired by this post on Product Talk.

January 20, 2026
How I Harness AI to Supercharge Product Discovery for Faster Research, Prototyping, and Validation

I’ve led product teams through countless discovery cycles, and nothing has accelerated our learning loops like AI. By weaving AI into our continuous discovery practice at HighLevel, I cut time-to-insight, reduce risk earlier, and keep our product strategy relentlessly focused on customer outcomes.

AI streamlines product discovery by accelerating research, prototyping, and validation, enabling teams to make faster, smarter, and user-driven decisions.

In the research phase, I use gen ai and LLMs for product managers to synthesize interviews, cluster themes, and surface unmet needs in minutes instead of days. Pairing those qualitative insights with behavioral signals in Amplitude analytics helps me spot high-intent cohorts and friction points at scale, so our problem framing is both human-centered and data-backed.

From there, I translate insights into crisp hypotheses and prioritize with the Kano Model and outcomes vs output OKRs. To keep experiments honest, I define a minimum detectable effect (MDE) up front and design A/B testing plans that reflect realistic traffic and seasonality, ensuring our decisions are statistically grounded rather than anecdotal.

Prototyping is where gen ai for product prototyping really shines. I spin up multiple UX flows, UI copy variants, and edge-case scenarios using prompt engineering, then iterate with rapid feedback from product trios. When needed, I mock in-app guides and product tours to validate onboarding concepts before we commit to code, preserving velocity without sacrificing quality.

For validation, I lean on a mix of lightweight experiments—fake-door tests, concierge pilots, and targeted A/B testing—augmented by in-product surveys via Pendo or Intercom. For AI-powered features, I apply eval-driven development to measure relevance, latency, and safety, so we can ship responsibly while maintaining the pace of learning.

This approach only works when the team is structured to move fast. Empowered product teams and product trios own discovery end-to-end, with clear guardrails around data governance, privacy-by-design, and AI risk management. That alignment lets us shift from opinions to evidence, and from output to outcomes, without friction.

If you’re getting started, pick one discovery loop to transform: automate research synthesis, prototype two to three variants with AI, and validate with a tightly scoped experiment. Instrument your analytics, track time-to-insight and time-to-prototype, and iterate your product roadmapping and sprint planning with what you learn. The payoff is immediate: faster cycles, stronger conviction, and a more user-driven path to product-led growth.

Inspired by this post on Product School.

January 19, 2026