Tag: agentic AI

How We Taught Agentic AI to Speak Product Analytics—and Unlocked Actionable Insights

I set out to solve a deceptively simple problem: help our teams ask product questions in plain English and get trustworthy, analysis-grade answers—fast. That required more than a powerful model; it demanded agents that genuinely understand the language of product analytics, from behavioral analytics nuances to the messy reality of event taxonomies, funnels, and cohorts. In this post, I share how we engineered agentic AI that speaks our domain fluently and turns questions into decisions.

The core challenge wasn’t data volume or dashboard sprawl; it was semantics. Different teams said “activation,” “onboarding,” or “first value” and meant overlapping but distinct things. Our PMs, analysts, and engineers navigated a maze of synonyms across Amplitude analytics, Pendo, and our unified analytics platform. Generic LLMs stumbled on these nuances, so we built a shared ontology—driver trees anchored to a clear North Star—with canonical definitions for activation, retention, and conversion, plus consistent event naming and cohort logic.

We started with a rigorous metric catalog: every KPI linked to its drivers, exact formulas, cohorts, and time windows; every event mapped to a product taxonomy; every dashboard and SQL snippet versioned with ownership and lineage. That catalog became the ground truth for agents. We embedded data governance and privacy-by-design from the start—permissioning for fields and queries, PII redaction, and scoped access that reflected how product teams actually work.

Next, we built a retrieval-first pipeline to ground the agents in our corpus before generation. We indexed metric definitions, dashboards, experiment readouts, runbooks, and high-signal Slack threads so the agent could cite relevant artifacts, not just predict plausible text. With careful context window management and prompt engineering, the agent retrieves definitions and prior analyses, then plans multi-step actions: run a query, compare cohorts, check “minimum detectable effect (MDE)” for an A/B test, and summarize findings with references.

Architecturally, we treated this as “Agent Analytics”: an orchestrator that selects tools based on intent—querying Amplitude analytics or Pendo for behavioral paths and funnels, hitting our warehouse for cohort tables, or pulling experiment metadata and anomaly detection alerts. Tool use is permission-aware, auditable, and designed to fail safe. The agent’s outputs include citations back to the exact definitions, dashboards, and SQL used, so reviewers can validate and iterate.

Quality came from eval-driven development, not intuition. We built a gold set of representative product questions (activation inflections, retention analysis by segment, funnel drop-offs after feature launches) and scored the agent on faithfulness to definitions, numerical accuracy, latency, and actionability. We incorporated regression checks to catch drifts after schema changes, and we tuned prompts to reduce overconfident answers and push for clarifying questions when context was missing.

Safety and reliability were non-negotiable. We layered AI risk management with role-based access, guardrails that block destructive queries, and risk scoring for unfamiliar joins or sudden spikes in metric deltas. The agent logs every step—what it retrieved, which tools it called, and why—so analysts can replay and refine the chain of thought with transparent provenance.

The payoff: product teams now self-serve nuanced questions in minutes instead of days, and our analysts spend more time on discovery than report wrangling. Retention analysis improved as the agent standardized cohort logic; conversion investigations accelerated thanks to consistent funnel definitions; and cross-functional decisions aligned around the same driver trees and shared language. Most importantly, the agent turned ambiguous asks into structured analyses that stand up to scrutiny.

For fellow product leaders, my lesson is simple: start with semantics, not models. A crisp ontology, disciplined taxonomy, and clear ownership will outperform a flashy stack riddled with ambiguity. Avoid technology FOMO; favor retrieval-first grounding, small sharp tools, and continuous discovery with your product trios. When your organization speaks a common analytics language, agents can finally think with you, not just for you.

Next, we’re extending the agent’s planning skills to recommend experiment designs, estimate power and “minimum detectable effect (MDE),” and propose driver-tree-informed bet sizing. We’re also tightening feedback loops so every accepted answer, edit, or override strengthens the retrieval corpus and evaluations. The vision: a calm, reliable layer that makes rigorous product analytics feel conversational—and helps teams move from questions to confident action.

Inspired by this post on Amplitude – Best Practices.

April 13, 2026
Stop Drowning in Tasks: How AI Marketing Agents Restore Focus and Maximize Impact

Every week I meet marketers who are working harder than ever—more campaigns, more content, more dashboards—yet seeing less movement on metrics that matter. The surge of AI tooling has amplified activity, not necessarily impact. That’s the focus problem: we confuse motion with momentum, and our backlogs look great while our outcomes stall.

Learn how AI agents for marketing can help you prioritize impact so you can do important work, instead of just more work.

In my role leading product and growth teams, I’ve learned that AI only compounds value when it is pointed squarely at outcomes. If we don’t define what “good” looks like, agentic AI will simply scale busywork. The antidote is a disciplined operating model that connects strategy to execution and instruments agents with clear success criteria.

First, anchor your program with outcomes vs output OKRs. Choose one or two measurable business outcomes—such as qualified pipeline, conversion rate, or activation—and make everything else subordinate. This provides the compass agents need to make effective trade-offs when speed and volume tempt you to do “one more thing.”

Second, map a driver tree from the target outcome down to the controllable levers: audience segments, offers, channels, messaging, and experience friction. This traceability shows where agents can move the needle fastest—whether that’s accelerating research, sharpening positioning, or eliminating handoffs that slow experimentation.

Third, design a small, agentic AI workforce aligned to those levers. For example: a Research Agent that synthesizes market insights and past performance; a Copy Agent that generates on-brief, on-brand variants; a Distribution Agent that adapts content to each channel and schedules posts; and an Analytics Agent that runs A/B tests, summarizes results, and flags anomalies. Keep human oversight where judgment matters most—strategy, brand voice, and high-stakes decisions.

Fourth, instrument rigor from day one with Agent Analytics and eval-driven development. Define offline evals for brand consistency, factuality, safety, and response time; pair them with online experiments that quantify lift on your target outcomes. Set a minimum detectable effect (MDE) so you stop shipping changes that cannot plausibly move the metric.

Fifth, operationalize your AI workflows. Standardize prompts, inputs, and handoffs; templatize briefs and acceptance criteria; and keep a change log so improvements compound rather than reset. Use short, frequent feedback loops to prune low-impact work and double down on what demonstrably advances your objectives.

I’ve seen teams reclaim focus and momentum when they treat agents as teammates, not toys. The magic isn’t in producing more assets—it’s in consistently choosing the next best action in service of a clear outcome. When you combine outcome clarity, a driver tree, targeted agents, and tight evals, AI becomes a force multiplier for marketing impact.

If you’re feeling overwhelmed by AI’s possibilities, start small: commit to one outcome, one driver you believe is material, and one agent designed for that job. Prove lift, codify the workflow, then scale. Velocity is only valuable when it’s pointed in the right direction.

Inspired by this post on Amplitude – Best Practices.

April 10, 2026
Never Stop Disrupting: Why the Fin API Platform Signals a New Era for Agentic AI

Disruption is the only sustainable strategy in product. When a platform meaningfully changes how we build and operate, I pay attention—not just as a product leader, but as someone accountable for turning AI Strategy into durable competitive differentiation. That’s why the launch of the Fin API platform stands out: it’s a concrete step toward agentic AI at enterprise scale.

Today, I’m diving into what this launch includes, why it matters for product strategy, and how I’d navigate the build vs buy decision in this new landscape. My goal is to translate the announcement into actionable guidance for product teams, CX leaders, and forward-deployed engineers who are building the next generation of customer support and product-led experiences.

Fin is a customer agent platform that at present resolves over 2M customer issues a week, growing at a rapid exponential pace. It’s relied on by the best brands, large and small, in every vertical you can imagine. From Atlassian and Riot Games, to smaller hot upstarts like Mercury and Polymarket. It runs on a family of models trained by its AI group. Last week, they announced Apex, which is the world’s first specialized customer service LLM. In production tests over the last 6 months, it beat every single frontier model, including those from Anthropic and OpenAI, on resolution rate, latency, hallucination rate, and cost.

With this launch, teams can access the platform’s core capabilities and underlying models directly via API, with contracts starting at $250k per year, and usage rates that are by far the cheapest in the industry for each of the model’s subcategories. For leaders evaluating total cost of ownership, this is a meaningful data point: it shifts the economics of scaled automation from experimental to operational.

Why now? Because builders want options. I hear from teams daily that want to design their own agents, tune prompts and policies, and integrate with bespoke CRMs, data lakes, and product surfaces. The Fin announcement meets that demand with three clear build-paths, each mapping to a different operating model and maturity stage.

First, for the vast majority of companies, the Fin Agent Platform is the pragmatic starting point. Fin reports ~8k companies on it today. It addresses 99% of customer needs out of the box—without exhausting consulting engagements—while delivering top-tier resolution rates. If your priority is time-to-value, governance, and platform scalability, this route de-risks implementation and accelerates outcomes.

Second, for teams that need custom surfaces or channels, the Fin Agent API lets you present Fin in unique contexts. You get the Fin platform’s orchestration and controls, but you’re free to bypass the default messenger, email, voice, or any prebuilt channel and embed the agent natively in your product. I see this as the sweet spot for product-led growth motions where conversation design and UX writing are strategic levers.

Third, for companies building hyper-specific agents—think service plus in-product actions—the new API access to Apex and the broader collection of models is the obvious move. Unlike generalized models, these are purpose-trained for customer service scenarios and operational policies. If you have strong in-house solutions engineering, a retrieval-first pipeline, and eval-driven development in place, this path maximizes control without reinventing the model layer.

This also opens the door for vertical specialists. Fin-like businesses focused on deep domains can emerge quickly—Fin for dentists? Why not? Fin for car dealerships? Sure. I expect startups and modern CX providers (including players like Decagon and Sierra) to carve out niches where domain data, workflows, and compliance are the real moats. That’s where differentiated AI beats generic capability.

There’s a defensive reason to pay attention here. The software landscape is shifting fast: the moat is no longer feature parity—it’s the quality of your agents and the data flywheels powering them. Building software is simply less hard now, and I’ve watched engineering teams more than double measurable productivity as they adopt AI-assisted development. The implication is clear: the interface-and-features era is giving way to an agents-and-outcomes era.

Serious software companies must evolve from being a features company to an agents company—and build those agents on differentiated AI. More value will accrue at the model and orchestration layers, where safety, latency, cost, and resolution quality are won. That puts a premium on prompt engineering discipline, policy routing, continuous discovery of edge cases, and rigorous offline/online evals to keep hallucination rates low while maintaining speed.

How would I choose among the three build-paths? If you’re early or resource-constrained, start with the Fin Agent Platform to validate outcomes and align stakeholders. If you need branded experiences and tighter product integration, use the Fin Agent API to control surfaces without owning the heavy lifting. If you have strong ML ops and a mature customer support ai strategy, go model-level with Apex and companions, layering in your own guardrails, context window management, and test harnesses. In each case, balance velocity, control, and risk—your build vs buy decision should be grounded in clear metrics and an explicit product strategy.

Where does this lead? We’ll see more companies expose specialized model families with clearer economics and stronger governance. For now, I’m excited to see what teams build with the Fin API platform—and how they turn agentic AI into measurable improvements in resolution rate, CSAT, cost-to-serve, and ultimately, customer loyalty.

Inspired by this post on The Intercom Blog.

April 3, 2026
Inside Banani: How a Canvas-First AI Designer Elevates UX and Accelerates Product Teams

I believe the future of product design isn’t about replacing designers—it’s about giving every team access to one. That’s why Banani grabbed my attention. It’s an AI product designer that doesn’t just generate code—it generates design. For solo founders, stretched design teams, and early-stage startups, that shift matters: it raises the design floor without lowering the creative ceiling.

I spent time with Vlad Solomakha (CEO & Co-founder), Vova Kovalchuk (CTO & Co-founder), and Vlad Ostapovats (Founding Growth) to unpack how they took Banani from a Figma plugin proof-of-concept to a canvas-first AI design tool generating hundreds of thousands of designs per week. Vlad brings a decade of design experience and a precise north star: AI should produce beautiful, tasteful design rather than average, undifferentiated UI.

The architectural choices stood out. They engineered their agent to handle parallel screen edits, manage per-screen context across canvases with hundreds of frames, and make surgical edits without regenerating entire screens. This is the kind of agentic AI work that product leaders have been waiting for: concrete advances in context window management, tool orchestration, and prompt engineering that translate into higher throughput without sacrificing quality.

Equally important is how they addressed the "gulf of specification"—the mismatch between how designers think visually and how agents understand text. Banani’s canvas-first approach acknowledges that design is spatial, hierarchical, and iterative. Rather than forcing a chat-first UX, they center the canvas and let the agent do production work while keeping the designer firmly in control. In practice, this narrows intent ambiguity, speeds up iteration, and preserves taste.

The team made another pivotal bet: Why Banani doesn’t compile running applications — just HTML/CSS mockups — and how that shapes everything. By decoupling the design artifact from runnable code, they optimize for velocity, taste, and exploration. In my experience, this separation is the right product strategy for early discovery and gen ai for product prototyping—move fast on aesthetics and flows, then converge on implementation once you’ve validated the direction.

I also appreciated their pragmatic evaluation approach. Instead of traditional evals, they spin up 10 screens from one prompt to compare models. It’s hands-on, outcome-based, and aligned with eval-driven development in real product environments. They’re relentlessly discerning about when to work around model limitations versus when to wait for the models to improve—an essential discipline when building at the edge of what’s possible.

Under the hood, context engineering and specialized agent tools do the heavy lifting. Per-screen history with shared project context enables precise, reversible changes across large canvases. The result: fewer destructive regenerations, more reliable design intent preservation, and a workflow that feels like collaborating with a strong mid-level designer who’s exceptionally fast and consistent.

If you want a quick tour, I recommend jumping to a few highlights: 20:13 Product Tour Canvas First AI, 33:40 Gulf of Specification, 42:54 Agent Architecture Under Hood, 48:48 State History Context Tricks, and 56:04 Navigating Busy Canvases. Each segment reveals a different layer of the system design and product thinking behind Banani’s canvas-first UX.

For product leaders, this is a compelling blueprint for raising the design floor while protecting the last mile of craft. It aligns with empowered product teams, continuous discovery, and LLMs for product managers who need leverage without losing judgment. If you’re exploring agentic AI in design, this is a thoughtful, execution-focused model worth studying and trialing on your next product tour or redesign.

Resources worth exploring: Banani and TL Draw. To hear the full conversation, you can listen on Spotify or Apple Podcasts. Then, pressure-test the approach inside your own product development lifecycle and see how a canvas-first AI designer reshapes your team’s velocity and quality bar.

Inspired by this post on Product Talk.

April 2, 2026
Product Management Isn’t Dead: Why ‘Product Builders’ Will Win in the AI Era—and How to Upskill Now

“Is product management dead?” I hear this question at almost every conference hallway chat. After listening to the latest Product Builders – All Things Product Podcast with Teresa Torres & Petra Wille, I’m more convinced than ever: product management isn’t dead—it’s evolving fast, and the leaders will be those who embrace the shift.

Listen to this episode on: Spotify | Apple Podcasts

The core take resonated deeply with my day-to-day at HighLevel: product management isn’t dying—“the traditional product trio (PM, design, engineering) is collapsing into something new.” The center of gravity is shifting from swim lanes to outcomes, from rigid handoffs to fluid collaboration, and from role definitions to capabilities that actually ship value.

AI is raising the baseline across the board. That “80/20 shift: AI handles patterns, humans handle hard problems” is real on my teams. With LLMs like “GPT 5.2” and “Opus 4.5,” coding agents such as “Claude Code” and “Codex,” and tools like “Replit” and “Lovable,” we’re compressing cycle time on the repeatable 80%. The bottleneck is no longer typing code or drafting copy—it’s selecting the right problems, crafting sharp product strategy, and making confident trade-offs.

This is why the future belongs to “product builders” — people with a shared foundation across disciplines and deep expertise in one area. I look for teams that can shape, prototype, validate, and iterate in tight loops, blending continuous discovery with empowered product teams. The baseline expands, the craft deepens.

Functional expertise still matters—more than ever—because the hard parts are getting harder. We need leaders who can weigh platform scalability against time-to-value, protect privacy-by-design, apply AI risk management, and navigate data governance while sustaining product-market fit. When AI accelerates execution, judgment becomes the differentiator.

For leaders, this creates a clear mandate: “What product leaders must do to create safe AI infrastructure.” In practice, that means building guardrails early—security reviews tailored to AI workflows, QA harnesses that include eval-driven development, model performance observability, and human-in-the-loop review systems. You can’t bolt this on later without paying a tax in velocity and trust.

Hiring signals are already shifting. “How job descriptions and hiring expectations are already shifting” shows up in my reqs: we emphasize cross-functional range, fluency with AI workflows, prompt engineering literacy, and the ability to frame measurable outcomes. We still want craft depth—design systems, systems thinking in engineering, rigorous discovery—but we prize people who move seamlessly from discovery to delivery.

In the episode, I appreciated the crisp framing of why product management isn’t dying—but changing. The rise of the “product builder” foundation reframes team topology and unlocks smaller, more cross-functional squads. AI changes the baseline skill set across product teams, and ignoring it is a career risk. If you’re not learning AI tools, you’re falling behind.

My key takeaways were straightforward and actionable. Smaller, more cross-functional teams are likely. Deep expertise still matters—especially for complex trade-offs. Leaders need guardrails: security, QA, and review systems built for an AI-driven workflow. And if you work in product, design, or engineering, this episode is your signal to start upskilling now.

“The risk of ignoring AI in your craft” is not hypothetical. I encourage PMs to carve out weekly lab time for hands-on experiments with LLMs for product managers, build lightweight prototypes with Replit or Lovable, and pressure-test opportunity solution trees with data-informed discovery. Pair with your engineers on agentic AI use cases, and integrate model evals into your CI/CD pipelines.

“Mentioned in the episode” were several resources worth exploring: “Product at Heart” (June, Hamburg), “Replit,” “Lovable,” “Every,” “Petra’s Coaching Packages,” and “coding agents (Claude Code, Codex) and LLMs (GPT 5.2, Opus 4.5).” These are great jumping-off points for your own product builder toolkit.

My recommendation: queue up the episode on your commute, then pick one workflow to augment with AI before the week ends. Replace a handoff with a shared canvas. Automate a repetitive analysis. Ship a scrappy prototype. Momentum compounds.

Have thoughts on this episode? Leave a comment below. I’d love to hear how your teams are evolving your product trios, what AI workflows are sticking, and where governance has been most challenging.

Inspired by this post on Product Talk.

March 31, 2026
How We Built PR Review Bots In‑House for a Fraction of the Cost—and How You Can Too

PR review bots are all the rage, but they cost a premium. We built our own for cheap that work just as well, if not better. Here's how.

As a VP of Product Management, I care deeply about the velocity and quality of our software delivery. The decision to build our own pull request (PR) review agents came from a simple calculus: we needed tighter control over developer experience, CI/CD integration, and cost—without sacrificing accuracy or reliability. The result was a pragmatic system that accelerates reviews, improves code quality, and pays for itself through faster feedback loops.

Before we wrote a line of code, we defined success. Our objectives were to shorten review cycles, reduce back-and-forth on style and test coverage, and surface risks earlier—measured against DORA metrics like lead time and deployment frequency. That focus aligned the team, guided our build vs buy decision, and anchored scope to the highest-impact use cases.

We started rules-first, AI-optional. The initial release enforced guardrails that are universally valuable: linting and formatting checks, required test coverage thresholds, commit message standards, ownership validation (CODEOWNERS), and basic security scans. These automated gates eliminated predictable review friction, freeing engineers to focus on logic and architecture rather than style debates.

Then we layered intelligence where it mattered. We added lightweight, explainable checks for common code smells and dependency risks, plus optional natural-language summaries that turn large diffs into concise context. Where appropriate, we introduced agentic AI workflows to triage PRs by risk, draft review comments, and suggest missing tests—always keeping humans in the loop. This hybrid approach kept costs low and outcomes high.

Integration with our CI/CD pipeline was non-negotiable. We wired GitHub/GitLab webhooks to a stateless service that queued work, executed checks in containerized workers, and posted results back as status checks and review comments. Caching, parallelization, and smart diff-scoping ensured we only computed what changed, keeping the experience snappy even on large repos.

Adoption hinged on developer experience. We made the bot’s feedback fast, specific, and actionable, with clear remediation steps and links to documentation. Feature flags allowed teams to opt into new checks gradually. ChatOps commands enabled quick overrides for emergencies, while policy-as-code kept rules visible, versioned, and auditable.

We treated this like any product: eval-driven development for accuracy, ongoing telemetry for false-positive rates, and explicit SLAs for response times. We instrumented outcomes end-to-end—tracking PR cycle time, comment-to-merge ratios, and rework—so we could prove the ROI and tune the system without guesswork.

The outcome: a reliable PR review companion that runs on a shoestring budget, integrates cleanly with our workflows, and measurably improves engineering throughput. If you’re weighing build vs buy, start small with rules that deliver immediate value, then layer intelligence where it earns its keep. With a clear product strategy, you can stand up capable PR review bots quickly—and scale them as your needs grow.

If you’re ready to try this yourself, begin with your top three friction points in code reviews, wire them into your CI/CD checks, and pilot with a single team. Iterate weekly, measure relentlessly, and let your developers be your strongest signal. You’ll be surprised how far a pragmatic, product-led approach can take you.

Inspired by this post on Amplitude – Perspectives.

March 27, 2026
Apex Arrives: Vertical AI That Beats GPT-5.4 on Customer Service Speed, Accuracy, and Cost

I just watched one of the most significant leaps in customer service AI in years. Last week, a quiet but seismic release landed in CX: Fin introduced Apex, a vertical model purpose-built for support that raises the bar on speed, accuracy, and cost. As a product leader, this is exactly the kind of breakthrough that changes roadmaps, vendor strategies, and what customers can expect from modern service operations.

It’s a brand new model for Fin called Apex, and it’s objectively the highest performing, fastest, and cheapest model for customer service. It beats the very best models in the industry including GPT-5.4 and Opus 4.5.

In this analysis, I’ll unpack why the launch matters for the customer service agent category, what it signals for frontier labs and open‑weight ecosystems, and how leaders should rethink their AI Strategy, build vs buy decisions, and eval-driven development roadmaps.

Fin was already the highest performing and most sophisticated agent in the customer service space, consistently beating impressive competitors like Decagon and Sierra at an average win rate in the 70s. It operates at tremendous scale, now resolving almost 2M customer issues per week, a number that’s growing at an exponential clip. In its short life it’s grown to nearly $100M in recurring revenue.

As of last week, ~100% of all (English language, chat and email) customer conversations are now running on Apex. Since day 1, the Fin engine has comprised a system of models, and last year the team began replacing off‑the‑shelf models with custom ones trained on proprietary data. The core answering model had been a frontier labs offering—initially versions of GPT and more recently Sonnet 4.0. Now, that core answering model is Apex 1.0.

This model resolves customer issues at a materially higher rate than any other model available. One of their largest customers in the gaming space saw the resolution rate improve overnight from 68% to 75% (i.e. a reduction in unresolved conversations of 22%). The team notes they had never seen a jump this large from a single improvement since they started Fin.

Just as important, it’s dramatically faster, has fewer hallucinations, and is far cheaper than other available models—exactly the attributes operations leaders weigh most when deploying agents at scale. In practice, these are the levers that unlock higher CSAT, tighter SLAs, and better unit economics.

Achieving all three simultaneously is extraordinarily hard. Credit goes to foundational research from a 60‑person AI group run by Fergal Reid, and, crucially, to domain‑specific proprietary evals drawn from billions of human and agent interactions produced by the Fin resolution engine—already hand‑tuned to be the most effective in the category. That creates a flywheel: an eval‑driven development loop that trains models to keep improving at the edge of the system’s abilities. In other words, Apex 1.0 looks like the tip of the iceberg.

Zooming out, service is one of the few categories where generative AI has already delivered commercial impact at scale (alongside coding, and arguably the legal industry). With TAMs measured in the hundreds of billions, competition is intense and well capitalized. The pattern I’ve seen repeatedly is clear: winners in these spaces must become full‑stack AI companies. As features become ~free to build, durable competitive differentiation shifts under the hood—to proprietary data, post‑training, inference efficiency, and the quality of the eval loop.

Fin Apex raises the bar for finance-ready AI, highlighting a -65% cut in hallucinations and a quicker first token at 3.7s (0.6s faster), compared with Sonnet 4.6, Opus 4.5, and GPT-5.4 in side-by-side charts.

That’s why competitors will need to release their own models. Many appear to be just starting to hire the talent to do so, which likely gives Fin at least a year of head start. For product leaders, this is a strong signal to revisit build vs buy assumptions, and to quantify when owning your post‑training pipeline and evals becomes the rational move.

Honestly, 2–3 years ago I expected AI application differentiation to live mostly in what we built around third‑party models. The AI game humbles all of us; today it’s obvious that vertical models paired with proprietary evals create compounding moats.

In a podcast interview last week, Andrej Karpathy said:

"I do think we should expect more speciation in the intelligences. The animal kingdom is extremely [diverse] in the brains that exist. And there’s lots of different niches of nature… And I think we should be able to see more speciation. And you don’t need this oracle that knows everything. You kind of speciate it. And then you put it on a specific task. And we should be seeing some of that because you should be able to have much smaller models that still have the cognitive core."

The frontier labs still have the very best models, but open‑weight models aren’t far behind—making pre‑training look increasingly like a commodity. The frontier is moving to post‑training, which is precisely what we see with Apex (and Cursor’s Composer 2), and what we should expect to dominate going forward.

Labs now face a dual reality. On one hand, horizontal general‑purpose models can over‑serve specific verticals (e.g., customer service doesn’t need an oracle that knows everything). On the other, open‑weight models are good enough that high‑quality, domain‑specific post‑training can produce superior models for special‑purpose jobs—and in the ways that matter for those jobs. In service, soft factors like judgement, pleasantness, and attentiveness matter alongside hard factors like resolution effectiveness, speed, and cost.

I’m still bullish on the labs. Many organizations remain heavy customers of Anthropic—whether as part of multi‑model systems or through deep usage of Claude Code in engineering teams (see this example of Claude Code adoption). Yet classic disruption (à la the late, great Clay Christensen) is now at their door. The way out is to disrupt themselves by building cheaper specialized models too, which likely requires acquiring the evals—or the companies with the evals—needed for each task. Expect creative data partnerships, M&A consolidation, and a wave of hyper‑specific model providers that compete head‑to‑head with the labs.

In the meantime, Fin appears to be the only vendor in its space with a custom model that’s also objectively superior to everything else out there. I’m excited to see it deployed broadly for end customers, and I’m watching closely for the next announcement that will accelerate that rollout. For product leaders, the message is clear: the age of vertical models and agentic AI is here—bring your evals, or bring your checkbook.

Inspired by this post on The Intercom Blog.

March 26, 2026
Stop Flying Blind with AI Agents: Put Users at the Center with Pendo Agent Analytics

I’ve watched too many AI agent deployments celebrate velocity while overlooking the one thing that determines long-term success: whether real users are actually getting value. Dashboards tend to spotlight model upgrades, prompt tweaks, and launch counts, yet they rarely quantify task completion, trust, or time-to-value. That blind spot isn’t technical—it’s human.

Enterprises are spending 93% of their AI budget building agents and almost none know if those agents are actually working for users. Pendo Agent Analytics closes the gap.

In my product reviews, I look for evidence that agentic AI is improving outcomes across the customer journey, not just the demo path. Without behavioral analytics and observability, teams optimize for throughput instead of resolution, for novelty instead of reliability. This is where eval-driven development, A/B testing, and rigorous cohort analysis become non-negotiable: they translate agent performance into user impact we can measure and improve.

Here’s the pattern that works for me: define user-centric success metrics first, then let the AI follow. I prioritize signals like successful task completion, low-friction activation, reduced escalations, and sentiment lift—tied directly to product-led growth indicators such as retention and expansion. When these metrics move in the right direction, I know the agent is creating compounding value, not just answering faster.

Practically, I operationalize this with an analytics spine that captures end-to-end agent interactions: intents, prompts, responses, clarifying turns, handoffs, and final outcomes. I segment by persona, journey stage, and account tier to uncover where agents delight and where they degrade trust. With this foundation, I can run controlled experiments, spot anomalies early, and connect improvements in agent behavior to improvements in business performance.

Pendo Agent Analytics closes the loop by making these user outcomes visible and actionable. Instead of guessing whether an agent helped or hindered, I can analyze where users stall, which prompts or skills drive completion, and how interventions like in-app guides or product tours change behavior. That visibility lets me tune models and experiences in days, not quarters—and gives stakeholders confidence that our AI investments are paying off for customers.

If you’re scaling agents today, start small but instrument deeply: map top user intents, define offline and online evals, A/B test prompts and policies, monitor regressions, and tie every improvement to activation, adoption, and retention. The result is a durable feedback loop that keeps agents aligned with user value as your surface area grows.

AI agents are not a destination—they’re a capability. When we anchor that capability to clear user outcomes and measure it with the right analytics, we stop flying blind and start compounding advantage. That’s how we turn promising demos into dependable products.

Inspired by this post on Pendo – Best Practices.

March 25, 2026
Meet Amplitude’s Always‑On AI Analysts: Instant Answers Without Dashboards or Reports

For years, I’ve watched product, growth, and data teams burn cycles stitching together manual dashboards and reports, then slogging through replay review just to validate a hunch. That overhead slows discovery and delays decisions. The promise here is different: "Discover how Amplitude AI Agents help product, growth, and data teams turn questions into action without manual dashboards, reports, or replay review." As someone obsessed with decision velocity and evidence-based product strategy, that shift is exactly what I’ve been waiting for.

In practice, I think about "Amplitude AI Agents" as always-on data analysts embedded in our workflow. Instead of queuing requests or context-switching into tooling, I can ask targeted questions, get synthesized insights, and move directly to action. This is a powerful example of agentic AI meeting behavioral analytics in a unified analytics platform—removing friction between inquiry and impact while keeping teams focused on outcomes, not artifacts.

What changes for my day-to-day? I can interrogate customer behavior in real time, pressure-test hypotheses from discovery interviews, and quickly understand whether activation, retention, or monetization is the current constraint. If I’m probing a driver tree for activation or a retention analysis for a specific cohort, I can get to a decision faster—without waiting on someone to build a bespoke dashboard. That means more cycles spent shaping product strategy and fewer sunk into report wrangling.

This matters beyond speed. When product, growth, and data leaders anchor discussions in the same source of truth, we shorten the distance from signal to decision. That alignment is the backbone of product-led growth and continuous discovery: shared context, faster feedback loops, and clearer trade-offs. It also reduces the long tail of analytics debt—those one-off reports and stale views that quietly accumulate across teams.

Of course, adopting any AI workflow in analytics demands governance. I hold these systems to the same bar I set for my teams: clarity of assumptions, consistent metric definitions, and auditable reasoning. Pairing "Amplitude analytics" with strong data governance, CI/CD for analytics definitions, and lightweight evals helps ensure the recommendations we act on are reliable, reproducible, and explainable. AI should accelerate our judgment, not replace it.

The strategic shift is simple and profound: move from building dashboards to making decisions. With always-on analysis, we can spend less time instrumenting analytics theater and more time delivering customer value. That is how we translate insights into impact—and why I’m excited to operationalize this capability across our product trios and go-to-market partners.

Inspired by this post on Amplitude – Best Practices.

March 23, 2026

Agentic AI for Clinical Trial Operations: A Practical Playbook

If you are deciding where to introduce agentic AI in clinical trial operations, the hard question is not whether an agent can complete an impressive demonstration. It is whether the agent can produce a traceable, reviewable result under real trial conditions without obscuring who remains accountable.

Start with a bounded operational workflow, not a promise to automate an entire role. The useful outcome is not an agent that sounds intelligent. It is a smaller work queue, earlier detection of issues, faster human review, and enough evidence to explain every recommendation after the fact.

Start with work that is bounded, frequent, and reversible

Clinical operations contains no shortage of repetitive work. That does not make every task a suitable first agent use case. A workflow can be repetitive and still be unsafe to automate if an error changes a source record, delays escalation, affects patient safety, or hides a protocol issue.

Do not rank candidate workflows by estimated time savings alone. Rank them by risk-adjusted learnability: how quickly can you observe the agent’s behavior, compare it with an accountable reviewer, and contain a mistake before it has a consequential downstream effect?

A strong initial workflow usually has these properties:

A clear trigger and an unambiguous end state.
A finite set of authorized inputs.
An output that a qualified person can independently verify.
A mistake that can be corrected before it changes a consequential decision.
A named reviewer who already owns the underlying process.
An existing queue, baseline, or historical record against which performance can be evaluated.
A defined escalation path for ambiguity, missing data, conflicting records, and tool failure.

Document classification is a useful illustration. An eTMF agent has been applied to more than 80,000 documents per year. That workload is high-volume and structured enough to create repeatable evaluation data. The agent can recommend a classification, expose the evidence behind it, and send uncertain cases to a reviewer. A person can correct the result before the document proceeds through the controlled process.

Monitoring is a different risk class. A CRA agent can assemble safety and data-quality signals from 13 clinical systems, but that breadth is not permission to replace clinical judgment. The safer product boundary is evidence gathering, prioritization, and routing. The accountable professional still determines what the signal means and what action is appropriate.

My rule is simple: let the agent compress evidence gathering before it earns authority to execute an outcome. An agent may identify a possible discrepancy, collect the associated records, and prepare a review packet. It should not resolve a safety issue, close a query, approve clinical content, or alter an authoritative record unless that specific action has been validated, authorized, and made recoverable.

Turn the operating contract into governed platform primitives

Before writing prompts, write the operating contract. It should state the agent’s intended use, authorized inputs, available tools, required output, prohibited actions, review owner, escalation conditions, and evidence to retain. This contract gives product, clinical operations, quality, security, and engineering the same object to inspect.

The prohibited-actions section deserves particular attention. An instruction such as “help the CRA monitor the trial” is too broad to test. A useful boundary sounds more like this: retrieve permitted records, normalize specified fields, identify conditions defined in the approved specification, present supporting evidence, and route the result to the assigned reviewer. Do not interpret clinical significance, overwrite a source value, or close the issue.

A durable platform can encode that contract through reusable primitives such as models, skills, knowledge bases, MCP connectors, versions, and trigger types. Each primitive should own a specific control rather than serving as a loose container for prompts.

Platform primitive	Product decision to make explicit	Operational failure it should contain
Model	Which approved model and configuration may perform the task, including fallback behavior.	An unreviewed model change silently altering the output.
Skill	The narrow action, permitted inputs, expected schema, and failure behavior.	A general-purpose prompt expanding beyond the validated task.
Knowledge base	Which controlled material is authoritative and which version applies.	An answer relying on obsolete or unapproved material.
Connector	Identity, credential, record scope, and read-versus-write permission.	The agent retrieving or changing data beyond its authorization.
Trigger	What condition may start a run and what happens when the condition repeats.	Duplicate, unexpected, or untraceable execution.
Version	Which complete configuration produced a result and how it can be rolled back.	An output that cannot be reproduced during investigation.

Version everything that can materially change behavior: prompts, skills, model configuration, knowledge, ontology mappings, connector permissions, and escalation logic. A run record should identify why the agent started, which configuration ran, which tools it called, what evidence it retrieved, what it produced, and how the reviewer disposed of the result.

Separate read authority from write authority. A standard connector interface can make a system callable; it does not make every call permissible. Authentication and credential handling belong in a governed connector layer, as demonstrated by custom MCP connectors with an authentication and credentialing wrapper. The agent should receive only the tools and permissions required for the current task.

The same governance should apply across delivery models. First-party agents can prove reusable patterns, services-led implementations can handle complex workflows, and self-service configuration can extend adoption. Those three deployment paths should share the same identity controls, version model, evaluation process, monitoring, and audit record. Self-service without centralized guardrails merely distributes configuration risk.

Match retrieval to the question the agent must answer

Many apparent reasoning failures begin as retrieval or data-alignment failures. The agent received an outdated document, missed the relevant section, joined records under inconsistent identifiers, or treated two conflicting statuses as though they agreed. A larger context window does not repair those defects. It can make them harder to notice.

Choose the retrieval pattern from the operational question:

Use embeddings for semantic discovery. This is useful when the agent needs to find conceptually related material despite differences in wording. Retrieval results still need document identity, version, and provenance.
Use document hierarchies when structure carries meaning. Markdown or another explicit hierarchy can preserve the relationship among sections, subsections, tables, and controlled instructions. This is preferable when a nearby heading changes how a passage should be interpreted.
Use just-in-time connector retrieval for current system state. When the answer depends on the latest authorized record, retrieve it from the system at run time rather than relying on a stale copied index.

These patterns are complementary. An agent may use semantic retrieval to identify relevant controlled material and an MCP connector to fetch the current operational record. What matters is that the final output distinguishes retrieved policy or guidance from live trial data and preserves the provenance of both.

Cross-system monitoring also needs an ontology layer. Terms, statuses, units, and identifiers that appear similar may not carry the same operational meaning. A unified ontology can align terminology across multiple clinical systems, but normalization must not erase the original value. Retain the source system, source field, retrieval time, transformation applied, and canonical concept alongside every normalized field used in a recommendation.

Define conflict behavior explicitly. The newest value should not automatically win merely because it has the latest timestamp. If two authoritative records disagree and no validated reconciliation rule applies, the agent should show both, explain the conflict in neutral terms, and escalate. Fabricating a clean answer from inconsistent data is more dangerous than returning no answer.

Context management should reduce the agent’s working set to what the current decision requires. Sub-agents and automatic tool filtering can isolate tasks and limit the tools presented at each step. A retrieval sub-agent might return structured evidence with provenance, while a separate workflow skill applies the approved decision rule. That separation makes failures easier to test and permissions easier to constrain.

Do not optimize context solely for token efficiency. In clinical operations, the stronger reason to keep context narrow is control: fewer irrelevant records, fewer callable tools, clearer evidence lineage, and a smaller surface on which conflicting instructions can alter behavior.

Make evaluation and human review release gates

A clinical operations agent is not ready because it succeeds on a happy-path demonstration. Readiness means its intended behavior, failure behavior, and escalation behavior have all been tested against representative conditions. The evaluation plan should exist before the team sees the final test results, so release criteria do not drift to accommodate a weak agent.

Move through increasing levels of operational authority:

Retrospective evaluation: run the agent against a controlled golden dataset without access to live workflows.
Shadow operation: process current inputs in read-only mode while the existing process remains authoritative.
Assisted operation: show recommendations and evidence to a qualified reviewer, requiring approval before any downstream action.
Bounded execution: automate only the reversible actions that have earned sufficient evidence, while preserving escalation and rollback.

A golden dataset needs more than obvious examples. Include normal cases, ambiguous inputs, missing records, conflicting fields, outdated knowledge, duplicate triggers, unauthorized tool requests, and cases that should produce an abstention. Keep high-consequence failure modes visible as separate evaluation slices; a strong average can conceal the specific false negative that matters most.

Human feedback is useful, but it is not automatically ground truth. Reviewers can disagree, inherit inconsistent local practices, or approve a recommendation without examining it closely. Capture the initial agent output, reviewer action, reason for correction, and final adjudication where the process provides one. Use adjudicated outcomes to improve the golden set instead of treating every click as an equally reliable label.

Evaluate the properties that correspond to the operating contract:

Correct classification, routing, or evidence assembly on adjudicated cases.
Recall on important conditions, reviewed separately for higher-consequence misses.
Abstention and escalation when information is missing, conflicting, or outside scope.
Evidence completeness, including links or identifiers that let a reviewer verify the output.
Tool-use correctness, permission failures, and attempts to call unauthorized tools.
Reviewer acceptance, correction, and overturn reasons.
Operational impact on queue size, review effort, and time to disposition.

Set release thresholds according to the consequence of the task. A threshold appropriate for a reversible document suggestion is not automatically appropriate for a safety-monitoring signal. Do not compensate for weak performance on a high-risk slice with excellent performance on easy cases.

The human review interface is part of the safety system. It should present the recommendation, the exact supporting evidence, source identity, relevant timestamps, detected conflicts, and the permitted next actions. The reviewer needs an obvious way to correct, reject, or escalate the output. A generic approve button encourages automation bias and produces weak feedback data.

Preserve a traceable chain from agent intent to specification to test evidence. A release packet should identify the approved intended use, current versions, evaluation dataset, results by risk slice, known limitations, required human controls, monitoring plan, and rollback procedure. This is not paperwork added after product development. In a GxP-regulated setting, it is part of the product.

Production monitoring should detect changes in both behavior and operating conditions. Watch for shifts in input mix, rising abstention, changes in reviewer overturn reasons, missing provenance, connector failures, and differences after any model, knowledge, permission, or ontology update. When a material change occurs, route the affected configuration back through the relevant evaluation gates.

Key takeaways

Choose a bounded, frequent, and reversible workflow before attempting broad role automation.
Use agents to assemble evidence and prioritize work before granting authority over consequential outcomes.
Express the operating contract through governed models, narrow skills, controlled knowledge, permissioned connectors, triggers, and reproducible versions.
Match retrieval to the question: semantic discovery, hierarchical document access, and live connector retrieval solve different problems.
Preserve ontology mappings and field-level provenance when normalizing data across clinical systems.
Treat abstention, escalation, human review, evaluation evidence, production monitoring, and rollback as release requirements.

Your next artifact should not be a broader agent demonstration. Write the operating contract for the narrowest valuable workflow, then identify its authoritative inputs, prohibited actions, accountable reviewer, evaluation cases, escalation path, and rollback procedure. If any of those are unclear, narrow the workflow again. If they are explicit, you have a credible starting point for an agent that can improve clinical operations without outrunning the evidence.

References

Shivam.Consulting Blog – Inside Medable’s Agent Studio: The Agentic AI Blueprint to Accelerate Safer Clinical Trials

March 19, 2026

Agentic Architecture Demystified: How Modern AI Systems Plan, Learn, and Execute at Scale

In my role leading product teams at HighLevel, I’m often asked to explain what’s really happening behind the scenes of today’s AI products. The short answer is that modern systems are built on "Agentic Architecture: How Modern AI Systems Actually Work"—not just a single model, but a coordinated loop of planning, tool use, memory, and evaluation. Once you see that pattern, the design decisions snap into focus and the roadmap becomes far easier to prioritize.

At its core, agentic AI treats the model as a reasoning engine embedded within an AI workflow. The agent interprets intent, plans steps, calls the right tools and APIs, grounds itself in trusted data, and then evaluates outcomes before deciding to continue or stop. This loop creates reliability, reduces hallucinations, and enables the system to operate in real-world, multi-step scenarios.

Here’s the practical lifecycle I rely on. A user provides intent (a goal or request). We run a retrieval-first pipeline to ground the model in accurate, current data. Prompt engineering structures the task and primes the agent with constraints and success criteria while managing context window management. The agent generates a plan, executes steps by calling tools or services, evaluates intermediate results, reflects or revises as needed, and only then returns a final answer with clear citations or evidence.

For more complex work, I orchestrate multiple specialized agents—commonly a planner, a solver, and a critic—coordinated by a lightweight controller. This multi-agent pattern reduces single-agent blind spots, encourages self-checking, and mirrors how empowered product teams collaborate. Whether it’s conversation design for support flows or a voice AI agent driving hands-free tasks, orchestration is the difference between a clever demo and a dependable product.

Memory is the second pillar. Short-term working context sits in the prompt, while long-term memory lives in vector stores or databases to track past interactions, preferences, and outcomes. Retrieval augments the model with the right facts at the right time, and tight context window management ensures the agent stays focused on signal, not noise. The result is faster responses, lower costs, and far better accuracy.

Reliability is earned through eval-driven development and robust AI risk management. I define offline and online evaluations, guardrails, and human-in-the-loop checkpoints before scaling traffic. These evaluations become living, automated tests that protect against regressions as prompts, models, and tools evolve. The payoff is real: fewer escalations, higher trust, and measurable improvements to quality over time.

From a product strategy perspective, I resist over-engineering. Start with a simple retrieval-first pipeline and a single agent; prove value; then layer in multi-agent orchestration only where it moves key metrics. Instrument everything—latency, cost, grounding coverage, and outcome quality—and build Agent Analytics dashboards so teams can diagnose issues and iterate with confidence.

If you’re looking for a practical playbook, here’s mine: clarify the user intent and success criteria; design the tools the agent can call; ground with authoritative data; write prompts that constrain scope and define termination conditions; add reflection and automated evaluations; and ship behind feature flags for safe, staged rollout. Each step compounds reliability without killing velocity.

The diagram and the video above bring these patterns to life. If you watch closely, you’ll see the same loop—plan, retrieve, act, evaluate—show up in every effective implementation, regardless of domain. That repetition isn’t accidental; it’s the backbone of agentic architecture and a blueprint you can adapt to your own stack.

Ultimately, what matters is outcomes. When we build around agentic AI, we create systems that are explainable to stakeholders, maintainable by engineers, and genuinely helpful to customers. That’s how we move past hype to durable impact—shipping AI products that plan, learn, and execute at scale.

Inspired by this post on Product School.

March 16, 2026
How We Automated 81% of Customer Support with AI—While Uplifting CX, Speed, and ROI

Leading the Support function for a company that builds a leading Agent and AI-forward customer service platform has been, for me, unique, exciting, and yes—daunting. It’s where product ambition meets operational reality, and where every decision I make is immediately tested by customers who expect excellence.

It’s unique because we use the same technology as our customers. We live in the product every day, which puts us in a privileged position to be the voice of the customer across the organization. That tight feedback loop has shaped how I prioritize, what I build next, and how I measure success.

It’s exciting because we get to try all of the new features and capabilities of Fin and the Intercom helpdesk. With a relentless focus on AI innovation, I’ve had access to remarkable tools that help us deliver an incredible customer experience—and I’ve seen firsthand how the right workflows and guardrails turn those tools into outcomes.

And it’s daunting because expectations for our own Customer Support (CS) team are sky high. If we can’t deliver incredible support using our own technology, we undermine its value proposition. That imperative has kept me honest, focused, and fast.

In our new research, “The 2026 Customer Service Transformation Report,” we’ve been sharing how forward-looking teams use AI to transform their support models. If you’d like to get straight to the report, download it here.

When Intercom changed its focus in late 2022 to prioritize the customer service use case, we undertook a critical review of the support experience we were delivering and committed to driving meaningful change under an AI-first framework. That was a turning point: I aligned product strategy and operations around a single north star—automate with quality, and elevate humans to higher-value work.

Three years on, Fin now resolves over 81% of all our customer support volume, delivering immediate and high-quality resolutions. We have absorbed a 300%+ increase in customer demand since 2022 without proportional headcount growth. Without Fin, we would have needed at least 100 additional CS team members to meet that demand and our improved service levels – a net saving to Intercom of between $7.5M–$9M annually.

Throughout this work, we drew on research from the 2026 Customer Service Transformation Report and applied the lessons directly to our own org design, knowledge management, and AI workflows. What follows is our story of transformation and how we achieved a mature deployment of Fin.

The problems we set out to solve

Back in 2022, our challenges looked familiar to any modern support organization, and I knew we needed a step-change—not incremental tweaks.

We faced increased support demand from new and existing customers: Intercom was launching major features and changes at speed, driving up overall customer conversation volume and requiring additional headcount for the CS team. I could see we were scaling people faster than processes—unsustainable without automation.

Our support policy (as defined by our service level objectives) was not based on a high bar: In most cases, we were only committed to “business hours” coverage for the majority of our customers, impacting first response times. Even with SLOs that were not considered best in class, we were struggling to meet our commitments. I wanted 24/7 coverage and faster first responses without sacrificing quality.

We wanted to do more: As we pivoted our strategy, we wanted to open new routes to our support team, such as providing support to website visitors with technical questions and to trial customers. That meant meeting customers earlier in their journey with accurate, on-brand responses—at scale.

What we did

We made a very conscious decision to become our own best reference customer. As Intercom embraced the opportunity that generative AI presented to transform customer service, we intentionally moved to an AI-first strategy for our Customer Support team. I set a simple operating principle: ship value quickly, measure relentlessly, and let evidence guide the next bet.

We started with the highest-volume, informational queries and saw our resolution rates climb quickly. With that foundation in place, we pushed Fin further, training it on deeper documentation and internal procedures, and eventually giving it the ability to take actions on behalf of customers. As Fin took on more complex work, our results started to compound—and trust in the system grew across the organization.

Early adoption and building trust. When “AI Assist” features came to the Intercom Inbox, the CS team got early exposure to AI and were empowered to provide feedback directly to our product teams. This built awareness and trust across the team about what we were trying to achieve with AI, and helped shape the product roadmap. We were also the first beta customer for Fin, rolling it out to a subset of customers to watch sentiment and outcomes closely. With no adverse reaction and an initial resolution rate of over 25%, we deployed Fin to most customer segments within weeks. I’ll never forget the first week we put Fin in front of real customers—the silence of issues that never reached humans was the loudest signal of success.

Knowledge management as a product. We recognized quickly that time spent tuning our help center and knowledge assets for Fin would pay dividends. We transitioned our Help Center Manager into a “Knowledge Manager,” with a dedicated remit to optimize content for Fin. We embedded knowledge creation into our “New Product Introduction” (NPI) process, targeting that Fin would resolve at least 50% of customer issues at every new product and feature launch. Over time, we added new sources, including “Developer Documents,” enabling Fin to handle increasingly complex issues. We built a culture of continuous improvement—allocating “out of the inbox” time so every teammate could close content gaps and raise the bar.

Conversation design end-to-end. To ensure a consistent, high-quality customer experience, we created a new “Conversation Designer” role that owns the journey across automation and human handoffs. Using Intercom’s Workflows, we introduced “skills-based routing” so that when a customer asks for a human, the conversation reaches someone with the right expertise quickly. This is now handled by Fin directly using a feature called “Attributes.” The result: a seamless, on-brand experience regardless of channel or escalation path.

Leaders are racing ahead with real AI in support. Explore the 2026 Customer Service Transformation Report to see where deployment is stalling, benchmark your team, and get practical steps to scale automation that delights.

Organization changes that unlocked leverage. As we scaled Fin, we stood up a dedicated AI Support team under a senior CS leader to continuously optimize automation and define our AI adoption strategy across the journey. We restructured human roles into “Technical Support Specialist” and “Technical Support Engineer” to better align with the complexity of incoming work. We also expanded Support Operations to focus on optimization—using AI to uplevel Enablement, Workforce Management, QA, Process Management, and Data Insights. Just as important, we reset expectations about the balance between time spent supporting customers directly versus improving AI. That mindset shift created compounding returns.

Pushing Fin further with new capabilities. As capabilities matured, we were early adopters and saw measurable wins:

Fin Guidance: Multiple Guidance rules provide additional controls and a more personalized, targeted experience for customers.

Fin Tasks and Procedures: Enables Fin to carry out activities such as updating customers on incident status and deep troubleshooting for technical issues.

Insights: AI-driven dashboards provide deep insight into Fin’s performance and surface recommendations for further optimization. Insights also provides a Customer Experience (CX) Score for every customer interaction, enabling more targeted improvement efforts and opening up new ways to close the loop with customers who have had a poor experience.

What we achieved

What started as a focused effort to improve our customer support experience became the strongest proof point for what’s possible when you fully embrace AI. Fin now resolves over 81% of all our customer support volume and has allowed us to absorb a 300%+ increase in demand without proportional headcount growth. Over 90% of our customers now benefit from improved first response performance, 24/7 coverage, and outbound phone support.

What the numbers don’t fully capture is the shift in how our team operates. With volume absorbed by Fin, our CS teammates now deliver consultative support—guiding next best actions, deepening product adoption, and contributing directly to retention and expansion. Customers that receive these engagements adopt Fin at a much deeper level and achieve greater support success. What was once a reactive, volume-driven team is now a function that generates significant revenue.

What’s next

Customer expectations are always rising, so we’re building on our progress by embracing the Fin Flywheel—an actionable framework for ongoing improvement and optimization. This keeps us honest about the discipline required to sustain AI performance at scale.

Train: Teach Fin to resolve even the most complex queries with Procedures, knowledge, and policies.

Test: Run fully simulated customer conversations from start to finish to see exactly how Fin will behave before going live.

Deploy: Set Fin live across every channel – voice, email, chat, and social – for consistent support wherever customers reach out.

Analyze: Use AI-powered Insights to analyze and improve Fin’s performance and deliver better customer experiences.

We are also investing in our support teammates so they can adjust to the new world of AI—taking on more complex work and being valued for the subject matter expertise, consultative engagement, and empathy they bring to the role. That human layer is where differentiation shines.

We will continue to develop and share best practices for deploying an Agent, based on our own experience with Fin and the lessons learned from our most forward-looking customers. These are captured and continually evolving in The Agent Blueprint.

Transformation takes commitment

The most successful teams aren’t bolting AI onto old processes; they’re rebuilding support around it—investing in knowledge and people alongside technology, and treating AI as a continuous discipline rather than a one-time deployment. That’s the real change required. For support teams willing to make it, there’s a rare opportunity to redefine what customer service can deliver—higher CSAT, faster resolution, and durable ROI.

Inspired by this post on The Intercom Blog.

March 13, 2026