Default prompts are quietly sabotaging agent retention. I learned this the hard way while reviewing early funnels for our voice and chat agents—engagement looked great at the greeting, but the moment the agent stopped after a single reply, the conversation flatlined. The fix wasn’t a fancy LLM trick; it was a disciplined second message and a rigorous audit of defaults across every entry point.
When an AI agent opens with a generic, low-friction greeting and then waits, users hesitate. Cognitive load rises, intent stays fuzzy, and drop-off follows. A thoughtful second message—delivered quickly, with clarity and options—reduces ambiguity and gives people a low-effort path to progress. It’s a small behavioral nudge that pays off in outsized retention gains.
Here’s the pattern that consistently works for me. First, keep the initial default prompt short, confident, and specific to the channel and task domain. Then ship a fast follow-up if the user hesitates for a few seconds. That second message should clarify what the agent can do, present 2–3 concrete choices, and invite free-form input. I’ve repeatedly seen this simple sequence unlock a 2–3x retention lift in early sessions, especially for first-time users.
Auditing default prompts is where the leverage lives. I inventory every ingress—web widget, IVR, SMS, in-app, help center—and catalogue the exact default system, developer, and user-facing prompts. Then I inspect turn-1 and turn-2 transcripts in Agent Analytics to quantify where users stall: time-to-first-intent, clarification rate, option selection rate, and completion. This makes the drop-off visible and turns “vibes” into data we can A/B test.
Designing the second message is a conversation design exercise, not a copy tweak. My recipe: empathize with the user’s likely uncertainty, constrain scope so the agent appears capable, and apply choice architecture. For voice AI agents, I keep it shorter, use confirmation questions, and bias toward read-back for accuracy. For chat, I include tappable options and examples that mirror top intents. The goal is momentum without feeling pushy.
Operationally, I run controlled A/B tests on default and second-message variants, sized to a realistic minimum detectable effect. I segment by source (ad, organic, support), device, and use case, because the winning prompt for sales qualification rarely matches the one for customer support. With proper instrumentation in our analytics stack, we track retention curves over the first 3–5 sessions, not just single-session reply rates, to avoid optimizing for chatter over outcomes.
Strong prompt engineering underpins the experience. I keep system prompts stable and explicit about persona, tone, and refusal behavior; manage the context window so examples don’t drown live intent; and use a retrieval-first pipeline when domain knowledge matters. The most expensive mistake I see is shipping defaults like “How can I help you?” without guardrails or examples—great for demos, bad for real users.
If you’re starting fresh, begin with a prompt audit this week: list all defaults, map them to top intents, and pair each with a channel-appropriate second message. Instrument the funnel, launch two variants, and set a crisp success metric (e.g., turn-2 continuation rate to task start, then task completion). This is one of those rare changes that is simple to ship and compounds across onboarding, activation, and long-term retention.
The takeaway is straightforward: don’t let your best work stall after the first reply. A disciplined second message and a focused default prompt audit will lift engagement, reduce ambiguity, and create the kind of early momentum that sustains retention over time.
Inspired by this post on Amplitude – Perspectives.
Old-school, in-person selling is having a renaissance in the AI era, and I’ve seen why up close. From leading product and go-to-market teams through hypergrowth, I keep returning to one lesson: enterprise buyers still reward the teams who show up, orchestrate change management, and own outcomes end-to-end. The tech has changed; the human dynamics haven’t.
Has the sales playbook changed in the AI era? The tools are faster and the surface area is bigger, but the core motion remains the same: “showing up” beats letting the marketplace decide. That’s why in-person enterprise rollouts still beat product-led motions, especially when the stakes include security, governance, and cross-functional adoption. You win by reducing organizational risk, not by assuming free trials will do the heavy lifting.
Great enterprise sellers collapse silos. They sell to engineers and executives in one motion, pairing deeply technical validation with crisp business narratives. In my org, that means every high-velocity pilot has a dual thread: hands-on, eval-driven proof for the builders and a value architecture for the budget owners. When those motions run in parallel, time-to-value plummets and procurement friction fades.
Selling to AI-native buyers who grew up on ChatGPT changes tempo, not fundamentals. The same seller, different tempo: 8 weeks vs. 8 business days. These buyers evaluate fast, expect clear ROI, and push for automation-first workflows. How AI-native buyers handle build vs. buy decisions comes down to build for differentiation and buy for acceleration. If you make procurement feel like product—frictionless, instrumented, and transparent—you’ll meet their bar.
Process matters, but humanity wins. Building a robust sales process that still leaves room for unscripted moments is where trust is formed. I’ll never forget the story of the rep who taught a champion’s son guitar over Zoom—an unscripted moment that cemented a partnership. The lesson: raise the floor without capping the ceiling. Equip every rep with repeatable plays, then celebrate the creative instincts that make champions out of customers.
In early GTM, why the three highest-leverage early sales hires aren’t sellers at all resonates with my experience. I prioritize a solutions engineer who can de-risk integration, a forward-deployed operator who can run the first rollout like a product manager, and a customer success lead who designs adoption paths from day zero. Together, they compress the value journey from proof to production.
Compensation design shapes your talent market. The case for outsized commission accelerators for star sellers — and the kind of person they attract is real: magnets for competitors who close complex, multi-threaded deals and thrive with ownership. But beware: why too much process narrows the kind of seller you attract. Over-script it and you filter out the very people who can navigate ambiguity with customers.
Under the hood, instrumenting the funnel from stage zero to close keeps the system honest. I track intent signals before pipeline, conversion by persona and use case, proof milestones, and time-to-value in production. The three pillars of GTM excellence for me are repeatable discovery, referenceable outcomes, and relentless enablement. And inside the leadership team, building peers who are 80% aligned, not 100% preserves healthy tension while keeping execution fast.
AI is expanding the definition of enablement—whether AI is changing what good enablement looks like isn’t a theoretical question anymore. I see world-class teams arming reps with retrieval-first knowledge bases, sandbox environments, and objection libraries that evolve weekly. Meanwhile, selling against direct and implied competitors at once is the norm: your battlecard must cover “do nothing,” internal tools, adjacent categories, and new AI entrants—while you still remember why in-person enterprise rollouts still beat product-led motions for durable adoption.
Planning horizons tighten in AI markets. How far out should a GTM leader be planning? I work a dual cadence: a rolling 6-week operating plan that’s ruthlessly tactical and a 2–3 quarter roadmap for coverage, enablement, and category storytelling. What a normal week looks like in hypergrowth blends customer time, pipeline triage, onboarding and enablement, deal engineering, and process tuning—always with one or two high-conviction bets that could bend the curve.
If you’re scaling an AI product today, pair a disciplined sales-led growth engine with the best of product-led growth: fast paths to proof, hands-on validation for builders, executive-level value mapping, and human moments that turn customers into advocates. That’s how you compress an eight-week cycle into five business days—and keep the expansion flywheel spinning.
I measure product health by a simple equation: speed plus clarity equals trust. That’s why I prioritize Core Web Vitals and search performance together—because the fastest path to better UX and higher rankings is a closed loop between measurement, diagnosis, and action. Standardizing on Amplitude’s Global Agent with Amplitude AI Agents let my teams compress that loop from weeks to hours, and in many cases, to minutes.
Learn how to track your web vitals and page rankings faster with Amplitude AI Agents and improve your site’s user experience and SEO rankings. That goal sounds ambitious, but with the right instrumentation and analytics workflow, it becomes a repeatable operating rhythm rather than a one-off project.
Here’s what changed for us with Amplitude’s Global Agent: a single, consistent way to capture performance signals across pages and journeys, unified context for every session, and a lightweight footprint that doesn’t get in the way of speed. By centralizing measurement, we eliminated blind spots and gave product, growth, and engineering one shared truth for Core Web Vitals and behavioral analytics.
My practical playbook is straightforward: 1) Establish a performance baseline for Core Web Vitals on key templates and critical user paths. 2) Segment results by device, location, acquisition channel, and content type to surface where users actually feel the friction. 3) Connect those vitals to downstream behaviors—scroll depth, engagement, and conversion—so we prioritize fixes that move business outcomes, not just lab scores. 4) Use feature flags and A/B testing to ship improvements safely and quantify uplift. 5) Close the loop with Agent Analytics to keep learnings visible and actionable.
Operationally, we rely on anomaly detection to flag regressions early, CI/CD guardrails to prevent performance slips at deploy time, and observability plus session replay to accelerate root-cause analysis. This combination reduces mean time to resolution, protects page experience during fast iteration cycles, and helps us avoid trading UX for speed—or vice versa.
The strategic benefit is compounding: better Core Web Vitals improve user perception and increase engagement, which strengthens SEO signals and, ultimately, page rankings. With a unified analytics platform in place, we can spotlight the few improvements that create outsized gains, then scale those patterns across the site with confidence.
If your roadmap includes faster pages, stronger rankings, and happier users, align your teams around this simple loop: measure precisely, diagnose quickly, experiment safely, and learn continuously. Amplitude’s Global Agent and Amplitude AI Agents give you the instrumentation and insight to make that loop your competitive advantage.
Inspired by this post on Amplitude – Best Practices.
Most mornings I wake up to a to-do list that’s already been updated—because my always-on team of agentic AI assistants has been working while I sleep. I rely on Claude to orchestrate these agents so routine prep, follow-ups, and retrospectives never slip through the cracks.
When a podcast recording hits my calendar, my podcast-manager agent (powered by Claude) automatically creates a podcast-interview-prep task with a concise summary of who I’m interviewing and what they are building. It also creates a transcript review document with the correct share settings. After the recording, it adds a task to my to-do list to share the transcript with the podcast participants.
For sales, my sales-admin agent (also powered by Claude) prepares a sales-meeting-prep task with notes on who I’m meeting with, where they are in the sales process, and what I need to move the deal forward. After the call, it generates clear next-step tasks so momentum doesn’t stall.
Every week, my coding-manager agent (still powered by Claude) compiles a report from my prior week’s coding sessions and offers targeted tips. It flags recurring mistakes or dead ends, shows how to avoid them, and suggests ways to work better with Claude. It’s the retrospective I never skip.
In this walkthrough, I’ll explain how I get Claude to complete tasks for me while I’m away from the computer—and how I designed the system to balance power, safety, and cost control.
I first explored this approach after seeing the rapid growth of OpenClaw. OpenClaw is an open-source "agent harness" that lets you configure personalized agents to act on your behalf. It’s incredibly promising, but the early wave of enthusiasm also revealed pitfalls: complex safety configuration, overly broad machine access (browser, terminal, files, credentials), third-party skills of varying quality, and surprise usage bills.
After hearing one too many horror stories about wasted hours and unexpected charges, I set out to design a safer, more predictable way to capture the benefits of OpenClaw while managing risk and spend. That’s what led to my current agent setup.
For transparency: I’m a long-time practitioner and a genuine fan of Claude Code. I have not received any compensation from Anthropic for writing about my approach. If that ever changes, I will disclose it—both because it’s required by the FTC in the U.S. and because it’s simply the right thing to do.
An Overview of How My Agent Team Works
Today, I run three specialized agents: a podcast manager, a sales admin, and a coding manager. As I invest more, I expect this team to grow—because the pattern scales cleanly across use cases.
This system runs on four core components that keep everything reliable, auditable, and cost-aware.
First, agent identity. I use a simple but powerful convention: an identity markdown file that tells the agent who it is, where its task folder lives, and provides context for the types of tasks it will do. This keeps scope tight and intent explicit—critical for safety and predictable automation.
Second, the scheduler. I’m using MacOS’s built-in scheduler (via LaunchAgents). This is like cron, but runs with all your user permissions on Mac. That means I can run all of this under my Claude Code Max subscription or my ChatGPT/Codex subscription. The result is a dependable heartbeat for my AI workflows without relying on fragile cloud glue.
Third, tasks. Each agent owns a dedicated folder of tasks. A task is a markdown file with frontmatter. That structure makes work items easy to create, parse, review, and version—perfect for repeatable automation with a human-in-the-loop safety net.
Fourth, scripts. Each agent has its own scripts folder with utilities it can call on demand or that run on a schedule. These scripts are small, composable, and transparent—so I can evolve capabilities without ballooning risk or complexity.
Agent identity, tasks, and scripts are saved in Obsidian—not Claude Code skills or agents. The scheduler runs on my always-on Mac Mini. The benefit of this is it just works across all of my devices and I can seamlessly switch between Claude Code, Codex—or any other coding CLI—as I need to. All it takes is updating my script that the scheduler uses.
In practice, this architecture delivers exactly what I want from agentic AI: clarity of responsibility, strong guardrails, and outcomes that compound. My podcast manager keeps interviews buttoned up, my sales admin removes administrative drag, and my coding manager turns lessons learned into steady skill gains—all while I focus on higher-leverage product management work.
If you’re considering a similar setup, start with a single agent and a narrow task, then expand. Keep identities crisp, scripts small, and schedules explicit. With that foundation, you’ll get the benefits of automation and delegation—without surrendering control.
Pricing looks deceptively simple from the outside; inside, it’s anything but. Over the years, I’ve learned that every price tag is really a strategic statement about value, priorities, and the future we’re building toward.
At Fin, pricing and packaging (P&P) is more than a finishing touch. It’s a research problem, a forecasting challenge, a commercial decision, and ultimately, a strategic statement, requiring deep cross-functional work. We must balance the needs and wants of our customers, the value delivered by our product, and the broader vision we are building towards.
Our approach keeps evolving as our product and market mature. I treat it as a living system—continuously informed by research, GTM learning, and customer behavior, never "set and forget."
Here’s how I run the process in practice, especially when we launch something new that needs to be monetized, like Fin, our AI Agent. The work moves from qualitative discovery to quantitative validation to commercial modeling, with tight partnership across product, research, data science, finance, GTM, and engineering.
Step 1: Foundational research
I start by talking to buyers to understand their mental models of value. How do they define ROI? What pricing models do they expect in this category? What feels intuitive, and what feels off? This early discovery shapes two crucial choices: the pricing model and the pricing metric.
The pricing model is the overall structure; value-based, usage-based, access-based, fixed fee, and so on. With Fin, we chose a value-based model: you only pay when Fin delivers value. Our research clearly showed that buyers don’t want to pay for usage, they want to pay for results.
The pricing metric is the unit of value within that model, the unit we anchor pricing to. For Fin, the pricing metric is “outcomes.” An outcome is defined by Fin successfully handling a customer service query.
Small definitional changes can dramatically alter how customers perceive value, so I obsess over details. Buyers rarely hand us the “right” model; they reveal how they evaluate value, and I translate that into a model and metric that align with their goals and expectations.
Throughout, I loop in execs, finance, GTM, and engineering to ensure alignment before proceeding. Pricing choices cut across the business; they can’t be made in isolation.
Step 2: Willingness to pay
Once we have a model and metric, I quantify what the market will bear. This is where rigorous willingness-to-pay (WTP) research comes in, grounded in the language we validated through the qualitative work.
Here’s the kind of framing I use in surveys to keep things concrete and consistent with our model and metric:
You would only pay when Fin delivers an outcome (→ the model). An outcome is counted when the AI Agent resolves a customer query with no further help needed (→ the metric). Would you be willing to pay $X per outcome for Fin?
The foundational qual is so important as a first step. It helps us decide what we should be asking about before we start asking how much people will pay. Without the qual ground work, you risk building a very convincing answer to the wrong question.
The goal isn’t to find a perfect price. That doesn’t exist. The goal is to ground our discussions in the reality of the market.
I use methods like Gabor-Granger and Van Westendorp to understand WTP and to shape a demand curve that informs strategy, not just a single number.
This chart shows us what percentage of the market is willing to buy the product at various price points. The demand curve shows that 69% of buyers were willing to pay for the product at $0.86 per outcome, whereas only 39% were willing to pay at $1.42.
The dashed line shows the price point at which revenue for the business would be maximized (by multiplying adoption by the dollar amount).
This allows us to debate knotty questions like: What’s the right balance between growth and revenue? How sensitive is demand to price changes? At what price do we start losing the market? If we wanted to increase adoption, would lowering our prices by $X make a meaningful difference?
Those conversations help me weigh customer value and business outcomes side by side. At this stage, decisions feel more tangible, but I don’t finalize a price until I’ve modeled the operational realities.
Step 3: Modeling
By now I have a validated model, a clear metric, and a strong WTP signal. Next I translate theory into a commercially workable plan—this is where data science and finance are indispensable.
I start with a list price aligned to our strategy and commercial goals. Then I adjust for likely discounting to estimate realized price. Next, I analyze beta usage to project outcomes per customer by segment and derive average ARR. I combine usage projections with WTP to model attach rates across conservative-to-optimistic scenarios. Finally, I connect the dots in our long-range plan—logos, ARR, margins—iterating until the numbers and narrative cohere.
The modeling step is important because willingness-to-pay data is somewhat theoretical. It reflects intent, not behavior. Modeling helps us bridge that gap.
The goal of this step is to land on a price point recommendation, alongside forecasts for ARR and adoption. It allows us to understand the real business impact of the decisions we’re making.
Alongside all of this, we need to ensure any decision we make falls in line with our pricing principles and broader business objectives.
Step 4: Sign-off and execution
With the analysis complete, I consolidate everything into a clear P&P recommendation for executive approval. Once approved, the real work begins: enabling sales, communicating changes to customers, instrumenting ROI proof points, and monitoring performance so we can learn and iterate.
Do we run the full process every time?
Not always. This is the ideal process, and I apply it end-to-end for the most material decisions. In reality, time and resource constraints require judgment; rigor should mirror impact. When uncertainty crops up midstream, I run scrappier, targeted research rather than forcing a linear path.
The ongoing challenge
As Fin’s breadth has expanded, our pricing system has had to evolve, too. For a while, modular pricing worked well—each product had its own logic tied to a crisp outcome. As we add more products, more Agent capabilities, and more outcomes, the question shifts from “what is the right P&P for this one product?” to “how does everything fit together into a coherent pricing system?”
We must recognize that pricing isn’t something you set once and leave alone. As products evolve, especially in a world where AI is rapidly changing how value is created and delivered, it’s important to regularly step back and review the bigger picture, not just the component parts.
For example, outcome-based pricing has served us well, particularly when our products were tightly tied to clear, measurable outcomes. But as our products become more varied, and as we continue building toward a broader platform, it becomes less straightforward to apply a single model cleanly everywhere.
The challenge becomes less about replacing one model with another, and more about continually looking up and asking: what pricing philosophy best reflects the value we’re delivering today? And how do we deliver that philosophy in a way that still feels right for customers?
In short, there is no finish line, pricing is never “done” – and that’s exactly how it should be.
Why this work matters
Pricing and packaging is often noticeable only when it goes wrong. A confusing model, a bad metric, or a price that feels disconnected from value. And we hear about those quickly.
When pricing is done well, it becomes nearly invisible—but it still does a lot of work. It shapes how people perceive value, clarifies what they’re paying for, and makes the product easier to sell, easier to buy, and easier to scale. Most importantly, it forces us to be honest about what the product is really worth. That’s why I take it so seriously—and why I treat pricing as a product in its own right.
I’m getting sharper, more specific questions about scale from enterprise customers every quarter, and that’s exactly how it should be. Teams want to know how our platform behaves during their highest-volume moments — the Black Friday sales, the sporting events, the production incidents — and they want confidence their growth won’t outpace the systems they depend on.
We welcome those questions. They’re the right ones to ask of any critical component of your business.
Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform.
We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer.
While those numbers matter, they age quickly. Every growing software company can publish a bigger number every year, month, week. What ultimately matters is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it.
Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them.
Here’s how we do that in practice.
We build on boring foundations because at the edges, we try hard not to be clever. We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product.
“That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.”
This extends a principle I’ve embraced for years: run less software. The point isn’t to minimize the stack for its own sake; it’s to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together. Boring technology choices aren’t a lack of ambition — they reserve our ambition for the nuanced scaling challenges that matter.
The source of truth is the hard part. You can scale stateless web traffic by adding machines, add queue consumers, and add cache. Those are real problems — just not the hardest ones. The source-of-truth database is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears.
That is why we moved to Vitess, managed by PlanetScale. The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers. When we first laid out this direction, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day.
Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed.
We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk.
Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows.
Search is not a hidden bottleneck for us. Search underpins core product surfaces across the platform — from vector search in our AI features to realtime reporting — and if it’s slow or unhealthy, customers feel it. Scaling isn’t just adding more machines; often the better approach is making the product do less unnecessary work. Today, our Elasticsearch clusters support a much higher-throughput product than in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We’re serving a larger product surface more efficiently, not just running a bigger cluster.
More importantly, when an index gets too large or traffic distribution turns unhealthy, we don’t want a high-risk, manual migration. We reshape Elasticsearch indexes online by partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and deleting the old index only when we’re confident. We’ve used this pattern for years to make large search migrations safer and more incremental — a core playbook in our platform scalability and SRE practices.
Fairness is non-negotiable in a multi-tenant system. A single customer’s high-volume moment should not quietly become everyone else’s latency problem. We design for this at multiple layers. For asynchronous work, we use overflow queues and queueing strategies that prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They’re designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants.
We also build application-level guardrails so customer isolation doesn’t depend on every engineer remembering every rule in every code path. In a large multi-tenant Rails application, the safe path must be built into the system. The focus is primarily about correctness and customer data separation, but the broader operating principle is the same: important customer boundaries should be enforced by infrastructure and application frameworks.
The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it’s actually needed.
Fin adds a new dimension to scaling. Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work. The details differ from traditional SaaS infrastructure, but the principle is the same: understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. AI providers are not commodity storage systems, and we do not design as if they are.
That is why we have invested in Fin-specific reliability systems. Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point. For enterprise customers, this matters because AI support volume can spike just like human support volume — and the AI layer must absorb that spike without relying on one fragile upstream path.
When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform.
Performance tests help, but production traffic is reality. Real customers use products in ways no synthetic test will perfectly predict: launches, incidents, seasonal patterns, gaming events, sudden changes in end-user behavior. Those moments give us data that no lab can fully reproduce. Often, a large customer event barely moves the platform-wide graphs because our customer base is broad enough that one industry’s peak aligns with another’s quiet period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect.
“That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.”
Sometimes, these events teach us something specific. In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed. The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb.
We treat large customer events this way even when there’s no broad customer impact. They’re opportunities to understand real scaling properties and make the next event safer — a habit that anchors our incident management, observability, and FinOps practices.
Scale is also an operating model. The infrastructure matters, but it’s not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents.
“That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.”
Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through infrastructure noise and answer the question that matters most during an incident: are customers able to use the product successfully?
This shapes how we ship. Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. That isn’t a vanity metric — it’s part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers. These are the very DORA metrics we hold ourselves to in order to balance CI/CD speed with stability.
“Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.”
The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence. That’s how scale stays safe over time.
Scheduled maintenance should be extraordinary. Historically, database maintenance was a main reason for maintenance windows: upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime. With the move to Vitess and PlanetScale, we changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime — and we do this in practice, not just as a goal.
This matters because customers rely on our platform for live operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience.
“Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.”
Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact. That’s the practical benefit of the architecture work: better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers.
What this means for customers is simple: be skeptical of vague scale claims. The question isn’t whether a vendor says they can scale — it’s whether they can explain how, where the limits are, what they measure, how they recover, and what they’ve changed after learning from production. We understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system — and we use that data to keep improving. That’s what you should expect from a platform you depend on during your busiest moments.
I just finished a standout conversation on AI engineering and product discovery that hit squarely at the questions I hear from product leaders every week: What does practical AI engineering actually look like for product managers, and how do we ramp without a traditional software background?
Listen to this episode on: Spotify | Apple Podcasts
Here’s the arc that resonated with me: a product leader goes from occasional tinkerer to spending 60% of her time on real engineering work—building AI-powered tools for continuous discovery, forming a licensing partnership with Vistaly, and quietly constructing "Teresa Bot," an AI discovery coach trained on everything she’s ever written. The journey is less about mastering every framework up front and more about structuring learning, tightening feedback loops, and shipping useful outcomes.
The most energizing throughline is the myth-busting: you don’t need a deep engineering pedigree to operate in this space. Curiosity, rigorous discovery habits, and eval-driven development will take you further than brute-force coding. As one moment put beautifully, "I know anything that I don't know how to do, Claude will teach me how to do. And Claude is infinitely patient." That captures the posture I expect modern PMs to adopt with LLMs and tools like Claude Code.
On the nuts and bolts, the discussion gets concrete about AI engineering in practice: context engineering, prompt writing, RAG, observability, and evals. This is the real stack—think retrieval-first pipeline design, prompt engineering guardrails, instrumentation for model drift, and continuous, automated evals to protect behavior as you iterate. If you’ve been dabbling with context window management but haven’t formalized your test harnesses or dashboards, this is your cue.
What I appreciated most is how directly discovery skills transfer. Framing assumptions, running tight customer interviews, mapping opportunity solution trees, and aligning stakeholders—these are precisely the muscles you need to shape problem spaces before you “vibe code” solutions. As one reflection nails it, "The moment I learned more about data science, all of my discovery work became so different." That’s the bridge from qualitative sense-making to measurable, model-centered learning.
The partnership with Vistaly is also a smart build vs buy case study. Rather than reinvent infrastructure, the choice to license purpose-built opportunity solution tree software keeps focus on the differentiated layer—learning systems and product outcomes. As it’s put plainly: "I don't want to build all that stuff. I don't really want to be a software company. I'm almost set up like an AI researcher." Product leaders should internalize this lens for platform choices across their AI roadmaps.
On "Teresa Bot," the implementation breadcrumbs are familiar and pragmatic: pair a solid retrieval-first pipeline (RAG) with clean content sources, keep prompts modular, enforce code review even for vibe coding, and stand up observability and evals early. I’ve had similar success using Claude Code for rapid iteration while treating every prompt and context change as a versioned artifact. That discipline pays dividends when you need to trace regressions or prove improvements.
If you’re a PM ready to lean in, start small and systematic. Pick one high-signal discovery workflow, model the knowledge you already have, and wire up basic evals before you scale. Keep a lab notebook, use programmatic tests to gate deployments, and measure outcome movement—not just model cleverness. This is where LLMs for product managers move from novelty to execution readiness.
Resources mentioned: Watch the episode on YouTube, Claude Code, Vistaly (opportunity solution tree software), Opportunity Solution Trees: Visualize Your Discovery to Stay Aligned and Drive Outcomes, Product Talk Academy, Just Now Possible Podcast, Vibe Coding Best Practices: Avoid the Doom Loop with Planning and Code Reviews, and the AI Evals for Engineers and PMs course on Maven.
What stood out to you—RAG design choices, eval frameworks, or the discovery-to-engineering mindset shift? Drop your thoughts below; I’d love to learn how you’re applying these patterns in your own product roadmaps.
Too often I watch teams ping a global agent with vague AMAs and then wonder why they get generic summaries instead of decisive guidance. When I lead product reviews, I push the team to treat AI like a partner in decision-making, not a trivia engine. That simple mindset shift transforms how quickly we move from questions to confident action.
AI isn’t built for AMA (ask me anything). Get recommendations for outcome-based questions for the best results with Amplitude AI.
In practice, outcome-based prompting means I don’t ask an agent to “analyze the data.” I ask it to help me reach a specific product decision, grounded in behavioral analytics and connected to our outcomes vs output OKRs. To make that concrete, I always frame my prompts around three things.
First, I state the outcome and metric. I name the business goal and the exact measure in Amplitude analytics that will validate success—activation rate, funnel conversion from A to B, or 8-week retention. I’ll reference the relevant events, segments, or driver trees so the agent has a crisp target. This is where product strategy meets measurement discipline.
Second, I define the context and constraints. I specify the user cohort, the timeframe, and the surface area I care about—new self-serve signups in the last 30 days, first-session behavior on web only, or EU traffic where data governance rules apply. On a unified analytics platform, this context lets an agentic AI narrow its search to the highest-signal slices of behavioral analytics rather than pattern-matching across noise.
Third, I declare the decision and deliverable. I tell the agent exactly what I will do next and the format I need to act: a ranked list of levers for an A/B testing plan, a recommended prompt engineering template for in-app guides, or a one-page brief I can hand to the growth team. Clear decisions lead to clear outputs; vague intents lead to vague answers.
Operationally, I turn these three elements into reusable prompt templates, and I track their performance with Agent Analytics. I review traces to see which inputs drive the best recommendations, and I refine prompts the same way I iterate on product copy. For LLMs for product managers, this is the craft: small, testable improvements that compound into outsized impact.
Here’s a quick example. When I needed to lift user activation, I asked for a prioritized set of friction points blocking first-value within 24 hours for new self-serve accounts, based on last month’s data. I defined activation as completing event X within Y hours, asked the agent to analyze top drop-offs in the funnel, and requested an action plan with two experiment ideas and success thresholds. The response mapped behaviors to interventions, connected to retention analysis, and gave me a prompt engineering snippet for the onboarding nudge we shipped the same week.
If your AI workflow still starts with “What does the data say?”, you’ll keep getting broad narratives. Start with outcomes, sharpen the context, and specify the decision you will make. That’s how Amplitude analytics, paired with agentic AI, stops being interesting and starts being indispensable.
Inspired by this post on Amplitude – Perspectives.
I’m excited to share two opportunities this season to uplevel your craft, connect with peers, and leave with practical, repeatable techniques you can apply immediately to your product work.
We will be doing another round of Claude Code: Show and Tell on May 26th at 9am PDT. These community-driven sessions are hands-on and fast-paced—we swap proven workflows, compare prompts, and pressure-test approaches together. You’ll see how product teams are operationalizing AI workflows in real contexts and walk away with ideas you can adapt for your own roadmap and experimentation pipeline. Invites will go out to Supporting Members and CDH Members tomorrow. If you'd like to join us, keep an eye on your inbox for the invite.
I love these Show & Tell sessions because they translate tacit knowledge into clear, reusable playbooks. Whether you’re refining evaluation loops for LLMs, streamlining discovery synthesis, or standardizing prompts for consistency, the shared rigor and camaraderie make it a high-signal hour for any product leader invested in AI workflows.
I also want to share that I'll be teaching our June 4th – July 9th cohort of Product Discovery Fundamentals. This is the last time I'll be teaching this cohort in its current format. If you've been thinking of enrolling in this program, and want to take it with me, this is your last chance. Register here.
Across this cohort, we’ll practice continuous discovery habits—framing opportunities, tightening assumptions, running lean experiments, and aligning product trios on evidence-backed decisions. If you want a rigorous, repeatable system for turning customer insight into confident prioritization and compelling product strategy, I’d be thrilled to have you in the room.
Churn is a lagging indicator—and by the time I see it in a dashboard, the moment to change a customer’s mind has usually passed. At HighLevel, I’ve learned that durable retention starts long before a cancellation ticket, with product-led growth habits, customer success partnerships, and a clear view of user behavior that flags risk early and often.
Stop chasing SaaS churn after it happens. Learn how proactive product and service experiences, powered by behavioral analytics, help reduce churn before users leave.
My operating model is simple: treat retention as a design problem, not a rescue mission. I anchor our strategy in behavioral analytics and retention analysis, translating leading indicators—activation milestones, time-to-first-value, depth of feature adoption, and expansion intent—into outcomes like Net Recurring Revenue (NRR) and cohort-based retention. When these inputs move in the right direction, churn becomes the exception, not the trend.
To get there, I start with rigorous journey mapping and continuous discovery. We define the exact “aha” moments that signal value realization, instrument events across the funnel, and segment cohorts by persona, plan, and use case. Tools in a unified analytics platform (e.g., Amplitude analytics or Pendo) help us pinpoint where engagement decays, which features predict stickiness, and which friction points block activation. This evidence replaces hunches and lets us prioritize the highest-leverage work.
From those signals, I build a transparent risk score that anyone can use. It blends usage momentum (DAU/WAU), core feature frequency, anomaly detection on key behaviors, billing and payment health, and support sentiment. When the score crosses a threshold, we trigger plays—inside the product and through customer success—so we’re helping users before they drift, not pleading after they’ve left.
On the product side, I favor lightweight, contextual interventions: in-app guides tailored to stalled tasks, checklists that shorten time-to-value, adaptive product tours, and tooltip design that clarifies the next best action. We A/B test these experiences with a clear minimum detectable effect (MDE), watching both local metrics (feature completion, error rate) and global metrics (activation, retention). The goal is precision—right nudge, right user, right moment—without adding cognitive load.
On the service side, we run consultative support and customer success plays keyed to the same behavioral triggers. A sudden drop in core usage may prompt a quick diagnostic call; repeated failed integrations can route to solutions engineering; stalled accounts get value reviews or QBRs focused on outcomes, not feature checklists. Because product and service draw from the same data, customers experience a single, coherent journey.
Proactive retention also depends on smart packaging and pricing. When value metrics mirror how customers win, plan boundaries reinforce the right behaviors and reduce “silent churn” caused by misaligned tiers. Outcome-based pricing and clear upgrade paths can turn potential risk into expansion rather than attrition.
Operationally, I keep a weekly retention review with product trios and customer success leaders. We walk driver trees from inputs (activation, engagement depth, support friction) to outputs (NRR, churn), review session replay where confusion spikes, and commit to small, measurable experiments. This cadence compounds learning and keeps us honest about what’s moving the needle.
If you’re starting fresh, begin with four moves: define an activation milestone tied to value; instrument the few events that prove users are on track; build a basic risk score from those events; and craft three plays—one in-product, one lifecycle message, one success outreach—triggered by that score. You’ll create a flywheel where insights power interventions, and interventions feed better insights.
Churn will always exist, but it doesn’t have to be a cliff. With behavioral analytics guiding both product and service experiences, we can make retention the natural outcome of how we build, communicate, and support—long before a customer ever thinks about leaving.
Inspired by this post on Amplitude – Perspectives.
I focus every day on turning raw customer signals into meaningful product experiences that create measurable outcomes. Human37 is a Brussels-based customer data strategy agency helping organizations turn data into real customer experiences. That statement sets a useful standard for the kind of partner I look for: one that helps us move beyond reports and into shipped value customers can feel.
What matters most to me is the bridge between discovery and delivery—how insights inform product strategy and roadmaps without slowing execution. The strongest partners operationalize behavioral analytics within a unified analytics platform, connect qualitative learning with quantitative evidence, and make journey mapping a living artifact rather than a slide. Tools like Amplitude analytics can accelerate this work, but the real differentiator is the operating model that converts data into decisions and decisions into outcomes.
When I evaluate a customer data strategy partner, I look for five things: rigorous data governance and privacy-by-design; clean event taxonomy and robust identity resolution; clear experimentation workflows that tie to activation and retention analysis; practical enablement for product teams (not just analysts); and a bias for product-led growth rooted in real user behavior. If a partner can’t articulate how insights ladder to user activation and long-term value, they’re not ready to guide the roadmap.
Here’s how I sequence the work to turn signals into experiences: first, define the outcomes that matter and the driver trees behind them; second, instrument events and unify identities to power trustworthy behavioral analytics; third, map critical paths with journey mapping to expose friction and moments of delight; fourth, run focused experiments linked to product strategy, not vanity metrics; finally, scale what works with in-product experiences and lifecycle messaging that compounds retention.
The payoff is speed and clarity: faster time-to-insight, more confident bets, and fewer handoffs between data teams and product builders. If you’re exploring European partners, a Brussels-based agency with a sharp customer data strategy capability can help you move from analysis to action. The litmus test is simple—can they help your team ship experiences that customers notice and your metrics confirm?
Inspired by this post on Amplitude – Perspectives.
There’s a question that runs underneath every AI Agent evaluation: what can it do?
Two years ago, that was the right question to ask because Agents were limited and capability was a genuine constraint. The gap between what organizations needed and what the technology could deliver was wide. I felt that gap acutely in early pilots—plenty of ambition, not enough dependable execution.
That gap has since narrowed considerably, and yet most organizations are running their Agents well below what’s technically possible. I see teams lean on answering and routing, but stop short of looking things up, taking actions, or resolving complex, multi-step problems—especially where data, process variance, or risk come into play.
The standard explanation is that AI isn’t good enough yet—models must improve, or vendors must ship more features. But after studying organizations across industries actively expanding their AI automation, I’ve found that this explanation holds up less often than people assume. The blockers tend to be elsewhere.
The teams I’ve observed weren’t primarily constrained by what their AI could do; they were constrained by what their organization was structured to let it do. In other words, the ceiling wasn’t the Agent’s capability—it was organizational readiness, governance, and risk tolerance.
“Readiness” for AI breaks into five distinct types, and most organizations have some but not all of them. Below is how I assess them with product, operations, and engineering leaders.
Content readiness is whether you can explain your product and policies clearly and consistently. Most companies can. In practice, that means up-to-date knowledge bases, unified policy language, and clear versions that Agents can cite and apply.
Scope readiness is whether you’ve defined the edges: when should AI engage, and when should it step aside? Edge cases multiply, intent varies by customer segment, sensitive topics surface mid-conversation, but most teams can work through this with effort. Clear guardrails reduce ambiguity and shrink risk.
Procedural readiness is where things start to get harder. This is about whether you can articulate your processes clearly enough for something other than a human with years of tacit knowledge to follow. The happy path is rarely the problem. It’s the failure paths, decision branches, variations that have never been written down because they’ve always lived in someone’s head.
Data readiness is the first real cliff. Can you reliably identify the right user, account, or object at the moment a decision needs to be made? Is the data trustworthy in real time? Are the APIs stable, accessible, and actually connected? For most organizations, the honest answer is “partially, but we’re not always sure when it breaks.”
Execution readiness is the highest bar. Not just technically (can the Agent make the change?) but organizationally. Who owns it when the wrong refund gets processed? Who detects it? Who recovers? Does someone with authority actually accept the risk?
Most companies have the first two, some have the third, fewer have the fourth and fifth. When I map this with teams, we often discover that their Agent’s ceiling is really a reflection of operational maturity and data plumbing, not model quality.
We studied companies across six industries – energy, healthcare, ecommerce, gaming, financial services, property management – all trying to expand what their Agents could do. The pattern was consistent: teams set out to automate real actions—looking up account status, processing changes, handling transactions. In most cases, the AI could technically do it, but at a certain point (somewhere between guiding a user through a process and looking something up on their behalf) they hit a wall.
One team tried to automate application changes but couldn’t reliably identify which application to modify across their internal systems. Another explored billing automation but couldn’t access live account data due to regulatory constraints. A third needed to verify status across third-party vendor systems their Agent couldn’t reliably reach. I’ve seen similar constraints surface around CRM integration, data governance, and vendor SLAs—none of which are model issues.
In most cases, the team redesigned around what their infrastructure could support. They moved toward guiding—walking users through processes step by step, rather than executing changes on their behalf. It worked, it resolved conversations and delivered real value, just differently than anyone planned. In customer support, this often looks like consultative flows that shorten time-to-resolution even without direct writes.
Most Agent evaluations are built around capability. Can it handle complex queries? Does it support multiple channels? Can it integrate with our systems? These are reasonable things to evaluate for, but they produce a capability score, and that doesn’t tell you whether your organization can actually use what you’re buying.
The teams that got to deeper automation, the ones executing actions early, didn’t have “better AI,” they had more standardized operations. Actions that were already well-defined, consistently applied, and exposed through stable systems with clear rules. Automation wasn’t inventing new behavior, it was triggering actions that were already tightly controlled elsewhere.
Readiness enables capability, not the other way around. Which reframes the evaluation question from “can the AI do this?” to “are we actually ready for it to?”
Something that gets lost in most conversations about AI readiness is that organizations are often further along than they assume, just not for the kind of work they were planning for. A team that set out to automate refunds but can reliably guide users through complex troubleshooting has genuine capability deployed. They’re operating at the level their readiness supports, which is a starting point, not a deficit.
The more useful frame isn’t “are we ready?” – it’s “what are we ready for, and what specifically stands between here and the next level?” The gaps tend to be concrete: a missing API, data that lives in three systems that don’t agree, a process that’s never been documented, or an ownership question nobody has answered. These are solvable problems. They just require a different kind of investment than buying a more capable Agent.
What nobody has worked through seriously yet is how organizations actually build readiness. Does it develop naturally through using AI at shallower levels first? Or is it mostly a function of prior decisions, like system architecture choices made years ago, operational maturity that accumulated over time, engineering investments that have nothing to do with AI? When readiness does increase, what actually changes? Does the support team develop it? Does engineering grant it? Does it require executive sponsorship and investment in infrastructure with no obvious AI label on it?
In my experience, progress comes from a joint effort: product to define scope and guardrails, operations to codify procedures and edge cases, engineering to harden APIs and observability, and leadership to underwrite risk with clear ownership. When those pieces align, agentic AI moves from guided assistance to safe, auditable execution.
Until there are clearer answers, the pattern is likely to continue. Companies will buy capable Agents, plan ambitious rollouts, and find that the harder work is building the organizational infrastructure. The Agents can do the work. The question is what it takes to let them.