Once I’ve defined the right roles on my team, the next move is to design an operating model that makes progress a habit. My goal is simple: every interaction should strengthen the system so the AI Agent keeps improving over time.
I anchor the team on a mantra that has never failed me: “The first time you answer a question should be the last.” That single statement reframes support as a compounding system rather than a one-off activity.
The ambition is to ensure every resolution makes the next one faster and more accurate, so fewer issues repeat, quality compounds, and support scales naturally. That doesn’t happen by accident—it requires intentional design.
In practice, this comes down to four essentials: clear ownership of performance, guardrails that make iteration fast and safe, feedback loops that turn learning into routine upgrades, and a culture that celebrates the work of improvement—not just the outcomes. Here’s how I put that into play.
First, I start with clear ownership. Ambiguity is one of the most common reasons AI performance plateaus. When no one truly owns how the AI Agent performs, feedback gets lost, issues linger, and improvements stall.
On high-performing teams, I assign a single owner—often an AI ops lead—responsible for making the AI Agent better. They review resolution trends to spot underperformance, make targeted updates to content, configuration, and behavior, coordinate with product and engineering on systemic blockers, and set improvement priorities, targets, and timelines. The title matters less than the mandate; what matters is clear authority to drive change across teams.
Real-world example: At Dotdigital, AI performance plateaued after a strong start—resolving around 2,800 conversations per month for three consecutive months. To drive resolution rates up, the team created a dedicated support operations specialist role, filled by an experienced agent with deep product knowledge. This person will focus on refining snippets, improving content, and enhancing the AI’s resolution capabilities.
Second, I make iteration fast and safe. As the AI Agent takes on more volume and complexity, change can start to feel risky—so teams hesitate, and performance stalls. Lightweight governance fixes that by making the path from insight to action predictable.
I keep the rules simple and explicit: which changes need review (and which don’t), who the decision-makers are, how we test updates before they go live, where feedback flows so it’s seen and acted on, and when progress gets reviewed on a steady cadence. Governance isn’t bureaucracy—it’s what keeps improvement routine and safe.
Real-world example: Anthropic ran a focused “Fin hackathon” sprint to improve their AI Agent’s resolution rate. The team audited unresolved queries, identified underperforming topics, and created or updated content to close gaps. They converted frequently used macros into AI-usable snippets, monitored Fin’s performance during live support, and continuously refined content based on real interactions. This structured approach enabled rapid improvement while maintaining quality standards.
Third, I build a system that learns by default. AI performance isn’t static, but many organizations treat it like a one-time implementation. The most successful teams operationalize learning: they analyze where the AI Agent struggles and feed those insights directly into structured improvements.
The signals are straightforward: review common handoffs to humans, track unresolved queries by topic or intent, measure resolution rate trends over time, and use those inputs to prioritize fixes and content upgrades. Whether you follow a formal loop like the Fin Flywheel framework or something lighter, the goal is the same—make improvement inevitable.
Fourth, I treat content as competitive infrastructure. Your AI Agent is only as good as what it knows. As George Dilthey, Head of Support at Clay, put it: “That’s when we realized: AI doesn’t just come up with information out of nowhere, you have to feed it. We were spending all our time evaluating tools when we should’ve been focused on content.”
I operationalize knowledge like infrastructure: every topic has a clear owner, content is structured, versioned, and ingestion-ready, new products ship with source-of-truth content by default, and changes ship on a schedule—not when someone finds time. This is the backbone that differentiates teams who scale confidently from those who stall out.
In my organization, we’ve evolved our New Product Introduction (NPI) process by aligning early with R&D on a single, canonical source of truth that becomes the foundation for all downstream content—including what the AI Agent uses to resolve queries. By embedding content creation into launch readiness, not as an afterthought, we’ve consistently hit 50%+ resolution rates on new features from day one.
Finally, I make belief visible. Even the best system will stagnate if people stop believing in it. Belief can fade quietly unless you reinforce it on purpose. I keep it strong by sharing specific wins regularly, highlighting improvements with metrics, and recognizing the people behind the gains—then giving them space to lead. This isn’t just about morale; it keeps everyone aligned on the bigger play.
When you put it all together—clear ownership, safe iteration, a learning system by default, and content as infrastructure—AI performance compounds. As the AI Agent gets better, the entire support model becomes faster, more reliable, and truly scalable. That’s the foundation of a modern, AI-first support organization.
Next, I’ll take this a level deeper and share how capacity planning changes when AI handles the majority of inbound volume and your team shifts into higher-value roles. If scaling with confidence is the goal, this is where the operating model pays off.
Support teams in Spain just got the clearest signal yet that the old way of doing things won’t cut it anymore. As I look at the details, I see more than a regulatory hurdle—I see a blueprint for the modernization many of us have been pushing toward for years.
The signal arrives in the form of one of the most ambitious customer service regulations in Europe—a law designed to strengthen consumer protections and set clear expectations for fair, transparent, and personalized customer service. Among its measures: new protections against spam calls, stronger transparency requirements, safeguards around personalized interactions, and measurable standards for speed, accessibility, and complaint handling within customer support.
It’s a significant shift, especially for large enterprises and essential-service providers. While the initial reaction might be anxiety about audits and penalties, the larger opportunity is hard to ignore: this law compels us to build modern, resilient support operations that scale, perform, and earn trust.
Spain is often an early mover in consumer-protection regulation, and this shift could signal what future standards across the EU might look like. For EMEA leaders, this is a moment to reevaluate operating models, invest in automation thoughtfully, and ensure customer experience improvements directly support regulatory compliance.
Below, I break down what the law requires, what it means in practice, and how AI Agents like Fin can help teams meet regulatory expectations while delivering faster, more personal support at scale.
The law applies in full to providers of regulated services, including water, energy, passenger transport, postal services, pay-audiovisual media, and electronic communications, and also to any company (or group) that meets certain size and turnover thresholds, even if their core business falls outside those sectors.
Large companies (those with more than 250 employees and over €50 million in turnover) also hold additional obligations, particularly around multilingual support in Spain’s co-official language regions.
While the law is still moving through its final approval stages, the direction is clear: a broad set of obligations will apply to reinforce consumer rights, ensuring they can: Reach support quickly. Speak to a human when needed. Get clear information during outages or service disruptions. Have complaints handled promptly and on time.
1. 95% of support calls must be answered within three minutes
This raises the bar significantly for responsiveness, especially during spikes, outages, billing cycles, or seasonal surges. Most support systems are not built for this level of agility. In my experience, you can’t hire your way to this metric sustainably—you have to design for it.
2. Customers must be able to speak to a human on request
Automation is allowed, but it cannot be the only option. At any point during a call, a customer must be able to transfer to a human if they ask for one. Companies cannot trap customers in automated loops. The practical implication: every workflow needs a reliable, audited escape hatch to a person.
3. Support lines must be free of charge
Premium-rate numbers are prohibited. Customer service cannot generate revenue for the business, nor may it be used to upsell products. This cleanly separates service from sales and reduces consumer friction.
4. Essential services must offer 24/7 support for continuity issues
Electricity, water, gas, telecoms, and transport providers must always be reachable at all hours when customers need to report service interruptions. That means coverage, triage, and routing must be always-on.
5. Complaints must be resolved within 15 days – or within five days for undue charges
This halves the previous general complaint window of 30 days and adds a much faster path for billing-error complaints. Companies must maintain records, assign tracking numbers, and ensure timely follow-up. Your case management discipline will make or break this requirement.
6. No spam calls or unwanted commercial pressure
Companies must identify business calls with a designated prefix, and customer -service calls with a different one. Telecom operators will be required to block calls that do not use these codes. Additionally, contracts obtained via unsolicited calls will be legally null and void, protecting consumers from being pressured into commitments they never intended to make.
7. Companies must maintain a unified complaint-tracking system
All complaints, claims, and incidents must be recorded in a centralized system to ensure traceability. If your data is fragmented across tools, this is a call to centralize and standardize intake.
8. Companies must pass annual external audits
These audits assess whether customer service processes are meeting the required standards. In practice, that means consistent processes, measurable outcomes, and reliable evidence.
9. Better linguistic and accessibility rights
Large companies operating in regions with co-official languages must be able to provide support in those languages. They must also ensure their customer service is accessible for vulnerable consumers, such as those with disabilities or older adults. Multilingual and accessible by design is the new default.
10. Fairer contract renewals
Companies must provide customers with 15 days’ notice prior to automatic renewal of online subscriptions and make cancellation simple. This is both a compliance and customer trust win.
Most support systems weren’t built for this level of speed or operational rigor. But the steps required to comply are the same ones that make service better for customers—and better for the teams delivering it. That’s why I view AI as an essential capability, not a bolt-on.
With the regulatory expectations clear, the question becomes: what does a modern, compliant support operation look like? For me, it blends human empathy with intelligent automation, proving auditability without sacrificing experience.
This is where AI plays a meaningful role. Not as a replacement for humans, but as a reliable front line that can handle a wide range of queries, including the most complex ones that require real depth, while keeping queues under control.
Adopting an AI Agent like Fin helps teams build a support model that meets regulatory expectations and improves customer experience across all your channels. Here’s how.
Many organizations will struggle to meet the three-minute standard during normal times, let alone during spikes or busy seasons, without unsustainably scaling their teams. Fin can help by reducing the number of calls that reach your phone lines and Fin Voice will ensure the ones that do are handled quickly.
Reducing avoidable call volume before it reaches the queue
Many of the queries teams receive are predictable: outage updates, billing questions, account changes, and other repeatable issues. Fin can resolve these instantly across several channels, including live chat, SMS, email, and WhatsApp, using the content and processes your team already maintains. I’ve seen this alone cut peak-time pressure dramatically.
Answering the phone immediately
For customers who do call, Fin Voice can pick up straight away. It provides natural, conversational responses based on your existing knowledge and helps your team stay responsive during busy periods.
Making it easy to reach a human easier during spikes
When queues build up, Fin can capture the reason for the call, gather details, and prioritize the most urgent issues. If you offer callback options, Fin can help schedule them quickly so customers avoid long wait times, which is key for staying compliant during peak periods.
The law requires customers to reach a real person whenever they request one. Fin supports this by keeping the path to a human clear and dependable: every interaction includes an option to speak to a person, and that option is accessible until the issue is resolved; when chosen, Fin hands over full context so human teams don’t start from scratch; if you show team availability or wait times, Fin can surface that information for customers; escalations can be prioritized to ensure faster pickup; alerts can notify on-call staff when urgent issues arise. On the phone, Fin Voice follows the same principle. Callers can request a transfer at any moment, and Fin routes the call to the right team with context intact.
Essential-service providers must be reachable at any hour when customers need to report service interruptions. Fin can help you meet this requirement without building a full overnight staffing model.
Always-on answers and triage
Fin provides first-line support at any hour of the day or night. Fin Voice brings this capability to the phone, giving callers immediate help even when your human team is offline. Fin can also direct customers to the latest updates you’ve published, such as outage information or status pages.
Routing urgent issues to the right people
When an issue requires human judgment, Fin gathers the necessary details and routes it to the appropriate on-call team using your existing after-hours processes. Teams can set up notifications so urgent issues are seen quickly.
Proactively surface what matters most
With AI Insights, Fin can also monitor for emerging patterns in customer conversations through Trending Topics. This means that if there’s a sudden spike in reports about a specific outage or a recurring question about a new process, Fin can flag these trends in real time. Your team is alerted to what’s top-of-mind for customers, so you can prioritize updates, publish targeted FAQs, or escalate critical issues, ensuring your support stays relevant and responsive, even overnight.
Complaints and outages often create the biggest spikes in volume, and the new law increases pressure to respond quickly, keep customers informed, and maintain complete records. This is exactly where structured AI intake adds value.
A more structured complaint intake
Fin can recognize when a customer is lodging a complaint, gather required information, and initiate a record in your existing system with a clear ID assigned from the outset.
Clear ownership and deadline alignment
Your team can then use your case-management tools to apply the 15-day resolution timeline (or five says for undue charges). Fin’s structured intake helps ensure that ownership and next steps are visible, rather than buried in unstructured notes.
Faster, more consistent outage communications
During service interruptions, Fin can share the latest published information, provide estimated fix times when available, and direct customers to live updates. On the phone, Fin Voice can triage incident-related calls quickly so callers aren’t waiting for a human agent just to receive basic information.
While multilingual support is only mandatory for large companies operating in co-official language regions, it remains essential for meeting consumer expectations. Fin helps by supporting multilingual, natural language interactions across voice and other channels; operating within channels that support accessibility features, like channels compatible with screen readers or commonly used messaging apps; and offering “request a call” paths and collecting the necessary information up front so teams can follow up quickly for customers who prefer phone support.
The law prohibits customer service interactions from generating additional revenue or being used to offer new products. With Guidance, you can set Fin up to stay firmly within these boundaries by shaping how it responds, which topics it should avoid, and what it should prioritize when a customer is seeking help or lodging a complaint.
The law raises expectations around documentation and audit readiness. Fin helps by making customer interactions more structured and consistent: when a conversation involves a complaint, Fin can ensure the required information is captured and a clear ID assigned; that ID can follow the interaction so it remains easy to trace; consistent intake gives you better visibility into key metrics regulators care about, like response times, time to first human contact, escalation volume, and whether complaints are resolved within required timelines; transcripts, summaries, and metadata can be retained until cases are resolved, supporting audit requirements; many organizations maintain internal compliance playbooks outlining processes and owners. Fin’s structured intake helps keep these practices reliable; leverage Insights to identify trending topics, optimize processes and measure service quality.
Spain’s new customer service law raises the bar on speed, access, and accountability. It’s natural to worry about how your team will cope, especially if your support operation has grown organically across tools and regions. I’ve seen how quickly burnout and chaos can set in when expectations rise faster than capacity.
The reality is that meeting these expectations through people alone would put unsustainable pressure on already stretched support teams. The risk of burnout and operational chaos is real, which is why an AI Agent like Fin can bring welcome relief.
By handling everything from high-volume, repetitive questions to many of the deeper, more involved issues customers raise, Fin keeps queues manageable and prevents the strain from falling entirely on your human team, helping everyone stay above water as expectations rise.
For companies operating across the EU, adapting early to Spain’s stricter expectations can build resilience for whatever comes next—whether that ends up being driven by regulation or customer demand. Now is the time to align compliance, AI strategy, and customer experience into a single, measurable operating model.
I build products on the belief that trust is earned in every design decision and every deployment. Trust has always been a first principle at Intercom, from our early investments in security and privacy to the globally recognized certifications that shape our approach today.
As AI becomes more deeply embedded in customer-facing work, it’s essential that businesses can rely on systems that are safe, reliable, and governed to the highest standards. That’s why we’re proud to share that Intercom is now AIUC-1 certified, becoming one of the first companies to meet the world’s first standard designed specifically for AI Agents. For leaders navigating AI Strategy and AI risk management, this is more than a badge—it’s a measurable leap forward in governance and operational rigor.
AIUC-1 is the first certification tailored to the unique risks and challenges of AI Agents. It complements broader AI governance frameworks like ISO 42001 by focusing on enterprise-specific concerns like security, customer safety, system reliability, data and privacy, society, and accountability. In practice, this alignment helps us translate policy into deployable safeguards across cybersecurity, data governance, and regulatory compliance.
To achieve certification, organizations undergo independent third-party audits and quarterly adversarial testing across more than a thousand enterprise risk scenarios. This continuous technical evaluation ensures that AI systems remain robust against fast-evolving threats and that safeguards keep pace with rapid progress in the field. As a product leader, I welcome this level of scrutiny—it’s how we operationalize threat detection and response and make agentic AI dependable at scale.
AIUC-1 itself evolves every quarter, incorporating new research, threat patterns, and global best practices. The standard is shaped by the AIUC-1 Consortium, launched in November with more than 50 founding members who collectively handle tens of trillions of dollars in payments and serve over a billion people daily. Intercom is proud not only to be certified, but to be recognized as a founding technical contributor helping shape the development of the standard. That continuous, community-driven iteration mirrors how we build—measure, learn, and harden—so our customers benefit from real-world, enterprise-ready AI.
Intercom has decades of combined experience in security, compliance, and trust, and we’ve consistently demonstrated that robust governance and fast innovation can coexist. Achieving AIUC-1 certification reinforces that the same rigor we apply across our platform also extends to Fin, our AI Agent. I’ve seen first-hand how risk and procurement teams evaluate generative AI: they expect clarity, evidence, and controls. This certification delivers independent proof that our approach meets those expectations.
For our customers, this certification provides independent validation that Intercom’s AI systems are safe, resilient, and enterprise-ready. It confirms that our AI is tested regularly, built with strong safeguards, and aligned with the expectations of modern security and risk teams. It also signals our continued leadership in shaping responsible AI practices globally, ensuring our customers benefit from standards built for real-world use. In short, you can move faster with confidence—without compromising on governance.
Intercom has always approached trust as an ongoing commitment. AIUC-1 strengthens the foundation we’ve built across other frameworks and certifications, including SOC 2, ISO 27001, ISO 27701, ISO 27018, HIPAA, HDS, and ISO 42001. Together, these certifications create a comprehensive control fabric across privacy, security, and reliability—critical pillars for any enterprise deploying gen AI into production workflows.
As AI technology accelerates, we will continue to evolve our safeguards, deepen our governance practices, and contribute to the standards that shape responsible AI. Our promise is simple: to build AI that is not only powerful and efficient, but safe, transparent, and deserving of the trust our customers place in us. That’s how we turn innovation into durable value.
You can learn more about our certifications and access our security and compliance documentation through the Intercom Trust Center.
Get started with Fin and see how an AIUC-1 certified, enterprise-ready AI Agent can elevate your customer experience with confidence.
When I assess whether an AI product is ready for prime time, I start with trust—not model accuracy. Accuracy is table stakes; trust is what earns adoption, drives retention, and unlocks durable product-led growth.
Evaluation metrics in AI products go beyond accuracy. Learn how product teams use trust-driven metrics to build reliable, growth-driving AI systems.
In practice, I organize trust-driven metrics into four layers: model quality and safety, user and business outcomes, operational reliability and cost, and governance and compliance. This layered approach keeps product trios aligned on what matters now, what must be gated in CI/CD, and what signals we’ll use to prove progress against outcomes vs output OKRs.
On model quality and safety, I care about precision, recall, F1, calibration, and abstention behavior, but also the hard-to-fake signals: hallucination rate, grounding and faithfulness, citation coverage, toxicity, bias, and fairness. For generative systems, I instrument refusal correctness (declining unsafe requests) and evidence adequacy (did the answer rely on retrieved, trustworthy sources).
User and business outcomes must be explicit. I track adoption, activation, task success rate, time to first value, win rate uplift in assisted workflows, CSAT and NPS deltas, and retention analysis by cohort exposed to AI features. For customer support scenarios, deflection rate, average handle time change, and first-contact resolution are core; for sales or ops copilots, I monitor cycle-time reduction and error-rate reduction in critical tasks.
Experimentation is non-negotiable. I design A/B testing with a clear minimum detectable effect (MDE), pre-registered guardrails for safety and quality, and sequential tests that stop early if harm outpaces benefit. Online metrics are always paired with offline evals so we can iterate quickly without exposing users to regressions.
Operationally, trust shows up as speed, stability, and cost predictability. I track latency end-to-end, time to first token, throughput, rate of 5xx and timeouts, cost per request, and caching effectiveness. We also trend safety incidents per 10,000 interactions and mean time to mitigation to keep reliability visible alongside performance.
Governance and compliance are part of the product, not an afterthought. Data governance and privacy-by-design metrics include PII exposure rate, data lineage coverage, access-control correctness, audit pass rate against internal policies, and model and prompt change traceability. This is the backbone of our AI risk management posture and accelerates regulatory compliance reviews instead of slowing them down.
The delivery engine for all of this is eval-driven development. We maintain golden datasets and scenario-based test suites that mirror real user intents, gate releases in CI/CD with minimum thresholds, and run canary rollouts to validate offline–online alignment. Every model or prompt update gets a comparable scorecard so product, engineering, and design can trade off quality, speed, and cost with shared facts.
For LLM-heavy features, retrieval-first pipeline metrics are mandatory. I monitor retrieval hit rate, recall at K, mean reciprocal rank, context contamination, and citation correctness. With large prompts, context window management matters: we track context utilization, truncation rate, and the contribution of each context block to final answers to avoid silently losing critical evidence.
Finally, trust must be legible. I package these metrics into an executive scorecard that maps to business outcomes, risk appetite, and OKRs, with clear thresholds for ship, improve, or roll back. When teams can articulate trade-offs—say, a 20% latency reduction at a small cost increase, or a lower hallucination rate at the expense of higher abstention—they build credibility with stakeholders and confidence with customers.
Trust is not a single number; it’s a system of evidence. By instrumenting these layers and operationalizing AI Strategy with rigorous, transparent metrics, we can ship faster, reduce surprises, and earn the right to scale AI features across the product portfolio.
Vibe marketing can electrify a brand, but it can also derail a strategy if it outruns the fundamentals. I have seen campaigns with breathtaking creative fall flat because the message had no anchor in product truth, no measurable goals, and no operational guardrails. In this installment, I share the patterns I watch for, the diagnostics I run, and the AI tools I use to keep the vibe aligned with outcomes.
Learn how to avoid the five most common mistakes in vibe marketing to have more success with AI marketing tools.
At its best, vibe marketing translates product positioning and value proposition into an emotional signal customers immediately recognize. At its worst, it becomes mood without meaning. The difference is disciplined product management: clear go-to-market strategy, outcomes vs output OKRs, rigorous A/B testing, and a feedback loop that connects creative choices to customer behavior.
Mistake 1: Mistaking mood for strategy. Early drafts often lean on catchy lines or trending aesthetics that don’t map to customer jobs-to-be-done or competitive differentiation. When I feel that drift, I force the team to articulate the core product promise, restate the positioning, and tie each headline to a measurable outcome. If a message cannot be traced to a specific hypothesis, audience, and metric, we rewrite it before it ships.
Mistake 2: Chasing trends instead of customer truth. Vibes built on whatever is viral this week rarely compounding learnings. I push for continuous discovery with interviews, in-product surveys, and sentiment analysis, then let gen ai generate multiple narrative variants grounded in actual quotes and objections. We evaluate with A/B testing and an explicit minimum detectable effect so we don’t declare victory on noise. That keeps our experimentation eval-driven, not anecdote-driven.
Mistake 3: Measuring vanity, not meaning. Reach and likes can be directional, but I optimize for activation, time-to-value, retention analysis, and conversion lift across the funnel. I instrument journeys in a unified analytics platform with Amplitude analytics and CRM integration so we can connect vibe exposure to outcomes. If the creative lifts click-through but hurts downstream activation, it’s not working—no matter how cool it looks.
Mistake 4: One vibe for every segment and channel. Audiences experience value differently, so the same creative rarely works in ads, landing pages, and in-app guides. I use LLMs for product managers and CustomGPT workflows to adapt the message by segment and stage, then validate with product tours, in-app prompts, and targeted lifecycle emails. The goal is coherence, not uniformity: a consistent story tuned to the context where decisions happen.
Mistake 5: Unbounded AI experimentation. Without AI risk management and data governance, teams can unintentionally ship off-brand or non-compliant copy. I set privacy-by-design standards, define approval thresholds, and establish context window management so models stay on-brief and on-policy. We log generations, review outputs against brand guidelines, and use retrieval to ground messaging in approved claims.
My practical playbook is simple: define the hypothesis tied to positioning, generate creative options with gen ai, pre-qualify with qualitative feedback, run A/B tests with clear success criteria, and iterate only on variants that move a business metric. Product trios align weekly on learnings so marketing signals and product-led growth motions reinforce each other. When the vibe matches the value and the data, momentum compounds.
Vibe marketing is not the opposite of rigor; it is rigor expressed emotionally. With the right AI strategy, measurement discipline, and governance, the creative spark becomes a durable advantage—and your brand earns the right to keep the spotlight.
Inspired by this post on Amplitude – Perspectives.
I love real-world AI that ships, scales, and actually solves painful customer problems. This story checks every box. As a product leader who has brought agentic AI to production environments, I was captivated by how a small, focused team at Perk took a no-code voice AI prototype and turned it into a system that reliably makes 10,000+ calls per week to prevent failed hotel payments.
What happens when you combine a real customer problem, a no-code prototype, and a team willing to listen to every single call?
Steven Payne (Product Manager), Gabriel Stock (Senior Engineering Manager), and Philipe Steiff (Senior Software Engineer) from Perk share how they built a voice AI agent that calls hotels to verify virtual credit card payments, preventing travelers from arriving to find their rooms unpaid. This is a textbook example of linking operational pain to a high-leverage AI solution.
What started as a hackathon experiment in Make.com became a production system handling over 10,000 calls per week across multiple languages. Along the way, the team learned hard lessons about prompt engineering for voice (numbers, pronunciation, and a very "Karen-like" first version), how to break a single monolithic prompt into structured conversation stages, and why listening to actual calls beats any amount of theorizing.
From a product management perspective, this approach aligns perfectly with eval-driven development and continuous discovery. Structure the problem, instrument aggressively, ship safely, then listen—deeply—to real interactions. In my own teams, I’ve seen that nothing accelerates iteration on agentic AI like closing the loop between qualitative call reviews and quantitative evals.
They built a working prototype without writing a single line of backend code.
They structured the call into discrete stages (IVR, booking confirmation, payment) to improve reliability.
They created two eval systems: one for call success classification, another for conversational behavior.
They scaled from five calls a day to tens of thousands per week while maintaining quality.
This is a detailed look at building AI for real-time human interaction—where the stakes are high and the feedback is immediate.
Guests: Steven Payne, Product Manager, Perk; Gabriel Stock, Senior Engineering Manager, Perk; Philipe Steiff, Senior Software Engineer, Perk.
What stood out to me was how Perk's team identified an AI use case by connecting prior experimentation with a real operational problem. Why they chose Make.com for prototyping—and shipped to production without touching backend code—underscores how far no-code can take you when paired with crisp problem framing. The evolution from a single prompt to structured conversation stages (IVR handling, booking confirmation, payment request) is exactly how you harden agent behavior for production.
Breaking up the agent's task dramatically improved reliability. They also built two eval systems: classification for success rates and LLM-as-judge for conversational behavior. Even with automation, the team still listens to calls manually—a practice I strongly endorse for uncovering edge cases, trust issues, and UX nuances that dashboards can’t show.
The challenge of prompt engineering for voice—numbers, booking references, and text-to-speech markup—was non-trivial. Expanding to German revealed that prompts in native language improve results. And, as often happens with operations-heavy rollouts, this project uncovered other operational problems they didn't know existed—valuable signal for the roadmap.
Resources & Links: Perk. Make.com — No-code automation platform used for the prototype. Twilio — Voice/telephony provider. Eleven Labs — Text-to-speech provider (used in early experiments).
Chapters: 00:00 Introduction to the Team; 01:54 Understanding PERK's Mission; 02:59 Challenges in Travel Booking; 07:27 AI Solutions for Customer Care; 09:52 Prototyping with AI and Voice; 17:00 Implementing AI in Production; 25:51 Learning Through Trial and Error; 26:40 Prompting Challenges and Solutions; 27:58 Iterating on Prompts and Evaluations; 30:08 Scaling and Production Challenges; 32:43 Advanced Evaluation Techniques; 35:32 Real-World Applications and Success; 49:07 Future Directions and Expansion; 53:53 Conclusion and Team Reflections.
My product takeaways: Start with clear operational pain and measurable outcomes (e.g., payment verification). Use no-code to validate quickly, then progressively harden. Treat voice AI like any production system: break it into deterministic stages, add guardrails, and measure both outcome and behavior. Pair automated evals with hands-on reviews. And when going multilingual, write prompts in the native language—your accuracy will thank you.
If you’re exploring agentic AI for operations, this is the blueprint: tight scoping, Make.com for speed, Twilio for reliability, structured prompts for control, and an eval-driven loop to scale quality with confidence.
AI search is reshaping how customers discover emerging products, and I’ve seen firsthand how this shift rewards startups that speak clearly to both humans and machines. Learn how LLMs like ChatGPT and Perplexity decide which startups to recommend and what signals help a brand get discovered in AI search.
In practice, AI search behaves less like a list of blue links and more like a synthesis engine. These models look for credible, consensus-backed, well-structured sources they can cite with confidence. That means your brand’s discoverability hinges on technical clarity (schema, structure, speed), topical authority (depth, citations, expert bylines), and evidence of real-world adoption (reviews, case studies, third-party validation).
I start by mapping buyer intent across the entire journey—category exploration, problem framing, solution fit, integration needs, ROI, and competitive comparisons. Then I design a page system that answers each intent with precision: clear “About” and “Use Cases” pages, integration-specific pages, objective "X vs Y" comparisons, transparent pricing, and a living FAQ that mirrors the exact questions users ask in conversational queries.
Structure matters. I add JSON-LD schema for Organization, Product, FAQPage, HowTo, and Article where appropriate; keep canonical URLs consistent; and ensure titles, meta descriptions, and Open Graph data reinforce the same story. Clean sitemaps, a sensible robots.txt, and fast, mobile-first performance reduce friction for crawlers and increase the odds that LLMs extract accurate snippets.
Authority is earned off-site as much as on-site. I prioritize third-party signals—G2/Capterra reviews, analyst mentions, reputable press, open-source repos with README clarity, academic or industry citations, and credible partner integrations. LLMs heavily weight these external proofs when recommending solutions, especially for B2B and regulated categories.
On your site, demonstrate expertise. I include expert bylines with real credentials, cite primary sources, showcase customer outcomes with verifiable metrics, and make methodologies transparent. Shallow, keyword-stuffed posts don’t help; comprehensive, up-to-date explainers with references do.
Make your content retrieval-friendly. LLMs favor text they can segment, anchor, and quote. I structure pages with descriptive headings, short paragraphs, and linkable anchors; offer HTML-first documentation (not just PDFs); and provide copyable code or configuration steps when relevant. This also sets you up for a retrieval-first pipeline in your own product experiences.
From a product and platform angle, I expose trustworthy documentation and a clear trust center—security, compliance, data governance, and privacy-by-design content. When a user asks an LLM whether they can safely deploy your solution, these pages often get pulled into the answer.
Evaluation closes the loop. I run an eval-driven development process for content: a stable prompt set that mirrors real queries, regular tests in both Perplexity and ChatGPT, and analytics to track referrals from AI-driven sources. I iterate headlines, schema, and on-page structure, then tie changes back to engagement and pipeline using A/B testing where it’s appropriate.
Don’t neglect comparison and alternatives pages. Fair, well-cited pages that address trade-offs and points of parity build trust—and they give LLMs succinct, quotable language for recommendation contexts. Clarity beats hype every time.
Finally, keep your corpus fresh. I schedule quarterly content reviews, retire outdated claims, and highlight release notes and integration updates. Freshness signals help models favor your content when they resolve time-sensitive queries.
If you treat AI search as a product surface—one that rewards precision, provenance, and performance—you’ll dramatically increase your odds of being recommended where it matters. That’s how I operationalize AI discovery for startups: intent mapping, structured content, external authority, a retrieval-friendly corpus, and a rigorous eval loop.
Inspired by this post on Amplitude – Perspectives.
I’ve learned that the most powerful AI features rarely emerge from lone-wolf brilliance—they’re born when a community rallies around a shared objective. “Building Amplitude’s AI for insight automation felt a lot like the fable of travelers making stone soup with their community.” That spirit captures how I approach shipping AI for analytics: bring focused ingredients, invite contributions, and let rigorous evaluation transform the result into something extraordinary.
At the core is Eval-Driven Development. Rather than debating preferences, we define explicit evaluation sets, success thresholds, and guardrails, then wire them into CI/CD so every change improves reliability, quality, and relevance. For AI-driven analytics, our evals combine offline judgment tests (precision, recall, hallucination rates), user-centric measures (time-to-insight, actionability), and production health signals (failure modes, latency). When the bar rises, the product improves—continuously and measurably.
We made “stone soup” by inviting contributions from every function. Data science established gold-standard datasets and baselines. Engineering implemented retrieval, orchestration, and safe deployment paths. Product and design framed high-value use cases, in-app guides, and UX writing that clarified intent. Customer success and support piped real-world edge cases into our evals so the system improved where it mattered. Product trios kept us outcome-focused and empowered product teams moved quickly without sacrificing governance.
Why this matters for analytics: AI insight automation reduces the heavy lift of exploring funnels, cohorts, anomalies, and retention patterns—accelerating activation and product-led growth. With a unified analytics platform and strong data governance, we can surface relevant patterns proactively, explain the “why” behind movements, and recommend next best actions without drowning users in noise. The result is faster decisions, cleaner handoffs between teams, and a tighter loop from observation to intervention.
Our practical playbook is simple but strict: define a clear north-star outcome; curate representative eval sets that mirror real user questions; simulate A/B testing offline before live traffic; instrument time-to-insight and adoption; and integrate evals into CI/CD so regressions never ship. We monitor DORA metrics to maintain delivery velocity while holding quality lines, and we use human-in-the-loop review to continuously refine prompts, patterns, and explanations.
We also learned what doesn’t work. General-purpose prompts seldom transfer cleanly to analytics without domain grounding and context window management. A retrieval-first pipeline improves factuality, but only if metadata and event taxonomies are consistent. And while generative UX can delight in demos, it must earn trust in production through transparent reasoning, privacy-by-design, and predictable behavior under load.
In the end, the stone soup metaphor isn’t about cute storytelling—it’s about disciplined collaboration. When a cross-functional community contributes the right ingredients and Eval-Driven Development keeps us honest, AI for insight automation becomes both credible and compounding. That’s how we turn analytics into action—and how we ship AI products that users rely on every day.
Inspired by this post on Amplitude – Best Practices.
I’ve spent years watching users bounce between product screens, docs, and support tickets when they hit a roadblock. The fastest path to value is always the same: deliver relevant, contextual help exactly when and where the user needs it. That’s why I’m excited about the next wave of in-app guidance that blends behavioral data with AI to anticipate intent and remove friction in real time.
Announcing Resource Centers, Amplitude’s newest in-product help feature that uses behavioral data and AI to serve help content users actually need.
Here’s why that matters. In a product-led growth model, in-app guides, product tours, and just-in-time tips are essential to onboarding and user activation. When help content is informed by real behavioral signals—events, cohorts, milestones—it stops being a static knowledge base and becomes a living system that adapts to a user’s journey. That means fewer context switches, faster time-to-value, and more confident users who can self-serve their way to outcomes.
In practice, the most effective resource centers are opinionated and contextual: they surface content by role, plan, and lifecycle stage; trigger nudges based on key events; and offer multiple modalities (microcopy, short clips, interactive guides) so users can choose how they learn. They also respect pacing, avoiding notification fatigue with rate limits and prioritization rules. Think of this as high-quality UX writing paired with data-driven orchestration—useful, discoverable, and never in the way.
Execution matters. Start with a clear content taxonomy, map help assets to journey stages, and establish a content ops cadence so guides stay fresh. Partner closely with data governance to ensure privacy-by-design and transparent consent for behavioral data usage. Then wire in feedback loops—thumbs up/down, quick polls, and session replays—so you can continuously discover gaps and iterate quickly.
Measure impact with the same rigor you apply to product features. Track activation rates, time-to-first-value, self-serve resolution rates, reduction in ticket volume on targeted topics, and downstream retention. Use A/B testing to validate which interventions move the needle, and segment results to learn what works for new users versus power users. When results differ, treat that as a design signal—not a failure—and refine the targeting.
Rollout thoughtfully. Pilot with a high-friction workflow, localize the help content to the user’s context, and set clear exit criteria before scaling. Align with customer support and success so your resource center becomes the canonical source for in-app help, not yet another content silo. Over time, unify insights across Amplitude analytics and your support stack to close the loop between product behavior and help outcomes.
As product leaders, our goal is simple: reduce effort and increase confidence for every user. AI-assisted, behaviorally triggered resource centers are a pragmatic step toward that future—meeting users where they are, with exactly what they need, at the moment they need it.
Inspired by this post on Amplitude – Best Practices.
Every week, I ask a simple question with massive implications for our AI Strategy: what do large language models actually say about our brand? As a VP of Product Management at HighLevel, I’ve learned that competitive differentiation now lives as much in AI-generated responses as it does in traditional search or social. That’s why a reliable, unified analytics platform for AI visibility is quickly becoming table stakes for product management leadership.
Discover how Amplitude AI Visibility helps you track your visibility score, uncover competitor rankings, and prove business impact—all in one platform.
Here’s why that matters. A visibility score gives me a measurable baseline—our AI share of voice—so I can see whether our product-led growth and go-to-market strategy are landing in the places where buyers increasingly look for answers. Competitor rankings reveal points of parity and opportunities to differentiate, which directly inform product positioning and our value proposition. And the ability to prove business impact closes the loop between AI exposure and outcomes that executives care about.
Operationally, I would start by benchmarking our visibility score against key competitors, then segment by core use cases to identify where our story underperforms. Those insights feed product discovery, content strategy, and enablement—tightening the narrative to better align with buyer intent. I’d translate the findings into prioritized bets for the roadmap and partner closely with marketing to amplify wins and address gaps.
For teams exploring LLMs for product managers and GenAI-driven growth, this approach creates a disciplined feedback loop: measure what AI says, experiment to improve it, and verify the impact across the funnel. It’s a pragmatic way to connect messaging, discovery, and differentiation—without guessing what the models are surfacing about your brand.
I’ve followed Amplitude analytics for years, and Amplitude AI Visibility slots naturally into a modern operating model: one platform to monitor the signals that matter, align stakeholders, and make faster, evidence-based decisions. If your mandate includes scaling product-led growth and sharpening competitive differentiation, this is a timely, actionable way to see—and shape—how AI represents you.
Inspired by this post on Amplitude – Best Practices.
I’m constantly looking for ways to collapse the distance between product questions and trustworthy answers. When behavioral data shows up in the tools I already use, my team moves faster, aligns better, and makes higher-confidence calls. That’s exactly why Amplitude MCP caught my attention—and why it’s quickly becoming essential to my AI Strategy and day-to-day Product Management practice.
Discover how Amplitude MCP brings behavioral context to AI tools like Claude and Cursor, enabling data-driven decisions in your existing workflows.
In practice, this means I can ask Claude, Cursor, or even Claude Code about activation cohorts, retention analysis, funnel drop‑offs, and feature adoption—and get responses grounded in Amplitude analytics without tab-hopping. By bringing our unified analytics platform into the flow of work, I keep momentum high and decision latency low, especially during fast-moving discovery and delivery cycles.
This approach elevates LLMs for product managers from clever assistants to reliable copilots. During continuous discovery, I can interrogate segments, compare behaviors across personas, and pressure-test hypotheses in minutes. In product-led growth environments, that behavioral context turns prioritization into a repeatable, outcomes-first ritual rather than a debate fueled by anecdotes.
Equally important, MCP helps me protect the integrity of our metrics. With consistent definitions flowing into AI tools, I reduce shadow analysis, preserve governance, and support privacy-by-design. Stakeholders—from engineers to design to GTM—see the same truths, which improves trust and accelerates alignment across the organization.
Getting started is straightforward: connect your workspace, ensure your event taxonomy is clean, and align key properties with CRM integration so segments and journeys remain attributable. I also curate an AI product toolbox of prompts for common workflows—say, exploring A/B testing outcomes or checking the minimum detectable effect (MDE) before a new experiment—so the team can move quickly without reinventing the wheel.
The payoff is immediate: fewer context switches, faster iteration loops, and sharper decisions where they matter most—inside the tools we already rely on. If you’re charting your gen ai roadmap, consider how Amplitude MCP can infuse behavioral insight into every conversation and commit. For me, it’s a pragmatic step toward an intelligent, data-informed product practice that scales.
Inspired by this post on Amplitude – Best Practices.
Most mornings start the same way for me: coffee in hand, I sit down, open Claude Code, and type /today. In a few seconds, Claude pulls fresh tasks from my Trello board, compiles a clean today.md with what matters most, and assembles a research digest of the latest academic work across my focus areas.
Scanning that today.md has become my daily ritual. My workload typically spans writing, coding, and administration. I now make a habit of asking Claude, "What's on my to-do list that you can help with?" That simple question keeps me honest about where AI can accelerate my day.
I’m experimenting with a workflow where Claude enriches every task based on what it can take on or accelerate. It’s still early, so we iterate together for a few minutes each morning to tighten the loop and improve the prompts and outputs.
Next up is my research digest. I skim, download the PDFs that look promising, and move on. Tomorrow, Claude will deliver detailed summaries of every paper I saved—so I stay current without burning hours on search and sorting.
For the first few hours, I protect deep work. Today, that means writing this article. My to-do list and draft live side-by-side in Obsidian, so I click directly from the task into the outline, pick up my running conversation with Claude, and get right back into flow. I pair-write: we outline, I draft, and then I ask, "I wrote the intro. What do you think?"
A terminal-based AI helper suggests concrete ways to lighten your workload—draft a blog, plan 2026, launch a course, migrate files, craft a survey, and digest research—so you can pick the next task fast.
Claude gives pointed feedback—what’s working, what needs tightening—and we iterate. This is genuinely how I work now. I pair with Claude on almost everything I do. It didn’t happen overnight; over the past five months, I’ve built a personal AI-enhanced operating system that has fundamentally improved how I operate: more output, faster cycles, and frankly, more joy in the work.
Because it’s made such a difference, I’m sharing the playbook. If you’re new to Claude Code or want to get more from it, start here:
Claude Code: What It Is, How It's Different, and Why Non-Technical People Should Use It
Stop Repeating Yourself: Give Claude Code a Memory
How to Use Claude Code Safely: A Non-Technical Guide to Managing Risk
In recent office hours, one question came up again and again: Where do I start—what should I automate and what should I have AI augment? Today, I’ll walk through how I decide, share my own workflows, and show how I prioritize what to build next. Next week, we’ll get into how to design and build personal workflows.
This series was inspired by my personal usage of Claude Code. I have not received any compensation from Anthropic for writing this series. And you can trust that if that ever changes, I will disclose it. This is not only required by the FTC here in the US, but I strongly believe it is the right thing to do. You can count on me to do so.
Understanding what AI workflows can do for you
Peek inside a dark-themed writing workspace where a markdown editor displays an article on choosing tasks to automate with AI. The sidebar organizes notes, while the draft outlines pulling Trello tasks, making today.md, and using Claude.
I started with ChatGPT in the browser not long after it launched and quickly began asking, “Can ChatGPT help with this?” As my use cases grew (and my patience for copy-paste vanished), I moved to Claude Code. The philosophy never changed: continuously push the envelope of what LLMs can do today while managing risk.
My default stance is to attempt everything with AI, then decide what becomes a reusable workflow versus a one-off assist. A workflow, to me, is a sequence of steps where some are automated by AI, others are AI-augmented, and some still require me.
Across my setup, clear patterns emerged. I use AI to: (1) do more of what I’m already good at, (2) eliminate friction in frequent tasks, and (3) remove what drains me. The goal is simple: multiply impact without sacrificing quality.
Take writing. I now average about 35,000 words per month—up from roughly 8,000. I’m writing more often and in more depth. I draw more from academic research and include more stories—both my own and those from others. Claude gives me detailed feedback on everything I write, which helps me maintain momentum. It’s remarkable how often a simple nudge—“Ready to write the next section?”—keeps me in the zone. I also spend more time with Claude on structure before drafting, so I discard far less.
Go behind the scenes of creating an AI automation guide: a split-screen workspace pairs the article draft with detailed reviewer notes, revealing a practical, iterative process of outlining, fact-checking, and refining before publication.
Podcast production is another domain where AI shines. I produce two weekly shows: I love connecting with Petra Wille on All Things Product, and talking with product teams building AI-powered products on Just Now Possible. I use Descript to edit, and I rely on Claude Code shortcuts (slash commands) to draft episode titles, descriptions, show notes, chapters, and social posts. I still own the editorial bar—no “AI slop”—but I let AI handle the heavy lifting so I can focus on shaping the final story.
Then there are tasks I fully automate. I love reading across creativity, collaboration, AI efficacy, and more. I do not love searching for relevant papers. So I don’t. Every morning, my automated research workflow finds the newest, most relevant articles and populates my digest. All I do is review.
Choosing your first AI workflows
Classic delegation advice still applies: build awareness of where your time goes; identify what you can delegate; invest your time in the work you’re uniquely equipped to do. That’s a great start for AI workflow strategy, but don’t ignore what you love doing and want to do more of. Augmentation often generates the highest returns—AI helps me go deeper, faster, without diluting my craft.
Peek inside an AI-powered curation flow: a markdown workspace compiles a 'Filtered Research Digest' with criteria, paper counts, and summaries, demonstrating how automation turns raw literature into actionable insights.
To uncover opportunities, I simply ask, over and over: Can AI help with this? As you go about your work today, keep asking yourself: How can AI help with this?
Evaluating if a task is a good candidate for an AI workflow
Through trial and error, I now run new tasks through a quick filter:
• Is this a one-time task or do I do it often?
A clean, workshop-style slide asks the pivotal question: "How can AI help with this?" Use it to spark automation ideas, map steps, and decide where generative AI can accelerate research, drafting, analysis, and repetitive work.
• Do I enjoy doing this task or would I give it to someone else if I could?
• How complex is the task?
• Can I articulate how I would do the task step-by-step?
• Does completing the task require my human judgment?
• Can I define what "done successfully" looks like?
• How much risk is there if the task is not done well?
This checklist takes minutes and pays off quickly. The answers tell me whether to automate, augment, or keep a task human-only for now—and they guide how much process and guardrailing to build around each workflow.
From here, I’ll walk through how to answer these questions in practice, how the answers map to different levels of automation or augmentation, and how I prioritize which workflows to invest in. I’ll also share 41 of my own AI workflows (noting which are automated versus augmented) plus 9 discovery-related workflows currently in development so you can steal shamelessly and ship your first one today.
The rest of this article requires a paid subscription. This publication is reader-supported. If you’ve benefited from my writing, please subscribe today.