Tag: eval-driven development

  • From No-Code Hack to 10,000 Weekly Calls: Inside Perk’s Voice AI That Actually Works

    From No-Code Hack to 10,000 Weekly Calls: Inside Perk’s Voice AI That Actually Works

    I love real-world AI that ships, scales, and actually solves painful customer problems. This story checks every box. As a product leader who has brought agentic AI to production environments, I was captivated by how a small, focused team at Perk took a no-code voice AI prototype and turned it into a system that reliably makes 10,000+ calls per week to prevent failed hotel payments.

    What happens when you combine a real customer problem, a no-code prototype, and a team willing to listen to every single call?

    Steven Payne (Product Manager), Gabriel Stock (Senior Engineering Manager), and Philipe Steiff (Senior Software Engineer) from Perk share how they built a voice AI agent that calls hotels to verify virtual credit card payments, preventing travelers from arriving to find their rooms unpaid. This is a textbook example of linking operational pain to a high-leverage AI solution.

    What started as a hackathon experiment in Make.com became a production system handling over 10,000 calls per week across multiple languages. Along the way, the team learned hard lessons about prompt engineering for voice (numbers, pronunciation, and a very "Karen-like" first version), how to break a single monolithic prompt into structured conversation stages, and why listening to actual calls beats any amount of theorizing.

    From a product management perspective, this approach aligns perfectly with eval-driven development and continuous discovery. Structure the problem, instrument aggressively, ship safely, then listen—deeply—to real interactions. In my own teams, I’ve seen that nothing accelerates iteration on agentic AI like closing the loop between qualitative call reviews and quantitative evals.

    They built a working prototype without writing a single line of backend code.

    They structured the call into discrete stages (IVR, booking confirmation, payment) to improve reliability.

    They created two eval systems: one for call success classification, another for conversational behavior.

    They scaled from five calls a day to tens of thousands per week while maintaining quality.

    This is a detailed look at building AI for real-time human interaction—where the stakes are high and the feedback is immediate.

    Guests: Steven Payne, Product Manager, Perk; Gabriel Stock, Senior Engineering Manager, Perk; Philipe Steiff, Senior Software Engineer, Perk.

    What stood out to me was how Perk's team identified an AI use case by connecting prior experimentation with a real operational problem. Why they chose Make.com for prototyping—and shipped to production without touching backend code—underscores how far no-code can take you when paired with crisp problem framing. The evolution from a single prompt to structured conversation stages (IVR handling, booking confirmation, payment request) is exactly how you harden agent behavior for production.

    Breaking up the agent's task dramatically improved reliability. They also built two eval systems: classification for success rates and LLM-as-judge for conversational behavior. Even with automation, the team still listens to calls manually—a practice I strongly endorse for uncovering edge cases, trust issues, and UX nuances that dashboards can’t show.

    The challenge of prompt engineering for voice—numbers, booking references, and text-to-speech markup—was non-trivial. Expanding to German revealed that prompts in native language improve results. And, as often happens with operations-heavy rollouts, this project uncovered other operational problems they didn't know existed—valuable signal for the roadmap.

    Resources & Links: Perk. Make.com — No-code automation platform used for the prototype. Twilio — Voice/telephony provider. Eleven Labs — Text-to-speech provider (used in early experiments).

    Chapters: 00:00 Introduction to the Team; 01:54 Understanding PERK's Mission; 02:59 Challenges in Travel Booking; 07:27 AI Solutions for Customer Care; 09:52 Prototyping with AI and Voice; 17:00 Implementing AI in Production; 25:51 Learning Through Trial and Error; 26:40 Prompting Challenges and Solutions; 27:58 Iterating on Prompts and Evaluations; 30:08 Scaling and Production Challenges; 32:43 Advanced Evaluation Techniques; 35:32 Real-World Applications and Success; 49:07 Future Directions and Expansion; 53:53 Conclusion and Team Reflections.

    My product takeaways: Start with clear operational pain and measurable outcomes (e.g., payment verification). Use no-code to validate quickly, then progressively harden. Treat voice AI like any production system: break it into deterministic stages, add guardrails, and measure both outcome and behavior. Pair automated evals with hands-on reviews. And when going multilingual, write prompts in the native language—your accuracy will thank you.

    If you’re exploring agentic AI for operations, this is the blueprint: tight scoping, Make.com for speed, Twilio for reliability, structured prompts for control, and an eval-driven loop to scale quality with confidence.


    Inspired by this post on Product Talk.


    Book a consult png image
  • Crack the AI Search Code: How Startups Win Recommendations in ChatGPT and Perplexity

    Crack the AI Search Code: How Startups Win Recommendations in ChatGPT and Perplexity

    AI search is reshaping how customers discover emerging products, and I’ve seen firsthand how this shift rewards startups that speak clearly to both humans and machines. Learn how LLMs like ChatGPT and Perplexity decide which startups to recommend and what signals help a brand get discovered in AI search.

    In practice, AI search behaves less like a list of blue links and more like a synthesis engine. These models look for credible, consensus-backed, well-structured sources they can cite with confidence. That means your brand’s discoverability hinges on technical clarity (schema, structure, speed), topical authority (depth, citations, expert bylines), and evidence of real-world adoption (reviews, case studies, third-party validation).

    I start by mapping buyer intent across the entire journey—category exploration, problem framing, solution fit, integration needs, ROI, and competitive comparisons. Then I design a page system that answers each intent with precision: clear “About” and “Use Cases” pages, integration-specific pages, objective "X vs Y" comparisons, transparent pricing, and a living FAQ that mirrors the exact questions users ask in conversational queries.

    Structure matters. I add JSON-LD schema for Organization, Product, FAQPage, HowTo, and Article where appropriate; keep canonical URLs consistent; and ensure titles, meta descriptions, and Open Graph data reinforce the same story. Clean sitemaps, a sensible robots.txt, and fast, mobile-first performance reduce friction for crawlers and increase the odds that LLMs extract accurate snippets.

    Authority is earned off-site as much as on-site. I prioritize third-party signals—G2/Capterra reviews, analyst mentions, reputable press, open-source repos with README clarity, academic or industry citations, and credible partner integrations. LLMs heavily weight these external proofs when recommending solutions, especially for B2B and regulated categories.

    On your site, demonstrate expertise. I include expert bylines with real credentials, cite primary sources, showcase customer outcomes with verifiable metrics, and make methodologies transparent. Shallow, keyword-stuffed posts don’t help; comprehensive, up-to-date explainers with references do.

    Make your content retrieval-friendly. LLMs favor text they can segment, anchor, and quote. I structure pages with descriptive headings, short paragraphs, and linkable anchors; offer HTML-first documentation (not just PDFs); and provide copyable code or configuration steps when relevant. This also sets you up for a retrieval-first pipeline in your own product experiences.

    From a product and platform angle, I expose trustworthy documentation and a clear trust center—security, compliance, data governance, and privacy-by-design content. When a user asks an LLM whether they can safely deploy your solution, these pages often get pulled into the answer.

    Evaluation closes the loop. I run an eval-driven development process for content: a stable prompt set that mirrors real queries, regular tests in both Perplexity and ChatGPT, and analytics to track referrals from AI-driven sources. I iterate headlines, schema, and on-page structure, then tie changes back to engagement and pipeline using A/B testing where it’s appropriate.

    Don’t neglect comparison and alternatives pages. Fair, well-cited pages that address trade-offs and points of parity build trust—and they give LLMs succinct, quotable language for recommendation contexts. Clarity beats hype every time.

    Finally, keep your corpus fresh. I schedule quarterly content reviews, retire outdated claims, and highlight release notes and integration updates. Freshness signals help models favor your content when they resolve time-sensitive queries.

    If you treat AI search as a product surface—one that rewards precision, provenance, and performance—you’ll dramatically increase your odds of being recommended where it matters. That’s how I operationalize AI discovery for startups: intent mapping, structured content, external authority, a retrieval-friendly corpus, and a rigorous eval loop.


    Inspired by this post on Amplitude – Perspectives.


    Book a consult png image
  • From Stone Soup to Insights: Eval-Driven Development That Supercharges AI Analytics

    From Stone Soup to Insights: Eval-Driven Development That Supercharges AI Analytics

    I’ve learned that the most powerful AI features rarely emerge from lone-wolf brilliance—they’re born when a community rallies around a shared objective. “Building Amplitude’s AI for insight automation felt a lot like the fable of travelers making stone soup with their community.” That spirit captures how I approach shipping AI for analytics: bring focused ingredients, invite contributions, and let rigorous evaluation transform the result into something extraordinary.

    At the core is Eval-Driven Development. Rather than debating preferences, we define explicit evaluation sets, success thresholds, and guardrails, then wire them into CI/CD so every change improves reliability, quality, and relevance. For AI-driven analytics, our evals combine offline judgment tests (precision, recall, hallucination rates), user-centric measures (time-to-insight, actionability), and production health signals (failure modes, latency). When the bar rises, the product improves—continuously and measurably.

    We made “stone soup” by inviting contributions from every function. Data science established gold-standard datasets and baselines. Engineering implemented retrieval, orchestration, and safe deployment paths. Product and design framed high-value use cases, in-app guides, and UX writing that clarified intent. Customer success and support piped real-world edge cases into our evals so the system improved where it mattered. Product trios kept us outcome-focused and empowered product teams moved quickly without sacrificing governance.

    Why this matters for analytics: AI insight automation reduces the heavy lift of exploring funnels, cohorts, anomalies, and retention patterns—accelerating activation and product-led growth. With a unified analytics platform and strong data governance, we can surface relevant patterns proactively, explain the “why” behind movements, and recommend next best actions without drowning users in noise. The result is faster decisions, cleaner handoffs between teams, and a tighter loop from observation to intervention.

    Our practical playbook is simple but strict: define a clear north-star outcome; curate representative eval sets that mirror real user questions; simulate A/B testing offline before live traffic; instrument time-to-insight and adoption; and integrate evals into CI/CD so regressions never ship. We monitor DORA metrics to maintain delivery velocity while holding quality lines, and we use human-in-the-loop review to continuously refine prompts, patterns, and explanations.

    We also learned what doesn’t work. General-purpose prompts seldom transfer cleanly to analytics without domain grounding and context window management. A retrieval-first pipeline improves factuality, but only if metadata and event taxonomies are consistent. And while generative UX can delight in demos, it must earn trust in production through transparent reasoning, privacy-by-design, and predictable behavior under load.

    In the end, the stone soup metaphor isn’t about cute storytelling—it’s about disciplined collaboration. When a cross-functional community contributes the right ingredients and Eval-Driven Development keeps us honest, AI for insight automation becomes both credible and compounding. That’s how we turn analytics into action—and how we ship AI products that users rely on every day.


    Inspired by this post on Amplitude – Best Practices.


    Book a consult png image