What AI problem did Perk's team solve with a voice agent?

Perk's team built a voice AI agent that calls hotels to verify virtual credit card payments. The goal was to prevent travelers from arriving at hotels and finding that their rooms had not been paid for.

How did the Perk voice AI project start?

The project began as a hackathon experiment using Make.com. The team turned that no-code prototype into a production system without initially writing backend code.

How did Perk improve the reliability of the voice AI agent?

Perk moved from a single monolithic prompt to structured conversation stages such as IVR handling, booking confirmation, and payment request. Breaking the call into discrete stages made the agent's behavior easier to control and improve.

What evaluation systems did the team use for the voice AI agent?

The team created two eval systems: one for classifying call success and another using LLM-as-judge for conversational behavior. They also continued listening to calls manually to find edge cases and trust issues that dashboards could miss.

What prompt engineering lessons came from building a production voice AI system?

The post highlights challenges with numbers, booking references, pronunciation, and text-to-speech markup. It also notes that when the team expanded to German, writing prompts in the native language improved results.

What tools are mentioned in the Perk voice AI workflow?

The post mentions Make.com for no-code prototyping, Twilio as a voice and telephony provider, and Eleven Labs as a text-to-speech provider used in early experiments. The production approach paired these tools with structured prompts and evaluation loops.

What should product teams learn from Perk's agentic AI rollout?

The main lesson is to start with clear operational pain, validate quickly, then harden the system through stages, guardrails, and measurement. The post emphasizes combining automated evals with human reviews to scale quality with confidence.

What AI problem did Perk's team solve with a voice agent?

Perk's team built a voice AI agent that calls hotels to verify virtual credit card payments. The goal was to prevent travelers from arriving at hotels and finding that their rooms had not been paid for.

How did the Perk voice AI project start?

The project began as a hackathon experiment using Make.com. The team turned that no-code prototype into a production system without initially writing backend code.

How did Perk improve the reliability of the voice AI agent?

Perk moved from a single monolithic prompt to structured conversation stages such as IVR handling, booking confirmation, and payment request. Breaking the call into discrete stages made the agent's behavior easier to control and improve.

What evaluation systems did the team use for the voice AI agent?

The team created two eval systems: one for classifying call success and another using LLM-as-judge for conversational behavior. They also continued listening to calls manually to find edge cases and trust issues that dashboards could miss.

What prompt engineering lessons came from building a production voice AI system?

The post highlights challenges with numbers, booking references, pronunciation, and text-to-speech markup. It also notes that when the team expanded to German, writing prompts in the native language improved results.

What tools are mentioned in the Perk voice AI workflow?

The post mentions Make.com for no-code prototyping, Twilio as a voice and telephony provider, and Eleven Labs as a text-to-speech provider used in early experiments. The production approach paired these tools with structured prompts and evaluation loops.

What should product teams learn from Perk's agentic AI rollout?

The main lesson is to start with clear operational pain, validate quickly, then harden the system through stages, guardrails, and measurement. The post emphasizes combining automated evals with human reviews to scale quality with confidence.

From No-Code Hack to 10,000 Weekly Calls: Inside Perk’s Voice AI That Actually Works

I love real-world AI that ships, scales, and actually solves painful customer problems. This story checks every box. As a product leader who has brought agentic AI to production environments, I was captivated by how a small, focused team at Perk took a no-code voice AI prototype and turned it into a system that reliably makes 10,000+ calls per week to prevent failed hotel payments.

What happens when you combine a real customer problem, a no-code prototype, and a team willing to listen to every single call?

Steven Payne (Product Manager), Gabriel Stock (Senior Engineering Manager), and Philipe Steiff (Senior Software Engineer) from Perk share how they built a voice AI agent that calls hotels to verify virtual credit card payments, preventing travelers from arriving to find their rooms unpaid. This is a textbook example of linking operational pain to a high-leverage AI solution.

What started as a hackathon experiment in Make.com became a production system handling over 10,000 calls per week across multiple languages. Along the way, the team learned hard lessons about prompt engineering for voice (numbers, pronunciation, and a very "Karen-like" first version), how to break a single monolithic prompt into structured conversation stages, and why listening to actual calls beats any amount of theorizing.

From a product management perspective, this approach aligns perfectly with eval-driven development and continuous discovery. Structure the problem, instrument aggressively, ship safely, then listen—deeply—to real interactions. In my own teams, I’ve seen that nothing accelerates iteration on agentic AI like closing the loop between qualitative call reviews and quantitative evals.

They built a working prototype without writing a single line of backend code.

They structured the call into discrete stages (IVR, booking confirmation, payment) to improve reliability.

They created two eval systems: one for call success classification, another for conversational behavior.

They scaled from five calls a day to tens of thousands per week while maintaining quality.

This is a detailed look at building AI for real-time human interaction—where the stakes are high and the feedback is immediate.

Guests: Steven Payne, Product Manager, Perk; Gabriel Stock, Senior Engineering Manager, Perk; Philipe Steiff, Senior Software Engineer, Perk.

What stood out to me was how Perk's team identified an AI use case by connecting prior experimentation with a real operational problem. Why they chose Make.com for prototyping—and shipped to production without touching backend code—underscores how far no-code can take you when paired with crisp problem framing. The evolution from a single prompt to structured conversation stages (IVR handling, booking confirmation, payment request) is exactly how you harden agent behavior for production.

Breaking up the agent's task dramatically improved reliability. They also built two eval systems: classification for success rates and LLM-as-judge for conversational behavior. Even with automation, the team still listens to calls manually—a practice I strongly endorse for uncovering edge cases, trust issues, and UX nuances that dashboards can’t show.

The challenge of prompt engineering for voice—numbers, booking references, and text-to-speech markup—was non-trivial. Expanding to German revealed that prompts in native language improve results. And, as often happens with operations-heavy rollouts, this project uncovered other operational problems they didn't know existed—valuable signal for the roadmap.

Resources & Links: Perk. Make.com — No-code automation platform used for the prototype. Twilio — Voice/telephony provider. Eleven Labs — Text-to-speech provider (used in early experiments).

Chapters: 00:00 Introduction to the Team; 01:54 Understanding PERK's Mission; 02:59 Challenges in Travel Booking; 07:27 AI Solutions for Customer Care; 09:52 Prototyping with AI and Voice; 17:00 Implementing AI in Production; 25:51 Learning Through Trial and Error; 26:40 Prompting Challenges and Solutions; 27:58 Iterating on Prompts and Evaluations; 30:08 Scaling and Production Challenges; 32:43 Advanced Evaluation Techniques; 35:32 Real-World Applications and Success; 49:07 Future Directions and Expansion; 53:53 Conclusion and Team Reflections.

My product takeaways: Start with clear operational pain and measurable outcomes (e.g., payment verification). Use no-code to validate quickly, then progressively harden. Treat voice AI like any production system: break it into deterministic stages, add guardrails, and measure both outcome and behavior. Pair automated evals with hands-on reviews. And when going multilingual, write prompts in the native language—your accuracy will thank you.

If you’re exploring agentic AI for operations, this is the blueprint: tight scoping, Make.com for speed, Twilio for reliability, structured prompts for control, and an eval-driven loop to scale quality with confidence.

Inspired by this post on Product Talk.