I’ve been deep in the work of building practical, agentic capabilities into AI products, so this story about Alyx immediately resonated with me. It’s a rare, clear-eyed look at what it actually takes to ship a useful AI agent inside an AI platform—while using that same platform to build, test, and continuously improve the agent.
What does it really take to build an AI agent inside an AI platform—especially when you’re using that same platform to build the agent?
Listening to SallyAnn DeLucia (Director of Product at Arize) and Jack Zhou (Staff Engineer at Arize) unpack Alyx—the AI agent that helps teams debug, optimize, and evaluate AI applications—I recognized playbooks I trust: start scrappy, dogfood relentlessly, build intuition with real users, and systematize improvement with thoughtful evals.
Their early phase looked exactly like the messy reality many of us try to hide: Jupyter notebooks, hacked-together web apps, and weekly dogfooding sessions with their customer success team. That’s where patterns emerged, confidence was built, and the highest-leverage skills for the agent were prioritized. It’s a reminder that “vibe checks” matter at first—but you must quickly graduate to measurable, repeatable learning loops.
In my experience, the foundation of GenAI product quality is threefold: tracing, observability, and evals. They reached the same conclusion—defining traces across tool calls and sessions, creating observability into model behavior, and layering evals to compare both micro-decisions and system-level outcomes. That discipline converts hunches into evidence and makes agent behavior improvable, not mysterious.
What stood out was how cross-functional, boundary-spanning teams made the difference. Customer success engineers surfaced repeatable workflows. Product framed early skills. Engineering wrapped prototype tools into something coherent. Using their own platform to build Alyx accelerated intuition and de-risked launch. That’s the product loop I aim to cultivate: close to customers, close to data, and fast to learn.
As Alyx matures, the next step is moving from “on rails” workflows to more autonomous, agentic planning loops. That evolution requires stronger tool design, richer feedback signals, and evals that reflect end-to-end user value. It’s exactly the shift I expect across GenAI: from scripted assistants to adaptive systems that reason, plan, and act with guardrails.
Listen to this episode on: Spotify | Apple Podcasts
Guests:
SallyAnn DeLucia, Director of Product, Arize
Jack Zhou, Staff Engineer, Arize
In this episode, we cover:
What tracing, observability, and evals really mean in GenAI applications
How Arize used its own platform to build Alyx, its AI agent
The role of customer success engineers in surfacing repeatable workflows
Why early prototyping looked like messy notebooks and hacked-together local apps
How dogfooding shaped Alyx’s evolution and built confidence for launch
Why evals start messy, and how Arize layered evals across tool calls, sessions, and system-level decisions
The importance of cross-functional, boundary-spanning teams in building AI products
What’s next for Alyx: moving from “on rails” workflows to more autonomous, agentic planning loops
My takeaways for product teams building GenAI agents are simple and hard: design tools with observability in mind; operationalize evals early even if they’re imperfect; embed customer-facing engineers in the loop to capture real workflows; and keep the first skills narrow, high-impact, and testable. If your team can move from demos to disciplined measurement quickly, you’ll accelerate product-market fit.
Resources & Links
Arize AI — Sign up for a free account and try Alex
Arize Blog — Lessons learned from building AI products
Maven AI Evals Course — The course Teresa took to learn about evals (Get 35% off with Teresa’s affiliate link)
Cursor — The AI-powered code editor used by the Arize engineering team
DataDog — For understanding application traces
OpenAI GPT Models — GPT-3.5, GPT-4, and newer models used in early and current versions of Alex
Jupyter Notebooks — A tool for combining code, data, and notes, used in Arise’s prototyping
Axial Coding Method by Hamel Husain — A framework for analyzing data and designing evals
Chapters
00:00 Introduction to Sally Ann and Jack
01:08 Overview of Arize.ai and Its Core Components
01:44 Deep Dive into Tracing, Observability, and Evals
03:56 Introduction to Alyx: Arize's AI Agent
04:15 The Genesis and Evolution of Alyx
08:51 Challenges and Solutions in Building Alyx
24:33 Prototyping and Early Development of Alyx
26:22 Exploring the Power of Coding Notebooks
26:51 Early Experiments with Alyx
27:59 Challenges with Real Data
29:20 Internal Testing and Dogfooding
31:55 The Importance of Evals
35:16 Developing Custom Evals
43:09 Future Plans for Alyx
47:59 How to Get Started with Alyx
Full Transcript
Podcast transcripts are only available to paid subscribers.
If you’re building in GenAI right now, this conversation offers a pragmatic blueprint. Start with high-signal workflows, turn qualitative insights into quantitative evals, and use tracing plus observability to make agents debuggable. That’s how scrappy prototypes become reliable systems. And if you want a tangible example, “47:59 How to Get Started with Alyx” is a helpful on-ramp.
Inspired by this post on Product Talk.












Leave a Reply