What is Alyx in the context of this article?

Alyx is described as Arize’s AI agent that helps teams debug, optimize, and evaluate AI applications. The article uses Alyx as an example of building an AI agent inside an AI platform while using that same platform to improve it.

Why are tracing, observability, and evals important for GenAI agents?

The article frames tracing, observability, and evals as the foundation of GenAI product quality. They make model behavior visible across tool calls and sessions, turning hunches into evidence and making agent behavior improvable.

What role did customer success engineers play in Alyx’s development?

Customer success engineers surfaced repeatable workflows that helped the team understand where Alyx could provide value. Their input supported a product loop that stayed close to customers, close to data, and fast to learn.

What should product teams learn from Alyx’s early prototyping process?

The article recommends starting with narrow, high-impact, testable skills and moving quickly from qualitative feedback to measurable learning loops. Early prototypes can be messy, but teams should operationalize evals and observability before relying on demos alone.

Inside Alyx: Dogfooding, Evals, and Observability That Power an Agentic AI Future

Q: How did Arize dogfood its own platform while building Alyx?

The article says Arize used its own platform to build Alyx, paired with scrappy prototypes, Jupyter notebooks, hacked-together web apps, and weekly dogfooding sessions. Those sessions helped the team find patterns, build confidence, and prioritize the agent’s highest-leverage skills.

Q: What is the next frontier for Alyx and similar GenAI agents?

The article points to a shift from on-rails workflows toward more autonomous, agentic planning loops. That evolution requires stronger tool design, richer feedback signals, and evals that reflect end-to-end user value.

I’ve been deep in the work of building practical, agentic capabilities into AI products, so this story about Alyx immediately resonated with me. It’s a rare, clear-eyed look at what it actually takes to ship a useful AI agent inside an AI platform—while using that same platform to build, test, and continuously improve the agent.

What does it really take to build an AI agent inside an AI platform—especially when you’re using that same platform to build the agent?

Listening to SallyAnn DeLucia (Director of Product at Arize) and Jack Zhou (Staff Engineer at Arize) unpack Alyx—the AI agent that helps teams debug, optimize, and evaluate AI applications—I recognized playbooks I trust: start scrappy, dogfood relentlessly, build intuition with real users, and systematize improvement with thoughtful evals.

Their early phase looked exactly like the messy reality many of us try to hide: Jupyter notebooks, hacked-together web apps, and weekly dogfooding sessions with their customer success team. That’s where patterns emerged, confidence was built, and the highest-leverage skills for the agent were prioritized. It’s a reminder that “vibe checks” matter at first—but you must quickly graduate to measurable, repeatable learning loops.

In my experience, the foundation of GenAI product quality is threefold: tracing, observability, and evals. They reached the same conclusion—defining traces across tool calls and sessions, creating observability into model behavior, and layering evals to compare both micro-decisions and system-level outcomes. That discipline converts hunches into evidence and makes agent behavior improvable, not mysterious.

What stood out was how cross-functional, boundary-spanning teams made the difference. Customer success engineers surfaced repeatable workflows. Product framed early skills. Engineering wrapped prototype tools into something coherent. Using their own platform to build Alyx accelerated intuition and de-risked launch. That’s the product loop I aim to cultivate: close to customers, close to data, and fast to learn.

As Alyx matures, the next step is moving from “on rails” workflows to more autonomous, agentic planning loops. That evolution requires stronger tool design, richer feedback signals, and evals that reflect end-to-end user value. It’s exactly the shift I expect across GenAI: from scripted assistants to adaptive systems that reason, plan, and act with guardrails.

Listen to this episode on: Spotify | Apple Podcasts

Guests:

SallyAnn DeLucia, Director of Product, Arize

Jack Zhou, Staff Engineer, Arize

In this episode, we cover:

What tracing, observability, and evals really mean in GenAI applications

How Arize used its own platform to build Alyx, its AI agent

The role of customer success engineers in surfacing repeatable workflows

Why early prototyping looked like messy notebooks and hacked-together local apps

How dogfooding shaped Alyx’s evolution and built confidence for launch

Why evals start messy, and how Arize layered evals across tool calls, sessions, and system-level decisions

The importance of cross-functional, boundary-spanning teams in building AI products

What’s next for Alyx: moving from “on rails” workflows to more autonomous, agentic planning loops

My takeaways for product teams building GenAI agents are simple and hard: design tools with observability in mind; operationalize evals early even if they’re imperfect; embed customer-facing engineers in the loop to capture real workflows; and keep the first skills narrow, high-impact, and testable. If your team can move from demos to disciplined measurement quickly, you’ll accelerate product-market fit.

Resources & Links

Arize AI — Sign up for a free account and try Alex

Arize Blog — Lessons learned from building AI products

Maven AI Evals Course — The course Teresa took to learn about evals (Get 35% off with Teresa’s affiliate link)

Cursor — The AI-powered code editor used by the Arize engineering team

DataDog — For understanding application traces

OpenAI GPT Models — GPT-3.5, GPT-4, and newer models used in early and current versions of Alex

Jupyter Notebooks — A tool for combining code, data, and notes, used in Arise’s prototyping

Axial Coding Method by Hamel Husain — A framework for analyzing data and designing evals

Chapters

00:00 Introduction to Sally Ann and Jack

01:08 Overview of Arize.ai and Its Core Components

01:44 Deep Dive into Tracing, Observability, and Evals

03:56 Introduction to Alyx: Arize's AI Agent

04:15 The Genesis and Evolution of Alyx

08:51 Challenges and Solutions in Building Alyx

24:33 Prototyping and Early Development of Alyx

26:22 Exploring the Power of Coding Notebooks

26:51 Early Experiments with Alyx

27:59 Challenges with Real Data

29:20 Internal Testing and Dogfooding

31:55 The Importance of Evals

35:16 Developing Custom Evals

43:09 Future Plans for Alyx

47:59 How to Get Started with Alyx

Full Transcript

Podcast transcripts are only available to paid subscribers.

If you’re building in GenAI right now, this conversation offers a pragmatic blueprint. Start with high-signal workflows, turn qualitative insights into quantitative evals, and use tracing plus observability to make agents debuggable. That’s how scrappy prototypes become reliable systems. And if you want a tangible example, “47:59 How to Get Started with Alyx” is a helpful on-ramp.

Inspired by this post on Product Talk.