Inside Alyx: Dogfooding, Evals, and Observability That Power an Agentic AI Future

Dark-blue podcast-style graphic with the headline "Just Now Possible" above a teal network diagram; a yellow footer reads "Building Alyx at Arize". Clean, minimalist tech branding.

I’ve been deep in the work of building practical, agentic capabilities into AI products, so this story about Alyx immediately resonated with me. It’s a rare, clear-eyed look at what it actually takes to ship a useful AI agent inside an AI platform—while using that same platform to build, test, and continuously improve the agent.

What does it really take to build an AI agent inside an AI platform—especially when you’re using that same platform to build the agent?

Listening to SallyAnn DeLucia (Director of Product at Arize) and Jack Zhou (Staff Engineer at Arize) unpack Alyx—the AI agent that helps teams debug, optimize, and evaluate AI applications—I recognized playbooks I trust: start scrappy, dogfood relentlessly, build intuition with real users, and systematize improvement with thoughtful evals.

Their early phase looked exactly like the messy reality many of us try to hide: Jupyter notebooks, hacked-together web apps, and weekly dogfooding sessions with their customer success team. That’s where patterns emerged, confidence was built, and the highest-leverage skills for the agent were prioritized. It’s a reminder that “vibe checks” matter at first—but you must quickly graduate to measurable, repeatable learning loops.

In my experience, the foundation of GenAI product quality is threefold: tracing, observability, and evals. They reached the same conclusion—defining traces across tool calls and sessions, creating observability into model behavior, and layering evals to compare both micro-decisions and system-level outcomes. That discipline converts hunches into evidence and makes agent behavior improvable, not mysterious.

What stood out was how cross-functional, boundary-spanning teams made the difference. Customer success engineers surfaced repeatable workflows. Product framed early skills. Engineering wrapped prototype tools into something coherent. Using their own platform to build Alyx accelerated intuition and de-risked launch. That’s the product loop I aim to cultivate: close to customers, close to data, and fast to learn.

As Alyx matures, the next step is moving from “on rails” workflows to more autonomous, agentic planning loops. That evolution requires stronger tool design, richer feedback signals, and evals that reflect end-to-end user value. It’s exactly the shift I expect across GenAI: from scripted assistants to adaptive systems that reason, plan, and act with guardrails.

Listen to this episode on: Spotify | Apple Podcasts

Guests:

SallyAnn DeLucia, Director of Product, Arize

Jack Zhou, Staff Engineer, Arize

In this episode, we cover:

What tracing, observability, and evals really mean in GenAI applications

How Arize used its own platform to build Alyx, its AI agent

The role of customer success engineers in surfacing repeatable workflows

Why early prototyping looked like messy notebooks and hacked-together local apps

How dogfooding shaped Alyx’s evolution and built confidence for launch

Why evals start messy, and how Arize layered evals across tool calls, sessions, and system-level decisions

The importance of cross-functional, boundary-spanning teams in building AI products

What’s next for Alyx: moving from “on rails” workflows to more autonomous, agentic planning loops

My takeaways for product teams building GenAI agents are simple and hard: design tools with observability in mind; operationalize evals early even if they’re imperfect; embed customer-facing engineers in the loop to capture real workflows; and keep the first skills narrow, high-impact, and testable. If your team can move from demos to disciplined measurement quickly, you’ll accelerate product-market fit.

Resources & Links

Arize AI — Sign up for a free account and try Alex

Arize Blog — Lessons learned from building AI products

Maven AI Evals Course — The course Teresa took to learn about evals (Get 35% off with Teresa’s affiliate link)

Cursor — The AI-powered code editor used by the Arize engineering team

DataDog — For understanding application traces

OpenAI GPT Models — GPT-3.5, GPT-4, and newer models used in early and current versions of Alex

Jupyter Notebooks — A tool for combining code, data, and notes, used in Arise’s prototyping

Axial Coding Method by Hamel Husain — A framework for analyzing data and designing evals

Chapters

00:00 Introduction to Sally Ann and Jack

01:08 Overview of Arize.ai and Its Core Components

01:44 Deep Dive into Tracing, Observability, and Evals

03:56 Introduction to Alyx: Arize's AI Agent

04:15 The Genesis and Evolution of Alyx

08:51 Challenges and Solutions in Building Alyx

24:33 Prototyping and Early Development of Alyx

26:22 Exploring the Power of Coding Notebooks

26:51 Early Experiments with Alyx

27:59 Challenges with Real Data

29:20 Internal Testing and Dogfooding

31:55 The Importance of Evals

35:16 Developing Custom Evals

43:09 Future Plans for Alyx

47:59 How to Get Started with Alyx

Full Transcript

Podcast transcripts are only available to paid subscribers.

If you’re building in GenAI right now, this conversation offers a pragmatic blueprint. Start with high-signal workflows, turn qualitative insights into quantitative evals, and use tracing plus observability to make agents debuggable. That’s how scrappy prototypes become reliable systems. And if you want a tangible example, “47:59 How to Get Started with Alyx” is a helpful on-ramp.


Inspired by this post on Product Talk.


Book a consult png image

What three elements form the foundation of GenAI product quality?

Tracing, observability, and evals form the foundation. The post explains that defining traces across tool calls and sessions, observing model behavior, and layering evals to compare micro-decisions and system-level outcomes turns hunches into evidence and makes agent behavior improvable.

How did Arize use its own platform to build Alyx?

Arize used its platform to build Alyx by dogfooding, wrapping prototype tools into a coherent stack, and conducting regular sessions with their customer success team to surface real workflows. Using their own platform accelerated intuition and de-risked launch.

What role did customer success engineers play in the Alyx project?

Customer success engineers surfaced repeatable workflows. Their involvement helped frame early skills and align product development with real customer needs.

Why did prototyping look messy in Alyx’s early days?

Early prototyping looked like messy notebooks and hacked-together local apps. It was a realistic starting point that allowed patterns to emerge and confidence to grow.

How do evals evolve, according to the post?

Evals start messy and are layered across tool calls, sessions, and system-level decisions. This layered approach turns hunches into evidence and reduces uncertainty about agent behavior.

What’s next for Alyx?

The next step is moving from on-rails workflows to autonomous, agentic planning loops. This evolution requires stronger tool design and richer feedback signals that reflect end-to-end user value.

What are the key takeaways for GenAI product teams from this post?

Key takeaways include designing tools with observability in mind and embedding customer-facing engineers in the loop. Also, operationalize evals early and keep initial skills narrow, high-impact, and testable to speed up product-market fit.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Signup for Weekly Digest Emails

Categories

Archieve