What I Learned Building AI Teacher Assistants: RAG, Evals, and Designs Teachers Love

Podcast cover for Just Now Possible featuring eSpark’s Teaching Assistant, with bold typography, a light-blue neural network graphic, and the line “with Teresa Torres” on a navy field with a yellow footer.

How do you build an AI-powered assistant that teachers will actually use?

As a VP of Product Management who ships AI features to real users, I’ve learned that the answer starts with deep empathy and ends with disciplined engineering. I recently dug into a compelling case study of K–5 edtech, where a team with more than a decade of experience building adaptive learning tools launched an AI-powered Teacher Assistant to help educators align supplemental lessons with district-mandated core curricula. The result is a practical blueprint for product leaders navigating gen AI in high-stakes environments.

In this episode of Just Now Possible, Teresa Torres talks with Thom van der Doef (Principal Product Designer), Mary Gurley (Director of Learning Design & Product Manager), and Ray Lyons (VP of Product & Engineering) from eSpark. Listening through a product lens, I focused on what translated from vision to value in busy classrooms—and why some early instincts (like a chatbot-first UI) didn’t survive contact with reality.

Listen to this episode on: Spotify | Apple Podcasts

Here’s what stood out to me. Post-COVID shifts in education created new pressures for teachers and administrators, amplifying the gap between top-down mandates and classroom realities. The team’s first instinct—a chatbot interface—failed in testing, and what ultimately worked was a more structured workflow that mapped to how teachers actually plan, select, and assign lessons. That’s a timeless product discovery lesson: meet users where they are, especially when their cognitive load is already maxed.

On the technical side, their first RAG system surfaced all the usual suspects—and all the usual surprises. The team had to learn to wrangle embeddings, debug semantic search vs. keyword search, and tune retrieval to the nuance of curricula, standards, and lesson objectives. As someone who has shipped RAG-backed features, I appreciate how much of the work happens in the unglamorous middle: data quality, ontology decisions, metadata hygiene, and evaluation strategy.

Speaking of evaluation, their background in education shaped a surprisingly rigorous eval process, long before “evals” became a buzzword. They leaned on rubrics, Braintrust, and a human-in-the-loop approach to ensure the assistant’s recommendations were accurate, aligned, and classroom-ready. It’s a reminder that in domains like education and healthcare, model observability and structured evaluation are non-negotiable for product-market fit.

The most energizing signal for me: they’ve learned from thousands of teachers using the product this school year—and they’re already translating that learning into roadmap bets. What’s next for Teacher Assistant: more contextual recommendations using student data. Done well, that shift moves the product from “helpful” to “indispensable,” grounding gen AI in student outcomes rather than generic assistance.

Show notes for context: Guests include Thom van der Doef, Principal Product Designer at eSpark; Mary [last name], Director of Learning Design & Product Manager at eSpark; and Ray Lyons, VP of Product & Engineering at eSpark. Topics covered span the origin story of Teacher Assistant (connecting administrator mandates with teacher needs), why the team abandoned a chatbot interface in favor of a more structured workflow, how retrieval augmented generation (RAG) and embeddings shaped the product architecture, lessons learned from debugging semantic search vs. keyword search, building evals with rubrics, Braintrust, and a human-in-the-loop approach, and what’s next for Teacher Assistant: more contextual recommendations using student data.

If you like to follow along chronologically, the chapter flow is tight and practical: 02:05 Overview of Epar's Adaptive Learning Program; 07:19 Challenges and Insights from COVID-19; 17:06 Developing the Teacher Assistant Feature; 24:55 User Experience and Interface Evolution; 34:29 Chat GPT-5's New Features; 35:16 Balancing Engagement and Efficiency; 35:40 Seasonal Business and Real Traffic; 36:29 Technical Decisions and RAG Implementation; 38:28 Challenges with Embeddings and Metadata; 41:24 Improving Recommendations and Data Enrichment; 55:18 Evaluating the Teaching Assistant; 01:05:51 Future Plans and User Feedback; 01:07:57 Conclusion and Final Thoughts.

Useful links if you want to go deeper: eSpark Learning; Braintrust.dev – evals and observability for LLM applications; AI Evals Maven Course by Hamel Husain and Shreya Shanker.

My product takeaways for anyone building AI in complex, regulated, or mission-driven domains: First, resist the chatbot reflex; many users need structured, high-signal workflows. Second, treat retrieval as a product surface—data modeling, metadata, and domain language matter as much as model choice. Third, invest early in evals with rubric-based scoring and human-in-the-loop reviews to protect trust. Finally, plan for seasonality and “real traffic” patterns; the strongest eval is usage in production with tight feedback loops from your most demanding users.

Gen AI is only as valuable as the outcomes it enables. In classrooms, that means saving teachers time, raising instructional alignment, and ultimately improving student learning. This case study shows that when we combine empathetic product discovery with disciplined RAG architecture and rigorous evals, AI stops being a demo—and starts being a difference-maker.


Inspired by this post on Product Talk.


Book a consult png image

Why did the team move away from a chatbot-first UI?

Testing showed the chatbot interface failed, so they adopted a structured workflow that maps to how teachers plan and assign lessons. They emphasize data quality, ontology decisions, metadata hygiene, and evaluation strategy to ensure trust.

How did they handle RAG and retrieval in the project?

They wrangled embeddings, debugged semantic search vs. keyword search, and tuned retrieval to reflect curricula, standards, and lesson objectives. The work also highlighted the importance of data quality, ontology decisions, metadata hygiene, and evaluation strategy.

What role did evaluation play?

They used rubrics, Braintrust, and a human-in-the-loop approach to ensure the assistant’s recommendations were accurate and classroom-ready. It’s a reminder that model observability and structured evaluation are essential for trust and product-market fit.

What is next for Teacher Assistant?

The next step is more contextual recommendations using student data. This shift moves the product from helpful to indispensable, grounding gen AI in student outcomes.

What is the overarching message about AI in regulated domains?

Gen AI is only as valuable as the outcomes it enables. In classrooms, that means saving teachers time, improving instructional alignment, and boosting student learning.

What practical takeaway about UI design and workflows does the post offer?

Resist the chatbot reflex and favor structured, high-signal workflows to reduce cognitive load. Meet users where they are to improve adoption.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Signup for Weekly Digest Emails

Categories

Archieve