What is the main lesson from building AI teacher assistants?

The post argues that AI teacher assistants need more than a chatbot interface. They work best when the product is grounded in real classroom workflows, deep user empathy, and disciplined engineering.

Why did the eSpark team move away from a chatbot-first design?

Testing showed that a chatbot-first UI did not match how busy teachers plan, select, and assign lessons. A more structured workflow better supported teachers with high cognitive load.

How did RAG shape the Teacher Assistant product architecture?

The team used retrieval augmented generation to connect supplemental lessons with curricula, standards, and lesson objectives. The post highlights embeddings, semantic search versus keyword search, metadata hygiene, and retrieval tuning as important product decisions.

Why are evals important for AI products in education?

The post describes evals as essential for trust in high-stakes domains like education. Rubrics, Braintrust, and human-in-the-loop reviews helped check whether recommendations were accurate, aligned, and classroom-ready.

What product discovery takeaway does the post emphasize for gen AI teams?

The key takeaway is to resist the chatbot reflex and meet users where they are. For complex or mission-driven domains, structured workflows, domain language, real traffic, and feedback loops matter as much as model choice.

What is next for eSpark’s Teacher Assistant according to the post?

The post says the next roadmap direction is more contextual recommendations using student data. The goal is to move from generic assistance toward recommendations tied more directly to student outcomes.

What is the main lesson from building AI teacher assistants?

The post argues that AI teacher assistants need more than a chatbot interface. They work best when the product is grounded in real classroom workflows, deep user empathy, and disciplined engineering.

Why did the eSpark team move away from a chatbot-first design?

Testing showed that a chatbot-first UI did not match how busy teachers plan, select, and assign lessons. A more structured workflow better supported teachers with high cognitive load.

How did RAG shape the Teacher Assistant product architecture?

The team used retrieval augmented generation to connect supplemental lessons with curricula, standards, and lesson objectives. The post highlights embeddings, semantic search versus keyword search, metadata hygiene, and retrieval tuning as important product decisions.

Why are evals important for AI products in education?

The post describes evals as essential for trust in high-stakes domains like education. Rubrics, Braintrust, and human-in-the-loop reviews helped check whether recommendations were accurate, aligned, and classroom-ready.

What product discovery takeaway does the post emphasize for gen AI teams?

The key takeaway is to resist the chatbot reflex and meet users where they are. For complex or mission-driven domains, structured workflows, domain language, real traffic, and feedback loops matter as much as model choice.

What is next for eSpark’s Teacher Assistant according to the post?

The post says the next roadmap direction is more contextual recommendations using student data. The goal is to move from generic assistance toward recommendations tied more directly to student outcomes.

What I Learned Building AI Teacher Assistants: RAG, Evals, and Designs Teachers Love

How do you build an AI-powered assistant that teachers will actually use?

As a VP of Product Management who ships AI features to real users, I’ve learned that the answer starts with deep empathy and ends with disciplined engineering. I recently dug into a compelling case study of K–5 edtech, where a team with more than a decade of experience building adaptive learning tools launched an AI-powered Teacher Assistant to help educators align supplemental lessons with district-mandated core curricula. The result is a practical blueprint for product leaders navigating gen AI in high-stakes environments.

In this episode of Just Now Possible, Teresa Torres talks with Thom van der Doef (Principal Product Designer), Mary Gurley (Director of Learning Design & Product Manager), and Ray Lyons (VP of Product & Engineering) from eSpark. Listening through a product lens, I focused on what translated from vision to value in busy classrooms—and why some early instincts (like a chatbot-first UI) didn’t survive contact with reality.

Listen to this episode on: Spotify | Apple Podcasts

Here’s what stood out to me. Post-COVID shifts in education created new pressures for teachers and administrators, amplifying the gap between top-down mandates and classroom realities. The team’s first instinct—a chatbot interface—failed in testing, and what ultimately worked was a more structured workflow that mapped to how teachers actually plan, select, and assign lessons. That’s a timeless product discovery lesson: meet users where they are, especially when their cognitive load is already maxed.

On the technical side, their first RAG system surfaced all the usual suspects—and all the usual surprises. The team had to learn to wrangle embeddings, debug semantic search vs. keyword search, and tune retrieval to the nuance of curricula, standards, and lesson objectives. As someone who has shipped RAG-backed features, I appreciate how much of the work happens in the unglamorous middle: data quality, ontology decisions, metadata hygiene, and evaluation strategy.

Speaking of evaluation, their background in education shaped a surprisingly rigorous eval process, long before “evals” became a buzzword. They leaned on rubrics, Braintrust, and a human-in-the-loop approach to ensure the assistant’s recommendations were accurate, aligned, and classroom-ready. It’s a reminder that in domains like education and healthcare, model observability and structured evaluation are non-negotiable for product-market fit.

The most energizing signal for me: they’ve learned from thousands of teachers using the product this school year—and they’re already translating that learning into roadmap bets. What’s next for Teacher Assistant: more contextual recommendations using student data. Done well, that shift moves the product from “helpful” to “indispensable,” grounding gen AI in student outcomes rather than generic assistance.

Show notes for context: Guests include Thom van der Doef, Principal Product Designer at eSpark; Mary [last name], Director of Learning Design & Product Manager at eSpark; and Ray Lyons, VP of Product & Engineering at eSpark. Topics covered span the origin story of Teacher Assistant (connecting administrator mandates with teacher needs), why the team abandoned a chatbot interface in favor of a more structured workflow, how retrieval augmented generation (RAG) and embeddings shaped the product architecture, lessons learned from debugging semantic search vs. keyword search, building evals with rubrics, Braintrust, and a human-in-the-loop approach, and what’s next for Teacher Assistant: more contextual recommendations using student data.

If you like to follow along chronologically, the chapter flow is tight and practical: 02:05 Overview of Epar's Adaptive Learning Program; 07:19 Challenges and Insights from COVID-19; 17:06 Developing the Teacher Assistant Feature; 24:55 User Experience and Interface Evolution; 34:29 Chat GPT-5's New Features; 35:16 Balancing Engagement and Efficiency; 35:40 Seasonal Business and Real Traffic; 36:29 Technical Decisions and RAG Implementation; 38:28 Challenges with Embeddings and Metadata; 41:24 Improving Recommendations and Data Enrichment; 55:18 Evaluating the Teaching Assistant; 01:05:51 Future Plans and User Feedback; 01:07:57 Conclusion and Final Thoughts.

Useful links if you want to go deeper: eSpark Learning; Braintrust.dev – evals and observability for LLM applications; AI Evals Maven Course by Hamel Husain and Shreya Shanker.

My product takeaways for anyone building AI in complex, regulated, or mission-driven domains: First, resist the chatbot reflex; many users need structured, high-signal workflows. Second, treat retrieval as a product surface—data modeling, metadata, and domain language matter as much as model choice. Third, invest early in evals with rubric-based scoring and human-in-the-loop reviews to protect trust. Finally, plan for seasonality and “real traffic” patterns; the strongest eval is usage in production with tight feedback loops from your most demanding users.

Gen AI is only as valuable as the outcomes it enables. In classrooms, that means saving teachers time, raising instructional alignment, and ultimately improving student learning. This case study shows that when we combine empathetic product discovery with disciplined RAG architecture and rigorous evals, AI stops being a demo—and starts being a difference-maker.

Inspired by this post on Product Talk.