What I Learned from Trainline’s Agentic AI: Building a Trusted Travel Assistant at Scale

Podcast artwork featuring the title 'Just Now Possible', a teal network-style diagram, and a banner reading 'Building Trainline’s AI Travel Assistant' on a dark navy background with white and yellow text.

Over the past year, I’ve been shipping agentic AI into production and coaching product teams on what it really takes to make these systems trustworthy in the wild. One story that crystallizes the playbook comes from Trainline’s move to an agentic architecture for travel assistance—an approach that mirrors what I’ve seen work in high-stakes, real-time customer experiences.

Trainline—the world’s leading rail and coach platform—helps millions of travelers get from point A to point B. Now, they’re using AI to make every step of the journey smoother.

I studied how "David Eason (Principal Product Manager) Billie Bradley (Product Manager), and Matt Farrelly (Head of AI and Machine Learning)" approached the build of "Travel Assistant, an AI-powered travel companion that helps customers navigate disruptions, find real-time answers, and travel with confidence." Their work exemplifies the kind of end-to-end thinking required to move beyond demos into dependable, on-the-go assistance.

They share how they: Identified underserved traveler needs beyond ticketing; Built a fully agentic system from day one, combining orchestration, tools, and reasoning loops; Designed layered guardrails for safety, grounding, and human handoff; Expanded from 450 to 700,000 curated pages of information for retrieval; Developed LLM-as-judge evals and a custom user context simulator to measure quality in real-time; Balanced latency, UX, and reliability to make AI assistance feel trustworthy on the go.

I align strongly with their core takeaways: "AI assistants need both scalable reasoning and deep domain context to be useful." "Tool design and guardrails are as critical as prompt design in agent systems." "LLM-as-judge evals make it possible to measure open-ended systems without massive labeling costs." And perhaps most importantly, "Even legacy companies can move fast when they embrace experimentation and tight PM–engineering collaboration."

From an AI strategy perspective, starting "fully agentic" was the right call. When the problem space is dynamic—disruptions, route changes, fare conditions—reasoning loops and orchestration aren’t luxuries; they’re table stakes. Tool selection becomes product design: you need the right retrieval interfaces, constraint-aware planners, and API contracts that are resilient to partial failures. Layered guardrails for safety, grounding, and human handoff reduce hallucination risk while preserving responsiveness—critical when users are standing on a platform waiting for an answer.

The retrieval scale-up—"Expanded from 450 to 700,000 curated pages of information for retrieval"—is a classic inflection point. I’ve seen teams stall here when they treat content growth as a pure indexing problem. The winning move is curation and structure: normalize sources, encode policy-level constraints, and align retrieval chunks to decision boundaries the agent actually uses. That’s how you keep precision high while coverage explodes.

Evaluation is where most open-ended assistants fail quietly, which is why I was encouraged to see "Developed LLM-as-judge evals and a custom user context simulator to measure quality in real-time." In practice, LLM-as-judge gives you scalable, scenario-based scoring without prohibitive labeling, while a user context simulator surfaces regressions tied to persona, itinerary state, and device constraints. The combination closes the loop between model behavior, tool layer changes, and UX outcomes.

On product delivery, the decision to have the system "Balanced latency, UX, and reliability to make AI assistance feel trustworthy on the go" shows mature prioritization. For travel, trust accrues in seconds: fast-enough responses, graceful degradation when upstream data lags, and explicit handoff when confidence dips. This is where guardrails meet UX writing—clear, bounded language signals competence even when the system defers.

Finally, the organizational pattern matters. The teams that win in agentic AI are cross-functional, experimentation-driven, and ruthless about instrumentation. Tight PM–engineering collaboration, explicit safety thresholds, and an eval stack that mirrors real user journeys are what turn promising architectures into dependable products.

It’s a behind-the-scenes look at how an established company is embracing new AI architectures to serve customers at scale.

If you’re building agentic AI in production, borrow these moves: invest early in tool and guardrail design, scale retrieval with curation not just volume, adopt LLM-as-judge plus context simulation for continuous evaluation, and treat latency and reliability as core product requirements—not afterthoughts. That’s how you ship AI assistance that customers trust when it matters most.


Inspired by this post on Product Talk.


Book a consult png image

What is the core concept behind Trainline’s agentic AI?

Agentic AI combines orchestration, tools, and reasoning loops to create an end-to-end travel assistant that can navigate disruptions and provide real-time answers. Trainline’s example shows how this approach supports scalable, dependable assistance in high-stakes, on-the-go contexts.

How much content did Trainline scale for retrieval?

They expanded from 450 to 700,000 curated pages of information for retrieval. That growth required careful curation and structuring to maintain precision while increasing coverage.

What role do guardrails play in the system?

Layered guardrails for safety, grounding, and human handoff were designed to reduce hallucination risk while preserving responsiveness. They’re as critical as tool design and prompt design in agent systems.

What evaluation approach did Trainline adopt?

They developed LLM-as-judge evaluations and a custom user context simulator to measure quality in real time. This approach enables scalable scenario-based scoring without heavy manual labeling.

What organizational patterns contributed to success?

Cross-functional, experimentation-driven teams with instrumentation and tight PM–engineering collaboration were essential. They also established explicit safety thresholds and an evaluation stack that mirrors real user journeys.

What is the main takeaway about AI assistants according to the author?

AI assistants need both scalable reasoning and deep domain context to be useful. Tool design and guardrails are as critical as prompt design in agent systems.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Signup for Weekly Digest Emails

Categories

Archieve