What makes CX Score a trustworthy AI metric for support leaders?

The article explains that CX Score is grounded in how experienced support teams define quality, aligned with expert human judgment, and validated with standard ML metrics. It is designed so leaders can inspect, explain, and defend the score instead of accepting an algorithm on faith.

How was CX Score aligned with human judgment?

The team created a dataset of thousands of real customer conversations across industries, languages, channels, and agent types. Experienced support professionals manually reviewed the conversations, with two reviewers where possible and disagreement resolution to create stable consensus labels.

What statistical threshold did the team use before shipping CX Score?

The article says CX Score was tested against precision, recall, and F1 score. The explicit bar was F1 above 0.8, representing high agreement with human judgment before the metric shipped.

How did field testing shape CX Score before release?

The article describes a multi-phase field test with shadow scoring, checks across agent type and conversation length, and a controlled customer rollout. CX Score shipped after real teams confirmed that the scores felt sensible, the reasons were clear, and the insights were actionable.

How is CX Score designed to evolve as a business changes?

The post says a trustworthy metric cannot be static because customer expectations, products, and AI systems change. CX Score is built around evaluating real customer experience signals, keeping the logic interpretable, and helping leaders make clear decisions over time.

What makes CX Score a trustworthy AI metric for support leaders?

The article explains that CX Score is grounded in how experienced support teams define quality, aligned with expert human judgment, and validated with standard ML metrics. It is designed so leaders can inspect, explain, and defend the score instead of accepting an algorithm on faith.

How was CX Score aligned with human judgment?

The team created a dataset of thousands of real customer conversations across industries, languages, channels, and agent types. Experienced support professionals manually reviewed the conversations, with two reviewers where possible and disagreement resolution to create stable consensus labels.

What statistical threshold did the team use before shipping CX Score?

The article says CX Score was tested against precision, recall, and F1 score. The explicit bar was F1 above 0.8, representing high agreement with human judgment before the metric shipped.

How did field testing shape CX Score before release?

The article describes a multi-phase field test with shadow scoring, checks across agent type and conversation length, and a controlled customer rollout. CX Score shipped after real teams confirmed that the scores felt sensible, the reasons were clear, and the insights were actionable.

How is CX Score designed to evolve as a business changes?

The post says a trustworthy metric cannot be static because customer expectations, products, and AI systems change. CX Score is built around evaluating real customer experience signals, keeping the logic interpretable, and helping leaders make clear decisions over time.

Build CX Scores You Can Defend: My 5-step playbook for transparent, trustworthy AI metrics

Q: Why does explainability matter for CX Score?

Explainability matters because every score includes clear reasons, concrete excerpts, and a short explanation of what influenced the rating. That makes the metric inspectable, auditable, and easier to explain to executives.

“You don’t have to trust the algorithm; you can see exactly why a conversation earned the score it did.”

We recently shared how we redesigned CX Score to deliver deeper, more actionable insights across every conversation. The most common follow-up from support leaders was simpler and incredibly important: “Can I trust it?” It’s the right question—and it’s the one I use as my own bar for whether a metric is ready for the C‑suite.

CS teams are the subject matter experts on customer experience. They understand the nuance of what customers feel, the context behind every interaction, and the difference between a technically resolved issue and a genuinely satisfied customer. I’ve learned, conversation by conversation, that any metric we ship has to capture that nuance at scale—or it doesn’t deserve to be used.

We built CX Score to give support teams a complete view of how their customers feel across every conversation. It surfaces what’s working, what’s not, and why—so leaders can communicate impact clearly and drive change across support, product, and the wider business.

Interface card displaying 'CX Score: 2' summarizing a case where repeated CSV export attempts failed, frustrating the customer; the AI agent explains the issue and requests more details; rounded gradient border. — A CX Score in action: repeated CSV export failures trigger a low score and customer frustration, while the AI agent clarifies next steps and gathers details—turning raw signals into actionable support insights.

Here’s exactly how I approached building a trustworthy metric that support leaders can inspect, explain, and defend.

1) It’s grounded in how support teams define quality. I started with how experienced support professionals actually evaluate conversations—collecting real examples of strong, mixed, and poor interactions across industries, identifying the specific factors that shape overall experience, and writing plain-English rules for each. The result: CX Score applies the same criteria a trained support professional would use, not generic LLM assumptions.

2) It’s aligned with human judgment. We created a dataset of thousands of real customer conversations spanning multiple industries, languages, channels, and agent types. Each was manually reviewed by experienced support professionals—with two reviewers per conversation where possible and disagreement resolution to create stable consensus labels. The result: CX Score is trained and tested to behave like an expert reviewer, not a language model making broad guesses.

Analytics dashboard visualizing a CX Score with KPI cards and a Sankey performance funnel linking support channels to AI involvement, resolutions, and positive, neutral, or negative outcomes. — A modern CX analytics view shows how conversations flow from chat, email, and mobile into AI assistance, then to resolutions and sentiment outcomes—turning messy support data into a single, defensible CX Score.

3) It’s engineered by AI specialists. CX Score isn’t a prompt attached to an LLM. It’s a production system built by Intercom’s AI Group: 37 ML scientists and 350 engineers whose full-time focus is AI for customer service. The system includes specialized handling for long transcripts, model configuration tailored for support language and subtle sentiment, prompt engineering designed to default to neutral when evidence is weak, and a multi-stage evaluation pipeline that checks for precision, consistency, and reliability. The result: A metric built by a team that understands LLM behavior in production support environments, where accuracy and consistency matter most.

4) It’s validated statistically, not qualitatively. Trust requires measurement, not vibes. We tested CX Score across standard ML metrics: Precision (when the model flags a negative experience, how often do humans agree?), recall (how many human-identified issues does it catch?), and F1 score (the balance between both). We set an explicit bar: F1 above 0.8, representing high agreement with human judgment. We reran these evaluations through every revision, checking for regressions or biases, and I focused especially on negative experiences, because a false negative hides a real problem. The result: CX Score meets a measurable standard before it ships—not a gut check, a statistical requirement.

5) It was battle-tested with real customers. Lab accuracy isn’t enough. Customer environments are messy: Varied ticket types, mixed languages, unpredictable edge cases. Before release, we ran a multi-phase field test—shadow-scoring conversations with both old and new models, validating sensible behavior across agent type and conversation length, then rolling out to a controlled customer group who confirmed the scores felt right, reasons were clear, and insights were actionable. The result: CX Score shipped because real teams told us it made sense in practice, not because it passed internal tests.

Donut chart of CX categories beside a chat UI showing a CX Score of 3 with a 'Negative policy feedback' tag, highlighting policy feedback, answer quality, customer effort, and emotion. — From conversation to clarity: this visual maps the drivers behind a CX Score. Explore how policy feedback, answer quality, and effort combine to produce defendable insights support leaders can act on.

The importance of explainability. One of the most critical choices I made was ensuring CX Score isn’t a black box. Every score comes with clear reasons, concrete excerpts, and a short explanation of what influenced the rating. This turns the metric into something you can inspect, audit, and explain to executives. You don’t have to trust the algorithm. You can see exactly why a conversation earned the score it did.

A metric that evolves with your business. Customer expectations shift. Products change. AI improves. A trustworthy metric can’t be static. CX Score evolves with the same commitments that shaped its redesign: Evaluate the real signals that shape customer experience, keep the logic simple and interpretable, and ensure leaders can make clear decisions from it. It’s built to be a durable source of truth across every conversation.

The takeaway. In a world where products look the same and AI can generate any interaction, customer experience is one of the few differentiators that actually matters. Support leaders have built that expertise conversation by conversation. What they’ve lacked is a measurement system that could validate it at scale—one that’s reliable enough to report to the C-suite, explainable enough to defend in strategy meetings, and rigorous enough to drive real decisions. That’s what CX Score is designed to be: A metric that reflects the reality support leaders see every day, backed by the technical rigor to make it credible everywhere else.

Want to see CX Score in your workspace? Ask your admin to enable it for your team, and start using explainable AI insights to improve customer experience and coach with confidence.

Inspired by this post on The Intercom Blog.