Building AI Products That Work: My Playbook for LLM Strategy, Evals, and Orchestration

Podcast episode artwork titled 'Building AI Products' on a soft green background, with an abstract network of circles and connecting lines on the left, plus text for Podcast - Episode #28 and All Things Product.

AI features don’t succeed on clever prompts alone—they demand thoughtful product strategy, rigorous evaluation, and tight cross-functional collaboration. As a VP of Product Management and someone deeply immersed in building with Large language model (LLM) technology, I’m constantly refining how we turn generative capabilities into real customer value. This episode of All Things Product zeroes in on that challenge, and it captures many of the principles I rely on when shipping AI to production.

The central question resonates with every product leader I know: How do product teams learn to build AI-powered products “beyond just dabbling with ChatGPT”? I appreciate how the conversation moves past novelty and into the disciplines that make AI reliable, safe, and outcome-oriented.

One metaphor that always lands for me: building AI features is less like writing a single “killer prompt” and more like orchestrating a team of “interns.” You define roles, break down work, set guardrails, and continuously review outputs. That orchestration mindset, coupled with strong observability, evals, and ongoing maintenance practices, is what separates flashy demos from repeatable product value.

Here’s how I frame the work. First, there’s a difference between an AI-powered product manager and an AI product manager. Many of us are becoming AI-powered—using tools to accelerate discovery, ideation, or execution. But when you own AI features end-to-end, you inherit new responsibilities: modeling risks, defining evaluation strategies for non-deterministic systems, and treating prompts and data pipelines as core product surfaces.

Prompt engineering for a product is fundamentally different from prompting ChatGPT for personal use. In production, I rely on prompt decomposition and orchestration—explicitly breaking a task into steps, assigning each step to the right capability, and enforcing consistent formats. This reduces variance, improves debuggability, and enables targeted evals that catch regressions before customers do.

System design and risk mitigation become front and center. I align early with engineering, legal, security, and support on failure modes, privacy expectations (including Personal information or personally identifiable information (PII)), and rollout plans. We log traces for every critical path, treat prompts as versioned assets, and use observability to connect inputs, intermediate states, and outputs. When something drifts, we need to see it fast, explain it, and fix it.

Evaluating non-deterministic AI features is its own craft. “Thumbs up/thumbs down” isn’t enough. I design layered evals: unit-level checks for correctness and formatting, scenario-level evals for edge cases and risk behaviors, and longitudinal evals to monitor model and data drift over time. Clear acceptance thresholds and shadow deployments help us balance velocity with reliability.

Deciding when AI is the right solution starts with the customer problem, not the model. I ask: Is the task ambiguous enough to benefit from generation? Can we bound the failure modes? Do we have affordable latency and cost envelopes? And what’s the graceful fallback if the model underperforms? If a deterministic algorithm or simple rules solve it better, we choose that—no heroics.

The hidden cost of AI is maintenance. Prompts rot as upstream models change. New data skews behavior. Guardrails that worked yesterday might not hold tomorrow. That’s why ongoing evals, robust logging, and a change-management plan (for prompts, schemas, and policies) are non-negotiable. Treat AI features as living systems, not one-off launches.

If you’re exploring gen ai for product prototyping, start small. Pick a narrow, high-value workflow, instrument everything, and ship with clear success metrics. Use your first release to build your team’s muscles around observability, evals, and cross-functional collaboration. The goal is not a perfect model; it’s a reliable product outcome.

Want to go deeper? Listen to the full conversation here: Spotify | Apple Podcasts. Prefer video? Watch on YouTube: Building AI Products.

What you’ll learn in this episode:

– The difference between an AI-powered product manager and an AI product manager

– Why prompt engineering for a product is different from prompting ChatGPT for personal use

– The role of prompt decomposition and orchestration in building robust AI features

– How to think about system design, risk mitigation, and cross-functional collaboration

– Why observability and logging traces are critical for LLM products

– The challenge of evaluating non-deterministic AI features (and why “thumbs up/thumbs down” isn’t enough)

– How to decide when AI is the right solution for a customer problem

– The hidden cost of ongoing maintenance for AI features

Join the conversation: What practices have helped you ship reliable AI features? Drop your thoughts and questions in the comments—I’d love to learn from your experiences.


Inspired by this post on Product Talk.


Book a consult png image

What’s the difference between an AI-powered product manager and an AI product manager?

Owning AI features end-to-end means you inherit new responsibilities. You must model risks, define evaluation strategies for non-deterministic systems, and treat prompts and data pipelines as core product surfaces.

Why isn’t 'thumbs up/thumbs down' enough to evaluate non-deterministic AI features?

Thumbs up/down isn’t enough for non-deterministic AI features. I design layered evals: unit-level checks for correctness and formatting; scenario-level evals for edge cases and risk behaviors; longitudinal evals to monitor drift. Clear acceptance thresholds and shadow deployments help balance velocity with reliability.

What is the hidden cost of AI in product maintenance?

The hidden cost of AI is maintenance. Prompts rot as upstream models change and new data can skew behavior; guardrails that worked yesterday might not hold tomorrow. Ongoing evals, robust logging, and a change-management plan are non-negotiable; treat AI features as living systems, not one-off launches.

When should you choose a deterministic algorithm or simple rules over AI?

If a deterministic algorithm or simple rules solve the task better, choose that solution. There’s no heroics needed.

How should you approach gen ai for product prototyping?

Start small; pick a narrow, high-value workflow, instrument everything, and ship with clear success metrics. Use the first release to build your team’s muscles around observability, evals, and cross-functional collaboration. The goal is not a perfect model; it’s a reliable product outcome.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Signup for Weekly Digest Emails

Categories

Archieve