AI Changes Everything—and Nothing: A Practical Product Playbook for LLMs, RAG, and Evals

Title slide reading “AI Changes Everything (And Nothing At All)” on a dark blue background, crediting Teresa Torres of ProductTalk.org with handle @ttorres; image suits a generative AI ethics post.

AI has changed the tempo of product management, but not the timeless fundamentals. I’m living that paradox daily: the technology reshapes how we plan, build, and ship—yet the way we find real customer value hasn’t budged. Here’s how I reconcile both truths in practice.

At their core, large language models predict the next token. That’s it. Consider the prompt, “The cat sat on the ____.” Mat is most likely. Chair is pretty likely. Floor, roof, piano—all possible, just less probable. When you give an LLM a prompt, it runs through a neural network—billions of parameters making calculations—to predict the most likely next word, then the next, then the next. That neural net represents not just facts, but relationships, patterns, context—and some might even argue reasoning capabilities—that all influence the probabilities of different outputs. So LLMs don’t just predict the next token. They predict the next token by drawing upon a vast amount of knowledge. LLMs are transformational.

Presentation slide on large language models showing 'The cat sat on the ____' with probabilities—mat 45%, chair 25%, floor 15%, roof 10%, piano 5%—to illustrate next-token prediction.
Behind the magic, LLMs don’t “know”—they rank what likely comes next. This slide’s cat sentence with weighted options underscores how generative AI reshapes work while remaining probability-driven.

And yet, AI makes silly mistakes. The kind that make you wonder how it could be so smart and so dumb at the same time. No matter how hard I tried, I could not get ChatGPT-5 to fix the closing quote on an image. And it took 12 seconds of thinking to figure out that I asked for a meeting summary but didn’t include any meeting notes. Why do LLMs make these mistakes? Well, it’s because all the LLM is doing is predicting the next token based on patterns. And sometimes those predictions are wrong. And when it’s wrong, it can be confidently wrong. So what do we do?

Split graphic contrasting AI hype and skepticism: a rocket with the text 'AGI is around the corner!' on the left and a sad person with a thumbs-down bubble on the right, with 'Which is it?' centered on a dark background.
A side-by-side visual pits an optimistic AGI rocket against a frowning skeptic with a thumbs-down, ending with the prompt 'Which is it?'—capturing the debate over whether generative AI is transformative progress or overhyped.

To build reliable AI features, I focus my team on four core skills: prompt engineering, context engineering, orchestration, and evaluation (evals). Together, they let us harness what’s powerful about generative AI while reducing unforced errors.

Slide-style illustration on a dark blue background showing web, books, code, and analytics icons feeding into a stylized neural network with the line: 'The key insight… is understanding how predictions get made.'
Generative AI isn’t magic—it’s inputs, training data, and models. This visual invites readers to examine how predictions are produced and why transparency, bias checks, and ethics shape responsible outcomes.

First, prompt engineering. With LLMs, the quality of the input determines the quality of the output. An LLM can’t guess what you want. You have to specify what you want in a language the LLM understands. In a chat, you can iterate with multiple turns. In a product, you get one shot. When we ship a meeting summarizer, for example, I design a production-ready prompt that assigns a role (“You are a chief of staff for a busy executive…”), specifies output format (e.g., a structured list of action items), and clarifies exactly what to include and how to structure it. This is prompt engineering. It’s like writing a really good product spec—but for an AI.

Slide explaining next‑word prediction: text 'The cat sat on the ____' with a neural network picking 'mat' and a list of probabilities (mat 45%, chair 25%, floor 15%, roof 10%, piano 5%).
A simple sentence becomes a window into generative AI: a neural network weighs options and selects 'mat,' showing how models rank likely words to produce fluent text—and why outputs feel confident yet probabilistic.

Second, context engineering. LLMs know a lot, but not everything. Imagine a takeaway reads, “Send the pipeline report to John by Friday.” When is Friday? Which John? What is the pipeline report? The model lacks today’s date, the roles of participants, and the specifics of our reporting vocabulary. If you dump every possible detail into the prompt, the model gets noisy input and the quality drops. The key is to add only the context the LLM needs to do the task at hand and nothing more. This is where techniques like RAG (Retrieval Augmented Generation) shine: retrieve just the relevant snippets—e.g., attendee roles, the precise definition of “pipeline report,” and the calendar context for “Friday”—and include only those in the prompt.

Slide illustrating next-token prediction in large language models: 'The cat sat on the ____' with probabilities: mat (45%), chair (25%), floor (15%), roof (10%), piano (5%) on a dark blue background.
A minimalist slide demystifies how LLMs choose words: given 'The cat sat on the ____', the model ranks options by probability, showing that AI outputs are weighted predictions, not facts, which matters for ethics and product decisions.

Third, orchestration. LLMs are better at simpler tasks. If you try to do everything in one prompt—identify action items, summarize the meeting, categorize by urgency, match to owners, check for clarity—the quality goes down. But if you break it into steps, each focused on one thing, the quality goes up. I design a workflow of multiple LLM calls that work together. For example: First call: identify action items from the transcript. Second call: categorize by urgency using lightweight project context. Third call: match to owners with a minimal team directory. Fourth call: generate calendar events via an API. Each step is simple. But together, they create something sophisticated.

Presentation slide highlighting generative AI UIs: a chat-based web app builder, an AI video editor tools panel, a translated comment snippet, and the line 'And this turns out to be really powerful.'
From building apps by chatting to auto-cleaning video and translating comments, AI compresses complex tasks into simple prompts. This scene shows why it feels transformative—while nudging us to weigh its limits and ethics.

Fourth, evaluation (evals). Otherwise known as: Is my AI product any good? At the heart of evals is this question: Given the input, did we get the expected output? I use several approaches. Datasets: I collect 20–100 real examples, define expected outputs, run the prompt, and measure the success rate. Code assertions: I enforce rules the output must follow (for example, “Every action item must have a task, owner, and deadline”) and fail the test when they’re violated. LLM-as-Judge: I use a model to verify factuality (e.g., “Is each item in this summary in the meeting transcript? Yes or no.”). Human evals: I track whether users need to edit the output and by how much. Start simple; get more sophisticated as the stakes rise.

Slide juxtaposes AI hype and limits: a rocket with quotes 'LLMs are changing everything' and 'AGI is around the corner,' alongside a chat UI failing to find meetings to summarize, over text 'And AI makes silly mistakes.'
From “AI Changes Everything (And Nothing At All),” this slide contrasts bold promises with everyday glitches: optimism about LLMs and AGI on the left, and a meeting-summarizer that can’t find content on the right—reminding us AI still errs.

Here’s the hard truth: you can master all four skills and still ship the wrong thing. Why? Because you chose the wrong problem to solve. Who needs yet another meeting summarizer? The ease of building with generative AI tempts us to jump straight to solutions. Resist it.

Slide with dark blue background and white text about getting good at building with AI by leveraging its power while reducing mistakes; minimalist design with ProductTalk.org branding.
A keynote slide from AI Changes Everything (And Nothing At All) urges teams to master AI by harnessing its power while minimizing errors, framing a practical, ethical approach to building trustworthy products.

Discovery matters more than ever. Before writing your first prompt, be explicit about the impact you want and set a clear outcome. Talk to your customers and ensure you understand their needs; choose the right opportunity before you chase a cool solution. Explore multiple options and use assumption testing to converge on what will actually move the needle.

Slide on prompt engineering with a branching node diagram labeled 'The LLM response,' bullet points on giving clear instructions, and the line 'The quality of the input determines the quality of the output.'
Clear prompts shape better AI. This slide shows how explicit, well-structured instructions lead to stronger LLM responses, reminding builders that inputs drive outcomes—and that responsible prompting begins with clarity.

Prototype, test, and build iteratively. With AI products, it’s easy to get to a great demo and hard to get to a production-grade experience. I validated feasibility before I wrote a line of production code by experimenting directly in a model’s UI, pasting real transcripts, and shaping the system instructions and output format. Only after I had a repeatable prompt and useful feedback did I worry about deployment.

Slide on prompt engineering with example chat prompts — summarize my meeting, what are my action items, who said what — next to a meeting-summary template detailing role, format, and instructions.
Designing prompts is product design. This visual pairs common meeting requests with a rigorous summary framework, reminding teams that when prompts are embedded in products, accuracy matters because you get one shot.

From there, I tested deployment paths with the smallest viable investment. I tried a custom chat in Replit and embedded it. It wasn’t the right interaction model for targeted feedback. I switched to a submission flow using a homework-style pattern: students upload their transcript, the system processes it, they get detailed feedback. Done. That worked—until automation reliability lagged. I wired it up in Zapier, then rebuilt it on AWS Step Function for robust retries and better error handling when scale introduced edge cases.

Presentation slide on context engineering showing how ambiguous instructions hinder LLMs, with a 'send the pipeline report to John by Friday' example and the note that models know a lot but not everything.
A reminder that AI needs context: a simple instruction becomes ambiguous without dates, roles, recipients, and report details, highlighting the limits of LLM knowledge and the value of precise, shared prompts.

Each iteration tested a different assumption. Iteration 1 (Claude): Can AI even do this? What makes good feedback? Iteration 2 (Replit chat): How should people interact with it? Iteration 3 (Zapier): Can I integrate this into the existing workflow with minimal engineering? Iteration 4 (Step Functions): How do I make it reliable at scale when things inevitably go wrong? I didn’t know the answers up front; I learned by building, shipping, and observing.

Slide illustrating LLM inputs—documents, code, users, analytics, and a calendar—feeding a neural network; it warns that prompt quality shapes output, while too much input can confuse the model.
From AI Changes Everything (And Nothing At All), this visual reminds builders that curated, high‑quality prompts yield better LLM responses, while overloaded, noisy inputs derail reasoning and reduce reliability.

Ethical data practices are non-negotiable. Improving AI products requires inspecting traces—the user input, system prompts, tool calls, and LLM responses. For a coaching flow, that includes the uploaded transcript, the system prompts that define evaluation criteria, and the AI’s feedback. To fix issues, I look at traces where things went wrong—but only with explicit user permission. Too many AI products collect everything by default and review traces without clear consent. Don’t do this.

Slide titled “Context Engineering” with icons for notes, code and checkmark, analyst avatar, charts, and calendar feeding into a neural network graphic labeled “The LLM response” on a dark blue background.
A visual explainer showing how structured inputs—documents, verified code, roles, metrics, and timelines—flow into a network to shape a single, clear LLM response, underscoring the power of context in generative AI.

In my products, users must explicitly grant permission for me to review their data. The consent is clear and specific. If they say no, I don’t see their data. Full stop. That forces me to rely on synthetic data for many tests and to engineer privacy into the architecture from day one rather than bolt it on later. It’s the right thing to do, and increasingly, it’s required.

Slide titled Orchestration showing prompts and code panels that split a meeting summary workflow into multiple LLM calls, with checkmarks and a note that LLMs are better at simpler tasks.
A visual guide to orchestration: break a complex meeting recap into smaller LLM calls, gather action items, bullet points, and a detailed summary, then aggregate the outputs to boost accuracy and reliability.

Everything changes—and nothing changes. Yes, there are new skills you should master: prompt engineering to get the right output the first time; context engineering to give the LLM exactly what it needs; orchestration to decompose complex tasks into simpler steps; and evals to systematically measure quality. And the fundamentals still rule: solve the right customer problems with strong discovery; prototype and iterate to de-risk; and uphold ethical data practices as a design constraint, not an afterthought. The teams that win with AI will master both the new technical craft and the timeless product fundamentals.

Slide titled Orchestration showing prompts for meeting summaries, icons of documents and analytics, and a node graph of RAG and LLM calls, plus the line We can retrieve the right context.
A presentation slide visualizes AI orchestration: user prompts generate meeting summaries while a graph of RAG and LLM calls routes tasks to the best sources, underscoring that retrieving the right context drives better outputs.

If this resonates, pick one active workflow, apply the four AI skills end-to-end, and run a short discovery and eval cycle against a clear outcome. Then ship. You’ll learn faster than any slide deck could ever teach you.

Slide titled 'Evaluation (Evals)' with a neural-style diagram from Input to Output and icons for Datasets, Code Assertions, LLM-as-Judge, and Human Evals, asking if the output meets expectations in AI evaluation.
This presentation slide maps how to assess AI: link inputs to outputs with datasets, code assertions, LLM-as-judge checks, and human reviews—centering the question, "Given the input, did we get the expected output?"

Inspired by this post on Product Talk.


Book a consult png image

What are the four core AI product skills?

Prompt engineering, context engineering, orchestration, and evaluation (evals). These four skills let us harness what’s powerful about generative AI while reducing unforced errors.

How does context engineering improve AI outputs?

LLMs know a lot but not everything; add only the context needed for the task and use approaches like Retrieval Augmented Generation (RAG) to fetch relevant snippets. This helps reduce noise and improve task accuracy.

Why is discovery important in AI product development?

Discovery helps define the impact you want and set a clear outcome before prompting. It also emphasizes talking to customers and upholding ethical data practices as design constraints.

How did the author validate feasibility before production code?

They validated feasibility in a model UI with real transcripts and shaped system instructions and output format. They only deployed after achieving a repeatable prompt and useful feedback.

What are the components of evals for AI products?

Evals include datasets, code assertions, LLM-as-Judge checks, and human evals to determine if outputs meet expectations. Start simple and increase sophistication as stakes rise.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Signup for Weekly Digest Emails

Categories

Archieve