Tag: eval-driven development

How to Structure Prompts for a Reliable AI Resume Coach

You can make an AI rewrite a resume with one sentence. The harder question is whether you can trust the next rewrite. A useful resume coach must stay grounded in the candidate’s evidence, adapt to the target role, ask when important facts are missing, and produce advice that a person can review quickly.

If you are building that coach, treat the prompt as a product specification rather than a clever instruction. Define what the model may change, what it must preserve, how it should make decisions, and what a passing response looks like. That structure is what turns an impressive demo into repeatable behavior.

Key takeaways

Give the coach a measurable job: improve clarity, impact, relevance, and ATS alignment without inventing experience.
Separate stable instructions from session evidence such as the resume, job description, audience, and formatting constraints.
Require diagnosis before rewriting so the model does not polish low-value content or force unsupported keywords into the resume.
Make every new claim traceable to candidate-provided evidence. Missing metrics, scope, or ownership should trigger a question, not a guess.
Use a fixed output contract and a representative evaluation set so prompt changes can be measured instead of judged by a few attractive examples.
Minimize personal data, define retention rules, and test whether the coach treats non-traditional career paths fairly.

Start with the coach’s behavioral contract

“Act as a resume expert” assigns a persona, but it does not define reliable behavior. Two responses can sound equally expert while one preserves the candidate’s record and the other quietly adds claims that were never supplied.

The first part of your prompt should therefore establish a contract with four elements: role, audience, success criteria, and evidence boundaries.

Role: Act as an experienced hiring manager and resume coach for the target field, such as SaaS product management.
Audience: Calibrate the advice for the candidate’s level and goal, whether that is an early-career role, a mid-career move, or an executive search.
Success criteria: Improve clarity, demonstrated impact, job relevance, and appropriate keyword coverage.
Evidence boundary: Do not invent metrics, employers, titles, responsibilities, tools, qualifications, or outcomes. Do not turn participation into ownership or ownership into leadership unless the candidate supplied that distinction.

The evidence boundary matters more than an instruction to “be accurate.” Accuracy is too abstract. Tell the model what transformations are permitted. It may reorder facts, remove repetition, tighten language, connect an explicit achievement to a relevant requirement, and propose questions that would strengthen a bullet. It may not manufacture the missing proof.

Set non-goals as well. The coach should not inflate seniority, guarantee an interview, or maximize keyword count at the expense of readable prose. ATS alignment should mean expressing genuine experience in language relevant to the role, not copying every phrase from the job description.

Define the minimum viable input

A rewrite should not begin until the model has enough information to make a defensible recommendation. Require these inputs:

The current resume or the specific sections to review.
The target job description.
The target role and candidate level.
Any hard constraints, such as preserving chronology, using a particular voice, or keeping bullets under 22 words.
Optional evidence that may not appear in the current resume, including metrics, team size, customer scope, decision authority, stakeholders, or business outcomes.

If the resume or job description is missing, the model should explain what it can do with the available material and ask for what it needs. If a stronger bullet depends on an absent metric, it should ask for the metric or offer a clearly marked fill-in structure. That is a better user experience than presenting polished fiction.

Build the prompt as a stack of distinct layers

A layered prompt architecture is easier to maintain because each instruction has one job. When the output fails, you can identify whether the problem came from missing context, weak examples, an incomplete workflow, or a loose quality gate.

Use the following order for a reusable prompt:

Role and goal: State who the coach is, whom it serves, and what a successful review improves.
Evidence and safety rules: Define which facts may be used, which inferences are prohibited, and when the coach must ask a question.
Session context: Insert the resume, job description, candidate level, target role, and formatting constraints in clearly labeled sections.
References: Supply the relevant role taxonomy, resume style rules, and evaluation rubric. Retrieve only the material needed for the target role when the reference library is large.
Examples: Show a good transformation, the evidence that supports it, and a counterexample that demonstrates an unacceptable habit such as buzzword stuffing.
Workflow: Tell the model how to move from requirement extraction to evidence mapping, diagnosis, clarification, rewriting, and verification.
Output contract: Name the required sections and fields so users and downstream systems receive a predictable result.
Quality gate: Require a final check for evidence fidelity, relevance, clarity, and compliance with the requested format.

Keep stable instructions in the system-level portion of your implementation. Pass candidate-specific material as session input. This separation prevents an individual resume from quietly redefining the coach’s operating rules and makes prompt versions easier to compare.

Use examples to teach judgment, not phrases

A before-and-after pair is useful only when the prompt also shows why the revision is better. Annotate the example with the source evidence, the job requirement it addresses, and the rule it demonstrates. Otherwise, the model may copy the surface pattern while missing the reasoning.

Use placeholders when illustrating a result that must come from the candidate. For example: “Led [initiative] across [scope], changing [business or customer measure] from [baseline] to [result].” Instruct the coach never to present a placeholder as a completed claim. If the underlying values are unavailable, the placeholder belongs in a follow-up question, not the finished resume.

Add a counterexample that sounds impressive but contains no proof, such as a string of leadership adjectives or tool names detached from an outcome. Label the exact failure: unsupported seniority, generic language, duplicated keywords, or no demonstrated result. Negative examples give the model a boundary, not merely a style preference.

Protect the important context when inputs are long

Long resumes, job descriptions, and reference libraries can compete for attention. Set an explicit retention order. Preserve the target requirements, candidate evidence, measurable outcomes, constraints, and evidence rules. Compress repeated background and low-relevance reference material first. Never summarize away a number, scope statement, qualification, or ownership detail that could determine whether a rewrite is supportable.

Retrieval is useful when you support several job families. Select the skill taxonomy and style guidance for the requested role instead of inserting the entire library into every session. Version those materials independently from the core prompt so a taxonomy update does not require an untracked rewrite of the coach’s behavioral rules.

Make the workflow evidence-first, not prose-first

The model should not start by rewriting the first bullet it sees. It needs to understand the hiring problem before changing the language. A staged workflow reduces the chance that fluent prose outruns the available evidence.

Extract the hiring signals. Separate the job description into capabilities, expected scope, domain knowledge, responsibilities, and desired outcomes.
Build an evidence inventory. Identify where the resume demonstrates each signal and distinguish direct evidence from a plausible but unverified inference.
Diagnose the gaps. Prioritize 3-5 improvements with the greatest effect on relevance, clarity, impact, or keyword coverage.
Resolve blocking unknowns. Ask about missing metrics, scope, ownership, stakeholders, or outcomes when those facts would materially change the rewrite.
Rewrite selectively. Revise the bullets that address the priority gaps. Preserve the candidate’s meaning and avoid changing every line merely to create visible output.
Verify the result. Check each bullet against the source evidence, target requirement, word constraint, and style rules before returning it.

This sequence also improves the conversation. A candidate can disagree with the diagnosis before spending time refining prose. The coach can show that a requirement is unsupported instead of hiding the gap behind adjacent keywords.

Use an output contract that exposes the reasoning

Do not ask for “feedback and improved bullets.” That output is difficult to evaluate and difficult to connect to a product interface. Require sections with distinct purposes:

Output block	What it must contain	Why it matters
Diagnosis	The most important strengths, gaps, and 3-5 priority changes	Prevents indiscriminate rewriting
Clarifying questions	Only questions that could materially affect a claim or recommendation	Surfaces missing proof before prose is finalized
Requirement map	Each important job requirement, supporting resume evidence, and unresolved gap	Makes relevance inspectable
Rewritten bullets	Original wording, proposed wording, evidence used, and requirement addressed	Allows line-by-line human review
Keyword coverage	Relevant terms already supported, missing concepts, and safe opportunities to improve wording	Separates alignment from keyword stuffing
Summary draft	A concise positioning statement based only on verified experience	Connects the candidate’s strongest evidence to the target role
Confidence and rationale	Where evidence is strong, where assumptions remain, and what would raise confidence	Prevents a polished tone from masking uncertainty
Quality check	Confirmation of evidence fidelity, clarity, relevance, and format compliance	Creates a final release gate

The confidence field should explain uncertainty rather than produce an unexplained score. A low-confidence rewrite is not automatically bad; it may reveal exactly which fact the candidate needs to confirm. An unexplained score adds precision without accountability.

Include a stop condition in the prompt: if a proposed sentence depends on an unsupported achievement, the coach must withhold that sentence from the final resume. It can present a question and a fill-in pattern separately. The user should never have to inspect fluent wording to discover which parts are guesses.

Evaluate the coach as a product, not a single response

A prompt is not reliable because it produced one excellent resume. Build a small, representative evaluation set containing different levels of resume quality, candidate seniority, job families, career paths, and job-description styles. Keep the underlying cases stable while you change the prompt.

Score each run against criteria that reflect the actual risk and value of the product:

Evidence fidelity: Can every rewritten claim be traced to candidate-provided material?
Requirement relevance: Does each priority recommendation address a meaningful hiring signal?
Impact and clarity: Does the language make ownership, scope, action, and outcome easier to understand without changing the facts?
Keyword judgment: Does the coach use role-relevant language only where the candidate’s experience supports it?
Question quality: Are follow-up questions necessary, specific, and capable of changing the output?
Schema compliance: Are all required sections present and usable by the interface or downstream workflow?
Human-rater alignment: Do qualified reviewers agree that the recommendations are accurate and useful?

Compare prompt variants by changing one meaningful layer at a time. A new exemplar, a revised evidence rule, and a different output schema solve different problems; changing all of them together makes the result difficult to interpret. Record the prompt version, case, pass or failure, and failure type. When performance drifts, that history tells you whether to tighten a rule, replace an example, adjust retrieval, or simplify the output.

Pay special attention to failures that attractive prose can conceal: invented scale, overstated ownership, unjustified seniority, lost metrics, or generic advice that could apply to any candidate. A slightly less elegant response that preserves evidence is preferable to a persuasive falsehood.

Design privacy and fairness into the workflow

Resumes contain personal and employment information. Minimize what enters the system before optimizing the prompt. Remove unnecessary contact details and other identifying information where possible, send only the sections required for the requested task, and avoid retaining raw resumes longer than the workflow requires.

Separate product telemetry from resume content. You can record that a response failed schema validation or contained an unsupported claim without preserving the candidate’s full document. Define who can access stored inputs, how deletion works, and whether retrieved reference material or model outputs are retained.

Fairness checks belong in the evaluation set. Include non-traditional career paths and resumes that describe equivalent skills in different language. Look for advice that systematically treats career gaps, unconventional titles, or less familiar employers as evidence of weak capability. The coach should identify missing evidence, not convert unfamiliarity into a negative judgment.

Start with one target role, a fixed prompt contract, and representative anonymized cases. Do not add more personas, tools, or job families until the coach can consistently preserve evidence, ask useful questions, and obey its output schema. Once those behaviors hold, expand the references and use evaluation results to decide what earns its way into the stack.

References

Shivam.Consulting Blog – Master Burger Prompting: Build a High-Impact AI Resume Coach with Proven LLM Structures

December 19, 2025

Trustworthy AI Product Engineering: From Demo to Daily Use
You have an AI feature that performs impressively in a demo. The difficult decision comes next: can you let it shape a customer’s workflow when its inputs may be incomplete, its output is probabilistic, and a polished answer can still be wrong?

The answer should not depend on confidence theater or one launch-day accuracy score. You need a product and engineering system that makes claims traceable, uncertainty actionable, failures bounded, and quality continuously measurable. That is what turns trust from a brand promise into a release criterion.

Define a trust contract before choosing the architecture

Trustworthy AI does not mean an AI product is always correct. It means the product is explicit about what it can do, shows the basis for consequential claims, declines work outside its operating boundary, and gives the user a safe way to recover when something goes wrong.

I treat every consequential AI workflow as having a trust contract. This is not a legal document or a general responsible-AI statement. It is a short product specification that connects a user decision to evidence, acceptable errors, system behavior, and ownership.

Write the contract before debating models or orchestration frameworks. Include these fields:
- User and decision: Name the person relying on the output and the decision the output will influence. Generating ideas and approving a customer-facing action are different products, even if they use the same model.
- Permitted claim: State what the system may conclude. A diagnostic assistant might identify a likely contributor to a metric change, but it should not present correlation as proven causation.
- Required evidence: Define the data, permissions, time range, comparison, and retrieval quality needed before the claim can appear.
- Uncertainty behavior: Specify when the product answers normally, adds a qualification, asks for more information, or abstains.
- Action boundary: Separate advice, preparation of a reversible action, and autonomous execution. Each step toward execution needs a stronger quality threshold and a clearer recovery path.
- Unacceptable outcome: Describe failures that block release, such as exposing another customer’s data, inventing a citation, applying an action to the wrong account, or concealing missing evidence.
- Quality measure and owner: Choose the metric that reflects the failure cost and assign a person who can stop or roll back the feature.
This contract prevents a common category error: treating model capability as product readiness. The same output quality may be acceptable when a user is brainstorming and unacceptable when the system is changing a live configuration. Risk comes from the combination of the output, the user, and the action that follows.

Consider an assistant investigating a drop in campaign performance. It may safely offer a hypothesis if it displays the metric, segment, comparison window, and missing data. It should not automatically reallocate a budget when the evidence is incomplete. The safe alternative is to keep the result advisory and require a person to verify the cited analysis before any consequential change.

If you cannot complete the trust contract, keep the feature inside a reversible, supervised workflow. That is not a failure to innovate. It is an accurate boundary for what the product can currently support.

Engineer an evidence path, not just an answer

A fluent response is an interface. It is not evidence. For an AI product to support a real decision, the user must be able to move from the claim to the data that supports it without reconstructing the system’s reasoning from scratch.

Start with a retrieval-first flow: authoritative data, retrieval, structured context, generation, policy checks, presentation, and telemetry. That requires robust data contracts and a deliberate orchestration layer, because no prompt can repair ambiguous field meanings, stale records, or broken permissions.

A useful data contract should tell the AI system and its operators:
- What each field means, including its unit and valid states.
- Which tenant, account, or user is allowed to access it.
- How fresh the value must be for the intended decision.
- How null, delayed, duplicated, or conflicting records are represented.
- Which transformations produced a derived metric.
- Which identifier links the generated claim back to the underlying record, query, chart, or dashboard.
Pass an evidence object through the system alongside the generated answer. At minimum, that object should contain the claim it supports, the source identifiers, filters, time window, retrieval timestamp, relevant transformations, and any missing or conflicting signals. The policy layer can then inspect the same evidence the interface will expose.

This design is stronger than asking the model to add citations after it has written an answer. A citation generated as decoration can look convincing while pointing to something irrelevant. A citation carried through the pipeline can be checked for permissions, relevance, and claim-level support before the user sees it.

In the interface, build an inspection ladder:
<!– wp:list {
December 18, 2025

How to Design Multi-Agent Fintech Support That Finishes Work

Your support prototype can explain what happens after a customer reports a stolen card. The harder product decision is whether you can trust it to carry that case from the first message to a verified outcome without losing state, skipping an approval, duplicating an action, or going silent while work remains open.

You will not solve that problem by adding a larger prompt or more conversational agents. You need an operating model for cases that span people, policies, systems, and days. The model below gives you a practical way to define the work, divide agent responsibilities, control execution, and measure whether the customer's problem was actually resolved.

Define the case before you define the agents

A stolen-card request exposes the central mistake in support automation. Freezing the card is visible, immediate, and easy to demonstrate. The less visible work may include dispute intake, fraud investigation, merchant communication, customer outreach, approvals, and follow-up. If your scope ends when the chat ends, you have automated the tip of the workflow while leaving its operational burden intact.

Start with a case contract. This is the shared definition of what entered the system, what outcome is owed, which actions are permitted, and what evidence will prove completion. Define it before deciding how many agents you need.

Customer outcome: State the result in operational terms. "Card secured and required follow-up completed" is more useful than "customer helped."
Entry conditions: Record the signals that create the case, including the customer request, the affected product, and any authentication or evidence requirements imposed by your policy.
Required work: Enumerate the actions, investigations, notices, approvals, and follow-ups that may sit below the initial request.
Allowed actions: Specify which tools may be called, which fields may be changed, and which financial or account actions require approval.
State and owner: Give every open case a current state and an accountable role. "The agents are working on it" is not a state.
Waiting conditions: Name the external event that can unblock the case, such as a customer reply, a system response, a timer, or a human decision.
Terminal conditions: Define resolved, declined, cancelled, transferred, and incomplete outcomes separately. Each one should require evidence and a reason code.

The strongest procedure starts as a workflow map owned by the people who understand disputes, fraud, operations, and compliance. Those subject-matter experts can maintain agent procedures in natural language, but natural language should not mean unmanaged prose. Give each procedure an owner, version, effective date, test cases, and approval history. A policy change should produce a traceable procedure change, not an invisible prompt edit.

Test your case contract with an awkward question: could the system truthfully tell the customer that the case is resolved while a mandatory downstream task is still pending? If the answer is yes, your terminal condition is wrong. Fix that before tuning response quality.

Split responsibilities at operational handoffs

A multi-agent design earns its complexity only when the separation makes ownership clearer. Creating several agents with overlapping prompts usually produces more routing ambiguity, not more capability. Divide the system where the nature of the work, permissions, or waiting behavior changes.

A useful pattern separates inbound, back-office, and outbound responsibilities while keeping procedures, skills, and guardrails on a shared foundation.

Agent role	What it owns	Typical handoff signal	Boundary to enforce
Inbound	Understands the request, gathers required details, performs permitted immediate actions, and creates or updates the case	The case has enough validated information to begin operational work	It cannot imply resolution merely because the conversation was handled
Back office	Executes system work, coordinates investigation steps, records evidence, and manages pending operational tasks	More information, an approval, or customer communication is required	It cannot invent missing evidence or bypass a policy gate to keep the case moving
Outbound	Requests missing information, communicates status or decisions, and follows up until a defined terminal condition is reached	The required response arrives, a timer fires, or the outreach policy is exhausted	It cannot decide that silence means success unless the procedure explicitly defines that outcome

The handoff should be a structured state transition, not an open-ended conversation between agents. Pass a compact case record containing the case identifier, current state, completed actions, evidence references, pending requirement, next allowed actions, applicable procedure version, and relevant deadline or timer. That record prevents the next agent from reconstructing the truth from a transcript.

Keep skills modular as well. "Send a status request," "retrieve transaction details," and "submit an approved case update" are easier to authorize, test, and audit than one broad tool called "handle dispute." Each skill should declare its required inputs, permitted states, side effects, expected result, and failure behavior.

Do not use separate agents simply to mirror your organization chart. Use them when different stages need different permissions, context, completion rules, or escalation paths. If two proposed agents can perform the same actions in the same states under the same controls, they probably belong together.

Let a state machine control long-running work

The language model can interpret a message and propose the next step. It should not be the sole authority on what state the case is in or which actions are legal from that state. A state-machine orchestrator can manage turns, triggers, and skill selection across an asynchronous case while the model handles the language inside those boundaries.

For an illustrative stolen-card workflow, your states might include:

Report received.
Immediate protection pending.
Immediate protection confirmed.
Required information under review.
Investigation or dispute work in progress.
Waiting on the customer, a merchant, an internal system, or a human approver.
Decision ready.
Required communication pending.
Resolved, transferred, declined, cancelled, or closed incomplete with a recorded reason.

Adapt the states to your product, operating procedure, and regulatory obligations. The value is not in these labels. It is in making every transition explicit. For each transition, specify the triggering event, required preconditions, allowed skill, expected side effect, accountable role, failure path, timer behavior, and evidence written back to the case.

Then scope skills deterministically for each turn. An agent handling a customer reply while the case is waiting for information may be allowed to validate the reply, attach evidence, request a missing item, or resume the workflow. It should not be able to perform unrelated account actions simply because those tools exist elsewhere in the platform. This per-state allow-list reduces the number of unsafe choices the model can make.

Async triggers deserve the same design care as messages. A customer reply, API status change, timer expiry, failed tool call, and human approval are all events that can create a new turn. Store them durably and process them against the current case version. Otherwise a delayed event can act on stale state after the case has already moved forward.

Financial actions also need protection from retries. A timeout does not prove that a tool failed; the action may have succeeded while the response was lost. Use an idempotency key where the receiving system supports one, record the attempted operation before retrying, and reconcile uncertain outcomes. Blindly repeating a freeze, refund, fee adjustment, or dispute submission can create customer harm and financial exposure.

Outbound completion needs its own rule. The customer may never send a final message, so "the conversation ended" cannot define success. A defensible terminal condition can require that the necessary notice was sent, mandatory actions are complete, no unresolved task remains, and any follow-up timer has reached the outcome defined by policy. Silence may end an outreach attempt; it does not automatically prove the underlying case was resolved.

Finally, write an audit record for every transition. Capture the prior state, event, procedure version, allowed skills, selected action, tool result, guardrail result, human decision if present, and resulting state. A transcript tells you what was said. A transition log tells you why the system acted.

Make compliance and human review part of execution

Do not reduce compliance to a paragraph at the end of the system prompt. High-stakes rules need controls at the point where the system interprets information, chooses an action, changes a case, or communicates a decision.

Use three complementary layers:

Deterministic controls: Enforce permissions, required fields, state preconditions, transaction limits defined by your policy, and mandatory approvals in code or workflow configuration.
Classification guardrails: Detect whether an input, proposed action, or outgoing message belongs to a risk category that must be blocked, revised, or reviewed.
Human decisions: Route policy exceptions, consequential approvals, conflicting evidence, ambiguous cases, and unsupported operations to an accountable person.

For critical regulatory checks, treat guardrails as classification problems and prioritize recall when missing a risky case is more costly than sending an extra case to review. That choice has an operational consequence: more false positives can increase manual workload and delay customers. Product, operations, risk, and compliance owners should agree on that trade-off for each guardrail rather than applying one global threshold.

Every classifier needs a defined consequence. A positive result might block an action, remove a skill from the current turn, require human approval, or permit the workflow to continue with additional logging. A score without an execution rule is only dashboard data.

Customer-specific policies matter in a platform serving more than one fintech. The system may share an architecture while each customer requires its own procedures and guardrails. Resolve the applicable policy set from trusted configuration before the model acts, attach the policy version to the case, and prevent cross-customer retrieval or tool access. Do not ask a model to infer which client's rules should apply from conversational context.

Human escalation should be a first-class tool call, not a side-channel message. The request should contain the exact decision needed, current state, relevant evidence, attempted actions, available options, policy context, risk of delay, and response deadline. The human's answer should return as a recorded workflow event so the orchestrator can validate it and resume from the correct state.

This pattern is especially important when an API is missing. A person may complete the task in an internal system, but the agent must not assume it happened. Require a structured confirmation and evidence before advancing the case. If that evidence never arrives, keep the case visibly pending or escalate it according to the procedure.

Because these workflows can affect money, account access, customer rights, and regulatory obligations, your AI design cannot substitute for review by qualified legal, compliance, risk, and operations owners. Let those owners approve the policies, controls, escalation criteria, and customer communications before live execution. Begin with read-only or reversible capabilities where possible, and do not grant autonomous financial actions until the failure and recovery paths have been tested.

Measure verified resolution and improve from failures

A conversational system can produce polished replies while leaving cases unfinished. That is why containment or deflection cannot be your sole success metric. The primary question is whether the case reached the correct terminal state with the required evidence, policy checks, and customer communication.

Build a metric hierarchy that separates outcomes from diagnostics:

Case outcome: Track the share of eligible cases reaching a verified terminal state, along with cases reopened, transferred, or found incomplete during review.
Customer experience: Track customer satisfaction and whether the customer must contact support again because ownership or status was unclear.
Operational performance: Track time to resolution, first-contact resolution where that metric is genuinely applicable, deflection, escalation rate, waiting time by state, and human work by escalation reason.
Risk performance: Track critical guardrail misses, false-positive reviews, unauthorized action attempts, procedure deviations, and cases advanced without required evidence.
Agent-stage performance: Track routing accuracy, skill success, handoff completeness, tool failures, timer outcomes, and terminal-state correctness for each role.

Be careful with first-contact resolution in workflows that are supposed to run asynchronously. A fraud investigation may remain open after a perfectly handled first interaction. Optimizing the agent to close the contact can therefore conflict with the real outcome. Use time to verified resolution and unresolved-work visibility alongside conversation metrics.

Evaluation should inspect both language and execution. A useful case-level rubric asks whether the system understood the request, selected an allowed skill, used the correct procedure version, obtained required evidence, respected guardrails, preserved context at handoffs, communicated accurately, and entered the right terminal state.

An automated evaluation pipeline can flag cases for human review and turn reviewed failures into labeled data. Do not sample only obviously failed conversations. Include high-risk classifications, recently changed procedures, new skills, long-running cases, human escalations, unusual state transitions, tool errors, and a baseline sample of apparently successful cases. Otherwise your evaluation set will miss failures that look normal in aggregate metrics.

Give every reviewed failure a place in a product backlog. The fix may belong to the procedure, state machine, skill contract, integration, guardrail, escalation path, or model behavior. "The agent made a mistake" is too broad to assign. A stable failure taxonomy tells you which layer should change and which regression tests must be added before release.

A sensible implementation sequence is:

Choose one bounded journey with a meaningful operational tail and a clearly accountable owner.
Map the full case, including hidden back-office steps, waiting states, approvals, exceptions, communications, and terminal conditions.
Define the case schema, events, state transitions, evidence requirements, and audit record.
Assign inbound, back-office, and outbound responsibilities only where permissions or completion rules differ.
Expose narrow modular skills and apply a deterministic allow-list in every state.
Add compliance classifiers, hard controls, and human decision gates before enabling consequential actions.
Run historical, synthetic, or controlled cases through the workflow and evaluate the complete case, not just the generated messages.
Release gradually, monitor state-level failures, and feed reviewed cases back into procedures, controls, and regression evaluations.

Key takeaways

Scope the customer's complete case before choosing the number of agents.
Separate agents at real permission, workflow, or completion boundaries.
Let the model interpret language, but let explicit state and policy control execution.
Treat human review as a structured workflow event with an owner and deadline.
Define "done" with evidence; a finished chat is not a finished case.
Optimize for verified resolution, policy adherence, and safe recovery rather than response quality alone.

At your next design review, put one real support case on the page and ask four questions: where can it wait, what event unblocks it, who approves a risky action, and what evidence proves completion? If your team cannot answer all four from the workflow, the system is not ready to act. Once those answers are explicit, agent boundaries become an engineering decision instead of a bet on autonomous behavior.

References

Shivam.Consulting Blog — Beyond the Support Iceberg: Gradient Labs' Multi-Agent Breakthrough That Actually Gets Work Done

December 18, 2025

AI Product Management Skills: A Practical 12-Month Roadmap
You may know how to prompt a model and still feel unprepared to own an AI product. That gap is real. Producing a plausible response is easy; deciding what should be built, how to evaluate it, when to trust it, and whether it improved the user journey requires a broader product skill set.

The useful roadmap is not a queue of courses or tools. It is a sequence of increasingly consequential work: understand model behavior, turn ideas into testable artifacts, ship a bounded workflow, and then build the operating system that lets more teams do it responsibly.

What you should be able to do after 12 months

An AI product manager does not need to become a machine-learning engineer. You do need enough technical judgment to frame a feasible problem, challenge an architecture, inspect failures, define an evaluation, and make a release decision with engineering and design.

The 12-month progression from foundations to governed scale works because each phase produces evidence needed by the next one. You learn model constraints before promising a user experience. You build evaluations before exposing the system to real customers. You prove one workflow before standardizing it across a product organization.

Key takeaways
- Months 1-3: Learn model behavior, context management, prompting, retrieval, privacy, and data governance. Apply them to product discovery.
- Months 4-6: Build prototypes and an evaluation system. Instrument activation and retention before treating the feature as ready.
- Months 7-9: Ship a bounded AI-enabled workflow with safeguards, monitoring, recovery paths, and clear human control.
- Months 10-12: Standardize evaluation gates, analytics, discovery practices, roadmapping, and outcome-based reporting.
Treat these as capability gates, not calendar milestones. If you cannot explain why a prototype failed in month six, more production infrastructure will not fix the problem. If you cannot show that users received value in month nine, scaling the feature will only distribute uncertainty.

By the end of the roadmap, your portfolio should contain operating artifacts rather than course certificates: an AI product brief, a prompt and retrieval pattern, a reusable evaluation set, an instrumented production workflow, a risk checklist, and a scale playbook. Those artifacts demonstrate that you can move from possibility to accountable product performance.

Months 1-3: Learn enough AI to make sound product decisions

Your first objective is not technical fluency for its own sake. It is learning where model behavior changes a familiar product decision. A deterministic feature is expected to return the same result for the same state. A generative feature can produce different, incomplete, or confidently incorrect outputs. That changes acceptance criteria, testing, interface design, and the meaning of “done.”

Build an operator’s mental model

Work through four capabilities in order:
1. Model behavior and constraints: Learn what the model receives, what it produces, where variability enters, and which failures matter to the user. You should be able to distinguish a capability problem from a context, instruction, or workflow problem.
2. Context window management: Decide which information belongs in the model’s working context, which information is stale, and which information should never be sent. More context is not automatically better context. Irrelevant material can obscure the evidence the task actually requires.
3. Prompting as product specification: Write reusable instructions that state the task, relevant context, constraints, required output, and quality criteria. Save the prompt with examples of both acceptable and unacceptable behavior. A prompt library is useful only when another person can reproduce and assess the result.
4. Retrieval-first design: For tasks that depend on changing or proprietary knowledge, learn the basic pipeline: retrieve relevant approved information, give that information to the model, generate an answer, and preserve enough traceability to investigate failures. This is a product choice as much as an architecture choice because it determines what the experience can reliably know.
Pair these capabilities with privacy-by-design and data governance from the beginning. Before using customer or company information, write down which data classes are permitted, who can access them, where they may be retained, and what must be removed or masked. If those answers are unclear, use synthetic or explicitly approved material until the policy is settled. Avoiding sensitive data at the prototype stage is safer than trying to remove it after it has spread through prompts, logs, and evaluation files.

Apply the foundations to product discovery

Discovery gives you a low-risk place to practise. Use generative AI to summarize research, cluster feedback, compare recurring needs, or sharpen a value proposition. Keep the model in an assistive role: every synthesized theme should remain traceable to the underlying customer evidence. If you cannot inspect the feedback behind a cluster, you cannot tell whether the model found a pattern or flattened important differences.

Create an AI product brief for one candidate problem. Include:
- The user and the job they are trying to complete.
- The decision or work the model will assist with.
- The inputs the system may use and the inputs it must reject.
- The expected output and the conditions that make it useful.
- The consequence of a wrong, missing, or delayed output.
- The point at which a person reviews, edits, approves, or overrides the result.
- The product signal that would show improved user behavior.
You are ready for the next phase when you can explain the proposed experience without hiding behind model vocabulary. You should be able to identify the necessary context, name the important failure modes, explain whether retrieval is needed, and show how the user remains in control.

Months 4-6: Prototype the experience and build its evaluation system

A prototype is valuable when it tests uncertainty, not when it merely looks polished. Use generative AI to accelerate UX mocks, PRDs, in-app guidance, and alternative interaction flows, but spend the saved time on the questions that determine whether the product deserves to ship.

Prototype the entire decision loop. Show where the user supplies context, how the result is presented, what happens when the answer is weak, how the user corrects it, and whether that correction improves the next step. The error state is part of the primary AI experience; hiding it until engineering integration creates false confidence.

Use evaluation as a development method

Eval-driven development turns a vague judgment such as “the answers seem good” into a repeatable product decision. Build the evaluation alongside the prototype:
1. Define the task boundary. State what the system is expected to do and what remains outside its responsibility.
2. Collect representative cases. Include normal inputs, ambiguous inputs, missing information, adversarial behavior, and cases where the correct response is to stop or ask for clarification.
3. Write a scoring rubric. Assess the properties the user actually needs, such as correctness, relevance, completeness, appropriate tone, traceability, or compliance with a constraint.
4. Record a baseline. Compare the proposed experience with the current workflow or a simpler non-AI alternative. A model output is not valuable merely because it exists.
5. Inspect failure patterns. Separate prompt failures, missing-context failures, retrieval failures, model limitations, interface confusion, and policy violations. Each category points to a different remedy.
6. Set a release gate. Decide which failures block launch, which require human review, and which are tolerable in the intended use case. The gate should reflect the consequence of error, not enthusiasm for the feature.
Keep the evaluation set versioned with the product. When you change the prompt, model, retrieval logic, or available tools, rerun the same cases. Otherwise, an apparent improvement in one example can conceal regressions elsewhere.

Instrument behavior before launch

Quality evaluation and product analytics answer different questions. An evaluation tells you whether the system behaved acceptably on known cases. Behavioral analytics tells you whether customers reached value in the product.

Define the journey in Amplitude or your existing analytics system before exposing the prototype broadly. Capture the moment a user encounters the feature, supplies enough information, receives an output, accepts or edits it, completes the downstream task, returns to use it again, abandons it, or escalates to a person. That sequence gives you activation and retention signals rather than a vanity count of generations.

If you run an A/B test, choose the minimum detectable effect before launch. The decision matters because an experiment that cannot detect a product-relevant change may produce an inconclusive result even when the dashboard looks busy. Define the primary outcome, guardrail metrics, exposure rule, and analysis plan before looking at the results.

Move forward when the prototype solves a defined task, the evaluation catches meaningful failures, the events expose the user journey, and the experiment can answer a decision. A persuasive demo without those four elements is still a demo.

Months 7-9: Ship a bounded workflow, not an open-ended assistant

The production phase is where product judgment becomes visible. Start with a workflow that has a recognizable beginning, end, and owner. Customer-support, CRM, and guided-onboarding workflows are useful patterns because the AI can sit inside an existing user journey rather than asking customers to invent a use case from a blank chat box.

Screen the workflow before committing engineering capacity:
- Is the user’s job clear enough to define a successful completion?
- Does the system have access to approved, relevant context?
- Can you observe whether the user accepted, corrected, ignored, or escalated the output?
- What happens to the customer if the system is wrong?
- Can a consequential action be paused, reviewed, or reversed?
- Is a generative system materially better than a rule, search result, template, or conventional workflow?
Use agentic AI only when the job genuinely requires several connected steps, tool use, or changing plans. Additional autonomy also creates more places for permissions, context, and actions to go wrong. Begin with the narrowest useful boundary, then expand it when production evidence supports the change.

Map the production loop before building it

A product trio should be able to trace the complete workflow on one page:
1. Trigger: What user action or system event begins the workflow?
2. Context: Which profile, conversation, account, or knowledge records are retrieved?
3. Generation or decision: What does the model produce, classify, recommend, or plan?
4. Tool action: Which systems can it read from or write to, and under whose authority?
5. Human checkpoint: Which output can be edited, rejected, or approved before it changes customer data or sends an external message?
6. Recovery: How does the product handle low confidence, missing data, tool failure, timeouts, or a user correction?
7. Learning signal: Which feedback updates the evaluation set, product decision, or workflow design?
Place safeguards at the point of consequence. Restrict the data and tools the workflow can access. Require explicit approval before a high-impact external action. Preserve a record of the inputs, retrieved context, output, action, and user response so a failure can be investigated. If an action cannot be safely reversed, keep it behind human review until the risk has been addressed.

Threat detection and response also need a product playbook. Define what counts as suspicious input or abnormal behavior, who receives the alert, how the workflow is disabled or contained, what evidence is retained, and how affected users are handled. The escalation path should exist before the first serious incident, not be improvised during it.

Monitor the experience at four levels
- User outcome: Did the customer complete the intended job with less effort or fewer avoidable handoffs?
- AI quality: Are the evaluation scores and failure categories changing after releases?
- Workflow health: Are retrieval, model, and tool steps completing as expected, and can the team locate the failing stage?
- Risk: Are users overriding outputs, escalating cases, encountering policy violations, or triggering suspicious behavior?
Track deployment frequency because a team that can release safely can also learn faster. Do not confuse release frequency with customer value, though. The useful loop connects a deployment to a quality change, a behavior change, and a decision about what to do next.

Months 10-12: Turn one successful product into a repeatable system

Scaling is not copying the same AI feature into every surface. It is making the successful practices reusable while preserving room for different user risks and workflow requirements.

Codify the operating assets that reduced uncertainty during the earlier phases:
- An intake template that starts with the user problem, current workflow, expected outcome, and consequence of error.
- A continuous-discovery practice that keeps generated themes connected to original customer evidence.
- A retrieval-first architecture template for products that depend on approved or changing knowledge.
- A shared prompt library with owners, versions, expected behavior, and known limitations.
- An evaluation gate covering representative cases, blocking failures, human-review requirements, and regression checks.
- A production checklist covering permissions, privacy, observability, recovery, threat response, and user control.
- A monitoring cadence that connects product behavior, AI quality, workflow health, and risk.
Do not impose one universal quality threshold on every AI feature. A low-consequence drafting aid and a workflow that changes a customer account do not carry the same downside. Use the same evaluation process across teams, but set release gates according to the task, affected user, reversibility, and consequence of failure.

Use common analytics without erasing product context

A unified analytics model lets leadership compare lift across products without forcing every team to use an identical funnel. Standardize the basic meanings of exposure, meaningful use, successful task completion, correction, abandonment, escalation, and return usage. Then let each product define the events that represent those states in its own journey.

This is also where roadmapping and sprint planning should move from output commitments to outcome-based decisions. “Ship an AI assistant” is an output. A useful objective describes the customer behavior or business result that should change. The roadmap can then contain competing ways to produce that change, including improvements that do not require AI.

Use a consistent stakeholder narrative:
- What shipped: The workflow or capability placed in users’ hands.
- What moved: The user, product, quality, and risk signals that changed.
- What was learned: The assumptions confirmed, rejected, or still unresolved.
- What happens next: The decision to expand, revise, contain, or stop the work.
That structure prevents activity from masquerading as progress. It also gives executives a clear basis for funding decisions: evidence of value, evidence of control, and a specific next bet.

Start this week with one recurring user decision. Write its AI product brief, run the workflow manually with permitted data, and save the successful and failed cases as the beginning of an evaluation set. If you cannot define a good result or the consequence of a bad one, stay in discovery. If you can, you have a concrete first artifact and a reason to proceed to a prototype.

References
- Shivam.Consulting Blog – Master AI as a Product Manager in 12 Months: My 2026 Roadmap to Ship Smarter, Faster
December 17, 2025

Context-Driven AI Product Engineering That Survives Production

Your AI feature can look excellent in a demo and still fail in production. The prompt has not changed, but the user, account, permissions, available data, and business decision have. A fluent answer built on the wrong context is still the wrong answer.

If your team keeps rewriting instructions to fix inconsistent results, inspect what the model can see, why it can see it, and what it is expected to do with that information. Context-driven AI product engineering turns those decisions into a versioned, measurable product system rather than hiding them inside one large prompt.

Determine whether context is actually the bottleneck

Runtime context is the complete package available to the model for a specific task. It includes instructions, retrieved evidence, permissions, conversation state, memory, tool definitions, metric definitions, output requirements, and stop conditions. Prompt text is only one part of that package.

This distinction matters because different failure classes require different fixes. A prompt change cannot retrieve a missing CRM record. A larger model cannot make a stale policy current. Better prose cannot repair an authorization error. Start by assigning every bad result to the layer that produced it.

Evidence is missing: the necessary record, document, event, or metric never reached the system.
Evidence was available but not selected: retrieval, filtering, metadata, or ranking favored the wrong material.
Evidence is stale or contradictory: the system lacks a freshness rule or conflict-resolution policy.
The procedure is incomplete: the model has facts but not the sequence, metric definition, or decision rule needed to use them.
The scope is unsafe: the context contains data the current user, role, tenant, or workflow should not access.
The answer contract is unclear: the model does not know when to cite evidence, expose uncertainty, request missing input, call a tool, or abstain.
The answer is technically correct but operationally unhelpful: it does not fit the user’s role, decision, timing, or next action.

For one failed session, reconstruct the full path instead of reading only the final answer:

Capture the user’s request, detected intent, role, tenant, and relevant permissions.
Record the retrieval queries, filters, candidate results, metadata, and ranking scores.
Show which candidates entered the context, which were excluded, and why.
Inspect the assembled instructions, evidence, memory, tool contracts, and output schema.
Record every tool call, returned result, retry, timeout, and policy decision.
Compare the answer with the evidence that was actually available at generation time.

The resulting trace gives you a practical decision tree. If the correct evidence was absent from the candidate set, fix ingestion or retrieval. If it was retrieved but excluded, fix ranking or context packing. If it entered the prompt but the answer contradicted it, test instruction hierarchy, conflict handling, or model behavior. If the evidence and answer were both correct but the user still could not act, fix the product experience.

This is why a retrieval-first, context-aware design usually creates more leverage than another round of isolated prompt editing: it makes the evidence path visible and gives each failure an identifiable owner.

Write a context contract before choosing the architecture

A context contract defines what the AI needs for one product task, where that context may come from, how it must be constrained, and what the system should do when the contract cannot be satisfied. It is the interface between product intent and runtime engineering.

Consider an account-risk assistant used by a customer success manager. Its contract could look like this:

Contract field	Decision to make	Example implementation
Task boundary	What may the AI decide or produce?	Summarize risk signals and propose a next step; do not change the account record.
Authorized evidence	Which information is both relevant and permitted?	CRM fields, recent support history, approved playbooks, and defined product-usage metrics visible to the current user.
Identity and scope	Which user, tenant, account, and role govern access?	Resolve all four before retrieval and preserve them through every tool call.
Freshness	How current must each evidence type be?	Carry the captured-at timestamp and qualify the answer when a required record exceeds the product’s approved freshness window.
Conflict rule	What happens when trusted inputs disagree?	Expose the conflict and its timestamps instead of silently choosing one value.
Procedure	Which reasoning process should the workflow execute?	Identify the account, retrieve authorized signals, apply metric definitions, compare evidence, state caveats, and propose an action.
Output contract	What structure must the response follow?	Answer, supporting evidence, caveats, recommended action, and provenance.
Abstention rule	When should the system decline to conclude?	Report missing evidence when a required record, metric definition, or permission check is unavailable.
Audit payload	What must be reproducible later?	Context-contract version, evidence identifiers, timestamps, policy version, tool results, and model configuration.

The contract should keep five kinds of context distinct. Task context says what the user is trying to accomplish. Evidence context contains facts relevant to that task. Policy context defines permissions, governance, and prohibited behavior. Interaction context carries the useful parts of the current conversation and approved long-term memory. Execution context defines tools, schemas, retries, and stop conditions.

Keeping those layers separate prevents a common production mistake: treating all text as equally authoritative. A user’s request should not override a permission rule. A retrieved comment should not outrank an approved policy. An old conversation should not silently redefine a current metric. Your assembly logic needs an explicit precedence order for these collisions.

Personalization belongs in the contract too. Intent and role should narrow context, not merely add more of it. A finance user may need policy-safe excerpts and transaction evidence. A customer success user may need current account activity and support history. A product manager may need metric definitions, cohorts, experiment state, and caveats. Role-aware assembly and scoped memory make the same underlying capability useful without exposing every available field to every request.

You know the contract is testable when each field can become a pass-or-fail assertion. Did the workflow apply the current permission scope? Did it include the required metric definition? Did it expose a conflict? Did it abstain when decisive evidence was unavailable? If a requirement cannot be tested or observed, it is still an aspiration rather than an engineering contract.

Build context assembly as a controlled pipeline

The production unit is not a prompt template. It is the pipeline that converts a user request into a bounded evidence packet and an executable task. That pipeline should have explicit stages:

Authorize the request. Resolve identity, role, tenant, account scope, and permitted operations before searching for evidence. Apply access controls again before generation as a second check.
Normalize the inputs. Give each record or chunk a stable identifier plus source type, owner, tenant, timestamp, policy classification, schema version, and other metadata needed for filtering.
Generate retrieval candidates. Combine semantic retrieval for conceptually related language with keyword retrieval for exact identifiers, product names, codes, and policy terms.
Filter and rank for the task. Use intent, role, account, freshness, authority, and source-level confidence in addition to semantic similarity.
Resolve stale and conflicting evidence. Apply the contract’s freshness and precedence rules before the model sees the packet. Preserve unresolved conflicts as explicit context.
Pack the context window. Allocate space by priority, remove duplicates, keep decisive passages intact, and exclude material that does not change the task.
Execute through a defined interface. Supply tool schemas, metric definitions, procedure steps, output fields, citation requirements, and abstention conditions.
Attach provenance and emit a trace. Store identifiers and versions needed to reproduce the decision without indiscriminately copying sensitive raw content into logs.

Hybrid retrieval is useful because semantic and lexical search solve different problems. Semantic search can find a relevant concept expressed in different words. Keyword search protects exact matches such as an account identifier, event name, plan code, or policy term. Metadata then makes the results usable: a highly similar passage from the wrong tenant or an obsolete policy is not a valid result.

Authorization must shape retrieval itself. Do not search a global corpus, rank everything, and rely on a final prompt instruction to hide unauthorized results. That approach can expose sensitive material to intermediate services, caches, traces, or debugging tools even if it never appears in the final answer. Filter at the retrieval boundary, preserve tenant and role scope through tool calls, and validate the assembled packet before generation.

Context-window management is also a relevance problem, not just a token-count problem. Reserve capacity in a deliberate order: non-negotiable policy and permissions, the current task, decisive evidence, required procedure and definitions, recent interaction state, then supplemental material. When the packet is too large, compress or drop lower-priority evidence rather than truncating whichever section happens to come last.

Memory needs its own product rules. Short-term conversation state should retain unresolved references, user corrections, and active task decisions. Long-term memory should be scoped to durable facts that the product is allowed to retain. Define how memory is written, validated, refreshed, read, and deleted. Dumping a full transcript into every turn increases noise and can revive facts or instructions that no longer apply.

For analytical products, context must include a procedure as well as data. A reliable workflow starts with the decision to be made, anchors it to metric definitions and guardrails, retrieves trusted data, generates testable hypotheses, segments the evidence, and returns options with trade-offs and caveats. That structured analyst loop is far easier to evaluate than a broad instruction to analyze the data.

The same restraint applies to agents. Use multiple steps or tools when decomposition makes the task clearer, safer, or more verifiable. Each step needs an input schema, permitted tools, completion condition, failure path, and evidence handoff. Agentic patterns are most useful when task decomposition reduces real complexity; extra autonomy without a clearer control boundary simply creates more places for context to drift.

Ship with layered evaluations, observability, and ownership

Evaluate the evidence path before scoring the prose

A single answer-quality score hides the layer that failed. Build an evaluation stack that follows the same stages as the runtime pipeline:

Retrieval evaluation: Was the required evidence present in the candidate set, and where did it rank?
Assembly evaluation: Did the final packet include required facts and policies, exclude unauthorized or irrelevant material, preserve provenance, and respect freshness rules?
Behavior evaluation: Did the model follow the procedure, use the supplied evidence, handle conflicts, cite support, and abstain when required?
Answer evaluation: Was the result correct, grounded, complete enough for the task, and structured as promised?
Product evaluation: Did the user complete the task, reach an answer faster, correct the output, return to the capability, or escalate to a human?
Operational evaluation: Did latency, context size, cost, tool failures, permission denials, and fallback behavior stay within the product’s approved limits?

Your offline evaluation set should represent the failure surface, not just normal requests. Include different roles and intents, sparse accounts, stale records, contradictory inputs, missing definitions, empty retrieval, tool failures, unauthorized requests, and cases where abstention is the correct result. Label the evidence that should be retrieved as well as the answer that should be produced. Otherwise, a system can pass by reaching the right conclusion through the wrong material.

Version the evaluation cases, context contract, retrieval configuration, policy set, prompt, tools, and model independently. Change one major layer at a time when possible. If a model upgrade, ranking change, and prompt rewrite ship together, an improved aggregate score will not tell you what worked or which change caused a regression in a sensitive slice.

After offline acceptance, use staged online experiments with a predeclared outcome, guardrails, acceptance threshold, and minimum detectable effect. Task success, groundedness, time to first answer, adoption, and deflection can all be useful, but only when they match the workflow. A support assistant should not optimize deflection by confidently blocking necessary escalation. An analytical assistant should not optimize speed by dropping caveats required for a sound decision.

Instrument enough to reproduce failure without creating a new data risk

For each request, emit a structured event envelope containing the workflow and context-contract versions, detected intent, authorized scope, retrieval-query identifier, evidence identifiers, ranking metadata, freshness state, tool outcomes, policy decisions, answer status, latency, and user feedback. This gives product and engineering a common record for diagnosing failure.

Do not default to logging every raw prompt, retrieved document, or tool response. Production context can contain customer data, confidential policy, or personal information. Prefer stable identifiers, approved redaction, access-controlled traces, and retention rules. Keep the minimum raw material needed for authorized debugging and evaluation, and make data ownership explicit.

Roll out in stages: run the new pipeline against offline cases, observe it without user impact where possible, expose it to a constrained cohort, compare it with the existing experience, and expand only after both quality and operational guardrails hold. Preserve a feature flag, a known-safe fallback, and a rollback path for context changes as well as model changes.

Give every context surface an owner

Context crosses organizational boundaries, so shared responsibility without named ownership turns into drift. Assign decisions explicitly:

Product owns the task boundary, target user, intended decision, outcome metric, failure taxonomy, and acceptance trade-offs.
Design owns how evidence, uncertainty, correction, abstention, and human handoff appear in the experience.
AI and platform engineering own retrieval, ranking, assembly, tool interfaces, reproducibility, evaluation infrastructure, and fallbacks.
Data owners own schemas, metric definitions, lineage, freshness, and the authoritative status of each collection.
Security, privacy, and governance owners define permitted use, redaction, retention, and audit requirements.
SRE owns service-level monitoring, failure alerts, capacity behavior, deployment safety, and rollback readiness.

A Staff AI Engineer can connect these concerns by turning research choices into repeatable workflows and shared evaluation infrastructure, but that role should not become the sole owner of product judgment, source governance, or production reliability. Cross-functional execution works when each decision has one accountable owner and the whole group uses the same context trace and evaluation results.

Treat context changes like code changes. A release should identify the changed source, schema, ranking rule, contract, or policy; show the affected evaluation slices; state the expected product outcome; and preserve a rollback path. CI/CD guardrails, drift monitoring, and human review turn context from an informal prompt dependency into an operable platform capability.

Key takeaways

Diagnose the failed layer before editing the prompt. Missing evidence, bad ranking, stale data, unsafe scope, incomplete procedure, and weak UX are different problems.
Define a context contract for each workflow: task boundary, authorized evidence, freshness, precedence, procedure, output, abstention, and audit payload.
Authorize before retrieval, rank with task and metadata signals, and validate the assembled packet before generation.
Manage the context window by authority and decision value, not by filling every available token.
Evaluate retrieval, assembly, model behavior, answer quality, user outcomes, and operational performance separately.
Version context components independently, release them through staged controls, and assign an accountable owner to every surface.

At your next AI product review, do not approve the experience from the final answer alone. Ask to see the evidence packet, permission scope, context-contract version, failed evaluation slices, runtime trace, and rollback path. Those artifacts reveal whether the feature is dependable or merely persuasive.

Start with one production workflow whose failures matter to users. Trace its most common failure, write the contract, repair the responsible layer, and require the change to pass both offline evaluation and a guarded rollout. Once that loop works, you have the foundation for a reusable context platform rather than another prompt that only works in the demo.

References

December 16, 2025

AI in Product Design: My Proven Playbook, Real Use Cases, and the Tools That Win Faster

In product design, AI has shifted from novelty to non-negotiable. I’ve watched teams accelerate discovery, compress prototyping cycles, and turn ambiguous ideas into validated experiences faster than ever—without sacrificing quality or customer trust.

AI in product design has quickly moved from new to necessary. Here are the AI product design tools and approaches you need to stay relevant in this decade.

From my vantage point leading product teams, “necessary” means AI is woven throughout the product lifecycle—discovery, prioritization, prototyping, validation, and iteration—not bolted on. The goal isn’t to chase hype; it’s to build durable advantage with clear AI Strategy, disciplined execution, and measurable outcomes.

First, anchor the work in strategy. Tie every AI initiative to a specific customer problem and value proposition, then express that linkage with outcomes vs output OKRs. This keeps teams focused on real impact and avoids feature-chasing. It also sharpens product positioning and clarifies where AI can deliver competitive differentiation versus simple points of parity.

Second, upgrade discovery. I rely on AI workflows to synthesize interviews, cluster themes, and surface insights at scale. A retrieval-first pipeline—grounding models in our own data—improves factuality and reduces hallucinations. Combine this with strong data governance and privacy-by-design so insights are trustworthy and compliant from day one.

Third, make quality measurable. Adopt eval-driven development: define evaluation sets and acceptance thresholds that reflect real user tasks before you ship. Pair that with A/B testing and minimum detectable effect (MDE) discipline, so you learn quickly and confidently. Add safety guardrails (red-teaming prompts, content filters, and bias checks) to manage AI risk without slowing the pace.

Fourth, enable empowered product teams. Product trios (PM, design, engineering) should co-create prompts, prototypes, and evaluation criteria. Give designers and PMs practical tools—LLMs for product managers, structured prompt templates, and reusable components—so AI-augmented work becomes the default, not a special project.

Where does AI shine in product design today? Concept exploration and market scans, turning fuzzy opportunity spaces into crisp problem statements. Rapid wireframes and interaction ideas, using gen ai for product prototyping to explore multiple design directions in minutes. UX writing that adapts tone and reduces friction across onboarding, tooltip design, and microcopy.

It also excels at guided experiences. I’ve seen strong lifts in user activation when we pair in-app guides and product tours with context-aware suggestions. For support and education use cases, a retrieval-grounded assistant can deflect tickets, shorten time-to-value, and reinforce the product’s value proposition at the exact moment a user needs help.

Voice is another frontier. A well-scoped voice AI agent can accelerate complex workflows (think data entry or multi-step configurations) when hands-free is faster or more intuitive. Just be intentional about when agentic AI adds net value versus when a simple UI tweak would do.

On the tooling side, my AI product toolbox is pragmatic and modular. For analytics and learning loops, Amplitude analytics and Pendo help quantify behavior changes and retention analysis. For in-product engagement and feedback routing, Intercom and HubSpot integrate cleanly with LLM-driven tagging and summarization. For ideation and automation, I use a ChatGPT connector and Claude Code for quick scripts, data wrangling, and prompt experiments. The constant: a retrieval-first pipeline that grounds models in approved knowledge and maintains context window management at scale.

Risk management is built in, not bolted on. Set clear AI risk management policies, catalog model and data dependencies, and document decisions. Align with regulatory compliance requirements early, and keep an audit trail of prompts, datasets, and eval results. That’s how you move fast without breaking trust.

If you’re getting started, begin small: pick one high-friction workflow, add a retrieval-grounded copilot, and measure the lift. Use the results to inform product roadmapping and sprint planning, then scale to adjacent use cases. With disciplined discovery, sharp evaluation, and the right tooling, AI becomes a force multiplier for product teams and a clear win for customers.

Inspired by this post on Product School.

December 15, 2025
From Concierge to AI Marketing Engine: Inside Mowie’s Document Hierarchy Playbook

I’m constantly asked by SMB owners: What if your small business could have a full marketing team—automated content calendars, customer segmentation, and channel-specific posts—without the headcount? That question is no longer hypothetical; it’s precisely the promise behind Mowie, and the way they got there is a masterclass in practical AI product development.

I recently listened to Chris O'Connor (CEO) and Jessica Valenzuela (Co-Founder) of Mowie, an AI marketing platform built for small and medium-sized businesses in restaurants, retail, and e-commerce. Their story starts with a concierge marketing service—doing the work by hand for overwhelmed owners—and evolves into a fully automated AI product.

They walk through their "document hierarchy" approach: how Mowie crawls the web to build a "dossier" about each business, infers customer segments and marketing pillars, and generates quarterly content calendars with channel-specific posts. As a product leader, this is the kind of retrieval-first pipeline that consistently outperforms naive prompt chaining because it builds durable context before generation.

They also unpack the technical challenges of structuring unstructured data and the evolution from rigid schemas to loosely structured markdown. In my experience with LLMs for product managers, markdown becomes a flexible intermediate representation that’s easy to diff, trace, and feed back into models without brittle parsing.

Equally important, they use customer feedback—from calendar approvals to regeneration requests—as their primary evaluation signal. That’s eval-driven development in practice: close the loop with lightweight evals that reflect genuine user intent, not proxy metrics.

The planning model is elegant: the three mini-calendars—public events, business-specific events, and recommended campaigns—roll up into a coherent plan that eliminates the blank-page problem and enables steady, predictable execution.

Crucially, they’re building traceability so customers can see which context documents influenced their content. This kind of transparency increases trust, accelerates edits, and supports governance in regulated categories where auditability matters.

Onboarding and data collection stay pragmatic: let the system crawl first, ask humans only for deltas, and progressively profile over time. It’s a pattern I advocate in continuous discovery and AI workflows—keep humans in the loop without overwhelming them, and make the right action the easy action.

Early on, they used Simon Sinek's Golden Circle framework to validate demand and sharpen messaging. Framing the "why" before the "what" helps teams maintain a crisp value proposition and tighten their go-to-market strategy.

Performance measurement goes beyond vanity metrics by connecting marketing performance back to point-of-sale data for attribution. The ability to tie campaigns to revenue events is the bridge from clever content to accountable outcomes.

What’s next is equally compelling: deeper attribution, omnichannel expansion, and digital out-of-home displays. For SMBs, that points to a unified analytics platform spanning email, social, and in-store touchpoints—exactly where modern marketing is headed.

My takeaways for builders: invest in a retrieval-first pipeline with a resilient document hierarchy; prefer loosely structured markdown over rigid JSON when dealing with messy inputs; design human-in-the-loop controls that double as evals; and always connect activity to business outcomes. That’s how you turn an idea into a repeatable system that scales.

If you want to explore further, start here: Mowie AI — AI marketing platform for SMBs. For early validation and storytelling, revisit Simon Sinek's Golden Circle.

Inspired by this post on Product Talk.

December 11, 2025

Operationalizing AI: A Practical System for Scalable Growth

Your AI pilot works in the demo. Then it reaches a live workflow and slows down: the data is incomplete, nobody owns the exceptions, reviewers apply different standards, and the team cannot prove whether the result improved revenue, cost, speed, or retention.

The gap is not model quality alone. Scalable growth requires an operating system around the model: a constrained business outcome, a mapped workflow, approved data, explicit decision rights, measurable quality, controlled releases, and a path for handling failure. Build those pieces around one valuable use case, and AI can become a repeatable business capability instead of a collection of pilots.

Choose the growth constraint before the AI use case

Do not begin with a broad instruction to “find an AI use case.” That framing encourages teams to start with a model capability and search for somewhere to place it. Start with a constrained business problem instead.

The unit of investment should be a decision or task inside a customer or employee journey. “Build a churn copilot” is too broad. “Before a renewal review, summarize approved usage and CRM signals, identify the evidence of risk, and propose an action for the customer success manager to review” is narrow enough to test.

Most growth-oriented opportunities fit into four useful lanes:

Revenue: improve qualification, conversion, expansion, cross-sell, or win-back decisions. Measure the commercial event, not the number of AI recommendations generated.
Efficiency: reduce the cost, handling time, rework, or backlog associated with a repetitive process. Good candidates have high task volume and outputs that can be checked without recreating the work.
Speed: shorten a discovery, delivery, or release cycle. If the workflow serves software delivery, deployment frequency can be relevant, but it is not evidence of customer or commercial value by itself.
Activation and retention: make onboarding, guidance, or support more contextual. Measure whether customers reach the intended product behavior and continue receiving value, not whether they clicked an AI-generated tooltip.

A disciplined portfolio can pair one revenue use case with one efficiency use case, define success before development, and release each through a narrow MVP. That balance matters. An efficiency-only roadmap can shrink costs without creating differentiation, while an unconstrained revenue bet can consume attention without proving economic value.

Screen each candidate with the same questions:

What business metric should move, and what is its current baseline?
Which person, decision, and moment in the workflow create that movement?
Does the task occur often enough to justify a reusable solution?
Are the required inputs available, current, and approved for this purpose?
Can a reviewer distinguish an acceptable result from an unacceptable one?
What happens when the system is wrong, and can the action be reversed?
Who owns the outcome after the launch team moves on?

My test is blunt: if you cannot name the workflow event, the owner, the baseline, and the failure consequence, you do not yet have an implementation candidate. You have a discovery question. Fund the learning needed to answer it before funding scale.

Convert the use case into a controlled workflow

An AI feature becomes operational when its behavior is defined inside the surrounding work. That means understanding what happens before the model is called, what the model may do, how its output is checked, and what happens next.

Begin by mapping the task as it is performed, choosing one step to augment, selecting the right automation method, and iterating against an explicit quality bar. Do the task manually while mapping it if the real process is unclear. Policy documents often describe the intended path; observation reveals the exceptions that determine whether automation will survive production.

Name the trigger. Specify the event that starts the workflow, such as a support request, renewal review, onboarding milestone, invoice submission, or product release.
Identify the inputs. Record each system, document, field, permission, and freshness requirement. Separate required evidence from optional context.
Expose the decisions. Write down the classifications, judgments, calculations, and approvals a person currently makes. Hidden judgment is where apparently simple automations tend to break.
Specify the output. Define its schema, audience, channel, timing, and acceptable evidence. “Produce a helpful answer” is not a specification.
Map exceptions. Include missing records, contradictory inputs, unsupported requests, low-confidence cases, policy conflicts, and unavailable downstream systems.
Assign each step to code, retrieval, an LLM, or a person. The workflow should use the simplest reliable mechanism for each job.
Define the handoff. State who reviews the result, what they can change, when the workflow must stop, and where failures are recorded.

Use each form of automation for the work it can control

Use deterministic code for exact calculations, validation rules, permissions, routing, and other behavior that should produce the same answer from the same inputs. Use an LLM where language is ambiguous, inputs are unstructured, or the task requires drafting, summarizing, extracting, or classifying meaning.

When the answer must reflect company facts, policy, or customer history, retrieve the approved information at runtime instead of expecting the model to remember it. A retrieval-first design can connect behavioral and CRM context to account signals and recommended actions, while preserving a visible trail back to the evidence used.

Keep a person in the path when the consequence is material, the action is difficult to reverse, or the definition of a correct result remains contested. Human review is not a permanent excuse for weak quality, however. The reviewer needs defined criteria, enough context to make a decision, and an easy way to correct and categorize the failure.

Write an execution contract, not just a prompt

A production instruction set should define more than tone and role. Treat it as an execution contract containing:

the objective and the business context;
the permitted inputs and authoritative evidence;
the decision criteria the system must apply;
the required output structure;
the actions it may and may not take;
the conditions that require refusal or escalation;
the way uncertainty should be represented;
examples of acceptable, unacceptable, and edge-case behavior.

For an agentic workflow, increase authority in deliberate stages: observe, draft, recommend, act after approval, and only then act within defined limits. Do not jump from a convincing chat demonstration to autonomous execution. Agentic AI needs explicit guardrails and verifiable quality before it can safely take work out of a human queue.

Measure business value, workflow performance, and AI quality separately

A dashboard that reports requests, tokens, or generated answers tells you that the feature was used. It does not tell you whether the business improved. You need separate measures because an AI system can look healthy at one layer while failing at another.

Measurement layer	What to track	What it reveals
Business outcome	Conversion, expansion, cost per completed outcome, cycle time, activation, or retention	Whether the investment affects the growth constraint it was chosen to address
Workflow performance	Completion, rework, exception, escalation, abandonment, and end-to-end latency	Whether the surrounding process can absorb and use the AI output
AI quality	Correctness, evidence support, instruction adherence, output validity, and appropriate refusal	Whether the system behaves acceptably across expected and difficult cases
Risk and operations	Unauthorized data exposure, prohibited actions, overrides, incidents, rollback events, and unresolved failures	Whether growth is being purchased with unacceptable operational or trust costs

Build the measurement path before the rollout:

Capture the baseline. Measure the existing workflow using the same outcome definition you will use after launch. Otherwise, a faster AI step can hide slower review, higher rework, or shifted labor elsewhere.
Create a representative evaluation set. Use permitted examples from normal, difficult, and failure-prone cases. Define the expected result and the critical errors for each case.
Weight failures by consequence. Formatting errors, unsupported factual claims, privacy failures, and unauthorized actions should not disappear into one average score.
Run offline evaluations before exposure. Test the complete combination of instructions, model, retrieval, tools, and output validation. A model score alone does not represent the production system.
Release behind a feature flag. Start with a controlled cohort, preserve the ability to roll back, and compare outcomes. Use A/B testing when assignment and outcome measurement are credible; use a phased rollout when they are not.
Record versions. Log the model, instructions, retrieval configuration, tools, and policy version associated with each result so a regression can be traced.
Turn failures into future tests. Categorize meaningful production failures and add them to the evaluation set before the next release.

This is the practical meaning of eval-driven development: instrument the system, watch for drift, and tighten the delivery loop while changes remain controlled by feature flags. It turns evaluation from a launch checkpoint into part of product development.

Use a scale gate that includes economics

Do not scale because the demo is impressive or employees like the interface. Require four decisions:

The business outcome is moving in the intended direction, or there is credible evidence that the workflow is producing the leading behavior tied to it.
Quality remains acceptable across normal cases, edge cases, and high-consequence failures.
Total cost per successful outcome is viable after model usage, retrieval, storage, human review, escalation, rework, and operations are included.
The operating owner can detect, contain, and learn from failures without depending on the original project team.

If a pilot fails one of these gates, the decision is not automatically to cancel it. Narrow the scope, change the workflow, improve the evidence, or stop. What matters is that expansion is earned by measured behavior rather than assumed from adoption.

Scale through guardrails, reusable components, and clear ownership

Governance should make routine decisions faster. When every team has to rediscover which data is permitted, which evaluation is sufficient, and who can approve a release, governance becomes a sequence of meetings. When those expectations are encoded in a standard launch record, teams know the path before they build.

Create a minimum launch record for every workflow

the business outcome, baseline, and accountable owner;
the workflow boundary, users, and authorized actions;
the approved data sources, access controls, retention rules, and prohibited data;
the evaluation set, acceptance criteria, and critical failure classes;
the human review and escalation conditions;
the logging, monitoring, feature flag, and rollback plan;
the model, retrieval, tool, and vendor dependencies;
the incident owner and the method for notifying affected internal teams or customers when appropriate.

Privacy-by-design, data governance, red-teaming, and defined review gates are growth infrastructure. They reduce repeated risk debates and make the safe path reusable across launches.

If a workflow touches personal data, confidential customer content, employment decisions, payments, security actions, or contractual commitments, involve the appropriate privacy, security, legal, financial, or people owner before live use. The downside is not limited to a poor answer. The workflow can expose restricted data or take an action the business cannot easily reverse.

Assign ownership beyond launch

Four responsibilities must be explicit, even when one person holds more than one:

Business outcome ownership: decides whether the workflow is worth continuing based on the target metric and economics.
Workflow ownership: manages exceptions, reviewer behavior, process changes, and user feedback.
Technical ownership: controls releases, versions, integrations, reliability, monitoring, and rollback.
Risk ownership: defines the policy boundary and approves material changes to data, authority, or exposure.

This prevents a common operating failure: the product team treats launch as completion, while the operations team inherits a changing probabilistic system without the tools or authority to manage it.

Standardize the recurring parts, not every local process

Once working use cases expose recurring needs, turn those needs into shared capabilities. Useful candidates include identity and permissions, governed retrieval connectors, evaluation tooling, instruction and model versioning, observability, feature flags, rollback controls, and cost attribution.

Keep the final workflow close to the business team that understands the customer, exceptions, and outcome. Centralize the controls and infrastructure that should be consistent. This creates leverage without forcing every function into the same process.

Review the portfolio as a set of products, not permanent projects. The decision for each workflow should be to expand it, fix a known constraint, narrow its authority, or retire it. Continuous discovery with product trios can refine the prompts, data sources, and experience while evidence determines what scales and what stops.

Operationalizing AI: three questions leaders ask

Should you build a central AI platform first?

Usually, no. Start with the minimum secure infrastructure required for a valuable workflow. Standardize a component when several use cases need the same capability or when inconsistency creates material risk. Data access, identity, logging, and release controls may need early consistency; a broad internal platform without proven workflows can become an expensive set of assumptions.

How do you know a pilot is ready to scale?

A pilot is ready when it improves the intended business or workflow outcome, stays within quality and risk boundaries, has viable cost per successful outcome, and can be operated without daily intervention from its builders. Usage and positive comments are supporting signals, not a scale decision.

Where should a human remain in the loop?

Keep human approval where consequences are high, actions are difficult to reverse, evidence is incomplete, or acceptable judgment cannot yet be specified. Remove or reduce review only when evaluations and production monitoring show that the remaining risk is understood and controlled. A reviewer who merely clicks approve without adding judgment is not a guardrail; it is latency disguised as governance.

For your next AI proposal, require a one-page charter containing the outcome, workflow boundary, owner, baseline, approved data, evaluation set, failure policy, release plan, and full cost model. If a line is blank, fund discovery to resolve it. If the charter is complete, release the smallest useful workflow behind a control, learn from real failures, and widen its authority only when the evidence earns it.

References

December 10, 2025

How to Build a Self-Improving AI Support Operation

Your AI support agent handled the easy questions, produced an encouraging early lift, and then stopped getting better. The same topics still reach human agents. Content fixes happen when someone remembers. The aggregate resolution rate moves, but nobody can explain why.

If that describes your operating review, a newer model is unlikely to be the first thing you need. You need a closed operating loop: every weak conversation becomes evidence, every useful insight gets an owner, and every change is tested against the next conversation it is meant to improve.

Measure the improvement loop, not just resolution rate

A self-improving support operation is not an agent that quietly rewrites or retrains itself. It is a managed system in which live conversations expose failure modes, people convert those failures into controlled changes, and later conversations show whether the changes worked.

Resolution rate is an outcome of that system, not a diagnosis. An aggregate rate cannot tell you which intent deteriorated, why the agent handed a customer to a human, or whether a change repaired one topic while damaging another. It can also be misleading when eligibility changes. Expanding automation into harder intents may lower the rate while increasing the number of conversations resolved. Excluding difficult intents can produce the opposite effect.

Start by documenting exactly what your denominator includes and what counts as a resolution. Keep that definition stable enough to compare periods, and report resolved volume alongside the rate. Then add the views that turn a dashboard into a work queue:

Coverage: Which inbound conversations are eligible for AI handling, and which are excluded?
Outcome by intent: Where does the agent resolve, hand off, or fail to answer?
Failure reason: Was the problem missing knowledge, weak retrieval, incorrect behavior, poor routing, or an issue the product itself must solve?
Quality: Did an audit, repeated contact, reopened conversation, or another trusted signal indicate that the apparent resolution was weak?
Change throughput: How many identified failures are waiting for diagnosis, testing, approval, or release?

The intent-level view matters because it gives the owner somewhere to act. A falling aggregate rate is merely a warning. A cluster of unresolved questions about one feature, tied to one failure reason, is a tractable product and operations problem.

Classify the failure before choosing the fix

Teams waste cycles when every poor answer is treated as a documentation problem. Use a small failure taxonomy to route each issue to the layer that can actually repair it.

Failure class	What you observe	Likely action
Knowledge gap	No current, approved answer exists	Create or repair the canonical content
Retrieval gap	The answer exists, but the agent does not receive or select it	Improve structure, segmentation, metadata, or retrieval configuration
Behavior gap	The right information is available, but the response is incomplete or misapplied	Adjust instructions, examples, or agent configuration
Routing gap	The agent should escalate but does not, or the handoff loses essential context	Change escalation conditions and the handoff payload
Product gap	No support answer can resolve the underlying problem	Send the evidence to product or engineering instead of disguising it as a content task

This distinction prevents two common errors: endlessly rewriting accurate content when retrieval is broken, and asking the support agent to explain around a product defect that requires an actual fix.

Give one owner the authority and the improvement queue

Shared participation is useful. Shared accountability is not. One person should own the performance of the AI support operation, even though support, product, content, engineering, and security may contribute to individual changes.

The title can be AI operations lead, support operations specialist, or something else. The mandate is what matters: identify underperforming intents, maintain the improvement backlog, coordinate changes across functions, enforce the evaluation process, and report what improved or regressed.

Ownership becomes especially important after the launch surge fades. At Dotdigital, performance held at about 2,800 resolved conversations per month for three consecutive months. The response was to create a dedicated support operations specialist role focused on snippets, content, and the agent’s resolution capability. The lesson is not that every company needs the same job title. It is that a plateau without an empowered owner tends to remain a plateau.

Do not bury improvement work in the general support queue. A customer ticket can close while the underlying failure remains. Create a separate, persistent record for the system-level issue, with fields that make it possible to trace evidence through to an outcome:

Representative conversation links and the affected intent
The observed failure and its customer consequence
The failure class and the evidence supporting that diagnosis
The knowledge, retrieval, behavior, routing, or product artifact to change
The accountable owner and required reviewer
The evaluation cases that must pass
The release status, version, and deployment date
The live signal that will be checked after release

Define done as more than content published or configuration changed. An improvement is complete only when the change is linked to its originating evidence, reviewed at the appropriate risk level, tested, released, and checked in live operation.

For prioritization, assess recurrence, consequence, confidence in the diagnosis, and effort separately. Do not let raw volume make the decision by itself. A rare failure involving access, privacy, or an irreversible customer action can deserve attention before a frequent wording problem. Conversely, a recurring low-risk knowledge gap may be the best candidate for a fast content repair.

Turn live failures into governed, testable changes

Feedback does not improve an agent merely because it was collected. A thumbs-down, a handoff, or an unresolved conversation is a signal, not a root cause. The operating loop has to convert that signal into a specific hypothesis and then close the loop.

Collect: Group common handoffs and unresolved conversations by intent instead of reading them as isolated tickets.
Diagnose: Assign a failure class and confirm that the proposed layer is actually responsible.
Prioritize: Select the issue using recurrence, consequence, confidence, and effort.
Change: Modify the smallest responsible artifact rather than making broad agent changes by default.
Evaluate: Test the originating failures, realistic variations, and already-passing cases that could regress.
Release and observe: Record what shipped, monitor the affected live intent, and feed any new failure back into the queue.

Write the hypothesis before making the change: for this intent, changing this artifact should reduce this failure reason without degrading these existing behaviors. That sentence forces clarity about what success means and which regression cases belong in the evaluation set.

When a live failure reveals a missing case, promote it into the regression set after the fix. Over time, the evaluation suite becomes a practical memory of mistakes the operation should not repeat. That is where compounding comes from: the team is not merely correcting answers; it is preserving each correction as a reusable control.

Match governance to the blast radius

Fast iteration and responsible review are compatible when the rules are explicit. A useful governance model distinguishes changes by consequence:

Low blast radius: A correction to an approved fact, an obsolete product step, or a missing limitation can follow a lightweight peer review and the relevant evaluation cases.
Moderate blast radius: Retrieval, behavior, and routing changes that can affect several intents should receive cross-functional review and a controlled release.
High blast radius: Actions involving permissions, account access, customer data, money, or security need stronger approval, a safe test environment, a rollback path, and an obvious route to a human.

A wrong explanation can create confusion. A wrong action can change an account or expose data. Treating those changes as equivalent either slows harmless content repairs or makes consequential automation unsafe.

Use focused sprints without making improvement episodic

A concentrated sprint is useful when the backlog has accumulated or a set of topics is visibly underperforming. In one focused Anthropic effort, the team audited unresolved queries, repaired weak content, converted recurring macros into AI-usable snippets, and monitored live performance. That is a practical pattern for clearing known gaps quickly.

The sprint should strengthen the standing loop, not replace it. Keep the same taxonomy, backlog, review rules, and evaluation artifacts after the concentrated work ends. Otherwise, the operation improves during special events and drifts between them.

Make the improvement work visible in each operating review. Show the failure observed, the artifact changed, the evaluation result, and the live outcome or next check. Name the person who drove the repair. This rewards the behavior that creates durable gains instead of celebrating only a headline rate that few people can explain.

Make AI-ready knowledge part of product launch readiness

Company-specific support knowledge does not appear because the underlying model is capable. The agent needs current, approved information in a form it can retrieve and apply. Missing or contradictory knowledge is an operating failure, not a model mystery.

Treat knowledge as production infrastructure. Every topic needs an owner. Important changes need versions and effective dates. Retired instructions need to be removed or clearly superseded. The agent’s ingestion and retrieval path needs verification, just as the customer-facing help experience does.

A canonical source of truth does not have to be one enormous help article. It means there is one approved origin for the product facts from which help-center content, agent snippets, human macros, and other downstream formats are derived. When those formats are authored independently, contradictions are almost inevitable.

Add an AI support gate to the new product introduction process. Before a feature is considered ready, confirm that:

A named owner is accountable for keeping the feature’s knowledge current.
The canonical material explains what changed, who can use it, how it works, and where its boundaries are.
Known limitations and escalation conditions are explicit rather than left for the agent to infer.
The effective version or release state is clear, so old and new instructions cannot be confused.
The content has been ingested or indexed and retrieval has been tested.
Expected support intents and representative evaluation cases are ready before inbound volume arrives.
Support has a defined path for returning launch-day failures to product, engineering, or the knowledge owner.

This is not only administrative hygiene. In my organization, embedding a canonical source of truth into launch readiness has consistently supported resolution rates above 50% for new features from day one. That result is evidence for the operating model, not a universal benchmark; intent mix, product complexity, and the definition of resolution still matter.

Do not automatically turn every human answer into permanent knowledge. First decide whether the resolution is generalizable. If it is, update the canonical material. If it is a legitimate exception, encode the escalation path. If the underlying issue is a product defect, preserve the conversation as product evidence and route it accordingly. The objective is a cleaner system, not simply more content.

Key takeaways for your next operating review

Define self-improvement as a managed loop from conversation evidence to a verified change, not autonomous model learning.
Keep resolution rate, resolved volume, coverage, failure reasons, and change throughput visible together.
Assign one accountable owner with authority to coordinate support, content, product, and engineering.
Classify each failure before fixing it so knowledge, retrieval, behavior, routing, and product problems reach the right layer.
Turn repaired failures into regression cases, and apply stronger review as the blast radius increases.
Make canonical, AI-ready knowledge a launch requirement instead of a cleanup task for support.

At your next review, take one recurring unresolved intent and trace it all the way through: evidence, diagnosis, owner, change, evaluation, release, and live result. If any link is missing, that is the first operating gap to repair. Once the path works for one intent, make it the default path for every failure worth learning from.

References

Shivam.Consulting Blog – Make Every Answer the Last: Building a Self-Improving AI Support Engine for 2026

December 9, 2025

Beyond Accuracy: The Trust-First Evaluation Metrics I Use to Scale High-Impact AI Products

When I assess whether an AI product is ready for prime time, I start with trust—not model accuracy. Accuracy is table stakes; trust is what earns adoption, drives retention, and unlocks durable product-led growth.

Evaluation metrics in AI products go beyond accuracy. Learn how product teams use trust-driven metrics to build reliable, growth-driving AI systems.

In practice, I organize trust-driven metrics into four layers: model quality and safety, user and business outcomes, operational reliability and cost, and governance and compliance. This layered approach keeps product trios aligned on what matters now, what must be gated in CI/CD, and what signals we’ll use to prove progress against outcomes vs output OKRs.

On model quality and safety, I care about precision, recall, F1, calibration, and abstention behavior, but also the hard-to-fake signals: hallucination rate, grounding and faithfulness, citation coverage, toxicity, bias, and fairness. For generative systems, I instrument refusal correctness (declining unsafe requests) and evidence adequacy (did the answer rely on retrieved, trustworthy sources).

User and business outcomes must be explicit. I track adoption, activation, task success rate, time to first value, win rate uplift in assisted workflows, CSAT and NPS deltas, and retention analysis by cohort exposed to AI features. For customer support scenarios, deflection rate, average handle time change, and first-contact resolution are core; for sales or ops copilots, I monitor cycle-time reduction and error-rate reduction in critical tasks.

Experimentation is non-negotiable. I design A/B testing with a clear minimum detectable effect (MDE), pre-registered guardrails for safety and quality, and sequential tests that stop early if harm outpaces benefit. Online metrics are always paired with offline evals so we can iterate quickly without exposing users to regressions.

Operationally, trust shows up as speed, stability, and cost predictability. I track latency end-to-end, time to first token, throughput, rate of 5xx and timeouts, cost per request, and caching effectiveness. We also trend safety incidents per 10,000 interactions and mean time to mitigation to keep reliability visible alongside performance.

Governance and compliance are part of the product, not an afterthought. Data governance and privacy-by-design metrics include PII exposure rate, data lineage coverage, access-control correctness, audit pass rate against internal policies, and model and prompt change traceability. This is the backbone of our AI risk management posture and accelerates regulatory compliance reviews instead of slowing them down.

The delivery engine for all of this is eval-driven development. We maintain golden datasets and scenario-based test suites that mirror real user intents, gate releases in CI/CD with minimum thresholds, and run canary rollouts to validate offline–online alignment. Every model or prompt update gets a comparable scorecard so product, engineering, and design can trade off quality, speed, and cost with shared facts.

For LLM-heavy features, retrieval-first pipeline metrics are mandatory. I monitor retrieval hit rate, recall at K, mean reciprocal rank, context contamination, and citation correctness. With large prompts, context window management matters: we track context utilization, truncation rate, and the contribution of each context block to final answers to avoid silently losing critical evidence.

Finally, trust must be legible. I package these metrics into an executive scorecard that maps to business outcomes, risk appetite, and OKRs, with clear thresholds for ship, improve, or roll back. When teams can articulate trade-offs—say, a 20% latency reduction at a small cost increase, or a lower hallucination rate at the expense of higher abstention—they build credibility with stakeholders and confidence with customers.

Trust is not a single number; it’s a system of evidence. By instrumenting these layers and operationalizing AI Strategy with rigorous, transparent metrics, we can ship faster, reduce surprises, and earn the right to scale AI features across the product portfolio.

Inspired by this post on Product School.

December 8, 2025
From No-Code Hack to 10,000 Weekly Calls: Inside Perk’s Voice AI That Actually Works

I love real-world AI that ships, scales, and actually solves painful customer problems. This story checks every box. As a product leader who has brought agentic AI to production environments, I was captivated by how a small, focused team at Perk took a no-code voice AI prototype and turned it into a system that reliably makes 10,000+ calls per week to prevent failed hotel payments.

What happens when you combine a real customer problem, a no-code prototype, and a team willing to listen to every single call?

Steven Payne (Product Manager), Gabriel Stock (Senior Engineering Manager), and Philipe Steiff (Senior Software Engineer) from Perk share how they built a voice AI agent that calls hotels to verify virtual credit card payments, preventing travelers from arriving to find their rooms unpaid. This is a textbook example of linking operational pain to a high-leverage AI solution.

What started as a hackathon experiment in Make.com became a production system handling over 10,000 calls per week across multiple languages. Along the way, the team learned hard lessons about prompt engineering for voice (numbers, pronunciation, and a very "Karen-like" first version), how to break a single monolithic prompt into structured conversation stages, and why listening to actual calls beats any amount of theorizing.

From a product management perspective, this approach aligns perfectly with eval-driven development and continuous discovery. Structure the problem, instrument aggressively, ship safely, then listen—deeply—to real interactions. In my own teams, I’ve seen that nothing accelerates iteration on agentic AI like closing the loop between qualitative call reviews and quantitative evals.

They built a working prototype without writing a single line of backend code.

They structured the call into discrete stages (IVR, booking confirmation, payment) to improve reliability.

They created two eval systems: one for call success classification, another for conversational behavior.

They scaled from five calls a day to tens of thousands per week while maintaining quality.

This is a detailed look at building AI for real-time human interaction—where the stakes are high and the feedback is immediate.

Guests: Steven Payne, Product Manager, Perk; Gabriel Stock, Senior Engineering Manager, Perk; Philipe Steiff, Senior Software Engineer, Perk.

What stood out to me was how Perk's team identified an AI use case by connecting prior experimentation with a real operational problem. Why they chose Make.com for prototyping—and shipped to production without touching backend code—underscores how far no-code can take you when paired with crisp problem framing. The evolution from a single prompt to structured conversation stages (IVR handling, booking confirmation, payment request) is exactly how you harden agent behavior for production.

Breaking up the agent's task dramatically improved reliability. They also built two eval systems: classification for success rates and LLM-as-judge for conversational behavior. Even with automation, the team still listens to calls manually—a practice I strongly endorse for uncovering edge cases, trust issues, and UX nuances that dashboards can’t show.

The challenge of prompt engineering for voice—numbers, booking references, and text-to-speech markup—was non-trivial. Expanding to German revealed that prompts in native language improve results. And, as often happens with operations-heavy rollouts, this project uncovered other operational problems they didn't know existed—valuable signal for the roadmap.

Resources & Links: Perk. Make.com — No-code automation platform used for the prototype. Twilio — Voice/telephony provider. Eleven Labs — Text-to-speech provider (used in early experiments).

Chapters: 00:00 Introduction to the Team; 01:54 Understanding PERK's Mission; 02:59 Challenges in Travel Booking; 07:27 AI Solutions for Customer Care; 09:52 Prototyping with AI and Voice; 17:00 Implementing AI in Production; 25:51 Learning Through Trial and Error; 26:40 Prompting Challenges and Solutions; 27:58 Iterating on Prompts and Evaluations; 30:08 Scaling and Production Challenges; 32:43 Advanced Evaluation Techniques; 35:32 Real-World Applications and Success; 49:07 Future Directions and Expansion; 53:53 Conclusion and Team Reflections.

My product takeaways: Start with clear operational pain and measurable outcomes (e.g., payment verification). Use no-code to validate quickly, then progressively harden. Treat voice AI like any production system: break it into deterministic stages, add guardrails, and measure both outcome and behavior. Pair automated evals with hands-on reviews. And when going multilingual, write prompts in the native language—your accuracy will thank you.

If you’re exploring agentic AI for operations, this is the blueprint: tight scoping, Make.com for speed, Twilio for reliability, structured prompts for control, and an eval-driven loop to scale quality with confidence.

Inspired by this post on Product Talk.

December 4, 2025

How Startups Earn Visibility in ChatGPT and Perplexity

A prospect asks ChatGPT or Perplexity for the kind of product you sell. Several competitors appear. Your startup does not. That does not automatically mean your product is weak or your SEO has failed. It often means the system cannot find enough clear, consistent, and corroborated evidence to include you confidently.

Your job is not to force your company into every answer. It is to make your startup easy to identify, accurately categorize, and safely recommend when it genuinely fits the question. That requires coordinated work across positioning, content, technical structure, third-party proof, and measurement.

Key takeaways

Measure visibility across important buyer questions, not as one universal AI-search ranking.
Build a page for each major decision: category, use case, integration, price, comparison, and deployment risk.
Make important claims explicit in visible HTML, then reinforce them with accurate metadata and schema.
Support first-party claims with reviews, partner pages, case studies, documentation, and other independent evidence.
Use a stable prompt set to find specific visibility failures, change the relevant evidence, and retest.

Measure recommendation coverage, not an imaginary rank

Conventional search encourages a positional question: where do I rank? AI search requires a different question: for which buyer decisions can the system understand and support a recommendation of my product?

AI search behaves more like a synthesis engine than a page of ranked blue links. It assembles an answer around the wording and context of a prompt. Change the question from best software for a category to best software for a particular team, workflow, integration, budget, or risk profile, and the eligible recommendations may change.

There is therefore no single visibility score that tells the whole story. A startup can be visible for category discovery but absent from integration questions. It can be named as an alternative yet omitted when the buyer adds a security requirement. It can also be mentioned with an outdated description, which is exposure without useful discovery.

A practical baseline should distinguish four outcomes:

Discovery: Does your company appear when the prompt describes a problem you solve?
Positioning: Is it placed in the right category and associated with the right audience and job?
Fit: Does the answer explain when your product is appropriate, including relevant trade-offs?
Evidence: Are the supporting claims current, specific, and connected to credible pages?

Start with the questions that already matter in your buying journey. Include category exploration, problem framing, use-case fit, integrations, commercial value, alternatives, and deployment risk. Preserve the exact wording of each prompt. If you rewrite the test every time, you will not know whether your evidence improved or the question merely changed.

Record more than whether your name appeared. Save the product description, recommendation context, claims, citations, omissions, and factual errors. A mention is not a win if the answer sends the wrong buyer to your product or attributes a capability you do not offer.

Turn buyer intent into an answerable page system

Many startups try to solve AI visibility by publishing more blog posts. Volume is rarely the first constraint. The more common problem is that the website has no precise page capable of answering the buyer’s actual question.

Your homepage cannot carry the entire decision journey. Give each high-value intent a clear destination:

Buyer decision	Question the page must answer	Best page type	Evidence to include
Category exploration	What is this product, and who is it for?	About or category page	Plain category definition, target customer, core job, and differentiator
Problem framing	How should I understand and solve this problem?	In-depth explainer	Method, terminology, constraints, and links to primary material
Solution fit	Can this product handle my workflow?	Use-case page	User, workflow, inputs, outputs, limitations, and customer evidence
Integration fit	Does it work with the rest of my stack?	Integration page or documentation	Prerequisites, supported connection, setup steps, data flow, and known limits
Commercial fit	What will I pay, and what value should I expect?	Pricing and value page	Pricing structure, inclusions, exclusions, assumptions, and verifiable outcomes
Competitive choice	When should I choose this product instead of an alternative?	Comparison or alternatives page	Points of parity, meaningful differences, trade-offs, and cited claims
Deployment risk	Can my organization use it safely?	Trust center	Security, privacy, compliance, governance, and data-handling information

Each page should lead with a direct answer. Do not make a retrieval system infer your category from a slogan or reconstruct an integration from a press release. A useful positioning sentence follows a simple structure: [Product] is a [category] for [audience] that needs to [job], distinguished by [relevant difference]. Use the same underlying definition wherever the product is introduced.

Use-case pages need more than a collection of benefits. Name the user, triggering problem, workflow, expected output, dependencies, and boundaries. If the product is suitable only under particular conditions, state them. Precise qualification can reduce superficial visibility while improving the quality of the recommendations that remain.

Integration pages deserve the same care. A logo wall proves very little. Explain what connects, in which direction data moves, what setup requires, and which workflows the connection supports. Link to technical documentation and the partner’s corresponding page when one exists.

Comparison pages should help a buyer make a decision, not manufacture a victory. Start with the shared category, acknowledge points of parity, identify the conditions that make each option a better fit, and cite claims that a reader can verify. A fair statement such as one product suits a particular workflow while another suits a different operating model is more useful than an unsupported declaration that yours is best.

Transparent pricing matters for the same reason. If a public amount is not available, you can still explain the pricing unit, packaging logic, included capabilities, major variables, and purchasing path. The aim is to remove avoidable ambiguity from a commercial-fit question.

Make the corpus easy to retrieve and hard to misread

Good information can remain invisible when it is buried in a PDF, hidden behind vague navigation, contradicted by metadata, or scattered across pages with no canonical version. Retrieval-friendly content reduces the work required to locate, segment, and interpret an answer.

Work through the site in this order:

Make the visible narrative consistent. Use the same product name, category, audience, and core capability across the homepage, About page, product pages, documentation, and trust center. Resolve genuine contradictions before adding markup.
Give every important answer a stable URL. Use descriptive headings, short focused sections, sensible internal links, and linkable anchors. Keep documentation in HTML when possible, even if you also offer a PDF.
Add schema that describes the visible page. Organization, Product, FAQPage, HowTo, and Article JSON-LD can clarify entities and content types when they accurately match what a person can read on the page.
Align the surrounding signals. Titles, meta descriptions, canonical URLs, and Open Graph data should reinforce the same identity and purpose rather than introducing alternate names or claims.
Remove retrieval friction. Maintain a clean sitemap, review robots.txt for accidental blocking, keep important pages reachable through navigation, and provide fast mobile-first experiences.
Keep technical material usable. Provide copyable commands, configuration examples, prerequisites, expected results, and failure conditions where they are relevant.

Schema is a translation layer, not evidence. Product markup cannot rescue an unsupported claim, and FAQ markup cannot turn a thin sales page into an authoritative answer. Add structured data after the visible content is accurate and complete.

The trust center is especially important for B2B products. Security, compliance, privacy, governance, and data-handling questions often enter the buying process before a prospect speaks to sales. Give each topic a clear, current answer. Avoid mixing aspirational commitments with controls that are already in place.

Freshness also needs visible ownership. Release notes should reflect material product and integration changes. Outdated feature claims should be corrected or retired instead of left to compete with the current version. Schedule a quarterly review of commercially important pages, documentation, comparison claims, and trust material. The goal is not to alter dates cosmetically; it is to ensure that the underlying answer remains true.

Earn corroboration where your company cannot control the wording

Your website establishes what you claim. Independent surfaces help establish whether anyone else has reason to believe it. That distinction becomes important when a recommendation involves operational risk, meaningful spend, or a crowded category.

Map each commercially important claim to the strongest available proof:

Adoption: detailed customer stories, current review profiles, and customer outcomes with verifiable metrics.
Compatibility: partner directories, joint integration pages, and documentation that confirms the supported connection.
Technical maturity: accessible documentation, maintained repositories where relevant, and README files that accurately explain installation and use.
Category authority: reputable industry mentions, analyst coverage, or citations by practitioners and institutions with relevant expertise.
Deployability: security, compliance, governance, and privacy material that a buyer can inspect rather than a generic statement that the product is secure.

Do not chase mentions indiscriminately. A third-party page is useful when it verifies a claim a buyer cares about. An integration listing that confirms compatibility can be more valuable for an integration prompt than broad publicity that says nothing about the product’s operation.

Case studies should make their evidentiary limits visible. Identify the customer context, starting problem, product use, measured result, and method behind the metric. If the outcome is self-reported or cannot be independently verified, describe it that way. Specificity makes the claim easier to evaluate; inflated certainty makes the entire corpus less trustworthy.

Build a proof inventory before launching another content campaign. For each positioning claim, record the first-party explanation, customer evidence, independent corroboration, current URL, owner, and freshness status. Empty cells reveal whether you have a writing problem, a product-evidence problem, or a distribution problem.

This inventory also prevents a common sequencing mistake. A startup may publish many pages around a claim that no customer, partner, reviewer, or technical artifact supports. More repetition does not create stronger evidence. First establish the truth of the claim, then make that truth easy to discover in the places a recommendation system can retrieve.

Run AI visibility as an eval-driven product loop

AI-search work becomes vague when the team alternates between random prompts and random content changes. Treat the discovery experience as a product surface with defined test cases, observable failures, and controlled iterations.

Define a stable prompt set. Represent the buyer intents you want to serve, using the language a real evaluator would use at each decision stage.
Capture a baseline in ChatGPT and Perplexity. Record the exact prompt, system, test date, answer, recommendation context, cited pages, and factual errors.
Classify the failure. Distinguish absence from miscategorization, weak fit evidence, missing corroboration, stale information, or retrieval of the wrong page.
Change the evidence connected to that failure. Improve the category definition for a positioning error, an integration page for a compatibility gap, or the trust center for an unsupported deployment answer.
Rerun the same test cases. Look for improved coverage and accuracy without assuming that a single response proves a durable change.
Connect visibility to buyer behavior. Track referrals from AI-driven surfaces, landing-page engagement, qualified demand, and pipeline where your analytics can identify them responsibly.

Use a simple evaluation record rather than one blended score. Mark whether the product was present or absent, whether its category was correct or wrong, whether fit was supported or merely asserted, whether citations were current, and whether the linked page offered a useful next step. Separate fields tell you what to fix. A single number hides the cause.

Answer variability is part of the environment, so treat one run as an observation rather than a verdict. The useful signal is whether the same class of important prompts becomes more consistently accurate after you improve the relevant material.

A/B testing can help when a page receives enough appropriate traffic and the change can be measured through user behavior. Test answer placement, headings, proof presentation, or the route to a next step. Do not A/B test incompatible facts about what the product is. Positioning consistency is a prerequisite for the evaluation, not an experiment variant.

Avoid the shortcuts that create activity without evidence: bulk publishing shallow pages, applying every available schema type, writing hostile comparison copy, leaving essential documentation only in PDFs, and reporting raw mentions without checking accuracy or commercial relevance.

In your next working session, choose the buyer question closest to an active product decision. Inspect the answer, identify the missing or unreliable evidence, improve the page that should resolve it, and add one credible corroborating signal. Then preserve the prompt and retest it. Repeating that loop across the decision journey is how AI visibility becomes an operating capability instead of a one-time content project.

References

Amplitude – Crack the AI Search Code: How Startups Win Recommendations in ChatGPT and Perplexity

December 3, 2025

Tag: eval-driven development

Key takeaways

Start with the coach’s behavioral contract

Define the minimum viable input

Build the prompt as a stack of distinct layers

Use examples to teach judgment, not phrases

Protect the important context when inputs are long

Make the workflow evidence-first, not prose-first

Use an output contract that exposes the reasoning

Evaluate the coach as a product, not a single response

Design privacy and fairness into the workflow

References

Define a trust contract before choosing the architecture

Engineer an evidence path, not just an answer

Define the case before you define the agents

Split responsibilities at operational handoffs

Let a state machine control long-running work

Make compliance and human review part of execution

Measure verified resolution and improve from failures

Key takeaways

References

What you should be able to do after 12 months

Months 1-3: Learn enough AI to make sound product decisions

Build an operator’s mental model

Apply the foundations to product discovery

Months 4-6: Prototype the experience and build its evaluation system

Use evaluation as a development method

Instrument behavior before launch

Months 7-9: Ship a bounded workflow, not an open-ended assistant

Map the production loop before building it

Monitor the experience at four levels

Months 10-12: Turn one successful product into a repeatable system

Use common analytics without erasing product context

References

Determine whether context is actually the bottleneck

Write a context contract before choosing the architecture

Build context assembly as a controlled pipeline

Ship with layered evaluations, observability, and ownership

Evaluate the evidence path before scoring the prose

Instrument enough to reproduce failure without creating a new data risk

Give every context surface an owner

Key takeaways

References

Choose the growth constraint before the AI use case

Convert the use case into a controlled workflow

Use each form of automation for the work it can control

Write an execution contract, not just a prompt

Measure business value, workflow performance, and AI quality separately

Use a scale gate that includes economics

Scale through guardrails, reusable components, and clear ownership

Create a minimum launch record for every workflow

Assign ownership beyond launch

Standardize the recurring parts, not every local process

Operationalizing AI: three questions leaders ask

Should you build a central AI platform first?

How do you know a pilot is ready to scale?

Where should a human remain in the loop?

References

Measure the improvement loop, not just resolution rate

Classify the failure before choosing the fix

Give one owner the authority and the improvement queue

Turn live failures into governed, testable changes

Match governance to the blast radius

Use focused sprints without making improvement episodic

Make AI-ready knowledge part of product launch readiness

Key takeaways for your next operating review

References

Measure recommendation coverage, not an imaginary rank

Turn buyer intent into an answerable page system

Make the corpus easy to retrieve and hard to misread

Earn corroboration where your company cannot control the wording

Run AI visibility as an eval-driven product loop

References