Five Engineering Practices That Separate Production Gen AI from a Demo

The Wall Every Team Hits

There is a moment that repeats itself across every team that builds a generative AI application for real users. The demo has worked. The notebook has impressed the right stakeholders. Then the system goes live, and within weeks the team discovers problems that were invisible in the prototype: costs that appeared without warning, a model whose behavior shifted after a pricing change, prompts that could only be updated by shipping a new release, and no systematic way to know whether the application was getting better or worse.

These are not edge cases. They are the predictable consequences of skipping the engineering discipline that separates a production gen-AI application from a very impressive demo. As ML systems researcher Chip Huyen has observed, it is easy to make something cool with LLMs, but very hard to make something production-ready. A Deloitte survey found that only 23% of organizations feel highly prepared for gen-AI risk management and governance challenges. The gap between adoption and confident production deployment remains wide even as McKinsey's 2024 State of AI survey reports 72% of organizations have adopted AI in at least one business function.

Across the development of a variety of internal chat and productivity solutions over the past two years, five engineering practices have shown up as necessary in every production gen-AI application that has held up under real use. They are not theoretical. They are what teams keep relearning when they ship without them: cost transparency and observability, model independence, externalized prompts, prompt refinement through closed feedback loops, and structured outputs with human gates. They are not the complete architecture. They are the engineering floor under one.

Principle 1: Know What You Are Spending, in Real Time

Generative AI cost is variable per call, not flat per seat. A single expensive task can burn five dollars in one run, and the team will never know unless the instrumentation is there to surface it. Trussed AI, a cost observability vendor, cites data showing 84% of companies report AI cost impacts to gross margins, while Maiven, an AI cost management platform, reports that 71% of enterprises admit little to no control over where their costs originate. Independent pricing analysis shows the same task can cost $0.04 per million tokens on one provider and $25.00 on another, a 625x difference.

The implementation is straightforward: maintain a model-pricing table inside the application, estimate cost per call, track per-user and per-feature spend, and surface that data where the team will actually see it. Two concrete patterns matter most: a per-user spend dashboard with budget settings, and a per-task cost estimate shown before the user runs the task. This principle is distinct from TCO modeling at procurement time: TCO models the contract, while runtime observability shows what is actually happening in production.

Principle 2: Never Let a Model Choice Become a Dependency

Baking a specific model into source code is a structural bet on a market that moves every quarter: prices change, capabilities shift, and provider quality can dip without notice. Independent benchmarking services like Artificial Analysis track pricing and quality rankings across a broad range of models and providers on a continuous basis, and those rankings move as the market does. A team that can swap a model without a deploy can respond to all three pressures; a team that cannot waits for a release cycle every time the market moves.

The useful pattern is an administrative screen where any agent stage can be switched between providers (OpenAI, Anthropic, Google, AWS Bedrock, self-hosted open-weights models) through configuration rather than code. Lock-in, it turns out, extends well beyond the model itself: the prompt library, evaluation harness, fine-tunes, observability stack, and billing relationships all accumulate around a provider over time. For regulated industries and public sector organizations, this flexibility is also a governance consideration: data sovereignty and residency policies may constrain which infrastructure providers are permissible for a given workload. The caveat that must not be omitted: swapping models is not free. Output behavior shifts, and a swap should pair with prompt refinement and an evaluation pass before reaching users. Model independence is the capability; it does not make model changes costless.

Principle 3: Prompts Are the Spec, Not the Code

Prompts define what a gen-AI application does, how it responds, and what it refuses. Yet the most common architecture embeds them in source files, which means iterating on the most frequently tuned part of the system requires going through the slowest part of the development process. As one independent practitioner guide frames it, prompts should be treated as versioned, deployable assets with their own lifecycle, separate from application code.

The solution is a prompt library that can be edited without a release cycle. The important nuance is that not all prompts are safe for free-form editing. A layered architecture distinguishes system and policy prompts (the constitutional layer that enforces role, scope, output schema, and refusal behavior) from user-tunable fragments designed to be edited freely without breaking downstream contracts. Anthropic's published model specification makes explicit that safety and policy compliance take precedence over helpfulness in cases of apparent conflict, providing direct grounding for the principle that policy-layer prompts must be protected from free-form editing. Anthropic's Constitutional AI research offers broader context for the principle-based governance approach that motivates this layered design. Academic research confirms that prompt variation has measurable engineering consequences. The discipline is treating prompts as code in version control and review, but as configuration in their deploy path.

Principle 4: Close the Loop, or Watch Quality Stagnate

A gen-AI application is not done at launch. Its quality compounds with usage if the feedback loop is closed, and stagnates if it is not. The alternative (manual prompt iteration based on whichever output someone happened to complain about) does not scale and does not converge. A survey by WRITER, a generative AI platform vendor, found that 61% of companies implementing in-house gen-AI solutions have experienced accuracy issues. Accuracy issues that are not systematically captured do not resolve themselves.

The pattern is to build feedback capture as a first-class UI element on every gen-AI output: a simple approve/reject signal and an optional comment. Then build the second half: an analysis pass that proposes prompt edits based on accumulated feedback, which a human approves or rejects before the change goes to production. The pattern is human-in-the-loop, not autonomous. The application proposes; the team disposes. Academic research on agentic AI governance frames effective human-in-the-loop systems as strategically applying human judgment at the points where it provides the most value: validating automated evaluations, catching subtle quality failures, and generating data that refines agent behavior over time. The feedback loop is also the connective tissue between the other principles, serving as the instrument by which the team learns whether a model swap or prompt change actually improved quality in production.

Principle 5: Probabilistic Outputs Need Deterministic Interfaces

Painterly editorial illustration of layered chat-message bubbles inspected by small human reviewer figures, some marked with check-marks and others with red X marks, representing human approval gates between stages of an AI pipeline

In a multi-step agentic pipeline, a language model's text becomes the input to the next stage. Without a validation layer between them, errors do not stay local; they amplify. Peer-reviewed research formally modeling error cascades in multi-agent LLM systems confirms that errors compound as they propagate through agent collaboration networks. To put this in concrete terms: a misinterpreted instruction at step one can cascade into incorrect tool usage at step three and unintended external action at step five. In a system handling a substantial volume of daily requests, even a low error rate at the first stage can translate into a meaningful number of corrupted downstream outputs before the issue is detected.

Every agent output should be validated against a schema (Zod, Pydantic, JSON Schema, or native function-calling structured outputs) before it becomes input to the next stage. Each stage should have a human-exercisable approve/reject gate and a quality gate that can auto-rerun a stage whose output fails validation. The documented failures are instructive: Air Canada was forced to honor a nonexistent bereavement fare policy after its chatbot cited it as real, a ruling upheld by a civil resolution tribunal; a consulting firm's AI-generated report contained fabricated sources that internal review and quality assurance processes failed to catch before delivery. These are not model failures; they are validation and oversight failures. The EU AI Act's Article 14 mandates human oversight for high-risk AI systems, and OMB M-24-10 directs federal agencies to maintain human review of AI-assisted decisions. For regulated and public sector organizations, the human gates are not engineering preference; they are compliance requirements.

How the Five Work Together

Painterly editorial illustration of five distinct dashboard monitors connected by signal lines to a small AI system visualization above, representing five engineering practices as the foundation under a production gen-AI system

Cost transparency tells the team what the application is spending. Model independence and externalized prompts give the levers to change behavior without a deploy. The feedback loop closes the circuit so changes are informed by real production data. Structured outputs and human gates keep the system correct at every stage boundary. Skipping any one produces a recognizable failure mode: opaque spend, locked-in models, slow iteration, stagnant quality, or cascading bad outputs. To put the stakes in concrete terms: a team that ships without cost observability may not discover a runaway expense pattern until it appears on a monthly invoice; a team that ships without structured outputs may not discover a data corruption issue until it has propagated through dozens of downstream records.

What is deliberately not on this list is worth naming. Retrieval-augmented generation is necessary for many applications but is a substantial architecture decision in its own right. Evaluation harnesses and regression testing are a close cousin to the feedback loop principle but constitute a distinct discipline at scale. Platform-level safety guardrails, identity management, and audit trails are table-stakes enterprise software disciplines that are not gen-AI-specific. The five principles described here are the floor under all of those disciplines, not a substitute for them. Each of those broader disciplines is harder to execute well without the floor in place: a team that cannot observe its costs cannot budget for a retrieval layer; a team that cannot swap models cannot respond when a provider's safety posture changes; a team with no feedback loop has no signal to drive regression testing.

Engineering Discipline Is the Path to Production

Production gen-AI applications are reached by engineering discipline, not by finding the right model. The five principles described here are not exotic inventions; they are the practices teams keep relearning, and the difference between an application that holds up under real use and one that does not.

Spruce brings this engineering foundation to every production gen-AI engagement through its AI Solutions Engineering practice, which builds these applications with these principles embedded from the start. Spruce's AI Advisory practice works with clients upstream, framing the architecture and policy choices that determine whether a gen-AI program can scale, with both practices operating inside the AI On-Ramp. For organizations ready to move from pilot to production, the floor described here is where that work begins. The five principles are not exotic inventions; they are the practices teams keep relearning, and the difference between a gen-AI application that holds up under real use and one that does not.