Most AI content tools are production lines, not intelligence systems

27 June 2026

Most marketing teams using AI for content are running a production line. They just don't know it.

A production line takes raw material and converts it to output, as fast as possible, with no mechanism to evaluate what came out. That's what happens when you open a chat interface, paste in a brief, hit generate, copy the output into your CMS, and publish. You've made the writing step faster. You have not built a content system. You've built a faster factory for undifferentiated text.

The teams actually winning with AI content are doing something categorically different. They're running an intelligence system — a loop of generation, evaluation, targeted improvement, structural verification, and automated deployment, with model routing economics that make the whole thing commercially viable at scale. The gap between these two approaches is not a matter of prompt quality or tool selection. It's an architectural difference.


The production line mistake

Here's how most "AI content programmes" look when you pull back the curtain.

A content manager or marketer opens an AI writing tool. They type a brief or paste in a template prompt. The tool generates an article. The human reads it, maybe edits a few sentences, pastes it into a CMS, and schedules publication. Occasionally there's a review step where another human reads it. Sometimes there isn't.

This feels like leverage because the writing step is faster. It is faster. But the writing step was never the expensive part of content at scale. The expensive parts were quality consistency, structural correctness, and the cognitive load of producing dozens or hundreds of pieces per week without them degrading. The production line approach hasn't solved any of those problems — it's just made the initial draft cheaper to produce.

The result is predictable: inconsistent quality across a large corpus, no feedback loop between what gets published and what should have been rejected, and humans spending time doing manual quality review that doesn't scale. The AI is a faster typist. The human is still the grader, the editor, and the deployment mechanism. Nothing fundamental has changed.


What an intelligence system actually looks like

A real AI content system is a pipeline of isolated processes. Each process has a specific job, a specific model, and a specific output contract. No single process does everything. The intelligence emerges from the structure of how they connect.

The core loop is: Generate → Grade → Improve → Verify → Deploy.

That loop is not a description of prompts chained together. It is a description of agents with independent context windows, operating sequentially, where the output of each stage is the input contract for the next. The distinction matters, and I'll explain why in a moment.

Here's what each stage actually does.


The five components of a real AI content system

1. The Generator

The generator's job is to produce content against a fully-loaded context. That context isn't just "write a 1,200-word article about X." It's a composition of multiple knowledge layers — the subject-matter expertise the content requires, the market or audience context it's being written for, any local or temporal signals that make the piece specific rather than generic, and the language and register for the target reader.

The generator runs on a capable content model — not the cheapest model available, not the most expensive reasoning model. The right model for generation is one that can hold a large, structured context window and produce coherent long-form output. It has one job: write the piece. It doesn't evaluate. It doesn't make deployment decisions.

2. The Grader

The grader is where most AI content programmes have a complete gap. The grader's job is to evaluate the generated content against a rubric of binary pass/fail criteria. Not a holistic quality score. Not a 1–10 rating. Pass/fail per specific criterion.

Why binary? Because "7/10 quality" tells an improvement agent nothing. "FAIL: no regulatory body named, FAIL: missing local trust signal, PASS: accurate product claims" gives the improvement agent a precise brief.

The critical design requirement: the grader must have an isolated context from the generator. It cannot see what the generator was prompted with. It cannot see the generator's reasoning. It can only evaluate what was produced against the rubric.

This isolation is not bureaucratic. It solves a real problem. LLMs asked to evaluate their own output tend to validate it — they're reasoning against their own assumptions and biases. An isolated grader has no idea what the generator intended. It can only see what the generator produced. That asymmetry is what makes the quality gate honest.

The grader runs on a cheaper, faster model. Evaluation against a fixed rubric doesn't require the same capability as generation. Using a more expensive model for grading is wasteful — provided the rubric itself was written carefully, which I'll return to.

3. The Improver

The improver receives two inputs: the piece that failed grading, and the specific list of failing criteria. Its job is not to rewrite the piece from scratch. Its job is targeted repair — fix the specific criteria that failed, leave the passing criteria alone.

This is architecturally important. Full regeneration is expensive and introduces regression risk — you might fix the failing criteria but break the ones that were passing. Targeted improvement on specific failing criteria is cheaper, faster, and more predictable.

The improver runs on the same class of model as the generator. Targeted rewriting requires genuine capability — it can't be done cheaply.

After improvement, the piece goes back to the grader for re-evaluation. In the same isolated context. Against the same rubric. Only pieces that pass grading proceed to the next stage.

4. The TypeScript Verifier

This is the component that separates practitioners from experimenters, and it doesn't involve AI at all.

Generated content, at the point of deployment, does not go into a CMS via a form. It goes into typed data files — structured TypeScript objects that a Next.js or Astro build reads at compile time. After the content agent writes to those files, the build system runs tsc --noEmit.

The TypeScript compiler is a deterministic, zero-hallucination validator. It cannot be convinced that a type error is acceptable. It doesn't care about intent. It evaluates structure.

What it catches that content review misses: a nested array inside a field that expects a string; a missing required field that the improvement agent accidentally dropped; a numeric field populated with a string because the model inferred "free" was cleaner than 0; a field name with a one-character typo that would produce a silent undefined at render time.

These are structural errors that a human reviewer reading the content for quality would never catch. They only surface at compile time — or, if there's no compile-time check, at render time, after deployment, in production.

If TypeScript compilation fails, the agent reads the error output and makes a targeted fix. It re-runs the compiler. Only files that compile cleanly proceed to deployment.

The type system is not bureaucracy. It's the structural QA layer.

5. The Deployment Agent

The deployment agent does one thing: it pushes content that has passed grading and passed TypeScript verification to the production environment. It has no judgement about quality — that judgement happened upstream, in isolated stages, by agents designed for that specific job.

This agent runs on the cheapest model available. Writing files and triggering deploys does not require reasoning capability.


The economics of model routing

Running this pipeline on the most capable model throughout would be prohibitively expensive at scale. Running it on the cheapest model throughout would produce unreliable outputs at every stage that requires genuine capability.

Model routing is the discipline of matching the capability requirement of each stage to the cost tier of the model executing it.

The practical breakdown: top-tier reasoning models for rubric design and orchestration logic — this is where the quality standard gets set, and it cannot be cut here. Mid-tier capable models (Sonnet-class) for generation and improvement — these tasks require holding large context windows and producing coherent long-form output. Budget models (Haiku-class) for grading, deployment, and file writes — evaluation against a fixed rubric and executing deterministic actions don't require expensive capability.

The ratio in a well-routed pipeline is roughly 3–10× cheaper per output compared to running the whole thing on a top-tier model. At a thousand agent invocations per week, this is real money in absolute terms, not just a percentage saving.

The failure mode to avoid: routing grading to a cheap model without ensuring the rubric was designed on a capable model. Cheap-model grading quality is bounded by the quality of the rubric it's evaluating against. If the rubric is vague, the grader will be inconsistent. The rubric must be precise, and precision requires the model that can reason about what good looks like.


How to audit where you are

Four questions. Ask them to whoever runs your AI content programme.

What does the grader evaluate against? If the answer is "we review it ourselves" or "the tool has a quality score," there is no grader. You have a production line.

Is the grader's context isolated from the generator's? If they don't understand why this question matters, the grader — if it exists — is not doing honest evaluation.

What is the model routing policy? If every step runs on the same model (either the most capable or the cheapest), there is no model routing. Either you're overspending, or you're under-specifying the generation and improvement steps.

What happens when a piece fails structural verification before deployment? If the answer involves a human manually checking something, there is no automated structural gate. The TypeScript verification layer does not exist.

The answers to these questions tell you whether you're running an intelligence system or a production line. Most teams are running a production line. The ones who aren't are the ones pulling away in content velocity and consistency.


What to do with this

You don't need to build all five components at once. Start with the grader.

Take whatever content you're currently producing with AI. Design a binary rubric — not a holistic quality score, but a checklist of specific pass/fail criteria against your quality standard. Run your last twenty pieces through it. See what the pass rate is. See which criteria fail most consistently.

That failure pattern is your improvement brief. Fix the generation prompt or the generation context for the specific criteria that are failing. Re-run. Grade again.

That cycle — generate, grade on a binary rubric, improve on specific failing criteria, grade again — is the core of a real content system. Everything else (model routing, TypeScript verification, automated deployment) is built on top of that loop. But the loop is the thing. Without it, you're running a production line with better machinery.

The teams who understood this eighteen months ago are now running content operations at a scale and consistency that are difficult to catch up to. The gap compounds because the system improves every time you tighten the rubric, add a grading criterion, or route a stage to a better-suited model. A production line doesn't compound. It just produces more output.

Which one are you running?

Written by

Eitan Gorodetsky

I run an AI-native marketing operation, and write about what it takes to operate this way. Full story →

The newsletter

Get the teardown, then the next essay

Subscribe and I'll send you the teardown — the four layers of an AI-native marketing operation and the ladder to place your own function. Then one essay a month, written from inside the operation. No fluff, unsubscribe any time.

Double opt-in — you'll confirm by email. No spam, unsubscribe any time. See the privacy policy.