How to Build an AI Agent Blog Pipeline: Architecture, Models, and Real Costs

Q: How do you optimize AI-written content for AI search (GEO)?

Emit structured data (BlogPosting and, where appropriate, FAQPage JSON-LD), embed every article as a vector so your own systems can retrieve it by meaning, and write clean, dated, citable facts. Public answer engines still choose sources through crawling, indexing, ranking, and their own retrieval systems, but specific, structured, sourced content gives them something usable to cite.

Most "AI blog automation" you read about is one prompt in a loop. It produces fast, forgettable content that reads like every other model dump on the internet — and search engines, AI answer engines, and human readers all treat it that way.

We wanted something different: a multi-agent content pipeline that publishes a high-quality, fact-checked, well-illustrated article every day, indefinitely, for tens of dollars a month — not thousands. So we built one where each step is handled by the model best suited to it, where one model family fact-checks another, and where the whole thing runs unattended on a schedule.

This post is the architecture blueprint. We'll walk through the stages, the exact models we use and why, what it actually costs per article (with measured numbers, not guesses), and then give you concrete, step-by-step instructions — schemas, thresholds, and snippets included — to build a working version of your own.

Why a multi-agent content pipeline beats a single prompt

A single "write me a blog post about X" call has three structural problems:

No grounding. The model writes from its training data, which is months stale and has no idea what shipped last week. You get confident, generic, or wrong.
No second opinion. The model that wrote the post is the same model that "checks" it — blind to its own habits and hallucinations.
No production discipline. No images, no structured data, no deduplication, no safety net. It's a draft, not a publishable artifact.

A pipeline fixes each by assigning a specialized stage to a specialized model, and by making the stages adversarial where it counts.

The architecture: seven stages (plus captioning)

Each article moves through these stages, and each stage hands a structured result to the next:

Research — Pull current, sourced facts about the topic from a search-grounded model. This keeps posts accurate and timely instead of training-data-stale.
Plan — Turn the research into a concrete outline: angle, headings, the specific claims the article will make, and the questions it should answer.
Write — Generate the full draft from the plan and the research brief. The brief is treated as ground truth, so the writer grounds every claim in it instead of inventing.
Fact-check & edit — A different model family reviews the draft against the brief: verifying claims, tightening prose, enforcing voice, flagging anything unsupported. This cross-family check is the highest-leverage quality lever in the system.
Illustrate (plus a quick caption/alt-text step) — Generate a custom hero image so every post is visually distinct and preview cards look intentional.
Vectorize — Embed the finished article into a vector so it's retrievable by meaning. This powers on-site semantic search and makes the content usable by your own retrieval and RAG systems.
Publish — Run a final safety check, insert the post, attach structured data, and ping the sitemap.

The key design idea: stages are decoupled and resumable. Each article is a row in a store with a stable row ID and a status field (planned → researched → written → checked → illustrated → embedded → published). A stage claims rows by ID plus expected status, does its work, and advances the status only when its output is written. A failure in the image step doesn't lose the written draft; a rejected draft loops back to planned without touching the rest of the queue. The row ID plus conditional status transition is the stage boundary. For side effects like image uploads, post inserts, and sitemap updates, still add normal idempotency controls: unique slugs, collision handling, and provider request IDs where the API supports them.

How stages pass data

Keep the contract between stages boringly explicit. A research brief that the writer consumes might look like:

{
  "topic": "vector databases for RAG",
  "angle": "practical tradeoffs for a small team",
  "key_facts": [
    {"claim": "pgvector ships as a Postgres extension", "source": "https://..."},
    {"claim": "HNSW indexes trade build time for query speed", "source": "https://..."}
  ],
  "must_answer": ["When is a dedicated vector DB worth it?", "What does pgvector cost to run?"],
  "avoid_repeating": ["title of last week's RAG post"]
}

And the fact-checker returns a structured verdict the publish gate can act on — never free text the next stage has to guess at. verdict is one of pass / revise / reject, and score is 1–5:

{
  "verdict": "revise",
  "score": 3,
  "unsupported_claims": ["the '10x faster' figure isn't in the brief"],
  "fixes_applied": ["tightened intro", "removed two hedges"],
  "edited_markdown": "..."
}

Because every stage emits structured output, the orchestration is just a loop: select rows at status X, call the stage, write the result and the new status. No model is asked to parse another model's prose. (Have the model return strict JSON — no comments or trailing commas — so it parses on the first try.)

The most important decision: cross-model fact-checking

If you take one thing from this post, take this: don't let the model that wrote the draft be the model that approves it.

Every model family has characteristic blind spots — phrasings it over-uses, claims it's overconfident about, structures it defaults to. When the same model "reviews" its own output, it nods along. When a model from a different lab reviews it, those blind spots light up.

So the writer and the fact-checker are always from different families. One writes; the other reads the draft against the research brief and asks, in effect, "is this actually supported, and is it actually good?" Disagreements surface real problems. Agreement across families is a more useful signal than one model's self-assessment, though it still needs deterministic gates and human review for high-stakes topics.

You can take this further with a panel: spawn several independent reviewers, each prompted to refute the draft rather than rubber-stamp it, and require a majority to pass. For daily content a single cross-family check is usually enough; for high-stakes posts, escalate to a panel.

The models, and why each one

Here's the stack we run, by stage. Prices are per million tokens (input / output), verified against the providers' official pricing pages as of June 2026:

Stage	Model	Price (in / out)	Why this one
Research	Perplexity Sonar	$1 / $1 + search fee	Search-grounded; returns sourced, current facts
Plan	Claude Sonnet 4.6	$3 / $15	Fast, structured, cheap for outlining
Write	Claude Opus 4.8	$5 / $25	Our pick for long-form writing and reasoning
Fact-check	GPT-5.5	$5 / $30	Strong reasoner from a different family
Caption	Claude Sonnet 4.6	$3 / $15	Alt text and image captions
Illustrate	gpt-image-2	token-based (~$0.15–0.20 for our 1536×1024 high-quality hero images)	Custom hero per post
Vectorize	text-embedding-3-small	$0.02 / —	Cheap, solid semantic embeddings

A trend worth internalizing: frontier writing models got dramatically cheaper. The previous top Opus tier was priced at $15/$75 per million tokens; the current generation is $5/$25 — and the older rate now applies only to deprecated models. That collapse is what makes a daily, premium-quality pipeline economically practical.

What it actually costs

We measured real token usage across more than 400 production generations. An average article uses about 4,600 input / 4,000 output tokens to write, and 5,700 input / 3,900 output tokens to fact-check (both measured). Plug those into the prices above — the remaining stages are tight estimates from typical token sizes — and a single article costs:

Task	Model	Basis	Cost
Research	Sonar	estimate	~$0.011
Plan	Sonnet 4.6	estimate	~$0.021
Write	Opus 4.8	measured	~$0.123
Fact-check	GPT-5.5	measured	~$0.145
Caption	Sonnet 4.6	estimate	~$0.006
Hero image	gpt-image-2	estimate	~$0.190
Vectorize	embed-3-small	measured	~$0.0001
Total			≈ $0.50 / article

That works out to about $15/month for one post a day, or ~$75/month at five posts a day — for original, fact-checked, illustrated, search-optimized content. The single biggest line item isn't the writing; it's the image ($0.19). The two text passes together are about $0.27.

That's the headline: premium quality is no longer the expensive part of content. The expensive part is the human time you're replacing.

One honest caveat: newer model versions sometimes ship a new tokenizer that uses more tokens for the same text, so always cost your pipeline against measured token usage, not the sticker rate. Build a cost log into the system from day one — log input/output tokens and a computed cost per stage, per article.

Quality and safety: the guardrails that earn their keep

Speed is easy. Not embarrassing yourself is harder. Three guardrails matter most:

Deduplication. Before writing, show the planner recent titles and tell it not to repeat them. After writing, compare the draft against existing posts by vector cosine similarity and skip anything too close — a threshold around 0.85 is a sensible starting point (tune to taste). A daily pipeline will drift into repeating itself without this.
A publish-time safety gate. The most important rule for any automated publisher: a deterministic check runs on the final text before anything goes live, and hard-blocks on anything that shouldn't be public. Not a model — a fixed denylist. A minimal sketch:
```
import re

PATTERNS = [
    r"\bsk-(?:proj-)?[A-Za-z0-9_-]{20,}\b", # OpenAI-style API keys
    r"-----BEGIN [A-Z ]+PRIVATE KEY",        # private keys
    r"\b\d{1,3}(\.\d{1,3}){3}\b",            # raw IP addresses
    r"password\s*[:=]",                      # inline credentials
    # + your own: internal hostnames, private names, client names
]
def gate(markdown: str) -> list[str]:
    return [p for p in PATTERNS if re.search(p, markdown)]
# if gate(text) is non-empty -> HARD BLOCK, do not publish
```
These patterns are illustrative — the IP and password: rules will false-positive on version numbers and ordinary prose, so tune them and lean on precise internal terms (your real hostnames, client names, secret prefixes) rather than broad patterns. The model is creative; the gate is not, and that's the point.
Grounding contracts. Tell the writer the research brief is ground truth and to never invent quotes, numbers, or events. The fact-checker enforces it. "Sounds plausible" is not "is true."

SEO and GEO for an AI-written blog

Two audiences read your blog now: people, and the AI systems that answer people's questions. Optimize for both:

Structured data. Emit BlogPosting JSON-LD on every post, and emit FAQPage JSON-LD when your renderer supports a real Q&A section. Treat this as schema hygiene and clean, extractable structure, not a guaranteed rich result or AI citation. A minimal block:

<script type="application/ld+json">
{"@context":"https://schema.org","@type":"BlogPosting",
 "headline":"...","datePublished":"2026-06-05","author":{"@type":"Organization","name":"..."},
 "image":"https://.../hero.png","description":"..."}
</script>

Embeddings (your GEO backbone). Vectorizing every article makes your own search and RAG systems retrieve content by meaning. Public answer engines still depend on crawlability, indexing, ranking, and source selection.
Clean, citable facts. Dated, specific, sourced claims are easier for answer engines to quote. Vague marketing copy is easier to ignore.
Crawl hygiene. Submit a sitemap, keep it current, and disallow auto-generated routes (image endpoints, thin tag pages) so the crawler spends its budget on real articles.

Run the automated AI agent blog unattended

The pipeline only pays off if it runs without you. Put the daily build on a scheduler — cron on Linux, a scheduled task or launchd job on macOS, a scheduled GitHub Action, or a serverless cron. Stagger the steps (research and write in the early morning, publish through the day) and send yourself a single daily summary instead of a notification per stage. A cron entry is as simple as:

0 4 * * *  /usr/bin/python3 /opt/blog/run.py build   # 4am: research -> write -> check -> illustrate
0 9 * * *  /usr/bin/python3 /opt/blog/run.py publish  # 9am: gate -> publish

If something fails, that's when it should interrupt you. If it succeeds, one quiet line is enough.

Build your own: step by step

Concrete and model-agnostic — swap in whatever providers you prefer. Each step is independently useful, so ship Steps 1–3 first, then layer the rest on.

Step 1 — Model a single article as a state machine. Create a store (a database table is plenty) where each article is a row with a stable ID, a status field, and slots for the brief, draft, image URL, and embedding. Every stage claims rows by ID plus expected status, does its work, writes the result, and advances status. This gives you a clean base for idempotency and crash recovery; side effects still need their own guards.

Step 2 — Ground the writing in research. Before writing, call a search-grounded model and store its output as a structured brief (see the JSON above). Pass that brief to the writer and instruct it to treat the brief as ground truth and never invent quotes, numbers, or events. This removes a large class of unsupported claims when the brief itself is correct and sourced.

Step 3 — Write, then cross-check with a different family. Generate the draft from the brief, then pipe it to a second model from a different lab. Have it return a structured verdict (pass/revise/reject + a score + unsupported-claim list + edited markdown). On reject, set the row back to planned and increment a retry_count; cap it (e.g., 3) so a bad topic can't spin forever.

Step 4 — Add adversarial review for high-stakes posts. Spawn several independent reviewers, each told to try to break the draft, and require a majority pass. Adversarial-by-default catches what a single agreeable reviewer misses.

Step 5 — Illustrate, caption, and embed. Generate a custom hero image per post; generate alt text/caption with a cheap model; create a vector embedding of the final text and store it in the row (a vector column via something like pgvector, or a dedicated vector DB).

Step 6 — Deduplicate. Show the planner recent titles to avoid, and before publishing compare the new embedding against existing posts by cosine similarity — skip anything above ~0.85. This keeps a daily pipeline from eating its own tail.

Step 7 — Add structured data. Emit BlogPosting JSON-LD on every post and FAQPage JSON-LD wherever your renderer includes a real Q&A. It's cheap schema hygiene and makes the page easier for machines to parse.

Step 8 — Gate, then publish. Run the deterministic safety check (the regex denylist above, plus your own internal terms) on the final text. Treat any hit as a hard stop. Only then insert the post and update your sitemap.

Step 9 — Test one post end to end, then schedule. Before automating, run Steps 1–4 on a single topic and inspect the draft and the fact-checker's verdict by hand. When one clean post comes out the far end, enable the image/embed/publish stages, put the build on a daily schedule, send one summary, and watch your cost log.

Frequently asked questions

How much does it cost to run an AI blog pipeline?
At current model prices, a fully illustrated, fact-checked article costs roughly $0.50 — about $0.27 for the writing and fact-checking passes and about $0.19 for a custom hero image. That's roughly $15/month for one post a day, or about $75/month at five posts a day.

Which AI models are best for writing blog posts?
Use a strong long-form model for writing (we use Claude Opus 4.8) and a capable model from a different family for fact-checking (we use GPT-5.5). The cross-family check matters more than any single model choice, because a model from another lab catches the writer's blind spots.

Why use two different AI models instead of one?
Because a model can't reliably check its own work. Every model family has characteristic blind spots and overconfident claims. A reviewer from a different lab surfaces them; the same model nods along. Cross-model fact-checking is the highest-leverage quality decision in the pipeline.

How do you keep AI-generated blog posts accurate?
Ground every post in fresh, sourced research, pass that research to the writer as ground truth, forbid invented quotes and numbers, and have a different model verify each claim against the research before publishing. Accuracy is a process, not a single prompt.

Can an AI blog pipeline run unattended?
Yes — that's the point. Model each article as a state machine, put the build on a scheduler, send one daily summary, and most importantly, run a deterministic safety gate on the final text before anything publishes. The gate is one of the controls that makes unattended publishing safe.

How do you optimize AI-written content for AI search (GEO)?
Emit structured data (BlogPosting and, where appropriate, FAQPage JSON-LD), embed every article as a vector so your own systems can retrieve it by meaning, and write clean, dated, citable facts. Public answer engines still choose sources through crawling, indexing, ranking, and their own retrieval systems, but specific, structured, sourced content gives them something usable to cite.

The takeaway

A good content pipeline isn't a single clever prompt — it's an assembly line where specialized models do specialized jobs, one model family checks another's work, and a deterministic gate stands between the machine and the publish button. The model prices that used to make this expensive have collapsed; premium, daily, fact-checked content now costs about fifty cents an article.

The hard part was never the writing. It's the discipline around the writing: grounding, cross-checking, gating, deduplicating, and scheduling. Build those — start with one grounded, cross-checked post and layer on the rest — and you have a system that earns its keep every single day.