Every decision. Every tradeoff. Walking through a production RAG system for a news media corporation — the way a senior engineer would explain it on a whiteboard.
The prompt
"A news media corporation is exploring a Generative AI assistant to help users discover and understand content through natural language interactions. The system should provide reliable, grounded, and audience-appropriate responses aligned with the corporation's editorial standards."
On the surface this sounds like a chatbot problem. It isn't. Let's unpack the three constraints that define every architectural decision we're about to make.
Real-Time Freshness
News is perishable. A 3-day-old article about a market crash may already be superseded. The system must weight recency without hard-expiring evergreen context pieces.
Editorial Standards
This isn't Google. The corporation has a journalistic voice: factual, attributed, impartial. The AI cannot express opinions on political matters — that's a legal and institutional constraint.
Scale + Structure
A major news outlet publishes hundreds of articles per day across dozens of sections: news, opinion, analysis, live blog. The corpus is massive and heterogeneous.
Let's do this properly. The non-functional requirements are where this system gets genuinely interesting.
Functional
Non-Functional
Three plausible approaches. Let's evaluate each honestly before committing to any of them.
| Approach | How it works | The problem | Verdict |
|---|---|---|---|
| Fine-Tuning | Train the LLM on the news corpus to bake knowledge into weights | Knowledge frozen at training time. Expensive to retrain. Still hallucinates. Cannot cite sources. | ❌ Wrong tool |
| Prompt Stuffing | Dump articles directly into a huge context window | Context window limits (~200K tokens). Cannot scale to millions of articles. Expensive per query. Slow. | ❌ Doesn't scale |
| RAG | Retrieve relevant articles from a vector index, then generate grounded responses | Retrieval quality is the ceiling. Two-stage latency. Infrastructure complexity. | ✓ The right call |
RAG is only as good as the data behind it. This is where most teams underinvest.
News content arrives continuously. How you ingest it determines your freshness ceiling.
Batch Rebuild
Nightly full re-index
+ Simple to implement
− Hours of staleness on breaking news
Pure Streaming
Every article triggers immediate index
+ Minutes to live
− Complex infra, partial failure risk
Hybrid ✓
Batch baseline + streaming delta
+ Fresh for breaking news, stable for archive
− Two pipelines to maintain
Chunking is one of the highest-leverage decisions in RAG. Too large: retrieval is imprecise. Too small: you lose semantic context. For news, there's a structural advantage to exploit.
| Strategy | Chunk size | Good for | Bad for |
|---|---|---|---|
| Full article | 1,000–5,000 tokens | Full context, no boundary splits | Retrieval precision collapses on large corpora |
| Fixed-size (512 tokens) | 512 tokens + overlap | Uniform retrieval units | Splits mid-sentence, loses narrative context |
| Semantic — lede + paragraphs ✓ | 150–400 tokens/chunk | Precision + journalistic structure. Lede = summary chunk. | Requires more careful parsing |
| Hierarchical (doc summary + chunks) | Multi-level index | Two-stage retrieval: summary → paragraph | More complex indexing infrastructure |
Metadata is the control plane for retrieval. Every field below enables a specific retrieval capability — filter by date, section, content type, or language. Without it, you're flying blind.
| Field | Type | Enables |
|---|---|---|
| source_id | string | Citation linking — every claim traces to an article |
| title | string | BM25 title boosting — title matches score higher |
| published_at | datetime | Temporal decay and date-range filtering |
| section | enum: politics | business | health | local | ... | Domain-scoped retrieval |
| content_type | enum: news | opinion | analysis | feature | live-blog | Editorial labeling in responses |
| language | enum: en | fr | Multilingual routing (v2) |
| is_breaking | bool | Boost breaking news in retrieval ranking |
| keyword_tags | string[] | Named entity index: politician, location, bill names |
| freshness_score | float [0–1] | Computed recency weight for re-ranking |
Retrieval quality is the ceiling for your entire system. The LLM can only be as good as what you give it.
| Model | Multilingual | Open Source | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-large | Partial | No | Strong performance, data privacy concern for orgs with strict data residency requirements |
| Cohere Embed v3 | Yes (100+ langs) | No | Strong reranking integration, enterprise pricing |
| BGE-M3 | Yes (100+ langs) | Yes — Apache 2.0 | SOTA on multilingual benchmarks. Self-hostable. No data egress. Right call for an en/fr bilingual mandate. |
Semantic only — what breaks
Finds conceptually similar content. Works for "explain rising rents". Fails badly on "what did Bill C-47 say about housing grants" — the named entity gets lost in the embedding.
Breaks on: names, dates, legislation, tickers, proper nouns
BM25 only — what breaks
Exact term matching. Works for "Bill C-47". Fails for "housing affordability crisis" when the article is titled "Rent Burden Reaches Breaking Point".
Breaks on: paraphrasing, synonyms, conceptual queries
Standard retrieval is agnostic to article age. For news, surfacing a 2015 housing article in response to a 2022 housing query is actively misleading. We add a recency weight to the final score.
Temporal decay formula
final_score = retrieval_score × (α + (1−α) × e−λ × age_days)
α = 0.3 — minimum floor (old articles can still rank if highly relevant)
λ = 0.001 — decay rate (~50% recency weight lost after 693 days)
age_days — days since publication
Coarse retrieval
Top-100 candidates via parallel BM25 + semantic search
Metadata filtering
Language, section, date range — narrows the candidate pool
Cross-encoder re-ranking
Score true query-chunk relevance. The highest-impact quality lever in the pipeline.
Temporal decay re-scoring
Apply the recency weight formula from above
Top-K selection
Pass top 4–6 chunks to the LLM. More = diminishing returns + higher cost.
This is the chapter that separates a generic chatbot from a journalism-grade AI system.
Rule 1: Grounding
Only make claims directly supported by the provided articles. Never use training knowledge to fill gaps.
Why: Prevents hallucination of facts from pre-training that may be outdated, wrong, or unverifiable.
Rule 2: Impartiality
Never express editorial opinions on political, economic, or social policy matters.
Why: Institutional requirement. News organisations are legally and professionally obligated to separate reporting from opinion.
Rule 3: Balanced attribution
When retrieved articles present conflicting expert views, present both perspectives with attribution. Do not adjudicate.
Why: Journalistic fairness. The AI must not take sides in contested expert debates.
Rule 4: Content type labeling
Explicitly distinguish opinion articles from news reporting in your response.
Why: Summarising an op-ed as if it were factual reporting is an editorial failure. The content_type metadata field makes this automatable.
Rule 5: Graceful degradation
If no relevant articles are retrieved, state this clearly. Do not speculate or synthesise from training data.
Why: Silence is better than confident wrongness in journalism.
News corpora frequently contain articles where experts disagree. Rent control is a perfect example: one article says it helps affordability, another says it reduces housing supply. The AI must handle this without taking a side.
Bad response
"Rent control is effective at improving housing affordability for existing tenants."
↳ Takes a position. Editorial violation.
Correct response
"Experts are divided. Proponents argue rent control improves affordability [Article 4]. Critics contend it reduces housing supply over time [Article 10]."
↳ Both sides, attributed, no position taken.
Citation ID verification
Every [Article N] citation in the response must reference a doc that was actually retrieved. Automated check — flag and regenerate if invalid.
NLI entailment check
Run each factual sentence through a Natural Language Inference model. Verify it is entailed by at least one retrieved chunk. Flag low-confidence sentences.
LLM-as-judge
Use a second LLM call to score faithfulness. Costlier but catches subtle fabrications that NLI misses. Run on sampled traffic, not every request.
'It feels good in the demo' is not a launch criterion. Here's how to measure this properly.
We run evaluation across three independent layers — each catches different failure modes. Offline evaluation uses a golden dataset: 100–200 expert-labelled (query, ideal response, expected source articles) pairs that editorial staff have approved. It runs before any user sees the system. Online evaluation captures what real users do — do they click cited articles, do they ask follow-ups, do they bail to site search? These are behavioural proxies for trust. Editorial audit is a manual human review, run weekly, where editors read a random sample of responses and flag anything that violates journalistic standards.
The three layers are complementary, not redundant. Offline metrics can be gamed by over-fitting to the golden set. Online signals are noisy and lag behind regressions. Editorial audit catches subtle failures both miss — a response that is technically grounded but tonally inappropriate for a news brand.
Offline
Golden dataset — run before launch
Of all articles that should appear in the top-5 results, what fraction actually do? Measures how often we miss relevant content. A recall gap means users get incomplete answers.
Of the 5 retrieved chunks, how many are actually relevant to the query? Measures noise. Low precision pollutes the LLM context and can cause off-topic or confused responses.
Does every factual claim in the generated response trace directly to a retrieved chunk? Catches hallucination. Measured via NLI entailment or LLM-as-judge on the golden set.
Editors manually review 200 golden responses — does any violate the 5 system prompt rules? A single impartiality violation is a blocking issue; this metric has a near-zero tolerance floor.
Online
Production signals — tracked continuously
% of sessions where the user clicks a cited article. A strong proxy for trust — if users verify what the AI said, they believe it's worth checking.
Of sessions where the user gives explicit feedback (thumbs up/down), what % are positive? Requires enough rating volume to be statistically meaningful — prompt users selectively.
Does the user ask a follow-up question in the same session? Yes = the AI helped and they want more. No = they got what they needed or gave up. Context-dependent; watch the trend.
% of sessions where the user abandons the AI and navigates to site search instead. High escalation is a trust signal — the AI failed to satisfy and the user sought an alternative.
Editorial audit
Human review — weekly sample
Any response that takes a position on a political, economic, or social policy matter. Zero tolerance — one confirmed violation is a rollback trigger, not a P2 ticket.
Response references [Article N] that was not in the retrieved context. Indicates the model is citing from training memory, not the RAG pipeline. Also zero tolerance.
Opinion articles surfaced or quoted without the 'In an Opinion piece…' labeling. Treating an op-ed as straight news reporting is an editorial failure.
Spot-check on predefined sensitive categories: election coverage, active court cases, Indigenous affairs, breaking investigations. Editors assess against written editorial guidelines.
| Phase | Who | Go criteria |
|---|---|---|
| Internal Alpha | Engineers + Editors (20 people) |
|
| Editorial Sign-off | Senior editorial team |
|
| Beta | Opt-in users (1,000) |
|
| General Availability | Full audience |
|
The RAG pipeline above is reliable, auditable, and safe for V1. Agents introduce real value in specific workflows — and real risk that must be designed around explicitly.
The short answer is yes — but incrementally, and only where a single retrieval pass is structurally insufficient. The system described in this playbook is deliberately stateless and reactive: a user asks, the pipeline retrieves, the model answers. That is the right architecture for V1. It is auditable, debuggable, and safe to hand to an editorial team that has zero tolerance for AI errors reaching publication.
Agents earn their added complexity when the task has structure that single-pass RAG cannot express: multi-step reasoning across independent topic domains, proactive monitoring of evolving stories, or workflows where the system must decide what to retrieve before it knows how to answer. Below is where that line sits for a news media system specifically.
Four workflows where an agentic layer is justified over a simple RAG call.
A user asks: "How have interest rates affected housing affordability over the last two years?" A single retrieval cannot answer this — it requires retrieving from Business (rates), Politics (housing policy), and Local (affordability impact) independently, then synthesising. The agent decomposes the query, runs sequential retrievals, and fuses the results before generating.
Breaking news stories evolve over hours. An agent can monitor a user's query intent against incoming article ingestion — when a new article is semantically similar to a prior high-engagement query, it proactively surfaces the update. This is a background agentic loop with a clear trigger condition, not open-ended autonomy.
When retrieval confidence is low, the current system degrades gracefully. An agent can go further: identify that the gap exists, determine whether it is a coverage gap (no articles exist) or a recency gap (articles exist but are stale), and route accordingly — either to a human editor flag or to a "we don't have recent coverage on this" response rather than hallucinating.
Internal-facing use case: an editor asks the system to find all coverage of a specific politician across the last 6 months, flag contradictory reporting, and surface any stories that reference each other but are not cross-linked. This is a multi-step agentic workflow that has high editorial value and can tolerate higher latency since it is not user-facing.
Scope boundaries are not optional in a news context. These are hard constraints, not guidelines.
No external retrieval
The agent may only call tools that read from the internal corpus. No web search, no external APIs. Every cited fact must trace to an article in the audited index.
No editorial judgements
The agent cannot decide that one source is more credible than another, or that a story is worth covering. Those decisions belong to editors.
No unprompted publication
The agent cannot trigger any write action — no pushing alerts, no posting to CMS, no sending notifications — without an explicit human approval gate.
No personalisation decisions
User history and interest profiling raise privacy obligations. The agent must not use behavioural data to alter what content it surfaces without explicit consent and a defined retention policy.
Each agentic step compounds error. A faithfulness check on the final output is not sufficient — validation must happen at each retrieval-and-reason hop.
| Risk | How It Manifests | Control |
|---|---|---|
| Hallucination compounding | A bad retrieval in step 1 propagates into step 2 reasoning and step 3 synthesis. The final output can be confidently wrong and internally consistent. | Faithfulness check at every hop, not just the final output. If any step scores below threshold, abort and return a degraded response. |
| Runaway loops | The agent retries retrieval when results are unsatisfactory, enters a loop, and burns tokens or hits rate limits. | Hard cap: maximum 4 tool calls per request. Timeout at 8 seconds. Any loop pattern triggers circuit breaker and fallback to single-pass RAG. |
| Editorial standard erosion | Multi-source synthesis across Politics and Business sections without per-source impartiality checking could produce responses that violate the corporation's editorial policy. | Apply the 5 editorial guardrails to every intermediate synthesis, not just the final generation. High-stakes topics (political, judicial) require editorial review before delivery. |
| Audit trail loss | "Why did it say that?" becomes unanswerable across 4 retrieval steps and 3 LLM calls. | Every agentic step writes to the audit log: tool called, query issued, documents retrieved, reasoning produced. The complete trace is stored alongside the final response. |
| Scope creep | The agent attempts external retrieval when the internal corpus is insufficient, pulling from unvetted sources. | Strict tool sandboxing. The agent's tool registry contains only signed, internal retrieval functions. No HTTP client, no web search tool, no access to external endpoints. |
The complete data flow — hover any node for a one-sentence explanation.
runs in parallel ↓
Real pipeline: BM25 + temporal decay retrieval → GPT-4o-mini with the 5 editorial guardrails.
Sample corpus — 10 articles
| # | Title | Date | Section | Type |
|---|---|---|---|---|
| 1 | Government Introduces Housing Affordability Plan | Mar 15, 2022 | politics | news |
| 2 | Interest Rate Hike Threatens Mortgage Renewals | Jul 13, 2022 | economics | news |
| 3 | Rental Prices in Major Cities Hit 30-Year High | May 4, 2022 | real estate | news |
| 4 | Cities That Tried Rent Control: Mixed Results | Sep 22, 2021 | real estate | analysis |
| 5 | Construction Boom Stalls Due to Material Costs | Nov 8, 2021 | business | news |
| 6 | Community Groups Push Back Against Urban Densification | Aug 30, 2020 | local | news |
| 7 | Remote Work Shifts Housing Demand to Suburbs | Jan 19, 2022 | real estate | feature |
| 8 | Study Finds Exercise Reduces Chronic Painout-of-scope | Jun 7, 2021 | health | news |
| 9 | Housing Policy Experts Weigh In on Affordability7yr-old, stale | Jun 12, 2015 | politics | opinion |
| 10 | The Rent Control Debate: Economists Divided | Apr 28, 2022 | real estate | opinion |
Articles #8 (off-topic) and #9 (7 years old) test editorial filtering and temporal decay respectively.
10-article corpus · Hybrid retrieval · GPT-4o-mini generation
Try a question
Your question
You might also like
A complete RAG rollout program for enterprise teams — from data readiness and RBAC through evaluation frameworks and organisational change management.
What changes when your AI stops answering and starts acting. LangGraph, MCP, human-in-the-loop gates, and the Padala Bank credit analyst agent as a reference architecture.

Shashank Padala
Founder, Kirak Labs · Ex-Amazon · AI Product Leader
AI Product & Transformation Leader with 8+ years building production LLM systems. Previously at Amazon, led GenAI integration into content creation workflows for Corp Communications serving ~3M employees globally.