System DesignRAGVector SearchEditorial AILLM

Designing a GenAI
News Discovery System

Every decision. Every tradeoff. Walking through a production RAG system for a news media corporation — the way a senior engineer would explain it on a whiteboard.

Shashank PadalaFounder, Kirak Labs · Ex-Amazon 25 min readMay 22, 2026
Target: CS grads · ML engineers · Technical PMs
Jump to live demo
Chapter 00

The Case Study

The prompt

"A news media corporation is exploring a Generative AI assistant to help users discover and understand content through natural language interactions. The system should provide reliable, grounded, and audience-appropriate responses aligned with the corporation's editorial standards."

On the surface this sounds like a chatbot problem. It isn't. Let's unpack the three constraints that define every architectural decision we're about to make.

Real-Time Freshness

News is perishable. A 3-day-old article about a market crash may already be superseded. The system must weight recency without hard-expiring evergreen context pieces.

Editorial Standards

This isn't Google. The corporation has a journalistic voice: factual, attributed, impartial. The AI cannot express opinions on political matters — that's a legal and institutional constraint.

Scale + Structure

A major news outlet publishes hundreds of articles per day across dozens of sections: news, opinion, analysis, live blog. The corpus is massive and heterogeneous.

Key reframe before we start
This is not a "build a chatbot" problem. It's a "build an editorial compliance system with a conversational interface" problem. That framing changes your architecture, your evaluation framework, and your definition of failure.
Chapter 01

Requirements

Let's do this properly. The non-functional requirements are where this system gets genuinely interesting.

Functional

  • Accept natural language queries about news content
  • Return responses grounded in published articles only
  • Cite the specific article behind every factual claim
  • Handle discovery, explanatory, and factual lookup queries
  • Gracefully decline out-of-scope queries
  • Surface related articles the user might want to read next

Non-Functional

  • Freshness: Recent articles weighted higher in retrieval
  • Impartiality: Zero editorial opinions — institutional constraint
  • Attribution: Every claim traces to a specific article + date
  • Latency: < 2s P95 end-to-end (streaming helps)
  • Multilingual: English and French (en v1, fr v2)
  • Auditability: Every response: query → retrieved docs → output
The core tension
Helpfulness and editorial compliance pull in opposite directions. An LLM optimised to be helpful will fill gaps, speculate, and offer opinions. A news organisation cannot afford any of that. Your architecture must make the AI less helpful in the naive sense — and more trustworthy in the professional sense.
Chapter 02

First Big Decision — What Architecture?

Three plausible approaches. Let's evaluate each honestly before committing to any of them.

ApproachHow it worksThe problemVerdict
Fine-TuningTrain the LLM on the news corpus to bake knowledge into weightsKnowledge frozen at training time. Expensive to retrain. Still hallucinates. Cannot cite sources.❌ Wrong tool
Prompt StuffingDump articles directly into a huge context windowContext window limits (~200K tokens). Cannot scale to millions of articles. Expensive per query. Slow.❌ Doesn't scale
RAGRetrieve relevant articles from a vector index, then generate grounded responsesRetrieval quality is the ceiling. Two-stage latency. Infrastructure complexity.✓ The right call
Decision: RAG — five reasons specific to news
  1. 1.News requires real-time freshness — fine-tuning can't deliver that
  2. 2.Journalistic attribution requires citable sources — RAG grounds responses in specific documents
  3. 3.Corpus grows daily — retrieval scales horizontally, prompt stuffing doesn't
  4. 4.Editorial compliance requires knowing exactly which document drove each claim
  5. 5.Hybrid search (next chapter) lets us match named entities precisely — critical for news
Chapter 03

Data Foundation

RAG is only as good as the data behind it. This is where most teams underinvest.

3.1 — Ingestion Strategy

News content arrives continuously. How you ingest it determines your freshness ceiling.

Batch Rebuild

Nightly full re-index

+ Simple to implement

Hours of staleness on breaking news

Pure Streaming

Every article triggers immediate index

+ Minutes to live

Complex infra, partial failure risk

Hybrid ✓

Batch baseline + streaming delta

+ Fresh for breaking news, stable for archive

Two pipelines to maintain

3.2 — Chunking Strategy

Chunking is one of the highest-leverage decisions in RAG. Too large: retrieval is imprecise. Too small: you lose semantic context. For news, there's a structural advantage to exploit.

StrategyChunk sizeGood forBad for
Full article1,000–5,000 tokensFull context, no boundary splitsRetrieval precision collapses on large corpora
Fixed-size (512 tokens)512 tokens + overlapUniform retrieval unitsSplits mid-sentence, loses narrative context
Semantic — lede + paragraphs ✓150–400 tokens/chunkPrecision + journalistic structure. Lede = summary chunk.Requires more careful parsing
Hierarchical (doc summary + chunks)Multi-level indexTwo-stage retrieval: summary → paragraphMore complex indexing infrastructure
Decision: Semantic chunking — lede + paragraph chunks
In journalism, the lede (opening paragraph) is the entire story compressed. We index it as a standalone summary chunk. Body paragraphs get indexed individually at 200–350 tokens with 15% overlap. This gives retrieval precision while preserving journalistic structure.

3.3 — Metadata Schema

Metadata is the control plane for retrieval. Every field below enables a specific retrieval capability — filter by date, section, content type, or language. Without it, you're flying blind.

FieldTypeEnables
source_idstringCitation linking — every claim traces to an article
titlestringBM25 title boosting — title matches score higher
published_atdatetimeTemporal decay and date-range filtering
sectionenum: politics | business | health | local | ...Domain-scoped retrieval
content_typeenum: news | opinion | analysis | feature | live-blogEditorial labeling in responses
languageenum: en | frMultilingual routing (v2)
is_breakingboolBoost breaking news in retrieval ranking
keyword_tagsstring[]Named entity index: politician, location, bill names
freshness_scorefloat [0–1]Computed recency weight for re-ranking
Chapter 04

Retrieval Architecture

Retrieval quality is the ceiling for your entire system. The LLM can only be as good as what you give it.

4.1 — Embedding Model

ModelMultilingualOpen SourceNotes
OpenAI text-embedding-3-largePartialNoStrong performance, data privacy concern for orgs with strict data residency requirements
Cohere Embed v3Yes (100+ langs)NoStrong reranking integration, enterprise pricing
BGE-M3Yes (100+ langs)Yes — Apache 2.0SOTA on multilingual benchmarks. Self-hostable. No data egress. Right call for an en/fr bilingual mandate.

4.2 — Why Hybrid Search?

Semantic only — what breaks

Finds conceptually similar content. Works for "explain rising rents". Fails badly on "what did Bill C-47 say about housing grants" — the named entity gets lost in the embedding.

Breaks on: names, dates, legislation, tickers, proper nouns

BM25 only — what breaks

Exact term matching. Works for "Bill C-47". Fails for "housing affordability crisis" when the article is titled "Rent Burden Reaches Breaking Point".

Breaks on: paraphrasing, synonyms, conceptual queries

Decision: Hybrid search — BM25 + semantic, then re-rank
Run both in parallel. BM25 catches exact entity matches. Semantic catches conceptual similarity. Merge via Reciprocal Rank Fusion (RRF). Then apply a cross-encoder re-ranker to score true query-chunk relevance on the top-50 candidates. This is the standard production approach.

4.3 — Temporal Decay

Standard retrieval is agnostic to article age. For news, surfacing a 2015 housing article in response to a 2022 housing query is actively misleading. We add a recency weight to the final score.

Temporal decay formula

final_score = retrieval_score × (α + (1−α) × e−λ × age_days)

α = 0.3 — minimum floor (old articles can still rank if highly relevant)

λ = 0.001 — decay rate (~50% recency weight lost after 693 days)

age_days — days since publication

Tradeoff: tuning λ
Set λ too high and evergreen background articles (housing policy history) get buried. Too low and recency barely matters. Tune this against your editorial team's judgment of what "recent enough" means for different query types — it's not a pure ML decision.

4.4 — The Five-Stage Retrieval Pipeline

1

Coarse retrieval

Top-100 candidates via parallel BM25 + semantic search

2

Metadata filtering

Language, section, date range — narrows the candidate pool

3

Cross-encoder re-ranking

Score true query-chunk relevance. The highest-impact quality lever in the pipeline.

4

Temporal decay re-scoring

Apply the recency weight formula from above

5

Top-K selection

Pass top 4–6 chunks to the LLM. More = diminishing returns + higher cost.

Chapter 05

The Editorial Standards Layer

This is the chapter that separates a generic chatbot from a journalism-grade AI system.

Why this chapter matters
Any competent engineer can build a RAG pipeline. What makes a news AI system different is the editorial compliance layer. This is where institutional trust is either protected or destroyed. Get it wrong and you have an AI expressing political opinions on behalf of a news organisation — a serious incident.

5.1 — The Five System Prompt Rules

Rule 1: Grounding

Only make claims directly supported by the provided articles. Never use training knowledge to fill gaps.

Why: Prevents hallucination of facts from pre-training that may be outdated, wrong, or unverifiable.

Rule 2: Impartiality

Never express editorial opinions on political, economic, or social policy matters.

Why: Institutional requirement. News organisations are legally and professionally obligated to separate reporting from opinion.

Rule 3: Balanced attribution

When retrieved articles present conflicting expert views, present both perspectives with attribution. Do not adjudicate.

Why: Journalistic fairness. The AI must not take sides in contested expert debates.

Rule 4: Content type labeling

Explicitly distinguish opinion articles from news reporting in your response.

Why: Summarising an op-ed as if it were factual reporting is an editorial failure. The content_type metadata field makes this automatable.

Rule 5: Graceful degradation

If no relevant articles are retrieved, state this clearly. Do not speculate or synthesise from training data.

Why: Silence is better than confident wrongness in journalism.

5.2 — The Conflicting Sources Problem

News corpora frequently contain articles where experts disagree. Rent control is a perfect example: one article says it helps affordability, another says it reduces housing supply. The AI must handle this without taking a side.

Bad response

"Rent control is effective at improving housing affordability for existing tenants."

↳ Takes a position. Editorial violation.

Correct response

"Experts are divided. Proponents argue rent control improves affordability [Article 4]. Critics contend it reduces housing supply over time [Article 10]."

↳ Both sides, attributed, no position taken.

5.3 — Post-Generation Faithfulness Checks

Citation ID verification

Every [Article N] citation in the response must reference a doc that was actually retrieved. Automated check — flag and regenerate if invalid.

NLI entailment check

Run each factual sentence through a Natural Language Inference model. Verify it is entailed by at least one retrieved chunk. Flag low-confidence sentences.

LLM-as-judge

Use a second LLM call to score faithfulness. Costlier but catches subtle fabrications that NLI misses. Run on sampled traffic, not every request.

Chapter 06

Evaluation & Readiness

'It feels good in the demo' is not a launch criterion. Here's how to measure this properly.

We run evaluation across three independent layers — each catches different failure modes. Offline evaluation uses a golden dataset: 100–200 expert-labelled (query, ideal response, expected source articles) pairs that editorial staff have approved. It runs before any user sees the system. Online evaluation captures what real users do — do they click cited articles, do they ask follow-ups, do they bail to site search? These are behavioural proxies for trust. Editorial audit is a manual human review, run weekly, where editors read a random sample of responses and flag anything that violates journalistic standards.

The three layers are complementary, not redundant. Offline metrics can be gamed by over-fitting to the golden set. Online signals are noisy and lag behind regressions. Editorial audit catches subtle failures both miss — a response that is technically grounded but tonally inappropriate for a news brand.

Offline

Golden dataset — run before launch

Retrieval Recall@5> 85%

Of all articles that should appear in the top-5 results, what fraction actually do? Measures how often we miss relevant content. A recall gap means users get incomplete answers.

Retrieval Precision@5> 70%

Of the 5 retrieved chunks, how many are actually relevant to the query? Measures noise. Low precision pollutes the LLM context and can cause off-topic or confused responses.

Answer Faithfulness> 90%

Does every factual claim in the generated response trace directly to a retrieved chunk? Catches hallucination. Measured via NLI entailment or LLM-as-judge on the golden set.

Editorial Compliance> 95%

Editors manually review 200 golden responses — does any violate the 5 system prompt rules? A single impartiality violation is a blocking issue; this metric has a near-zero tolerance floor.

Online

Production signals — tracked continuously

Citation click-throughTrack trend

% of sessions where the user clicks a cited article. A strong proxy for trust — if users verify what the AI said, they believe it's worth checking.

Positive rating rate> 70%

Of sessions where the user gives explicit feedback (thumbs up/down), what % are positive? Requires enough rating volume to be statistically meaningful — prompt users selectively.

Session continuationTrack trend

Does the user ask a follow-up question in the same session? Yes = the AI helped and they want more. No = they got what they needed or gave up. Context-dependent; watch the trend.

Escalation rateMinimise

% of sessions where the user abandons the AI and navigates to site search instead. High escalation is a trust signal — the AI failed to satisfy and the user sought an alternative.

Editorial audit

Human review — weekly sample

Impartiality violations0 per audit

Any response that takes a position on a political, economic, or social policy matter. Zero tolerance — one confirmed violation is a rollback trigger, not a P2 ticket.

Hallucinated citations0 per audit

Response references [Article N] that was not in the retrieved context. Indicates the model is citing from training memory, not the RAG pipeline. Also zero tolerance.

Content-type errors< 2%

Opinion articles surfaced or quoted without the 'In an Opinion piece…' labeling. Treating an op-ed as straight news reporting is an editorial failure.

Sensitive topic handlingPass / fail

Spot-check on predefined sensitive categories: election coverage, active court cases, Indigenous affairs, breaking investigations. Editors assess against written editorial guidelines.

Launch Go / No-Go Gates

PhaseWhoGo criteria
Internal AlphaEngineers + Editors (20 people)
  • Faithfulness > 85% on golden dataset
  • Zero editorial violations in 200-response audit
  • P95 end-to-end latency < 2s
  • RBAC / content-type metadata verified correct
Editorial Sign-offSenior editorial team
  • 100-response manual review with written editor sign-off
  • No impartiality violations of any kind
  • Opinion vs. news labeling accurate across sample
  • Conflicting-source handling reviewed and approved
BetaOpt-in users (1,000)
  • Positive rating rate > 65%
  • Escalation rate (to site search) < 20%
  • Weekly editorial audit passing with no P0 findings
  • No public editorial incident
General AvailabilityFull audience
  • Positive rating rate > 70%
  • Monitoring dashboard live with alerting configured
  • Human review queue below 2% of total queries
  • Rollback plan tested end-to-end in staging
Chapter 07

Should This System Use Agents?

The RAG pipeline above is reliable, auditable, and safe for V1. Agents introduce real value in specific workflows — and real risk that must be designed around explicitly.

The short answer is yes — but incrementally, and only where a single retrieval pass is structurally insufficient. The system described in this playbook is deliberately stateless and reactive: a user asks, the pipeline retrieves, the model answers. That is the right architecture for V1. It is auditable, debuggable, and safe to hand to an editorial team that has zero tolerance for AI errors reaching publication.

Agents earn their added complexity when the task has structure that single-pass RAG cannot express: multi-step reasoning across independent topic domains, proactive monitoring of evolving stories, or workflows where the system must decide what to retrieve before it knows how to answer. Below is where that line sits for a news media system specifically.

The Test for Adding an Agent
If the answer to "can a single retrieval pass + one LLM call answer this reliably?" is yes — do not add an agent. The latency cost, audit complexity, and error compounding are not justified. Agents are for workflows where sequenced decisions are structurally required, not for making simple queries feel more impressive.

7.1 — Where Agents Add Value

Four workflows where an agentic layer is justified over a simple RAG call.

Multi-Hop Research

A user asks: "How have interest rates affected housing affordability over the last two years?" A single retrieval cannot answer this — it requires retrieving from Business (rates), Politics (housing policy), and Local (affordability impact) independently, then synthesising. The agent decomposes the query, runs sequential retrievals, and fuses the results before generating.

Evolving Story Tracking

Breaking news stories evolve over hours. An agent can monitor a user's query intent against incoming article ingestion — when a new article is semantically similar to a prior high-engagement query, it proactively surfaces the update. This is a background agentic loop with a clear trigger condition, not open-ended autonomy.

Corpus Gap Detection

When retrieval confidence is low, the current system degrades gracefully. An agent can go further: identify that the gap exists, determine whether it is a coverage gap (no articles exist) or a recency gap (articles exist but are stale), and route accordingly — either to a human editor flag or to a "we don't have recent coverage on this" response rather than hallucinating.

Editorial Workflow Assistance

Internal-facing use case: an editor asks the system to find all coverage of a specific politician across the last 6 months, flag contradictory reporting, and surface any stories that reference each other but are not cross-linked. This is a multi-step agentic workflow that has high editorial value and can tolerate higher latency since it is not user-facing.

7.2 — What the Agent Must Never Do

Scope boundaries are not optional in a news context. These are hard constraints, not guidelines.

No external retrieval

The agent may only call tools that read from the internal corpus. No web search, no external APIs. Every cited fact must trace to an article in the audited index.

No editorial judgements

The agent cannot decide that one source is more credible than another, or that a story is worth covering. Those decisions belong to editors.

No unprompted publication

The agent cannot trigger any write action — no pushing alerts, no posting to CMS, no sending notifications — without an explicit human approval gate.

No personalisation decisions

User history and interest profiling raise privacy obligations. The agent must not use behavioural data to alter what content it surfaces without explicit consent and a defined retention policy.

7.3 — Risks and How to Control Them

Each agentic step compounds error. A faithfulness check on the final output is not sufficient — validation must happen at each retrieval-and-reason hop.

RiskHow It ManifestsControl
Hallucination compoundingA bad retrieval in step 1 propagates into step 2 reasoning and step 3 synthesis. The final output can be confidently wrong and internally consistent.Faithfulness check at every hop, not just the final output. If any step scores below threshold, abort and return a degraded response.
Runaway loopsThe agent retries retrieval when results are unsatisfactory, enters a loop, and burns tokens or hits rate limits.Hard cap: maximum 4 tool calls per request. Timeout at 8 seconds. Any loop pattern triggers circuit breaker and fallback to single-pass RAG.
Editorial standard erosionMulti-source synthesis across Politics and Business sections without per-source impartiality checking could produce responses that violate the corporation's editorial policy.Apply the 5 editorial guardrails to every intermediate synthesis, not just the final generation. High-stakes topics (political, judicial) require editorial review before delivery.
Audit trail loss"Why did it say that?" becomes unanswerable across 4 retrieval steps and 3 LLM calls.Every agentic step writes to the audit log: tool called, query issued, documents retrieved, reasoning produced. The complete trace is stored alongside the final response.
Scope creepThe agent attempts external retrieval when the internal corpus is insufficient, pulling from unvetted sources.Strict tool sandboxing. The agent's tool registry contains only signed, internal retrieval functions. No HTTP client, no web search tool, no access to external endpoints.
Recommended Rollout Order
Start with the internal editorial workflow agent (gap detection, cross-link analysis) before any user-facing agentic capability. Internal users provide faster feedback, have higher tolerance for errors, and their workflows are easier to audit. Validate the control framework internally before exposing agentic behaviour to the public-facing product.
Full System

End-to-End Architecture

The complete data flow — hover any node for a one-sentence explanation.

Hover any node for detailsIngestionQueryShared
01Ingestion Pipeline— runs continuously in background
News Sources
RSS · APIs
Article Parser
metadata extract
Semantic Chunker
lede + paragraphs
Embedding Model
BGE-M3
writes embeddings to
Shared Storage Layer
Vector Database
embeddings · metadata index · RBAC filters · audit log
ingestion writesquery reads
retrieves ranked candidates from
02Query Pipeline— executes per user request
User Chat
mobile · web
Safety Check
first gate

runs in parallel ↓

routing
Intent Classifier
prompt routing
retrieval
Hybrid Search
BM25 + semantic
Re-ranker
temporal decay
merges →
LLM Orchestration
intent-routed prompt
Faithfulness Check
NLI · LLM-as-judge
Response
citations inline
Audit Logevery request writes: query · retrieved docs · faithfulness score · response — immutable, append-only
Try It

Live Demo — Ask the System

Real pipeline: BM25 + temporal decay retrieval → GPT-4o-mini with the 5 editorial guardrails.

Sample corpus — 10 articles

#TitleDateSectionType
1Government Introduces Housing Affordability PlanMar 15, 2022politicsnews
2Interest Rate Hike Threatens Mortgage RenewalsJul 13, 2022economicsnews
3Rental Prices in Major Cities Hit 30-Year HighMay 4, 2022real estatenews
4Cities That Tried Rent Control: Mixed ResultsSep 22, 2021real estateanalysis
5Construction Boom Stalls Due to Material CostsNov 8, 2021businessnews
6Community Groups Push Back Against Urban DensificationAug 30, 2020localnews
7Remote Work Shifts Housing Demand to SuburbsJan 19, 2022real estatefeature
8Study Finds Exercise Reduces Chronic Painout-of-scopeJun 7, 2021healthnews
9Housing Policy Experts Weigh In on Affordability7yr-old, staleJun 12, 2015politicsopinion
10The Rent Control Debate: Economists DividedApr 28, 2022real estateopinion

Articles #8 (off-topic) and #9 (7 years old) test editorial filtering and temporal decay respectively.

Live RAG Demo

10-article corpus · Hybrid retrieval · GPT-4o-mini generation

LiveSemantic + BM25

Try a question

Your question

What We Covered

  • RAG over fine-tuning — freshness + citation are non-negotiable for news
  • Hybrid ingestion — streaming for breaking news, batch for archive
  • Semantic chunking — lede as summary chunk, paragraphs for precision
  • BGE-M3 — multilingual, open source, no data egress
  • Hybrid search + re-ranking + temporal decay — the 5-stage pipeline
  • 5 editorial guardrails — grounding, impartiality, attribution, labeling, degradation
  • Three-tier evaluation — offline golden dataset, online signals, editorial audit
  • Phased launch with explicit go/no-go thresholds at each gate
  • Agentic extensions — where agents add value, hard limits, and risk controls
Shashank Padala

Shashank Padala

Founder, Kirak Labs · Ex-Amazon · AI Product Leader

AI Product & Transformation Leader with 8+ years building production LLM systems. Previously at Amazon, led GenAI integration into content creation workflows for Corp Communications serving ~3M employees globally.

SYSTEM ONLINE
RAG Pipeline Active
Vector DB Connected
Guardrails Enabled