System DesignRAGVector SearchEditorial AILLM

Designing a GenAI
News Discovery System

Every decision. Every tradeoff. Walking through a production RAG system for a news media corporation — the way a senior engineer would explain it on a whiteboard.

Shashank PadalaFounder, Kirak Labs · Ex-Amazon 25 min readMay 22, 2026

Target: CS grads · ML engineers · Technical PMs

Jump to live demo

Chapter 00

The Case Study

The prompt

"A news media corporation is exploring a Generative AI assistant to help users discover and understand content through natural language interactions. The system should provide reliable, grounded, and audience-appropriate responses aligned with the corporation's editorial standards."

On the surface this sounds like a chatbot problem. It isn't. Let's unpack the three constraints that define every architectural decision we're about to make.

Real-Time Freshness

News is perishable. A 3-day-old article about a market crash may already be superseded. The system must weight recency without hard-expiring evergreen context pieces.

Editorial Standards

This isn't Google. The corporation has a journalistic voice: factual, attributed, impartial. The AI cannot express opinions on political matters — that's a legal and institutional constraint.

Scale + Structure

A major news outlet publishes hundreds of articles per day across dozens of sections: news, opinion, analysis, live blog. The corpus is massive and heterogeneous.

Key reframe before we start

This is not a "build a chatbot" problem. It's a "build an editorial compliance system with a conversational interface" problem. That framing changes your architecture, your evaluation framework, and your definition of failure.

Chapter 01

Requirements

Let's do this properly. The non-functional requirements are where this system gets genuinely interesting.

Functional

Accept natural language queries about news content
Return responses grounded in published articles only
Cite the specific article behind every factual claim
Handle discovery, explanatory, and factual lookup queries
Gracefully decline out-of-scope queries
Surface related articles the user might want to read next

Non-Functional

Freshness: Recent articles weighted higher in retrieval
Impartiality: Zero editorial opinions — institutional constraint
Attribution: Every claim traces to a specific article + date
Latency: < 2s P95 end-to-end (streaming helps)
Multilingual: English and French (en v1, fr v2)
Auditability: Every response: query → retrieved docs → output

The core tension

Helpfulness and editorial compliance pull in opposite directions. An LLM optimised to be helpful will fill gaps, speculate, and offer opinions. A news organisation cannot afford any of that. Your architecture must make the AI less helpful in the naive sense — and more trustworthy in the professional sense.

Chapter 02

First Big Decision — What Architecture?

Three plausible approaches. Let's evaluate each honestly before committing to any of them.

Approach	How it works	The problem	Verdict
Fine-Tuning	Train the LLM on the news corpus to bake knowledge into weights	Knowledge frozen at training time. Expensive to retrain. Still hallucinates. Cannot cite sources.	❌ Wrong tool
Prompt Stuffing	Dump articles directly into a huge context window	Context window limits (~200K tokens). Cannot scale to millions of articles. Expensive per query. Slow.	❌ Doesn't scale
RAG	Retrieve relevant articles from a vector index, then generate grounded responses	Retrieval quality is the ceiling. Two-stage latency. Infrastructure complexity.	✓ The right call

Decision: RAG — five reasons specific to news

1.News requires real-time freshness — fine-tuning can't deliver that
2.Journalistic attribution requires citable sources — RAG grounds responses in specific documents
3.Corpus grows daily — retrieval scales horizontally, prompt stuffing doesn't
4.Editorial compliance requires knowing exactly which document drove each claim
5.Hybrid search (next chapter) lets us match named entities precisely — critical for news

Chapter 03

Data Foundation

RAG is only as good as the data behind it. This is where most teams underinvest.

3.1 — Ingestion Strategy

News content arrives continuously. How you ingest it determines your freshness ceiling.

Batch Rebuild

Nightly full re-index

+ Simple to implement

− Hours of staleness on breaking news

Pure Streaming

Every article triggers immediate index

+ Minutes to live

− Complex infra, partial failure risk

Hybrid ✓

Batch baseline + streaming delta

+ Fresh for breaking news, stable for archive

− Two pipelines to maintain

3.2 — Chunking Strategy

Chunking is one of the highest-leverage decisions in RAG. Too large: retrieval is imprecise. Too small: you lose semantic context. For news, there's a structural advantage to exploit.

Strategy	Chunk size	Good for	Bad for
Full article	1,000–5,000 tokens	Full context, no boundary splits	Retrieval precision collapses on large corpora
Fixed-size (512 tokens)	512 tokens + overlap	Uniform retrieval units	Splits mid-sentence, loses narrative context
Semantic — lede + paragraphs ✓	150–400 tokens/chunk	Precision + journalistic structure. Lede = summary chunk.	Requires more careful parsing
Hierarchical (doc summary + chunks)	Multi-level index	Two-stage retrieval: summary → paragraph	More complex indexing infrastructure

Decision: Semantic chunking — lede + paragraph chunks

In journalism, the lede (opening paragraph) is the entire story compressed. We index it as a standalone summary chunk. Body paragraphs get indexed individually at 200–350 tokens with 15% overlap. This gives retrieval precision while preserving journalistic structure.

3.3 — Metadata Schema

Metadata is the control plane for retrieval. Every field below enables a specific retrieval capability — filter by date, section, content type, or language. Without it, you're flying blind.

Field	Type	Enables
source_id	string	Citation linking — every claim traces to an article
title	string	BM25 title boosting — title matches score higher
published_at	datetime	Temporal decay and date-range filtering
section	enum: politics \| business \| health \| local \| ...	Domain-scoped retrieval
content_type	enum: news \| opinion \| analysis \| feature \| live-blog	Editorial labeling in responses
language	enum: en \| fr	Multilingual routing (v2)
is_breaking	bool	Boost breaking news in retrieval ranking
keyword_tags	string[]	Named entity index: politician, location, bill names
freshness_score	float [0–1]	Computed recency weight for re-ranking

Chapter 04

Retrieval Architecture

Retrieval quality is the ceiling for your entire system. The LLM can only be as good as what you give it.

4.1 — Embedding Model

Model	Multilingual	Open Source	Notes
OpenAI text-embedding-3-large	Partial	No	Strong performance, data privacy concern for orgs with strict data residency requirements
Cohere Embed v3	Yes (100+ langs)	No	Strong reranking integration, enterprise pricing
BGE-M3	Yes (100+ langs)	Yes — Apache 2.0	SOTA on multilingual benchmarks. Self-hostable. No data egress. Right call for an en/fr bilingual mandate.

4.2 — Why Hybrid Search?

Semantic only — what breaks

Finds conceptually similar content. Works for "explain rising rents". Fails badly on "what did Bill C-47 say about housing grants" — the named entity gets lost in the embedding.

Breaks on: names, dates, legislation, tickers, proper nouns

BM25 only — what breaks

Exact term matching. Works for "Bill C-47". Fails for "housing affordability crisis" when the article is titled "Rent Burden Reaches Breaking Point".

Breaks on: paraphrasing, synonyms, conceptual queries

Decision: Hybrid search — BM25 + semantic, then re-rank

Run both in parallel. BM25 catches exact entity matches. Semantic catches conceptual similarity. Merge via Reciprocal Rank Fusion (RRF). Then apply a cross-encoder re-ranker to score true query-chunk relevance on the top-50 candidates. This is the standard production approach.

4.3 — Temporal Decay

Standard retrieval is agnostic to article age. For news, surfacing a 2015 housing article in response to a 2022 housing query is actively misleading. We add a recency weight to the final score.

Temporal decay formula

final_score = retrieval_score × (α + (1−α) × e^{−λ × age_days})

α = 0.3 — minimum floor (old articles can still rank if highly relevant)

λ = 0.001 — decay rate (~50% recency weight lost after 693 days)

age_days — days since publication

Tradeoff: tuning λ

Set λ too high and evergreen background articles (housing policy history) get buried. Too low and recency barely matters. Tune this against your editorial team's judgment of what "recent enough" means for different query types — it's not a pure ML decision.

4.4 — The Five-Stage Retrieval Pipeline

Coarse retrieval

Top-100 candidates via parallel BM25 + semantic search

Metadata filtering

Language, section, date range — narrows the candidate pool

Cross-encoder re-ranking

Score true query-chunk relevance. The highest-impact quality lever in the pipeline.

Temporal decay re-scoring

Apply the recency weight formula from above

Top-K selection

Pass top 4–6 chunks to the LLM. More = diminishing returns + higher cost.

Chapter 05

The Editorial Standards Layer

This is the chapter that separates a generic chatbot from a journalism-grade AI system.

Why this chapter matters

Any competent engineer can build a RAG pipeline. What makes a news AI system different is the editorial compliance layer. This is where institutional trust is either protected or destroyed. Get it wrong and you have an AI expressing political opinions on behalf of a news organisation — a serious incident.

5.1 — The Five System Prompt Rules

Rule 1: Grounding

Only make claims directly supported by the provided articles. Never use training knowledge to fill gaps.

Why: Prevents hallucination of facts from pre-training that may be outdated, wrong, or unverifiable.

Rule 2: Impartiality

Never express editorial opinions on political, economic, or social policy matters.

Why: Institutional requirement. News organisations are legally and professionally obligated to separate reporting from opinion.

Rule 3: Balanced attribution

When retrieved articles present conflicting expert views, present both perspectives with attribution. Do not adjudicate.

Why: Journalistic fairness. The AI must not take sides in contested expert debates.

Rule 4: Content type labeling

Explicitly distinguish opinion articles from news reporting in your response.

Why: Summarising an op-ed as if it were factual reporting is an editorial failure. The content_type metadata field makes this automatable.

Rule 5: Graceful degradation

If no relevant articles are retrieved, state this clearly. Do not speculate or synthesise from training data.

Why: Silence is better than confident wrongness in journalism.

5.2 — The Conflicting Sources Problem

News corpora frequently contain articles where experts disagree. Rent control is a perfect example: one article says it helps affordability, another says it reduces housing supply. The AI must handle this without taking a side.

Bad response

"Rent control is effective at improving housing affordability for existing tenants."

↳ Takes a position. Editorial violation.

Correct response

"Experts are divided. Proponents argue rent control improves affordability [Article 4]. Critics contend it reduces housing supply over time [Article 10]."

↳ Both sides, attributed, no position taken.

5.3 — Post-Generation Faithfulness Checks

Citation ID verification

Every [Article N] citation in the response must reference a doc that was actually retrieved. Automated check — flag and regenerate if invalid.

NLI entailment check

Run each factual sentence through a Natural Language Inference model. Verify it is entailed by at least one retrieved chunk. Flag low-confidence sentences.

LLM-as-judge

Use a second LLM call to score faithfulness. Costlier but catches subtle fabrications that NLI misses. Run on sampled traffic, not every request.

Chapter 06

Evaluation & Readiness

'It feels good in the demo' is not a launch criterion. Here's how to measure this properly.

We run evaluation across three independent layers — each catches different failure modes. Offline evaluation uses a golden dataset: 100–200 expert-labelled (query, ideal response, expected source articles) pairs that editorial staff have approved. It runs before any user sees the system. Online evaluation captures what real users do — do they click cited articles, do they ask follow-ups, do they bail to site search? These are behavioural proxies for trust. Editorial audit is a manual human review, run weekly, where editors read a random sample of responses and flag anything that violates journalistic standards.

The three layers are complementary, not redundant. Offline metrics can be gamed by over-fitting to the golden set. Online signals are noisy and lag behind regressions. Editorial audit catches subtle failures both miss — a response that is technically grounded but tonally inappropriate for a news brand.

Offline

Golden dataset — run before launch

Retrieval Recall@5> 85%

Of all articles that should appear in the top-5 results, what fraction actually do? Measures how often we miss relevant content. A recall gap means users get incomplete answers.

Retrieval Precision@5> 70%

Of the 5 retrieved chunks, how many are actually relevant to the query? Measures noise. Low precision pollutes the LLM context and can cause off-topic or confused responses.

Answer Faithfulness> 90%

Does every factual claim in the generated response trace directly to a retrieved chunk? Catches hallucination. Measured via NLI entailment or LLM-as-judge on the golden set.

Editorial Compliance> 95%

Editors manually review 200 golden responses — does any violate the 5 system prompt rules? A single impartiality violation is a blocking issue; this metric has a near-zero tolerance floor.

Online

Production signals — tracked continuously

Citation click-throughTrack trend

% of sessions where the user clicks a cited article. A strong proxy for trust — if users verify what the AI said, they believe it's worth checking.

Positive rating rate> 70%

Of sessions where the user gives explicit feedback (thumbs up/down), what % are positive? Requires enough rating volume to be statistically meaningful — prompt users selectively.

Session continuationTrack trend

Does the user ask a follow-up question in the same session? Yes = the AI helped and they want more. No = they got what they needed or gave up. Context-dependent; watch the trend.

Escalation rateMinimise

% of sessions where the user abandons the AI and navigates to site search instead. High escalation is a trust signal — the AI failed to satisfy and the user sought an alternative.

Editorial audit

Human review — weekly sample

Impartiality violations0 per audit

Any response that takes a position on a political, economic, or social policy matter. Zero tolerance — one confirmed violation is a rollback trigger, not a P2 ticket.

Hallucinated citations0 per audit

Response references [Article N] that was not in the retrieved context. Indicates the model is citing from training memory, not the RAG pipeline. Also zero tolerance.

Content-type errors< 2%

Opinion articles surfaced or quoted without the 'In an Opinion piece…' labeling. Treating an op-ed as straight news reporting is an editorial failure.

Sensitive topic handlingPass / fail

Spot-check on predefined sensitive categories: election coverage, active court cases, Indigenous affairs, breaking investigations. Editors assess against written editorial guidelines.

Launch Go / No-Go Gates

Phase	Who	Go criteria
Internal Alpha	Engineers + Editors (20 people)	Faithfulness > 85% on golden dataset Zero editorial violations in 200-response audit P95 end-to-end latency < 2s RBAC / content-type metadata verified correct
Editorial Sign-off	Senior editorial team	100-response manual review with written editor sign-off No impartiality violations of any kind Opinion vs. news labeling accurate across sample Conflicting-source handling reviewed and approved
Beta	Opt-in users (1,000)	Positive rating rate > 65% Escalation rate (to site search) < 20% Weekly editorial audit passing with no P0 findings No public editorial incident
General Availability	Full audience	Positive rating rate > 70% Monitoring dashboard live with alerting configured Human review queue below 2% of total queries Rollback plan tested end-to-end in staging

Chapter 07

Should This System Use Agents?

The RAG pipeline above is reliable, auditable, and safe for V1. Agents introduce real value in specific workflows — and real risk that must be designed around explicitly.

The short answer is yes — but incrementally, and only where a single retrieval pass is structurally insufficient. The system described in this playbook is deliberately stateless and reactive: a user asks, the pipeline retrieves, the model answers. That is the right architecture for V1. It is auditable, debuggable, and safe to hand to an editorial team that has zero tolerance for AI errors reaching publication.

Agents earn their added complexity when the task has structure that single-pass RAG cannot express: multi-step reasoning across independent topic domains, proactive monitoring of evolving stories, or workflows where the system must decide what to retrieve before it knows how to answer. Below is where that line sits for a news media system specifically.

The Test for Adding an Agent

If the answer to "can a single retrieval pass + one LLM call answer this reliably?" is yes — do not add an agent. The latency cost, audit complexity, and error compounding are not justified. Agents are for workflows where sequenced decisions are structurally required, not for making simple queries feel more impressive.

7.1 — Where Agents Add Value

Four workflows where an agentic layer is justified over a simple RAG call.

Multi-Hop Research

A user asks: "How have interest rates affected housing affordability over the last two years?" A single retrieval cannot answer this — it requires retrieving from Business (rates), Politics (housing policy), and Local (affordability impact) independently, then synthesising. The agent decomposes the query, runs sequential retrievals, and fuses the results before generating.

Evolving Story Tracking

Breaking news stories evolve over hours. An agent can monitor a user's query intent against incoming article ingestion — when a new article is semantically similar to a prior high-engagement query, it proactively surfaces the update. This is a background agentic loop with a clear trigger condition, not open-ended autonomy.

Corpus Gap Detection

When retrieval confidence is low, the current system degrades gracefully. An agent can go further: identify that the gap exists, determine whether it is a coverage gap (no articles exist) or a recency gap (articles exist but are stale), and route accordingly — either to a human editor flag or to a "we don't have recent coverage on this" response rather than hallucinating.

Editorial Workflow Assistance

Internal-facing use case: an editor asks the system to find all coverage of a specific politician across the last 6 months, flag contradictory reporting, and surface any stories that reference each other but are not cross-linked. This is a multi-step agentic workflow that has high editorial value and can tolerate higher latency since it is not user-facing.

7.2 — What the Agent Must Never Do

Scope boundaries are not optional in a news context. These are hard constraints, not guidelines.

No external retrieval

The agent may only call tools that read from the internal corpus. No web search, no external APIs. Every cited fact must trace to an article in the audited index.

No editorial judgements

The agent cannot decide that one source is more credible than another, or that a story is worth covering. Those decisions belong to editors.

No unprompted publication

The agent cannot trigger any write action — no pushing alerts, no posting to CMS, no sending notifications — without an explicit human approval gate.

No personalisation decisions

User history and interest profiling raise privacy obligations. The agent must not use behavioural data to alter what content it surfaces without explicit consent and a defined retention policy.

7.3 — Risks and How to Control Them

Each agentic step compounds error. A faithfulness check on the final output is not sufficient — validation must happen at each retrieval-and-reason hop.

Risk	How It Manifests	Control
Hallucination compounding	A bad retrieval in step 1 propagates into step 2 reasoning and step 3 synthesis. The final output can be confidently wrong and internally consistent.	Faithfulness check at every hop, not just the final output. If any step scores below threshold, abort and return a degraded response.
Runaway loops	The agent retries retrieval when results are unsatisfactory, enters a loop, and burns tokens or hits rate limits.	Hard cap: maximum 4 tool calls per request. Timeout at 8 seconds. Any loop pattern triggers circuit breaker and fallback to single-pass RAG.
Editorial standard erosion	Multi-source synthesis across Politics and Business sections without per-source impartiality checking could produce responses that violate the corporation's editorial policy.	Apply the 5 editorial guardrails to every intermediate synthesis, not just the final generation. High-stakes topics (political, judicial) require editorial review before delivery.
Audit trail loss	"Why did it say that?" becomes unanswerable across 4 retrieval steps and 3 LLM calls.	Every agentic step writes to the audit log: tool called, query issued, documents retrieved, reasoning produced. The complete trace is stored alongside the final response.
Scope creep	The agent attempts external retrieval when the internal corpus is insufficient, pulling from unvetted sources.	Strict tool sandboxing. The agent's tool registry contains only signed, internal retrieval functions. No HTTP client, no web search tool, no access to external endpoints.

Recommended Rollout Order

Start with the internal editorial workflow agent (gap detection, cross-link analysis) before any user-facing agentic capability. Internal users provide faster feedback, have higher tolerance for errors, and their workflows are easier to audit. Validate the control framework internally before exposing agentic behaviour to the public-facing product.

Full System

End-to-End Architecture

The complete data flow — hover any node for a one-sentence explanation.

Hover any node for detailsIngestionQueryShared

01Ingestion Pipeline— runs continuously in background

News Sources

RSS · APIs

Article Parser

metadata extract

Semantic Chunker

lede + paragraphs

Embedding Model

BGE-M3

writes embeddings to

Shared Storage Layer

Vector Database

embeddings · metadata index · RBAC filters · audit log

↑ ingestion writesquery reads ↓

retrieves ranked candidates from

02Query Pipeline— executes per user request

User Chat

mobile · web

Safety Check

first gate

runs in parallel ↓

routing

Intent Classifier

prompt routing

retrieval

Hybrid Search

BM25 + semantic

Re-ranker

temporal decay

merges →

LLM Orchestration

intent-routed prompt

Faithfulness Check

NLI · LLM-as-judge

Response

citations inline

Audit Logevery request writes: query · retrieved docs · faithfulness score · response — immutable, append-only

Try It

Live Demo — Ask the System

Real pipeline: BM25 + temporal decay retrieval → GPT-4o-mini with the 5 editorial guardrails.

Sample corpus — 10 articles

#	Title	Date	Section	Type
1	Government Introduces Housing Affordability Plan	Mar 15, 2022	politics	news
2	Interest Rate Hike Threatens Mortgage Renewals	Jul 13, 2022	economics	news
3	Rental Prices in Major Cities Hit 30-Year High	May 4, 2022	real estate	news
4	Cities That Tried Rent Control: Mixed Results	Sep 22, 2021	real estate	analysis
5	Construction Boom Stalls Due to Material Costs	Nov 8, 2021	business	news
6	Community Groups Push Back Against Urban Densification	Aug 30, 2020	local	news
7	Remote Work Shifts Housing Demand to Suburbs	Jan 19, 2022	real estate	feature
8	Study Finds Exercise Reduces Chronic Painout-of-scope	Jun 7, 2021	health	news
9	Housing Policy Experts Weigh In on Affordability7yr-old, stale	Jun 12, 2015	politics	opinion
10	The Rent Control Debate: Economists Divided	Apr 28, 2022	real estate	opinion

Articles #8 (off-topic) and #9 (7 years old) test editorial filtering and temporal decay respectively.

Live RAG Demo

10-article corpus · Hybrid retrieval · GPT-4o-mini generation

LiveSemantic + BM25

Try a question

Your question

What We Covered

RAG over fine-tuning — freshness + citation are non-negotiable for news
Hybrid ingestion — streaming for breaking news, batch for archive
Semantic chunking — lede as summary chunk, paragraphs for precision
BGE-M3 — multilingual, open source, no data egress
Hybrid search + re-ranking + temporal decay — the 5-stage pipeline
5 editorial guardrails — grounding, impartiality, attribution, labeling, degradation
Three-tier evaluation — offline golden dataset, online signals, editorial audit
Phased launch with explicit go/no-go thresholds at each gate
Agentic extensions — where agents add value, hard limits, and risk controls

Enterprise GenAI Strategy: From Data to Production

A complete RAG rollout program for enterprise teams — from data readiness and RBAC through evaluation frameworks and organisational change management.

RAGRBACEnterprise

18 min read

Agentic AI Playbook

Enterprise Agentic AI: From Chatbot to Autonomous Agent

What changes when your AI stops answering and starts acting. LangGraph, MCP, human-in-the-loop gates, and the Padala Bank credit analyst agent as a reference architecture.

Agentic AILangGraphMCP

22 min read

Shashank Padala

Founder, Kirak Labs · Ex-Amazon · AI Product Leader

AI Product & Transformation Leader with 8+ years building production LLM systems. Previously at Amazon, led GenAI integration into content creation workflows for Corp Communications serving ~3M employees globally.

LinkedIn →Ask My AI Brain →

Designing a GenAINews Discovery System