The RAG Pipeline as Core Infrastructure: System Design Patterns for AI-Native Applications

Introduction

In 2023, RAG was a feature. You bolted a vector search onto your existing application, piped documents through an embedding model, and called it "AI-powered."

In 2026, RAG is core infrastructure—as fundamental to AI-native applications as the database is to traditional applications. System design interviews include RAG pipelines alongside load balancers and caches. Production architectures treat retrieval, embedding, and generation as first-class components with their own reliability, cost, and observability requirements.

This shift changes how you design systems.

Section 1: RAG as Infrastructure, Not a Feature

The feature mindset (2023)

Existing App → [RAG module added] → LLM API

RAG is an optional enhancement. The app works without it. Retrieval is a single API call.

The infrastructure mindset (2026)

                    ┌──────────────┐
User Request ──────→│ API Gateway  │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │Retrieval │ │Generation│ │  Agent   │
        │ Pipeline │ │ Pipeline │ │Orchestr. │
        └────┬─────┘ └────┬─────┘ └────┬─────┘
             │            │            │
        ┌────▼─────┐ ┌───▼────┐ ┌────▼─────┐
        │ Vector DB│ │LLM API │ │Tool/MCP  │
        │ + Index  │ │        │ │ Registry │
        └──────────┘ └────────┘ └──────────┘

Retrieval, generation, and orchestration are peer infrastructure components—each with independent scaling, monitoring, and failure modes.

Section 2: The RAG Pipeline Architecture

Ingestion layer

Documents enter the system through an ingestion pipeline:

Source connectors: file upload, API sync, database export, web crawl,
Document processing: chunking (semantic, fixed-size, or hierarchical), metadata extraction,
Embedding generation: batch or streaming embedding with model version tracking,
Index writing: vector store + optional keyword index (hybrid search).

Ingestion is async, idempotent, and versioned. Re-embedding a document corpus when the model changes is a planned operation, not an emergency.

Retrieval layer

At query time:

Query processing: query expansion, HyDE, or direct embedding,
Hybrid search: vector similarity + keyword/BM25 (most production systems use both),
Reranking: cross-encoder or LLM-based reranking of top-K candidates,
Context assembly: select chunks within context window budget, deduplicate, order by relevance.

Generation layer

Prompt construction: system prompt + retrieved context + user query,
LLM inference: with model routing (cheap for simple, capable for complex),
Post-processing: citation extraction, hallucination checks, format validation,
Response delivery: with source attribution.

Section 3: Infrastructure Concerns

Reliability

Component	Failure mode	Mitigation
Vector DB	Index unavailable	Cache recent queries, fallback to keyword search
Embedding service	Rate limit / timeout	Queue with retry, fallback to cached embeddings
LLM API	Rate limit / outage	Model routing to fallback provider
Ingestion pipeline	Partial failure	Idempotent writes, dead letter queue

Cost management

Embedding cost: batch ingestion off-peak, cache embeddings by content hash,
Retrieval cost: limit top-K, use smaller embedding models for initial retrieval,
Generation cost: semantic caching, model routing, context window optimization,
Storage cost: tiered vector storage (hot index for recent, cold for archive).

Track cost per query across all three layers—not just LLM inference.

Observability

Per-query tracing:

query_id → embedding_cost → retrieval_latency → chunks_selected → 
generation_cost → total_cost → quality_score

Alert on: retrieval recall drops, embedding model version mismatches, generation cost spikes, ingestion lag.

Data freshness

Incremental indexing: new documents available within minutes, not hours,
Stale content detection: flag chunks older than TTL,
Re-embedding pipeline: automated when embedding model version changes,
Deletion propagation: removed source documents purge from index.

Section 4: Vector Store Selection as Infrastructure Decision

Factor	pgvector	Dedicated (Pinecone, Weaviate)	Managed (Bedrock KB, Vertex)
Scale	< 10M vectors	10M–1B vectors	Variable
Ops burden	Low (existing Postgres)	Medium	Low
Hybrid search	Manual (Postgres FTS)	Built-in	Built-in
Cost at 1M vectors	~$50/month (existing DB)	$200–$500/month	$100–$300/month
Best for	Startups, MVP	Scale-stage	Enterprise, compliance

Start with pgvector. Migrate when retrieval latency or index size exceeds Postgres comfort zone—not before.

Section 5: System Design Interview Patterns

When designing AI-native systems, always address:

Ingestion: how documents enter, how often, how re-indexing works,
Retrieval: hybrid search strategy, reranking, context window budget,
Generation: model selection, prompt strategy, fallback,
Freshness: how quickly new data is searchable,
Cost: per-query cost at 1K, 100K, 1M queries/day,
Eval: how you measure retrieval quality and generation accuracy,
Failure modes: what happens when each component is down.

Conclusion

RAG is no longer a feature you add—it is infrastructure you design. Treat the retrieval pipeline with the same rigor as your database layer: reliability, cost management, observability, and planned scaling.