The RAG Pipeline as Core Infrastructure: System Design Patterns for AI-Native Applications
Introduction
In 2023, RAG was a feature. You bolted a vector search onto your existing application, piped documents through an embedding model, and called it "AI-powered."
In 2026, RAG is core infrastructure—as fundamental to AI-native applications as the database is to traditional applications. System design interviews include RAG pipelines alongside load balancers and caches. Production architectures treat retrieval, embedding, and generation as first-class components with their own reliability, cost, and observability requirements.
This shift changes how you design systems.
Section 1: RAG as Infrastructure, Not a Feature
The feature mindset (2023)
Existing App → [RAG module added] → LLM API
RAG is an optional enhancement. The app works without it. Retrieval is a single API call.
The infrastructure mindset (2026)
┌──────────────┐
User Request ──────→│ API Gateway │
└──────┬───────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│Retrieval │ │Generation│ │ Agent │
│ Pipeline │ │ Pipeline │ │Orchestr. │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
┌────▼─────┐ ┌───▼────┐ ┌────▼─────┐
│ Vector DB│ │LLM API │ │Tool/MCP │
│ + Index │ │ │ │ Registry │
└──────────┘ └────────┘ └──────────┘
Retrieval, generation, and orchestration are peer infrastructure components—each with independent scaling, monitoring, and failure modes.
Section 2: The RAG Pipeline Architecture
Ingestion layer
Documents enter the system through an ingestion pipeline:
- Source connectors: file upload, API sync, database export, web crawl,
- Document processing: chunking (semantic, fixed-size, or hierarchical), metadata extraction,
- Embedding generation: batch or streaming embedding with model version tracking,
- Index writing: vector store + optional keyword index (hybrid search).
Ingestion is async, idempotent, and versioned. Re-embedding a document corpus when the model changes is a planned operation, not an emergency.
Retrieval layer
At query time:
- Query processing: query expansion, HyDE, or direct embedding,
- Hybrid search: vector similarity + keyword/BM25 (most production systems use both),
- Reranking: cross-encoder or LLM-based reranking of top-K candidates,
- Context assembly: select chunks within context window budget, deduplicate, order by relevance.
Generation layer
- Prompt construction: system prompt + retrieved context + user query,
- LLM inference: with model routing (cheap for simple, capable for complex),
- Post-processing: citation extraction, hallucination checks, format validation,
- Response delivery: with source attribution.
Section 3: Infrastructure Concerns
Reliability
| Component | Failure mode | Mitigation |
|---|---|---|
| Vector DB | Index unavailable | Cache recent queries, fallback to keyword search |
| Embedding service | Rate limit / timeout | Queue with retry, fallback to cached embeddings |
| LLM API | Rate limit / outage | Model routing to fallback provider |
| Ingestion pipeline | Partial failure | Idempotent writes, dead letter queue |
Cost management
- Embedding cost: batch ingestion off-peak, cache embeddings by content hash,
- Retrieval cost: limit top-K, use smaller embedding models for initial retrieval,
- Generation cost: semantic caching, model routing, context window optimization,
- Storage cost: tiered vector storage (hot index for recent, cold for archive).
Track cost per query across all three layers—not just LLM inference.
Observability
Per-query tracing:
query_id → embedding_cost → retrieval_latency → chunks_selected →
generation_cost → total_cost → quality_score
Alert on: retrieval recall drops, embedding model version mismatches, generation cost spikes, ingestion lag.
Data freshness
- Incremental indexing: new documents available within minutes, not hours,
- Stale content detection: flag chunks older than TTL,
- Re-embedding pipeline: automated when embedding model version changes,
- Deletion propagation: removed source documents purge from index.
Section 4: Vector Store Selection as Infrastructure Decision
| Factor | pgvector | Dedicated (Pinecone, Weaviate) | Managed (Bedrock KB, Vertex) |
|---|---|---|---|
| Scale | < 10M vectors | 10M–1B vectors | Variable |
| Ops burden | Low (existing Postgres) | Medium | Low |
| Hybrid search | Manual (Postgres FTS) | Built-in | Built-in |
| Cost at 1M vectors | ~$50/month (existing DB) | $200–$500/month | $100–$300/month |
| Best for | Startups, MVP | Scale-stage | Enterprise, compliance |
Start with pgvector. Migrate when retrieval latency or index size exceeds Postgres comfort zone—not before.
Section 5: System Design Interview Patterns
When designing AI-native systems, always address:
- Ingestion: how documents enter, how often, how re-indexing works,
- Retrieval: hybrid search strategy, reranking, context window budget,
- Generation: model selection, prompt strategy, fallback,
- Freshness: how quickly new data is searchable,
- Cost: per-query cost at 1K, 100K, 1M queries/day,
- Eval: how you measure retrieval quality and generation accuracy,
- Failure modes: what happens when each component is down.
Conclusion
RAG is no longer a feature you add—it is infrastructure you design. Treat the retrieval pipeline with the same rigor as your database layer: reliability, cost management, observability, and planned scaling.
Related reading:
- RAG vs Fine-Tuning Decision Framework
- Vector Databases for AI Applications
- AI Product Architecture in 2026
- De-Identification Strategy for RAG
For AI architecture consulting: