Back to Insights
2026-06-08 6 min read Tanuj Garg

The RAG Pipeline as Core Infrastructure: System Design Patterns for AI-Native Applications

AI & Automation#RAG#Vector Database#AI Architecture#System Design#Infrastructure

Introduction

In 2023, RAG was a feature. You bolted a vector search onto your existing application, piped documents through an embedding model, and called it "AI-powered."

In 2026, RAG is core infrastructure—as fundamental to AI-native applications as the database is to traditional applications. System design interviews include RAG pipelines alongside load balancers and caches. Production architectures treat retrieval, embedding, and generation as first-class components with their own reliability, cost, and observability requirements.

This shift changes how you design systems.


Section 1: RAG as Infrastructure, Not a Feature

The feature mindset (2023)

Existing App → [RAG module added] → LLM API

RAG is an optional enhancement. The app works without it. Retrieval is a single API call.

The infrastructure mindset (2026)

                    ┌──────────────┐
User Request ──────→│ API Gateway  │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │Retrieval │ │Generation│ │  Agent   │
        │ Pipeline │ │ Pipeline │ │Orchestr. │
        └────┬─────┘ └────┬─────┘ └────┬─────┘
             │            │            │
        ┌────▼─────┐ ┌───▼────┐ ┌────▼─────┐
        │ Vector DB│ │LLM API │ │Tool/MCP  │
        │ + Index  │ │        │ │ Registry │
        └──────────┘ └────────┘ └──────────┘

Retrieval, generation, and orchestration are peer infrastructure components—each with independent scaling, monitoring, and failure modes.


Section 2: The RAG Pipeline Architecture

Ingestion layer

Documents enter the system through an ingestion pipeline:

  1. Source connectors: file upload, API sync, database export, web crawl,
  2. Document processing: chunking (semantic, fixed-size, or hierarchical), metadata extraction,
  3. Embedding generation: batch or streaming embedding with model version tracking,
  4. Index writing: vector store + optional keyword index (hybrid search).

Ingestion is async, idempotent, and versioned. Re-embedding a document corpus when the model changes is a planned operation, not an emergency.

Retrieval layer

At query time:

  1. Query processing: query expansion, HyDE, or direct embedding,
  2. Hybrid search: vector similarity + keyword/BM25 (most production systems use both),
  3. Reranking: cross-encoder or LLM-based reranking of top-K candidates,
  4. Context assembly: select chunks within context window budget, deduplicate, order by relevance.

Generation layer

  1. Prompt construction: system prompt + retrieved context + user query,
  2. LLM inference: with model routing (cheap for simple, capable for complex),
  3. Post-processing: citation extraction, hallucination checks, format validation,
  4. Response delivery: with source attribution.

Section 3: Infrastructure Concerns

Reliability

ComponentFailure modeMitigation
Vector DBIndex unavailableCache recent queries, fallback to keyword search
Embedding serviceRate limit / timeoutQueue with retry, fallback to cached embeddings
LLM APIRate limit / outageModel routing to fallback provider
Ingestion pipelinePartial failureIdempotent writes, dead letter queue

Cost management

  • Embedding cost: batch ingestion off-peak, cache embeddings by content hash,
  • Retrieval cost: limit top-K, use smaller embedding models for initial retrieval,
  • Generation cost: semantic caching, model routing, context window optimization,
  • Storage cost: tiered vector storage (hot index for recent, cold for archive).

Track cost per query across all three layers—not just LLM inference.

Observability

Per-query tracing:

query_id → embedding_cost → retrieval_latency → chunks_selected → 
generation_cost → total_cost → quality_score

Alert on: retrieval recall drops, embedding model version mismatches, generation cost spikes, ingestion lag.

Data freshness

  • Incremental indexing: new documents available within minutes, not hours,
  • Stale content detection: flag chunks older than TTL,
  • Re-embedding pipeline: automated when embedding model version changes,
  • Deletion propagation: removed source documents purge from index.

Section 4: Vector Store Selection as Infrastructure Decision

FactorpgvectorDedicated (Pinecone, Weaviate)Managed (Bedrock KB, Vertex)
Scale< 10M vectors10M–1B vectorsVariable
Ops burdenLow (existing Postgres)MediumLow
Hybrid searchManual (Postgres FTS)Built-inBuilt-in
Cost at 1M vectors~$50/month (existing DB)$200–$500/month$100–$300/month
Best forStartups, MVPScale-stageEnterprise, compliance

Start with pgvector. Migrate when retrieval latency or index size exceeds Postgres comfort zone—not before.


Section 5: System Design Interview Patterns

When designing AI-native systems, always address:

  1. Ingestion: how documents enter, how often, how re-indexing works,
  2. Retrieval: hybrid search strategy, reranking, context window budget,
  3. Generation: model selection, prompt strategy, fallback,
  4. Freshness: how quickly new data is searchable,
  5. Cost: per-query cost at 1K, 100K, 1M queries/day,
  6. Eval: how you measure retrieval quality and generation accuracy,
  7. Failure modes: what happens when each component is down.

Conclusion

RAG is no longer a feature you add—it is infrastructure you design. Treat the retrieval pipeline with the same rigor as your database layer: reliability, cost management, observability, and planned scaling.

Related reading:

For AI architecture consulting: