Vector Databases at Scale: Architecture Patterns for High-Throughput AI Applications
Introduction
Every AI application that uses embeddings eventually hits the same wall: your vector database can't keep up with the read/write throughput, or similarity search latency spikes as your data grows beyond millions of vectors.
Vector databases are fundamentally different from traditional databases. They optimize for approximate nearest neighbor (ANN) search across high-dimensional vectors, not exact matches or range queries. This difference cascades into every architectural decision—how you shard, how you index, how you handle writes, and how you scale reads.
If you're building an AI application with embeddings—semantic search, RAG systems, recommendation engines, or agent memory—you need a vector database architecture that can handle your read/write patterns at scale.
Section 1: Read/Write Patterns in AI Applications
Understanding your workload is the foundation of vector database architecture. AI applications typically exhibit distinct patterns:
High-write scenarios
- Ingestion pipelines: bulk-loading millions of documents and their embeddings,
- Real-time updates: AI agents generating new content and storing embeddings on-the-fly,
- Versioned embeddings: re-embedding content when models update, requiring dual-write or migration strategies.
High-read scenarios
- Query-time retrieval: every user query triggers multiple vector searches (RAG retrieval, semantic search),
- Agent reasoning: AI agents performing multiple similarity searches per reasoning step,
- Batch inference: processing large datasets that require vector lookups for each item.
The challenge
Most vector databases handle either high-read OR high-write well—not both simultaneously. Pinecone, for example, optimizes for read-heavy workloads. Qdrant and Milvus offer better write throughput. Choose based on your dominant pattern, then architect around the weaker dimension.
Section 2: Sharding Strategies for Billion-Scale Vectors
When your vector collection grows beyond what a single node can handle—whether due to memory, storage, or query latency—you need sharding.
Hash-based sharding
Route vectors to shards based on a hash of the vector ID or a metadata field:
function getShardId(vectorId: string, shardCount: number): number {
const hash = murmurhash(vectorId);
return hash % shardCount;
}
// Query across shards
async function searchAcrossShards(queryVector: number[], topK: number) {
const results = await Promise.all(
shards.map(shard => shard.search(queryVector, topK))
);
// Merge and re-rank results from all shards
return mergeResults(results, topK);
}
Tradeoff: queries must hit all shards (scatter-gather), but writes are evenly distributed.
Metadata-based sharding
Shard by a meaningful dimension—tenant ID, content type, or time range:
function getShardForVector(metadata: VectorMetadata): Shard {
// Shard by tenant for multi-tenant apps
return shardsByTenant[metadata.tenantId];
}
Tradeoff: queries for a specific tenant hit only one shard (fast), but tenant data may be unevenly distributed.
Hybrid approach
Use metadata-based sharding for query routing, with hash-based sharding within each metadata partition for even distribution:
function getShard(vector: Vector): Shard {
const partition = getPartitionByMetadata(vector.metadata);
const shardIndex = hash(vector.id) % partition.shardCount;
return partition.shards[shardIndex];
}
Section 3: Indexing Strategies for Write-Heavy Workloads
Vector indexes (HNSW, IVF, LSH) are expensive to build and update. Your indexing strategy determines your write throughput.
HNSW (Hierarchical Navigable Small World)
The most popular ANN algorithm—excellent query performance, but expensive to update:
- Best for: read-heavy workloads with moderate write rates,
- Write cost: each insert requires traversing and updating multiple layers,
- Optimization: batch inserts rather than single-vector writes.
// Batch your writes for HNSW indexes
async function batchUpsertVectors(vectors: Vector[], batchSize: number = 1000) {
const results = [];
for (let i = 0; i < vectors.length; i += batchSize) {
const batch = vectors.slice(i, i + batchSize);
results.push(await vectorDb.upsert(batch));
}
return results;
}
IVF (Inverted File Index)
Divides vectors into clusters (Voronoi cells). Faster to update than HNSW:
- Best for: write-heavy workloads where query latency is less critical,
- Write cost: assign vector to nearest centroid—O(n_centroids), not O(log n),
- Tradeoff: query accuracy may be lower than HNSW.
Flat indexes for write-heavy scenarios
If you're writing faster than you can index, consider a two-tier approach:
// Tier 1: Fast writes to a flat (brute-force) index
await writeOptimizedIndex.upsert(vectors);
// Tier 2: Periodically migrate to optimized index
async function migrateToOptimizedIndex() {
const vectors = await writeOptimizedIndex.getAll();
await optimizedIndex.upsert(vectors);
await writeOptimizedIndex.clear();
}
Section 4: Handling Massive Read Throughput
When your AI application serves thousands of similarity searches per second, single-node vector search becomes a bottleneck.
Read replicas
Deploy multiple read replicas of your vector database. Route search queries to replicas, keeping the primary for writes:
class VectorSearchRouter {
private replicas: VectorDatabase[];
private primary: VectorDatabase;
async search(queryVector: number[], topK: number): Promise<SearchResult[]> {
// Round-robin or latency-based routing to replicas
const replica = this.getLeastLoadedReplica();
return replica.search(queryVector, topK);
}
async upsert(vectors: Vector[]): Promise<void> {
// Writes go to primary, replicate asynchronously
await this.primary.upsert(vectors);
}
}
Caching hot queries
AI applications often see repeated or similar queries. Cache similarity search results:
class CachedVectorSearch {
private cache: LRUCache<string, SearchResult[]>;
private vectorDb: VectorDatabase;
async search(queryVector: number[], topK: number): Promise<SearchResult[]> {
const cacheKey = this.hashQuery(queryVector, topK);
if (this.cache.has(cacheKey)) {
return this.cache.get(cacheKey)!;
}
const results = await this.vectorDb.search(queryVector, topK);
this.cache.set(cacheKey, results);
return results;
}
private hashQuery(vector: number[], topK: number): string {
// Quantize the query vector for cache-friendly similarity
const quantized = vector.map(v => Math.round(v * 1000));
return `${quantized.join(",")}-${topK}`;
}
}
Quantization for faster reads
Reduce vector dimensionality or precision at query time:
- Product Quantization (PQ): compress vectors into compact codes,
- Scalar Quantization: reduce float32 to int8 or binary,
- Dimensionality reduction: use PCA to reduce dimensions before indexing.
Section 5: Managing Storage for Massive Vector Collections
A billion 1536-dimensional float32 vectors consumes ~6TB of raw storage (1B × 1536 × 4 bytes). Add indexes, and you're easily at 10–15TB.
Tiered storage architecture
Not all vectors need the same performance:
class TieredVectorStorage {
async search(queryVector: number[], topK: number) {
// Search hot storage first (recent/high-traffic vectors)
let results = await this.hotStorage.search(queryVector, topK);
if (results.length < topK) {
// Fall back to warm storage
const warmResults = await this.warmStorage.search(queryVector, topK - results.length);
results = [...results, ...warmResults];
}
return results.slice(0, topK);
}
}
Hot storage: SSD-backed, in-memory indexes, recent or frequently accessed vectors. Warm storage: SSD-backed, disk-based indexes, older vectors. Cold storage: Object storage (S3), vectors archived and restored on demand.
Compression techniques
Reduce storage footprint without sacrificing too much accuracy:
- Binary quantization: convert float vectors to binary (32x compression),
- Product quantization: compress vectors to a few bytes (10–50x compression),
- Dimensionality reduction: reduce from 1536 to 384 dimensions if your use case allows.
function compressVector(vector: number[], method: 'binary' | 'pq'): Buffer {
if (method === 'binary') {
// Convert to binary based on sign
const bits = new Uint8Array(Math.ceil(vector.length / 8));
vector.forEach((v, i) => {
if (v > 0) bits[Math.floor(i / 8)] |= (1 << (i % 8));
});
return Buffer.from(bits);
}
// PQ implementation...
}
Section 6: Consistency Models for AI Applications
Vector databases for AI applications face a consistency challenge: how soon after a write should that vector be searchable?
Eventual consistency
Most vector databases default to eventual consistency. Writes propagate to indexes and replicas asynchronously:
- Pros: high write throughput, simpler architecture,
- Cons: recently written vectors may not appear in search results immediately,
- Mitigation: for user-facing writes, explicitly refresh the index or use a read-your-writes pattern.
async function writeAndVerify(vector: Vector) {
await vectorDb.upsert([vector]);
// Force index refresh for immediate availability
await vectorDb.refreshIndex();
// Verify the vector is searchable
const results = await vectorDb.search(vector.values, 1);
if (results[0]?.id !== vector.id) {
throw new Error("Vector not immediately searchable");
}
}
Strong consistency for critical applications
If your AI application requires that written vectors are immediately searchable:
- Use a single-node setup (limits scale),
- Or implement a write-through cache that serves reads for recently written vectors,
- Or use a synchronous replication model (impacts write latency).
Section 7: Monitoring and Optimizing Vector Database Performance
Key metrics to track for vector database health:
Write performance
- Write latency: time to upsert a batch of vectors,
- Write throughput: vectors per second,
- Index build time: time to incorporate new vectors into the index.
Read performance
- Query latency (p50, p99): time for similarity search,
- Query throughput: queries per second,
- Recall accuracy: fraction of true nearest neighbors returned by ANN search.
async function measureRecall(
queryVectors: number[][],
groundTruth: number[][], // exact nearest neighbors
topK: number
): Promise<number> {
let totalRecall = 0;
for (let i = 0; i < queryVectors.length; i++) {
const annResults = await vectorDb.search(queryVectors[i], topK);
const annIds = new Set(annResults.map(r => r.id));
const trueNearest = groundTruth[i].slice(0, topK);
const overlap = trueNearest.filter(id => annIds.has(id)).length;
totalRecall += overlap / topK;
}
return totalRecall / queryVectors.length;
}
Storage metrics
- Storage per vector: including index overhead,
- Index size ratio: index size / raw vector size,
- Compression ratio: original size / compressed size.
Conclusion
Vector databases at scale require architectural decisions that balance read throughput, write throughput, storage costs, and query accuracy. There's no one-size-fits-all solution—your AI application's specific read/write patterns should drive your architecture.
Start by measuring your workload: what's your read/write ratio? What's your latency requirement? How many vectors do you need to store? Then choose your vector database, sharding strategy, indexing approach, and storage tiering accordingly.
The difference between a vector database that scales and one that becomes a bottleneck is usually architectural—not the choice of database itself.
Related Service: AI Systems & Automation
Need help architecting vector databases for your AI application?