Vector Databases & How They Work
Understanding why we need vector databases and exploring their internal implementations for efficient similarity search.
Why Do We Need Vector Databases?
Consider these two queries that might be sent to a search engine:
"what is an array"
"Array kya hota hai"
These queries are asking the same question, but traditional databases can't recognize their similarity. This is where vector databases come in.
SQL and NoSQL databases are designed to find exact matches based on known values, while vector databases are designed to find similar items based on conceptual meaning.
First Method: Brute Force
The simplest approach is to convert text into vectors (numerical representations) and compare them using distance metrics like Euclidean distance or Cosine similarity.
Example of a vector: [0.9, 0.1, 0.0, 0.9, 0.1]
Item | Vector |
---|---|
onion | [0.9, 0.1, 0.0, 0.9, 0.1] |
tomato | [0.8, 0.2, 0.0, 0.8, 0.2] |
dhaniya | [0.7, 0.1, 0.0, 0.9, 0.0] |
nimbu | [0.1, 0.8, 0.1, 0.7, 0.3] |
banana | [0.1, 0.9, 0.1, 0.1, 0.8] |
protein | [0.0, 0.1, 0.9, 0.1, 0.7] |
creatine | [0.0, 0.0, 0.95, 0.0, 0.1] |
bcaa | [0.0, 0.0, 0.92, 0.0, 0.1] |
How Distance Calculation Works
For example, to find the distance between points (2,3) and (3,4):
Euclidean distance: √((2-3)² + (3-4)²)
With brute force, the system compares the query vector with every single vector in the database, calculating the distance each time. It then returns the items with the shortest distances.
The Problem with Brute Force
The reason a full scan is slow is because it's perfect. It calculates the distance to every single point and guarantees it finds the absolute closest one. This is called Exact Nearest Neighbor (ENN) search.
While this gives perfect results, it becomes impractical when dealing with billions of vectors, as the computation time scales linearly with the number of items.
ANN Approach: Trading Accuracy for Speed
Approximate nearest neighbor (ANN) is an algorithm that finds a data point in a data set that's very close to the given query point, but not necessarily the absolute closest one.
It sacrifices a tiny amount of accuracy (it might return 9 of the top 10, plus the 11th) for a massive gain in speed (often orders of magnitude faster).
Let's explore several ANN approaches that make vector databases practical for large-scale applications.
First Method: Clustering/Inverted File Index - IVF
Phase 1: Indexing (The one-time setup)
Find the 'Centers': The database first looks at all of your million vectors and uses an algorithm (like k-means clustering) to find, let's say, 100 'center points' or centroids. Each centroid is itself a vector that represents the average of a region in your data. Think of these as the signs for the sections in our library: "Fiction," "Science," etc.
Assign to Clusters: The database then goes through every single one of your million vectors and figures out which of the 100 centroids it's closest to. It then puts a "pointer" to that vector into a list for that centroid.
Searching (When you make a query)
Find the Closest Section(s): The database does NOT compare your query to all one million vectors. Instead, it just compares it to the 100 centroids. This is super fast. It finds that your query is closest to, say, 'Centroid #42' (the "men's casual shirts" section).
Search Only Within That Section: Now, the database knows your best matches are probably in the list for Centroid #42. So it only searches through the 10,000 vectors in that one cluster.
Example
Consider vectors like:
- onion: [2,3]
- tomato: [3,4]
- grapes: [1,2]
- paali: [1,6]
By using centroids, we might find that 3 centroids represent our entire dataset well, and each points to a cluster of about 10,000 items, making searches much faster.
Second Method: The Decision Tree Method (Binary Space Partitioning)
What if we just took the whole space and chopped it in half? And then chopped those halves in half? And so on.
How It Works:
First Split: We pick a dimension (e.g., the x-axis) and find the middle point (the median). We draw a vertical line there. All points to the left go into the "left" branch of our tree, and all points to the right go into the "right" branch.
Second Split: Now we look at the two new groups. For each group, we pick the other dimension (the y-axis) and split it in half with a horizontal line.
Keep Splitting: We keep repeating this process, alternating which dimension we split on, until we only have a few points left in each final, tiny rectangle.
Binary Tree Structure
Search 3 Nearest Neighbors: Q = [13, 8]
Approach 3: Hierarchical Navigable Small Worlds - HNSW
How HNSW Works
HNSW organizes vectors in a hierarchical graph structure with multiple layers:
- Layer 2: Highest level with few nodes
- Layer 1: Mid-level with alpha=3 connections per node
- Layer 0: Base level with alpha=6 connections per node
HNSW works by creating a "skip list"-like structure where higher levels allow for quick, coarse navigation, and lower levels refine the search. This approach provides both fast search speeds and high accuracy.
Approach 4: The Compression Method (Product Quantization - PQ)
The Memory Problem
A single vector using 1536 dimensions of 32-bit floating-point numbers takes up 1536 * 4 bytes = 6,144 bytes. If you have a billion vectors, that's 6.1 Terabytes of RAM! That's incredibly expensive.
Product Quantization solves this by compressing vectors while maintaining their similarity relationships.
How Product Quantization Works
It chops the big vector into smaller segments. For example, a 16-dimension vector could be chopped into 4 segments of 4 dimensions each. [41, 42, ..., 416] → [chunk1], [chunk2], ..., [chunk4]
Step A: Split V into chunks
V = [1.1, 2.3, 0.9, 3.4, 8.8, 7.6, 9.1, 6.5, 4.2, 5.5, 3.4, 5.1, 9.9, 8.1, 7.7, 9.3]
total size = 16*4 = 64 bytes
v_chunk1 = [1.1, 2.3, 0.9, 3.4] -> mapped to [10]
v_chunk2 = [8.8, 7.6, 9.1, 6.5] -> mapped to [23]
v_chunk3 = [4.2, 5.5, 3.9, 5.1] -> mapped to [123]
v_chunk4 = [9.9, 8.1, 7.7, 9.3] -> mapped to [16]
Resulting compressed vector: [10, 23, 123, 16]
Step B: Create the Codebooks (k cluster/centroid)
For each chunk position, we create a codebook of representative vectors:
- Chunk 1 (dims 1-4): We take all the first chunks and look at Codebook 1.
- Chunk 2 (dims 5-8): We do the same for the second chunks and look at Codebook 2.
- Chunk 3 (dims 9-12): This gives us Codebook 3.
- Chunk 4 (dims 13-16): This gives us Codebook 4.
Codebook 1 might look like this (for 1 billion records, chunk1 might have 256 clusters/centroids):
0 : [1,2,3,6,4,7,2,1]
1 : [1,1,3,2,4,5,2,8]
2 : [2,2,3,6,5,7,1,1]
3 : [1,1,2,4,1,0,3,3]
...
255 : [4,2,3,6,1,7,2,1]
How to search a query
For a query Q = [1.5, 2.1, 0.8, 3.4, 8.1, 7.9, 9.0, 6.6, 4.8, 5.2, 4.1, 5.0, 9.5, 8.3, 7.9, 9.1]:
- Split Q into chunks, just like we did for vectors in our database
- For each chunk, compute distances to all centroids in the corresponding codebook
- Build a distance lookup table for fast comparisons
- Use these tables to estimate distances between Q and compressed vectors
This allows for extremely memory-efficient vector storage while maintaining reasonably accurate similarity searches.
Which Method is "Best"? A Comparison
Method | Speed (Query Latency) | Accuracy (Recall) | Memory Usage |
---|---|---|---|
HNSW | Highest | Highest | Highest |
IVFPQ (Hybrid) | High | High (with re-ranking) | Lowest by Far |
IVF (Standalone) | Medium | Medium | Low |
Trees (e.g., KD-Tree) | Low (in high dimensions) | Low | Medium |
Database / Library Comparison
Database/Library | Category | Key Feature / Philosophy | Primary Indexing Technique(s) |
---|---|---|---|
Milvus | Native | Open-source, highly scalable, MLOps-focused. Decoupled storage & compute. | Highly Flexible: HNSW, IVFPQ, IVFFlat, and others. |
Pinecone | Native | Managed service (SaaS), ease of use, performance-focused. | HNSW |
Weaviate | Native | Open-source, GraphQL API, connects vectors in a knowledge graph. | HNSW |
Qdrant | Native | Open-source, performance-focused with advanced filtering. Written in Rust. | HNSW with custom quantization. |
PostgreSQL (pgvector) | Integrated | Keep your structured data and vectors together in a trusted relational DB. | IVFFlat and HNSW |
Elasticsearch | Integrated | Combine traditional keyword search with semantic vector search in one query. | HNSW |
Redis | Integrated | In-memory speed, combining vector search with other Redis data structures. | HNSW |
Faiss (by Meta) | Library | The foundational C++/Python toolkit. Not a DB, but powers many others. | The Reference: HNSW, IVFPQ, and many more. |
What Data is Actually Stored
Vector databases typically store three types of information:
// Example record in a vector database
{
id: "product_456", // unique identifier
vector: [0.98, 0.23, -0.11, ...], // the numerical representation
metadata: {
product_name: "Red Running Shoes",
price: 89.99,
link: "https://example.com/shoes/456"
}
}
Summary: Vector Databases and Their Implementation
1. Vector databases solve the similarity search problem
They allow us to find items based on meaning and similarity rather than exact matches.
2. Brute force is accurate but slow
Calculating distance to every vector gives perfect results but becomes impractical at scale.
3. Approximate methods offer speed-accuracy tradeoffs
IVF clustering, tree-based partitioning, HNSW, and product quantization all offer different approaches to the same problem.
4. Modern vector databases combine techniques
Most production systems use combinations of these methods to balance speed, accuracy, and memory usage.
Continue learning