Semantic Search and Information Retrieval with Transformers

From Keywords to Neural Understanding: The Transformer Revolution in Search — Article 9

Modern information retrieval has undergone a profound transformation. Where once we struggled with keyword limitations and Boolean operators, today’s semantic search unlocks the true meaning behind our questions. This chapter explores how transformer models have revolutionized search by bridging the gap between what users ask and what they truly seek.

We’ll journey through:

The fundamental shift from lexical matching to semantic understanding
How transformer architectures create rich, contextual embeddings that capture meaning
Vector databases that make these embeddings searchable at scale
Real-world applications across customer support, knowledge management, and legal discovery
The latest advancements including RAG integration and specialized domain models

By understanding both the theory and practical implementation of transformer-based search, you’ll gain the tools to build systems that truly comprehend user intent — not just match strings. Let’s explore how these neural networks have fundamentally changed what’s possible in information retrieval.

Topics covered: Semantic Search & Information Retrieval

Shows transition From Keywords to Understanding with limitations and benefits
Covers Transformer Embeddings including models and adaptations
Details Vector Databases & FAISS for scalable implementation
Highlights Business Applications across industries
Includes Modern Features like RAG and benchmarking

Introduction: From Keyword Search to True Understanding

Ever chased a critical document that you know exists, but can’t find because you’re using the wrong words? That’s keyword search failing you. Today’s transformer models transform this frustration into fluid discovery. They understand meaning, not just matching letters.

Picture searching a massive warehouse for a red umbrella. Keyword search gives you a flashlight that only illuminates boxes labeled “red.” Miss the one marked “crimson parasol”? Too bad. You’ll walk right past it.

Now imagine a brilliant assistant who grasps that crimson equals red, and parasol means umbrella. They understand your intent, not just your words. That’s semantic search — it discovers meaning, not just matches.

Semantic search harnesses transformer models — deep learning architectures that capture relationships between words in context. Instead of literal matching, transformers encode the essence behind text. “Refund policy” and “money back guarantee” become neighbors in meaning space, even sharing zero words.

Let’s witness this transformation with Python. We’ll contrast keyword and semantic search using the latest sentence-transformers library and production-ready patterns.

This builds on my previous exploration of custom data workflows in Custom Data Workflows Matter — Article 8, and for foundational transformer mechanics, see my blog post Inside the Transformer: Architecture and Attention Demystified — A Complete Guide.

Setting Up Your Environment with Python 3.12.9

# Using pyenv (recommended for Python version management)
pyenv install 3.12.9
pyenv local 3.12.9

# Verify Python version
python --version  # Should show Python 3.12.9

# Install required packages with poetry
poetry new semantic-search-project
cd semantic-search-project
poetry env use 3.12.9
poetry add sentence-transformers numpy faiss-cpu rank-bm25 chromadb openai

# Or use mini-conda
conda create -n semantic-search python=3.12.9
conda activate semantic-search
pip install sentence-transformers numpy faiss-cpu rank-bm25 chromadb openai

# Or use pip with pyenv
pyenv install 3.12.9
pyenv local 3.12.9
pip install sentence-transformers numpy faiss-cpu rank-bm25 chromadb openai

For hands-on code, check the companion GitHub repo, later you can follow along to load up the example and run them. This is the fastest way to set up a new environment as we have all of the starter config.

Keyword Search vs. Semantic Search: A Modern Comparison

Let’s now explore the practical application of both keyword and semantic search methods with a detailed code example. The following listing demonstrates a side-by-side comparison that highlights the fundamental differences between these approaches.

This example will show:

How traditional keyword search fails when exact word matching isn’t present
How transformer-based semantic search correctly identifies relevant content through meaning
The practical implementation using the sentence-transformers library
Why semantic search produces superior results for natural language queries

The code sample uses a realistic FAQ scenario where a user’s query about forgotten login credentials should match documents about password recovery, despite not sharing the same keywords:

# Example FAQ documents
faqs = [
    "How can I reset my password?",
    "What are the steps for account recovery?",
    "How do I request a refund?",
    "Information about our privacy policy."
]

# User query
query = "I forgot my login credentials"

# --- Keyword Search ---
# Find FAQs containing any keyword from the query (exact match)
keywords = set(query.lower().split())
keyword_matches = [faq for faq in faqs if keywords & set(faq.lower().split())]
print("Keyword Search Results:", keyword_matches)

# --- Semantic Search ---
# Use a transformer model to embed both the FAQs and the query
from sentence_transformers import SentenceTransformer, util
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')  # Current, lightweight model

# Embeddings as vectors from FAQs and Query
faq_embeddings = model.encode(faqs, convert_to_numpy=True)  
query_embedding = model.encode([query], convert_to_numpy=True)[0]

# Cosine similarity measures semantic closeness between query and each FAQ
cosine_scores = util.cos_sim(query_embedding, faq_embeddings)[0].cpu().numpy()

# Rank FAQs by similarity (highest score first)
top_idx = np.argsort(-cosine_scores)
semantic_matches = [faqs[i] for i in top_idx[:2]]
print("Semantic Search Results:", semantic_matches)

# Note: For production-scale search, use a vector database (see below) for efficiency.

Step-by-Step Explanation:

Define FAQs and Query: Create sample documents and a user question that doesn’t match keywords exactly
Keyword Search: Split query into words, find FAQs sharing any word — misses relevant answers when wording differs
Load Transformer Model: Initialize sentence transformer that creates meaning-rich embeddings
Generate Embeddings: Convert FAQs and query into dense vectors capturing semantic essence
Calculate Similarity: Use cosine similarity to measure meaning closeness between vectors
Rank Results: Sort FAQs by similarity score — most relevant surfaces first

Notice how the simplistic keyword search returns nothing (no shared words), while semantic search correctly identifies password/account recovery FAQs as relevant. That’s the power of understanding meaning.

Why does this transformation matter for business?

Customer Support: Users rarely phrase questions matching your documentation. Semantic search bridges this gap.
Enterprise Knowledge: Employees discover procedures using their terminology.
Legal Compliance: Lawyers surface relevant precedents by meaning, not exact phrasing.

Transformers fuel this leap. They absorb context and nuance from massive datasets, enabling search that transcends surface matching.

Key takeaway: Semantic search, powered by transformers, unlocks genuine language understanding. This shift proves vital for building smarter, more intuitive search across domains.

🔎 P**roduction Note: **Real-world semantic search stores embeddings in vector databases (FAISS, Milvus, Weaviate, Postgres/pgvector) for efficient scaling to millions of documents. We’ll explore this shortly.

🌐 L**ooking Ahead: **Recent advances include retrieval-augmented generation (RAG), Mixture of Experts architectures, and multimodal search combining text, images, and video. For a deeper dive into RAG’s evolution and why it’s far from dead, check my analysis in Is RAG Dead? Anthropic Says No. Complement this with my blog on multimodal extensions in Beyond Language: Transformers for Vision, Audio, and Multimodal AI — Article 7, which explores how semantic search evolves beyond text.

Introduction to Semantic Search

Search drives how we navigate information from company wikis to legal archives. Traditional engines focus on exact matches, missing true intent. Semantic search revolutionizes this by understanding meaning and context, leveraging transformer embeddings.

You’ll master how semantic search surpasses keyword matching, why transformer embeddings excel at capturing meaning, and which metrics prove search quality. We’ll introduce production practices — vector databases, hybrid search, modern evaluation tools — through practical examples.

Step-by-Step Explanation:

Traditional Search path shows query going through keyword engine to limited results
User becomes frustrated when intent is missed
Semantic Search path shows both query and documents becoming embeddings
Transformer creates vectors that capture meaning
Similarity matching produces relevant results
User achieves satisfaction through understood intent

Keyword vs. Semantic Search

Picture searching for “resetting your password.” Keyword search only finds documents containing “reset” and “password” — missing “Account recovery steps” despite being your answer.

Think of keyword search as a rigid librarian who only fetches books with your exact phrase. Fast, but inflexible. Synonyms or paraphrasing derail it altogether.

Semantic search resembles an insightful librarian who grasps your meaning. It connects “resetting password” with “account recovery” by matching intent, not letters. This magic happens through embeddings — numeric vectors representing text meaning, generated by transformers.

Embeddings capture context and word relationships. We compare meanings using cosine similarity — a mathematical measure of vector direction closeness. Nearby vectors share similar meaning.

Comparing Keyword and Semantic Search Results

# Ensure Python 3.12.9 environment
import sys
print(f"Python version: {sys.version}")  # Verify 3.12.9

# Example query and candidate documents
query = "How do I reset my password?"
documents = [
    "Account recovery steps",
    "Password reset instructions",
    "Update your profile information"
]

# Keyword search: match if 'password' is present
keyword_matches = [doc for doc in documents if "password" in doc.lower()]
print("Keyword Search Results:", keyword_matches)

# Semantic search: use a sentence transformer for meaning
from sentence_transformers import SentenceTransformer, util

# You can also try newer embedding models, such 
# as 'BAAI/bge-base-en-v1.5' or 'intfloat/e5-base-v2'
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)
scores = util.cos_sim(query_embedding, doc_embeddings)[0]
top_idx = scores.argmax().item()

print("Semantic Search Top Result:", documents[top_idx])

Step-by-Step Explanation:

Verify Python Version: Ensure we’re using Python 3.12.9 for consistency
Define Query and Documents: Create search scenario with varied phrasings
Keyword Search: Only finds “Password reset instructions” — ignores relevant “Account recovery”
Load Transformer: Initialize model that creates semantic embeddings
Generate Embeddings: Convert text into meaning vectors
Calculate Similarity: Find document with closest meaning to query
Surface Best Match: Semantic search correctly identifies most relevant document

Try modifying the query to “forgot login details” — watch how results shift. This demonstrates semantic search uncovering relevance that keyword search misses entirely.

Production systems scale by storing embeddings in vector databases (FAISS, Milvus, Pinecone, Weaviate, Postgres + VectorDB, AlloyDB), enabling lightning-fast similarity search across millions of documents.

Modern engines combine keyword and semantic search (hybrid search) maximizing precision and recall. Reranking models (cross-encoders, LLMs) further refine top results.

Key Takeaway: Semantic search matches meaning, not just words. Transformer embeddings plus vector databases deliver helpful, scalable, accurate results.

Hybrid Search: Combining Keyword and Semantic Approaches

After discussing keyword vs. semantic search, let’s explore how to combine both approaches for optimal results. Hybrid search leverages the precision of keyword matching with the understanding of semantic search.

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Documents and query
documents = ["Account recovery steps", "Password reset instructions", "Update profile"]
query = "Forgot login details"

# Keyword component (BM25)
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
bm25_scores = bm25.get_scores(query.lower().split())

# Semantic component
model = SentenceTransformer('all-MiniLM-L6-v2')
query_emb = model.encode(query)
doc_embs = model.encode(documents)
semantic_scores = util.cos_sim(query_emb, doc_embs)[0].numpy()

# Hybrid: Weighted average (e.g., 0.4 keyword + 0.6 semantic)
hybrid_scores = 0.4 * np.array(bm25_scores) + 0.6 * semantic_scores
top_idx = np.argsort(-hybrid_scores)[0]
print("Hybrid Top Result:", documents[top_idx])

Step-by-Step Explanation:

Tokenize documents for BM25: Split documents into words for keyword scoring
Compute keyword scores: Use BM25 algorithm for traditional relevance
Generate embeddings for semantic scores: Create meaning-based vectors
Combine with weights: Balance keyword precision with semantic understanding (tune based on domain)
Rank results: Surface the best match using combined scores

This hybrid approach often outperforms either method alone, especially in domains where specific terminology matters but users may phrase queries differently.

Business Applications of Semantic Search

Semantic search isn’t merely technical evolution — it’s competitive advantage. Real-world impact:

Enterprise Knowledge Bases: Employees use varied terminology. Semantic search bridges vocabulary gaps, surfacing answers regardless of phrasing.
Customer Support Automation: Intent-aware chatbots understand “How can I get my money back?” matches refund policies, without the word “refund.”
Legal and Compliance Discovery: Legal teams find relevant precedents through meaning, not just keywords — saving hours, reducing risk. This aligns with enterprise use cases I’ve covered in The Economics of Deploying Large Language Models: Costs, Value, and 99.7% Savings, and for scaling such systems, refer to my blog post Scaling Up: Debugging, Optimization, and Distributed Training — Article 17.

Production deployments use scalable vector databases for efficient embedding retrieval, often combining keyword and semantic methods to achieve peak accuracy.

Summary: Semantic search reduces friction, boosts satisfaction, makes knowledge work efficient — especially with cutting-edge embedding models and infrastructure.

Vector Database Comparison Table

Database Type Strengths Best For Scaling

FAISS Open-source library Fast, flexible indices Research, custom deployments Manual sharding
PostgreSQL + pgvector Extension ACID, SQL integration, mature Enterprise with existing PG Vertical + read replicas
Pinecone Managed cloud Zero-ops, auto-scaling SaaS applications Automatic
Weaviate Open-source + cloud GraphQL API, hybrid search Enterprise search Horizontal
Milvus Open-source + cloud GPU support, high performance Large-scale ML Distributed
Qdrant Open-source + cloud Rust-based, filtering Real-time applications Cloud-native
Chroma Embedded/cloud Simple API, developer-friendly Prototyping, RAG In-memory to cloud

MongoDB added support for vector indexing as did DuckDB. Recently, Amazon announced support for vector indexes via S3. RAG is far from dead. RAG is becoming somewhat ubiquitous as AI adoption grows.

Measuring Search Quality: Metrics that Matter

Building search is half the battle. Proving it works requires clear metrics:

Precision: What fraction of results are relevant? High precision minimizes irrelevant noise.
Recall: What fraction of relevant documents appeared? High recall ensures nothing important is missed.
F1 Score: Harmonizes precision and recall into one balanced metric.
Mean Reciprocal Rank (MRR): How high does the first relevant result appear? Top placement delights users.
Normalized Discounted Cumulative Gain (NDCG): Evaluates entire ranking, rewarding relevant results near the top — crucial for result lists.

Calculating Precision and Recall for Search Results

# Example: Evaluating a search system
retrieved = {"doc1", "doc2", "doc5"}  # IDs of documents returned by search
relevant = {"doc2", "doc3", "doc5"}   # IDs of truly relevant documents

precision = len(retrieved & relevant) / len(retrieved)
recall = len(retrieved & relevant) / len(relevant)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
# Output: Precision: 0.67, Recall: 0.67

Step-by-Step Explanation:

Define Sets: Retrieved documents vs. truly relevant documents
Calculate Intersection: Find documents that are both retrieved and relevant
Compute Precision: Fraction of retrieved that are relevant
Compute Recall: Fraction of relevant that were retrieved

Practice requires averaging metrics across many queries. For ranking metrics (MRR, NDCG), libraries like Hugging Face’s evaluate, scikit-learn, or pytrec_eval automate calculations.

Mean Reciprocal Rank (MRR): This metric focuses on the position of the first relevant result. The reciprocal rank is 1 divided by the rank position of the first relevant document. For example, if the first relevant document appears at position 3, the reciprocal rank is 1/3 = 0.33. MRR averages this value across multiple queries. Higher MRR values (closer to 1.0) indicate better search performance where relevant results appear early in the results list.

Normalized Discounted Cumulative Gain (NDCG): This metric evaluates the entire ranking quality, not just the first relevant hit. It considers both the relevance of results and their positions. NDCG gives higher weight to relevant documents appearing earlier in search results, following the principle that users typically focus on top results. The metric is “normalized” by comparing the actual ranking to an ideal ranking where all relevant documents appear first in order of relevance. NDCG ranges from 0 to 1, with 1 representing perfect ranking.

Both metrics are essential for evaluating real-world search systems where result ordering significantly impacts user experience. MRR is simpler and focuses on finding the first correct answer quickly, while NDCG provides a more comprehensive view of overall ranking quality across multiple positions.

Choose metrics matching your goals. Customer support needs high recall — never miss helpful articles. Legal search demands high precision — avoid irrelevant results.

Key Point: Metrics transform search quality from guesswork into measurable outcomes, guiding continuous improvement.

Embedding Text for Search Applications

Imagine organizing a vast library by meaning, not titles. That’s semantic search’s challenge — embeddings solve it. You’ll learn to transform text into vectors, store them efficiently, adapt for languages and specialized domains. We’ll highlight cutting-edge models and databases for future-proof, production-ready solutions.

Step-by-Step Explanation:

Start with RawText that needs embedding
Text goes through Tokenization preprocessing
TransformerModel encodes tokens into Embeddings
Embeddings stored in VectorDatabase and indexed
System becomes SearchReady
Parallel paths handle Multilingual and Domain-specific content

Coverage includes:

What embeddings are and their importance
Creating them with sentence transformers and modern APIs
Storing and managing embeddings at scale
Multilingual and domain-specific strategies
Selecting optimal models and databases

What Are Embeddings — and Why Do They Matter?

An embedding is a numerical fingerprint — a vector — capturing text meaning. Picture it as DNA for sentences. Similar meanings yield similar embeddings, clustering in vector space.

Consider:

“How do I reset my password?”
“What are the steps to recover my account?”

Keyword search sees these as unrelated. Embeddings recognize their semantic kinship, enabling intelligent matching.

Practical impact? Embeddings power:

Customer question routing to relevant help articles — despite different wording
Support ticket clustering by issue type
Cross-language search focusing on meaning
Retrieval-augmented generation (RAG) combining search with LLMs for reasoning

I’ve explored practical RAG implementations in Beyond Chat: Enhancing LiteLLM Multi-Provider App with RAG, Streaming, and AWS Bedrock. For the foundational step of tokenization that underpins embeddings, see my blog post Article 5: Tokenization — Converting Text to Numbers for Neural Networks.

Bottom line: Embeddings translate language into comparable meanings, not just matched strings.

Generating Embeddings with Sentence Transformers and Modern APIs

While classic transformers, such as BERT, understand language, they weren’t designed for sentence comparison. Sentence transformers train specifically to cluster similar meanings — perfect for search. Recent APIs (Google Gemini, OpenAI, Cohere) raise the bar for multilingual and domain performance.

The sentence-transformers library remains essential for local workflows. For English, 'all-MiniLM-L6-v2' delivers speed and accuracy. Production deployments often use cloud APIs or cutting-edge open models. Pro tip: Check the MTEB leaderboard for latest benchmarks.

Generating Embeddings with Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences to embed
sentences = [
    "How do I reset my password?",
    "Account recovery steps"
]

# Generate embeddings (returns a NumPy array of floats)
embeddings = model.encode(sentences)

print(embeddings.shape)  # Output: (2, 384)

Step-by-Step Explanation:

Import Library: Load sentence-transformers for embedding generation
Load Model: Initialize pre-trained model from Hugging Face
Define Sentences: Create text samples to embed
Generate Embeddings: Convert sentences to 384-dimensional vectors
Verify Shape: Confirm output dimensions match expectations

Similar sentences produce similar vectors, enabling meaningful comparison. For other languages or industries, choose models trained on relevant data. Modern APIs support more extended sequences; essential for document search and RAG.

For state-of-the-art performance, explore Google Gemini Embedding, OpenAI’s latest APIs, or top MTEB models supporting extended inputs, superior accuracy, and broader language coverage. If you are using AWS Bedrock, Cohere’s embeddings are probably your best bet.

Modern Embedding API Example with OpenAI

# Example using OpenAI's embedding API (requires API key)
import openai
import numpy as np

# Set your OpenAI API key
openai.api_key = "your-api-key-here"

def get_openai_embeddings(texts):
    """Generate embeddings using OpenAI's latest text-embedding model"""
    response = openai.Embedding.create(
        model="text-embedding-ada-002",  # Or newer models as available
        input=texts
    )
    embeddings = [item['embedding'] for item in response['data']]
    return np.array(embeddings)

# Example usage
texts = ["How do I reset my password?", "Account recovery steps"]
openai_embeddings = get_openai_embeddings(texts)
print(f"OpenAI embeddings shape: {openai_embeddings.shape}")  # (2, 1536)

# Compare with sentence-transformers for cost/performance trade-offs
# OpenAI: Higher accuracy, API costs, larger dimensions
# Sentence-transformers: Free, local, smaller dimensions

Step-by-Step Explanation:

Import Libraries: Load OpenAI client and NumPy
Set API Key: Configure authentication for OpenAI
Define Function: Create reusable embedding generator
Call API: Request embeddings from OpenAI’s model
Extract Vectors: Parse response into NumPy array
Compare Options: Understand trade-offs between APIs and local models

RAG: Combining Search with Generation

Since RAG (Retrieval-Augmented Generation) is increasingly important, let’s see how embeddings feed into generative AI:

from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Setup retriever (simple in-memory for demo)
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = ["Refund policy: 30 days money back.", "Account recovery: Contact support."]
doc_embs = model.encode(docs)

# Query
query = "How do I get a refund?"
query_emb = model.encode(query)
scores = util.cos_sim(query_emb, doc_embs)[0].numpy()
top_doc = docs[np.argmax(scores)]

# RAG: Use retrieved doc as context for LLM
generator = pipeline('text-generation', model='gpt2')  # Replace with fine-tuned model in production
prompt = f"Context: {top_doc}\\nQuestion: {query}\\nAnswer:"
response = generator(prompt, max_length=50)[0]['generated_text']
print("RAG Response:", response)

Step-by-Step Explanation:

Embed documents and query: Create vectors for similarity search
Retrieve top match via cosine similarity: Find most relevant document
Craft prompt with context: Combine retrieved information with question
Generate response: Use LLM to produce contextual answer

Note: Use larger models like Llama-2 or GPT-4 for better production results. This demonstrates how semantic search enables more accurate, grounded AI responses.

Batch Processing and Storing Embeddings

Real systems handle thousands of documents — product catalogs, support tickets, articles. Batch processing accelerates embedding creation. You’ll need persistent storage for search. Vector databases are now industry standard for production.

Batch Embedding and Saving to Disk

import numpy as np

# Example: List of document texts
documents = ["Doc 1 text", "Doc 2 text", "Doc 3 text"]

# Batch encode documents (adjust batch_size for your hardware)
doc_embeddings = model.encode(documents, batch_size=32, show_progress_bar=True)

# Save embeddings as a NumPy array for later use
np.save('doc_embeddings.npy', doc_embeddings)

Step-by-Step Explanation:

Prepare Documents: Create list of texts to embed
Batch Encode: Process multiple documents efficiently
Show Progress: Track encoding progress for large batches
Save Embeddings: Store vectors in NumPy format for fast loading

⚡ Pro Tip: Always maintain document ID mapping. Track which embedding corresponds to which document — critical for retrieving correct text later.

For production search, choose vector databases like Postgres, FAISS, Qdrant, Milvus, or Chroma — optimized for embedding storage and similarity search at scale.

Summary: Batch processing and organized storage enable scalable semantic search. Select databases matching your performance and deployment needs.

Multilingual and Domain-Specific Embeddings

Global businesses operate across languages and specialized fields. How do embeddings capture meaning across these dimensions? Recent advances dramatically improve multilingual and domain performance.

Multilingual embeddings: Models like 'paraphrase-multilingual-MiniLM-L12-v2' support 50+ languages. Google Gemini and OpenAI embeddings offer even broader coverage. Similar meanings cluster together regardless of language.

Generating Multilingual Embeddings

For effective multilingual search, we need specialized embedding models that understand semantic meaning across different languages. Let’s explore how to generate these powerful cross-language embeddings.

multi_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

sentences = [
    "How do I reset my password?",        # English
    "¿Cómo puedo restablecer mi contraseña?",  # Spanish
    "Comment réinitialiser mon mot de passe?"  # French
]

multi_embeddings = multi_model.encode(sentences)
print(multi_embeddings.shape)  # Output: (3, 384)

multi_embeddings = multi_model.encode(sentences)
print(multi_embeddings.shape)  # Output: (3, 384)

Step-by-Step Explanation:

Load Multilingual Model: Initialize model supporting multiple languages
Define Multilingual Sentences: Same meaning in three languages
Generate Embeddings: Create vectors capturing cross-language meaning
Verify Consistency: All produce same-dimension embeddings

These three sentences share meaning — multilingual models ensure their embeddings cluster together, enabling seamless cross-language search.

Domain-specific embeddings: Technical fields need specialized understanding. Options:

Search Hugging Face for domain models (legal, biomedical, financial)
Fine-tune base models on your data (see Article 10)
Use APIs with strong domain-specific MTEB performance

Long-context support: Modern embeddings (Gemini, OpenAI) handle extended sequences — ideal for document search and RAG applications requiring large context.

In summary:

Deploy multilingual models for global reach
Adapt models for specialized domains
Prefer long-context support for documents
Benchmark using MTEB for optimal selection

Right model choice boosts search quality and user satisfaction dramatically.

Key Takeaways and Next Steps

Embeddings transform text into semantic vectors, enabling meaning-based search. Sentence transformers and modern APIs simplify generation and storage at scale. Match models to your language, domain, and context needs using MTEB benchmarks.

Building on this, my guide The Developer’s Guide to AI File Processing with AutoRAG support: Claude vs. Bedrock vs. OpenAI compares RAG implementations across providers. For customizing models via fine-tuning, which enhances domain-specific embeddings, check my blog Mastering Fine-Tuning: Transforming General Models into Domain Specialists — Article 10.

Try this: Embed three sentences from your domain. Compare their vectors — notice meaning clusters. For production, experiment with Postgres, Qdrant, Milvus, or Chroma for managed search.

Next: Discover how embeddings power real search with vector databases and FAISS and Postgres (this chapter). Explore fine-tuning for custom needs (Article 10). For advanced applications, embeddings enable RAG systems combining search with LLM reasoning.

Vector Databases and FAISS Integration

Semantic search retrieves meaning across massive datasets. FAISS (Facebook AI Similarity Search) delivers high-performance, open-source vector search and clustering. Scale from thousands to billions of records while maintaining low latency. You’ll install FAISS, build indices, and apply production best practices for scaling. We’ll cover recent features, quantization, and integration with distributed search.

Step-by-Step Explanation:

FAISS base class provides vector search functionality
Subclasses offer different index types for various use cases
VectorDatabase interface implemented by managed solutions
ProductionSystem integrates FAISS with transformers and metadata
Shows relationships between components in real deployments

Setting Up FAISS for Scalable Similarity Search

FAISS excels at similarity search over high-dimensional vectors from transformers. It supports exact search (true nearest neighbors) and approximate search (faster with minor accuracy trade-offs). Version 1.7.2+ adds improved GPU support and enhanced algorithms.

Install FAISS choosing CPU or GPU versions based on your hardware. Ensure Python 3.7+ and NumPy 1.18+ compatibility.

Installing and Importing FAISS (v1.7.2+)

# Using Python 3.12.9 environment
# Install FAISS for CPU
# pip install faiss-cpu
# Or for GPU (CUDA)
# pip install faiss-gpu

# Or with poetry (recommended)
# poetry add faiss-cpu

# Or with conda
# conda install -c pytorch faiss-cpu

import faiss
import numpy as np

print(f"FAISS version: {faiss.__version__}")  # Should be 1.7.2 or higher
print(f"Python version: {sys.version}")  # Verify 3.12.9

FAISS requires embeddings as NumPy arrays with dtype ‘float32’. Convert other types before indexing to prevent errors.

Ensuring Embedding Data Type Compatibility

embeddings = np.array(embeddings, dtype=np.float32)  # Always use float32

Create your first FAISS index — a data structure enabling fast similarity search. For small-medium datasets, exact search works perfectly.

Creating a Simple FAISS Index (Exact Search)

# Assume 'embeddings' is a (num_vectors, dimension) float32 array
dimension = embeddings.shape[1]  # Embedding size
index = faiss.IndexFlatL2(dimension)  # Exact search with L2 distance
index.add(embeddings)  # Add embeddings to index

Step-by-Step Explanation:

Get Dimension: Extract embedding vector length (e.g., 384)
Create Index: Initialize exact L2 distance index
Add Data: Load embeddings for fast searching

For larger datasets, explore approximate or compressed indices (covered in scaling section).

Indexing, Querying, and Updating Search Indices

With your index built, search for similar items by embedding queries and using FAISS’s search API. Always use the same model and preprocessing for queries as your indexed data.

Process:

Generate query embedding (same model as documents)
Search FAISS index for nearest neighbors
Map results to original data

Performing a Search Query with Error Handling

try:
    query_text = "How can I get my money back?"
    query_embedding = model.encode([query_text]).astype(np.float32)
    # Ensure shape is (1, dimension)
    if query_embedding.ndim == 1:
        query_embedding = query_embedding.reshape(1, -1)
    distances, indices = index.search(query_embedding, k=3)  # Top-3 matches
    print("Top 3 matching document indices:", indices[0])
    print("Corresponding distances:", distances[0])
except Exception as e:
    print(f"An error occurred during FAISS search: {e}")

Step-by-Step Explanation:

Define Query: User’s search question
Generate Embedding: Convert query to vector using same model
Reshape if Needed: Ensure proper dimensions for FAISS
Search Index: Find top-k most similar embeddings
Handle Errors: Gracefully manage potential issues
Display Results: Show matching indices and distances

Lower distance indicates higher similarity for L2 distance metric.

Adding new documents: generate embeddings and use index.add(). Most indices support fast, safe addition. Note: Training-required indices (IVF) need training before adding data.

Deletion/updates are complex. Version 1.7.2 supports limited deletion (IndexIVFFlat with IDMap). Best practice: periodically rebuild indices for frequent changes.

Always save indices after major updates for reliability and fast recovery.

Saving and Loading a FAISS Index (v1.7.2+)

# Save the index
faiss.write_index(index, "my_faiss.index")

# Load it later
index = faiss.read_index("my_faiss.index")

Production tip: Version indices and store in durable locations (object storage, distributed filesystems) for disaster recovery.

Managed Vector DB Example with Chroma

To complement FAISS with a cloud-managed alternative, here’s how to use Chroma for simpler deployment:

from chromadb import Client
from sentence_transformers import SentenceTransformer

# Setup Chroma client
client = Client()
collection = client.create_collection("docs")

# Embed and upsert documents
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = ["Doc1 text", "Doc2 text"]
embeddings = model.encode(docs).tolist()
collection.add(ids=["1", "2"], embeddings=embeddings, documents=docs)

# Query
query_emb = model.encode(["Similar query"]).tolist()
results = collection.query(query_embeddings=query_emb, n_results=1)
print("Top Match:", results['documents'][0][0])

Step-by-Step Explanation:

Initialize client and collection: Set up Chroma database connection
Generate embeddings: Create vectors for documents
Upsert with IDs: Store documents with unique identifiers
Query and retrieve: Find similar documents efficiently

This managed approach simplifies deployment while FAISS offers more control for custom solutions.

Scaling Semantic Search

Growing datasets make exact search (IndexFlatL2) slow or memory-intensive. FAISS offers advanced scaling strategies: approximate nearest neighbor (ANN) search, quantization, GPU acceleration. Version 1.7.2+ adds LSQ quantization on GPU and improved clustering.

Approximate Nearest Neighbor (ANN) Search

ANN algorithms dramatically accelerate search by returning close (not always exact) matches. FAISS options:

IndexIVFFlat: Inverted File with Flat quantization (requires training)
IndexHNSWFlat: Hierarchical Navigable Small World graphs (no training, fast)
IndexLSQ: Locally Sensitive Quantization (GPU support in v1.7.2)
IndexPQ: Product Quantization for compression

Choose based on dataset size, accuracy needs, and hardware.

Building an Approximate Search Index with FAISS (IVF)

# Set up an IVF index for large-scale search
nlist = 100  # Number of clusters (tune for your data)
quantizer = faiss.IndexFlatL2(dimension)  # Used for cluster assignment
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_L2)

# Must train the index before adding data
index_ivf.train(embeddings)
index_ivf.add(embeddings)

# Enable parallel search for faster queries (optional, multi-core CPUs)
faiss.omp_set_num_threads(4)

Step-by-Step Explanation:

Define Clusters: Set number of partitions for vector space
Create Quantizer: Initialize cluster assignment mechanism
Build IVF Index: Combine quantizer with inverted file structure
Train Index: Learn cluster centers from data
Add Data: Insert embeddings into trained index
Enable Parallelism: Use multiple CPU cores for faster search

IndexIVFFlat partitions space into clusters for faster search. Training required before data addition. nlist balances speed vs. recall.

Quantization and Compression

For massive datasets, quantization reduces memory with minimal accuracy loss. FAISS supports Product Quantization (PQ) and LSQ (including GPU).

Building a Memory-Efficient Index with Product Quantization

# Example: IVF with Product Quantization
nlist = 100
m = 8  # Number of subquantizers (tune for your dimension)
nbits = 8  # Bits per quantizer
quantizer = faiss.IndexFlatL2(dimension)
index_ivfpq = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

index_ivfpq.train(embeddings)
index_ivfpq.add(embeddings)

Quantization enables billion-scale search on single servers or GPUs through dramatic memory reduction.

Sharding and Distributed Search

Massive datasets require splitting across machines (sharding). Each shard handles partial data; results merge at query time. While FAISS lacks native multi-node orchestration, it integrates with distributed databases like Pinecone, Weaviate, or Milvus.

Hybrid Search: Semantic + Keyword

Modern search combines vectors (semantic) with keywords (e.g., Elasticsearch). Common pattern: keyword filtering, then FAISS ranking. This hybrid maximizes precision and recall.

Integrating FAISS with Hugging Face Datasets

Hugging Face Datasets offers built-in FAISS integration — add indices to embedding columns for seamless search.

Integrating FAISS with Hugging Face Datasets

from datasets import Dataset
import numpy as np

# Assume 'texts' (list of strings) and 'embeddings' (NumPy array) are defined
hf_dataset = Dataset.from_dict({'text': texts, 'embeddings': list(embeddings)})

# Add a FAISS index to the 'embeddings' column
hf_dataset.add_faiss_index(column='embeddings')

query_embedding = model.encode(["How can I get my money back?"]).astype(np.float32)
scores, retrieved_examples = hf_dataset.get_nearest_examples(
    'embeddings', query_embedding, k=3
)

for idx, (score, example) in enumerate(zip(scores, retrieved_examples['text'])):
    print(f"Rank {idx+1}: {example} (Score: {score})")

Step-by-Step Explanation:

Create Dataset: Combine texts with their embeddings
Add FAISS Index: Enable fast search on embedding column
Query Dataset: Search for similar documents
Retrieve Results: Get nearest examples with scores
Display Ranked Results: Show top matches with similarity scores

This integration streamlines experimentation and production deployment within the Hugging Face ecosystem.

Summary: Scaling semantic search requires more than speed — it demands reliable, relevant results at any scale. Choose appropriate FAISS indices, leverage quantization, integrate with distributed and hybrid tools for enterprise-grade systems.

Key Points:

Use FAISS v1.7.2+ for scalable vector search with latest features
Select exact or approximate indices based on size and latency needs
Apply quantization for memory efficiency on large datasets
Save and version indices for reliability
Integrate with distributed or hybrid search as needed

PostgreSQL with pgvector: Enterprise-Ready Vector Search

While FAISS excels at pure vector operations, many enterprises need vector search integrated with their existing databases. PostgreSQL with pgvector extension delivers production-grade vector similarity search alongside ACID transactions, SQL queries, and mature tooling. This section explores pgvector’s capabilities, performance characteristics, and integration patterns for semantic search at scale.

Step-by-Step Explanation:

Application Layer sends both traditional SQL and vector search queries
PostgreSQL Database contains regular tables and vector-enabled tables
pgvector Extension provides vector operations and specialized indexes
Vector Indexes (IVFFlat, HNSW) accelerate similarity search
Integration allows joining vector results with relational data

Why PostgreSQL for Vector Search?

PostgreSQL with pgvector combines the best of both worlds:

ACID Compliance: Full transactional guarantees for vector operations
SQL Integration: JOIN vector search results with existing data
Mature Ecosystem: Leverage PostgreSQL’s tooling, monitoring, backups
Hybrid Queries: Combine semantic search with filters, aggregations
Cost Efficiency: Use existing PostgreSQL infrastructure

Perfect for organizations already using PostgreSQL who need vector capabilities without managing separate systems.

Setting Up PostgreSQL with pgvector

First, install PostgreSQL with pgvector extension. We’ll use Docker for quick setup:

# Using Docker Compose (recommended for development)
docker-compose up -d postgres-vector

# Or direct Docker command
docker run -d \\
  --name pgvector-demo \\
  -e POSTGRES_PASSWORD=postgres \\
  -p 5433:5432 \\
  pgvector/pgvector:pg16

# Verify connection
psql -h localhost -p 5433 -U postgres -d postgres

Creating Vector-Enabled Tables

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table for documents with embeddings
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(384),  -- 384 dimensions for all-MiniLM-L6-v2
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create index for fast similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Step-by-Step Explanation:

Enable Extension: Activate pgvector in your database
Define Vector Column: Specify dimensions matching your embedding model
Add Metadata: JSONB column for flexible additional data
Create Index: IVFFlat index for approximate nearest neighbor search
Configure Lists: Balance between speed and accuracy (more lists = better accuracy)

Python Integration with pgvector

import psycopg2
from pgvector.psycopg2 import register_vector
from sentence_transformers import SentenceTransformer
import numpy as np

# Connect to PostgreSQL
conn = psycopg2.connect(
    host="localhost",
    port="5433",
    database="vector_demo",
    user="postgres",
    password="postgres"
)

# Register pgvector type
register_vector(conn)
cursor = conn.cursor()

# Initialize embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Insert documents with embeddings
documents = [
    "PostgreSQL is a powerful relational database",
    "Vector search enables semantic similarity",
    "pgvector integrates seamlessly with SQL"
]

for doc in documents:
    embedding = model.encode(doc)
    cursor.execute(
        "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
        (doc, embedding.tolist())
    )

conn.commit()

# Perform semantic search
query = "database with vector capabilities"
query_embedding = model.encode(query)

cursor.execute("""
    SELECT content, 1 - (embedding <=> %s) AS similarity
    FROM documents
    ORDER BY embedding <=> %s
    LIMIT 3
""", (query_embedding.tolist(), query_embedding.tolist()))

for row in cursor.fetchall():
    print(f"Content: {row[0]}")
    print(f"Similarity: {row[1]:.3f}\\n")

Step-by-Step Explanation:

Connect to Database: Establish PostgreSQL connection
Register Vector Type: Enable pgvector operations in Python
Generate Embeddings: Use same model as FAISS examples
Insert with Embeddings: Store documents and vectors together
Query with Similarity: Use <=> operator for cosine distance
Calculate Similarity Score: Convert distance to similarity (1 — distance)

Advanced pgvector Features

1. Multiple Index Types

pgvector supports different index types for various use cases:

-- IVFFlat: Good balance of speed and accuracy
CREATE INDEX idx_ivfflat ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- HNSW: Faster queries, more memory, better for static data
CREATE INDEX idx_hnsw ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- No index: Exact search for small datasets
-- Simply omit the CREATE INDEX statement

Comparing IVFFlat and HNSW Indexes

Both IVFFlat and HNSW are approximate nearest neighbor (ANN) indexes that accelerate vector search, but they have different characteristics that make them suitable for different scenarios:

Characteristic IVFFlat HNSW

Query Speed Good Excellent (2–10x faster)
Build Time Fast Slow (can be 10x slower)
Memory Usage Lower Higher (typically 2–4x more)
Update Handling Better for dynamic data Better for static data Accuracy Good with tuning Excellent with proper parameters
Build Parameters lists(partition count) m(max connections),ef_construction(build quality)
Query Parameters probes(partitions to check) ef_search(candidate set size)

When to Use IVFFlat

Frequently updated data: Better handles insertions and updates without complete rebuilds
Memory-constrained environments: Uses less RAM than HNSW
Quick index building: Creates indexes faster, especially for large datasets
Balanced performance needs: Good compromise between speed, memory, and accuracy

When to Use HNSW

Query-intensive workloads: Significantly faster search performance
Static or append-only data: Where index rebuilds are infrequent
Higher accuracy requirements: Delivers better recall at same k value
RAM-rich environments: When memory isn’t a primary constraint

Maintenance Considerations

IVFFlat:

Rebuilding Frequency: Performs well with incremental updates but benefits from periodic reindexing (weekly/monthly) when >20% of data changes
Parameter Tuning: May need to adjust lists as dataset grows (rule of thumb: lists = √n/10 where n is row count)
Rebalancing: Less sensitive to data distribution changes than HNSW

HNSW:

Rebuilding Complexity: More impacted by updates; consider full rebuilds after significant changes (>10% of data)
Build-Time vs Query-Time Tradeoff: Higher ef_construction values create better indexes but take longer to build
Runtime Parameter Adjustment: Can tune ef_search at query time without rebuilding index

Memory Usage Estimation:

IVFFlat: Approximately 1.1x the size of your raw vector data
HNSW: Approximately 1.5–2.5x the size of your raw vector data (varies with m parameter)

In practical implementations, consider starting with IVFFlat for development and testing, then benchmark against HNSW for production if query performance becomes a bottleneck and your data is relatively static.

2. Hybrid Search with Filters

Combine vector similarity with SQL conditions:

-- Find similar documents created in the last 7 days
SELECT content, 1 - (embedding <=> %s) AS similarity
FROM documents
WHERE created_at > NOW() - INTERVAL '7 days'
  AND metadata->>'category' = 'technical'
ORDER BY embedding <=> %s
LIMIT 5;

This SQL query demonstrates hybrid search in pgvector, combining vector similarity with traditional SQL filtering. Let’s break down what it does:

Semantic Search with Filtering: The query finds documents semantically similar to a query vector, but only within documents that meet specific criteria
Time-Based Filtering: The WHERE created_at > NOW() - INTERVAL '7 days' clause restricts results to recent documents only
Metadata Filtering: The AND metadata->>'category' = 'technical' clause filters by document category using JSON path operators
Similarity Calculation: 1 - (embedding <=>; %s) AS similarity converts cosine distance to similarity score (0-1 scale)
Ordering: Results are sorted by vector similarity using the cosine distance operator <=>;
Result Limiting: Only returns the top 5 most relevant matches

This powerful approach combines the best of both worlds: the semantic understanding of vector search with the precise filtering capabilities of SQL. It’s ideal for applications that need context-aware search but also require filtering by traditional database attributes.

3. Distance Functions

pgvector supports multiple distance metrics:

-- Cosine distance (default, normalized)
ORDER BY embedding <=> query_vector

-- Euclidean distance (L2)
ORDER BY embedding <-> query_vector

-- Inner product (for non-normalized vectors)
ORDER BY embedding <#> query_vector

The choice of distance metric in vector search significantly impacts result quality and relevance. pgvector supports three primary distance functions, each with specific use cases:

Cosine Distance (`<=>`)

Formula: 1 — (A·B)/(|A|·|B|)

Range: 0 (identical) to 2 (opposite)

Best for: Text embeddings and semantic similarity

Advantages: Normalizes for vector magnitude, focusing purely on direction/angle
When to use: Most transformer models produce normalized embeddings where cosine distance works best
Example use case: Semantic document search, where document length shouldn’t affect relevance

2. Euclidean Distance (`<->`)

Formula: √Σ(Aᵢ — Bᵢ)²

Range: 0 (identical) to ∞

Best for: Physical or spatial data

Advantages: Intuitive physical distance interpretation
When to use: Non-normalized embeddings, image feature vectors, or geometric data
Example use case: Facial recognition, image similarity, geographical positioning

3. Inner Product (`<#>`)

Formula: -A·B (negative dot product)

Range: -∞ to ∞

Best for: Maximum dot product search (MIPS)

Advantages: Faster computation, works well with specific ML models
When to use: Recommendation systems, when vectors are NOT normalized
Example use case: Product recommendations where magnitude encodes relevance/popularity

Selection Guidelines

Use Case Recommended Metric Reason

Semantic text search: Cosine (<=>) Ignores magnitude differences in text embeddings
Image similarity: Euclidean (<->) Preserves spatial relationships in feature space
Recommendations: Inner Product (<#>) Captures preference strength and similarity Pre-normalized vectors Cosine or Euclidean Equivalent when vectors are normalized
Non-normalized with magnitude importance: Inner Product Considers both direction and magnitude

For most transformer-based text embeddings (Open AI embeddings, Cohere embeddings, BERT, MPNet, MiniLM), cosine distance is the recommended default as these models are typically trained to produce normalized vectors where angular similarity matters most.

Performance Optimization

1. Index Selection Strategy

def choose_index_type(num_vectors, update_frequency):
    """Recommend pgvector index based on use case"""
    if num_vectors < 10_000:
        return "No index needed for small datasets"
    elif update_frequency == "high":
        return "IVFFlat - handles updates better"
    else:
        return "HNSW - fastest queries for static data"

This function provides a simple decision framework for choosing the right pgvector index type based on two key factors:

Dataset Size Assessment: For small datasets (under 10,000 vectors), the function recommends skipping indexing entirely, as PostgreSQL can efficiently scan small tables without specialized indexes
Update Frequency Evaluation: For frequently updated data, IVFFlat is recommended due to its better handling of insertions and modifications
Default to Performance: For static or infrequently changing datasets, HNSW is recommended for its superior query performance

This strategy balances performance needs with operational characteristics, ensuring optimal vector search experiences across different use cases.

2. Batch Operations


# Efficient batch insertion
def batch_insert_embeddings(documents, batch_size=100):
    embeddings = model.encode(documents, batch_size=32)

    # Use COPY for fastest insertion
    with cursor.copy(
        "COPY documents (content, embedding) FROM STDIN"
    ) as copy:
        for doc, emb in zip(documents, embeddings):
            copy.write_row([doc, emb.tolist()])

This code demonstrates an efficient method for batch inserting document embeddings into a PostgreSQL database using pgvector. Let’s analyze what it does:

Batch Processing with Transformer Model: The function processes multiple documents at once, generating embeddings in batches of 32 using a transformer model. This is more efficient than encoding documents one by one.
PostgreSQL COPY Command: Instead of using individual INSERT statements, the code leverages PostgreSQL’s COPY protocol, which is significantly faster for bulk operations.
Parallel Data Insertion: The function pairs each document with its corresponding embedding vector and inserts them together in a single operation.
Vector Format Conversion: The embeddings are converted from NumPy arrays to Python lists using tolist() to ensure compatibility with the database.
Performance Benefits: This approach can be 10–100x faster than individual inserts when adding large numbers of vectors to a database.

This pattern is particularly valuable when initially populating a vector database or when performing batch updates, significantly reducing the time required for database operations while maintaining data integrity.

3. Query Optimization

-- Optimize HNSW search performance
SET hnsw.ef_search = 100;  -- Higher = more accurate, slower

-- Limit search to top candidates first
WITH candidates AS (
    SELECT * FROM documents
    WHERE metadata->>'category' = 'target_category'
    LIMIT 1000
)
SELECT content, 1 - (embedding <=> %s) AS similarity
FROM candidates
ORDER BY embedding <=> %s
LIMIT 10;

Comparing pgvector with FAISS

Feature FAISS pgvector

Speed Fastest for pure vector ops Good, with SQL overhead
ACID Transactions No Yes
SQL Integration No Native
Maintenance Manual PostgreSQL tools
Scaling Sharding complexity Read replicas, partitioning
Memory Management Manual PostgreSQL buffer management
Index Types More options IVFFlat, HNSW
Updates Limited Full CRUD support

Postgres with pgvector is available from GCP as AlloyDB. It is serverless, in that you don’t have to worry about scaling per se. Amazon provides support for pgvector and Postgres as well via Aurora, so again, it manages the cluster for you. It is easy to spin up Postgres cluster with pgvector support as well. When it comes to scalability, Postgres plus pgvector is a clear winner.

You can also combine pgvector support with keyword search to combine the best of both worlds. We created a simple library called vector-rag that shows how to combine pgvector with Postgres’s builtin support for keyword search.

PostgreSQL implements BM25-style ranking through its Full-Text Search (FTS) features:

tsvector: Stores preprocessed searchable text
ts_rank_cd(): Provides cover density ranking similar to BM25

Database Schema Changes

Added a tsvector column to the chunks table:

ALTER TABLE chunks
ADD COLUMN content_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;

CREATE INDEX idx_chunks_content_tsv
ON chunks USING GIN (content_tsv);

The tsvector column:

Automatically generated from the content column
Lowercases text and removes stop words
Applies stemming (e.g., “running” → “run”)
Indexed with GIN for fast searches

API Methods from Vector RAG python lib

BM25 style Search

from vector_rag.api import VectorRAGAPI

api = VectorRAGAPI()
results = api.search_bm25(
    project_id=1,
    query_text="PostgreSQL tsvector",
    page=1,
    page_size=10,
    rank_threshold=0.0,  # Minimum BM25 score
    file_id=None,        # Optional: search specific file
    metadata_filter={"category": "technical"}  # Optional filters
)

Hybrid Search

results = api.search_hybrid(
    project_id=1,
    query_text="implement full text search",
    page=1,
    page_size=10,
    vector_weight=0.5,   # Weight for semantic similarity (0-1)
    bm25_weight=0.5,     # Weight for BM25 score (0-1)
    similarity_threshold=0.0,  # Min vector similarity
    rank_threshold=0.0,        # Min BM25 rank
    file_id=None,
    metadata_filter=None
)

# Access individual scores
for result in results.results:
    scores = result.chunk.metadata.get('_scores', {})
    print(f"Vector: {scores.get('vector', 0)}")
    print(f"BM25: {scores.get('bm25', 0)}")
    print(f"Hybrid: {result.score}")

Usage Examples

1. Finding Exact Technical Terms

# BM25 excels at finding exact matches
results = api.search_bm25(
    project_id=project_id,
    query_text="ts_rank_cd PostgreSQL",
    page_size=5
)

2. Semantic Search with Keyword Backup

# Hybrid search with vector-heavy weighting
results = api.search_hybrid(
    project_id=project_id,
    query_text="how to implement search functionality",
    vector_weight=0.7,  # 70% semantic
    bm25_weight=0.3,    # 30% keyword
)

3. Keyword-First Search

# Hybrid search with BM25-heavy weighting
results = api.search_hybrid(
    project_id=project_id,
    query_text="BM25 k1 parameter tuning",
    vector_weight=0.3,  # 30% semantic
    bm25_weight=0.7,    # 70% keyword
)

See the docs for this library’s full text search capabilities. To learn more about using Postgres as your VectorDB see this article.

Production Deployment Patterns

1. Read Replica Pattern

# docker-compose.yml excerpt
services:
  postgres-primary:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_REPLICATION_MODE: master

postgres-replica:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_REPLICATION_MODE: slave
      POSTGRES_MASTER_HOST: postgres-primary

You would most likely use AlloyDB if using GCP. It is serverless, in that you don’t have to worry about scaling per se. If on AWS, you would use Aurora and postgres with pgvector installed. On Azure, spin up a Postgres cluster with pgvector support installed. I have used all three and they are an easy way to get a scalable VectorDB with a cloud provider.

2. Connection Pooling

from psycopg2 import pool

# Create connection pool for better performance
connection_pool = pool.SimpleConnectionPool(
    1, 20,  # min and max connections
    host="localhost",
    port="5433",
    database="vector_demo",
    user="postgres",
    password="postgres"
)

def search_with_pool(query_text):
    conn = connection_pool.getconn()
    try:
        # Perform search
        # ...
    finally:
        connection_pool.putconn(conn)

3. Monitoring and Maintenance

-- Monitor index performance
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexname LIKE '%embedding%';

-- Analyze query performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM documents
ORDER BY embedding <=> %s
LIMIT 10;

The code snippet demonstrates two important techniques for monitoring and optimizing PostgreSQL database performance when working with vector embeddings:

1. Monitoring Index Usage

The first query extracts statistics about indexes with “embedding” in their name from PostgreSQL’s system catalog:

schemaname/tablename/indexname: Identifies the exact location of each index
idx_scan: Shows how many times each index has been used for queries — low numbers may indicate unused indexes
idx_tup_read: Counts index entries read during scans — high values with low idx_scan suggest inefficient index usage
idx_tup_fetch: Counts actual table rows fetched using the index — comparing with idx_tup_read helps identify index efficiency

This query helps identify underutilized indexes (which waste resources) or overused indexes (which may need optimization).

2. Analyzing Vector Search Query Performance

The second command uses PostgreSQL’s EXPLAIN ANALYZE with BUFFERS option to provide detailed execution statistics for a vector similarity search:

EXPLAIN: Shows the query execution plan
ANALYZE: Actually executes the query and reports real timing information
BUFFERS: Adds information about buffer usage (shared, local, and temporary blocks)
ORDER BY embedding <=> %s: Uses the vector distance operator to find similar vectors

The output would show exactly how PostgreSQL executes the vector search, including:

Whether it uses the vector index (IVFFlat or HNSW)
How long each step takes
Memory/buffer usage during execution
Potential bottlenecks in the query execution

These monitoring techniques are essential for optimizing vector search in production environments, especially when scaling to large datasets where performance tuning becomes critical.

Real-World Use Case: Multi-Tenant SaaS

Here’s how to implement vector search for a multi-tenant application:

class MultiTenantVectorSearch:
    def __init__(self, conn_string):
        self.conn = psycopg2.connect(conn_string)
        register_vector(self.conn)
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

    def create_tenant_schema(self, tenant_id):
        """Create isolated schema for each tenant"""
        with self.conn.cursor() as cur:
            cur.execute(f"CREATE SCHEMA IF NOT EXISTS tenant_{tenant_id}")
            cur.execute(f"""
                CREATE TABLE IF NOT EXISTS tenant_{tenant_id}.documents (
                    id SERIAL PRIMARY KEY,
                    content TEXT NOT NULL,
                    embedding vector(384),
                    metadata JSONB,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
            """)
            cur.execute(f"""
                CREATE INDEX ON tenant_{tenant_id}.documents
                USING ivfflat (embedding vector_cosine_ops)
            """)
        self.conn.commit()

    def search_tenant_documents(self, tenant_id, query, limit=10):
        """Search within tenant's isolated data"""
        query_embedding = self.model.encode(query)

        with self.conn.cursor() as cur:
            cur.execute(f"""
                SELECT content, 1 - (embedding <=> %s) AS similarity
                FROM tenant_{tenant_id}.documents
                ORDER BY embedding <=> %s
                LIMIT %s
            """, (query_embedding.tolist(), query_embedding.tolist(), limit))

            return cur.fetchall()

Integration with RAG Systems

pgvector excels in RAG architectures by storing context alongside embeddings:

def rag_with_pgvector(question, context_limit=3):
    # Retrieve relevant documents
    query_emb = model.encode(question)

    cursor.execute("""
        SELECT content, metadata
        FROM documents
        ORDER BY embedding <=> %s
        LIMIT %s
    """, (query_emb.tolist(), context_limit))

    # Build context from results
    context = "\\n\\n".join([row[0] for row in cursor.fetchall()])

    # Generate response with LLM
    prompt = f"Context:\\n{context}\\n\\nQuestion: {question}\\nAnswer:"
    # ... continue with LLM generation

    return response

Best Practices

Index Management: Create indexes after bulk loading for faster initial setup
Dimension Consistency: Ensure all vectors have the same dimensions
Connection Pooling: Use connection pools for concurrent requests
Backup Strategy: Regular pg_dump includes vector data automatically
Version Control: Track schema changes including vector columns
Query Monitoring: Use pg_stat_statements to optimize slow queries

Summary

PostgreSQL with pgvector brings enterprise-grade vector search to existing PostgreSQL deployments. While pure vector databases like FAISS offer ultimate performance, pgvector provides the best integration with relational data, ACID guarantees, and mature operational tooling. Choose pgvector when you need vector search alongside traditional database features, existing PostgreSQL infrastructure, or strong consistency requirements.

Key Takeaways

Traditional keyword search often misses user intent. Searching “How do I get a refund?” might overlook “Return Policy” documents. Semantic search solves this by understanding meaning and context — not just exact words.

Transformer models convert text into embeddings: dense vectors capturing semantic essence. Models like BERT, RoBERTa, and newer options (E5, GTE) generate these embeddings. Similar meanings cluster together, regardless of wording differences.

Generating Semantic Embeddings with Sentence Transformers

# Ensure Python 3.12.9
import sys
print(f"Python: {sys.version}")

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained, efficient sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')  # For production, benchmark newer models like E5 or GTE

# Example sentences
sentences = [
    "How do I get a refund?",
    "What is your return policy?",
    "How can I reset my password?"
]

# Generate embeddings
embeddings = model.encode(sentences)
print(embeddings.shape)  # Output: (3, 384)

Step-by-Step Explanation:

Verify Environment: Confirm Python 3.12.9 is active
Import Libraries: Load sentence transformers and NumPy
Initialize Model: Load efficient pre-trained transformer
Define Sentences: Create sample texts with varying topics
Generate Embeddings: Convert to 384-dimensional vectors
Verify Output: Confirm embedding dimensions

Each sentence maps to a 384-dimensional vector. Similar meanings produce nearby embeddings in vector space — the foundation of semantic search.

For applications with thousands of documents, efficient retrieval proves critical. FAISS enables fast, scalable vector search. Production often uses managed databases like Pinecone, Weaviate, Qdrant, or Milvus for distributed querying.

Building and Querying a FAISS Index

import faiss

# FAISS requires float32 arrays
embeddings = embeddings.astype('float32')
dimension = embeddings.shape[1]

# For small datasets, use IndexFlatL2 (exact search)
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)  # Add document embeddings

# For large-scale (millions of vectors), consider compressed or inverted indices:
# index = faiss.IndexIVFPQ(faiss.IndexFlatL2(dimension), dimension, nlist=100, m=8, nbits=8)
# See FAISS docs for details.

# Encode a user query
query = model.encode(["How do I get my money back?"]).astype('float32')

# Search for the top-2 most similar documents
D, I = index.search(query, k=2)
print(I)  # Indices of the most relevant documents

Step-by-Step Explanation:

Convert to Float32: FAISS requires specific data type
Create Index: Initialize based on embedding dimensions
Add Embeddings: Store document vectors in index
Encode Query: Transform user question with same model
Search Index: Retrieve most similar documents efficiently
Get Results: I contains indices of best matches

For production workloads, use compressed indices (e.g., IndexIVFPQ) reducing memory and increasing speed. Managed vector databases offer cloud-native scaling.

Tips: Always use identical embedding models for documents and queries. Ensure float32 format — FAISS requirement.

Modern systems employ hybrid retrieval — combining dense vectors (semantic) with keyword search (BM25, SPLADE). This approach maximizes recall and relevance for complex queries.

Advanced applications integrate semantic search with LLMs in Retrieval-Augmented Generation (RAG) architectures. Retrieved documents provide context for generative models, enabling sophisticated QA systems. For more on RAG’s future and Anthropic’s innovations, see my piece Is RAG Dead? Anthropic Says No, which complements these semantic foundations with real-world deployment strategies.

Semantic search powers intelligent chatbots, knowledge management, and discovery tools industry-wide. It improves relevance, saves time, delights users.

Ready to experiment? Encode your own sentences — observe which cluster by meaning. For production, benchmark multiple models (E5, GTE, OpenAI embeddings) selecting best domain fit.

Key Concepts Checklist:

✓ Semantic search retrieves by meaning and context, not keywords
✓ Transformers and embeddings represent text as semantic vectors
✓ FAISS and vector databases enable scalable similarity search
✓ Hybrid retrieval combines dense and sparse methods
✓ RAG integrates search with LLMs for advanced applications
✓ Mastery enables fine-tuning, deployment, and continuous improvement

Glossary:

Semantic Search: Finds results based on meaning and context
Embedding: Dense vector capturing semantic meaning
Sentence Transformer: Model generating sentence-level embeddings
FAISS: Library for fast, scalable vector similarity search
Vector Database: Managed system for storing/querying embeddings at scale
Hybrid Search: Combines semantic and keyword retrieval
RAG: Retrieval-Augmented Generation for context-aware LLM answers
Precision/Recall: Metrics evaluating search quality

What’s next? Start with my foundational series:

Be sure to check out the first seven articles in this series:

Master semantic search and modern approaches — build AI applications that truly understand users. Let’s continue!

Hands-On Resources

Key Hands-On Resources for Practice

Reinforce your learning with these practical resources designed to help you implement semantic search concepts:

For hands-on code, check the companion GitHub repo, let’s follow along to load up the example and run them.

Semantic Search and Information Retrieval with Transformers

This github repo contains working examples for article 09 of the Hugging Face Transformers article series, with enhanced examples demonstrating cutting-edge semantic search techniques.

Overview

Learn how to implement and understand:

Embedding Generation: Create semantic embeddings with multiple models (local and API-based)
Hybrid Search: Combine keyword (BM25) and semantic search for optimal results
Vector Databases: Implement scalable search with FAISS and Chroma
RAG Systems: Build Retrieval-Augmented Generation pipelines
Production Patterns: Real-world deployment strategies and optimizations
Model Optimization: Quantization techniques for efficient deployment

Prerequisites

Python 3.12 (managed via pyenv)
Poetry for dependency management
Go Task for build automation
API keys for any required services (see .env.example)

Setup

Clone this repository

git clone [email protected]:RichardHightower/art_hug_09.git

Run the setup task:

task setup

Copy .env.example to .env and configure as needed

Project Structure

.
├── src/
│   ├── __init__.py
│   ├── config.py              # Configuration and utilities
│   ├── main.py                # Entry point with all examples
│   ├── embedding_generation.py # Generate embeddings with many models
│   ├── hybrid_search.py       # Hybrid keyword + semantic search
│   ├── vector_db_manager.py   # FAISS and Chroma database management
│   ├── rag_integration.py     # RAG pipeline implementation
│   └── utils.py               # Utility functions
├── tests/
│   └── test_examples.py       # Unit tests
├── docs/
│   ├── article9.md            # Original article
│   └── article9i.md           # Enhanced article with additional examples
├── .env.example               # Environment template
├── Taskfile.yml               # Task automation
└── pyproject.toml             # Poetry configuration

Running Examples

Run all examples:

task run

Or run individual modules:

task run-embeddings          # Generate embeddings with various models
task run-hybrid              # Run hybrid search examples
task run-vector-db           # Vector database management
task run-rag                 # Run RAG implementation
task run-quantization        # Run quantization examples

task postgres-logs          # View PostgreSQL logs
task postgres-shell         # Connect to PostgreSQL shell
task postgres-start         # Start PostgreSQL with pgvector extension
task postgres-stop          # Stop PostgreSQL container

Interactive mode:

python src/main.py      # Choose examples interactively

Interactive Jupyter Notebook Tutorial

For a hands-on learning experience with visualizations and interactive examples:

task notebook           # Launch in Jupyter Notebook
# or
task notebook-lab       # Launch in JupyterLab

# or 
# To see FAISS Notebook and Postgres pgvector Notebook run this 
task notebooks

The notebook includes:

Interactive comparisons of keyword vs. semantic search
Visualizations of embeddings and similarity matrices
Real-time performance benchmarking
Hands-on exercises to build your own search systems
Step-by-step explanations with expected results

Available Tasks

% task
task: [default] task --list
task: Available tasks for this project:
* clean:                   Clean up generated files
* default:                 Show available tasks
* format:                  Format code with Black and Ruff
* notebook:                Launch the interactive Jupyter notebook tutorial
* notebook-lab:            Launch the tutorial in JupyterLab
* notebooks:               Launch the interactive Jupyter notebook tutorial
* postgres-logs:           View PostgreSQL logs
* postgres-shell:          Connect to PostgreSQL shell
* postgres-start:          Start PostgreSQL with pgvector extension
* postgres-stop:           Stop PostgreSQL container
* run:                     Run all examples
* run-embeddings:          Run embedding generation examples
* run-hybrid:              Run hybrid search examples
* run-postgres-demo:       Run PostgreSQL vector database demo
* run-quantization:        Run quantization examples
* run-rag:                 Run RAG implementation examples
* run-vector-db:           Run vector database examples
* setup:                   Set up the Python environment and install dependencies
* test:                    Run all tests

Key Features

**1. Embedding Generation (embedding_generation.py**)

Compare multiple embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, E5, multilingual)
Benchmark performance and quality
Support for both local models and API-based embeddings (OpenAI)
Batch processing and storage optimization

**2. Hybrid Search (hybrid_search.py**)

Combine BM25 keyword search with semantic embeddings
Configurable weighting between keyword and semantic components
Production-ready search engine implementation
Performance benchmarking against individual approaches

**3. Vector Database Management (vector_db_manager.py**)

Unified interface for FAISS and Chroma
Support for exact and approximate search indices
Metadata filtering and advanced querying
Easy switching between local and managed deployments

**4. RAG Integration (rag_integration.py**)

Complete RAG pipeline from retrieval to generation
Multi-document context handling
Advanced filtering and reranking
Performance benchmarking

Learn More

Summary

This chapter transformed you from keyword-matching to meaning-understanding. You explored how transformers and vector databases such as FAISS and Postgres that enable modern information retrieval. Through hands-on examples, you built embeddings and implemented scalable search. Armed with these tools, you’re ready to tackle real-world AI search challenges and advance into sophisticated transformer applications.

About the Author

Rick Hightower brings extensive enterprise experience in AI and transformer technologies to his writing. With a background in building production AI systems at scale, Rick focuses on practical, real-world applications of cutting-edge technology. His work bridges the gap between academic research and enterprise deployment, making complex AI concepts accessible to practitioners.

From Keywords to Neural Understanding: The Transformer Revolution in Search — Article 9

Introduction: From Keyword Search to True Understanding

Setting Up Your Environment with Python 3.12.9

Keyword Search vs. Semantic Search: A Modern Comparison

Introduction to Semantic Search

Keyword vs. Semantic Search

Comparing Keyword and Semantic Search Results

Hybrid Search: Combining Keyword and Semantic Approaches

Business Applications of Semantic Search

Vector Database Comparison Table

Measuring Search Quality: Metrics that Matter

Calculating Precision and Recall for Search Results

Embedding Text for Search Applications

What Are Embeddings — and Why Do They Matter?

Generating Embeddings with Sentence Transformers and Modern APIs

Generating Embeddings with Sentence Transformers

Modern Embedding API Example with OpenAI

RAG: Combining Search with Generation

Batch Processing and Storing Embeddings

Batch Embedding and Saving to Disk

Multilingual and Domain-Specific Embeddings

Generating Multilingual Embeddings

Key Takeaways and Next Steps

Vector Databases and FAISS Integration

Setting Up FAISS for Scalable Similarity Search

Installing and Importing FAISS (v1.7.2+)

Ensuring Embedding Data Type Compatibility

Creating a Simple FAISS Index (Exact Search)

Indexing, Querying, and Updating Search Indices

Performing a Search Query with Error Handling

Saving and Loading a FAISS Index (v1.7.2+)

Managed Vector DB Example with Chroma

Scaling Semantic Search

Building an Approximate Search Index with FAISS (IVF)

Building a Memory-Efficient Index with Product Quantization

Integrating FAISS with Hugging Face Datasets

PostgreSQL with pgvector: Enterprise-Ready Vector Search

Why PostgreSQL for Vector Search?

Setting Up PostgreSQL with pgvector

Creating Vector-Enabled Tables

Python Integration with pgvector

Advanced pgvector Features

1. Multiple Index Types

Comparing IVFFlat and HNSW Indexes

When to Use IVFFlat

When to Use HNSW

Maintenance Considerations

2. Hybrid Search with Filters

3. Distance Functions

Cosine Distance (<=>)

2. Euclidean Distance (<->)

3. Inner Product (<#>)

Selection Guidelines

Performance Optimization

1. Index Selection Strategy

2. Batch Operations

3. Query Optimization

Comparing pgvector with FAISS

Database Schema Changes

API Methods from Vector RAG python lib

BM25 style Search

Hybrid Search

Usage Examples

1. Finding Exact Technical Terms

2. Semantic Search with Keyword Backup

3. Keyword-First Search

Production Deployment Patterns

1. Read Replica Pattern

2. Connection Pooling

3. Monitoring and Maintenance

1. Monitoring Index Usage

2. Analyzing Vector Search Query Performance

Real-World Use Case: Multi-Tenant SaaS

Integration with RAG Systems

Best Practices

Summary

Key Takeaways

Generating Semantic Embeddings with Sentence Transformers

Building and Querying a FAISS Index

Further Reading

Cosine Distance (`<=>`)

2. Euclidean Distance (`<->`)

3. Inner Product (`<#>`)