Back to Tutorials
tutorialstutorialairag

How to Build a Semantic Search Engine with Qdrant and text-embedding-3

Practical tutorial: Build a semantic search engine with Qdrant and text-embedding-3

Alexia TorresApril 24, 202610 min read1 834 words

How to Build a Semantic Search Engine with Qdrant and text-embedding-3

The problem with traditional search is that it's fundamentally broken. When you type "affordable electric vehicles with long range" into a keyword-based system, it doesn't understand what you actually want. It's looking for exact matches, not meaning. It's looking for strings, not semantics. This is the central failure mode of legacy search architecture—and it's precisely the problem that vector search was built to solve.

We are now entering an era where search engines don't just match keywords; they understand intent. Powered by dense vector embeddings and high-performance vector databases, semantic search systems can retrieve documents based on conceptual similarity rather than lexical overlap. In this deep-dive, we'll build one from scratch using Qdrant—a purpose-built vector database [4]—and the text-embedding-3 model from Hugging Face's transformers library [5]. By the end, you'll have a production-ready semantic search engine that can scale to millions of documents.

The Architecture of Meaning: Why Vector Databases Are the New Search Backend

Before we touch a single line of code, it's worth understanding why this architecture matters. Traditional search engines rely on inverted indexes and token matching. They work well for exact queries but collapse when faced with synonyms, paraphrasing, or abstract concepts. Semantic search, by contrast, converts text into high-dimensional vectors—numerical representations that capture the meaning of the text in a way that machines can compare mathematically.

The architecture we're building has two core components. First, an embedding model—specifically, text-embedding-3 from Hugging Face—which transforms any piece of text into a dense vector. Second, Qdrant, a vector database [1] designed from the ground up for similarity search. Qdrant stores these vectors, indexes them using advanced algorithms like HNSW (Hierarchical Navigable Small World graphs), and allows you to query them with blazing speed.

What makes this pairing powerful is that Qdrant doesn't just store vectors; it stores payloads alongside them. That means you can attach metadata—document IDs, timestamps, categories—to each vector and filter on those attributes during search. This is critical for real-world applications like enterprise document retrieval, recommendation systems, or even AI-powered customer support.

The beauty of this approach is that it's model-agnostic. You can swap out text-embedding-3 for any other embedding model without changing the search infrastructure. The vector database doesn't care what generated the vectors—it only cares about the geometry of the space they occupy.

Setting the Stage: Dependencies and Environment

Let's get our hands dirty. The code examples in this guide are written in Python 3.x, and we'll need two primary dependencies: the qdrant-client for interacting with our vector database, and the transformers library [5] for loading our embedding model.

pip install qdrant-client transformers torch

The choice of these libraries is deliberate. Qdrant's Python client offers a clean, idiomatic API that abstracts away the complexity of gRPC and REST calls. The transformers library, meanwhile, gives us access to state-of-the-art NLP models with minimal boilerplate. We're using PyTorch as the backend for tensor operations, but you could just as easily use TensorFlow or ONNX Runtime.

For local development, you'll need a running Qdrant instance. The easiest way to get one up is via Docker:

docker run -p 6333:6333 qdrant/qdrant

This spins up a Qdrant server on localhost:6333—the default gRPC and REST port. If you're working with a remote instance, you can connect via URL instead. We'll cover both options in the implementation.

From Text to Vectors: Initializing the Embedding Pipeline

The first step in our implementation is loading the embedding model. We're using text-embedding-3, which is a sentence-transformer model optimized for generating high-quality embeddings for semantic similarity tasks. Specifically, we'll use the all-MiniLM-L6-v2 variant—a lightweight model that produces 384-dimensional vectors while maintaining strong performance.

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def embed_text(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        # Use mean pooling for better representation
        embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.numpy()

Notice the use of mean pooling rather than the pooler_output. This is a subtle but important distinction. The pooler_output from many transformer models is designed for classification tasks, not for generating semantically meaningful embeddings. Mean pooling across all token embeddings produces a more robust representation for similarity search—a trick that sentence-transformers popularized.

The embed_text function is the workhorse of our system. It takes raw text, tokenizes it with truncation and padding, runs it through the model, and returns a NumPy array that we can feed directly into Qdrant. For production systems, you'd want to batch this operation and potentially cache embeddings, but for our purposes, this is a solid foundation.

Building the Vector Store: Creating Collections and Inserting Data

With our embedding pipeline ready, we need to initialize the Qdrant client and create a collection to store our vectors. A collection in Qdrant is analogous to a table in a relational database—it defines the schema for your vectors and their associated payloads.

from qdrant_client import QdrantClient

client = QdrantClient(host="localhost", port=6333)

collection_name = "documents"
vector_size = 384  # Matches our embedding model's output

client.create_collection(
    collection_name=collection_name,
    vectors_config={
        "size": vector_size,
        "distance": "Cosine"
    }
)

We're using cosine distance as our similarity metric, which is the standard choice for normalized embeddings. Cosine distance measures the angle between two vectors, making it robust to differences in vector magnitude. This is crucial because our embedding model doesn't guarantee normalized outputs.

Now let's insert some documents. Each document gets an ID, its embedding vector, and a payload containing the original text. This payload is what we'll return to users when they search.

documents = [
    {"id": 1, "text": "Quantum computing promises to revolutionize cryptography."},
    {"id": 2, "text": "Neural networks are inspired by the human brain's structure."},
    {"id": 3, "text": "Blockchain technology enables decentralized trust systems."},
]

for doc in documents:
    embedding = embed_text(doc["text"])
    client.upload_records(
        collection_name=collection_name,
        records=[
            {
                "id": doc["id"],
                "vector": embedding[0].tolist(),
                "payload": {"text": doc["text"]}
            }
        ]
    )

This is a simplified example, but it illustrates the core pattern. In production, you'd be ingesting thousands or millions of documents. That's where batch processing becomes essential—a topic we'll explore in the optimization section.

The Search Experience: Querying with Intent

The payoff for all this setup is the ability to search by meaning rather than by keyword. When a user submits a query, we generate its embedding using the same embed_text function, then ask Qdrant to find the nearest neighbors in vector space.

query_text = "Explain how quantum mechanics affects encryption."

query_embedding = embed_text(query_text)

search_result = client.search(
    collection_name=collection_name,
    vector=query_embedding[0].tolist(),
    limit=5,
    with_payload=True
)

for hit in search_result:
    print(f"Document ID: {hit.id}, Score: {hit.score:.4f}")
    print(f"Text: {hit.payload['text']}\n")

If you run this query against our three documents, you'll see that the quantum computing document scores highest—even though the query uses different words like "encryption" instead of "cryptography." That's the magic of semantic search. The model understands that these concepts are related, and the vector database surfaces that relationship mathematically.

This is where Qdrant really shines. The search method handles all the complexity of approximate nearest neighbor (ANN) search under the hood. You don't need to worry about index structures, distance calculations, or result ranking. You just provide a vector and get back the most semantically similar documents.

Production-Grade Optimizations: Scaling Beyond the Prototype

A working prototype is satisfying, but the real challenge is scaling to production workloads. Here are three optimizations that will take your semantic search engine from a Jupyter notebook to a production service.

Batch Processing for Indexing: When you're ingesting millions of documents, uploading them one at a time is painfully slow. Qdrant supports batch uploads, which dramatically reduce network overhead and improve throughput.

def process_batch(batch):
    embeddings = embed_text([doc["text"] for doc in batch])
    records = [
        {
            "id": doc["id"],
            "vector": embedding.tolist(),
            "payload": {"text": doc["text"]}
        }
        for doc, embedding in zip(batch, embeddings)
    ]
    client.upload_records(collection_name=collection_name, records=records)

batch_size = 100
for i in range(0, len(all_documents), batch_size):
    process_batch(all_documents[i:i+batch_size])

Asynchronous Operations: For I/O-bound tasks like network calls, asynchronous programming can significantly improve performance. Qdrant's client supports async variants of its methods, allowing you to overlap computation and communication.

import asyncio

async def async_upload(records):
    await client.upload_records_async(
        collection_name=collection_name,
        records=records
    )

Hardware Acceleration: For high-throughput scenarios, consider deploying Qdrant on GPU-enabled hardware. While Qdrant itself runs on CPU, the embedding generation is a natural candidate for GPU acceleration. Modern GPUs can process hundreds of text inputs per second through transformer models, making them essential for real-time indexing pipelines.

Error Handling, Security, and the Edge Cases That Matter

Production systems fail in ways that prototypes don't. Network partitions, rate limits, and malformed data are not theoretical concerns—they're daily realities. Here's how to build resilience into your semantic search engine.

Robust Error Handling: Wrap all Qdrant operations in try-except blocks with appropriate logging. Network failures during batch uploads should trigger retries with exponential backoff.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def safe_upload(records):
    client.upload_records(collection_name=collection_name, records=records)

Security Considerations: Never hardcode API keys or database credentials. Use environment variables or a secrets manager. For production deployments, ensure Qdrant is running behind a firewall or VPN, and enable TLS for all client-server communication.

Scaling Bottlenecks: The most common bottleneck in semantic search systems is the embedding generation step, not the vector search itself. Monitor your embedding pipeline's throughput and consider using a dedicated inference server (like Hugging Face's TGI or NVIDIA Triton) to decouple embedding generation from your application logic.

Beyond the Basics: Real-Time Indexing and Advanced Queries

Once you have the core system running, there are several directions for enhancement. Real-time indexing allows you to add new documents to the search index as they arrive, without rebuilding the entire index. Qdrant supports point-by-point insertion, making this straightforward.

Advanced query capabilities include filtering on payload fields. For example, you might want to search only documents from a specific date range or category. Qdrant's search method accepts a filter parameter that lets you combine vector similarity with structured queries.

from qdrant_client.http import models

filter_condition = models.Filter(
    must=[
        models.FieldCondition(
            key="category",
            match=models.MatchValue(value="technology")
        )
    ]
)

results = client.search(
    collection_name=collection_name,
    vector=query_embedding[0].tolist(),
    limit=10,
    query_filter=filter_condition,
    with_payload=True
)

This hybrid search capability—combining semantic similarity with metadata filtering—is what makes vector databases so powerful for real-world applications. It's the difference between a toy search engine and a production system that can handle complex business logic.

The Road Ahead: From Search to Understanding

Building a semantic search engine is more than a technical exercise—it's a shift in how we think about information retrieval. By moving from keyword matching to semantic understanding, we open up new possibilities for how humans interact with machines. The system we've built here is just the beginning.

For further exploration, consider integrating this search engine with open-source LLMs to build retrieval-augmented generation (RAG) pipelines, or explore how vector databases compare to traditional search backends. The AI tutorials on our site provide deeper dives into these topics.

The code is on your machine. The Qdrant instance is running. The embeddings are flowing. What you build next is up to you.


tutorialairag
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles