Hybrid RAG with Qdrant¶
Overview¶
This example demonstrates using Docling with Qdrant to perform a hybrid search across your documents using dense and sparse vectors.
We'll chunk the documents using Docling before adding them to a Qdrant collection. By limiting the length of the chunks, we can preserve the meaning in each vector embedding.
Setup¶
- 👉 Qdrant client uses FastEmbed to generate vector embeddings. You can install the
fastembed-gpu
package if you've got the hardware to support it.
In [ ]:
Copied!
%pip install --no-warn-conflicts -q qdrant-client docling docling-core fastembed
%pip install --no-warn-conflicts -q qdrant-client docling docling-core fastembed
[notice] A new release of pip is available: 24.2 -> 24.3.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
Let's import all the classes we'll be working with.
In [1]:
Copied!
from docling_core.transforms.chunker import HierarchicalChunker
from qdrant_client import QdrantClient
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker
from qdrant_client import QdrantClient
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter
- For Docling, we'll set the allowed formats to HTML since we'll only be working with webpages in this tutorial.
- If we set a sparse model, Qdrant client will fuse the dense and sparse results using RRF. Reference.
In [2]:
Copied!
COLLECTION_NAME = "docling"
doc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML])
client = QdrantClient(location=":memory:")
# The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI.
# For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant
# client = QdrantClient(location="http://localhost:6333")
client.set_model("sentence-transformers/all-MiniLM-L6-v2")
client.set_sparse_model("Qdrant/bm25")
COLLECTION_NAME = "docling"
doc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML])
client = QdrantClient(location=":memory:")
# The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI.
# For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant
# client = QdrantClient(location="http://localhost:6333")
client.set_model("sentence-transformers/all-MiniLM-L6-v2")
client.set_sparse_model("Qdrant/bm25")
Fetching 5 files: 0%| | 0/5 [00:00<?, ?it/s]
Fetching 29 files: 0%| | 0/29 [00:00<?, ?it/s]
We can now download and chunk the document using Docling. For demonstration, we'll use an article about chunking strategies :)
In [3]:
Copied!
result = doc_converter.convert(
"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag"
)
documents, metadatas = [], []
for chunk in HierarchicalChunker().chunk(result.document):
documents.append(chunk.text)
metadatas.append(chunk.meta.export_json_dict())
result = doc_converter.convert(
"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag"
)
documents, metadatas = [], []
for chunk in HierarchicalChunker().chunk(result.document):
documents.append(chunk.text)
metadatas.append(chunk.meta.export_json_dict())
Let's now upload the documents to Qdrant.
- The
add()
method batches the documents and uses FastEmbed to generate vector embeddings on our machine.
In [4]:
Copied!
client.add(COLLECTION_NAME, documents=documents, metadata=metadatas, batch_size=64)
client.add(COLLECTION_NAME, documents=documents, metadata=metadatas, batch_size=64)
Out[4]:
['e74ae15be5eb4805858307846318e784', 'f83f6125b0fa4a0595ae6a0777c9d90d', '9cf63c7f30764715bf3804a19db36d7d', '007dbe6d355b4b49af3b736cbd63a4d8', 'e5e31f21f2e84aa68beca0dfc532cbe9', '69c10816af204bb28630a1f957d8dd3e', 'b63546b9b1744063bdb076b234d883ca', '90ad15ba8fa6494489e1d3221e30bfcf', '13517debb483452ea40fc7aa04c08c50', '84ccab5cfab74e27a55acef1c63e3fad', 'e8aa2ef46d234c5a8a9da64b701d60b4', '190bea5ba43c45e792197c50898d1d90', 'a730319ea65645ca81e735ace0bcc72e', '415e7f6f15864e30b836e23ae8d71b43', '5569bce4e65541868c762d149c6f491e', '74d9b234e9c04ebeb8e4e1ca625789ac', '308b1c5006a94a679f4c8d6f2396993c', 'aaa5ec6d385a418388e660c425bf1dbe', '630be8e43e4e4472a9cdb9af9462a43a', '643b316224de4770a5349bf69cf93471', 'da9265e6f6c2485493d15223eefdf411', 'a916e447d52c4084b5ce81a0c5a65b07', '2883c620858e4e728b88e127155a4f2c', '2a998f0e9c124af99027060b94027874', 'be551fbd2b9e42f48ebae0cbf1f481bc', '95b7f7608e974ca6847097ee4590fba1', '309db4f3863b4e3aaf16d5f346c309f3', 'c818383267f64fd68b2237b024bd724e', '1f16e78338c94238892171b400051cd4', '25c680c3e064462cab071ea9bf1bad8c', 'f41ab7e480a248c6bb87019341c7ca74', 'd440128bed6d4dcb987152b48ecd9a8a', 'c110d5dfdc5849808851788c2404dd15']
Query Documents¶
In [5]:
Copied!
points = client.query(COLLECTION_NAME, query_text="Can I split documents?", limit=10)
print("<=== Retrieved documents ===>")
for point in points:
print(point.document)
points = client.query(COLLECTION_NAME, query_text="Can I split documents?", limit=10)
print("<=== Retrieved documents ===>")
for point in points:
print(point.document)
<=== Retrieved documents ===> Document Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections. Document Specific Chunking can handle a variety of document formats, such as: Consequently, there are also splitters available for this purpose. 1. We start at the top of the document, treating the first part as a chunk. 2. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one. 3. We keep this up until we reach the end of the document. Have you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document: The goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document. To put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation. Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together. You can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text! And there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.