Chunking
Introduction
A chunker is a Docling abstraction that, given a
DoclingDocument
, returns a stream of chunks, each of which
captures some part of the document as a string accompanied by respective metadata.
To enable both flexibility for downstream applications and out-of-the-box utility,
Docling defines a chunker class hierarchy, providing a base type, BaseChunker
, as well
as specific subclasses.
Docling integration with gen AI frameworks like LlamaIndex is done using the
BaseChunker
interface, so users can easily plug in any built-in, self-defined, or
third-party BaseChunker
implementation.
Base Chunker
The BaseChunker
base class API defines that any chunker should provide the following:
def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]
: Returning the chunks for the provided document.def serialize(self, chunk: BaseChunk) -> str
: Returning the potentially metadata-enriched serialization of the chunk, typically used to feed an embedding model (or generation model).
Hybrid Chunker
To access HybridChunker
- If you are using the
docling
package, you can import as follows:from docling.chunking import HybridChunker
- If you are only using the
docling-core
package, you must ensure to install thechunking
extra, e.g.and then you can import as follows:pip install 'docling-core[chunking]'
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
The HybridChunker
implementation uses a hybrid approach, applying tokenization-aware
refinements on top of document-based hierarchical chunking.
More precisely:
- it starts from the result of the hierarchical chunker and, based on the user-provided tokenizer (typically to be aligned to the embedding model tokenizer), it:
- does one pass where it splits chunks only when needed (i.e. oversized w.r.t. tokens), &
- another pass where it merges chunks only when possible (i.e. undersized successive
chunks with same headings & captions) — users can opt out of this step via param
merge_peers
(by defaultTrue
)
👉 Example: see here.
Hierarchical Chunker
The HierarchicalChunker
implementation uses the document structure information from
the DoclingDocument
to create one chunk for each individual
detected document element, by default only merging together list items (can be opted out
via param merge_list_items
). It also takes care of attaching all relevant document
metadata, including headers and captions.