Skip to content

Chunking

Introduction

A chunker is a Docling abstraction that, given a DoclingDocument, returns a stream of chunks, each of which captures some part of the document as a string accompanied by respective metadata.

To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a chunker class hierarchy, providing a base type, BaseChunker, as well as specific subclasses.

Docling integration with gen AI frameworks like LlamaIndex is done using the BaseChunker interface, so users can easily plug in any built-in, self-defined, or third-party BaseChunker implementation.

Base Chunker

The BaseChunker base class API defines that any chunker should provide the following:

  • def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]: Returning the chunks for the provided document.
  • def serialize(self, chunk: BaseChunk) -> str: Returning the potentially metadata-enriched serialization of the chunk, typically used to feed an embedding model (or generation model).

Hybrid Chunker

To access HybridChunker

  • If you are using the docling package, you can import as follows:
    from docling.chunking import HybridChunker
    
  • If you are only using the docling-core package, you must ensure to install the chunking extra, e.g.
    pip install 'docling-core[chunking]'
    
    and then you can import as follows:
    from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
    

The HybridChunker implementation uses a hybrid approach, applying tokenization-aware refinements on top of document-based hierarchical chunking.

More precisely:

  • it starts from the result of the hierarchical chunker and, based on the user-provided tokenizer (typically to be aligned to the embedding model tokenizer), it:
  • does one pass where it splits chunks only when needed (i.e. oversized w.r.t. tokens), &
  • another pass where it merges chunks only when possible (i.e. undersized successive chunks with same headings & captions) — users can opt out of this step via param merge_peers (by default True)

👉 Example: see here.

Hierarchical Chunker

The HierarchicalChunker implementation uses the document structure information from the DoclingDocument to create one chunk for each individual detected document element, by default only merging together list items (can be opted out via param merge_list_items). It also takes care of attaching all relevant document metadata, including headers and captions.