Deep Search

Deep Search leverages state-of-the-art AI methods to continuously collect, convert, enrich, and link large document collections. You can use it for both public and proprietary PDF documents. Deep Search is offered commercially as-a-service.

documents have been converted by our open access system. Try it yourself for free or request access to an enterprise system.

Feature Free Enterprise
Convert documents
Extract components such as tables and figures
Automate your process via the Toolkit
Upload and search your own documents
Configure the document conversion process
Search millions of pre-loaded documents
Use question-answering on your own documents
Integrate with gen AI systems like IBM watsonx.ai
Search and extract chemical (sub-)structures on request

Overview

Deep Search converts unstructured PDF documents into structured JSON files with accuracy and ease. It enables you to automate knowledge extraction as well as to fine-tune your proprietary Foundational Models and Large Language Models. We have already converted and collected the most common open-access repositories of technical documents for you to browse and search with ease. In the same way, you can point Deep Search to your proprietary document collections to create an Enterprise Search service.

Collect

Deep Search comes pre-loaded with millions of standard technical documents from many public data sources. We adhere to all licensing and only provide sources with permissive, open-access licenses. These sources are accessible after explicit license confirmation by the user.

Patents from
  • USA
  • Europe
  • Japan
  • Korea
  • China
Open-Access Publications
  • SemanticScholar
  • CrossRef
  • PLOS
  • MDPI
Public Repositories
  • arXiv
  • PubMed-Central
News articles that are curated and related to technical topics

Document collections are stored and indexed such that you can search and retrieve any document according to their contents, down to the values and physical units in tables. These large collections of documents are processed quickly and concurrently on scalable cloud infrastructure.

Convert

Deep Search converts large collections of PDF documents, such as scientific publications and patents, into JSON files. Our state-of-the-art AI detects and delimits document objects that contain information, such as paragraphs and tables. This information is then extracted from the objects, such as text from paragraphs and cells from tables.

Enrich

After parsing a document for its content, Deep Search enriches it. Paragraphs of text are passed through natural language models. These models identify language structures such as sentences and terms, which are then classified into entity types such as a country or a physical property of a material. Likewise, image objects are detected and interpreted by computer vision models.

Use Deep Search to interlink your document collections into knowledge graphs. Knowledge graphs relate the entity types that have been discovered across documents. You can then query a graph to answer analytical questions that go beyond searching for keywords. For example:

For each material type that is mentioned in my collection of scientific papers, which of its physical properties have been tested and under which conditions?
For each company that is mentioned in my collection of annual reports, what was its total revenue per year?

Toolkit

Deep Search can be accessed programmatically to:

Take a look at the Deep Search Toolkit, our Python development kit and command line interface.

Use cases

PatCID: Patent Chemical-structure Image Discovery

PatCID is a collection of chemical structures in patent documents to facilitate search of patent documents in the organic-chemistry domain. Programmatic access to PatCID can facilitate discovery of molecules. This collection was created by processing molecular-structure images in United States Patent and Trademark Office, Japan Patent Office, European Patent Office, Korean Intellectual Property Office, and China National Intellectual Property Administration patent documents. (PatCID: Chemical Structures in Patent Documents, Nature Communications 2024; MolGrapher: Graph-based Visual Recognition of Chemical Structures, ICCV 2023). Contact us for early access.

DocQA: Conversational question-answering on your documents

DocQA enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). A research paper describing this application is published at the AAAI 2024 conference. Read more about it here.

Question-answering across entire document collections is supported as well by means of Retrieval-Augmented Generation. Contact us for early access.

Publications

2025

2024

2023

2022

2021 - 2016

People

At IBM Research we are always looking for scientists and engineers, be it as interns, PhD students, or staff. Contact us with your résumé and interests.

Team members

Past members

Lokesh
Mishra
Luca
Buratti
Francesco
Fusco
Diego
Antognini
Birgit
Pfitzmann
Rita
Kuznetsova

Past students

Fabian
Lindlbauer
Matteo
Omenetti
Eric
Sease
Alice
Sizer
Tien
Dee Lin
Sohayl
Dhibi
Christian
Cadisch
Teodora
Nedic
Isabelle
Franzen

Locations

Zürich Switzerland
Paris France