Skip to content

Tip

The features described on this page are not available in the public service. Contact us to know more.

Data indices

A data index stores a collection of documents in a project. This page shows how to create and delete a data index, and to list all data indices in a project.

Since a data index "lives" inside a project, we need to specify which project we are referring to. This is accomplished by a project key PROJ_KEY. We can obtain the project keys for our projects by listing them.

Creating a data index in a project

Suppose you want to create an index called NAME. Optionally, a description,DESC, for the data index can be provided.

$ deepsearch cps data-indices create -p PROJ_KEY -n NAME -d DESC

After you have generated the api object (from a profile):

api.data_indices.create(proj_key=PROJ_KEY, name=NAME, desc=DESC)

In addition, it is possible to specify non-default type of data index. For more, see here for CLI and here for python.

Type Description
Document (Default) Index containing documents uploaded as PDF and converted by the platform.
DB Record Index containing data matching the DB records schema. This usually orginates from curated data collections, and exposes a schema which can be leveraged in the processing pipeline.
Generic Generic type with the least requirements.
Experiment Data coming from simulation experiments.

Listing data indices in a project

$ deepsearch cps data-indices list -p PROJ_KEY
indices = api.data_indices.list(PROJ_KEY)

for item in indices:
    print(item.source.index_key, item.name)

# If your project uses Pandas, you can easily convert the list of projects to a Dataframe
import pandas as pd
df = pd.DataFrame([item.to_dict() for item in indices])
print(df)

Deleting a data index from a project

To delete a data index, you need to specify an index via its INDEX_KEY. Listing data indices will show the INDEX_KEY for all the indices in a project.

$ deepsearch cps data-indices delete -p PROJ_KEY -x INDEX_KEY
from deepsearch.cps.client.components.elastic import ElasticProjectDataCollectionSource

# specify index
coords = ElasticProjectDataCollectionSource(proj_key=PROJ_KEY, index_key=INDEX_KEY)

api.data_indices.delete(coords)

Adding documents to a project

Documents can be converted and added, directly, to a data index in a project. Briefly, documents can be on a local machine or on the remote files. Local documents can be in PDF format, ZIP archives, or directory containing both (PATH_DOCS). The web address of a remote document is input directly or multiple web addresses can be stored in a text file (PATH_URL). The specification of documents is same as in Document Conversion.

// for local documents
$ deepsearch cps data-indices upload -p PROJ_KEY -x INDEX_KEY -i PATH_DOCS

// for online documents
$ deepsearch cps data-indices upload -p PROJ_KEY -x INDEX_KEY -u PATH_URL

// for COS documents
$ deepsearch cps data-indices upload -p PROJ_KEY -x INDEX_KEY -c PATH_COS_COORDINATES
from deepsearch.cps.client.components.elastic import ElasticProjectDataCollectionSource
from deepsearch.cps.data_indices import utils as data_indices_utils

# Specify index
coords = ElasticProjectDataCollectionSource(proj_key=PROJ_KEY, index_key=INDEX_KEY)

# For local documents
data_indices_utils.upload_files(api=api, coords=coords, local_file=PATH_DOCS)

# For online documents

# load the urls from the file to a list
input_urls = open(PATH_URL).readlines()
# or, define a list directly
#input_urls = ["https:///URL1", "https://URL2", "https://URL3"]

data_indices_utils.upload_files(api=api, coords=coords, url=input_urls)

# For COS documents
cos_coordinates = S3Coordinates.parse_file(s3_coordinates)

data_indices_utils.upload_files(api=api, coords=coords, s3_coordinates=cos_coordinates)

Adding attachments to an index item

Attachments can be added to an index item in a project. Briefly, attachments have to be on local machine and can be (almost) any format. The full list of supported formats are listed here.

$ deepsearch cps data-indices add-attachment -p PROJ_KEY -x INDEX_KEY -d INDEX_ITEM_ID -i ATTACHMENT_PATH -k ATTACHMENT_KEY
from deepsearch.cps.client.components.data_indices import DataIndex

# get indices of the project
indices = api.data_indices.list(PROJ_KEY)

# get specific index to add attachment
index = next((x for x in indices if x.source.index_key == index_key), None)

# if the index exists, add attachment
if index is not None:
    # specify parameters
    index_item_id = "example_item_id"
    attachment_path = "path/to/local/file"
    attachment_key = "usr_my_attachment"  # optional. if set need start with 'usr_' and be snake_case

    index.add_item_attachment(
        api=api,
        index_item_id=index_item_id,
        attachment_path=attachment_path,
        attachment_key=attachment_key,  # optional
    )