Dataset Management

create

Creates a new dataset and uploads SignedURLs to a GCP bucket.

Arguments

AttributeTypeDescription
namestrName of the dataset.
dataset_pathstrPath to the Newline-Delimited JSON (NDJSON) file.

Return

Dataset object containing the metadata of the newly-uploaded dataset with the following structure:

Dataset(
    id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
    object='dataset',
    name='my-dataset-08-19',
    project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
    expiry_time=1724561116000,
    status=DatasetStatus(
        overview='Uploaded',
        message='Uploaded Dataset',
        update_time=1723697121735,
        source=DatasetSource(
            kind='UploadedNewlineDelimitedJsonFile',
            upload_url=DatasetUploadUrl(
                method='PUT',
                url='',
                headers=[
                    'content-length',
                    'x-goog-hash',
                    'x-goog-meta-datature-dataset-link',
                    'x-goog-if-generation-match',
                    'content-type'
                ],
                expires_at_time=1723740318174
            )
        ),
        item_count=50
    ),
    create_date=1723697118174,
    update_date=1723697121774
)
AttributeTypeDescription
idstrDataset ID as a string.
objectstrObject type of the dataset.
namestrName of the dataset.
project_idstrIdentifier of the project associated with the dataset.
expiry_timestrExpiry timestamp of the dataset, in milliseconds.
statusDatasetStatusStatus object of the dataset.
create_dateintCreation timestamp of the dataset, in milliseconds.
update_dateintLast updated timestamp of the dataset, in milliseconds.

Examples

Create a new dataset with a local NDJSON file path:

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.create(
    name="my-dataset-08-15",
    dataset_path="my-dataset.ndjson"
)

list

Lists all created datasets in the project.

Arguments

NameTypeDescription
paginationdictA dictionary containing the limit of the number of datasets to be returned in each page (defaults to 1000), and the page cursor for page selection (defaults to the first page)

Return

PaginationResponse object containing a list of Dataset objects with page navigation data, with the following structure:

PaginationResponse(
    next_page=None,
    previous_page=None,
    data=[
        Dataset(
            id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
            object='dataset',
            name='my-dataset-08-15',
            project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
            expiry_time=1724561116000,
            status=DatasetStatus(
                overview='Uploaded',
                message='Uploaded Dataset',
                update_time=1723697121735,
                source=DatasetSource(
                    kind='UploadedNewlineDelimitedJsonFile',
                    upload_url=DatasetUploadUrl(
                        method='PUT',
                        url='',
                        headers=[
                            'content-length',
                            'x-goog-hash',
                            'x-goog-meta-datature-dataset-link',
                            'x-goog-if-generation-match',
                            'content-type'
                        ],
                        expires_at_time=1723740318174
                    )
                ),
                item_count=50
            ),
            create_date=1723697118174,
            update_date=1723697121774
        )
    ]
)
AttributeTypeDescription
next_pagestrPage ID of the next page.
prev_pagestrPage ID of the prev page.
dataList[Dataset]List of dataset metadata.

Examples

  • Default listing of datasets (shows first 1000 datasets):
from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")

project.batch.datasets.list()
  • View the next page of results:
from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")

next_page = project.batch.datasets.list()["next_page"]

project.batch.datasets.list({"page": next_page})
  • View the previous page of results:
from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")

prev_page = project.batch.datasets.list({
    "page": "ZjYzYmJkM2FjN2UxOTA4ZmU0ZjE0Yjk5Mg"}
)["prev_page"]

project.batch.datasets.list({"page": prev_page})
  • List a specific page of datasets that returns 2 datasets on that page:
from datature.nexus import Client
  
project = Client("5aa41e8ba........").get_project("proj_b705a........")

project.batch.datasets.list({
    "limit": 2,
    "page": "ZjYzYmJkM2FjN2UxOTA4ZmU0ZjE0Yjk5Mg"
})

get

Retrieves a specific dataset by ID.

Arguments

NameTypeDescription
dataset_idstrThe dataset ID as a string.

Return

Dataset object containing the metadata of the retrieved dataset with the following structure:

Dataset(
    id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
    object='dataset',
    name='my-dataset-08-15',
    project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
    expiry_time=1724561116000,
    status=DatasetStatus(
        overview='Uploaded',
        message='Uploaded Dataset',
        update_time=1723697121735,
        source=DatasetSource(
            kind='UploadedNewlineDelimitedJsonFile',
            upload_url=DatasetUploadUrl(
                method='PUT',
                url='',
                headers=[
                    'content-length',
                    'x-goog-hash',
                    'x-goog-meta-datature-dataset-link',
                    'x-goog-if-generation-match',
                    'content-type'
                ],
                expires_at_time=1723740318174
            )
        ),
        item_count=50
    ),
    create_date=1723697118174,
    update_date=1723697121774
)
AttributeTypeDescription
idstrDataset ID as a string.
objectstrObject type of the dataset.
namestrName of the dataset.
project_idstrIdentifier of the project associated with the dataset.
expiry_timestrExpiry timestamp of the dataset, in milliseconds.
statusDatasetStatusStatus object of the dataset.
create_dateintCreation timestamp of the dataset, in milliseconds.
update_dateintLast updated timestamp of the dataset, in milliseconds.

Examples

Retrieve dataset by dataset ID:

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.get("webhook_6aea3395-9a72-4bb5-9ee0-19248c903c56")

delete

Deletes a specific dataset by ID.

Arguments

NameTypeDescription
dataset_idstrThe dataset ID as a string.

Return

DeleteResponse object that describe the deletion status of the dataset, with the following structure:

DeleteResponse(
    id='dataset_f7d8aec2-7e2b-4d2c-a103-c8dd575c29c7',
    deleted=True
)
AttributeTypeDescription
idstrThe dataset ID as a string.
deletedboolWhether the dataset has been successfully deleted or not.

Examples

Delete a specified dataset:

from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")

project.batch.datasets.delete("dataset_f7d8aec2-7e2b-4d2c-a103-c8dd575c29c7")

wait_until_done

Waits for the dataset upload to be completed.

Arguments

NameTypeDescription
dataset_idstrThe dataset ID as a string.

Return

Dataset object containing the metadata of the uploaded dataset with the following structure:

Dataset(
    id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
    object='dataset',
    name='my-dataset-08-15',
    project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
    expiry_time=1724561116000,
    status=DatasetStatus(
        overview='Uploaded',
        message='Uploaded Dataset',
        update_time=1723697121735,
        source=DatasetSource(
            kind='UploadedNewlineDelimitedJsonFile',
            upload_url=DatasetUploadUrl(
                method='PUT',
                url='',
                headers=[
                    'content-length',
                    'x-goog-hash',
                    'x-goog-meta-datature-dataset-link',
                    'x-goog-if-generation-match',
                    'content-type'
                ],
                expires_at_time=1723740318174
            )
        ),
        item_count=50
    ),
    create_date=1723697118174,
    update_date=1723697121774
)
AttributeTypeDescription
idstrDataset ID as a string.
objectstrObject type of the dataset.
namestrName of the dataset.
project_idstrIdentifier of the project associated with the dataset.
expiry_timestrExpiry timestamp of the dataset, in milliseconds.
statusDatasetStatusStatus object of the dataset.
create_dateintCreation timestamp of the dataset, in milliseconds.
update_dateintLast updated timestamp of the dataset, in milliseconds.

Examples

Waits for the NDJSON file to be fully uploaded to the GCS bucket:

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.wait_until_done("dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e")