Dataset Management

`create`

Creates a new dataset and uploads SignedURLs to a GCP bucket.

Arguments

Attribute	Type	Description
`name`	`str`	Name of the dataset.
`dataset_path`	`str`	Path to the Newline-Delimited JSON (NDJSON) file.

Return

Dataset object containing the metadata of the newly-uploaded dataset with the following structure:

Dataset(
    id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
    object='dataset',
    name='my-dataset-08-19',
    project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
    expiry_time=1724561116000,
    status=DatasetStatus(
        overview='Uploaded',
        message='Uploaded Dataset',
        update_time=1723697121735,
        source=DatasetSource(
            kind='UploadedNewlineDelimitedJsonFile',
            upload_url=DatasetUploadUrl(
                method='PUT',
                url='',
                headers=[
                    'content-length',
                    'x-goog-hash',
                    'x-goog-meta-datature-dataset-link',
                    'x-goog-if-generation-match',
                    'content-type'
                ],
                expires_at_time=1723740318174
            )
        ),
        item_count=50
    ),
    create_date=1723697118174,
    update_date=1723697121774
)

Attribute	Type	Description
`id`	`str`	Dataset ID as a string.
`object`	`str`	Object type of the dataset.
`name`	`str`	Name of the dataset.
`project_id`	`str`	Identifier of the project associated with the dataset.
`expiry_time`	`str`	Expiry timestamp of the dataset, in milliseconds.
`status`	`DatasetStatus`	Status object of the dataset.
`create_date`	`int`	Creation timestamp of the dataset, in milliseconds.
`update_date`	`int`	Last updated timestamp of the dataset, in milliseconds.

Examples

Create a new dataset with a local NDJSON file path:

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.create(
    name="my-dataset-08-15",
    dataset_path="my-dataset.ndjson"
)

`list`

Lists all created datasets in the project.

Arguments

Name	Type	Description
pagination	`dict`	A dictionary containing the limit of the number of datasets to be returned in each page (defaults to 1000), and the page cursor for page selection (defaults to the first page)

Return

PaginationResponse object containing a list of Dataset objects with page navigation data, with the following structure:

PaginationResponse(
    next_page=None,
    previous_page=None,
    data=[
        Dataset(
            id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
            object='dataset',
            name='my-dataset-08-15',
            project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
            expiry_time=1724561116000,
            status=DatasetStatus(
                overview='Uploaded',
                message='Uploaded Dataset',
                update_time=1723697121735,
                source=DatasetSource(
                    kind='UploadedNewlineDelimitedJsonFile',
                    upload_url=DatasetUploadUrl(
                        method='PUT',
                        url='',
                        headers=[
                            'content-length',
                            'x-goog-hash',
                            'x-goog-meta-datature-dataset-link',
                            'x-goog-if-generation-match',
                            'content-type'
                        ],
                        expires_at_time=1723740318174
                    )
                ),
                item_count=50
            ),
            create_date=1723697118174,
            update_date=1723697121774
        )
    ]
)

Attribute	Type	Description
`next_page`	`str`	Page ID of the next page.
`prev_page`	`str`	Page ID of the prev page.
`data`	`List[Dataset]`	List of dataset metadata.

Examples

Default listing of datasets (shows first 1000 datasets):

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")

project.batch.datasets.list()

View the next page of results:

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")

next_page = project.batch.datasets.list()["next_page"]

project.batch.datasets.list({"page": next_page})

View the previous page of results:

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")

prev_page = project.batch.datasets.list({
    "page": "ZjYzYmJkM2FjN2UxOTA4ZmU0ZjE0Yjk5Mg"}
)["prev_page"]

project.batch.datasets.list({"page": prev_page})

List a specific page of datasets that returns 2 datasets on that page:

from datature.nexus import Client
  
project = Client("5aa41e8ba........").get_project("proj_b705a........")

project.batch.datasets.list({
    "limit": 2,
    "page": "ZjYzYmJkM2FjN2UxOTA4ZmU0ZjE0Yjk5Mg"
})

`get`

Retrieves a specific dataset by ID.

Arguments

Name	Type	Description
`dataset_id`	`str`	The dataset ID as a string.

Return

Dataset object containing the metadata of the retrieved dataset with the following structure:

Dataset(
    id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
    object='dataset',
    name='my-dataset-08-15',
    project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
    expiry_time=1724561116000,
    status=DatasetStatus(
        overview='Uploaded',
        message='Uploaded Dataset',
        update_time=1723697121735,
        source=DatasetSource(
            kind='UploadedNewlineDelimitedJsonFile',
            upload_url=DatasetUploadUrl(
                method='PUT',
                url='',
                headers=[
                    'content-length',
                    'x-goog-hash',
                    'x-goog-meta-datature-dataset-link',
                    'x-goog-if-generation-match',
                    'content-type'
                ],
                expires_at_time=1723740318174
            )
        ),
        item_count=50
    ),
    create_date=1723697118174,
    update_date=1723697121774
)

Attribute	Type	Description
`id`	`str`	Dataset ID as a string.
`object`	`str`	Object type of the dataset.
`name`	`str`	Name of the dataset.
`project_id`	`str`	Identifier of the project associated with the dataset.
`expiry_time`	`str`	Expiry timestamp of the dataset, in milliseconds.
`status`	`DatasetStatus`	Status object of the dataset.
`create_date`	`int`	Creation timestamp of the dataset, in milliseconds.
`update_date`	`int`	Last updated timestamp of the dataset, in milliseconds.

Examples

Retrieve dataset by dataset ID:

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.get("webhook_6aea3395-9a72-4bb5-9ee0-19248c903c56")

`delete`

Deletes a specific dataset by ID.

Arguments

Name	Type	Description
`dataset_id`	`str`	The dataset ID as a string.

Return

DeleteResponse object that describe the deletion status of the dataset, with the following structure:

DeleteResponse(
    id='dataset_f7d8aec2-7e2b-4d2c-a103-c8dd575c29c7',
    deleted=True
)

Attribute	Type	Description
`id`	`str`	The dataset ID as a string.
`deleted`	`bool`	Whether the dataset has been successfully deleted or not.

Examples

Delete a specified dataset:

from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")

project.batch.datasets.delete("dataset_f7d8aec2-7e2b-4d2c-a103-c8dd575c29c7")

`wait_until_done`

Waits for the dataset upload to be completed.

Arguments

Name	Type	Description
`dataset_id`	`str`	The dataset ID as a string.

Return

Dataset object containing the metadata of the uploaded dataset with the following structure:

Dataset(
    id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
    object='dataset',
    name='my-dataset-08-15',
    project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
    expiry_time=1724561116000,
    status=DatasetStatus(
        overview='Uploaded',
        message='Uploaded Dataset',
        update_time=1723697121735,
        source=DatasetSource(
            kind='UploadedNewlineDelimitedJsonFile',
            upload_url=DatasetUploadUrl(
                method='PUT',
                url='',
                headers=[
                    'content-length',
                    'x-goog-hash',
                    'x-goog-meta-datature-dataset-link',
                    'x-goog-if-generation-match',
                    'content-type'
                ],
                expires_at_time=1723740318174
            )
        ),
        item_count=50
    ),
    create_date=1723697118174,
    update_date=1723697121774
)

Attribute	Type	Description
`id`	`str`	Dataset ID as a string.
`object`	`str`	Object type of the dataset.
`name`	`str`	Name of the dataset.
`project_id`	`str`	Identifier of the project associated with the dataset.
`expiry_time`	`str`	Expiry timestamp of the dataset, in milliseconds.
`status`	`DatasetStatus`	Status object of the dataset.
`create_date`	`int`	Creation timestamp of the dataset, in milliseconds.
`update_date`	`int`	Last updated timestamp of the dataset, in milliseconds.

Examples

Waits for the NDJSON file to be fully uploaded to the GCS bucket:

from datature.nexus import Client

project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.wait_until_done("dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e")