Dataset Management
create
create
Creates a new dataset and uploads SignedURLs to a GCP bucket.
Arguments
Attribute | Type | Description |
---|---|---|
name | str | Name of the dataset. |
dataset_path | str | Path to the Newline-Delimited JSON (NDJSON) file. |
Return
Dataset
object containing the metadata of the newly-uploaded dataset with the following structure:
Dataset(
id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
object='dataset',
name='my-dataset-08-19',
project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
expiry_time=1724561116000,
status=DatasetStatus(
overview='Uploaded',
message='Uploaded Dataset',
update_time=1723697121735,
source=DatasetSource(
kind='UploadedNewlineDelimitedJsonFile',
upload_url=DatasetUploadUrl(
method='PUT',
url='',
headers=[
'content-length',
'x-goog-hash',
'x-goog-meta-datature-dataset-link',
'x-goog-if-generation-match',
'content-type'
],
expires_at_time=1723740318174
)
),
item_count=50
),
create_date=1723697118174,
update_date=1723697121774
)
Attribute | Type | Description |
---|---|---|
id | str | Dataset ID as a string. |
object | str | Object type of the dataset. |
name | str | Name of the dataset. |
project_id | str | Identifier of the project associated with the dataset. |
expiry_time | str | Expiry timestamp of the dataset, in milliseconds. |
status | DatasetStatus | Status object of the dataset. |
create_date | int | Creation timestamp of the dataset, in milliseconds. |
update_date | int | Last updated timestamp of the dataset, in milliseconds. |
Examples
Create a new dataset with a local NDJSON file path:
from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.create(
name="my-dataset-08-15",
dataset_path="my-dataset.ndjson"
)
list
list
Lists all created datasets in the project.
Arguments
Name | Type | Description |
---|---|---|
pagination | dict | A dictionary containing the limit of the number of datasets to be returned in each page (defaults to 1000), and the page cursor for page selection (defaults to the first page) |
Return
PaginationResponse
object containing a list of Dataset
objects with page navigation data, with the following structure:
PaginationResponse(
next_page=None,
previous_page=None,
data=[
Dataset(
id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
object='dataset',
name='my-dataset-08-15',
project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
expiry_time=1724561116000,
status=DatasetStatus(
overview='Uploaded',
message='Uploaded Dataset',
update_time=1723697121735,
source=DatasetSource(
kind='UploadedNewlineDelimitedJsonFile',
upload_url=DatasetUploadUrl(
method='PUT',
url='',
headers=[
'content-length',
'x-goog-hash',
'x-goog-meta-datature-dataset-link',
'x-goog-if-generation-match',
'content-type'
],
expires_at_time=1723740318174
)
),
item_count=50
),
create_date=1723697118174,
update_date=1723697121774
)
]
)
Attribute | Type | Description |
---|---|---|
next_page | str | Page ID of the next page. |
prev_page | str | Page ID of the prev page. |
data | List[Dataset] | List of dataset metadata. |
Examples
- Default listing of datasets (shows first 1000 datasets):
from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.list()
- View the next page of results:
from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")
next_page = project.batch.datasets.list()["next_page"]
project.batch.datasets.list({"page": next_page})
- View the previous page of results:
from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")
prev_page = project.batch.datasets.list({
"page": "ZjYzYmJkM2FjN2UxOTA4ZmU0ZjE0Yjk5Mg"}
)["prev_page"]
project.batch.datasets.list({"page": prev_page})
- List a specific page of datasets that returns 2 datasets on that page:
from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.list({
"limit": 2,
"page": "ZjYzYmJkM2FjN2UxOTA4ZmU0ZjE0Yjk5Mg"
})
get
get
Retrieves a specific dataset by ID.
Arguments
Name | Type | Description |
---|---|---|
dataset_id | str | The dataset ID as a string. |
Return
Dataset
object containing the metadata of the retrieved dataset with the following structure:
Dataset(
id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
object='dataset',
name='my-dataset-08-15',
project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
expiry_time=1724561116000,
status=DatasetStatus(
overview='Uploaded',
message='Uploaded Dataset',
update_time=1723697121735,
source=DatasetSource(
kind='UploadedNewlineDelimitedJsonFile',
upload_url=DatasetUploadUrl(
method='PUT',
url='',
headers=[
'content-length',
'x-goog-hash',
'x-goog-meta-datature-dataset-link',
'x-goog-if-generation-match',
'content-type'
],
expires_at_time=1723740318174
)
),
item_count=50
),
create_date=1723697118174,
update_date=1723697121774
)
Attribute | Type | Description |
---|---|---|
id | str | Dataset ID as a string. |
object | str | Object type of the dataset. |
name | str | Name of the dataset. |
project_id | str | Identifier of the project associated with the dataset. |
expiry_time | str | Expiry timestamp of the dataset, in milliseconds. |
status | DatasetStatus | Status object of the dataset. |
create_date | int | Creation timestamp of the dataset, in milliseconds. |
update_date | int | Last updated timestamp of the dataset, in milliseconds. |
Examples
Retrieve dataset by dataset ID:
from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.get("webhook_6aea3395-9a72-4bb5-9ee0-19248c903c56")
delete
delete
Deletes a specific dataset by ID.
Arguments
Name | Type | Description |
---|---|---|
dataset_id | str | The dataset ID as a string. |
Return
DeleteResponse
object that describe the deletion status of the dataset, with the following structure:
DeleteResponse(
id='dataset_f7d8aec2-7e2b-4d2c-a103-c8dd575c29c7',
deleted=True
)
Attribute | Type | Description |
---|---|---|
id | str | The dataset ID as a string. |
deleted | bool | Whether the dataset has been successfully deleted or not. |
Examples
Delete a specified dataset:
from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.delete("dataset_f7d8aec2-7e2b-4d2c-a103-c8dd575c29c7")
wait_until_done
wait_until_done
Waits for the dataset upload to be completed.
Arguments
Name | Type | Description |
---|---|---|
dataset_id | str | The dataset ID as a string. |
Return
Dataset
object containing the metadata of the uploaded dataset with the following structure:
Dataset(
id='dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e',
object='dataset',
name='my-dataset-08-15',
project_id='proj_ca5fe71e7592bbcf7705ea36e4f29ed4',
expiry_time=1724561116000,
status=DatasetStatus(
overview='Uploaded',
message='Uploaded Dataset',
update_time=1723697121735,
source=DatasetSource(
kind='UploadedNewlineDelimitedJsonFile',
upload_url=DatasetUploadUrl(
method='PUT',
url='',
headers=[
'content-length',
'x-goog-hash',
'x-goog-meta-datature-dataset-link',
'x-goog-if-generation-match',
'content-type'
],
expires_at_time=1723740318174
)
),
item_count=50
),
create_date=1723697118174,
update_date=1723697121774
)
Attribute | Type | Description |
---|---|---|
id | str | Dataset ID as a string. |
object | str | Object type of the dataset. |
name | str | Name of the dataset. |
project_id | str | Identifier of the project associated with the dataset. |
expiry_time | str | Expiry timestamp of the dataset, in milliseconds. |
status | DatasetStatus | Status object of the dataset. |
create_date | int | Creation timestamp of the dataset, in milliseconds. |
update_date | int | Last updated timestamp of the dataset, in milliseconds. |
Examples
Waits for the NDJSON file to be fully uploaded to the GCS bucket:
from datature.nexus import Client
project = Client("5aa41e8ba........").get_project("proj_b705a........")
project.batch.datasets.wait_until_done("dataset_652ef566-af5c-4d52-b1e4-ec8ec6dc4b8e")
Updated 4 months ago