Skip to main content

Datasets

Datasets allow you to organize, share, and easily access your data directly within your notebooks and training code. Currently, TIR supports datasets backed by EOS (Object Storage) and PVC-backed datasets with Disk Storage.

How to Use Datasets

  • 1. Mount EOS Storage Buckets:
    • You can mount your EOS storage buckets, making them accessible like a local file system.
    • This setup allows you to interact with the data in your buckets as if it were on your local machine.
  • 2. Access Data from Multiple Sources:
    • You can load training data from your local machine or from other cloud providers and access it through your notebook.
    • All mounted datasets will appear under the /datasets directory, allowing you to read and manipulate data using standard file operations.
Note

Define datasets to manage and control your data, even if you don’t plan to use it to mount it for use on notebooks or training jobs.

Dataset SDK

Datasets allow you to organize, share, and easily access your data directly within your notebooks and training code.

Benefits of Using Datasets

  • Shared Team Access: You can create a shared EOS bucket for your team’s training data. Using common data sources across your team helps improve the reproducibility of results and ensures consistency in training workflows.
  • Minimal Configuration Overhead: When you create a dataset through TIR, we automatically handle the creation of a new storage bucket and access credentials. You can immediately copy and execute the mc (MinIO CLI) commands displayed in the UI to upload data from your local machine or hosted notebooks.
  • Simplified Data Access: Access your training data directly in your hosted notebooks or training jobs without needing to configure access credentials.
  • Data Streaming: Stream training data in real-time instead of downloading it all to disk. This is especially beneficial for distributed training jobs where loading large datasets at once can be inefficient.

Storage Type

1. EOS Bucket

WebUI: Allows you to create, browse and upload files to the Dataset.

SDK: The TIR datasets are compatible with the following SDKs, which can be used to transfer data to and from the datasets:

Note

In case you have data with other cloud providers, you can use Data Syncer to migrate your data to TIR.

2. Disk

WebUI: Allows you to create a dataset.

Note

For disk storage type data can only be transferred after mounting it on TIR nodes.

Getting Started

Prerequisites

  • Install MinIO CLI: Install the MinIO CLI (mc) on your local machine from the MinIO website. If mc is already installed, you can skip this step.

Create a new dataset

Create a new dataset

  • 1. Log in to the TIR AI Platform:
    • Ensure you are logged in and working within the correct project. If needed, you can create a new project.
  • 2. Navigate to the Datasets Section:
    • From the TIR dashboard, go to the Datasets section.
    • Click on the CREATE DATASET button.

Navigate to the Datasets Section

  • 3. Choose Your Storage Type:
    • You will see two options for storage type:
      • EOS Bucket
      • Disk

Choose Your Storage Type

1. EOS Bucket

  • In EOS BUCKET storage type, there are two options for creating your dataset:

  • New EOS Bucket: This option creates a new EOS bucket tied to your account, along with access keys. You have the option to create a New EOS bucket with Enabled Encryption. If you enable encryption, you have two options to encrypt:

    • E2E Managed: In this type of encryption, Server-Side Encryption (SSE) is provided at rest, with encryption keys being generated and fully managed end-to-end by E2E. This ensures that the encryption process is seamless and secure.
    • User Managed: In this type of encryption, Server-Side Encryption (SSE) is provided at rest. As a user, you are responsible for generating encryption keys and enabling encryption for uploaded objects. When using this type of encryption, keep the following points in mind:
      • You cannot mount the encrypted dataset when using other TIR services like Node.
      • You are responsible for the management of the encryption keys, including key creation, rotation, and deletion.
      • If you lose access to your keys, you will lose access to the data encrypted with those keys. There is no recovery mechanism for lost keys.
      • Ensure that you have a recovery plan in place to handle key loss or corruption scenarios.
      • Encryption configuration cannot be changed later.
  • Enter a name for your dataset and click CREATE.

Create New EOS Bucket

Encryption Type: If you Select Encryption Type E2E Managed.

Encryption Type

  • You will see a screen to configure EOS bucket to upload data. In that screen, you will get Setup Minio CLI, Setup s3cmd, and Dataset Details.

Setup EOS Bucket

Setup Minio CLI

In the setup Minio CLI tab, you will get the setup host command and a command to copy the folder to a bucket.

Setup Minio CLI

Setup S3cmd

In the setup S3cmd tab, you will get commands for setting up endpoints, setup access keys, and enable s3 v4 signature APIs.

Setup S3cmd

Dataset Details

In the Dataset Details tab, you will get dataset details and bucket details.

Dataset Details

After creation of EOS Bucket with encrypt E2E Managed, you will see the below screen.

EOS Bucket Created

User Managed: If you create a dataset with User Managed encryption, you can securely upload data. To do so, you'll need to generate an encryption key.

User Managed

Below are the commands to generate an encryption key and securely upload data.

Generate Encryption Key

After creating an EOS Bucket with User Managed encryption, you will see the following screen.

User Managed EOS Bucket

2. Existing EOS Bucket:

Existing EOS Bucket

  • Select an existing EOS bucket to use for your dataset.
  • Enter a name for your dataset (e.g., paws) and click CREATE.
  • Similar to the new EOS bucket option, you'll see configuration tabs for Setup MinIO CLI, Setup S3cmd, and Dataset Details.

2. Disk

If you select Disk as the storage type, you will need to specify the disk size. Each GB of disk space will be charged at ₹5 per month.

Note

Disk size cannot be reduced later, but it can be increased at any time.

  • Enter a name for your dataset (e.g., paws) and click CREATE.

  • After successfully creating the dataset, you will see the following sections:

    • Setup: This tab provides details for configuring the EOS bucket, MinIO client setup, and S3cmd setup.

Setup MinIO Client

Setup MinIO Client

Setup S3cmd

Setup S3cmd

Overview

This section displays information about the dataset, including the dataset name, creator details, storage type, bucket name, access keys, and EOS endpoints.

Dataset Overview

Dataset Management

  • Dataset Filtering: Users can filter datasets based on their name and storage type (EOS/Disk) using the dataset search function.

Dataset Filtering

Data Objects

In Data Objects, you can upload data in your bucket by clicking the UPLOAD DATA button.

Upload Data

After clicking on the upload data button, you can choose or drag files from your system and click on the UPLOAD button.

Choose File

Upload Files

After successfully uploading files, you can see the list of files/data in the list.

After Uploading Files

You can download files/data by selecting the particular file and then clicking the download button.

Download File

You can delete files/data by selecting the particular file and then clicking the delete button.

Delete File

Delete Dataset

Select the particular dataset from the list and click on the Delete button to delete the dataset.

Delete Dataset

After clicking on the Delete button, it will show a popup to delete the dataset. You can click the delete button to delete the dataset.

Confirm Delete Dataset