TIR/AI Platform

Introduction

The TIR platform is an AI Development Platform built to tackle the friction of training and serving large AI models.

Components of TIR Platform

  • TIR Dashboard: Notebooks, Datasets, Models, Inference, Token Management

  • Python SDK: Work with TIR objects from the comfort of a python shell or jupyter notebooks (hosted or local)

  • CLI: Bring the power of E2E GPU cloud on your local desktop

Why AI Model Development is so hard?

  1. Software Stack Complexity: Taking a model from development to production requires a variety of toolsets and environments. Some of these toolsets need hard version dependencies which further make things harder.

    • Data Loading and Processing

    • Training frameworks and libraries (e.g. pytorch, transformers, etc)

    • GPU drivers with library optimizations (some libraries depend on the GPUs)

    • Fault Tolerence handling (through usage of pipelines and stateful jobs that can restart)

    • Deployment Management

  2. Scaling Up, Out and to Zero: Training and serving large models requires platforms with high GPU availability and ability to scale out, up and ability to scale to zero to save idle usage cost.

  3. Collaboration: Work of AI Researchers and Engineers requires high degree of collaboration. Being able to reproduce your team members work is an important aspect of pushing the boundaries of work. The software engineering tools like git do help but are not sufficient to handle large datasets, models to enable reproducibility of work.

  4. Taking Models to Production: Packaging open source or your own models for production use requires a whole different set of skillsets (Containers, API Development, Security and Authenticaiton, etc). A good news is this process is repeatitive in nature, so can be easilty automated.

Key Features of TIR Platform

  • GPUs Optimized Containers (Nvidia)

  • Manage End-to-End Lifecycle of Training and Serving large AI models

  • Pre-Configured Notebooks:

    • Easily launch notebooks with a variety of environment options (e.g transformers) and desired hardware

    • Persistent notebook workspaces for reproducibility of work

  • Datasets: EOS (E2E Object Storage) and PVC backed for easier data sharing and availability

  • Model and Endpoints: Track models with EOS backed repository and serve them through end point with simple configuration

  • Pipelines: Define end-to-end training and deployment pipeline

  • Jobs: Want to quickly run your python code? Just start a job with desired hardware and we take care of the rest

  • Project and Team Management

  • User and Access management

  • Integrations: git, Huggingface, Weights and Biases (Experiement Management), Neptune