TIR AI Platform Documentation

TIR is a modern AI Development Platform designed to tackle the friction of training and serving large AI models.

We do so by using highly optimised GPU containers (NGC), pre-configured environments (pytorch, tensorflow, triton), automated API generation for model serving, shared notebook storage, and much more.

Components of TIR Platform

  • TIR Dashboard: Nodes, Datasets, Models, Inference, Token Management

  • Python SDK: Work with TIR objects from the comfort of a python shell or jupyter notebooks (hosted or local)

  • CLI: Bring the power of E2E GPU cloud on your local desktop

Why AI Model Development is so hard?

  1. Software Stack Complexity: Taking a model from development to production requires a variety of toolsets and environments. Some of these toolsets need hard version dependencies which further make things harder.

    • Data Loading and Processing

    • Training frameworks and libraries (e.g. pytorch, transformers, etc)

    • GPU drivers with library optimizations (some libraries depend on the GPUs)

    • Fault Tolerance handling (through usage of pipelines and stateful jobs that can restart)

    • Deployment Management

  2. Scaling Up, Out and to Zero: Training and serving large models requires platforms with high GPU availability and ability to scale out, up and ability to scale to zero to save idle usage cost.

  3. Collaboration: Work of AI Researchers and Engineers requires a high degree of collaboration. Being able to reproduce your team members’ work is an important aspect of pushing the boundaries of work. The software engineering tools like git do help but are not sufficient to handle large datasets, models to enable reproducibility of work.

  4. Taking Models to Production: Packaging open source or your own models for production use requires a whole different set of skill sets (Containers, API Development, Security and Authentication, etc). A good news is this process is repetitive in nature, so can be easily automated.

Key Features of TIR Platform

  • GPU optimized containers (Nvidia)

  • Manage End-to-End Lifecycle of training and serving large AI models

  • Pre-configured Nodes:

    • Easily launch notebooks with a variety of environment options (e.g transformers) and desired hardware

    • Persistent notebook workspaces for reproducibility of work

  • Datasets: EOS (E2E Object Storage) and PVC backed for easier data sharing and availability

  • Model and Endpoints: Track models with EOS backed repository and serve them through end point with simple configuration

  • Pipelines: Define end-to-end training and deployment pipeline

  • Run: Want to quickly run your python code? Just start a job with desired hardware and we take care of the rest

  • Project and Team Management

  • User and Access management

  • Integrations: Huggingface, Weights and Biases (Experiment Management), Gitlab (coming soon)