TIR AI Platform Documentation

TIR is a modern AI Development Platform designed to tackle the friction of training and serving large AI models.

We do so by using highly optimised GPU containers (NGC), pre-configured environments (pytorch, tensorflow, triton), automated API generation for model serving, shared notebook storage, and much more.

Components of TIR Platform

TIR Dashboard: Nodes, Datasets, Models, Inference, Token Management
Python SDK: Work with TIR objects from the comfort of a python shell or jupyter notebooks (hosted or local)
CLI: Bring the power of E2E GPU cloud on your local desktop

Why AI Model Development is so hard?

Software Stack Complexity: Taking a model from development to production requires a variety of toolsets and environments. Some of these toolsets need hard version dependencies which further make things harder.
- Data Loading and Processing
- Training frameworks and libraries (e.g. pytorch, transformers, etc)
- GPU drivers with library optimizations (some libraries depend on the GPUs)
- Fault Tolerance handling (through usage of pipelines and stateful jobs that can restart)
- Deployment Management
Scaling Up, Out and to Zero: Training and serving large models requires platforms with high GPU availability and ability to scale out, up and ability to scale to zero to save idle usage cost.
Collaboration: Work of AI Researchers and Engineers requires a high degree of collaboration. Being able to reproduce your team members’ work is an important aspect of pushing the boundaries of work. The software engineering tools like git do help but are not sufficient to handle large datasets, models to enable reproducibility of work.
Taking Models to Production: Packaging open source or your own models for production use requires a whole different set of skill sets (Containers, API Development, Security and Authentication, etc). A good news is this process is repetitive in nature, so can be easily automated.

Key Features of TIR Platform

GPU optimized containers (Nvidia)
Manage End-to-End Lifecycle of training and serving large AI models
Pre-configured Nodes:
- Easily launch notebooks with a variety of environment options (e.g transformers) and desired hardware
- Persistent notebook workspaces for reproducibility of work
Datasets: EOS (E2E Object Storage) and PVC backed for easier data sharing and availability
Model and Endpoints: Track models with EOS backed repository and serve them through end point with simple configuration
Pipelines: Define end-to-end training and deployment pipeline
Run: Want to quickly run your python code? Just start a job with desired hardware and we take care of the rest
Project and Team Management
User and Access management
Integrations: Huggingface, Weights and Biases (Experiment Management), Gitlab (coming soon)