Model Endpoints
TIR provides two methods for deploying containers that serve model API endpoints for AI inference services:
Deploy Using Pre-built Containers (Provided by TIR)
Before launching a service with TIR's pre-built containers, you need to create a TIR Model and upload the necessary model files. These containers are configured to automatically download model files from an EOS (E2E Object Storage) Bucket and start the API server. Once the endpoint is ready, you can make synchronous requests to the endpoint for inference. Deploy using your own container
You can launch an inference service using your own Docker image, either public or private. Once the endpoint is ready, you can make synchronous requests for inference. Optionally, you can attach a TIR model to automate the download of model files from an EOS bucket to the container.
Model Endpoints Plans
Let's Explore the TIR Model Endpoint Plans for various frameworks
Example Usage
data "tir_model_endpoint_plans" "model_endpoint_plans" {
active_iam = <active_iam : string>
framework = "VLLM"
}
Schema
Required
- active_iam (String)
- framework (String)
Read-Only
- id (String) : The ID of this resource.
- plans (List of Object) (see below for nested schema)
Nested Schema for plans
Read-Only:
- committed_days (Number)
- cpu (String)
- currency (String)
- gpu (String)
- memory (String)
- name (String)
- sku_type (String)
- unit_price (Number)
Supported Frameworks
Use left side value in strings such as "TRITON"
- TRITON = 'triton'
- LLAMA = 'llma'
- PYTORCH = 'pytorch'
- CODELAMA = 'codellama'
- STABLE_DIFFUSION = 'stable_diffusion'
- STABLE_DIFFUSION_XL = 'stable_diffusion_xl'
- MPT = 'mpt'
- CUSTOM = 'custom'
- MIXTRAL8X7B = 'mixtral-8x7b-instruct'
- MIXTRAL7B = 'mistral-7b-instruct'
- TENSOR_RT = 'tensorrt'
- GEMMA_2B = 'gemma-2b'
- GEMMA_2B_IT = 'gemma-2b-it'
- GEMMA_7B = 'gemma-7b'
- GEMMA_7B_IT = 'gemma-7b-it'
- LLAMA_3 = 'llama-3-8b-instruct'
- LLAMA_3_1 = 'llama-3_1-8b-instruct'
- LLAMA_3_2 = 'llama-3_2-3b-instruct'
- LLAMA_3_2_VISION = 'llama-3_2-11b-vision-instruct'
- VLLM = 'vllm'
- STARCODER = 'starcoder2-7b'
- PHI_3_MINI = 'Phi-3-mini-128k-instruct'
- NEMO = 'nemo-rag'
- STABLE_VIDEO_DIFFUSION = 'stable-video-diffusion-img2vid-xt'
- YOLO_V8 = 'yolov8'
- NEMOTRON = 'nemotron-3-8b-chat-4k-rlhf'
- NV_EMBED = 'nvidia-nv-embed-v1'
- BAAI_LARGE = 'bge-large-en-v1_5'
- BAAI_RERANKER = 'bge-reranker-large'
- PIXTRAL = 'pixtral-12b-2409'
- SGLANG = 'sglang'
- DYNAMO = 'dynamo'
Model Endpoint Resource
Example Usage
resource "tir_model_endpoint" "name:string" {
name = "name"
sku_name = "C3.2"
sku_type = "hourly"
committed_instance_policy = ""
committed_days = 0
model_path = ""
framework = "PYTORCH"
#model_id = tir_model_repository.<name as in state file>.id // You have to chose either model_id or model_load_integration_id
model_load_integration_id = tir_integration.name.id
cluster_type = "tir-cluster"
storage_type = "disk"
disk_path = "/mnt/models"
image_pull_policy = "Always"
is_auto_scale_enabled = false
replica = 1
committed_replicas = 0
auto_scale_policy {
rules {
}
}
detailed_info {
# commands = "[\"first\",\"second\"]"
# args = ""
# hugging_face_id = "BAAI/Aquila-7B" #this is the supported model for VLLM, SGLANG and DYNAMO
server_version = "v0.9.0" // this is for TRITON, PYTORCH, NEMO, TENSORRT
tokenizer = ""
world_size = 1
error_log = true
info_log = true
warning_log = true
log_verbose_level = 1
model_serve_type = "" // this is required for vllm its value is "full-model" / "peft-model"
engine_args = {
# Define engine_args as you needed.
}
}
is_readiness_probe_enabled = false
is_liveness_probe_enabled = false
readiness_probe {
# Define readiness as you need.
}
liveness_probe {
# Define liveness probe as you need.
}
resource_details {
disk_size = 100
mount_path = ""
env_variables {
key = "HF_HdfsfOME"
value = "ENV_VALUE"
required = true
disabled = {
key = true
value = false
}
}
env_variables {
key = "HF_HOMsE"
value = "ENV_VALsUE"
required = true
disabled = {
key = true
value = false
}
}
}
container_type = "public"
team_id = <team_id : string>
project_id = <project_id:string>
active_iam = <active_iam:string>
location = "Delhi"
currency = "INR"
}
Schema
Required
- active_iam (String) : The IAM (Identity and Access Management) role associated with the resource.
- cluster_type (String) : The type of cluster the resource is deployed on.
- container_type (String) : The type of container used for the resource (e.g., public, private).
- currency (String) : The currency used for billing the resource.
- framework (String) : The framework used for the model. This could be TensorFlow, PyTorch, etc.
- location (String) : The location or region where the resource is deployed.
- name (String) : The name of the resource. This is a required field and must be unique within the project.
- project_id (String) : The ID of the project where the resource is deployed.
- sku_name (String) : The SKU (Stock Keeping Unit) name for the resource. This defines the type of resource being deployed.
- sku_type (String) : The SKU type for the resource. This defines the category or classification of the SKU.
- storage_type (String) : The type of storage used for the resource.
- team_id (String) : The ID of the team that owns the resource.
Optional
- auto_scale_policy (Block List) : The policy for auto-scaling the resource. This includes min/max replicas and scaling rules. (see below for nested schema)
- committed_days (Number) : The number of days the instance is committed for. This is used for billing and resource allocation.
- committed_instance_policy (String) : The policy for committed instances. This defines how committed instances are managed and billed.
- committed_replicas (Number) : The number of replicas that are committed for the resource.
- custom_sku (Map of Number) : A map of custom SKU configurations for the private cloud .
- dataset_id (String) : The ID of the dataset associated with the resource.
- dataset_path (String) : The path to the dataset used by the resource.
- detailed_info (Block List) : Detailed information about the resource, including commands, args, and logging settings. (see below for nested schema)
- disk_path (String) : The path where the disk is mounted. This is used to specify the location for model storage.
- image_pull_policy (String) : The policy for pulling container images. Options are 'Always' or 'IfNotPresent'.
- is_auto_scale_enabled (Boolean) : Indicates whether auto-scaling is enabled for the resource.
- is_liveness_probe_enabled (Boolean) : Enable or disable the liveness probe for the resource.
- is_readiness_probe_enabled (Boolean) : Enable or disable the readiness probe for the resource.
- liveness_probe (Block List) : Configuration for the liveness probe. (see below for nested schema)
- metric_port (Boolean) : Indicates whether a metric port is exposed for the resource.
- model_id (String) : The unique identifier for the model. This is used to reference the model in the system.
- model_load_integration_id (String) : The integration ID used for loading the model. This is typically used for custom model loading workflows.
- model_path (String) : The path to the model file or directory. This is used to specify the location of the model to be deployed.
- private_cloud_id (String) : The ID of the private cloud where the resource is deployed.
- public_ip (String) : Indicates whether a public IP address is assigned to the resource.
- readiness_probe (Block List) : Configuration for the readiness probe. (see below for nested schema)
- replica (Number) : The number of replicas to deploy for the resource.
- resource_details (Block List) : Additional details about the resource, such as disk size, mount path, and environment variables. (see below for nested schema)
- server_options (String) : Specifies the server options for the resource. This is typically used for server types like TRITON, PYTORCH, NEMO, and TENSOR RT.
- service_port (Boolean) : Indicates whether a service port is exposed for the resource.
- sfs_id (String) : The ID of the shared file storage. This is used to reference the shared storage resource.
- sfs_path (String) : The path for shared file storage. This is used for caching and shared resources.
- stop_inference (String) : Indicates whether to stop or start inference for the resource. Default is 'start'.
Read Only
- container_name (String) : The name of the container associated with the resource. This is computed automatically.
- created_at (String) : The timestamp when the resource was created. This is computed automatically.
- id (String) : The ID of this resource.
- status (String) : The current status of the resource. This is computed automatically.
Nested Schema for auto_scale_policy
Optional:
- max_replicas (Number) : The maximum number of replicas to scale up to during auto-scaling.
- min_replicas (Number) : The minimum number of replicas to maintain during auto-scaling.
- rules (Block List) : The rules for auto-scaling based on metrics and conditions. (see below for nested schema)
- stability_period (Number) : The period (in seconds) to wait after scaling before scaling again.
Nested Schema for auto_scale_policy.rules
Optional:
- condition_type (String) : The type of condition to apply for scaling.
- custom_metric_name (String) : The name of a custom metric to use for scaling.
- granularity (Number) : The granularity of the metric data collection.
- metric (String) : The metric to monitor for auto-scaling
- value (Number) : The threshold value for the metric to trigger scaling.
- watch_period (Number) : The period (in seconds) to watch the metric before scaling.
- window (Number) : The time window (in seconds) for evaluating the metric.
Nested Schema for detailed_info
Optional:
- args (String) : Arguments to pass to the commands when the resource is deployed.
- commands (String) : Commands to execute when the resource is deployed.
- engine_args (Map of String) : Additional engine-specific arguments for the model.
- error_log (Boolean) : Enable or disable error logging.(see below for nested schema)
- hugging_face_id (String) T: he Hugging Face model ID associated with the resource.
- info_log (Boolean) : Enable or disable info logging.
- log_verbose_level (Number) : The verbosity level for logging.
- model_serve_type (String) : The type of model serving (e.g., real-time, batch).
- server_version (String) : The version of the server being used.
- tokenizer (String) : The tokenizer to use for the model.
- warning_log (Boolean) : Enable or disable warning logging.
- world_size (Number) : The world size for distributed training or inference.
Nested Schema for detailed_info.engine_args
Optional:
- block_size
- chat_template
- data_type
- disable_custom_all_reduce
- disable_log_requests
- disable_log_stats
- disable_sliding_window
- distributed_executor_backend
- enable_auto_tool_choice
- enable_chunked_prefill
- enable_lora
- enable_lora_bias
- enable_prefix_caching
- enforce_eager
- fully_sharded_loras
- gpu_memory_utilization
- guided_decoding_backend
- kv_cache_data_type
- load_format
- long_lora_scaling_factors
- lora_data_type
- lora_extra_vocab_size
- max_cpu_loras
- max_log_len
- max_logprobs
- max_lora_rank
- max_loras
- max_model_length
- max_num_batched_tokens
- max_num_seqs
- max_parallel_loading_workers
- max_seq_len_to_capture
- model_loader_extra_config
- ngram_prompt_lookup_max
- ngram_prompt_lookup_min
- num_gpu_blocks_override
- num_lookahead_slots
- num_speculative_tokens
- preemption_mode
- quantization
- rope_scaling
- rope_theta
- scheduler_delay_factor
- seed
- skip_tokenizer_init
- spec_decoding_acceptance_method
- speculative_disable_by_batch_size
- speculative_draft_tensor_parallel_size
- speculative_max_model_len
- speculative_model
- swap_space
- tokenizer
- tokenizer_mode
- tokenizer_pool_extra_config
- tokenizer_pool_size
- tokenizer_pool_type
- tokenizer_revision
- tool_call_parser
- typical_acceptance_sampler_posterior_alpha
- typical_acceptance_sampler_posterior_threshold
Nested Schema for liveness_probe
Optional:
- commands (String) : Commands to execute for the liveness probe.
- failure_threshold (Number) : The number of failed probes before the resource is marked as not live.
- grpc_service (String) : The gRPC service to check for the liveness probe.
- initial_delay_seconds (Number) : The initial delay (in seconds) before the liveness probe starts.
- path (String) : The path to check for the liveness probe.
- period_seconds (Number) : The period (in seconds) between liveness probe checks.
- port (Number) : The port to use for the liveness probe.
- protocol (String) : The protocol to use for the liveness probe (e.g., http, tcp).
- success_threshold (Number) : The number of successful probes required to mark the resource as live.
- timeout_seconds (Number) : The timeout (in seconds) for the liveness probe.
Nested Schema for readiness_probe
Optional:
- commands (String) : Commands to execute for the readiness probe.
- failure_threshold (Number) : The number of failed probes before the resource is marked as not ready.
- grpc_service (String) : The gRPC service to check for the readiness probe.
- initial_delay_seconds (Number) : The initial delay (in seconds) before the readiness probe starts.
- path (String) : The path to check for the readiness probe.
- period_seconds (Number) : The period (in seconds) between readiness probe checks.
- port (Number) : The port to use for the readiness probe.
- protocol (String) : The protocol to use for the readiness probe (e.g., http, tcp).
- success_threshold (Number) : The number of successful probes required to mark the resource as ready.
- timeout_seconds (Number) : The timeout (in seconds) for the readiness probe.
Nested Schema for resource_details
Optional:
- disk_size (Number) : The size of the disk (in GB) allocated for the resource.
- env_variables (Block List) : Environment variables to be set for the resource. (see below for nested schema)
- mount_path (String) : The path where the disk is mounted.
Nested Schema for resource_details.env_variables
Optional:
- disabled (Map of Boolean) : A map of disabled environment variables.
- key (String) : The key for the environment variable.
- required (Boolean) : Indicates whether the environment variable is required.
- value (String) : The value for the environment variable.
Supported Frameworks
- TRITON =
'triton' - PYTORCH =
'pytorch' - CODELAMA =
'codellama' - STABLE_DIFFUSION =
'stable_diffusion' - STABLE_DIFFUSION_XL =
'stable_diffusion_xl' - MPT =
'mpt' - CUSTOM =
'custom' - MIXTRAL8X7B =
'mixtral-8x7b-instruct' - MIXTRAL7B =
'mistral-7b-instruct' - TENSOR_RT =
'tensorrt' - GEMMA_2B =
'gemma-2b' - GEMMA_2B_IT =
'gemma-2b-it' - GEMMA_7B =
'gemma-7b' - GEMMA_7B_IT =
'gemma-7b-it' - LLAMA_3 =
'llama-3-8b-instruct' - LLAMA_3_1 =
'llama-3_1-8b-instruct' - LLAMA_3_2 =
'llama-3_2-3b-instruct' - LLAMA_3_2_VISION =
'llama-3_2-11b-vision-instruct' - VLLM =
'vllm' - STARCODER =
'starcoder2-7b' - PHI_3_MINI =
'Phi-3-mini-128k-instruct' - NEMO =
'nemo-rag' - STABLE_VIDEO_DIFFUSION =
'stable-video-diffusion-img2vid-xt' - YOLO_V8 =
'yolov8' - NEMOTRON =
'nemotron-3-8b-chat-4k-rlhf' - NV_EMBED =
'nvidia-nv-embed-v1' - BAAI_LARGE =
'bge-large-en-v1_5' - BAAI_RERANKER =
'bge-reranker-large' - PIXTRAL =
'pixtral-12b-2409' - SGLANG =
'sglang' - DYNAMO =
'dynamo'
Supported Models for SGLANG
- deepseek-ai/DeepSeek-R1
- google/gemma-2b
- deepseek-ai/DeepSeek-V3
- meta-llama/Llama-3.2-1B
- microsoft/Phi-3-small-8k-instruct
- meta-llama/Llama-3.2-1B-Instruct
- custom
Supported Models for VLLM
- custom
- BAAI/Aquila-7B
- BAAI/AquilaChat-7B
- Snowflake/snowflake-arctic-base
- Snowflake/snowflake-arctic-instruct
- baichuan-inc/Baichuan-7B
- baichuan-inc/Baichuan2-13B-Chat
- bigscience/bloom
- bigscience/bloomz
- THUDM/chatglm2-6b
- THUDM/chatglm3-6b
- CohereForAI/c4ai-command-r-v01
- databricks/dbrx-base
- databricks/dbrx-instruct
- Deci/DeciLM-7B
- Deci/DeciLM-7B-instruct
- tiiuae/falcon-7b
- tiiuae/falcon-40b
- tiiuae/falcon-rw-7b
- google/gemma-2b
- google/gemma-7b
- gpt2
- gpt2-xl
- bigcode/starcoder
- bigcode/gpt_bigcode-santacoder
- WizardLM/WizardCoder-15B-V1.0
- EleutherAI/gpt-j-6b
- nomic-ai/gpt4all-j
- EleutherAI/gpt-neox-20b
- EleutherAI/pythia-12b
- OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
- databricks/dolly-v2-12b
- stabilityai/stablelm-tuned-alpha-7b
- internlm/internlm-7b
- internlm/internlm-chat-7b
- internlm/internlm2-7b
- internlm/internlm2-chat-7b
- core42/jais-13b
- core42/jais-13b-chat
- core42/jais-30b-v3
- core42/jais-30b-chat-v3
- openlm-research/open_llama_13b
- meta-llama/Llama-2-13b-hf
- meta-llama/Llama-2-70b-hf
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B-Instruct
- meta-llama/Meta-Llama-3.1-8B-Instruct
- meta-llama/Meta-Llama-3.1-70B-Instruct
- meta-llama/Meta-Llama-3.1-405B-Instruct
- meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
- meta-llama/Llama-3.2-1B
- meta-llama/Llama-3.2-3B
- meta-llama/Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-3B-Instruct
- meta-llama/Llama-Guard-3-1B
- meta-llama/Llama-3.2-11B-Vision
- meta-llama/Llama-3.2-11B-Vision-Instruct
- meta-llama/Llama-3.2-90B-Vision
- meta-llama/Llama-3.2-90B-Vision-Instruct
- meta-llama/Llama-Guard-3-11B-Vision
- meta-llama/Llama-3.3-70B-Instruct
- lmsys/vicuna-13b-v1.3
- 01-ai/Yi-6B
- 01-ai/Yi-34B
- llava-hf/llava-1.5-7b-hf
- llava-hf/llava-1.5-13b-hf
- openbmb/MiniCPM-2B-sft-bf16
- openbmb/MiniCPM-2B-dpo-bf16
- mistralai/Mistral-7B-v0.1
- mistralai/Mistral-7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistral-community/Mixtral-8x22B-v0.1
- mosaicml/mpt-7b
- mosaicml/mpt-30b
- mosaicml/mpt-7b-instruct
- mosaicml/mpt-30b-instruct
- mosaicml/mpt-7b-chat
- mosaicml/mpt-30b-chat
Supported Models for Dynamo
- custom: Custom (If model not present in list)
- BAAI/Aquila-7B: Aquila-7B
- BAAI/AquilaChat-7B: Aquila2-7B-Chat
- Snowflake/snowflake-arctic-base: Arctic-Base
- Snowflake/snowflake-arctic-instruct: Arctic-Instruct
- baichuan-inc/Baichuan-7B: Baichuan-7B
- baichuan-inc/Baichuan2-13B-Chat: Baichuan2-13B-Chat
- bigscience/bloom: BLOOM
- bigscience/bloomz: BLOOMZ
- THUDM/chatglm2-6b: ChatGLM2-6B
- THUDM/chatglm3-6b: ChatGLM3-6B
- CohereForAI/c4ai-command-r-v01: Command-R
- databricks/dbrx-base: DBRX-Base
- databricks/dbrx-instruct: DBRX-Instruct
- Deci/DeciLM-7B: DeciLM-7B
- Deci/DeciLM-7B-instruct: DeciLM-7B-Instruct
- tiiuae/falcon-7b: Falcon-7B
- tiiuae/falcon-40b: Falcon-40B
- tiiuae/falcon-rw-7b: Falcon-RW-7B
- google/gemma-2b: Gemma-2B
- google/gemma-7b: Gemma-7B
- gpt2: GPT-2
- gpt2-xl: GPT-2-XL
- bigcode/starcoder: StarCoder
- bigcode/gpt_bigcode-santacoder: SantaCoder
- WizardLM/WizardCoder-15B-V1.0: WizardCoder-15B
- EleutherAI/gpt-j-6b: GPT-J-6B
- nomic-ai/gpt4all-j: GPT-J
- EleutherAI/gpt-neox-20b: GPT-NeoX-20B
- EleutherAI/pythia-12b: Pythia-12B
- OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5: OpenAssistant-12B
- databricks/dolly-v2-12b: Dolly-V2-12B
- stabilityai/stablelm-tuned-alpha-7b: StableLM-Alpha-7B
- internlm/internlm-7b: InternLM-7B
- internlm/internlm-chat-7b: InternLM-7B-Chat
- internlm/internlm2-7b: InternLM2-7B
- internlm/internlm2-chat-7b: InternLM2-7B-Chat
- core42/jais-13b: Jais-13B
- core42/jais-13b-chat: Jais-13B-Chat
- core42/jais-30b-v3: Jais-V3-30B
- core42/jais-30b-chat-v3: Jais-V3-30B-Chat
- openlm-research/open_llama_13b: LLaMA-13B
- meta-llama/Llama-2-13b-hf: Llama-2-13B
- meta-llama/Llama-2-70b-hf: Llama-2-70B
- meta-llama/Meta-Llama-3-8B-Instruct: Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B-Instruct: Llama-3-70B-Instruct
- meta-llama/Meta-Llama-3.1-8B-Instruct: Llama-3.1-8B-Instruct
- meta-llama/Meta-Llama-3.1-70B-Instruct: Llama-3.1-70B-Instruct
- meta-llama/Meta-Llama-3.1-405B-Instruct: Llama-3.1-405B-Instruct
- meta-llama/Meta-Llama-3.1-405B-Instruct-FP8: Llama-3.1-405B-Instruct-FP8
- meta-llama/Llama-3.2-1B: Llama-3.2-1B
- meta-llama/Llama-3.2-3B: Llama-3.2-3B
- meta-llama/Llama-3.2-1B-Instruct: Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-3B-Instruct: Llama-3.2-3B-Instruct
- meta-llama/Llama-Guard-3-1B: Llama-Guard-3-1B
- meta-llama/Llama-3.2-11B-Vision: Llama-3.2-11B-Vision
- meta-llama/Llama-3.2-11B-Vision-Instruct: Llama-3.2-11B-Vision-Instruct
- meta-llama/Llama-3.2-90B-Vision: Llama-3.2-90B-Vision
- meta-llama/Llama-3.2-90B-Vision-Instruct: Llama-3.2-90B-Vision-Instruct
- meta-llama/Llama-Guard-3-11B-Vision: Llama-Guard-3-11B-Vision
- meta-llama/Llama-3.3-70B-Instruct: Llama-3.3-70B-Instruct
- lmsys/vicuna-13b-v1.3: Vicuna-V1-13B
- 01-ai/Yi-6B: Yi-6B
- 01-ai/Yi-34B: Yi-34B
- llava-hf/llava-1.5-7b-hf: LLaVA-1.5-7B
- llava-hf/llava-1.5-13b-hf: LLaVA-1.5-13B
Versions
Versions for TensorrtServerOptions
- v24.02
- v24.01
- v23.12
- v23.11
- v23.10
- v0.10.0
- v0.9.0
- v0.7.2
- custom
Versions for PytorchServerOptions
- v0.9.0
- v0.8.2
- v0.8.1
- custom
Versions for TritonServerOptions
- v24.02
- v24.01
- v23.12
- v23.11
- v23.10
- custom
Versions for NemoServerOptions
- v0.9.0
- custom