Triton Inference
Triton is an open-source inference serving software from Nvidia designed for high-performance and scalable deployments. It supports multiple frameworks and enables HTTP/gRPC-based communication, dynamic batching, and async request handling.
If you need to serve multiple models efficiently, Triton Inference combined with GPU infrastructure provides a robust and scalable solution.
Utilization
Triton supports both GPU and CPU deployments. It optimizes hardware utilization with features like dynamic batching and concurrent model execution, allowing multiple models to share a single GPU efficiently.
Scalability
It enables auto-scaling of replicas and seamless multi-model deployment across available resources, all managed through the dashboard.
Application experience
You get out-of-the-box support for:
- HTTP/REST and gRPC APIs
- Real-time and batch inference
- Streaming and async requests
- Hot model updates without downtime
Quick start tutorial
Step 1: Create a model directory to download the model repository
Instance(Node)
Start a new notebook in Jupyter Labs and run the command below:
!mkdir -p model-dir
You may also use the terminal in the notebook without the leading ! to execute the same command.
Shell (Local)
Run the following command from your local shell:
mkdir -p model-dir && cd model-dir
Step 2: Download the sample model repository
Wget (Instance)
!cd model-dir && mkdir -p model_repository/simple/1 && wget -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef && wget -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt
Wget (Shell)
cd model-dir && mkdir -p model_repository/simple/1 && wget -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef && wget -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt
Github (Instance)
Alternatively, you can download sample models from the Triton GitHub repository:
!cd model-dir && git clone -b r23.10 https://github.com/triton-inference-server/server.git && ./var/tmp/server/docs/examples/fetch_models.sh
Ensure that the model_repository folder is located under the model-dir directory created in Step 1.
Step 3: Upload sample models from local directory to a repository
Instance(Nodes)
Install the SDK and push the model repository.
pip install e2enetworks
Import the SDK and upload the model:
from e2enetworks.cloud import tir
model_repo_client = tir.Models()
model_repo = model_repo_client.create(name='test-triton', model_type='triton')
model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id)
Local
Install the SDK:
pip install e2enetworks
Then, in a Python shell:
from e2enetworks.cloud import tir
tir.init(api_key="<your-api-key>", access_token="<your-access-token>")
model_repo_client = tir.Models()
model_repo = model_repo_client.create(name='test-triton', model_type='triton')
model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id)
Step 4: Create a model endpoint to access the model over REST/gRPC
Dashboard
Use the Dashboard to create the endpoint:
- Go to Model Endpoints in Inference.
- Click Create Endpoint.
- Choose Nvidia Triton as the framework and click Continue.
- Select a GPU or CPU plan.
- Enter a name for your endpoint (optional).
- Select the model repository you uploaded in Step 3.
- Click Finish to launch the model endpoint server.
- Wait for the endpoint to reach Ready state.
Using SDK
You can also perform these steps using Python SDK in a Jupyter notebook or shell:
endpoint_client = tir.EndPoints()
endpoint = endpoint_client.create(endpoint_name='test-triton-simple', framework='triton', plan='CPU-C3-4-8GB-0', model_id=model_repo.id)
endpoint_client.get(id=endpoint.id)
Step 5: Use Triton client to call the model endpoint
Install Triton client:
pip install tritonclient[http]
Set authentication header and endpoint URL:
headers = {'Authorization': 'Bearer <your-auth-token>'}
endpoint_url = 'infer.e2enetworks.net/project/<project-id>/endpoint/<endpoint-id>/'
Run the following code to perform inference:
Click to expand code
import tritonclient.http as httpclient
from tritonclient.http import InferenceServerClient
import gevent.ssl as ssl
import numpy as np
triton_client = InferenceServerClient(
url=endpoint_url,
verbose=False,
ssl=True,
ssl_options={},
insecure=False,
ssl_context_factory=ssl.create_default_context,
)
model_name = "simple"
input0_data = np.arange(start=0, stop=16, dtype=np.int32)
input0_data = np.expand_dims(input0_data, axis=0)
input1_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32)
inputs = []
outputs = []
inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
inputs[0].set_data_from_numpy(input0_data, binary_data=False)
inputs[1].set_data_from_numpy(input1_data, binary_data=True)
outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))
query_params = {"test_1": 1, "test_2": 2}
results = triton_client.infer(
model_name,
inputs,
outputs=outputs,
query_params=query_params,
headers=headers,
)
print(results)
Troubleshooting and operations guide
Model updates
- Push updated model and configuration files to the repository.
- Restart the model endpoint.
It automatically reloads the new model when the container restarts.
Metrics and logging
Monitor resource usage (CPU/GPU, memory) and inference metrics like QPS, P99 latency, etc., through the dashboard.
Autoscaling
Configure autoscaling to scale replicas dynamically based on demand. Scaling depends on available resources.
Multi-model support
It supports running multiple models on a single GPU. Explicit per-endpoint model loading/unloading is not yet supported.
Frequently asked questions
Can I use Triton to deploy LLMs?
Yes. Use TensorRT-LLM or vLLM server.
Does Triton support batching and streaming?
Yes. Refer to Triton Client Examples for advanced use cases like async, streaming, and decoupled inference.