Skip to main content

Triton Inference

Triton is an open-source inference serving software from Nvidia designed for high-performance and scalable deployments. It supports multiple frameworks and enables HTTP/gRPC-based communication, dynamic batching, and async request handling.

If you need to serve multiple models efficiently, Triton Inference combined with GPU infrastructure provides a robust and scalable solution.


Utilization

Triton supports both GPU and CPU deployments. It optimizes hardware utilization with features like dynamic batching and concurrent model execution, allowing multiple models to share a single GPU efficiently.


Scalability

It enables auto-scaling of replicas and seamless multi-model deployment across available resources, all managed through the dashboard.


Application experience

You get out-of-the-box support for:

  • HTTP/REST and gRPC APIs
  • Real-time and batch inference
  • Streaming and async requests
  • Hot model updates without downtime

Quick start tutorial

Step 1: Create a model directory to download the model repository

Instance(Node)

Start a new notebook in Jupyter Labs and run the command below:

!mkdir -p model-dir

You may also use the terminal in the notebook without the leading ! to execute the same command.

Shell (Local)

Run the following command from your local shell:

mkdir -p model-dir && cd model-dir

Step 2: Download the sample model repository

Wget (Instance)

!cd model-dir && mkdir -p model_repository/simple/1 && wget -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef && wget -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt

Wget (Shell)

cd model-dir && mkdir -p model_repository/simple/1 && wget -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef && wget -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt

Github (Instance)

Alternatively, you can download sample models from the Triton GitHub repository:

!cd model-dir && git clone -b r23.10 https://github.com/triton-inference-server/server.git && ./var/tmp/server/docs/examples/fetch_models.sh

Ensure that the model_repository folder is located under the model-dir directory created in Step 1.


Step 3: Upload sample models from local directory to a repository

Instance(Nodes)

Install the SDK and push the model repository.

pip install e2enetworks

Import the SDK and upload the model:

from e2enetworks.cloud import tir
model_repo_client = tir.Models()
model_repo = model_repo_client.create(name='test-triton', model_type='triton')
model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id)

Local

Install the SDK:

pip install e2enetworks

Then, in a Python shell:

from e2enetworks.cloud import tir
tir.init(api_key="<your-api-key>", access_token="<your-access-token>")

model_repo_client = tir.Models()
model_repo = model_repo_client.create(name='test-triton', model_type='triton')
model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id)

Step 4: Create a model endpoint to access the model over REST/gRPC

Dashboard

Use the Dashboard to create the endpoint:

  1. Go to Model Endpoints in Inference.
  2. Click Create Endpoint.
  3. Choose Nvidia Triton as the framework and click Continue.
  4. Select a GPU or CPU plan.
  5. Enter a name for your endpoint (optional).
  6. Select the model repository you uploaded in Step 3.
  7. Click Finish to launch the model endpoint server.
  8. Wait for the endpoint to reach Ready state.

Using SDK

You can also perform these steps using Python SDK in a Jupyter notebook or shell:

endpoint_client = tir.EndPoints()
endpoint = endpoint_client.create(endpoint_name='test-triton-simple', framework='triton', plan='CPU-C3-4-8GB-0', model_id=model_repo.id)
endpoint_client.get(id=endpoint.id)

Step 5: Use Triton client to call the model endpoint

Install Triton client:

pip install tritonclient[http]

Set authentication header and endpoint URL:

headers = {'Authorization': 'Bearer <your-auth-token>'}
endpoint_url = 'infer.e2enetworks.net/project/<project-id>/endpoint/<endpoint-id>/'

Run the following code to perform inference:

Click to expand code
import tritonclient.http as httpclient
from tritonclient.http import InferenceServerClient
import gevent.ssl as ssl
import numpy as np

triton_client = InferenceServerClient(
url=endpoint_url,
verbose=False,
ssl=True,
ssl_options={},
insecure=False,
ssl_context_factory=ssl.create_default_context,
)

model_name = "simple"
input0_data = np.arange(start=0, stop=16, dtype=np.int32)
input0_data = np.expand_dims(input0_data, axis=0)
input1_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32)

inputs = []
outputs = []
inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
inputs[0].set_data_from_numpy(input0_data, binary_data=False)
inputs[1].set_data_from_numpy(input1_data, binary_data=True)

outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))

query_params = {"test_1": 1, "test_2": 2}

results = triton_client.infer(
model_name,
inputs,
outputs=outputs,
query_params=query_params,
headers=headers,
)
print(results)

Troubleshooting and operations guide

Model updates

  1. Push updated model and configuration files to the repository.
  2. Restart the model endpoint.

It automatically reloads the new model when the container restarts.

Metrics and logging

Monitor resource usage (CPU/GPU, memory) and inference metrics like QPS, P99 latency, etc., through the dashboard.

Autoscaling

Configure autoscaling to scale replicas dynamically based on demand. Scaling depends on available resources.

Multi-model support

It supports running multiple models on a single GPU. Explicit per-endpoint model loading/unloading is not yet supported.


Frequently asked questions

Can I use Triton to deploy LLMs?

Yes. Use TensorRT-LLM or vLLM server.

Does Triton support batching and streaming?

Yes. Refer to Triton Client Examples for advanced use cases like async, streaming, and decoupled inference.