Triton Inference

Triton is an open-source, efficient inference serving software from Nvidia that offers best-in-class throughput on inference requests. It also enables a multitude of options for client-server communication (like HTTP, gRPC, dynamic batching, and async requests).

If you need to serve more than one model, then the flexibility of Triton Inference and TIR's high-performance GPU infrastructure is your best bet.

Utilization

Triton can be used to deploy models either on GPU or CPU. It maximizes GPU/CPU utilization with features such as dynamic batching and concurrent model execution. This means if you have a single GPU, you can load more than one model if GPU memory is available.

Scalability

You can auto-scale replicas and auto-deploy models across all of them with a click of a button.

Application Experience

You get HTTP/REST and gRPC endpoints out of the box. There is also support for real-time, batch (including in-flight batching), and streaming options for sending inference requests. Models can be updated in production without downtime.

Quick Start: Tutorial

1. Create a directory `model-dir` to download the model repository.

TIR Notebook
Shell (Local)

Start a new notebook in Jupyter labs and run the below command:

# create a directory to download model weights
    !mkdir -p model-dir 

You may also use terminal to perform the steps in this tutorial by omitting the ! at the start of commands.

The steps in this tutorial can also be performed from the command line. You may run these steps from a terminal in TIR Notebook or your local.

# create a directory to download model weights 
    mkdir -p model-dir && cd model-dir

2. Download the Sample Model Repository

Wget (TIR Notebook)
Wget (Shell)
Github (TIR Notebook)

!cd model-dir && mkdir -p model_repository/simple/1 && wget  -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef &&  wget  -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt

 cd model-dir && mkdir -p model_repository/simple/1 && wget  -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef &&  wget  -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt

Alternatively, you can also download sample models from the triton's github repo.

  !cd model-dir && \
  git clone -b r23.10 https://github.com/triton-inference-server/server.git && \
  ./var/tmp/server/docs/examples/fetch_models.sh  

This command will download the sample models in model_repository folder.

Before you proceed, make sure that model_repository is copied to or created under model-dir (directory created in step 1).

3. Upload the sample models from local directory to a TIR Repository.

TIR Notebook
Local

Install SDK:

    pip install e2enetworks

Start a python shell or jupyter notebook and import TIR.

from e2enetworks.cloud import tir

Push models:

Initiate model repository client.

model_repo_client = tir.Models()

Define a model.

    model_repo = model_repo_client.create(name='test-triton', model_type='triton')

Upload from model-dir. A new model repository will be created.

  model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id)

Install TIR sdk:

   pip install e2enetworks

Start a python shell or a jupyter notebook and import sdk:

   from e2enetworks.cloud import tir

Push models:

   # Get API token and Access Token from Projects > Inference > API Tokens
   tir.init(api_key="<enter-api-key-here>", access_token="<enter-access-token-here>")

    # get team id and project id 
    teams_client = tir.Teams()
    project_client = tir.Projects()

    # get team id 
    teams = teams.get_teams()

    # get project id 
    projects = projects.get_projects()
    
    # initiate model repository object.  
    # here we assume that there is only one team and single project. if you have multiple projects then choose the correct project id from projects            
    model_repo_client = tir.Models(teams[0].team_id, projects[0].project_id)

    # create model 
    model_repo = model_repo_client.create_model(name='test-triton', model_type='triton')

    # upload from model-dir. A new model repository will be created  
    model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id)

4. Create a model endpoint to access our model over REST/gRPC:

A model endpoint offers a way to serve your model over REST or gRPC endpoints. TIR will automatically provision a HTTP Server (handler) for your model.

TIR Dashboard
Using SDK

Use the simplicity of TIR Dashboard to the endpoint.

Go to Model Endpoints in Inference
Click Create Endpoint
Select Nvidia Triton as framework and Click Continue
Select from one of the available GPU plans. You may choose a CPU plan as well for this exercise.
Click Next
Change endpoint name if desired. Click Next
Select the model repository we created in step 3
Click Finish to launch a model endpoint server
Wait for the endpoint to reach Ready state before moving further

You can perform these steps from Notebook cells in Jupyter Labs or Python Shell.

    # crete endpoint client
    endpoint_client = tir.EndPoints()

    # create an endpoint 
    endpoint = endpoint_client.create(endpoint_name='test-triton-simple', framework='triton', plan='CPU-C3-4-8GB-0', model_id=model_repo.id)

You can monitor the endpoint creation from TIR dashboard or use list() to check the status

endpoint_client.get(id=endpoint.id)

5. Use Triton Client to call the model endpoint

Triton Client is extremely flexible as it can support a number of client settings. Apart from the support for variety of programming languages (C++, Java or Python), you can also utilize features async io, streaming, decoupled connections, etc.

For the sake of simplicity, we will use a python client for that synchronously calls the simple model endpoint.

The triton client repo has examples <https://github.com/triton-inference-server/client/tree/main/src/python/examples>_ on more complex use cases like async, streaming, etc.

TIR Notebook

Install the triton client

   !pip install tritonclient[http]

Set Auth header for accessing TIR API. We will need this dict in every function call we make to triton endpoint.

    headers = {'Authorization': 'Bearer eyJhbfd3zI1......'}

    # On TIR notebook you may also use os environment variable to set header. Do confirm that E2E_TIR_ACCESS_TOKEN is set correctly in your notebook.

    # headers = {'Authorization': 'Bearer {}'.format(os.environ['E2E_TIR_ACCESS_TOKEN'])}

Set endpoint URL:

You can get the endpoint URL through sdk (step 4 above) or from TIR Dashboard by locating the target model's endpoint.

   endpoint_url="infer.e2enetworks.net/project/<enter your projectid>/endpoint/is-<enter inference service id>/"

Run the following sample code from a notebook cell

import tritonclient.http as httpclient
from tritonclient.http import InferenceServerClient
import gevent.ssl as ssl
import numpy as np

triton_client = InferenceServerClient(
                url=endpoint_url,
                verbose=False,
                ssl=True,
                ssl_options={},                
                insecure=False,
                ssl_context_factory=ssl.create_default_context)

model_name = "simple"
    
# Create the data for the two input tensors. Initialize the first
# to unique integers and the second to all ones.
input0_data = np.arange(start=0, stop=16, dtype=np.int32)
input0_data = np.expand_dims(input0_data, axis=0)
input1_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32)

# initiate model input variable in triton format 
inputs = []
outputs = []
inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))

# Initialize the data
inputs[0].set_data_from_numpy(input0_data, binary_data=False)
inputs[1].set_data_from_numpy(input1_data, binary_data=True)

# initiate model output variable in triton format 
outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))

query_params = {"test_1": 1, "test_2": 2}

results = triton_client.infer(
    model_name,
    inputs,
    outputs=outputs,
    query_params=query_params,
    headers=headers,
)
print(results)

Troubleshooting and Operations Guide

Model Updates

To deploy a new model version or model configuration, follow these two steps:

Push updated model files and configuration to the Model Repository.
Restart the model endpoint service.

When you restart the service, TIR will stop the existing container and start a new one. The new container will download the most recent files from the repository.

Metrics and Logging

You can use the dashboard to keep track of resource metrics as well as service level metrics like QPS, P99, P50, etc. The dashboard also reports detailed logs streamed from the running replicas.

Autoscaling

You can configure autoscaling to dynamically launch new replicas when the load is high. Please note that the scaling operation will depend on the availability of resources.

Multi-Model Support

Triton allows sharing of GPUs between multiple models. TIR supports multi-model configuration. However, the option to use explicit models, where you only load or unload a few selected models, is not supported yet. If this feature is important to you, please feel free to raise a support ticket.

Frequently Asked Questions

Can I use Triton Server to deploy Large Language Models (LLMs)?

Yes. We recommend TensorRT-LLM or vLLM server.

Can Triton server handle streaming or batching requests?

Yes. The Triton client repository has several examples.

Utilization​

Scalability​

Application Experience​

Quick Start: Tutorial

1. Create a directory model-dir to download the model repository.​

2. Download the Sample Model Repository​

3. Upload the sample models from local directory to a TIR Repository.​

4. Create a model endpoint to access our model over REST/gRPC:​

5. Use Triton Client to call the model endpoint​