# Triton Inference Triton is an open-source inference serving software from Nvidia designed for high-performance and scalable deployments. It supports multiple frameworks and enables HTTP/gRPC-based communication, dynamic batching, and async request handling. If you need to serve multiple models efficiently, Triton Inference combined with GPU infrastructure provides a robust and scalable solution. --- ## Utilization Triton supports both **GPU** and **CPU** deployments. It optimizes hardware utilization with features like **dynamic batching** and **concurrent model execution**, allowing multiple models to share a single GPU efficiently. --- ## Scalability It enables **auto-scaling** of replicas and seamless **multi-model deployment** across available resources, all managed through the dashboard. --- ## Application experience You get out-of-the-box support for: * HTTP/REST and gRPC APIs * Real-time and batch inference * Streaming and async requests * Hot model updates without downtime --- ## Quick start tutorial ### Step 1: Create a model directory to download the model repository #### Instance(Node) Start a new notebook in **Jupyter Labs** and run the command below: ```bash !mkdir -p model-dir ``` You may also use the terminal in the notebook without the leading `!` to execute the same command. #### Shell (Local) Run the following command from your local shell: ```bash mkdir -p model-dir && cd model-dir ``` --- ### Step 2: Download the sample model repository #### Wget (Instance) ```bash !cd model-dir && mkdir -p model_repository/simple/1 && wget -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef && wget -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt ``` #### Wget (Shell) ```bash cd model-dir && mkdir -p model_repository/simple/1 && wget -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef && wget -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt ``` #### Github (Instance) Alternatively, you can download sample models from the Triton GitHub repository: ```bash !cd model-dir && git clone -b r23.10 https://github.com/triton-inference-server/server.git && ./var/tmp/server/docs/examples/fetch_models.sh ``` Ensure that the `model_repository` folder is located under the `model-dir` directory created in Step 1. --- ### Step 3: Upload sample models from local directory to a repository #### Instance(Nodes) Install the SDK and push the model repository. ```bash pip install e2enetworks ``` Import the SDK and upload the model: ```python from e2enetworks.cloud import tir model_repo_client = tir.Models() model_repo = model_repo_client.create(name='test-triton', model_type='triton') model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id) ``` #### Local Install the SDK: ```bash pip install e2enetworks ``` Then, in a Python shell: ```python from e2enetworks.cloud import tir tir.init(api_key="", access_token="") model_repo_client = tir.Models() model_repo = model_repo_client.create(name='test-triton', model_type='triton') model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id) ``` --- ### Step 4: Create a model endpoint to access the model over REST/gRPC #### Dashboard Use the Dashboard to create the endpoint: 1. Go to **Model Endpoints** in **Inference**. 2. Click **Create Endpoint**. 3. Choose **Nvidia Triton** as the framework and click **Continue**. 4. Select a GPU or CPU plan. 5. Enter a name for your endpoint (optional). 6. Select the model repository you uploaded in Step 3. 7. Click **Finish** to launch the model endpoint server. 8. Wait for the endpoint to reach **Ready** state. #### Using SDK You can also perform these steps using Python SDK in a Jupyter notebook or shell: ```python endpoint_client = tir.EndPoints() endpoint = endpoint_client.create(endpoint_name='test-triton-simple', framework='triton', plan='CPU-C3-4-8GB-0', model_id=model_repo.id) endpoint_client.get(id=endpoint.id) ``` --- ### Step 5: Use Triton client to call the model endpoint Install Triton client: ```bash pip install tritonclient[http] ``` Set authentication header and endpoint URL: ```python headers = {'Authorization': 'Bearer '} endpoint_url = 'infer.e2enetworks.net/project//endpoint//' ``` Run the following code to perform inference:
Click to expand code ```python import tritonclient.http as httpclient from tritonclient.http import InferenceServerClient import gevent.ssl as ssl import numpy as np triton_client = InferenceServerClient( url=endpoint_url, verbose=False, ssl=True, ssl_options={}, insecure=False, ssl_context_factory=ssl.create_default_context, ) model_name = "simple" input0_data = np.arange(start=0, stop=16, dtype=np.int32) input0_data = np.expand_dims(input0_data, axis=0) input1_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32) inputs = [] outputs = [] inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) inputs[0].set_data_from_numpy(input0_data, binary_data=False) inputs[1].set_data_from_numpy(input1_data, binary_data=True) outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True)) outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) query_params = {"test_1": 1, "test_2": 2} results = triton_client.infer( model_name, inputs, outputs=outputs, query_params=query_params, headers=headers, ) print(results) ```
--- ## Troubleshooting and operations guide ### Model updates 1. Push updated model and configuration files to the repository. 2. Restart the model endpoint. It automatically reloads the new model when the container restarts. ### Metrics and logging Monitor resource usage (CPU/GPU, memory) and inference metrics like **QPS**, **P99 latency**, etc., through the dashboard. ### Autoscaling Configure autoscaling to scale replicas dynamically based on demand. Scaling depends on available resources. ### Multi-model support It supports running multiple models on a single GPU. Explicit per-endpoint model loading/unloading is not yet supported. --- ## Frequently asked questions ### Can I use Triton to deploy LLMs? Yes. Use [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) or [vLLM server](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton). ### Does Triton support batching and streaming? Yes. Refer to [Triton Client Examples](https://github.com/triton-inference-server/client/tree/main/src/python/examples) for advanced use cases like async, streaming, and decoupled inference. ---