# Triton Inference

Triton is an open-source inference serving software from Nvidia designed for high-performance and scalable deployments. It supports multiple frameworks and enables HTTP/gRPC-based communication, dynamic batching, and async request handling.

If you need to serve multiple models efficiently, Triton Inference combined with GPU infrastructure provides a robust and scalable solution.

---

## Utilization

Triton supports both **GPU** and **CPU** deployments. It optimizes hardware utilization with features like **dynamic batching** and **concurrent model execution**, allowing multiple models to share a single GPU efficiently.

---

## Scalability

It enables **auto-scaling** of replicas and seamless **multi-model deployment** across available resources, all managed through the dashboard.

---

## Application experience

You get out-of-the-box support for:

* HTTP/REST and gRPC APIs
* Real-time and batch inference
* Streaming and async requests
* Hot model updates without downtime

---

## Quick start tutorial

### Step 1: Create a model directory to download the model repository

#### Instance(Node)

Start a new notebook in **Jupyter Labs** and run the command below:

```bash
!mkdir -p model-dir
```

You may also use the terminal in the notebook without the leading `!` to execute the same command.

#### Shell (Local)

Run the following command from your local shell:

```bash
mkdir -p model-dir && cd model-dir
```

---

### Step 2: Download the sample model repository

#### Wget (Instance)

```bash
!cd model-dir && mkdir -p model_repository/simple/1 && wget -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef && wget -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt
```

#### Wget (Shell)

```bash
cd model-dir && mkdir -p model_repository/simple/1 && wget -O ./model_repository/simple/1/model.graphdef https://objectstore.e2enetworks.net/iris/model_repository/simple/1/model.graphdef && wget -O ./model_repository/simple/config.pbtxt https://objectstore.e2enetworks.net/iris/model_repository/simple/config.pbtxt
```

#### Github (Instance)

Alternatively, you can download sample models from the Triton GitHub repository:

```bash
!cd model-dir && git clone -b r23.10 https://github.com/triton-inference-server/server.git && ./var/tmp/server/docs/examples/fetch_models.sh
```

Ensure that the `model_repository` folder is located under the `model-dir` directory created in Step 1.

---

### Step 3: Upload sample models from local directory to a repository

#### Instance(Nodes)

Install the SDK and push the model repository.

```bash
pip install e2enetworks
```

Import the SDK and upload the model:

```python
from e2enetworks.cloud import tir
model_repo_client = tir.Models()
model_repo = model_repo_client.create(name='test-triton', model_type='triton')
model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id)
```

#### Local

Install the SDK:

```bash
pip install e2enetworks
```

Then, in a Python shell:

```python
from e2enetworks.cloud import tir
tir.init(api_key="<your-api-key>", access_token="<your-access-token>")

model_repo_client = tir.Models()
model_repo = model_repo_client.create(name='test-triton', model_type='triton')
model_repo_client.push_model(model_path='./model-dir', prefix='', model_id=model_repo.id)
```

---

### Step 4: Create a model endpoint to access the model over REST/gRPC

#### Dashboard

Use the Dashboard to create the endpoint:

1. Go to **Model Endpoints** in **Inference**.
2. Click **Create Endpoint**.
3. Choose **Nvidia Triton** as the framework and click **Continue**.
4. Select a GPU or CPU plan.
5. Enter a name for your endpoint (optional).
6. Select the model repository you uploaded in Step 3.
7. Click **Finish** to launch the model endpoint server.
8. Wait for the endpoint to reach **Ready** state.

#### Using SDK

You can also perform these steps using Python SDK in a Jupyter notebook or shell:

```python
endpoint_client = tir.EndPoints()
endpoint = endpoint_client.create(endpoint_name='test-triton-simple', framework='triton', plan='CPU-C3-4-8GB-0', model_id=model_repo.id)
endpoint_client.get(id=endpoint.id)
```

---

### Step 5: Use Triton client to call the model endpoint

Install Triton client:

```bash
pip install tritonclient[http]
```

Set authentication header and endpoint URL:

```python
headers = {'Authorization': 'Bearer <your-auth-token>'}
endpoint_url = 'infer.e2enetworks.net/project/<project-id>/endpoint/<endpoint-id>/'
```

Run the following code to perform inference:

<details>
<summary>Click to expand code</summary>

```python
import tritonclient.http as httpclient
from tritonclient.http import InferenceServerClient
import gevent.ssl as ssl
import numpy as np

triton_client = InferenceServerClient(
    url=endpoint_url,
    verbose=False,
    ssl=True,
    ssl_options={},
    insecure=False,
    ssl_context_factory=ssl.create_default_context,
)

model_name = "simple"
input0_data = np.arange(start=0, stop=16, dtype=np.int32)
input0_data = np.expand_dims(input0_data, axis=0)
input1_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32)

inputs = []
outputs = []
inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
inputs[0].set_data_from_numpy(input0_data, binary_data=False)
inputs[1].set_data_from_numpy(input1_data, binary_data=True)

outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False))

query_params = {"test_1": 1, "test_2": 2}

results = triton_client.infer(
    model_name,
    inputs,
    outputs=outputs,
    query_params=query_params,
    headers=headers,
)
print(results)
```

</details>

---

## Troubleshooting and operations guide

### Model updates

1. Push updated model and configuration files to the repository.
2. Restart the model endpoint.

It automatically reloads the new model when the container restarts.

### Metrics and logging

Monitor resource usage (CPU/GPU, memory) and inference metrics like **QPS**, **P99 latency**, etc., through the dashboard.

### Autoscaling

Configure autoscaling to scale replicas dynamically based on demand. Scaling depends on available resources.

### Multi-model support

It supports running multiple models on a single GPU. Explicit per-endpoint model loading/unloading is not yet supported.

---

## Frequently asked questions

### Can I use Triton to deploy LLMs?

Yes. Use [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) or [vLLM server](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton).

### Does Triton support batching and streaming?

Yes. Refer to [Triton Client Examples](https://github.com/triton-inference-server/client/tree/main/src/python/examples) for advanced use cases like async, streaming, and decoupled inference.


---