# Quick Start Guide

> Create a Model Endpoint and send your first inference request. This guide is for developers and ML engineers using the **TIR Dashboard** and works for both first-time setup and team onboarding.

---

## What you need before you start

| Requirement | Details for you |
|-------------|------------------|
| **Account** | Active account with access to **Inference** and at least one **project** and **region**. |
| **Model source** | Either a **Model Repository** linked to your project (object storage) or a **Hugging Face** model. For Hugging Face, you will need a **Hugging Face access token** (see below). |
| **Framework** | Know which **framework/runtime** your model needs (e.g. vLLM, Triton, PyTorch). Use the search bar in the UI to filter supported options. |
| **Budget** | Decide whether you want **Hourly** (can scale to 0 when idle) or **Committed** (always-on, fixed term) billing. |

---

## About your Hugging Face token

If you choose **Download from Hugging Face** when creating your endpoint, the platform needs a **Hugging Face (HF) access token** to pull the model. Here is what you need to know.

| Topic | What you need to do |
|-------|----------------------|
| **What it is** | A personal or project token from [Hugging Face](https://huggingface.co) that allows the platform to download the model (and, for gated models, to verify your agreement to the model license). |
| **Where to create it** | On Hugging Face: go to **Settings → Access Tokens** ([https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)), then **Create new token**. Give it a name (e.g. "TIR Model Endpoints") and choose the right scope. |
| **Recommended scope** | For **public models**: **Read** is enough. For **gated/private models**: you need **Read** and you must **accept the model license** on the model page on Hugging Face before the platform can pull it. |
| **Where to enter it** | After creating your Hugging Face token, go to **External Integration** and create a new **Hugging Face integration** where you enter your token and save it. When you select **Download from Hugging Face** in the endpoint creation flow, the UI will prompt you to choose your **HF Token Integration**. The token is used only to download the model at deploy time (and for any re-pull if the endpoint is recreated). |
| **Gated models** | For models that require license acceptance (e.g. Llama, Mistral), **accept the license on the model Hugging Face page** with the same HF account that owns the token, then use that token in the platform. Otherwise the download will fail. |
| **Troubleshooting** | If endpoint creation fails with a Hugging Face error, check: (1) token is valid and not revoked, (2) token has **Read** scope, (3) for gated models, license is accepted, and (4) model ID (e.g. `org/model-name`) is correct and accessible. |

---

## Step 1: Create a Model Endpoint

1. In the **TIR Dashboard**, go to **Inference → Model Endpoints**.
2. Click **CREATE ENDPOINT**.
3. **Choose a framework/runtime and version**
   - Use the search bar to filter by name (e.g. vLLM, SGLang, Triton, PyTorch).
   - Pick the **version** that matches your model (e.g. specific vLLM or Triton version). The UI shows supported frameworks for your region.
4. **Choose how the model will be provided:**
   - **Link with Model Repository** — Attach a repository from your project (e.g. object storage or a pre-registered model). Select the repository and path. Use this when your model is already in your project storage.
   - **Download from Hugging Face** — Pull a model snapshot directly from Hugging Face. Enter the **Hugging Face model ID** (e.g. `meta-llama/Llama-2-7b-chat-hf`) and select your **HF Token Integration** when prompted. The platform will use the integration to download the model when the endpoint is deployed. See **About your Hugging Face token** above for token creation and integration setup.

   <br></br>

:::tip

- For a first endpoint, **Download from Hugging Face** is often fastest if your model is on HF: no need to upload to object storage. Have your HF token and model ID ready.
- If you use a **Model Repository**, ensure the model files are already uploaded and the path is correct; you can use the **Validate** option (if shown) before creating the endpoint.
- The framework you select must support your model type (e.g. LLM, embedding). Check the framework description in the UI or the product supported frameworks list.
:::
---

## Step 2: Pick a machine and plan

1. **Choose a machine type**
   - Select **CPU** or **GPU** (and a specific GPU type, e.g. A100, L4) that fits your **model size** and **latency** needs. Larger or heavier models typically need a larger GPU.
   - Availability and options depend on your **region**; switch the region in the dashboard if you do not see the instance you want.

2. **Pick a billing plan**
   - **Hourly billed** — You pay per hour of endpoint uptime per replica. You **can scale down to 0 workers** when the endpoint is idle, so you pay nothing when it is not in use. Best for dev, test, or variable traffic.
   - **Committed** — You reserve capacity for a fixed term (e.g. 1 or 3 months). Workers stay running (always-on availability); you **cannot stop** the endpoint during the commitment. Often a lower effective hourly rate. Best for production workloads with steady traffic.

<br></br>
:::tip
- For **evaluation or dev**, start with **Hourly** and **1 replica** so you can stop the endpoint when not in use.
- **Committed** is a commitment for the full term; choose it when you are sure you need always-on capacity.
:::
---

## Step 3: Configure scaling (optional)

1. **Set worker counts**
   - **Active Workers (minimum)** — Number of replicas that are always running (when the endpoint is running). Set to **0** only if the plan allows scale-to-zero (typically **Hourly**); otherwise set at least 1.
   - **Max Workers (maximum)** — Upper limit of replicas. Prevents runaway scale-up and caps cost.

2. **When Active ≠ Max**
   - **Autoscaling is enabled**: the platform will scale between the minimum and maximum based on the policy you choose.

3. **Choose a scaling policy**
   - **Concurrent Request Count** — Adjusts workers based on the total number of requests in the queue and those currently in progress. Good for workloads with variable request durations.
   - **Request Rate per Second** — Adjusts workers based on the number of incoming requests per second. Good for steady, high-throughput API workloads.
   - **Custom** — Adjusts workers based on a user-defined configuration. Note: the number of active workers must always be greater than zero to ensure the service remains available.


<br></br>
:::tip
- For a **first test**, you can leave **Active = Max = 1** (no autoscaling) to keep things simple.
- If you expect **variable traffic**, set a minimum (e.g. 1) and a higher max (e.g. 4), and pick a request-based or custom-metrics policy so the endpoint scales with load.
:::
---

## Step 4: Deploy and test

1. **Deploy**
   - Click **Create** (or **Deploy**) to start endpoint creation. The endpoint will appear in the list with status **Deploying** (or similar). The first deployment can take **5–15+ minutes** while the platform pulls the model (e.g. from Hugging Face or your repository) and starts the runtime.

2. **Wait for healthy status**
   - When the endpoint status is **Running** (or **Ready** / **Healthy**), it is ready to accept requests. You can see status on the endpoint detail page. If it stays in **Deploying** for a long time, check **Logs** and **Deployment events** for errors (e.g. bad Hugging Face token, wrong model ID, or insufficient quota).

3. **Send a test request**
   - Open the **endpoint detail page** and copy the **Root Endpoint** URL (and **API key** or auth token if required).
   <br></br>
   > The **Root Endpoint** is the base URL of your deployed model and serves as the prefix for all routes. To call a specific route, append it to the Root Endpoint. For example:
   > <br></br>
   >  - Root Endpoint: `https://infer.e2enetworks.net/project/p-5520/endpoint/is-8528/`
   >  - To call `/v1/chat/completions`, use: `https://infer.e2enetworks.net/project/p-5520/endpoint/is-8528/v1/chat/completions`

   <br></br>
   - Use your preferred client to send a request:
     - **cURL** — Example below for an OpenAI-compatible endpoint.
     <br></br>
     ![cURL request](../images/cURL.png)
     <br></br>
     - **OpenAI SDK** — Point the client at your endpoint URL (e.g. set `base_url` to your endpoint) and use the same `model`, `messages`, and other parameters you use with OpenAI.
     <br></br>
     ![OpenAI SDK](../images/OpenAI_sdk.png)
     <br></br>
     - **Your app** — Use the endpoint URL and auth in your application code the same way you would for any REST inference API.

4. **Use the endpoint page for visibility**
   - **Logs** — Inference logs and (when available) per-replica logs to debug failures or slow responses.
   - **Deployment events** — When the endpoint was created, updated, or restarted.
   - **Hardware and service metrics** — CPU, GPU, memory, request rate, latency, errors.
   - **Request logs** — When enabled, see who called the endpoint and when (useful for compliance and usage analysis).

### Example: cURL (OpenAI-compatible endpoint)

Replace `YOUR_ROOT_ENDPOINT` with the Root Endpoint from your endpoint detail page (e.g. `https://infer.e2enetworks.net/project/p-5520/endpoint/is-8528/`):

```bash
curl -X POST "YOUR_ROOT_ENDPOINT/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "Hello, say something short."}]
  }'
```

### Example: OpenAI Python SDK

For OpenAI-compatible endpoints, you can use the official OpenAI client with a custom base URL:

```python
from openai import OpenAI

client = OpenAI(
    base_url="YOUR_ROOT_ENDPOINT/v1",  # Root Endpoint + /v1
    api_key="YOUR_AUTH_TOKEN"
)

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
```


---
> **Async invocation:** For long-running or batch workloads, you can enable **async invocation** so requests are queued and processed in the background; when done, the response is stored at a destination you specify. See [Async Invocation](./Features.md#7-async-invocation) for setup and usage.

---

## Next steps after your first request

- **Integrate** the endpoint into your app (sync or async; for OpenAI-compatible endpoints, point your OpenAI client at the endpoint URL).
- **Adjust scaling** (min/max workers, scaling policy) if traffic grows or varies.
- **Enable request logs** and **monitoring** for production; set up **security groups** and **private IPs** if required.
- **Manage cost**: for **Hourly** endpoints, **stop** or scale to 0 when idle; for **Committed**, plan your term and use the billing dashboard to track usage.


---