# Deploy Inference for Gemma

Deploy a model endpoint for **Gemma** with support for 4 variants: 2B, 2B-it, 7B, and 7B-it.
This guide uses the **2B-it** variant. The same steps work for other variants.

## Overview

* Model endpoint creation using prebuilt Gemma 2B-it container
* Reference for supported generation parameters

---

## Step 1: Create a Model Endpoint

1. Go to the AI Platform and select your project
2. Navigate to **Model Endpoints**
3. Click **Create Endpoint**
4. Select the **Gemma 2B-IT** model card
5. Choose your GPU plan and set desired replicas

### Environment Variables

#### Required

> Gemma is a *gated model* — checkpoint is downloaded from **Kaggle**.

To gain access:

1. Visit the Gemma model page on Kaggle
2. Request access and wait for approval
3. Generate a Kaggle API token (Account Settings → API)
4. Configure:

   * `KAGGLE_KEY`: Your API token key
   * `KAGGLE_USERNAME`: Your Kaggle username

#### Advanced (Optional)

Used by [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Modify only if needed.

| Variable         | Purpose                                                |
| ---------------- | ------------------------------------------------------ |
| `MAX_BATCH_SIZE` | Maximum concurrent input sequences processed per batch |
| `MAX_INPUT_LEN`  | Maximum input sequence length in tokens                |
| `MAX_OUTPUT_LEN` | Maximum output sequence length                         |

> After configuration, complete endpoint creation and monitor logs until deployment finishes.

---

## Step 2: Generate Your API Token

To make API requests:

1. Go to **API Tokens** in your project
2. Create or reuse an existing token
3. Copy the **Auth Token**

---

## Step 3: Make Inference Requests

Use the **Sample API Request** section on the endpoint details page.

<details>
<summary><b>Example cURL Request</b></summary>

```bash
curl -X POST https://your-endpoint-url/v2/models/ensemble/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
  -d '{
    "text_input": "What is artificial intelligence?",
    "max_tokens": 100,
    "temperature": 0.7,
    "top_p": 0.9
  }'
```

</details>

---

## Supported Parameters

<details>
<summary><b>View Parameter Reference</b></summary>

| Parameter                  | Description                            | Shape        | Data Type   |
| -------------------------- | -------------------------------------- | ------------ | ----------- |
| `text_input`               | Input text prompt for generation       | [-1]         | TYPE_STRING |
| `max_tokens`               | Maximum tokens to generate             | [-1]         | TYPE_INT32  |
| `bad_words`                | Words/phrases to exclude from output   | [-1]         | TYPE_STRING |
| `stop_words`               | Stop words excluded from generation    | [-1]         | TYPE_STRING |
| `end_id`                   | Token marking sequence end             | [1]          | TYPE_INT32  |
| `pad_id`                   | Token used for padding                 | [1]          | TYPE_INT32  |
| `top_k`                    | Highest probability tokens to consider | [1]          | TYPE_INT32  |
| `top_p`                    | Nucleus sampling probability threshold | [1]          | TYPE_FP32   |
| `temperature`              | Controls randomness                    | [1]          | TYPE_FP32   |
| `length_penalty`           | Penalty applied to output length       | [1]          | TYPE_FP32   |
| `repetition_penalty`       | Penalty for repeated sequences         | [1]          | TYPE_FP32   |
| `min_length`               | Minimum output tokens                  | [1]          | TYPE_INT32  |
| `presence_penalty`         | Penalize token presence                | [1]          | TYPE_FP32   |
| `frequency_penalty`        | Penalize token frequency               | [1]          | TYPE_FP32   |
| `random_seed`              | Random seed                            | [1]          | TYPE_UINT64 |
| `return_log_probs`         | Include token log probabilities        | [1]          | TYPE_BOOL   |
| `return_context_logits`    | Include logits for context tokens      | [1]          | TYPE_BOOL   |
| `return_generation_logits` | Include logits for generated tokens    | [1]          | TYPE_BOOL   |
| `prompt_embedding_table`   | Embedding table                        | [-1, -1]     | TYPE_FP16   |
| `prompt_vocab_size`        | Prompt vocab size                      | [1]          | TYPE_INT32  |
| `embedding_bias_words`     | Words for embedding bias               | [-1]         | TYPE_STRING |
| `embedding_bias_weights`   | Weights for embedding bias             | [-1]         | TYPE_FP32   |
| `cum_log_probs`            | Cumulative log probabilities           | [-1]         | TYPE_FP32   |
| `output_log_probs`         | Log probs per token                    | [-1, -1]     | TYPE_FP32   |
| `context_logits`           | Logits for context                     | [-1, -1]     | TYPE_FP32   |
| `generation_logits`        | Logits for generated tokens            | [-1, -1, -1] | TYPE_FP32   |

</details>


---