# Deploy Inference for Gemma
Deploy a model endpoint for **Gemma** with support for 4 variants: 2B, 2B-it, 7B, and 7B-it.
This guide uses the **2B-it** variant. The same steps work for other variants.
## Overview
* Model endpoint creation using prebuilt Gemma 2B-it container
* Reference for supported generation parameters
---
## Step 1: Create a Model Endpoint
1. Go to the AI Platform and select your project
2. Navigate to **Model Endpoints**
3. Click **Create Endpoint**
4. Select the **Gemma 2B-IT** model card
5. Choose your GPU plan and set desired replicas
### Environment Variables
#### Required
> Gemma is a *gated model* — checkpoint is downloaded from **Kaggle**.
To gain access:
1. Visit the Gemma model page on Kaggle
2. Request access and wait for approval
3. Generate a Kaggle API token (Account Settings → API)
4. Configure:
* `KAGGLE_KEY`: Your API token key
* `KAGGLE_USERNAME`: Your Kaggle username
#### Advanced (Optional)
Used by [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Modify only if needed.
| Variable | Purpose |
| ---------------- | ------------------------------------------------------ |
| `MAX_BATCH_SIZE` | Maximum concurrent input sequences processed per batch |
| `MAX_INPUT_LEN` | Maximum input sequence length in tokens |
| `MAX_OUTPUT_LEN` | Maximum output sequence length |
> After configuration, complete endpoint creation and monitor logs until deployment finishes.
---
## Step 2: Generate Your API Token
To make API requests:
1. Go to **API Tokens** in your project
2. Create or reuse an existing token
3. Copy the **Auth Token**
---
## Step 3: Make Inference Requests
Use the **Sample API Request** section on the endpoint details page.
Example cURL Request
```bash
curl -X POST https://your-endpoint-url/v2/models/ensemble/generate \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_AUTH_TOKEN" \
-d '{
"text_input": "What is artificial intelligence?",
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9
}'
```
---
## Supported Parameters
View Parameter Reference
| Parameter | Description | Shape | Data Type |
| -------------------------- | -------------------------------------- | ------------ | ----------- |
| `text_input` | Input text prompt for generation | [-1] | TYPE_STRING |
| `max_tokens` | Maximum tokens to generate | [-1] | TYPE_INT32 |
| `bad_words` | Words/phrases to exclude from output | [-1] | TYPE_STRING |
| `stop_words` | Stop words excluded from generation | [-1] | TYPE_STRING |
| `end_id` | Token marking sequence end | [1] | TYPE_INT32 |
| `pad_id` | Token used for padding | [1] | TYPE_INT32 |
| `top_k` | Highest probability tokens to consider | [1] | TYPE_INT32 |
| `top_p` | Nucleus sampling probability threshold | [1] | TYPE_FP32 |
| `temperature` | Controls randomness | [1] | TYPE_FP32 |
| `length_penalty` | Penalty applied to output length | [1] | TYPE_FP32 |
| `repetition_penalty` | Penalty for repeated sequences | [1] | TYPE_FP32 |
| `min_length` | Minimum output tokens | [1] | TYPE_INT32 |
| `presence_penalty` | Penalize token presence | [1] | TYPE_FP32 |
| `frequency_penalty` | Penalize token frequency | [1] | TYPE_FP32 |
| `random_seed` | Random seed | [1] | TYPE_UINT64 |
| `return_log_probs` | Include token log probabilities | [1] | TYPE_BOOL |
| `return_context_logits` | Include logits for context tokens | [1] | TYPE_BOOL |
| `return_generation_logits` | Include logits for generated tokens | [1] | TYPE_BOOL |
| `prompt_embedding_table` | Embedding table | [-1, -1] | TYPE_FP16 |
| `prompt_vocab_size` | Prompt vocab size | [1] | TYPE_INT32 |
| `embedding_bias_words` | Words for embedding bias | [-1] | TYPE_STRING |
| `embedding_bias_weights` | Weights for embedding bias | [-1] | TYPE_FP32 |
| `cum_log_probs` | Cumulative log probabilities | [-1] | TYPE_FP32 |
| `output_log_probs` | Log probs per token | [-1, -1] | TYPE_FP32 |
| `context_logits` | Logits for context | [-1, -1] | TYPE_FP32 |
| `generation_logits` | Logits for generated tokens | [-1, -1, -1] | TYPE_FP32 |
---