# Deploy Inference for Gemma Deploy a model endpoint for **Gemma** with support for 4 variants: 2B, 2B-it, 7B, and 7B-it. This guide uses the **2B-it** variant. The same steps work for other variants. ## Overview * Model endpoint creation using prebuilt Gemma 2B-it container * Reference for supported generation parameters --- ## Step 1: Create a Model Endpoint 1. Go to the AI Platform and select your project 2. Navigate to **Model Endpoints** 3. Click **Create Endpoint** 4. Select the **Gemma 2B-IT** model card 5. Choose your GPU plan and set desired replicas ### Environment Variables #### Required > Gemma is a *gated model* — checkpoint is downloaded from **Kaggle**. To gain access: 1. Visit the Gemma model page on Kaggle 2. Request access and wait for approval 3. Generate a Kaggle API token (Account Settings → API) 4. Configure: * `KAGGLE_KEY`: Your API token key * `KAGGLE_USERNAME`: Your Kaggle username #### Advanced (Optional) Used by [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Modify only if needed. | Variable | Purpose | | ---------------- | ------------------------------------------------------ | | `MAX_BATCH_SIZE` | Maximum concurrent input sequences processed per batch | | `MAX_INPUT_LEN` | Maximum input sequence length in tokens | | `MAX_OUTPUT_LEN` | Maximum output sequence length | > After configuration, complete endpoint creation and monitor logs until deployment finishes. --- ## Step 2: Generate Your API Token To make API requests: 1. Go to **API Tokens** in your project 2. Create or reuse an existing token 3. Copy the **Auth Token** --- ## Step 3: Make Inference Requests Use the **Sample API Request** section on the endpoint details page.
Example cURL Request ```bash curl -X POST https://your-endpoint-url/v2/models/ensemble/generate \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_AUTH_TOKEN" \ -d '{ "text_input": "What is artificial intelligence?", "max_tokens": 100, "temperature": 0.7, "top_p": 0.9 }' ```
--- ## Supported Parameters
View Parameter Reference | Parameter | Description | Shape | Data Type | | -------------------------- | -------------------------------------- | ------------ | ----------- | | `text_input` | Input text prompt for generation | [-1] | TYPE_STRING | | `max_tokens` | Maximum tokens to generate | [-1] | TYPE_INT32 | | `bad_words` | Words/phrases to exclude from output | [-1] | TYPE_STRING | | `stop_words` | Stop words excluded from generation | [-1] | TYPE_STRING | | `end_id` | Token marking sequence end | [1] | TYPE_INT32 | | `pad_id` | Token used for padding | [1] | TYPE_INT32 | | `top_k` | Highest probability tokens to consider | [1] | TYPE_INT32 | | `top_p` | Nucleus sampling probability threshold | [1] | TYPE_FP32 | | `temperature` | Controls randomness | [1] | TYPE_FP32 | | `length_penalty` | Penalty applied to output length | [1] | TYPE_FP32 | | `repetition_penalty` | Penalty for repeated sequences | [1] | TYPE_FP32 | | `min_length` | Minimum output tokens | [1] | TYPE_INT32 | | `presence_penalty` | Penalize token presence | [1] | TYPE_FP32 | | `frequency_penalty` | Penalize token frequency | [1] | TYPE_FP32 | | `random_seed` | Random seed | [1] | TYPE_UINT64 | | `return_log_probs` | Include token log probabilities | [1] | TYPE_BOOL | | `return_context_logits` | Include logits for context tokens | [1] | TYPE_BOOL | | `return_generation_logits` | Include logits for generated tokens | [1] | TYPE_BOOL | | `prompt_embedding_table` | Embedding table | [-1, -1] | TYPE_FP16 | | `prompt_vocab_size` | Prompt vocab size | [1] | TYPE_INT32 | | `embedding_bias_words` | Words for embedding bias | [-1] | TYPE_STRING | | `embedding_bias_weights` | Weights for embedding bias | [-1] | TYPE_FP32 | | `cum_log_probs` | Cumulative log probabilities | [-1] | TYPE_FP32 | | `output_log_probs` | Log probs per token | [-1, -1] | TYPE_FP32 | | `context_logits` | Logits for context | [-1, -1] | TYPE_FP32 | | `generation_logits` | Logits for generated tokens | [-1, -1, -1] | TYPE_FP32 |
---