Deploy Inference for Gemma
Deploy a model endpoint for Gemma with support for 4 variants: 2B, 2B-it, 7B, and 7B-it. This guide uses the 2B-it variant. The same steps work for other variants.
Overview
- Model endpoint creation using prebuilt Gemma 2B-it container
- Reference for supported generation parameters
Step 1: Create a Model Endpoint
- Go to the AI Platform and select your project
- Navigate to Model Endpoints
- Click Create Endpoint
- Select the Gemma 2B-IT model card
- Choose your GPU plan and set desired replicas
Environment Variables
Required
Gemma is a gated model — checkpoint is downloaded from Kaggle.
To gain access:
-
Visit the Gemma model page on Kaggle
-
Request access and wait for approval
-
Generate a Kaggle API token (Account Settings → API)
-
Configure:
KAGGLE_KEY: Your API token keyKAGGLE_USERNAME: Your Kaggle username
Advanced (Optional)
Used by TensorRT-LLM. Modify only if needed.
| Variable | Purpose |
|---|---|
MAX_BATCH_SIZE | Maximum concurrent input sequences processed per batch |
MAX_INPUT_LEN | Maximum input sequence length in tokens |
MAX_OUTPUT_LEN | Maximum output sequence length |
After configuration, complete endpoint creation and monitor logs until deployment finishes.
Step 2: Generate Your API Token
To make API requests:
- Go to API Tokens in your project
- Create or reuse an existing token
- Copy the Auth Token
Step 3: Make Inference Requests
Use the Sample API Request section on the endpoint details page.
Example cURL Request
curl -X POST https://your-endpoint-url/v2/models/ensemble/generate \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_AUTH_TOKEN" \
-d '{
"text_input": "What is artificial intelligence?",
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9
}'
Supported Parameters
View Parameter Reference
| Parameter | Description | Shape | Data Type |
|---|---|---|---|
text_input | Input text prompt for generation | [-1] | TYPE_STRING |
max_tokens | Maximum tokens to generate | [-1] | TYPE_INT32 |
bad_words | Words/phrases to exclude from output | [-1] | TYPE_STRING |
stop_words | Stop words excluded from generation | [-1] | TYPE_STRING |
end_id | Token marking sequence end | [1] | TYPE_INT32 |
pad_id | Token used for padding | [1] | TYPE_INT32 |
top_k | Highest probability tokens to consider | [1] | TYPE_INT32 |
top_p | Nucleus sampling probability threshold | [1] | TYPE_FP32 |
temperature | Controls randomness | [1] | TYPE_FP32 |
length_penalty | Penalty applied to output length | [1] | TYPE_FP32 |
repetition_penalty | Penalty for repeated sequences | [1] | TYPE_FP32 |
min_length | Minimum output tokens | [1] | TYPE_INT32 |
presence_penalty | Penalize token presence | [1] | TYPE_FP32 |
frequency_penalty | Penalize token frequency | [1] | TYPE_FP32 |
random_seed | Random seed | [1] | TYPE_UINT64 |
return_log_probs | Include token log probabilities | [1] | TYPE_BOOL |
return_context_logits | Include logits for context tokens | [1] | TYPE_BOOL |
return_generation_logits | Include logits for generated tokens | [1] | TYPE_BOOL |
prompt_embedding_table | Embedding table | [-1, -1] | TYPE_FP16 |
prompt_vocab_size | Prompt vocab size | [1] | TYPE_INT32 |
embedding_bias_words | Words for embedding bias | [-1] | TYPE_STRING |
embedding_bias_weights | Weights for embedding bias | [-1] | TYPE_FP32 |
cum_log_probs | Cumulative log probabilities | [-1] | TYPE_FP32 |
output_log_probs | Log probs per token | [-1, -1] | TYPE_FP32 |
context_logits | Logits for context | [-1, -1] | TYPE_FP32 |
generation_logits | Logits for generated tokens | [-1, -1, -1] | TYPE_FP32 |