Skip to main content

Deploy Inference for Gemma

Deploy a model endpoint for Gemma with support for 4 variants: 2B, 2B-it, 7B, and 7B-it. This guide uses the 2B-it variant. The same steps work for other variants.

Overview

  • Model endpoint creation using prebuilt Gemma 2B-it container
  • Reference for supported generation parameters

Step 1: Create a Model Endpoint

  1. Go to the AI Platform and select your project
  2. Navigate to Model Endpoints
  3. Click Create Endpoint
  4. Select the Gemma 2B-IT model card
  5. Choose your GPU plan and set desired replicas

Environment Variables

Required

Gemma is a gated model — checkpoint is downloaded from Kaggle.

To gain access:

  1. Visit the Gemma model page on Kaggle

  2. Request access and wait for approval

  3. Generate a Kaggle API token (Account Settings → API)

  4. Configure:

    • KAGGLE_KEY: Your API token key
    • KAGGLE_USERNAME: Your Kaggle username

Advanced (Optional)

Used by TensorRT-LLM. Modify only if needed.

VariablePurpose
MAX_BATCH_SIZEMaximum concurrent input sequences processed per batch
MAX_INPUT_LENMaximum input sequence length in tokens
MAX_OUTPUT_LENMaximum output sequence length

After configuration, complete endpoint creation and monitor logs until deployment finishes.


Step 2: Generate Your API Token

To make API requests:

  1. Go to API Tokens in your project
  2. Create or reuse an existing token
  3. Copy the Auth Token

Step 3: Make Inference Requests

Use the Sample API Request section on the endpoint details page.

Example cURL Request
curl -X POST https://your-endpoint-url/v2/models/ensemble/generate \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_AUTH_TOKEN" \
-d '{
"text_input": "What is artificial intelligence?",
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9
}'

Supported Parameters

View Parameter Reference
ParameterDescriptionShapeData Type
text_inputInput text prompt for generation[-1]TYPE_STRING
max_tokensMaximum tokens to generate[-1]TYPE_INT32
bad_wordsWords/phrases to exclude from output[-1]TYPE_STRING
stop_wordsStop words excluded from generation[-1]TYPE_STRING
end_idToken marking sequence end[1]TYPE_INT32
pad_idToken used for padding[1]TYPE_INT32
top_kHighest probability tokens to consider[1]TYPE_INT32
top_pNucleus sampling probability threshold[1]TYPE_FP32
temperatureControls randomness[1]TYPE_FP32
length_penaltyPenalty applied to output length[1]TYPE_FP32
repetition_penaltyPenalty for repeated sequences[1]TYPE_FP32
min_lengthMinimum output tokens[1]TYPE_INT32
presence_penaltyPenalize token presence[1]TYPE_FP32
frequency_penaltyPenalize token frequency[1]TYPE_FP32
random_seedRandom seed[1]TYPE_UINT64
return_log_probsInclude token log probabilities[1]TYPE_BOOL
return_context_logitsInclude logits for context tokens[1]TYPE_BOOL
return_generation_logitsInclude logits for generated tokens[1]TYPE_BOOL
prompt_embedding_tableEmbedding table[-1, -1]TYPE_FP16
prompt_vocab_sizePrompt vocab size[1]TYPE_INT32
embedding_bias_wordsWords for embedding bias[-1]TYPE_STRING
embedding_bias_weightsWeights for embedding bias[-1]TYPE_FP32
cum_log_probsCumulative log probabilities[-1]TYPE_FP32
output_log_probsLog probs per token[-1, -1]TYPE_FP32
context_logitsLogits for context[-1, -1]TYPE_FP32
generation_logitsLogits for generated tokens[-1, -1, -1]TYPE_FP32