Skip to main content

Deploy Model Endpoint for Codellama-7b

Deploy a Codellama‑7b model endpoint using a prebuilt container or your own custom uploaded weights.


Overview

  • Deploy Codellama‑7b with a prebuilt container
  • Optionally upload custom fine‑tuned weights
  • Make inference requests through REST APIs

Step 1: Create a Model Endpoint

  1. Go to the AI Platform
  2. Select your project
  3. Navigate to Model Endpoints
  4. Click Create Endpoint
  5. Choose Codellama‑7b model card
  6. Select CPU/GPU plan and replicas
  7. Complete creation and wait until status is Running

Step 2: Generate Your API Token

  1. Go to API Tokens under your project
  2. Create a token or use an existing one
  3. Copy the Auth Token — required for all API calls

Step 3: Make Inference Requests

Once running, sample API examples are available in the Sample API Request section of your endpoint.


Deploy Custom Model Weights (Optional)

Use this option for fine‑tuned models.

A) Create Model Definition

  1. Go to Models
  2. Click Create Model → select Custom
  3. Note the Object Storage (bucket) details provided
  4. Copy the MinIO setup command

B) Download Codellama‑7b into Notebook

from transformers import AutoTokenizer, pipeline
import torch

model = "codellama/CodeLlama-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
tokenizer=tokenizer,
)

C) Upload Weights to Object Storage

mc cp -r * codellama-7b/codellama-7b-hf

D) Deploy Endpoint with Custom Weights

  • Select your custom model under Model Details
  • Provide model path if not root
  • Validate and launch

Supported Parameters (Summary)

ParameterPurpose
max_new_tokensMax tokens to generate
temperatureControls randomness
top_kLimits high‑probability sampling
top_pNucleus sampling threshold
num_beamsBeam search depth
repetition_penaltyAvoid repeating text

Tune these for output length and creativity.


Example: Inference via API

cURL Request

curl -X POST https://your-endpoint-url/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_AUTH_TOKEN" \
-d '{
"model": "Codellama-7b",
"prompt": "Write a Python function for Fibonacci.",
"max_new_tokens": 100,
"temperature": 0.7
}'

Python Request

import requests

url = "https://your-endpoint-url/v1/completions"
headers = {
"Authorization": "Bearer YOUR_AUTH_TOKEN",
"Content-Type": "application/json"
}
payload = {
"model": "Codellama-7b",
"prompt": "Explain quicksort in Python.",
"max_new_tokens": 200,
"temperature": 0.5
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Tips

  • ✅ Use GPU plans for faster inference
  • ✅ Set reasonable max token values to avoid long delays
  • ✅ Include clear instructions in prompts for code generation
  • ✅ Start with temperature between 0.3–0.7 for quality results

Notes

  • Auth token is required for all requests
  • CPU plans may take longer to load the model initially
  • Check logs if the model takes longer to reach Running state
  • Ensure model folder structure includes required files

Codellama‑7b deployment is now ready — you can integrate it into any application using REST API requests.