Deploy Model Endpoint for Codellama-7b
Deploy a Codellama‑7b model endpoint using a prebuilt container or your own custom uploaded weights.
Overview
- Deploy Codellama‑7b with a prebuilt container
- Optionally upload custom fine‑tuned weights
- Make inference requests through REST APIs
Step 1: Create a Model Endpoint
- Go to the AI Platform
- Select your project
- Navigate to Model Endpoints
- Click Create Endpoint
- Choose Codellama‑7b model card
- Select CPU/GPU plan and replicas
- Complete creation and wait until status is Running
Step 2: Generate Your API Token
- Go to API Tokens under your project
- Create a token or use an existing one
- Copy the Auth Token — required for all API calls
Step 3: Make Inference Requests
Once running, sample API examples are available in the Sample API Request section of your endpoint.
Deploy Custom Model Weights (Optional)
Use this option for fine‑tuned models.
A) Create Model Definition
- Go to Models
- Click Create Model → select Custom
- Note the Object Storage (bucket) details provided
- Copy the MinIO setup command
B) Download Codellama‑7b into Notebook
from transformers import AutoTokenizer, pipeline
import torch
model = "codellama/CodeLlama-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
tokenizer=tokenizer,
)
C) Upload Weights to Object Storage
mc cp -r * codellama-7b/codellama-7b-hf
D) Deploy Endpoint with Custom Weights
- Select your custom model under Model Details
- Provide model path if not root
- Validate and launch
Supported Parameters (Summary)
| Parameter | Purpose |
|---|---|
max_new_tokens | Max tokens to generate |
temperature | Controls randomness |
top_k | Limits high‑probability sampling |
top_p | Nucleus sampling threshold |
num_beams | Beam search depth |
repetition_penalty | Avoid repeating text |
Tune these for output length and creativity.
Example: Inference via API
cURL Request
curl -X POST https://your-endpoint-url/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_AUTH_TOKEN" \
-d '{
"model": "Codellama-7b",
"prompt": "Write a Python function for Fibonacci.",
"max_new_tokens": 100,
"temperature": 0.7
}'
Python Request
import requests
url = "https://your-endpoint-url/v1/completions"
headers = {
"Authorization": "Bearer YOUR_AUTH_TOKEN",
"Content-Type": "application/json"
}
payload = {
"model": "Codellama-7b",
"prompt": "Explain quicksort in Python.",
"max_new_tokens": 200,
"temperature": 0.5
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())
Tips
- ✅ Use GPU plans for faster inference
- ✅ Set reasonable max token values to avoid long delays
- ✅ Include clear instructions in prompts for code generation
- ✅ Start with
temperaturebetween 0.3–0.7 for quality results
Notes
- Auth token is required for all requests
- CPU plans may take longer to load the model initially
- Check logs if the model takes longer to reach Running state
- Ensure model folder structure includes required files
Codellama‑7b deployment is now ready — you can integrate it into any application using REST API requests.