# Deploy Model Endpoint for Codellama-7b

Deploy a Codellama‑7b model endpoint using a prebuilt container or your own custom uploaded weights.

---

## Overview

* Deploy Codellama‑7b with a prebuilt container
* Optionally upload custom fine‑tuned weights
* Make inference requests through REST APIs

---

## Step 1: Create a Model Endpoint

1. Go to the AI Platform
2. Select your project
3. Navigate to **Model Endpoints**
4. Click **Create Endpoint**
5. Choose **Codellama‑7b** model card
6. Select CPU/GPU plan and replicas
7. Complete creation and wait until status is **Running**

---

## Step 2: Generate Your API Token

1. Go to **API Tokens** under your project
2. Create a token or use an existing one
3. Copy the **Auth Token** — required for all API calls

---

## Step 3: Make Inference Requests

Once running, sample API examples are available in the **Sample API Request** section of your endpoint.

---

## Deploy Custom Model Weights (Optional)

Use this option for fine‑tuned models.

### A) Create Model Definition

1. Go to **Models**
2. Click **Create Model** → select **Custom**
3. Note the Object Storage (bucket) details provided
4. Copy the MinIO setup command

### B) Download Codellama‑7b into Notebook

```python
from transformers import AutoTokenizer, pipeline
import torch

model = "codellama/CodeLlama-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
    tokenizer=tokenizer,
)
```

### C) Upload Weights to Object Storage

```bash
mc cp -r * codellama-7b/codellama-7b-hf
```

### D) Deploy Endpoint with Custom Weights

* Select your custom model under **Model Details**
* Provide model path if not root
* Validate and launch

---

## Supported Parameters (Summary)

| Parameter            | Purpose                          |
| -------------------- | -------------------------------- |
| `max_new_tokens`     | Max tokens to generate           |
| `temperature`        | Controls randomness              |
| `top_k`              | Limits high‑probability sampling |
| `top_p`              | Nucleus sampling threshold       |
| `num_beams`          | Beam search depth                |
| `repetition_penalty` | Avoid repeating text             |

> Tune these for output length and creativity.

---

## Example: Inference via API

### cURL Request

```bash
curl -X POST https://your-endpoint-url/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
  -d '{
    "model": "Codellama-7b",
    "prompt": "Write a Python function for Fibonacci.",
    "max_new_tokens": 100,
    "temperature": 0.7
  }'
```

### Python Request

```python
import requests

url = "https://your-endpoint-url/v1/completions"
headers = {
    "Authorization": "Bearer YOUR_AUTH_TOKEN",
    "Content-Type": "application/json"
}
payload = {
    "model": "Codellama-7b",
    "prompt": "Explain quicksort in Python.",
    "max_new_tokens": 200,
    "temperature": 0.5
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())
```

---

## Tips

* ✅ Use GPU plans for faster inference
* ✅ Set reasonable max token values to avoid long delays
* ✅ Include clear instructions in prompts for code generation
* ✅ Start with `temperature` between **0.3–0.7** for quality results

---

## Notes

* Auth token is required for **all requests**
* CPU plans may take longer to load the model initially
* Check **logs** if the model takes longer to reach *Running* state
* Ensure model folder structure includes required files

---

Codellama‑7b deployment is now ready — you can integrate it into any application using REST API requests.


---