# Deploy Model Endpoint for Codellama-7b Deploy a Codellama‑7b model endpoint using a prebuilt container or your own custom uploaded weights. --- ## Overview * Deploy Codellama‑7b with a prebuilt container * Optionally upload custom fine‑tuned weights * Make inference requests through REST APIs --- ## Step 1: Create a Model Endpoint 1. Go to the AI Platform 2. Select your project 3. Navigate to **Model Endpoints** 4. Click **Create Endpoint** 5. Choose **Codellama‑7b** model card 6. Select CPU/GPU plan and replicas 7. Complete creation and wait until status is **Running** --- ## Step 2: Generate Your API Token 1. Go to **API Tokens** under your project 2. Create a token or use an existing one 3. Copy the **Auth Token** — required for all API calls --- ## Step 3: Make Inference Requests Once running, sample API examples are available in the **Sample API Request** section of your endpoint. --- ## Deploy Custom Model Weights (Optional) Use this option for fine‑tuned models. ### A) Create Model Definition 1. Go to **Models** 2. Click **Create Model** → select **Custom** 3. Note the Object Storage (bucket) details provided 4. Copy the MinIO setup command ### B) Download Codellama‑7b into Notebook ```python from transformers import AutoTokenizer, pipeline import torch model = "codellama/CodeLlama-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model) pipeline = pipeline( "text-generation", model=model, torch_dtype=torch.float16, device_map="auto", tokenizer=tokenizer, ) ``` ### C) Upload Weights to Object Storage ```bash mc cp -r * codellama-7b/codellama-7b-hf ``` ### D) Deploy Endpoint with Custom Weights * Select your custom model under **Model Details** * Provide model path if not root * Validate and launch --- ## Supported Parameters (Summary) | Parameter | Purpose | | -------------------- | -------------------------------- | | `max_new_tokens` | Max tokens to generate | | `temperature` | Controls randomness | | `top_k` | Limits high‑probability sampling | | `top_p` | Nucleus sampling threshold | | `num_beams` | Beam search depth | | `repetition_penalty` | Avoid repeating text | > Tune these for output length and creativity. --- ## Example: Inference via API ### cURL Request ```bash curl -X POST https://your-endpoint-url/v1/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_AUTH_TOKEN" \ -d '{ "model": "Codellama-7b", "prompt": "Write a Python function for Fibonacci.", "max_new_tokens": 100, "temperature": 0.7 }' ``` ### Python Request ```python import requests url = "https://your-endpoint-url/v1/completions" headers = { "Authorization": "Bearer YOUR_AUTH_TOKEN", "Content-Type": "application/json" } payload = { "model": "Codellama-7b", "prompt": "Explain quicksort in Python.", "max_new_tokens": 200, "temperature": 0.5 } response = requests.post(url, json=payload, headers=headers) print(response.json()) ``` --- ## Tips * ✅ Use GPU plans for faster inference * ✅ Set reasonable max token values to avoid long delays * ✅ Include clear instructions in prompts for code generation * ✅ Start with `temperature` between **0.3–0.7** for quality results --- ## Notes * Auth token is required for **all requests** * CPU plans may take longer to load the model initially * Check **logs** if the model takes longer to reach *Running* state * Ensure model folder structure includes required files --- Codellama‑7b deployment is now ready — you can integrate it into any application using REST API requests. ---