Quick Start Guide
Create your first fine-tuning job and train a model on your own dataset. This guide covers the complete flow from job creation through monitoring and accessing your trained model.
What you need before you start
| Requirement | Details |
|---|---|
| Account | Active E2E AI Cloud account with access to Foundation Studio |
| Dataset | A dataset in .jsonl format uploaded to EOS, or a Hugging Face dataset name |
| Hugging Face token | Required if using a gated model (e.g. Llama 3) or a private Hugging Face dataset |
| GPU plan | Decide which GPU to use (H100, A100) based on your model size and training budget |
If you're fine-tuning a gated model (e.g. Llama 3, Mistral), you need a Hugging Face access token added as an integration. For setup instructions, see External Integrations — Hugging Face.
Step 1: Navigate to Fine-Tune Models
- In the TIR Dashboard sidebar, click Foundation Studio under Labs Experimental.
- From the dropdown, select Fine-Tune Models.
- You will land on the Manage Fine-Tuning Jobs page.
Step 2: Create a fine-tuning job
- Click Create Fine-Tuning Job or the Click Here button.
- Select a base model from the available options.
- Choose a GPU plan — H100 and A100 are available. Use the filter to narrow down by GPU type.
- For LLMs with 7B+ parameters, choose A100 or H100 for acceptable training times.
- For Stable Diffusion models, A100 is typically sufficient.
Step 3: Configure the job model
On the Job Model Configuration page:
- Enter a name for your fine-tuned model.
- Choose your training start point:
- Start Training from Scratch (default) — trains from the base model weights.
- Continue training from previous checkpoint — resumes from an existing checkpoint.
- If resuming, click Choose to select the model repository and checkpoint.
- Select a Hugging Face integration from the dropdown, or click Create New to add your token.
Some models require access granted by their administrator. Visit the model card on Hugging Face to request access.
Step 4: Prepare your dataset
On the Dataset Preparation page:
- Select a task that matches your training objective.
- Choose a dataset type:
- CUSTOM — Upload your own
.jsonlfiles to an EOS bucket. - HUGGING FACE — Use a dataset from the Hugging Face Hub.
- CUSTOM — Upload your own
- Set a validation split ratio (e.g.
0.1for 10% validation). - Configure prompt settings as needed.
Using a CUSTOM dataset
Click CHOOSE to select an existing EOS dataset, or click here to create a new one. After creating a dataset, click UPLOAD DATASET to add your files, then click SUBMIT.
For text models:
Your dataset should contain records with fields that map to your selected task and prompt configuration. The exact fields depend on the task you choose — the UI shows an Example Dataset preview once a task is selected, which you can use as a reference for the expected structure.
[
{
"input": "Artificial Intelligence is a branch of computer science...",
"output": "AI is a field focused on creating machines that mimic human intelligence.",
"instruction": "Summarize the following text."
}
]
The Prompt Configuration is auto-generated based on the selected task and defines how the fields are presented to the model during training.
For image generation models (e.g. Stable Diffusion):
Instead of a text schema, you configure dataset columns and validation settings directly in the UI:
| Field | Description | Example |
|---|---|---|
| Target Image Column | Column in your dataset containing the images | image |
| Target Caption Column | Column containing the text captions | text |
| Validation Prompt | A prompt used to generate sample images during training to track progress | A photo of a man with green eyes |
| Num Validation Images | Number of sample images to generate at each validation step | 2 |
Uploading a dataset with incorrect field names or structure will cause the fine-tuning job to fail. Use the Example Dataset shown in the UI as a reference for the expected format.
Using a Hugging Face dataset
Select HUGGING FACE as the dataset type and choose a dataset from the available collection.
Step 5: Set hyperparameters
During Hyperparameter Configuration, you fine-tune settings like learning rate, batch size, and optimization algorithms to optimize model performance. This step is crucial for balancing training speed, accuracy, and resource usage. Experimenting with different hyperparameter combinations helps in finding the best configuration that improves model accuracy while avoiding overfitting or underfitting.
On the Hyperparameter Configuration page, the following parameters are available:
| Parameter | Description |
|---|---|
| Training Type | The fine-tuning method to use (e.g. Parameter-Efficient Fine-Tuning, full fine-tuning) |
| Stop Training When | The condition that ends training (e.g. when epoch count has reached a set number) |
| Learning Rate | Step size during optimization — influences convergence speed and training stability |
| Epochs | Number of complete passes over the entire dataset during training |
| Max Steps | Maximum number of training steps; if set, epochs are ignored |
| Max Context Length | Maximum length of input sequences during training |
| Peft Lora R | LoRA attention dimension (rank) |
| Peft Lora Alpha | Alpha parameter for LoRA scaling |
| Lora Dropout | Dropout probability applied to LoRA layers to reduce overfitting |
| Lora Bias | Specifies which biases are updated during training (none, all, or lora_only) |
| Target Module | Specifies which model layers LoRA is applied to |
Quantization (optional): Reduce GPU memory usage during training. Options include Load in 4Bit and DoubleQuant.
Advanced settings (optional): Configure batch size and gradient accumulation steps.
WandB tracking (optional): Enable Weights & Biases (WandB) to monitor training metrics in real time. WandB is a platform for experiment tracking, model visualization, and team collaboration. To enable, add your WandB API key via External Integrations and select it in this step.
Debug Options (optional): Allows you to limit the amount of data used during training and evaluation runs, useful for quick validation before a full run.
Step 6: Review and launch
Review your configuration on the Summary page, then click Launch.
The job appears in the Manage Fine-Tuning Jobs list. Training time depends on model size, dataset size, and GPU plan.
Step 7: Monitor your job
Click on the job to view details:
| Tab | What it shows |
|---|---|
| Overview | Job configuration, status, and resource details |
| Events | Pod scheduling, container start, and lifecycle events |
| Logs | Real-time training logs to diagnose errors or monitor progress |
| Training Metrics | Loss curves and other training metrics |
| Metrics | GPU utilization, GPU memory usage, and other resource metrics |
Step 8: Access your fine-tuned model
When training completes, your fine-tuned model appears in the Models section at the bottom of the job page. The model repository contains:
- All training checkpoints
- Any LoRA adapters built during training
From here, navigate to the Inference section to deploy your fine-tuned model as an API endpoint.