Skip to main content

Quick Start Guide

Create your first fine-tuning job and train a model on your own dataset. This guide covers the complete flow from job creation through monitoring and accessing your trained model.


What you need before you start

RequirementDetails
AccountActive E2E AI Cloud account with access to Foundation Studio
DatasetA dataset in .jsonl format uploaded to EOS, or a Hugging Face dataset name
Hugging Face tokenRequired if using a gated model (e.g. Llama 3) or a private Hugging Face dataset
GPU planDecide which GPU to use (H100, A100) based on your model size and training budget

Hugging Face Integration

If you're fine-tuning a gated model (e.g. Llama 3, Mistral), you need a Hugging Face access token added as an integration. For setup instructions, see External Integrations — Hugging Face.


Step 1: Navigate to Fine-Tune Models

  1. In the TIR Dashboard sidebar, click Foundation Studio under Labs Experimental.
  2. From the dropdown, select Fine-Tune Models.
  3. You will land on the Manage Fine-Tuning Jobs page.

Step 2: Create a fine-tuning job

  1. Click Create Fine-Tuning Job or the Click Here button.
  2. Select a base model from the available options.
  3. Choose a GPU plan — H100 and A100 are available. Use the filter to narrow down by GPU type.

tip
  • For LLMs with 7B+ parameters, choose A100 or H100 for acceptable training times.
  • For Stable Diffusion models, A100 is typically sufficient.

Step 3: Configure the job model

On the Job Model Configuration page:

  1. Enter a name for your fine-tuned model.
  2. Choose your training start point:
    • Start Training from Scratch (default) — trains from the base model weights.
    • Continue training from previous checkpoint — resumes from an existing checkpoint.
  3. If resuming, click Choose to select the model repository and checkpoint.
  4. Select a Hugging Face integration from the dropdown, or click Create New to add your token.

Note

Some models require access granted by their administrator. Visit the model card on Hugging Face to request access.


Step 4: Prepare your dataset

On the Dataset Preparation page:

  1. Select a task that matches your training objective.
  2. Choose a dataset type:
    • CUSTOM — Upload your own .jsonl files to an EOS bucket.
    • HUGGING FACE — Use a dataset from the Hugging Face Hub.
  3. Set a validation split ratio (e.g. 0.1 for 10% validation).
  4. Configure prompt settings as needed.

Using a CUSTOM dataset

Click CHOOSE to select an existing EOS dataset, or click here to create a new one. After creating a dataset, click UPLOAD DATASET to add your files, then click SUBMIT.

For text models:

Your dataset should contain records with fields that map to your selected task and prompt configuration. The exact fields depend on the task you choose — the UI shows an Example Dataset preview once a task is selected, which you can use as a reference for the expected structure.

[
{
"input": "Artificial Intelligence is a branch of computer science...",
"output": "AI is a field focused on creating machines that mimic human intelligence.",
"instruction": "Summarize the following text."
}
]

The Prompt Configuration is auto-generated based on the selected task and defines how the fields are presented to the model during training.

For image generation models (e.g. Stable Diffusion):

Instead of a text schema, you configure dataset columns and validation settings directly in the UI:

FieldDescriptionExample
Target Image ColumnColumn in your dataset containing the imagesimage
Target Caption ColumnColumn containing the text captionstext
Validation PromptA prompt used to generate sample images during training to track progressA photo of a man with green eyes
Num Validation ImagesNumber of sample images to generate at each validation step2
Note

Uploading a dataset with incorrect field names or structure will cause the fine-tuning job to fail. Use the Example Dataset shown in the UI as a reference for the expected format.

Using a Hugging Face dataset

Select HUGGING FACE as the dataset type and choose a dataset from the available collection.


Step 5: Set hyperparameters

During Hyperparameter Configuration, you fine-tune settings like learning rate, batch size, and optimization algorithms to optimize model performance. This step is crucial for balancing training speed, accuracy, and resource usage. Experimenting with different hyperparameter combinations helps in finding the best configuration that improves model accuracy while avoiding overfitting or underfitting.

On the Hyperparameter Configuration page, the following parameters are available:

ParameterDescription
Training TypeThe fine-tuning method to use (e.g. Parameter-Efficient Fine-Tuning, full fine-tuning)
Stop Training WhenThe condition that ends training (e.g. when epoch count has reached a set number)
Learning RateStep size during optimization — influences convergence speed and training stability
EpochsNumber of complete passes over the entire dataset during training
Max StepsMaximum number of training steps; if set, epochs are ignored
Max Context LengthMaximum length of input sequences during training
Peft Lora RLoRA attention dimension (rank)
Peft Lora AlphaAlpha parameter for LoRA scaling
Lora DropoutDropout probability applied to LoRA layers to reduce overfitting
Lora BiasSpecifies which biases are updated during training (none, all, or lora_only)
Target ModuleSpecifies which model layers LoRA is applied to

Quantization (optional): Reduce GPU memory usage during training. Options include Load in 4Bit and DoubleQuant.

Advanced settings (optional): Configure batch size and gradient accumulation steps.

WandB tracking (optional): Enable Weights & Biases (WandB) to monitor training metrics in real time. WandB is a platform for experiment tracking, model visualization, and team collaboration. To enable, add your WandB API key via External Integrations and select it in this step.

Debug Options (optional): Allows you to limit the amount of data used during training and evaluation runs, useful for quick validation before a full run.


Step 6: Review and launch

Review your configuration on the Summary page, then click Launch.

The job appears in the Manage Fine-Tuning Jobs list. Training time depends on model size, dataset size, and GPU plan.


Step 7: Monitor your job

Click on the job to view details:

TabWhat it shows
OverviewJob configuration, status, and resource details
EventsPod scheduling, container start, and lifecycle events
LogsReal-time training logs to diagnose errors or monitor progress
Training MetricsLoss curves and other training metrics
MetricsGPU utilization, GPU memory usage, and other resource metrics

Step 8: Access your fine-tuned model

When training completes, your fine-tuned model appears in the Models section at the bottom of the job page. The model repository contains:

  • All training checkpoints
  • Any LoRA adapters built during training

From here, navigate to the Inference section to deploy your fine-tuned model as an API endpoint.


Next steps

  • Features — Explore all fine-tuning capabilities in detail.
  • Pricing — Understand GPU billing for fine-tuning jobs.
  • Guides — Model-specific tutorials for Llama, Mistral, Stable Diffusion, and more.
  • FAQs — Troubleshoot common issues.