Fine-tuning with LLaMA Factory on TIR

LLaMA Factory is a simple, easy-to-use, and efficient large-scale language model (Large Language Model) training and fine-tuning platform. Through LLaMA Factory, it is possible to complete the fine-tuning of hundreds of pre-training models locally without writing any codes.

Getting Started

1. Go to TIR
2. Create or Select a project
3. Create a GPU Node and open Jupyter labs. Refer to Creating a Node on TIR
4. Follow the steps below to set up your notebook for using LLaMA Factory.

Step 1: Installing dependencies

The first step is to install all the necessary dependencies for using LLaMA Factory.

%cd ~/
%rm -rf LLaMA-Factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install -e .[torch,bitsandbytes]

!pip install bitsandbytes

Step 2: Check GPU environment

Verify the availability of CUDA to ensure that your system can leverage GPU for the fine-tuning process.

import torch
try:
    assert torch.cuda.is_available() is True
except AssertionError:
    print("Please set up a GPU before using LLaMA factory")

Step 3: Set number of GPUs to be used in Fine-tuning

Next, configure the number of GPUs to be used for fine-tuning. This step ensures that your training process is optimized according to the available hardware resources, potentially accelerating the fine-tuning process and improving performance.

# Here, n = (number of GPUs - 1)
!export CUDA_VISIBLE_DEVICES=0,1 ... n

LlaMa Factory Command Line Interface

LLaMA Factory offers a command-line interface (CLI) that simplifies the model fine-tuning process. In this tutorial, we'll utilize this CLI to fine-tune the Llama-3 model.

Here are some frequently used commands to help you get started:

To fine-tune a model based on the specified arguments:

llamafactory-cli train <file_path>

To merge LoRA adapters:

llamafactory-cli export <file_path>

To evaluate your fine-tuned model:

llamafactory-cli eval <file_path>

To infer your fine-tuned model:

llamafactory-cli chat <file_path>

Quickstart

LLaMA-Factory provides some examples using which you can fine-tune models.

You can use the following 3 commands to run LoRA fine-tuning, inference and merging of the Llama3-8B-Instruct model, respectively.

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

Fine-tuning using CLI using Custom Arguments

The following steps will guide you through fine-tuning the Llama-3-8B-Instruct model using LLaMA-Factory with your custom arguments.

Step 1: Update Identity Dataset

Next, you'll need to update the Identity Dataset to incorporate more specific or personalized details. This customization ensures that the dataset reflects relevant and tailored information for your model's fine-tuning needs.

import json

%cd ~/LLaMA-Factory/

NAME = "Llama-3"
AUTHOR = "LLaMA Factory"

with open("data/identity.json", "r", encoding="utf-8") as f:
dataset = json.load(f)

for sample in dataset:
sample["output"] = sample["output"].replace("{{"+ "name" + "}}", NAME).replace("{{"+ "author" + "}}", AUTHOR)

with open("data/identity.json", "w", encoding="utf-8") as f:
json.dump(dataset, f, indent=2, ensure_ascii=False)

Step 2: Fine-tune model

With the setup complete, you're now ready to begin the fine-tuning process using the llamafactory-cli.

import json

args = dict(
    stage="sft",                        # do supervised fine-tuning
    do_train=True,
    model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
    dataset="identity,alpaca_en_demo",             # use alpaca and identity datasets
    template="llama3",                     # use llama3 prompt template
    finetuning_type="lora",                   # use LoRA adapters to save memory
    lora_target="all",                     # attach LoRA adapters to all linear layers
    output_dir="llama3_lora",                  # the path to save LoRA adapters
    per_device_train_batch_size=2,               # the batch size
    gradient_accumulation_steps=4,               # the gradient accumulation steps
    lr_scheduler_type="cosine",                 # use cosine learning rate scheduler
    logging_steps=10,                      # log every 10 steps
    warmup_ratio=0.1,                      # use warmup scheduler
    save_steps=1000,                      # save checkpoint every 1000 steps
    learning_rate=5e-5,                     # the learning rate
    num_train_epochs=3.0,                    # the epochs of training
    max_samples=500,                      # use 500 examples in each dataset
    max_grad_norm=1.0,                     # clip gradient norm to 1.0
    quantization_bit=4,                     # use 4-bit QLoRA
    loraplus_lr_ratio=16.0,                   # use LoRA+ algorithm with lambda=16.0
    fp16=True,                         # use float16 mixed precision training
)

json.dump(args, open("train_llama3.json", "w", encoding="utf-8"), indent=2)

%cd ~/LLaMA-Factory/

!llamafactory-cli train train_llama3.json

Step 3: Infer the fine-tuned model

With your model successfully fine-tuned, you can now proceed to infer using the newly fine-tuned model.

%cd ~/LLaMA-Factory/

from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

args = dict(
    model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
    adapter_name_or_path="llama3_lora",            # load the saved LoRA adapters
    template="llama3",                     # same to the one in training
    finetuning_type="lora",                  # same to the one in training
    quantization_bit=4,                    # load 4-bit quantized model
)
chat_model = ChatModel(args)

messages = []
print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
while True:
query = input("\nUser: ")
if query.strip() == "exit":
    break
if query.strip() == "clear":
    messages = []
    torch_gc()
    print("History has been removed.")
    continue

messages.append({"role": "user", "content": query})
print("Assistant: ", end="", flush=True)

response = ""
for new_text in chat_model.stream_chat(messages):
    print(new_text, end="", flush=True)
    response += new_text
print()
messages.append({"role": "assistant", "content": response})

torch_gc()

Step 4: Merge the LoRA adapter and optionally upload model

With the fine-tuning process complete, merge the new adaptations with the base model to integrate the improvements. Once merged, you have the option to upload the updated model to Hugging Face for sharing, deployment, or further use.

Note

Merging LoRA for an 8B model requires a minimum of 18GB of RAM. Ensure that your node has the necessary amount of memory to perform this operation effectively.

!huggingface-cli login

import json

args = dict(
    model_name_or_path="meta-llama/Meta-Llama-3-8B-Instruct", # use official non-quantized Llama-3-8B-Instruct model
    adapter_name_or_path="llama3_lora",            # load the saved LoRA adapters
    template="llama3",                     # same to the one in training
    finetuning_type="lora",                  # same to the one in training
    export_dir="llama3_lora_merged",              # the path to save the merged model
    export_size=2,                       # the file shard size (in GB) of the merged model
    export_device="cpu",                    # the device used in export, can be chosen from `cpu` and `cuda`
    #export_hub_model_id="your_id/your_model",         # the Hugging Face hub ID to upload model
)

json.dump(args, open("merge_llama3.json", "w", encoding="utf-8"), indent=2)

%cd ~/LLaMA-Factory/

!llamafactory-cli export merge_llama3.json

Fine-tuning using WebUI with Custom Arguments

You can fine-tune your models using the LlamaBoard WebUI. To launch the interface, follow these steps:

Step 1: Configure Add-ons in your Node

First, add Gradio to your notebook by selecting it from the Add-ons menu.

Step 2: Launch the WebUI server

You can use the following 2 commands to launch the LlamaBoard UI

import os

%cd ~/LLaMA-Factory/
print(f"Llama Board UI URL: {os.getenv('GRADIO_URL')}")
!GRADIO_SHARE=0 llamafactory-cli webui  # Make GRADIO_SHARE=1 for enabling the Gradio Public URL.

Step 3: Configure the arguments for Fine-tuning

Access the LlamaBoard UI by opening the URL displayed in the previous step. You can train, evaluate, interact with, and export your model using this UI.

Getting Started​

Step 1: Installing dependencies​

Step 2: Check GPU environment​

Step 3: Set number of GPUs to be used in Fine-tuning​

LlaMa Factory Command Line Interface​

Quickstart​

Fine-tuning using CLI using Custom Arguments​

Step 1: Update Identity Dataset​

Step 2: Fine-tune model​

Step 3: Infer the fine-tuned model​

Step 4: Merge the LoRA adapter and optionally upload model​

Fine-tuning using WebUI with Custom Arguments​

Step 1: Configure Add-ons in your Node​

Step 2: Launch the WebUI server​

Step 3: Configure the arguments for Fine-tuning​

References​