# Fine-tune Google Flan UL2 with Multiple GPUs In this tutorial, we will go through fine-tuning Flan-UL2 by combining Peft, LoRA, and Deepspeed on multiple GPUs (single machine). With A100 80GB cards, we can expect the training to finish in 24 hours for 3 epochs. [Flan-UL2](https://www.yitay.net/blog/flan-ul2-20b) is an encoder-decoder model based on the [T5](https://arxiv.org/pdf/1910.10683.pdf) architecture. It is a 20B parameter model fine-tuned using the "Flan" prompt tuning and dataset collection. The model was initialized using UL2 checkpoints. For more information on UL2, please take a look at [the original paper](https://arxiv.org/abs/2205.05131v1). While the focus of this article is to cover the use of multiple GPUs on the TIR platform, you may run through the same process with a single GPU at the cost of training speed. We can also try to fit the model across GPUs using Deepspeed's pipeline parallelism, but it does require adjusting the device map, mainly because T5 blocks in UL2 have residual connections that cannot be split across GPUs. There are workarounds, but for the scope of this tutorial, we will use Deepspeed's Zero Offloading feature, which will allow us to fine-tune one model copy per GPU through model parallelism. ## Steps 1. **Start a GPU Notebook** (with 4xA10080 Plan and at least 100GB disk) from the TIR Dashboard. 2. **Install Requirements.** ```bash !pip install accelerate transformers peft deepspeed ``` 3. **Initiate setup of Accelerate Config.** ```bash accelerate config --config_file launcher_config.yaml ``` 4. **Choose following parameters:** ```bash In which environment are you running? This machine Which type of machine are you using? Multi-GPU How many different machines will you use? 1 Do you wish to optimize your script with torch dynamo? No Do you want to use Deepspeed? Yes Do you want to specify json for deepspeed config? No What should be your Deepspeed's Zero Optimization stage? 3 Where to offload CPU optimization stages? cpu where to offload parameters? cpu How many gradient accumulation steps your are passing to the script? 8 Do you want to use gradient clipping? no Do you want to save 16-bit model.. ? no Do you want to enable 'deepspeed.zero.init' ... ? no How many gpus should be used for training? 4 Do you wish to use FP16 or BF16? no ``` 5. **Prepare Dataset** The fine-tuning script can work with any CSV file. Here, we will use the Alpaca dataset. Create a new file in Jupyter Labs named `prepare_alpaca_csv.py` and copy the following contents to this file. ```python import json import pandas as pd with open('alpaca_data.json') as f: data = json.load(f) new_format = [] for i, point in enumerate(data): # no input if len(point['input']) == 0: inputt = "Below is an instruction that describes a task.\n " inputt += "Write a response that appropriately completes the request.\n\n" inputt += f"### Instruction:\n{point['instruction']}\n\n### Response:" else: inputt = "Below is an instruction that describes a task.\n " inputt += "Write a response that appropriately completes the request.\n\n" inputt += f"### Instruction:\n{point['instruction']}\n\n### Input:\n{point['input']}\n\n### Response:" item = {'input': inputt, 'output': str(point['output'])} new_format.append(item) df = pd.DataFrame(new_format) df = df.dropna() df.to_csv('alpaca_data.csv') ``` 6. **Open terminal in Jupyter labs. Run the following commands to download alpaca set and prepare a csv from it.** ```bash # download the json alpaca dataset wget https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json # run in the terminal shell python prepare_alpaca_csv.py ``` 7. **Prepare Fine-Tuning Script** Create a file named `train.py` in Jupyter Labs (from the file browser) and copy the following contents to the file. ```python # Modified from https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py import argparse import gc import logging import os import threading import psutil import torch from accelerate import Accelerator from datasets import load_dataset from deepspeed.accelerator import get_accelerator from peft import LoraConfig, TaskType, get_peft_model from torch.utils.data import DataLoader from tqdm import tqdm from transformers import (AutoModelForSeq2SeqLM, AutoTokenizer, get_linear_schedule_with_warmup, set_seed) def b2mb(x): ''' Converting Bytes to Megabytes ''' return int(x / 2**20) class TorchTracemalloc: ''' # Context manager is used to track the peak memory usage of the process ''' def __enter__(self): gc.collect() torch.cuda.empty_cache() torch.cuda.reset_max_memory_allocated() # reset the peak gauge to zero self.begin = torch.cuda.memory_allocated() self.process = psutil.Process() self.cpu_begin = self.cpu_mem_used() self.peak_monitoring = True peak_monitor_thread = threading.Thread(target=self.peak_monitor_func) peak_monitor_thread.daemon = True peak_monitor_thread.start() return self def cpu_mem_used(self): """get resident set size memory for the current process""" return self.process.memory_info().rss def peak_monitor_func(self): self.cpu_peak = -1 while True: self.cpu_peak = max(self.cpu_mem_used(), self.cpu_peak) if not self.peak_monitoring: break def __exit__(self, *exc): self.peak_monitoring = False gc.collect() torch.cuda.empty_cache() self.end = torch.cuda.memory_allocated() self.peak = torch.cuda.max_memory_allocated() self.used = b2mb(self.end - self.begin) self.peaked = b2mb(self.peak - self.begin) self.cpu_end = self.cpu_mem_used() self.cpu_used = b2mb(self.cpu_end - self.cpu_begin) self.cpu_peaked = b2mb(self.cpu_peak - self.cpu_begin) # Handle argument parsing def parse_args(): parser = argparse.ArgumentParser() parser.add_argument("--model_path", type=str, help="Model path. Supports T5/UL2 models") parser.add_argument("--datafile_path", type=str, default="sample.csv", help="Path to the already processed dataset.") parser.add_argument("--num_epochs", type=int, default=1, help="Number of epochs to train for.") parser.add_argument("--per_device_batch_size", type=int, default=2, help="Batch size to use for training.") parser.add_argument("--input_max_length", type=int, default=128, help="Maximum input length to use for generation") parser.add_argument("--target_max_length", type=int, default=128, help="Maximum target length to use for generation") parser.add_argument("--lr", type=float, default=3e-4, help="Learning rate to use for training.") parser.add_argument("--seed", type=int, default=42, help="Seed to use for training.") parser.add_argument("--input_column", type=str, default='input', help='csv input text column') parser.add_argument("--target_column", type=str, default='output', help='csv target text column') parser.add_argument("--save_path", type=str, default='peft_ckpt', help="Save path") args = parser.parse_known_args() return args # Main function def main(): args, _ = parse_args() text_column = args.input_column label_column = args.target_column lr = args.lr num_epochs = args.num_epochs batch_size = args.per_device_batch_size seed = args.seed model_name_or_path = args.model_path data_file = args.datafile_path save_path = args.save_path target_max_length = args.target_max_length source_max_length = args.input_max_length # Create dir if it doesn't exist os.makedirs(save_path, exist_ok=True) # Setup logging logging.basicConfig( level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s", handlers=[ logging.FileHandler(os.path.join(save_path, "training_log.log")), logging.StreamHandler() ] ) logging.info(f'Args:\n {args}') # Launch configs accelerator = Accelerator() peft_config = LoraConfig( task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1 ) set_seed(seed) # Save logs only on the main process @accelerator.on_main_process def log_info(logging, s): logging.info(s) # Load dataset dataset = load_dataset('csv', data_files={'train': data_file}) log_info(logging, f"Dataset length :{len(dataset['train'])}") # Load model model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) # Load peft model model = get_peft_model(model, peft_config) model.print_trainable_parameters() # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) def preprocess_function(sample, padding="max_length"): # Created prompted input inputs = sample[text_column] # Tokenize inputs model_inputs = tokenizer( inputs, max_length=source_max_length, padding=padding, truncation=True) # Tokenize targets with the `text_target` keyword argument labels = tokenizer(text_target=sample[label_column], max_length=target_max_length, padding=padding, truncation=True) # If we are padding here, replace all tokenizer.pad_token_id # in the labels by -100 when we want to ignore padding in the loss. if padding == "max_length": labels["input_ids"] = [ [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"] ] model_inputs["labels"] = labels["input_ids"] return model_inputs # Prepare and preprocess the dataset with accelerator.main_process_first(): # Preventing string conversion errors def str_convert(example): example[label_column] = str(example[label_column]) return example dataset['train'] = dataset['train'].map(str_convert) processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=True, desc="Running tokenizer on dataset", ) accelerator.wait_for_everyone() train_dataset = processed_datasets["train"] def collate_fn(examples): return tokenizer.pad(examples, padding="longest", return_tensors="pt") train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True ) # Optimizer optimizer = torch.optim.AdamW(model.parameters(), lr=lr) # Learning rate scheduler lr_scheduler = get_linear_schedule_with_warmup( optimizer=optimizer, num_warmup_steps=0, num_training_steps=(len(train_dataloader) * num_epochs), ) # Accelerator preparation model, train_dataloader, optimizer, lr_scheduler = accelerator.prepare( model, train_dataloader, optimizer, lr_scheduler ) # Train the model ``` 8. Run the training with accelerate. ```python accelerate launch --config_file launcher_config.yaml train.py \ --model_path google/flan-ul2 \ --datafile_path alpaca_data.csv \ --save_path checkpoint --num_epochs 1 \ --lr 1e-4\ --per_device_batch_size 2 \ --input_max_length 256 \ --target_max_length 256 ``` 9. **Prepare checkpoint for inference** Deepspeed will create sharded model files and a zero_to_fp32.py in the checkpoint folder. Run the following command to convert the checkpoints to Pytorch bin file. ```python # replace 0 with the latest checkpoints number python ./checkpoint/zero_to_fp32.py ./checkpoint/ ./checkpoint/0/adapter_model.bin ``` 10 **Load fine-tuned model for inference.** ```python # Start a new notebook to run this code from a notebook cell. from transformers import AutoModelForSeq2SeqLM from peft import PeftModel, PeftConfig peft_model_id = 'checkpoint\0' base_model_id = 'Google\flan-ul2' model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id) # The original model path model = PeftModel.from_pretrained(model, peft_model_id) # The fine-tuned model path ``` You may also further merge the base weights with LoRA checkpoints. ## Conclusion In this tutorial, we have fine-tuned a large language model with multiple GPUs using Deepspeed and LoRA. A similar approach can be followed for fine-tuning other models. ---