Fine Tuning an LLM on Amazon Sagemaker - Jarrett's Data Ranch 🐎

## Context - This is my application of a notebook from an AWS training module. - Minor tweaks such as reducing the number of epochs (WIP) ## Notebook In this solution, you explore how to fine-tune a pre-trained large languages model (LLM), which is a powerful technique in the field of generative artificial intelligence (AI). LLMs are pre-trained on enormous amounts of data, making them highly effective in grasping the nuances of language and generating coherent responses. These models have learned to extract useful features and patterns from the data, making them a valuable resource for various machine learning (ML) tasks. With fine-tuning, also known as transfer learning, we can use the knowledge gained by a pre-trained model and apply it to a different but related task. Instead of training a model from scratch, we start with a pre-trained model and modify it to adapt to our specific problem domain. This approach saves significant computational resources, and it benefits from the generalization capabilities of the pre-trained model. This notebook walks you through the step-by-step process of fine-tuning a pre-trained model. It covers the following primary steps: 1. <a href="#step1">Check GPU memory</a> 2. <a href="#step2">Import libraries</a> 3. <a href="#step3">Prepare the training dataset</a> 4. <a href="#step4">Load a pre-trained LLM</a> 5. <a href="#step5">Define the trainer and fine-tune the LLM</a> 6. <a href="#step6">Deploy the fine-tuned model</a> 7. <a href="#step7">Test the deployed inference</a> NOTE: Work from the top to the bottom of this notebook, and do not skip sections; otherwise, you might receive error messages from missing code. ## <a name="step1" href="/">Step 1: Check GPU memory</a> To check the GPU memory, run the following command. ```python !nvidia-smi ``` If your CUDA memory is occupied by more than half (as in the following image), you need to shut down other running notebooks. ##Step 2: Import libraries Run the following two code blocks sequentially, one at a time, to import the necessary libraries, including the Hugging Face Transformers library and the PyTorch library, which is a dependency for Transformers. ```python %%capture pip install -r requirements.txt ``` ```python %%capture import os import numpy as np import pandas as pd from typing import Any, Dict, List, Tuple, Union from datasets import Dataset, load_dataset, disable_caching disable_caching() ## disable huggingface cache from transformers import AutoModelForCausalLM from transformers import AutoTokenizer from transformers import TextDataset import torch from torch.utils.data import Dataset, random_split from transformers import TrainingArguments, Trainer import accelerate import bitsandbytes from IPython.display import Markdown ``` ## <a name="step3">Step 3: Prepare the training dataset</a> Load and view the dataset. For this practice lab, you use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) for the main dataset, which has two columns: `instruction` and `response`. ```python sagemaker_faqs_dataset = load_dataset("csv", data_files='data/amazon_sagemaker_faqs.csv')['train'] sagemaker_faqs_dataset ``` ```python sagemaker_faqs_dataset[0] ``` ### <a name="step3">Step 3.1: Prepare the prompt</a> To fine-tune the LLM, you must decorate the instruction dataset with a PROMPT, as shown below. ```python from utils.helpers import INTRO_BLURB, INSTRUCTION_KEY, RESPONSE_KEY, END_KEY, RESPONSE_KEY_NL, DEFAULT_SEED, PROMPT ''' PROMPT = """{intro} {instruction_key} {instruction} {response_key} {response} {end_key}""" ''' Markdown(PROMPT) ``` Now, feed the PROMPT to the dataset through the `_add_text` Python function, which takes a record as input. The function checks that both fields (instruction/response) are not null, and then passes the values to the predefined PROMPT template. ```python def _add_text(rec): instruction = rec["instruction"] response = rec["response"] if not instruction: raise ValueError(f"Expected an instruction in: {rec}") if not response: raise ValueError(f"Expected a response in: {rec}") rec["text"] = PROMPT.format(instruction=instruction, response=response) return rec ``` ```python sagemaker_faqs_dataset = sagemaker_faqs_dataset.map(_add_text) sagemaker_faqs_dataset[0] ``` Use `Markdown` to neatly display the text. ```python Markdown(sagemaker_faqs_dataset[0]['text']) ``` ## <a name="#step4">Step 4: Load a pre-trained LLM</a> To load a pre-trained model, initialize a tokenizer and a base model by using the `databricks/dolly-v2-3b` model from the Hugging Face Transformers library. The tokenizer converts raw text into tokens, and the base model generates text based on a given prompt. By following the instructions previously outlined, you can correctly instantiate these components and use their functionality in your code. The `AutoTokenizer.from_pretrained()` Python function is used to instantiate the tokenizer. - `padding_side="left"` specifies the side of the sequences where padding tokens are added. In this case, padding tokens are added to the left side of each sequence. - `eos_token` is a special token that represents the end of a sequence. By assigning the token to `pad_token`, any padding tokens added during tokenization are also considered end-of-sequence tokens. This can be useful when generating text through the model because the model knows when to stop generating text after encountering padding tokens. - `tokenizer.add_special_tokens...` adds three additional special tokens to the tokenizer's vocabulary. These tokens likely serve specific purposes in the application using the tokenizer. For example, the tokens could be used to mark the end of an input, an instruction, or a response in a dialogue system. After the function runs, the `tokenizer` object has been initialized and is ready to use for tokenizing text. ```python tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b",padding_side="left") tokenizer.pad_token = tokenizer.eos_token tokenizer.add_special_tokens({"additional_special_tokens": [END_KEY, INSTRUCTION_KEY, RESPONSE_KEY_NL]}) ``` ```python model = AutoModelForCausalLM.from_pretrained( "databricks/dolly-v2-3b", # use_cache=False, device_map="auto", #"balanced", load_in_8bit=True, ) ``` ### <a name="#step4.1">Step 4.1: Prepare the model for training</a> Some preprocessing is needed before training an INT8 model using Parameter-Efficient Fine-Tuning (PEFT); therefore, import a utility function, `prepare_model_for_int8_training`, that will: - Cast all the non-INT8 modules to full precision (FP32) for stability. - Add a forward_hook to the input embedding layer to enable gradient computation of the input hidden states. - Enable gradient checkpointing for more memory-efficient training. ```python model.resize_token_embeddings(len(tokenizer)) ``` Use the `preprocess_batch` function to preprocess the text field of the batch, applying tokenization, truncation, and other relevant operations based on the specified maximum length. The field takes a batch of data, a tokenizer, and a maximum length as input. For more details, refer to `mlu_utils/helpers.py` file. ```python from functools import partial from utils.helpers import mlu_preprocess_batch MAX_LENGTH = 256 _preprocessing_function = partial(mlu_preprocess_batch, max_length=MAX_LENGTH, tokenizer=tokenizer) ``` Next, apply the preprocessing function to each batch in the dataset, modifying the text field accordingly. The map operation is performed in a batched manner and the instruction, response, and text columns are removed from the dataset. Finally, `processed_dataset` is created by filtering `sagemaker_faqs_dataset` based on the length of the input_ids field, ensuring that it fits within the specified `MAX_LENGTH`. ```python encoded_sagemaker_faqs_dataset = sagemaker_faqs_dataset.map( _preprocessing_function, batched=True, remove_columns=["instruction", "response", "text"], ) processed_dataset = encoded_sagemaker_faqs_dataset.filter(lambda rec: len(rec["input_ids"]) < MAX_LENGTH) ``` Split the dataset into `train` and `test` for evaluation. ```python split_dataset = processed_dataset.train_test_split(test_size=14, seed=0) split_dataset ``` ## <a name="#step5">Step 5: Define the trainer and fine-tune the LLM</a> To efficiently fine-tune a model, in this practice lab, you use [LoRA: Low-Rank Adaptation](https://arxiv.org/abs/2106.09685). LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and reduce the GPU memory requirement by 3 times. ### <a name="#step5.1">Step 5.1: Define LoraConfig and load the LoRA model</a> Use the build LoRA class `LoraConfig` from [huggingface 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning](https://github.com/huggingface/peft). Within `LoraConfig`, specify the following parameters: - `r`, the dimension of the low-rank matrices - `lora_alpha`, the scaling factor for the low-rank matrices - `lora_dropout`, the dropout probability of the LoRA layers ```python from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType MICRO_BATCH_SIZE = 8 BATCH_SIZE = 64 GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE LORA_R = 256 # 512 LORA_ALPHA = 512 # 1024 LORA_DROPOUT = 0.05 # Define LoRA Config lora_config = LoraConfig( r=LORA_R, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT, bias="none", task_type="CAUSAL_LM" ) ``` Use the `get_peft_model` function to initialize the model with the LoRA framework, configuring it based on the provided `lora_config` settings. This way, the model can incorporate the benefits and capabilities of the LoRA optimization approach. ```python model = get_peft_model(model, lora_config) model.print_trainable_parameters() ``` As shown, LoRA-only trainable parameters are about 3 percent of the full weights, which is much more efficient. ### <a name="#step5.2">Step 5.2: Define the data collator</a> A DataCollator is a huggingface🤗 transformers function that takes a list of samples from a dataset and collates them into a batch, as a dictionary of PyTorch tensors. Use `DataCollatorForCompletionOnlyLM`, which extends the functionality of the base `DataCollatorForLanguageModeling` class from the Transformers library. This custom collator is designed to handle examples where a prompt is followed by a response in the input text and the labels are modified accordingly. For implementation, refer to `utils/helpers.py`. ```python from utils.helpers import MLUDataCollatorForCompletionOnlyLM data_collator = MLUDataCollatorForCompletionOnlyLM( tokenizer=tokenizer, mlm=False, return_tensors="pt", pad_to_multiple_of=8 ) ``` ### <a name="#step5.3">Step 5.3: Define the trainer</a> To fine-tune the LLM, you must define a trainer. First, define the training arguments. ```python # temp change, 10 epochs has higher val loss! EPOCHS = 5 LEARNING_RATE = 1e-4 MODEL_SAVE_FOLDER_NAME = "dolly-3b-lora" training_args = TrainingArguments( output_dir=MODEL_SAVE_FOLDER_NAME, fp16=True, per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=LEARNING_RATE, num_train_epochs=EPOCHS, logging_strategy="steps", logging_steps=100, evaluation_strategy="steps", eval_steps=100, save_strategy="steps", save_steps=20000, save_total_limit=10, ) ``` This is where the magic happens! Initialize the trainer with the defined model, tokenizer, training arguments, data collator, and the train/eval datasets. The training takes about 10 minutes. ```python trainer = Trainer( model=model, tokenizer=tokenizer, args=training_args, train_dataset=split_dataset['train'], eval_dataset=split_dataset["test"], data_collator=data_collator, ) model.config.use_cache = False # silence the warnings. Please re-enable for inference! trainer.train() ``` ### <a name="#step5.4">Step 5.4: Save the fine-tuned model</a> When the training is completed, you can save the model to a directory by using the [`transformers.PreTrainedModel.save_pretrained`] function. This function saves only the incremental 🤗 PEFT weights (adapter_model.bin) that were trained, so the model is very efficient to store, transfer, and load. ```python trainer.model.save_pretrained(MODEL_SAVE_FOLDER_NAME) ``` If you want to save the full model that you just fine-tuned, you can use the [`transformers.trainer.save_model`] function. Meanwhile, save the training arguments together with the trained model. ```python trainer.save_model() ``` ```python trainer.model.config.save_pretrained(MODEL_SAVE_FOLDER_NAME) ``` Save the tokenizer along with the trained model. ```python tokenizer.save_pretrained(MODEL_SAVE_FOLDER_NAME) ``` ## <a name="#step6">Step 6: Deploy the fine-tuned model</a> ### <a name="step6">Overview of deployment parameters</a> To deploy using the Amazon SageMaker Python SDK with the Deep Java Library (DJL), you must instantiate the `Model` class with the following parameters: ```{python} model = Model( image_uri, model_data=..., predictor_cls=..., role=aws_role ) ``` - `image_uri`: The Docker image URI representing the deep learning framework and version to be used. - `model_data`: The location of the fine-tuned LLM model artifact in an Amazon Simple Storage Service (Amazon S3) bucket. It specifies the path to the TAR GZ file containing the model's parameters, architecture, and any necessary artifacts. - `predictor_cls`: This is just a JSON in JSON out predictor, nothing DJL related. For more information, see [sagemaker.djl_inference.DJLPredictor](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#djlpredictor). - `role`: The AWS Identity and Access Management (IAM) role ARN that provides necessary permissions to access resources, such as the S3 bucket that contains the model data. ### <a name="step6.1">Step 6.1: Instantiate SageMaker parameters</a> Initialize an Amazon SageMaker session and retrieve information related to the AWS environment such as the SageMaker role and AWS Region. You also specify the image URI for a specific version of the "djl-deepspeed" framework by using the SageMaker session's Region. The image URI is a unique identifier for a specific Docker container image that can be used in various AWS services, such as Amazon SageMaker or Amazon Elastic Container Registry (Amazon ECR). ```python # installing sagemaker library !pip3 install sagemaker==2.237.1 ``` ```python import boto3 import json import sagemaker.djl_inference from sagemaker.session import Session from sagemaker import image_uris from sagemaker import Model sagemaker_session = Session() print("sagemaker_session: ", sagemaker_session) aws_role = sagemaker_session.get_caller_identity_arn() print("aws_role: ", aws_role) aws_region = boto3.Session().region_name print("aws_region: ", aws_region) image_uri = image_uris.retrieve(framework="djl-deepspeed", version="0.22.1", region=sagemaker_session._region_name) print("image_uri: ", image_uri) ``` ### <a name="step6.2">Step 6.2: Create the model artifact</a> ### To upload the model artifact to the S3 bucket, we need to create a TAR GZ file that contains the model's parameters. First, create a directory named `lora_model` and a subdirectory named `dolly-3b-lora`. The "-p" option ensures that the command creates any intermediate directories if they don't exist. Then, copy the LoRA checkpoints, `adapter_model.bin` and `adapter_config.json`, to `dolly-3b-lora`. The base Dolly model is downloaded at runtime from the Hugging Face Hub. ```bash %%bash rm -rf lora_model mkdir -p lora_model mkdir -p lora_model/dolly-3b-lora cp dolly-3b-lora/adapter_config.json lora_model/dolly-3b-lora/ cp dolly-3b-lora/adapter_model.bin lora_model/dolly-3b-lora/ ``` Next, set the [DJL Serving configuration options](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html) in `serving.properties`. Using the Jupyter `%%writefile` magic command, you can write the following content to a file named lora_model/serving.properties. - `engine=Python`: This line specifies the engine used for serving. - `option.entryPoint=model.py`: This line specifies the entry point for the serving process, which is set to model.py. - `option.adapter_checkpoint=dolly-3b-lora`: This line sets the checkpoint for the adapter to dolly-3b-lora. A checkpoint typically represents the saved state of a model or its parameters. - `option.adapter_name=dolly-lora`: This line sets the name of the adapter to dolly-lora, a component that helps interface between the model and the serving infrastructure. ```python %%writefile lora_model/serving.properties engine=Python option.entryPoint=model.py option.adapter_checkpoint=dolly-3b-lora option.adapter_name=dolly-lora ``` You also need the environment requirement file in the model artifact. Create a file named `lora_model/requirements.txt` and write a list of Python package requirements, typically used with package managers such as `pip`. ```python %%writefile lora_model/requirements.txt accelerate>=0.16.0,<1 bitsandbytes==0.39.0 click>=8.0.4,<9 datasets>=2.10.0,<3 deepspeed>=0.8.3,<0.9 faiss-cpu==1.7.4 ipykernel==6.22.0 scipy==1.11.1 torch>=2.0.0 transformers==4.28.1 peft==0.3.0 pytest==7.3.2 ``` ### <a name="step6.3">Step 6.3: Create the inference script</a> Similar to the fine-tuning notebook, a custom pipeline, `InstructionTextGenerationPipeline`, is defined. The code is provided in `utils/deployment_model.py`. You save these inference functions to `lora_model/model.py`. ```bash %%bash cp utils/deployment_model.py lora_model/model.py ``` ### <a name="step6.4">Step 6.4: Upload the model artifact to Amazon S3</a> Create a compressed tarball archive of the lora_model directory and save it as lora_model.tar.gz. ```bash %%bash tar -cvzf lora_model.tar.gz lora_model/ ``` Upload the lora_model.tar.gz file to the specified S3 bucket. ```python import boto3 import json import sagemaker.djl_inference from sagemaker.session import Session from sagemaker import image_uris from sagemaker import Model s3 = boto3.resource('s3') s3_client = boto3.client('s3') s3 = boto3.resource('s3') # Get the name of the bucket with prefix lab-code for bucket in s3.buckets.all(): if bucket.name.startswith('artifact'): mybucket = bucket.name print(mybucket) response = s3_client.upload_file("lora_model.tar.gz", mybucket, "lora_model.tar.gz") ``` ### <a name="step6.5">Step 6.5: Deploy the model</a> ### Now, it's the time to deploy the fine-tuned LLM by using the SageMaker Python SDK. The SageMaker Python SDK `Model` class is instantiated with the following parameters: - `image_uri`: The Docker image URI that represents the deep learning framework and version to be used. - `model_data`: The location of the fine-tuned LLM model artifact in an S3 bucket. It specifies the path to the TAR GZ file that contains the model's parameters, architecture, and any necessary artifacts. - `predictor_cls`: This is just a JSON in JSON out predictor, nothing DJL related. For more information, see [sagemaker.djl_inference.DJLPredictor](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#djlpredictor). - `role`: The IAM role ARN that provides necessary permissions to access resources, such as the S3 bucket that contains the model data. ```python model_data="s3://{}/lora_model.tar.gz".format(mybucket) model = Model(image_uri=image_uri, model_data=model_data, predictor_cls=sagemaker.djl_inference.DJLPredictor, role=aws_role) ``` NOTE: The deployment should be completed within 10 minutes. Any longer than that, your endpoint might have failed. ```python %%time predictor = model.deploy(1, "ml.g4dn.2xlarge") ``` ### <a name="step7">Step 7: Test the deployed inference</a> Test the inference endpoint with [predictor.predict](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.predict). ```python outputs = predictor.predict({"inputs": "What solutions come pre-built with Amazon SageMaker JumpStart?"}) ``` ```python from IPython.display import Markdown Markdown(outputs) ```