in Code

Part 3: Fine-tuning your LLM using the MLX framework

This post is the third of four parts of: A simple guide to local LLM fine-tuning on a Mac with MLX.


This part of the guide assumes you have training data available to fine-tune with. The data should be in JSONL format. If that’s not the case then check out Part 2: Building your training data for fine-tuning.

Step 1: Clone the mlx-examples Github repo

Machine learning research folks at Apple have created a fantastic new framework called MLX. MLX is an array framework that is built for Apple silicon, allowing it to utilize the unified memory in Macs and thus bringing significant performance improvements.

The mlx-examples Github repo contains a whole range of different examples for using MLX. It’s a great way to learn how to use it.

I’m going to assume you have knowledge of Git and Github here. You can clone the MLX examples repo using:

git clone https://github.com/ml-explore/mlx-examples.git
Clone the ML Expore repo from Github

For this guide we’re going to focus on the LoRA (Low-Rank Adaptation) examples that we’ll use to fine-tune Mistral 7B Instruct and create our fine-tuning adapters.

Step 2: Start the fine-tuning process using your training data

In prior versions of MLX you needed to convert your model to be MLX compatible. That’s no longer the case, you can specify a model from Hugging Face and immediately start your fine-tuning with it.

We can use the train.jsonl and valid.jsonl file from the last step to kick off the training procedure.

The first step is to head to the mlx-examples/lora folder and install the requirements for the lora conversation script:

cd mlx-examples/lora
pip install -r requirements.txt
Move to the lora folder and install requirements.

Once requirements are installed, we can use the lora.py script to initiate training.

There are a few parameters to know about. You can run the script with the --help parameter to learn about them more, or check out the documentation on Github.

This is the exact command I used to train and start the fine-tuning:

python lora.py \
  --train \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --data ~/Developer/AI/Models/fine-tuning/data \
  --batch-size 2 \
  --lora-layers 8 \
  --iters 1000
The command to initiate the fine-tuning process.

The --model argument should point to the path of the model on Hugging Face. If you have a local model then you can provide the model directory of a local MLX converted model. For more info on the conversion process, check out the conversion docs.

The --data argument is optional and should point to a folder that contains your train.jsonl and valid.jsonl files, by default it will look in the current directory.

You’ll want to experiment with the batch-size and lora-layers arguments. Generally speaking the larger your system memory, the higher these numbers can go. The documentation has some good tips for this.

The --iters parameter here isn’t strictly necessary, 1000 iterations is the default. I’m including it because it’s still something you might want to experiment with.

Step 4: Sit back and hear your fans for the first time

Once you’ve kicked off the training process it’ll give you useful feedback on your training, validation loss, tokens, and iterations per second.

With my M1 Max and 32GB of system memory, batch size 1 and LoRA layers 4 I could get roughly 260 tokens/sec on average. With the settings above, I ended up with closer to 120 tokens/sec on average:

Loading pretrained model
Total parameters 7242.584M
Trainable parameters 0.852M
Loading datasets
Training
Iter 1: Val loss 0.868, Val took 94.857s
Iter 10: Train loss 0.793, It/sec 0.223, Tokens/sec 229.758
Iter 20: Train loss 0.683, It/sec 0.293, Tokens/sec 218.803
Iter 30: Train loss 0.768, It/sec 0.072, Tokens/sec 94.586
Iter 40: Train loss 0.686, It/sec 0.081, Tokens/sec 78.442
Iter 50: Train loss 0.734, It/sec 0.230, Tokens/sec 191.031
Iter 60: Train loss 0.600, It/sec 0.043, Tokens/sec 85.585
Iter 70: Train loss 0.691, It/sec 0.027, Tokens/sec 33.718
Iter 80: Train loss 0.568, It/sec 0.031, Tokens/sec 46.508

[...]
Model training output over time.

I’ve also tried training on a M1 MacBook Air with 16GB system memory. It’s obviously slower, but with lower values for batch-size and lora-layers and nothing else running, you should be ok. You can also try using 4-bit quantization to reduce memory needs.

Newer systems with more memory are going to go faster based on benchmarks.

Step 5: Locate your adapters.npz

Once the fine-tuning is completed you’ll find an adapters.npz file in the folder you kicked off the training from. This file contains the additional weight changes to the base model from your fine-tuning and you’ll supply the location of this file in the final step: Testing and interacting with your fine-tuned LLM.


If you have any feedback, questions, or suggestions for this part of the guide please drop them on the Twitter/X thread, or on the Mastodon thread.


Part 2: Building your training data for fine-tuning

Part 4: Testing and interacting with your fine-tuned LLM

Updates: