Part 4: Testing and interacting with your fine-tuned LLM

This post is the final part of: A simple guide to local LLM fine-tuning on a Mac with MLX.

This part of the guide assumes you have completed the fine-tuning process and have an adapters.npz file ready to use. If that’s not the case check our Part 3: Fine-tuning your LLM using the MLX framework.

Step 1: Determine prompts to compare results of fine tuning

In order to test and see the results of your fine-tuned model, you’ll want to compare prompting the fine-tuned model to the base model.

I’m not going to go into detail on the best way to do this, you’ll likely to want to create a number of prompts for the outputs you’re trying to improve with fine-tuning.

Depending on the kind of fine-tuning you’re doing you may have quantitative or qualitative ways to evaluate the results.

Step 2: Test your prompts on the base and fine-tuned models

The MLX library has some built in functionality to handle model inference.

To prompt the base model head to the llms folder inside the mlx-examples repo. You can use the mlx_lm.generate command to prompt the base model:

python -m mlx_lm.generate \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --max-tokens 1000 \
  --prompt "How do I build an online store using WordPress that will allow me to sell car parts? I want to be able to ship them and charge people using credit cards."

The command to prompt the base model.

In order to prompt your fine tuned model you’ll need to provide the location of your adapters file. You can do this from the lora folder in the mlx-examples repo. Use the lora.py script and the following params:

python lora.py \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --adapter-file ./fine-tuning/adapters/adapters.npz \
  --num-tokens 1000 \
  --prompt "How do I build an online store using WordPress that will allow me to sell car parts? I want to be able to ship them and charge people using credit cards."

The command to prompt your fine-tuned model.

In the above two commands, be sure to replace the --model and --adapter-file arguments with the correct Hugging Face path and adapters file location for your setup.

The num-tokens argument is optional and defaults to 1000. Depending on the length of answers you expect you may need to adjust this.

Step 3: Analyze your results

My WordPress fine tuning of Mistral 7B was simply a fun exercise so I wasn’t prepared to do a very thorough analysis of the results.

I tried 10 to 20 different prompts with more complex WordPress questions and evaluated how thorough and complete the answers were compared to the base model.

As I mentioned above, you may have specific ideas on how you want to evaluate depending on the kind of results you’re looking for. That’s up to you.

For completeness, here’s the prompt output for the base model, and the fine-tuned model for the prompt used in the examples above. You can see the depth and quality of the answer has significantly improved, but it has also drastically increased response length. More work is needed!

Step 4: Fuse your adapters into a new model

If you’re happy with your new fine-tuned model you can proceed to fuse your LoRA adapters file back in to the base model.

Doing this makes running the model easier with traditional inference tools (although GGUF format is not quite ready yet). You can also upload your fine-tuned model back to the Hugging Face MLX community if you wish.

To fuse your model use the following in the lora folder:

python fuse.py \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --adapter-file ./fine-tuning/adapters/adapters.npz \
  --save-path ./Models/My-Mistral-7B-fine-tuned
  --upload-name My-Mistral-7B-fine-tuned
  --de-quantize

The command to fuse your adapters with the base model.

The adapter-file and save-path arguments are optional if your adapters file is in the same directory, and you want to save the new model to the same directory.

If you don’t want to upload your fine-tuned model to Hugging Face, you can omit the upload-name argument and it will save only to your save-path location (or the same folder if you leave this argument out).

In the next step, we’ll convert your fine-tuned model to GGUF format. To do this we have to de-quantize with the --de-quantize argument. If you don’t want to convert to GGUF you can omit this argument.

Lastly, if you have started with a local base MLX converted model, provide the folder location as the model argument. In this case you’ll need to provide the original Hugging Face model path in a hf-path argument to upload the fine-tuned version to Hugging Face.

What about chatting with my model?

Prompting using the tooling in the MLX Examples library is one thing, but the most obvious question is “how do I have an ongoing conversation with my fine-tuned model”?

The answer is converting it to GGUF format. While the mlx-examples repo does not have a built in way to do this, you can convert your fine-tuned model that you fused in the previous step.

To do this you will need to follow instructions for converting to GGUF using llama.cpp. Once you have a GGUF version of your model, you’ll be able to run inference in Ollama, LM Studio and other apps.

Thanks for reading

I hope you found this guide helpful! If you have any feedback, questions, or suggestions for any part of this guide please drop them on the Twitter/X thread, or on the Mastodon thread.

Updates:

Jan 25, 2024: Updated script locations, added new section on GGUF creation.
Jan 9, 2024: Added a section on the ability to fuse your LoRa adapters with the base model and upload to Hugging Face.
Jan 8, 2024: First version published.