Part 2: Building your training data for fine-tuning

This post is the second of four parts of: A simple guide to local LLM fine-tuning on a Mac with MLX.

This part of the guide assumes you have your environment with Python and Pip all set up correctly. If that’s not the case then start with the first step. If you already have fine-tuning training data in JSONL format, you can skip to the fine-tuning step.

There are a number of different tools to get LLMs running locally on a Mac. One good one is LM Studio, providing a nice UI to run and chat to offline LLMs. For this guide I’m going to use Ollama as it provides a local API that we’ll use for building fine-tuning training data.

Step 1: Download Ollama and pull a model

Go ahead and download and install Ollama. For this guide I’m going to use the Mistral 7B Instruct v0.2 model from Mistral.ai. You’re welcome to pull a different model if you prefer, just switch everything from now on for your own model.

To pull the model use the following command:

ollama pull mistral

Pull down Mistral 7B and use the default instruct model.

Once you have the model, you can optionally use the run command in place of pull to try interacting with it from the command line.

Step 2: Determine the correct training data format

In order to fine-tune Mistral 7B we’ll need training data. The training data will allow the fine-tuned model to produce higher quality results that prompting will alone. Your training data should be full of examples of the kind of results you’d want to see once fine-tuned.

Training data for Mistral 7B should be in JSONL format, with each line formatted as follows:

{"text":"<s>[INST] Instruction[/INST] Model answer</s>[INST] Follow-up instruction[/INST]"}

A single line of training data in JSONL format.

If you’re training a different model, the format may look different — so be sure to check.

If you have existing training data in a spreadsheet, database, or some other store your goal is to get it into the above JSONL format. If you’ve done that you can move on the fine-tuning step.

Step 3: Create a prompt for generating instructions

I didn’t have any good training data on hand, so I decided to generate it. I wanted to fine-tune the model to produce more extensive answers to questions about working with WordPress websites.

I used Ollama and ran a range of different models and used the following prompt:

Please list in JSON format 50 frequently asked questions about WordPress from all levels of users working on WordPress websites.

The questions should start with any of the following: “Where do I", "Is it okay to", "Can you help me", "I need to", "Is there a", "Do you know", "Where is the", "Can you tell me", "Can I change", "What are the", "How do I", "When is it", "Does WordPress have", "How to", "What is the difference", "Can users", "Can I", "What is”.

You do not need to provide an answer or category to each question. The list should be a single dimension array of only questions.

The prompt used with various different models.

It took some trial and error to get to that prompt. Asking for instructions to start with a specific phrase helped improve the range of questions generated.

It’s a good idea to spend time here figuring out the best prompt that’ll produce the best list of instructions for your needs.

Step 4: Generate the instructions with Ollama

Once you have a good prompt for generation you can run this through Ollama and switch up the models you use. For each set of instructions I added them to a single JSON file.

Here’s what a truncated result looked like for Mistral 7B:

❯ ollama run mistral
>>> Please list in JSON format 50 frequently asked questions about WordPress from all levels of users working on WordPress websites.
...
... The questions should start with any of the following: “Where do I", "Is it okay to", "Can you help me", "I need to", "Is there a",
... "Do you know", "Where is the", "Can you tell me", "Can I change", "What are the", "How do I", "When is it", "Does WordPress have",
... "How to", "What is the difference", "Can users", "Can I", "What is”.
...
... You do not need to provide an answer or category to each question. The list should be a single dimension array of only questions.
 [
"Where do I find the WordPress login url?",
"Is it okay to install WordPress on a local server?",
"Can you help me recover a lost WordPress password?",
"I need to update my WordPress version, how do I do that safely?",
"Is there a limit to the number of plugins I can install?",
"Do you know how to create custom post types in WordPress?",
"Where is the database location for a WordPress site?",
"Can you tell me how to add custom CSS to my WordPress site?",
"Can I change the permalink structure in WordPress?"

[...]

The output from Mistral 7B when prompted.

It’s generally accepted that you need at least 50-100 examples to get any favourable results from fine-tuning. I can’t speak to what the ideal amount is, but I suspect you will want more than that. It’s something I still need to experiment with.

I went ahead and generated 1,200 instructions. Next time around I will write a script that does this instead of manually adding them to a JSON file.

Once you have all of your questions in a single JSON file. Save this file. I called it instructions.json.

Step 5: Generate model answers to your instructions

Now you have a JSON file of all the instructions, you can use the Ollama API to generate model answers to each one of them.

To do this I wrote a very simple PHP script that I can run on the command line to query the Ollama API and generate the JSONL training file.

Save this as generate.php in a folder next to your instructions.json file. I have a full example copy on Github where you’ll also find a Python version of the script.

PHP

<?php
if ( ! file_exists( 'instructions.json' ) ) {
	die( 'Please provide an instructions.json file to get started.' );
}

function query_ollama( $prompt, $model = 'mistral', $context = '' ) {
	$ch = curl_init( 'http://localhost:11434/api/generate' );

	curl_setopt( $ch, CURLOPT_POSTFIELDS, json_encode([
		"model" => $model,
		"stream" => false,
		"prompt" => $context . $prompt
	] ) );
	curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );

	$response = curl_exec( $ch );

	if ( $response === false ) {
		die( 'API call failed: ' . curl_error($ch) );
	}

	$answer = json_decode( $response )->response;

	curl_close( $ch );

	return trim( $answer );
}

function create_valid_file() {
	if ( ! file_exists( 'train.jsonl' ) ) {
		die( 'No train.jsonl file found!' );
	}

	// Remove 20% of the training data and put it into a validation file
	$train = file_get_contents( 'train.jsonl' );
	$trainLines = explode( "\n", $train );
	$totalLines = count( $trainLines );
	$twentyPercent = round( $totalLines * 0.2 );

	$valLines = array_slice( $trainLines, 0, $twentyPercent );
	$trainLines = array_slice( $trainLines, $twentyPercent );

	$train = implode( "\n", $trainLines) ;
	$val = implode( "\n", $valLines );

	file_put_contents( 'train.jsonl', $train);
	file_put_contents( 'valid.jsonl', $val);
}

$json = file_get_contents( 'instructions.json' );
$instructions = json_decode( $json, JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES );
$total = count( $instructions );

echo "------------------------------\n";
foreach ( $instructions as $i => $instruction ) {
	echo "(" . $i + 1 . "/" . $total . ") " . $instruction . "\n";
	echo "------------------------------\n";

	$answer = query_ollama( $instruction );
	echo $answer; // for terminal output

	$result = [ 'text' => '<s>[INST] ' . $instruction . '[/INST] ' . $answer . '</s>' ];

	$output = json_encode( $result ) . "\n";
	$output = str_replace( '[\/INST]', "[/INST]", $output );
	$output = str_replace( '<\/s>', "</s>", $output );

	echo "\n\n------------------------------\n";

	file_put_contents( 'train.jsonl', $output, FILE_APPEND );
}

create_valid_file();

echo "Done! Training and validation JSONL files created.\n"

The script will allow you to change the model depending on which model you want to use to fetch answers. You can even switch up the model and generate answers from different models.

One thing to note for awareness — the Llama 2 license does restrict using responses to train other non-llama 2 based models. #

To run the script you can call php generate.php and the script will run, showing you progress along the way:

❯ php generate.php
------------------------------
(1/10) What is the purpose of custom post type syndication in WordPress?
------------------------------
Custom Post Type (CPT) syndication in WordPress refers to the process of sharing custom post types across different websites or platforms. Custom post types are a way to create new content types that go beyond the standard post and page structures provided by WordPress. This can include things like portfolio items, events, jobs listings, or any other type of content that doesn't fit neatly into the default post or page structure.

[...]

Output from the answer generation script showing progress.

Once the generation is complete you should have two new files: train.jsonl and valid.jsonl. We’re going to use these with Apple’s MLX framework to kick off local model training on your Mac in part three.

If you have any feedback, questions, or suggestions for this part of the guide please drop them on the Twitter/X thread, or on the Mastodon thread.

Updates:

Jan 25, 2024: Added a link to the new python version of the generate script. Thank you Arun Sathiya!