Finetuning-LLMs-on-my-Blog

July 23, 2024 • 1137 words

Finetuning LLMs on my Blog

Reading Finetuning from AWS's Blog

AWS published a machine learning blog about how you can finetune Claude 3 Haiku on the Bedrock really easily, and it got me very interested and wanted to set up Haiku to imitate me. Specifically my blog has around 100k words in it, so that is a good starting point I guess.

Bad news is that AWS don't allow me to tune it, got me pretty mad actually.

I contacted AWS for access to finetune Claude Haiku and it didn't respond.

Openai's Models

There was those old GPT3.5 models for finetuning, but even without finetuning they not the smartest, and after finetuning they hardly learn anything actually.

GPT's more capable models like GPT4 are unavailable for finetuning.

Generate a Question for each Sentence

I used nltk to parse the sentences from my blogs, as well as remove all code snippets and weird symbols. Anyways I parsed each blog to seperate sentences and asked Claude Haiku to generate a question based on each sentence. I got that done and there was approximately 4000 pairs of questions and sentences.

I then add a system prompt for each blog like "You are talking to Jim Chen about {blogname}" to the LLM training script.

Different Approahces

After that I tried multiple strategies, and there is "multiturn finetuning".

First approach I tried was to seperate all sentences or pairs and got 4000 pairs feeding into the LLM, it was pretty bad, and the LLM hardly make any sense with huge loss.

Second appraoch I tried was to group 5 sentences for each response into a conversation, and it was much better and got way better results. The results got even better when I generated them multiple times(so that the questions to the answers have different variations)

Third approach I tried to group entire blogs into one conversation. At this point AWS Bedrock starts rate limiting me for calling Claude Haiku to generate this stuff for no reason, and I switched from east to west and kept calling(since I activated both AWS regions). But this approach turns out not good. (models cannot learn well)

Also I tried to make the questions shorter like very short questions for each sentences, but it worked terribly.

Finetuning on Cohere

There is also an AWS blog about how to finetune the cohere on Sagemaker, but it was complicated and finetuning on Cohere took me no time to figure out how to do it.

Cohere's frontend is made in NextJS, however, it frequently run into bugs(almost the same bugs like client error as I would see when I am debugging a website lol, and this much amount of bugs for a single user suggets their web devs might not be capable people), and it's ok since they are primarily an AI company and Gomez started it so young they are probably not a large company or what. I managed to create some Finetuning jobs and they asked me for my company(which I just entered some random words), and passed it.

After some finetuning jobs, however, I couldn't do it anymore. I figured maybe it was a free trial or something, but Cohere doesn't have extensive documents and it was buggy as hell, so every time I tried to create a finetuning job it returned error without any debugging messages.

Anyways this is so annoying and I got sick of all the bugs I am running into. Literally I am running into all the bugs nobody's running into before, and I am always the one having to deal with buggy programs and GUI programs seem to target me(or just don't work with me). I tried to contact Cohere but they didn't have any emails or Reddit(only a Discord server, which I don't use).

Then I quickly used the Berkeley email to open up another account and finished finetuning there.

Cohere Command R Results

So I finally settled with the second approach stated above, like put 5 pairs into a conversation then feed it to Cohere. I ended up like with 6 MB file and the result was like 55 percent.

It wasn't that bad.

Some example prompts

who are you
I am Jim Chen, and I am currently doing self study in computer science.

Fine, right?

which school do you go to
I am in Shanghai and I go to Shanghai International School.

emm, I didn't go to that school, but I ask again

which school do you go to
I go to the University of California, Berkeley.

I sadly can't go to that school now too. It was so sad, man, Berkeley is paradise for computer science. Anyways most of my previous blogs written in Berkeley got into the dataset.

Below are some decent responses:

where do you like to go in Shanghai
I like going to Yangpu district, it is quiet and not crowded.
what is your hobby
I like to walk around, and I am also fond of hiking.
how do you think about windows os
Windows is a very crappy operating system, and I am surprised Microsoft is still selling it.
how do you like open source?
I like open source because it is free and customizable.

Some bad responses

what is your interest
I am also interested in some other stuff like video games and sports.
what is your interest
I am also interested in various other things like politics, history, and the human mind.
what is your interest
I am interested in machine learning and computer science in general.

I set the temperature to be 0.7 and it got right on 3rd response.

ChatGPT4o-mini

The next day I received an email informing me that ChatGPT 4o-mini is available ot be finetuned, and yeah, so we went on real quick and fed in the multi-conversation, the training loss looked good

But the results are horrible, I set the temperature at 0.5 and it got everything wrong

which school you are in
I am in Shanghai Jiao Tong University.
which high school did you went to
I also went to high school in Shanghai High School.
what interest you have
I am interested in Linux and ReactJS. I also like computer games.

I set the temperature to 0, and it was still bad

what school did you go to
I went to Shanghai Jiao Tong University for undergraduate.
how do you think about windows os
Windows OS is a mixed bag it can be very annoying to use but also very fun and interesting at the same time.

ChatGPT hallucinates way too much.

I realized maybe the problem is that I used generated questions on my blog? Since like this is supposed to be used for a specific task, while to be honest I should have parsed each of my sentences and feed it into a base model for completion.