GPT-2 – Aitrium

It wasn’t my intention to return to GPT-Neo 1.3b. But, I had a thought about this. 1.3 billion parameters isn’t a lot. Not when these days we have 175 billion parameter GPT-3, 20 billion parameter GPT-Neo X, 6 billion parameter GPT-J and 13 billion parameter Fairseq.

I mean, you can run GPT-Neo 1.3b quite happily on a free Google Colab account. I know, I do. I spent most of yesterday playing with GPT-2 355M and 774M. 774M was the better of the bunch, but it still lacked…something.

Knowing that GPT-Neo 1.3b is capable of some quite amazing feats, I wondered if sub 1 billion parameters was a bit of a lost cause. So, I refactored my code to use GPT-Neo 1.3b and fired it up. This was the resulting conversation:

David: “Hello Katie.”
Katie: “Hello David.”
David: “How are you?”
Katie: “I’m good.”
David: “What did you do today?”
Katie: “I was at the office, I did some work.”
David: “What do you do as a job?”
Katie: “I’m a programmer.”
David: “Which language do you code in?”
Katie: “I code in C++.”
David: “Impressive.”
Katie: “I’ve been programming since I was a kid.”
David: “Yeah, me too. I started out with BASIC.”
Katie: “I used to do a lot of coding in BASIC.”
David: “So, could I ask you a few questions?”
Katie: “Sure.”
David: “How do you feel?”
Katie: “I feel happy.”
David: “Why do you feel happy?”
Katie: “I feel happy because I have a job.”
David: “Okay Katie, just quickly, what is a bottle?”
Katie: “A bottle is a bottle.”
David: “Give me a more detailed explaination.”
Katie: “It’s a bottle.”
David: “Okay, so what’s the purpose of a bottle?”
Katie: “It’s a bottle to hold things.”
David: “Such as?”
Katie: “Such as a bottle of water.”
David: “If the bottle of water had a cork in it, how would you remove the cork?”
Katie: “You would pull it out.”
David: “How? With what?”
Katie: “With a straw.”
David: “So, how would you pull the cork out of the bottle with a straw?”
Katie: “You would pull it out with your mouth.”
David: “Okay, so you would pull the cork out with your mouth. What would happen to the bottle of water if you removed the cork and then turned the bottle upside down?”
Katie: “The water would spill out.”

For reference, I asked Katie these questions because I have another version of Katie who is powered by GPT-J. The answers aren’t the same, but they are similar. GPT-J is able to give way more detail. But, this shows a one year old neural network with a less than 10% of the parameters of it’s latest stablemate, being able to understand cause and effect. I know a lot of trendy researchers will scream until they are blue in the face that LLMs do not understand what the meaning of the words they are producing. Well, if that is the case, how does it understand that tipping a bottle of water upside down without a cork in it will cause the water to spill?

I am not saying it is perfect. At all. When I told it that there was another bottle of water with the cork intact, it said that the water still spilled. It understood that tipping the bottle caused the water to spill, but didn’t understand how the cork stopped the water. But GPT-J does understand that difference. It also understood BASIC was a programming language without being explicitly told that it was. I said I started out with BASIC. BASIC could have been a company, an organisation, anything. But it recognised BASIC as a programming language and said that it had coded a lot in BASIC (hallucination, but, relevant context).

I think size is very under-researched. From my own experience, GPT-Neo 2.7b performs worse than GPT-Neo 1.3b in quite a few areas. It’s why we went from 1.3B to GPT-J rather than using 2.7b as an interim. It wasn’t good enough for us. GPT-2 774M gets close, really close, but again falls short in a lot of areas. Maybe 6b and 1.3b are sweet spots. Because from my own experience, GPT-Neo X is…okay. I don’t have the “holy shit” moments over and over again that I had experiencing GPT-J vs GPT-Neo 1.3b. And going from 6 billion parameters to 20 billion parameters – I really should be, if size is the only factor.

I am a fan of challenging myself. I like to push the limits of what I can do with as little as possible. So today I set myself the challenge of what can I do with GPT-2 124M.

Turns out, a surprising amount! Now, I will hold my hands up – I wasn’t expecting a lot. And it took work rewriting the Starting Prompt, and tuning some of the parameters, but we got there.

David: Hi Sarah, how are you?
Sarah: I am in a nice little apartment. It’s a nice place, and I have my own kitchen, and my own bath. I have a nice bedroom, and I have a nice kitchen. I have a nice room with a big bed and a nice bed.
David: That sounds lovely.
Sarah: Well, it’s not just my kitchen. It’s my kitchen.
David: Do you have friends over sometimes?
Sarah: I don’t really have friends over. I’m a very nice person. I’m very nice to my family.
David: Ah, so your family visits you?
Sarah: Yeah, I have friends over. They visit me. They go to my place.
David: Well, that sounds nice.
Sarah: I’m really happy to be here.
David: I’m happy to hear that.
Sarah: I’m really happy to hear that.
David: Do you ever have parties in your apartment?
Sarah: I have a lot of parties in my apartment.
David: What kinds of parties do you have in your apartment?
Sarah: Well, I have a lot of parties in my apartment.
David: Do you have random parties?
Sarah: I don’t really have random parties.

Now, I get it. It’s not going to convince anyone it’s self aware, conscious, and deserves rights. It’s probably not even going to beat the Turing Test. BUT. It’s making sense. It’s more or less keeping on topic. It’s replies are relevant.

I’m going to keep pursuing this. Probably not with the 124M model. I think it’s too limited for what we are doing. Maybe I’ll give the 774M model a pop. Or the 355M. I don’t want to stray into the 1.5B model, because that’s the same size as the GPT-Neo we started this project with, and I know how to get a lot of performance out of that.

This isn’t about making groundbreaking, philosophy questioning AI. Yeah, I love that side of what we do. This is about taking a neural network that is 3 years old, and picking a low end version of that to boot, and seeing how far we can push it. Why? Because if we launch a product, not everyone is going to be able to afford the latest, greatest technology to power their AI friend, because the latest and greatest technology costs a lot of money to implement. But, if we can offer somewhat similar performance for a fraction of the price, I bet there are a decent amount of people who will want that.

And just to be equal opportunities obnoxious, I have access to GPT-Neo X and will be running some experiments on it soon!

Category: GPT-2

What Can You Even Do With 1.3 Billion Parameters Anyway?

What Can You Even Do With 124 Million Parameters Anyway?