I highly highly doubt that training a LLM like gpt-2 will help you use something...

benreesman · 2025-07-01T14:28:10 1751380090

With modern high-quality datasets and the plummeting H100 rental costs it is 100% a feasible undertaking for an individual to train a model with performance far closer to gpt-4-1106-preview than to gpt-2, in fact its difficult to train a model that performs as badly as gpt-2 without carefully selecting for datasets like OpenWebText with the explicit purpose of replicating runs of historical interest: modern datasets will do better than that by default.

GPT-4 is a 1.75 terraweight MoE (the rumor has it) and that's probably pushing it for an individual's discretionary budget unless they're very well off, but you don't need to match that exactly to learn how these things fundamentally work.

I think you underestimate how far the technology has come. torch.distributed works out of the box now, deepspeed and other strategies that are both data and model parallel are weekend projects to spin up on an 8xH100 SXM2 interconnected cluster that you can rent from Lambda Labs, HuggingFace has extreme quality curated datasets (the fineweb family I was alluding to from Karpathy's open stuff is stellar).

In just about any version of this you come to understand how tokenizers work (which makes a whole class of failure modes go from baffling to intuitive), how models behave and get evaled after pretraining, after instruct training / SFT rounds, how convergence does and doesn't happen, how tool use and other special tokens get used (and why they are abundant).

And no, doing all that doesn't make Opus 4 completely obvious in all aspects. But its about 1000x more effective as a learning technique than doing prompt engineer astrology. Opus 4 is still a bit mysterious if you don't work at a frontier lab, there's very interesting stuff going on there and I'm squarely speculating how some of that works if I make claims about it.

Models that look and act a lot like GPT-4 while having dramatically lower parameter counts are just completely understood in open source now. The more advanced ones require resources of a startup rather than an individual, but you don't need to eval the same as 1106 to take all the mystery out of how it works.

The "holy shit" models are like 3-4 generations old now.

Davidzheng · 2025-07-01T15:26:44 1751383604

Ok I'm open (and happy to hear!) to being wrong on this. You are saying I can find tutorials which can train something like gpt3.5 level model (like a 7B model?) from scratch for under 1000 USD of cloud compute? Is there a guide on how to do this?

benreesman · 2025-07-01T15:49:56 1751384996

The literally watch it on a live stream version does in fact start with the GPT-2 arch (but evals way better): https://youtu.be/l8pRSuU81PU

Lambda Labs full metas jacket accelerated interconnect clusters: https://lambda.ai/blog/introducing-lambda-1-click-clusters-a...

FineWeb-2 has versions with Llama-range token counts: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

Ray Train is one popular choice for going distributed, RunHouse, bumcha stuff (and probably new versions since I last was doing this): https://docs.ray.io/en/latest/train/train.html

tiktokenizer is indispensable for going an intuition about tokenization and it does cl100k: https://tiktokenizer.vercel.app/

Cost comes into it, and doing things more cheaply (e.g. vast.ai) is harder. Doing a phi-2 / phi-3 style pretrain is like I said, more like the resources of a startup.

But in the video Karpathy evals better than gpt-2 overnight for 100 bucks and that will whet anyone's appetite.

If you get bogged down building FlashAttention from source or whatever, b7r6@b7r6.net

Davidzheng · 2025-07-01T19:37:44 1751398664

Thanks for the links! Hopefully this doesn't come across as confrontational (this is really something I would like to try myself) but I don't think a gpt2 arch will get to close to gpt3.5 level intelligence? I feel like there was some boundary around gpt3.5 where the stuff started to feel slightly magical for me [maybe it was only the RLHF effect]. Do you think models in gpt2 size now are getting to that capability? I know sub 10B models have been getting really smart recently.

benreesman · 2025-07-01T22:33:16 1751409196

I think you'll be surprised if you see the lift karpathy demonstrates from `fineweb.edu` vs `webtext` (he went back later and changed the `nanogpt` repository to use `openwebtext` because it was different enough that it wasn't a good replication of GPT-2).

But from an architecture point of view, you might be surprised at how little has changed. Rotary and/or alibi embeddings are useful, and there's a ton on the inference efficiency side (GQA -> MHA -> MLA), but you can fundamentally take a llama and start it tractably small, and then make it bigger.

You can also get checkpoint weights for tons of models that are trivially competitive, and tune heads on them for a fraction of the cost.

This leaked Google memo is a pretty good summary (and remarkably prescient in terms of how it's played out): https://semianalysis.com/2023/05/04/google-we-have-no-moat-a...

I hope I didn't inadvertently say or imply that you can make GPT-4 in a weekend, that's not true. But you can make models with highly comparable characteristics based on open software, weights, training sets, and other resources that are basically all on HuggingFace: you can know how it works.

GPT-2 is the one you can do completely by yourself starting from knowing a little Python in one day.