How LLMs Keep Getting Smarter: The Feedback Economy

Why the most valuable resource in AI isn't compute or data — it's people who can tell the LLM when it's wrong

Mar 28, 2026

A few days ago I was at lunch with a friend. He’s an engineer(not an AI person) and he asked a question I’ve been hearing a lot lately:

“How does this thing keep getting smarter when the internet is full of nonsense?”

It’s a great question. And if you go looking for the answer, you’ll find yourself neck-deep in research papers about reinforcement learning from human feedback, constitutional AI, reward modeling, and a dozen other terms that each require their own dedicated time. I end up spending a lot of time in these rabbit holes in order to keep an up-to-date understanding of where the tech is and where it’s going.

This post is my attempt to climb back out and explain what I found.

A quick recap

In 2023, I wrote a four-part series breaking down how ChatGPT works: Transformers, embeddings, attention, the whole engine. If you haven’t read it, the short version is: these models learn to predict the next word by reading enormous amounts of text, and the Transformer architecture is what makes that work at scale.

That story hasn’t changed. Transformers are still the engine. But the engine isn’t the interesting part anymore. What’s changed dramatically is everything around the engine. How the models get shaped after training, who’s doing the shaping, and why the whole operation now involves PhD-level experts getting paid $125 an hour to tell a model it’s wrong.

My friend didn’t need a lecture about attention heads. He needed someone to explain the new stuff. So that’s what this blog tries to do.

From pre-training to post-training

If you read my 2023 series, you can skip this. If you didn’t, a base language model is trained to predict the next word. You then feed it the entire internet, let it read billions of sentences, and eventually it builds a surprisingly rich internal model of how language works. Facts, style, reasoning patterns: it absorbs all of it, along with plenty of garbage, because the internet is full of it.

That’s pretraining. It gives the model raw capability.

Then comes a second phase, which is post-training. It’s where you take that capable-but-chaotic base model and shape it into something that behaves like an assistant, something that follows instructions, is helpful, avoids saying terrible things, and generally acts like a product instead of a weird autocomplete engine.

In 2023, I treated post-training as a footnote. But it’s 2026, and post-training is arguably the story.

Three dials, and only one of them is obvious

When my friend says that LLMs are getting smarter, he’s reacting to improvements that come from turning three knobs. Most people only know about the first one.

Dial 1: Throw more compute at training. This is the obvious one: bigger models, more GPUs, and longer training runs. But the nuance people miss is that the compute isn’t just going into pretraining anymore. A serious chunk now goes into reinforcement learning after pretraining. OpenAI has been unusually direct about the fact that their reasoning models improve with more “train-time compute”, and they’re talking about the RL phase, not just reading more on the internet.

Dial 2: Better data but not more internet. This is the part most people miss entirely, and it’s where the money is flowing. I’ll spend a lot of time on talking about this in this blog.

Dial 3: Think harder at inference time. Even without changing the model, you can get better answers by letting the system spend more time on harder problems such as generating multiple candidates, checking its own work, backtracking when something doesn’t look right. OpenAI calls this “test-time compute” and describes performance improvement as the model is given more time to think. If the question is easy, answer fast. If it’s hard, slow down and verify. That single product decision changes how smart the system feels to a user.

Keep these three dials in your head. Everything else described in this blog is basically an implementation detail for Dial 2.

RLHF: turning human taste into a training signal

Most people have a vague sense that RLHF - Reinforcement Learning from Human Feedback exists. Few have internalized what it does and why it matters so much.

Here’s the setup. You have a base model that can generate text. You show it a prompt, and it produces several possible answers. Then you ask a human: Which of these is better? The human picks one. Do this thousands of times, and you have a dataset of human preferences, not correct answers in a textbook sense, but a record of what humans actually prefer when they read these outputs.

Now here’s the clever part: once you have enough of these preference rankings, you can train a separate model that predicts what humans would prefer. And once you have that preference model, you can use it to train the actual language model at scale, without a human in the loop for every single example.

The canonical description of this pipeline comes from OpenAI’s InstructGPT work. The paper lays out three steps: supervised fine-tuning (humans write good answers), preference data collection (humans rank model outputs), and optimization (train the model to produce the kind of outputs humans prefer). They started with 40 contractors doing the labeling.

Forty people. That’s it. And from that modest beginning, they got a result that should permanently change your intuition about how these systems work:

A 1.3-billion-parameter model, after alignment, was preferred by humans over the 175-billion-parameter base GPT-3. The small model, roughly 130 times smaller, trained with feedback, beat the giant model trained without it because it had better post-training data.

This is the single most important intuition in modern AI: pretraining gives a model capability, but post-training gives it behavior. And behavior is what people actually experience.

This then became an industry overnight

Let me walk you through a few examples, because I think economics tells the story better than any technical paper.

Surge AI went from being a relatively unknown data labeling company to, according to Reuters reporting, generating over $1 billion in revenue and seeking to raise up to $1 billion at a valuation above $15 billion. That’s the market telling you exactly how valuable high-quality training signals have become.

Scale AI had a cautionary moment. When Meta took a large stake in Scale, Google reportedly planned to cut ties, largely due to concerns that the Meta involvement could expose proprietary AI plans. OpenAI followed suit, which means: the prompts, rubrics, and failure cases used to train a model are now treated as strategic intellectual property. Your training data pipeline is your secret sauce, and you don’t want it anywhere near a competitor.

Handshake, a college recruiting platform pivoted into data labeling and, according to Lenny’s Newsletter, was on track to blow past $100 million in annual recurring revenue in 12 months. Why? Because they had a network of domain experts: something the AI labs desperately needed. Handshake’s CEO told Business Insider that the industry is shifting from generalist labelers to specialized STEM experts, with people earning $100 to $125+ per hour on the platform.

Mercor raised $350 million at a $10 billion valuation, doing something similar that is connecting AI labs with domain experts for training data.

The Verge wrote a good overview tying all of these companies together, and the picture it paints is striking: the most valuable data in AI is no longer text scraped from the web. It’s structured expert judgment. If this feels like a weird twist in capitalism, that’s because it is. The scarce resource in AI is no longer compute or data in the traditional sense. It’s the structured expert judgement - It’s people who can tell the model when it’s wrong and explain why.

So why doesn’t it become “garbage in, garbage out”?

This is the question my friend was really asking, and by this point, you can probably see the answer.

The internet is the base layer. It gives the model vocabulary, world knowledge, and a general sense of how language works. But the internet is not what shapes the product you interact with. That layer comes later.

What you’re experiencing is that the product is shaped by feedback: human preferences, expert rubrics, test results, and carefully designed reward signals. That’s why a model trained on the whole internet can produce something that feels thoughtful and careful instead of sounding like a random Reddit thread.

A simple way to think about it is: Pretraining is the clay. Post-training is the sculpting.

The internet taught LLMs what things are, not how to do things

There’s a deeper point here that I think most people haven’t considered. Starting in the late ‘90s, we created massive incentives for people to put content online through ads, social capital, and commercial intent. If you had something to say, or something to sell, or something to promote, the internet gave you a reason to publish it. And all of that content became the training data for the first generation of LLMs.

But think about what kind of knowledge the incentive structure produced. It was overwhelmingly descriptive. What is the capital of France? What year did World War II end? What are the symptoms of diabetes? How does photosynthesis work? The internet is spectacularly good at declarative knowledge — facts, explanations, opinions, descriptions of how the world is.

What’s mostly missing is procedural knowledge - how you actually do things. How do you do your taxes when you’re a freelancer with income in two states? How do you build a marketing campaign for a B2B SaaS product with a $50K budget? How does a doctor actually work through a differential diagnosis when a patient presents with vague symptoms? Even when people write “how-to” guides, they tend to stay abstract. The messy, step-by-step, judgment-heavy process of doing real work rarely makes it onto a blog post, because there’s no incentive to publish it, and it’s hard to articulate anyway.

This is why LLMs are remarkably good at expository writing: essays, summaries, explanations but have been much less impressive at actually doing things the way humans do them. The training data was lopsided.

What’s starting to change

What’s changing as we bring LLMs into the flow of real work: coding, data analysis, customer support, research. They’re starting to observe and participate in procedural tasks. Every time an agent writes code and runs it, debugs an error, navigates a workflow, or helps someone through a multi-step process, that interaction is generating exactly the kind of procedural data that was absent from the original internet. Slowly, that data is building up. And it’s going to make these systems dramatically more useful.

It’s also, frankly, a little scary from an employment perspective.

If the first generation of LLMs could write like us, the next generation is learning to work like us. My default view is it’ll work out better as these tools tend to make individuals more capable rather than replacing them entirely but it’s a transition worth thinking about honestly.

This also sets up Part 2 of this series, where we’ll look at how agents operating in real environments are generating this kind of procedural data at scale and why that changes the pace of improvement.

Memory: the reason it feels like it’s learning you

Now let’s talk about the part that makes people say “it’s getting to know me.”

If you use ChatGPT, Claude, or Gemini regularly, you’ve probably noticed they remember things across conversations. Your preferences, facts about your life, things you’ve told them before. It feels like the model is learning.

What’s actually happening is less magical than it sounds and more practical than most people think.

Memory is almost never the model changing its weights. The underlying neural network itself is the same for everyone. What’s different is a per-user layer on top . A useful way to think about it like the model is an application, and memory is your personal config fileIn practice, memory systems tend to remember two categories of things:

Preferences. Things like “use simple language,” “stop using em dashes,” or “when you link my LinkedIn in a post, make it sound natural, not like a press release.” These seem trivial, but they’re incredibly high-leverage because they remove friction from every future conversation.

Personal context. Information that changes how responses should be tailored. Things like “I have a 13-year-old daughter who’s interested in coding” or “I asked about how to prep her for USACO.” A single fact like that changes how the assistant answers dozens of future questions, from programming language recommendations to summer camp suggestions.

Under the hood, most memory systems work roughly the same way: extract candidate memories from conversation, store them as structured text (sometimes with embeddings for retrieval), fetch relevant ones when you start a new chat, and inject them into the context. That’s it. Not consciousness, not a learning brain. Just retrieval plus context injection.

That said, memory is legitimately controversial. There’s a real tension between “this is useful” and “I didn’t realize it was keeping a file on me.” That tension is worth taking seriously. But if the goal is to understand why these systems feel like they’re improving over time, the framework is pretty clean:

Pretraining gives the model language.

Post-training gives it behavior.

Memory gives it you.

Where this is headed

Everything I’ve described so far is about shaping the model’s responses through feedback, expert data, and personal memory.

In Part 2, the story takes a turn. The model stops being just a chatbot. It becomes something that can act, run code, browse the web, use tools, operate in real environments. And when it can act, something interesting happens: its successes and failures become a new kind of training data.

That’s the feedback loop that has people using words like “exponential.”

We’ll get into it next.

This is Part 1 of a two-part series. Part 2: How LLMs Went From Chatbot to Coworker covers tool use, autonomous coding, web agents, robotics, and why the learning loop is about to get a lot faster.

Discussion about this post

Ready for more?