We're sorry but this app doesn't work properly without JavaScript enabled. Please enable it to continue.

Deep Blue Was Not ChatGPT

Boot.dev Team
Boot.dev TeamProgramming course authors and video producers

Last published

It's May 11th, 1997, and Garry Kasparov sits in front of a chessboard on the 35th floor of a Manhattan skyscraper.

Across the table sits a chunky monitor and a keyboard.

Kasparov is playing chess against a computer. In the 90s. IBM's Deep Blue supercomputer in the original man-vs-machine intelligence test.

Right now, we're five games into the series. The score: Kasparov 1, Deep Blue 1, and 3 draws.

This is the final game, 6 of 6, and Kasparov is playing the black pieces.

He's THE world champion chess player in 1997, and no one's been able to touch his title since 1985 - a 12 year reign.

Nineteen moves in, Kasparov resigns.

Deep Blue wins the match and series. For the first time ever, a reigning world champion loses to AI under standard tournament time controls.

And the next morning, the world loses their minds.

Interactive example available with JavaScript enabled.

"After Sudden Defeat, It's Kasparov Who's Blue" - LA Times.

"Swift and Slashing, Computer Topples Kasparov" - New York Times.

Or my personal favorite, this Matthew Pritchett cartoon.

Remember, this was thirty years ago. Before memes and Reddit. People got their news literally delivered to their doorsteps by school children.

Computers were around, but not everyone had one, and they weren't the modern RAM guzzling machines of today. In 1997, I was three; my dad hadn't even bought his first Gateway desktop.

But investors were well aware of computers and the internet - I mean, there was nothing hotter on Wall Street leading up to the 2001 tech crash. So, understandably, EVERYONE was freaking out about Deep Blue.

So the most powerful AI demo in the world proves its legitimacy on a global scale. How does this not turn into the artificial general intelligence, or AGI, future people feared back in 1997?

You'd think a breakthrough like Deep Blue would make AI research accelerate faster and faster. Certainly growth would be exponential now that the machines are "smarter than the smartest humans".

In an AGI world, the limits are only your imagination. We're talking the end of disease, global renaissance, terraforming Mars, automated government and economy, fixing the Earth's ecosystem...

But that's not what happened... like at all.

And not because Deep Blue was a hoax. Deep Blue was real, and it was AI.

In fact, it's exactly the kind of AI that I studied in my artificial intelligence class at university about 12 years ago.

But today, we have a completely different type of AI hyped up in our social feeds.

The conversation happening today is actually pretty similar to the conversation that stirred everyone up in 1997. And today's "growth at all costs" investor mentality is not too different from the 90s. That said, there are plenty of arguments about the potential of an AI bubble. So instead of beating that dead horse, let's dive into the technical differences between the kind of AI that Deep Blue was in 1997, and what we're working with today with GPT and Claude.

If you want to be able to spot the hand-wavy BS that a lot of the biggest AI hype bros are peddling... including some of the CEOs, you need to understand the difference between modern large language models, or LLMs, and what Deep Blue was. We're going under the hood so to speak.

If you look at a zoomed out view of 1997 to today, information technology DOES SEEM to be getting better faster (unless we're talking about a certain operating system that I'm FORCED to use to play most of my Steam library), but I want to challenge the idea that tech always improves exponentially over time. Sometimes it even slows down - only getting better logarithmically. Or in other words, there can be diminishing returns.

I mean when was the last time you really noticed a big improvement when you upgraded your iPhone?

So, let's open up the Deep Blue machine, talk about the actual computer IBM built, and go on a tour of the twelve-year history behind the old-school AI it used - and why that path suddenly turned cold before we got the LLMs we know and love (and hate) today.

Before Deep Blue

Now back to the beginning. Before Deep Blue, there was Deep Thought.

It's 1985 at Carnegie Mellon University. A grad student named Feng-hsiung Hsu — his nickname in the lab was "Crazy Bird", which is easier for me, starts building a chess-playing chip. Like, a physical computer chip.

It's a simple, tiny chip called ChipTest, based on a design from Ken Thompson — one of the guys who created Unix and one of the designers of the Go programming language, a personal favorite of mine. And it works so well that it actually won the North American Computer Chess Championship in 1987, back when most PC users still didn't own a mouse.

Inspired, Hsu keeps going. He builds a better version called Deep Thought - and no, not the one on Magrathea built by hyperintelligent pan-dimensional beings. It turns out, this Deep Thought can think pretty well. (Well, it wasn't actually "thinking" - but don't worry, we'll get into that in a bit.) It became the first chess computer to beat grandmasters in tournament play.

So, ChipTest and Deep Thought are often kind of lumped together as the "Carnegie Mellon predecessors" to Deep Blue - but there are actually some differences that show how Deep Blue essentially "grew up" over time.

It all started with this question: can you put a chessboard on a chip? ChipTest was the answer to that question. A single custom chip, plus a Sun-3 workstation. Every square of the chessboard was wired up as a circuit, so the chip could spit out all legal moves in one electrical operation. The Sun-3's job was to handle the search, evaluation, and control in software.

In 1986, ChipTest was searching 100,000 positions per second. One year later, it was searching 5 times as many.

Deep Thought used the same basic strategy as ChipTest, and multiplied it. The team scaled up to six processors that could search over 2 million positions per second - and they weren't done. Later, in 1991, "Deep Thought 2" was able to search up to almost 5 million positions per second.

But alas, it still wasn't enough.

In 1989 - 8 years before Deep Blue - Deep Thought played a game against Kasparov - who was at this point, the world champion for the last 5 years - but Kasparov crushes it. In a two-game exhibition match, Deep Thought loses twice. And it's not even close. Kasparov won the games in 52 and 37 moves respectively, easily seeing-through the computer's brute force algorithm with his decades of human-powered pattern recognition.

But by now, a lot of eyes are on Deep Thought. Among those watching closely are reps from IBM, and they like what they see. They hire Hsu right after he finishes his PhD. Then Murray Campbell, Joe Hoane, and Jerry Brody are added to finalize the 4-man technical team. Later, they also bring on grandmaster Joel Benjamin as a chess consultant.

All that's left is to give their project a name, and they go with "Deep Blue" — a play on Deep Thought and IBM's nickname, "Big Blue".

Here's the thing about actually building Deep Blue. This isn't really a software project. Hsu is designing custom silicon. And not just one chip - there are 480 of them, each one simulating a tiny chess board — an 8x8 grid of logic gates that generates moves and helps evaluate positions in hardware.

It's running algorithms and data structures, but a lot of its speed comes from silicon rather than software optimizations alone.

Those 480 chess chips sit inside an IBM RS/6000 SP supercomputer alongside 30 general-purpose processors. So, the chess chips handle the brute-force chess specific algorithm, and the main processors handle the high-level coordination. It's a hybrid — part general purpose computer, part custom chess machine.

And, building this thing takes years. Testing it takes even more years.

Joel Benjamin's job is to play the machine over and over, find the positions where it does something stupid, and help the engineers patch those weaknesses. The team also builds the "opening book", a library of known strong moves for the first phase of a chess game.

By 1996, Deep Thought is old news, and Deep Blue is ready for a Kasparov rematch.

The very first game of the 6 game series is insane. Deep Blue wins. It's the first time a computer has ever beaten a reigning world chess champion in a game under standard tournament rules.

But Kasparov - smart human as he is, adjusts. He wins three of the next five games and takes the match 4-2.

So what now? Well, IBM goes back to the lab. They're confident in the possibility now — after all, they were making progress. Deep Thought lost to Kasparov in 1989, going 0 for 2 - but now Deep Blue had actually scored two points against him.

They just needed more compute. So, they double Deep Blue's processing speed for the rematch. The improved version can evaluate 200 million chess positions per second. In 1997. Impressive, considering I still can't get my printer to connect to Wi-Fi.

The Move That Broke Kasparov

Anyways, it's time for the famous 1997 rematch. But game one surprises everyone with a very... odd moment.

Towards the end of the game, Deep Blue makes a completely unexpected, counterintuitive move. Kasparov, playing white, had the advantage and was pressing for a win. But instead of trying to make an immediate gain or launch a tactical counter (like a computer normally would), Deep Blue simply moved its rook from d5 to d1 (44... Rd1), a passive waiting move that SEEMED to anticipate Kasparov's deeper strategic traps. My understanding, not being a chess expert, is that it was a "do nothing" move, kinda like checking in poker.

    +---+---+---+---+---+---+---+---+
  8 |   |   |   |   | ♜ |   |   |   |
    +---+---+---+---+---+---+---+---+
  7 |   |   |   |   |   |   |   |   |
    +---+---+---+---+---+---+---+---+
  6 |   |   | ♟ |   |   | ♙ | ♙ | ♚ |
    +---+---+---+---+---+---+---+---+
  5 |   | ♟ |   |   |   |   |   |   |  <-- Rook moved from d5 to d1
    +---+---+---+---+---+---+---+---+
  4 | ♟ | ♙ |   |   | ♟ |   | ♖ |   |
    +---+---+---+---+---+---+---+---+
  3 | ♙ |   | ♗ |   |   |   |   |   |
    +---+---+---+---+---+---+---+---+
  2 |   |   | ♙ |   |   | ♔ |   |   |
    +---+---+---+---+---+---+---+---+
  1 |   |   |   | ♜ |   |   |   |   |
    +---+---+---+---+---+---+---+---+
      a   b   c   d   e   f   g   h

Kasparov looks at the board and thinks: that's too human for a computer. Which is funny, because these days, we see the exact opposite accusations flying! Magnus Carlsen, today's reigning chess champion, famously accused Hans Niemann of cheating in their 2022 Sinquefield Cup match, claiming Niemann was completely relaxed in critical positions while outplaying him with computer-like precision.

Anyways, so that odd move psyched Kasparov out, and though Kasparov won that first game, it did rattle him. So in game two, on move 36, Deep Blue - playing white - has a chance to win two pawns and gain a material advantage - which Kasparov is expecting. He knows computers are typically greedy, and so he's prepared for Deep Blue to fall into his trap.

Instead, Deep Blue executes a surprisingly sophisticated move to shut down Kasparov's counterplay—White pawn on a4 captures Kasparov's black pawn on b5 (36. axb5), to which Kasparov captures back with his pawn from a6. But then Deep Blue calmly moves its Bishop to e4 (37. Be4)—which according to Grandmaster Yasser Seirawan, sent Kasparov "into a tizzy".

    +---+---+---+---+---+---+---+---+
  8 | ♜ |   | ♜ |   | ♛ |   | ♚ |   |
    +---+---+---+---+---+---+---+---+
  7 |   |   |   |   |   |   | ♟ |   |
    +---+---+---+---+---+---+---+---+
  6 |   |   |   | ♝ |   | ♟ |   | ♟ |
    +---+---+---+---+---+---+---+---+
  5 |   | ♟ |   | ♙ | ♟ | ♙ |   |   |
    +---+---+---+---+---+---+---+---+
  4 |   | ♙ | ♟ |   | ♗ |   |   |   |  <-- Bishop moved from c2 to e4
    +---+---+---+---+---+---+---+---+
  3 |   |   | ♙ |   |   |   |   | ♙ |
    +---+---+---+---+---+---+---+---+
  2 | ♖ |   |   |   |   | ♕ | ♙ |   |
    +---+---+---+---+---+---+---+---+
  1 | ♖ |   |   |   |   |   | ♔ |   |
    +---+---+---+---+---+---+---+---+
      a   b   c   d   e   f   g   h

What just happened was the computer ignored immediate material gains (which would have been to play something like Queen to B6), instead favoring a long-term, frankly big-brain positional advantage - completely flipping the script on Kasparov, who used the same kind of tactic to defeat Deep Thought in 1989. The centralized Bishop blocked Kasparov's pieces, shutting down black's opportunity for counterplay, and Deep Blue eventually won the game.

As Seirawan later told Wired it was "an incredibly refined move, of defending while ahead to cut out any hint of countermoves".

But as we know, Kasparov loses that series - and he's not happy about it.

He asks IBM for printouts of the machine's calculations. At the post-game press conference, with cameras rolling, Kasparov suggests there may be a grandmaster somewhere inside the system, feeding Deep Blue moves.

Then, fifteen years later, we learn what really happened. Murray Campbell - the one from the Deep Blue team - told journalist Nate Silver that the weird move in first game — 44. Rd1 (moving its rook to d1) — was a bug.

Deep Blue couldn't pick a move, so it triggered a fail-safe and picked one at random. It turned out not to be a terrible move - just an unexpected one. "Kasparov had concluded that the counterintuitive play must be a sign of superior intelligence," Campbell said. "He had never considered that it was simply a bug".

So Kasparov demands another rematch. But IBM says no. Campbell later said IBM felt it had achieved its goal and moved on to other research. Lack of ambition if you ask me.

Anyways, Carnegie Mellon awarded an additional $100,000 from a prize set up all the way back in 1980, by professor Edward Fredkin, for the first computer to beat a reigning world champion.

And that's the end of the story of how Deep Blue changed chess. But the really interesting part - if you want to better understand how AI impacts everything we're doing today - is how Deep Blue actually worked under the hood, so let's get into that.

How Deep Blue Worked

Even though the project started as "Deep Thought" it wasn't really thinking. Just like modern day LLMs don't really think either, but for a different reason that we'll talk about.

See, Deep Blue was just... searching.

Chess is like, the perfect game for computer scientists. The board has 64 squares in a nice little grid with a controlled set of simple moves available. Both players see everything. There's no hidden information, no fog of war, and no real-time actions.

It's one of the easiest games in the world to simulate.

Take this toy game board here:

8  . . . . . . . .
7  . . . . . . . .
6  . . . . . . . .
5  . . . . . . . .
4  . . . N . . . .
3  . . P . P . . .
2  . . . P . . . .
1  . . . . . . . .
   a b c d e f g h

Sure, I could move any one of my 3 pawns forward, or I could move my Knight to 1 of 8 places: 11 options. But for a computer, 11 isn't that many. It's not like trying to predict which sentence you're about to say next.

Midway through your average chess game, you may have around 30 legal moves (admittedly it can vary a lot), to which your opponent has around, let's just say, 30 legal counter-moves. Multiply them together and that's 900 move combinations - or game states - after just one move each.

Two moves deep? Around 810,000 positions.

Three moves deep? Close to 729 million.

All of those moves, which create a big branching mess, is called a search tree. The current board is the root and every legal move is a branch. Every reply is another branch. As the computer, you keep exploring moves until you hit checkmate or a draw... or your computer runs out of time to keep searching.

Now, a human grandmaster doesn't actually think about all 700 million possible moves when they're "thinking 3 moves ahead". They look at a handful of the most seemingly promising moves. They follow the interesting lines, and use the pattern recognition they've learned over decades of training to narrow it down to less than a hundred real options.

But Deep Blue doesn't have mortal constraints. As a machine, it looks at more futures than a human could ever look at in their lifetime. In less than a second.

Campbell told Scientific American that the 1997 version searched between 100 and 200 million positions per second, usually seeing about six to eight pairs of moves ahead, sometimes much deeper in tactical lines.

Now you might be thinking, how is searching enough? That's not decision making, that's just simulating possibilities. Well, if you simulate all the possibilities, and you can see all the futures where you win, you can just keep picking the moves that are most likely to take you to one of those outcomes. But it's still a lot of moves to search.

Even if we assume just 10 possible moves on each turn, and just 15 turns (30 total, 15 for each player), that hypothetical game represents 10^30, or 1,000,000,000,000,000,000,000,000,000,000 (that's one nonillion) possibilities!

So Deep Blue needed a judge, to do something like what the best players in the world do when they can't calculate every line to the end. That judge is called an evaluation function, or a "heuristic". Instead of searching every branch all the way to checkmate, it can stop at the edge of its search and score the positions it sees there.

If you've ever heard a chess streamer say "I'm up four pawns and a knight," that's exactly what we're talking about. A good evaluation function can take a chess position and spit out a number that estimates how "good" the position is. A very simple one would be, given a board state, how many more pieces do I have than my opponent? Positive number is good, negative number is bad. Then it chooses the branches that lead to better-looking positions.

Now, in practice, Deep Blue's evaluation function was built directly into the chess chip, with about 66,000 gates dedicated to scoring positions. Material count, king safety, pawn structure, mobility, control of the center — all that messy chess judgment, compressed into a single score that could be derived from any board state. The opening system drew on a few thousand hand-built book positions and a database of 700,000 grandmaster games.

So, to recap, a Deep Blue move worked roughly like this:

List the legal moves. List the opponent's legal replies. List your replies to those. Keep branching until the clock or search depth says to stop. Then score the leaf positions at the edge of the search. Push those scores back up the tree, and assume both players are trying to win.

In other words, if Deep Blue takes this path that results in capturing a rook, it can't "cheat" and assume that the enemy won't return-capture its queen - it assumes that the enemy will ALSO want to maximize its score, so DeepBlue will assume its own queen is a goner.

No matter whose turn it is, the algorithm picks the branches with the best score. If it's your opponent's turn, assume they pick the worst score for you. This algorithm is called minimax. You maximize. They minimize.

This is the primary algorithm we talked about in my intro to AI class back in 2015 - no LLMs to be found.

So, because checking every branch is expensive, you cut off branches that can't change the decision anymore. Sure, if you had infinite compute, you wouldn't even need a heuristic - but because we live in reality, we need our algorithms to be faster. This kind of shortcut is called alpha-beta pruning. Deep Blue's whole stack was built around doing that kind of search in parallel, across general-purpose processors and custom chess chips.

So that was Deep Blue's intelligence. A giant tree of possible futures, an accurate scoring system built by domain experts (chess players), and a machine fast enough to execute the calculations.

There was no "learning" in the human sense. No inner voice saying "I'll make Garry sweat and squeeze him in the endgame."

Campbell put it plainly: when IBM started the project in 1989, machine learning methods for game-playing programs were too primitive to help, so they focused on efficient search and evaluation.

Deep Blue was just looking, not learning, and not adapting.

Search-and-evaluate works beautifully, but only for a very specific kind of problem. You need to be able to describe the whole state. You need a list of all legal moves. Full visibility. And you need a decent way to score a position.

Chess has all of that. After centuries of play, humans have pretty good list of what good positions look like.

Now try that with the task to write a birthday card for grandma. What are the legal moves? How do you score how "good" the wording is?

Or try it with code generation. Or autonomous driving through rain.

Minimax just doesn't work on these kinds of problems.

That's why Deep Blue could beat Kasparov but could never automate an email marketing campaign.

So, Deep Blue was a chess machine. The greatest chess machine ever built at the time. But still... just a chess machine. Not the gateway to AGI.

Why Modern AI Is Different

Okay, so what's different about the AI we can't shut up about today?

Eventually, we got the machines to start learning for real. But it took a really long time to get there.

Neural networks — programs loosely inspired by how neurons in the brain connect — have been around since the 1950s. Of course, for most of that time they were an interesting idea with mostly bad results. There wasn't enough data and certainly not enough compute.

But, by the mid-2010s, two things had changed. GPUs, originally built for video games, turned out to be great at the parallel math that neural networks need to efficiently scale. And the internet had produced absurd amounts of text — billions of pages of the written word. Tumblr fanfics, mommy blogs, bad takes on twitter, and billions of lines of open-source code. All ready to be scraped on the open-web.

But there was still a bottleneck. The best language models at the time used recurrent neural networks — RNNs. They read one word at a time, left to right, like a person reading a book. So you couldn't process words in parallel, and training was slow. And the longer the sentence got, the more the model forgot about the beginning.

Google's massive translation system, you know, Google Translate, turned out to be a great example of this. Around 2016 they shifted from phrase-based statistical translation to a deep network based on long short-term memory, or LSTM, with attention.

The old, phrase-based statistical method essentially looked up chunks of words and swapped them based on what was the most mathematically probabable fit. But these chunks - often sloppily glued together - fell apart or felt clunky because the machine lacked a holistic view of the sentence.

The new system was called Google Neural Machine Translation, or GNMT, and it worked way better than the previous phrase-based translation system - between 55% and 85% of translation errors dropped overnight. Unlike the previous method, this RNN kept a "memory" of what it saw earlier - so it could keep better track of language-y things like gender, tense, and grammar across a sentence.

But recurrent models were still slow to train, and Google noted even GNMT still translated sentences in isolation rather than considering the whole paragraph or page.

Then in 2017, eight Google researchers published a very famous paper called "Attention Is All You Need".

They called their new deep learning architecture the "Transformer" because they thought it sounded cool. An early design document even had an image of six Transformers from the franchise.

But the paper was very serious. They proposed ditching the sequential reading entirely. Instead of processing words one at a time, the Transformer looks at every word in a sentence at once and figures out which words relate to which other words. That mechanism is called "attention".

If I say "The bank was steep and covered in grass," the word "bank" could TECHNICALLY mean a financial institution, or a riverbank. An attention mechanism lets the model look at "steep" and "grass" at the SAME TIME as "bank" and figure out which meaning makes more sense. It doesn't have to wait until it gets to "grass" all the way at the end of the sentence.

And because it processes all the words at once, the Transformer can run on GPUs in parallel. So training gets way faster.

The original paper was about translation. English to German, English to French. They trained the model on 8 Nvidia P100 GPUs. The base model took 12 hours. The big model took three and a half days. And both beat the state of the art at a fraction of the cost.

But remember, these researchers weren't trying to build artificial general intelligence, they were just trying to make Google Translate better. Well, turns out they were onto something much, much bigger.

In June 2018, OpenAI published a paper called "Improving Language Understanding by Generative Pre-Training". This was GPT-1; "GPT" as in "Generative Pre-trained Transformer". A 117-million-parameter model with twelve layers of a Transformer decoder.

There were really two stages: First, pre-training: feed the model a huge pile of text and train it to predict the next word. No labels. No human annotation. Just "given these words, what word is most likely to come next?" And second, fine-tuning: take that pre-trained model and adapt it to specific tasks with a much smaller LABELED dataset. For example, not just "here's some python code that might work", but "here's python code that works", and "here's python code that doesn't work".

GPT-1 beat previous benchmarks on question answering, textual entailment, and common-sense reasoning by 5 to 9 percent on some tasks. Which basically means it had a much better understanding of the relationships between different pieces of text. For example, if the premise is "the dog jumped into the lake", it could predict that "the animal got wet", even though there are no directly overlapping words.

And it did this with very minimal changes to the underlying model between tasks. Same basic architecture, but different fine-tuning data.

Then in February of 2019, OpenAI released GPT-2 with 1.5 billion parameters — about ten times bigger than GPT-1, trained on ten times more data. They scraped 8 million web pages linked from Reddit posts that had at least 3 upvotes - which was their way of estimating this is "reasonably good" text.

GPT-2 could generate whole paragraphs of plausible writing. Give it a fake headline, and it'd write a news article with fake quotes and fake statistics. The Guardian called the output "plausible newspaper prose".

OpenAI was SO worried about misuse that they didn't release the full model at first. Sound familiar? It wasn't the first or last AI release wrapped in "too dangerous for public" messaging.

To us in 2026, it's hilarious because even GPT-3, which paved the way for ChatGPT, feels really stupid compared to 5.5, which STILL can't write an email that I'd be comfortable putting my name on. But GPT-3 was over 116 times bigger than GPT-2! It just goes to show the AI industry has a long and storied history of exaggerating their system's capabilities. Probably for that sweet sweet investor capital.

Reactions in the tech community were mixed. Anima Anandkumar at Caltech called the decision to not release it "malicious BS" and said there was no evidence GPT-2 posed the threats OpenAI described. The Allen Institute for AI announced a tool to detect "neural fake news". Researchers worried that Twitter, websites, and email would be overwhelmed by text that sounded reasonable and context-appropriate, but was actually bogus machine writing. I think today we've proved them right.

But by November 2019, OpenAI changed their mind. They said they'd "seen no strong evidence of misuse so far" and released the full model. The Verge correctly pointed out that "we already have programs that can generate plausible text at high volume for little cost: humans".

Then in May 2020, the GPT-3 paper dropped. "Language Models are Few-Shot Learners." The new version now has 175 billion parameters trained on 570 gigabytes of text — CommonCrawl, WebText, English Wikipedia, and two book corpora.

GPT-3 could translate French, answer trivia, write code, and summarize articles — often just by giving it a few examples in the prompt. Minimal fiddling required.

How LLMs Work

So how does an LLM like GPT, Claude, or Gemini actually work behind the curtain?

Unlike Deep Blue, it doesn't start with a board, legal moves, or a game state.

It starts with tokens.

Tokens are chunks of text. Sometimes a whole word, sometimes a piece of a word, sometimes just some fancy Unicode punctuation. Your prompt gets chopped into tokens, and the model runs them through layers of attention and learned weights.

At the end, it produces a probability distribution over the next token. It's not predicting "the right answer" or "the moral answer." Just: given everything I've seen so far, here are the next chunks of text that seem likely.

Then it picks one. Usually with a little randomness that you can actually tune if you want. Then it does it again. And again. Until it produces a "done" token.

That's a language model. A very expensive autocomplete machine. Now, it sounds like I'm being dismissive, but as it turns out, if the autocomplete is good enough, you can get a LOT of value out of it.

Language is everywhere. Contracts are language. Code is language. Emails, bug reports, product docs, coordinates on a map — a huge amount of work gets squeezed through words before it does anything in the world.

So if you get very, very good at modeling language, you can do a lot of useful work.

And then you give it access to tools. An LLM can write a Python script, but then you can give it a harness that allows it to actually run that Python script, look at the output, whether success or failure, edit the script, and run it again - that's all an agent is. An LLM with the ability to call tools in a loop.

Anyways, LLMs are unique "AI systems" because they handle the messy human-language wrapper but can still hand off the mechanical parts to "plain old" software. So, while basic-GPT3 might not be the best chess player in the world, you can actually hand it access to Deep Blue (or a Deep Blue emulator), and it could then wield that tool to not only beat Magnus Carlsen, but also throw sweet shade after it wins.

So LLMs feel "general" in the "AGI" sense in a way Deep Blue never did at all. Deep Blue did one thing. LLMs appear to do anything you can shove into a language problem.

But appearing general and being general aren't the same thing.

The AGI Trap

And the people who wrote the early papers knew this. The GPT-3 paper flagged datasets where few-shot learning still struggled, and warned about methodological issues from training on web corpora. Even the GPT-1 writeup noted the "limits and bias of learning about the world through text" and warned that deep learning models still behave in surprising ways under adversarial or out-of-distribution tests. For example, in 2019, researchers at Allen Institute for AI discovered that they could "break" top-tier NLP models by adding seemingly nonsensical phrases. Adding something like "zoning tapping fiennes" to the beginning of a movie review forced a model that was 99% sure a review was positive to just flip its position, suddenly labeling it as "negative".

It's a bit philosophical, but to me it seems that reading and writing about the world in the statistically-most-ubiquitous way isn't "reasoning" in the same way that humans do it. To be fair, we still don't REALLY understand how humans reason - and is it even fair to say that AGI has to reason the SAME way that we do? I'm not sure, but at the moment it seems that LLMs can do a lot of things, but they are far from the type of raw-computational-intelligence dreamed about in 2001: A Space Odyssey.

I think it's not quite fair to say "people overhyped Deep Blue and the 2001 tech bubble, so people MUST be overhyping LLMs". Deep Blue and the minimax algorithm are obviously far less "general purpose" than LLMs. Deep Blue could literally only play chess.

But it's probably also a mistake to assume that "Because LLMs are broader than Deep Blue, we must be on a straight, exponential growth trajectory to th AGI singularity".

The 70-ish years of AI research hasn't just been one exponential progress graph, up and to the right, moving faster and faster.

We live in a world where we often have something that looks promising, and improves quickly, but then the S curve flattens and we see diminishing returns. Some have speculated we're already there with transformer technology. As a 2022 paper by Epoch AI warned, public human text data could become a bottleneck for LLM scaling between 2026 and 2032 if current trends continue.

So where are we on the graph, like right now? It might keep getting better quickly, but that's not a guarantee. We might just be hitting yet another local maxima on the real zoomed-out progress curve.

And in that case, it will take another breakthrough that doesn't involve simply adding more datacenters to the compute capacity and Reddit posts to the dataset.

All this to say, I'm not sure, but I'm optimistic about these tools when used appropriately. But I personally don't see a way to get really good and tasteful work done without the human element.

Bibliography

  1. Chess.com, "Kasparov vs. Deep Blue | The Match That Changed History." Updated October 12, 2018.
  2. IBM, "Deep Blue." IBM Heritage.
  3. LA Times, "After Sudden Defeat, It's Kasparov Who's Blue." May 12, 1997.
  4. The New York Times Archives on X, "Swift and Slashing, Computer Topples Kasparov."
  5. Computer History Museum, "Deep Blue Cartoon."
  6. Chessprogramming Wiki, "ChipTest."
  7. Encyclopaedia Britannica, "UNIX."
  8. The Go Programming Language, "Frequently Asked Questions: What is the history of the project?"
  9. Chessprogramming Wiki, "Deep Blue."
  10. Thomas Anantharaman et al., "Singular Extensions: Adding Selectivity to Brute-Force Searching." ICCA Journal, 1988.
  11. Feng-hsiung Hsu, "IBM's Deep Blue Chess Grandmaster Chips." IEEE Micro 19(2): 70-81, March/April 1999.
  12. Feng-hsiung Hsu, Murray Campbell, and A. Joseph Hoane Jr., "Deep Blue System Overview." Proceedings of the 9th International Conference on Supercomputing, ACM, 1995, pp. 240-244.
  13. Larry Greenemeier, "20 Years after Deep Blue: How AI Has Advanced Since Conquering Chess." Scientific American, June 2, 2017.
  14. Klint Finley, "Did a Computer Bug Help Deep Blue Beat Kasparov?" Wired, September 28, 2012.
  15. Bill Chappell, "Chess world champion Magnus Carlsen accuses Hans Niemann of cheating." NPR, Sept 27, 2022.
  16. Chess.com, "Kasparov vs. Deep Blue | The Match That Changed History." Updated October 12, 2018.
  17. Murray Campbell, "Knowledge Discovery in Deep Blue." Communications of the ACM 42(11): 65-67, November 1999.
  18. Ashish Vaswani et al., "Attention Is All You Need." arXiv:1706.03762, submitted June 12, 2017.
  19. Steven Levy, "8 Google Employees Invented Modern AI. Here's the Inside Story." Wired, March 20, 2024.
  20. Alec Radford et al., "Improving Language Understanding by Generative Pre-Training." OpenAI, June 11, 2018.
  21. Alec Radford et al., "Better language models and their implications." OpenAI, February 14, 2019.
  22. James Vincent, "OpenAI has published the text-generating AI it said was too dangerous to share." The Verge, November 7, 2019.
  23. Alex Hern, "New AI fake text generator may be too dangerous to release, say creators." The Guardian, Feb 14, 2019.
  24. Tom B. Brown et al., "Language Models are Few-Shot Learners." arXiv:2005.14165, submitted May 28, 2020.
  25. Eric Wallace et al., "Universal Adversarial Triggers for Attacking and Analyzing NLP." EMNLP, 2019.
  26. Pablo Villalobos et al., "Will we run out of data? Limits of LLM scaling based on human-generated data." arXiv:2211.04325, submitted Oct. 26, 2022; revised June 4, 2024.
  27. Chessprogramming Wiki, "Deep Thought."
  28. Chessprogramming Wiki, "Kasparov versus Deep Blue 1997."
  29. Quoc V. Le and Mike Schuster, "A Neural Network for Machine Translation, at Production Scale." Google Research Blog, September 27, 2016.
  30. Wikipedia, "Deep Blue versus Garry Kasparov," section "1997 rematch: Game 2."