Deep Learning: The Ultimate Search Engine
The value of finding prior art
My last run of posts has been highlighting tools for building a trustworthy future of AI agents, especially formal verification and cryptography. The last idea I want to situate properly is deep learning, which is certainly the one the average reader will have top-of-mind. Most of what I have to say will apply to machine learning more broadly, though it’s hard to ignore the spectacular recent success of deep learning in particular.
This post will develop a way of thinking about the strengths and capabilities of deep learning. The following posts will highlight weaknesses, setting us up for a good plan of how reliable systems might divide responsibility between deep learning and other techniques.
A Cartoon View of Deep Learning
We probably all learned in middle-school math class how a line can be described by an equation y = mx + b. If someone hands us a pile of points in two-dimensional space, we can solve for the line they fit on, that is, for m and b. The idea generalizes to points in any dimensionality of space and to curves beyond lines.
Real-world data are messy, though. We can imagine somehow turning all of the contents of the public web into a gigantic set of points of many dimensions. Then we can think of LLMs as mapping prompts x to answers y. However, the training data won’t be fit by any clean curve; we need to consider fuzziness where they only match approximately. Luckily, the field of machine learning has figured out how to do that kind of large-scale data analysis. Leading-edge LLMs involve hundreds of billions of parameters and beyond. We can roughly think of “parameters” in that sentence as describing values to solve for, just like m and b in our middle-school example.
It’s not surprising, though, that the methods of middle-school math are not up to the challenge! For one thing, training models typically uses a lot of compute, relying heavily on GPU hardware. But how should we structure the computation? Old-school solving of systems of linear equations like Gauss worked out around 1800 won’t get the job done. Instead the big idea is gradient descent.
I’m not going to go into mathematical detail on gradient descent, but the basic idea is to start from a guess of good parameter values. They determine a mathematical function. Evaluate it on a good suite of example inputs, where you actually know the right answer for each one. Examine the details of how wrong your current function is and in what ways. With a bit of calculus, we can turn that analysis into a plan for how to modify the parameter values to get closer to correct. We repeat this process, improving the parameters until we are happy with how well the function performs.
The following diagram illustrates gradient descent for machine learning at a high level. The three-dimensional surface represents a loss function that expresses just how wrong our main function is. We want to push the loss as close to zero as we can. Assuming a relatively smooth structure of the loss function, we kind of just sled downhill one step at a time, always choosing the steepest downward slope to explore next. Each move corresponds to modifying the parameters.
At some level, then, this cartoon picture holds to explain the full power of LLMs and related generative AI. A prompt into an LLM is actually broken down into many executions of the function whose parameters we figured out, to generate a sequence of tokens (smallest supported chunks of characters) to form a full answer, with feedback from prior subanswers into the inputs of later function executions. Some of the most impressive uses also depend on reasoning models, which build out whole internal documents that lay out how to solve a problem step-by-step, then only showing the final answer to the user. For instance, a math proof may be developed one serious step of deduction at a time, including false starts that need to be undone when the AI can’t figure out what to try next. All of those steps work by essentially the same kind of repeatedly calling a function whose parameters we learned.
Why Generative AI Works So Well
Many people find it counterintuitive how well this approach to answering questions works. To me, the related empirical result is that memory accounts for a much bigger slice of intelligence than we realized. An LLM is accomplishing something much like memorizing the web, ready to regurgitate useful facts as needed, though its actual internal structure is more complicated. Rather than denigrating the LLM as a stochastic parrot, I prefer to describe the technology as supporting amazingly effective search engines, like Google on steroids. In 2020, I wouldn’t have predicted that a top-notch search engine could provide so much of what we call intelligence, but it does! Part of what makes this technique so effective is that it isn’t literally a search engine in the old sense, relying on mysterious structure discovered from unstructured data, but I think the metaphor still proves its value.
Actually, LLMs internally use a process very related to search engines. To make sense out of a whole web of information, they squash units of information into sequences of numbers of some fixed length, which we call a vector embedding. An important part of LLM execution is finding corners of idea space whose vectors are similar to those of the prompt.
So it seems empirically that looking up memories and doing a good job piecing them together can cover a lot of what we consider intelligence, when you have a lot of training data to work with. An LLM has an advantage over all but the top specialists in any given question, because it has so many more “memories” to draw on.
There is also an evolutionary-psychology reason why we should expect humans to be so surprised by how well these methods work. Genetically, we’re still quite close to our hunter-gatherer ancestors of hundreds of thousands of years ago. They lived in bands of about 100 people. Each band somewhat independently maintained a shared knowledge base of life skills, passed on orally. Each elder of the band could come to hold in personal memory a rather large fraction of all the knowledge that would ever become relevant to anyone’s life. As a result, we come to think we understand what it’s like to make decisions based on a knowledge base, though the scale of that knowledge for an LLM is so immensely larger than what any person has known, producing what seems like an entirely different phenomenon.
The result is systematically underestimating how common it is for one of us to have a question that no one we know can answer, and yet the literal answer is present somewhere out there. Think about 200-person companies and the various questions that arise in daily business. Rather-small variations of these questions are likely to come up across companies, and the variations are small enough that search using vector embeddings can identify prior art very effectively. Programmers underestimate how their exciting new challenges are extremely similar to others published online. The pattern continues across many domains.
Still, small differences show up in an individual instance of an established question category. The proper answer is a combination of existing nuggets of wisdom from the training data. We’re most impressed and surprised when those nuggets come from rather different domains, such that there may be no human expert on all the domains. The LLM doesn’t mind combining all that expertise, with comic effects like, say, customer-service chatbots that are happy to solve chemistry homework problems if you ask nicely. While a human team of experts would spend significant time and money on coordination, an LLM can combine answers almost instantly.
This kind of flow actually involves two different senses of the word “search.” On the one hand, there is the sense from information retrieval that I have emphasized for LLMs: finding facts from a database that are most relevant to a question. On the other hand, we have a sense from classic AI tasks like planning or automated theorem proving, where we explore the different ways to combine fixed ingredients. That latter style I presented earlier as a way to think about human work advancing the frontiers of knowledge, including an interesting role for formal verification in grading candidate solutions.
Regardless of the particular techniques, we arrive at the following pattern for solving (moderately) novel problems. The idea is that the two kinds of search respectively find ingredients and whole recipes, shown in the next diagram, which probably qualifies as “the one to remember from this post.” A few concrete examples of this pattern are:
To design a new mechanical device, finding blueprints for several relevant physical components and then creating a new blueprint that plugs them together
To write a new computer program, finding a set of relevant code libraries and then using them to write the program succinctly
To prove a new mathematical theorem, finding the most relevant theorems from the literature and then citing them in a new proof
Reasoning LLMs use the same information-retrieval process to decide on steps in the search that assembles ideas into full answers. They can function most effectively when the procedures for combining ideas are themselves well-represented in training data. However, we should expect that domains like cutting-edge research pose challenges for both the existence of good ideas to build on and methods to assemble them.
It’s worth mentioning that this framing has a lot in common with ideas from classic AI. Newell and Simon developed the heuristic-search approach that covers the second kind of search here, and case-based reasoning was later developed as an approach to retrieving useful precedent and assembling it. Hofstadter has pushed the related idea of analogy as central to problem solving.
Next Steps
Deep learning is tremendously effective at finding the most relevant knowledge from large data sets. However, it is much weaker at systematically exploring how to combine ideas, an essential part of cutting-edge knowledge work. The next few posts make that case. The first justification I’ll cover is a performance bottleneck revealed even by the name “deep learning” – we should expect that this style of information retrieval is inherently expensive and slow. After that, I’ll discuss how hard it is to achieve any kind of strong guarantees of answer quality from deep learning, contrasting with another established style. It is widely believed today that many important problems are just so challenging that we have no better tools available, and I’ll spend a post on an outside-the-box approach to shrinking the set of those problems, expanding on the approach I already covered for AI coding assistants.






“An LLM has an advantage over all but the top specialists in any given question, because it has so many more ‘memories’ to draw on.”
I actually like this comparison, because by comparing it to what we know as a ‘memory’ actually implies that these memories can be ‘encoded incorrectly’ in an LLM similar to how it could in a human. There are plenty of instances of false memories in human brains, where we might recall a situation incorrectly or recall a situation that may have never happened. I can imagine that an LLM equivalent would be hallucinations, or just incorrectly recalling certain facts from its training data.