Why "Deep" Often Means "Slow"
Multiplicative compounding of latency with generative AI
Let me continue describing the pros and cons of deep learning, to help decide how to divide up the work of future intelligence with other elements like formal verification. My last post presented deep learning as the ultimate search engine, for finding prior art related to new goals, considering massive data sets that might contain ideas that are related for reasons that are not obvious. Now I’ll spend two posts on weaknesses of deep learning, starting with the amount of time it takes to get an answer, especially with the canonical kind of LLM-based tool being built today.
Our take-away message is going to be that systems based on deep learning introduce sequentiality (certain computation steps that must happen after others) that is often not inherent in the questions that we ask them to answer. Instead, it arises from how we choose to organize computation to answer them. The result is fundamental delays in getting those answers back.
Latency and Throughput in Parallel Systems
We need to think in general terms about the performance of computer systems. Let me explain the basics using an example of courses and their prerequisites.
Imagine that every student in a certain department needs to take every course shown as a rectangle in this diagram. An arrow from one box to another indicates a prerequisite: the course at the source of the arrow must be taken before the course at the destination of the arrow. A longest path in this diagram is highlighted along the top, with courses that are not grayed out.
Even just from the perspective of one student, a process of completing course requirements is parallel: many activities can be happening at once, here taking multiple courses. Following terminology from running-time analysis of parallel algorithms, we will call the longest path in such a diagram the critical path. We use depth to refer to the length of a parallel workload’s critical path, and we use work to refer to the total number of steps (courses in this example), within a taxonomy introduced by Blelloch and collaborators a few decades ago
An important theorem in this domain is that a given parallel workload must run for a number of steps at least equal to its depth. This conclusion should be intuitive after digesting the terminology: a long prerequisite chain in degree requirements indeed implies a minimum number of terms to complete the degree.
In general, we run many instances of a parallel workload at once. For our courses example, this phenomenon corresponds to having many students enrolled at once. The time from the beginning of one execution of the workload to full completion is called latency. The number of workload instances finishing per step is called throughput. Our example connects latency to how long it takes one student to graduate and throughput to the number of students graduating each term.
Clearly both latency and throughput matter in the running example, but latency is more important from the student perspective. It is no comfort to learn that thousands of students graduate each term if it also takes 100 terms to graduate.
Critical Paths of Deep Neural Networks
Most people encountering the term “deep learning” naturally assume that “deep” is only good news: the neural network is doing something hard and complicated. The term refers more specifically to how many layers a neural network has. This diagram shows roughly the kind of structure of a deep neural network, with layers arranged horizontally, each layer including many artificial neurons, spread out vertically.
Each neuron has inputs coming from (in general) many others from the previous layer. Every neuron is like one course in our earlier example: one unit of work that needs to be performed eventually, to generate a complete answer to a question, with dependencies on some previous steps (here from the previous layer). Note that, as the highlighted path shows, the critical path now runs through all layers, so depth and thus latency are proportional to the layer count. Since “deep” refers to the number of layers, here we have our illustration of how “deep” can mean “slow.” There are many ways that a deep neural network could be evaluated as a parallel workload, but, without some higher-level rearrangement of computation, generating an answer will take time at least proportional to layer count.
Of course, there remain all kinds of tricks for boosting throughput. Most obviously, we expect that roughly all neurons in a single layer can be evaluated at once. The standard computer-systems technique of pipelining is also used, just as we would expect from our earlier example of courses: many executions of the neural network can be run simultaneously, where, at any moment, each layer is working on a different user request. However, these optimizations only boost throughput and, in fact, can introduce extra coordination across steps that even increases latency (e.g., pipelining breaks a computation into pieces that incur extra overhead communicating with each other through queues). While AI companies are motivated to control costs by improving throughput, an individual end user’s experience is more dependent on latency, even in the presence of optimizations like batching (executing requests from different users simultaneously). That user can pay lower fees thanks to throughput-improving resource sharing, but it’s hard not to notice that, say, a chatbot takes a long time to finish returning an acceptable answer, and latency can matter even more for emerging use cases.
It can also be argued that the diagram above is oversimplified. Cutting-edge neural networks don’t literally feed neurons with just outputs from immediately previous layers. An especially common technique is KV caching, which feeds each neuron a somewhat-complex summary of past steps. However, these variations don’t change the fact of long sequential dependencies (critical paths), with lengths proportional to layer counts. Two other common techniques worth mentioning are attention, which manipulates dependencies within layers but basically maintains critical-path structure across layers; and residual connections, which effectively allow some connections between nonadjacent layers but maintain the normal kind of connection, too, which generate roughly the same critical paths.
Recapping, by virtue of how deep learning is organized with long critical paths, it is unavoidable that some steps just need to happen after others, and in fact long chains of such dependencies develop, forcing delays before final answers can be delivered.
Deep All the Way Down, Multiplying Latencies
Now let’s consider other elements in agentic systems being built on top of LLMs today.
First, an LLM generates a full answer by calling a neural network repeatedly, once to generate each token (which is roughly a single word). The input to the neural network is a representation of the tokens that were produced previously, which leads to a long critical path stepping sequentially through all invocations of the neural network. In other words, when the neural network has depth d and we are generating an LLM response of t tokens, the critical path at this level of system detail has length proportional to dt.
How exactly do we arrive at depth about dt? At the level of detail in this last diagram, we easily trace out a critical path of length t, starting with the first neural-network invocation, whose output flows into the next, and so on until the last neural-network invocation. However, each box labeled “DNN” (for deep neural network) is a copy of the prior diagram, where we traced out critical path of length d. Dropping in those copies, the path segments connect together to reach length proportional to dt. Intuitively, the system makes t queries to the DNN in-order, and, for each query, we have to pause and wait for the d sequential steps of the DNN.
A few techniques allow deviating from this kind of workload diagram. For instance, speculative decoding uses cheaper neural networks to guess batches of answers, which are then checked in single shots by more-expensive neural networks. However, this technique only improves latency by a small constant factor – not enough to offset the multiplicative compounding of latency. As a result, latency remains proportional to depth of neural network and length of LLM output.
Now consider how a coding assistant like Claude Code works. It can take on many tasks in parallel, but some objectives still depend on relatively long sequences of steps – where many individual steps are copies of the whole LLM flow we just covered. Also, some steps make arbitrary tool calls to external programs that introduce latency of their own. This diagram shows the coding agent calling a compiler to get feedback on relatively shallow problems it finds in code, as well as a test runner to see which inputs from a test suite the program fails to generate the right answers for.
By now you may be used to spotting the problem for latency: every LLM node in this diagram has critical path length proportional to dt. Now consider the longest sequence of high-level steps in the new diagram, with length s. We are up to depth at least dts, before even taking into account the latency of the tool calls. The time it takes to get a complete answer to one top-level question is being multiplied by a further factor for every level of additional sophistication that we introduce. The reason is again that every LLM box expands into a full copy of the last diagram, and the path segments visible directly in this picture connect with the s copies of the critical path of length about dt from before. Latency has been compounding through all layers of the system, from neural network to repeated next-token prediction to higher-level agentic workflow. Adding functionality doesn’t add to latency but instead multiplies it.
Conclusion
The way that mainstream generative-AI systems are being designed today exhibits a fundamental phenomenon of multiplicative compounding of latency, with intensity proportional to how many levels of additional functionality are added. Latency isn’t everything, but getting full answers to questions sooner is clearly better. For instance, the programmer using an AI coding assistant must typically wait out the full latency before doing code review and testing to finalize a code contribution.
The best rejoinder to these observations about current generative-AI systems is that we don’t seem to have come up with other ways to solve the same problems with nearly the same level of solution quality. Perhaps the latency penalty is unavoidable for those problems. Two posts from now, I’ll present a framework addressing that response from an unusual angle. Zooming out even further than what this post covers, with other architectural changes, we can avoid such high latency by shortening critical paths.
First, though, in the next post, I want to present the challenge of explainability for machine learning, contrasted with an older style of artificial intelligence.





