If you’ve followed the earlier installments in this series, you’ve probably noticed a pattern beginning to emerge.
In the first article, we discussed how NAND flash isn’t disappearing, but instead becoming part of a much larger AI memory hierarchy. After that, we looked at High Bandwidth Memory (HBM) and why modern GPUs depend on having data physically closer to the processor. Then we moved into Storage Class Memory, High Bandwidth Flash, the limitations of DRAM scaling, and finally why even traditional hard drives still remain critical because AI infrastructure operates at a scale that most people dramatically underestimate.
At first glance, those may seem like separate topics.
They aren’t.
They are all symptoms of the same underlying pressure: AI systems are no longer struggling primarily with compute power. They are struggling with how efficiently they can move data.
That shift changes almost everything about how infrastructure gets designed.
For decades, computing followed a fairly stable model. Storage held the data, memory staged it, and processors fetched what they needed. As processors became faster, the system simply tried to feed them more efficiently using better buses, larger caches, and faster memory technologies.
AI changed the scale of the problem.
Modern GPU clusters can process information at such a massive rate that the act of moving data around the system has started becoming one of the largest bottlenecks in the entire architecture. In some environments, the processor itself is no longer the slow part. The delay comes from getting the right data to the processor quickly enough and consistently enough to keep it fully utilized.
That realization is quietly forcing the industry into a new direction.
Instead of continuously moving larger amounts of data back and forth across the system, AI infrastructure is starting to move portions of compute closer to where the data already lives.
And once you understand why that is happening, many of the earlier articles in this series begin fitting together much more clearly.
AI Is Starting To Hit a Data Movement Wall
One of the most important ideas from the earlier HBM article was that modern AI systems often slow down not because the processor lacks compute capability, but because the system cannot deliver data fast enough to keep the processor busy.
That issue becomes much more serious once AI workloads scale outward across entire racks and clusters.
A modern AI accelerator can consume astonishing amounts of information in parallel. The problem is that datasets are no longer small enough to sit entirely inside the fastest memory tiers. Even with HBM and large pools of DRAM, enormous amounts of data still need to travel across interconnects, buses, fabrics, storage layers, and network infrastructure.
That movement has a cost.
It shows up as latency, but that is only part of the story. It also shows up as power draw, heat, cooling demand, congestion, synchronization delays, and idle compute cycles. As we discussed in the DRAM installment, even tiny delays become surprisingly expensive once thousands of GPUs are operating at the same time. A small pause multiplied across a large AI cluster can represent an enormous amount of lost utilization.
That changes the engineering priorities.
For years, infrastructure was largely designed around maximizing compute performance. AI systems are now forcing engineers to think just as heavily about data locality, meaning where information physically sits relative to the processor trying to use it.
Put simply, distance now matters far more than it used to.
GPUs Became So Fast That The Rest Of The System Started Falling Behind
One of the strange things about AI infrastructure is that progress in one area tends to expose weaknesses somewhere else.
As GPUs became faster, memory bandwidth became the bottleneck. That led to HBM. As HBM capacity limitations became more obvious, the industry started introducing intermediary layers like Storage Class Memory. As DRAM scaling became expensive and physically difficult, systems started leaning more heavily on NAND while also exploring concepts like High Bandwidth Flash.
And as AI datasets continued expanding into the petabyte and exabyte range, hard drives quietly remained essential because the economics of storing that much information simply could not work any other way.
Each article in this series has really been pointing toward the same conclusion from a different angle.
The old assumption that compute sits here while storage sits over there is beginning to break apart. The reason is fairly simple: GPUs can now process data faster than traditional architectures can comfortably deliver it.
That creates a situation where enormous amounts of system activity are spent simply transporting information from one place to another. In practical terms, some AI environments are starting to look less like pure compute problems and more like logistics problems.
The Industry Started Asking A Different Question
For a long time, storage innovation focused mostly on making storage devices faster. Faster SSDs, faster interfaces, faster NAND, and faster controllers all mattered, and they still matter today.
But AI workloads started exposing a deeper issue underneath all of that.
At some point, engineers began realizing the problem was not always the speed of the storage device itself. The problem was the repeated movement of massive amounts of data back and forth across the entire system.
That subtle distinction matters because once the problem becomes data movement rather than simple storage speed, the solution starts changing too.
Instead of endlessly asking how storage can be made faster, the industry started asking how far the data needs to travel in the first place.
That question is now influencing nearly every part of modern AI infrastructure design.
Moving Compute Closer To Where Data Already Lives
This is where the architecture starts to shift.
Rather than treating storage as a completely passive layer that simply waits for requests, newer systems are beginning to perform certain tasks closer to the data itself. Not necessarily full-scale GPU processing, but localized operations that reduce unnecessary movement throughout the rest of the system.
Some systems now perform filtering, indexing, search operations, compression, retrieval preparation, and data organization closer to the storage layer before the information ever reaches the primary compute engines.
The goal is not to eliminate GPUs or replace fast memory. The goal is to reduce waste.
If the system can avoid transporting enormous amounts of unnecessary data across the infrastructure, the entire platform becomes more efficient. This is one of the reasons the line between compute and storage is starting to blur.
Storage is no longer behaving like a completely inactive destination sitting at the bottom of the hierarchy. It is becoming more involved in how data is prepared, staged, filtered, and delivered upstream.
If you think back to the earlier article on High Bandwidth Flash, this direction makes a great deal of sense. That article showed how NAND itself was being pushed toward more memory-like behavior. This article extends the same idea one step further by showing how the surrounding architecture is also adapting around the cost of data movement.
The Warehouse Analogy Starts Looking Different
The warehouse analogy we’ve used throughout this series still works here, but the warehouse itself has started evolving because the workload inside it has changed.
In the earlier installments, the layout was fairly straightforward. HBM represented the loading dock where the next pallet was already waiting beside the workers. DRAM acted as the active floor space where the immediate sorting and handling took place. Storage Class Memory became the staging area sitting just behind the dock, while NAND represented the primary warehouse shelving further back. Hard drives handled the deeper bulk storage where long-term inventory lived because capacity mattered more than immediate access speed.
That model still generally holds together, but AI systems are beginning to expose inefficiencies in how much movement happens between those areas.
Imagine a warehouse where workers spend more time driving forklifts back and forth across the building than actually processing inventory. At first, management responds by buying faster forklifts, widening the aisles, and improving the loading docks. Those upgrades help for a while, but eventually the operation reaches a point where transportation itself becomes the problem. The delays are no longer caused by slow workers or inadequate equipment. The delays come from the sheer amount of movement required to keep the workflow operating.
That is increasingly what large AI systems are running into.
The issue is no longer just how quickly data can be processed once it reaches the GPU. The issue is how much infrastructure effort is spent repeatedly transporting that data across the system in the first place.
So instead of endlessly optimizing transportation, the layout begins to change. Small workstations start appearing closer to the shelves themselves. Certain sorting tasks happen locally. Filtering happens locally. Data preparation begins happening nearer to where the information already resides, reducing how often the system has to move massive amounts of material back and forth across the entire operation.
That shift is essentially what AI infrastructure is starting to do at the architectural level. The goal is not to turn storage into a processor or eliminate centralized compute altogether. The goal is to reduce unnecessary movement because at AI scale, even small inefficiencies become surprisingly expensive once they are multiplied across thousands of accelerators operating simultaneously.
AI Infrastructure Is Becoming More Distributed By Necessity
One of the more interesting consequences of this shift is that AI infrastructure is starting to become far more distributed than traditional computing environments ever needed to be.
Older architectures assumed most of the important work would happen in centralized compute locations while storage remained largely passive and separated from the processing layer. That model worked reasonably well for decades because the amount of data moving through the system was still manageable relative to the speed of the processors consuming it.
AI changes the scale of the equation entirely.
The amount of information being processed, revisited, staged, cached, indexed, and retrieved is now so large that centralized movement itself begins creating inefficiencies. Instead of compute simply reaching downward into storage whenever it needs something, systems are increasingly trying to keep useful data positioned closer to where it will likely be used next.
That is part of the reason technologies like vector databases, distributed inference systems, retrieval layers, localized caching, and near-data processing have started receiving so much attention. On the surface, these may appear like separate technologies solving unrelated problems, but underneath they are all responding to the same pressure. The industry is trying to reduce how often enormous amounts of information must travel long distances across the infrastructure before meaningful work can begin.
As you’ve probably noticed throughout this series, the memory hierarchy itself is gradually becoming less rigid than it used to be. The clean separation between “compute over here” and “storage over there” is starting to soften because AI workloads reward systems that keep data physically closer to where processing occurs.
That trend is likely to continue because the economics of large-scale AI increasingly favor efficiency in movement just as much as raw compute capability.
The Memory Hierarchy Is Starting To Blur Together
One of the quieter themes running underneath every installment in this series has been the gradual erosion of the old boundaries between memory, storage, and compute.
In the HBM article, we looked at how memory was physically moved closer to the processor itself because even traditional DRAM placement began introducing delays large enough to matter at AI scale. In the Storage Class Memory installment, the focus shifted toward reducing the sharp transition between fast memory and slower persistent storage. High Bandwidth Flash pushed NAND into a more active role inside the working data path, while the DRAM article showed why simply scaling traditional memory upward indefinitely becomes difficult both economically and physically.
Now this article pushes that same progression one step further by showing how the architecture itself is adapting around the cost of moving data.
What makes this particularly interesting is that none of these technologies are truly replacing one another. The industry did not abandon NAND once HBM arrived. It did not replace DRAM simply because Storage Class Memory appeared. Hard drives also remain deeply relevant despite decades of predictions claiming solid-state storage would eliminate them entirely.
Instead, the system is becoming more layered, more specialized, and more aware of where data physically exists relative to the compute resources trying to consume it.
That distinction matters because it changes how we should think about the future of AI infrastructure. The evolution is not happening because one breakthrough technology suddenly solved everything. The evolution is happening because the workload itself forced the industry to reorganize how every layer participates in feeding information to the compute side efficiently.
Once you step back and look at the broader picture, the pattern becomes much easier to see. Every major shift we’ve discussed throughout this series ultimately points toward the same objective: reducing how much time, energy, and infrastructure overhead is spent simply moving information from place to place.
The Future May Depend More On Data Placement Than Raw Compute
For a very long time, the technology industry largely measured progress through raw compute capability. Faster processors, larger accelerators, more cores, and greater parallelism were treated as the primary indicators of advancement because, for most traditional workloads, improving compute performance generally improved the system as a whole.
AI is forcing a more nuanced conversation.
Once processors become fast enough, the larger challenge stops being the ability to execute operations and starts becoming the ability to keep those processors supplied with useful data consistently enough to avoid expensive idle time. That subtle change is now influencing nearly every major architectural decision happening inside modern AI infrastructure.
The interesting part is that the solution is no longer simply building faster storage devices or larger pools of memory in isolation. Instead, the industry is increasingly focused on where data lives throughout the system, how often it moves, and how intelligently the architecture can minimize unnecessary transportation before compute resources ever become involved.
That is why proximity has become such a recurring theme across every article in this series. HBM moved memory physically closer to the GPU. Storage Class Memory reduced the gap between memory and storage. High Bandwidth Flash attempted to make NAND participate more actively in the memory hierarchy. Distributed storage systems and near-data processing architectures are now trying to reduce how much movement happens across the infrastructure itself.
All of these developments are responding to the same realization.
At AI scale, moving data efficiently is becoming almost as important as processing the data once it arrives.
And that may ultimately become one of the defining architectural shifts of the entire AI era.
AI Memory Infrastructure Series
This article is part of our ongoing series on how AI infrastructure is reshaping the relationship between memory, storage, and compute. If you are joining the discussion here, the earlier installments provide the foundation for why this shift is happening.
Installment One:
NAND Isn’t Going Away, But AI Servers Now Depend on More Than Flash
Installment Two:
What Is High Bandwidth Memory (HBM) and Why AI Depends on It
Installment Three:
Storage Class Memory Explained: The Missing Layer Between DRAM and NAND
Installment Four:
High Bandwidth Flash: Can NAND Finally Act Like Memory?
Installment Five:
Why DRAM Alone Can’t Keep Up with AI Anymore
Installment Six:
Why Hard Drives Are Still Critical for AI Infrastructure
Installment Seven:
Why AI Is Moving Compute Closer To Storage
Editorial Note: This article is part of the ongoing AI infrastructure and memory architecture series published by GetUSB.info. The article was researched and written with AI-assisted editorial support for structure and readability, then reviewed and refined by the GetUSB editorial team for technical accuracy, continuity, and clarity.
About the Author
This article was developed under the direction of Matt LeBoff, a long-time contributor to GetUSB.info with over two decades of experience in USB technology, flash memory behavior, and data storage systems. The perspective presented here reflects hands-on industry knowledge and ongoing analysis of how real-world systems perform under evolving workloads, including AI infrastructure.
Let GetUSB.info keep you updated.
Receive article notifications about USB storage, flash memory, and duplication updates in your preferred language. We average a couple of articles per week.