Inside Meta’s AI Hardware Lab: The Reality of GPU Racks, Liquid Cooling, and Optical Interconnects
This article is a summary based on the conversation: The Hottest Job in the Market: Inside Meta’s AI Hardware Lab | Joshua Held and Yashar Bayani. Since the original content is in an interview format, this post presents it reorganized by topic and flow.
Core Takeaway at a Glance
The core of this conversation is simple: AI competitiveness comes not from a single GPU chip, but from the entire hardware system that enables those GPUs to run at scale.
The “entire system” includes:
- GPU rack design
- Power delivery
- Cooling methods
- Copper/optical interconnects
- Manufacturing quality control
- Shipping and deployment
- Failure response and maintenance
- Balance of CPU, memory, storage, and networking
Meta’s two leaders share: “In the past, working on hardware and infrastructure for a long time didn’t get much attention. But with the AI era, this role has become the front line that determines the success or failure of the company.”
Table of Contents
- Why Hardware Engineers Matter Now
- What is Meta’s AI Hardware Lab? 2.1 Why build equipment 5 years ahead? 2.2 Evolution from CPU/storage servers to GPU racks
- Why Meta Started Building Its Own Servers 3.1 Moving from external vendors to in-house design 3.2 Freedom servers and the Prineville data center
- Why GPU Infrastructure is Hard 5.1 From an 8-GPU tray to a 72-GPU rack 5.2 Complexity of scale-up vs. scale-out 5.3 Moving from air cooling to liquid cooling
- What is ALC (Air-Assisted Liquid Cooling)?
- Why Manufacturing and Maintenance Became Harder
- What are the Biggest Bottlenecks Today? 7.1 Power 7.2 Cooling 7.3 Signal Integrity 7.4 Optical Interconnects
- GPUs Alone are Not Enough: The Resurgence of CPU, Memory, Storage, and Network
- Career Talk: How Did the Two Leaders Get Here?
- Advice for Students and Engineers
- Key Learning Points
- Glossary of Terms
1. Why Hardware Engineers Matter Now
The opening scene of the interview is symbolic. The presenters note: “In the past, when we built hardware and infrastructure, nobody really paid much attention. Now, it has become the role that decides whether the company succeeds or fails.”
The meaning behind this is clear:
- AI model performance competition is no longer just a software problem.
- Without infrastructure that can run large-scale training and inference, buying a lot of GPUs is meaningless.
- Hardware, power, cooling, networking, and manufacturing capabilities translate directly to product competitiveness.
One presenter calls this shift a “creative awakening.” In the past, the focus was on packaging and deploying air-cooled systems; now, it has shifted to solving entirely new kinds of problems.
2. What is Meta’s AI Hardware Lab?
This conversation takes place in one of Meta’s hardware labs. This space is not just a simple testing room, but a prototyping space to pre-validate hardware that will be deployed several years into the future.
Key points highlighted by the presenters:
- Technologies like cooling, mechanics, power, and backplanes must be prepared far ahead of the industry’s volume manufacturing capabilities.
- Meta looks about 5 years ahead to experiment with future hardware.
- The lab houses prototypes from various systems and multiple vendors.
In other words, it is not a place to look at “current production equipment,” but rather a space to validate what the next-generation data centers should look like.
3. What the Equipment in the Room Demonstrates
Looking around the room, the presenters explain the evolution of Meta’s infrastructure. On the left are relatively simple compute and storage servers, while the systems on the right grow increasingly complex with GPUs.
This contrast illustrates:
- Early servers were simple 1U/2U box-type structures.
- The goals were fast deployment, quick repairs, and high availability.
- These CPU/storage servers have been optimized for a long time.
- In contrast, GPU racks are far more complex mechanically, electrically, and in terms of signal integrity.
- Moving toward the right, liquid cooling equipment and large heat exchanger structures appear.
In essence, more than 10 years of Meta’s infrastructure evolution is physically displayed in a single room.
4. Why Meta Started Building Its Own Servers
Meta did not build its own servers from the start. At one point, they purchased standard servers from external vendors like Dell and HP to deploy in colocated data centers. However, at some point, it became more rational to build them in-house.
4.1 Background of the Transition
- Demand for servers surged.
- As scale grew, purchasing external equipment became less cost-effective.
- There was significant room for optimization tailored to Meta’s services.
- Eliminating unnecessary components yielded clear cost savings.
4.2 Early In-House Server: Freedom
According to the presenter, one of the first servers Meta built in-house was a simple 2-socket server called Freedom.
The core idea was straightforward:
- Keep only what is necessary.
- Strip away unnecessary parts.
- Create a cost-effective structure from a service perspective.
At the time, saving even a few million dollars was highly meaningful.
4.3 Expansion to In-House Data Centers
The next step was building data centers. The presenter mentions Prineville as Meta’s first data center. During this time, the hardware team was very small, and a handful of people designed the servers and coordinated with the data center team for actual deployment.
5. How the Early Server Strategy Evolved
Initially, Meta’s primary workloads were web and databases. Thus, servers were categorized to match those purposes:
- High-memory servers
- High-flash servers
- High-storage servers
These servers mostly ran in bare-metal environments and were optimized for specific service characteristics. Later, Meta gradually expanded into other domains:
- In-house storage servers
- In-house network switches
- In-house AI servers
As the presenter puts it, today Meta designs almost everything end-to-end.
6. The Biggest Shift in GPU Infrastructure
The presenters point to the transition from 8-GPU tray-level systems to 72-GPU rack-level systems as the biggest change. This is not just a matter of increasing the GPU count:
- The entire rack must behave like a single system.
- Both scale-up and scale-out networks are critical.
- Mechanical design, power distribution, cabling, and thermal design become exponentially harder.
- The blast radius of a failure becomes significantly larger.
While they used to handle “individual servers,” now the “entire rack” feels like a single computer.
7. The Transition from Air Cooling to Liquid Cooling
As GPU generations progressed, air cooling hit a clear limit. The presenters mention that during the H100 era, there was talk of liquid cooling, but they reverted to air-cooled models because the industry was not ready. With generations like GB200 and GB300, liquid cooling has become an unavoidable direction.
The reasons are simple:
- GPU power density has become too high.
- Heat generated per rack has surged.
- Air cooling alone can no longer remove heat stably and efficiently.
8. What is ALC (Air-Assisted Liquid Cooling)?
One of the key concepts emphasized by Meta is ALC.
This is a hybrid system designed to bring liquid-cooled GPU racks into existing air-cooled data centers without requiring a complete overhaul of the facility.
8.1 Why Was It Needed?
Existing data centers were built for general compute workloads and were not designed for liquid cooling.
- Large-scale building renovations are expensive and slow.
- AI racks need to be deployed quickly.
- Leveraging existing data center assets was the most realistic option.
Thus, Meta created a rack-level thermal solution.
8.2 How It Works
ALC is essentially a rack-sized heat exchanger:
- Pumps circulate the coolant.
- The liquid carries heat away from the GPUs.
- Large radiators transfer the heat from the liquid to the air.
- That heat escapes through the data center’s hot aisle and exhaust paths.
In short, it is a “self-contained, rack-level liquid cooling system.”
8.3 Is It Structurally Unique?
According to the presenters, this system fits within the Open Compute Project (OCP) standard rack family. It is simply a variation with liquid cooling structures added inside. Although the weight increased, it was within a manageable range.
9. The Scariest Part of Liquid Cooling: Leaks
When introducing liquid cooling, the first risk that comes to mind is leaks. The presenters admit this was a major concern:
- Existing air-cooled data centers do not have drainage systems.
- They were not designed assuming water would enter.
- Even a small leak can be fatal to electronic equipment.
To address this, Meta designed two key elements:
- Resiliency in the cooling loop itself.
- A sensor system to detect leaks.
They developed a large-scale sensor network to quickly detect coolant leaks, allowing the data center operations team to respond swiftly. Of course, as scale grows, unexpected issues still occur. However, having processes in place for the operations team to quickly isolate problems and protect equipment has become essential.
10. Why Manufacturing Complexity Exploded
The presenters repeatedly emphasize manufacturing. In the past, most components were contained within a server chassis:
- PCBs
- CPUs
- Memory
- Other subcomponents
Now, the system is spread across the entire rack:
- Backplanes
- Flyover cables
- Thousands of micro-connectors
- Multiple compute trays
- Switches
In this state, even a tiny misalignment can lead to major issues.
10.1 Signal Integrity Challenges
If a connector is slightly misaligned, it causes signal integrity problems. This issue might seem microscopic during manufacturing, but it translates into major failures in production. Consequently, manufacturing processes have become far more precise:
- Using optical inspection systems.
- Closely checking connector states.
- Managing assembly quality at a granular level.
10.2 Simple Things Halt the Line
The presenters share an insightful example:
- A shortage of a single capacitor can halt the production of a multi-million dollar rack.
- A speck of invisible dust can ruin connector contacts, requiring the entire rack to be disassembled.
This means that in the AI infrastructure race, supply chain and manufacturing quality control are just as critical as securing cutting-edge GPUs.
11. The Unit of Maintenance Has Changed
In the past, servicing often meant pulling out a single server. Today, failure domains are highly diverse:
- A single GPU
- A compute tray
- A switch
- A backplane
- In extreme cases, the entire rack
The presenters summarize maintenance as follows:
- GPU failure: Can be replaced at the GPU or compute tray level.
- Backplane failure: May require taking down the entire rack for replacement.
- Switch failure: Affects multiple GPUs simultaneously, requiring coordination with job schedulers.
Maintenance is no longer just a hardware task; it has become intertwined with operations policy and resource scheduling.
12. Future Technical Challenges
The presenters highlight four major challenges going forward:
12.1 Power
Power is highlighted as the strongest bottleneck:
- Power consumption per GPU is extremely high.
- Rack power density can range from 200kW up to 1MW.
- In the past, 15kW per rack was considered a major issue; today’s requirements are incomparable.
The “easy” power sources for data centers have largely been exhausted, and finding creative ways to secure power is now a necessity.
12.2 Cooling
More power means more heat. Ultimately, power and cooling are two sides of the same coin. The closer, denser, and more numerous the GPUs, the more cooling becomes a core constraint.
12.3 Signal Integrity
To connect many GPUs, interconnects must be dense. However, complex connections make maintaining signal integrity harder. This is not just a network issue, but a mechanical assembly and manufacturing quality challenge.
12.4 Limits of Copper and Optical Interconnects
The presenters mention that the practical limit of copper is roughly 1.5 meters. Beyond that, signal degradation becomes severe.
So, how do we build larger GPU pools?
- Bring GPUs closer together.
- Change the rack structure.
- Ultimately, adopt optical interconnects.
However, optical technology still faces hurdles:
- Significantly more expensive than copper.
- Higher failure rates.
- Consumes more power.
Therefore, while maintaining the operational efficiency of standard racks, Meta appears to be pushing strongly toward optical interconnects for the long term.
13. GPUs Alone Are Not Enough: The Return of CPU, Memory, Storage, and Network
An interesting point in the interview is when a presenter warns, “Don’t ignore the CPU; it’s heating up again.” The reasoning is clear:
- Even with powerful GPUs, they are useless if data cannot feed into them.
- Without sufficient scale-up/scale-out networks, GPUs sit idle.
- If storage delivery is weak, the training pipeline gets bottlenecked.
- If the ratio of CPU and memory is off, overall system efficiency drops.
Meta does not just design GPU racks; they design the entire supporting layers to ensure those racks achieve maximum utilization. The presenters also mention other structural possibilities depending on workloads:
- Disaggregated compute layouts separating CPUs, memory, and compute.
- Co-locating storage directly within GPU racks.
Ultimately, there is no single right answer; it depends on the workload’s I/O, memory requirements, and CPU ratios.
14. Which Layer is Farthest Behind?
When asked “Which layer—storage, CPU, or network—is lagging most and posing the biggest challenge?”, the presenter identifies optical-based scale-out as the key area to push forward from a hardware design perspective.
The reasons are:
- Meta wants to maintain standardized rack sizes and operational workflows.
- They do not want to discard optimized structures for manufacturing, shipping, installing, and powering.
- However, to build larger GPU density, copper alone is insufficient.
Therefore, the key competitive edge in the coming years will likely be “how cheaply, reliably, and power-efficiently we can build optical interconnects.”
15. Real Optimization is Not Just About One Rack
The presenters emphasize hardware lifecycle optimization, something software engineers rarely experience. This lifecycle includes:
- Silicon fabrication and yield
- Module assembly
- System integration
- Rack integration
- Transportation
- Data center installation
- Bring-up and powering on
- Years of operational stability
In other words, a “well-designed rack” is not just high-performing, but must also be:
- Easy to manufacture
- Easy to ship
- Quick to install
- Long-lasting
- Recoverable when failures occur
Another critical factor is the depreciation rate of AI equipment. The presenters note that expensive GPU racks must be put into operation as quickly as possible. The time expensive equipment spends sitting idle represents a huge financial loss.
16. Career Stories: How the Presenters Got Here
The latter half of the conversation touches on the two leaders’ career paths.
16.1 Yashar Bayani
- Graduate of the University of Waterloo, Canada.
- Joined Facebook as an intern in the early 2010s.
- Participated in the early development of in-house equipment within a small hardware team.
- Gained experience across compute, storage, networking, and management.
- Finds fulfillment in growing people and teams.
16.2 Joshua Held
- Bay Area native.
- Graduated from San Jose State.
- Experience in telecom equipment, cable modems, and video equipment.
- Participated in mechanical engineering for GPU systems since his early days at Meta.
- Later expanded into management to support team growth.
Neither of them planned to be an “AI star player” from the start. Instead, they built core competencies, gained hands-on experience, and stepped into critical roles as the industry evolved.
17. Will CPU and GPU Merge?
An intriguing question was whether CPU and GPU will eventually converge into one, or continue to specialize. The presenters’ answers are cautious but share a common thread:
- CPU, GPU, and memory will all continue to play critical roles.
- The balance point varies depending on workloads (inference, low-latency processing, training).
- Specialized chips for specific workloads will continue to emerge.
- However, system design must be viewed in a more integrated manner.
While chips may diverge, at the system level, they will become more tightly coupled.
18. What is the Ultimate Constraint?
As the interview nears its end, the presenters’ answers converge on one point: The ultimate constraint is power.
While memory, silicon supply, cooling, and interconnects are all important, at the data center level, the critical limits are:
- Can we secure more power?
- Can we safely deliver that power to the actual racks?
- Can we handle the heat generated as a result?
This is the most realistic ceiling for AI infrastructure expansion.
19. Advice for Students and Engineers
The conversation concludes with career advice.
19.1 Problem-Solving Skills are Most Important
One presenter shares that when he got his first internship, he talked about how he solved a problem with a car engine. He didn’t just answer questions about computers or electrical engineering; showing how he actually solved a problem was what mattered. The message is clear:
- The problems you deal with will keep changing.
- The technology stack you learn today will change in a few years.
- However, the ability to solve problems from first principles lasts.
19.2 Grit and Flexibility
Key traits highlighted for students and early-career engineers:
- Grit (persistence).
- Flexibility to adapt to changing problems.
- The ability to explain fundamental concepts in your own words.
19.3 Creativity and Foundational Strength
Another presenter stresses that creativity remains highly important even in the AI era:
- The ability to look at problems from a different perspective.
- Hands-on experience fixing things at home.
- A foundational sense of engines, thermodynamics, and mechanical structures.
These elements build a “toolbox” that you will draw upon throughout a long career.
20. Key Learning Points
When studying this interview, keep the following points in mind:
- AI Infra is a System Problem: GPU speed alone isn’t enough. Power, cooling, storage, CPU, memory, network, manufacturing, and operations must align.
- Shift to Liquid Cooling: In high-density GPU racks, liquid cooling is transitioning from optional to mandatory. Hybrid solutions like ALC are key for compatibility with existing air-cooled data centers.
- Manufacturing Quality Equals Service Quality: A single faulty connector out of thousands leads to operational failure. Real-world issues like dust, supply chain, and assembly tolerance can halt AI infrastructure.
- Long-term Importance of Optical Interconnects: Copper has physical limits in distance and density. To pool more GPUs, optical links are needed, though challenges in cost, reliability, and power remain.
- Power is the Ultimate Bottleneck: The true ceiling of data center expansion is power. Securing power and cooling is a harder challenge than simply buying more GPUs.
21. Glossary of Terms
- Scale-up: Connecting GPUs tightly within a close range to use them as a single large compute pool.
- Scale-out: Connecting multiple systems or racks over a broader range to scale.
- Backplane: A central board or connection structure connecting multiple modules within a rack.
- Signal Integrity: The measure of electrical signal quality as it passes through a circuit without distortion.
- ALC (Air-Assisted Liquid Cooling): A hybrid cooling system using liquid loops inside a rack and transferring heat to the air, allowing liquid cooling in air-cooled data centers.
- OCP (Open Compute Project): An open-source hardware sharing and standardization initiative started by Meta.
Conclusion
This interview demonstrates that the AI race is no longer just a battle of models and software. Today’s key question is not “Which GPU should we use?” Rather, it is:
- How many of those GPUs can we connect?
- Can we feed data to those GPUs without interruption?
- Can we supply the power those GPUs consume?
- Can we handle the heat those GPUs emit?
- Can we manufacture, transport, install, and maintain these systems at scale?
Meta’s AI Hardware Lab is the testing ground for those very questions, explaining why hardware engineering has once again become one of the hottest fields in the AI era.
Leave a comment