Retrofitting Liquid Cooling for AI Workloads: What Colocation Operators Need to Know
Engineering considerations for colocation operators navigating the transition from conventional air-cooled infrastructure to hybrid and direct liquid cooling, and how to get the design decisions right before AI demand forces the issue.
The rapid proliferation of GPU-accelerated AI infrastructure has fundamentally altered the thermal envelope of colocation data halls. Rack power densities which only a few years ago would be considered exceptional (50 kW, 130 kW per rack) are now specified as baseline requirements by AI tenants deploying NVIDIA GB200 NVL72 systems.
Conventional perimeter cooling architectures, designed around IT load densities of 5–20 kW per rack, are thermodynamically incapable of rejecting heat at these flux densities without support from liquid-cooling systems. This article examines the engineering principals and constraints governing the retrofit of liquid-cooling infrastructure into operational air-cooled data halls.
1. The Hybrid Air- and Liquid-Cooling Architecture
The typical deployment model for operators is not a wholesale replacement of air-cooling systems, but the superposition of a liquid-cooling layer onto an existing air-cooled infrastructure. This hybrid architecture preserves the capital investment in perimeter cooling plant, whilst extending the thermal rejection capability of individual rack positions to accommodate AI workloads.
In this model, a flexible white space is created where the air-cooling infrastructure continues to serve standard-density enterprise and cloud workloads within its original design envelope, whilst simultaneously providing the additional cooling for AI racks. The hybrid approach means that the thermal load arising from ancillary components: chassis, and power distribution hardware, etc.; that liquid-cooling circuits do not directly address are still cooled by the air-cooling system. The liquid-cooling layer handles the larger thermal flux from the silicon, which in modern GPU architectures may represent 85% of total rack power dissipation.
The preferred liquid-cooling is direct-to-chip (D2C) cooling via cold plates and rack-level manifolds connected via pipework to a Coolant Distribution Unit (CDU). Rear-door heat exchangers (RDHx) represent an alternative air-cooling solution where existing perimeter cooling capacity is insufficient or has not been provided.
With rack densities continuing to increase up to 600 kW per rack the air-cooled component becomes less dominant but with increased liquid-cooling heat capture this ratio moves up to 97% liquid cooling, still requiring 20 kW a rack of air-cooling. This means that traditional colocation data centre designs can remain flexible providing a hybrid approach with perimeter air-cooling and liquid-cooling for the foreseeable future.
However, the GPUs are not the only source of heat in the room, storage servers, network switches, fabric heat gains, power distribution and CDU heat gains all require cooling, and while some of these may be provided with alternative cooling systems in the future there will most likely always need to be a source of air-cooling in the data hall.
2. Floor Infrastructure and Pipework Routing
2.1 Raised Floor Environments
Within a data hall, a raised access floor appears to offer a convenient route for liquid supply and return pipework, using the underfloor void. However, this approach requires careful hydraulic and aerodynamic consideration. Where a raised floor plenum exists, its primarily the primary cold air distribution pathway for perimeter-supplied air-cooling; the introduction of additional pipework into this space increases aerodynamic resistance, reduces effective plenum cross-section, and can generate turbulent flow conditions that compromise cold aisle supply pressure uniformity and the air-cooling ability to maintain design supply air temperature.
Where underfloor routing is adopted, pipework should be routed beneath the hot aisle as a preference and installed at the lowest practicable elevation, with CFD modelling of the modified plenum geometry recommended prior to installation to quantify any potential impact on perimeter cooling supply air distribution.
2.2 Exposed Slab Environments
Modern data hall design increasingly favours exposed slab configurations, in which overhead pipework routing and integration with existing overhead power and cabling infrastructure is the only viable option. This creates a potential coordination challenge which in a high-density AI hall requires: busbars; structured cabling containment; hot aisle containment; fire suppression; and, following retrofit, liquid-cooling supply and return headers. Above the rack distribution requires consideration at an early design stage to ensure that if the liquid cooling is being retrofitted this can still be installed where the other services may be in operation.
Pipework routed overhead, above live IT equipment, carries an inherent leak risk that must be actively mitigated. Leak detection systems should be specified: typically rope or spot sensors at headers and rack manifolds; and tied into the BMS for immediate alarm and, where warranted, automatic isolation. In addition, drip trays should be installed beneath supply and return headers to catch any leaks before they reach the racks below. Together these measures reduce the risk of a joint or fitting failure and water damage to high-value IT hardware.
3. Chilled Water Strategy and Hydraulic Distribution
Power and cooling infrastructure is installed very differently in a data centre, and that difference can hinder retrofits for liquid cooling. IT load is comparatively easy to move: a rack can be re-fed by repositioning a busbar tap-off, increasing power density can be taken from other data halls with cabling moved and redirected to suit. Cooling capacity does not move so willingly, the water serving any given area is fixed by the pipework already installed sized for the capacity originally required.
Once installed, that pipework network can be difficult and disruptive to change. Headers and risers are sized for a particular flow and cannot simply be uprated; adding capacity usually means cutting into a pressurised, live system, draining sections, and working around systems already in service. A busbar can be tapped almost anywhere along its length, whereas a water main only delivers capacity where it physically runs, and extending or providing additional connections after the fact is a major intervention rather than a quick change.
For this reason, planning the cooling pipework up front is the single most important design decision in a liquid-cooling retrofit. Where the pipework runs, how it is sized, and where future connection points are left will determine which parts of the hall can be liquid-cooled and getting those decisions wrong can be more costly to correct than the power side.
3.1 Integration with the Existing Chilled Water System
Connection of CDU secondary circuits to an existing chilled water (CHW) system operating at conventional supply/return temperatures (typically 20/30°C for newer facilities, or lower for older ones) is thermodynamically viable and represents the lowest-complexity retrofit option where existing plant has sufficient capacity. However, modern silicon is able to operate with coolant supply temperatures up to 40–45°C, the full benefit of elevated temperature liquid cooling is lost. The ability to reject heat via free-cooling or high-efficiency dry coolers without a full refrigeration cycle is not realised when connected to a refrigeration-dependent CHW loop.
3.2 Dedicated Facility Water Circuit at Elevated Temperature
The preferred strategy for significant AI workload deployment is a dedicated facility water circuit operating at an elevated supply temperature, hydraulically isolated from the primary chilled water system. Because modern GPU cold plates tolerate relatively warm coolant, this circuit can run much warmer than a conventional chilled water system, and that single difference is what unlocks its efficiency advantages.
The warmer the supply, the more of the year the heat can be rejected without mechanical refrigeration. In temperate climates a circuit running at elevated temperature can meet much of its demand through free-cooling, using dry or adiabatic coolers alone, with the chillers acting only as backup on the hottest days. This sharply reduces compressor run-hours and the energy attributable to cooling, which in turn lowers the facility's Power Usage Effectiveness (PUE). The Green Grid has documented these PUE benefits in detail.
The trade-off is capital cost and space: a dedicated circuit needs its own pumps, heat-rejection plant, pipework, and controls rather than drawing on existing headroom. For a facility anticipating meaningful AI tenancy this is usually justified by the lower running cost and the operational margin it provides, whereas for a single deployment, connecting to the existing chilled water system may remain the more pragmatic option.
3.3 Infrastructure Sizing Strategy
Regardless of which circuit architecture is adopted, the optimum strategy is to size and install the hydraulic distribution infrastructure (headers, risers, isolation valves, balancing valves, and metering) for the projected maximum load at the outset, this is especially key for the CHW system which may be accounting for both the air- and liquid-cooling loads. AI demand is hard to forecast precisely, but it rarely shrinks; so sizing the main pipework runs for the expected upper case leaves room to grow into it, rather than designing for today's load and being restricted in the future.
This matters because the economics are heavily one-sided. The incremental cost of specifying larger-bore headers and risers at first installation: more pipework, larger valves, increased pump duty, is modest. The cost of adding that capacity later is not: it means cutting into a pressurised, live distribution system, draining and re-filling sections, re-balancing the network, and carrying out hot works alongside occupied racks, all under change control and often out of hours. Oversizing at the start buys headroom cheaply; retrofitting it buys the same headroom at many times the price and with real operational risk.
Where budget genuinely precludes installing the full dual-circuit option up front, the minimum position is to design the provisions in even if the pipework is not. That means reserving the routing space and structural support for future headers and risers, leaving capped and valved tee-off points where additional CDUs or pipework runs will later connect, and sizing the primary plant space and electrical supplies for the eventual duty. Capacity that has been planned for can later be added through a localised, low-risk tie-in; capacity that was never allowed for usually cannot be added at all without rebuilding the route.
4. The Modular Construction Advantage
The engineering challenges described in the preceding sections arise from imposing a fundamentally different infrastructure on a facility that was not designed to accommodate it. The strategy itself is the same whether liquid cooling is installed on day one or retrofitted later, and a modular approach can accommodate either, provided the design and planning are right from the outset. The difference is simply that the work is moved off the live site: liquid-cooling readiness is engineered into the module at the factory stage, before deployment.
4.1 Chilled Water and Facility Water Installed, Full Capability from Day One
In this option both water systems are installed at build: a conventional chilled water (CHW) system serving the air-cooled load, and a dedicated facility water system (FWS) running at elevated temperature for liquid cooling. Because the two are hydraulically separate, each operates within its optimum envelope, with the CHW handling perimeter and ancillary cooling, and the FWS providing the warm, free-cooling-capable supply that direct-to-chip hardware prefers.
This gives full liquid-cooling capability from day one at the best achievable efficiency, and an AI tenant can be connected with no further plant works. The cost is that the entire FWS (pumps, heat rejection, pipework, and controls) is built and paid for before any liquid-cooling demand is confirmed, and may sit idle until it arrives. It suits facilities where significant AI tenancy is expected early, and where speed and running efficiency justify the higher up-front spend.
4.2 Oversized Chilled Water System, Liquid Cooling Served from a Single Circuit
Here only a chilled water system is installed, but it is deliberately oversized and specified to carry the liquid-cooling load as well as the air-cooled load, should a liquid-cooling tenant be proposed. Liquid cooling is then served directly from the CHW rather than from a separate FWS, keeping the facility to a single water system and a single distribution network.
The advantage is simplicity and lower capital: one set of plant to build and maintain, with no second circuit. The trade-off is efficiency: as noted in Section 3.1, a refrigeration-dependent CHW loop cannot deliver the elevated supply temperatures that unlock free-cooling, so much of the PUE benefit of warm-water liquid cooling is forgone. This option suits operators who want to keep liquid cooling possible without committing to a second system, and who will accept a higher running cost in return for a simpler, cheaper build.
4.3 Chilled Water at Build, Facility Water System Added When Required
In this option the base build installs only the chilled water system, sized for the air-cooled load, with the facility water system added later if and when a liquid-cooling requirement is confirmed. To keep that future addition straightforward, the provisions are designed in from the outset, with reserved plant space and pipe routing, structural support, and capped connection points, even though the FWS plant and pipework are not yet installed.
This defers the largest part of the cooling capital until demand is real, which is attractive where AI tenancy is uncertain. The penalty is that adding the FWS later is a project in its own right: plant must be installed and commissioned and connections made, and unless the provisions were properly allowed for at build, it can mean working around live infrastructure. It suits facilities that prioritise low initial cost and flexibility over immediate readiness.
All three options add only a modest cost over an air-only baseline, and all three are far cheaper and less disruptive than cutting liquid cooling into a facility that made no allowance for it. The right choice depends on how confident the operator is in near-term AI demand, and on the balance it wishes to strike between up-front capital, running efficiency, and speed to onboard a tenant.
5. Conclusion
Retrofitting liquid cooling into an air-cooled data hall is, above all, a planning problem: the cooling pipework is the hardest part of the facility to change once installed, so the decisions taken at design stage determine what the hall can ever support. A modular approach lets those decisions be made and built cleanly at the factory, and the right level of provision depends on how soon, and how certain, AI demand is. The three options below summarise that trade-off.
Full liquid-cooling capability from day one; best efficiency via free-cooling; tenants connect with no further plant works
Highest up-front capital; the full FWS is built before demand is confirmed and may sit idle
Simplest to build and maintain, one system, one network; lower capital than a dual-system build
Lower efficiency: CHW cannot reach elevated temperatures, so most of the free-cooling / PUE benefit is lost; higher running cost
Lowest initial cost; defers major cooling capital until demand is real; retains flexibility
Adding the FWS later is a separate project; risk of working around live infrastructure; not immediately liquid-cooling ready