An Optimization Approach to Data Center Efficiency

Today I listened to the green grid’s recent webex presentation (you’ll need a webex player to listen to it). It was an interesting presentation and it raised some points I hadn’t really considered before.

For example, the efforts of the green grid are focusing quite heavily on the data acquisition problem inside the data center; sort of an equivilent to distributed automated metering. Given that my company does work in the automated meter reading and data management space, I can’t account for why that didn’t occur to me before. Getting old I guess.

I’m not sure where this focus will lead based on what they disclosed, but it seems like it will be interesting. Maybe some kind of SNMP-like protocol that captures a “meter interval,” current rate of consumption, and associated average and point CPU utilization? SNMP may be an overkill way to do it, but in any case I assume that it will have to be more than just an IP-based self-reporting meter since it will be useful to have some correlated usage statistics (e.gI CPU utilization) to go with the power consumption. If collection is distributed out to every component (server, storage device, etc.) this will be quite a data storm to collect, persist, and analyze.

On a different note, if you look at the list of participants in the green grid, you may be surprised like I was by the lack of software participants at the contributor level. It is only natural that this would be led by the hardware guys, but too much of a hardware emphasis misses the role that software plays in the efficiency equation at both the component and system level.

For example, at the component level in software there is growing interest in software efficiency in areas like kernal design. How long can it be before you’ll be able to load a CPU’s energy consumption characteristics and your predicted kw-hour price into DTrace and be able to see not just the performance impact of coding decisions, but the predicted electrical cost impact as well?

However, I think software participation is even more important when data-center-as-a-system efficiency is considered.

Automobile’s, like data centers, are systems. Automobile manufacturers can model a new vehicle as the product of the sub-efficiencies of it’s power plant (thermodynamic, volumetric, mechanical,…), drive train, aero characteristics, and etc. With these formulas, many of the components of which are empirically based (e.g. aero efficiency, which relies on turbulent flow…), manufacturers can determine maximum acceleration, most efficient speed, efficiency at highway speeds, and etc. before they ever build the vehicle.

They can compare various power plant configurations (normally aspirated, flex fueled, Miller cycle with hybrid drive, etc) and predict most efficient speed and efficiency at typical speeds for each configuration. By including cost-to-build and cost-to-operate data they can develop an optimization objective function to drive design choices. Because they do this, not every car ships with a 300HP V-8 (well, that’s not the only reason; but in data centers it seems like everything needs a V-8).

Data center designers should have a similar set of theoretical and empirical tools on hand to allow them to optimize component selection for a particular kind of data center workload. What are critical latency constraints on an application-by-application basis? What are the component costs? What are the operational costs associated with direct electrical consumption and electrical consumption related to cooling and other secondary systems? What about the cost of floor space?

In an interconnected system like a data center, a faster processor doesn’t just consume more electricity directly, it creates more heat which results in more cooling load. If it is on but not operating, it consumes some base electrical and cooling load; but as CPU utilization increases, does heat dissipation increase linearly or greater than linearly? As cooling load increases, what is the shape of the cost response curve associated with servicing that cooling load? Theoretical or empirical models for these and other components would permit us to answer these questions, and then broader system level questions like:

– In the same way that automobiles have a most efficient speed, do servers in the context of a broader data center system have a most efficient CPU utilization (from an energy consumption point of view). If it exists, how does that most efficient utilization correspond to equipment cost and user latency constraints? Are they like a car that is most efficient at 25mph or 55mph?

– In a data center that had many processors running a similar application in a job dispatching approach (e.g. Google) would the data center efficiency on a joule per transaction basis be better with fewer processors running at higher utilization or more processors running at lower utilization? For a given cooling system, physical layout, power distribution system and computing workload characteristics, what would the optimium dispatched CPU utlization be?

– Would the overall energy consumption of a given data center with a given workload go up or down if all of it’s processors could be replaced with a greater number of slower-clock-speed but higher-efficiency processors designed for mobile technology such that user response latency was unaffected (considering all important factors from cooling load to floor space to power supply losses and etc.)?

– Given an existing data center design, would an investment of $X be better spent on updgrading the cooling sub-system, storage sub systems, or servers?

– What is the predicted financial impact of replacing the OS kernal on X machines with one that has better power management characteristics while maintaining a constant workload?

I’d like to consider more how software for job and/or VM dispatching will be important in the context of a most efficient utilization, but this post is running way long already so maybe another time…

• • •

Data Center Energy Use

I read the EPA’s just released report on data center efficiency with interest tonight. I was expecting to have to wade through a really dull tome, but it is actually quite an interesting read (yes, I am that much of a geek). Here’s a nice crib sheet if you don’t want to read all 133 pages; but I’m warning you, you’ll be missing all the good stuff.

A few interesting tidbits:

* ~ 1.5% of all electricity consumed in the U.S. is consumed by data centers at a cost of ~$4.5B. 10% of that is consumed by Federal government data centers. (they neglect to mention how much of that is consumed at Ft. Meade). We spend more on data center electricity than we do on color television.
* Within data centers, the distribution of power consumption is approximately 50% IT gear and 50% facilities (cooling, power conversion, etc.).
* The distribution of power load among the IT components is roughly 10% for network equipment, 11% for storage equipment, and 79% for Servers.
* Within a typical server at peak load power consumptions look like:
* CPU 80 watts
* Memory 36 watts
* Disks 12 watts
* Peripheral slots 50 watts
* Motherboard 25 watts
* Fan 10 watts
* PSU losses 38 watts
* total ~ 251 watts – note that the CPU consumes only 31% of the total in this typical server.
* Best current “volume” server designs consume ~25% less energy than similarly productive normal volume servers.
* The report estimates that server consolidation and virtualization may result in 20% energy savings in a typical data center.

Interestingly, they missed the emergent trend of hybrid drives and the potential impact they may have on storage efficiency.

Based on the numbers above, a data center of 10,000 highly-utilized typical “volume” servers would consume approximately 6.52 megawatts roughly distributed like:

CPU’s only~ 800kw / Server total ~ 2.58MW
Storage ~ 350kw
Network ~ 330kw
Total IT Gear ~ 3.26MW
Cooling, Power Conversion, and facilities: ~ 3.26MW
Data Center Total ~ 6.52MW

With the hierarchy of overheads layered on top of the CPU’s, only 12% of the power is being consumed by the CPU’s to do computational work. This is at least a partial explanation for why Google goes to such pains to build their own servers out of commodity CPU’s combined into unusually compact configurations. If they can get significantly more CPU out of their server overhead (shared power supplies, motherboards, etc.), computationally bound applications can be much more productive per watt consumed. The problem with all this overhead power consumption is that even significant improvements in multiple core / dynamic frequency voltage scaling chips tend to be damped and their overall impact limited (though any change that reduces CPU energy consumption will generally also reduce the required cooling load as well).

I have a number of other things in this report that I’d like to comment on; but to keep the rest of this post reasonably short I’ll focus on one area in particular.

A significant amount of the report details a variety of energy saving approaches and then defines three major improvement scenarios based on those best practices: Improved Operation, Best Practice, and State of the Art. It then goes on to suggest a standard data center energy efficiency rating that can be used for comparison between data centers. The problem is, the way the rating is defined, different data centers that are doing better or worse jobs of implementing best practices will be able to get the same overall rating.

The problem is defining an objective measure of efficiency with a consistent definition of useful computing work against which to normalize the energy consumption. Servers are used for wildly differing purposes so no such consistent measurement of server productivity exists. The report details these difficulties but then, instead of coming up with a best guess proxy for server productivity, it punts. The efficiency measure they fall back to essentially ignores the efficiency of the servers and whether they are being productively employed and simply measures the efficiency of the data center at delivering power to the servers (by dividing the power delivered to the IT gear by the total power consumed in the data center).

This is a useful measure in some ways, and will drive important behaviors that improve the efficiency of cooling and power conversion systems, but seems to me that it will do little to focus on the efficient productivity of the servers themselves. It would be like the CAFE standards focusing on the efficiency of drilling, refining, and distributing gasoline but ignoring the gas mileage of the vehicles consuming it.

Given that virtualization is one of the mechanisms for improved efficiency and the fact that virtualization tends to drive utilization up, it seems to me that they could have just used CPU utilization as a reasonable proxy for output productivity and achieved an efficiency rating standard of normalized average CPU utilization per watt consumed. Such a standard would be imperfect (data centers with heterogeneous applications might have a harder time achieving it than CPU-intensive single application farms for example) but it would at least incent the kinds of improvements in core computing efficiency that the best practice section encourages.

• • •

The Cubic Mile of Oil

Once you start thinking about topics like peak oil, you start running across all kinds of things about it and related topics. We currently use approximately one cubic mile of oil per year. This article in IEEE gives a sense of what it would take to replace that with other sources, and this article adds analysis and comments.

When we ran out of whale oil, it took rising prices finally forced people to shift to petroleum; but not until nearly the entire possible supply of whale oil was used. Fortunately petroleum was just lying there, just waiting for a few extraction technologies to fill the gap. If only fusion were waiting in the wings so ready.

• • •