August 15, 2007 by jimstogdill
An Optimization Approach to Data Center Efficiency
Today I listened to the green grid’s recent webex presentation (you’ll need a webex player to listen to it). It was an interesting presentation and it raised some points I hadn’t really considered before.
For example, the efforts of the green grid are focusing quite heavily on the data acquisition problem inside the data center; sort of an equivilent to distributed automated metering. Given that my company does work in the automated meter reading and data management space, I can’t account for why that didn’t occur to me before. Getting old I guess.
I’m not sure where this focus will lead based on what they disclosed, but it seems like it will be interesting. Maybe some kind of SNMP-like protocol that captures a “meter interval,” current rate of consumption, and associated average and point CPU utilization? SNMP may be an overkill way to do it, but in any case I assume that it will have to be more than just an IP-based self-reporting meter since it will be useful to have some correlated usage statistics (e.gI CPU utilization) to go with the power consumption. If collection is distributed out to every component (server, storage device, etc.) this will be quite a data storm to collect, persist, and analyze.
On a different note, if you look at the list of participants in the green grid, you may be surprised like I was by the lack of software participants at the contributor level. It is only natural that this would be led by the hardware guys, but too much of a hardware emphasis misses the role that software plays in the efficiency equation at both the component and system level.
For example, at the component level in software there is growing interest in software efficiency in areas like kernal design. How long can it be before you’ll be able to load a CPU’s energy consumption characteristics and your predicted kw-hour price into DTrace and be able to see not just the performance impact of coding decisions, but the predicted electrical cost impact as well?
However, I think software participation is even more important when data-center-as-a-system efficiency is considered.
Automobile’s, like data centers, are systems. Automobile manufacturers can model a new vehicle as the product of the sub-efficiencies of it’s power plant (thermodynamic, volumetric, mechanical,…), drive train, aero characteristics, and etc. With these formulas, many of the components of which are empirically based (e.g. aero efficiency, which relies on turbulent flow…), manufacturers can determine maximum acceleration, most efficient speed, efficiency at highway speeds, and etc. before they ever build the vehicle.
They can compare various power plant configurations (normally aspirated, flex fueled, Miller cycle with hybrid drive, etc) and predict most efficient speed and efficiency at typical speeds for each configuration. By including cost-to-build and cost-to-operate data they can develop an optimization objective function to drive design choices. Because they do this, not every car ships with a 300HP V-8 (well, that’s not the only reason; but in data centers it seems like everything needs a V-8).
Data center designers should have a similar set of theoretical and empirical tools on hand to allow them to optimize component selection for a particular kind of data center workload. What are critical latency constraints on an application-by-application basis? What are the component costs? What are the operational costs associated with direct electrical consumption and electrical consumption related to cooling and other secondary systems? What about the cost of floor space?
In an interconnected system like a data center, a faster processor doesn’t just consume more electricity directly, it creates more heat which results in more cooling load. If it is on but not operating, it consumes some base electrical and cooling load; but as CPU utilization increases, does heat dissipation increase linearly or greater than linearly? As cooling load increases, what is the shape of the cost response curve associated with servicing that cooling load? Theoretical or empirical models for these and other components would permit us to answer these questions, and then broader system level questions like:
– In the same way that automobiles have a most efficient speed, do servers in the context of a broader data center system have a most efficient CPU utilization (from an energy consumption point of view). If it exists, how does that most efficient utilization correspond to equipment cost and user latency constraints? Are they like a car that is most efficient at 25mph or 55mph?
– In a data center that had many processors running a similar application in a job dispatching approach (e.g. Google) would the data center efficiency on a joule per transaction basis be better with fewer processors running at higher utilization or more processors running at lower utilization? For a given cooling system, physical layout, power distribution system and computing workload characteristics, what would the optimium dispatched CPU utlization be?
– Would the overall energy consumption of a given data center with a given workload go up or down if all of it’s processors could be replaced with a greater number of slower-clock-speed but higher-efficiency processors designed for mobile technology such that user response latency was unaffected (considering all important factors from cooling load to floor space to power supply losses and etc.)?
– Given an existing data center design, would an investment of $X be better spent on updgrading the cooling sub-system, storage sub systems, or servers?
– What is the predicted financial impact of replacing the OS kernal on X machines with one that has better power management characteristics while maintaining a constant workload?
I’d like to consider more how software for job and/or VM dispatching will be important in the context of a most efficient utilization, but this post is running way long already so maybe another time…