Surviving Electric Squirrels and UPS FailuresJuly 16, 2012 No Comments
Folks who’ve worked in the data center industry for a while tend to have their squirrel stories. Mike Christian, who runs business continuity for Yahoo, shared his recently during a keynote at the O’Reilly Velocity conference in a presentation titled “Frying Squirrels and Unspun Gyros,” which examined the many ways that data centers can fail.
“A frying squirrel took out half of our Santa Clara data center two years back,” Christian said, noting squirrels’ propensity to interact with electrical equipment, with unfortunate results.
If you enter “squirrel outage” in either Google News or Google web search, you’ll find a lengthy record of both recent and historic incidents of squirrels causing local power outages.
Yahoo houses its servers in 29 different data centers, explaining Christian’s familiarity with the many ways they can fail. These include:
- Inadvertant fire suppression: When electrical triggered smoke detectors at a Texas data center hosting Yahoo Launch (Broacast.com), staffers didn’t realize they could override the next phase of the system – power shutdown and a “dump” of FM200 fire suppressant.
- HVAC Failure: A cooling system failure in an N+1 Yahoo facility in Reston, Virginia caused a temperature spike in part of the data center, which triggered the fire suppression system – which then shut down the remaining HVAC units, resulting in a “thermal runaway” that resulted in 130 degree F temperatures in the data center. Yahoo was able to shift the load, resulting in no downtime. That’s one reason Yahoo built its Lockport, N.Y. “chicken coop” data center to use fresh air instead of mechanical cooling. “That’s one less failure point,” said Christian.
- UPS Meltdowns: Yahoo had a small UPS setup in its Sunnyvale data center fail three times in five years. Christian cites a recent survey indicating that up to 29 percent of unplanned data center outages are caused by UPS failures. “Our UPS causes as many problems as it solves,” said Christian. “Complexity is introduced by adding all these multiple systems. They actually introduce additional failure cases.”
How do you prepare for these kind of events? Focus on storing data in more than one location, and routing around facility failures. How does Yahoo know this will work? It conducts full-scale live failover testing with live loads, shifting millions of users between data centers with no visible impact.DATA and ANALYTICS