The Great UPS Incident of Tuesday

8 min read
By The Operator
Heat

A UPS died. The backup UPS also died. The TTY learned about cascade failures. Management learned nothing. Tuesday lived up to its reputation.

The Great UPS Incident of Tuesday thumbnail

The datacenter UPS unit in Server Rack #7 began emitting a sound at precisely 2:47 PM on a Tuesday that can only be described as "the mechanical equivalent of a death rattle."

Not a beep. Not an alarm. A rattle. Like a shopping cart with three bad wheels being pushed down a gravel road by someone who's given up on life.

The TTY looked up from his monitor with the expression of someone who's just realized the fire alarm isn't a drill.

TTY: "Should that sound like that?"

OPERATOR: "No. That's the sound of ten-year-old capacitors experiencing what humans would call 'regret.'"

I was already standing.

The Discovery

Server Rack #7 housed our primary UPS for the customer-facing infrastructure. You know, the mission-critical stuff that management likes to mention in earnings calls. The UPS was rated for 3kVA, protecting approximately $200,000 worth of servers that generated approximately $2M in annual revenue.

The UPS itself cost $4,500 new. Management had approved its purchase in 2014. They had not approved the $800 replacement battery pack in 2018, 2020, 2022, or the three times I requested it in 2024.

"We'll address that in next quarter's budget," they'd said. Every quarter. For six years.

The rattle intensified. A new smell joined the party—hot electronics with notes of ozone and poor life choices.

TTY: "What's the backup plan?"

OPERATOR: "There's a backup UPS."

His face brightened.

TTY: "Oh good—"

OPERATOR: "It's been reporting 'replace battery' warnings since March 2023."

His face dimmed.

TTY: "Oh."

The Cascade

The primary UPS lasted another seven minutes before switching from "dying" to "dead." I know this because the monitoring system dutifully logged every second of its decline, including the moment the battery bank decided it had carried this operation long enough.

What happened next is what disaster recovery professionals call a "cascade failure" and what I call "exactly what I said would happen in four separate emails."

The servers attempted to switch to the backup UPS. The backup UPS, which had been running on batteries that could charitably be described as "vintage," immediately panicked and shut down. Not a graceful shutdown. A "I quit" shutdown.

The TTY watched twelve servers go dark in rapid succession. "Is that... supposed to happen?"

"According to the laws of physics and Murphy? Yes. According to the infrastructure budget proposal I submitted eighteen months ago? No."

The Teaching Moment

I pulled up the monitoring graphs on my laptop. The TTY leaned in, and I began what I like to call "aggressive technical education."

OPERATOR: "Notice the voltage here. That's the UPS trying to compensate for dying batteries. See how it's fluctuating? That's bad."

TTY: "How bad?"

OPERATOR: "Imagine if your heartbeat did that."

He nodded slowly.

TTY: "That's bad."

OPERATOR: "Now watch. The primary UPS dies. Load transfers to backup. Backup sees full load, tries to compensate, realizes its batteries are essentially decorative at this point, and gives up."

TTY: "Couldn't the servers just run without a UPS?"

OPERATOR: "They could. But we lose power conditioning, surge protection, and the three-to-five minute window to shut down gracefully if commercial power fails. We'd be trusting our infrastructure to the quality and consistency of the local power grid."

TTY: "Is that bad?"

OPERATOR: "The local power grid once went down because a squirrel looked at a transformer wrong."

TTY: "Ah."

The Management Response

I called the emergency number. Not because it was an emergency—the systems were already down, and the damage was done—but because management likes to feel involved during outages.

MANAGEMENT: "How long until it's fixed?"

OPERATOR: "New UPS units can be here tomorrow. Professional installation adds two days. Or I can install them this evening, and we're back up by midnight."

MANAGEMENT: "Excellent! Do that. And why didn't we have a backup?"

I pulled up my email client.

OPERATOR: "Would you like me to forward you the four budget requests from the past six years, or would you prefer I summarize them interpretively through the medium of dance?"

Silence.

OPERATOR: "I'll install the UPS tonight."

The Solution

Management had a choice: Pay for expedited shipping ($800) and after-hours installation by the vendor ($2,400), or approve my purchase order for two enterprise-grade UPS units ($6,500 total) that I could install myself tonight.

They chose the second option, probably because they could tell themselves they were "saving money" by not paying the vendor's emergency rates.

I placed the order. The supplier, bless them, had units in stock. The TTY and I drove across town to pick them up.

TTY: "Why didn't management just approve the battery replacements?"

OPERATOR: "Because battery replacements are boring. They don't make good slide presentations. Nobody gets promoted for 'continuing to ensure basic operational functionality.' You get promoted for 'implementing cost-saving measures' and 'optimizing resource allocation.'"

TTY: "So they saved $800 in battery costs and now they're spending $6,500 on new UPS units?"

OPERATOR: "Plus my overtime, your overtime, the revenue lost during downtime, and the customer goodwill destroyed by unexpected outages. But yes, technically they saved $800."

He stared out the window.

TTY: "That seems inefficient."

OPERATOR: "Welcome to enterprise IT."

The Installation

We installed the new UPS units by 9:00 PM. The servers came back online. Monitoring confirmed stable power, healthy batteries, and a backup UPS that actually deserved the name "backup."

The TTY helped cable everything properly, learning the fine art of "making infrastructure look intentional" and "labeling things so future you doesn't curse past you."

TTY: "Why are we installing two UPS units instead of one big one?"

OPERATOR: "Redundancy. If one dies, the other carries the load until we fix it."

TTY: "But won't management just refuse to fix the broken one because there's still a working backup?"

I stopped mid-cable-routing.

OPERATOR: "The TTY is learning."

TTY: "Is that bad?"

OPERATOR: "No. That's Tuesday."

The Aftermath

The outage lasted four hours and seventeen minutes. The customer impact was moderate—some services were unavailable, some transactions failed, and our support team fielded approximately ninety angry calls.

Management held a post-mortem meeting the next day. They wanted to know why the UPS failed and what we could do to prevent it from happening again.

I presented my findings:

  1. The UPS failed because it was old and underfunded
  2. Prevention requires replacing batteries on schedule
  3. Schedule is "every 2-3 years depending on usage"
  4. Current battery age: 7 years

MANAGEMENT: "So we need better monitoring?"

OPERATOR: "We had monitoring. It sent alerts. I forwarded them to you."

MANAGEMENT: "Better escalation of monitoring?"

OPERATOR: "I sent four budget requests and mentioned it in six quarterly reviews."

The VP nodded thoughtfully.

MANAGEMENT: "Let's form a committee to review our UPS replacement procedures."

I added "committee meetings" to my calendar and updated my resume.

The Operator's Notes

Documented for posterity and the next poor soul who inherits this infrastructure:

The lesson: UPS batteries have a finite lifespan. So do budgets, apparently. The difference is that UPS batteries will eventually warn you they're dying. Budgets will just silently fail and take your infrastructure with them.

The TTY learned: That cascade failures are real, that redundancy requires working backups (not "technically present" backups), and that sometimes the most expensive technical solution is the one you delayed until it became an emergency.

Management learned: That committees are an excellent way to look productive while learning nothing.

Next Tuesday's forecast: Partly cloudy with a 60% chance of management requesting "cost optimization" on our newly installed UPS units.

The datacenter is stable. The UPS units are humming contentedly with their fresh, properly-rated batteries. And I've started a betting pool on how long it takes for someone to suggest "we don't really need TWO UPS units, do we?"

The TTY has already placed his bet.

Smart kid.


All events depicted are fictional. No UPS units were harmed in the making of this story, though several were harmed in datacenters that ignored maintenance schedules. Please replace your batteries. Your infrastructure will thank you.

Note added to ClipboardTechnical Wisdom
View in Clipboard