Why Backups Matter: A Cautionary Tale

The TTY asked me last Tuesday why we run three separate backup systems when "one should be enough." I asked them to define "enough." They couldn't. This is why we run three separate backup systems.

The Fundamental Truth

Backups exist for one reason: everything fails eventually. Not "might fail" or "could fail" or "statistically has a non-zero probability of experiencing an unplanned service interruption." Will fail. The hard drive will die. The RAID array will have opinions. The cloud provider will have an outage. The intern will type rm -rf with confidence.

The question isn't "if" you'll need backups. It's "when," and more importantly, "will they work when you need them?"

(Spoiler: probably not, if you've been following management's backup budget recommendations.)

The 3-2-1 Rule: Not Just Numbers

There's a rule in backup strategy called the 3-2-1 rule. It's simple enough that even management can understand it, which is why they ignore it.

3 copies of your data
The original, plus two backups. Not "the original and one backup that you're pretty sure is recent." Three distinct copies. Why? Because Murphy's Law has a corollary specifically about backups: the one copy you have will corrupt itself the moment you need it.

2 different media types
Don't put all your eggs in one basket. Don't put all your data on one type of storage. If your primary storage is spinning disks, put backups on tape or SSD or stone tablets. Why? Because entire categories of storage can develop simultaneous vulnerabilities. Remember when everyone's SSDs had that firmware bug? The ones who had tape backups do. The ones who didn't... don't remember much of anything.

1 offsite copy
Because fire, flood, theft, and vengeful ex-employees don't respect datacenter boundaries. Offsite means physically separate location. Not "a different rack." Not "a different room." Different building. Different city is better. Different geological fault line is ideal.

The TTY asked why we need offsite backups when "the datacenter has sprinklers."
I showed them the incident report from the datacenter two states over where the sprinklers activated accidentally.
The TTY no longer questions offsite backups.

The Testing Paradox

Here's the thing about backups that nobody wants to hear: an untested backup is just hope with storage costs.

You don't have backups. You have untested backup files that might contain something resembling your data. Maybe. Possibly. Hopefully?

I run backup restoration tests quarterly. Not because I enjoy them. Not because they're on the schedule. Because the alternative is discovering your backups are corrupted at 3 AM during an actual disaster when the VP is breathing down your neck and the SLA clock is ticking.

The TTY once asked why we "waste time" testing backups when they "work fine."
I asked how they knew they worked fine.
They said "because we've never needed them."
I stared at them until they understood.
It took seven minutes.

RPO and RTO: The Acronyms That Matter

Recovery Point Objective (RPO): How much data can you afford to lose?

Not "how much data would you like to lose" (the answer is zero, obviously). How much data can the business lose before significant damage occurs? An hour? A day? A week?

This determines your backup frequency. If your RPO is one hour, you need backups running at least hourly. If your RPO is "we can't lose anything," congratulations, you need real-time replication and you should probably talk to someone about your budget.

Recovery Time Objective (RTO): How fast can you restore?

How long between "oh no everything is on fire" and "okay we're back and only mostly traumatized"? This determines your backup technology and strategy.

Tape backups are cheap. They're also slow. If your RTO is measured in hours and your restore process takes days, you have a problem. That problem is called "unemployment, pending."

The TTY suggested we could "restore really fast" by just keeping everything on redundant drives.
That's not a backup. That's RAID.
RAID is not a backup.
RAID protects against drive failure, not user error, malware, corruption, or the TTY accidentally dropping a table.
Never confuse RAID with backups.

Backup vs. Archive: They're Different

Backups are for operational recovery. They're recent, they're frequent, they're designed to restore current state quickly.

Archives are for long-term retention. They're old data you might need for compliance, legal, or historical reasons. They're "fire and forget" (but actually test them periodically because nothing is truly fire and forget).

Don't use backups as archives. Don't use archives as backups. They're different use cases with different requirements and different failure modes.

The finance department once asked why we couldn't "just use last year's backup" to recover a deleted file.
We could. If last year's backup hadn't been rotated out eight months ago like every retention policy specifies.
They now understand the difference between backups and archives.
It only took one very uncomfortable meeting with legal.

The Immutability Question

Modern backup best practice includes "immutable backups" - backups that can't be modified or deleted, even by someone with admin access.

Why? Ransomware.

Ransomware encrypts your files. Then it encrypts your backups. Then it laughs at you.
Immutable backups can't be encrypted by ransomware. They can't be deleted by malware. They can't be "accidentally" removed by that user who shouldn't have had admin rights but management insisted.

If your backups can be deleted by an automated process or compromised account, you don't have backups. You have ephemeral data that exists until someone decides it shouldn't.

The Operator's Backup Philosophy

Everything fails. Plan accordingly.
Untested backups don't exist. Test regularly.
Automate everything, trust nothing. Automation runs backups. Humans verify they worked.
Geographic separation is mandatory. If one location is compromised, others survive.
Encryption is non-negotiable. Encrypt backups. Always. No exceptions.
Document your restore process. When disaster strikes at 3 AM, you don't want to be Googling "how to restore from tape."
Monitor backup jobs. Failed backups should generate alerts louder than the fire alarm.

The Real Cost of Not Having Backups

I've seen companies lose everything because they "didn't have budget" for proper backups.
I've seen CTOs explain to boards why months of data vanished.
I've seen small businesses close permanently because they never tested their backup restorations.
I've seen the TTY's face when they realized they deleted the production database and the last backup was corrupted.

(The TTY learned. They test backups religiously now. Character-building experiences have that effect.)

The cost of backups is measured in storage and time.
The cost of not having backups is measured in everything else.

The TTY's Question

After reading this, the TTY asked: "So backups are basically insurance?"

Yes. Exactly. Insurance you hope you never need but will be desperately grateful for when you do.

Insurance you test regularly to make sure it actually works.

Insurance that management will question the cost of right up until the moment they need it.

Insurance that separates "temporary inconvenience" from "resume-generating event."

The Operator's Final Notes

Documented for posterity and the next person who asks "why do we need backups?"

The lesson: Backups aren't optional. They're not "nice to have." They're the difference between "bad day" and "catastrophic failure."

The TTY learned: That backups must be tested, geographically distributed, and treated with the same importance as the production systems they protect.

Management learned: That the backup budget is much smaller than the "oh no we lost everything" budget.

Next Tuesday's forecast: Partly cloudy with a 100% chance of something failing and us being very glad we have backups.

The datacenter is stable. The backups are running. The restore tests are scheduled.
And the TTY now checks backup logs without being asked.

Smart kid.