When did you last time ....
When did you last time check for inactive backups?
Example:
A fully automated backup system is in place,
backup schedules for all servers have been defined and activated and are working fine since several months or years.
A professional monitoring system monitors the backup system.
[1] Initial Situation
[2] Due planned maintenance work on server Uxxx the nightly automated backup schedule for this server has been manually deactivated.
The monitoring system did not raise an alarm, because that backup schedule did not raise an error. - A job which is not started can't fail and raise an error...
[3] Conclusions
The root cause for this operational mistake was an incomplete (or not existing?) checklist for maintenance work. But just ensuring that the next version of the maintenance work checklist is complete, and being followed, is not enough. An additional approach to detect never started backups is not easy, but that needs to be done – "whatever it takes". The inability to recover when needed would show that – but then it's too late.Approach | Comments |
Raise awareness | This alone is not sufficient, but it is one small additional contribution. |
In case that you already have a generic "Post Implementation Review" document or checklist,
then add the questions: "What (jobs, backups) has been temporarily deactivated?" "Who did re-activate those?" - Please confirm for each of those. |
|
Don't deactivate the backup schedule, just change the start-time to a later time. | This can be very dangerous in case that maintenance work takes longer than planned. |
Create a report like
SELECT count(*)
If count(*) > 0 then raise an alarm. |
This report shows you a list of servers which have been backed up 1 week ago, but not yesterday. Problems: Not all backup systems might support this type of individual report. In case that an old server has been decommissioned, this report will raise alarms for next week. |
Count the number of servers backed up in last 24 hours and compare that number | A manual backup of a server which is usually not backed up (e.g. servers for testing) would equalize a backup not started. |
Daily / weekly statistics on total backup volume | A missing server with small backup volume would not be detected. |
Deviation of daily / last rolling 7 days backup volume PER SERVER | or most servers the 7-day rolling backup volume should be quite constant, and one missing full backup would already show a 15% drop and should raise an alarm. However one or a few missing differential backups might not be visible, but latest after the first missing full backup statistics are expected to raise an alarm. |