Saturday, September 11, 2010

Monitoring TimeMachine backups with Nagios

An automated backup strategy is the key to avoiding tears when your hard drive inevitably fails.

With Apple machines, this is typically handled at the desktop level with an external drive and TimeMachine. (Incidentally I'm partial to the OtherWorldComputing on-the-go's myself since they don't need a power adapter...)

The problem with TimeMachine and local backups, in an enterprise context, is that it's hard to tell whether people are actually backing things up or not. What if the TimeMachine drive has failed for some reason? You will keep right on living, without realizing your user (or perhaps you) aren't actually successfully backing things up.

This is exactly the sort of problem that monitoring systems are supposed to solve, and my personal choice for monitoring is still nagios (though I hear good things about OpenNMS). You don't have to use a monitoring system perhaps - you could just have the script send you mail similar to my previous MacPorts check, but I like nagios because it maintains a history of previous events and has a flexible communication infrastructure.

But how to monitor the local TimeMachine backup status of a laptop with a centralized monitoring server like nagios?

The strategy I chose is to have a script on the local machine run daily (via an /etc/periodic/ entry) that scans for successful backups, fishes out the most recent success timestamp, and uses ssh to make an entry on my nagios server in a file. Then the I wrote a quick nagios plugin that inspect a timestamp in a file and measure whether it is acceptably recent or not.

If this is interesting to you - grab both of those little scripts from the link above and enjoy.

If you have a totally different strategy for monitoring local backup status, I'd be curious to hear it - please add a comment.

Cheers

2 comments:

Unknown said...

Thanks for the post - used it as the basis for a ruby script which is called via snmpd and monitored by nagios (which also caches the latest result for when everything's switched off, or away from the office)

Means I should be able to continue to use TM on our network

Unknown said...

I've added a more detailed account on how I did this here:

http://blog.etcp.co.uk

Love to hear comments too