So Monitoring has been a topic of interest lately. How to do it, why to do it.

Why Monitor?
Most people immediately say, ‘because you know when stuff is down’. Thats a good reason to monitor things. Pre-emptive warnings. ‘Your disk is 90% full’, ‘there are 2000 processes’, ‘random_service3 is not responding’ are all helpful warnings to most people. The first few are pre-emptive warnings. Knowing the disk is getting close to full is useful because it gives someone time to plan downtime and inform people of a change that will occur. Maybe someone needs to boot a machine into single user mode and resize the disks, maybe data needs to be deleted, or backed up, or moved. Maybe someone needs to log into the machine with 2000 running processes and kill some before the box locks up. The monitors give a system administrator time to deal with issues before they develop into something worse. Running out of disk space is bad, machines locking up is bad, knowing about downed services is good.

But I think most people only think they need monitoring for this reason. Preemptive warnings only help people solve problems as they arise. The new problem that spawns from this is that the same alerts trigger at some kind of random time. Which brings me to the real reason anyone should be using monitoring; metrics and trends.

If a particular service is down, typically someone will notice. If mail is down, angry users will call in, most pre-emptive warnings will eventually manifest into real problems and the problems can be dealt with by someone. However the real power of monitoring comes from graphing the monitoring data and looking for trends.

‘Why does my service spike at 8am every morning?’
‘Why does the web server appear to go down around 10am on wednesday?’
‘Why do we lose connectivity to a remote building every tuesday at 9pm’

Issues such as these become apparent as more data is stored and processed. But what good do trends do me, you ask. They help identify answers to the questions above. Maybe your service ‘goes down’ because a cronjob running at 3am spikes the load on the box causing the monitor to fail (essentially a false positive, but also a pointer that maybe that job is badly implemented). Once you can identify trends in your alerts, you can use the trends to identify the real problem behind many of the alerts and eliminate them.