Replication monitoring with monit
During my MBA, I always heard "what you measure is what you get." In IT, it seems that what you monitor is what you get. Here's how I set up monitoring for MySQL replication.
I'm terrified of backups. That's not quite true, I terrified that I don't have good backups. They're so easy to ignore. I mean, you don't need them 99.9% of the time. On top of that, they're incredibly hard to get right.
Most people can do filesystem backups, they aren't that hard. But how do you back up a 15GB mysql database? Most people probably just backup the datafiles and hope. I don't like that solution. I really like to have nice consistent backups. To make sure I can get a good backup without impacting the performance of our production environment, I use replication.
Our main database server replicates to a standby server. Along with being ready to takeover in case of a failure, we use our standby for backups. Here's our backup script:
#! /bin/bash DATE=$(date +"%Y%m%d") #Dump the facebook DB mysql -u root -e "slave stop sql_thread;" mysqldump -u root --all-databases -q -e | bzip2 - >/data/backup/backups/facebook-db-backup-${DATE}.dmp.bz2 mysql -u root -e "slave start;"
It's really simple. First, it stops the sql_thread portion of replication. That means we keep copying changes from production, we just don't apply them to this copy of the database. Once that is done, we use mysqldump to do a full backup. Once the backup is done, we restart the replication slave. Simple, right? So why am I so scared?
I'm nervous that replication won't get restarted. If that happened, we would no longer have a good backup. I'm terrified that I go to restore the database and find out that my data is three months old. That type of thing keeps me up at night.
Luckily, Jeremy installed monit on all of our servers a few months ago. In just a few hours, I cooked up the following scripts to monitor replication. First, here's a ruby script that will touch a file on the filesystem if monitoring is running okay. I run this every minute from cron.
#! /usr/bin/env ruby require 'mysql' conn=Mysql.new('127.0.0.1','root') h=conn.query("show slave status").fetch_hash unless h.nil? if h["Slave_IO_Running"] == "Yes" and h["Slave_SQL_Running"] == "Yes" system("touch /var/run/monit/watchdog") end end
With that code in place and running from cron, we can ask monit to watch our slave.
check file DbSlaveReplication with path /var/run/monit/watchdog IF timestamp > 2 minutes then alert check process mysql with pidfile /var/run/mysqld/mysqld.pid group database start program = "/etc/init.d/mysqld start" stop program = "/etc/init.d/mysqld stop" if failed host 127.0.0.1 port 3306 then restart if 5 restarts within 5 cycles then timeout
That's all it takes to make sure replication is running in our environment. It's just a little bit of code, but it helps me sleep better at night.
Posted by Mike Mangino on Wednesday, December 05, 2007