How to set up Failover for your Online Backup Service Software
So it’s time to talk about failover. For our purposes, we’re going to assume that you’re using WholesaleBackup in the Amazon EC2 cloud infrastructure, and want to set up failover in a different region (say US-east vs US-west). We could try to snapshot our volumes, and then convert them, or even try to sync them using s3, but those approaches have several drawbacks: the largest drawback is that they send too much information to be efficient, which means they take a long time, and also they are very complicated to set up from a data integrity perspective. Lastly, to save money we want to set up a cold failover so that we can turn off the failover instance when not needed, and turn it on and repoint DNS if needed.
WholesaleBackup ships with the capability to sync failover information including the system database, user metadata, and system settings), and is setup to start and stop a failover server running in Amazon’s EC2 infrastructure.
However, you may want to extend the failover to include data replication for your customers’ data. That’s what we’re going to cover now.
Some quick definitions:
- Hot failover means you keep your failover machine synced, and on, and have some DNS switching or IP failover set up to immediately take up the load from the first backup server. We support this for a backup cluster, but for the purposes of a remote (say bicoastal) setup, it doesn’t make much sense for most backup providers. Rather they want a warm (the failover server can be going in a few seconds, with a service interruption) or cold (it takes a few minutes to get it going). As far as this tutorial is concerned, let’s define cold as meaning that you will switch off (stop) the Amazon instance that is the slave server to save money. Hot means the slave server keeps running – this will be your default setting if you’re not using Amazon’s API to shut down your slave server after the data sync.
- The master server is the regular server that’s running, and the slave is the one that you want to turn on should the master fail.
So let’s talk about how to set it all up:
- Install a failover server in Amazon EC2, including the WholesaleBackup server software (configure it EXACTLY the same as the master server, including replicating the storage sizes, and paths, and assign it an elastic IP. Note the region and the instance id from the Amazon console, as well as the Amazon elastic IP, as you will configure the master server with those. You should initially sync the key user files before installing the WholesaleBackup app: /etc/passwd, /etc/shadow, /etc/group, /etc/gshadow.
- On the Master server, setup password-less (using public/private keys) access to the slave server by issuing the command:
Then ssh into the slave machine twice to test it – the first time it may prompt you to cache the key (say yes), but the second time it should not prompt for anything, rather you should automatically be logged in from the master machine to the slave machine:
- Do the exact same steps as #3, except in reverse: from the slave server to the master. (i.e. replace the user and machine with the master server).
- Setup the Amazon EC2 Command Line API and your x509 keys on the master server. Note where you locate your x509 certs. Check out this tutorial which tells you How to Setup Amazon EC2 Api Command line tools for instructions.
- On the master server, edit the /etc/divinsa/failoverconfig file, change the setting to indicate you want to turn failover on, add in whether you have a hot or cold failover set up, the Amazon info including the target region, instance id, and where you placed your keys. You will also want to designate the user for the ssh sseion for replication, as well as any other settings that need changing.
- On the master server, issue the following command to change permissions to allow the failover script to be run: chmod u+x /usr/local/bin/Dexportforfailover
- On the slave server, change /etc/divinsa/failoverconfig to indicate that this server is a slave server (so it will accept changes) by ( this is the “3” value at the top of the failover section).
- On the slave server, issue the following command to allow execute permissions on the Import Script: chmod u+x /usr/local/bin/Dimportforfailover
You’re now setup to replicate user’s metadata and the database to your failover server. n the instance of an outage, you can redirect your users to that server, and they will be able to backup.
For many people that’s sufficient, as they have other ways to replicate their data like drbd or a replicated filesystem. But if you also need a simple way to replicate your data, and sufficient bandwidth, it may be as simple as setting up rsync to sync the data from your master to your slave server. Here’s how:
- Make certain that both servers have the same capacity storage volumes, and that they are named exactly the same.
- Change the configuration settings for failover to indicate that you want to sync all user data by setting SYNCALLUSERDATA=”1″
Of course, you will need to make sure that the operating system updates and other key underlying elements are also synced, and above all, make sure to test that the syncronization is working as you would expect.
How to test?
- Note what time you ahve the sync setup to run, as this will affect what you should expect.
- On a machine that has a current backup client and account on your master server, make sure the client is shut down, and modify the hosts file (usually at c:\Windows\system32\drivers\etc\hosts ) and add a line that has the IP of your failover server and the name of the failover server. This will force the windows operating system to resolve the master server’s ip as the slave server rather than the regular ip value for the master server.
- Make sure the slave server is running.
- Open the backup client, and do a connection test. If that is successful, then try to do a test restore of some files backed up recently, but only if you are doing full data syncronization. If you are only syncronizing metadata, then you should try to backup something. If these operations work, then everything is ok.