It’s 8 pm on a Sunday evening, and I get a panicked call from a customer because he cannot connect to his XenServersTM via the XenCenterTM management tool. However, as near as he could tell, all of the hosted virtual machines were up and running and in a healthy state. He had unsuccessfully tried to point the XenCenter management tool at another member of the XenServer pool but was unsuccessful.
So what happened and how do you fix it?
This situation can happen for several reasons but generally it happens when there are only two servers in the XenServer pool, and the pool master suddenly fails. In essence, what happens is the surviving server (let’s just call it the “slave”) can no longer see its peer, the pool master, so it assumes it has been stranded and goes into emergency mode to protect its own VMs. There are other ways this can happen (an incorrectly configured pool with HA turned on for example), but this is the most common reason that I have personally experienced.
Depending upon the situation, you may not be able to ping the master server because it is actually down, or you may be able to ping the server but it is in an inconsistent, “locked up”, state such that it cannot answer requests to it. If you are able to connect to the console of the master server either directly with a monitor, keyboard, and mouse (the old fashioned way) or through a remote management interface (DRAC, ILO, ILOM, etc) the server may appear to be running, but you may not be able to do anything with it.
At this point you may be thinking, “This is no big deal – just reboot the machine and it will be fine.” If you are lucky that may actually solve the problem, but in many cases it will not. What you might see is that after the master reboots you will be able to connect to the master but you will not see the slave. Or it may be that your master is truly broken and you are not able to simply reboot it due to a system or hardware failure. But, of course, you’ve still got to get your pool online and working again regardless.
During this period of time, if you try to use a tool such as Putty to connect to the slave via its management interface, you may not be able to connect to it either. If you try to ping the slave on the management interface you may not get any replies. But if you connect to the console of the slave (again, either the physical console or via a remote management interface) you will probably see that the machine is running, but if you look at XSconsole it will appear that the management interface is gone because there will be no IP address showing. By now you’ll probably be scratching your head because the strange thing is all the VMs are running.
So at this point your master appears to be down, or at least impaired, you’ve got no management interface on the slave, your pool is broken and you cannot manage the VMs. So what do you do?
Well, if this happens to you and your VMs are still up and running the first thing you should do is take a deep breath, because more than likely it is not as bad as you might think. XenServer is a robust platform and if the infrastructure is built correctly (and I’m going to quote a customer), “you can really slam the things around and they still work”.
After you take a deep breath and let it out slowly, from the console of the slave server, you will need to access the command line and start by typing:
If the server returns an answer of “True” then you’ve confirmed that the server has gone into emergency mode in order to protect itself and the VMs running on it. (If the server returns an answer of “False” then you can stop reading, because the rest of this post isn’t going to help you.)
Assuming you receive the answer of “True” the slave server is in emergency mode because it cannot see a master – either because the master is actually down, or because the management interface(s) is(are) not working. Therefore, the next step is to promote the slave to master to get it out of emergency mode. We do this by typing:
At this point the slave server should take over as the pool master and the management interface should be available again. Now if you type the xe host-is-in-emergency-mode command again you should get an answer of “False”.
Now, open XenCenter again. It will first try to connect to the server that was the master, but after it times out it will then attempt to connect to the new master server. Be patient, because eventually it will connect (it may take several seconds) and you will again see your pool and be able to manage your VM’s. If some of the VMs are down because they were on the server that failed you’ll be able to start them on the remaining server (assuming you have shared backend storage and sufficient processor and memory resources).
Now what about the master if it has totally failed? What do I do after I’ve fixed, say, a hardware problem in order to return it to my pool?
If the following two conditions are true:
- You are using shared storage so that your VMs are not stored on the XenServer local drives, and
- You have built your XenServers with HBAs (fiber or iSCSI) rather than using Open iSCSI, which means the connectivity information to your backend SAN will be stored within the HBA,
…then it may be much simpler and quicker just to reload the XenServer operating system. (If you do not have shared backend storage, which means your VMs are on local storage, DO NOT DO THIS). I can rebuild my XenServers from scratch in about 20 – 30 minutes and have them back in the pool and running.
If either of those two conditions is not true then, depending upon your situation, recovery may be significantly more difficult. It could be as simple as resetting your Open iSCSI settings and connecting back to your SAN (still easy but takes more time to accomplish) or it could be as painful as rebuilding your VMs because you lost your server drives. (OUCH!)
Real world example: I recently had a NIC fail on the motherboard of my master server. Of course since the NIC was on the motherboard it meant the whole motherboard had to be replaced which significantly modified the hardware configuration for that server.
In this case, when I brought that XenServer back online it still had all the information about the old NICs showing in XenCenter, plus it had all the new NICs from the new hardware. Yes I could have used some PIF forget commands to remove the NICs that no longer existed and reconfigure everything but that would have taken me a bit of time to straighten out. Since I had iSCSI HBAs attached to a Datacore SAN (great product, by the way) for shared storage, all I did was reload XenServer on that machine, modify the multipath-enabled.conf file (that is a different blog topic for another day), and rejoin the server to the pool. Because the HBAs already had all the iSCSI information saved in the card, the storage automatically reconnected all the LUNs, the network interfaces took the configuration of the pool, and I was back online and running in less than 30 minutes.
After you repair the machine that failed and get it back online, you may want it to once again be the master server. To do this type:
You will get a list of available servers with their UUID’s. Record the UUID of the server that you want to designate as the new master and then type:
xe pool-designate-new-master host-uuid=[the uuid of the host you want]
After you type this your pool will again disappear from XenCenter, but after about 20 – 30 seconds (be patient) it will reappear with the new server as the master. Your pool should now be healthy, and you should again be able to manage servers as normal.