XenServer Host Is In Emergency Mode

It’s 8 pm on a Sunday evening, and I get a panicked call from a customer because he cannot connect to his XenServersTM via the XenCenterTM management tool. However, as near as he could tell, all of the hosted virtual machines were up and running and in a healthy state. He had unsuccessfully tried to point the XenCenter management tool at another member of the XenServer pool but was unsuccessful.

So what happened and how do you fix it?

This situation can happen for several reasons but generally it happens when there are only two servers in the XenServer pool, and the pool master suddenly fails. In essence, what happens is the surviving server (let’s just call it the “slave”) can no longer see its peer, the pool master, so it assumes it has been stranded and goes into emergency mode to protect its own VMs. There are other ways this can happen (an incorrectly configured pool with HA turned on for example), but this is the most common reason that I have personally experienced.

Depending upon the situation, you may not be able to ping the master server because it is actually down, or you may be able to ping the server but it is in an inconsistent, “locked up”, state such that it cannot answer requests to it. If you are able to connect to the console of the master server either directly with a monitor, keyboard, and mouse (the old fashioned way) or through a remote management interface (DRAC, ILO, ILOM, etc) the server may appear to be running, but you may not be able to do anything with it.

At this point you may be thinking, “This is no big deal – just reboot the machine and it will be fine.” If you are lucky that may actually solve the problem, but in many cases it will not. What you might see is that after the master reboots you will be able to connect to the master but you will not see the slave. Or it may be that your master is truly broken and you are not able to simply reboot it due to a system or hardware failure. But, of course, you’ve still got to get your pool online and working again regardless.

During this period of time, if you try to use a tool such as Putty to connect to the slave via its management interface, you may not be able to connect to it either. If you try to ping the slave on the management interface you may not get any replies. But if you connect to the console of the slave (again, either the physical console or via a remote management interface) you will probably see that the machine is running, but if you look at XSconsole it will appear that the management interface is gone because there will be no IP address showing. By now you’ll probably be scratching your head because the strange thing is all the VMs are running.

So at this point your master appears to be down, or at least impaired, you’ve got no management interface on the slave, your pool is broken and you cannot manage the VMs. So what do you do?

Well, if this happens to you and your VMs are still up and running the first thing you should do is take a deep breath, because more than likely it is not as bad as you might think. XenServer is a robust platform and if the infrastructure is built correctly (and I’m going to quote a customer), “you can really slam the things around and they still work”.

After you take a deep breath and let it out slowly, from the console of the slave server, you will need to access the command line and start by typing:

xe host-is-in-emergency-mode

If the server returns an answer of “True” then you’ve confirmed that the server has gone into emergency mode in order to protect itself and the VMs running on it. (If the server returns an answer of “False” then you can stop reading, because the rest of this post isn’t going to help you.)

Assuming you receive the answer of “True” the slave server is in emergency mode because it cannot see a master – either because the master is actually down, or because the management interface(s) is(are) not working. Therefore, the next step is to promote the slave to master to get it out of emergency mode. We do this by typing:

xe pool-emergency-transition-to-master

At this point the slave server should take over as the pool master and the management interface should be available again. Now if you type the xe host-is-in-emergency-mode command again you should get an answer of “False”.

Now, open XenCenter again. It will first try to connect to the server that was the master, but after it times out it will then attempt to connect to the new master server. Be patient, because eventually it will connect (it may take several seconds) and you will again see your pool and be able to manage your VM’s. If some of the VMs are down because they were on the server that failed you’ll be able to start them on the remaining server (assuming you have shared backend storage and sufficient processor and memory resources).

Now what about the master if it has totally failed? What do I do after I’ve fixed, say, a hardware problem in order to return it to my pool?

If the following two conditions are true:

  1. You are using shared storage so that your VMs are not stored on the XenServer local drives, and
  2. You have built your XenServers with HBAs (fiber or iSCSI) rather than using Open iSCSI, which means the connectivity information to your backend SAN will be stored within the HBA,

…then it may be much simpler and quicker just to reload the XenServer operating system. (If you do not have shared backend storage, which means your VMs are on local storage, DO NOT DO THIS). I can rebuild my XenServers from scratch in about 20 – 30 minutes and have them back in the pool and running.

If either of those two conditions is not true then, depending upon your situation, recovery may be significantly more difficult. It could be as simple as resetting your Open iSCSI settings and connecting back to your SAN (still easy but takes more time to accomplish) or it could be as painful as rebuilding your VMs because you lost your server drives. (OUCH!)

Real world example: I recently had a NIC fail on the motherboard of my master server. Of course since the NIC was on the motherboard it meant the whole motherboard had to be replaced which significantly modified the hardware configuration for that server.

In this case, when I brought that XenServer back online it still had all the information about the old NICs showing in XenCenter, plus it had all the new NICs from the new hardware. Yes I could have used some PIF forget commands to remove the NICs that no longer existed and reconfigure everything but that would have taken me a bit of time to straighten out. Since I had iSCSI HBAs attached to a Datacore SAN (great product, by the way) for shared storage, all I did was reload XenServer on that machine, modify the multipath-enabled.conf file (that is a different blog topic for another day), and rejoin the server to the pool. Because the HBAs already had all the iSCSI information saved in the card, the storage automatically reconnected all the LUNs, the network interfaces took the configuration of the pool, and I was back online and running in less than 30 minutes.

After you repair the machine that failed and get it back online, you may want it to once again be the master server. To do this type:

xe host-list

You will get a list of available servers with their UUID’s. Record the UUID of the server that you want to designate as the new master and then type:

xe pool-designate-new-master host-uuid=[the uuid of the host you want]

After you type this your pool will again disappear from XenCenter, but after about 20 – 30 seconds (be patient) it will reappear with the new server as the master. Your pool should now be healthy, and you should again be able to manage servers as normal.

12 replies
  1. Karthik Mani
    Karthik Mani says:

    We had 8 members in a pool. The master crashed and became unresponsive even though the VMs on the host were up and running. The other slave hosts (pool members) remained active and their VMs up and running but they were unresponsive when accessed via the console.

    Restarting all the pool members did not help. They would either hang when accessed via the console or act as if they were not part of a pool.

    When I realized that the master is fried, and before I started panicking, I came across this life saving blog.

    This is how I recovered.

    Rebooted a slave.

    Logged onto local shell

    xe host-is-in-emergency-mode

    (This returned true)

    xe pool-emergency-transition-to-master

    (made this slave the new master)

    Powered up all other slave members

    On new pool master, I issued the following..

    xe pool-recover-slaves

    The new pool master forced all other members to exit emergency mode and they all showed up happily in the pool via xencenter. I had to connect to the new pool via the new pool master’s IP though.

    I will now deal with the old pool master that is still down (out of space on dom0 I think) but I have 7 other servers up and running thanks to this blog!!!!

    Reply
  2. Joe
    Joe says:

    This article is very thorough and precise. It did not necessarily have the solution to my problem, but a few lines in it did jog my memory a bit, leading to success in solving my problem. I am currently learning about Citrix XenServer and I only have one host. The power failed where my host was running so it was disabled upon the next startup and I was unable to start my VMs back up. The solution was to ssh into the host and run the command “xe host-enable”. After running that command, I was able to start my VMs without any problems using XenCenter.

    Reply
  3. serwer serwery
    serwer serwery says:

    We stumbled over here from a different web page and thought I
    might as well check things out. I like what I see so i am just following you.

    Look forward to checking out your web page yet again.

    Reply
  4. Richard Shaw
    Richard Shaw says:

    Testing XEN at the moment with 3 xen servers and a centos server running as an iSCSI target for VM storage.

    The 3 XEN servers are configured as a pool and the storage is within the pool. I can run VMs on any host and all looks good.

    I even have the management interface running on a private network and VPN in with XenCenter via the Centos machine.

    My problem is, with the master down I’ve only seen XenCenter try to connect to other members of the pool once. It usually just tries and tries to connect to the dead master then fails with ‘The connection was refused’.

    Any ideas?

    Reply
  5. Rafael Fonseca
    Rafael Fonseca says:

    Great write-up!

    Just one question, though: if you are re-adding a host to the pool, wouldn’t the pool automatically apply the storage configuration to the newcomer? At least with XS 5.5/5.6, I can just install the hypervisor and basic network config, and the rest is passed to it by the pool.

    Also, I assume this (the emergency mode cutting management connectivity off) becomes a non-issue when you get the third and subsequent servers, right?

    Reply
    • Steve Parlee
      Steve Parlee says:

      You are correct when you add a new pool member, assuming the configurations of the hardware is the same, that new member will pickup the network configurations based upon the pool. In the situation I’m speaking about in this post I had to change out a bunch of hardware and when the XenServer came back online it saw all the new NICs plus it still have the old NICs as well.

      As for the issue with the pool being in emergency mode, yes if you lose your pool master the pool may still go into emergency mode as the pool master has suddenly disappeard. Again if this does happen you will be able to get out of emergency mode and designate a new master fairly easily.

      Reply
    • Steve Broad
      Steve Broad says:

      9 times out of 10 you need to take the pool out of HA mode in Xencenter.

      Then If your pool master still on the network, enter “xe pool-recover-slaves” command in pool master’s CLI.

      Once you finish the above methods, you may sometimes have to use “service xapi restart” command to restart XenServer’s API.

      Reply

Trackbacks & Pingbacks

  1. […] XenServer Host Is In Emergency Mode […]

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply to Steve Parlee Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.