VDR Drama – Host Losing Connectivity to vCenter

This is a problem I have seen now in two different environments, at two different companies.  Both happened to be using VMware Data Recovery for backups.

 

The problem starts like this. You lose a host from vCenter, and you cannot get it to reconnect.  You do a /sbin/services.sh restart, and still you cannot get connected to vCenter.

 

 

 

 

 

 

 

 

 

 

You CAN connect to the host locally using the vSphere Client.  Let’s look at the logs now.

 

This particular problem shows up in the host.d log.  To see it, go ahead and SSH into the host and type in: tail -f /var/log/hostd.log  and then go into vCenter and right click on the host to Connect.

 

 

Watching the hostd.log, if you see any messages about snapshots during the 5 minutes it takes to time out, here’s how to see if you have this issue.

 

In your SSH session on the affected host, type in the following:

find /vmfs/volumes/*/* -name *delta*

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

You’ll see a list of all snapshots for VM’s running on this host.  If you see a VM with a couple hundred snapshots, this is why your host won’t connect to vCenter.  vCenter has a database limitation, and when a VM has more than the number of snapshots vCenter can catalog in the database, the host cannot be managed by vCenter.  I haven’t figured out the exact limit for vCenter.  A VM can have 496, according to this post by William Lam, but I think vCenter breaks before you get to that point.  I had 235 on this suspect one.
To fix this, just connect locally to the host with vSphere Client and Consolidate your snapshots.

 

 

 

 

 

 

 

 

 

 

 

 

 

Once you’ve consolidated, your directory should look like the following.

 

 

 

 

 

 

 

 

 

 

 

 

 

Now, you can connect back to vCenter with no problem and no downtime!

 

Since this is a development environment, we didn’t pay a lot of attention to VDR, and just assumed it was working.  This particular VM happened to be out of hard drive space, so it could not be quiesced, and VDR just kept trying.  The bottom line is, pay attention to VDR errors!!!  After this, we’ll be checking it at least every few days.

 

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

This entry was posted in Authors by Brandon Riley. Bookmark the permalink.
Brandon Riley

About Brandon Riley

I am a Senior Distributed Systems Engineer working in the financial services sector for the past 15 years. I help design and implement open systems infrastructure. Virtualization with VMware and EMC VMAX is a huge part of that infrastructure. All views expressed in my blog posts are mine and mine alone. Opinions do not represent my employer or affiliates of my employer.
  • http://twitter.com/mattvogt Matt Vogt

    Brandon,
    Thanks for posting this. Not the first time I’ve heard of this (few people on twitter), but the first post I’ve seen. What’s helped me stay on top of this (specifically old snapshots) is Alan Renouf’s vCheck http://www.virtu-al.net/featured-scripts/vcheck/