Performance Troubleshooting VMware vSphere – CPU

pentiumee_processor_back

Introduction

Processors have come a long way in a very short time, and over the past few years we have seen the industry embrace the multi-core x86 architectures (Intel and AMD) which is allowing us to consolidate with even greater efficiencies than previous processor architectures.  Ensuring available compute cycles to virtual machine workloads is critical, and should be monitored closely as you scale out your infrastructure.

What to look for
  • Check for physical cpu utilization that is consistently above 80-90%.  Getting high consolidation rates is a wonderful thing, but don’t over tax the physical server.  Maybe it’s time to purchase another host for your DRS cluster and let the software balance your workloads better.
  • Watch pCPU0 on non ESXi hosts.  If pCPU0 is consistently saturated, this will negatively impact performance of the overall system.  If you are using third party agents, ensure they are functioning properly.  A couple of years ago we had issues with HP System Insight management agents (Pegasus process) which was creating a heavy load on our COS.  All of the virtual machines looked fine from a performance perspective, but once we dug a little bit deeper, we discovered this was our root cause.
  • Watch for high CPU ready times, this indicates that the processor is waiting on other I/O components on the host before it can perform its computations (Memory/Network/Storage).  This can help point you towards another possible bottleneck in your infrastructure outside of CPU.
  • Watch for virtual machines that are consistently at 80-100% utilization.  This is not a typical pattern of a conventional server.  Most likely if you login to the guest you will find a runaway process that is consuming all of the cpu cycles.  I actually found an offshore contractor running Rosetta@home (a cancer research screen saver) inside one of our virtual machines!  If something doesn’t look right, it’s worth checking it out.
  • Watch for virtual machines where the Kernel or HAL is not set to use more that one CPU (SMP) and the vm is allocated multiple processors via Virtual Center.  I was approached by a Linux administrator that told me he wasn’t seeing any performance improvements after he added a second processor.  After I poked around a little bit I discovered he was running a uniprocessor kernel and hadn’t recompiled his operating system for SMP.  If the operating system doesn’t have the ability to recognize more than one processor, you won’t be seeing any performance gains by throwing more vcpu’s at a larger workload.
Monitoring with Virtual Center

Virtual Center is a great place to start at for CPU performance monitoring both at a physical level and a virtual machine level.  Before getting into too much detail I wanted to explain Virtual Center statistics logging.  There are various levels of logging that can be set for the VC database.  Beware! You can easily over run your database and fill up your exiting disk space by setting all of these to the maximum setting.  Think of this as a debug level, the higher you set it the more information will be captured to the database for analysis (more disk space consumed).  If you need to get to some of the more detailed performance statistics, VC performance counters and their corresponding levels can be found here.  To change these settings click, Administration –> vCenter Server Settings –> Statistics.

image

Let’s take a look at a physical ESX host performance metrics through Virtual Center.  vSphere now includes a nice graphical summary in the performance tab of the physical host.  This gives you a quick dashboard type view of the overall health of the system over a 24 hour period.  Here is the CPU sample:

image Selecting the advance tab gives you a much more granular way of viewing performance data.  At first glance this might look like overkill, but with a little bit of fine tuning, you can make it report on some great historical information.  Here is a snapshot of physical CPU utilization across all processors:

image

The virtual center performance statistics by default display the past hour of statistics, and show a more detailed analysis of what’s currently happening on your host.  Select the option “Chart Options” to change values such as time/date range and which counters you would like to display.

Virtual Center Alarms are an excellent tool that can sometimes be overlooked.  While this is more of a proactive tool than a reactive or troubleshooting tool, I thought it was worth mentioning.  Setup CPU alerts so you will be notified via e-mail if a problem starts to manifest itself.  Here is an alarm configured to trigger if physical host CPU utilization is at 75% for 5 minutes or greater.

image

Monitoring with ESXTOP

Esxtop is another excellent way to monitor performance metrics on an ESX host.  Similar to the Unix/Linux “Top” command, this is designed to give an administrator a snapshot of how the system is performing.  SSH to one of your ESX servers and execute the command “esxtop”.  The default screen that you should see is the CPU screen, if you ever need to get back to this screen in the future, just hit the “c” key on your keyboard.  Esxtop gives you great real-time information and can even be set to log data over a longer time period, try “esxtop –a –b > performance.csv”.  Check your PCPU and CCPU (Physical/Console) here.  Examine what your virtual machines are doing, if you want to just display the virtual machine worlds hit the “V” key.

imageA detailed list of ESXTOP counters can be found here:

http://communities.vmware.com/docs/DOC-5240

http://communities.vmware.com/docs/DOC-9279

Monitor inside the Virtual Machine

A great feature VMware introduced for Windows virtual machines was integrating VMware performance counters right into the Performance Monitor or “perfmon” tool.  If your running vSphere 4 update 1 make sure you read this post first as there is a bug with the vmtools that will prevent them from showing up.  Check your % Processor time which is the current load of the virtual processor.

image

Monitoring with PowerCLI

Another great place to go to for finding potential cpu problems and bottlenecks is PowerCLI.  I have been using PowerGUI from Quest, accompanied by a powerpack from Alan Renouf.  If your not a command line guru don’t let this discourage you.  PowerGUI is a windows application that allows you to run pre-defined PowerCLI commands against your Virtual Center server or your physical ESX hosts.  Want to find virtual machines with CPU ready time?  How about virtual machines that have CPU reservations, shares or limits configured?  You can pull all of this information using Alan’s powerpack.

image

Conclusion

If your using VMware vSphere, there are many different ways to monitor for CPU problems.  The Virtual Center database is the first place you should start.  Check your physical host CPU contention, then work your way down the stack to the virtual machine(s) that might be indicating a problem.  Take a look at esxtop, check physical CPU, console cpu then the vmworlds that are running on the ESX host.

Look for the outliers in your environment.  If something doesn’t look right, that’s probably the case.  Scratch away at the surface and see if something pops up.  Use all possible tools available to you like PowerCLI.  Approaching problems from a different perspective will sometimes bring light to a situation you weren’t aware of.  If all else fails, engage VMware support and open a service request.  Support contracts exist for a reason and I have opened many SR’s that were new technical problems that have never been discovered by VMware support.

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

This entry was posted in Scott Sauer by Scott Sauer. Bookmark the permalink.
Scott Sauer

About Scott Sauer

I’m a Senior Systems Engineer for Tintri in Cincinnati Ohio. I am married to a wonderful woman (Alison) and have the privilege of raising two boys with her. I have over 16 years of experience in the Information Technology field with a background in virtualization, systems architecture, disaster recovery/ business continuity, storage area networking and data center operations.