Posts tagged VMware

Performance Troubleshooting VMware vSphere – The Tetralogy

steth

Yes a bold topic I know, but I wanted to tackle this subject because it’s such an important aspect that everyone typically deals with at some point.  I also find it personally useful to document some of my thoughts so I can solidify my own understanding of these processes and tools.  I will admit, this was a challenge for me to write up.  There is so much material and information that I had to really focus on keeping it simple and to the point.  Performance problems can span such a wide array of possibilities that there is never typically one easy answer.  Hopefully by highlighting some of the tools that are available for use, and offering some of my personal thoughts and experiences, I might be able to help when problems arise in your infrastructure.

There is so much useful information floating around on PDF’s, blog’s, websites, PowerPoint decks, that one could easily get consumed by this topic.  Since this is such a broad topic, I wanted to try and set the stage.  The focus of this series of blog posts is to highlight some key components to examine, and then provide tools that will give you insight into your own environment and/or situation.   This page will be the launch point for the various categories.  Each blog post will cover a different category relating to the possible points of I/O in a VMware ESX environment.

Performance Troubleshooting VMware vSphere – CPU

Performance Troubleshooting VMware vSphere – Memory

Performance Troubleshooting VMware vSphere – Storage

Performance Troubleshooting VMware vSphere – Network

There is one last I/O component that I will not be covering, and that is the human factor.  These posts will assume that your installation or upgrade is of sound mind and body.  If there are underlying installation issues or post upgrade issues, I suggest engaging VMware support before examining conventional performance problems.

Acknowledgments/References:

VMware vSphere 4 Performance Troubleshooting Guide – Hal Rosenberg

Performance Monitoring and Analysis – Scott Drummonds

VMworld 2009 TA2963 ESXtop for Advanced Users – Krishna Raj Raja

http://www.vmware.com/support/

http://www.yellow-bricks.com/esxtop/ – Duncan Epping

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

VMware Data Recovery (vDR) Overview

image

Overview

I like to try and save my employer money when possible.  I am of the opinion I would be doing them a disservice if I didn’t examine and evaluate a product that we paid for.  Our company decided to take the plunge and upgrade all of our licensing to vSphere Enterprise Plus.  There is a new backup/data protection product that was introduced with this recent release.  Here is the technical definition of VMware Data Recovery from the administration guide:

VMware® Data Recovery creates backups of virtual machines without interrupting their use or the data and services they provide. Data Recovery manages existing backups, removing backups as they become older. It
also supports deduplication to remove redundant data.

Data Recovery is built on the VMware vStorage API for Data Protection. It is integrated with VMware vCenter Server, allowing you to centralize the scheduling of backup jobs. Integration with vCenter Server also enables virtual machines to be backed up, even when they are moved using VMware VMotion™ or VMware
Distributed Resource Scheduler (DRS).

Sounds pretty good right?  You get a backup application (with de-dup!) that could possibly displace your primary method of backups built specifically for VMware?  I did a little digging in the community and was disappointed to learn that vDR is not exactly an enterprise product.  A lot of the feedback from other VMware engineers was “it’s a 1.0 product and is designed for a small installations”.  The maximum supported virtual machine backup configuration is 100 virtual machines.

I decide to check it out for myself and see if it was a fit for our environment and might possibly alleviate some of our backup problems.  Our primary site is rather large, but we are now implementing vSphere at our smaller satellite locations and this might be a fit for a smaller office configuration.

Installation

The installation was quite easy, VMware has provided another great virtual appliance that can be downloaded from their website.  After you import the virtual appliance via Virtual center and assign the host a static IP address, you then need to install the VC plug-in so you can manage your newly installed appliance.

image

After I ran through the installation and configuration I was disappointed to discover that I couldn’t get VDR to launch.  I kept getting prompted for authentication credentials which was odd.  I thought maybe I had incorrectly set something up so I went back and reviewed the administration guide.  Upon closer examination (RTFM) I discovered that vDR doesn’t support Virtual Center running in linked mode.  To my dismay, we are running in a VC linked mode in anticipation of a Site Recovery Manager implementation.  Our remote sites are managed by our primary site Virtual Center to save costs.  I discovered a work around by pointing the VC client to the ESX server that is managing the vDR appliance.  This would only give me access to backup other virtual machines hosted on the same ESX host so I could continue my testing.  I hope this is something that future versions of the product will address and fix.

The Console

Once you launch the vDR console, you are immediately prompted by a configuration wizard to begin setting up your environment.  Here are the following steps in the order they are presented:

  1. Select your Virtual Machines to backup.
  2. Select your destination (CIFS share, attached vmdk, or RDM).
  3. Select your backup window.
  4. Select your retention policy.

All of these are straight forward and don’t require much discussion.  The only step I found a little confusing was the retention policy.  Personally I would have preferred something a little more technical than “Few/More/Many”.

image The retention Policy radio buttons are pre-defined settings and will change the policy details below.  Change the buttons and you can see the variables change and what each setting will mean in terms of your destination data storage.  Use caution here as each vDR appliance can only support up to 1TB in data store size, with a maximum of two stores.

The Backup

The underlying backup technology behind vDR is the new vStorage API (Not VCB), it takes advantage of a new feature called change block tracking.  After the first full backup is performed, Change block tracking examines the virtual disk being backed up and only backs up the differences from the first backup.  This means less backup traffic going across your network.

I selected a CIFS/Windows share at our disaster recovery site to perform the backup testing.  The test share was a ~600GB, 5+1 (10K) of locally attached SCSI storage on a HP DL380.  I selected a couple of Windows virtual machines to test with and kicked off the backup jobs.  Below is a screenshot of the reporting window for vDR (sorry for all the censorship).

image The jobs seemed to run pretty slow in general, but completed successfully without errors (the error listed above is because of my linked virtual center configuration).  In my opinion the reporting interface is lacking some details.  I would have liked to have seen what throughput I was getting during the backups.  The only way I could see the throughput was by monitoring the windows host that was housing the data store.  I would have liked a more detailed task status, so I could tell what was going on through out the backup operation.  Data de-duplication ratio would have been another great detail to see.  This could help determine the total backup and estimated completion time of each virtual machine, which is another variable I found to be missing.

The Restore

There are two approaches to restoring data using vDR, the first method is a full system restore.  This method will recover the entire virtual machine, system state and all corresponding data.  When performing a full system restore you can restore the data to an alternate esx host, data store, and decide if you wish the network interface connected or not.  I even found that you can change the virtual disk node, and select an alternate SCSI path to recover your disk path too.

image

The second option is a file level restore (FLR), which typically most people would tend use on a more regular basis.  Unfortunately the vDR console can’t recover individual files without some additional configuration.  You need to install “VMwareRestoreClient.exe” executable on a virtual machine, which then will give you the ability to browse your data store contents and select individual files to recover.  I anticipate that we will see the FLR components get rolled into the vDR console in a future release.

Conclusion

VMware Data Recovery lacks a lot of critical pieces that an enterprise backup application should and needs to provide.  The product is a great start for smaller VMware implementations, but even at that I could see it quickly being outgrown.  Here are the areas I would love to see improved on in future releases of the product:

  • Need support for linked Virtual Center’s.  Personally I could use this product at some of our smaller locations but can’t leverage vDR since we are running in a linked mode.
  • Need to support larger capacity of virtual machines.  100 virtual machines is not enough, the product needs to scale to support a larger VMware implementation (not necessarily Enterprise).
  • Need support for larger data stores.  1TB is not a lot of space when you are going to be backing multiple virtual machines up and retaining their data for longer periods of time.
  • Need support for more data stores per vDR appliance.  Again this goes back to scale, storage growth is exponential in our current environment.
  • Support for a global vDR manager.  I would love to see VMware develop a central master or parent vDR console that would allow you to manage your children appliances, and the data stores that they manage.
  • Single console for both full system restores and file level restores.

VMware Data Recovery comes with all versions of VMware vSphere except for vSphere standard.  This is a great entry level backup solution with de-duplication included.  I am excited to see the product develop into a more mature product that can scale with some bigger environments.  I also feel that including vDR in the standard version of vSphere would only help the SMB market embrace virtualization at a higher adoption rate.

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

VMware vSphere Capacity IQ Overview – I’m Impressed!

ciq-icon

With the launch of VMware vSphere came some new products that I hadn’t really paid much attention to (busy upgrading I guess).  One of the newer products is a Virtual Center reporting tool called Capacity IQ.  This product  gives an administrator the ability to analyze, forecast and plan for future growth across your ESX environment.  I have had a lot of experience with monitoring/reporting tools in the past, I won’t bore you with the details, so I was quite skeptical of a 1.0 reporting tool for Virtual Center.  I must admit I was blown away by the immediate relevant reports the product was able to produce.

After pulling down the trial install and obtaining the demo key, I loaded it up for a spin.  I am not going to document the installation steps needed as Eric Gray has done this for us already.  It by far is the easiest reporting application I have ever installed.  If your interested in taking it for a trial run, download the virtual appliance from VMware’s website here (OVF format).  Once you import the virtual appliance and give it a static IP address, it will need to collect data about your environment for a while.

There are three basic views that CIQ gives you once you install the plug-in, dashboard, views and reports.

Dashboard

The dashboard tab is designed to give you a quick overview of the item you have selected.  Capacity IQ uses the same approach as virtual center does, whatever object you have selected will be reported and focused on.  Here is a view of one of our clusters, notice January 11th on the Trend and Forecast graph on top.

Dashboard

One of our clusters was out of resources, I added two more physical hosts to the cluster.  You can see CIQ picks up the new physical host resources for the cluster and reflects this by increasing the number of virtual machines it believes the cluster can accommodate.  Want to see something even more interesting, check out the pink graph on the 17th.  Capacity IQ is already using a prebuilt formula to assume what it thinks we will have (or won’t have) a week out.  Pretty impressive.

Views

The views tab is designed  to give you a more detailed look on some of the specific data points.  Here is a screenshot of the various reports you can execute:

Views

So here is where you can get some great visual reports to present to either upper management, or a potential customer.  This gives you a nice interface that you can customize with data points that you can tweak.  Check out the first report on this cluster:

image

This gives you a graphical historical view of your cluster, how many virtual machines you have added over the course of time.  Notice the horizontal sliding bar at the bottom of the chart.  This allows you to adjust your variable time/date window.  The lighter shaded line to the right is the projected or forecasted growth of how the cluster might continue to grow.  The views tab is a great place to run some ad-hoc reports, gives you the ability to select the type of report, and even allows you to export the data.

Reports

The reports tab is the “pre-canned” reports that can be executed by the administrator.  The one thing I was disappointed to not see here was the ability to schedule these reports to run at a particular interval (weekly/monthly).  This is something that I assume will probably be introduced in future releases of the product.

Reports

After the report is executed and compiled, you are then provided with a .pdf or .csv version of your dataset to download and review.  The first report totaled 17 pages and provided some great technical information.  Here is the table of contents:

image

Conclusion

I am very impressed with Capacity IQ.  There are no agents you need to install across the virtual machines you wish to report against.  The installation was very straight forward, I think I had it up and running in about 15 minutes.  Once the virtual appliance was in place, all it needed was a little bit of time to start crunching some data.  The reports are well written and very relevant to what an administrator would desire and wish to see.  If your looking for a nice reporting tool to help you forecast, give this one a test to see if it fits your needs.

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

vSphere 4 Update 1 with Update Manager Shenanigans

New Year Lights 2010

Happy New Year!  I hope everyone enjoyed the holiday’s and got to spend some time with friends and family.  If your reading this I suggest you pay tribute to the quality of Virtual Insanity, and give the gift of voting.  Eric Siebert has released a “best of 2009 blog contest”.  If Virtual Insanity has helped you out in some way in the past I suggest casting a vote for this great virtualization blog space!  Ok onto the real reason for this post…

I ran into an oddity while bringing a new host online today into our vSphere environment.  And thought it best to publish my findings.  Hopefully this might save someone a support call.  With vSphere 4 update 1 came a couple of technical issues, which are detailed here and here.  Personally we don’t use ESXi so only the first one was a major issue for us.  We are an HP shop, so the issue around the HP agents and update 1 was a major concern (basically would render the host unbootable).  Luckily VMware support is proactive about announcing issues like this to the community and most people were aware of the problem right away.

The problem I hit today was strange and I thought it was just being off from work for a week.  I went to apply our update 1 baseline to a new host I was bringing up, rescanned, and then got this:

compliant1

What the?  I know this isn’t compliant, our base build is still at 4.0  Check out the build number, that’s proof.  I have used the update 1 baseline for 50+ hosts so I know it’s not that.  So maybe update manager is still on holiday as well, I restart the service and life is good?  Nope.  Same thing.

To make a long story longer, I poke around in the repository and check the update 1 patch and see it’s valid, yep 11/19/09 that’s the right release date.  Why is this thing not working?

update1-first

I kept poking and prodding thinking maybe they released an update to the update?  Sure enough it slipped by me when I wasn’t looking, or it went to my spam mail.  Check the date 12/9/09.

update1-second

I created a new test baseline, and dropped the 12/9/09 update 1 into it and applied it to my new host.  Low and behold:

compliant2

That’s much better.  Strange the older update 1 patch didn’t reflect anything and showed compliant.  As an end user I would have liked to have seen some type of error message, or a reference to the newer released update 1.  Ran the new update, (still stopped the HP agents just in case).  And now things look good again (build number):

looksbetter

Conclusion

Go vote for this site, and make sure you update your update manager, update 1 baseline.  That’s a lot of updates.  See you online!

Scott

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

A handy new addition to the Command Line Tool for View 4

First things first

Thanks to Scott Sauer (@ssauer) and John Blessing (@vTrooper) for holding down the fort here at Virtual Insanity while I’ve been finishing up some unfinished projects and preparing for the VCDX Design Exam (which I take later this month).  One of Scott’s posts actually won a vSphere blog contest.  Nice work Scott!  These two guys are becoming pretty good friends of mine here in the Cincinnati area, so hopefully I can convince them to keep the content flowin’.

An itch I couldn’t scratch

I’ve mentioned here on this blog, at least once or twice, that I “eat the dog food” and actually run my primary XP desktop as a VMware View image.  Since the conversion almost a year ago, everything has been running pretty well with only a few minor bumps along the way.  And with the recent addition of PCoIP, I can’t imaging ever going back.

But there was one little reoccurring problem I was having for which I couldn’t seem to find an answer.  It wasn’t a show stopper of an issue, but it was just an “itch I couldn’t scratch,” if you know what I mean.  And the problem went something like this …

  1. Inside my desktop VM I have a Cisco VPN client, necessary for a secure connection back to corporate HQ in Palo Alto, CA. 
  2. When connecting to my desktop with the VPN client inside the VM inactive, I had no issue.
  3. However, if I disconnect from my desktop while the VPN session was active, then I couldn’t reconnect to my desktop via VMware View. 

The reason?  The broker was sending me the new IP address of the Cisco VPN Adapter, which is an IP address on the VPN, and an IP address my local computer didn’t know about. 

Now, if I were to log off instead of disconnect from my desktop, this would terminate the VPN session and therefore wouldn’t be a problem.  But who wants to log off every time?  More often than not, I have things open on my desktop (e.g. half written emails, documents, browsers with many many open tabs, etc.) that I don’t want to bother saving and closing every time I step away from the computer.  And really the bigger issue is with unintentional disconnects that result from local power/network/OS issues.

I tried all sorts of things to fix this.  Among other thins, I tried …

  1. Reordering the NICs, hoping the broker was just grabbing the first NIC. 
  2. Poking around the broker and agent install files, hoping to find a way to force the IP address. 
  3. I even tried uninstalling and reinstalling the View agent and the Cisco client, hoping the order of installation might do the trick (admittedly, this was a random shot in the dark)

But nothing seemed to work.  So until recently, to reconnect I would have to connect directly to my desktop via RDP, or connect to the console via the VMware Infrastructure Client, then disconnect the Cisco VPN and then reconnect via the View client. 

See what I mean?  Not a show stopper, but man what a pain in the butt! 

The solution

Well I found a way around this with a handy new addition to the Command Line Tool in View4.  Check out page 12 of the Command Line Tool for View Manager titled “Override IP Address.”  On the broker from a DOS prompt, in the c:\Program Files\VMware\VMware View\Server\bin directory, execute the following …

vdmadmin.exe –A –d <desktop name> –m <machine name> –override –i hostname

The “desktop name” is the name of the VM in the broker.  The “machine name” is the name of the VM in vCenter.  It’s likely they’ll be the same, but they don’t have to be and in fact, in my case they weren’t the same.  The “hostname” can be either a FQDN or an IP address.  Oh, and I can tell you that all parameters must be present or the command won’t execute. 

But that was all there was too it.  Now I can disconnect and reconnect to my desktop, regardless of the state of my VPN client.

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

More Bang for your Buck with PVSCSI (Part 2)

Part 2 Doing the work

As you might have noticed, this blog post is a continuation to my first post about PVSCSI, you can access Part 1 here.

Hopefully now you have a better understanding of what the Paravirtual SCSI driver is all about, and we can prove there are tangible reasons to move in this direction.  Let’s get on with the important part, the implementation phase.

SCSI2

(I need to finish off this blog post, I am running out of pictures of SCSI cables)

There are some caveats I need to start out with.  In case you missed it, PVSCSI drivers on virtual machines aren’t supported on operating system disks unless you are running vSphere 4 update 1.  You can use the driver on a secondary data disk if you so desire, but for this post I am going to assume you are running vSphere 4 update 1 (Virtual Center and ESX Hosts) and want to know how to get the driver working on all disks.

In most cases, it’s always easier to build new.  You know you have a clean install, the drivers are updated, the configuration is solid.  I would suggest updating your templates to include the new paravirtual scsi driver.  Your existing virtual machines run fine with their existing configurations, and depending on your environment, it might be a lot of work to go back and target all of your virtual machines.  For an upgrade path, my personal opinion would be to target your heavy I/O virtual machines.  Upgrade the VM’s that will make a difference, and you will see some immediate benefits.  Reducing the I/O on the disk subsystem will only benefit the other virtual machines that might share those same physical disk spindles.

Clean install

This section will walk you through the process of installing the driver with a Microsoft Windows 2003/2008 operating system.  Currently these two operating systems are the only ones supported.  Hopefully we will see some added support for the various Linux operating systems down the road.

Walk through the “New virtual machine Wizard” as you normally would.  On step 9, ensure you select the “VMware Paravirtual” option as seen below.

para_wiz

Before powering your new VM up, you need to connect the virtual floppy image file that has the driver for your desired guest operating system.  This is not on the VMware.com website under downloads, it already exists on your ESX host.  You will need to browse to the following location on your ESX host. [Datastores]\vmimages\floppies I would wait to connect your floppy disk image after you boot off the Windows CD-ROM so it doesn’t try to boot off the floppy drive.

pvscsi-flop

When you power up your new virtual machine, select the F6 option to tell the operating system you need to use a third party SCSI driver:

windowsf6

Now connect your floppy disk image to your virtual machine under the “edit settings” option.  You should now be able to point to operating system to the driver as seen below:

pvscsi_select

Continue on with your normal installation, and you are complete.  Your new virtual machine is now utilizing the Paravirtual SCSI drivers.  I suggest now converting this image you created to a template for future deployments with this configuration.

Upgrading and Existing Virtual Machine

To upgrade an existing virtual machine, the process is pretty straight forward.  Assuming you have already upgraded to the latest virtual hardware (Version 7), make sure your VMtools are upgraded post Update 1.  Shut down the VM, and edit the settings “Change Type” as shown below:

chng1-pvscsi

You will get another window that will alllow you to change the type of controller as seen below:

chng2-pvscsi

Select the “VMware Paravirtual” and then select ok.  Boot up your virtual machine and you are all set.  Your system is now running with the updated drivers and you can take advantage of the newer drivers that provide better throughput and less latency!

Hope you found this post useful.  Good luck!

Scott

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

More Bang for Your Buck with PVSCSI (Part 1)

One of the new features that was added to the release of VMWare vSphere 4.0 was a new SCSI subsystem driver that allows more I/O and less latency per virtual machine.  What the heck is PVSCSI?  Here is the technical definition stripped right from the vSphere storage guide. (RTFM).

scsicable

VMware Paravirtualized SCSI (PVSCSI) is a special purpose driver for high-performance storage adapters that offer greater throughput and lower CPU utilization for virtual machines. They are best suited for environments in which guest applications are very I/O intensive. VMware requires that you create a primary adapter for use with a disk that will host the system software (boot disk) and a separate PVSCSI adapter for the disk that will store user data, such as a database.  The primary adapter will be the default for the guest operating system on the virtual machine. For example, a virtual machine with Microsoft Windows 2008 guest operating systems, LSI Logic is the default primary adapter. The PVSCSI driver is similar to vmxnet in that it is an enhanced and optimized special purpose driver for VM traffic and works with only certain Guest OS verision that currently include Windows Server 2003, 2008 and RHEl 5. It can also be shared by multiple VMs running on a single ESX, unlike the VMDirectPath I/O which will dedicate a single adaptor to a single VM.”

So what does all that mean for you?  Better disk performance and less CPU cycles spent on processing these disk requests.  I took some notes at VMWorld 2009 during a few different sessions that discussed PVSCSI.  Here is my logical diagram of what PVSCSI is.  Download the PDF version here so you can print it out and frame it on your cube wall!

Sauer_PVSCSI.pdf

PVSCSI

With the release of vSphere 4.0 PVSCSI was only supported on disks other than the operating system (secondary data drives).  For more information on this, reference KB Article: 1010398.

vSphere 4 update 1 is now released and it’s exciting news for those looking at utilizing PVSCSI.  Support for boot disk devices attached to a Paravirtualized SCSI ( PVSCSI) adapter has been added for Windows 2003 and 2008 guest operating systems.

So let’s first find out if it’s all that.  We need to do some testing to validate the hype.  I created two virtual machines, one with the traditional LSI Logic SCSI driver, and one with the new PVSCSI driver.  The host is the same for each VM, 4 socket Intel Xeon system with 64 GB of RAM, connected to EMC Clariion CX3-80 storage.  The Raid configuration is a 4+1 RAID 5 set (10K spindles), with the default Clariion Active/Passive MRU setup (No PPVE).  Each VM has 2 vCPU’s and 4 GB of RAM and both are running 32 bit Microsoft Windows 2003 R2.  Both Virtual Machines data disks were formatted using diskpart and the tracks were correctly aligned.  Anti-virus real time scanning was disabled on both systems.  This test is meant to get as close as possible to a standard configuration that we can benchmark from.

I used IOMETER as my testing engine.  I didn’t go too deep on the various settings.  The first run is 32K 50%R 0%W.

Non-PVSCSI

no-pvscsi

With-PVSCSI

with-pvscsi

Quite the difference, no?  To be honest, I was seeing a lot of fluctuation while doing my tests.  I probably should have segregated things out a little more, but the screen captures were the average of the results.  I was thinking maybe I should use the built-in random IOmeter combined results.  So here you go.

Non-PVSCSI

no2-pvscsi

With-PVSCSI

with2-pvscsi

I believe the results speak for themselves.  I need to do a little more testing for my own personal preferences.  I want to get a more insight on what the differences are on the reads/writes and the various sizes.  I am certain cache has a lot to do with the results, but I think IOmeter can bypass cache since you force the randomizer.  I’m also curious about the sweet spot on the block sizes and how that plays out with read vs write.

Conclusion

PVSCSI is a technology worth moving towards.  There is no cost involved, and it can deliver better disk performance across your ESX environment.  It also can bring your host CPU utilization down, which can provide you with better consolidation ratios across your clusters.  Stay tuned for part 2, when I am going to provide the “how to do it” aspect so you can begin to leverage this technology you are already paying for!

Hope you found this information helpful.  Thanks goes out to Aaron Sweemer (@asweemer) for allowing me to abuse his website and not having to deal with bringing up my own site.

Thanks!

Scott Sauer

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

Capacity Conundrum Part Deux

 

– The vTrooper Report –

 

This is a continuation of the Capacity Conundrum, if you missed the first part start here.

$ per Compute VM

So let’s cut to the chase.  In the case of the compute tiles of our Quad we have a price per vCPU and $ per GB of RAM to settle. Keeping our example 2U server in play we could expect to spend approximately $15,000 for a 2U fully loaded with 4GB DIMMs.   Well unfortunately a small part of that 15K  is consumed in I/O cards and maintenance which needs to be pulled out to get the compute number.  For our argument we will use $10K for the compute system without the I/O cards and maint. costs;  This is the CAPEX we will offset in our $/per values.

vCOMPUTE – FIREPOWAH!

We know how a CPU works right? Move process into memory , execute CPU cycles, churn, churn more, back to the I/O guys, rinse and repeat.  Basically, this is where the hardware container happens in our data centers.   I say container because it’s easy to show it as a box; It’s hard to define what it will always be in physical form.  1U, 2U, 4U, half blade, Full Blade, appliance , PC ; you name it, it is probably in some one’s ‘datacenter’.  The lowest common denominator I have been able to settle on for a common form factor is Cores per Ram.  Grouping per socket fits because you are measuring the type of memory that is close to the CPU socket.  The NUMA architectures of AMD and Intel with memory controllers on-board and transports to the memory DIMMs without access through the I/O  controllers (eg. Northbridge) help define the grouping.

TECHNOTE:  Every core has associated memory banks it will use and every container(physical server) has a series of sockets that it controls.   A hypervisor has a limit to how well it can control the associated memory space to the nearest vCPU.   Generally the hypervisor will always schedule available vCPU’s from the same socket and swap the corresponding memory for those processes to the memory banks of the corresponding socket.  It does this is for efficiencies of the x86 architectures.  It can move the vm to another socket and readdress the memory but it has a ‘cost’ associated with such a move.  Path of least resistance is to stay in the same socket.

If you create a 4 vCPU VM and run it on a 1 core  processor it gets bogged down.   If you have the same VM on a two socket Quad Core (8 Cores)  the four cores utilized by the VM are likely to be on socket 1 or socket 2 .  The cost of splitting the vCPU between the two physical sockets by the scheduler is greater than running the vCPU in the same socket.    AMD delivered this earlier than Intel and sustained higher levels of virtualization consolidation “Per Host” than similar class systems of Intel could provide through the Northbridge.   Core i7 is a new game for Intel and the results of Nehalem show the improvements.

For more indepth information here is a good read:  CPU Scheduler in VMware ESX

We have a host of $10K  CapX charge that has two sockets at a 4/45GB Socket Ratio with approx $5k spend in each socket.  Looking at our Hardware invoice the CPU Cores are about 25% of the cost of a socket so we can assume that our per socket cost is broken down into 25% Core and 75% Memory.   So our Socket Ratio yields a $1250 cost for 4 cores and $3750 for 45GB of memory:

Per Core CapX = $312.50;  Per GB RAM CapX = $83.33

That gives us a bare metal cost without a hypervisor charge on top, but we need a hypervisor to get a VM running.  Adding in the ESX cost for a per socket license of ESX Enterprise Plus (worst case) you can add $3500 each socket.

ESX Lic. Cost per socket CapX = $3500

Raw burn rate of the host would be $8500 per socket if we never loaded a VM on the Host.  Well, we did it for a reason, so let’s get our money back. If we target the standard allocation for this host (4/45GB socket ratio) we get our target VM count of 16 per socket(1 vCPU/2.8 GB RAM).  Also, keep in mind that we broke the socket cost down by 25%  to CPU and 75% to Memory so we will keep that  same  split here.  If we don’t do the split, then any VM that is deployed to the socket will bear the same cost regardless of its size.

ESX Lic. Cost per VM= $218.75   ( 3500 / 16 )

-Or-

Split by the 25/75 % we did previously for the cost of the CPU and Memory and you get a little different calculation.

3500 * .25 = 875 / 16  = $55 AND     3500 *.75 = 2625 / 45  = $58

per vCPU=$55

per vMEM=$58

Adding it up with our target ratios in tow we get the burn rate of the $ per Compute on a VM basis.

($312/4  = 4:1 ratio) + (83*2.8) + {(55*1) + (58 * 2.8) } = $530

Or Summarized: (vCPU = $78)+(vMEM = $233)+(Hyper$=219)=(vCOMPUTE = $530)

Assuming 8760 Hours (1 year) this VM would cost $.06/hr in vCOMPUTE.

Lets apply that to some other VM systems and see if it sticks.  If we plan for the following VM deployment on our socket:

vmGrid

The costs would spit out as such:

vCompute

Or slice it up into a per hour number:

perhour

So based on this analysis some of my VM’s probably only cost $.05 per hour for vCompute.  Interesting. What is more interesting is the fact that the memory cost associated with a VM scales more accurately to the consumption.  You can have as much memory you like for your new 4 and 8 GB aspirations; (eg. memory leaks) you just need to pay for it accordingly.

Too bad that only pays for the top part of my total cost model.  That said, the benefit here is that this model can span across hypervisors and any market hypervisor can be split up to show the cost of a VM consumed on a Xen , KVM, VirtualIron, Parallels’, or Hyper-V infrastructure.

I will be working on a few powershell scripts and excel calculators that one can use to make this model more repeatable. At the very least, it is a model that I will use to consider CapacityIQ and third party products like the offering from VKernel; and the output they measure.  Especially if they consume additional costs on a per socket basis.  Which I can now calculate as Overhead.

Alas there is more to consider, stay tuned for Part III – “the I/O that binds”

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

Get Thin Provisioning working for you in vSphere

Going Thin and not looking back.

thinYes, I am slowly losing my hair like many other aging men out there, but it wouldn’t be virtual insanity if I were blogging about my personal male pattern baldness issues.  With the latest release of VMware vSphere comes a lot of new features and functionality that can be leveraged to make our lives easier.  One of these features, that I personally have been looking forward to for a while, is Thin Provisioning.  If you aren’t familiar with this technology, jump over to Gestalt IT for a great explanation of what it is and how it works.

One of the exciting promises of thin provisioning, is getting more “bang for your buck” out of the expensive enterprise storage you have been investing in for your ESX environment.  But, as Bret Michael’s once said, “Every rose has its thorn” and there are some things to look out for and considerations to make, before implementing thin disk technologies.

Efficiencies are great if they work right and don’t over

complicate the environment.

Do your homework and make sure you understand the characteristics of the virtual machine that you are considering migrating into a thin disk configuration.  The last thing you want to do is convert every VM to thin disk, and four months down the road all of your data stores are filling up and you’re scrambling for a storage CAPEX.  Some people are of the opinion to do thin provisioning either on the host side (VMware) or on the storage array side, but not both.  Take a gander at Chad Sakac’s blog that discusses thin on thin and some thoughts around each of these approaches.  I’m not going to go into all of the pluses and minuses of thin provisioning but rather focus on how to make it work for you.

Coffee Talk

coffee

So now that we have some of the basics out of the way, I wanted to share my thoughts on thin provisioning.  Like many organizations, we get requests from our customers that err on the side of caution.  They want to plan for the worse case and ensure that their project and/or application isn’t setup for failure.  I don’t blame them really, I do it myself all the time when I make coffee at home.  I always end up making more coffee than I typically drink, just in case I might need that extra charge.  The best way to do that is pad it, request more than what you might really need, just in case something comes up down the road.  Virtual machine disk storage in some cases fits this same profile.  If my coffee maker granted me access to hot coffee on demand, I would stop making extra coffee.  Thin disks can give your end users that capacity on demand so you can gain control of the padding effect that typically takes place in most corporate organizations.

Take it back…

So now you have done your research, you’re starting to get a feel for what this thin stuff is and how it might play out in your shop.  It’s go time.  If you’re a smaller VMware customer, you probably already have an idea of what are good target disks to convert.  If you’re a larger environment, it might be a little more difficult to gauge where the bloated pigs are hiding.

I worked at GE for a couple of years and was exposed to some of the Six Sigma methodologies they preach as well as practice.  Sounds boring, right?  Not really.  You can really leverage DMAIC for a lot of IT related problems/issues/projects.  You don’t have to take it to the extreme, use the framework to help guide you on your quest:

DMAIC

The DMAIC project methodology has five phases:

  • Define high-level project goals and the current process.
  • Measure key aspects of the current process and collect relevant data.
  • Analyze the data to verify cause-and-effect relationships. Determine what the relationships are and attempt to ensure that all factors have been considered.
  • Improve or optimize the process based upon data analysis using techniques like Design of experiments.
  • Control to ensure that any deviations from target are corrected before they result in defects. Set up pilot runs to establish process capability, move on to production, set up control mechanisms and continuously monitor the process.

We have already defined our project goals and what we are trying to accomplish.  We need a good “Measure” tool to really find where we might benefit from thin provisioning.  Powershell is a great tool that most VMware administrators use, or have at least heard of.  So this was the first place I turned to for assistance.

Alan Renouf of “Virtu-AL” http://www.virtu-al.net/ gave me a hand in writing the powershell script needed.  (Thanks again, Alan!).  Alan already had a one liner script to produce a list of vm’s, their disks assigned, and how much data each disk was consuming.  I needed the ability to see this data outside a powershell window and be able to analyze it in a better format.  We have a decent-sized VMware environment and exporting this out to a .csv for analysis is extremely helpful.  Here is the script!

************************************************************************

# Set the Filename for the exported data
$Filename = “C:\VMDisks.csv”

Connect-VIServer MYVIServer

$AllVMs = Get-View -ViewType VirtualMachine
$SortedVMs = $AllVMs | Select *, @{N=”NumDisks”;E={@($_.Guest.Disk.Length)}} | Sort NumDisks -Descending

$VMDisks = @()
ForEach ($VM in $SortedVMs){
$Details = New-object PSObject
$Details | Add-Member -Name Name -Value $VM.name -Membertype NoteProperty
$DiskNum = 0
Foreach ($disk in $VM.Guest.Disk){
$Details | Add-Member -Name “Disk$($DiskNum)path” -MemberType NoteProperty -Value $Disk.DiskPath
$Details | Add-Member -Name “Disk$($DiskNum)Capacity(MB)” -MemberType NoteProperty -Value ([math]::Round($disk.Capacity/ 1MB))
$Details | Add-Member -Name “Disk$($DiskNum)FreeSpace(MB)” -MemberType NoteProperty -Value ([math]::Round($disk.FreeSpace / 1MB))
$DiskNum++
}
$VMDisks += $Details
Remove-Variable Details
}
$VMDisks | Export-Csv -NoTypeInformation $Filename

***********************************************************************

So now that you have this great spreadsheet, you can do all sorts of crazy sorting and reporting, within Excel.  Take some time on phase 3, “Analyze” what you’re seeing.  Talk to your VM stakeholders to see how things might be changing from their perspective.  Try to plan for the surprises and position yourself accordingly.

Next is the “Improve” phase of DMAIC (see it’s easy!).  This is the part where you actually do the work.  It’s time to start leveraging the storage VMotion API’s, and reclaim some of that unused disk.

  1. Select the target VM in the VC client.
  2. Right click on the VM and select the option “Migrate”.
  3. Select the option “Change Datastore”.
  4. Select the destination, or click advanced if you are targeting one particular disk.
  5. Select “Thin provisioned format”.
  6. Select Finish.

Rinse and Repeat for the rest of that spreadsheet you have worked so hard on.

The last phase of DMAIC is “Control”.  This is one of the most important pieces to thin provisioning in my opinion.  At the minimum you need to setup Virtual Center alerts to monitor when your datastores are approaching critical levels.  You can’t implement thin disks in your vSphere environment and walk away.  The smart people over at VMware have given us the ability to monitor datastore disk space usage and over-allocation with the latest release of Virtual Center.  Setup your monitors so you are e-mailed when some of these thin disks begin to grow and you need to take some action.

image

Eric Gray of VMware takes this to the next level, check out his blog post on utilizing powershell to prevent datastore emergencies.  My personal approach to this concept is to setup a “hotspare” datastore for your environment.  A good practice to implement here would be to try reclaiming enough storage from your migrations to thin disks to free-up a “hot spare datastore”.  Implementing an automated recovery solution like Eric’s will help you sleep easier at night.  Worried about what might happen if your script doesn’t work or you do hit the perfect storm and end up with a full VMFS volume?  Intelligence has been built into vSphere to automatically pause the virtual machines, impressive.  Check out Eric’s video:

Wrapping it all up

Thin disk provisioning is a great feature that you should consider leveraging in your environment.  With some forward thinking and best practices you can achieve higher ROI for your ESX storage.  VMware vSphere offers the ability for you to migrate from thick to think with no downtime, so you can begin reclaiming storage on the fly.  Keep it simple, start out with a high level analysis of your infrastructure.  Identify the candidates that are a good fit and worth focusing on.  Setup your alerts on the datastores as soon as you migrate your first virtual machine so you are protecting yourself from problems down the road.  Consider taking automated actions if your datastores are reaching critical thresholds.

I hope you found this article helpful, good luck!

Scott Sauer

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon

Capacity Conundrum Part 1

 

–The vTrooper Report –

 

In an effort to gel up an internal billing and allocation model (GaaS – Goughing as a Service) I’ve been struggling with the concept of cost per vm.  I was asking a simple question in the twittersphere about that idea and it turned into a discussion and well…got out of hand.  I apologize for that, as this is a better format to explain. (Special thanks to @asweemer for a dumping ground)

If I had a Nickel for each VM…

At VMWorld 2009 there was a presentation in the keynote that showed the price of a vm hosted with Terramark that was $.05 per hour.  I thought  wow.   A nickel per hour.  If I had a nickel per vm/hour; How much would I have available to spend on coffee?

Then I thought wait.  I have VM’s. How much do they really cost me per hour?  Well the answer is … it depends.  Old servers with high power consumption and low density versus a new system with Intel 5500’s and packed in blades have different burn rates visible to different systems(power, cooling, depreciation).  I haven’t found a great model to break those units down to my satisfaction yet.  I need another way.

As a general practice I create in my mind some maxims that I follow in the creation of a VM.

  1. S  – 1vCPU, 1GB Ram, 1GB Net, 10GB Disk
  2. M -2vCPU, 2GB Ram, 1GB Net, 20GB Disk
  3. L – 4vCPU, 4GB Ram, 2GB Net, 40GB Disk

Seems simple enough, but it doesn’ t really generate a cost model on a consistant basis.   Hardware continues to change and each VM that consumes resources does it at different rates and times of the day.   A VM that isn’t doing anything isn’t really ‘consuming’ anything, right?  I thought I would try break it down further by creating a 4 quadrant block with two macro categories:  Compute (CPU and Memory) and I/O (Network and Disk)

cmnd

Each resource area could increase\decrease for a reason without changing the size of the original maxim it was created under.  This allows for small variations of size without having a customer yell that their bill when up by $2 this month.

The Measureable Unit

Use a unit of measure to identify the four quadrants:   vCPU : vMEM : vNET : vDISK  or C:M:N:D  .   Then overlay the VM creation to count up the units.  This way the growth of a ‘VM’ during its lifecycle can be adequately allocated back into the proper IT metric.  Using the VM creation maxims up above this may be:

  1. S  – 1:1:1:1
  2. M -2:2:1:2
  3. L – 4:4:2:4

This isn’t perfect but it at least allows for the average cpu cost to be allocated seperately from a memory, network, and disk cost.  Afterall, you don’t get to upgrade all four parts of the quadrant in the same fiscal year usually.  This also allows a way to trend an average of your cost rate per unit over a period of months and years to see which cost areas are improving.  It is an interesting metric for the business and IT.  Win-Win in my book.  Even if no-one internally ever has to pay the values back (Showback).  It also helps police which VM is consuming too much of a specific value which would skew the numbers if you simply took the cost of the esx hosts and divide by the number of VM’s.

Apples , Oranges, Lemons, and Grapes = Frutti Results

So you have a unit of measure and a type of system to match the measurement up towards over a period of time.  Here’s where the fruit cart and the horse get hooked up.

This is all very complex, why can’t I just buy the same server I have purchased for the last 5 years?

Sorry Kids. They don’t build’em like they used to.  But in todays market, the UCS system from Cisco has a new buzz to the original players of IBM, HP, and Dell.   How do you sort any of that out among the offerings, and how do you select the right platform for your new ESX System?     By the Socket !   Every system of the x86 family has them from both the Intel and AMD families.  And now that you have to pay for your hypervisor and additional tools (Capacity IQ, AppSpeed,  Nexus1000v) per socket  it matters more.  I need to squeeze the value out of those sockets.

Still staying in the upper half of the Quad;  lets measure cores and RAM as a ratio assuming dual rank 4GB Dimms and measure them to some of the standard 2 socket servers.

Standard Intel x5450

2 Socket – 4 Core – 16 Dimms (8 per socket) produces 4 cores/ 32 GB Ram

Standard Intel Nehalem  x5500

2 Socket – 4 Core – 18 Dimms (9 per socket)  produces 4 cores/ 45 GB Ram

Cisco UCS extention  on x5500

2 Socket – 4 Core – 48 Dimms (24 per socket)   produces 4 cores/ 96 GB ram

What this shows is that for every license of ESX consumed in the environment there are different amounts of memory available for a VM to use.  The approach by the UCS system allows for a much higher allowance of memory to a VM at the same licensing cost.   Sure you could buy 4 way servers and claim that the 256 GB of RAM gives the VM more allowance but in reality the vm will have ratios of contention to the vCPU and Memory within each of the 4 sockets. You can change the size of the container by moving to a 4 way,  but it won’t change the value of the ratio  for that container in regards to the cores and memory.

CPU Contention

The idea of CPU contention is becoming more visible to most administrators of virtualized environments because the desire to pack the vm’s onto a host is so strong.  If I can get 10 VM’s on a host for $5000 then getting 25 VM’s on the same host is lowering my cost per vm.  It could also be cheating your customers of the performance they paid. Especially if you have multiple vCPU’s assigned to those 25 VM’s.    This is where the ratio of VM per host becomes obsolete and vCPU/core  makes more sense.

Using the example containers above you can generate an expected number of VM’s per socket.  There is no reason to do a 1:1 ratio of cores to VM because the point of virtualization is to run more with less.  I think a good ratio to start with is 4:1 for a production VM and 16:1 for a VDI implementation:

Standard Intel x5450  -  (4 /32 GB SocketRatio)   yields 16  VM’s with a 1 vCPU/ 2GB ram configuration per socket

Standard Intel Nehalem  x5500  -  (4 /45GB SocketRatio) yields 16  VM’s with a 1 vCPU/ 2.8GB ram configuration per socket

Cisco UCS  -   (4 /96GB SocketRatio) yields 16  VM’s with a 1 vCPU/ 6GB ram configuration per socket

You can always adjust your actual deployment if these ratios don’t match up for your environment.   The expected deployment number helps determine how large the pizza slices are for the team.  Not how many slices each of them consume.  In these configurations you can see where the density of the RAM per socket (SocketRatio) of the UCS allows for much larger VM configurations before overcommitment. A nice fit for the new 64bit installations. These expected numbers of VM per socket help determine what the burn rate of a C:M:N:D value is for the CapX  spend you made.

BurnRate

To fully understand how much a VM costs, one has to look at what was spent in the CapX of the host and agree on the measuring stick to measure the C:M:N:D value of the created VM.  If a series of hosts are in service from different families and are at different parts of lifecycle there may have to be some averaging.  The SocketRatio of Cores/RAM is a consistent way to measure systems from different form factors and families and levelset the expected allocation of VM’s.  The expected allocation of VM’s for a host helps determine what density ratio is desired for vCPU:vMEM.

This is the end of Part 1 –  In Part Deux I will take a deeper dive into the Compute and I/O areas and assign a more detail cost per VM model.

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon