Revisiting VNX Fast Cache – PCI vs. SSD

A couple months ago after the introduction of the new EMC VNX arrays, I posted my thoughts on it here.  One of the engineering choices I questioned was the use of SSD’s for extending cache versus a PCI card.  It was always obvious why it would be better when cache was being added or replaced, but I questioned the throughput potential of a SAS interface versus a PCI one.

 

I got some interesting feedback on that from several people, and I appreciate it.  It wasn’t until the other day that I realized that the argument really did not have that much merit.  In a moment of blinding brilliance, I realized that the only time this might make a difference is when warming the cache.

 

How did I come to this realization?  I was in a VNX deep dive session presented by Chad Sakac, and I had every intention of asking him the question of PCI versus SAS when it comes to cache.  Lucky for me, he brings it up during the session, before I could ask.  Before he was done with the rest of the presentation, I realized an error in my prior way of thinking.

 

Chad pointed out that the time it takes for an IO to go through the controller, loops, and hit the flash is measured in nanoseconds (10-9).  Once it’s there, the flash has latencies in the microseconds (10-6).  So there is not likely to be a significant difference in latency between SSD, and PCI when it comes to cache.

 

PCI obviously has greater throughput potential, which is why I previously asked the question.  But a realization jumped up and bit me while I was sitting through this presentation.  Cache IO’s are usually small chunks of data that benefit from the reduced latency of flash / DRAM.  They aren’t giant read / write operations that generally require extremely wide bandwidth.  Will the increased bandwidth of PCI make a difference?  I have my doubts that it will be noticeable on the vast majority of workloads.  But this is just my opinion, as an outsider without the benefit of a storage engineering background.

 

I am looking forward to seeing the SPC1 benchmarks from the VNX.  I believe it will objectively tell the whole truth.  A slight difference on an anomalous workload is not significant enough to outweigh the benefits of SSD versus a PCI cache.  It’s easily swapped, and it’s non-volatile.  It only needs to be warmed once.  If a controller fails, the cache doesn’t die with it.  Replace a controller, and no need to rewarm cache.

 

Like I was alluding to in my last post on this. . .every design decision, whether it is in storage engineering, vSphere design, automobile design, is one of compromise.  The SPC1 will tell the whole story, but I think what we’ll see here is that this particular compromise was overall a good one.  What do you think?  Let me know in the comments.

Welcome to the US Central VMware Newsletter

 

writing

Download The Newsletter: VMware Newsletter March 2011

At one of our last internal central VMware meetings, a few of us had a similar idea to pull together a newsletter for our customers.  Some of us were already doing this to a degree, but collectively we agreed that one source of information would be better than many.  Several VMware SE’s and Specialists have helped pull this together so I wanted to thank everyone for their hard work. 

There is so much great content that is published to the web and sometimes passed internally, we wanted to consolidate this information to one common distributable location.  99% of the content is not really specific to the U.S. Central region other than the local events, so I think that many people will be able to benefit from out efforts.  The goal will be to test the waters and see if it’s something that people like and want to see continued.

As always we are looking for feedback, if you think this is something that should continue, let us know!  If you feel it’s lacking or could be improved in some aspect, we are also looking for your opinion to help shape it.

Enjoy!

-Scott

Improving VMware Performance and Operations (vCenter Operations)

stress

I spend a majority of my time talking with VMware customers trying to help understand their needs and how we can help them with some of their internal IT business challenges. I would say a majority of the problems and issues discussed are typically based around internal politics and the IT landscape changing, but their second largest concern is around performance and growth (capacity). VMware and virtualization in general, has been such a powerful driver for many organizations over the past several years. It has allowed IT organizations to run more efficiently, save capital expenditure costs, and ease administrative overhead, all in the midst of an economic downturn.

 

Capital expenditure costs savings are great, and very visible to the organization from a high level, but VMware needs to help customers with the next step. Now that we are moving so much of our infrastructure to a more elastic and flexible solution, (vSphere) we need to provide tools to help you manage this infrastructure because the same methodologies no longer apply as they did in the physical world. The more we can help automate and manage your virtual infrastructure; we can now begin to help with step two which is save your IT organization operational costs. A recent Gartner study determined that the average cost for a Windows server is $10,200 per year. Of that expense ~ 70% is on OPEX. Gartner also estimates that with automation and management that up to 80% of the OPEX could be saved.

VMware has made several acquisitions around management and automation, and I wanted to focus on one which was recently announced. VMware Operations is a “new” product that was released this past week. It’s actually not all that new of a product but a re-branding of a key acquisition announced at VMworld 2010. Integrien was an analytics and statistical based software company with a focus on management software. Notice that their primary focus was not management but analytics, a completely different approach to several other software companies out there trying to get to the same end result.

 

Rather than simply creating metrics to monitor and then setting thresholds on those metrics, Integrien will actually analyze the information that it’s gathering and understand when there is an actual problem. One of the coolest features about the full blown enterprise version is you can feed multiple data sources into the analytics engine. The more data it gets, the more accurate it’s able to predict when a problem is likely to occur.

This isn’t just your standard run of the mill monitoring software.

Those of you that have experience with enterprise monitoring software will know that it take a lot of effort to get these systems up and “fine tuned”. It takes a tremendous effort to begin sifting through all of the white noise alerts that come in and then adjust the threshold alerts to something that is tangible so it becomes useable data. VMware Operations removes that manual effort by dropping in an intelligent analytical engine that can understand what’s really going on behind the scenes.

 

vcops

 

Here are the different versions of the product, and how each version differs. I would suggest pulling down the virtual appliance and check out how awesome this product is. If you don’t feel like going to the effort, check out this video, it gives a great walk through of vCenter Operations and explains a lot of the same concepts I just wrote about.

vFabric … it’s not just for knitting anymore

If you’re a VMware fan, you have probably already seen the graphic above, or some variation thereof.  And you’re also probably already pretty familiar with the blue layer, or the Infrastructure layer of the cloud computing “stack.”  In addition, you’re probably well versed in the orange layer, or the End User Computing layer.  But what about that green layer?

That green layer is commonly referred to as cloud middleware, or the vFabric Cloud Application Platform.  It’s the ooey-gooey middle layer that leaves most of us in IT scratching our heads.  It’s where software developers live and breathe, but for rest of us, it’s the layer we have traditionally avoided like Charlie Sheen avoids sanity.

I’ll be talking more about vFabric in future posts, but today I’d like to focus on WaveMaker, because it’s an exciting piece for those of us that aren’t software developers.  It may be just the tool that gets us to dip our toes into that ooey-gooey green later.

OK, so what is WaveMaker and how will it fit into that graphic above?  First,the official news blurb …

VMware closed its acquisition of WaveMaker on Friday March 4, 2011.  WaveMaker is a widely used graphical tool that enables non-expert developers to build web applications quickly.  This acquisition furthers VMware’s cloud application platform strategy by empowering additional developers to build and run modern applications that share information with underlying infrastructure to maximize performance, quality of service and infrastructure utilization.

Great, soooooo what does that mean for readers of this blog?  WaveMaker is a tool built just for us!  It is the tool that will enable us to build web applications very quickly and deploy them to the cloud (that ooey-gooey green layer of the cloud) with a single mouse click.  WaveMaker claims it can eliminate98% of code, cut the web development learning curve by 92% and reduce software maintenance by 75%.  Here are a couple of other bullet points you’ll find interesting …

  • WaveMaker eliminates Java coding for building Web 2.0 applications
  • WaveMaker Studio generates standard Java apps
  • One-click deployment eliminates the complexity of deploying web apps to enterprise or cloud based hosting.

For more information, be sure to check out Rod Johnson’s blog post VMware acquires WaveMaker.  And of course make sure you visit the WaveMaker website.  While you’re there, download the software and give it test drive!  After you do, be sure to let me know what you think.

–Aaron

 

Do we need a dvRouter?

I’ve been running some tests in the lab lately, and trying to solve a problem that I don’t think is solvable right now. I’m hoping some of our readers will point out a potential solution that I have missed. Kendrick Coleman posted a write-up of how VM performance can be impacted by VM placement within the cluster.  This is almost exactly what I have been testing in my lab, with a few twists.

As Kendrick points out, VM’s that need to communicate with one another regularly are better off on the same ESXi host.  With VMXNET 3 NIC’s, one can achieve massive throughput between VM’s on the same host.  However, that is not always the case.

The issue I am running into as I design my production environment is a requirement to have everything segmented off into hundreds of VLAN’s.  This means that there will be servers on the same host that are on different VLAN’s that will need to communicate, sometimes frequently.  This completely negates the benefit of having the VM’s on the same host, as the traffic will have to leave the box to be routed.

Here are some tests I did using iperf from the VM Advanced ISO v0.2 just to further expand on the idea:

 

2 VM’s on same host / same VLAN

 

2 VM’s on same host / different VLAN

 

2VM’s on different hosts / same VLAN

 

2 VM’s on different hosts / different VLAN

 

As you can see, it makes almost no difference that VM’s are on the same host, when the VLAN’s / subnets are different.  Just for fun, I bumped the TCP window size, and was able to achieve 3.5Gbps from VM to VM on the same host, and the same VLAN.  When the VLAN is changed, the ratio of slowdown is the same regardless of host affinity.  This is because traffic is leaving the host, going all the way to my Cisco 6509, and coming back into the same host.

Just for reference, all the hosts in these examples are connected to the same Cisco 1GB switch.

I brought this up with Cisco when they were in talking about UCS.  They were mentioning their roadmap, and virtual appliances, so I thought it was a good time to ask whether there would be a virtual layer 3 appliance in Cisco’s roadmap.  The response was about what I expected.

Even if I go UCS, which brilliantly handles east / west traffic across multiple chassis, the top of rack 61xx Cisco devices don’t route, so I’ll still have to go all the way out to a 5000, or soon a 7000, to get routed back into the same host on the same wire.

Talking with a few friends who know more Cisco than I do, we discussed the idea of a virtual router.  The inherent problem with a virtual router in this environment is the VM is still bound by its default gateway.  When DRS runs, and moves VM’s around, now a VM is living on a host that does not have the particular virtual router with his gateway interface.  That defeats the purpose.

We talked about how it might be possible to work around this using Cisco’s Gateway Load Balancing Protocol (GLBP), but even then, you’d have to set preferred active paths, and it wouldn’t always work the way we need it to work.

The only solution to this issue I can think of is a Distributed Virtual Router, which doesn’t exist.  If someone could make a virtual router that operates like the Distributed Virtual Switch, it would help all us people out here in the financial world who are ever more constrained by tons of VLAN’s, and (virtual) firewalls in between those VLAN’s.

Is there a need for this in the marketplace?  Or am I making a bigger issue out of this than I should?

As always, your comments are appreciated.