Monday, August 8, 2011

my virtualized server performance stinks!

Another recurring issue from the last ten years or so...

Short answer: You need to work with the SAN admins to get more storage IOPS, and you won't be able to give them reliable numbers from where you're sitting because SANs and VM's will both lie to the guest OS. You will be able to give results like "it takes a long time" and "errors occur" but numbers from things like bonnie++ or perfmon won't be very meaningful.

Long answer: As operating systems developed virtual memory and storage models and learned how to manage multi-processor architectures, they learned to lie to the applications about how much of a given type of resource was available, and the apps became abstracted from the hardware. A given process can look up how much RAM the OS has, but if it asks for twice that, the OS cheerily complies, using virtual pages. This led to an opportunity to drop in a second OS which treats the first OS as an application, which would let you load many light workloads onto fewer pieces of heavier hardware. Once this opportunity had been realized, the next step was to oversubscribe the hardware; virtualization systems tell each OS that it has a reasonable amount of resources, gambling that each virtual machine will not really need the requested resources at the same time. Even more complexity ensues when you realize that each guest OS is doing the same thing to its applications... Much like other oversubscription systems, such as airline ticketing and finance markets, this trick sometimes fails. A sustained heavy user such as you describe is the nightmare client for an oversubscribed market because its resources can't be swapped out to other clients. In order to prevent the whole cluster or SAN going down, the virtual OS (AKA hypervisor) or the SAN OS simply denies the request for resources. It doesn't take the time to tell the guest OS that this is happening though, it just rejects the request for more stuff.

Unpopular answer: datacenter virtualization is sold using oversubscription metrics developed from loads like DHCP/DNS or Intranet servers. When given a data-heavy load, the model breaks and the customer is forced to realize that they're not really going to get a 3:1 ratio. Heavyweight systems don't have to have physical hardware, and they don't have to give up on the other benefits of virtualization, but they really do have to have dedicated hardware resources and they won't get oversubscription from virtualization unless they're willing to put up with this kind of performance problem. Virtualization can provide a number of benefits outside of hardware consolidation, though -- there's huge benefit from cloning, snapshot, monitoring and management tools. Just gotta make sure that you're basing the cost-benefit projections on the right benefits.

No comments:

Post a Comment