VM storage plan

This is a proposal for a new VM system, discuss here. Sign with four ~. Here's how the plan goes:

HA-iSCSI Proposal

 * Dell arrays are connected as JBODs to 2 storage servers. Let's call them haiscsi1 and haiscis2. Each will have 1 or 2 FC HBAs.
 * haiscsi1 and haiscsi2 are configured in an active/passive configuration.
 * haiscsi1 runs software raid on the Dell arrays and then exports iSCSI luns over a service ip.
 * When haiscsi1 goes down, it is fenced off and haiscsi2 starts the software raid array, brings up the service ip, and starts the iscsi target.
 * VM servers will use an open-iscsi client with a boosted timeout and retry count allowing the failover to take place with no consequences on the vm besides some slow disk access during failover.
 * I tested this by running bonnie++ on iscsitester (on mage) while stopping and starting the iscsi target on royal repeatedly.


 * An active/active configuration is also possible if haiscsi1 and hasicsi2 run different arrays and export different luns. Then, when one goes down, the other will import its array and start exporting its luns.
 * This is extensible to any more storage we get as long as 2 servers can concurrently read/write to it.
 * We could even RAID 1 athens/alexandria together and reexport their storage.


 * Solaris would probably also work with clustered zfs. Although some barriers are present:
 * COMSTAR configuration cannot be safely exported from one server and imported on the other: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6694511
 * iSCSI speeds to Solaris targets seem slower than to Linux ones (ietd), although that cannot be verified until more testing is done.
 * What Sun servers have PCI-X or PCI/66 slots? (PCI is shared so having 2 HBAs in 1 server would not increase performance since PCI has 133MB/s shared bandwith.


 * We have the option of either exporting only a few iSCSI luns and using clvm to make logical volumes for vms or we can export one lun for each vm.
 * Exporting a few luns and using clvm has the advantage of making new vm creation easier and makes the iSCSI configuration easier (especially on solaris).
 * Exporting one lun for each vm allows us to not have to use clvm.


 * Right now I am leaning toward exporting a few luns and using clvm especially since we don't need all the red hat cluster stuff
 * http://www.pixelchaos.net/2009/04/23/openais-an-alternative-to-clvm-with-cman/


 * I got clvm working very nicely with openais on poisson/hex connected to iSCSI on royal.

VM Server Setup Notes

 * Get the vm server overlay (/afs/csl.tjhsst.edu/students/2010/2010tgeorgio/vm-overlay)
 * Install openais
 * Change /etc/ais/openais.conf so that under totem, bindnetaddr reads 198.38.16.0 and nodeid is unique.

echo "-clvm" >> /etc/portage/profile/use.mask echo "sys-fs/lvm2 clvm" >> /etc/portage/package.use echo "=sys-fs/lvm2-2.02.47" >> /etc/portage/package.keywords
 * Install lvm2 with clvm
 * Edit /etc/lvm/lvm.conf and change locking_type to 3
 * Until I fix the ebuild, add "need ais" to depend for clvmd

Storage

 * Some sort of disks hooked up to a server/redundant pair that then export it over the network so that all vm servers have access (iSCSI or NFS)
 * Priority is getting some sort of reliable storage
 * One option is zvols exported as iSCSI devices offering rollback and snapshots, as well as compression
 * HA-NFS backed vm storage is also a possibility since initial tests report excellent performance
 * This also limits the single point of failure (HA-NFS is easier than HA-iSCSI)


 * Live-migration is as simple as taking the vm down on one server and bringing it up on the other, since it's a block device on every server it's exported to.
 * The Dell arrays appear to be usable as storage as long as they interface to a reliable server (hbas don't sit well in robustus). Royal is currently hooked up to them and exporting soupspoon and thumbtack over iSCSI.

System

 * Dom0s run Gentoo. We will have a standardized install procedure and will likely have a portage overlay with custom xen bits (hypervisor/tools/kernel).
 * DomU's are Debian except for ltsp. Debian is simpler and we can't keep an image for every VM—the software varies too much. Compiling on the VM to update is definitely not ideal. Thus, Debian is a better option, and it's arguably more stable.

Scripting
There are a lot of ways you could create a central management system for this sort of VM setup. You could have a script that brings down VM's on one server and brings them up another, periodically pings hosts, etc. If a host goes down, you could tell iSCSI to stop sharing with that server until further notice and switch to another server. This is just one of several possibilities. There's also always the option of manual management.

Locking
All vm servers will have access to a shared nfs filesystem which will contain all the xen vm configurations and lockfiles for the vms preventing simultaneous starting (ask Thomas to elaborate on locking).

Results

 * royal exporting Dell arrays over iSCSI to mage
 * 32K chunk size: 24MB/s write, 31MB/s read, 250 random seeks/s

Problems so far

 * Writes are really slow, reads are pretty fast. Writing: zeros 3.1MB/s random 2.5 MB/s Reading: zeros 82.9 MB/s random 59.9MB/s. Need to figure out why and fix it. Peter Godofsky 17:20, 16 June 2009 (EDT)
 * Update: It's not iscsi, it's the DomU. Same horrible disk write speeds are observed when running on sovereign, with the disk on sovereign. Peter Godofsky 17:43, 16 June 2009 (EDT)
 * Update also: The same domu on sovereign experiences the same writing slowness. 1.5MB/s write, 67MB/s read to local disks. It appears that t is an issue with the software on the vm servers since both royal and sovereign use lenny dom0 with etch domu kernel.  Thomas Georgiou 17:43, 16 June 2009 (EDT)


 * Initial NFS testing appears to be getting nice performance: for a 6G file, 73MB/s writes and 101MB/s reads