Sometimes it does not help how much you prepare ... the system is mighty.
To keep practicing and learning Linux with my friend we are running small business including hosting couple of websites and running about a dozen of virtualized machines. We are actually proud of our solution as we have managed to build kinda interesting infrastructure that runs for years.
If we have any problem with it, the it is usually with hard drives. We now have achieved 222 days without a downtime (I guess our SLA is now much better then London Stock Exchange powered by Microsoft). With this gained confidentiality we decided to upgrade whole SW stack on our infrastructure including major changes (XEN core, Linux Kernels and up-to-date to all services / packages). We decided to clean up everything for the next big thing.
We spent whole Friday and Saturday preparing all the packages and the process for smooth upgrade. On Sunday before lunch, we were ready just for the reboot. I asked my friend to get to site just in case anything goes wrong. In short we ended up at 1:00am on Monday morning. But we learned a lot, all the services have been restored and machines are ready to rock again.
Now follows Linux rant about the problem:
13:00 - Hypervizor complains about being compiled against wrong kernel headers. Solution was to recompile against latest xen headers. That means boot into usable environment using rescue CD. Our system has not CDROM, create usb stick and boot from that.
14:30 - Root partition is not detected. We use otherwise excellent Enterpise Volume Management System ( EVMS ) stack to manage our disks/partitions. But for some reason the root (main) partition was not detected. Too bad, after some help from IRC we reenabled EVMS flag on the partition and it was back online. Ok we can access everything to recompile the XEN.
15:30 - Hypervizor recompiled and is booting but now complains about mismatch with kernel. It turns out, kernel for some reason was compiled without PAE extension. XEN has dropped support for non-pae kernel in 3.1+ series. Took a while to figure out, but we recompiled the kernel with High Memory Support.
16:30 - Hypervisor boots, kernel boots but now init complains it can't switch root partition from RAM to EVMS partition. After some investigation it turns out something is worng with BusyBox (missing switch_root function). Edited initrd manually and used busybox from working initrd.
18:30 - System boots. We now have our main domain ready.
19:00 - 01:00am we spent in an effort to bring the rest of our services back on line. We allso had to recompile kernel for domU machines, modules for our FireWall to include support for iptables and TUN/TAP interface for VPN services, modify udev rules to create persistent rules for network interfaces.
What a learning experience. Anytime I hear a PM saying upgrade process has to be smooth, there has to be 0 downtime I have to laugh. World is not static, so is not development of the packages. The longer you do not touch your system, the more interesting things appears when you try to get it up-2-date. Linux is great.
UPDATE: One day later we probably could avoid the hypervisor booting and save about 5 hours! Everybody is a general after a battle :)