Saturday, April 12, 2008

Server down! Hardware or Greenware?

You may have noticed we had a catastrophic server outage from about 1pm Tuesday which lasted through Thursday at about 5pm. How could a server outage go 52 hours? When the server is in a datacenter that strives for 99.999% uptime?? When the server has mirrored drives, and rsync'd backups to another machine half the country away? Well, it wasn't hardware that took us and hundreds of other sites down this week.

The company we lease our server from apparently stopped making payments to the datacenter which it leased all of it's servers from at a resellers discount price. Thus the datacenter unplugged all the servers leased by the reseller, including ours! Then it gave the reseller 48 hours to make a payment before allowing the customers to transfer the same servers over to leases from the datacenter directly. This was such a process, that by the time our server was taken down, there was only 30 hours left of this 48 hour period. No notice of this was given to us or any other customer by anyone! The first hint I had of a problem was when I received an email from a monitoring service which monitors our server's uptime. Soon after I received a hectic call from Jerry saying we were down. Then the fun began.

Our reseller first told us it was a datacenter issue. I wasn't too worried at this point because the datacenters are never down long. 99.999% uptimes, remember? After receiving no replies from our reseller, I started investigating and found other customers gathering information on a webhosting forum. The rumor was that our reseller wasn't paying it's bill. The datacenter would not confirm or deny this, but told us we must be patient. Hindsight is 20/20, but I wish I had started rebuilding our site from backups on a second server at this point, 4 hours in.

Offline all night, the next morning the cries are reaching a fevered pitch in the webhosting forum. The datacenter caves on holding until 48 hours as companies with a lot more revenue than ours are begging for relief. But instead of turning them all back on and sorting through the paperwork later, it insists that all paperwork be completed before putting the servers back online. I think I was one of the earlier ones on the list as I was following this very closely. Of course, their administrative services were stretched to the limit with everyone trying to get back online in a mad rush.

About 30 hours into this madness, all of my paperwork was finished and I assumed incorrectly that the end was near. Putting the servers back online was taking hours to do, as it had taken to get them all off in the first place. That night, severe weather in Dallas caused significant damage to the datacenter's administrative offices! So temporary administrative offices were set up to accommodate the customers trying to get back online.

At 47 hours in I submitted a support ticket, which was ignored or perhaps it didn't actually make it in. I'm not sure. At 49 hours in, I submitted tickets to both sales and support, and I received responses back from each pretty quickly. Sales said my server was up and running. Support said my paperwork had to be finished through sales first. So I copied both responses and sent each to the other. Shortly thereafter, we were back online.

So I guess offline backups of a server in a Texas datacenter with 99.999s and daily backups to both New York and North Dakota are not enough when your reseller does not make payments on its leases. Greenware failures can be much more catastophic than hardware failures.

(P.S. In his email to our archery customers explaining why we were down, Jerry said our server was down and had been replaced. I guess I had not explained clearly to Jerry the fact that it was the same server in the same rack, it just needed a new lease.)

Thank you to our customers who put up with this outage. It was the first time in 5 years we were down for more than one hour.


