Recent downtime at The Zooniverse

Last week the majority of The Zooniverse websites suffered their first major outage for over 2 years. Firstly I wanted to reassure you that the valuable classifications you all provide was always secure but perhaps more importantly I wanted to say sorry – we should have recovered more quickly from this incident and we’ll be working hard this week to put better systems in place to enable us to recover more rapidly in the future.

For those who’d like more of the gory details, read on…

Some background

Many of you will know the story of the launch of the original Galaxy Zoo (in July 2007); just a few hours after launch the website hosted by the SDSS web team and Johns Hopkins University crashed under the strain of thousands of visitors to the site. Thankfully due to the heroic efforts of the JHU team (involving a new web server being built) the Galaxy Zoo site recovered and a community of hundreds of thousands of zooties was born.

Fast-forward to February 2009 and we were planning for the launch of Galaxy Zoo 2. This time we knew that we were going to have a busy launch day – Chris was once again going to be on BBC breakfast. So that we could keep the site running well during the extremely busy periods we looked to commercial solutions for scalable web hosting. There are a number of potential choices in this arena but by far the most popular and reliable service was that offered by Amazon (yes the book store) and their Web Services platform. While I won’t dig into the technical details of Amazon Web Services (AWS) here, the fundamental difference when running web sites on AWS is that you have a collection of ‘virtual machines’ rather than physical servers.

If any of you have ever run Virtual PC or VMWare on your own computer then you’ll already realise that using machine virtualisation it’s possible to run a number of virtual machines on a single physical computer. This is exactly what AWS do except they do it at a massive scale (millions of virtual machines) and have some fantastic tools to help you build new virtual servers. One particularly attractive feature of using these virtual machines is that you can essentially have as many as you want and you only pay by the hour. At one point on the Galaxy Zoo 2 launch day we had 20 virtual machines running the Galaxy Zoo website, API and databases. 2 days later we were running only 3. The ability to scale up (and down) in realtime the number of virtual machines means that we are able to cope with huge variations in the traffic that a particular Zooniverse site may be receiving.

The outage last week

As I write this blog post we currently have 22 virtual servers running on AWS. That includes all of The Zooniverse projects, the database servers, caching servers for Planet Hunters, our blogs, the forums and Talk and much more. Amazon have a number of hosting ‘regions’ that are essentially different geographical locations where they have datacenters. We happen to host in the ‘us-east’ region in Virginia – conveniently placed for both Europe and American traffic.

We have a number of tools in place that monitor the availability of our web sites and last Thursday at about 9am GMT I received a text-message notification that our login server (login.zooiverse.org) was down. We have a rota within the dev team for keeping an eye on the production web servers and last week it was my turn to be on call.

I quickly logged on to our control panel and saw that there was a problem with the virtual machine and attempted a reboot. At this point I also started to receive notifications that a number of the Zooniverse project sites were also unavailable. At this point realising that something rather unusual was going on I checked the Amazon status page which was ‘all green’, i.e. no known issues. Amazon can be a little slow to update this page so I also checked Twitter (https://twitter.com/#!/search/aws). Twitter was awash with people complaining that their sites were down and that they couldn’t access their virtual machines. Although this wasn’t ‘good’ news, it’s always helpful to understand in a situation such as this if the issue is with the code that we run on the servers or the servers themselves.

Waiting for a fix?

At this point we rapidly put up holding pages on our project sites and reviewed the status of each project site and service that we run. As the morning progressed it became clear that the outage was rather serious and actually became significantly worse for The Zooniverse as in turn, each of the database servers that we run became inaccessible. We take great care to execute nightly backups of all of our databases and so when the problems started the oldest backup was 4 hours old. With hindsight when the problems first started we should have immediately moved The Zooniverse servers to a different AWS region (this is actually what we did do with login.zooniverse.org) and booted up new database servers with the backup from the night before however we were reluctant to do this because of the need to reintegrate the classifications made by the community during the outage. But hindsight is always 20-20 and this isn’t what we did. Instead, believing that a fix for the current situation was only a matter of hours away we waited for Amazon to fix the problem for us.

As the day progressed a number of the sites became available for short periods and then inaccessible again. It wasn’t a fun day to be on operations duty with servers continually going up and down. Worse, our blogs were also unavailable so we only had Twitter to communicate what was going on. At about 11pm on the first evening a number of The Zooniverse projects had been up for a number of hours and things looked to be improving. Chris and I spoke on the phone and we agreed that if things weren’t completely fixed by the morning we’d move the web stack early Friday morning.

Friday a.m.

Friday arrived and the situation was slightly improved but the majority of the our projects were still in maintenance mode so I set about rebuilding The Zoonivere web stack. Three out of five of our databases were accessible again and so I took a quick backup of all of the three and booted up replacement databases and web servers. This was all we needed to restore the majority of the projects and by lunchtime on Friday we pretty much had a fully working Zooniverse again. The database server used by the blogs and forums took a little longer to recover and so it was Friday evening before they were back up.

A retrospective

We weren’t alone in having issues with AWS last week. Sites such as Reddit and Foursquare were also affected as were thousands of other users of the service. I think the team at the Q&A site Quora put it best when they said ‘We’d point fingers, but we wouldn’t be where we are today without EC2.’ on their holding page and this is certainly true of The Zooniverse.

Over the past 2 years we’ve only been able to deliver the number of projects that we have and the performance and uptime that we all enjoy due to the power, flexibility and reliability of the AWS platform. Amazon have developed a number of services that mean that it was possible (in theory) to protect against the failure they experienced in the US-east region last week. Netflix was notably absent from the list of AWS hosted sites affected by the AWS downtime and this is because they’ve gone to huge effort (and expense) to protect themselves against such a scenario (http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html).

Unfortunately with the limited resources available to The Zooniverse we’re not able to build a resilient web stack as Netflix however there are a number of steps we’ll be taking this to make sure that we’re in a much better position to recover from a similar incident in the future so that we experience downtime of minutes and hours rather than days.

Cheers
Arfon & The Tech Team