Category Archives: Technical

The New Zooniverse Home

If you visit Zooniverse Home today you may notice that a few things have changed. For a while now we’ve wanted to improve the design of the site to better handle the growing range of projects that are part of the Zooniverse. With the updates released today we’ve completely reworked the look and feel of the site. We’ve organised projects into different categories, starting with Space, Climate and Humanities. There are lots of new categories coming soon.

Today we’re also launching an entirely new type of project as part of ‘Zooniverse Labs’. To date all Zooniverse projects come with the guarantee that your work will contribute directly toward real research output (usually academic papers), however this can make it hard to try new ideas. As the name suggests, Zooniverse Labs is about experimentation and we’re excited about launching new, often smaller-scale websites that continue to stretch the boundaries of science online. More news on this very soon.

Another new part of the redesigned Zooniverse homepage is the ‘My Projects’ section that allows you to see your contributions to the Zooniverse, and what else you might like to try out. This part of the site is very much in beta, and we hope to add to it in the coming months.

We hope you like the new look, and that it helps you explore more of what the Zooniverse has to offer.

Recent downtime at The Zooniverse

Last week the majority of The Zooniverse websites suffered their first major outage for over 2 years. Firstly I wanted to reassure you that the valuable classifications you all provide was always secure but perhaps more importantly I wanted to say sorry – we should have recovered more quickly from this incident and we’ll be working hard this week to put better systems in place to enable us to recover more rapidly in the future.

For those who’d like more of the gory details, read on…

Some background

Many of you will know the story of the launch of the original Galaxy Zoo (in July 2007); just a few hours after launch the website hosted by the SDSS web team and Johns Hopkins University crashed under the strain of thousands of visitors to the site. Thankfully due to the heroic efforts of the JHU team (involving a new web server being built) the Galaxy Zoo site recovered and a community of hundreds of thousands of zooties was born.

Fast-forward to February 2009 and we were planning for the launch of Galaxy Zoo 2. This time we knew that we were going to have a busy launch day – Chris was once again going to be on BBC breakfast. So that we could keep the site running well during the extremely busy periods we looked to commercial solutions for scalable web hosting. There are a number of potential choices in this arena but by far the most popular and reliable service was that offered by Amazon (yes the book store) and their Web Services platform. While I won’t dig into the technical details of Amazon Web Services (AWS) here, the fundamental difference when running web sites on AWS is that you have a collection of ‘virtual machines’ rather than physical servers.

If any of you have ever run Virtual PC or VMWare on your own computer then you’ll already realise that using machine virtualisation it’s possible to run a number of virtual machines on a single physical computer. This is exactly what AWS do except they do it at a massive scale (millions of virtual machines) and have some fantastic tools to help you build new virtual servers. One particularly attractive feature of using these virtual machines is that you can essentially have as many as you want and you only pay by the hour. At one point on the Galaxy Zoo 2 launch day we had 20 virtual machines running the Galaxy Zoo website, API and databases. 2 days later we were running only 3. The ability to scale up (and down) in realtime the number of virtual machines means that we are able to cope with huge variations in the traffic that a particular Zooniverse site may be receiving.

The outage last week

As I write this blog post we currently have 22 virtual servers running on AWS. That includes all of The Zooniverse projects, the database servers, caching servers for Planet Hunters, our blogs, the forums and Talk and much more. Amazon have a number of hosting ‘regions’ that are essentially different geographical locations where they have datacenters. We happen to host in the ‘us-east’ region in Virginia – conveniently placed for both Europe and American traffic.

We have a number of tools in place that monitor the availability of our web sites and last Thursday at about 9am GMT I received a text-message notification that our login server (login.zooiverse.org) was down. We have a rota within the dev team for keeping an eye on the production web servers and last week it was my turn to be on call.

I quickly logged on to our control panel and saw that there was a problem with the virtual machine and attempted a reboot. At this point I also started to receive notifications that a number of the Zooniverse project sites were also unavailable. At this point realising that something rather unusual was going on I checked the Amazon status page which was ‘all green’, i.e. no known issues. Amazon can be a little slow to update this page so I also checked Twitter (https://twitter.com/#!/search/aws). Twitter was awash with people complaining that their sites were down and that they couldn’t access their virtual machines. Although this wasn’t ‘good’ news, it’s always helpful to understand in a situation such as this if the issue is with the code that we run on the servers or the servers themselves.

Waiting for a fix?

At this point we rapidly put up holding pages on our project sites and reviewed the status of each project site and service that we run. As the morning progressed it became clear that the outage was rather serious and actually became significantly worse for The Zooniverse as in turn, each of the database servers that we run became inaccessible. We take great care to execute nightly backups of all of our databases and so when the problems started the oldest backup was 4 hours old. With hindsight when the problems first started we should have immediately moved The Zooniverse servers to a different AWS region (this is actually what we did do with login.zooniverse.org) and booted up new database servers with the backup from the night before however we were reluctant to do this because of the need to reintegrate the classifications made by the community during the outage. But hindsight is always 20-20 and this isn’t what we did. Instead, believing that a fix for the current situation was only a matter of hours away we waited for Amazon to fix the problem for us.

As the day progressed a number of the sites became available for short periods and then inaccessible again. It wasn’t a fun day to be on operations duty with servers continually going up and down. Worse, our blogs were also unavailable so we only had Twitter to communicate what was going on. At about 11pm on the first evening a number of The Zooniverse projects had been up for a number of hours and things looked to be improving. Chris and I spoke on the phone and we agreed that if things weren’t completely fixed by the morning we’d move the web stack early Friday morning.

Friday a.m.

Friday arrived and the situation was slightly improved but the majority of the our projects were still in maintenance mode so I set about rebuilding The Zoonivere web stack. Three out of five of our databases were accessible again and so I took a quick backup of all of the three and booted up replacement databases and web servers. This was all we needed to restore the majority of the projects and by lunchtime on Friday we pretty much had a fully working Zooniverse again. The database server used by the blogs and forums took a little longer to recover and so it was Friday evening before they were back up.

A retrospective

We weren’t alone in having issues with AWS last week. Sites such as Reddit and Foursquare were also affected as were thousands of other users of the service. I think the team at the Q&A site Quora put it best when they said ‘We’d point fingers, but we wouldn’t be where we are today without EC2.’ on their holding page and this is certainly true of The Zooniverse.

Over the past 2 years we’ve only been able to deliver the number of projects that we have and the performance and uptime that we all enjoy due to the power, flexibility and reliability of the AWS platform. Amazon have developed a number of services that mean that it was possible (in theory) to protect against the failure they experienced in the US-east region last week. Netflix was notably absent from the list of AWS hosted sites affected by the AWS downtime and this is because they’ve gone to huge effort (and expense) to protect themselves against such a scenario (http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html).

Unfortunately with the limited resources available to The Zooniverse we’re not able to build a resilient web stack as Netflix however there are a number of steps we’ll be taking this to make sure that we’re in a much better position to recover from a similar incident in the future so that we experience downtime of minutes and hours rather than days.

Cheers
Arfon & The Tech Team

Spring (theme) Cleaning

It’s that time of year again. Here in the Northern Hemisphere the flowers are starting to pop up, the ground hogs are coming out of hibernation, and signs everywhere indicate that it’s time to invest in the latest styles. Here in the Zooniverse, I’m not sure we can generally be accused of clinging to the newest fashions, but security is always on our mind, and with the release of WordPress 3.1 we did need to do a few updates. While I was inside the guts of our server, I admit I went a little nuts and decided (in conversations with a bunch of people, including maybe you) that it might be nice to give our blogs matching outfits.

The new Zoo theme you see before you comes from one of my favourite WordPress design companies: Woo Themes. The theme is an adaption of their social media ready MyStream, and it’s my hope that all the happy little links to classifying and social media will form a central jumping off point to see what’s new, to share what you like, and to do science..

There are still a few things wanting. The Mendeley pages are, um, empty, but that will get fixed soon. The posts are also missing a lot of comments. This is where you come in. We know there are a ton of people contributing to science through the Zooniverse, and we’d like at least a large chunk of that ton reading this blog. Next time someone asks you about what you’re doing online when you’re clicking away, send them to the blog. Get them involved one blog post and one click at a time.

See you in the comments…

User accounts migration

This morning we made some major changes to the way you manage your account with Galaxy Zoo and the Zooniverse. Previously all account management (e.g. changing your email address) was done through the Galaxy Zoo site however the changes that we made this morning have moved those pages to the Zooniverse Home.

From the Zooniverse you can now manage your profile for both the projects (such as Galaxy Zoo) and also any of the Zooniverse forums. I’ve recorded a quick screencast demonstrating the changes here.

As part of the update today we also upgraded the Galaxy Zoo forum to the latest (and greatest) version of SMF. The changes we made today were made possible by the hard work of the whole Zooniverse developer team, in particular Jarod Luebbert and Pamela Gay – thanks for your help guys!

We hope you like the changes!

Cheers
Arfon