All posts by arfon

News

So Long, and Thanks for All the Fish!

October 4, 2013 arfon 8 Comments

It seems like only a couple of weeks ago I announced that I’d be heading off soon to pastures new and yet somehow that time has already come – today is my last day working with the Zooniverse.

It’s pretty much impossible for me to describe how much fun I’ve had over the past five years. Playing a part in shaping the Zooniverse from the early days of Galaxy Zoo (2) when we were a tiny team in Oxford through to where we are today has been a blast. In a coincidence of timing my son Caio has been around for almost exactly the same amount of time as I’ve been involved with the Zooniverse, and to be honest I’m not really sure I remember life before either. I checked the commit logs of the Galaxy Zoo 2 codebase and the first code was saved on 25th October 2008 – just over a month before Caio came into the world. Significantly this was more than two months before Chris began paying me but that’s just a testament to what a remarkably persuasive individual he is 🙂

These last few years have been filled with so many significant moments it’s hard to pick out highlights but if I had to then the launch of Galaxy Zoo 2 and furiously coding as people around me were sipping champagne is pretty memorable. Taking what felt like a massive leap into the unknown with Planet Hunters and then going to find exoplanets is definitely up there too. And announcing Old Weather (still my favourite Zooniverse project) to the world and seeing how people responded to the Zooniverse doing something ‘other’ than astro was very special.

I’m not going to try and thank every individual I’ve been working with because I’m bound to forget important people. Suffices to say, I love you all dearly and I’m going to miss working with you day to day immensely.

So farewell and stay in touch!

ArfX

News, Technical

Zooniverse, GitHub and the future

August 13, 2013 arfon 17 Comments

In case you haven’t noticed I’ve had a pretty busy five years at the Zooniverse. With more than 25 projects launched in fields from astronomy to biodiversity and from climataology all the way to zoology, it’s been an incredible experience to work with so many new science teams hungry for answers to research questions that can only be answered by enlisting the help of a large number of volunteers. This model of citizen science, one where we boil down the often complex analysis task brought to us by a science team to the ‘simplest thing that will work’, build a rich user experience and then ask a bunch of people to help, seems to work pretty well.

For me, one of the best aspects of what I get to do is that I work in a domain that is an inherently open way of doing research. Having joined Zooniverse when we were still ‘just’ Galaxy Zoo, to see the range of projects we host broaden and to watch our community mature has been a remarkable experience. With our latest endeavour – the Galaxy Zoo Quench project – it’s clear that the line between the activites of the ‘science’ team and the ‘volunteers’ is becoming less defined by the day. Citizen-led science in the Zooniverse began with a group of people in the Galaxy Zoo Forum, ‘The Peas Corp’ when they discovered a new class of galaxy, and it continues today with volunteers discovering new types of worms, exotic exoplanets and even, through Quench, analysing and writing a new paper as a group. These of course are just examples I’ve taken from the Zooniverse and there are many more in other projects run by other people, but in each case the result is the same: by enagaing the public in a meaningful way Citizen Science is challenging the centuries old practices of academia and that has to be a good thing.

The opportunity to change the way science is done, whether it’s building software to increase efficiency or developing new collaboration models, is what brought me to the Zooniverse and now it’s what is leading me away. At the end of September this year I’m going to be hanging up my hat as Technical Lead of the Zooniverse and joining GitHub as their ‘science guy’.

As with all big decisions in life this wasn’t an easy one. I feel very fortunate to have had the opportunity to give technical direction to an incredible team of scientists, developers, educators and designers here at the Adler and the wider Zooniverse. But over the past couple of years I’ve also got to know a number of the GitHub folks and I’ve been hugely impressed by their focus on building the very best platform possible for online collaboration. Starting with the very simple idea that ‘it should be easier to work together than alone’ they’ve clearly nailed what it looks like to work on a problem with others in code. But software isn’t the only thing people are sharing on GitHub – legislators are publishing drafts of state law, technicians are documenting scientific laboratory protocols and with tools like the IPython Notebook researchers have defined formats and means of sharing entire research workflows.

The mantra of ‘collaborative versioned science’ has been rattling around my head now for a couple of years. I believe there’s an opportunity for GitHub to be the platform for capturing the process of scientific discovery and I want to help make that happen.

So what does this mean for the Zooniverse? Well, I’m leaving at a pretty good time as the Zooniverse has never been healthier – there’s a first-class web and education team of twelve people I’m going to be leaving behind at the Adler Planetarium in Chicago and we’ve just secured several large grants to expand our sister team at The University of Oxford to ten people (watch this space for job ads).

With all of these people and a number of major development projects in the pipeline we’re going to need a new Technical Lead. If this sounds fun, like you might be a good fit (and you’re able to work in the UK or US) then drop myself and Chris Lintott a line (we’re arfon@zooniverse.org and chris@zooniverse.org) – we’d love to talk. Our software is a mixture of Ruby, Rails and Javascript and we like using technologies like MongoDB, Redis, Amazon Web Services and Hadoop. We get to work on hard data science problems, build custom software for solving crowdsourcing at scale and work with some incredibly smart and creative collaborators. Whoever takes over is going to have a lot of fun.

Arfon

PS If you’d like to know more about what work looks like as a Technical Lead of the Zooniverse then I’ve written recently about some of the problems we’ve addressed over the past few years here, here and here.

New Features, Technical

New Talk Feature: Automatic Favourites Collection

July 25, 2013 arfon Leave a comment

Today we’ve added a new feature to all the Zooniverse sites that use the new version of Talk. As you’ll know, most of our projects allow you to save ‘favourites’ – a list of things that are either cool/interesting/worthy of keeping and something to refer back to later. One often asked for feature is for this collection of favourites in the project to be available in Talk as a collection.

Today we’ve added this feature and from now on when you favourite something on the main site (e.g. Galaxy Zoo) it will automagically appear in a collection called ‘Favourites’ on Talk. That means you can discuss, share or even import them as a data source into tools.zooniverse.org

Enjoy!

News

How the Zooniverse Works: Keeping It Personal

July 23, 2013 arfon 1 Comment

This the the third post in a series about how, at a high level, the Zooniverse collection of citizen science projects work. In the first post I describe the core domain model description that we use – something that turns out to be a crucial part of faciliating conversation between scienctists and developers. In the second I covered about some of the core technologies that keep things running smoothly. In this and the next few posts I’m going to talk about parts of the Zooniverse that are subtle but important optimisations. Things such as how we pick which Subject to show to someone next, how we decide when a Subject is complete, and measuring the quality of a person’s Classifications.

Much of what I’m about to describe probably isn’t obvious to the casual observer but these are some of the pieces of the Zooniverse technical puzzle that as a team we’re most proud of and have taken many iterations over the past five years to get right. This post is about how we decide what to show to you next.

A Quick Refresher

At its most basic, a Zooniverse citizen science project is simply a website that shows you some data images, audio or plots, asks you to perform some kind of analysis on interpretation on it and collects back what you said. As I described in my previous post we’ve abstracted most of the data-part of that workflow into an API called Ouroboros which handles functionality such as login, serving up Subjects and collecting back user-generated Classifications.

Keeping it Fast

The ability for our infrastructure to scale quickly and predictably is a major technical requirement for us. We’ve been fortunate over the past few years to receive a fair bit of attention in the press which can result in tens or hundreds of thousands of people coming to our projects in a very short period of time. When you’re dealing with visitor numbers at that scale ideally you want everyone to have a pleasant experience.

Let’s think a little more about what absolutely has to happen when a person visits for example Galaxy Zoo.

We need to show a login/signup form and send the information provided by the individual back to the server.
Once registration/login is complete we need to serve back some personal information (such as a screen name).
We need to pick some Subjects to show.

For many of the operations that happen in the Zooniverse, a record is written to a database somewhere. When trying to improve the performance of code that involves databases, a key strategy is to try and avoid querying these database as much as possible especially if the queries are complex and the databases are large as these are often the slowest parts of your application.

What count’s as ‘complex’ and ‘big’ in database terms varies based upon the types of records that you are storing, the choices you’ve made about how to index them and the resources you provide to the database server i.e. how much RAM/CPU you have available.

Keeping it personal

If there’s one place that complex queries are guaranteed to reside in a Zooniverse project codebase then it’s the part where we decide what to show to a particular person next. It’s complex, in need of optimisation and potentially slow for a number of reasons:

When selecting a Subject we need to pick from one that a particular User hasn’t seen before.
Often Subjects are in Groups (such as a collection of records in Notes from Nature) and so these queries have to happen within a particular scope.
We often want to prioritise a certain subset of the Subjects.
These queries happen a lot, at least n * the total number of Subjects (where n is the number of repeat classifications each Subject receives).
The list of Subjects we’re selecting from is often large (many millions).

On first inspection, writing code to achieve the requirements above might not seem that hard but if you add in the requirement that we’d like to be able to select Subjects hundreds of times per second for many thousands of Users then it starts to get tricky.

A ‘poor man’s’ version of this might look something like this

def self.next_original_for_user(user)
  recents = joins(:classifications).where(:classifications => { :zooniverse_user_id => user.id }).select('subjects.id').all
  if recents.any?
    where(['id NOT IN (?)', recents]).first
  else
    first
  end
end

What we’re doing here is finding all the classifications for a given User and grabbing all of the Subject ids for them. Then we do a SQL select to grab the first record that doesn’t have an id matching one of the ones from existing classifications.

While this code is perfectly valid and would work OK for small-scale datasets there are a number of core issues with it:

It’s pretty much guaranteed to get slower over time – as the number of classifications grows for a user retrieving the recent classifications is going to become a bigger and bigger query.
It’s slow from the start – NOT IN queries are notoriously slow.
It’s wasteful – every time we grab a new Subject for a User we essentially run the same query to grab the recent classification Subject ids.

These factors combined make for some serious potential performance issues if we want to execute code like this frequently, for large numbers of people and across large datasets all of which are requirements for the Zooniverse.

A better way

It turns out that there are technologies out there designed to help with this sort of scenario. When we select the new Subject for a user there’s no reason why this operation has to actually happen in the database that the Subjects are stored in, instead we can keep ‘proxy’ records stored in lists or sets. That means that if we have a big list of ids of things that are available to be classified and a list of ids of things that each user has seen so far then when we want to select a Subject for someone we just subtract those two things and then pick randomly from the difference and pluck that record from the database.

In the diagram above when Rob (in the middle) comes to one of our sites we subtract from the big list of Subjects that need classifying still (in blue) the list of things that he’s already seen (in green) and then pick randomly from that resulting set. Going by this diagram it looks like we must have to keep a list of available Subjects for each project together with a separate list of Subjects per project per user so that we can do this subtraction and that’s exactly the case. The database technology that we use to do this is called Redis and it’s designed for operations just like this.

The result

Maturing our codebase to a point where the queries described above are straightforward has been a lot of work, mostly by this guy. What does it look like to actually require this kind of behaviour in code? Just two lines:

class BatDetectiveSubject < Subject
  include SubjectSelector
  include SubjectSelector::Unique
end

This example is selecting ‘unique’ records for each user. We can also select unique grouped and prioritised unique records for projects like Planet Hunters. Regardless of the selection ‘flavour’ we’re using it’s simple for us to now to implement selection behaviour, using Redis to perform these selection operations means that everything is insanely quick, typically returning from Redis in ~30ms even for databases with many tens of thousands of Subjects to be classified.

Making the routinely hard stuff easier is a continual goal for the Zooniverse development team. That way we can focus maximum effort on the front-end experience and what’s different and hard about each new project we build.

News, Technical

How the Zooniverse Works: Tools and Technologies

June 25, 2013 arfon 5 Comments

In my last post I described at length the domain model that we use to describe conceptually what the Zooniverse does. That wouldn’t mean much without an implementation of that model and so in this post I’m going to describe some of the tools and technologies that we use to actually run our citizen science projects.

The lifecycle of a Zooniverse project

Let’s think a little more about what happens when you visit a project such as Snapshot Serengeti. Ignoring all of the to-and-fro that your web browser does to work out where the domain name ‘snapshotserengeti.org’ points to, once it’s figured this and a few other details out you basically get sent a website that your browser renders for you. For the website to function as a Zooniverse project a few things are essential:

You need to be able to view images (or listen to audio or watch a video) that we and the science team need your help analysing.
You need to be able to log in with your Zooniverse account.
We need to capture back what you said when doing the citizen science analysis task.
Save out favourite images to your profile.
View recent images you’ve seen in your profile.
Discuss these images with the community.

It turns out that pretty much all of the functionality mentioned above is for us delivered by an application we call Ouroboros as an API layer and a website (such as Snapshot Serengeti) talking to it.

Ouroboros – or ‘why the simplest API that works is probably all you need’.

So what is Ouroboros? It provides an API (REST/JSON) that allows you to build a Zooniverse project that has all of the core components (1-6) listed above. Technology-wise it’s a custom Ruby on Rails application (Rails 3.2) that uses MongoDB to store data and Redis as a query cache all running on Amazon Web Services. It’s probably utterly useless to anyone but us but for our needs it’s just about perfect.

At the Zooniverse we’re optimised for a few different things. In no particular order of priority they are:

Volume – we want to be able to build lots of projects.
Science – we want it to be easy to do science with the efforts of our community.
Scale/performance – we want to be able to have millions of people come to our proejcts and them to stay up.
Availability – we’d prefer our websites to be ‘up’ and not ‘down’.
Cost – we want to keep costs at a manageable level.

Pretty much all of these requirements point to having a shared API (Ouroboros) that serves a large number of projects (I’ll argue #4 in the pub with anyone who really wants to push me on it).

Running a core API that serves many projects makes you take the maintenance and health of that application pretty seriously. Should Ouroboros throw a wobbly then we’d currently take out about 10 Zooniverse projects at once and this is only set to increase. This means we’ve thought a lot about how to scale the application for times when we’re busy and we also spend significant amounts of time monitoring the application performance and tuning code where necessary. I mentioned that cost is a factor – running a central API means that when the Zooniverse is quiet and there aren’t many people about we can scale back the number of servers we’re running (automagically on Amazon Web Services) to a minimal level.

We’ve not always built our projects this way. The original Galaxy Zoo (2007) was an ASP/web forms application, projects between Galaxy Zoo 2 and SETI Live were all separate web applications, many of them built using an application called The Juggernaut. Building standalone applications every time not only made it difficult to maintain our projects but we also found ourselves writing very similar (but subtly different) code many times between projects, code for things like choosing which Subject to show next.

Ouroboros is an evolution of our thinking about how to build projects, what’s important and generalisable and what isn’t. At it’s most basic it’s a really fast Subject allocator and Classification collector. Our realisation over the last few years was that the vast majority of what’s different about each project is the user experience and classification interface and this has nothing to do with the API.

The actual projects

The point of having a central API is that when we want to build a new project we’re already working with a very familiar toolset – the way we log people in, do signup forms, ask for a Subject, send back Classifications – all of this is completely standard. In fact if you’re building in JavaScript (which we almost always are these days) then there’s a client library called ‘Zooniverse’ (meta I know) available here on GitHub.

Having a standard API and client library for talking to it meant that we built the Zooniverse project Planet Four in less than 1 week! That’s not to say it’s trivial to build projects, it’s definitely not, but it is getting easier. And having this standardised way of communicating with the core Zooniverse means that the bulk of the effort when building Planet Four was exactly where it should be – the fan drawing tools – the bit that’s different from any of our other projects.

So how do we actually build our projects these days? We build our projects as JavaScript web applications using JavaScript web frameworks such as Spine JS, Backbone or something completely custom. The point being, that all of the logic for how the interface should behave is baked into the JavaScript application – Ouroboros doesn’t try and help with any of this stuff.

Currently the majority of our projects are hosted using the Amazon S3 static website hosting service. The benefits of this are numerous but key ones for us are:

There’s no webserver serving the site content, that is http://www.galaxyzoo.org resolves to an S3 bucket. When you access the Galaxy Zoo site S3 does all of the hard work and we just pay for the bandwidth from S3 to your computer.
Deploying is easy. When we want to put out a new version of any of our sites we just upload new timestamped versions of the files and your browser starts using them instead.
It’s S3 – Amazon S3 is a quite remarkable service – a significant fraction of the web is using it. Currently hosting more than 2 trillion (yes that’s 12 zeroes) objects and regularly serving more than 1 million requests for data per second the S3 service is built to scale and we get to use it (and so can you).

Amazon S3 is a static webhost (i.e. you can’t have any server-side code running) so how do we make a static website into a Zooniverse project you can log in to when we can’t access database records? The main site functions just fine – these JavaScript applications (such as the current Galaxy Zoo or any recent Zooniverse project) implement what is different about the project’s interface. We then use a small invisible iFrame on each website that actually points to api.zooniverse.org which is Ouroboros. When you use a login form we actually set a cookie on this domain and then send all of our requests back to the API through this iFrame. This approach is a little unusual and with browsers tightening up the restrictions on third-party cookies if looks like we might need to swap it out for a different approach but for now it’s working well.

Summing up

If you’re like me then when you read something you read the opening, look at the pictures and then skip to the conclusions. I’ll summarise here just incase you’re doing that too:

In the Zooniverse there’s a clear separation between the API (Ouroboros) and the citizen science projects that the community interact with. Ouroboros is a custom-built, highly scalable application built in Ruby on Rails, that runs on Amazon Web Services and uses MongoDB, Redis and a few other fancy technologies to do its thing.

The actual citizen science projects that people interact with are these days all pure JavaScript applications that are hosted on Amazon S3 and they’re pretty much all open source. They’re generally still bespoke applications each time but share common code for talking to Ouroboros.

What I didn’t talk about in this post are the hardest bits we’ve solved in Ouroboros – namely all of the logic about how to make finding Subjects for people quickly and other ‘smart stuff’. That’s coming up next.

Technical

How the Zooniverse Works: The Domain Model

June 20, 2013 arfon 2 Comments

We talk a lot in the Zooniverse about research, whether it’s interesting stories from the community, a new paper based upon the combined efforts of the volunteers and the science teams or conferences we might be going to.

One thing we don’t spend much time talking about is the technology solutions we use to build the Zooniverse sites, the lessons we’ve learned as a team building more than twenty five citizen science projects over the past five years and where we think the technical challenges still remain in building out the Zooniverse into something bigger and better.

There’s a lot to write here so I’m going to break this into three separate blog posts. The first is going to be entirely about the domain model that we use to describe what we do. When it seems relevant I’ll talk a little more about implementation details of these domain entities in our code too. The second will be about technologies and the infrastructure we run the Zooniverse atop of and the third will be about making smarter systems.

Why bother with a domain model?

Firstly it’s worth spending a little time talking about why we need a domain model. In my mind the primary reason for having a domain model is that it gives the team, whether it’s the developers, scientists, educators or designers working on a new project a shared vocabulary to talk about the system we’re building together. It means that when I use the term ‘Classification’ everyone in the team understands that I’m talking about the thing we store in the database that represents a single analysis/interaction of a Zooniverse volunteer with a piece of data (such as a picture of a galaxy), which by the way we call a ‘Subject’.

Technology wise the Zooniverse is split into a core set of web services (or Application Programming Interface, API) that serve up data and collect it back (more about that later) and some web applications (the Zooniverse projects) that talk to these services. The domain model we use is almost entirely a description of the internals of the core Zooniverse API called Ouroboros and this is an application that is designed to support all of the Zooniverse projects which means that some of the terms we use might sound overly generic. That’s the point.

The core entities

The domain model is actually pretty simple. We typically think most about the following entities:

User

People are core to the Zooniverse. When talking publically about the Zooniverse I almost always use the term ‘citizen scientist’ or ‘volunteer’ because it feels like an appropriate term for someone who donates their time to one of our projects. When writing code however, the shortest descriptive term that makes sense is usually selected so in our domain model the term we use is User.

A User is exactly what you’d expect, it’s a person, it has a bunch of information associated with it such as a username, an email address, information about which projects they’ve helped with and a host of other bits and bobs. Crucially though for us, a User is the same regardless of which project they’re working – that is Users are pan-Zooniverse. Whether you’re classiying galaxies over at Galaxy Zoo or identifying animals on Snapshot Serengeti we’re associating your efforts with the same User record each time which turns out to be useful for a whole bunch of reasons (more later).

Subject

Just as people are core, as are the things that they’re analysing to help us do research. In Old Weather it’s a scanned image of a ship log book, in Planet Hunters it’s a light curve but regardless of the project internally we call all of these things Subjects. A Subject is the thing that we present to a User when we want to them to do something.

Subjects are one of the core entities that we want to behave differently in our system depending upon their particular flavour. A log book in Old Weather is only viewed three times before being retired whereas an image in Galaxy Zoo is shown more than 30 times before retiring. This means that for each project we have a specific Subject class (e.g. GalaxyZooSubject) that inherits its core functionality from a parent Subject class but then extends the functionality with the custom behaviour we need for a particular project.

Subjects are then stored in our database with a collection of extra information a particular Subject sub-class can use for each different project. For example in Galaxy Zoo we might store some metadata associated with the survey telescope that imaged the galaxy and in Cyclone Center we store information about the date, time and position the image was recorded.

Workflow/Task

These two entities are grouped together as they’re often used to mean broadly the same thing. When a User is presented with a Subject on one of our projects we ask them to do something. This something is called the Task. These Tasks can be grouped together into a Workflow which is essentially just a grouping entity. To be honest we don’t use it very much as most projects just have a single Workflow but in theory it allows us to group a collection of Tasks into a logical unit. In Notes from Nature each step of the transcription (such as ‘What is the location?’) is a separate Task, in Galaxy Zoo, each step of the decision tree is a Task too.

Classification

It’s no accident that I’ve introduced these three entities, User, Subject and Task first as a combination of these is what we call a Classification. The Classification is the core unit of human effort produced by the Zooniverse community as it represents what a person saw and what they said about it. We collect a lot of these – across all of the Zooniverse projects to date we must be getting close to 500 million Classifications recorded.

I’ll talk more about what we store in a Classification in a followup the next post about technologies suffice to say now that they store a full description of what the User said about the object. In previous versions of the Zoonivese API software we tried to break these records out into smaller units called Annotations but we don’t do that any more – it was an unnecessary step.

Group

Sometimes we need to group Subjects together for some higher level function. Perhaps it’s to represent a season’s worth of images in Snapshot Serengeti or a particular cell dye staining in Cell Slider. Whatever the reason for grouping, the entity we use to describe this is ‘Group’.

Grouping records is both one of the most useful features Ouroboros offers but also one of the things it took the longest for us to find the right level of abstraction. While a Group can represent an astronomical survey in Galaxy Zoo (such as the Hubble CANDELS survey) or a Ship in Old Weather, it isn’t just enough for a bunch of Subjects to all be associated with each other. There’s often a lots of functionality that goes along with a Group or the Subjects within that is custom for each Zooniverse project. Ultimately we’ve solved this in a similar fashion to Subject – by having per-project subclasses of Groups (i.e. there is a SerengetiGroup that inherits from Group) that can set custom behaviour as required.

Project

Ouroboros (the Zooniverse API) hosts a whole bunch of different Zooniverse projects so it’s probably no surprise that we represent the actual citizen science project within our domain model. No prize for guessing the name of this entity – it’s called Project.

A Project is really just the overarching named entity that Subjects, Classifications and Groups are associated with. Project in Ouroboros does some other cool stuff for us though. It’s the Project that knows about the Groups, its current status (such as how many Subjects are complete) and other adminstrative functions. We also sometimes deliver a slightly different user experience to different Users in what are known as A/B splits – it’s the Project that manages these too.

So that’s about it. There are a few more entities routinely in discussion in the Zooniverse team such as Favourite (something a User favourites when they’re on one of our projects) but they’re really not core to the main operation of a project.

The domain description we’re using today is informed by everthing we’ve learnt over the past five years of building proejcts. It’s also a consequence of how the Zooniverse has been functioning – we try lots of projects in lots of different research domains and so we need a domain model that’s flexible enough to support something like Notes from Nature, Planet Four and Snapshot Seregeti but not so generic that we can’t build rich user experiences.

We’ve also realised that the vast majority of what’s differenct about each project is the user experience and classification interface. We’re always likely to want to put significant effort into developing the best user experience possible and so having an API that abstracts lots of the complexity away and allows us to focus on what’s different about each project is a big win.

Our domain model has also been heavily influenced by the patterns that have emerged working with science teams. In the early years we spent a lot of time abstracting out each step of the User interaction with a Subject into distinct descriptive entities called Annotations. While in theory these were a more ‘complete’ description of what a User did, the science teams rarely used them and almost never in realtime operations. The vast majority of Zooniverse projects to date collect large numbers of Classifications that are write once, read very much later. Realising this has allowed us to worry less about exactly what we’re storing at a given time and focus on storing data structures that are a convenient for the scientists to work with.

Overall the Zooniverse domain model has been a big success. When designing for the Zooniverse we really were developing a new system unlike anything else we knew of. It’s terminology is pervasive in the collaboration and makes conversations much more focussed and efficient which can only be a good thing.

News, Technical

Galaxy Zoo is Open Source

April 16, 2013 arfon 4 Comments

It’s always a good feeling a be making a codebase open and today it’s time to push the latest version of Galaxy Zoo into the open. As I talked about in my blog post a couple of months ago, making open source code the default for Zooniverse is good for everyone involved with the project.

One significant benefit of making code open is that from here on out it’s going to be much easier to have Zooniverse projects translated into your favourite language. When we build a new project we typically extract the content into something called a localisation file (or localization if you prefer your en_US) which is basically just a plain text file that our application uses. You can view that file for our (US) English translation file here and it looks a little like this:

So how do I translate Galaxy Zoo?

I’m glad you asked… It turns out there’s a feature built into the code-hosting platform we’re using (called GitHub) which allows you to basically make your own copy of the Galaxy Zoo codebase. It’s called ‘forking’ and you can read much more about it here but all you need to do to contribute is fork the Galaxy Zoo code repository, add in your new translation file and (there’s a handy script that will generate a template file based on the English version), translate the English values into the new language and send the changes back up to GitHub.

Once you’re happy with the new translation and you’d like us to try it out you can send us a ‘pull request’ (details here). If everything looks good then we can review the changes and pull the new translation into the main Galaxy Zoo codebase. You can see an example of a pull request from Robert Simpson that’s been merged in here.

So what next?

This method of translating projects is pretty new for us and so we’re still finding our way a little here. As a bunch of developers it feels great to be using the awesome collaborative toolset that the GitHub platform offers to open up code and translations to you all.

Cheers

Arfon

News

ARCHIVE: Why SciStarter.com is Bad For Citizen Science

March 12, 2013 arfon 10 Comments

Since this post was written, SciStarter have changed their policies and now provide direct links to projects. This is a good thing, and I’m happy to acknowledge it here.

Chris – April 2017

Preface: I’d like to begin by saying that I’ve met Darlene Cavalier at conferences in the past and I’m a big supporter of her efforts. Darlene is truly is a ‘cheerleader’ for citizen science, her enthusiasm is infectious and the citizen science domain is clearly a better place with her. I’m writing here about what I consider the bad practice of SciStarter.com and Science For Citizens LLC, their parent organisation. I have no idea whether the issues highlighted here are because of decisions that she has made.

There was a time not so long ago when you needed a new account for pretty much everything you tried out on the web. Want to upload photos to Flickr? Then signup for a Yahoo! ID. Want a blog? Then give WordPress or Tumblr your details. Feeling social? Then FaceBook, Twitter or MySpace would pretty much want the same information. These days there are a number of solutions that allow you to log in to web-based services using things like your Facebook, Twitter or Google account. Under the hood these solutions typically rely on a couple of protocols such as OAuth and OpenID and often still request your email address when you sign in but the days of hundreds of accounts each with their own password to remember are coming to a close.

In many ways a request by an organision for your email address when signing up for a new service is completely reasonable. In exchange for handing over your email address and a few personal details these tools were often available for free – both parties win. There is of course the discussion around who or what is the product when you use these free services but let’s not go into that here.

Since launching the original Galaxy Zoo back in 2007 we’ve encouraged our volunteer community to register for an account with us, although for the vast majority of our projects (and all of our recent ones) this login/signup is an optional step. For the Zooniverse there are two main reasons for asking you to create an account:

1) When we publish a paper as a result of your efforts we feel extremely strongly about crediting you for your efforts. Experience has taught us that attempting to publish a paper with 170,000 authors on is somewhat frowned upon by the journals but if you take a look at any of the Zooniverse publications you’ll find a link to an authors page such as here, here and here. We can only credit you if you share some personal information with us when you sign up.

2) For our research methods to work well, identifying an individual ‘classifier’ is pretty important. You can read more about this here (the original Galaxy Zoo paper) or here but in order to produce the best results possible we spend lots of time working out who is ‘best’ at a particular task and weighting their contributions accordingly. Being able to reliably identify an individual throughout the lifetime of a project (and even between projects) is most simple when someone has logged in.

Over the past year or so I’ve become increasingly concerned by the behaviour of SciStarter.com – a website that indexes citizen science projects from across the web. The site does a pretty good job of cataloging citizen science projects you can contribute to – when you visit the site and search for example for ‘bats’ the Zooniverse project Bat Detective is listed in the results. Selecting the result takes you to a brief summary of Bat Detective and offers you a link to ‘get started now!’ and this is where it goes wrong: Rather than taking you straight to the Bat Detective site you have to be ‘logged in’. Sign up for what exactly? Am I signing up to take part in Bat Detective? No. You’re actually just signing up for an account with SciStarter.com just so you can get a link to a project that SciStarter.com has nothing to do with.

Additionally, in a recent ‘top 10’ blog post of most successful citizen science projects of 2012, Bat Detective was highlighted. Did the link in this article send you straight to the Bat Detective website? Sadly not, it of course links to SciStarter’s catalogue page about Bat Detective which requires account registration before you can access the URL.

To me this doesn’t seem right and in many ways this is just exploiting people’s lack of experience and understanding of the web. There’s a reason that Facebook.com is in the consistently the most Googled terms – many people just don’t quite understand how the web works and I think SciStarter.com are exploiting this. Conversly, for those who are a little more web savvy these tactics must seem very clumsy. Perhaps more importantly though, it’s widely recognised that signup forms are a barrier to entry for many people and so by having people jump through this hoop SciStarter.com are actually holding potential citizen scientists back.

I don’t believe it’s in anyone’s interest other than Scistarter’s to require you to sign up to follow a link through to a project. By mandating this step they are building an index of individuals interested in other people’s projects when they don’t have any of their own and they’re risking confusing new community volunteers about what they have and haven’t signed up for. All of this is made worse by the fact that SciStarter.com is a division of Science for Citizens LLC – a commercial company.

So my challenge to SciStarter.com is this: If you’re so committed to citizen science then why put up this artificial barrier to contribution? Crawling the internet for people’s emails is one of the less tasteful aspects of the web and one I’d hoped we’d seen the end of. So how about it SciStarter?

News

Making the Zooniverse Open Source

February 18, 2013 arfon 2 Comments

We’re pleased to announce that the time has come to start making the Zooniverse open source. From today, you’ll be able to see several of our current projects on Github (at https://github.com/zooniverse) and will be able to fork them and contribute to them.

Taking the Zooniverse open source is something we’ve been thinking about for a long time. As the field of citizen science expands into ever broader domains the number of tools available to people to start their own projects is still low. Since the launch of Galaxy Zoo 2 we’ve been building tools that allow for code reuse across a number of projects and while the majority(1) of our software has never been ‘officially’ open, behind the scenes we’ve been sharing with pretty much anyone who asked, often talking them through the thought process that led us to design our software in a particular way.

Because of our natural inclination to share with those who approached us, we’ve never really made publishing our code a priority. As with most closed source projects there are also a number of pretty boring (but sometimes important) reasons for not publishing – we worried about how usable the code we’d written was to people we didn’t work closely with – as a small team we favour clean code and conversation with other developers over heavy documentation. Some sensitive information around our production environment inevitably slipped into the codebases which mean’t lots of work to clean up and security audit our tools. Some of these reasons hold for legacy applications each project we start often comes with a new Git repo and an opportunity to develop in a different way.

What does this mean?

Well, from today you’ll start to see a number of applications appear on the Zooniverse GitHub site. We’re starting with a collection of our most recent projects: Snapshot Serengeti, Bat Detective, Cyclone Center and Seafloor Explorer.

It’s important to say here that we’re not expecting a community of developers to jump in a help us develop new projects (although that would be pretty cool), but if there’s a typo on our site or a really annoying bug that you know exactly how to fix then fork the repo and send us a pull request and we’ll see what we can do. Significantly for our localisation support (translating sites into multiple languages) we’re proposing that new translations should be submitted in exactly this way (2). There are a huge number of very talented people in the Zooniverse community who until today had no way of contributing to the project other than to help analyse data. That changes today.

We’re releasing our software under a very liberal license – Apache 2.0. In very simple terms this means that the tools we develop can be used for whatever you like provided you follow the rules of the Apache 2.0 license.

What aren’t we open sourcing?

In truth lots of legacy code for our older projects aren’t likely to make it into the open. A large number of our projects between 2009 (Galaxy Zoo 2) and 2011 (The Milky Way Project) were all built upon a shared codebase called The Juggernaut. While we’re not making each of the projects open we’re are publishing the common application core which has been kept up to date and runs on Rails 3.1.

We’re also not opening up our applications that hold sensitive user information and are mission-critical for the operation of the Zooniverse. That’s not to say we won’t ever do this, we’re just not comfortable publishing these applications at this point. This basically means that the application that powers Zooniverse Home (www.zooniverse.org) and an application called Ouroboros (api.zooniverse.org) that serves up images and collects back classifications aren’t part of our open source strategy.

Why now?

Aside from the reasons mentioned above, there are a number of reasons to make open source our default position. In part it’s about people – developers these days are often hired (or at least shortlisted) by their GitHub profiles that show which projects they’ve been working on. As our team grows and we hire talented young developers we’re doing them a disservice not allowing them to show off the awesome work they do. It’s also about the way in which we as the Zooniverse do science. We believe citizen science is an inherently open way of doing research, we often work with open datasets (such as SDSS) and ask people to donate their time and efforts to a project that in the end produces open data products for the research community to enjoy (e.g. data.galaxyzoo.org, data.milkywayproject.org). Having a closed codebase for everything we do just feels incompatible with this way of doing research.

What’s next?

To be honest we’re not quite sure. Going forward, our projects will typically become open source as we launch them. If there’s a Zooniverse project that you think you’d like to rework for a different purpose then there’s now nothing stopping you from doing this. If you’re interested in helping us with a new translation for your favourite project then we’d love to talk. Perhaps you’re just interested to see how some of our applications work. Regardless, we invite you to take a look and give us feedback. The Zooniverse has always been about harnessing the crowd to make science happen. From today, there is a new way for people to contribute to that goal.

Cheers
Arfon

Footnotes:
1. Scribe, our open source text transcription framework grew out of Old Weather and has been used on a number of projects now.
2. A fuller article about language support is coming very soon on this blog.