We recently had a very successful (and longer than usual) Stargazing Live. I wanted to talk a little about the work that our team did in the weeks leading up to this and also recap what actually happened behind the scenes during the two weeks of events.
If you’re not familiar with it, Stargazing Live is an annual astronomy TV show on BBC Two in the UK, which is broadcast live on three consecutive nights. Each year we launch a project in collaboration with the show, and this always proves to be the busiest time of our year. This year, for the first time there was a second week of shows for ABC Australia, so this time we launched two projects instead of one: Planet 9 and Exoplanet Explorers.
A lot of work went into making sure that our site stayed up for this year’s shows. In previous years we’ve had issues that have resulted in either a brief outage or reduced performance for at least some of the time during the show. This year everything worked perfectly and we actually found ourselves reducing our capacity (scaling down) much sooner than we anticipated. The prep work fell into three areas:
- Optimisations to the frontend to reduce the number of API calls made by the site while people were using it. This involved a combination of refactoring, fixing bugs, and modifying the backend to return frequently requested data without it having to be requested separately (e.g. when checking if the user has favourited a subject).
- Reducing the load on our databases. We reduced the number of requests that result in database queries through caching in the backend (with memcache), and we started using a new microservice (called Designator) to keep track of what each user has seen and serve them new subjects. We also separated some services onto a read replica rather than having them query the primary database.
- Adding feature flags so that we could turn off anything non-essential, and so that we could shut down any features that were causing problems, using the Flipper Ruby gem.
On the first night of the BBC show it was all hands on deck. Our teams in the US and the UK were in our offices, despite it being evening in the UK, and in Oxford we gathered around the TV expectantly awaiting the moment when Chris would announce the project’s URL on air. That moment is usually a bit frantic, as several thousand people all turn up on the site at once and start clicking around, registering, logging in, and submitting classifications. We’re always closely watching our monitoring systems, keeping an eye on various performance metrics, watching for any early signs of problems that might affect the performance of the site. This year when that moment came the number of visitors on site shot up to over 5,000, and then… everything just kept running smoothly.
The first night of the BBC show we peaked at about 0.9 million requests per hour, with 1.1 million per hour the second night.
We scaled our API service to 50 of EC2’s m3.medium instances the first night and the average CPU utilisation of these instances reached about 30% at peak traffic. The next two nights we reduced the number of instances to 40. In hindsight we could have gone even lower, but from past experience the amount of traffic we receive on the second and third nights can be difficult to predict, so we decided to play it safe.
Traffic during the ABC show was lower than during the BBC show (Australia has a smaller population than the UK, so this was as expected). That week we scaled the API to 40 instances the first night, and 20 instances for the second and third nights.
In the past we’ve had problems with running out of available connections in PostgreSQL. The connection limit depends on available memory, and we find this to be more of a problem than CPU or network constraints. During the shows we scaled the PostgreSQL instance for our main API to RDS’s m4.10xlarge and our Talk/microservices database to m4.2xlarge, primarily to give us enough leeway to avoid the connection limit. In the future we’d like to implement connection pooling to avoid this.
This was all a big improvement on previous years. While before we found ourselves extremely busy fighting fires and fixing bugs between shows, this time we had time to just relax and watch the show. We have more work to do on optimisations, because we did still have to scale up our capacity more than we’d like, but overall we’re very happy with how well things went this year.