I was sipping on my tea when I heard it.
Ping, Ping, Ping.
Within a few minutes my peaceful morning has turned manic as an avalanche of alerts started flowing.
Something had gone wrong, and maybe, just maybe, it was to do with the 1,337 lines of code I'd just merged into the live product.
Between 1145am and 1245pm on Nov. 17th, 2022, at least 5 schools worth of users encountered brick-wall responses from Huey the Bookbot when asking for recommendations, with more possibly failing to load a bookbot at all.
This is a bit of a postmortem.
The event was triggered by a merging of a new feature at 1130am, without proper care to ensure all facets of the system (primarily the database) were aligned with the new code.
Owing to the out-of-sync database, requests to Huey's core recommendation engine raised errors - with bookbot users being met with disappointing messages from Huey.
Once the team became aware, the cause was quickly recognised - "the database wasn't brought up to speed with the newest changes" - and the process of updating the database was initiated.
However, a further problem was now brewing. During development, trials of running this exact database migration were taking 1-2mins - not great, not terrible. But hindsight reveals this was due to the favourable conditions of the development database (virtually no traffic); the many concurrent users and schools using the service increased the viscosity, causing this migration to demand upwards of half an hour.
This was now a bigger problem, affecting not only bookbot users.
The slow-running migration contained some very large, very blocking transactions, meaning that other features throughout the Huey Books ecosystem were impacted for this duration. Users only now loading up a bookbot (without having already been loaded into one) may very well have found themselves with long wait times, possibly ending in an unexpected 404 page in cases where the server timed out completely.
All stops were pulled out to speed up the migration - new traffic was deflected from the busy database, existing transactions were terminated, and the terminal window performing the migration was given a heavy stare and a stern talking-to.
Eventually, the cogs whirred to a stop. The migration was complete, and thus the production codebase was now aligned with the data, allowing the show to go on.
There were still errors piling up in the team's notifications. A brief look into it, and big surprise, the new code had bugs of its own! 🐛
Luckily, they weren't big ones. Two hotfixes were shoved out the door, finally putting an end to a very noisy (to those without the error alerts muted) hour.
The team was notified by Slack channels with the sole purpose of automatically reporting errors encountered by API and/or bookbot users.
The alerts included data about the event, the affected user/school, and a clue as to which piece of code led to the error.
The Five Whys
Huey the Bookbot had an outage.
Because the database and codebase were out of sync for a period.
Because code with a very time-consuming migration was merged during school hours.
Because it was underestimated, and merged without sufficient consideration.
Because it was merged in a hurry.
Because a wealth of exciting new features were being held up by it! 🎉
- Just because a migration is quick on development, doesn't necessarily mean it will be quick on production
- Test coverage can definitely be improved, particularly with regard to our core API endpoints
- Slack alerts are very effective!
- Large refactoring code changes would best be merged outside of peak hours
- A broadcast channel would be of great benefit in cases like this, i.e. a way to inject a banner into applications: "Huey is currently performing maintenance!", "Something has gone wrong at Huey HQ. Please come back later!"
- Create sensibly-permuted integration tests for all primary endpoints
- Introduce the concept of API status: a message and health level readily available in the
/versionendpoint, and/or environment variables in frontend applications
- Gargantuan code/database changes are to be avoided, but if required, be merged off-peak
There is of course regret in the impact this mistake had on the affected schools and users, but there is a silver lining in the identification of areas of improvement, the demonstrable effectiveness of our alerts system, and the free shot of adrenaline that the sudden, rapid, incessant error pings provided 😅.
While tumultuous, this was an important step in our journey to help kids get on the path to becoming great readers.
Thanks for reading!
- An anonymous Huey Books developer