The Amazon outage: Recovery and improvements
Last Thursday we were affected by a large scale outage in Amazon’s server center. Since we were hosting most of our servers in one zone of Amazon’s server center, we weren’t really prepared for a system failure like this one. We store our data on Amazon data drives (‘EBS’) and create regular on-site and off-site backups. Unfortunately we weren’t able to access our most recent data (‘EBS’) or any recent backups (‘snapshots’) during the outage. Amazon basically locked down a large number of the EBS drives in the corrupt zone of their datacenter in Virginia. Unfortunately our data was only available in that specific zone at that time.
After 12 hours of downtime and no clear signs of recovery on Amazon’s side, we chose to get a new database server online with our most recent available data. This meant that we were missing 7.5 hours of your valuable data when we got back online on Friday, April 22 at 1:35AM PST and had to wait until Amazon recovered full access to the most recent (missing) data.
We’ve recovered all data
Friday, April 22 Amazon restored full access to our data drives. Over the weekend one of our engineers (say Hi to Rik who spent easter behind his Macbook) build and tested an import script that will import 100% of the missing data (data inserted between 3:00AM – 10:30AM PST on Thursday, April 19th). We’ve tested this script thoroughly on our development servers yesterday and today. During a short 10 minute maintenance window at 10:30PM PST tonight, April 27 we’re going to import the missing test data. After this import any missing data from Thursday 3:00AM to 10:30AM will be available in your account again.
Note: After the import some of your tests might contain duplicate (inactive) tasks, you can ignore those or remove them manually.
Prevent future outages
In the upcoming maintenance window (April 27 at 10:30PM – 10:40PM PST ) we will switch our database server to a more redundant Amazon RDS database server. This new database will be available in two separated zones of Amazon’s data center (‘Multi Availability Zone’). In case one instance of our database server fails, the other instance will take over. When we get struck by a failure in one of the zones of Amazon’s server center, like last Thursday, our servers will automatically recover in about 3 minutes.
I’m very sorry about the downtime last Thursday and any inconvenience this downtime or your temporarily missing data may have caused you. This large scale outage has taught us some valuable lessons about (better) designing for failure. Our team has grown rapidly over the past few weeks and we’re happy to have some more bright engineers onboard. We will do our very best to improve our infrastructure and service to prevent any future downtime as much as possible.
Please let me know if you got any questions about the outage, the way we responded to it, or which other measures we take to keep your data safe.
Of course, also feel free to contact us (firstname.lastname@example.org) if you need any help to set up or analyze a test. We’d be happy to make up for some of your lost time.