Postmortem: Downtime, July 5, 2017

Traders,

On July 5, 2017, we suffered a prolonged downtime – our longest since launch in November 2014 – due to a server issue. Trading was suspended from 23:30 UTC until 03:45 UTC, for a total suspension of 4 hours and 15 minutes.

Those of you who trade with us know that we take our uptime very seriously, and the record shows it. Before this month, we had not had a single month with less than 99.9% uptime, with our longest 100% streak reaching nearly 300 days.

So what happened?

The crypto market is exploding, as many of you know. While we have one of the most sophisticated trading engines in the industry, its focus has always been on correctness (remargining positions continuously, auditing every trade), rather than speed. This was a winning strategy from 2014 to 2016, and we’ve never lost an execution, but as we entered record-setting volume in the beginning of this year, requests started to queue up.

 

Optimizing the BitMEX Trading Engine

We started optimizing. The web layer, up to this point, hadn’t had any issues – we could always scale it horizontally – but the engine (at this time) cannot be horizontally scaled. We partnered with Kx, the makers of kdb+, which powers our engine. We began testing new storage subsystems and server configurations. We settled on an upgrade plan, set for five days hence (July 11), and began testing the switchover. We simulated the switchover thrice, each time setting a timer so that we could best estimate our downtime. The plan was:

  • Move to a larger instance with a faster local SSD, and
  • Move from bcache + ext4 to ZFS.

Some more details on those actions:

  • EBS is slow. So we would move the trading engine from an AWS c3.xlarge, which we used for its fast local SSDs in combination with bcache, to an i3.2xlarge. This gives us far faster local SSDs, nearly 20x the local SSD storage so we can easily cache our entire data set.
  • ZFS gives us some distinct advantages over other filesystems:
    • ZFS checksums individual blocks, preventing data rot. It can be scheduled to automatically check & repair drives (this is called a scrub), and can be configured to alert on varied criteria. This goes a long way toward ensuring the continued integrity of our data.
    • ZFS allow us to easily mirror and replicate our data across multiple volumes and physical locations.
    • ZFS snapshots are cheap, especially compared to traditional backup systems that must check the size & modified time of every file; in our testing, we can snapshot as often as every second (!) without any significant performance regression.
    • Kdb+ data is stored in a columnar fashion, like so:
      trade
      ├── foreignNotional
      ├── grossValue
      ├── homeNotional
      ├── price
      ├── side
      ├── size
      ├── symbol
      ├── tickDirection
      ├── timestamp
      └── trdMatchID
    • This data is highly compressible – in practice we see compression rates approaching 4x. This directly translates to less data over the wire to EBS and faster checkpointing & lower latency on the write log. For example, du is able to show the “apparent size”, that is, the size the OS thinks these files are, versus the actual space usage:
      /u/l/b/e/d/h/execution $ du --apparent-size -h
      955M .
      /u/l/b/e/d/h/execution $ du -h
      268M .
    • ZFS has the concept of the ARC (fast in-memory caching, a adaptive combination of MFU and MRU caches; in practice, the MFU cache is better for our use case), and the L2ARC, which provides a second-level spillover of this data, ideally to fast local SSD. It even compresses, leading to some eye-popping metrics:
      L2 ARC Size: (Adaptive)       1.17 TiB
      Compressed:            33.74% 403.90 GiB 
      Header Size:            0.08% 931.12 MiB
    • ZFS snapshots are amazing, and easy to code for. This allows us to do things that would be impossible otherwise, such as automatically snapshotting the engine data before and after any code changes. This is only practically possible because of the instance nature of snapshots.

I could go on. We’re ZFS superfans.

 

What Went Wrong

As Donald Rumsfeld once said:

Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.

We had the plan ready to go, checklists ready, and we had simulated the switchover a few times. We started preparing a zpool for use with the production engine.

Here’s where it went wrong.

19:47 UTC: We create a mirrored target zpool that would become the engine’s new storage. In order to not influence I/O performance on the running engine, we snapshot the data storage drive, then remount it to the instance. This is not something we did in our test runs.

Bcache, if you haven’t used it before, is a tricky beast. It actually moves the superblock of a partition up by 8KB and uses that space for specific metadata. One piece of this metadata is a UUID, so bcache can identify unique drives. And that makes perfect sense, in the physical world. It’s in the virtualized world that this becomes a problem. What happens when you snapshot a volume – bcache superblock and all – and attach it?

Without any interaction, the kernel automatically mounted the drive, figuring it was also the backing device on the existing (running) bcache device, and appeared to start spreading writes randomly across both devices. As you can imagine, this is a problem, and began to trash the filesystem minute-by-minute, but we didn’t know it was doing this. It seemed odd that it had mounted a bcache1 drive, but we were not immediately alarmed. No errors were thrown, and writes continued to succeed. We start migrating data to the zpool.

22:09 UTC: A foreign data scraper on the engine instance (we read in pricing data from nearly every major exchange) throws an “overlap mismatch”. This means that, when writing new trades, the data on disk did not mesh perfectly with what was in memory. We begin investigating and repairing the data from our redundant scrapers, not aware of the bcache issue.

23:02 UTC: A read of historical data from the quote table fails. This causes the engine team serious concern. We begin to verify all tables on disk to ensure they match memory. Several do not. We realize we can no longer trust the disk, but we aren’t sure why.

We begin snapshotting the volume every minute to aid in a rebuild, and our engine developers start copying all in-memory data to a fresh volume.

23:05 UTC: We schedule an engine suspension. To give traders time to react, we set the downtime for 23:30 UTC and send out this notice. We initially assume this is an EBS network issue and plan to migrate to a new volume.

23:30 UTC: The engine suspends and we begin shutting down processes, dumping all of their contents to disk. At this point we believe we have identified the cause of the corruption (bcache disk mounting).

Satisfied that all data is on multiple disks, we shut down the instance, flushing its contents to disk and wait for it to come back up.

It doesn’t. We perform the usual dance (if you’ve ever seen a machine fail to boot on AWS, you know this one): unmount the root volume, attach to another instance, check the logs. No visible errors.

We take a breath and chat. This is going to be more difficult than we thought.

23:50 UTC: We decide to move the timetable up on the ZFS and instance migration. It becomes very clear that we can’t trust bcache. We already have our migration script written – we begin ticking boxes. We clone our Testnet engine, which had already been migrated to ZFS, and begin copying data to it. The new instance has 2x the CPU & 4x the RAM, and a 1.7TB NVMe drive. We’re looking forward to the increased firepower.

00:30 UTC: We migrate all the init scripts and configuration, then mount a recent backup. We have trouble getting the bcache volume to mount correctly as a regular ext4 filesystem. The key is recalling the superblock has moved 8kB forward. We mount a loopback device & start copying.

We also set up an sshfs tunnel to Testnet to migrate any missing scraper data. The engine team begins recovering tables.

~01:00 UTC: We destroy and remount the pool to work around EBS<->S3 prewarming issues. While the files copy, we begin implementing our new ZFS-based backup scheme and replicate minutely snapshots, as we work, to another instance. This becomes valuable several times as we verify data.

~02:00 UTC: The copy has finished and the zpool is ready to go. Bcache trashed blocks all over the disk, so the engine team begins recovering from backup. This is painstaking work, but between all the backups we had taken, we have all the data.

~03:00 UTC: The backfill is complete and we are verifying data. Everything looks good. We didn’t lose a single execution. Relief starts flooding through the room. We start talking timetables. We partition the local NVMe drive into a 2GB ZIL & 1.7TB L2ARC and attach it to the pool to get ready for production trading.

03:05 UTC: We bring the site back online, scheduling unsuspension at 03:45 UTC.  Our support team begins telling customers the new timeline. Chat comes back on.

03:45 UTC: The engine unsuspends and trading restarts. Fortunately, the Bitcoin price has barely moved over these four hours. We consider our place in the world.

 

Postmortem

While we prepared for this event, actually experiencing it was quite different.

Over the next two days, the team communicating constantly. We wrote lists of every thing that went wrong: where our alerting failed, where we could introduce additional checksumming, how we might stream trade data to another instance and increase the frequency of backups. We introduced more fine-grained alerts up and down the stack, and began testing them.

To us, this particular episode is an example of an “unknown unknown”. Modern-day stacks are too large, too complicated, for any single person to fully understand every single aspect. We had tested this migration, but we had failed to adequately replicate the exact scenario. The best game to play is constant defense:

  1. Don’t touch production.
  2. Really, don’t touch production.
  3. Treat in-service instances as immutable: clone, modify, test, switch.

As we scale over the coming months, we will be implementing more systems toward this end, toward the eventual goal of having an infrastructure resilient to even multiple-node failures. We want to deploy a Simian Army.

Already, we are making improvements:

  • Moving to ZFS itself was a long-planned and significant step that affords us significantly improved data consistency guarantees, much more frequent snapshotting, and better performance.
  • We are developing automated tools to re-check data integrity at intervals (outside of our existing checks + ZFS checksumming), and to identify problems sooner.
  • We have reviewed every aspect of our alerting system, reworking several gaps in our coverage and implementing many more fail-safes.
  • We have greatly expanded the number of jobs covered under Dead Man’s Snitch, a service that has proven invaluable over the last few years.
  • We have implemented additional backup destinations and re-tested. We are frequently replicating data across continents and three cloud providers.
  • We continue to implement new techniques for increasing the repeatability of our architecture, so that major pieces can be torn down and rebuilt at-will without significant developer knowledge.

Thanks to our great customers for being understanding while we were down, and for continuing to support us.