Postmortem: Downtime, July 14, 2017

Traders,

On July 14, 2017, we suffered a minor downtime as a runaway ZFS snapshot process froze up disk I/O on the trading engine. No data was lost. While the outage was relatively minor and required only a host reboot, we took additional time to re-verify data, clean up ZFS snapshots, and fix the underlying issue.

We apologize for the disruption.

If you are interested in our recent migration to ZFS, please see this post.

Postmortem: Downtime, July 5, 2017

Traders,

On July 5, 2017, we suffered a prolonged downtime – our longest since launch in November 2014 – due to a server issue. Trading was suspended from 23:30 UTC until 03:45 UTC, for a total suspension of 4 hours and 15 minutes.

Those of you who trade with us know that we take our uptime very seriously, and the record shows it. Before this month, we had not had a single month with less than 99.9% uptime, with our longest 100% streak reaching nearly 300 days.

So what happened?

The crypto market is exploding, as many of you know. While we have one of the most sophisticated trading engines in the industry, its focus has always been on correctness (remargining positions continuously, auditing every trade), rather than speed. This was a winning strategy from 2014 to 2016, and we’ve never lost an execution, but as we entered record-setting volume in the beginning of this year, requests started to queue up.

 

Optimizing the BitMEX Trading Engine

We started optimizing. The web layer, up to this point, hadn’t had any issues – we could always scale it horizontally – but the engine (at this time) cannot be horizontally scaled. We partnered with Kx, the makers of kdb+, which powers our engine. We began testing new storage subsystems and server configurations. We settled on an upgrade plan, set for five days hence (July 11), and began testing the switchover. We simulated the switchover thrice, each time setting a timer so that we could best estimate our downtime. The plan was:

  • Move to a larger instance with a faster local SSD, and
  • Move from bcache + ext4 to ZFS.

Some more details on those actions:

  • EBS is slow. So we would move the trading engine from an AWS c3.xlarge, which we used for its fast local SSDs in combination with bcache, to an i3.2xlarge. This gives us far faster local SSDs, nearly 20x the local SSD storage so we can easily cache our entire data set.
  • ZFS gives us some distinct advantages over other filesystems:
    • ZFS checksums individual blocks, preventing data rot. It can be scheduled to automatically check & repair drives (this is called a scrub), and can be configured to alert on varied criteria. This goes a long way toward ensuring the continued integrity of our data.
    • ZFS allow us to easily mirror and replicate our data across multiple volumes and physical locations.
    • ZFS snapshots are cheap, especially compared to traditional backup systems that must check the size & modified time of every file; in our testing, we can snapshot as often as every second (!) without any significant performance regression.
    • Kdb+ data is stored in a columnar fashion, like so:
      trade
      ├── foreignNotional
      ├── grossValue
      ├── homeNotional
      ├── price
      ├── side
      ├── size
      ├── symbol
      ├── tickDirection
      ├── timestamp
      └── trdMatchID
    • This data is highly compressible – in practice we see compression rates approaching 4x. This directly translates to less data over the wire to EBS and faster checkpointing & lower latency on the write log. For example, du is able to show the “apparent size”, that is, the size the OS thinks these files are, versus the actual space usage:
      /u/l/b/e/d/h/execution $ du --apparent-size -h
      955M .
      /u/l/b/e/d/h/execution $ du -h
      268M .
    • ZFS has the concept of the ARC (fast in-memory caching, a adaptive combination of MFU and MRU caches; in practice, the MFU cache is better for our use case), and the L2ARC, which provides a second-level spillover of this data, ideally to fast local SSD. It even compresses, leading to some eye-popping metrics:
      L2 ARC Size: (Adaptive)       1.17 TiB
      Compressed:            33.74% 403.90 GiB 
      Header Size:            0.08% 931.12 MiB
    • ZFS snapshots are amazing, and easy to code for. This allows us to do things that would be impossible otherwise, such as automatically snapshotting the engine data before and after any code changes. This is only practically possible because of the instance nature of snapshots.

I could go on. We’re ZFS superfans.

 

What Went Wrong

As Donald Rumsfeld once said:

Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.

We had the plan ready to go, checklists ready, and we had simulated the switchover a few times. We started preparing a zpool for use with the production engine.

Here’s where it went wrong.

19:47 UTC: We create a mirrored target zpool that would become the engine’s new storage. In order to not influence I/O performance on the running engine, we snapshot the data storage drive, then remount it to the instance. This is not something we did in our test runs.

Bcache, if you haven’t used it before, is a tricky beast. It actually moves the superblock of a partition up by 8KB and uses that space for specific metadata. One piece of this metadata is a UUID, so bcache can identify unique drives. And that makes perfect sense, in the physical world. It’s in the virtualized world that this becomes a problem. What happens when you snapshot a volume – bcache superblock and all – and attach it?

Without any interaction, the kernel automatically mounted the drive, figuring it was also the backing device on the existing (running) bcache device, and appeared to start spreading writes randomly across both devices. As you can imagine, this is a problem, and began to trash the filesystem minute-by-minute, but we didn’t know it was doing this. It seemed odd that it had mounted a bcache1 drive, but we were not immediately alarmed. No errors were thrown, and writes continued to succeed. We start migrating data to the zpool.

22:09 UTC: A foreign data scraper on the engine instance (we read in pricing data from nearly every major exchange) throws an “overlap mismatch”. This means that, when writing new trades, the data on disk did not mesh perfectly with what was in memory. We begin investigating and repairing the data from our redundant scrapers, not aware of the bcache issue.

23:02 UTC: A read of historical data from the quote table fails. This causes the engine team serious concern. We begin to verify all tables on disk to ensure they match memory. Several do not. We realize we can no longer trust the disk, but we aren’t sure why.

We begin snapshotting the volume every minute to aid in a rebuild, and our engine developers start copying all in-memory data to a fresh volume.

23:05 UTC: We schedule an engine suspension. To give traders time to react, we set the downtime for 23:30 UTC and send out this notice. We initially assume this is an EBS network issue and plan to migrate to a new volume.

23:30 UTC: The engine suspends and we begin shutting down processes, dumping all of their contents to disk. At this point we believe we have identified the cause of the corruption (bcache disk mounting).

Satisfied that all data is on multiple disks, we shut down the instance, flushing its contents to disk and wait for it to come back up.

It doesn’t. We perform the usual dance (if you’ve ever seen a machine fail to boot on AWS, you know this one): unmount the root volume, attach to another instance, check the logs. No visible errors.

We take a breath and chat. This is going to be more difficult than we thought.

23:50 UTC: We decide to move the timetable up on the ZFS and instance migration. It becomes very clear that we can’t trust bcache. We already have our migration script written – we begin ticking boxes. We clone our Testnet engine, which had already been migrated to ZFS, and begin copying data to it. The new instance has 2x the CPU & 4x the RAM, and a 1.7TB NVMe drive. We’re looking forward to the increased firepower.

00:30 UTC: We migrate all the init scripts and configuration, then mount a recent backup. We have trouble getting the bcache volume to mount correctly as a regular ext4 filesystem. The key is recalling the superblock has moved 8kB forward. We mount a loopback device & start copying.

We also set up an sshfs tunnel to Testnet to migrate any missing scraper data. The engine team begins recovering tables.

~01:00 UTC: We destroy and remount the pool to work around EBS<->S3 prewarming issues. While the files copy, we begin implementing our new ZFS-based backup scheme and replicate minutely snapshots, as we work, to another instance. This becomes valuable several times as we verify data.

~02:00 UTC: The copy has finished and the zpool is ready to go. Bcache trashed blocks all over the disk, so the engine team begins recovering from backup. This is painstaking work, but between all the backups we had taken, we have all the data.

~03:00 UTC: The backfill is complete and we are verifying data. Everything looks good. We didn’t lose a single execution. Relief starts flooding through the room. We start talking timetables. We partition the local NVMe drive into a 2GB ZIL & 1.7TB L2ARC and attach it to the pool to get ready for production trading.

03:05 UTC: We bring the site back online, scheduling unsuspension at 03:45 UTC.  Our support team begins telling customers the new timeline. Chat comes back on.

03:45 UTC: The engine unsuspends and trading restarts. Fortunately, the Bitcoin price has barely moved over these four hours. We consider our place in the world.

 

Postmortem

While we prepared for this event, actually experiencing it was quite different.

Over the next two days, the team communicating constantly. We wrote lists of every thing that went wrong: where our alerting failed, where we could introduce additional checksumming, how we might stream trade data to another instance and increase the frequency of backups. We introduced more fine-grained alerts up and down the stack, and began testing them.

To us, this particular episode is an example of an “unknown unknown”. Modern-day stacks are too large, too complicated, for any single person to fully understand every single aspect. We had tested this migration, but we had failed to adequately replicate the exact scenario. The best game to play is constant defense:

  1. Don’t touch production.
  2. Really, don’t touch production.
  3. Treat in-service instances as immutable: clone, modify, test, switch.

As we scale over the coming months, we will be implementing more systems toward this end, toward the eventual goal of having an infrastructure resilient to even multiple-node failures. We want to deploy a Simian Army.

Already, we are making improvements:

  • Moving to ZFS itself was a long-planned and significant step that affords us significantly improved data consistency guarantees, much more frequent snapshotting, and better performance.
  • We are developing automated tools to re-check data integrity at intervals (outside of our existing checks + ZFS checksumming), and to identify problems sooner.
  • We have reviewed every aspect of our alerting system, reworking several gaps in our coverage and implementing many more fail-safes.
  • We have greatly expanded the number of jobs covered under Dead Man’s Snitch, a service that has proven invaluable over the last few years.
  • We have implemented additional backup destinations and re-tested. We are frequently replicating data across continents and three cloud providers.
  • We continue to implement new techniques for increasing the repeatability of our architecture, so that major pieces can be torn down and rebuilt at-will without significant developer knowledge.

Thanks to our great customers for being understanding while we were down, and for continuing to support us.

Additional Withdrawal Time at 10:00 UTC

We have received several support tickets asking about a special early withdrawal period, so that users may claim entry in the Byteball Fair Initial Distribution, which takes place at 13:10 UTC.

To support this, we will be initiating early withdrawals at 10:00 UTC tomorrow. No opt-in is necessary; all withdrawals will be processed if confirmed before that time. The usual time at 13:00 UTC will be honored as well. If you wish to participate in this distribution, we recommend hitting the 10:00 UTC cutoff so you have sufficient time for the transaction to confirm.

QTUM Futures Now Live

BitMEX is proud to announce the launch of QTUM Futures contracts, expiry 29 September 12:00 UTC with symbol QTUMU17. Each contract is worth 1 QTUM and the contract offers 2x leverage.

Since the QTUM platform is still under development, the following rules will apply:

  • QTUMU17 will have 25% Up and Down Limit against the previous session close price to prevent price manipulation. Each session is 2 hours long, and session closes occur every even numbered hour.
  • Settlement will occur either at the ICO price (if QTUM/XBT trading has not begun) or at the .QTUMXBT30M Index Price if QTUM/XBT has begun trading prior to 28 September 12:00 UTC.

Further details about this contract can be read in the QTUM Series Guide.

Update on OKCoin Market Disruption Event – Removal Expedited

Traders,

Due to a quicker than expected price divergence on OKCoin International, we are moving the timetable forward for the removal of OKCoin International and the incorporation of GDAX into the index.

The new timetable is:

  • At 21:45 UTC, GDAX will be added to the index. At this time, the index will have three constituents.
  • At 22:00 UTC, OKCoin International will be removed.

For more information, please see our previous post on the removal of OKCoin International.

Market Disruption Event: OKCoin International

Yesterday, OKCoin International announced USD deposits have been blocked:

Starting from today (April 18th, 2017), OKCoin would temporarily suspend USD deposit because of the issues with intermediary banks. Please do not make further deposit as your wires may be rejected by intermediary banks. We are now actively looking for alternatives to resume deposit as soon as possible. Your current account balance remains unaffected. We are sorry for any inconvenience caused.

For this reason, we are weighting OKCoin Intl to 0 in the .BXBT Index, effective 20 April at 08:00 UTC. To re-distribute the index, GDAX will be reinstated as an equal member.

The new distribution will be equally weighted between GDAX and Bitstamp. For reference, this change is live on Testnet and can be used for intermediate pricing data.

Additionally, we will be announcing new price protection mechanisms for BitMEX indices to prevent further bad pricing issues.

Update: Due to rapid price divergence, the timetable has been moved forward to 19 Apr at 22:00 UTC.

Market Disruption Event: Bitfinex

Just recently, Bitfinex announced that USD deposits will be rejected until further notice. In combination with their previous notice blocking USD withdrawals, this means that Bitfinex is no longer a viable USD/Bitcoin exchange, and we expect the pricing discrepancy between Bitfinex and other exchanges to increase as traders attempt to withdraw via cryptocurrencies.

For this reason, we are weighting Bitfinex to 0 in the .BXBT Index, effective at 16:00 UTC today (30 minutes from the time of this post). In combination with the prior temporary suspension of GDAX from the index due to pricing discrepancies, this means that for the time being, the old .XBT index and the new .BXBT index will print the same prices.

Market Disruption Event: GDAX

At 23:02 UTC on 15 April 2017, one constituent of our .BXBT Index, GDAX, reported a trade print of $0.06 / XBT. This fed into the .BXBT Index and caused the price to temporarily move down to $888.48 / XBT which led to a number of users having their positions liquidated.

This was not a BitMEX engine or pricing issue. However, we strive to create a fair platform where users are not unfairly disadvantaged due to an error on another exchange, even if this error was an official price. As such, BitMEX will be refunding those users who were unfairly liquidated due to the pricing discrepancy from GDAX out of our own company funds.

Those users who had their positions liquidated will see the loss between $1183.00 / XBT and their liquidation price transferred back to their BitMEX Bitcoin wallet. Positions lost due to liquidation will not be reinstated.

For the time being, GDAX will be weighted at 0 in the .BXBT index until we have built in sufficient outlier protections.

On Potential Post-Fork Contract Settlement

Traders,

Recently, we published A Statement on the Possible Bitcoin Unlimited Hard Fork, a statement of our views on the potential fork to Bitcoin Unlimited, its consequences, and further requirements we consider necessary for adoption.

Many have asked us about the settlement of our existing Bitcoin futures: the Bitcoin/USD series (XBT), the Bitcoin/CNY series (XBC), and the Bitcoin/JPY series (XBJ).

In the event of a fork in which both chains remain viable into the future and maintain double-digit percentages of the original Bitcoin hash rate (a “Contentious Fork“), we will take the following actions:

Contracts

  • As we predict the value of Bitcoin to then be split between BTC and BTU, currently-listed futures at the time of the fork will settle on the sum of BTC and BTU.
    • It may not be possible to predict or plan to get reliable pricing data from our current Index exchanges, or they may not list the minor coin at all. In the event of a Contentious Fork BitMEX reserves the right to move all Bitcoin derivatives to Last Price Protected Marking, until a stable index can be composed.
    • We will compose two indices representing the majority and minority chain, and the sum will be taken to compose the Mark and Settlement Prices. The indices will be separated in case not all component exchanges list the minority chain.
  • Contracts listed after the fork will settle on the BTC or the BTU price, but not both. Only contracts listed pre-fork will settle on the sum.
  • Perpetual swap contracts will be timed to switch underlying indices in tandem with a futures contract. Ample notice will be given. Like futures, the new index will reference only one chain.

Wallets

  • During the time immediately after the fork, BitMEX reserves the right to suspend withdrawals to avoid replay attacks and double-spending and account for the development effort required to accommodate a hard fork.
  • Users will be able to withdraw the minor currency, but not deposit it. We have no plans to support multiple margin currencies. Balances of the minor currency will be calculated via a snapshot at the time of the fork and maintained separately to major currency’s margin balance, as further mixing of the currencies thereafter could lead to improper attribution.

A Statement on the Possible Bitcoin Unlimited Hard Fork

As proposed in the multi-exchange hard-fork contingency plan, there is significant doubt that a Bitcoin Unlimited (BU) hard fork could be done safely without additional development work.

In the case of a fork, we support the plan as proposed by Bitfinex, Bitstamp, BTCC et al.

It will not be possible for any exchange, including BitMEX, to support both chains separately. For these reasons, BU will not be listed or used as a deposit/withdrawal currency until replay protection is implemented and BU is not at risk of a blockchain reorganization if the Core chain becomes longer.

If the BU fork does succeed, we intend to take every possible step to ensure the safety and integrity of customer deposits on both chains. As BitMEX does not offer margin lending, there is no concern about Bitcoin in active positions at the time of the fork.

Notice Regarding Bitcoin/USD Products (Index, Tick Size)

Bitcoin / USD 30 June 2017 Futures Contract

The BitMEX Bitcoin / USD 30 June 2017 Futures Contract (XBTM17) listed today, 17 March 2017 at 12:00 UTC. This contract is similar to XBTH17, but uses a new index, described below.

.BXBT: The New BitMEX Bitcoin / USD Index

.BXBT is an equally weighted index using the Bitcoin / USD spot price from the following exchanges:

  • Bitfinex
  • Bitstamp
  • GDAX
  • OKCoin International

XBTM17 uses the .BXBT index. Any exchange that is down or displays stale pricing data for 15 minutes or more will be removed temporarily from the index. Once the price feed is operational at least 5 minutes, we will reinstate the exchange.

A page detailing each constituents’ individual price and history will be live soon.

The BitMEX Bitcoin / USD 31 March 2017 Futures Contract, XBTH17, will continue to use the existing Bitcoin / USD Index (Symbol: .XBT; Weights: 50% Bitstamp, 50% OKCoin International) until it expires.

The BitMEX Bitcoin / USD Swap, XBTUSD, will continue to use the existing Bitcoin / USD Index (.XBT) until 31 March 2017 12:00 UTC. It will then switch to the new index. This is the same moment that XBTH17 expires.

Increase to Bitcoin / USD Products’ Tick Size

Also effective 31 March 2017 12:00 UTC, the tick size for Bitcoin/USD products (XBTUSD, XBTM17) will change from 0.01 USD to 0.1 USD.