Postmortem: Downtime, 19 May 2020

On 19 May, the BitMEX trading engine experienced unscheduled downtime between 12:00 UTC and 13:40 UTC as a result of an unexpected server restart. Before we go into the details of our investigation, we want to start by reiterating our apologies to all those affected by this event. We fully understand the significant impact such events have on our customers. This blog post outlines the steps we continue to take to ensure we’re doing everything possible to minimise the risk of downtime in the future. We also provide some guidance on how users can manage risks associated with downtime events.

We also want to reiterate that at no point during this event were any customer funds at risk, no liquidations occurred while the exchange was offline, and all pending and new customer withdrawals were processed within 90 minutes of coming back online. 

Here’s a summary of what happened and how we responded:

  1. At 12:00 UTC, the trading engine server unexpectedly restarted, taking it offline.
  2. At 12:01 UTC, our Engineering and DevOps teams initiated our incident response procedure.
  3. At 12:13 UTC, our customers received live alerts via https://status.bitmex.com, and the official BitMEX Telegram channel. A series of update messages were then later issued across all platforms throughout the event.
  4. At 12:20 UTC, trading engine services were partially recovered and our teams started going through the next steps required for resuming operations.
  5. At 12:38 UTC, the trading engine server abruptly restarted a second time, prompting our teams to trigger another recovery procedure designed to migrate the trading engine to a standby server – we were able to complete this process within 22 minutes utilising a new failover mechanism introduced earlier this year.
  6. At 12:41 UTC, our cloud provider confirmed that both server restarts were associated with underlying hardware issues.
  7. At 13:04 UTC, the trading engine services were successfully restored.
  8. At 13:23 UTC, the trading platform was brought back online in “Market Suspended and Cancel-Only mode”, and notifications were sent to our customers regarding the resumption of all trading operations at 13:40 UTC.
  9. At 13:40 UTC, trading resumed successfully.
  10. Withdrawals were processed at 14:00 UTC and 15:00 UTC.

Could this happen again?

BitMEX runs a 24/7 exchange and although we strive to offer perfect uptime and uninterrupted service, unfortunately, as with any exchange, we cannot guarantee this. As we look at the stability of our systems, both our internal processes and our use of third party providers, we are taking steps to minimise the risk of any downtime. Note that we provide full transparency of service uptime via our Status Page, with information available for the preceding year. 

What is BitMEX doing to ensure the stability of its platform?

During 2017-2019, most of our engineering resources were allocated towards scaling the exchange [1] [2]. This was necessary to manage the never-ending increase in demand during those months of extreme volatility and to provide the best quality of service for all customers. Starting in late 2019, our engineering resources are increasingly focused on ensuring high-availability and platform resiliency, while maintaining and continuously increasing the capacity of the platform.

With that dual focus in mind, our engineers have been designing and implementing a number of architectural improvements. These improvements significantly reduce the impact that software/hardware failures have on the platform and reduce the time required to complete equivalent failovers. The end-state objective is to achieve near zero down-time when handling events impacting a single availability zone and less than an hour of downtime for a region availability zone.

In parallel to the architectural work, we have been simulating outages weekly in a non-production environment in order to verify both the correctness of our procedures, and improve our familiarity in executing them and increasingly automate them. One recent and visible change to infrastructure took place on May 14, in direct response to the March 13 degradation of service. After weeks of design sessions, consultations with our cloud-provider and production-like load-testing, our teams replaced the technology behind our primary user database: improving its recovery time 4x and opening opportunities to scale it 15x over the next few months – all in place and without interrupting trading. Similarly, our engineers deliver updates to our trading engine services several times a week completely transparently to ship performance improvements and new features.

Finally, to fuel our continuous efforts on resilience and disaster recovery and as part of our commitment to providing a best-in-class trading platform, we have been growing our teams aggressively and continue to do so. If you’re up for the challenge and think you have the requisite skills, take a look at our various career openings.

What is BitMEX doing to enhance processes for reopening of its platform?

Following a review of previous downtime events and customer feedback, we have amended procedures for the reopening of the trading platform following unscheduled events, including an updated protocol of Cancel-Only mode (where previously this mode has been used following scheduled maintenance). The 17 minute period prior to resuming full functionality of the platform following the 19 May event resulted in 38,437 cancel-order instructions. We believe this provided a significant improvement in user experience compared to previous unscheduled events. We will continue to review customer comments on this process and refine effectiveness where necessary, and provide updates on any new procedures. 

How can I protect myself against the risks associated with downtime events?

The BitMEX trading platform provides a number of sophisticated tools and settings through which it is possible for traders to control the risk posed to them by downtime events.

For traders who use our API, we offer a “Dead Man’s Switch” feature, which enables traders to set a timeout for cancellation of orders in case they are unable to reach the exchange. For implementation details and an example of setting a timeout, please see the documentation for this feature here.

Another useful feature of the BitMEX platform is the ability to set the trigger price of a stop order to reference Last Price, Mark Price or Index Price. When the exchange is reopened after a downtime event, dramatic swings in the Last Price can take place. To avoid stop orders being triggered by any dramatic swings in Last Price, traders can set their stops to trigger based on the Mark Price or Index Price, rather than the Last Price.

These are just two example features of the BitMEX platform that allow our traders to control risk and reduce the impact of downtime. Our Support team can provide further information about these (and any other) risk management tools on the platform.

Conclusion

The cryptocurrency industry has come a long way in a short amount of time. We know that the expectations on us have risen and we’re working 24/7 to further improve the resiliency of our platform. We hope this blog post provides all users with a clear description of this downtime incident and the concrete and urgent steps we’re taking to enhance platform stability and our overall trading experience.

As always, we encourage any concerned customers and those impacted by this downtime to contact Support.