Postmortem: Downtime, 21 September 2022

On 21 September, we encountered an issue with our platform infrastructure that caused connectivity problems within our systems. 

As regularly communicated during the downtime via our Status Page and social media channels, all of our users’ funds remained safe at all times. 

Following a comprehensive review of what happened, we wanted to provide to our users a more detailed summary of the incident as set out below. Should you have any additional questions, as always you can reach us by contacting Support.

What happened on 21 September?

Core components of our computing platform entered a split-brain situation. A split-brain situation is best explained when more than one computing resource or service believes they are the leader – with the leader being the only one responsible for data consistency in a redundant high availability setup. 

This caused an inconsistent state in the underlying infrastructure, and in turn resulted in multiple other services being intermittently unreachable.

At BitMEX, we use Cloudflare to protect our system from DDOS attempts, and other malicious traffic and behaviour. Our clients send their requests to Cloudflare, which are then proxied to BitMEX’s servers. Only traffic directly sent by Cloudflare IP addresses is accepted. 

Once the split-brain issue had been resolved and services were restarted, about half of our service instances were unable to authenticate with Cloudflare (not through any fault of Cloudfare). This caused these services to not trust requests coming from Cloudflare IP addresses, thereby resulting in 403 authentication errors for requests reaching these services. 

The problem was further exacerbated by two more technical issues:

  1. Upon service startup, the Cloudflare authentication request was implemented to fail silently and default to an empty list for trusted IP addresses. Put simply, this meant we had instructed the API to not trust any requests. 
  2. The error code generated for an untrusted IP address is a 403. This is the same response that would normally be seen for regular user authentication issues, thereby not flagging an escalation. 

What impact was felt by users?

The impact manifested as intermittent access on our API, Web and Mobile services, as well as delays in Mark Price updates. Withdrawal processing and deposits were also affected. The combination of these eventually resulted in our decision to put the system into maintenance mode. 

After trading resumed certain users trading via the API experienced a further period of intermittent connectivity.

What is being done to ensure this does not happen again?

As a 24/7 exchange, we strive to offer perfect uptime and uninterrupted services, but this cannot be guaranteed. 

We are continuously working to improve our monitoring services to detect system issues such as what was experienced on 21 September in a proactive manner, before they have an impact on our users. This ongoing focus aims to ensure seamless trading and an improved experience. 

Note that since our inception, we have provided full transparency of service uptime via our Status Page

How can I protect myself against the risks associated with downtime events?

The BitMEX trading platform provides a number of sophisticated tools and settings through which it is possible for traders to control the risk posed to them by downtime events.

For traders who use our API, we offer a “Dead Man’s Switch” feature, which enables traders to set a timeout for cancellation of orders in case they are unable to reach the exchange. For implementation details and an example of setting a timeout, please see the documentation for this feature here.

For enhanced protection traders can set the trigger price of a stop order to reference Last Price, Mark Price or Index Price. When the exchange is reopened after a downtime event, dramatic swings in the Last Price can take place. To avoid stop orders being triggered by any dramatic swings in Last Price, traders can set their stops to trigger based on the Mark Price or Index Price, rather than the Last Price.

These are just two example features of the BitMEX platform that allow our traders to control risk and reduce the impact of downtime. Our Support team can provide further information about these (and any other) risk management tools on the platform.

Conclusion 

The cryptocurrency industry has come a long way in a short amount of time. We know that the expectations on us have risen and we are constantly working to further improve the resiliency of our platform. 

We hope this blog post provides all users with a clear description of this downtime incident and the robust ongoing steps we’re taking to enhance platform stability and our overall trading experience.

As always, we encourage any concerned users and those impacted by this downtime to contact Support. All users who were liquidated as a result of the downtime on 21 September have already been contacted and compensated accordingly.