Postmortem: Downtime, 13 March 2023

On Monday 13 March 2023, our trading engine was suspended from 13:58 UTC to 14:18 UTC. This suspension was triggered by a shortfall in on-chain funds of 50,000 USDT, on the Tron network, due to a duplicated customer withdrawal at 13:57 UTC. 

As communicated during the downtime via our Status page and social media channels, all of our users’ funds remained safe at all times. 

Following a comprehensive review of the downtime, we are providing you, our users, with a more detailed summary of the incident as set out below. Should you have any additional questions, as always you can reach us by contacting Support.

What happened on 13 March?

At BitMEX, our systems self audit in real time. If these audits fail, our trading systems automatically suspend in order to protect the assets of all BitMEX users, while our engineering team investigates the issue, and works to fix it. 

As shared previously on our security page, we check our positions and margins multiple times a minute. With balances cross-checked against on-chain records. Bugs, flaws or intrusions, causing positions not to match, will immediately halt our exchange. 

Today’s incident was triggered by a mismatch between on-chain transaction states, the interpretation of a response code, and messages from a custody provider’s API. 

A workflow allowing resubmission of a transaction was incorrectly enabled by the custody systems. This led to the manual processing of a duplicate transaction. The combination of the events allowing this, was a rare edge case.

Our on-call engineers quickly identified the cause of the suspension, and the amount was matched to a single payment. With the root cause of the problem understood, the decision was it was safe to resume trading.  

The funds mismatch was made whole from BitMEX company funds. 

What was the user impact?

Whilst trading and matching was suspended on the exchange, the exchange operated in cancel only mode.

What is being done to ensure this does not happen again?

As a 24/7 exchange, we strive to offer perfect uptime and uninterrupted services, however this cannot be guaranteed. We don’t make compromises when it comes to the safety of our users’ funds. You can read more about how we keep funds safe here

The self-auditing and automated suspension of the exchange is an important risk-management process. It prevents a persistent mismatch between exchange deposits and liabilities. 

Moving forward, a fix will be implemented to ensure that the ability to manually retry an already completed transaction, in this rare case, will be blocked by our custody systems. 

How can I protect myself against the risks associated with downtime events?

We provide a number of sophisticated tools and settings through which it is possible for traders to control the risk posed to them by downtime events.

For traders who use our API, we offer a “Dead Man’s Switch” feature, which enables traders to set a timeout for cancellation of orders in case they are unable to reach the exchange. For implementation details and an example of setting a timeout, please see the documentation for this feature here.

For enhanced protection traders can set the trigger price of a stop order to reference Last Price, Mark Price or Index Price. When the exchange is reopened after a downtime event, dramatic swings in the Last Price can take place. To avoid stop orders being triggered by any dramatic swings in Last Price, traders can set their stops to trigger based on the Mark Price or Index Price, rather than the Last Price.

These are just two example features of the BitMEX platform that allow our traders to control risk and reduce the impact of downtime. Our Support team can provide further information about these (and any other) risk management tools on the platform.

Conclusion 

We hope this blog post provides all users with a clear description of this downtime incident and the robust ongoing steps we’re taking to enhance platform stability and our overall trading experience.

As always, we encourage any concerned users and those impacted by this downtime to contact Support