Between 09:25:54 UTC and 09:44:30 UTC 24 June 2019 the orderBookL2, orderBookL2_25, orderBook10, and quote realtime websocket feeds for ETHUSD were in a degraded state. During this period, the state of the ETHUSD orderbook on these feeds was incorrect.
We were able to identify and resolve the root cause of the issue within a minute of detection. The issue was caused by a rare sequence of order events that triggered a bug in an optimisation of the orderBookL2 calculation which had been deployed to the production environment several hours earlier. This change has since been reverted.
There was no impact to orders in the trading engine itself – just the presentation of the calculated orderbook for ETHUSD downstream of the trading engine.
We have deployed additional automated feed validators to detect potential similar issues in the future and to alert us earlier.
Summary: We have observed an increased number of unauthorised attempts to access customer accounts. We would like to remind all customers and users to please protect your BitMEX and personal accounts by: using strong and unique passwords; enabling Two-Factor Authentication (2FA) for all your accounts; and using a password manager.
Security has always been the number one priority at BitMEX. This is why we were the first platform to adopt a manual multi-signature cold wallet setup to protect customer funds. We are consistently reviewing our security protocols and improving our standards. We remain committed to continual improvement of our platform security and the security of our customers.
In 2016, following a large botnet credential reuse attack, we published a blog post highlighting the importance of using unique passwords on BitMEX. In addition, we recommended enabling 2FA. 2FA, sometimes referred to as ‘two-step verification’ or ‘multi-factor authentication’, adds an additional layer of security to your account by requiring not only your username and password at login, but also the input of a unique, time-based token. Tokens can be stored on a cell phone within a software-based authenticator app such as Google Authenticator or Authy.
This message was as true and relevant then as it is now: to protect your account, you should always use strong unique passwords, in combination with a multi-factor authentication solution and password manager.
More recently, we have witnessed an increased number of attempts to compromise or obtain unauthorised access to customer accounts. Enabling 2FA on your account is the best and easiest way to protect yourself from these attacks.
Furthermore, we have observed a continued increase in the sophistication and tactics utilised by financially motivated criminals. One example of this: rather than the attacker immediately executing a withdrawal request, we have observed attackers trading funds out of accounts by deliberately making losses against another account which they also control. We have proactively identified a number of these attacks, and continue to eliminate this activity as it is detected.
Another recurring tactic observed in account takeovers is the disabling of BitMEX email login notifications following unauthorised account access. An attacker may also attempt to enable 2FA on a compromised customer account in order to create an API key with withdrawal permissions. A common thread in almost all cases is that customers may not have seen a withdrawal notification or other account related email notification; for example, a login notification.
While we review practices such as enforcing 2FA and other login access features, we have made the following changes:
Customers can no longer disable login notification emails. The login notification emails will now be sent regardless of existing notification preferences.
Withdrawal requests issued via the API must always complete an email verification step to confirm a withdrawal, unless the API key used was created before 8:00PM June 10, 2019 (UTC).
These changes are a step toward increasing account security for our customers, however it is important to realise that this is not the full solution. Enabling 2FA remains our strongest recommendation.
In addition to the above, BitMEX has reviewed each and every account takeover experienced by our customers and we have identified several common factors among compromised accounts:
Password reuse, or use of trivially guessed passwords on the BitMEX platform and on customer personal email accounts.
Compromised personal email accounts leading to account theft via password recovery flows.
Malware on customer computers leading to secure password theft and subsequent login to the bitmex.com platform.
In order to combat these attacks, adopting a vigilant, disciplined approach to security is key. In all of the above scenarios, utilising 2FA greatly decreases the risk of account compromise. This is further highlighted by recent research by Google that has shown that 100% of attacks can be blocked if a security key has been used for 2FA.
While we consider mandatory enforcement of 2FA across our customer base, we will again stress the importance of adopting good security practices as outlined below.
Note that these steps should be taken not only on your BitMEX account but on personal accounts where you store any confidential information:
A strong password consists of at least ten characters (and the more characters, the stronger the password) that are a combination of letters, numbers and symbols (@, #, $, %, etc.). Passwords are typically case-sensitive, so a strong password contains letters in both uppercase and lowercase.
Do NOT use the same passwords for your social media accounts such as Facebook, Spotify or Instagram accounts as you would for your BitMEX trading accounts or bank accounts. Use strong, unique and different passwords for each and every account!
Assess your existing risk
Check to see if your password has been leaked in a third-party breach via services like HIBP.
Check your trading accounts on a regular basis to ensure that you know what the balances are or should be.
Regular reconciliation of your accounts would be a useful way for you to ensure all transactions in your accounts are with your authorisation.
Add email@example.com to your contacts list and ensure our emails are not landing in your SPAM folder
Ensure that you are not filtering official communications from bitmex.com. These communications include login and withdrawal notifications.
BitMEX support will NEVER ask for your account password
At BitMEX, we take security very seriously. Whilst we continue to evolve our security capabilities both externally and internally, security is ultimately everyone’s responsibility. If you have digital funds on your online accounts, it is critical that you take steps to ensure your account safety/security as above.
If you observe any unusual activity on your account, please contact our Support team immediately via our contact page.
The scheduled system update has successfully concluded. The BitMEX platform is now back to full functionality. The update was performed from 01:00 UTC to 02:58 UTC today, 04 June 2019. No trading activity was affected.
Please be advised that we will be performing a scheduled system update to our database service starting 01:00 UTC 04 June 2019 and it is expected to last 3 – 5 hours. Trading, logins, and other key API features will remain operational, however please note that the following features will be disabled during the update period:
New account signup
Mute accounts on the Trollbox
Create API Key
Disable API Key
Enable API Key
Delete API Key
Once we have completed the system update we will make a further announcement.
We apologise for any inconvenience this may cause. Feel free to contact our Support with any concerns you may have about the scheduled update. You may reach us via our contact form: https://www.bitmex.com/app/support/contact.
Between 16:00 and 17:00 UTC 30 May 2019 the websocket API experienced periods of substantial lag due to spikes of traffic generated by the trading engine during large market moves. During this period some websocket connections also experienced dropped market data updates as memory limits on an internal messaging layer were hit, forcing reconnections.
Our engineers are accelerating the development effort in an already-planned strategic upgrade of our market data distribution architecture to vastly increase its capacity and lower the overall latency of the websocket feed. This capacity upgrade is scheduled for Testnet release this week and we will update users once this has been released to the main platform. If you have any further questions, please contact Support via our contact form: https://www.bitmex.com/app/support/contact.
We are delighted to announce HDR Global Trading Limited’s support of the MIT Digital Currency Initiative, which conducts research into the development and betterment of the global cryptocurrency ecosystem.
Sam Reed, CTO of HDR Global Trading and co-founder of the BitMEX trading platform, announced the sponsorship:
Our company has always been energized by the potential of cryptocurrency. Our donation into research and development is about ensuring that the network is more robust. A stronger Bitcoin network will be beneficial to all, and we are very excited to be able to aid in its progress.
HDR Global Trading owns and operates BitMEX, the world’s largest cryptocurrency trading platform by volume. HDR Global Trading is proud to support Bitcoin research and engineering that will make Bitcoin stronger, improving Bitcoin’s robustness, scalability and privacy.
In particular, HDR is keen to help support the work of Bitcoin Core developers Wladimir van der Laan and Cory Fields. Their roles have important implications on different parts of the Bitcoin protocol.
The donation is provided unconditionally and without restrictions.
Abstract: The 15 May 2019 Bitcoin Cash hardfork appears to have suffered from three significant interrelated problems. A weakness exploited by an “attack transaction”, which caused miners to produce empty blocks. The uncertainty surrounding the empty blocks may have caused concern among some miners, who may have tried to mine on the original non-hardfork chain, causing a consensus chainsplit. There appears to have been a plan by developers and miners to recover funds accidentally sent to SegWit addresses and the above weakness may have scuppered this plan. This failure may have resulted in a deliberate and coordinated 2 block chain re-organisation. Based on our calculations, around 3,392 BCH may have been successfully double spent in an orchestrated transaction reversal. However, the only victim with respect to these double spent coins could have been the original “thief”.
Illustration of the Bitcoin Cash network splits on 15 May 2019
(Source: BitMEX Research) (Notes: Graphical illustration of the split)
The three Bitcoin Cash issues
Bitcoin Cash’s May 2019 hard fork upgrade was plagued by three significant issues, two of which may have been indirectly caused by a bug which resulted in empty blocks. The below image shows the potential relationships between these three incidents.
The relationships between the three issues faced by Bitcoin Cash during the hardfork upgrade
(Source: BitMEX Research)
The empty block problem
Bitcoin ABC, an important software implementation for Bitcoin Cash, appears to have had a bug, where the validity conditions for transactions to enter the memory pool may have been less onerous than the consensus validity conditions. This is the opposite to how Bitcoin (and presumably Bitcoin Cash) are expected to operate, consensus validity rules are supposed to be looser than memory pool ones. This is actually quite an important characteristic, since it prevents a malicious spender from creating a transaction which satisfies the conditions to be relayed across the network and get into a merchants memory pools, but fails the conditions necessary to get into valid blocks. This would make 0-confirmation double spend attacks relatively easy to pull off, without one needing to hope their original payment doesn’t make it into the blockchain. In these circumstances, an attacker can be reasonably certain that the maliciously constructed transaction never makes it into the blockchain.
An attacker appears to have spotted this bug in Bitcoin Cash ABC and then exploited it, just after the hardfork, perhaps in an attempt to cause chaos and confusion. This attack could have been executed at any time. The attacker merely had to broadcast transactions which met the mempool validity conditions but failed the consensus checks. When miners then attempted to produce blocks with these transactions, they failed. Rather than not making any blocks at all, as a fail safe, miners appear to have made empty blocks, at least in most of the cases.
Bitcoin Cash – Number of transactions per block – orange line is the hardfork
(Source: BitMEX Research)
The asymmetric chainspilt
At the height of the uncertainty surrounding the empty blocks, our pre-hardfork Bitcoin ABC 0.18.2 node received a new block, 582,680. At the time, many were concerned about the empty blocks and it is possible that some miners may have reverted back to a pre-hardfork client, thinking that the longer chain was in trouble and may revert back to before the hardfork. However, this is merely speculation on our part and the empty block bug may have had nothing to do with the chainsplit, which could have just been caused by a miner who was too slow to upgrade.
Bitcoin Cash consensus chainsplit
(Source: BitMEX Research)
The chainsplit did highlight an issue to us with respect to the structure of the hardfork. We tested whether our post hardfork client, ABC 0.19.0, would consider the non-hardfork side of the split as valid. In order for the break to be “clean”, each side of the split should consider the other as invalid.
In order to test the validity of the shorter pre-hardfork chain, from the perspective of the Bitcoin ABC 0.19.0 node, we had to invalidate the first hardfork block since the split. We then observed to see whether the node would follow the chainsplit or remain stuck at the hardfork point. To our surprise, as the below screenshot indicates, the node followed the other side of the split. Therefore the split was not clean, it was asymmetric, potentially providing further opportunities for attackers.
Screenshot of the command line from our Bitcoin ABC 0.19.0 node
(Source: BitMEX Research)
The coordinated two block re-organisation
A few blocks after the hardfork, on the hardfork side of the split, there was a block chain re-organisation of length 2. At the time, we thought this was caused by normal block propagation issues and did not think much of it. For example, Bitcoin SV experienced a re-organisation a few weeks prior to this, of 6 blocks in the length. When Bitcoin SV re-organised, all transactions in the orphaned chain eventually made it into the main winning chain (except the Coinbase transactions), based on our analysis. However, in this Bitcoin Cash re-organisation, we discovered that this what not the case.
The orphaned block, 582,698, contained 137 transactions (including the Coinbase), only 111 of which made it into the winning chain. Therefore a successful 2 block double spend appears to have occurred with respect to 25 transactions. The output value of these 25 transactions summed up to over 3,300 BCH, as the below table indicates.
List of transactions in the orphaned block (582,698) which did not make it into the main chain
As the above table shows, the total output value of these 25 double spent transactions is 3,391.7 BCH, an economically significant sum. Therefore, one may conclude that the re-organisation was an orchestrated event, rather than it having occurred by accident. If it occurred by accident, it is possible there would be no mismatch between the transactions on each side of the split. However, assuming coordination and a deliberate re-org is speculation on our part.
We have provided two examples of outputs which were double spent below:
Example of one of the double spent UTXOs – “0014”
(Source: BitMEX Research)
The above table illustrates what happened to a 5 BCH output during the re-organisation. The 5 BCH was first sent to address qzyj4lzdjjq0unuka59776tv4e6up23uhyk4tr2anm in block 582,698. This chain was orphaned and the same output was eventually sent to a different address, qq4whmrz4xm6ey6sgsj4umvptrpfkmd2rvk36dw97y, 7 block later.
Second example of one of the double spent UTXOs – “0020”
(Source: BitMEX Research)
What happened to the above outputs shares characteristics with almost all the funds in the 25 double spent transactions. Most of the outputs appear to have been double spent around block 582,705 on the main chain, around 7 blocks after the orphaned block.
The SigScript, used to redeem the transaction inputs, starts with “0020” or “0014”, highlighted in the above examples. These may relate to Segregated Witness. According to the specification in Segregated Witness, “0014” is pushed in P2WPKH (Pay to witness public key hash) and “0020” is pushed in P2WSH (Pay to witness script hash). Therefore the redemption of these inputs may have something to do with Segregated Witness, a Bitcoin upgrade, only part of which was adopted on Bitcoin Cash.
Indeed, based on our analysis, every single input in the 25 transactions in the orphaned block 582,698 was redeemed with a Sigscript starting “0014” or “0020”. Therefore it is possible that nobody lost funds related to this chain re-organisation, other than the “attacker” or “thief” who redeemed these SegWit outputs, which may have accidentally been sent to these outputs in the first place.
As part of the Bitcoin Cash May 2019 hardfork, there was a change to allow coins which were accidentally sent to a SegWit address, to be recovered. Therefore, this may have occurred in the incident.
Allow Segwit recovery
In the last upgrade, coins accidentally sent to Segwit P2SH addresses were made unspendable by the CLEANSTACK rule. This upgrade will make an exemption for these coins and return them to the previous situation, where they are spendable. This means that once the P2SH redeem script pre-image is revealed (for example by spending coins from the corresponding BTC address), any miner can take the coins.
It is possible that this 2 block re-organisation is unrelated to the empty block bug. However, the split appears to have occurred just one block after the resolution of the bug, therefore it may be related. Perhaps the “honest” miners were attempting to coordinate the spend of these outputs directly after the split, perhaps to return them to the original owners and the empty block bug messed up their timing, allowing the attacker to benefit and sweep the funds.
On the other hand, the attack is quite complex, therefore the attacker is likely to have a high degree of sophistication and needed to engage in extensive planning. Therefore, it is also possible this attack may have been effective even without the empty block bug.
There are many lessons to learn from the events surrounding the Bitcoin Cash hardfork upgrade. A hardfork appears to provide an opportunity for malicious actors to attack and create uncertainty and therefore careful planning and coordination of a hardfork is important. On the other hand, this empty block bug, which may be the root cause of the other 2 incidents, could have occurred at any time and trying to prevent bugs like this is critical whether one is attempting to harfork or not.
Another key lesson from these events is the need for transparency. During the incidents it was difficult to know what developers were planning, the nature of the bugs, or which chain the miners were supporting. Open communication in public channels about these issues could have been more helpful. In particular, many were unaware of an apparent plan developers and miners had to coordinate and recover lost funds sent to SegWit addresses. It may have been helpful if this plan was debated and discussed in the community more beforehand, as well as during the apparent deliberate and coordinated re-organisation. Assuming of course if there was time to disclose the latter. It may also be helpful if those involved disclose the details about these events after the fact.
The largest concern from all of this, in our view, is the deliberate and coordinated re-organisation. From one side of the argument, the funds were stolen, therefore the actions were justified in returning the funds to their “rightful owners”, even if it caused some short term disruption. However, the cash like transaction finality is seen by many, or perhaps by some, as the only unique characteristic of these blockchain systems. The ability to reverse transactions, and in this case economically significant transactions, undermines the whole premise of the system. Such behavior can remove incentives to appropriately secure funds and set a precedent or change expectations, making further reversals more likely.
For all those in the Bitcoin community who dislike Bitcoin Cash, this could be seen as an opportunity to laugh at the coin. However, although Bitcoin Cash has a much lower hashrate than Bitcoin, making this reversal easier, the success of this economically significant orchestrated transaction reversal on Bitcoin Cash is not positive news for Bitcoin in our view. In some ways, these incidents contribute to setting a dangerous precedent. It shows that it may be possible in Bitcoin. Alternatively, this could just illustrate the risks Bitcoin Cash faces while being the minority chain.
Since BitMEX launched on 24 November 2014, cryptocurrency derivatives trading exploded. I tried in vain to seduce various venture capital firms with the vision of the future that was all about derivatives trading. At that time, succour was not forthcoming; however, I could not be more pleased with my failures now standing in 2019.
The BitMEX XBTUSD perpetual swap and various other contracts traded on OKEx and Deribit are of the same ilk. These contracts all allow you to trade a fixed USD amount of Bitcoin. We call these inverse derivatives contracts. Many OG traders have heard me speak at length about the subtle yet profound implications of this contract structure. However, as many new traders now try their hand at derivatives trading, a refresher course is necessitated.
Contrary to popular belief, I don’t delight when I see the BitMEX Rekt twitter feed going bananas. I’m long-term greedy. I would rather you enjoy a long trading career earning a profit and paying BitMEX trading fees along the way, than blow up your equity capital during a liquidation. Therefore, it is in mine and BitMEX’s best interest that our traders are sufficiently educated about best trading practices.
I love our traders, but when I hear people smile and laugh about getting liquidated it makes me cringe. A real trader practices proper risk management, and that means never being liquidated.
You Gotta Go Down, To Go Up
Convexity or gamma is the second derivative of a contract’s value with respect to price. Used correctly convexity can supercharge your portfolio’s returns. However, if you do not understand how convexity affects a derivative you trade, you will get rekt repeatedly.
With inverse contracts, the margin currency is the same as the home currency. I will use the XBTUSD contract throughout this post.
I will dwell on how the XBT exposure of a long 100,000 contract position changes with respect to the price (.BXBT Index).
First, let’s look at the long side. In bull and bear markets, these will most likely be speculators. This makes sense because being long Bitcoin offers asymmetric returns. Bitcoin can rise to infinity, but can only fall zero. It is better from a return on equity perspective to go long the bottom, then go short the top. Those who picked up ETH below $100 know this acutely. Therefore, coupled with leverage, on the margin, longs in most market environments will be predominately speculators.
The first chart shows XBT PNL profile and curvature. The straight line is the PNL % return if the contract moved in a linear fashion, the curved line is the long inverse contract position’s PNL % return. What you immediately notice is that you will lose more money when the market falls, and make less money as the market rises. This is suboptimal as you must post margin in XBT. Thus, your margin requirements increase in a non-linear fashion, and this is why longs get rekt quickly in a falling market.
Now let’s examine the short side. In bull and bear markets, these will most likely be hedgers and market makers. In both cases, these market participants want to lock in the USD value of Bitcoin. With inverse contracts, a long physical Bitcoin position coupled with an equivalent short XBTUSD position creates a synthetic USD position. If 100% of the physical Bitcoin is placed at cross-margin with BitMEX, you cannot be liquidated.
Unlike the long side, shorts benefit from positive XBT convexity. Shorts make more and more XBT as the price falls, and lose less and less as the price rises.
The take away from these two examples is that long speculators will be liquidated faster on the way down. This explains why dumps in these derivatives dominated markets are now more extreme than pumps and will continue so long as inverse style derivatives dominate the cryptocurrency derivatives markets.
The CME contract has a fixed XBT exposure regardless of the price, and the USD exposure varies linearly with respect to price. While this is great for USD benchmarked investors, it becomes problematic for those hedging their exposure. Bitcoin purchased to hedge a short CME position cannot be used as collateral with the CME. This presents some challenges for hedgers who hold physical Bitcoin, and market makers who must divide precious capital between derivatives and spot markets with no cross-collateral relief.
Today, we’re offering Part 2 of this series – a deep dive into overload and problems inherent to horizontal scaling. We will discuss the results from our efforts so far in handling unprecedented volumes and will detail the parts of the BitMEX engine that must remain serial, the parts that can be parallelised, and the benefits of BitMEX’s API-first design.
In Part 3, we’ll explain code optimizations that have already been put in place, systems that have been parallelised, and why certain features have been removed. Additionally, we’ll talk about BitMEX’s commitment to fair and equal access – and how that translates to our refusal to offer co-location.
Let’s get started.
BitMEX is a unique platform in the crypto space. In order to offer industry-leading leverage and features, the BitMEX trading engine is fundamentally different than most engines in crypto and in traditional finance. And while we’re able to offer exceptionally accurate trading and margining, it hasn’t been particularly fast – yet.
Throughout 2017, BitMEX’s average daily trading volume grew by 129x. This is incredible growth, and it continued to grow throughout 2018 and 2019.
As you can see above, orders per week have also sharply increased from 2017. Of particular interest is that an all-time USD trading record of $8B was hit in July 2018, despite the lower order rate! This record wasn’t broken until just last week, on 11 May 2019, where $11B was traded. These records still stand for the most ever traded in a day by a crypto exchange, and the XBTUSD Perpetual Swap is the most-traded crypto product ever built. It has since been imitated over a dozen times by crypto exchanges, both aspiring and established.
In May 2018, we began a focused effort to optimize order cancellation, amend, and placement actions in the trading engine, reworking internal data-structures, algorithms and audit checks to tune specifically for the kind of speed kdb+ can offer. This was a hyper-focused effort to make the existing trading engine continue to do what it does, only a lot faster. We’re very proud to say that this effort yielded us over 10x performance improvement in a very short time, with 4.6x achieved in just the first 30 days, and 10x by the end of August. By mid July, the performance enhancements delivered to the trading engine had resulted in the elimination of virtually all overloads, and we’re very proud of our technology teams for achieving this amazing result.
Below, overloads as a percentage of all write requests (order placement, amend, cancel). More red = more overloads.
The market reacted to the increased capacity, pushing BitMEX volumes into the stratosphere. We had the crypto industry’s first-ever million bitcoin trading day. The spare capacity enabled us to continue to launch innovative new products like the first ever ETHUSD quanto perpetual swap, which became the number 1 traded ETHUSD product within 6 weeks of listing. In November 2018, we nearly touched two million bitcoin in 24 hours and achieved $11B in trading in May 2019.
While the July graph looks completely clear, if you’re reading this, you know that it didn’t stay that way. Why isn’t it completely fixed? Why haven’t we simply continued that effort to get to 100x and beyond?
Let’s get into some background.
Servicing requests on BitMEX is analogous to waiting in line at a ticket counter. You join the queue, and when the queue clears, you make your request.
How long does it take to buy a ticket? Well, if there’s no queue, it’s going to be very quick. The entire interaction lasts only as long as it takes to service your individual request. This is why some other trading services, which have significantly lower traffic and volume, may feel snappier even if their maximum capacity is lower: there are fewer requests in the queue.
Consider what happens when the queue is very long. Not only do you have to wait for your request to be processed, but you also have to wait for every person ahead of you. You could have the world’s most efficient clerks, but if a queue starts forming, the average experience is going to be very poor.
Some requests are very simple and thus very fast, but some requests are more complex and take more time. If a request can be avoided altogether (say, the user doesn’t have enough available balance to complete it), optimisations can be employed to ensure it never even joins the queue, similar to automated check-in counters and bag drops.
Traffic on a web service behaves in many of the same ways. Even when processing an individual request is very fast, when a queue forms for a single resource, the experience degrades.
You’ve seen this before. Amazon and Alibaba have had major downtime on holidays. Twitter had the infamous failwhale. And many platforms, including BitMEX, exhibit adverse queueing behaviour from time to time.
“System Overload”, as you might know it, is one mechanism we at BitMEX use to address this problem. Are you shaking your head? How could overload be the solution, not the problem? Overload is a defence mechanism, better known in the industry as “load shedding,” a technique used in information systems to avoid system overload by ignoring some requests rather than crashing a system and making it fail to serve any request.
We’ve published a document showing the definitive rules around load shedding and explaining the mechanism:
To understand this, consider a system where load shedding is not present. With increasing demand, a queue forms and begins to grow.
How bad can it get? When the market moves, giant swarms of traders race to place orders to increase or decrease their position. One might think that the delay would regulate itself: as service quality decreases, traders will start placing orders at a slower pace, waiting for each one to confirm before placing the next. But in fact the opposite happens: as response time increases, automated arbitrageurs are unable to step in quickly to keep the prevailing price in line with other exchanges. Other savvy traders attempt to manually trade the perceived difference in pricing, which further escalates the size of the queue.
Without safeguards, the queue can reach delays of many minutes. Orderbook spreads increase as users fail to effectively place resting orders. A great market price becomes a terrible one by the time an order actually makes it through the queue and executes. In this environment, trading is effectively impossible. This isn’t just hypothetical; it is a common issue also faced by other crypto markets.
BitMEX’s solution to this is to limit the maximum number of order management requests that can be in queue for the trading engine. There is a service in front of the trading engine that identifies requests as reads (i.e. data fetches, like GET /api/v1/position) and writes (e.g. order placement/amend/cancel, and leverage changes). If it is a write, it is delegated to the main engine, and a queue forms. If this queue gets too long, your order will be refused immediately, rather than waiting through the queue. The depth of this queue is tuned to engine performance to cap latency at a worst-case latency of 3-5 seconds.
As per our Load Shedding documentation, certain types of requests like cancels are allowed to enter the queue no matter its size, but they enter at the back of the queue, like any other request.
The result: traders know immediately that the system is experiencing lags, rather than finding out after the order is in the queue, taking many seconds to execute. The engine isn’t slowing down: in fact, during overload, the engine reaches a peak order rate, and the orderbook and trade feeds move very quickly.
Trading During Overload
Some traders have expressed frustration that trading continues during overload. In fact, we’ve seen many conspiracy theories on Twitter and in trader chat rooms about it, arguing that this must be because certain traders have unequal access to the system. That is fundamentally untrue: every single trader on BitMEX has equal access and enters the back of the same queue. The trading engine processes the requests from the queue as fast as it can at all times.
If the number of orders entering the system is 5 times what the system can handle, only 20% of orders will be accepted and 80% will be rejected. As to which orders are rejected and which are accepted, it is simply whether there is space in the queue at the moment when the order arrives. If your request happens to hit the queue just after a response has been served, bringing the queue below the maximum depth, it will be accepted. The next order submitted after yours may not be.
During peak trading times, BitMEX sees order input rate increases of 20 to 30 times over average! Executed trades have reached peaks of over $100M/minute. This rate, if sustained, would lead to $6B trading volume per hour, or over $144B per day! This is 13x the top volume ever recorded in a single day on BitMEX, or on any other crypto platform.
In order to always provide a smooth trading experience, BitMEX needs to have a large reserve of capacity to handle these intense events. Below, we’ll document some of the challenges we’ve faced in achieving this goal.
High Scalability & Amdahl’s Law
How do scaling problems get solved? There are two types of scaling: “vertical”, and “horizontal”. Scaling vertically involves making an individual system faster. You can do this by buying a faster processor (good luck; Moore’s Law for CPUs is dead), or by finding ways to do less work. On the other hand, scaling horizontally is of the “throw more money at it” variety: spin up more servers, and spread the load among them.
Web servers are a good example of a horizontally-scalable service. In most properly-architected systems, you can add more web servers to handle customer demand. When one reply does not depend upon another, it is safe for servers to work in parallel, like check-out clerks at a grocery store.
This is a massive simplification, but for many, the scalability solution is a longer version of “throw money at the problem”. Many systems scale horizontally. Most customers’ experiences are completely independent of one another, and can simply be served by more web servers. The backing databases often can be scaled horizontally, replicating their data to one another.
There’s a limit to how much horizontal scaling is possible, which is often expressed as Amdahl’s Law. In short: a system’s horizontal scalability is limited by the serial operations (or the operations that must happen in a specific sequence) required. To illustrate: imagine a simple, single-threaded service you want to speed up by running it in parallel e.g. through multiple servers. Through some performance analysis, you find that only 25% of the work must be done in-order. The rest can be done in parallel. This means that, no matter how many cores or servers you throw at the problem, it can only be sped up by 4x, as 1/25% = 4. That bit of serial work becomes the bottleneck.
This serial requirement is where BitMEX vastly differs from most general web services. The BitMEX trading engine has far more serial requirements, thereby seriously limiting parallelisation opportunities.
Sequential Problems: Orders and Re-margining
The BitMEX trading engine processes orders in a sequential, First-In-First-Out (FIFO) fashion. Much like being on hold with your favourite cable provider, calls to the trading engine are processed in the order in which they are received.
This is a fundamental principle to a market and cannot be changed. Orderbooks must have orders applied to them sequentially – that is, the ordering matters. When an aggressive order is placed, it takes liquidity out of the book and no other order may consume it. For this reason, matching on an individual market cannot be effectively distributed; however, matching may be delegated to a single process per market.
At the time of writing, BitMEX operates approximately 150 API servers talking directly to a proxy in front of the engine. This proxy delegates read requests to data mirrors, websocket data to the pub/sub system, and write requests directly to the engine.
Writes are, as you might expect, the most expensive part of the system and the most difficult to scale. In order for a trading system to work effectively, the following must be true:
All participants must receive the same market data at the same time.
Any participant may send a write at any time.
If this write is valid and changes public state, the modified world state must be sent to all participants after it is accepted and executed.
Unoptimised, this system undergoes quadratic scaling: 100 users sending 1 order each per minute generates 10,000 (100 * 100) market data packets, one for each participant. A 10x increase to 1,000 users generates 100x the market data (1000 x 1000), and so on.
As mentioned at the beginning of this article, BitMEX grew 129x in 2017. During that time, our user-base scaled proportionally. This means that, on December 31, 2017, compared to January 1 2017, we were sending roughly 16,641x (129 * 129) more messages.
Scaling BitMEX is a difficult undertaking. We aren’t a typical spot or derivatives platform: we handle the entire customer lifecycle, from sign-up, to deposit, to trading.
In order to offer 100x leverage safely, BitMEX’s systems must be correct and they must be fast. BitMEX uses Fair Price Marking, an original and often imitated system by which composite indices of spot exchange prices underlying a contract are used to re-margin users, rather than the last traded price of the contract. This makes BitMEX markets much more difficult to manipulate by referencing external liquidity.
In order for this to work properly, the BitMEX engine must be consistent. Upon every mark price change, the system re-margins all users with open positions. At this time, the entirety of the system is audited by a control routine. The cost of all open positions, all open orders, and all leftover margin must be exactly equal to all deposits. Not a single satoshi goes missing, or the system shuts down! This happened a few times in our early days; each time due to a few satoshis’ rounding error on affiliate revenue or fees. While there was temptation to build in a small buffer of funds in case of error, our team believes system solvency to be paramount: they hold themselves to the highest possible standard. The system still audits to an exact satoshi sum today after every major change in state.
It is not possible for a malicious actor with database access on BitMEX to simply edit his or her balance: the system would immediately recognise that money had appeared from nowhere, emit a fatal error, and shut off.
Before auditing, the current value of your entire account must be recalculated from scratch; that is, the value of all your open positions and open orders at the new price. This ensures traders are not able to buy what they cannot afford. Traders don’t reach negative balances on BitMEX.
Speeding up this system is one of the primary goals of our scaling effort. Matching takes comparatively little time and scales easily; margining does not. BitMEX has always strived for “correct first, fast later” – and thus, the time this has taken is largely due to our commitment to getting it right. Incorrect results are not tolerable, and therefore a correctly distributed system must be able to detect slow or failed producers, rebalance load, and complete essential processing within a tight time budget. This requires careful, methodical attention and rigorous testing.
Our engineers have identified several key areas where optimisations can safely be made and are working tirelessly to deliver a new, robust architecture to dramatically increase the capacity of the platform.
BitMEX is rather unique among its peers: it was implemented API-first. The BitMEX architecture is comprised of three main parts: the trading engine, the API, and a web frontend. Notice that we didn’t use the term “the” frontend. Why is that?
When building BitMEX, we wanted our API to be best-in-class. A great API makes it easy for developers to build robust tools. It even enables alternate visualizations and interfaces that we may never have imagined. At the time we began coding, it was generous to say that crypto trading APIs were less than subpar. Many were missing any semblance of regularity, documentation or pre-written adapters, critical data was often missing, and vital functions could only be done via the website. Worse yet, most didn’t even have websocket feeds, and the few that did often kept them private and only accessible via the website.
At BitMEX we bucked the trend and set a new standard for crypto trading APIs. We engaged in a deliberate policy of dogfooding, by stipulating that the website must use the API as any other program might use it. This means there is not “the” frontend, there is simply an official BitMEX frontend. The BitMEX website, as a project, has no special access other than the ear of the API developers and a few login/registration anti-abuse mechanisms.
This also means that no mechanism for accessing BitMEX is faster or slower than another. All users enter the same data path and the same queues, whether they are accessing via a mobile device, a browser, a custom-written API connector, or even through Sierra Chart’s DTC integration. This ensures a fair experience for everyone.
From the beginning, BitMEX had:
A websocket change feed for all tables, including orders, trades, orderbook, positions, margin, instruments, and more, where all tables follow the same format,
A unified data path for both website and API consumers.
BitMEX’s commitment to API-first design shines in its implementation of real-time data, which is exposed through our websocket. As mentioned above, all tables have real-time feeds available, a first in the crypto industry and extremely rare today. Additionally, all tables follow the same formatting, meaning you can write as little as 30 lines of code to be able to process any stream. Or, use one of ours off-the-shelf from GitHub.
This data flows from a change stream generated by the engine itself, which is filtered for individual user subscriptions. This allows for a very comfortable flow when building interfaces on top of BitMEX: subscribe to your tables, make requests, and listen on the stream for changes. Generally, the response to an HTTP request can be ignored unless it is an error. This avoids a common duality in applications where both websocket streams & HTTP responses must be read separately and coalesced, resulting in awkward code and bugs.
We believe that this philosophy of building a top-tier application interface not only makes for the best userland integrations, it makes the BitMEX website and upcoming mobile apps the best they can be.
Our real-time feeds are of paramount importance to the orderly functioning of the BitMEX platform. To that end, we are staging a major internal rework of this system that we expect to improve latency and throughput significantly, without external changes. We will announce that launch and its results soon.
We hope the above has given all of you an idea of the challenges BitMEX faces while scaling the platform for the next 100x growth. While we are proud of the platform’s success and thankful to our users, we need to continue to improve in order to be viable in the years to come.
The BitMEX Trading Engine Team releases updates to the platform multiple times per week. These incremental changes are both part of the ongoing longer term re-architecture of the trading platform as well as tactical in-place capacity improvements to the engine. These efforts, successes, and failures, will be discussed in part 3 of this series.
Our Engine Team has been able to deliver major upgrades to our system’s throughput on a regular cadence. Just recently, on 23 May 2019, the team pushed through a major infrastructure upgrade that increased the new order handling capacity by up to 70%. Significant capacity improvements like these will continue to be delivered over the coming months whilst the larger scale re-architecture of the platform continues in parallel.
While work proceeds quickly on scaling our trading engine, we are also scaling our teams. BitMEX employs both world-renowned experts in electronic trading systems, scaling, infrastructure, security, and web, and has junior and intermediate roles for people who aren’t afraid to learn and get their hands dirty. If this article interests you, you might be the kind of person we want on our team; take a look at the exciting opportunities on our Careers Page.