移除 AllOrNone ExecInst

在 BitMEX 系统优化的过程中,有时需要中断 API 变更。 BitMEX 尽力避免这些情况发生。

请确保即时更新您的 API 客户端,以避免您的交易策略受到任何干扰。

北京时间 5 月 15 日 (星期一) 早上 8 点,委托执行指令 (ExecInst) AllOrNone 将被移除。之后所有使用 AllOrNone 的委托将被拒绝。

 

 

欢迎转载,请注明文章来自

BitMEX (www.bitmex.com)

卸载和 “平仓” execInst

我们留意到,一些 API 的用户利用关闭 execInst 作为逃生阀,在交易超载情况下继续单方面询价。这不是这个例外的当初意图 – 其目地是让平掉所有仓位,主要是为了前端用户。

如果平仓委托包含委托数量的话,这些委托现在会受卸载影响。如果您打算平掉所有的仓位,请在委托中省略 orderQty 字段。该字段的缺失加上 Close ,将被解释为平掉您所有仓位。

 

 

 

欢迎转载,请注明文章来自

BitMEX (www.bitmex.com)

 

BitMEX 停机时间, 2018 年 5 月 17 日

今天, 2018 年 5 月 17 日, BitMEX 交易引擎遇到了几个独立的之前不可提前预测的问题,导致反馈时间延迟和停机时间突然增加。

 

大约在北京时间 18:00 ,安装在主要交易系统硬件上的磁盘的性能急剧下降。这种性能下降会导致预定归档和重新索引作业期间的进度延迟,从而导致系统承受严重的压力。磁盘 I/O 操作的运行时间降至其预期速率的 1/20 。

 

BitMEX 运行着后备的驱动器,但在上述情况下,两个驱动器同时表现出这种退化行为。我们别无选择,只能安排停机时间来维护和替换它们。背压达到临界点的时间比预期的要快,于是我们提前了时间表。

 

在任何时候,数据完整性都没有因这个问题而受到影响,但是将系统恢复到本来磁盘的性能状态需要比预期更长的时间。

 

该维护完成后,我们重新开放了交易。不幸的是,在下一个存档期间我们发现了另一个问题,即 reindex 作业与以前罕见的请求模式相结合,导致在特定图表上出现意外的索引生成和符号重新认证。这导致了另一种系统压力且造成类似的症状。

 

我们已经确定并修复了造成上述问题的成因。交易系统团队将 24 小时密切留意系统性能,同时继续对性能下降进行根本原因分析。

 

 

欢迎转载,请注明文章来自

BitMEX (www.bitmex.com)

BitMEX Downtime, May 17 2018

Today, May 17, 2018, the BitMEX trading engine encountered several separate and heretofore unpredictable problems, causing feed latency and downtime in spurts throughout the day.

Disks mounted to the main trading engine hardware degraded sharply in performance at roughly 10:00 UTC. This degradation caused feed latency during scheduled archive and reindex jobs, which caused significant backpressure. Disk I/O operations were running at roughly 1/20 of their expected rate.

BitMEX runs redundant drives, but in this case, both drives were simultaneously exhibiting this degraded behavior. We had no choice but to schedule a maintenance downtime to replace them. Unfortunately, backpressure reached critical levels faster than we expected and we moved up our timetable.

At no point was data integrity compromised by this problem, but restoring the machine to a functional state with nominal disk performance took longer than expected to execute and verify.

After this action was complete, we restarted trading. Unfortunately, another problem was uncovered during the next archive, where a reindex job combined with a previously rare request pattern led to unexpected index regeneration and symbol revalidation on specific tables. This led to another backpressure scenario, with similar symptoms.

We have identified and fixed multiple contributing factors to the above behavior. The trading engine team will be closely monitoring engine performance throughout the day while continuing root cause analysis for the slowdowns.

BitMEX 技术扩张:第一部分

您好 – 我是 BitMEX 首席技术官塞缪尔·里德( Samuel Reed )。

创立 BitMEX 的过去四年是一段令人难以置信的旅程。当我们开始的时候,我不认为我们几个人能想象到这个平台将会取得成功,及它将如何在 2018 年主导比特币/美元的交易

从 2014 年至今, BitMEX 平台的日均交易量从 0 增长到 30 亿美元。我们的旗舰产品 XBTUSD 比世界上任何加密产品的交易量都大。我们通过五种语言为全球各地的客户提供服务,并已成为比特币定价及流动性的主要平台。

BitMEX 团队一直努力扩容,建立实在的移动端,并创建一个最优秀的技术团队。我们并没有因为取得的成果而沾沾自喜。恰恰相反:我们继续努力着并比以往更加忙碌。

我们希望让社区了解我们如何成立以及将如何向前迈进。正如名言所说:“为了击败错误,我们必先了解错误”1


容我由一个真实的故事讲起。


资料来源: russellfreeman.com

2014年 ,我在香港为一个编程训练营 General Assembly 举办的论坛发言。该公司想让他们即将毕业的学生了解全职作为程序员工作的感觉。我借此机会谈论了许多自己的工作经验:在几家小型企业,初创公司和政府中任职 – 强调软件工程师的市场需求大到多么的令人难以置信。

座位后排的一个响亮声音问道:“那缺乏资金的初创公司如何招揽 CTO 呢?这些公司又如何在竞争激烈的氛围中吸引优秀的人才?”

“这是一个很好的问题,而答案也很残酷”我答道。“没有资金,优秀人才将面临着选择,一边是稳定收入,另一边是承受巨大收入风险。为什么有经验的开发人员要为了初创企业放弃年薪 20 万美元在一个更舒适和资源丰富的大型科技公司的工作,而来到一家初创企业还要每周工作 80个小时以上。看来您需要找到一些傻瓜“ – 我当时真这么说了 – “这个人必须非常相信您的创业理念,使得他尽管有其他更好的选择,他仍愿意承担这种风险。” 我当时祝他好运,并且继续小组讨论。

他在小组讨论后向我走过来,告诉我他想做一个比特币衍生品交易平台。那时我便知道:我就是那个傻瓜,而我和亚瑟·海耶斯 Arthur Hayes )便成为了合伙人。

在资金并不充沛的情况下,我们在六个月内推出了一个内测版本( alpha ),并推出了 BitMEX 交易挑战赛,这是一个无规则交易竞赛,我们通过交易频率来测试交易平台。它的确是无规则的(除了多个账户) – 若黑客入侵了网站可以赢得我们预先准备好的奖金。在那段时间,我们因为系统小瑕疵支付了几个比特币的赏金,但整体来说该系统没有发生任何重大失误。

让我妻子的恼火的是,我们选择在 2014 年 11 月 24 日推出了 BitMEX ,当时我和我妻子正在克罗地亚度蜜月。本( Ben )和亚瑟( Arthur )则在香港庆祝。注意两张照片中的原始交易界面。您仍可以看到当天原始的 Trollbox 信息

2014年11月24日,克罗地亚,杜布罗夫尼克

2014年11月24日,香港


所有项目都是其时间背景的产物。 2014 年初,加密生态圈正从 Mt.Gox 留下的真空中卷土重来。当时的焦点并不是 “Poof of Work” 与 “Proof of Stake”,而是一个叫“Proof of Reserves”的被遗忘术语 – 上网搜一下,并看看所有热门评论的时间戳。实际上,关于这个问题的讨论是我们在 Reddit 公布推出交易平台后投票数最高的帖子

运营比特币交易所的第一条规则是,“不要弄丢比特币”。

这条规则成为了 BitMEX 基石。它贯穿着我们的政策,即便至今:我们仍使用 100% 的冷钱包来储藏加密币,每笔交易都通过多重签名认证。在区块链上查询 3BMEX 交易,您会看到它。连续 1,250 天,我们三人中至少有两人需要起床,阅读当天的提款记录,完成风险检查并签名,然后递交给下一位合伙人进行签名和最终发送。

当时,我认为用户对这一点会有所抵抗。是的,比起任何货币系统,比特币在许多方面都更有优势。但它也有弱点。托管是一个待解决的问题,需要时刻保持警惕。我认为今天我们的客户了解并欣赏这一点。在早期,我们收到了大量关于提现时间太长的投诉。今天,我们是世界上交易量最大的交易平台,我们几乎没有收到任何关于这方面的投诉。人们明白 – 以这种方式保护您的存款并不容易。我们这样做不是因为它更方便,而是因为它更安全。

2014 年的 BitMEX

2014 年也影响了我们如何构建 BitMEX 。根据我的前端开发经验,我采用了 ReactJS ,并使 BitMEX 成为第一个使用 ReactJS 的交易平台,这个正确的选择,直到今天也持续为公司的盈利作出重大的贡献。

我们也是第一个 – 也可能是唯一的交易所 – 在kdb + / q 技术上构建我们的交易撮合和保证金系统,这是一种传统上用于查询海量时间序列数据的技术。这与交易平台是天作之合。它速度很快,使用 SIMD 指令可以大大提高交易量,它非常灵活,而且准确。 Kdb + 的灵活性和速度使我们能够两次更新我们的产品:从低杠杆反向和双币种定期合约到高杠杆定期合约,以及从高杠杆定期合约到我们的旗舰产品 XBTUSD 永续合约。这些产品的推出需要团队中每个人投入创新的概念及付出大量的汗水,我们为这些进程感到骄傲。


现在,是时候回归到文章的标题了。 BitMEX 现在每天的交易额高达 65 亿美元。我们前一分钟交易纪录是 3,500 万美元,这个数字高于 2016 年 4 月的整个月份

标出的月份, 2016 年 3 月,在 XBTUSD 有 1,600 万的成交量。现在 XBTUSD 在一分钟内的成交量达到该峰值的两倍。

下面的图表,从小到大的时间尺度,显示了的每月营业额,以突出在总概览中看不见的细节:

为了理解 BitMEX 为什么会遇到系统超负荷的问题,尽管使用了像 kdb + 这样的可靠技术外,重要的是要了解 BitMEX 与其他交易平台不同的地方。

100 倍杠杆是一个会引发强烈反应的数字,从 “您疯​​了吗?” 到 “这怎么可能?”只有在我们联合创始人兼首席策略官本·戴罗( Ben Delo )的金融工程设计中才使之成为可能。本是一位勤奋而优秀的数学家。他建立了一个完美的数学交易模型,一个持续自我平衡的系统,不断审计所有交易,总和总是为零。交易不会在 BitMEX 系统中丢失。用户的账户永远不会变为负数。在其他平台上常见的错误在 BitMEX 平台上从未发生过。对于细节的重视使 BitMEX 在众多交易平台中脱颖而出。标记/合理价格,加权自动减仓系统,永续合约资金费率以及实时逐仓/全仓保证金重算都是在 BitMEX 之前的平台上没有的,全新的概念。

BitMEX 系统内部的持续平衡及一致性使得 100 倍杠杆成为可能。 Kdb + 的速度足够快,使我们可以在价格变化的同时不断重新计算所有仓位的保证金要求。这提供了必要的安全性和速度,不仅使得我们可以将维持保证金设在 0.5% 这么低的水平,还能够持续发展。 BitMEX 保险基金是一个保证 BitMEX 合约结算的基金,该基金现有(在撰写本文时) 6,149 XBT,其价值超过 5,000 万美元。其他平台的保险基金一般只有不到 10 个比特币,而且仅能提供 20 倍杠杆。

BitMEX 不会为了速度而牺牲安全性。我们用户的资金安全和他们对交易平台的信心是至关重要的。但我们依然聆听着用户的需求:您需要更快的交易速度,您不再想看到 “系统过载” 的信息,我们会满足您的愿望。

自 2017 年底以来, BitMEX 团队重点关注提升系统性能,并将其作为我们的首要任务。我们已经建立并且正在继续扩充该领域的专业团队。这个团队将努力工作,并创造足够的容量应付下一个 100 倍交易量增长。


在本系列的第二部分中,我将深入解释:

  • BitMEX 系统如何处理委托和重算保证金
  • 实时消息如何通过我们的系统显示在您的浏览器上
  • BitMEX 如何使用 API ​​优先设计来提供市面上最强大的 API
  • 性能图表显示热点,峰值与基线负载以及边角案例
  • 剖析可怕的 “系统过载” 信息,以及它是如何生成的

在第三部分中,我将探讨:

  • 数字显示自 2017 年以来交易容量的增长过程
    • 过去几个月我们取得了很大进展 – 但需求也同时增加了
  • 第二季度的路线图和待完成的工作
  • BitMEX 对网上衍生品交易未来的展望

感谢您的支持,这是 BitMEX 成功的最大推动力。本,亚瑟和我感到很幸运,不仅仅是因为能成为一家伟大公司的一员:我们的客户,团队和市场机会都是得天独厚的。

您可以直接在推特上找到我 @STRML_ 或在 Telegram 上找 STRML 直接联系我。我偶尔也会在 Whalepool TeamSpeak 上与投资者互相讨论,这是一个有趣的投资者交流平台,他们多年来给予了 BitMEX 许多支持与鼓励。


推出 BitMEX 的杜布罗夫尼克公寓窗户外的景象

 

1 – Starship Troopers 曾经在软件开发方面处于领先地位。

 

 

 

欢迎转载,请注明文章来自

BitMEX (www.bitmex.com)

BitMEX Technology Scaling: Part 1

Hi there – I’m Samuel Reed, CTO of BitMEX.

It’s been an incredible journey over the last four years building BitMEX. When we started, I don’t think any of us could have imagined the success this platform would achieve or how it would come to dominate Bitcoin/USD trading in 2018.

From 2014 to today, the BitMEX platform has grown from zero to an average of $3B of trading volume per day. Our flagship product, XBTUSD, trades more than any crypto product in the world. We serve customers all over the world, in five languages, and have become the premier platform for Bitcoin price discovery and liquidity.

The BitMEX team has been hard at work improving capacity, building a solid mobile offering, and creating a tech team that is truly best-in-class. We are not resting on our laurels, enjoying this success for the sake of it. Quite the opposite: we’ve been busier than ever.

We’d like to let the community in on how we formed and how we’re moving forward. As was wisely said: “In order to defeat the bug, we must understand the bug.”1


Origins

I’ll begin with a true story.

Source: russellfreeman.com

In 2014, I was speaking at a web development panel in Hong Kong for General Assembly, a coding bootcamp. They wanted to give their soon-to-graduate students a taste of what it was like to work professionally. I took the opportunity to talk about my history: a career made of positions in several small businesses, startups, and government – with an emphasis on how incredibly in-demand software engineers are.

A rather loud personality in the back asked a question: “How do cash-poor startups looking for a CTO make a case? How do you attract great talent in such a competitive atmosphere?”

“Well, that’s a good question, and a tough answer,” I said. “Without funding, you have the challenge of a serious risk versus a sure thing. Why should any experienced developer forgo $200,000 or more at a large tech company, in a comfortable, resource-rich environment, to work 80 or more hours a week? You essentially have to find some bozo” – I really said this – “who believes in your idea so much he’s willing to take the risk despite so many better options.” I wished him good luck and we continued the panel.

He came up to me after the panel and told me he wanted to do a Bitcoin derivatives exchange. I knew then: I was that bozo, and Arthur Hayes and I were to become business partners.

Without any major funding, we brought an alpha online within six months and started with the BitMEX Trading Challenge, an no-rules trading competition where we put the exchange through its paces. And it really was no-rules (aside from multiple accounts) – hacking the site would win you the prize. We paid out a few Bitcoin in bug bounties in those days but we didn’t have any major failures.

Much to the annoyance of my wife, we launched BitMEX on during our honeymoon in Croatia, on November 24, 2014. Ben and Arthur celebrated separately, in Hong Kong. Notice the original trading interface in both photos. You can still read the original Trollbox messages from that day.

November 24, 2014, Dubrovnik, Croatia.
November 24, 2014, Hong Kong.

 


Building BitMEX, 2014

All projects are a product of the time in which they are built. In early 2014, the crypto ecosystem was reeling from the vacuum Mt.Gox left behind. The focus at the time was not “proof of work” vs. “proof of stake”, as it is today, but a forgotten term called “proof of reserves” – just Google it and look at the timestamps of all the popular posts. In fact, a question about this was the top-voted comment on our Reddit launch announcement.

The first rule of running a Bitcoin exchange is, and always has been, “Don’t lose the Bitcoin.”

This rule pervades everything we do at BitMEX. It permeates our policy, even today: we still use a 100% cold wallet where every transaction is multisig. Look up a 3BMEX transaction on the blockchain, and you’ll see it. For 1,250 straight days (!), at least two out of the three of us have gotten up, read the day’s withdrawals, done our risk checks, and signed, to be passed onto the next partner for signing and eventual broadcast.

At the time, I thought users would resist this. Yes, Bitcoin is better in so many ways than any monetary system that has come before it. But it is weaker too. Custodianship is an unsolved problem that requires constant vigilance. I think our customers know this and appreciate it. In our early days, we received a large number of complaints about withdrawal times. Today, where we are the largest exchange by volume in the world, we receive barely any. People get it – caring for your deposits this way is not easy. We do it not because it is convenient, but because it is safe.

BitMEX in 2014.

The atmosphere in 2014 influenced how we built BitMEX. My frontend experience lead me to adopt ReactJS for the frontend. BitMEX was the first exchange to launch with it, a choice that has paid dividends well into 2018.

We were also the first – and likely still the only – exchange to build our matching and margining engine on kdb+/q, a technology traditionally used for querying of large-scale time-series data. It’s a natural fit. It’s fast (bear with me), using SIMD instructions to greatly boost throughput, it’s flexible, and it’s accurate. Kdb+’s flexibility and speed allowed us to pivot our product offerings twice: from low-leverage inverse and quanto futures to high-leverage ones, and from high-leverage futures to our flagship product, the XBTUSD Perpetual. We also pivoted loss-recovery mechanisms twice, from guaranteed settlement, to Dynamic Profit Equalization, to ADL.

BitMEX is a company known for listening to its customers and adapting. This required flexibility, innovation, and a lot of sweat equity from everyone on the team, and we’re so proud of how far it’s come.


Now, it wouldn’t be fair to come this far without addressing the title of this post. BitMEX now trades as much as US$6.5 billion per day. Our most recent 1-minute record was US$35 million, a number that is higher than the entire month of April 2016.

The highlighted month, March 2016, had 16M of volume on XBTUSD. XBTUSD now peaks at double that in just one minute.

The following charts show monthly turnover in increasingly large timescales, to highlight detail completely lost in the overall view:

To understand why BitMEX is experiencing slowdowns, despite using a solid technology like kdb+, it’s important to understand what BitMEX does differently than other exchanges.

100x is a number that elicits a large number of reactions, ranging from “are you crazy?” to “how is this possible?” It is only possible due to incredible financial engineering from our co-founder and CSO Ben Delo. Ben is a diligent and brilliant mathematician. He built a perfect mathematical model for trading, a constantly-coherent system that continuously audits all trades and always sums to zero. Transactions don’t get lost in the BitMEX engine. A user’s balance never goes negative. There are entire classes of bugs that are common on other platforms that never occur on BitMEX, and it is that attention to detail that makes all the difference. Mark/Fair Pricing, the weighted ADL system, perpetual contract funding rates, and live isolated/cross remargining are all new, novel concepts that did not exist before BitMEX.

This consistent coherency inside the BitMEX engine makes 100x possible. Kdb+ has historically been fast enough that we can continuously remargin all positions upon each and every price change. This provides the safety and speed necessary to not only survive within the razor-thin requirements of 0.5% maintenance margin, but thrive. The BitMEX Insurance Fund, a fund that guarantees settlement of BitMEX contracts, contains (at the time of writing) an incredible 6,149 XBT, over US$50M. Competing firms have insurance funds in the single digits of Bitcoin, despite offering as low as only 20x leverage.

BitMEX won’t sacrifice safety for speed. The security of our users’ funds and confidence in their trades is paramount. But we hear all of you: you want to trade faster, you want freedom from “System Overload” messages, and we will give that to you.

Since late 2017, the BitMEX team has refocused on engine performance as our highest priority. We have built, and are continuing to build, a team full of the top professionals in the space. This team works hard, building capacity for the next 100x increase in trading volume.


In the second part of this series, I’ll explain in-depth:

  • How the BitMEX engine processes orders and remargining
  • How real-time messages flow through our system to your browser
  • How BitMEX uses API-first design to provide the most powerful API in the business
  • Performance charts showing hot-spots, peak versus baseline load, and corner-cases
  • A breakdown of the dreaded “System Overload” message, and how it is generated

In the third part, I’ll also explain:

  • Performance numbers showing how capacity has increased since 2017
    • We have made large strides in the past months – but demand has increased to match
  • Roadmaps and pending work for Q2
  • BitMEX’s vision for the future of online derivatives trading

Thank you to all of you for being a part of BitMEX’s success. Ben, Arthur and I feel fortunate than to be a part of such a great company: our customers, team, and market opportunity are simply best-in-class.

Reach out to me directly on Twitter at @STRML_ and on Telegram at STRML. I also occasionally talk with traders on the Whalepool TeamSpeak, a fun community of traders that have given great feedback and encouragement to BitMEX for years.


A common sight from the window of the Dubrovnik apartment where BitMEX was launched.

1 – Starship Troopers was ahead of its time with its views on software development.

A Note on Recent ETH Liquidations

At 02:22 UTC on April 15, 2018, the ETHBTC on Poloniex crashed approximately 18% from 0.063 to 0.052. The Poloniex price is used as 100% of the index price (.ETHXBT30M) for the ETHM18 contract.

Crypto trading is very risky and underlying prices are highly volatile. While this type of movement is unexpected and undesirable, the price on BitMEX accurately reflected the price on the underlying spot market throughout this crash and recovery. Due to BitMEX systems functioning as expected and according to specification, there will be no refunds of losses sustained due to liquidations.

Index stability is important to us. BitMEX will continue to evaluate if adjustments are needed to existing index compositions. The liquidity of crypto spot markets is constantly in flux and can change significantly over the lifetime of a quarterly contract. Poloniex hosts one of the most liquid ETHBTC markets (note, not ETHUSD) in the world, but this unexpected action has triggered an internal re-review.

The Index Price is used as the central calculation for marking BitMEX futures. This Mark Price is used for margin calculations, which trigger liquidations. More information is available here.

Changes to altcoin futures contracts

On 30 March 2018, we are making the following changes to altcoin products:

  • The 0.25% settlement fee will be removed on futures products. We hope that will encourage greater liquidity by removing barriers to entry and exit.
  • We will be removing the DASH, ETC, NEO, XMR, XLM, and ZEC pairs after they expire, for the time being. This is to free up trading-engine capacity on our more popular contracts. After we make certain optimisations, we may re-list these.
  • The ADA, BCH, ETH, LTC, and XRP contracts will be re-listed for another quarterly today:
    • BitMEX Cardano / Bitcoin 29 June 2018 futures contract (ADAM18)
    • BitMEX Bitcoin Cash / Bitcoin 29 June 2018 futures contract (BCHM18)
    • BitMEX Ether / Bitcoin 29 June 2018 futures contract (ETHM18)
    • BitMEX Litecoin / Bitcoin 29 June 2018 futures contract (LTCM18)
    • BitMEX Ripple / Bitcoin 29 June 2018 futures contract (XRPM18)

Postmortem: Downtime, July 14, 2017

Traders,

On July 14, 2017, we suffered a minor downtime as a runaway ZFS snapshot process froze up disk I/O on the trading engine. No data was lost. While the outage was relatively minor and required only a host reboot, we took additional time to re-verify data, clean up ZFS snapshots, and fix the underlying issue.

We apologize for the disruption.

If you are interested in our recent migration to ZFS, please see this post.

Postmortem: Downtime, July 5, 2017

Traders,

On July 5, 2017, we suffered a prolonged downtime – our longest since launch in November 2014 – due to a server issue. Trading was suspended from 23:30 UTC until 03:45 UTC, for a total suspension of 4 hours and 15 minutes.

Those of you who trade with us know that we take our uptime very seriously, and the record shows it. Before this month, we had not had a single month with less than 99.9% uptime, with our longest 100% streak reaching nearly 300 days.

So what happened?

The crypto market is exploding, as many of you know. While we have one of the most sophisticated trading engines in the industry, its focus has always been on correctness (remargining positions continuously, auditing every trade), rather than speed. This was a winning strategy from 2014 to 2016, and we’ve never lost an execution, but as we entered record-setting volume in the beginning of this year, requests started to queue up.

 

Optimizing the BitMEX Trading Engine

We started optimizing. The web layer, up to this point, hadn’t had any issues – we could always scale it horizontally – but the engine (at this time) cannot be horizontally scaled. We partnered with Kx, the makers of kdb+, which powers our engine. We began testing new storage subsystems and server configurations. We settled on an upgrade plan, set for five days hence (July 11), and began testing the switchover. We simulated the switchover thrice, each time setting a timer so that we could best estimate our downtime. The plan was:

  • Move to a larger instance with a faster local SSD, and
  • Move from bcache + ext4 to ZFS.

Some more details on those actions:

  • EBS is slow. So we would move the trading engine from an AWS c3.xlarge, which we used for its fast local SSDs in combination with bcache, to an i3.2xlarge. This gives us far faster local SSDs, nearly 20x the local SSD storage so we can easily cache our entire data set.
  • ZFS gives us some distinct advantages over other filesystems:
    • ZFS checksums individual blocks, preventing data rot. It can be scheduled to automatically check & repair drives (this is called a scrub), and can be configured to alert on varied criteria. This goes a long way toward ensuring the continued integrity of our data.
    • ZFS allow us to easily mirror and replicate our data across multiple volumes and physical locations.
    • ZFS snapshots are cheap, especially compared to traditional backup systems that must check the size & modified time of every file; in our testing, we can snapshot as often as every second (!) without any significant performance regression.
    • Kdb+ data is stored in a columnar fashion, like so:
      trade
      ├── foreignNotional
      ├── grossValue
      ├── homeNotional
      ├── price
      ├── side
      ├── size
      ├── symbol
      ├── tickDirection
      ├── timestamp
      └── trdMatchID
    • This data is highly compressible – in practice we see compression rates approaching 4x. This directly translates to less data over the wire to EBS and faster checkpointing & lower latency on the write log. For example, du is able to show the “apparent size”, that is, the size the OS thinks these files are, versus the actual space usage:
      /u/l/b/e/d/h/execution $ du --apparent-size -h
      955M .
      /u/l/b/e/d/h/execution $ du -h
      268M .
    • ZFS has the concept of the ARC (fast in-memory caching, a adaptive combination of MFU and MRU caches; in practice, the MFU cache is better for our use case), and the L2ARC, which provides a second-level spillover of this data, ideally to fast local SSD. It even compresses, leading to some eye-popping metrics:
      L2 ARC Size: (Adaptive)       1.17 TiB
      Compressed:            33.74% 403.90 GiB 
      Header Size:            0.08% 931.12 MiB
    • ZFS snapshots are amazing, and easy to code for. This allows us to do things that would be impossible otherwise, such as automatically snapshotting the engine data before and after any code changes. This is only practically possible because of the instance nature of snapshots.

I could go on. We’re ZFS superfans.

 

What Went Wrong

As Donald Rumsfeld once said:

Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.

We had the plan ready to go, checklists ready, and we had simulated the switchover a few times. We started preparing a zpool for use with the production engine.

Here’s where it went wrong.

19:47 UTC: We create a mirrored target zpool that would become the engine’s new storage. In order to not influence I/O performance on the running engine, we snapshot the data storage drive, then remount it to the instance. This is not something we did in our test runs.

Bcache, if you haven’t used it before, is a tricky beast. It actually moves the superblock of a partition up by 8KB and uses that space for specific metadata. One piece of this metadata is a UUID, so bcache can identify unique drives. And that makes perfect sense, in the physical world. It’s in the virtualized world that this becomes a problem. What happens when you snapshot a volume – bcache superblock and all – and attach it?

Without any interaction, the kernel automatically mounted the drive, figuring it was also the backing device on the existing (running) bcache device, and appeared to start spreading writes randomly across both devices. As you can imagine, this is a problem, and began to trash the filesystem minute-by-minute, but we didn’t know it was doing this. It seemed odd that it had mounted a bcache1 drive, but we were not immediately alarmed. No errors were thrown, and writes continued to succeed. We start migrating data to the zpool.

22:09 UTC: A foreign data scraper on the engine instance (we read in pricing data from nearly every major exchange) throws an “overlap mismatch”. This means that, when writing new trades, the data on disk did not mesh perfectly with what was in memory. We begin investigating and repairing the data from our redundant scrapers, not aware of the bcache issue.

23:02 UTC: A read of historical data from the quote table fails. This causes the engine team serious concern. We begin to verify all tables on disk to ensure they match memory. Several do not. We realize we can no longer trust the disk, but we aren’t sure why.

We begin snapshotting the volume every minute to aid in a rebuild, and our engine developers start copying all in-memory data to a fresh volume.

23:05 UTC: We schedule an engine suspension. To give traders time to react, we set the downtime for 23:30 UTC and send out this notice. We initially assume this is an EBS network issue and plan to migrate to a new volume.

23:30 UTC: The engine suspends and we begin shutting down processes, dumping all of their contents to disk. At this point we believe we have identified the cause of the corruption (bcache disk mounting).

Satisfied that all data is on multiple disks, we shut down the instance, flushing its contents to disk and wait for it to come back up.

It doesn’t. We perform the usual dance (if you’ve ever seen a machine fail to boot on AWS, you know this one): unmount the root volume, attach to another instance, check the logs. No visible errors.

We take a breath and chat. This is going to be more difficult than we thought.

23:50 UTC: We decide to move the timetable up on the ZFS and instance migration. It becomes very clear that we can’t trust bcache. We already have our migration script written – we begin ticking boxes. We clone our Testnet engine, which had already been migrated to ZFS, and begin copying data to it. The new instance has 2x the CPU & 4x the RAM, and a 1.7TB NVMe drive. We’re looking forward to the increased firepower.

00:30 UTC: We migrate all the init scripts and configuration, then mount a recent backup. We have trouble getting the bcache volume to mount correctly as a regular ext4 filesystem. The key is recalling the superblock has moved 8kB forward. We mount a loopback device & start copying.

We also set up an sshfs tunnel to Testnet to migrate any missing scraper data. The engine team begins recovering tables.

~01:00 UTC: We destroy and remount the pool to work around EBS<->S3 prewarming issues. While the files copy, we begin implementing our new ZFS-based backup scheme and replicate minutely snapshots, as we work, to another instance. This becomes valuable several times as we verify data.

~02:00 UTC: The copy has finished and the zpool is ready to go. Bcache trashed blocks all over the disk, so the engine team begins recovering from backup. This is painstaking work, but between all the backups we had taken, we have all the data.

~03:00 UTC: The backfill is complete and we are verifying data. Everything looks good. We didn’t lose a single execution. Relief starts flooding through the room. We start talking timetables. We partition the local NVMe drive into a 2GB ZIL & 1.7TB L2ARC and attach it to the pool to get ready for production trading.

03:05 UTC: We bring the site back online, scheduling unsuspension at 03:45 UTC.  Our support team begins telling customers the new timeline. Chat comes back on.

03:45 UTC: The engine unsuspends and trading restarts. Fortunately, the Bitcoin price has barely moved over these four hours. We consider our place in the world.

 

Postmortem

While we prepared for this event, actually experiencing it was quite different.

Over the next two days, the team communicating constantly. We wrote lists of every thing that went wrong: where our alerting failed, where we could introduce additional checksumming, how we might stream trade data to another instance and increase the frequency of backups. We introduced more fine-grained alerts up and down the stack, and began testing them.

To us, this particular episode is an example of an “unknown unknown”. Modern-day stacks are too large, too complicated, for any single person to fully understand every single aspect. We had tested this migration, but we had failed to adequately replicate the exact scenario. The best game to play is constant defense:

  1. Don’t touch production.
  2. Really, don’t touch production.
  3. Treat in-service instances as immutable: clone, modify, test, switch.

As we scale over the coming months, we will be implementing more systems toward this end, toward the eventual goal of having an infrastructure resilient to even multiple-node failures. We want to deploy a Simian Army.

Already, we are making improvements:

  • Moving to ZFS itself was a long-planned and significant step that affords us significantly improved data consistency guarantees, much more frequent snapshotting, and better performance.
  • We are developing automated tools to re-check data integrity at intervals (outside of our existing checks + ZFS checksumming), and to identify problems sooner.
  • We have reviewed every aspect of our alerting system, reworking several gaps in our coverage and implementing many more fail-safes.
  • We have greatly expanded the number of jobs covered under Dead Man’s Snitch, a service that has proven invaluable over the last few years.
  • We have implemented additional backup destinations and re-tested. We are frequently replicating data across continents and three cloud providers.
  • We continue to implement new techniques for increasing the repeatability of our architecture, so that major pieces can be torn down and rebuilt at-will without significant developer knowledge.

Thanks to our great customers for being understanding while we were down, and for continuing to support us.

Additional Withdrawal Time at 10:00 UTC

We have received several support tickets asking about a special early withdrawal period, so that users may claim entry in the Byteball Fair Initial Distribution, which takes place at 13:10 UTC.

To support this, we will be initiating early withdrawals at 10:00 UTC tomorrow. No opt-in is necessary; all withdrawals will be processed if confirmed before that time. The usual time at 13:00 UTC will be honored as well. If you wish to participate in this distribution, we recommend hitting the 10:00 UTC cutoff so you have sufficient time for the transaction to confirm.

QTUM Futures Now Live

BitMEX is proud to announce the launch of QTUM Futures contracts, expiry 29 September 12:00 UTC with symbol QTUMU17. Each contract is worth 1 QTUM and the contract offers 2x leverage.

Since the QTUM platform is still under development, the following rules will apply:

  • QTUMU17 will have 25% Up and Down Limit against the previous session close price to prevent price manipulation. Each session is 2 hours long, and session closes occur every even numbered hour.
  • Settlement will occur either at the ICO price (if QTUM/XBT trading has not begun) or at the .QTUMXBT30M Index Price if QTUM/XBT has begun trading prior to 28 September 12:00 UTC.

Further details about this contract can be read in the QTUM Series Guide.