Learning from Financial Trading Bugs


The commodities and securities trading exchanges provide challenging examples for cloud and big data application development. Their users are disparate traders world wide. They have user requirements for high trading volumes and for low latency. They utilize enormous amounts of storage, networking, and computer processing power. My IEEE Computer Society talk, here, discusses some of the technical features for such applications and for the hardware on which they run. Ordinary public cloud systems cannot currently address these needs, and perhaps they never will. On the other hand, those of us developing big data and/or cloud software applications can learn a lot by studying these “bleeding edge” applications, their bugs, and the consequences of such bugs.

Big Data pioneers such as Yahoo!, Linkedin, Facebook, Google, eBay, etc. have, of course, their own bugs that have economic consequences for both the companies and their customers. Larger service providers such as Amazon, Microsoft, GoDaddy, and Rackspace have outages that do serious damage to their customers. However, financial trading applications can cause millions of dollars of damage in just a few seconds, and the governmental oversight agencies eventually get involved. This has happened in a big way this year [1,2] with four incidents that seem to have galvanized these agencies into action:

  • On Feb 24, options market maker Ronin Capital injected more than 30,000 mispriced quotes into the NYSE Amex exchange.
  • On March 23, the BATS Exchange, handling its own IPO traffic on top of other traffic, crashed. (How embarrassing!) Among other losses, this caused a brief 9% price decline in Apple shares.
  • On May 18, the Facebook IPO had many orders stalled and not executed on the NASDAQ exchange. The Union Bank of Switzerland, alone, lost more than $350 Million, and curiously Knight Capital lost $35.4 Million in this incident.
  • On August 1, the Knight Capital Group lost $440 Million by flooding the NYSE with bad orders.

Since “You can’t know the players without a program…”, here is a brief cheat sheet of agency acronyms:

  • CFTC = Commodity Futures Trading Commission
  • FIA = Futures Industry Association
  • FIA-EPTA = European version of the FIA-PTG
  • FIA-PTG = FIA’s Principal Traders Group
  • FRB = Federal Reserve Bank
  • FSOC = Financial Stability Oversight Council (established by the Dodd-Frank Act)
  • IOSCO = International Organization of Securities Commission
  • MFA = Managed Funds Association (hedge funds)
  • SEC = Securities Exchange Commission

Of course numerous observers clamored for reform, e.g. [5,6,7,10] but the above agencies started to issue calls for action:

  • MFA requested of the SEC mandatory risk checks on all orders, new requirements on system testing, and a requirement for an individual with a “kill switch” to watch over all trading activity. (Imagine not trusting computer programs and wanting a human being to watch over automated trading!) [14]
  • The FIA PTG/EPTA issued its “Software Development and Change Management Recommendations”, March 2012. While both reasonable and comprehensive, there is nothing new in the report from an academic software development perspective. What is interesting is that they felt it was necessary to prepare it for financial application development. [13,14]
  • The FSOC made some vague recommendations in July 2012 that the SEC and the CFTC consider establishing error control and standards for exchanges, clearing houses, and other market participants that are relevant to high-speed trading. [11]
  • August 2, the FIA PTG make a “soft” statement to the SEC at their Roundtable noting that the 2005 regulations, designed to encourage market competition created “different safety controls” which now need “smart regulatory policies.” August 3, FIA PTG/EPTA issued a stronger statement on the “Knight Capital” problem, stating “Rapid advances in trading technology have brought very substantial benefits… but … they also have introduced new sources of risk.” They reiterated their earlier recommendations for “tests and controls” that trading firms should consider when they change their technology systems. [12, 13]
  • August 2012 The IOSCO issued a “Consultation Report” entitled “Technological Challenges to Effective Market Surveillance Issues and Regulatory Tools” which called for greater data collection for the purposes of surveillance of automatic or algorithmic trading of securities. [8] It refers to an earlier paper “Objectives and Principles of Securities Regulation” dated May 2003 that has 38 “principles” for such software development and regulation. Both papers are good reading. IOSCO further warns of the dangers of the then (and now) situation due to the neglect of these principles. [3]
  • October 1, 2012 the FRB of Chicago issued a report “How to keep markets safe in the era of high-speed trading” by Carol Clark. By interviewing various vendors, the author points out that there are a few places in the system where checks can and should be made. It makes solid recommendations on various risk limits, risk mitigation techniques, kill switches, position limits, and profit and loss limits. Good paper. [4]
  • October 4, 2012 The FIA PTG responded to the Chicago FRB’s report, supporting its recommendations. [15]
  • October 10, 2012 The FIA PTG/EPTG responded to IOSCO’s recommendations for market surveillance and audit trail quality, wanting more, especially, surveillance for illegal or inappropriate conduct which might be facilitated by automated trading. [3]

Wow! Four bugs caused all this commotion? Well, no. The noticeable problems were occurring prior to 2012 and also outside of the US. (Many of these are discussed in earlier posts.) There clearly was a welling up of (and I’m not sure this is the right word, but) anger.

So, besides just being new, what is wrong? Well, in high frequency trading, speed is king, and it would appear that no one wants to slow down their software by putting in audit trails that IOSCO recommends. Vendors force the regulators to read the code to audit their systems! Can you imagine how worthless that exercise is? No one seems to realize that such code additions would actually help test and debug their systems. Risk and profit/loss limits seem easy to implement, but again while it does slow down the system a little bit, the more likely reason is that such limits are an annoyance. Again regulation is needed.

Complexity is probably the number two reason for such bugs hitting. Here comes the argument that good testing won’t find all bugs. On the other hand, most of the bugs reported (or deduced) seem well within the current art of testing. I’ve seen no bugs reported that only occur on weird combinations of extreme data. In one case, the addition of new code activated some old “dead” code [14]. Both bugs (dead code and the new activation problem) could have easily been caught by reasonable testing. I’ve read about the now boring excuse of rushing new functionality to market for competitive reasons. Give me a break. With hundreds of millions of dollars at stake, shouldn’t the vendors be able to afford decent automated test suites? Properly done, such test suites make the development go faster! On the other hand, I’d hate to see government regulations on testing. It would be a case of the ignorant policing the ignorant. My guess is that the best government regulations would be to impose massive fines and to enforce total restoration of all money lost due to a bug.  Even with proper catastrophe insurance, this should be significant motivation for quality!

For sure, a desire for high performance with complex software, made more difficult by dealing with relatively new big data infrastructure, is a recipe for lots of bugs. While I’ll discuss big data and cloud application development in subsequent posts, my thinking here is simple: Invest at least as much in your testing and its automation as you do in writing your application. Follow the IOSCO principles by adding code for debugging and for auditing. It will pay for itself. Get audited. Audits probably won’t find anything, but your financial and legal consequences will probably be less severe should a bug rear its ugly head. Also, when high performance in networking and IO is desired, go with new hardware that has built-in measurement and time-stamping features. It this is not possible, then add such measurements to your software. Finally, do some sanity checks and reasonability calculations to make sure you are not doing something fundamentally wrong.






