Archive for August, 2012

Crashes on Wall Street


It is of course interesting when cloud vendors have problems.  Cloud computing is like the 1849 Gold Rush to California.  If you are not heading there, you are at least talking about it.

Less discussed are recent trading crashes on “Wall Street”.  Maybe it is the distance of the moon to the earth, but there sure have been a noticeable number of such crashes.  Now Wall Street is a little more secretive than the big cloud vendors.  Maybe it is because there is less technical scrutiny on Wall Street, and they don’t perceive the advantages of openly discussing technical problems.

What prompts this post is an article in the business section of today’s LA Times about Knight Capital’s recent crash having its root cause a total lack of adequate quality assurance, but I’m getting ahead of myself …

Let’s start with the automated trading system BATS.  In a sense, BATS is a competitor of all the established stock exchanges, and is now the third largest stock exchange in the US.  It is totally automated, replacing floor traders with software.  It started in October 2008, and it was doing so well, that it decided to have an IPO – an entrepreneur’s dream exit – on March 23, 2012.  They, not surprisingly, picked themselves to be the exchange to list and sell their stock.  But a funny thing happened on the way to the market, just as their stock was about to trade, their system crashed!  Not just stopped, but many trades, including Apple’s stock that morning, were corrupted.  They finally pulled the plug, but the damage was done.  I assume they were able to mop up the corrupted trades, but their embarrassment was so great that they withdrew from the planned public offering. (BATS recently announced they will try the IPO route again.) I tried hard to find the root cause, but secrecy prevailed.  They did release a statement that they rated their system as 99.9% available.  They had had a few crashes prior to the March debacle, but that wasn’t warning enough. Such a low availability rating is inexcusable for a stock trading system, and it appeared to me that their testing was just totally inadequate.

Imagine how unhappy all the Facebook investors were when the NASDAQ IPO software couldn’t process all the buy requests in the early minutes of the Facebook IPO.  NASDAQ claimed they spent “thousands of hours” running “hundreds of scenarios” for their testing.  Again I haven’t seen a technical root cause analysis for the NASDAQ problems.

I’m going to skip over the JP Morgan billion dollar loss, but it was due to a rogue trader executing more than risky trades.  I will, however, wonder aloud why adequate reporting software didn’t exist to flag such a series of risky trades.

Public exchanges are not the only sources of computer bugs.  AXA Rosenberg had to pay $217 million to pay for investor losses due to a “significant error” in its software systems.  It also paid a $25 million penalty to regulators for … guess what … hiding the error!  “The secretive structure and lack of oversight of quantitative investment models, as this case demonstrates, cannot be used to conceal errors and betray investors.”

Not to totally pick on the US, and to give this blog a bit of an international flair, consider Madrid’s Bolsas y Mercados Espanoles, which suffered a four hour outage when communication servers crashed.  What, no redundancy?  No duplicate Internet connections?  Sigh…  This breakdown affected two multilateral trading platforms operated by NYSE Euronext: Smartpool and NYSE Arca Europa where orders could be submitted but not traded.  Note that a bug in one exchange can affect other exchanges!

The Tokyo Stock Exchange just had its second major problem in the last year.  The root cause was reported to be a router going down, and the failover to a backup router also failed.  First kudos to the managing vendor Hitachi for disclosing the problems.  Second, the key lesson to learn is to test your failover mechanisms!  BUT, it took 95 minutes for the on-site staff to diagnose the problem and to affect a manual failover.  Way too long.  The lesson here is that to manage Mean Time To Repair (MTTR), training and practice is essential.  Good diagnostic software might also have helped identify the problem faster.

Now, back to the US and Knight Capital:  In less than an hour, trades that were supposed to be spread out over days, were executed essentially one after the next.  The result was a $440 million dollar loss for the firm.  Had not investors led by Jefferies Group, Ltd.  provided $400 million, Knight Capital would have gone under.

Now comes today’s LA Times article, by-lined Bloomberg News.  It appears that Knight Capital installed some new software designed to interface to the NYSE’s new retail liquidity program for small investors.  The “Law of Untended Consequences” bit them.  The installation of this new software somehow activated previously dormant software, which started multiplying trades by 1000.  What a bug!  ANY KIND OF TESTING would have discovered such a huge bug.

There is a theme here.  Software on financial exchanges executes trades at dizzying speeds.  A bug can very quickly cause millions of dollars in bad trades.  I’m a little shocked that this software is obviously not tested adequately.


References and interesting links:

The costliest bug ever:

How software updates are destroying Wall Street (Bloomberg and Businessweek):

Two Years After the Flash Crash (of 2010), Are Markets any Safer?   [I strongly recommend this article!]

SEC judgement on AXA Rosenberg Entities: