What We Learned From the NYSE, United Airlines Tech Outages
Phone companies typically promise 99.999% reliability. But can we expect the same from other kinds of networks upon which we increasingly rely?
By Steve Rosenbush and Steven Norton, The Wall Street Journal
9 July 2015
Say what you will about Plain Old Telephone Service, but it worked. The functionality of POTS, as it was known, was limited to making calls, and they were expensive. But many traditional phone companies offered 99.999% reliability, which allowed for about five minutes of downtime a year.Today’s networks are far less expensive, infinitely more capable and nowhere near as reliable as the wired-to-the-wall phone, as a spate of network outages on Wednesday demonstrated. The New York Stock Exchange halted trading, citing a technical problem. United Airlines was grounded because of an IT issue. The NYSE outage lasted about four hours, or nearly 50 years of allowable downtime using the “five nines” standard.In the age of artificial intelligence, smartphones, cloud computing and robotic cars, how can this be?To some extent, contemporary networks suffer from inattention. The old phone system worked so well because regulators in certain countries like the U.S. said it had to, and enough money was set aside to fund an army of technicians and engineers to oversee it. That generally isn’t the case with modern, digital networks and IT infrastructure, and companies often neglect this nuts-and-bolts technology.“Until there’s a day like today, everybody invests in other things,” said Thomas Bayer, chief information officer at Standard & Poor’s Ratings Services, on Wednesday. “When you have a day like today, people start to concentrate on their infrastructure.”Yet reliability problems are persistent even in industries such as finance, where a considerable amount of time, money and pressure from regulators and customers require investments in network technology and IT infrastructure. They can cost a company $5,000 to $10,000 a minute, researcher Gartner Inc. estimates.Today’s problems with reliability are more fundamental, a reflection of the complexity of contemporary networks, the volume of data, the pace of change, insufficient organizational and cultural practices, and a legacy of arcane and poorly written business software that traditionally put little emphasis on usability or customer experience.Outages persist because of the interdependency of computer systems, fueled by the rise of digital services across all industries, particularly those with customer-facing software such as mobile apps, according to former NYSE Euronext CIO Paul Cassell, now CIO of Pico Quantitative Trading LLC. “Data is going to more places than ever before,” he said. “The level of complexity…has increased tenfold. The IT organizations need to be ahead of the curve.” He says solid, well-documented and automated processes for building, testing, upgrading and configuring IT are essential.That process seemed to break down at the NYSE, which issued a statement Thursday saying that customer gateways weren’t properly configured for a software release, leading to communication problems with a trading unit.In a fragmented market like finance, if one exchange or network goes down, another is supposed to pick up the slack. No single network bears the world on its back anymore, so the pressure to be perfect isn’t as high. Unless, of course, it is. “When I was there, if you had interruption for a second, it feels like an hour. It’s an eternity,” said former NYSE Euronext CIO Steve Rubinow, now CTO at data-marketing firm Catalina.Underneath it all, the economics of falling prices carry a trade-off. Consumers get more for their money in the mobile, digital era, but that often leaves margin-stretched companies with fewer resources to invest in robustness and maintenance. Reliability is as much a function of business and risk management as it is about tech.“I don’t know if people are sweating that detail as much as they used to,” said Mr. Bayer, previously CIO of the Securities and Exchange Commission.At the SEC, hurricane season was a chance to practice disaster recovery. The agency would intentionally bring down local area networks at regional offices to protect them from storm damage.S&P had redundant capabilities when Mr. Bayer joined the company in February. Now, he’s looking to cloud providers to increase redundancy and resilience across the country. Real-time metrics can now show companies if a circuit is about to be overloaded or if a particular machine is about to fail. But something may also be lost amid the shift to automation.Retailer Best Buy Co.BBY -1.22 % spent millions of dollars upgrading its IT systems over the past three years, hiring its own staff to develop better internal systems of handling sales and tracking inventory. The chain also hired Internet specialist Akamai Technologies Inc. to make sure its website could handle the waves of online shopper traffic that usually hits stores before the holidays.That still didn’t stop the company’s website from crashing on Black Friday under the weight of visits from people using smartphones. Time Warner Cable Inc.TWC 0.31 % cut off its entire subscriber base for more than an hour in August when an employee misconfigured the network’s internal system for maintaining Internet addresses. A TWC spokesman said the company later added more safeguards and worker training to protect its network from the suffering the same problem again.It is very difficult for any company to account for all that can go wrong. And sometimes the smallest detail can lead to a big problem.When the United and NYSE incidents occurred Wednesday, Time Inc. CTO Colin Bodell put information security staff on high alert and monitored social media and government sites.Time has disaster recovery systems that protect the company in the event of a catastrophic failure. If a primary system goes down, Time can switch traffic to a redundant system in a different location. Mr. Bodell faced a “backhoe dug up a fiber connection” situation a few weeks ago. He realized the incident had occurred when the system notified staff that it had failed over successfully. It is difficult to build every backhoe into a disaster recovery plan. And even if you could, would you want to? Could you afford to?Former NYSE Euronext Chief Operating Officer Lawrence Leibowitz told the Journal in 2013 the public shouldnﾒt expect market technology to function perfectly, a goal that would be too expensive to implement even if it were technically feasible.“When you go through a lot of change, there is a higher probability of error. That is just what happens, and that means we need to do a better job of managing change,” he said.Even then, problems are bound to persist, because the price of perfection very well might be bankruptcy. “There is a cost-benefit analysis,” Mr. Leibowitz said. “How much will it cost to address it, how much will it cost industry participants to use it—that applies to any solution to this problem. I don’t know if the industry could pay billions of dollars to fix this.”