Published on February 09, 2017 by Carl Fugate
atom-c2000 cisco clock-signal intel networking
4 min READ
It was just a normal day. A day like any other day. Children bundled up after asking Alexa what the weather was going to be like today. Parents checked their online calendar to see how many different places they would need to be at the same time so they could have an Uber ready to pickup their teen after practice.
Everyone scurried out the door, rushing off in different directions as they went about to start their day. And then…sometime in the middle day everything started to go wrong. At first it was just minor inconveniences like not being able to turn on a few IoT controllable lights at the house. Then Siri stopped being able to answer me when I was pondering universally important things like “What is the Answer to the Ultimate Question of Life, the Universe, and Everything“ (Spoiler - ‘42’). Not long after that, my cell phone stopped working and I lost my entire reason for living.
A few months ago, there started to be rumblings that something was wrong with some devices made by a particularly large network vendor. Then on February 2nd, a release went live outlining that due to a hardware design failure by an unnamed vendor, products from several different verticals could suffer complete failure within 18 months of use. Tony Mattke has covered the vendor impact of this very well over on his site. This in and of itself while scary is not the thing of nightmares. Hardware design failures, while rare do happen from time to time. In this particular case, even if this had been contained to just this network vendor who had perhaps used the same chip inside several different product lines it would not be catastrophic enough to impact the entire internet. But this issue is not contained to only a single vendor, but is actually impacting several vendors that are using the same merchant chip in a wide variety of technology products.
We have seen the widespread adoption of merchant designs in technology products for quite some time. In 2005, Apple made headlines when they switched from using their proprietary PowerPC chips to Intel x86. This was cheered the world over as it meant that development and would get simpler and more importantly…cheaper. This trend has continued and the move to commodity chipsets has brought feature standardization, with better performance and at lower costs.
And here, as they say lies the rub. I was reading an article that Tom Hollingsworth wrote about this issue and he brought up a very compelling thought.
What if this issue had been present in Broadcom Trident or Tomahawk chips? What if a component of OCP or Wedge had a hidden fault? The larger the install base of a particular chip, the more impactful the outage could be.
This is the nightmare scenario…
Technology vendors are almost all now completely reliant on these chips and they are embedded in hundreds of thousands of devices. A complete replacement of the shear number of devices would not be possible in a reasonable timeframe and would have to be prioritized with largest customers getting the most attention. This would devastate any smaller organizations as they fought to get a piece of the small supply of replacements. Just look at the recent Samsung Note 7 recall (the first one) which took weeks to get the first replacement phones out there. Now realize their supply chain is an order of magnitude greater than most IT vendors who have moved to ‘Just in Time’ manufacturing.
Cloud providers who run Data Centers at scales beyond what most of us could even imagine can hedge this risk. One of the things that makes this type of problem more manageable for them is their speed of change and adoption of new technologies. Most of the large providers are running at the bleeding edge just to keep up with demand and this means that their risk pools are spread out as they are never deploying the same design more than a few times if ever as they build more facilities. They design for losses of entire facilities which makes survivability of services more likely than most enterprises can expect.
In my mind, this brings the issue of risks pools (or my favorite - blast radius) back to the forefront of any design discussion. The old adage of “don’t put all of your eggs in the same basket” really sings true but can we avoid it? For most enterprises, standardization is the axiom of design as it reduces complexity and overhead for support. This is even harder now when several vendors may be using the exact same chips. We need to make sure that we understand this and see if we can find ways to mitigate the risk.