In the wake of the recent botched CrowdStrike update that took down 8.5 million Windows devices in a matter of a few hours, many armchair quarterbacks are calling for an end to kernel level drivers for endpoint detection and response. Others are calling for more rigorous testing and improved release practices. And some groups are already calling for lawsuits, federal regulation and other more drastic actions against the vendor.
What Happened?
Late on July 18th and into the early morning hours on July 19th, cybersecurity software vendor CrowdStrike released an update to all of its customers. The update adversely affected Microsoft Windows operating systems, causing many to freeze under the dreaded blue screen of death (BSOD). It is estimated that the incident impacted more than 8.5 million windows systems, including both servers and laptops / workstations. The financial impact of the global outage is still being calculated, but is expected to be in the billions of dollars. Most of the affected systems required one or more touches from IT personnel to recover, extending the outage and diverting resources from normal operational tasks while they struggled to recover.
What Can We Learn From The Incident?
Rather than jump on the troll bandwagon against CrowdStrike and their customers, I prefer to take a look at the incident from a different angle, one of disaster recovery and business continuity. Calls for change are not necessarily unfounded. However, most of the actions being recommended are not within our direct control and may take months or years to implement. But, for all of us in the IT industry, there is one thing that is within our control that can minimize impact and risk from catastrophic events like this, be they malicious actors or friendly fire from an errant change. It’s relatively low cost, and fairly quick to get started. I’m talking about the creation of a robust business continuity and disaster recovery plan (BCDR). Refer to our previous post about disaster recovery for an introduction to the topic and some best practices.
Were Some Industries Hit Harder Than Others?
The recent incident is a referendum of sorts on the robustness of business continuity plans (BCPs) and disaster recovery plans (DRs) across the globe. Some businesses and some industries fared much better than others. Those with a solid BCDR plan that executed it quickly had minimal impact from the incident. While others struggled to restore business operations and continued to be down for days after the event.
The healthcare sector seemed to be heavily impacted and was slow to recover, resorting to pen and paper operations. That decision will likely lead to longer term problems with missing records or the need for manual data entry as medical staff play catch-up after a return to normal operations. There have been numerous high visibility incidents in healthcare this year, all with similar outcomes. It seems the sector as a whole has a long way to go to close the widening gap with other industries with respect to cybersecurity and BCDR.
The airline industry was also hit hard by the outage, taking on average three days to recover. Although the airlines are somewhat unique, needing to move both people and equipment simultaneously in order to fulfill flight scheduling requirements, reports indicate the extended outages for some carriers stemmed from the inability to recover flight and crew scheduling software. A DR plan to restore critical software from backups in an alternate location might have significantly reduced the impact of the CrowdStrike event for both the carriers and for travel weary consumers.
Why Does BCP and DR Matter?
These and other cases of extended IT outages causing impact to business customers and end consumers alike highlight the need for a more robust plan for business continuity in the face of disaster. As highlighted by the CrowdStrike update, the term disaster is not just limited to weather related or malicious cybersecurity events. Businesses who spend the time upfront thinking about what could go wrong and how to work around it, or in extreme circumstances recover from it, are going to be significantly better off.
Let’s face it, cyber events are on the rise. Coupled with more severe weather and growing complexity in information technology, the likelihood of these types of catastrophic outage events is only going to grow. Despite wide-scale variation in business operations and costs, some experts have pegged the average cost per minute of downtime between $5600 and $9000. Worse, other sources indicate more than half of businesses close within a year of enduring a natural disaster, while a staggering 93 percent cease operations within 12 months of a cyber event.
In addition to avoiding these negative direct costs and devastating impacts, there are numerous less quantifiable benefits from implementing a good BCP, such as boosting brand trust, reducing insurance cost, and expediting the decision making process during critical events. Customers can see when your business is up while others are struggling. That builds a sense of confidence that is priceless, allowing your company to rise to the top against your competitors.
How Can Black Kilt Help?
At Black Kilt, we have years of experience building, testing, and improving BCP and DR plans for some of the world’s most complex businesses. Our consultants have Fortune 100 experience and are available to help protect your businesses from the unexpected. Ask us about a free review of your existing BCP or DR plans to uncover critical holes and opportunities for improvement that could spell the difference between success or closure in the wake of an unexpected event.