CrowdStrike's Calamity: Charting Core Contingencies

Preparing for the unpredictable

On July 19th, CrowdStrike, a major cybersecurity software vendor, accidentally disrupted businesses worldwide. A flawed update to their Falcon software, designed to protect computers, led to complete protection by causing Windows computers to crash and continuously reboot. Even hackers couldn’t access their hacked computers!

What Happened?

The problem began in Australia, where businesses and airports first reported issues. As the day went on, more places around the globe felt the impact. Airports faced delays and cancellations, banks had trouble with their services, and hospitals struggled without access to their electronic systems. Even the U.S. Federal Aviation Administration reported that flights were grounded because of the issue. Delta says it lost half a billion dollars due to the outage!

The Scale of the Problem

CrowdStrike's software is widely used, including by over half of the Fortune 500 companies. This made the fallout huge, with some calling it the "largest IT outage in history" with nearly 9 million computers impacted. Thousands of flights were canceled or delayed, and hospitals had to postpone procedures, resorting to pen and paper for records. The glitch not only affected businesses but also created significant inconveniences for everyday people relying on these services.

Why Did This Happen?

CrowdStrike explained that the issue came from an update to their Falcon software, specifically an antivirus signature component designed to help detect malicious activity. This particular update revealed a bug in the administrative portion of the program that caused an error in the Windows system, leading to crashes and endless reboots. The problem only affected Windows machines; Mac and Linux systems were unaffected.

How Did They Fix It?

Even though CrowdStrike removed the bad update within 79 minutes, machines that had already installed the update couldn’t be automatically fixed. IT teams had to manually fix each one, rebooting the systems in safe mode and removing the problematic file.

A Few Takeaways

Expect Failure

No matter how robust a system seems, always anticipate potential failure, especially for things you rely on - whether processing customers or your personal email. What happens if you can’t use the system? This incident shows that even top-tier cybersecurity firms can make critical mistakes. Prepare contingency plans for when things go wrong.

Test Robust Offline Processing Mechanisms

Having an offline mode or backup process can save the day when the main system fails. Businesses should ensure they can continue operating manually if necessary. This can involve simple measures like keeping critical data on paper or fireproof safe, or having a secondary system that can take over during emergencies. If the internet goes down, can you still operate? Does your hotspot work?

Limit Critical Privileges

Specific to this catastrophe, Crowdstrike was running with complete administrative privileges - the same as the operating system, in fact. Applications generally should not be given extra privileges, and computer accounts should only have access to what they need, and no more. By minimizing the potential impact of any single application, the overall system becomes more resilient to failures.

white iphone 5 c on black surface
Photo by Laura Rivera on Unsplash
Expect Tech Failures

Tech will fail—often at the worst possible time. Keep backups of important files, and be ready to switch to manual methods if needed. Having printed copies of crucial documents can be invaluable. Having tested backup methods means you stress less when things go wrong.

Stay Calm and Adapt

When technology fails, stay calm and think through your alternatives. If your computer dies, know how to access your backed-up files. If your internet goes out, make sure you know how to quickly access your phone’s hotspot. If you lose access to an important online account, make sure your backup or recovery accounts and mechanisms are accessible.

In the end, the CrowdStrike incident was a wake-up call for many. It highlighted the importance of being prepared for unexpected problems and ensuring that our defenses are as strong as possible. As our reliance on technology grows, so does the need for robust cybersecurity measures and the readiness to handle unforeseen challenges.

Have a project in mind? Let’s talk

Get in touch