On July 19, 2024, the world witnessed one of the largest IT outages in history. A sensor configuration update from CrowdStrike, pushed to all Windows systems at 4:09 UTC, caused 8.5 million Microsoft Windows PCs to crash worldwide. Airlines, banks, hospitals, government agencies, stock exchanges, and countless other organizations were brought to a standstill. In this video, I break down what happened and, more importantly, what you can do to prevent something like this from happening in your organization.
What Happened at CrowdStrike#
CrowdStrike is an American cybersecurity firm founded in 2011 with a market capitalization of roughly $83 billion. Their Falcon software is an Endpoint Detection and Response (EDR) system that organizations install on their computers to detect malware and suspicious activity. On that Friday morning, a configuration update pushed between 4:09 and 5:27 UTC immediately crashed every Windows 7 and Windows 11 (or above) system that downloaded it. The result: a global wave of blue screens.
When you looked at Down Detector that day, the picture was staggering. Company after company reporting outages. And CrowdStrike’s own share price dropped significantly as soon as the world identified them as the source.
Plan Your Deployment and Release Strategy#
The first principle I want to emphasize is to plan your deployment and release strategy. In DevOps, we make a critical distinction between deployment and release. A deployment is bringing compiled code into production with feature toggles switched off. A release is the act of switching those feature toggles on for users. This separation gives you enormous control over when and how new functionality reaches your users.
Quality During Development#
During development, several practices help ensure code quality before anything reaches production:
- Pair Programming: Two developers working together means a second brain constantly reviewing the code. This is a powerful way to spot errors, including the kind of null pointer exception that potentially caused the CrowdStrike crash.
- Code Reviews and Merge Requests: Even after pair programming, a second person reviews your code during the merge request process. Another checkpoint to catch issues.
- Smart Pointers in C++: Since CrowdStrike’s system was implemented in C++, where you manage your own memory, tools like Smart Pointers are essential to prevent null pointer exceptions.
Continuous Integration and Test Automation#
After committing code to the repository, the Continuous Integration system kicks in. This is where your new code gets reintegrated with the rest of the codebase. During this phase:
- Static Analysis (for example, CppCheck for C++) scans the code for errors and potential null pointer exceptions.
- Unit Tests and Integration Tests verify that the software works correctly on different Windows versions, including reboot mechanisms.
- Dynamic Analysis (for example, Valgrind for C++) executes the code and detects runtime issues, memory leaks, and exceptions.
After these steps, you have a deployable artifact that has been reviewed, statically analyzed, and dynamically analyzed.
Staging, Blue-Green Deployments, and Rollback#
The reviewed artifact gets deployed to a staging environment where feature toggles allow you to test the software. Blue-green deployments are particularly valuable here: you maintain two environments, one live and one idle. You deploy to the idle environment, switch traffic over, and if something fails, you switch back instantly. This principle also applies to production. Rollback mechanisms, whether through blue-green switching or simply uninstalling and reverting to the previous version, are non-negotiable.
At this point, you have not yet released the software. In the CrowdStrike scenario, you would have deployed it but not released it.
Canary Releases and Dark Launches#
This is where I believe CrowdStrike could have made the biggest difference. Canary releases mean you release to a small user group first, typically beta users who want the newest features. If something goes wrong, the blast radius stays small. I am convinced that this principle would have helped massively in the CrowdStrike outage.
Dark launches are a variant: you continuously deploy new versions to production, but feature toggles remain off. You only enable them selectively on test systems to verify everything works. Normal users never see the new features until you are confident.
A/B testing follows the same logic: one group (A) gets the new feature, the other (B) does not. You compare results before rolling out broadly.
These are all release strategies, and planning for them upfront is essential.
Proactive Detection and Monitoring#
You never want your clients to discover a problem before you do. CrowdStrike’s clients found out about the crash before CrowdStrike did. To prevent this, you need:
- Proactive Detection: Find bugs before your customers do.
- Alerting: Based on proactive detection, automated alerts must notify you immediately.
- Continuous Monitoring: Observe all systems, gather feedback, and enable proactive detection so you can react before users are impacted.
Key Takeaways#
- We cannot prevent all incidents, but we can reduce the blast radius. We live in a highly connected world where software runs everything. We must learn from failures.
- Pair programming and code reviews catch errors early, including potential null pointer exceptions.
- Static and dynamic analysis provide automated safety nets for code quality.
- Unit and integration tests should verify the software on different system versions, including reboot mechanisms.
- Canary releases are critical. New versions should always go to a small beta group first to limit the impact of potential failures.
- Build resilient systems. As an industry, we need to ask hard questions. Why did Microsoft not build a more resilient Windows system that cannot be fully brought down by a single third-party update? We need to properly engineer our systems for resilience.
- Have a proper post-mortem. Analyze the root cause, improve based on findings, and apply DevOps principles consistently.
