Understanding the CrowdStrike Incident of July 2024
In July 2024, the digital world was rocked by a significant event: the CrowdStrike incident. In this blog post, we’ll delve into what happened, why it happened, and how the issue is being resolved. This incident, involving CrowdStrike’s Falcon software, caused disruptions to over 8 million Windows computers globally, impacting critical services and daily operations for millions. Let’s explore these aspects in detail.
What Happened?
On July 19, 2024, millions of Windows computers experienced the infamous “Blue Screen of Death” (BSOD). This event didn’t just affect individual users but had widespread ramifications, disrupting businesses, airlines, hospitals, and other critical services worldwide. As a result, many missed flights, appointments, and other important engagements, illustrating the extensive reach of this disruption.
The BSOD is a common indicator of severe system failure in Windows computers, often caused by critical errors at the kernel level, which is the core part of the operating system responsible for managing hardware and system resources.
Why Did It Happen?
To understand why this happened, we can use the analogy of a castle. Imagine a castle with multiple security layers: the outer perimeter (area one) and the innermost secure area (area zero). In a computer system, these areas are analogous to ring levels, with ring zero representing the most secure part of the system (kernel mode), where the operating system and critical drivers run, and ring one representing user mode, where applications operate.
CrowdStrike’s Falcon software, an advanced anti-malware solution, operates at ring zero. This high-level access allows it to effectively monitor and prevent malware but also means that any issue with Falcon can directly impact the core functions of the operating system.
On July 19th, a dynamic update to Falcon included an incorrect or corrupted file. Despite the Falcon software being certified by Microsoft’s Windows Hardware Quality Labs (WHQL), the update led to a critical failure. The incorrect file caused the Falcon driver, running in kernel mode, to malfunction, leading to the widespread BSOD incidents. This highlights a critical issue in software quality assurance (QA) processes, especially for updates that affect core system components.
How Is It Being Resolved?
Resolving this issue involves multiple steps. Initially, CrowdStrike pushed out a corrected update. However, systems that had already experienced the BSOD required more direct intervention. The recommended approach for affected computers is to reboot into safe mode, manually locate and delete the problematic files associated with the Falcon update, and then reboot the system.
For large-scale deployments, such as servers in data centers that may not have direct user interfaces, additional steps and possibly scripting are necessary to manage the recovery process. Furthermore, systems using security features like BitLocker require even more intricate procedures to recover.
Microsoft has also updated its recovery tools to assist IT administrators in expediting the repair process. These tools offer options like booting from a Windows Preinstallation Environment (WinPE) or recovering from safe mode to facilitate the removal of the faulty update.
Avoiding Future Incidents
To prevent such incidents in the future, enhanced QA processes for updates are crucial. This includes thorough testing of all components, not just the core software but also any dynamic updates. Additionally, reconsidering the operational mode of critical security software like Falcon might be necessary. Running such software in user mode rather than kernel mode could mitigate the risk of entire system failures, albeit potentially at the cost of some efficiency in malware detection.
The CrowdStrike incident of July 2024 serves as a stark reminder of the vulnerabilities inherent in our interconnected digital world. While the immediate causes of the incident have been addressed, it raises important questions about how to prevent similar occurrences in the future. Two critical strategies that can enhance overall security and resilience are the adoption of Secure by Design principles and the implementation of network segmentation. Let’s explore how these approaches can mitigate risks and potentially prevent incidents like the CrowdStrike disruption.
Secure by Design Principles
Secure by Design (SbD) is an approach that integrates security from the very beginning of the software development lifecycle. This principle ensures that security considerations are embedded into every stage of development, from initial design to deployment and maintenance. Here’s how SbD could have impacted the CrowdStrike incident:
Early Threat Modeling
Incorporating threat modeling at the design phase helps identify potential vulnerabilities and attack vectors. If CrowdStrike had implemented a thorough threat modeling process, it might have identified the risks associated with running their software in kernel mode (ring zero), where any failure could lead to a system-wide crash.
Code Review and Static Analysis
Regular code reviews and static analysis can catch bugs and vulnerabilities early in the development process. Comprehensive testing, including stress testing and failure mode analysis, could have identified the problematic update before it was released, preventing the blue screen of death (BSOD) incidents.
Continuous Integration and Continuous Deployment (CI/CD) with Security Checks
Integrating automated security checks into the CI/CD pipeline ensures that every code change is tested for security issues before deployment. This approach can significantly reduce the risk of deploying updates with critical vulnerabilities.
Network Segmentation
Network segmentation involves dividing a network into smaller, isolated segments to limit the spread of potential threats and contain breaches. This strategy can significantly enhance the security posture of an organization by minimizing the impact of security incidents. Here’s how network segmentation could have mitigated the effects of the CrowdStrike incident:
Isolation of Critical Systems
By isolating critical systems and services into separate network segments, organizations can prevent the spread of issues from less critical areas. For instance, if critical systems in hospitals or airlines had been segmented away from general-purpose user systems, the BSOD incidents might have been contained, reducing the overall impact.
Minimizing Attack Surfaces
Segmentation reduces the attack surface by limiting access to sensitive systems. If the CrowdStrike Falcon software had been deployed in a segmented manner, with its updates and communications restricted to a controlled environment, the faulty update might have been identified and contained before reaching all systems.
Improved Monitoring and Incident Response
Segmentation allows for more granular monitoring and quicker incident response. Security teams can focus their efforts on specific segments, making it easier to detect anomalies and take corrective actions. This could have sped up the identification and resolution of the faulty Falcon update.
By understanding these key aspects of the CrowdStrike incident, we can appreciate the complexity of maintaining secure and reliable systems in an increasingly interconnected world. Stay vigilant and informed to navigate these challenges effectively.
Reference: https://www.youtube.com/watch?v=2TfM_BF2i-I

















