MontyCloud Automates BSOD Fix for the CrowdStrike Outage for AWS EC2
The CrowdStrike Falcon content update for Windows has caused widespread system crashes and "blue screen of death" (BSOD) errors, leaving IT teams...
6 min read
Luke Walker : Feb 18, 2021 11:36:00 AM
Automating cloud security and compliance assessments can alert you faster. Then what? In this blog, Luke Walker – Principal Product Manager at MontyCloud shares how you can automate remediations. He also shares his list of Do’s and Don’ts, based on his years of experience working with customers.
– Sabrinath S. Rao
Cloud security is like a thankless plumbing job. You have to make sure that your cloud infrastructure is properly interconnected, constantly monitor the flow of data across the network, and flush out unnecessary bottlenecks at regular intervals. It is a painful, never-ending job—but a critical one that can burst open a can of worms for your organization if you happen to be looking the other way.
Earlier this week, my colleague Sabrinath Rao, talked about how you can deal with your continuously changing infrastructure with
continuous visibility and inventory. He also discussed how you can get continuous security and compliance assessments by running checks against your company’s policies based on best practices and industry standards.
Continuous assessments can alert you to violations faster. Then what?
In this blog, I am going to share how you can set real-time notifications in order to automate remediation to security alerts, minimize the likelihood of new attacks, and speed up your threat response process.
When it comes to handling cloud security response, most teams take a hurried approach, diving right into solving the problem. It’s a concern that other experts also seem to share.
“Too many infosec pros want to solve problems before understanding them. I see a lot of organizations going from threat to threat, alert to alert but not really understanding the underlying basis of why it’s happening…In many cases, we see organizations buy expensive solutions but not understand the problem they’re trying to solve.”
– Jason Rivera, director of CrowdStrike’s strategic threat advisory group at RSA 365 virtual summit
These knee-jerk reactions to threats can cause more harm than good. Common issues our customers tell us include implementing the wrong technical solution, missing systems downstream to the affected resources, and runaway cloud services bills to name a few. It often results in your organization ignoring legitimate security threats and compliance violations.
Instead of wading right into every alert or notification, it’s important to take note of the blast radius. In AWS’ terminology, a blast radius is “the maximum impact that might be sustained in the event of a system failure.”
Whether it be a security alert, or an incident, first knowing the context and blast radius helps. By understanding which business applications are affected, what resources those apps in turn touch, and who in the business is affected by the incident can give you a much better tactical sense for the steps you need to take and the downstream impact.
Both the steps to define your blast radius, and the steps you take afterwards are all actions that can also be automated, giving your teams more time to focus on the right course of action. Therefore, most cloud best practices start with tagging policies, to provide context and traceability when inspecting individual resources and components. But enforcement of those tags isn’t enough, your tagging scheme needs to be designed to convey business and application context.
Any member of your team should understand which business application, owner, department is impacted with quick glances at your resources through your tagging metadata, enabling fast awareness, decision making and flexibility in choosing the right response.
Tagging is not the only way where automation can have a real impact to your team’s ability to respond and remediate.
No engineer enjoys reading obscure alerts and spending a day reverse engineering who owns which instance in an account.
By defining your tagging schema upfront, tagging resources as they are created, every resource can now be viewed in the right context – whether it be business owner, dependencies, security requirements – and the first step in defining the blast radius for any incident can now be reduced into a handful of queries.
The use of your metadata should not be restricted to lookup work either. Every alert can inform Ops teams and engineers as to what and who is impacted. Basic troubleshooting scripts can invoke architecture, department or application-specific logic and invoke different tests to verify compliance, governance, and system availability, in ways that are more applicable to those environments.
It does mean that you rely not just on the existence of tags but the quality of your metadata. The data needs to be up-to-date and relevant to prevent miscommunication and incorrect actions when it comes time to act. But the investment in self-documenting during provisioning can help your teams spend less time reverse engineering their cloud accounts. Your teams can react faster in deciding what the next decision will be.
Here are some simple Do’s and Don’ts I have learnt working with our customers.
DO:
DON’T:
Trying to invent a ./fix_everything.sh script sounds great but you’ll end up making a mountain out of a molehill.
By breaking common problems down into smaller chunks, many steps can be rendered into a ready-made library of troubleshooting and remediation scripts that make the work easier to review & action.
Alerts and metric thresholds can be configured to use this library and automate the first troubleshooting steps, collecting relevant information like the number of processes running, or performing traffic analysis. The resulting trace data can be shared with adjacent Engineering, DevOps, CloudOps and other teams. Low-impact fixes could also be automated to resolve common issues with your applications building blocks, for example, locking down public access to a S3 bucket, disabling open SSH and RDP ports, or rate-limiting flagged traffic.
Whether working with a new, or existing application, there are several actions in which the collective knowledge of your operations teams can be broken down and automated. This work saves time that can be spent by the team to better focus on the problem.
Again, here are some simple Do’s and Don’ts I have learnt working with our customers.
DO:
DON’T:
Nothing causes more anxiety than the unknown.
It’s important to keep people well informed on what you do know, and what the next steps will be. While many teams focus on keeping business owners, stakeholders, and users well informed about what is going on – and rightly so, it is key to a healthy relationship for any Ops team.
Metadata-rich alerts can inform ticketing systems on which environments are impacted.
Threshold monitoring across ticketing systems can then raise status notifications that your teams are investigating a problem, helping to halt the flood of support tickets.
Publishing open data on your security statue, availability, response times builds trust and credibility with business leaders who invest your team to entrust in the safety and protection of their data.
DO:
DON’T:
Having a remediation plan is good, but executing the plan is most fulfilling.
But except in a perfect world, a remediation process rarely goes as per the plan. In reality, implementing a remediation strategy will inevitably face a few setbacks and is subject to constant fluctuations.
That’s why it’s twice as important to put all your remediation efforts into action to test the feasibility of your plan.
Your actual remediation plan will feel different and evolve along the way from your original strategy. That’s okay. Focus on adjusting your strategy to accommodate the changes in order to achieve the end goal.
When documenting your action plan, pay special attention to the language that you use to document your strategy. Communicate the strength of your plan and reassure your confidence in its outcomes.
Once you put it all into action, you can close the gaps in your automation and remediation processes and come up with a more robust, watertight cloud security standard.
DO:
DON’T:
Despite your best effort to put an automated remediation plan in action, misconfigurations and sub-standard deployments are more common than you can imagine. Oftentimes, a breach incident comes back to bite you the second time if you don’t have a way to assess your remediation tactics.
As the last step to foolproof your cloud security parameters, it is important to set up an automated monitoring process to keep an eye on potential data leaks and to tighten loose ends together.
In case of a repeat incident, putting a monitoring process in place will help you detect the problem areas immediately and deploy the right solution to plug the security loopholes in real-time.
Automating the monitoring process also helps you stay compliant with industry mandates, regulatory standards, and public cloud best practices.
In addition to offering, your continuous assessment of the threat levels, monitoring is also a great way to document your own best policies, apply new rules to your cloud standards, and make changes to your security blueprints.
Ready-made remediation makes it easier for you to review and execute security actions, especially now you know what you are remediating against and for whom.
To that end, MontyCloud DAY2™ helps you visualize, analyze and automate your security and compliance standards in just five easy steps.
Signup for a free MontyCloud DAY2™ account today and get automated security and compliance assessments within minutes.
The CrowdStrike Falcon content update for Windows has caused widespread system crashes and "blue screen of death" (BSOD) errors, leaving IT teams...
After years of experience working with partners at all phases of their cloud journey, I am thrilled to announce MontyCloud’s AWS Practice Builder...
Have you signed an agreement to begin migrating to AWS? Or are you a Managed Service Provider (MSP) with an AWS Migration Competency delivering AWS...