MontyCloud Blog

Simple steps you can take today to automate your cloud security remediations - MontyCloud

Written by Luke Walker | Feb 18, 2021 5:36:00 PM

Automating cloud security and compliance assessments can alert you faster. Then what? In this blog, Luke Walker – Principal Product Manager at MontyCloud shares how you can automate remediations. He also shares his list of Do’s and Don’ts, based on his years of experience working with customers.

  –  Sabrinath S. Rao

Cloud security is like a thankless plumbing job. You have to make sure that your cloud infrastructure is properly interconnected, constantly monitor the flow of data across the network, and flush out unnecessary bottlenecks at regular intervals. It is a painful, never-ending job—but a critical one that can burst open a can of worms for your organization if you happen to be looking the other way.

Earlier this week, my colleague Sabrinath Rao, talked about how you can deal with your continuously changing infrastructure with
continuous visibility and inventory. He also discussed how you can get continuous security and compliance assessments by running checks against your company’s policies based on best practices and industry standards.

Continuous assessments can alert you to violations faster. Then what?

In this blog, I am going to share how you can set real-time notifications in order to automate remediation to security alerts, minimize the likelihood of new attacks, and speed up your threat response process.

 
Threat alerts can be a slippery slope

When it comes to handling cloud security response, most teams take a hurried approach, diving right into solving the problem. It’s a concern that other experts also seem to share.

“Too many infosec pros want to solve problems before understanding them. I see a lot of organizations going from threat to threat, alert to alert but not really understanding the underlying basis of why it’s happening…In many cases, we see organizations buy expensive solutions but not understand the problem they’re trying to solve.”
Jason Rivera, director of CrowdStrike’s strategic threat advisory group at RSA 365 virtual summit

These knee-jerk reactions to threats can cause more harm than good. Common issues our customers tell us include implementing the wrong technical solution, missing systems downstream to the affected resources, and runaway cloud services bills to name a few. It often results in your organization ignoring legitimate security threats and compliance violations.

Instead of wading right into every alert or notification, it’s important to take note of the blast radius.  In AWS’ terminology, a blast radius is “the maximum impact that might be sustained in the event of a system failure.”

Whether it be a security alert, or an incident, first knowing the context and blast radius helps. By understanding which business applications are affected, what resources those apps in turn touch, and who in the business is affected by the incident can give you a much better tactical sense for the steps you need to take and the downstream impact.

 
Proper resource tagging can help you design an effective remediation strategy

Both the steps to define your blast radius, and the steps you take afterwards are all actions that can also be automated, giving your teams more time to focus on the right course of action. Therefore, most cloud best practices start with tagging policies, to provide context and traceability when inspecting individual resources and components.  But enforcement of those tags isn’t enough, your tagging scheme needs to be designed to convey business and application context.

Any member of your team should understand which business application, owner, department is impacted with quick glances at your resources through your tagging metadata, enabling fast awareness, decision making and flexibility in choosing the right response.

Tagging is not the only way where automation can have a real impact to your team’s ability to respond and remediate.

 
Every problem needs to be framed correctly – with metadata

No engineer enjoys reading obscure alerts and spending a day reverse engineering who owns which instance in an account.

By defining your tagging schema upfront, tagging resources as they are created, every resource can now be viewed in the right context – whether it be business owner, dependencies, security requirements – and the first step in defining the blast radius for any incident can now be reduced into a handful of queries.

The use of your metadata should not be restricted to lookup work either.  Every alert can inform Ops teams and engineers as to what and who is impacted.   Basic troubleshooting scripts can invoke architecture, department or application-specific logic and invoke different tests to verify compliance, governance, and system availability, in ways that are more applicable to those environments.

It does mean that you rely not just on the existence of tags but the quality of your metadata. The data needs to be up-to-date and relevant to prevent miscommunication and incorrect actions when it comes time to act.   But the investment in self-documenting during provisioning can help your teams spend less time reverse engineering their cloud accounts. Your teams can react faster in deciding what the next decision will be.

Here are some simple Do’s and Don’ts I have learnt working with our customers.

DO:

  • Tag values should be something you can look at-a-glance, and easy to search.
  • Enforce tagging – don’t let resource be built without tags.
  • Validate metadata – don’t let garbage tag data be created.
  • Apply Common sense – if you must consult a nightmarish Excel sheet to figure out when patches will be applied to a server, then you probably should have a patch window tag.

DON’T:

  • Not absolutely everything needs to be a tag. Consider databases & wikis for more complex data that your teams can query. For example, App owner contact details should be on a Wiki, not in a tag.
  • If you need a secret key decoder ring, you need to question whether that tag is valuable to begin with. (refer to prior point).
 
Streamline your actions

Trying to invent a ./fix_everything.sh script sounds great but you’ll end up making a mountain out of a molehill.

By breaking common problems down into smaller chunks, many steps can be rendered into a ready-made library of troubleshooting and remediation scripts that make the work easier to review & action.

Alerts and metric thresholds can be configured to use this library and automate the first troubleshooting steps, collecting relevant information like the number of processes running, or performing traffic analysis.  The resulting trace data can be shared with adjacent Engineering, DevOps, CloudOps and other teams.  Low-impact fixes could also be automated to resolve common issues with your applications building blocks, for example, locking down public access to a S3 bucket, disabling open SSH and RDP ports, or rate-limiting flagged traffic.

Whether working with a new, or existing application, there are several actions in which the collective knowledge of your operations teams can be broken down and automated.  This work saves time that can be spent by the team to better focus on the problem.

Again, here are some simple Do’s and Don’ts I have learnt working with our customers.

DO:

  • Perform a permission check more than twice? Script it.
  • Tailored your script specifically for an app? Keep that & every other copy you write.
  • Share your script library between your teams.
  • Configure your support systems to trigger ready-made checks from your library.

DON’T:

  • The more complicated a script, the harder to use for automation. Simple is always best.
 
Don’t forget about communications

Nothing causes more anxiety than the unknown.

It’s important to keep people well informed on what you do know, and what the next steps will be.  While many teams focus on keeping business owners, stakeholders, and users well informed about what is going on – and rightly so, it is key to a healthy relationship for any Ops team.

Metadata-rich alerts can inform ticketing systems on which environments are impacted.

Threshold monitoring across ticketing systems can then raise status notifications that your teams are investigating a problem, helping to halt the flood of support tickets.

Publishing open data on your security statue, availability, response times builds trust and credibility with business leaders who invest your team to entrust in the safety and protection of their data.

DO:

  • Share tag metadata with every alert or ticket created.
  • Success & failure statistics from consistent security checks are actionable metrics. Publish, share, and configure threshold alerting for supervisor, managers and business owners.

DON’T:

  • Assume everyone knows what’s happening.
 
Problems evolve, and so should your strategy

Having a remediation plan is good, but executing the plan is most fulfilling.

But except in a perfect world, a remediation process rarely goes as per the plan. In reality, implementing a remediation strategy will inevitably face a few setbacks and is subject to constant fluctuations.

That’s why it’s twice as important to put all your remediation efforts into action to test the feasibility of your plan.

Your actual remediation plan will feel different and evolve along the way from your original strategy. That’s okay. Focus on adjusting your strategy to accommodate the changes in order to achieve the end goal.

When documenting your action plan, pay special attention to the language that you use to document your strategy. Communicate the strength of your plan and reassure your confidence in its outcomes.

Once you put it all into action, you can close the gaps in your automation and remediation processes and come up with a more robust, watertight cloud security standard.

DO:

  • Build trust in your remediation with low impact testing in dev/test accounts
  • Validate others understand the plan

DON’T:

  • Assume everyone is automatically on the same page
 
Automation can help reduce your average time to response

Despite your best effort to put an automated remediation plan in action, misconfigurations and sub-standard deployments are more common than you can imagine. Oftentimes, a breach incident comes back to bite you the second time if you don’t have a way to assess your remediation tactics.

As the last step to foolproof your cloud security parameters, it is important to set up an automated monitoring process to keep an eye on potential data leaks and to tighten loose ends together.

In case of a repeat incident, putting a monitoring process in place will help you detect the problem areas immediately and deploy the right solution to plug the security loopholes in real-time.

Automating the monitoring process also helps you stay compliant with industry mandates, regulatory standards, and public cloud best practices.

In addition to offering, your continuous assessment of the threat levels, monitoring is also a great way to document your own best policies, apply new rules to your cloud standards, and make changes to your security blueprints.

 
Take control of your cloud security with MontyCloud DAY2

Ready-made remediation makes it easier for you to review and execute security actions, especially now you know what you are remediating against and for whom.

To that end, MontyCloud DAY2 helps you visualize, analyze and automate your security and compliance standards in just five easy steps.

  1. Assess your cloud security against 200+ AWS security best practices and 164 compliance checks across 60+ industry-specific standards and 72 AWS services.
  2. Get continuous visibility of all your cloud resources and services across cloud accounts and regions.
  3. Instantly get an inventory of all resources across cloud accounts and regions.
  4. Group and manage your cloud resources in their applications’ or department’s context.
  5. Identify abandoned, and unused resources and reclaim the resource or isolate the resource for further investigation.