MontyCloud Blog

Announcing DAY2™ CloudOps Automation Rules: Hands off management via automated routine remediations - MontyCloud

Written by Luke Walker | Dec 9, 2020 5:42:00 PM

Automating responses and remediations to well understood alerts such as increased CPU utilization by IIS webservers, can help make applications more reliable and CloudOps teams scale faster. In this blog, Luke Walker – Principal Product Manager at MontyCloud, explains how CloudOps engineers and IT SysAdmins can deliver consistent and timely responses with DAY2 CloudOps Automation Rules.

 – Sabrinath Rao

 

Today, we are announcing DAY2 CloudOps Automation Rules, a no-code process builder that helps CloudOps engineers and IT SysAdmins build their own remediation systems. With DAY2 CloudOps Automation Rules, IT teams can now delivery consistent and timely responses by automated alerting & remediation processes for cloud-based applications.

As customers increasingly create and move workloads to the cloud, the workloads become more distributed and complex. CloudOps engineers and SysAdmins are constrained by time and skills to keep pace with the alerts and notifications that help their applications function well. Often this means that any effort to observe and react against operational issues are performed manually, resulting in slower reactions and resolution times, and repetitive commands can mean a higher risk for user error. Automating responses to well understood and frequently occurring alerts can help CloudOps teams scale.

For example, at LeadSquared, a common problem with one of their key business applications arises when the IIS AppPool consumes a high amount of CPU % in the web server farm, and application performance degrades. While the CPU % utilization is a lead indicator, it is a single indicator and there are a number of manual steps that still must be performed on each individual web server to ensure that this is not a false positive event, and determine if the AppPool is really at fault. With DAY2 CloudOps Automation Rules, LeadSquared Cloud Ops team can automate not just the system checks themselves, but also smooth out the data from the lead indicator and minimize false positive triggers.

With DAY2 CloudOps Automation Rules, teams can save time in implementing system checks and improve accuracy of application reliability responses by focusing on defining the logical sequence of events that should take place with “if this then that” conditional logic, instructing the DAY2 platform to react on the teams’ behalf.

 
How CloudOps Automation Rules Works

The DAY2 platform collects events for a given application’s resources, and in turn teams can select key events or define conditions against system metrics to trigger actions from the DAY2 Task Library and user-defined scripts.  CloudOps teams can chain together simple but effective reactions raging from troubleshooting and notifications, to semi- and fully- automated remediation responses.

Let us walk through how we can improve our response times to this common problem with CloudOps Automation Rules.

 
Getting Started with CloudOps Automation Rules

Creating your first rule is as easy as accessing the Rules tab on a DAY2 application.  Applications in DAY2 are created when you deploy a blueprint, import an application based on tagged resources or an existing CFN stack, or classify existing deployed resources into an application.

Once the application is available, you can navigate into the application and the Rules tab and select Add Rule.  You will then name your rule and select the trigger event and resources you want to monitor.

We will select CPU Utilization because we want to look into this type of event when the CPU % rises above 85%.   Once you select CPU Utilization and click Select, the conditions box will be automatically set to Threshold for the metric which we can then set to Greater than or equals to 80%.

A common problem with reactive systems is they can fire off more frequently than desired because the threshold is met but not sustained, in this case, a CPU % spike.

To smooth out the data and ensure that we only trigger our reaction when we know for certain that this utilization is not an outlier, we will use Add Condition and select Consecutive Datapoints with a value of 5.   Datapoints are sampled at 1 minute intervals, and this means we will only react if we see the CPU % Utilization sustained at 80% or higher for a period of 5 minutes.

Now we can define the actions we want to happen when the conditions have been met.  For LeadSquared, the Ops team needs to be informed about the CPU % Utilization condition being met, and a troubleshooting script written by the App development team that tests the local application to determine if the AppPool needs to be cycled must be executed.

Under Perform the following actions, we select Send Notification via Email, then set the scope to Operations Access and Application Owner.  This ensures that the application owner, and any user with Operations Access on the platform will receive the notification.   We can opt to customize the message or leave the default subject and body as-is.

Next, we will select Add Action and select Execute DAY2 Task.  Tasks are a collection of MontyCloud provided actions, and user-defined Python scripts and AWS Automation documents that can be executed in any account managed by the DAY2 platform.   We have already uploaded an automation document that embeds the App Dev team’s script, called Check AppPool and Cycle, and will use this as our final reaction.

With the event, conditions and actions defined, we can finish our task by selecting Save Changes and the rule will go into effect immediately.

DAY2 can send alerts via email and through the In-App notification system, and new tasks are being added regularly to the platform to reduce the burden on ops teams from writing their own scripts.

Modern cloud applications are dynamic and events driven, and demand a lot from CloudOps teams that are looking to deliver reliability. With DAY2 CloudOps Automation Rules, customers can save time and improve application resilience by monitoring what matters, and enable automated responses based on logical sequencing of automated actions.
 
How can I start using this today?

DAY2 CloudOps Automation Rules is available in MontyCloud’s DAY2 platform today, and to learn more about this feature and about MontyCloud’s intelligent Cloud Management Platform, you can request a demo here.