S01: E01 – Break Free from the Task Dungeon
The constant task stream is growing to become a big burden in cloud operations. Luke Walker talks about automating routine cloud operations and building operational efficiency.
Rick: A CloudOps workday is filled with many big and small tasks. Dealing with the constant task stream is becoming an ever-bigger burden than many realize. It’s time to talk about task automation.
This is OpsTalk Podcast. Here we talk about the real, everyday cloud ops reality. I am Rick Hebly, thanks for tuning in.
With me today, is Luke Walker. Luke runs product management at MontyCloud. Great to have you Luke! Today we are talking about the task stream. What does that mean? What problems do these task streams present?
Luke: Thanks Rick, great to join OpsTalk. Here is what I’m seeing. The reward for working hard is more work, and nothing could be more true for an Ops team. Companies are expanding their cloud footprints. They migrate more workloads, they refactor more applications and build more natively in their cloud.
Unlike a project with a fixed start/end, your typical CloudOps team has a never-ending stream of tasks, and it’s no surprise to find teams can’t hire or spend fast enough to meet this accelerating demand of work.
Rick: Can you tell us what kind of tasks we are talking about and how that can spin out of control? Is it simply related to the amount of resources living in the cloud?
Luke: Most look at the amount of resources deployed, but it’s really about your app consumption increasing, the more people using your apps, the more tasks appear within your ops backlog.
Now app consumption is a good thing, it means you’re delivering value to the business and that’s what you’re building for, but are you ready for the operations load that comes with it?
Your teams are then under pressure when you assess the priority, volume and sequencing of tasks that must take place to maintain service levels.
Yet trying to carve out time to find a more efficient way to cope with that pressure is a constant struggle that I hear from customers.
Rick: Can you give me an example?
Luke: Watching your Ops team be buried by report deployment is one place to start.
One of our early customers is a luxury jewellery retailer with stores all over the world. The customers built their Enterprise Data Hub on AWS, using 10 AWS services including ECS, EMR and Elasticsearch to run analytics on their point of sale data.
Once app consumption increased, the team was getting more and more requests for changes to existing, or new reports. That in itself created even more tasks to process the changes, re-deploy, and scale their clusters to meet the new demand.
But that’s where the bigger problem lies, like many other projects while Infrastructure as Code handled the deployment, those same tools generally don’t deal with the operational aspects of an application, then this was compounded with all of the customer’s operations being built organically.
Tooling, alerting, and the automation of onboarding tasks was completed only as required, so Ops found themselves needing days to push, test, measure simple report updates, and months to push a new release into production or even just implement new monitoring rules.
It gets worse as you realize the manual task execution, like deploying resources, integrating services and setting up IAM roles – across multiple teams – created many inconsistencies and errors. Even with 15 cloud engineers running operations it became unmanageable, compromising compliance.
And even if they could scale out the team fast enough, which they couldn’t, the operations and infrastructure cost would skyrocket, as does the size of your backlog and inability to meet business demand.
So, you can imagine what the weekly Ops review with business units started to look like after several months.
Rick: Oh boy, that’s quite a predicament, now I get where you’re coming from. It sounds like it comes down to preparedness, to control the task storm. Surely cloud teams want to avoid getting into a situation like that. How can they go about it?
Luke: With the retailer, after watching the ops team operate for a couple of weeks it was apparent that only 10 simple technical tasks were taking the team 40 hours spread over two weeks or more to implement, on top of all of their other tasks.
This is going to sound cliché, but you just really have to automate tasks Rick, it’s that simple. Getting those 40 hours back and fulfilling a change request in less than an hour or so isn’t going to happen unless you start to automate.
Rick: That is right, but also easier said than done, isn’t there a big automation debt?
Luke: Today it takes a lot of scripting and maintenance, based on the tasks you can schedule and events you can predict. You are right, writing the code to talk to APIs together across service boundaries is where things get more difficult. It takes significant DevOps skills, deep understanding of AWS services and architectures, as well as time. This is all before you get the reduction in tasks, cost and risk. Like I said, adding talent to teams is difficult nowadays.
Rick: I see another predicament, but you and our product and engineering teams at MontyCloud found a better way, right? How do you make CloudOps simple and help customers break out of the automation dept?
Luke: Sure thing. First of all, AWS does a brilliant job at offering rich API sets across both the resources that make the building blocks for your application, and tools to deploy, manage and secure those constituents. Think of CloudFormation to provision Infrastructure as Code, Systems Manager to manage your EC2 fleet, and IAM roles to secure access, just to name a few. To take full advantage of cloud-native API’s, you need the code to stitch them together. That is across service and tooling boundaries, so that it works for you, in your own application context.
You have different architectures powered by various infrastructure and database instances and clusters, managed by multiple tools, possibly even across accounts and regions. There’s lots of code to write there for DevOps, provided you understand really well what outcome you’re building for. You must do it deliberately and consistently. If not, it can get ugly, as it did for our customer in the luxury retail business.
Rick: I can only imagine – most IT are not skilled DevOps engineers and Python experts. So again, how do you solve this problem for IT teams? Or should I say Cloud Centers of Excellence?
Luke: We need to break down what a deployment really means, or in our case, a DAY2 blueprint. In order to attain Operational Excellence, you must take into account BOTH your deployment and operation templates.
Without an operation template, you hit the same predicament our retail customer encountered. The app is deployed, but here is the team drowning in tasks.
Rick: So why not just standardize your scripts, put them in a common repository and reduce scriptwriting time?
Luke: When you flip the coin over and try to establish a one-size-fits-all standard for ops but then the generic operations playbook has no tie back to the deployment template, and you end up with generic tooling and automations that will never make a dent in your backlog.
Standardizing your ops tooling does help, but you can find that it only meets 30 to 40 % of your needs because every application is different.
So, when you design your architecture, that is when you must also design how you operate it, in order to stay ahead of the task stream.
This is what we believe makes our DAY2 Blueprints different – we are planning for success by forcing the conversation on not just what needs to be built, but what must be defined and automated to successfully operate that application.
Rick: That’s a plan. Failure to plan is planning to fail?
Luke: Yeah – clichés exist because they’re true. Here is what we’ve done to combat the task stream problem.
We’ve set up an entire library for our customers, featuring Well-Architected Blueprints for over 20 common services, and growing every week. Those go from basic infrastructure build-out, pre-architected public and private VPCs, from EC2, RDS to container clusters, managed with Kubernetes and/or Fargate, all the way to complex data analytics applications such as Elastic MapReduce & Elasticsearch.
And as we see blueprints being a combination of deployment and operations, every blueprint is built with a health dashboard, monitoring metrics and routine tasks out-of-the-box.
By completing all of the heavy lifting to get these templates compliant with AWS’s Well-Architected Framework, so our customers just get to click and deploy any blueprint on a self-service basis and take advantage of all of this work.
Sounds like a value prop Luke, all the time I hear folks screaming for Self-Service deployment capabilities to lower the threshold and get going faster with AWS, instead of burning precious time reinventing the wheel (then hoping those didn’t turn out square).
Yes, this must certainly be based on No-Code operations to democratize cloud for the majority. Again, most cloud operators lack the skills or time to do it all themselves. And they are not the problem, the problem is the unrealistic expectation that everyone touching cloud can or should want to code.
Rick: Now I picked up on something else in what you just said. You talked about post-deployment task automation. It’s true that most operations come after the deployment. Managing IAM roles, as one unpopular example. Is that what you were going at?
Luke: That is the dilemma I was waiting for! Don’t get me wrong, it’s certainly challenging to write a CFN template to meet Well-Architected standards, but in an age where deployments are becoming more and more automated, it’s no longer the big problem to solve.
Let me share another war story – another customer we worked with had successfully automated the deployment of data sets for data scientists performing experiments on healthcare data. And these deployments create 100s of S3 buckets every day, reducing their time-to-experiment significantly but kept each project and researcher isolated, meeting their HIPPA needs.
Next thing you know, operations grounds to a halt when they discovered it was time-intensive to unwind all of these deployments, or even modify and remove access to researchers who had left their various projects, resulting in multiple P0 tickets and management calls.
Now, these tasks weren’t technically difficult but the sheer number of tasks simply crippled the team because Operations was not included in the design phase.
And this is EXACTLY why it takes more to go from a deployment to a Well-Managed Application.
Rick: Can I interrupt you there for a second? A Well-Managed Application. What does that mean?
Luke: Good question. In a nutshell, it is an application deployed in such a way that it can be managed efficiently, is secure and is compliant.
Remember it always starts with deployment. When you deploy an application that is manageable then you can manage the application well. (We seem to be all about clichés today – laugh) In all seriousness though, it is much easier when you do it right from the start.
What comes next is your Routine Management tasks. Depending on the application, resource and policy, this can include tasks like adding and removing nodes, backup & restore or configuring alerts. These management tasks are more commonly known as DAY2 tasks.
Bringing resources under management was usually a separate workstream, including agent installation, setting access permissions and governance policies. Instead of hoping that gets done, a Well-Managed Application is aware of the context and provisioned with the components and configurations at the time of deployment.
Rick: So you are suggesting that with MontyCloud your DAY2 operations are pre-configured in the application context?
Luke: Yes, I am, more than that actually. We created the ability to enable self-service tasks at an application level and at the time of deployment.
Then we also created a per application dashboard to actually monitor it, audit changes, perform routine tasks through simple clicks or on an automated schedule, track and forecast resource-level costs, and run reports for the business.
Rick: Now that is something different. You can see the entire production line come together here. Again, the No-Code Self Service is hitting it home. Talking of that, I have a question. Self-Service can be a little scary, without proper governance users can break policy, compliance, budget and what more. Have you thought about that?
Luke: Absolutely. Self-Service is governed through guardrails, and this is all part of the routine tasks bundled into the Blueprints, ensuring you can’t move out of bounds.
In the S3 bucket dilemma we talked about before, you want to have a self-service task that resets permissions, but that task should only be accessible to the researcher for that data set, and your Ops team, and perhaps scheduled to run on a regular basis. You may also want to have another task that allows alert configuration, but without the ability to remove a CloudTrail log.
Self-service guardrails are essential for a Well-Managed application.
Rick: Got it. Now how did it work out for that luxury jeweler you talked about?
Luke: I can tell you this, they’re in a far better place now. Remember it was taking weeks and months to execute change requests? We took the top 10 common tasks plaguing the team’s backlog, built them into the DAY2 blueprints, and change requests are completed now within the hour, down from avg. 40 hours spread over 2+ weeks.
So not only did automated deployment governed by guardrails reduce time and cost, they now have monitoring and task automation set up at the time of deployment.
Our DevOps engineer Gana wrote a blog about it recently. Folks who want more detail can check it out at MontyCloud.com/casestudies or just hit “Customers” on our front page.
Rick: Thanks, Luke. I appreciate you taking the time to unpack the task automation issues and sharing proven solutions that are also very accessible. Can our listeners take a look at it themselves?
Luke: You are welcome Rick. And yes, people can get a fully-featured free trial. DAY2 is SaaS and Cloud-Native, so there’s nothing to install – and that includes managing your instances, we are agentless.
Just go to montycloud.com, hit Get Started for Free, and you’ll be ready to connect your AWS account in no time, with no code and no agents.
Rick: Brilliant. Thanks for watching the OpsTalk podcast.
In our next episode, we’re going to be talking about the Remote Console feature, where you can gain shell-level access to Windows and Linux instances without the need for VPNs, bastion hosts, or even granting access to the AWS console.
Make sure you hit subscribe on our YouTube channel and don’t forget to click the bell to make sure you don’t miss that episode. We look forward to talking to you then!