A few years back, in a sidebar discussion at a tech conference, one of Netflix’s engineering managers asked me if I was using any automation tools at work.
I said, “Not really. It’s a small environment and we’re not delivering any web apps that require automation for scale.”
She gave me an amused/sympathetic look and replied, “Dealing with scale isn’t the only reason to automate things.” It wasn’t condescension; she was being kind — dropping some knowledge, but I didn’t know how to respond.
A little embarrassed, I mumbled some other excuses for why automation wasn’t a good fit, said ‘nice to meet you’, and wandered off.
I cycled through my excuses, trying to figure out if they were valid. Most of the automation and config management stuff I had used in the past had been imperative, task-sequence based stuff, like what you’d find in Microsoft System Center. When you have to do the “walk forward five steps, now extend left hand at 30 degrees, close fingers around peanut butter jar”- programming game for smaller, legacy environments, it definitely feels “not worth it.”
Days after, the conversation still bugged me. “Why do people automate their infra? Why, really?” Even after reading a ton of articles, blog posts, and whitepapers, I still couldn’t come up with anything that wasn’t ultimately a scale use-case.
I had confirmed my bias and probably would have stopped there in similar circumstances, but what the Netflix employee said had a feeling of truth that I couldn’t let go of. I kept digging.
In order to understand the benefits and justification for automation, I started automating things.
Turns out, that engineering manager had a gift for understatement.
Livestock, not pets
I grew up in a culture of IT where servers, even PCs, were treated as special snowflakes. It took a long time to reinstall Windows + drivers + software, so you did a lot of care, feeding, and troubleshooting to make sure you didn’t have to start over from scratch.
We named servers after hobbits and constellation. We got attached to them and treated each like a pet.
“Bilbo-01 just crashed?! NOOOOOOO!”
In some ways, virtualization worsened that philosophy. Things were more abstracted, but not enough to force a mindshift. You could now move your pet servers between different hardware, reducing the reasons you would have to rebuild a particular server. At great cost, effort, and risk (“You can never patch my preciousssss.”), there are businesses running VMs that are old enough to drive.
So we ended up with thousands of VMs running thousands of apps that were setup by people who have retired, switched jobs 10 times since, or stayed and now act like fancy wizards, holding their knowledge tight to their chest.
Automation is the documentation
Let’s tackle the issue of tribal and secret knowledge first.
A big component of DevOps (and the Lean concepts that inspire it) is identifying and removing bottlenecks. Sometimes those bottlenecks are people. This doesn’t mean you have to get rid of people, but you do need to (where possible) remove any one individual as a core dependency for getting something done.
“Bob is the only person who knows how to install that app.”
“Those are Jane’s servers, you’ll have to check with her.”
“We can’t change any of this because no one knows how it works.”
At the end of the day, this is a scale problem. It’s scaling your IT to be larger than one person. Part of the solution to this problem is cross-training, but automation can also help (and prevent future stupidity).
If you use a configuration tool like Ansible or Chef, the playbooks/cookbooks become the documentation for the environment. They detail dependencies, configuration changes, and service hooks that were realistically never going to be documented otherwise. If you’ve subscribed to a declarative model of automation, the playbooks not only detail what the app stack should look like— if they’re run again, they can enforce that the stack matches what’s in the playbook.
Things generally break because something changed. Maybe it’s a hardware or network failure. Maybe the software is buggy and there was a memory overrun or a cascading service failure. Maybe somebody touched something they shouldn’t have.
In olden times, a sysad would be tasked to troubleshoot the broken thing, wasting hours with Google searches and trial & error. Meanwhile, the app is down.
If you’re automating your infrastructure, that’s less of a thing. App stopped working? Re-run the playbook for the stack. Want to know why the app stopped working? Look at your run logs. Troubleshooting is still needed sometimes, but there is a lot less fire fighting when you can push a simple reset button to get things back up and running. Turn it off and on again.
For approved changes, automation requires that the changes be well defined, which is a big positive that helps everyone know what’s happening and what to expect.
This type of state enforcement could equally be considered a security measure. Some people schedule plays that run through app stacks and repair/report anything that doesn’t match the expected norm.
NO MORE (or maybe less) PATCHING!
Not everyone is able to get there, but having fully automated stacks often means you can do away with OS patching. Just rebuild the stack once a month with the newest patched OS image. Boom!
If you do have to patch, you can significantly reduce your patching and service confirmation work by building the patch installs, reboots, and health checks into your automation. This helps prevent the post-patch-night “My app doesn’t work.” emails.
Even with de-dupe, I can’t imagine how many petabytes of backup data are made of up OS volumes and full VMs. If you’re automating deployment and config management, the scope of what you need to back up is greatly decreased (so is your time to recover).
You’ll really just be concerned with backing up application data. Other than that, you can make compute and the VMs your app runs on disposable. So you’ll just have to worry about having your playbooks with configs in version control and some method to backup databases and storage blobs.
This rolls into DR and failover as well. In many instances, automation will enable you to do away with failover systems. Depending on your SLAs, a recovery plan could be as simple as “re-run the playbook with a different datacenter target.”
Integration tests… for infrastructure
If you truly are treating your infrastructure as code, you can write unit and integration tests for it that go past “well, the server responds to ping”. You can also deploy into test environments very easily and run those environments more cheaply because of not having to maintain 1:1 infra full-time.
Turns out, if you make testing easier, people actually test things and you end up with better infrastructure.
This stuff is important
I get that none of these things feel very sexy, but in practice, they are game changing. As you start automating, you’ll discover that your infrastructure doesn’t work exactly like you thought it did, you’ll figure out what different apps actually need, and you’ll pull the weight of being the only person that knows something about a particular server/app off of your shoulders.
Some people like keeping secrets. They think being the only person who can do something gives them job security.
Those people are idiots. Maybe they will keep their job, but that’s not a good thing. They’ll never advance, never do anything more interesting than their current responsibilities.
Automating your infrastructure, opening up the secret knowledge to the entire team and doing away with the idea of being a hero who fights constant fires, is how you free yourself up to do better things. So build the robot, let it take over your job, and keep peeling all the layers of the onion to find work that’s more meaningful and interesting than installing patches, troubleshooting IIS, and getting griped at because “the server” is down.
You don’t have to work for a web company or be in the cloud to do this stuff (although some of the cloud toolsets are better). If you have even a small number of servers, it’s worth it. You don’t need “scale”, you just need a desire for your infrastructure not to suck.