Big cheers to my old team and friends at ZeniMax Online Studios, who got to announce Elder Scrolls Online yesterday. Note: announcement Web site living in the cloud. Very nice.
Month: May 2012
I have been pretty busy with travel lately, but it turns out to be sort of reasonable to write on an iPad while flying. This post is brought to you by JetBlue. Let’s talk about a few best engineering practices as they apply to technical operations.
1: Code review is a good idea. This is a new practice for many sysadmins, but most of us get the idea that you want someone else to eyeball a maintenance change before you make it live. Sell it that way. If your engineering department has code review software in place, you might as well give it a try. Chances are it’s tied into the source control system you’re both using, and automation reduces friction.
As a manager, the best thing I can do to help people get comfortable with formal code review is to reduce the friction. I want code reviewed as quickly as possible. This means being ready to do it myself if necessary, although you don’t want people dependent on you. So complain in your daily standups when other people are falling down on the job.
Oh, and get programmers involved too. You want them interested in Puppet and init scripts. You also want to be at least an observer on their code reviews, and it’s easier to get that if you invite them into your house first.
2: Integrate continuously. When you change something about a machine build, test that change in a local VM. (Take a look at Vagrant for one good tool that helps a lot with that.) After you check your changes in, the build system should do an automated test.
I personally like to have two different development environments in the build system. One is the usual engineering development environment; the other is the technical operations environment. You could integrate ops changes side by side with engineering changes, but I’m still a bit leery of that. Especially in the early days, you’ll wind up breaking the test systems a lot. The natural developer reaction to that is annoyance. I’d rather avoid that pain point and give ops a clean test bed; the tested results can integrate into the main flow daily or as necessary.
I may change my mind on this eventually. The counter-argument is the same case I’ve been making all along: configuration is code. Why treat it any differently than the rest of the codebase? That’s a pretty compelling argument and I think the perfect devops group would do it that way. I just don’t want to drive anyone insane on the way there.
3: Hire coders. I am pretty much done hiring system administrators who can’t write code in some language. I don’t care if it’s perl or ruby or python or whatever. I probably don’t want someone who only knows bash, but I could be convinced if they’re a really studly bash programmer.
I want to have a common language for the entire group; it’s kind of messy if half the routine scripts are written in perl and half are written in ruby. If nothing else, you’re cutting back on the number of people who can debug your problems. Because tech ops people like variety, it’s almost certain that we’ll be using a bunch of scripting languages. If you have Puppet and Splunk installed, you’re stuck with ruby (for Puppet) and python (for Splunk) and you’re going to need to know how to code both. Still, you can at least make your own scripts predictable.
I can’t dig up any video of this, but Adam Jacobs has a great spiel about how you’re probably a programming polyglot even if you don’t know it. If you speak perl, you’ll be able to puzzle your way through enough python to write a quick Splunk script. When hiring, you want to make sure your candidate has reasonably good understanding of one of the common scripting languages. It doesn’t matter so much if it’s the department standard.
Your technical operations people will be writing code, though. There’s a lot of complexity in automation. That complexity is abstracted so you can deal with it on a day to day basis, and the process of abstraction is writing code. So hire coders.
The one exception is entry-level NOC people, and not every group has those. If you do have a 24/7 NOC, those guys may not be coders. However, it is mandatory to give those employees training programs which include coding. It’s hard enough maintaining camaraderie between off-hours employees and the day shift without having knowledge barriers.