Funny pun!

If you’ve ever set up any computing device, up to and including a smartphone, you’ve done configuration management. If you’re not a system administrator, chances are you’re doing it really old school style. You know what you want the computer to look like and do, and you know pretty much how to get it there, so you just sit down and install programs and change settings and so forth until the computer looks right. Possibly you find out that you missed something the next day, but that’s easy enough to fix.

Web Server Install
The bad old days
Back when I was a lad, we’d figured out that this was prone to breaking when you had to manage any more than a handful of servers. The first iteration of configuration management I ever encountered was a written script. “Log in. Type this. Make these changes.” You check off the boxes for each server and you have some degree of confidence that the servers are configured consistently. This doesn’t help as much when you’re making changes, though, and it’s still subject to human error.

At AltaVista, we had this awesome configuration system written in perl. We copied a script over to any new server, and added the new server to a text file back on the main configuration server, and ran the script. It copied a bunch of files over to the new server, some of which were shell scripts. I seem to recall it’d then run the scripts. (Yeah, I know.) You could define individual packages and specify which servers got which packages; you could also update individual packages and the updates would likewise get pushed out. It was fairly clunky. On the other hand, conceptually it wasn’t that far from what we’re doing these days.

Fast forward 10 years. There are two very popular open source UNIX-oriented configuration tools, Puppet and Chef. (Both of these have recently added Windows support, but I haven’t used either of them to manage Windows machines.) There are also a ton of other choices if you don’t like Ruby, the language which both of them use. I’d also be lax if I didn’t mention that the original configuration management system, CFEngine, is still going strong. However, Chef and Puppet both have a lot of momentum: it’s easier to find people who’ve used them, they’ve both got good commercial support, and so on.

The core idea behind any modern configuration management system is the ability to describe the desired state for a computer in one central location, which then propagates out to individual systems. The key concept there is that you’re describing the desired state, not the steps necessary to get there. In other words, rather than writing a script that runs the commands required to install a Web server, you’d specify that you wanted the Web server package installed. Likewise, instead of writing the command to start your game server, you’d specify that you want the game server software to be running at all times. Puppet and Chef both allow you to control a wide range of states including software package installs, which services should be running, user accounts, and so on.

Assuming the configuration management software is well written, your configuration specifications are idempotent. That means you can apply the configuration to a given server as many times as you want without screwing anything up. For the Web server installation example, imagine that we were recompiling or reinstalling the Web server software each time the configuration management system checked for updates. It might not hurt anything, but it’d be extra unnecessary work.

Your script could certainly start out by checking to see if the Web server was already installed, in which case your script would also be idempotent. But you’d have to do that for every script you wrote. If you use a configuration management system, you’re getting the benefit of that check without having to recreate the wheel for every single element of the system.

Sample puppet configuration
So much better
The coders in the audience will recognize this as basic abstraction via an API. Puppet and Chef abstract both the idempotence and the underlying operating system. If you write a Puppet configuration module that specifies a software package to install, that package will be installed if you apply the module to a Solaris system, an Ubuntu system, or a Red Hat system. Those all use different packaging systems; you don’t need to care, because your configuration management system cares for you.

So that’s the importance of abstraction. The other big thing you get out of using Puppet or Chef is modularity. Let’s say I have two modules that set up the basic networking for my two data centers. I can easily set up my configuration servers such that any server in my New York data center automatically has the New York module applied and any server in my San Mateo data center has that module applied. I don’t have to assign each server to one of the data centers by hand; the configuration server just needs to know that any server within a given IP address range belongs to New York, and so on.

You can also differentiate by a ton of other criteria. When a computer connects to the configuration server, it reports a bunch of information: its name, its physical characteristics (memory, CPU, etc.), what operating system it’s running — all that jazz. If my forum Web servers are all named something like forum-web-01.internal.company.com, and my billing Web servers are all named something like billing-web-01.internal.company.com, my configuration system can apply the basic Web package to both classes of server while applying the specific tweaks for forum servers to those servers only. Super-powerful stuff.

Now, I mentioned that coders will recognize that all this is basically an abstract API. The not-so-secret implication is that a system administrator writing these configurations is in fact a developer. Puppet uses its own language, but Chef configuration files are actually pure Ruby, so if you’re working with Chef you don’t have any excuse for pretending you’re not using a programming language just like those guys down the hallway who keep complaining about Visual Studio quirks.

Configuration is code.

I’d been leaning towards OpenStack for my private clouds; it’s not entirely production-ready yet, but with Dell, HP, and Rackspace behind it I had high hopes. But Citrix just fully open-sourced CloudStack. As Lydia Leong notes, this changes the calculus a bit.

She mentions HP’s cloud efforts, but I’m going to be watching Dell. I think Rob Hirschfeld’s Crowbar project is going to hit the sweet spot for people who want to deploy private clouds. They’re deeply involved in the OpenStack community; still, stability is attractive.

And while I’m reading Lydia’s blog, she’s right about the Eucalyptus/EC2 API license. You don’t see anyone claiming that Google+’s API (whenever it shows up) needs to be compatible with Facebook’s API in order to succeed.

Devops was a concept on the horizon two years ago. Since then I’ve had the chance to dig into it, figure out what I didn’t understand then, and implement an agile devops focused work environment. In the interests of demystifying what I do, I’m going to lay out what all those buzzwords currently mean to me. Spoiler: I still don’t think continuous deployment all the way to live makes sense for MMOs. Social games may be another matter.

First Principle

Devops is the principle of applying software development methodologies to the discipline of system administration and vice versa. There are couple of reasons to do this. First, the two fields are closely related and they have a lot to learn from one another. At the end of the day you’re giving directions to computers and hoping you said it right. If developers have learned some cool tricks for double-checking your hopes before you subject them to a live datacenter, bonus. If system administrators have learned some stuff about testing code in the same live datacenter and we can pull those tests forward in the dev process, double bonus.

Second, there’s a tendency for groups using the same techniques to talk more. Anything that gets groups to talk more is a good thing. I have a grand plan to invent marketops sometime in the next couple of years just so I can get marketing and operations groups to talk more. Mardevops comes next. Don’t think I’m not serious.

OK, so we have a principle and some reasons why it’s a good idea. Let’s break that down a bit further and spend a little time on the techniques we can use to realize that principle.

Configuration is Code

The buzzphrase is “infrastructure is code.” This is a lie and I actually think it’s a bit damaging. Over the last five years, virtualization has made it very easy to think of computers as non-physical objects that can be completely reconfigured remotely. Modern server hardware technology helps; if your servers are blades, the blade chassis containing them probably manages a lot of the individual blade configuration, and any blade you plug into a given slot in the chassis will behave according to how you’ve configured that slot.

However, you’re still dealing with cables and heat profiles and power requirements. Infrastructure involves the physical. Even if we’re using a cloud, it’s critical to think about the underlying hardware in order to better understand how to provide redundancy and resilience. Consider the April 2011 Amazon EC2 outage.

Configuration is code, though. It’s just instructions for computers. Therefore, it benefits from being treated like code.

The easy implication is that we should be storing our configuration files in source control. Everyone sane does that already, so that was pretty easy. We should also be acting like we’re coders. This is harder. But do it: use code review on every configuration checkin, even if it’s just someone else eyeballing it.

Taking a step back, configuration should undergo the same kind of promotion process you’d expect from any development team. You want to be performing continuous builds. For operations, this means that every time someone modifies a configuration file, that file gets pushed into a test environment and an automated process makes sure that the device being configured still runs.

Scrum, Sometimes

Every game company I’ve worked for has used some form of Scrum methodology for development. Online game operations groups should get on the bandwagon. Carefully.

If you try to apply Scrum to everything you’ll do in operations, you’ll find out pretty quickly that Scrum is designed for projects and you do a lot of maintenance work. Don’t try and use Scrum there. If something breaks, do not write a user story about how “As a user, I would like gameserver01.bos.gamecompany.com to be functional.”

Instead, treat it as a bug. Use your existing bug system if possible. This is convenient because it’s pretty easy to explain priority A bugs to your producers, and they’ll understand why it’s affecting your velocity. This also subtly implies that QA should be involved in verifying the fixes. That’s not an accident.

Also, identify tasks that can’t be decomposed into sprint-sized chunks and think about applying Kanban techniques there. Clinton Keith threw out Kanban as the appropriate agile operations technique in a seminar I took once. It’s not the be all and end all — that’s a misapprehension caused by not enough communication between development and operations — but it’s not bad if you’re doing a lot of repetitive tasks.

Finally, cross-fertilize your scrums. I don’t think we really want an operations scrum; rather, we want a datacenter scrum, which is primarily focused on operational tasks. We want a security scrum, which needs both operational security guys and developers. We want operations people sitting on most if not all engineering-oriented scrums. Scrum is a great fit for devops because it’s oriented towards projects, not functional groups.

Testing

You don’t want developers writing code without QA involved. The same applies to system administrators.

Load testing, security monitoring, and functionality monitoring are part of good data center infrastructure. They should be part of the development and build process as well.

Squish those two together and you wind up wanting to do full stack testing as early on as possible. Testing in a simulation of a live environment — here’s where cloud technologies come in — is really useful. This makes QA involvement a natural part of the process; you want to bring the traditional code branch into an integration environment with the ops configuration branch as soon as possible. The idea is to find out if your wacky configuration ideas break the carefully crafted software earlier rather than later.

Since you’re filing infrastructure issues as bugs, you already need QA to understand how to find and report bugs in your stuff, so that all works out rather well.

Next Up

I want to dig further into each of those techniques, and I also want to talk a bit about how you can get from zero to devops. I’ll probably do the latter next. Thanks for reading, particularly since it’s been a long long while.

The single biggest favor a system administrator can do herself is to make sure she knows exactly what state her servers are in at all times. Inevitably, you’re going to make a change that screws something up. It could be a configuration change, it could be pushing updated code or content, or it could be some weird random “harmless” tweak. The IT Process Institute believes that 80% of all outages are the result of changes to the system. If you don’t know what changes you’ve made, troubleshooting becomes much more difficult.

Thus, one of the fundamental underpinnings of any Ops organization is a change management system. You always have one.

Way too often, particularly in small shops, your change management system is your memory. This is the bit where something breaks and you sit there and you go “Oh, wait, I pushed out a new version of the kernel to that server on Monday.” Personally, I find I can keep a pretty good model of the state of ten servers in my head. “Pretty good” is nowhere near good enough. Still, it’s a very tempting model if you’re the kind of guy who likes coming up with snap answers in the middle of a crisis. Dirty secret: lots of us Ops guys are like that.

It’s better if you have a real record. There are a ton of ways to do this. One of the advantages to many commercial ticketing systems is integrated change management — you can tie a ticket to a server so that you can look up the list of changes to that server quickly and easily. Wikis work fairly well if there’s a real commitment to keeping them updated. Email lists are harder to search but they’re at least something.

Any change that can be managed as part of a source control system probably ought to be. This includes most configuration files, and you can do a lot with configuration files. There’ll be a followup post to this one talking about configuration management — a tightly related subject — and that’ll shed some light on those possibilities. Once something’s managed via source control, you can not only easily look up changes, you can revert to old versions quickly.

The other side of this is troubleshooting. When something breaks, if you have good change management, you can start with the most recent change, think about whether or not it’s potentially related, and back it out if necessary. Having an ordered list of all changes for the server or system is crucial; this is why something more formalized than a mailing list or a bunch of IRC logs is good.

The non-technical side of this is controlling change. I tend to come down strongly on the side of slow, measured changes, particularly in the MMO environment. MMOs have two interesting qualities that affect my thinking in this regard: first, the pace of development is relatively quick. We’re already pushing code at a rate that would startle more traditional shops. (Although to be fair, we’re slower than the more agile-oriented Web 2.0 sites.) Second, the face we present to the world is relatively monolithic. We do not have easy access to the same techniques Web sites use to test changes on small groups of users, for example. Games could be built with that potential in mind, but there are social issues involved with giving content to some customers early.

Because of the first quality, there’s a tendency to want to move very quickly. Network Operations is a balancing point. It’s bad to push back for the sake of pushing back, but it’s good to remember that you’re allowed to provide input into the push process. Since you own the concrete task of moving the bits onto the servers, you’re the last people who can say no, which is a meaningful responsibility.

Because of the second quality, problems tend to be more visible to the customers than they would be if you could roll them out gently.

There’s also a general reason for going slowly. Not all change-generated outages manifest immediately. Sometimes you’ll make a change and the outage won’t occur for days. Easy example: changing logging to be more verbose. If it’s lots more verbose, you may run out of disk space, but it might take four days for that to happen. When you’re walking down the list of changes, it’ll be harder to pinpoint the root cause the more changes that happened between now and then. Your search space is bigger.

And yeah, that one is a trivial example that wouldn’t be so hard to troubleshoot in practice. The principle is solid, however.

All that said, I’ll let the Agile Operations/Devops guys make the case for more speedy deployments. Especially since I think they’re making huge strides in techniques that I badly want to adopt. (This presentation deserves its own post.)

Let us assume, in any case, that we’re looking to control change. The process side of change management is the methods by which you discuss, approve, and schedule changes. There is no right way to do this, although there are wrong ways. If your Net Ops group is getting a code push and only finds out what’s in the push at that point, that’d be a wrong way.

One right way: have a periodic change control meeting. Invite Net Ops, Engineering, QA, Customer Support, Community Relations, Release Management, and producers. Some of those may overlap. Also, see if you can get Finance to show up. It’s nice to have someone who can authoritatively talk about how much money the company loses if the servers are down for a day.

Go over the list of proposed changes for the next change window. Rate ’em on importance and risk. Decide which ones you’re doing. The answer is probably all of them, but the question shouldn’t be a mere formality. If there are blockers which can be removed before the change window, schedule a followup to confirm that they’ve been removed.

Make changes! That’s really the fun part.

The periodicity of the meeting is subject to discussion. I’ve done it as a weekly meeting, I’ve done it as a monthly pre-update meeting, and I think it might be useful to do it as a daily meeting. If your company is into agile development, and does daily standups, it makes sense to fit it into that structure. You also need to be willing to do emergency meetings, because this structure will fall apart if it doesn’t have flexibility built in. There are going to be emergency changes. I like to account for that possibility.

That’s change management in a nutshell. Coming soon, some discussion of tools.

Attention conservation notice: this doesn’t have much to do with MMOs.

JavaStation-10
A JavaStation-10, aka JavaStation-NC.
A decade or so ago, I worked on Sun’s internal JavaStation deployment project. Our mandate was to deploy a few thousand of these things throughout Sun as replacement desktops. It was a lot of fun, I learned a lot, and I got to travel some. I think that was my first business travel, in fact.

The retail cost of a JavaStation-10 was around $750, if I recall correctly. It was a great concept, because it was a great task station. You could do email, calendaring, and Web browsing. The OS was slower than you’d like, but I found it perfectly reasonable unless you were asking it to do X-Windows via Citrix or something like that. Of course, you had to have a big chunky server in the data center to serve up the OS at boot time, and the hardware didn’t include the mythical JavaChip co-processor, so complex apps could be a bit poky.

A decade later, I have an iPad sitting on my desk. It ran me around $750, but I could have gotten one at two-thirds the price. It does email, calendaring, and Web browsing, plus a ton more. The OS is fast. This would make a really good task station, especially if you housed it in a different casing and slapped on a keyboard. I have no reason to think that Apple is aiming in that direction, but I won’t be surprised if we see iOS devices designed for single purpose workstations.

Those early ideas just get better and better as the technology catches up, huh?

In a vain attempt to tie this back to the blog purpose: this thing is going to be an amazing admin tool. If I’m in the datacenter, it’ll be very useful to have a separate screen handy. I can’t count the number of times I’ve been looking at a server on a KVM and I’ve wanted to have a second monitor available for watching logs and so forth. I have an ssh client, and I have a client that does both RDP and VNC, which is key for a multi-platform admin. If I was still working in Silicon Valley back in the stupid money days, I’d try to sell my boss on buying these for my team as core tools. Cheaper than an on-call laptop.

Lydia Leong has a great post about the question of speedy provisioning. As she says, the exciting bit about getting new hardware in place isn’t the OS and software installation. Even if you’re not virtualized, you can install a new server unattended in a couple of hours. You want to be able to do that even if you never expect to grow, because you need to be able to rebuild servers quickly if they die. This isn’t hard to manage. In a pinch, people will sell you solutions and you can get a consultant in, but it’s easier to just plan ahead.

She hits on the internal aspect: getting someone to sign off on the new servers. If we’re talking about the need to buy more capacity on short notice in our industry, we’re probably talking about launch, which means this problem isn’t so bad for us. But you’ve got to get the ducks lined up in advance. You don’t want to shock your CEO with an order on the third day of launch; she’s worrying about other stuff. Better to get the plan in writing way in advance, along with executive buyoff. Then you can tell the appropriate people you need ten more shards, get the documents signed, and get your vendor moving.

I think that’s a bit trickier than Lydia says, but I also think she’s talking about onesies/twosies. Buying one server is easy, as she notes. Buying a hundred servers for serious expansion is going to take a bit longer, because Dell and HP and IBM hate keeping too much backstock around, so they’re going to have to build those servers for you.

You can alleviate this, of course. First tactic is to let them know it’s coming. None of those companies are going to increase their inventory just for the sake of your possible buy, because you’re too small, unless you’re Blizzard. However, you can and should get some commitments around response time. You can also, and I think this is more important, find out what’s going to ship the fastest. There’s no reason why you shouldn’t take that as an input to your hardware decision matrix. If all else is equal, go with the servers that generally have the largest inventory. Or ask questions about factories: can your vendor literally build 1U servers faster than blades?

Also, make sure the vendor order process is just as quick. As with all vendors, you want your hardware sales people to be on call during the two weeks around launch. Midnight calls are very unlikely; weekend calls are more probable.

Finally, figure out how you’re going to rack and stack a hundred servers quickly. Could be your vendor’s professional services, could be some local contractor. Even if your internal staff racked the rest of the servers, it’s better not to ask them to spend the time in the machine room during launch, cause you are going to be doing a lot of other things.

None of this is really all that hard, it’s just a great example of one of the many rows you need to have your ducks in. It’s not difficult, it’s merely if

CCP Yokai, the Technical Director over in EVE-land, just posted a dev blog about their new rack setup. This is pretty rare insight for any operation, so it’s definitely worth reading. You don’t get the nitty-gritty details but you get a good overview.

They’re located in 12 cabinets. That apparently covers their single server, their test server, and ancillary services. If you don’t know, EVE is a single shard setup, which is really technically impressive. They crowd all 50-60K concurrent players into one world. That’s one big reason why network connectivity is so important to them. Yokai mentions it a few times in the blog. That’s a very high quality network he’s got set up, probably because most of those 64 servers may need to talk to any other server. Compare that to an infrastructure where 10 servers make up a shard. I can’t know for sure but it sure seems like you’d need to be ready for more interconnections.

He’s using blades, and the blades have a lot of RAM. IBM makes a really solid blade, by the way. The HS21 is I think one generation old; they’re currently selling the HS22s in that price/performance spot, but once you’ve bought a bunch of blades you don’t upgrade unless you need to. The interesting thing to me is the amount of RAM they’ve got in each blade. 32GB is a fair bit. I don’t want to speculate too much but CCP has never been shy about smart ways to use the fastest possible resources, and RAM is fast. See also that big 2 terabyte SSD SAN (storage area network) he mentions.

Lots of blades means lots of heat. I am not surprised that they need a self-contained cooling system. I should talk some about the blade vs. 1U server question, since while blades do take up less physical space, the practical space they consume may not save you much. On the other hand, as noted, CCP needs the fast interconnects. Blades do help there.

Don’t miss the comment thread, either. The devs are again being very open about some of their choices, which is awfully nice of them.

Yep, it still works. Neat.

I’m a year or so into the new job, which still rocks, and for various and sundry reasons it’s a good time to start blogging about geeky MMO operations stuff again. Excellent. Quick note on policy, here: I’m not going to talk about where I am and I’m going to steer clear of talking about what we’re doing, because we’re not ready to talk about it yet and because I am explicitly and emphatically not speaking for the company in any way, shape or form. I’m not keeping my employer a secret — y’all know how to use LinkedIn, right? — but I’m keeping some separation in order to, well, keep some separation.

Mostly I’m just jazzed to be talking about the stuff I love again.