The single biggest favor a system administrator can do herself is to make sure she knows exactly what state her servers are in at all times. Inevitably, you’re going to make a change that screws something up. It could be a configuration change, it could be pushing updated code or content, or it could be some weird random “harmless” tweak. The IT Process Institute believes that 80% of all outages are the result of changes to the system. If you don’t know what changes you’ve made, troubleshooting becomes much more difficult.
Thus, one of the fundamental underpinnings of any Ops organization is a change management system. You always have one.
Way too often, particularly in small shops, your change management system is your memory. This is the bit where something breaks and you sit there and you go “Oh, wait, I pushed out a new version of the kernel to that server on Monday.” Personally, I find I can keep a pretty good model of the state of ten servers in my head. “Pretty good” is nowhere near good enough. Still, it’s a very tempting model if you’re the kind of guy who likes coming up with snap answers in the middle of a crisis. Dirty secret: lots of us Ops guys are like that.
It’s better if you have a real record. There are a ton of ways to do this. One of the advantages to many commercial ticketing systems is integrated change management — you can tie a ticket to a server so that you can look up the list of changes to that server quickly and easily. Wikis work fairly well if there’s a real commitment to keeping them updated. Email lists are harder to search but they’re at least something.
Any change that can be managed as part of a source control system probably ought to be. This includes most configuration files, and you can do a lot with configuration files. There’ll be a followup post to this one talking about configuration management — a tightly related subject — and that’ll shed some light on those possibilities. Once something’s managed via source control, you can not only easily look up changes, you can revert to old versions quickly.
The other side of this is troubleshooting. When something breaks, if you have good change management, you can start with the most recent change, think about whether or not it’s potentially related, and back it out if necessary. Having an ordered list of all changes for the server or system is crucial; this is why something more formalized than a mailing list or a bunch of IRC logs is good.
The non-technical side of this is controlling change. I tend to come down strongly on the side of slow, measured changes, particularly in the MMO environment. MMOs have two interesting qualities that affect my thinking in this regard: first, the pace of development is relatively quick. We’re already pushing code at a rate that would startle more traditional shops. (Although to be fair, we’re slower than the more agile-oriented Web 2.0 sites.) Second, the face we present to the world is relatively monolithic. We do not have easy access to the same techniques Web sites use to test changes on small groups of users, for example. Games could be built with that potential in mind, but there are social issues involved with giving content to some customers early.
Because of the first quality, there’s a tendency to want to move very quickly. Network Operations is a balancing point. It’s bad to push back for the sake of pushing back, but it’s good to remember that you’re allowed to provide input into the push process. Since you own the concrete task of moving the bits onto the servers, you’re the last people who can say no, which is a meaningful responsibility.
Because of the second quality, problems tend to be more visible to the customers than they would be if you could roll them out gently.
There’s also a general reason for going slowly. Not all change-generated outages manifest immediately. Sometimes you’ll make a change and the outage won’t occur for days. Easy example: changing logging to be more verbose. If it’s lots more verbose, you may run out of disk space, but it might take four days for that to happen. When you’re walking down the list of changes, it’ll be harder to pinpoint the root cause the more changes that happened between now and then. Your search space is bigger.
And yeah, that one is a trivial example that wouldn’t be so hard to troubleshoot in practice. The principle is solid, however.
All that said, I’ll let the Agile Operations/Devops guys make the case for more speedy deployments. Especially since I think they’re making huge strides in techniques that I badly want to adopt. (This presentation deserves its own post.)
Let us assume, in any case, that we’re looking to control change. The process side of change management is the methods by which you discuss, approve, and schedule changes. There is no right way to do this, although there are wrong ways. If your Net Ops group is getting a code push and only finds out what’s in the push at that point, that’d be a wrong way.
One right way: have a periodic change control meeting. Invite Net Ops, Engineering, QA, Customer Support, Community Relations, Release Management, and producers. Some of those may overlap. Also, see if you can get Finance to show up. It’s nice to have someone who can authoritatively talk about how much money the company loses if the servers are down for a day.
Go over the list of proposed changes for the next change window. Rate ’em on importance and risk. Decide which ones you’re doing. The answer is probably all of them, but the question shouldn’t be a mere formality. If there are blockers which can be removed before the change window, schedule a followup to confirm that they’ve been removed.
Make changes! That’s really the fun part.
The periodicity of the meeting is subject to discussion. I’ve done it as a weekly meeting, I’ve done it as a monthly pre-update meeting, and I think it might be useful to do it as a daily meeting. If your company is into agile development, and does daily standups, it makes sense to fit it into that structure. You also need to be willing to do emergency meetings, because this structure will fall apart if it doesn’t have flexibility built in. There are going to be emergency changes. I like to account for that possibility.
That’s change management in a nutshell. Coming soon, some discussion of tools.