The single biggest favor a system administrator can do herself is to make sure she knows exactly what state her servers are in at all times. Inevitably, you’re going to make a change that screws something up. It could be a configuration change, it could be pushing updated code or content, or it could be some weird random “harmless” tweak. The IT Process Institute believes that 80% of all outages are the result of changes to the system. If you don’t know what changes you’ve made, troubleshooting becomes much more difficult.

Thus, one of the fundamental underpinnings of any Ops organization is a change management system. You always have one.

Way too often, particularly in small shops, your change management system is your memory. This is the bit where something breaks and you sit there and you go “Oh, wait, I pushed out a new version of the kernel to that server on Monday.” Personally, I find I can keep a pretty good model of the state of ten servers in my head. “Pretty good” is nowhere near good enough. Still, it’s a very tempting model if you’re the kind of guy who likes coming up with snap answers in the middle of a crisis. Dirty secret: lots of us Ops guys are like that.

It’s better if you have a real record. There are a ton of ways to do this. One of the advantages to many commercial ticketing systems is integrated change management — you can tie a ticket to a server so that you can look up the list of changes to that server quickly and easily. Wikis work fairly well if there’s a real commitment to keeping them updated. Email lists are harder to search but they’re at least something.

Any change that can be managed as part of a source control system probably ought to be. This includes most configuration files, and you can do a lot with configuration files. There’ll be a followup post to this one talking about configuration management — a tightly related subject — and that’ll shed some light on those possibilities. Once something’s managed via source control, you can not only easily look up changes, you can revert to old versions quickly.

The other side of this is troubleshooting. When something breaks, if you have good change management, you can start with the most recent change, think about whether or not it’s potentially related, and back it out if necessary. Having an ordered list of all changes for the server or system is crucial; this is why something more formalized than a mailing list or a bunch of IRC logs is good.

The non-technical side of this is controlling change. I tend to come down strongly on the side of slow, measured changes, particularly in the MMO environment. MMOs have two interesting qualities that affect my thinking in this regard: first, the pace of development is relatively quick. We’re already pushing code at a rate that would startle more traditional shops. (Although to be fair, we’re slower than the more agile-oriented Web 2.0 sites.) Second, the face we present to the world is relatively monolithic. We do not have easy access to the same techniques Web sites use to test changes on small groups of users, for example. Games could be built with that potential in mind, but there are social issues involved with giving content to some customers early.

Because of the first quality, there’s a tendency to want to move very quickly. Network Operations is a balancing point. It’s bad to push back for the sake of pushing back, but it’s good to remember that you’re allowed to provide input into the push process. Since you own the concrete task of moving the bits onto the servers, you’re the last people who can say no, which is a meaningful responsibility.

Because of the second quality, problems tend to be more visible to the customers than they would be if you could roll them out gently.

There’s also a general reason for going slowly. Not all change-generated outages manifest immediately. Sometimes you’ll make a change and the outage won’t occur for days. Easy example: changing logging to be more verbose. If it’s lots more verbose, you may run out of disk space, but it might take four days for that to happen. When you’re walking down the list of changes, it’ll be harder to pinpoint the root cause the more changes that happened between now and then. Your search space is bigger.

And yeah, that one is a trivial example that wouldn’t be so hard to troubleshoot in practice. The principle is solid, however.

All that said, I’ll let the Agile Operations/Devops guys make the case for more speedy deployments. Especially since I think they’re making huge strides in techniques that I badly want to adopt. (This presentation deserves its own post.)

Let us assume, in any case, that we’re looking to control change. The process side of change management is the methods by which you discuss, approve, and schedule changes. There is no right way to do this, although there are wrong ways. If your Net Ops group is getting a code push and only finds out what’s in the push at that point, that’d be a wrong way.

One right way: have a periodic change control meeting. Invite Net Ops, Engineering, QA, Customer Support, Community Relations, Release Management, and producers. Some of those may overlap. Also, see if you can get Finance to show up. It’s nice to have someone who can authoritatively talk about how much money the company loses if the servers are down for a day.

Go over the list of proposed changes for the next change window. Rate ’em on importance and risk. Decide which ones you’re doing. The answer is probably all of them, but the question shouldn’t be a mere formality. If there are blockers which can be removed before the change window, schedule a followup to confirm that they’ve been removed.

Make changes! That’s really the fun part.

The periodicity of the meeting is subject to discussion. I’ve done it as a weekly meeting, I’ve done it as a monthly pre-update meeting, and I think it might be useful to do it as a daily meeting. If your company is into agile development, and does daily standups, it makes sense to fit it into that structure. You also need to be willing to do emergency meetings, because this structure will fall apart if it doesn’t have flexibility built in. There are going to be emergency changes. I like to account for that possibility.

That’s change management in a nutshell. Coming soon, some discussion of tools.

Attention conservation notice: this doesn’t have much to do with MMOs.

JavaStation-10
A JavaStation-10, aka JavaStation-NC.
A decade or so ago, I worked on Sun’s internal JavaStation deployment project. Our mandate was to deploy a few thousand of these things throughout Sun as replacement desktops. It was a lot of fun, I learned a lot, and I got to travel some. I think that was my first business travel, in fact.

The retail cost of a JavaStation-10 was around $750, if I recall correctly. It was a great concept, because it was a great task station. You could do email, calendaring, and Web browsing. The OS was slower than you’d like, but I found it perfectly reasonable unless you were asking it to do X-Windows via Citrix or something like that. Of course, you had to have a big chunky server in the data center to serve up the OS at boot time, and the hardware didn’t include the mythical JavaChip co-processor, so complex apps could be a bit poky.

A decade later, I have an iPad sitting on my desk. It ran me around $750, but I could have gotten one at two-thirds the price. It does email, calendaring, and Web browsing, plus a ton more. The OS is fast. This would make a really good task station, especially if you housed it in a different casing and slapped on a keyboard. I have no reason to think that Apple is aiming in that direction, but I won’t be surprised if we see iOS devices designed for single purpose workstations.

Those early ideas just get better and better as the technology catches up, huh?

In a vain attempt to tie this back to the blog purpose: this thing is going to be an amazing admin tool. If I’m in the datacenter, it’ll be very useful to have a separate screen handy. I can’t count the number of times I’ve been looking at a server on a KVM and I’ve wanted to have a second monitor available for watching logs and so forth. I have an ssh client, and I have a client that does both RDP and VNC, which is key for a multi-platform admin. If I was still working in Silicon Valley back in the stupid money days, I’d try to sell my boss on buying these for my team as core tools. Cheaper than an on-call laptop.

Lydia Leong has a great post about the question of speedy provisioning. As she says, the exciting bit about getting new hardware in place isn’t the OS and software installation. Even if you’re not virtualized, you can install a new server unattended in a couple of hours. You want to be able to do that even if you never expect to grow, because you need to be able to rebuild servers quickly if they die. This isn’t hard to manage. In a pinch, people will sell you solutions and you can get a consultant in, but it’s easier to just plan ahead.

She hits on the internal aspect: getting someone to sign off on the new servers. If we’re talking about the need to buy more capacity on short notice in our industry, we’re probably talking about launch, which means this problem isn’t so bad for us. But you’ve got to get the ducks lined up in advance. You don’t want to shock your CEO with an order on the third day of launch; she’s worrying about other stuff. Better to get the plan in writing way in advance, along with executive buyoff. Then you can tell the appropriate people you need ten more shards, get the documents signed, and get your vendor moving.

I think that’s a bit trickier than Lydia says, but I also think she’s talking about onesies/twosies. Buying one server is easy, as she notes. Buying a hundred servers for serious expansion is going to take a bit longer, because Dell and HP and IBM hate keeping too much backstock around, so they’re going to have to build those servers for you.

You can alleviate this, of course. First tactic is to let them know it’s coming. None of those companies are going to increase their inventory just for the sake of your possible buy, because you’re too small, unless you’re Blizzard. However, you can and should get some commitments around response time. You can also, and I think this is more important, find out what’s going to ship the fastest. There’s no reason why you shouldn’t take that as an input to your hardware decision matrix. If all else is equal, go with the servers that generally have the largest inventory. Or ask questions about factories: can your vendor literally build 1U servers faster than blades?

Also, make sure the vendor order process is just as quick. As with all vendors, you want your hardware sales people to be on call during the two weeks around launch. Midnight calls are very unlikely; weekend calls are more probable.

Finally, figure out how you’re going to rack and stack a hundred servers quickly. Could be your vendor’s professional services, could be some local contractor. Even if your internal staff racked the rest of the servers, it’s better not to ask them to spend the time in the machine room during launch, cause you are going to be doing a lot of other things.

None of this is really all that hard, it’s just a great example of one of the many rows you need to have your ducks in. It’s not difficult, it’s merely if

CCP Yokai, the Technical Director over in EVE-land, just posted a dev blog about their new rack setup. This is pretty rare insight for any operation, so it’s definitely worth reading. You don’t get the nitty-gritty details but you get a good overview.

They’re located in 12 cabinets. That apparently covers their single server, their test server, and ancillary services. If you don’t know, EVE is a single shard setup, which is really technically impressive. They crowd all 50-60K concurrent players into one world. That’s one big reason why network connectivity is so important to them. Yokai mentions it a few times in the blog. That’s a very high quality network he’s got set up, probably because most of those 64 servers may need to talk to any other server. Compare that to an infrastructure where 10 servers make up a shard. I can’t know for sure but it sure seems like you’d need to be ready for more interconnections.

He’s using blades, and the blades have a lot of RAM. IBM makes a really solid blade, by the way. The HS21 is I think one generation old; they’re currently selling the HS22s in that price/performance spot, but once you’ve bought a bunch of blades you don’t upgrade unless you need to. The interesting thing to me is the amount of RAM they’ve got in each blade. 32GB is a fair bit. I don’t want to speculate too much but CCP has never been shy about smart ways to use the fastest possible resources, and RAM is fast. See also that big 2 terabyte SSD SAN (storage area network) he mentions.

Lots of blades means lots of heat. I am not surprised that they need a self-contained cooling system. I should talk some about the blade vs. 1U server question, since while blades do take up less physical space, the practical space they consume may not save you much. On the other hand, as noted, CCP needs the fast interconnects. Blades do help there.

Don’t miss the comment thread, either. The devs are again being very open about some of their choices, which is awfully nice of them.

Yep, it still works. Neat.

I’m a year or so into the new job, which still rocks, and for various and sundry reasons it’s a good time to start blogging about geeky MMO operations stuff again. Excellent. Quick note on policy, here: I’m not going to talk about where I am and I’m going to steer clear of talking about what we’re doing, because we’re not ready to talk about it yet and because I am explicitly and emphatically not speaking for the company in any way, shape or form. I’m not keeping my employer a secret — y’all know how to use LinkedIn, right? — but I’m keeping some separation in order to, well, keep some separation.

Mostly I’m just jazzed to be talking about the stuff I love again.