The single biggest favor a system administrator can do herself is to make sure she knows exactly what state her servers are in at all times. Inevitably, you’re going to make a change that screws something up. It could be a configuration change, it could be pushing updated code or content, or it could be some weird random “harmless” tweak. The IT Process Institute believes that 80% of all outages are the result of changes to the system. If you don’t know what changes you’ve made, troubleshooting becomes much more difficult.

Thus, one of the fundamental underpinnings of any Ops organization is a change management system. You always have one.

Way too often, particularly in small shops, your change management system is your memory. This is the bit where something breaks and you sit there and you go “Oh, wait, I pushed out a new version of the kernel to that server on Monday.” Personally, I find I can keep a pretty good model of the state of ten servers in my head. “Pretty good” is nowhere near good enough. Still, it’s a very tempting model if you’re the kind of guy who likes coming up with snap answers in the middle of a crisis. Dirty secret: lots of us Ops guys are like that.

It’s better if you have a real record. There are a ton of ways to do this. One of the advantages to many commercial ticketing systems is integrated change management — you can tie a ticket to a server so that you can look up the list of changes to that server quickly and easily. Wikis work fairly well if there’s a real commitment to keeping them updated. Email lists are harder to search but they’re at least something.

Any change that can be managed as part of a source control system probably ought to be. This includes most configuration files, and you can do a lot with configuration files. There’ll be a followup post to this one talking about configuration management — a tightly related subject — and that’ll shed some light on those possibilities. Once something’s managed via source control, you can not only easily look up changes, you can revert to old versions quickly.

The other side of this is troubleshooting. When something breaks, if you have good change management, you can start with the most recent change, think about whether or not it’s potentially related, and back it out if necessary. Having an ordered list of all changes for the server or system is crucial; this is why something more formalized than a mailing list or a bunch of IRC logs is good.

The non-technical side of this is controlling change. I tend to come down strongly on the side of slow, measured changes, particularly in the MMO environment. MMOs have two interesting qualities that affect my thinking in this regard: first, the pace of development is relatively quick. We’re already pushing code at a rate that would startle more traditional shops. (Although to be fair, we’re slower than the more agile-oriented Web 2.0 sites.) Second, the face we present to the world is relatively monolithic. We do not have easy access to the same techniques Web sites use to test changes on small groups of users, for example. Games could be built with that potential in mind, but there are social issues involved with giving content to some customers early.

Because of the first quality, there’s a tendency to want to move very quickly. Network Operations is a balancing point. It’s bad to push back for the sake of pushing back, but it’s good to remember that you’re allowed to provide input into the push process. Since you own the concrete task of moving the bits onto the servers, you’re the last people who can say no, which is a meaningful responsibility.

Because of the second quality, problems tend to be more visible to the customers than they would be if you could roll them out gently.

There’s also a general reason for going slowly. Not all change-generated outages manifest immediately. Sometimes you’ll make a change and the outage won’t occur for days. Easy example: changing logging to be more verbose. If it’s lots more verbose, you may run out of disk space, but it might take four days for that to happen. When you’re walking down the list of changes, it’ll be harder to pinpoint the root cause the more changes that happened between now and then. Your search space is bigger.

And yeah, that one is a trivial example that wouldn’t be so hard to troubleshoot in practice. The principle is solid, however.

All that said, I’ll let the Agile Operations/Devops guys make the case for more speedy deployments. Especially since I think they’re making huge strides in techniques that I badly want to adopt. (This presentation deserves its own post.)

Let us assume, in any case, that we’re looking to control change. The process side of change management is the methods by which you discuss, approve, and schedule changes. There is no right way to do this, although there are wrong ways. If your Net Ops group is getting a code push and only finds out what’s in the push at that point, that’d be a wrong way.

One right way: have a periodic change control meeting. Invite Net Ops, Engineering, QA, Customer Support, Community Relations, Release Management, and producers. Some of those may overlap. Also, see if you can get Finance to show up. It’s nice to have someone who can authoritatively talk about how much money the company loses if the servers are down for a day.

Go over the list of proposed changes for the next change window. Rate ’em on importance and risk. Decide which ones you’re doing. The answer is probably all of them, but the question shouldn’t be a mere formality. If there are blockers which can be removed before the change window, schedule a followup to confirm that they’ve been removed.

Make changes! That’s really the fun part.

The periodicity of the meeting is subject to discussion. I’ve done it as a weekly meeting, I’ve done it as a monthly pre-update meeting, and I think it might be useful to do it as a daily meeting. If your company is into agile development, and does daily standups, it makes sense to fit it into that structure. You also need to be willing to do emergency meetings, because this structure will fall apart if it doesn’t have flexibility built in. There are going to be emergency changes. I like to account for that possibility.

That’s change management in a nutshell. Coming soon, some discussion of tools.

See also Part I and Part II. Take the time to read ’em, I don’t mind. I need to go make coffee anyhow.

Our third and final topic in this series is downtime. Every MMO player is used to downtime. Turbine games have downtime usually weekly. Blizzard takes World of Warcraft down most weeks. EVE Online goes down every day. It’s part of the whole MMO experience.

I believe that polish is a key part of a successful MMO. Ragged edges show and they turn off customers. This became obvious when Blizzard launched World of Warcraft and set the bar a lot higher while attracting several million new customers. I also believe that operational polish is part of that. Customer support polish is important. Tech ops polish is also important.

Downtime is not polish. Downtime is something we should avoid.

So, patching. We reboot when we patch for two reasons. One, the data files change, whether those are binary files or configuration files or databases, and the server software can’t load new data on the fly. Two, the server software itself changes.

Neither of these are entirely simple. As far as the first one goes, I’ve been known to claim that if a Web server can reload content without being restarted, a game server ought to be able to do the same. This is a misrepresentation on my part, because Web servers are stateless and game servers are exceedingly stateful. In order to solve the problem of transparent reloads for game servers, you need to figure out how you’re going to handle it when content changes while a user is accessing it.

I don’t think it’s impossible, however. My initial model would be something like Blizzard’s phasing technology, in which the same zone/area looks different depending on where you are in certain quest lines. Do the same thing, except that the phases are different content levels. You still run the risk of discontinuity: e.g., if the data for an instance changes while one person in the party is inside the instance and the others zone in afterwards, you have a party split between two instances.

Displaying a warning to the users is inelegant but does solve the problem. See also City of Heroes‘ instanced zone design, where players may be in any of several versions of a given city area. I don’t have a better approach handy, and I do think that indicating the mismatch to the users is better than downtime, so that technique satisfies for now.

Any game which allows for hotfixes without the game going down already does this, of course. I can think of a couple that do it. I sort of feel like this should be the minimum target functionality for new games. I say target because unexpected issues can always arise, but it’s good to have a target nonetheless.

The second problem is trickier because it requires load balancing. Since games are stateful and require a persistent connection — or a really good simulation of one — you’re not going to be able to restart the server without affecting the people connected to it. The good news is that since we control the client/server protocol, we theoretically have the ability to play some clever tricks.

The specific trick I’d like to play is a handoff. I want to be able to tell all the clients connected to one instance of the server that they should go talk to a second instance of the server… now. Then I can take down the first instance of the server, do whatever I need to do, and reverse the process to upgrade the second instance of the server when I’m done.

Load balancing is useful for more than server upgrades: it’d be great for hardware maintenance as well. What’s more, if the client is taking that particular cue from a separate cluster of servers, you could possibly do the same thing retroactively: a piece of hardware goes down? Detect the fault, and have the load balancing cluster issue a directive to go use a different server.

I snuck in the assumption that the load balancing cluster would be a cluster. I think that’s semi-reasonable. It’s one of the functions I’d be inclined to farm out to HTTP servers of some flavor, because anything that’s an HTTP server can live behind a commercial-grade load balancer that the game studio doesn’t have to write. The drawback is that the load balancing is then a pull instead of a push: the client can check to see if anything’s changed, but the servers can’t tell the client anything when they haven’t checked.

I think I’m sort of being overly optimistic here, unfortunately. For one thing, it’s unclear that response times will be quick enough to avoid the users seeing some discontinuity. That might be tolerable, given that MMOs are relatively slack in their required response times, but I’m dubious. For another thing, the problem of maintaining state between two instances of a game server is really tough. You’d have to checkpoint the state of each individual server regularly. The length of time between checkpoints is the amount of rollback you’d be faced with from a perceptual standpoint. There’s an additional issue in that the state checkpoints would need to match the database checkpoints, or you’ll wind up with discontinuity there, which is worse. You really don’t want two servers to disagree about the state of the game.

A more realistic approach is something like what Second Life does. When they roll out new server software, they just do a rolling update. Each individual server reboots, which is a pretty quick process. The section of the game world handled by that server is inaccessible for that period of time. When it comes back up, it’s running the new code.

There’s a small paradigm shift in that idea. Linden Lab doesn’t mind if Second Life‘s world rules vary slightly between servers for a short period of time. In a more game-oriented virtual world, there are more implications there. I can easily envision exploitable situations. The question has to be whether or not those are so serious that it’s worth downtime. And of course, if the answer is ever yes, you can always revert to taking down everything at once.

The other implication of rolling updates is that client/server communications must be reverse compatible. I’m not going to spend a lot of time talking about that, since it’s a good idea in any situation and the techniques for ensuring it are well-known. It’s one of those things which takes a little more effort but it’s worth doing. Not all polish is immediately obvious to the user.

There’s one other forced reboot moment familiar to the MMO player, which is the client restart. That happens when there’s a client patch. I’m willing to admit defeat here because we have way less control over the user’s environment, and we don’t have (for example) a second instance of each desktop which can take over running the client while the first instance reboots.

On the other hand, reloading data files shouldn’t be any harder on a desktop than it is on a server — so if your patch is just data updates, many of the same techniques should be in place. Do it right, and the client is semi-protected against users randomly deleting data files in the middle of a game session too. Yeah, that doesn’t happen much, but I’m a perfectionist.

Before I leave this topic, I should apologize to any server coders and architects who know better than me. My disclaimer: I’m a tech ops guy, not a programmer, and I know that means I miss some blindingly obvious things at times. Corrections are always welcome. This is written from the point of view of my discipline, and revolves around techniques I’d love to be able to use rather than what I absolutely know is possible.

Without preamble:

Rows and rows of closed cabinets.
Rows and rows of closed cabinets.

That’s probably a bunch of cabinets that are being rented out individually to different customers. If so, they’re lockable. In my mind, enclosed units like these are cabinets while open units are racks, but the terms get used interchangeably. Each one of these contains the standard 42 units of available vertical space; 1U is 1.75 inches. They’re 19″ wide. The standard for rack design goes back to railway signaling, and is used in telecom, electronics, audio, etc., etc.

You can see a cable run going above the cabinets at the top left center. Cables over the top is fairly standard. Some people still do cabling under the floor — the old Sun corporate headquarters was like that — but generally that’s reserved for power these days. It’s a pain to do any cable work when you have to go under the tiles to do it. My group wasn’t responsible for server cabling or maintenance, which was a relief.

Each one of those cabinets probably has a gigabit Ethernet cable or two dropping into it to provide bandwidth.

Our hunched over tech ops guy is working at a crash cart; the data center provides these so you can hook a monitor and keyboard up to a server for diagnostic purposes. The etymology is fairly obvious. Vendors sell these nifty flat screen monitors that are designed to fit into a rack, and slide out of the way when you’re not using them, but at a thousand bucks per monitor they’re a little pricy.

Private cage space.
Private cage space.

This is the other way data center space is sold: in chunks. We call these cages, because they’re caged in — see the chain link fence surrounding this space? The cage itself is locked, but the racks within don’t need to be. Sometimes closed cabinets are necessary for cooling; I find we geeks tend to like to expose our servers for easy access whenever possible.

Those tiles on the floor lift up easily to expose the crawlspace beneath. There are lots of power cables there. Some of the tiles are perforated; that’s for cooling. There is serious math behind both the design of individual perforated tiles and the layout of perforated and non-perforated tiles within a data center, which is both neat and one of many reasons I never want to host servers myself. In practice sometimes the data center gets it wrong, alas, so I have to know something about it. Or at least be able to detect uncertainty on the part of my vendors.

The little servers in rack 002.005, one in from the left, are 1U servers. It looks like someone spaced them out for the sake of making the rack look fuller or for cooling. I’d stack ’em right up next to each other like the five 1U servers you can see down on the right, in rack 002.001 — the cooling is generally better that way.

These days, the amount of physical floor space you buy is determined by how much power you need rather than how many racks you want in your cage. The last time I bought data center space, I wound up deciding that we were going to buy enough power to fill our racks halfway full of 1U servers; that was the most cost effective model given pricing. That might be one reason to spread your servers out like these guys, so it doesn’t look like you’re wasting space, but I don’t know. Then again, I wasn’t there, so I’ll quit speculating.

That far right rack has all the network devices in it. You can tell because there’s a little nest of cabling there. The third device from the bottom in the far left rack with all the vertical panels is a storage unit. Each one of those panels can contain a hard drive. That might be a tape backup unit right above it; I’m not sure. Lots of nice IBM hardware, though.

I do spend a certain amount of time peering curiously at other cages when I’m in a data center, trying to figure out what everyone else’s setup is like, yes. I’ve shared data center space with Blizzard. That’s a cool setup.

A rather nice bit of cabling work.
A rather nice bit of cabling work.

Here, we’re looking at a stack of Cisco Catalyst switches, which are real workhorses. Those cables are running off to individual servers, while the switches are probably connected to a router in the same rack. The cables all run downward, which makes me think that this particular installation runs cable under the floor, so there you go.

All the cables run through cable guides, which prevents them from turning into disorganized spaghetti. They’re also all labeled, so you can look at the cables here and know which one is going to which server without tracing it through the floor to the end. They’re color-coded for good measure: if you’re at a server, you know by looking at the cables which segment or segments of the network that server is on.

A not so awesome cabling job.
A not so awesome cabling job.

This cabling is sub-par. Sorry to whoever took the photo! It’s good that everything’s labelled, but the cables aren’t running through any guides, which means they’re going to get tangled up. Also, the labels are too big and I suspect they’ll get in the way. The idea of having removable labels is good, because it makes it easier to update, mind you. This isn’t tragic cabling, it’s just not great.

A real mess.
A real mess.

That’s really bad cabling. Ow.

Oh, and yeah, I’d said something about a mildly amusing story. Welp, I found all these photos via Flickr’s Creative Commons search, which means I’m sure it’s OK to reproduce them here. It turns out that people aren’t always so careful about rights. Back at AltaVista, we originally kept our servers in this fine building, inches away from the most desirable retail space in Palo Alto. Not entirely cost-effective. Sometime thereafter, we moved most of our servers to a Level 3 facility, which was generally pretty good.

We had a lot of servers for the time, and because AltaVista was still tightly coupled with Compaq at that point, we had nice new ones. Our racking and stacking was, modestly, top-notch. Our cage in the Level 3 data center was very pretty. Pretty enough so that Level 3 decided to photograph it and use it on the cover of one of their stockholder reports without asking us. Our sales rep was fairly embarrassed.

Part I of the series is here. In this part, I’ll get more technical.

I like having a checklist for the process I’m about to describe. It’s good to have whoever is executing each step checking off their work. It feels dull because it is dull, but it keeps fallible human beings from forgetting the one boring step they’ve done a hundred times before. It also instills a sense of responsibility. Either paper or electronic is fine, as long as the results are archived and each step plus the overall patch is associated with a specific person each time.

Once the patch is approved, it’ll need to be moved to the data center. As Joe notes in the post I linked in Part I, that can be a surprisingly long process. That’s a problem even if you aren’t doing continuous deployment, because there will come a time when you need to get a fix out super-quickly. The easy answer here is that patches shouldn’t be monolithic. Data files should be segmented such that you can push out a change in one without having to update the whole wad. The speed of the uplink to the datacenter is definitely something you should be thinking about as a tech ops guy, though. Find out how big patches could be, figure out how long it’ll take to get them uploaded in the worst case, and make sure people know about that.

You might even want to have a backup plan. I’ve been in situations where it was quicker and more reliable to copy a patch to a USB drive, drive it to the datacenter, and pull it off the hard drive. That’s really cheap — you can buy one at Best Buy and keep it around in case of emergency. Back at Turbine we routinely copied a patch to our portable drive just in case something went wrong with the main copy.

It may come in handy to be able to do Quality of Service on your office network, as well. At a game company, you need to expect that people will be playing games during work hours. This is a valid use of time, since it’s important to know what the competition is like. Still, it’s good to be able to throttle that usage if you’re trying to get the damned patch up as quick as possible to minimize server downtime. Or if the patch took a couple days extra to get through testing, but you’ve already made the mistake of announcing a patch date… yeah.

If your office is physically close to the data center, cost out a T1 line directly there. Then compare the yearly cost of the T1 to the cost of six hours of downtime. Also, if you have a direct connection into the data center, you can avoid some security concerns at the cost of some different ones.

Right. The files are now at the data center. You have, say, a couple hundred servers that need new files. The minimum functionality for an automated push is as follows:

  • Must be able to push a directory structure and the files within it to a specified location on an arbitrary number of servers.
  • Must be able to verify file integrity after the push.
  • Must be able to run pre-push and post-push scripts. (This sort of takes care of the second requirement.)
  • Must report on success or failure.

That’ll get you most of the way to where you need to go. The files should be pushed to a staging location on each server — best practice is to push to a directory whose name incorporates the version number. Something like /opt/my-mmo/patches/2009-03-22-v23456/ is good. Once everything’s pushed out and confirmed and it’s time to make the patch happen, you can run another command and automatically move the files from there into their final destination, or relink the data file directory to the new directory, or whatever. Sadly, right now, “whatever” probably includes taking the servers down. Make sure that the players have gotten that communication first; IMHO it’s better to delay a bit if someone missed sending out game alerts and forum posts. If your push infrastructure can do the pre-push and post-push scripts, you can treat this step as just another push, which is handy.

This is often a time to do additional maintenance; e.g., taking full backups can happen during this downtime. You should absolutely do whatever’s necessary to ensure that you can roll back the patch, but you also want to keep downtime to a minimum.

Somewhere in here, perhaps in parallel, any data files or executables destined for the client need to be moved to the patch server. “Patch server” is a bit of a handwave. I think the right way to do this is to have one server or cluster responsible for telling the client what to download, and a separate set of servers to handle the downloads proper. That’ll scale better because functionality is separated.

If you use HTTP as the transport protocol for your client patches, you have a lot of flexibility as to where you host those patches. Patch volumes will be really high; most of your active customers will download the patches within a few hours after they go live. At Turbine, we found out that it would take multiple gigabyte network drops to handle patch traffic, which is way more than you need for day to day operations. You want the flexibility to deliver patches as Amazon S3 objects, or via a CDN like Akamai if you’re way rich. Using Amazon gives you Bittorrent functionality for free, which might save you some bandwidth costs. I wouldn’t expect to save a lot that way, for reasons of human nature.

Client patches can theoretically be pre-staged using the same basic approach used with server files: download early, move files into place as needed. If you’re really studly, your client/server communication protocol is architected with reserve compatibility in mind. Linden Lab does this for Second Life — you can usually access new versions of the server with old clients. Let people update on their schedule, not yours. That also makes roll backs easier, unless it’s the game client or data files which need to be rolled back. Client patching architecture should be designed to allow for those rollbacks as well.

Pushing files to patch servers might use the same infrastructure as pushing server and data files around. Akamai will pull files from a server inside your datacenter, as will most CDNs, so that’s easy. Pushing files to Amazon S3 would require a different process. Fortunately the Amazon API is not very hard to work with. Note that you still want that consistency check at the end of the push. You can do this by downloading the files from Amazon and comparing them with the ones you pushed up there.

Once everything’s in place, if you’ve taken the servers down, you run one more consistency check to make sure the files in place are the ones you want. Then you bring the servers back up. They should come back up in a locked state, whether that’s a per-shard configuration or a knob you turn on the central authentication server. (Fail-safe technique: insist that servers come up locked by default, and don’t open to customers until someone types an admin command.)

Tech ops does the first login. If that sniff test goes well, QA gets a pass at it. This will include client patching, which is a second check on the validity of those files. Assuming all this goes well, the floodgates open and you’re done. Assuming no rollbacks are needed.

After you’re done, you or your designate sits in the war room watching metrics and hanging out with everyone else in the war room. The war room is a good topic for another post; it’s a way to have everyone on alert and to have easy access to decision-makers if decisions need to be made. It’s usually quiet. Sometime in the evening the war room captain says you’re really done, and this time you can go home.

Part III of this series will be a discussion of patch downtime, and MMO downtime in general.

Chris asked about patching the game in comments, which dovetails nicely with this post. I have a nit to pick with the theory of continuous deployment, but that’ll wait a post or two.

Joe’s outline of release management focuses mostly on the engineering and QA side of the house, which makes sense. The Flying Lab process is very similar to the Turbine process as far as that goes. I’m going to get into the tech ops aspects of patching in the next post, but in this one I want to cover some business process and definitions. Oh, and one side note: patch, hotfix, content update, content push, whatever you want to call it. If you’re modifying the game by making server or client changes, it’s a patch from the operational perspective.

Roughly speaking, you can divide a patch into four potential parts. Not all patches will necessarily need each of these parts. Depending on your server and client design, you may have to change all of these concurrently, but optimally they’re independent.

Part one is server data, which could come in any number of forms. Your servers might use binary data files. They might use some sort of flat text file — I bet there’s someone out there doing world data in XML. I know of at least one game that kept all the data in a relational database. It all boils down to the data which defines the world.

I suppose that in theory, and perhaps in practice, game data could be compiled into the server executable itself. This is suboptimal because it removes the theoretical ability to reload game data on the fly without a game server restart. Even if your data files are separate, you may not be able to do a reload on the fly, but at least separation should make it easier to rework the code to do the right thing later on. There will be more on this topic at a later date.

Part two is the server executable itself. This doesn’t change as often; maybe just when the game introduces new systems or new mechanics. Yay for simplicity. I am pretending that there aren’t multiple pieces of software which make up your game shard, which is probably untrue, but the principle is the same regardless.

Parts three and four split the same way, but apply to the client: client data files and client executables. Any given game may or may not use the same patching mechanism for these two pieces. The distribution method is likely to be the same, but it’s convenient to be able to handle data files without client restarts for the same reason you want to be able to update game data without a server restart.

I prefer to be involved with the release process rather than just pushing out code as it’s thrown over the wall. My job is to keep the servers running happily; at the very least, the more I know about what’s happening, the better I can react to problems. One methodology that I’ve used in the past in games: have a release meeting before the patch hits QA. Break down each change in the patch, and rate each one for importance — how much do we need this change? — and risk. Then when the patch comes out of QA, go back and do the same breakdown. QA will often have information which changes the risk factor, and sometimes that means you don’t want to make a specific change after all. Sometimes the tech ops idea of risk is different than engineering’s idea of risk, for perfectly valid reasons. The second meeting either says “yep, push it!” or “no, don’t push it.” If it’s a no, generally that means you decided to drop some changes and do another QA round.

Meetings like that include QA, engineering, whoever owns the continued success of the game (i.e., a producer or executive producer), community relations, and customer support. You can fold the rest of the go/no-go meeting process into this meeting as well. There’s a checklist: do we have release notes for players? Is the proposed date of the push a bad one for some reason? Etc.

I haven’t mentioned the public test server, but that should happen either as part of the QA process or as a separate step in the process. I tend to think that you benefit from treating public test servers as production, which may mean that your first patch meeting in the cycle also formally approves the patch going to public test. You might have quickie meetings during the course of the QA cycle to push out new builds to test as well.

Tomorrow: nuts and bolts.

Daniel James of Three Rings (Puzzle Pirates, Whirled) made a great post with his slides from his GDC presentation. Attention alert: lots of real numbers! It’s like catnip for MMO geeks.

From a tech ops perspective, I paid lots of attention to those graphs. Page 7 is awesome. That is exactly the sort of data which should be on a graph in your network monitoring software; ideally it should be on a page with other graphs showing machine load, network load, and so on. Everything should be on the same timeline, for easy comparisons. It’s my job to tell people when we’re going to need to order new hardware; a tech ops manager should have a deep understanding of how player load affects hardware load. Hm, let’s have an example of graphing:

Cacti graphs showing network traffic and CPU utilization.
Cacti graphs showing network traffic and CPU utilization.

That’s cacti, which is my favorite open source tool for this purpose right now, although it has its limitations and flaws. This particular pair of graphs shows network traffic on top and CPU utilization for one CPU of the server below; not surprisingly, CPU utilization rises along with network traffic. Data collection for CPU utilization and network traffic is built into cacti, and it’s easy to add collection for pretty much any piece of data that can be expressed as a numeric value.

That sort of trend visualization also helps catch problem areas before they get bad. Does the ratio of concurrent players to memory used change abruptly when you hit a specific number of concurrent users? If so, talk to the engineers. It might be fixable. And if it isn’t, well, the projections for profitability might have just changed in which case you better be talking to the financial guys. Making sure the company is making money is absolutely part of the responsibility of anyone in technical operations; some day perhaps I’ll rant about the self-defeating geek tendency to sneer at the business side of the house.

Page 8, more of the same. The observant will notice one of the little quirks of gaming operations: peak times are afternoon to evening, and the peak days are the weekends. The Saturday peak is broader, because people can play during the day more on weekends. You might assume that browser-based games like Whirled would see more play from work, but nope, I guess not.

I wonder what those little dips on 3/17, 3/18, and 3/20 are? I don’t think Whirled is a sharded game, so that can’t be a single shard crashing. Welp, I’ll never know, but that’s a great example of the sorts of things graphs show. If those were because of crashes, you’d know without needing graphs to tell you because your pager would go off, but if it’s something else you’d want to investigate. Could be a bug in your data collection, for that matter, but that’s bad too.

Less tech ops, but still interesting: the material on player acquisition is excellent. Read this if you want to know how to figure out the economics of a game. If I were Daniel James, I would also have breakdowns telling me how those retention cohorts broke down based on play time and perhaps styles of play. What kinds of players stick around? Very important question. I believe strongly in the integration of billing metrics and operational metrics. That work is something that technical operations can drive if need be; all the data sources are within your control. It’s worth spending the time to whip up a prototype dashboard and pitch it to your CFO.

Then there’s a chunk of advice on building an in-world economy that relates to the real world. Heh: it’s MMO as platform again. Whirled is built on that concept, as I understand it. That dovetails nicely with his discussion of billing. When he says “Don’t build, but use a provider,” he is absolutely correct.

I love this slideshow. In the blog post surrounding it, he talks about how he feels it’s OK to give away the numbers. There are dangers in sharing subscriber numbers and concurrencies, particularly if you’re competing in the big traditional space, but I like seeing people taking the risk. There is plenty of room in the MMO space for more players and plain old numbers are not going to be the secret sauce that makes you rich. How you get those numbers is a different story. So thanks to Daniel for this.

I’m never sure how mystifying my job is to the average person. I do know that even technophobes don’t always really know what technical operations does beyond “they’re the guys who keep the servers running,” and I like talking about my job, so I figured I’d expand a bit on the brief blurb and talk about what a typical tech ops team does from time to time.

I’m going to try to use the term “technical operations” for my stuff, in the interests of distinguishing it from operations in general. When a business guy talks about operations, he’s probably talking about the whole gamut of running a game (or a web site, whatever). This includes my immediate bailiwick, but it also includes stuff like customer support, possibly community management, and in some cases even coders maintaining the game. It’s sort of a fuzzier distinction in online gaming; back in the wonderful world of web sites, there’s not a ton of distinction between development pre-launch and development post-launch. Gaming tends to think of those two phases as very different beasts, for mostly good reasons. Although I think some of that is carryover from offline games. I digress! Chalk that up for a later post.

So okay. My primary job is to keep servers running happily. The bedrock of this is the physical installation of servers in the data center. This post is going to be about how you host your servers.

Figure any MMO of any notable size will have… let’s say over 100 servers. This is conservative; World of Warcraft has a lot more than that. There’ll also be big exceptions. I think Puzzle Pirates is a significant MMO and given that it’s a 2D environment, it might be pretty small in terms of server footprint. Um, eight worlds — yeah, I wouldn’t be surprised if they were under 100. But figure we’re generally talking in the hundreds.

You don’t want to worry about the physical aspect of hosting that many servers, especially if you’re a gaming company, because then that’s really not your area of expertise. My typical evaluation of a hosting facility includes questions about how many distinct power grids the facility can access; if, say, Somerville has a power outage I’d like it if the facility could get power from somewhere else. I want to know how long the facility can go without power at all, and how often those backup generators are tested. I want to know how redundant the air conditioning systems are. I want to know how many staff are on site overnight. I want to know about a million things about their network connectivity to the rest of the world. This is all both expensive and hard to build, and why buy that sort of headache? There are companies who will do it for you, and it will be more cost effective, because they’re doing it on a larger scale.

If I’m starting from the ground up, step one is choosing the right hosting facility. Call it colocation if you like. Some people spell that collocation, which is not incorrect but which drives me nuts. (Sorry, Mike.) You start out with the evaluation… well, no. You start out by figuring what’s important to you. As with everything, you need to make the money vs. convenience vs. quality tradeoffs. A tier 1 provider like AT&T or MCI can be really good, but you’re going to pay more than you would for a second tier provider, and that’s not always a wise choice.

My full RFP (request for proposal) document is thousands of words of questions. I won’t reproduce the whole thing here. Suffice it to say that this choice is one of the most important ones you’re going to make. You do not want the pain of changing data centers once you’ve launched. Even once you’ve launched beta. It’s good to get this one right.

There’s also a fair amount of ongoing work that goes into maintaining the relationship, because the bill for hosting is one of your biggest monthly costs. Every month, you have to go over the bill and make sure you’re getting charged for all the right things. I have worked with a lot of colocation facilities and even the best of them screw up billing from time to time.

It’s also smart to basically keep in touch with your facility. You need to figure out who the right person is — probably your Technical Account Manager, maybe someone else. I’ve had relationships where the right guy to talk to was my sales guy, because he loved working with a gaming company and he was engaged enough to look at our bills himself every month to make sure they were right. You want to talk to someone at least once a month, in any case, for a bunch of reasons.

First off, if they’ve got concerns, it’s an avenue for them to express them informally. Maybe you’re using more power than you’re paying for. Maybe your cage is a mess, in which case shame on you and why didn’t you already know about it? But you never know. Maybe there’s a new customer that’s about to scoop up a ton of space in your data center and you won’t have expansion room available.

If you’re talking to your key people regularly, they’re going to keep you in mind when things like that last happen. Often enough you can’t do anything about it; it’s still good to know.

Oh, and if your hosting provider has some sort of game-oriented group, latch onto it! AT&T has an absolutely great Gaming Core Team; when Turbine hooked up with them, our already good service got even better.

Like any relationship with any vendor, you’re going to get more out of it the more you put into it. You don’t stop worrying once you sign the contract.