Brief hiatus here, cause, um, I just got a job! Which is ducky. But I do want to figure out if there’s any company policy on blogging, etc. on the better safe than sorry theory.

Great timing, huh? I can’t really find it in myself to complain, however.

Apparently IBM wasn’t the only company looking to buy a nice Silicon Valley enterprise hardware and software company, since Oracle just bought Sun now that IBM’s opted to pass. You’d assume this was the backup plan, and you’d also assume that Sun really needed/wanted to be purchased. I think it’s a better fit for Sun than IBM would have been.

It’s a pretty important change in the world of technical operations (which is much like the martial arts world, up to and including the preponderance of aged sages and masters). It even has at least one minor direct impact on MMOs. Even if it didn’t, though, it’d be interesting enough to talk about here.

The minor direct impact is Project Darkstar, Sun’s open source MMO server. Oracle doesn’t have anything against open source, about which more in a second, but Project Darkstar probably hasn’t sold Sun many servers so far and I suspect it’s the sort of fringe project which has trouble surviving after a merger/acquisition. Also, the current releases run on a single server only, which is a bit of a drawback for serious MMO work. On the other hand, it’s a fringe project which isn’t being used for a whole lot in practice, which brings us back to the minor impact.

The first big indirect impact is MySQL. I don’t expect Oracle to kill MySQL out of hand. Oracle doesn’t have a reputation as an open source friendly company, but they do a fair bit of work with open source and they’re clearly not adverse to the concept. Oracle bought Berkeley DB a few years back and it’s been chugging along just fine ever since, although admittedly it isn’t a direct competitor to Oracle’s core products. Rather, it’s a nice complement. For that matter, Oracle’s actually licensed core technology to MySQL in the past; InnoDB is an important storage engine for MySQL and it’s owned by Oracle.

I also don’t expect MySQL to stay exactly the same, because it has been a direct competitor to Oracle DB. While it makes sense for a software company to offer an open source version of its products, that version should be a gateway to a commercial relationship. MySQL currently has a subscription-based enterprise product which provides monthly frequent bug fixes and updates. That’s a good model. I’m just not sure it makes sense for Oracle, because you’re then supporting two products that directly compete. If nothing else, the marketing gets wonky.

Maintaining MySQL as a starter DB and encouraging people to upgrade to Oracle as their needs grow is also unlikely. If nothing else, that’s not a clean upgrade.

It’s not impossible that Oracle will cut MySQL loose. It’d be a pretty easy transition, since MySQL development still happens in Sweden. When Sun bought MySQL, they didn’t change much about how the company did business. I’d imagine this would come at some cost, and I’m not sure who’d be in a position to pay it.

Since it’s an open source product, the original developers could theoretically resign from Oracle en masse and reconstitute themselves as a new company. That is pure and blatant speculation. I don’t know what Swedish non-compete laws are like, I don’t know what it says in their employment contract, and so on. It’s more likely that a new entity would take on development. On the other hand, that’s a big project to launch; you’d want someone like IBM backing you, and you might lose those licenses that Oracle’s sold to MySQL proper.

So lots of possibilities. I don’t expect to see MySQL vanish. I do think it’s very possible that development will slow down. If I were starting a project that required a database right now, I’d look at the possibility of change around MySQL and probably decide to use Postgres. Just to pull this back to MMOs: did you know that Sony Online Entertainment has a significant investment in the largest Postgres support company out there? That’s who Sony uses for their databases. Just sayin’.

The other interesting indirect impact is servers. Oracle is not a company you think of as a server company, unless you were paying attention last fall. That product’s built on HP technology and hardware. Oracle now has an opportunity to own the entire concept, from hardware to OS to database. This has to be attractive to Larry Ellison, who is a big fan of Steve Jobs and the Apple integration of hardware and software. I expect to see database and data warehouse appliances built on Sun’s hardware within a year or two.

The open question is whether or not Oracle wants to be in the general purpose server market. They certainly could be, but it doesn’t necessarily sell more databases. Sun’s been pushing their servers at the MMO industry fairly hard; I expect most of us have gotten their sales calls and heard all about the Sun Games Technologies Group. I have no qualms about saying that I never wanted to buy Sun because I was uncertain about their future. (Which always made me sad — I used to work for Sun.) If Oracle decides they want to continue the general purpose product line, however, that’s a potentially different matter.

I’d be pretty happy to see another competitor in the market. Right now, if you want the support that should come with a top tier server provider, you’ve got IBM and HP with Dell as a reasonable third choice. As a purchaser, I’m always happy with more choices. I’m thus rooting for a Sun resurgence, without any sappy puns about rising.

The third big impact is Java. I’d expect this to be the most painless, least interesting transition. Oracle is a software company, and Java fits into their offerings without being a competitor to anything they’re already doing. Oracle’s already involved in Java open source work. Theory isn’t practice, but Oracle would have to work at it a bit to screw up Java. I’d bet Java was actually the biggest reason Oracle wanted to buy Sun, although MySQL and the hardware had to be factors as well. Therefore, I don’t have much else to say about it.

So there we are. It’s a pretty seismic change, even if Sun was declining as a factor. We won’t really know what the full impact is for a year or so. I think I am cautiously optimistic that it’ll be a net good for me, with some negative effects.

See also Part I and Part II. Take the time to read ’em, I don’t mind. I need to go make coffee anyhow.

Our third and final topic in this series is downtime. Every MMO player is used to downtime. Turbine games have downtime usually weekly. Blizzard takes World of Warcraft down most weeks. EVE Online goes down every day. It’s part of the whole MMO experience.

I believe that polish is a key part of a successful MMO. Ragged edges show and they turn off customers. This became obvious when Blizzard launched World of Warcraft and set the bar a lot higher while attracting several million new customers. I also believe that operational polish is part of that. Customer support polish is important. Tech ops polish is also important.

Downtime is not polish. Downtime is something we should avoid.

So, patching. We reboot when we patch for two reasons. One, the data files change, whether those are binary files or configuration files or databases, and the server software can’t load new data on the fly. Two, the server software itself changes.

Neither of these are entirely simple. As far as the first one goes, I’ve been known to claim that if a Web server can reload content without being restarted, a game server ought to be able to do the same. This is a misrepresentation on my part, because Web servers are stateless and game servers are exceedingly stateful. In order to solve the problem of transparent reloads for game servers, you need to figure out how you’re going to handle it when content changes while a user is accessing it.

I don’t think it’s impossible, however. My initial model would be something like Blizzard’s phasing technology, in which the same zone/area looks different depending on where you are in certain quest lines. Do the same thing, except that the phases are different content levels. You still run the risk of discontinuity: e.g., if the data for an instance changes while one person in the party is inside the instance and the others zone in afterwards, you have a party split between two instances.

Displaying a warning to the users is inelegant but does solve the problem. See also City of Heroes‘ instanced zone design, where players may be in any of several versions of a given city area. I don’t have a better approach handy, and I do think that indicating the mismatch to the users is better than downtime, so that technique satisfies for now.

Any game which allows for hotfixes without the game going down already does this, of course. I can think of a couple that do it. I sort of feel like this should be the minimum target functionality for new games. I say target because unexpected issues can always arise, but it’s good to have a target nonetheless.

The second problem is trickier because it requires load balancing. Since games are stateful and require a persistent connection — or a really good simulation of one — you’re not going to be able to restart the server without affecting the people connected to it. The good news is that since we control the client/server protocol, we theoretically have the ability to play some clever tricks.

The specific trick I’d like to play is a handoff. I want to be able to tell all the clients connected to one instance of the server that they should go talk to a second instance of the server… now. Then I can take down the first instance of the server, do whatever I need to do, and reverse the process to upgrade the second instance of the server when I’m done.

Load balancing is useful for more than server upgrades: it’d be great for hardware maintenance as well. What’s more, if the client is taking that particular cue from a separate cluster of servers, you could possibly do the same thing retroactively: a piece of hardware goes down? Detect the fault, and have the load balancing cluster issue a directive to go use a different server.

I snuck in the assumption that the load balancing cluster would be a cluster. I think that’s semi-reasonable. It’s one of the functions I’d be inclined to farm out to HTTP servers of some flavor, because anything that’s an HTTP server can live behind a commercial-grade load balancer that the game studio doesn’t have to write. The drawback is that the load balancing is then a pull instead of a push: the client can check to see if anything’s changed, but the servers can’t tell the client anything when they haven’t checked.

I think I’m sort of being overly optimistic here, unfortunately. For one thing, it’s unclear that response times will be quick enough to avoid the users seeing some discontinuity. That might be tolerable, given that MMOs are relatively slack in their required response times, but I’m dubious. For another thing, the problem of maintaining state between two instances of a game server is really tough. You’d have to checkpoint the state of each individual server regularly. The length of time between checkpoints is the amount of rollback you’d be faced with from a perceptual standpoint. There’s an additional issue in that the state checkpoints would need to match the database checkpoints, or you’ll wind up with discontinuity there, which is worse. You really don’t want two servers to disagree about the state of the game.

A more realistic approach is something like what Second Life does. When they roll out new server software, they just do a rolling update. Each individual server reboots, which is a pretty quick process. The section of the game world handled by that server is inaccessible for that period of time. When it comes back up, it’s running the new code.

There’s a small paradigm shift in that idea. Linden Lab doesn’t mind if Second Life‘s world rules vary slightly between servers for a short period of time. In a more game-oriented virtual world, there are more implications there. I can easily envision exploitable situations. The question has to be whether or not those are so serious that it’s worth downtime. And of course, if the answer is ever yes, you can always revert to taking down everything at once.

The other implication of rolling updates is that client/server communications must be reverse compatible. I’m not going to spend a lot of time talking about that, since it’s a good idea in any situation and the techniques for ensuring it are well-known. It’s one of those things which takes a little more effort but it’s worth doing. Not all polish is immediately obvious to the user.

There’s one other forced reboot moment familiar to the MMO player, which is the client restart. That happens when there’s a client patch. I’m willing to admit defeat here because we have way less control over the user’s environment, and we don’t have (for example) a second instance of each desktop which can take over running the client while the first instance reboots.

On the other hand, reloading data files shouldn’t be any harder on a desktop than it is on a server — so if your patch is just data updates, many of the same techniques should be in place. Do it right, and the client is semi-protected against users randomly deleting data files in the middle of a game session too. Yeah, that doesn’t happen much, but I’m a perfectionist.

Before I leave this topic, I should apologize to any server coders and architects who know better than me. My disclaimer: I’m a tech ops guy, not a programmer, and I know that means I miss some blindingly obvious things at times. Corrections are always welcome. This is written from the point of view of my discipline, and revolves around techniques I’d love to be able to use rather than what I absolutely know is possible.

A couple of followups as I sip coffee and wait for various and sundry phone calls…

The OnLive claims are continuing to spark debate. Mostly of the form “sha, right, that’s not practical.” Steve Perlman responded in a BBC interview.

There’s some concrete info in there, mostly about the encoding and compression process. They’re depending on a specialized chip which’ll cost them under $20 per chip in bulk. That makes some kind of sense for the hardware console replacement, and I suppose that the Mac/PC versions will have plenty of processor available.

Aiming for sub-80 millisecond round trip ping times between the clients and the data centers is feasible, given that they’re willing to have multiple data centers.

Running 10 games per server is an interesting concept. Possibly whatever custom hardware they’re building around their specialized chip will load multiple chips on each server, such that the tricky work is offloaded from the main CPUs. If they’re planning on running large servers — something like the IBM x3850m2 — and using virtualization, then there’s enough CPU and RAM in a single server to handle that. You’re not going to get 10 games on a little dual CPU quad core 1U server, though. The lesson here: “The company has calculated that each server will be dealing with about 10 different gamers…” is a completely meaningless statement if the word server is undefined.

Thus, my concerns about cost structure remain intact for now.

Meanwhile, Dave Perry (ex-Acclaim) has his own streaming game service in the works, called Gaikai. I love his interview because he hits on one of my favorite business concepts, friction. He’s absolutely right in his discussion of the dangers of making it harder for people to play. His service also looks more flexible and requires less buy-in from studios. On the other hand, he’s saying nothing about the technical difficulties.

One last streaming tidbit: World of Warcraft streamed to an iPhone, from a streaming-oriented company called Vollee. Just a demo, which means you can’t say anything about how it performs over a 3G network, but still neat. I like their capacity for custom UI filtering to adapt to the smaller screen size.

In completely different news, two other people came up with my clever addon/plugin App Store idea, except they both thought it was an April Fool’s joke. Humph.

The great thing about pictures of bad cabling messes is that there are always worse ones. So: worse ones! I know there are other r

Without preamble:

Rows and rows of closed cabinets.
Rows and rows of closed cabinets.

That’s probably a bunch of cabinets that are being rented out individually to different customers. If so, they’re lockable. In my mind, enclosed units like these are cabinets while open units are racks, but the terms get used interchangeably. Each one of these contains the standard 42 units of available vertical space; 1U is 1.75 inches. They’re 19″ wide. The standard for rack design goes back to railway signaling, and is used in telecom, electronics, audio, etc., etc.

You can see a cable run going above the cabinets at the top left center. Cables over the top is fairly standard. Some people still do cabling under the floor — the old Sun corporate headquarters was like that — but generally that’s reserved for power these days. It’s a pain to do any cable work when you have to go under the tiles to do it. My group wasn’t responsible for server cabling or maintenance, which was a relief.

Each one of those cabinets probably has a gigabit Ethernet cable or two dropping into it to provide bandwidth.

Our hunched over tech ops guy is working at a crash cart; the data center provides these so you can hook a monitor and keyboard up to a server for diagnostic purposes. The etymology is fairly obvious. Vendors sell these nifty flat screen monitors that are designed to fit into a rack, and slide out of the way when you’re not using them, but at a thousand bucks per monitor they’re a little pricy.

Private cage space.
Private cage space.

This is the other way data center space is sold: in chunks. We call these cages, because they’re caged in — see the chain link fence surrounding this space? The cage itself is locked, but the racks within don’t need to be. Sometimes closed cabinets are necessary for cooling; I find we geeks tend to like to expose our servers for easy access whenever possible.

Those tiles on the floor lift up easily to expose the crawlspace beneath. There are lots of power cables there. Some of the tiles are perforated; that’s for cooling. There is serious math behind both the design of individual perforated tiles and the layout of perforated and non-perforated tiles within a data center, which is both neat and one of many reasons I never want to host servers myself. In practice sometimes the data center gets it wrong, alas, so I have to know something about it. Or at least be able to detect uncertainty on the part of my vendors.

The little servers in rack 002.005, one in from the left, are 1U servers. It looks like someone spaced them out for the sake of making the rack look fuller or for cooling. I’d stack ’em right up next to each other like the five 1U servers you can see down on the right, in rack 002.001 — the cooling is generally better that way.

These days, the amount of physical floor space you buy is determined by how much power you need rather than how many racks you want in your cage. The last time I bought data center space, I wound up deciding that we were going to buy enough power to fill our racks halfway full of 1U servers; that was the most cost effective model given pricing. That might be one reason to spread your servers out like these guys, so it doesn’t look like you’re wasting space, but I don’t know. Then again, I wasn’t there, so I’ll quit speculating.

That far right rack has all the network devices in it. You can tell because there’s a little nest of cabling there. The third device from the bottom in the far left rack with all the vertical panels is a storage unit. Each one of those panels can contain a hard drive. That might be a tape backup unit right above it; I’m not sure. Lots of nice IBM hardware, though.

I do spend a certain amount of time peering curiously at other cages when I’m in a data center, trying to figure out what everyone else’s setup is like, yes. I’ve shared data center space with Blizzard. That’s a cool setup.

A rather nice bit of cabling work.
A rather nice bit of cabling work.

Here, we’re looking at a stack of Cisco Catalyst switches, which are real workhorses. Those cables are running off to individual servers, while the switches are probably connected to a router in the same rack. The cables all run downward, which makes me think that this particular installation runs cable under the floor, so there you go.

All the cables run through cable guides, which prevents them from turning into disorganized spaghetti. They’re also all labeled, so you can look at the cables here and know which one is going to which server without tracing it through the floor to the end. They’re color-coded for good measure: if you’re at a server, you know by looking at the cables which segment or segments of the network that server is on.

A not so awesome cabling job.
A not so awesome cabling job.

This cabling is sub-par. Sorry to whoever took the photo! It’s good that everything’s labelled, but the cables aren’t running through any guides, which means they’re going to get tangled up. Also, the labels are too big and I suspect they’ll get in the way. The idea of having removable labels is good, because it makes it easier to update, mind you. This isn’t tragic cabling, it’s just not great.

A real mess.
A real mess.

That’s really bad cabling. Ow.

Oh, and yeah, I’d said something about a mildly amusing story. Welp, I found all these photos via Flickr’s Creative Commons search, which means I’m sure it’s OK to reproduce them here. It turns out that people aren’t always so careful about rights. Back at AltaVista, we originally kept our servers in this fine building, inches away from the most desirable retail space in Palo Alto. Not entirely cost-effective. Sometime thereafter, we moved most of our servers to a Level 3 facility, which was generally pretty good.

We had a lot of servers for the time, and because AltaVista was still tightly coupled with Compaq at that point, we had nice new ones. Our racking and stacking was, modestly, top-notch. Our cage in the Level 3 data center was very pretty. Pretty enough so that Level 3 decided to photograph it and use it on the cover of one of their stockholder reports without asking us. Our sales rep was fairly embarrassed.

Part I of the series is here. In this part, I’ll get more technical.

I like having a checklist for the process I’m about to describe. It’s good to have whoever is executing each step checking off their work. It feels dull because it is dull, but it keeps fallible human beings from forgetting the one boring step they’ve done a hundred times before. It also instills a sense of responsibility. Either paper or electronic is fine, as long as the results are archived and each step plus the overall patch is associated with a specific person each time.

Once the patch is approved, it’ll need to be moved to the data center. As Joe notes in the post I linked in Part I, that can be a surprisingly long process. That’s a problem even if you aren’t doing continuous deployment, because there will come a time when you need to get a fix out super-quickly. The easy answer here is that patches shouldn’t be monolithic. Data files should be segmented such that you can push out a change in one without having to update the whole wad. The speed of the uplink to the datacenter is definitely something you should be thinking about as a tech ops guy, though. Find out how big patches could be, figure out how long it’ll take to get them uploaded in the worst case, and make sure people know about that.

You might even want to have a backup plan. I’ve been in situations where it was quicker and more reliable to copy a patch to a USB drive, drive it to the datacenter, and pull it off the hard drive. That’s really cheap — you can buy one at Best Buy and keep it around in case of emergency. Back at Turbine we routinely copied a patch to our portable drive just in case something went wrong with the main copy.

It may come in handy to be able to do Quality of Service on your office network, as well. At a game company, you need to expect that people will be playing games during work hours. This is a valid use of time, since it’s important to know what the competition is like. Still, it’s good to be able to throttle that usage if you’re trying to get the damned patch up as quick as possible to minimize server downtime. Or if the patch took a couple days extra to get through testing, but you’ve already made the mistake of announcing a patch date… yeah.

If your office is physically close to the data center, cost out a T1 line directly there. Then compare the yearly cost of the T1 to the cost of six hours of downtime. Also, if you have a direct connection into the data center, you can avoid some security concerns at the cost of some different ones.

Right. The files are now at the data center. You have, say, a couple hundred servers that need new files. The minimum functionality for an automated push is as follows:

  • Must be able to push a directory structure and the files within it to a specified location on an arbitrary number of servers.
  • Must be able to verify file integrity after the push.
  • Must be able to run pre-push and post-push scripts. (This sort of takes care of the second requirement.)
  • Must report on success or failure.

That’ll get you most of the way to where you need to go. The files should be pushed to a staging location on each server — best practice is to push to a directory whose name incorporates the version number. Something like /opt/my-mmo/patches/2009-03-22-v23456/ is good. Once everything’s pushed out and confirmed and it’s time to make the patch happen, you can run another command and automatically move the files from there into their final destination, or relink the data file directory to the new directory, or whatever. Sadly, right now, “whatever” probably includes taking the servers down. Make sure that the players have gotten that communication first; IMHO it’s better to delay a bit if someone missed sending out game alerts and forum posts. If your push infrastructure can do the pre-push and post-push scripts, you can treat this step as just another push, which is handy.

This is often a time to do additional maintenance; e.g., taking full backups can happen during this downtime. You should absolutely do whatever’s necessary to ensure that you can roll back the patch, but you also want to keep downtime to a minimum.

Somewhere in here, perhaps in parallel, any data files or executables destined for the client need to be moved to the patch server. “Patch server” is a bit of a handwave. I think the right way to do this is to have one server or cluster responsible for telling the client what to download, and a separate set of servers to handle the downloads proper. That’ll scale better because functionality is separated.

If you use HTTP as the transport protocol for your client patches, you have a lot of flexibility as to where you host those patches. Patch volumes will be really high; most of your active customers will download the patches within a few hours after they go live. At Turbine, we found out that it would take multiple gigabyte network drops to handle patch traffic, which is way more than you need for day to day operations. You want the flexibility to deliver patches as Amazon S3 objects, or via a CDN like Akamai if you’re way rich. Using Amazon gives you Bittorrent functionality for free, which might save you some bandwidth costs. I wouldn’t expect to save a lot that way, for reasons of human nature.

Client patches can theoretically be pre-staged using the same basic approach used with server files: download early, move files into place as needed. If you’re really studly, your client/server communication protocol is architected with reserve compatibility in mind. Linden Lab does this for Second Life — you can usually access new versions of the server with old clients. Let people update on their schedule, not yours. That also makes roll backs easier, unless it’s the game client or data files which need to be rolled back. Client patching architecture should be designed to allow for those rollbacks as well.

Pushing files to patch servers might use the same infrastructure as pushing server and data files around. Akamai will pull files from a server inside your datacenter, as will most CDNs, so that’s easy. Pushing files to Amazon S3 would require a different process. Fortunately the Amazon API is not very hard to work with. Note that you still want that consistency check at the end of the push. You can do this by downloading the files from Amazon and comparing them with the ones you pushed up there.

Once everything’s in place, if you’ve taken the servers down, you run one more consistency check to make sure the files in place are the ones you want. Then you bring the servers back up. They should come back up in a locked state, whether that’s a per-shard configuration or a knob you turn on the central authentication server. (Fail-safe technique: insist that servers come up locked by default, and don’t open to customers until someone types an admin command.)

Tech ops does the first login. If that sniff test goes well, QA gets a pass at it. This will include client patching, which is a second check on the validity of those files. Assuming all this goes well, the floodgates open and you’re done. Assuming no rollbacks are needed.

After you’re done, you or your designate sits in the war room watching metrics and hanging out with everyone else in the war room. The war room is a good topic for another post; it’s a way to have everyone on alert and to have easy access to decision-makers if decisions need to be made. It’s usually quiet. Sometime in the evening the war room captain says you’re really done, and this time you can go home.

Part III of this series will be a discussion of patch downtime, and MMO downtime in general.

Chris asked about patching the game in comments, which dovetails nicely with this post. I have a nit to pick with the theory of continuous deployment, but that’ll wait a post or two.

Joe’s outline of release management focuses mostly on the engineering and QA side of the house, which makes sense. The Flying Lab process is very similar to the Turbine process as far as that goes. I’m going to get into the tech ops aspects of patching in the next post, but in this one I want to cover some business process and definitions. Oh, and one side note: patch, hotfix, content update, content push, whatever you want to call it. If you’re modifying the game by making server or client changes, it’s a patch from the operational perspective.

Roughly speaking, you can divide a patch into four potential parts. Not all patches will necessarily need each of these parts. Depending on your server and client design, you may have to change all of these concurrently, but optimally they’re independent.

Part one is server data, which could come in any number of forms. Your servers might use binary data files. They might use some sort of flat text file — I bet there’s someone out there doing world data in XML. I know of at least one game that kept all the data in a relational database. It all boils down to the data which defines the world.

I suppose that in theory, and perhaps in practice, game data could be compiled into the server executable itself. This is suboptimal because it removes the theoretical ability to reload game data on the fly without a game server restart. Even if your data files are separate, you may not be able to do a reload on the fly, but at least separation should make it easier to rework the code to do the right thing later on. There will be more on this topic at a later date.

Part two is the server executable itself. This doesn’t change as often; maybe just when the game introduces new systems or new mechanics. Yay for simplicity. I am pretending that there aren’t multiple pieces of software which make up your game shard, which is probably untrue, but the principle is the same regardless.

Parts three and four split the same way, but apply to the client: client data files and client executables. Any given game may or may not use the same patching mechanism for these two pieces. The distribution method is likely to be the same, but it’s convenient to be able to handle data files without client restarts for the same reason you want to be able to update game data without a server restart.

I prefer to be involved with the release process rather than just pushing out code as it’s thrown over the wall. My job is to keep the servers running happily; at the very least, the more I know about what’s happening, the better I can react to problems. One methodology that I’ve used in the past in games: have a release meeting before the patch hits QA. Break down each change in the patch, and rate each one for importance — how much do we need this change? — and risk. Then when the patch comes out of QA, go back and do the same breakdown. QA will often have information which changes the risk factor, and sometimes that means you don’t want to make a specific change after all. Sometimes the tech ops idea of risk is different than engineering’s idea of risk, for perfectly valid reasons. The second meeting either says “yep, push it!” or “no, don’t push it.” If it’s a no, generally that means you decided to drop some changes and do another QA round.

Meetings like that include QA, engineering, whoever owns the continued success of the game (i.e., a producer or executive producer), community relations, and customer support. You can fold the rest of the go/no-go meeting process into this meeting as well. There’s a checklist: do we have release notes for players? Is the proposed date of the push a bad one for some reason? Etc.

I haven’t mentioned the public test server, but that should happen either as part of the QA process or as a separate step in the process. I tend to think that you benefit from treating public test servers as production, which may mean that your first patch meeting in the cycle also formally approves the patch going to public test. You might have quickie meetings during the course of the QA cycle to push out new builds to test as well.

Tomorrow: nuts and bolts.

Hey, that’s a week. Neat. Thoughts and questions for people who’ve found their way here:

Anything in particular you want to see? I have pending requests for another post about datacenters, something on customer service, and a piece on planning for usage spikes. If there’s anything in particular you want me to talk about, let me know.

For that matter, if there’s a general category of stuff which is more interesting, let me know that, too.

I fiddled around with the look of the blog a bit over the course of the week. Comment links are now at the bottom of each post instead of the top. I don’t imagine anyone really cares, but if you want those links at the top as well as the bottom, I could do that.

There is a Livejournal feed, which I should put in the sidebar. There is also now a Livejournal feed containing just excerpts, since the fairly large posts do chew up a bunch of room: imgnry_cgs_shrt. Not the most memorable name in the world but there’s a length limit on Livejournal syndication names.

I’m away on business Monday, so see you again probably on Tuesday. Thanks for coming by.

Daniel James of Three Rings (Puzzle Pirates, Whirled) made a great post with his slides from his GDC presentation. Attention alert: lots of real numbers! It’s like catnip for MMO geeks.

From a tech ops perspective, I paid lots of attention to those graphs. Page 7 is awesome. That is exactly the sort of data which should be on a graph in your network monitoring software; ideally it should be on a page with other graphs showing machine load, network load, and so on. Everything should be on the same timeline, for easy comparisons. It’s my job to tell people when we’re going to need to order new hardware; a tech ops manager should have a deep understanding of how player load affects hardware load. Hm, let’s have an example of graphing:

Cacti graphs showing network traffic and CPU utilization.
Cacti graphs showing network traffic and CPU utilization.

That’s cacti, which is my favorite open source tool for this purpose right now, although it has its limitations and flaws. This particular pair of graphs shows network traffic on top and CPU utilization for one CPU of the server below; not surprisingly, CPU utilization rises along with network traffic. Data collection for CPU utilization and network traffic is built into cacti, and it’s easy to add collection for pretty much any piece of data that can be expressed as a numeric value.

That sort of trend visualization also helps catch problem areas before they get bad. Does the ratio of concurrent players to memory used change abruptly when you hit a specific number of concurrent users? If so, talk to the engineers. It might be fixable. And if it isn’t, well, the projections for profitability might have just changed in which case you better be talking to the financial guys. Making sure the company is making money is absolutely part of the responsibility of anyone in technical operations; some day perhaps I’ll rant about the self-defeating geek tendency to sneer at the business side of the house.

Page 8, more of the same. The observant will notice one of the little quirks of gaming operations: peak times are afternoon to evening, and the peak days are the weekends. The Saturday peak is broader, because people can play during the day more on weekends. You might assume that browser-based games like Whirled would see more play from work, but nope, I guess not.

I wonder what those little dips on 3/17, 3/18, and 3/20 are? I don’t think Whirled is a sharded game, so that can’t be a single shard crashing. Welp, I’ll never know, but that’s a great example of the sorts of things graphs show. If those were because of crashes, you’d know without needing graphs to tell you because your pager would go off, but if it’s something else you’d want to investigate. Could be a bug in your data collection, for that matter, but that’s bad too.

Less tech ops, but still interesting: the material on player acquisition is excellent. Read this if you want to know how to figure out the economics of a game. If I were Daniel James, I would also have breakdowns telling me how those retention cohorts broke down based on play time and perhaps styles of play. What kinds of players stick around? Very important question. I believe strongly in the integration of billing metrics and operational metrics. That work is something that technical operations can drive if need be; all the data sources are within your control. It’s worth spending the time to whip up a prototype dashboard and pitch it to your CFO.

Then there’s a chunk of advice on building an in-world economy that relates to the real world. Heh: it’s MMO as platform again. Whirled is built on that concept, as I understand it. That dovetails nicely with his discussion of billing. When he says “Don’t build, but use a provider,” he is absolutely correct.

I love this slideshow. In the blog post surrounding it, he talks about how he feels it’s OK to give away the numbers. There are dangers in sharing subscriber numbers and concurrencies, particularly if you’re competing in the big traditional space, but I like seeing people taking the risk. There is plenty of room in the MMO space for more players and plain old numbers are not going to be the secret sauce that makes you rich. How you get those numbers is a different story. So thanks to Daniel for this.

I’m never sure how mystifying my job is to the average person. I do know that even technophobes don’t always really know what technical operations does beyond “they’re the guys who keep the servers running,” and I like talking about my job, so I figured I’d expand a bit on the brief blurb and talk about what a typical tech ops team does from time to time.

I’m going to try to use the term “technical operations” for my stuff, in the interests of distinguishing it from operations in general. When a business guy talks about operations, he’s probably talking about the whole gamut of running a game (or a web site, whatever). This includes my immediate bailiwick, but it also includes stuff like customer support, possibly community management, and in some cases even coders maintaining the game. It’s sort of a fuzzier distinction in online gaming; back in the wonderful world of web sites, there’s not a ton of distinction between development pre-launch and development post-launch. Gaming tends to think of those two phases as very different beasts, for mostly good reasons. Although I think some of that is carryover from offline games. I digress! Chalk that up for a later post.

So okay. My primary job is to keep servers running happily. The bedrock of this is the physical installation of servers in the data center. This post is going to be about how you host your servers.

Figure any MMO of any notable size will have… let’s say over 100 servers. This is conservative; World of Warcraft has a lot more than that. There’ll also be big exceptions. I think Puzzle Pirates is a significant MMO and given that it’s a 2D environment, it might be pretty small in terms of server footprint. Um, eight worlds — yeah, I wouldn’t be surprised if they were under 100. But figure we’re generally talking in the hundreds.

You don’t want to worry about the physical aspect of hosting that many servers, especially if you’re a gaming company, because then that’s really not your area of expertise. My typical evaluation of a hosting facility includes questions about how many distinct power grids the facility can access; if, say, Somerville has a power outage I’d like it if the facility could get power from somewhere else. I want to know how long the facility can go without power at all, and how often those backup generators are tested. I want to know how redundant the air conditioning systems are. I want to know how many staff are on site overnight. I want to know about a million things about their network connectivity to the rest of the world. This is all both expensive and hard to build, and why buy that sort of headache? There are companies who will do it for you, and it will be more cost effective, because they’re doing it on a larger scale.

If I’m starting from the ground up, step one is choosing the right hosting facility. Call it colocation if you like. Some people spell that collocation, which is not incorrect but which drives me nuts. (Sorry, Mike.) You start out with the evaluation… well, no. You start out by figuring what’s important to you. As with everything, you need to make the money vs. convenience vs. quality tradeoffs. A tier 1 provider like AT&T or MCI can be really good, but you’re going to pay more than you would for a second tier provider, and that’s not always a wise choice.

My full RFP (request for proposal) document is thousands of words of questions. I won’t reproduce the whole thing here. Suffice it to say that this choice is one of the most important ones you’re going to make. You do not want the pain of changing data centers once you’ve launched. Even once you’ve launched beta. It’s good to get this one right.

There’s also a fair amount of ongoing work that goes into maintaining the relationship, because the bill for hosting is one of your biggest monthly costs. Every month, you have to go over the bill and make sure you’re getting charged for all the right things. I have worked with a lot of colocation facilities and even the best of them screw up billing from time to time.

It’s also smart to basically keep in touch with your facility. You need to figure out who the right person is — probably your Technical Account Manager, maybe someone else. I’ve had relationships where the right guy to talk to was my sales guy, because he loved working with a gaming company and he was engaged enough to look at our bills himself every month to make sure they were right. You want to talk to someone at least once a month, in any case, for a bunch of reasons.

First off, if they’ve got concerns, it’s an avenue for them to express them informally. Maybe you’re using more power than you’re paying for. Maybe your cage is a mess, in which case shame on you and why didn’t you already know about it? But you never know. Maybe there’s a new customer that’s about to scoop up a ton of space in your data center and you won’t have expansion room available.

If you’re talking to your key people regularly, they’re going to keep you in mind when things like that last happen. Often enough you can’t do anything about it; it’s still good to know.

Oh, and if your hosting provider has some sort of game-oriented group, latch onto it! AT&T has an absolutely great Gaming Core Team; when Turbine hooked up with them, our already good service got even better.

Like any relationship with any vendor, you’re going to get more out of it the more you put into it. You don’t stop worrying once you sign the contract.