One of the fun things about interviewing is the times you wind up geeking out about tech or business with a simpatico interviewer. I have been talking to a good number of such people this week, and one of the conversations I keep having is the one about using virtual servers in the cloud to handle MMO launch loads. This is a really good strategy for anything Web-based. Zynga used to launch and host new games on Amazon’s EC2, although they’ve pulled back on EC2 usage because it’s cheaper to run enough servers for baseline usage in their own data centers. There are, likewise, a ton of Web startups using EC2 or any of the other public clouds to minimize capital expenditure going into the launch of a new product.

Not surprisingly, people tend to wonder about using the same strategy for an MMO. Launch week for any MMO is a high demand time, so why not push the load into the cloud instead of buying too much hardware or stressing out the servers?

It’s an uptime problem. I don’t worry too much about the big outages — those are fluke occurrences and your datacenter’s network could have a problem of similar severity on your launch day. I do worry about the general uptime.

Web services are non-persistent. (Mostly. Yes, SPDY changes that.) A browser makes an HTTP request, and then can safely give up on the TCP connection to the Web server. State doesn’t live on individual Web servers, so if Web server one dies before the next HTTP request, you just feed the next request to Web server two and everyone’s happy.

MMO servers are persistent. (Mostly. I’m sure there are exceptions, which I would love to hear about.) The state of a given instance or zone is typically maintained on an instance/zone server. If that server dies, most likely the people in that instance are either kicked off the game or booted back to a different area. This is lousy MMO customer experience.

So that’s why I don’t think in terms of cloud for MMO launches. That said, I don’t think these are inherent problems with the concepts. Clouds can be more stable, and Amazon’s certainly working in that direction. For that matter, there are other cloud providers. Likewise, I can sketch out a couple of ways to avoid the persistence problem. I’m not entirely sure they’d work, but there are possibilities. You can at the least minimize the bad effects of random server crashes.

Still and all, it’s a lot trickier than launching anything Web-related in a public cloud.

See also Part I and Part II. Take the time to read ’em, I don’t mind. I need to go make coffee anyhow.

Our third and final topic in this series is downtime. Every MMO player is used to downtime. Turbine games have downtime usually weekly. Blizzard takes World of Warcraft down most weeks. EVE Online goes down every day. It’s part of the whole MMO experience.

I believe that polish is a key part of a successful MMO. Ragged edges show and they turn off customers. This became obvious when Blizzard launched World of Warcraft and set the bar a lot higher while attracting several million new customers. I also believe that operational polish is part of that. Customer support polish is important. Tech ops polish is also important.

Downtime is not polish. Downtime is something we should avoid.

So, patching. We reboot when we patch for two reasons. One, the data files change, whether those are binary files or configuration files or databases, and the server software can’t load new data on the fly. Two, the server software itself changes.

Neither of these are entirely simple. As far as the first one goes, I’ve been known to claim that if a Web server can reload content without being restarted, a game server ought to be able to do the same. This is a misrepresentation on my part, because Web servers are stateless and game servers are exceedingly stateful. In order to solve the problem of transparent reloads for game servers, you need to figure out how you’re going to handle it when content changes while a user is accessing it.

I don’t think it’s impossible, however. My initial model would be something like Blizzard’s phasing technology, in which the same zone/area looks different depending on where you are in certain quest lines. Do the same thing, except that the phases are different content levels. You still run the risk of discontinuity: e.g., if the data for an instance changes while one person in the party is inside the instance and the others zone in afterwards, you have a party split between two instances.

Displaying a warning to the users is inelegant but does solve the problem. See also City of Heroes‘ instanced zone design, where players may be in any of several versions of a given city area. I don’t have a better approach handy, and I do think that indicating the mismatch to the users is better than downtime, so that technique satisfies for now.

Any game which allows for hotfixes without the game going down already does this, of course. I can think of a couple that do it. I sort of feel like this should be the minimum target functionality for new games. I say target because unexpected issues can always arise, but it’s good to have a target nonetheless.

The second problem is trickier because it requires load balancing. Since games are stateful and require a persistent connection — or a really good simulation of one — you’re not going to be able to restart the server without affecting the people connected to it. The good news is that since we control the client/server protocol, we theoretically have the ability to play some clever tricks.

The specific trick I’d like to play is a handoff. I want to be able to tell all the clients connected to one instance of the server that they should go talk to a second instance of the server… now. Then I can take down the first instance of the server, do whatever I need to do, and reverse the process to upgrade the second instance of the server when I’m done.

Load balancing is useful for more than server upgrades: it’d be great for hardware maintenance as well. What’s more, if the client is taking that particular cue from a separate cluster of servers, you could possibly do the same thing retroactively: a piece of hardware goes down? Detect the fault, and have the load balancing cluster issue a directive to go use a different server.

I snuck in the assumption that the load balancing cluster would be a cluster. I think that’s semi-reasonable. It’s one of the functions I’d be inclined to farm out to HTTP servers of some flavor, because anything that’s an HTTP server can live behind a commercial-grade load balancer that the game studio doesn’t have to write. The drawback is that the load balancing is then a pull instead of a push: the client can check to see if anything’s changed, but the servers can’t tell the client anything when they haven’t checked.

I think I’m sort of being overly optimistic here, unfortunately. For one thing, it’s unclear that response times will be quick enough to avoid the users seeing some discontinuity. That might be tolerable, given that MMOs are relatively slack in their required response times, but I’m dubious. For another thing, the problem of maintaining state between two instances of a game server is really tough. You’d have to checkpoint the state of each individual server regularly. The length of time between checkpoints is the amount of rollback you’d be faced with from a perceptual standpoint. There’s an additional issue in that the state checkpoints would need to match the database checkpoints, or you’ll wind up with discontinuity there, which is worse. You really don’t want two servers to disagree about the state of the game.

A more realistic approach is something like what Second Life does. When they roll out new server software, they just do a rolling update. Each individual server reboots, which is a pretty quick process. The section of the game world handled by that server is inaccessible for that period of time. When it comes back up, it’s running the new code.

There’s a small paradigm shift in that idea. Linden Lab doesn’t mind if Second Life‘s world rules vary slightly between servers for a short period of time. In a more game-oriented virtual world, there are more implications there. I can easily envision exploitable situations. The question has to be whether or not those are so serious that it’s worth downtime. And of course, if the answer is ever yes, you can always revert to taking down everything at once.

The other implication of rolling updates is that client/server communications must be reverse compatible. I’m not going to spend a lot of time talking about that, since it’s a good idea in any situation and the techniques for ensuring it are well-known. It’s one of those things which takes a little more effort but it’s worth doing. Not all polish is immediately obvious to the user.

There’s one other forced reboot moment familiar to the MMO player, which is the client restart. That happens when there’s a client patch. I’m willing to admit defeat here because we have way less control over the user’s environment, and we don’t have (for example) a second instance of each desktop which can take over running the client while the first instance reboots.

On the other hand, reloading data files shouldn’t be any harder on a desktop than it is on a server — so if your patch is just data updates, many of the same techniques should be in place. Do it right, and the client is semi-protected against users randomly deleting data files in the middle of a game session too. Yeah, that doesn’t happen much, but I’m a perfectionist.

Before I leave this topic, I should apologize to any server coders and architects who know better than me. My disclaimer: I’m a tech ops guy, not a programmer, and I know that means I miss some blindingly obvious things at times. Corrections are always welcome. This is written from the point of view of my discipline, and revolves around techniques I’d love to be able to use rather than what I absolutely know is possible.

Part I of the series is here. In this part, I’ll get more technical.

I like having a checklist for the process I’m about to describe. It’s good to have whoever is executing each step checking off their work. It feels dull because it is dull, but it keeps fallible human beings from forgetting the one boring step they’ve done a hundred times before. It also instills a sense of responsibility. Either paper or electronic is fine, as long as the results are archived and each step plus the overall patch is associated with a specific person each time.

Once the patch is approved, it’ll need to be moved to the data center. As Joe notes in the post I linked in Part I, that can be a surprisingly long process. That’s a problem even if you aren’t doing continuous deployment, because there will come a time when you need to get a fix out super-quickly. The easy answer here is that patches shouldn’t be monolithic. Data files should be segmented such that you can push out a change in one without having to update the whole wad. The speed of the uplink to the datacenter is definitely something you should be thinking about as a tech ops guy, though. Find out how big patches could be, figure out how long it’ll take to get them uploaded in the worst case, and make sure people know about that.

You might even want to have a backup plan. I’ve been in situations where it was quicker and more reliable to copy a patch to a USB drive, drive it to the datacenter, and pull it off the hard drive. That’s really cheap — you can buy one at Best Buy and keep it around in case of emergency. Back at Turbine we routinely copied a patch to our portable drive just in case something went wrong with the main copy.

It may come in handy to be able to do Quality of Service on your office network, as well. At a game company, you need to expect that people will be playing games during work hours. This is a valid use of time, since it’s important to know what the competition is like. Still, it’s good to be able to throttle that usage if you’re trying to get the damned patch up as quick as possible to minimize server downtime. Or if the patch took a couple days extra to get through testing, but you’ve already made the mistake of announcing a patch date… yeah.

If your office is physically close to the data center, cost out a T1 line directly there. Then compare the yearly cost of the T1 to the cost of six hours of downtime. Also, if you have a direct connection into the data center, you can avoid some security concerns at the cost of some different ones.

Right. The files are now at the data center. You have, say, a couple hundred servers that need new files. The minimum functionality for an automated push is as follows:

  • Must be able to push a directory structure and the files within it to a specified location on an arbitrary number of servers.
  • Must be able to verify file integrity after the push.
  • Must be able to run pre-push and post-push scripts. (This sort of takes care of the second requirement.)
  • Must report on success or failure.

That’ll get you most of the way to where you need to go. The files should be pushed to a staging location on each server — best practice is to push to a directory whose name incorporates the version number. Something like /opt/my-mmo/patches/2009-03-22-v23456/ is good. Once everything’s pushed out and confirmed and it’s time to make the patch happen, you can run another command and automatically move the files from there into their final destination, or relink the data file directory to the new directory, or whatever. Sadly, right now, “whatever” probably includes taking the servers down. Make sure that the players have gotten that communication first; IMHO it’s better to delay a bit if someone missed sending out game alerts and forum posts. If your push infrastructure can do the pre-push and post-push scripts, you can treat this step as just another push, which is handy.

This is often a time to do additional maintenance; e.g., taking full backups can happen during this downtime. You should absolutely do whatever’s necessary to ensure that you can roll back the patch, but you also want to keep downtime to a minimum.

Somewhere in here, perhaps in parallel, any data files or executables destined for the client need to be moved to the patch server. “Patch server” is a bit of a handwave. I think the right way to do this is to have one server or cluster responsible for telling the client what to download, and a separate set of servers to handle the downloads proper. That’ll scale better because functionality is separated.

If you use HTTP as the transport protocol for your client patches, you have a lot of flexibility as to where you host those patches. Patch volumes will be really high; most of your active customers will download the patches within a few hours after they go live. At Turbine, we found out that it would take multiple gigabyte network drops to handle patch traffic, which is way more than you need for day to day operations. You want the flexibility to deliver patches as Amazon S3 objects, or via a CDN like Akamai if you’re way rich. Using Amazon gives you Bittorrent functionality for free, which might save you some bandwidth costs. I wouldn’t expect to save a lot that way, for reasons of human nature.

Client patches can theoretically be pre-staged using the same basic approach used with server files: download early, move files into place as needed. If you’re really studly, your client/server communication protocol is architected with reserve compatibility in mind. Linden Lab does this for Second Life — you can usually access new versions of the server with old clients. Let people update on their schedule, not yours. That also makes roll backs easier, unless it’s the game client or data files which need to be rolled back. Client patching architecture should be designed to allow for those rollbacks as well.

Pushing files to patch servers might use the same infrastructure as pushing server and data files around. Akamai will pull files from a server inside your datacenter, as will most CDNs, so that’s easy. Pushing files to Amazon S3 would require a different process. Fortunately the Amazon API is not very hard to work with. Note that you still want that consistency check at the end of the push. You can do this by downloading the files from Amazon and comparing them with the ones you pushed up there.

Once everything’s in place, if you’ve taken the servers down, you run one more consistency check to make sure the files in place are the ones you want. Then you bring the servers back up. They should come back up in a locked state, whether that’s a per-shard configuration or a knob you turn on the central authentication server. (Fail-safe technique: insist that servers come up locked by default, and don’t open to customers until someone types an admin command.)

Tech ops does the first login. If that sniff test goes well, QA gets a pass at it. This will include client patching, which is a second check on the validity of those files. Assuming all this goes well, the floodgates open and you’re done. Assuming no rollbacks are needed.

After you’re done, you or your designate sits in the war room watching metrics and hanging out with everyone else in the war room. The war room is a good topic for another post; it’s a way to have everyone on alert and to have easy access to decision-makers if decisions need to be made. It’s usually quiet. Sometime in the evening the war room captain says you’re really done, and this time you can go home.

Part III of this series will be a discussion of patch downtime, and MMO downtime in general.

Chris asked about patching the game in comments, which dovetails nicely with this post. I have a nit to pick with the theory of continuous deployment, but that’ll wait a post or two.

Joe’s outline of release management focuses mostly on the engineering and QA side of the house, which makes sense. The Flying Lab process is very similar to the Turbine process as far as that goes. I’m going to get into the tech ops aspects of patching in the next post, but in this one I want to cover some business process and definitions. Oh, and one side note: patch, hotfix, content update, content push, whatever you want to call it. If you’re modifying the game by making server or client changes, it’s a patch from the operational perspective.

Roughly speaking, you can divide a patch into four potential parts. Not all patches will necessarily need each of these parts. Depending on your server and client design, you may have to change all of these concurrently, but optimally they’re independent.

Part one is server data, which could come in any number of forms. Your servers might use binary data files. They might use some sort of flat text file — I bet there’s someone out there doing world data in XML. I know of at least one game that kept all the data in a relational database. It all boils down to the data which defines the world.

I suppose that in theory, and perhaps in practice, game data could be compiled into the server executable itself. This is suboptimal because it removes the theoretical ability to reload game data on the fly without a game server restart. Even if your data files are separate, you may not be able to do a reload on the fly, but at least separation should make it easier to rework the code to do the right thing later on. There will be more on this topic at a later date.

Part two is the server executable itself. This doesn’t change as often; maybe just when the game introduces new systems or new mechanics. Yay for simplicity. I am pretending that there aren’t multiple pieces of software which make up your game shard, which is probably untrue, but the principle is the same regardless.

Parts three and four split the same way, but apply to the client: client data files and client executables. Any given game may or may not use the same patching mechanism for these two pieces. The distribution method is likely to be the same, but it’s convenient to be able to handle data files without client restarts for the same reason you want to be able to update game data without a server restart.

I prefer to be involved with the release process rather than just pushing out code as it’s thrown over the wall. My job is to keep the servers running happily; at the very least, the more I know about what’s happening, the better I can react to problems. One methodology that I’ve used in the past in games: have a release meeting before the patch hits QA. Break down each change in the patch, and rate each one for importance — how much do we need this change? — and risk. Then when the patch comes out of QA, go back and do the same breakdown. QA will often have information which changes the risk factor, and sometimes that means you don’t want to make a specific change after all. Sometimes the tech ops idea of risk is different than engineering’s idea of risk, for perfectly valid reasons. The second meeting either says “yep, push it!” or “no, don’t push it.” If it’s a no, generally that means you decided to drop some changes and do another QA round.

Meetings like that include QA, engineering, whoever owns the continued success of the game (i.e., a producer or executive producer), community relations, and customer support. You can fold the rest of the go/no-go meeting process into this meeting as well. There’s a checklist: do we have release notes for players? Is the proposed date of the push a bad one for some reason? Etc.

I haven’t mentioned the public test server, but that should happen either as part of the QA process or as a separate step in the process. I tend to think that you benefit from treating public test servers as production, which may mean that your first patch meeting in the cycle also formally approves the patch going to public test. You might have quickie meetings during the course of the QA cycle to push out new builds to test as well.

Tomorrow: nuts and bolts.

Relational databases: please no!

OK. It is completely obvious that any MMO is going to need a way to store data. I understand that the instinctive reaction is to use a relational database, because that’s what relational databases are for. However, I beg of you as the guy who needs to keep the things running fast and smooth, think twice.

Yes, you get the ease of writing code against a mature system. You also get slower response times and more fiddly parts. If you haven’t hired a really good DBA to work on schema design, you are going to find out all about the ways in which relational databases can be slow under load.

Relational databases are really good at transactional integrity. It’s probably worth thinking about whether or not that’s a key feature. Most games don’t implement it very well right now. If the game crashes five seconds after you kill a monster, do you really feel confident that the loot will be in your bags when you log back in? I don’t.

Besides, you can get transactional integrity out of non-relational databases too. I’m going to cite a bunch of systems with weaker integrity later on, but they’re weaker because they’re big distributed systems spanning multiple datacenters. Games typically do not have that problem.

Relational databases do not scale cheaply. They can scale well, particularly if you bite the bullet and pay for Microsoft SQL or Oracle or something. They don’t scale cheaply.

It’s abominably easy to write bad relational database code. The problem here is that the failure state is not obvious. Bad SQL code reveals problems under load when there’s a lock on one table because transaction A is in process, which blocks transaction B, which happens to be the transaction responsible for showing your player what she’s got in her mailbox. She now waits 30 seconds for the mailbox to load and bitches on the forums. These load problems happen to be one of the things that’s hard to test programatically. Experienced SQL programmers won’t make those mistakes, but inexperienced SQL programmers may not realize how easy they are to make.

Relational databases are not maximally fast. They can’t be, because one of the big concerns is the above-mentioned transactional integrity, and that takes time. If you’ve got some dataset that’s primarily read-only and isn’t too horrendously large, use an in-memory data store. For the tr

It is significant that the big players in the Web space all use non-relational databases heavily. I suspect this is in part the legacy of search — back at AltaVista, we’d have been somewhat horrified at the idea of using Oracle to serve up search queries. The smart people who did the original work at AltaVista, Google, and Yahoo wrote their own data stores, optimized for returning search results quickly.

This attitude stuck. I know a possible apocryphal story about Yahoo, which claims that user data was just stored in a standard filesystem. The code was supposedly hacked such that niceties like file update times weren’t stored anywhere. Speed was all. Whether or not that’s true, Yahoo’s still using speed-oriented database code (warning: PDF). So is Google. Amazon does the same sort of thing.

Now, OK, we’re not gonna write our own serious database systems. CouchDB and HBase probably aren’t there yet. I still really wish more people would ask if they need a relational DB. If you really do need one, make sure you’ve hired someone who’s worked with large systems before and remember that 95% of Web work is smaller than anything you’ll be doing.