Our third and final topic in this series is downtime. Every MMO player is used to downtime. Turbine games have downtime usually weekly. Blizzard takes World of Warcraft down most weeks. EVE Online goes down every day. It’s part of the whole MMO experience.
I believe that polish is a key part of a successful MMO. Ragged edges show and they turn off customers. This became obvious when Blizzard launched World of Warcraft and set the bar a lot higher while attracting several million new customers. I also believe that operational polish is part of that. Customer support polish is important. Tech ops polish is also important.
Downtime is not polish. Downtime is something we should avoid.
So, patching. We reboot when we patch for two reasons. One, the data files change, whether those are binary files or configuration files or databases, and the server software can’t load new data on the fly. Two, the server software itself changes.
Neither of these are entirely simple. As far as the first one goes, I’ve been known to claim that if a Web server can reload content without being restarted, a game server ought to be able to do the same. This is a misrepresentation on my part, because Web servers are stateless and game servers are exceedingly stateful. In order to solve the problem of transparent reloads for game servers, you need to figure out how you’re going to handle it when content changes while a user is accessing it.
I don’t think it’s impossible, however. My initial model would be something like Blizzard’s phasing technology, in which the same zone/area looks different depending on where you are in certain quest lines. Do the same thing, except that the phases are different content levels. You still run the risk of discontinuity: e.g., if the data for an instance changes while one person in the party is inside the instance and the others zone in afterwards, you have a party split between two instances.
Displaying a warning to the users is inelegant but does solve the problem. See also City of Heroes‘ instanced zone design, where players may be in any of several versions of a given city area. I don’t have a better approach handy, and I do think that indicating the mismatch to the users is better than downtime, so that technique satisfies for now.
Any game which allows for hotfixes without the game going down already does this, of course. I can think of a couple that do it. I sort of feel like this should be the minimum target functionality for new games. I say target because unexpected issues can always arise, but it’s good to have a target nonetheless.
The second problem is trickier because it requires load balancing. Since games are stateful and require a persistent connection — or a really good simulation of one — you’re not going to be able to restart the server without affecting the people connected to it. The good news is that since we control the client/server protocol, we theoretically have the ability to play some clever tricks.
The specific trick I’d like to play is a handoff. I want to be able to tell all the clients connected to one instance of the server that they should go talk to a second instance of the server… now. Then I can take down the first instance of the server, do whatever I need to do, and reverse the process to upgrade the second instance of the server when I’m done.
Load balancing is useful for more than server upgrades: it’d be great for hardware maintenance as well. What’s more, if the client is taking that particular cue from a separate cluster of servers, you could possibly do the same thing retroactively: a piece of hardware goes down? Detect the fault, and have the load balancing cluster issue a directive to go use a different server.
I snuck in the assumption that the load balancing cluster would be a cluster. I think that’s semi-reasonable. It’s one of the functions I’d be inclined to farm out to HTTP servers of some flavor, because anything that’s an HTTP server can live behind a commercial-grade load balancer that the game studio doesn’t have to write. The drawback is that the load balancing is then a pull instead of a push: the client can check to see if anything’s changed, but the servers can’t tell the client anything when they haven’t checked.
I think I’m sort of being overly optimistic here, unfortunately. For one thing, it’s unclear that response times will be quick enough to avoid the users seeing some discontinuity. That might be tolerable, given that MMOs are relatively slack in their required response times, but I’m dubious. For another thing, the problem of maintaining state between two instances of a game server is really tough. You’d have to checkpoint the state of each individual server regularly. The length of time between checkpoints is the amount of rollback you’d be faced with from a perceptual standpoint. There’s an additional issue in that the state checkpoints would need to match the database checkpoints, or you’ll wind up with discontinuity there, which is worse. You really don’t want two servers to disagree about the state of the game.
A more realistic approach is something like what Second Life does. When they roll out new server software, they just do a rolling update. Each individual server reboots, which is a pretty quick process. The section of the game world handled by that server is inaccessible for that period of time. When it comes back up, it’s running the new code.
There’s a small paradigm shift in that idea. Linden Lab doesn’t mind if Second Life‘s world rules vary slightly between servers for a short period of time. In a more game-oriented virtual world, there are more implications there. I can easily envision exploitable situations. The question has to be whether or not those are so serious that it’s worth downtime. And of course, if the answer is ever yes, you can always revert to taking down everything at once.
The other implication of rolling updates is that client/server communications must be reverse compatible. I’m not going to spend a lot of time talking about that, since it’s a good idea in any situation and the techniques for ensuring it are well-known. It’s one of those things which takes a little more effort but it’s worth doing. Not all polish is immediately obvious to the user.
There’s one other forced reboot moment familiar to the MMO player, which is the client restart. That happens when there’s a client patch. I’m willing to admit defeat here because we have way less control over the user’s environment, and we don’t have (for example) a second instance of each desktop which can take over running the client while the first instance reboots.
On the other hand, reloading data files shouldn’t be any harder on a desktop than it is on a server — so if your patch is just data updates, many of the same techniques should be in place. Do it right, and the client is semi-protected against users randomly deleting data files in the middle of a game session too. Yeah, that doesn’t happen much, but I’m a perfectionist.
Before I leave this topic, I should apologize to any server coders and architects who know better than me. My disclaimer: I’m a tech ops guy, not a programmer, and I know that means I miss some blindingly obvious things at times. Corrections are always welcome. This is written from the point of view of my discipline, and revolves around techniques I’d love to be able to use rather than what I absolutely know is possible.