Part I of the series is here. In this part, I’ll get more technical.
I like having a checklist for the process I’m about to describe. It’s good to have whoever is executing each step checking off their work. It feels dull because it is dull, but it keeps fallible human beings from forgetting the one boring step they’ve done a hundred times before. It also instills a sense of responsibility. Either paper or electronic is fine, as long as the results are archived and each step plus the overall patch is associated with a specific person each time.
Once the patch is approved, it’ll need to be moved to the data center. As Joe notes in the post I linked in Part I, that can be a surprisingly long process. That’s a problem even if you aren’t doing continuous deployment, because there will come a time when you need to get a fix out super-quickly. The easy answer here is that patches shouldn’t be monolithic. Data files should be segmented such that you can push out a change in one without having to update the whole wad. The speed of the uplink to the datacenter is definitely something you should be thinking about as a tech ops guy, though. Find out how big patches could be, figure out how long it’ll take to get them uploaded in the worst case, and make sure people know about that.
You might even want to have a backup plan. I’ve been in situations where it was quicker and more reliable to copy a patch to a USB drive, drive it to the datacenter, and pull it off the hard drive. That’s really cheap — you can buy one at Best Buy and keep it around in case of emergency. Back at Turbine we routinely copied a patch to our portable drive just in case something went wrong with the main copy.
It may come in handy to be able to do Quality of Service on your office network, as well. At a game company, you need to expect that people will be playing games during work hours. This is a valid use of time, since it’s important to know what the competition is like. Still, it’s good to be able to throttle that usage if you’re trying to get the damned patch up as quick as possible to minimize server downtime. Or if the patch took a couple days extra to get through testing, but you’ve already made the mistake of announcing a patch date… yeah.
If your office is physically close to the data center, cost out a T1 line directly there. Then compare the yearly cost of the T1 to the cost of six hours of downtime. Also, if you have a direct connection into the data center, you can avoid some security concerns at the cost of some different ones.
Right. The files are now at the data center. You have, say, a couple hundred servers that need new files. The minimum functionality for an automated push is as follows:
- Must be able to push a directory structure and the files within it to a specified location on an arbitrary number of servers.
- Must be able to verify file integrity after the push.
- Must be able to run pre-push and post-push scripts. (This sort of takes care of the second requirement.)
- Must report on success or failure.
That’ll get you most of the way to where you need to go. The files should be pushed to a staging location on each server — best practice is to push to a directory whose name incorporates the version number. Something like /opt/my-mmo/patches/2009-03-22-v23456/ is good. Once everything’s pushed out and confirmed and it’s time to make the patch happen, you can run another command and automatically move the files from there into their final destination, or relink the data file directory to the new directory, or whatever. Sadly, right now, “whatever” probably includes taking the servers down. Make sure that the players have gotten that communication first; IMHO it’s better to delay a bit if someone missed sending out game alerts and forum posts. If your push infrastructure can do the pre-push and post-push scripts, you can treat this step as just another push, which is handy.
This is often a time to do additional maintenance; e.g., taking full backups can happen during this downtime. You should absolutely do whatever’s necessary to ensure that you can roll back the patch, but you also want to keep downtime to a minimum.
Somewhere in here, perhaps in parallel, any data files or executables destined for the client need to be moved to the patch server. “Patch server” is a bit of a handwave. I think the right way to do this is to have one server or cluster responsible for telling the client what to download, and a separate set of servers to handle the downloads proper. That’ll scale better because functionality is separated.
If you use HTTP as the transport protocol for your client patches, you have a lot of flexibility as to where you host those patches. Patch volumes will be really high; most of your active customers will download the patches within a few hours after they go live. At Turbine, we found out that it would take multiple gigabyte network drops to handle patch traffic, which is way more than you need for day to day operations. You want the flexibility to deliver patches as Amazon S3 objects, or via a CDN like Akamai if you’re way rich. Using Amazon gives you Bittorrent functionality for free, which might save you some bandwidth costs. I wouldn’t expect to save a lot that way, for reasons of human nature.
Client patches can theoretically be pre-staged using the same basic approach used with server files: download early, move files into place as needed. If you’re really studly, your client/server communication protocol is architected with reserve compatibility in mind. Linden Lab does this for Second Life — you can usually access new versions of the server with old clients. Let people update on their schedule, not yours. That also makes roll backs easier, unless it’s the game client or data files which need to be rolled back. Client patching architecture should be designed to allow for those rollbacks as well.
Pushing files to patch servers might use the same infrastructure as pushing server and data files around. Akamai will pull files from a server inside your datacenter, as will most CDNs, so that’s easy. Pushing files to Amazon S3 would require a different process. Fortunately the Amazon API is not very hard to work with. Note that you still want that consistency check at the end of the push. You can do this by downloading the files from Amazon and comparing them with the ones you pushed up there.
Once everything’s in place, if you’ve taken the servers down, you run one more consistency check to make sure the files in place are the ones you want. Then you bring the servers back up. They should come back up in a locked state, whether that’s a per-shard configuration or a knob you turn on the central authentication server. (Fail-safe technique: insist that servers come up locked by default, and don’t open to customers until someone types an admin command.)
Tech ops does the first login. If that sniff test goes well, QA gets a pass at it. This will include client patching, which is a second check on the validity of those files. Assuming all this goes well, the floodgates open and you’re done. Assuming no rollbacks are needed.
After you’re done, you or your designate sits in the war room watching metrics and hanging out with everyone else in the war room. The war room is a good topic for another post; it’s a way to have everyone on alert and to have easy access to decision-makers if decisions need to be made. It’s usually quiet. Sometime in the evening the war room captain says you’re really done, and this time you can go home.
Part III of this series will be a discussion of patch downtime, and MMO downtime in general.