Imaginary Cogs

On the operation of massively multiplayer online games.
RSS icon Email icon Home icon
  • Nik Davidson

    Posted on May 11th, 2011 Bryant

    Nik Davidson is an old friend of mine from Turbine, who has just started blogging . He is articulate, intelligent, and witty. Plus he’s updating more often than I am. You know what to do.

  • What Do I Do? Change Management

    Posted on July 20th, 2010 Bryant

    The single biggest favor a system administrator can do herself is to make sure she knows exactly what state her servers are in at all times. Inevitably, you’re going to make a change that screws something up. It could be a configuration change, it could be pushing updated code or content, or it could be some weird random “harmless” tweak. The IT Process Institute believes that 80% of all outages are the result of changes to the system. If you don’t know what changes you’ve made, troubleshooting becomes much more difficult.

    Thus, one of the fundamental underpinnings of any Ops organization is a change management system. You always have one.

    Way too often, particularly in small shops, your change management system is your memory. This is the bit where something breaks and you sit there and you go “Oh, wait, I pushed out a new version of the kernel to that server on Monday.” Personally, I find I can keep a pretty good model of the state of ten servers in my head. “Pretty good” is nowhere near good enough. Still, it’s a very tempting model if you’re the kind of guy who likes coming up with snap answers in the middle of a crisis. Dirty secret: lots of us Ops guys are like that.

    It’s better if you have a real record. There are a ton of ways to do this. One of the advantages to many commercial ticketing systems is integrated change management — you can tie a ticket to a server so that you can look up the list of changes to that server quickly and easily. Wikis work fairly well if there’s a real commitment to keeping them updated. Email lists are harder to search but they’re at least something.

    Any change that can be managed as part of a source control system probably ought to be. This includes most configuration files, and you can do a lot with configuration files. There’ll be a followup post to this one talking about configuration management — a tightly related subject — and that’ll shed some light on those possibilities. Once something’s managed via source control, you can not only easily look up changes, you can revert to old versions quickly.

    The other side of this is troubleshooting. When something breaks, if you have good change management, you can start with the most recent change, think about whether or not it’s potentially related, and back it out if necessary. Having an ordered list of all changes for the server or system is crucial; this is why something more formalized than a mailing list or a bunch of IRC logs is good.

    The non-technical side of this is controlling change. I tend to come down strongly on the side of slow, measured changes, particularly in the MMO environment. MMOs have two interesting qualities that affect my thinking in this regard: first, the pace of development is relatively quick. We’re already pushing code at a rate that would startle more traditional shops. (Although to be fair, we’re slower than the more agile-oriented Web 2.0 sites.) Second, the face we present to the world is relatively monolithic. We do not have easy access to the same techniques Web sites use to test changes on small groups of users, for example. Games could be built with that potential in mind, but there are social issues involved with giving content to some customers early.

    Because of the first quality, there’s a tendency to want to move very quickly. Network Operations is a balancing point. It’s bad to push back for the sake of pushing back, but it’s good to remember that you’re allowed to provide input into the push process. Since you own the concrete task of moving the bits onto the servers, you’re the last people who can say no, which is a meaningful responsibility.

    Because of the second quality, problems tend to be more visible to the customers than they would be if you could roll them out gently.

    There’s also a general reason for going slowly. Not all change-generated outages manifest immediately. Sometimes you’ll make a change and the outage won’t occur for days. Easy example: changing logging to be more verbose. If it’s lots more verbose, you may run out of disk space, but it might take four days for that to happen. When you’re walking down the list of changes, it’ll be harder to pinpoint the root cause the more changes that happened between now and then. Your search space is bigger.

    And yeah, that one is a trivial example that wouldn’t be so hard to troubleshoot in practice. The principle is solid, however.

    All that said, I’ll let the Agile Operations/Devops guys make the case for more speedy deployments. Especially since I think they’re making huge strides in techniques that I badly want to adopt. (This presentation deserves its own post.)

    Let us assume, in any case, that we’re looking to control change. The process side of change management is the methods by which you discuss, approve, and schedule changes. There is no right way to do this, although there are wrong ways. If your Net Ops group is getting a code push and only finds out what’s in the push at that point, that’d be a wrong way.

    One right way: have a periodic change control meeting. Invite Net Ops, Engineering, QA, Customer Support, Community Relations, Release Management, and producers. Some of those may overlap. Also, see if you can get Finance to show up. It’s nice to have someone who can authoritatively talk about how much money the company loses if the servers are down for a day.

    Go over the list of proposed changes for the next change window. Rate ‘em on importance and risk. Decide which ones you’re doing. The answer is probably all of them, but the question shouldn’t be a mere formality. If there are blockers which can be removed before the change window, schedule a followup to confirm that they’ve been removed.

    Make changes! That’s really the fun part.

    The periodicity of the meeting is subject to discussion. I’ve done it as a weekly meeting, I’ve done it as a monthly pre-update meeting, and I think it might be useful to do it as a daily meeting. If your company is into agile development, and does daily standups, it makes sense to fit it into that structure. You also need to be willing to do emergency meetings, because this structure will fall apart if it doesn’t have flexibility built in. There are going to be emergency changes. I like to account for that possibility.

    That’s change management in a nutshell. Coming soon, some discussion of tools.

  • Obligatory iPad Post

    Posted on July 12th, 2010 Bryant

    Attention conservation notice: this doesn’t have much to do with MMOs.

    JavaStation-10

    A JavaStation-10, aka JavaStation-NC.

    A decade or so ago, I worked on Sun’s internal JavaStation deployment project. Our mandate was to deploy a few thousand of these things throughout Sun as replacement desktops. It was a lot of fun, I learned a lot, and I got to travel some. I think that was my first business travel, in fact.

    The retail cost of a JavaStation-10 was around $750, if I recall correctly. It was a great concept, because it was a great task station. You could do email, calendaring, and Web browsing. The OS was slower than you’d like, but I found it perfectly reasonable unless you were asking it to do X-Windows via Citrix or something like that. Of course, you had to have a big chunky server in the data center to serve up the OS at boot time, and the hardware didn’t include the mythical JavaChip co-processor, so complex apps could be a bit poky.

    A decade later, I have an iPad sitting on my desk. It ran me around $750, but I could have gotten one at two-thirds the price. It does email, calendaring, and Web browsing, plus a ton more. The OS is fast. This would make a really good task station, especially if you housed it in a different casing and slapped on a keyboard. I have no reason to think that Apple is aiming in that direction, but I won’t be surprised if we see iOS devices designed for single purpose workstations.

    Those early ideas just get better and better as the technology catches up, huh?

    In a vain attempt to tie this back to the blog purpose: this thing is going to be an amazing admin tool. If I’m in the datacenter, it’ll be very useful to have a separate screen handy. I can’t count the number of times I’ve been looking at a server on a KVM and I’ve wanted to have a second monitor available for watching logs and so forth. I have an ssh client, and I have a client that does both RDP and VNC, which is key for a multi-platform admin. If I was still working in Silicon Valley back in the stupid money days, I’d try to sell my boss on buying these for my team as core tools. Cheaper than an on-call laptop.

  • Quick Provisioning

    Posted on June 19th, 2010 Bryant

    Lydia Leong has a great post about the question of speedy provisioning. As she says, the exciting bit about getting new hardware in place isn’t the OS and software installation. Even if you’re not virtualized, you can install a new server unattended in a couple of hours. You want to be able to do that even if you never expect to grow, because you need to be able to rebuild servers quickly if they die. This isn’t hard to manage. In a pinch, people will sell you solutions and you can get a consultant in, but it’s easier to just plan ahead.

    She hits on the internal aspect: getting someone to sign off on the new servers. If we’re talking about the need to buy more capacity on short notice in our industry, we’re probably talking about launch, which means this problem isn’t so bad for us. But you’ve got to get the ducks lined up in advance. You don’t want to shock your CEO with an order on the third day of launch; she’s worrying about other stuff. Better to get the plan in writing way in advance, along with executive buyoff. Then you can tell the appropriate people you need ten more shards, get the documents signed, and get your vendor moving.

    I think that’s a bit trickier than Lydia says, but I also think she’s talking about onesies/twosies. Buying one server is easy, as she notes. Buying a hundred servers for serious expansion is going to take a bit longer, because Dell and HP and IBM hate keeping too much backstock around, so they’re going to have to build those servers for you.

    You can alleviate this, of course. First tactic is to let them know it’s coming. None of those companies are going to increase their inventory just for the sake of your possible buy, because you’re too small, unless you’re Blizzard. However, you can and should get some commitments around response time. You can also, and I think this is more important, find out what’s going to ship the fastest. There’s no reason why you shouldn’t take that as an input to your hardware decision matrix. If all else is equal, go with the servers that generally have the largest inventory. Or ask questions about factories: can your vendor literally build 1U servers faster than blades?

    Also, make sure the vendor order process is just as quick. As with all vendors, you want your hardware sales people to be on call during the two weeks around launch. Midnight calls are very unlikely; weekend calls are more probable.

    Finally, figure out how you’re going to rack and stack a hundred servers quickly. Could be your vendor’s professional services, could be some local contractor. Even if your internal staff racked the rest of the servers, it’s better not to ask them to spend the time in the machine room during launch, cause you are going to be doing a lot of other things. Obviously, you don’t want to call up a contractor and find out that all his people are doing something for some big insurance company so he can’t help you.

    None of this is really all that hard, it’s just a great example of one of the many rows you need to have your ducks in. It’s not difficult, it’s merely fiddly.

  • EVE Online Insight

    Posted on June 17th, 2010 Bryant

    CCP Yokai, the Technical Director over in EVE-land, just posted a dev blog about their new rack setup. This is pretty rare insight for any operation, so it’s definitely worth reading. You don’t get the nitty-gritty details but you get a good overview.

    They’re located in 12 cabinets. That apparently covers their single server, their test server, and ancillary services. If you don’t know, EVE is a single shard setup, which is really technically impressive. They crowd all 50-60K concurrent players into one world. That’s one big reason why network connectivity is so important to them. Yokai mentions it a few times in the blog. That’s a very high quality network he’s got set up, probably because most of those 64 servers may need to talk to any other server. Compare that to an infrastructure where 10 servers make up a shard. I can’t know for sure but it sure seems like you’d need to be ready for more interconnections.

    He’s using blades, and the blades have a lot of RAM. IBM makes a really solid blade, by the way. The HS21 is I think one generation old; they’re currently selling the HS22s in that price/performance spot, but once you’ve bought a bunch of blades you don’t upgrade unless you need to. The interesting thing to me is the amount of RAM they’ve got in each blade. 32GB is a fair bit. I don’t want to speculate too much but CCP has never been shy about smart ways to use the fastest possible resources, and RAM is fast. See also that big 2 terabyte SSD SAN (storage area network) he mentions.

    Lots of blades means lots of heat. I am not surprised that they need a self-contained cooling system. I should talk some about the blade vs. 1U server question, since while blades do take up less physical space, the practical space they consume may not save you much. On the other hand, as noted, CCP needs the fast interconnects. Blades do help there.

    Don’t miss the comment thread, either. The devs are again being very open about some of their choices, which is awfully nice of them.

  • Insert Saddle Metaphor Here

    Posted on May 15th, 2010 Bryant

    Yep, it still works. Neat.

    I’m a year or so into the new job, which still rocks, and for various and sundry reasons it’s a good time to start blogging about geeky MMO operations stuff again. Excellent. Quick note on policy, here: I’m not going to talk about where I am and I’m going to steer clear of talking about what we’re doing, because we’re not ready to talk about it yet and because I am explicitly and emphatically not speaking for the company in any way, shape or form. I’m not keeping my employer a secret — y’all know how to use LinkedIn, right? — but I’m keeping some separation in order to, well, keep some separation.

    Mostly I’m just jazzed to be talking about the stuff I love again. I’ll pick it up tomorrow with some random banter about datacenters, probably. Thanks for leaving me in your RSS feeds.

  • Tap, Tap

    Posted on May 13th, 2010 Bryant

    Hey, is this thing on?

  • Whoops

    Posted on April 23rd, 2009 Bryant

    Brief hiatus here, cause, um, I just got a job! Which is ducky. But I do want to figure out if there’s any company policy on blogging, etc. on the better safe than sorry theory.

    Great timing, huh? I can’t really find it in myself to complain, however.

  • Oracle Buys Sun

    Posted on April 20th, 2009 Bryant

    Apparently IBM wasn’t the only company looking to buy a nice Silicon Valley enterprise hardware and software company, since Oracle just bought Sun now that IBM’s opted to pass. You’d assume this was the backup plan, and you’d also assume that Sun really needed/wanted to be purchased. I think it’s a better fit for Sun than IBM would have been.

    It’s a pretty important change in the world of technical operations (which is much like the martial arts world, up to and including the preponderance of aged sages and masters). It even has at least one minor direct impact on MMOs. Even if it didn’t, though, it’d be interesting enough to talk about here.

    The minor direct impact is Project Darkstar, Sun’s open source MMO server. Oracle doesn’t have anything against open source, about which more in a second, but Project Darkstar probably hasn’t sold Sun many servers so far and I suspect it’s the sort of fringe project which has trouble surviving after a merger/acquisition. Also, the current releases run on a single server only, which is a bit of a drawback for serious MMO work. On the other hand, it’s a fringe project which isn’t being used for a whole lot in practice, which brings us back to the minor impact.

    The first big indirect impact is MySQL. I don’t expect Oracle to kill MySQL out of hand. Oracle doesn’t have a reputation as an open source friendly company, but they do a fair bit of work with open source and they’re clearly not adverse to the concept. Oracle bought Berkeley DB a few years back and it’s been chugging along just fine ever since, although admittedly it isn’t a direct competitor to Oracle’s core products. Rather, it’s a nice complement. For that matter, Oracle’s actually licensed core technology to MySQL in the past; InnoDB is an important storage engine for MySQL and it’s owned by Oracle.

    I also don’t expect MySQL to stay exactly the same, because it has been a direct competitor to Oracle DB. While it makes sense for a software company to offer an open source version of its products, that version should be a gateway to a commercial relationship. MySQL currently has a subscription-based enterprise product which provides monthly frequent bug fixes and updates. That’s a good model. I’m just not sure it makes sense for Oracle, because you’re then supporting two products that directly compete. If nothing else, the marketing gets wonky.

    Maintaining MySQL as a starter DB and encouraging people to upgrade to Oracle as their needs grow is also unlikely. If nothing else, that’s not a clean upgrade.

    It’s not impossible that Oracle will cut MySQL loose. It’d be a pretty easy transition, since MySQL development still happens in Sweden. When Sun bought MySQL, they didn’t change much about how the company did business. I’d imagine this would come at some cost, and I’m not sure who’d be in a position to pay it.

    Since it’s an open source product, the original developers could theoretically resign from Oracle en masse and reconstitute themselves as a new company. That is pure and blatant speculation. I don’t know what Swedish non-compete laws are like, I don’t know what it says in their employment contract, and so on. It’s more likely that a new entity would take on development. On the other hand, that’s a big project to launch; you’d want someone like IBM backing you, and you might lose those technology licenses that Oracle’s sold to MySQL proper.

    So lots of possibilities. I don’t expect to see MySQL vanish. I do think it’s very possible that development will slow down. If I were starting a project that required a database right now, I’d look at the possibility of change around MySQL and probably decide to use Postgres. Just to pull this back to MMOs: did you know that Sony Online Entertainment has a significant investment in the largest Postgres support company out there? That’s who Sony uses for their databases. Just sayin’.

    The other interesting indirect impact is servers. Oracle is not a company you think of as a server company, unless you were paying attention last fall. That product’s built on HP technology and hardware. Oracle now has an opportunity to own the entire concept, from hardware to OS to database. This has to be attractive to Larry Ellison, who is a big fan of Steve Jobs and the Apple integration of hardware and software. I expect to see database and data warehouse appliances built on Sun’s hardware within a year or two.

    The open question is whether or not Oracle wants to be in the general purpose server market. They certainly could be, but it doesn’t directly sell more databases. Sun’s been pushing their servers at the MMO industry fairly hard; I expect most of us have gotten their sales calls and heard all about the Sun Games Technologies Group. I have no qualms about saying that I haven’t wanted to buy Sun for MMO purposes because I was uncertain about their future. (Which always made me sad — I used to work for Sun.) If Oracle decides they want to continue the general purpose product line, however, that’s a different matter.

    I’d be happy to see another competitor in the market. Right now, if you want the support that should come with a top tier server provider, you’ve got IBM and HP with Dell as a reasonable third choice. As a purchaser, I’m always happy with more options. I’m thus rooting for a Sun resurgence, without any sappy puns about rising.

    The third big impact is Java. I’d expect this to be the most painless, least interesting transition. Oracle is a software company, and Java fits into their offerings without being a competitor to anything they’re already doing. Oracle’s already involved in Java open source work. Theory isn’t practice, but Oracle would have to work at it a bit to screw up Java. I’d bet Java was actually the biggest reason Oracle wanted to buy Sun, although MySQL and the hardware had to be factors as well. It’s obvious enough that I don’t really have much else to say there. “Yep, that’s a good fit.”

    So there we are. It’s a pretty seismic change, even if Sun was declining as an enterprise IT vendor. We won’t know what the full impact is for a year or more. I think I am cautiously optimistic that it’ll be a net good for me, with some negative effects.

  • Patching the Game (Part III)

    Posted on April 10th, 2009 Bryant

    See also Part I and Part II. Take the time to read ‘em, I don’t mind. I need to go make coffee anyhow.

    Our third and final topic in this series is downtime. Every MMO player is used to downtime. Turbine games have downtime usually weekly. Blizzard takes World of Warcraft down most weeks. EVE Online goes down every day. It’s part of the whole MMO experience.

    I believe that polish is a key part of a successful MMO. Ragged edges show and they turn off customers. This became obvious when Blizzard launched World of Warcraft and set the bar a lot higher while attracting several million new customers. I also believe that operational polish is part of that. Customer support polish is important. Tech ops polish is also important.

    Downtime is not polish. Downtime is something we should avoid.

    So, patching. We reboot when we patch for two reasons. One, the data files change, whether those are binary files or configuration files or databases, and the server software can’t load new data on the fly. Two, the server software itself changes.

    Neither of these are entirely simple. As far as the first one goes, I’ve been known to claim that if a Web server can reload content without being restarted, a game server ought to be able to do the same. This is a misrepresentation on my part, because Web servers are stateless and game servers are exceedingly stateful. In order to solve the problem of transparent reloads for game servers, you need to figure out how you’re going to handle it when content changes while a user is accessing it.

    I don’t think it’s impossible, however. My initial model would be something like Blizzard’s phasing technology, in which the same zone/area looks different depending on where you are in certain quest lines. Do the same thing, except that the phases are different content levels. You still run the risk of discontinuity: e.g., if the data for an instance changes while one person in the party is inside the instance and the others zone in afterwards, you have a party split between two instances.

    Displaying a warning to the users is inelegant but does solve the problem. See also City of Heroes‘ instanced zone design, where players may be in any of several versions of a given city area. I don’t have a better approach handy, and I do think that indicating the mismatch to the users is better than downtime, so that technique satisfies for now.

    Any game which allows for hotfixes without the game going down already does this, of course. I can think of a couple that do it. I sort of feel like this should be the minimum target functionality for new games. I say target because unexpected issues can always arise, but it’s good to have a target nonetheless.

    The second problem is trickier because it requires load balancing. Since games are stateful and require a persistent connection — or a really good simulation of one — you’re not going to be able to restart the server without affecting the people connected to it. The good news is that since we control the client/server protocol, we theoretically have the ability to play some clever tricks.

    The specific trick I’d like to play is a handoff. I want to be able to tell all the clients connected to one instance of the server that they should go talk to a second instance of the server… now. Then I can take down the first instance of the server, do whatever I need to do, and reverse the process to upgrade the second instance of the server when I’m done.

    Load balancing is useful for more than server upgrades: it’d be great for hardware maintenance as well. What’s more, if the client is taking that particular cue from a separate cluster of servers, you could possibly do the same thing retroactively: a piece of hardware goes down? Detect the fault, and have the load balancing cluster issue a directive to go use a different server.

    I snuck in the assumption that the load balancing cluster would be a cluster. I think that’s semi-reasonable. It’s one of the functions I’d be inclined to farm out to HTTP servers of some flavor, because anything that’s an HTTP server can live behind a commercial-grade load balancer that the game studio doesn’t have to write. The drawback is that the load balancing is then a pull instead of a push: the client can check to see if anything’s changed, but the servers can’t tell the client anything when they haven’t checked.

    I think I’m sort of being overly optimistic here, unfortunately. For one thing, it’s unclear that response times will be quick enough to avoid the users seeing some discontinuity. That might be tolerable, given that MMOs are relatively slack in their required response times, but I’m dubious. For another thing, the problem of maintaining state between two instances of a game server is really tough. You’d have to checkpoint the state of each individual server regularly. The length of time between checkpoints is the amount of rollback you’d be faced with from a perceptual standpoint. There’s an additional issue in that the state checkpoints would need to match the database checkpoints, or you’ll wind up with discontinuity there, which is worse. You really don’t want two servers to disagree about the state of the game.

    A more realistic approach is something like what Second Life does. When they roll out new server software, they just do a rolling update. Each individual server reboots, which is a pretty quick process. The section of the game world handled by that server is inaccessible for that period of time. When it comes back up, it’s running the new code.

    There’s a small paradigm shift in that idea. Linden Lab doesn’t mind if Second Life‘s world rules vary slightly between servers for a short period of time. In a more game-oriented virtual world, there are more implications there. I can easily envision exploitable situations. The question has to be whether or not those are so serious that it’s worth downtime. And of course, if the answer is ever yes, you can always revert to taking down everything at once.

    The other implication of rolling updates is that client/server communications must be reverse compatible. I’m not going to spend a lot of time talking about that, since it’s a good idea in any situation and the techniques for ensuring it are well-known. It’s one of those things which takes a little more effort but it’s worth doing. Not all polish is immediately obvious to the user.

    There’s one other forced reboot moment familiar to the MMO player, which is the client restart. That happens when there’s a client patch. I’m willing to admit defeat here because we have way less control over the user’s environment, and we don’t have (for example) a second instance of each desktop which can take over running the client while the first instance reboots.

    On the other hand, reloading data files shouldn’t be any harder on a desktop than it is on a server — so if your patch is just data updates, many of the same techniques should be in place. Do it right, and the client is semi-protected against users randomly deleting data files in the middle of a game session too. Yeah, that doesn’t happen much, but I’m a perfectionist.

    Before I leave this topic, I should apologize to any server coders and architects who know better than me. My disclaimer: I’m a tech ops guy, not a programmer, and I know that means I miss some blindingly obvious things at times. Corrections are always welcome. This is written from the point of view of my discipline, and revolves around techniques I’d love to be able to use rather than what I absolutely know is possible.