On the operation of massively multiplayer online games.
RSS icon Email icon Home icon
  • GDC 2012 Notes: Zynga Infrastructure Panel

    Posted on October 16th, 2012 Bryant

    These are pretty rough notes from the single tech ops oriented talk I got to at GDC Online. Hao Chen and Jesse Willett did a great job filling in at what looked like the last minute — the talk was originally supposed to be on Draw Something, but it pivoted to Farmville 2 and Chefville. It’s always interesting to hear perspective from a big datacenter like Zynga.

    My commentary is in footnotes because it’s been too long since I used my nifty footnote plugin.

    • Hao Chen (Studio CTO Farmville 2), Jesse Willett (Studio CTO Chefville)
      • Studio CTOs provide technical direction and responsible for technical health of the games
    • Web apps vs Web games
      • Web games are write heavy — 2 reads for each write, FV 2 is even higher write/read ratio (1:1)
      • Size of user data is much larger for games
        • User generated map drives this
      • Much larger content payload up front
    • Web vs. other games
      • Pace of iteration is higher
      • Demographics is different
      • Platform is harder
        • 8 years of PCs, multiple OSes, multiple browsers
    • Typical infrastructure diagram
      • Clients, CDN, load balancer, web tier, storage tier
    • Zynga stack
      • Linux, Apache, membase (CouchDB fork), PHP
        • 3 years ago, memcache/MySQL data stack
          • Coalesced writes at RAM speed to memcache, basically as a write queue
    • Last mile (content to user)
      • Default design choice: defer everything
      • Only greedily fetch that which you must fetch (from either client or server perspective)
        • Blocking APIs are bad
        • Async is more complex but performance will improve
    • Bandwidth
      • How many people will wait for load
      • How many people can play
      • Scales with content (game and user) and geography
      • Farmville 2
        • November 2011 — customers needed 256 Kbps of BW to play
        • Now — much lower (missed exact number)
    • CDN
      • Think about how you version content
      • Chefville offloads 99% of static bytes to CDN
      • They struggle with cache busting
        • Hashed file name and/or version history in file name
        • They’re conscious of increasing download times w/history method
      • They use rather general SSL certs which has come in handy for using new domain names to bust caches
      • crossdomain.xml is the big problem; hard coded name
    • Front end scaling
      • Stateless, scales as well as weakest dependency, minimize disk I/O
        • The only I/O on their front ends is error/Apache logs
      • Make sure servers scale linearly
      • Farmville 2 load — hangs between 5 and 20
        • Moved some disk I/O into memory, increased number of users although did not lower load
        • rrdtool graph
      • Scale by Optimization
        • APC, highdef (sic)
        • minimize external dependencies
        • move requests to client
        • move requests into async
        • batch the work
          • They can change the batch size at runtime to tune performance! Sexy.
        • redesign features
    • FV2 user dropoff
      • P80 — Facebook API response time under 500 ms (80% of users)
      • P95 — Facebook API response time over 14000 ms
      • Added hidden iframe and made the Facebook call parallel to other loads
    • Back end scaling
      • Graph — issue caused by new puppet conf
        • Shows that it’s wise to run at 60% headroom, not 90%
      • Rescaling back end is harder than rescaling front end
        • Stateful, complicated, slower, etc.
        • All generates uncertainty, so you have to overprovision
      • Must be robust; data loss is unforgivable
        • Assume everything will fail, including datacenter
        • Don’t roll your own: you make games
        • Offsite backup into a different technology stack
    • Capacity estimates and planning
      • Art and/or science style
      • Science
        • Know how fast you can add more capacity for each server type
          • They had a meltdown cause they needed more package repos
          • Set your alarm early enough in the morning so that you can decide if you need more capacity in time for peak
        • Know your expected demand
          • How do you shut off traffic?
        • Different studios are different
          • They have a Tencent partnership; very slow to get new HW
      • Art
        • Rough estimates for server load, access, DAU, etc.
        • Provision X times estimate for comfort
        • Downscale once you hit steady state
        • Tradeoff: capex for engineering time
      • Testing
        • Pre-marketing, with low load, they removed nodes until there were problems to understand what minimum capacity really was
        • Allowed them to calibrate for many months into the future
      • Planning requires measurement
      • Everything grows
        • Driven by both user activity and game evolution
        • Successful game = more features more quickly = efficiency tanks faster than you think
      • Servers scale on different metrics
        • peak concurrents: web, storage i/o
        • WAU/MAU: middleware cache
        • cumulative installs: storage capacity
        • Nothing scales in terms of DAU
    • Monitoring
      • RightScale/collectd
      • Nagios
      • Splunk
      • Business metrics
      • Homegrown dashboard meta-tool
    • Scaling case
      • Added 1025th node — limit of 64 connects, 64 workers per node, whoops
      • MCMUX to do connection pooling for PHP threads
    • Locks
      • Fear locks, no matter what your tech
      • A lock is the front end projecting a weak dependency on the back end
      • Case study: a friend juggles, all N of their friends get 5 coins
        • Naive:
          • Lock self & N friends
          • Validate juggle
          • Add coins
          • Unlock N+1 locks
        • Correct:
          • Lock self
          • Validate juggling
          • Unlock self
          • Send lockless +5 message
    • Q&A
      • How often do they reevaluate tech stack?
        • Not often — storage change was the biggest they’ve seen
        • Did all their testing early; current stack is just mandated
      • Message queue?
        • Minimal implementations are great
        • One queue rather than multiples – let the consumers figure out what to do with the messages
    Notes:
    1. Not sure why it’s much bigger than, say, inventory – how big can the map data be?
    2. Still skeptical about this emphasis on user data size… maybe user info is in a monolithic blob?
    3. highdef is another PHP caching product; I didn’t get the exact name
    4. I didn’t quite catch the software they’re using, sounded a bit like memcacheq but they said it was commercial.
    Not sure why it’s much bigger than, say, inventory – how big can the map data be?
    Still skeptical about this emphasis on user data size… maybe user info is in a monolithic blob?
    highdef is another PHP caching product; I didn’t get the exact name
    I didn’t quite catch the software they’re using, sounded a bit like memcacheq but they said it was commercial.

    Leave a reply