GDC 2012 Notes: Zynga Infrastructure Panel

These are pretty rough notes from the single tech ops oriented talk I got to at GDC Online. Hao Chen and Jesse Willett did a great job filling in at what looked like the last minute — the talk was originally supposed to be on Draw Something, but it pivoted to Farmville 2 and Chefville. It’s always interesting to hear perspective from a big datacenter like Zynga.

My commentary is in footnotes because it’s been too long since I used my nifty footnote plugin.

  • Hao Chen (Studio CTO Farmville 2), Jesse Willett (Studio CTO Chefville)
    • Studio CTOs provide technical direction and responsible for technical health of the games
  • Web apps vs Web games
    • Web games are write heavy — 2 reads for each write, FV 2 is even higher write/read ratio (1:1)
    • Size of user data is much larger for games
      • User generated map drives this
    • Much larger content payload up front
  • Web vs. other games
    • Pace of iteration is higher
    • Demographics is different
    • Platform is harder
      • 8 years of PCs, multiple OSes, multiple browsers
  • Typical infrastructure diagram
    • Clients, CDN, load balancer, web tier, storage tier
  • Zynga stack
    • Linux, Apache, membase (CouchDB fork), PHP
      • 3 years ago, memcache/MySQL data stack
        • Coalesced writes at RAM speed to memcache, basically as a write queue
  • Last mile (content to user)
    • Default design choice: defer everything
    • Only greedily fetch that which you must fetch (from either client or server perspective)
      • Blocking APIs are bad
      • Async is more complex but performance will improve
  • Bandwidth
    • How many people will wait for load
    • How many people can play
    • Scales with content (game and user) and geography
    • Farmville 2
      • November 2011 — customers needed 256 Kbps of BW to play
      • Now — much lower (missed exact number)
  • CDN
    • Think about how you version content
    • Chefville offloads 99% of static bytes to CDN
    • They struggle with cache busting
      • Hashed file name and/or version history in file name
      • They’re conscious of increasing download times w/history method
    • They use rather general SSL certs which has come in handy for using new domain names to bust caches
    • crossdomain.xml is the big problem; hard coded name
  • Front end scaling
    • Stateless, scales as well as weakest dependency, minimize disk I/O
      • The only I/O on their front ends is error/Apache logs
    • Make sure servers scale linearly
    • Farmville 2 load — hangs between 5 and 20
      • Moved some disk I/O into memory, increased number of users although did not lower load
      • rrdtool graph
    • Scale by Optimization
      • APC, highdef (sic)
      • minimize external dependencies
      • move requests to client
      • move requests into async
      • batch the work
        • They can change the batch size at runtime to tune performance! Sexy.
      • redesign features
  • FV2 user dropoff
    • P80 — Facebook API response time under 500 ms (80% of users)
    • P95 — Facebook API response time over 14000 ms
    • Added hidden iframe and made the Facebook call parallel to other loads
  • Back end scaling
    • Graph — issue caused by new puppet conf
      • Shows that it’s wise to run at 60% headroom, not 90%
    • Rescaling back end is harder than rescaling front end
      • Stateful, complicated, slower, etc.
      • All generates uncertainty, so you have to overprovision
    • Must be robust; data loss is unforgivable
      • Assume everything will fail, including datacenter
      • Don’t roll your own: you make games
      • Offsite backup into a different technology stack
  • Capacity estimates and planning
    • Art and/or science style
    • Science
      • Know how fast you can add more capacity for each server type
        • They had a meltdown cause they needed more package repos
        • Set your alarm early enough in the morning so that you can decide if you need more capacity in time for peak
      • Know your expected demand
        • How do you shut off traffic?
      • Different studios are different
        • They have a Tencent partnership; very slow to get new HW
    • Art
      • Rough estimates for server load, access, DAU, etc.
      • Provision X times estimate for comfort
      • Downscale once you hit steady state
      • Tradeoff: capex for engineering time
    • Testing
      • Pre-marketing, with low load, they removed nodes until there were problems to understand what minimum capacity really was
      • Allowed them to calibrate for many months into the future
    • Planning requires measurement
    • Everything grows
      • Driven by both user activity and game evolution
      • Successful game = more features more quickly = efficiency tanks faster than you think
    • Servers scale on different metrics
      • peak concurrents: web, storage i/o
      • WAU/MAU: middleware cache
      • cumulative installs: storage capacity
      • Nothing scales in terms of DAU
  • Monitoring
    • RightScale/collectd
    • Nagios
    • Splunk
    • Business metrics
    • Homegrown dashboard meta-tool
  • Scaling case
    • Added 1025th node — limit of 64 connects, 64 workers per node, whoops
    • MCMUX to do connection pooling for PHP threads
  • Locks
    • Fear locks, no matter what your tech
    • A lock is the front end projecting a weak dependency on the back end
    • Case study: a friend juggles, all N of their friends get 5 coins
      • Naive:
        • Lock self & N friends
        • Validate juggle
        • Add coins
        • Unlock N+1 locks
      • Correct:
        • Lock self
        • Validate juggling
        • Unlock self
        • Send lockless +5 message
  • Q&A
    • How often do they reevaluate tech stack?
      • Not often — storage change was the biggest they’ve seen
      • Did all their testing early; current stack is just mandated
    • Message queue?
      • Minimal implementations are great
      • One queue rather than multiples – let the consumers figure out what to do with the messages
Notes:
1. Not sure why it’s much bigger than, say, inventory – how big can the map data be?
2. Still skeptical about this emphasis on user data size… maybe user info is in a monolithic blob?
3. highdef is another PHP caching product; I didn’t get the exact name
4. I didn’t quite catch the software they’re using, sounded a bit like memcacheq but they said it was commercial.
Not sure why it’s much bigger than, say, inventory – how big can the map data be?
Still skeptical about this emphasis on user data size… maybe user info is in a monolithic blob?
highdef is another PHP caching product; I didn’t get the exact name
I didn’t quite catch the software they’re using, sounded a bit like memcacheq but they said it was commercial.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.