GDC 2012 Notes: Zynga Infrastructure Panel

These are pretty rough notes from the single tech ops oriented talk I got to at GDC Online. Hao Chen and Jesse Willett did a great job filling in at what looked like the last minute — the talk was originally supposed to be on Draw Something, but it pivoted to Farmville 2 and Chefville. It’s always interesting to hear perspective from a big datacenter like Zynga.

My commentary is in footnotes because it’s been too long since I used my nifty footnote plugin.

Hao Chen (Studio CTO Farmville 2), Jesse Willett (Studio CTO Chefville)
- Studio CTOs provide technical direction and responsible for technical health of the games
Web apps vs Web games
- Web games are write heavy — 2 reads for each write, FV 2 is even higher write/read ratio (1:1)
- Size of user data is much larger for games
  - User generated map drives this
- Much larger content payload up front
Web vs. other games
- Pace of iteration is higher
- Demographics is different
- Platform is harder
  - 8 years of PCs, multiple OSes, multiple browsers
Typical infrastructure diagram
- Clients, CDN, load balancer, web tier, storage tier
Zynga stack
- Linux, Apache, membase (CouchDB fork), PHP
  - 3 years ago, memcache/MySQL data stack
    - Coalesced writes at RAM speed to memcache, basically as a write queue
Last mile (content to user)
- Default design choice: defer everything
- Only greedily fetch that which you must fetch (from either client or server perspective)
  - Blocking APIs are bad
  - Async is more complex but performance will improve
Bandwidth
- How many people will wait for load
- How many people can play
- Scales with content (game and user) and geography
- Farmville 2
  - November 2011 — customers needed 256 Kbps of BW to play
  - Now — much lower (missed exact number)
CDN
- Think about how you version content
- Chefville offloads 99% of static bytes to CDN
- They struggle with cache busting
  - Hashed file name and/or version history in file name
  - They’re conscious of increasing download times w/history method
- They use rather general SSL certs which has come in handy for using new domain names to bust caches
- crossdomain.xml is the big problem; hard coded name
Front end scaling
- Stateless, scales as well as weakest dependency, minimize disk I/O
  - The only I/O on their front ends is error/Apache logs
- Make sure servers scale linearly
- Farmville 2 load — hangs between 5 and 20
  - Moved some disk I/O into memory, increased number of users although did not lower load
  - rrdtool graph
- Scale by Optimization
  - APC, highdef (sic)
  - minimize external dependencies
  - move requests to client
  - move requests into async
  - batch the work
    - They can change the batch size at runtime to tune performance! Sexy.
  - redesign features
FV2 user dropoff
- P80 — Facebook API response time under 500 ms (80% of users)
- P95 — Facebook API response time over 14000 ms
- Added hidden iframe and made the Facebook call parallel to other loads
Back end scaling
- Graph — issue caused by new puppet conf
  - Shows that it’s wise to run at 60% headroom, not 90%
- Rescaling back end is harder than rescaling front end
  - Stateful, complicated, slower, etc.
  - All generates uncertainty, so you have to overprovision
- Must be robust; data loss is unforgivable
  - Assume everything will fail, including datacenter
  - Don’t roll your own: you make games
  - Offsite backup into a different technology stack
Capacity estimates and planning
- Art and/or science style
- Science
  - Know how fast you can add more capacity for each server type
    - They had a meltdown cause they needed more package repos
    - Set your alarm early enough in the morning so that you can decide if you need more capacity in time for peak
  - Know your expected demand
    - How do you shut off traffic?
  - Different studios are different
    - They have a Tencent partnership; very slow to get new HW
- Art
  - Rough estimates for server load, access, DAU, etc.
  - Provision X times estimate for comfort
  - Downscale once you hit steady state
  - Tradeoff: capex for engineering time
- Testing
  - Pre-marketing, with low load, they removed nodes until there were problems to understand what minimum capacity really was
  - Allowed them to calibrate for many months into the future
- Planning requires measurement
- Everything grows
  - Driven by both user activity and game evolution
  - Successful game = more features more quickly = efficiency tanks faster than you think
- Servers scale on different metrics
  - peak concurrents: web, storage i/o
  - WAU/MAU: middleware cache
  - cumulative installs: storage capacity
  - Nothing scales in terms of DAU
Monitoring
- RightScale/collectd
- Nagios
- Splunk
- Business metrics
- Homegrown dashboard meta-tool
Scaling case
- Added 1025th node — limit of 64 connects, 64 workers per node, whoops
- MCMUX to do connection pooling for PHP threads
Locks
- Fear locks, no matter what your tech
- A lock is the front end projecting a weak dependency on the back end
- Case study: a friend juggles, all N of their friends get 5 coins
  - Naive:
    - Lock self & N friends
    - Validate juggle
    - Add coins
    - Unlock N+1 locks
  - Correct:
    - Lock self
    - Validate juggling
    - Unlock self
    - Send lockless +5 message
Q&A
- How often do they reevaluate tech stack?
  - Not often — storage change was the biggest they’ve seen
  - Did all their testing early; current stack is just mandated
- Message queue?
  - Minimal implementations are great
  - One queue rather than multiples â€“ let the consumers figure out what to do with the messages

Notes:

1. Not sure why it’s much bigger than, say, inventory â€“ how big can the map data be?

2. Still skeptical about this emphasis on user data size… maybe user info is in a monolithic blob?

3. highdef is another PHP caching product; I didn’t get the exact name

4. I didn’t quite catch the software they’re using, sounded a bit like memcacheq but they said it was commercial.

Not sure why it’s much bigger than, say, inventory â€“ how big can the map data be?

Still skeptical about this emphasis on user data size… maybe user info is in a monolithic blob?

highdef is another PHP caching product; I didn’t get the exact name

I didn’t quite catch the software they’re using, sounded a bit like memcacheq but they said it was commercial.

GDC 2012 Notes: Zynga Infrastructure Panel

Related Posts

Go Team Elder!

EVE Online Insight

Leave a Reply Cancel reply