These are pretty rough notes from the single tech ops oriented talk I got to at GDC Online. Hao Chen and Jesse Willett did a great job filling in at what looked like the last minute — the talk was originally supposed to be on Draw Something, but it pivoted to Farmville 2 and Chefville. It’s always interesting to hear perspective from a big datacenter like Zynga.
My commentary is in footnotes because it’s been too long since I used my nifty footnote plugin.
- Hao Chen (Studio CTO Farmville 2), Jesse Willett (Studio CTO Chefville)
- Studio CTOs provide technical direction and responsible for technical health of the games
- Web apps vs Web games
- Web vs. other games
- Pace of iteration is higher
- Demographics is different
- Platform is harder
- 8 years of PCs, multiple OSes, multiple browsers
- Typical infrastructure diagram
- Clients, CDN, load balancer, web tier, storage tier
- Zynga stack
- Linux, Apache, membase (CouchDB fork), PHP
- 3 years ago, memcache/MySQL data stack
- Coalesced writes at RAM speed to memcache, basically as a write queue
- 3 years ago, memcache/MySQL data stack
- Linux, Apache, membase (CouchDB fork), PHP
- Last mile (content to user)
- Default design choice: defer everything
- Only greedily fetch that which you must fetch (from either client or server perspective)
- Blocking APIs are bad
- Async is more complex but performance will improve
- Bandwidth
- CDN
- Think about how you version content
- Chefville offloads 99% of static bytes to CDN
- They struggle with cache busting
- Hashed file name and/or version history in file name
- They’re conscious of increasing download times w/history method
- They use rather general SSL certs which has come in handy for using new domain names to bust caches
- crossdomain.xml is the big problem; hard coded name
- Front end scaling
- Stateless, scales as well as weakest dependency, minimize disk I/O
- The only I/O on their front ends is error/Apache logs
- Make sure servers scale linearly
- Farmville 2 load — hangs between 5 and 20
- Moved some disk I/O into memory, increased number of users although did not lower load
- rrdtool graph
- Scale by Optimization
- Stateless, scales as well as weakest dependency, minimize disk I/O
- FV2 user dropoff
- P80 — Facebook API response time under 500 ms (80% of users)
- P95 — Facebook API response time over 14000 ms
- Added hidden iframe and made the Facebook call parallel to other loads
- Back end scaling
- Graph — issue caused by new puppet conf
- Shows that it’s wise to run at 60% headroom, not 90%
- Rescaling back end is harder than rescaling front end
- Stateful, complicated, slower, etc.
- All generates uncertainty, so you have to overprovision
- Must be robust; data loss is unforgivable
- Assume everything will fail, including datacenter
- Don’t roll your own: you make games
- Offsite backup into a different technology stack
- Graph — issue caused by new puppet conf
- Capacity estimates and planning
- Art and/or science style
- Science
- Know how fast you can add more capacity for each server type
- They had a meltdown cause they needed more package repos
- Set your alarm early enough in the morning so that you can decide if you need more capacity in time for peak
- Know your expected demand
- How do you shut off traffic?
- Different studios are different
- They have a Tencent partnership; very slow to get new HW
- Know how fast you can add more capacity for each server type
- Art
- Rough estimates for server load, access, DAU, etc.
- Provision X times estimate for comfort
- Downscale once you hit steady state
- Tradeoff: capex for engineering time
- Testing
- Pre-marketing, with low load, they removed nodes until there were problems to understand what minimum capacity really was
- Allowed them to calibrate for many months into the future
- Planning requires measurement
- Everything grows
- Driven by both user activity and game evolution
- Successful game = more features more quickly = efficiency tanks faster than you think
- Servers scale on different metrics
- peak concurrents: web, storage i/o
- WAU/MAU: middleware cache
- cumulative installs: storage capacity
- Nothing scales in terms of DAU
- Monitoring
- RightScale/collectd
- Nagios
- Splunk
- Business metrics
- Homegrown dashboard meta-tool
- Scaling case
- Added 1025th node — limit of 64 connects, 64 workers per node, whoops
- MCMUX to do connection pooling for PHP threads
- Locks
- Fear locks, no matter what your tech
- A lock is the front end projecting a weak dependency on the back end
- Case study: a friend juggles, all N of their friends get 5 coins
- Naive:
- Lock self & N friends
- Validate juggle
- Add coins
- Unlock N+1 locks
- Correct:
- Lock self
- Validate juggling
- Unlock self
- Send lockless +5 message
- Naive:
- Q&A
Notes:
1. Not sure why it’s much bigger than, say, inventory – how big can the map data be?
2. Still skeptical about this emphasis on user data size… maybe user info is in a monolithic blob?
3. highdef is another PHP caching product; I didn’t get the exact name
4. I didn’t quite catch the software they’re using, sounded a bit like memcacheq but they said it was commercial.
Not sure why it’s much bigger than, say, inventory – how big can the map data be?
Still skeptical about this emphasis on user data size… maybe user info is in a monolithic blob?
highdef is another PHP caching product; I didn’t get the exact name
I didn’t quite catch the software they’re using, sounded a bit like memcacheq but they said it was commercial.