For Blizzard Entertainment, it’s “game over” on scaling complexity

Blizzard Entertainment is a California-based software company focused on creating and developing game entertainment for customers in the Americas, Europe, Asia, and Australia. In late June, two of the company’s engineering leaders, Colin Cashin and Erik Andersson, talked to the open source community about their cloud strategy and scaling challenges at the OpenDev virtual event focused on large scale usage of open infrastructure.

Blizzard employs a multi cloud strategy. The company uses public clouds from Amazon, Google, and Alibaba, and since 2016 also owns and operates an extensive global private cloud platform built on OpenStack. Blizzard currently has 12,000 compute nodes on OpenStack distributed globally, and even managed to upgrade five releases in one jump last year to start using Rocky. The team at Blizzard is also dedicated to contributing upstream.

All in all, Blizzard values simplicity over complexity, and has made consistent efforts to combat complexity by addressing four major scaling challenges.

Scaling challenge #1:
The first scaling challenge that Blizzard faced was Nova scheduling with NUMA pinning. NUMA pinning ensures that guest virtual machines are not scheduled across NUMA zones on dual socket compute nodes, thereby avoiding the penalty of tarversing oversubscribed bus interconnects on the CPU. For high performant game server workloads this is the difference between a great and not great player experience. In 2016, they made the decision to implement NUMA pinning during scheduling to prevent these issues, ahead of the launch of Overwatch, Blizzard’s first team based First Person Shooter (FPS). At scale, this decision caused a lot of pains. NUMA scheduling is expensive and requires a recall to Nova DB, impacting the turnaround time of this process. During particularly large deployments, they regularly ran into race conditions where scheduling failed, and ultimately addressed this issue with configuration tuning to increase the target pool for the scheuler from 1 to 20 compute nodes. Another side effect of NUMA pinning was broken live migrations, a hindrance that is now fixed in Train’s Nova release.

The takeaway: For large environments, NUMA pinning should be implemented in a tested and controlled manner, and that live migrations with NUMA profiling is fixed in new releases.

Scaling challenge #2:
Next, Cashin and Andersson discussed scaling RabbitMQ. RabbitMQ is a tool that acts as a messaging broker between OpenStack components, but in Blizzard’s case has proven to be easily overwhelmed. Recovering serviceswhen something went wrong (e.g. large scale network events in a datacenter) appeared to be their biggest challenge at scale, and this wasn’t something that could be overlooked. To tackle this, Blizzard tuned RabbitMQ configurations to introduce extended connection timeouts with a variance to allow for slower but more graceful recovery. Additional tuning was applied to Rabbit queues to make sure that only critical queues were replicated across clusters and that queues could not grow exponentially during these events.

The takeaway: RabbitMQ needs tuning unique to your environment.

Scaling challenge #3:
Neutron scaling proved to be the third biggest hurdle for Blizzard. Blizzard experienced several protracted operational incidents due to having certain OpenStack services colocated on the same controller hosts. The Blizzard team fixed this in 2019, when they decided to scale horizontally by migrating Neutron RPC workers to virtual machines. Moving to VMs also solved the shared fate of the API and worker pools. Additionally, there was the issue of overwhelming the control plane when metadata services proxied huge amounts of data at scale. After much research and conversation with the community, Andersson was able to extend the interval to 4-5 minutes and greatly reduce load on the control plane by up to 75% during normal operations.

The takeaway: Neutron configuration and deployment should be carefully considered as the scale of your cloud grows.

Scaling challenge #4:
Lastly, the concern of compute fleet maintenance had been an issue for Blizzard for quite some time. As their private cloud went into production at scale, there was an internal drive to migrate more workloads into cloud from bare metal. In many cases, this meant that migrations took place before applications were truly cloud aware. Over time, this severely impacted Blizzard’s ability to maintain the fleet. Upgrades involved lots of toil and did not scale. Over the past 15 months, Blizzard’s software team has built a new product, Cloud Automated Maintenance, that enables automated draining and upgrading of the fleet. The product uses Ironic underneath to orchestrate bare metal and a public cloud style notification system, all automated by lifecycle signaling.

The takeaway: Onboard tenants to OpenStack with strict expectations set about migration capabilities, particularly for less ephemeral workloads. Also, have the processes and/or system in place to manage fleet maintenance before entering production.

Going forward, Blizzard will continue to pinpoint and tackle challenges to eliminate complexity at scale as much as they can. If you’re interested in finding solutions to these challenges, Blizzard is hiring at their offices in Irvine, CA.