The open-source software helped reboot 80,000 cores over about seven months.

image

Just about a year ago, the security community got a nasty wake-up call: Spectre and Meltdown.

Considered “pretty catastrophic” by experts, they were a series of vulnerabilities discovered by various security researchers around performance optimization techniques built in modern CPUs. Those optimizations (involving superscalar capabilities, out-of-order execution, and speculative branch prediction) essentially created a side-channel that could be exploited to deduce the content of computer memory that should normally not be accessible.

For e-commerce giant eBay, keeping the nightmares away was a particularly complex project. The eBay Classifieds Group has a private cloud distributed in two geographical regions (with future plans in the works for a third), around 1,000 hypervisors and a capacity of 80,000 cores. The team needed to patch hypervisors on four availability zones for each region with the latest kernel, KVM version and BIOS updates. During these updates the zones were unavailable and all the instances restarted automatically.

Bruno Bompastor and Adrian Joian, from eBay’s cloud reliability team, shared how shoring up their system against these vulnerabilities stretched from January until July. One of the takeaways? First that Ansible is a great tool for infrastructure automation. “We decided to use Ansible as our main tool and heavily relied on Ansible roles as a way to organize tasks,” Bompastor says. As an example, the team has OpenStack roles, hardware roles, update roles and — the most important one for this project — the checker role, to scan for these vulnerabilities. They ran a checker on the host (an open-source script that basically tests the variants they wanted to check). Available on GitHub, “it’s a very nice script that covers everything like this…”

They also gave an inside look all the work the team had to perform to shut down, update and boot successfully an infrastructure fully patched and without data loss. They discussed how the team managed of SDN (Juniper Contrail) and LBaaS (Avi Networks) when restarting this massive number of cores.

Catch the whole case study on YouTube or the slides here.

Superuser