Zuul case study: The OpenStack project

Zuul drives continuous integration, delivery and deployment systems with a focus on project gating and interrelated projects. In a series of interviews, Superuser asks users about why they chose it and how they’re using it.

OpenStack has been using Zuul for about six years now. Here Superuser talks to Clark Boylan, infrastructure engineer at the OSF, and Doug Hellman, developer, editor, author and veteran OpenStack community member, about the origin story of Zuul.

How would you describe the days before Zuul?

CB: Prior to Zuul we ran a Jenkins master. We enforced gating with Jenkins, but to avoid one change breaking another we had to serialize the gating of each change. Since this gating took about an hour per change that meant that we could only merge 24 changes per day. This bottleneck became a pain point we knew we would have to address.

DH: Before Zuul, we would regularly land patches that worked in isolation, but that would then break things when combined with changes in other repositories. When I first started working on OpenStack, I could run devstack to set up a local test environment and it would work in the morning but if I wasn’t careful and updated in the afternoon it would fail, at least for an hour or so until someone fixed the problem. We no longer have that problem, in part due to better test coverage, but largely due to Zuul’s speculative merging and multi-repository testing features that allow us to ensure that changes across several repositories are tested together.

What’s the origin story for Zuul – when was it clear it was needed and what problems inside OpenStack was it started to solve?

CB: OpenStack ended up in a situation where it wanted to keep gating changes to OpenStack but have the ability to merge more than 24 changes a day. Running tests more quickly, or running fewer tests were either not possible or less than ideal options. Instead we (mostly Jim) set out to build a system (Zuul) which could parallelize the serial testing of OpenStack. The trick here was to build potential future states and if those passed we could merge them in aggregate rather than waiting for each to pass in succession. If changes failed testing they are evicted from the aggregate and tests are restarted without the failing change.

How did you switch from Jenkins to Zuul?

CB: The initial Zuul system relied on Jenkins to execute jobs for us. This meant that Zuul was the coordinator for Jenkins and the two systems worked together. Around the 2016 Austin Summit, we realized that we could replace Jenkins with an Ansible-based execution system to improve performance and reliability. (We had essentially gotten tired of restarting our Jenkins masters every week to clear out a thread leak). The success of this Ansible based system highly influenced the decision to do the major Zuul v3 rewrite which also uses Ansible to execute jobs.

When was it clear that it was useful to the larger community?

CB: We’ve seen various entities try to take Zuul and use it either as an Internal developer tool or as a CI product for their customers. HP actually did both with forj.io as a product including Zuul and Gozer as an internal CI system for HP developers. I want to say this happened very early on, like within the first year or so.

DH: Red Hat also has a product called Software Factory which includes Zuul, along with several other components. Having multiple product offerings built around Zuul tells me that it’s definitely ready to be used by more than the OpenStack community.

When did you realize it would’ve been a successful spin-off – more like “Breaking Bad” and “Better Call Saul” than “Brady Brides?”

CB: We had seen people trying to use it and others talking about using it but being worried about it being OpenStack specific or requiring specific tools like Gerrit. I think we figured we could accommodate those needs and address those concerns by adding support for common tools like Github and other Nodepool cloud drivers while continuing to support the existing needs of OpenStack.

What can you share about metrics?

CB: I usually go to Grafana for this stuff.

DH: Clark mentioned that early on we were able to land about 24 patches in a day. During the Rocky development cycle (roughly February through August, 2018) we have been averaging over 180 patches approved and merged per day. That does not include patches that were proposed and tested, but not approved and merged.

How do you handle complexity across so many different sub-projects under the OpenStack umbrella?

CB: We try to make things as consistent as possible and provide pre-canned job definitions. This is made somewhat easier than expected because OpenStack, as a whole, is pretty consistent (same programming language, same documentation system, most projects provide a REST API and so on). Zuul supports plenty more and we run tests for Go and Java and other languages and projects as well.

DH: The enhancements in Version 3, especially the ability to manage test job definitions and other settings in the same repository as the application source code, make it possible for multiple unrelated teams to use a single Zuul deployment without relying on a large group of dedicated operators to write and manage the test jobs.

Several of our OpenStack teams have written their own complex functional and integration test jobs. They didn’t have to start from scratch, because they could build on the common job definitions as a foundation and they could extend those jobs independently to test anything they needed.

What are the takeaways for other users about the power and current limitations?

CB: As for limitations, Github reporting and general dashboard information could be improved. The job execution itself is quite powerful.

DH: Zuul is flexible, and as with any flexible system one needs to take care to plan well to minimize complexity and maximize reuse. It can be easy to fall into a trap of taking the expedient approach when creating similar but slightly different jobs by copying the job definitions. Taking a little bit of time to treat the jobs like you would any other source code by building a library of reusable tools will pay off in terms of reuse, letting you build new jobs more quickly.

What are you hoping the Zuul community will focus on / deliver in the next year?

DH: I’m looking forward to some of the planned API enhancements to let us query the job configuration for a repository. We look at that data directly today, when we enforce policies related to the standard set of jobs that the community has agreed that all projects need to run for minimal testing. Having an API to query that data will let us simplify the Zuul jobs we have that enforce the policies today.

Tags: Ansible, CI/CD, Gerrit, Jenkins, Zuul