The OpenStack Summit Tokyo was, as always, educational, exciting, entertaining and energizing.
After we flew back and shook off the jet lag, the question was what will we do with what we learned and discussed? What actions will we take as an operator to make our cloud more stable, performant and future-proof?
After some team discussions, we here are the things that we are taking immediate action on — either doing, or investigating, — based on what we learned in Tokyo.
On the first day of the summit, we attended a talk called “RabbitMQ Operations for OpenStack.” RabbitMQ is always a sore spot and a dangerous area to poke in an OpenStack deployment. In this talk, Michael Klishin from Pivotal Labs gave dozens of tips on how to operate [RabbitMQ](https://www.rabbitmq.com ).
A few of these tips really hit home:
- “Always run the latest version.” I was a bit horrified, but not surprised, to find out that we were two full versions out of date.
- “Tune your FDs.” We’d already been hit by this, the default limit is way too low and as we add capacity and nodes the number in use keeps climbing.
- “Use autoheal.” We’re not and we should be.
- “Optimize the TCP config.” This paraphrases about 5 minutes of the talk. The good news is that the newer Puppet modules do most of this for us, so we upgraded the Puppet module, too.
The thing with Rabbit is that if you’re going to take it down for a reconfig or an upgrade, you only want to do it once. So we spent some time digging into all the rest of his optimizations and also wrote up an Ansible-based RabbitMQ upgrade playbook. We can happily say that we finished rolling out this upgrade of RabbitMQ (from 3.3 to 3.5) to production last week. This included our optimized configuration.
Another talk that we enjoyed (and led along with GoDaddy) was [“Duct-Tape, Bubblegum, and Bailing Wire: The 12 Steps in Operating OpenStack.”](https://www.openstack.org/summit/tokyo-2015/videos/presentation/duct-tape-bubblegum-and-bailing-wire-12-steps-in-operating-openstack) In this talk we learned about a bunch of issues that operators had, little problems that crop up and cause headaches. There were plenty of other talks in this category, including hallway discussions and as an outcome of this we made a few small tweaks in our environment:
- Raising kernel.pid_max (sysctl): We had the default value and according to other operators this can cause issues with Ceph. This rolled to production last week.
- Enabling the neutron root-wrap daemon: Neutron wastes a lot of time execing rootwrap and sudoing and the daemon should make neutron calls to things like ip netns much faster. This change is baking in our dev environment while we test how stable it is.
Finally, we both attended lots of [Ops sessions.](https://mitakadesignsummit.sched.org/overview/type/ops?iframe=no&w=i:100;&sidebar=yes&bg=no#.VkUDJK6rR2Q) Much of the discussion in these sessions was on prepping for the Liberty upgrade. We have a philosophy that we always try to upgrade fairly quickly after the summit so we’re getting some tasks done now that will make it easier when we do it. One of the big things we’re looking forward to in Liberty is improved Fernet token performance. Fernet tokens are marching towards being the default token provider and although they work well for us, we’d like them to be faster. There was a [great design session](https://etherpad.openstack.org/p/keystone-mitaka-summit-tokens) on these changes on Wednesday of the summit. Another major thing we’re going to enjoy in Liberty is the Neutron OVS agent fix. Right now in Kilo (and before) when the OVS agent restarts it can interrupt networking to the customer VMs we host. This makes host maintenance and upgrades painful. Unfortunately, [this fix](https://etherpad.openstack.org/p/keystone-mitaka-summit-tokens) will not be backported to Kilo, but our upgrades will be much easier after Liberty.
Due to some infrastructure work and holiday closures, we’re not starting Liberty now, but what pre-Liberty work are we doing now?
One topic that saw a lot of discussion is the requirement to be on Kilo.1 for Nova before upgrading to Liberty. In general, upgrading to a new minor release for OpenStack is fairly straightforward, but as mentioned before, restarting the OVS agent to do that upgrade can be really disruptive. Additionally package dependencies can force you to upgrade everything on a box even if you just want a newer copy of Nova. For these reasons, we’re currently working on moving Nova services into Docker containers so that we can upgrade just that single service.
[Keystone deprecated eventlet](https://review.openstack.org/#/c/157495/) during the Kilo cycle, and so we’ve already switched over to service keystone with Apache. However because other services will probably also deprecate eventlet and for performance reasons, we’re looking to switch all of our API services to use a higher performing WSGI server. We moving more and more services into containers and running Apache inside a container would be workable, but seemed like overkill. Reading up we learned that most people recommend using uWSGI for this now and we’ve been reading up, experimenting it and getting close to rolling our first service (heat) using it soon.
Tokyo gave us a ton of ideas and also a ton of work, but we’re confident that these changes will make our cloud better, our lives as operators easier and our customers happier.
You can catch Matt Fischer, princple engineer at Time Warner Cable, on Twitter at [@openmfisch](https://twitter.com/@openmfisch). Clayton O’Neill, also a principle engineer at TWC, is on Twitter at [@clayton_oneill](https://twitter.com/clayton_oneill).
Superuser is always interested in how-tos and other contributions, please get in touch: [email protected]
[Cover Photo](https://www.flickr.com/photos/pburka/8563699246/in/photolist-e3Kcr1-e3Kc2j-e3Dw6p-e3Dwer) // CC [BY NC](https://creativecommons.org/licenses/by-nc/2.0/)