This post contributed by Stratoscale helps you navigate common problems.

Upgrading an OpenStack cloud has become a challenging task, which
requires choosing the right approach, careful planning and precise
execution to minimize the downtime of the cloud environment. Because of
such complexity, cloud operators prefer to skip one or more releases
before doing an upgrade.

In this OpenStack tutorial, we discuss different aspects of OpenStack upgrades, identify the major pitfalls when upgrading OpenStack and provide solutions and best practices to avoid these pitfalls.

Update Vs. Upgrade

First of all, we should define a strict distinction between updating
and upgrading OpenStack. In this OpenStack tutorial, updating
means applying bug fixes and fixes for security vulnerabilities to
the OpenStack components and underlying operating system. Usually, such
fixes are considered to be safe for _in-place _updates, because they do
not introduce a new functionality and thus do not have regressions.

At the same time upgrading means upgrading to a new stable
OpenStack release
. An OpenStack cloud consists of a number of
distributed software components that collaborate with each other in
order to deliver the required cloud services. From the first look, such
components, including operating system dependencies, must be upgraded at
the same time, which make the upgrade tasks even more complex. Good news
is that the OpenStack community aims to keep the APIs for the components
compatible, so the old API version usually is kept and supported for
some time. However, the old API can be marked as deprecated and removed
from newer releases.

Planning an OpenStack upgrade

There are several important steps we recommend for planning any
OpenStack upgrade:

  1. Read the OpenStack release notes thoroughly to identify potential
    incompatibilities between releases.
  2. Choose the proper method for OpenStack upgrade (see below).
  3. Prepare a plan to roll back a failed upgrade.
  4. Prepare a plan for data backups, at minimum, with backups of
    configuration files and databases.
  5. Determine the acceptable downtime for the cloud, as defined by the
    SLAs for specific services. If any data loss is projected, notify your
    users about the service interruption.
  6. Test the upgrade method using a test environment similar to the
    production one.

Methods for OpenStack upgrade

  1. Parallel cloud: Deploy a separate OpenStack cloud and migrate
    all the resources from the old cloud to the upgraded one. This is the
    simplest and least intrusive method. Also it has the simplest rollback
    procedure. However, it requires extensive hardware resources and leads
    to lengthy downtime.

  2. Rolling upgrade: These two methods upgrade each component on
    each server one by one, finally giving you an upgraded OpenStack cloud:
  • In-place upgrade: This method requires shutting down each
    service for the upgrade, which gives you some downtime, though less than
    the parallel cloud method.

  • Side by side upgrade: Since OpenStack Icehouse the controllers
    are decoupled from the compute nodes, so you can upgrade them
    independently. With this method, you can deploy an upgraded controller,
    transfer all the data from the old controller to the new one and
    seamlessly replace the old controller by the new one. The old controller
    is left untouched, so a roll back should be simple. In order to achieve
    zero downtime you should have more than one controller in HA mode.

Upgrade pitfalls and solutions

Manual upgrades are prone to failure

Upgrades commonly fail when a number of manually repetitive tasks must
be completed. Your cloud consists of many nodes and each node contains a
number of services. The services on each node collaborate with other
services, and due to this complexity manual upgrades are not an option.

Solution: Use automation for the upgrade. There are many
configuration management tools tools you can use such as Ansible, Chef
and Puppet.

Upgrade of the production cloud can fail

By nature the OpenStack cloud contains custom settings and the
standard upgrade procedure usually does not honor the custom settings in
the configuration files. You should assume that upgrade of the cloud
will fail, so you need to verify the upgrade on a test cloud, which
should be similar to the production one. The test cloud can be smaller
than the production one, but it should have the same architecture and
configuration.

It is very important to have proper automation implemented in your
organization. Both deployment (for the old release) and upgrade
procedures should be automated and both should be under configuration
management control. You should be able to track back each custom setting
to the original requirement. Before upgrading the production cloud, the
upgrading procedure and the corresponding automation should be properly
verified with the following standard approach:

  1. Deploy a test cloud using the same automation scripts that you used
    to deploy the production cloud.
  2. Apply upgrade scripts to the test cloud.
  3. If the upgrade failed, make necessary fixes to the upgrade scripts
    and repeat the procedure from the step 1.
  4. If the upgrade completed successfully, verify the test cloud.
  5. If the verification failed, make necessary fixes to the upgrade
    scripts and repeat the procedure from Step 1.

You can use OpenStack Rally for
automated cloud verification. Rally verification scenarios may include
the standard ones and custom scenarios, which are specific for the cloud
under test.

The cloud’s performance will degrade

Each OpenStack release introduces new features and brings new bugs, but
more importantly, will require a new hardware configuration. A new
OpenStack release might require additional or faster CPUs, more memory
and disk space. This is true for several OpenStack releases, including
Liberty. Potentially, community efforts to the OpenStack optimization
may lead to decreased requirements, but at the moment you should expect
the performance of your cloud to degrade due to an upgrade.

To pro-actively identify and solve such performance issues you need to
perform benchmarking and profiling for your clouds: the old and the
new one. You should be able to identify any performance degradation and
add additional resources for OpenStack services under high load. You can
use OpenStack Rally for automated cloud benchmarking and profiling.

Unclean shutdown of the services may lead to an inconsistent state

of the cloud

The service should complete all the requests it has received from the
message queue and notify the message queue to stop sending new requests
to the service. You should shut down OpenStack services gracefully and
give them enough time to complete all the active requests and report
their unavailability to the message queue. Shut down one service at a
time
, upgrade it, start, then do the same for next one.

Upgrading the services in the wrong order may break the cloud

You can easily break the cloud by upgrading the services in the wrong
order. The following order is the most recommended:

  1. Upgrade OpenStack Identity (Keystone)
  2. Upgrade the OpenStack Image service (Glance)
  3. Upgrade OpenStack Compute (Nova)
  4. Upgrade OpenStack Networking (Neutron)
  5. Upgrade OpenStack Block Storage (Cinder)
  6. Upgrade the OpenStack dashboard (Horizon)
  7. Upgrade the OpenStack Orchestration (Heat)

Upgrade will fail due to old or missing system dependencies

A new OpenStack release introduces new system dependencies and requires
upgraded versions of the existing system dependencies. The upgraded
OpenStack service will fail to start or will terminate with runtime
failure if its some system dependencies are not installed or upgraded.

When upgrading the OpenStack services make sure that all the
dependencies are also upgraded properly
. Usually it implies that all
of the OpenStack components are installed from packages (deb or rpm)
with correctly defined and tested dependencies. Even in this case,
depending on the specific configuration, upgrading the packages can
break some services. It is recommended that if the package manager (yum
or apt-get) asks you to update configuration files, reject the changes.
Instead, review, change the configuration files and restart the services
manually.

Database downgrades are not supported

Most of the OpenStack services support database migrations. That means
that each service will try to upgrade its database during startup.
Usually the automated upgrade is well tested for the stable OpenStack
release and can be used safely (it can be disabled in favor of manual
upgrade, if necessary). At the same time, starting from Kilo, database
downgrades are not supported. Thus, the only reliable way for a
database rollback is to restore a database from backup
.

Configuration files will not be upgraded automatically

Each OpenStack release introduces changes to the configuration files.
Options can be removed, renamed and moved to other sections. New options
can be added with the default values that can break your cloud. Read the
release notes thoroughly to identify such changes and apply them to your
configuration files. For example:

  • In Juno, the ‘identity_uri’ option should be used in the
    ‘[keystone_authtoken]’ section instead of ‘auth_host’, ‘auth_port’, and
    ‘auth_protocol’ for all of the services.
  • In Kilo, when using libvirt 1.2.2 live snapshots are disabled by
    default. Deployers can set
    ‘workarounds.disable_libvirt_livesnapshot=True’ in nova.conf to enable
    live snapshot support.
  • In Liberty, setting ‘force_config_drive=always’ in nova.conf is
    deprecated, use True/False boolean values instead

Upgrade will fail due to new, deprecated or removed API

If you have custom scripts or other software that uses OpenStack API,
then be prepared for a failed upgrade, because a new OpenStack release
introduces a new API version and marks the old version as deprecated.

In the worst case scenario, the API can be removed from the release.
Read the release notes thoroughly to identify such changes and apply
them to your cloud. For example:

  • In Kilo, the EC2 API support has been deprecated and removed.
  • In Liberty, the Load Balancer as a Service (LBaaS) V1 API is marked
    as deprecated and is planned to be removed in a future release. Going
    forward, the LBaaS V2 API should be used.

Upgrade will fail due to deprecated or removed features, plugins and

drivers

If you are using, a vendor specific plugin then be prepared for a failed
upgrade, because in a new OpenStack release such feature or plugin is
deprecated or even removed. Read the release notes thoroughly to
identify such changes and apply them to your cloud. For example:

  • In Kilo, XML support in Keystone has been removed
  • In Kilo and Liberty, many monolithic vendor specific plugins have
    been removed from Neutron

Upgrade will fail due to architectural changes

In some cases your cloud may depend on a specific architectural feature
of the old OpenStack release, which is changed or deprecated in a new
release. Read the release notes thoroughly to identify such changes and
apply them to your cloud. For example:

  • Use Python 3 instead of Python 2.6
  • Use the pymysql database driver instead of Python-MySQL
  • Use unified ‘openstack’ client instead of ‘keystone’, ‘glance’, etc.
  • In Liberty, Ceilometer Alarms is deprecated in favour of Aodh
  • In Kilo and Liberty releases, the Keystone project deprecates
    eventlet in favor of a separate web server with WSGI extensions

The future of OpenStack upgrades

To help solve challenges related to upgrading OpenStack, the OpenStack
community has adopted a Big Tent
approach
for new releases. With the Big Tent model, operators will be able to
select the preferred components and their version, and then add or
upgrade modules incrementally with little or no downtime.

This post first appeared on Stratoscale’s blog.

Superuser is always interested in how-tos and other contributions, please get in touch: [email protected]

Cover Photo // CC BY NC