Belmiro Moreira, cloud architect, tells Superuser how the research center went from a few hundred nodes to 280,000 cores.

image

We’re spotlighting users and operators who are on the front lines deploying OpenStack in their organizations to drive business success. These users are taking risks, contributing back to the community and working to secure the success of their organization in today’s software-defined economy.

Today we talk to Belmiro Moreira, cloud architect, who tells Superuser about how the research center went from a few hundred nodes to 280,000 cores.

Describe how are you using OpenStack now. What kinds of applications or
 workloads are you currently running on OpenStack?

CERN provides several facilities and resources to scientists all around the world, including compute and storage resources, for their fundamental research.

Back in 2012, the CERN IT department decided to deploy a private cloud based on OpenStack and other open source tools to improve the utilization and reduce the compute resources provision time. The Cloud Infrastructure Service was moved into production with only few hundred compute nodes in July 2013. Since then, it has been growing rapidly to several thousand compute nodes. This scaling represented a huge architectural challenge.

In order to manage the large number of compute nodes, we split the infrastructure in several Nova cells. This allow us to scale OpenStack Nova and reduce the failure domains into only few hundred compute nodes. Currently, we have more than 280,000 cores available that are split between 60 Nova cells.

Users can increase their application availability by creating resources in different availability zones that are mapped to several Nova cells.

The infrastructure runs in two CERN data centers (Geneva and Budapest) that are separated by 22ms, however only one region is defined. Users only have one endpoint to access the CERN Cloud.

When we started with the cloud setup, we offered only a few OpenStack projects (Nova, Glance, Cinder, Keystone). Today we run more than 12 different OpenStack projects and continue to evaluate new ones.

We have a very wide, diverse and geographically dispersed user community. Currently, more than 3,000 users use the CERN Cloud for their research projects. Because of that, the set of applications that run in our Cloud is extremely heterogeneous. They range from IT services to scientific data-processing applications to remote desktops. As an example, projects like “CMS Data Acquisition,” “ATLAS Tile Calorimeter“, “IT Windows Terminal Service” and “Personal Belmiro” share the same compute nodes. In the past they would have had different dedicated servers. This makes the CERN Cloud a very dynamic and challenging environment for our engineering team because of all the different use cases, service expectations and requirements.

The biggest use case, however, is the data processing from the Large Hadron Collider (LHC) data.

The data analysis entails the execution of a large number of loosely coupled jobs responsible for the data crunch of the 200PB stored in our tape libraries. These jobs are executed in resources provisioned directly by the LHC experiments using the OpenStack APIs or in the batch system that runs on top of the CERN Cloud.
Recently, we added OpenStack Magnum to our service offering. This allows us to provision Kubernetes, Docker Swarm and Mesos clusters. Several users are testing the possibilities of containers to run data analysis and specially to run legacy code from previous experiments and data sets.

What results has CERN seen from using OpenStack? How are you measuring 
the impact?

Before the introduction of the Cloud Infrastructure Service, all compute resources were manually allocated to all of the different use cases and projects. Managing this process was complicated and the allocation could take up to several weeks. Also, after the resources were provisioned it was extremely difficult to move sporadic idle compute capacity to other projects.

Moving into a self-service resource allocation model based on predefined quotas allowed us to decrease the provision time drastically and since the resources are only created when needed, globally it resulted in more compute capacity available to our scientific communities.

We currently only provision compute resources as virtual machines and containers, however we are working to offer bare metal-as-a-service in the near future. This will not only improve the efficiency of physical nodes allocation but most importantly it will allow us to consolidate all resources management and provisioning within the CERN data centers through an OpenStack interface. A common workflow for all resources provisioned will offer further productivity gains.

For updates from the CERN cloud team, check out the OpenStack in Production blog.

Superuser wants to hear more from operators like you, get in touch at editorATopenstack.org

 

Cover Photo by Marceline Smith // CC BY NC