How to build clusters for scientific applications-as-a-service

How can we make a workload easier on cloud? In a previous article we presented the lay of the land for HPC workload management in an OpenStack environment. A substantial part of the work done to date focuses on automating the creation of a software-defined workload management environment – SLURM-as-a-Service.

SLURM is only one narrow (but widely used) use case in a broad ecosystem of multi-node scientific application clusters: let’s not over-specialize. It raises the question of what is needed to make a generally useful, flexible system for creating Cluster-as-a-Service?

What do users really want?

A user of the system will not care how elegantly the infrastructure is orchestrated:

Users will want support for the science tools they need, and when new tools are needed, the users will want support for those too.
Users will want to get started with minimal effort. The learning curve they must climb to deploy tools needs to be shallow.
Users will want easy access to the datasets upon which their research is based.

Scientists certainly do not want to be given clusters which in truth are just replicated infrastructure. We must provide immediately useful environments that don’t require scientists to be sysadmins. The time to science (or more bluntly, the time to paper) is pretty much the foremost consideration. Being able to use automation to reliably reproduce research findings comes a close second.

Application building blocks

A really useful application cluster service is built with several key design criteria taken into account:

- The right model for sharing. Do we provide a globally-available shared infrastructure, project-level multi-user infrastructure or per-user isolation? Per-user isolation might work until the user decides they prefer to collaborate. But can a user trust every other user in their project? Per-project sharing might work unless the users don’t actually trust one another (which might in itself be a bigger problem).
- Users on the cluster. Are the members of the project also to be the users of the cluster? In this case, we should do what we can to offer a cluster deployment that is tightly integrated with the OpenStack environment. Why not authenticate with their OpenStack credentials, for example? In a different service model, the OpenStack project members are the cluster admins, not its users. If the cluster is being provided as a service to others who have no connection with the infrastructure, an external mechanism is required to list the users, and to authenticate them. Some flexibility should be supported in an effective solution.
- Easy data access. Copying user files into a cluster introduces a boundary to cross, which adds inconvenience for using that resource. Furthermore, copying data in requires the same data to be stored in two (or more) places. Where a cluster requires a shared filesystem, creating its own ad-hoc filesystem is unlikely to be the best solution. In the same manner as provider networks, a cluster should support “provider filesystems” – site production filesystems that are exported from other infrastructure in the data center. Large scientific datasets may also be required, and are often mediated using platform data services such as iRODS, or HDF5 (note, that’s a 5). Object storage (S3 in particular) is seen as the long-term solution for connecting applications with datasets, and does appear to be the direction of travel for many. However, sharing read-only filesystems, at either file or block level, are simple and universal approaches that work perfectly well. Both are also well-established choices.
- Scaling up and down. The cloud access model does not entail queuing and waiting for resources like a conventional HPC batch queuing system. Equally, cloud resources are assumed to grow and shrink dynamically, as required. Perhaps this could even happen automatically, triggered by peaks or troughs in demand. To maximize overall utilization, cluster resizing is actually quite important, and managing all the resources of a cluster together enables us to do it well.
- Self-service creation – for some definition of self. Users shouldn’t need to learn sysadmin skills in order to create a resource for doing their science. For some, the Horizon web interface might be too complex to bother learning. Enter the ResOps role – for example described in a recent SuperUser article on the Scientific WG – a specialist embedded within the project team, trained on working with cloud infrastructure to deliver the best outcomes for that team. To help the task of cluster creation, it should also be automated to the fullest extent possible.

Bringing it all together: OpenHPC-as-a-Service

The OpenHPC project is an initiative to build a community package distribution and ecosystem around a common software framework for HPC.

OpenHPC clusters are built around the popular SLURM workload manager.

A team from Intel including Sunil Mahawar, Yih Leong Sun and Jeff Adams have already presented their work at the latest OpenStack summit in Boston:

Independently, we have been working on our own OpenHPC clusters as one of our scientific cluster applications on the SKA performance prototype system, and I’m going to share some of the components we have used to make this project happen.

SKA Alaska: Performance Prototype Platform

Software base image: An image with major components baked in saves network bandwidth and deployment time for large clusters. We use our os-images role, available on Ansible Galaxy. We use this role with our custom elements written for Diskimage-builder

Here’s an example of the configuration to provide for the os-images:
os_images_list: # Build of OpenHPC image on a CentOS base - name: "CentOS7-OpenHPC" elements: - "centos7" - "epel" - "openhpc" - "selinux-permissive" - "dhcp-all-interfaces" - "vm" env: DIB_OPENHPC_GRPLIST: "ohpc-base-compute ohpc-slurm-client 'InfiniBand Support'" DIB_OPENHPC_PKGLIST: "lmod-ohpc mrsh-ohpc lustre-client-ohpc ntp" DIB_OPENHPC_DELETE_REPO: "n" properties: os_distro: "centos" os_version: 7

os_images_elements: ../stackhpc-image-elements

Flexible infrastructure definition: A flexible definition is needed because it soon becomes apparent that application clusters do not all conform to a template of 1 master, n workers. We use a Heat template in which the instances to create are parameterized as a list of groups. This takes Heat close to the limits of its expressiveness – and requires the Newton release of Heat as a minimum. Some users of Heat take an alternative approach of code-generated Heat templates.Our flexible Heat template is encapsulated within an Ansible role available on Ansible Galaxy. This role includes a task to generate a static inventory file, extended with user-supplied node groupings, which is suitable for higher-level configuration.

The definition of the compute node components cluster takes a simple form:
cluster_groups: - "{{ slurm_login }}" - "{{ slurm_compute }}"

slurm_login:
name: “login”
flavor: “compute-B”
image: “CentOS7-OpenHPC”
num_nodes: 1

slurm_compute:
name: “compute”
flavor: “compute-A”
image: “CentOS7-OpenHPC”
num_nodes: 8

The invocation of the infrastructure role becomes equally simple:
--- # This playbook uses the Ansible OpenStack modules to create a cluster # using a number of baremetal compute node instances, and configure it # for a SLURM partition - hosts: openstack

roles:
– role: stackhpc.cluster-infra
cluster_name: “{{ cluster_name }}”
cluster_params:
cluster_prefix: “{{ cluster_name }}”
cluster_keypair: “{{ cluster_keypair }}”
cluster_groups: “{{ cluster_groups }}”
cluster_net: “{{ cluster_net }}”
cluster_roles: “{{ cluster_roles }}”

Authenticating our users: On this prototype system, we currently have a small number of users, and these users are locally defined within Keystone. In a larger production environment, a more likely scenario would be that the users of an OpenStack cloud are stored within external authentication infrastructure, such as LDAP.
Equivalent user accounts must be created on our OpenHPC cluster. Users need to be able to login on the externally-facing login node. The users should be defined on the batch compute nodes, but they should not be able to login on these instances.

Our solution is to enable our users to authenticate using Keystone on the login node. This is done using two projects, PAM-Python and PAM-Keystone – a minimal PAM module that performs auth requests using the Keystone API. Using this, our users benefit from common authentication on OpenStack and all the resources created on it.

Access to cluster filesystems: OpenHPC clusters require a common filesystem mounted across all nodes in a workload manager. One possible solution here would be to use Manila, but our bare metal infrastructure may complicate its usage. It is an area for future exploration for this project.
We are using CephFS, exported from our local Ceph cluster, with an all-SSD pool for metadata and a journaled pool for file data. Our solution defines a CephX key, shared between project users, which enables access to the CephFS storage pools and metadata server. This CephX key is stored in Barbican. This appears to be an area where support in Shade and Ansible’s own OpenStack modules is limited. We have written an Ansible role for retrieving secrets from Barbican and storing them as facts, and we’ll be working to package it and publish on Galaxy in due course.

Converting infrastructure into platform: Once we have built upon the infrastructure to add the support we need, the next phase is to configure and start the platform services. In this case, we build a SLURM configuration that draws from the infrastructure inventory to define the workers and controllers in the SLURM configuration.

Adding value in a cloud context

In the first instance, cloud admins recreate application environments, defined by software and deployed on demand. These environments meet user requirements. The convenience of their creation is probably offset by a slight overhead in performance. On balance, an indifferent user might not see compelling benefit to working this way. Our OpenHPC-as-a-Service example described here largely falls into this category.

Don’t stop here.

Software-defined cloud methodologies enable us to do some more imaginative things in order to make our clusters the best they possibly can be. We can introduce infrastructure services for consuming and processing syslog streams, simplifying the administrative workload of cluster operation. We can automate monitoring services for ensuring smooth cluster operation, and application performance telemetry as standard to assist users with optimzation. We can help admins secure the cluster.

All of these things are attainable, because we have moved from managing a deployment to developing the automation of that deployment.

Reducing the time to science

Our users have scientific work to do and our OpenStack projects exist to support that.

We believe that OpenStack infrastructure can go beyond simply recreating conventional scientific application clusters to generate Cluster-as-a-Service deployments that integrate cloud technologies to be even better.