The director of cloud engineering at the online messaging, marketing and analytics company talks DevOps, migration and more.

image

Koby Holzer has 17 years of experience working with large infrastructure environments, the last four-and-a-half of these at LivePerson as the director of cloud engineering, focused on Openstack. His past experience includes working with prominent Israeli telecom companies in the area of technological infrastructure.

ap9abbrhdicn9rt8zf2x

Source: Twitter

I have personally known him for years, through discussing, lecturing and enjoying the great cloud and DevOps community in Israel…He sat down with me to talk about past challenges, future projections and tips.

How was the OpenStack Summit?

The Summit was amazing and it’s getting better every year. It was my fourth one. This was the most exciting one for me as I had the opportunity to speak at the keynote, talking about LivePerson, Openstack, containers…and all the new and exciting stuff we are doing with the LivePerson cloud.

It was the biggest convention yet with over 7,500 people in attendance. And it just continues to get bigger every year in every aspect — logistics, keynote sessions, educational tracks.

On the educational side, I particularly like the practical real life cases. It’s very interesting to see how other companies are tackling OpenStack. My focus was on two specific tracks – the case studies, including the AT&T, Comcast and the containers track, which was very popular at the conference. One particularly meaningful session was "Do containers become first class citizens in the data center?" The panel included experts from the most popular container technologies on the market (i.e.Kubernetes, Docker, Mesos).

How would you summarize the evolution of LivePerson over the past five years?

In early 2012, we started learning and evaluating Openstack, which was the Diablo version at that time. We started to play with it, building our cloud labs and made the decision to go to production with a small portion of LivePerson (LP) services during the middle of that year. And when we reached production we were already using Essex. In 2012, we had Essex in production and towards the end of that year we decided to rewrite LivePerson’s service from one big monolithic service to many micro-services.

The next step was adopting Puppet, which accelerated the consumption of our private cloud. R&D moved from consuming physical servers to virtual OpenStack instances. By the end of 2013, we had already created a large cluster with more than 2,000 instances on four data centers and from then on it just continued to evolve.

In 2013-2014, we were dealing with the OpenStack upgrade challenge and managed to move to Folsom, then Havana and Icehouse. We try to upgrade as often as possible, however, the bigger our cloud gets the more difficult that is.

In 2015 we reached a point where we had finished rewriting the service and our new “LiveEngage” service was ready for production and real users. Today we have something like 8,000 instances on seven data centers, running on more than 20,000 physical cores. 2016 is the year for us to migrate to containers and Kubernetes, something which we expect to span well into 2017.

Looking back over the last few years, what would you say have been, or are still, your main challenges?

I’m managing the cloud engineering and there is a rather large team here of software developers. We were very lucky that the R&D organization decided to move to micro-services at the same time that we introduced OpenStack and Puppet. Looking back, I am not sure if it was planned, but the timing was just perfect. While development built a modern micro-services based service, ops adopted and implemented cloud and automation.

In terms of management challenges, I will just narrow it down and talk about the challenges that I see for 2016. Migrating 150 services to containers is something that my team cannot accomplish alone. We are in a continuous effort to maintain the partnership with R&D and create a joint effort when it comes to educating ourselves and being able to optimally use the new technologies. That includes moving from continuous integration to continuous delivery and building a strong delivery pipeline.

The operations goal is to build an environment that enables R&D to own the service end-to-end, not only to develop it but also to be able to support a quality and robust production environment.

What are some specific challenges you faced and overcame in your cloud journey?

One big challenge was the deployment and adoption across the organization of Puppet. If only the cloud production and operations team was using it, it wouldn’t have been enough. We needed our software developers to adopt Puppet as well and use it as a standard delivery method. And making 200 developers use a new technology doesn’t happen overnight as you can imagine.

I learned that it’s not something that you can easily convince that number of smart people to do just by saying "Guys, this is great technology and it’s the only way we can deal with delivery." We learned our lesson from that and now we work much closer with R&D, taking decisions together from the start.

Remember that this was almost four years ago. It took a management decision from the very top of LivePerson general manager for everyone to understand that this was the way forward. Our entire R&D was instructed that all new updates will go to production using Puppet. A small team of DevOps experts was brought in order to support and train the R&D teams and made sure Puppet was being used on a daily basis. This team carried out workshops and were the people to go to if any questions were raised. It took around a year to bring everyone up to speed and today Puppet is the main delivery tool.

Another challenge which is a common for OpenStack is the upgrades, at least with older versions. After four years of practicing, the process of upgrading takes one engineer up to three months. This was the story for every upgrade until now. The most recent upgrade has been the biggest so far, mainly due to the fact that our cloud has grown significantly and that we also needed to upgrade the hosts in tandem.

Upgrading thousands of physical servers while maintaining the service uptime is no simple task. In order to do this we need to take a group of servers, run a live migration of workloads to the other servers, then upgrade and ensure nothing was harmed before bringing the group back into the pool. There are lots of considerations and activities behind this, including understanding and segmenting the sensitive workloads.

ezd82nbcsr0ddnlxv37v
The LivePerson team at OpenStack Day Israel. Photo: Lauren Sell. (source)

How do you manage to keep transparency throughout the upgrade process?

We built a smart cloud discovery solution which updates automatically. Transparency is key and we have complete control over each individual VM and service. The system records all activities and can be accessed using an API and UI.

What are the main takeaways from your experience?

As the operations manager, you should be able to build an efficient and professional team. Which obviously depends on the size of your OpenStack cloud. Considering that a cloud consists of thousands of hosts you need at least two network professionals, three talented operations/engineering guys that are responsible for automating everything, and one storage guy. This team does not include the teams that operate the daily tasks and use the Openstack resources for the LivePerson service.

In addition, you need to think of every management aspect. Security is not part of our team, although ideally it should be. We are supported by our R&D team’s security experts. When dealing with building your private cloud team remember that your R&D care less whether it’s an OpenStack, physical servers, VMWare, whatever. They just need the resources and the flexibility that the cloud and DevOps promise.

Learning from the past with the Puppet challenge, it was like us telling R&D "we demand that you deliver with Puppet," but as an IT leader you need to understand and market the values of the new technology. And it never ends, but once you have done it the next time will be easier, as I see with our current move to Docker and Kubernetes. Eventually, we want to work together as equals, with everyone adopting the technology together, learning together and coping with all challenges together.

In order to accomplish that you need to create a "feature team" that includes representation from parties involved including the architects, leading developers, operations, network and security.

Although that might be challenging I strongly suggest to educate the other parties, not only on the touch points between dev and ops but also to get them to know behind the scenes of your cloud and get them to have the knowledge they need to use the OpenStack/Kubernetes APIs in particular. This is something that we are still working on with our R&D team. And together with containers our developers will be able to enjoy real independence with provisioning and consumption of resources. Connecting between the software and the infrastructure and letting the developers decide what they need and when is the flexibility, IT operations are responsible for.

Everyone should adopt the DevOps approach. R&D and Production are both developers, each with their own location in the delivery system. Although I am proud that LP is a cloud pioneer we still have a way to go on that matter and that’s exciting. Becoming Netflix or Google doesn’t happen overnight. The good news is that this road never ends and there is always something new to learn, adopt and do better.

What are your thoughts about the private cloud/cloud market landscape for the future?

I think that in two years we will see hybrid clouds big time — this is also what we are aiming for. By using Kubernetes, we’ll be able to use all the public clouds, including our own private one the same way, with the same teams and tools. What I want to see in LP is a very dynamic multi-cloud environment.

For example, let’s say that Amazon just changed their prices and I know in real-time that I can get a better price with Google, I’ll want all workloads and traffic to seamlessly move to GCP, and if it changes again the day after, I will want it to automatically move back to a third public cloud. The workload migration will be based on a price/performance equation while taking into consideration the service license agreement (SLA) of each workload.
In regards to OpenStack, there is no doubt that today the compute, network and storage are much better than four years ago, even enterprise ready. I think that those core components will be much better, support larger scales and so it will be easier and easier to upgrade seamlessly.

The second priority is to have OpenStack integrate better with public clouds, burst workloads, DR and backup projects supporting us everywhere: on our OpenStack private cloud, in EC2, Google and Azure. For example, Trove working the same in private cloud, EC2 instances, Google cloud, etc. Since the future is hybrid, it just makes sense to have those extra cool projects work for me everywhere I choose. I think it will make OpenStack much stronger.

This post was contributed by Ofir Nachmani. Superuser is always interested in community-driven content, please get in touch: [email protected].

Cover Photo // CC BY NC