The insanely popular messaging service runs 5,000 virtual machines on OpenStack.

image

KakaoTalk is a chat and voice-over IP app that keeps super-wired South Korea — and a good chunk of the rest of the world — connected.

Launched by Kakao Corp. in 2012, it hit 200 million users in December 2015. Available in 15 languages, the cute stickers and animated emoticons offer a distinct voice in the noisy app market.

Superuser got the scoop from Andrew Kong cloud computing cell lead at KakaoTalk about why they got started with OpenStack and how they manage 5,000 virtual machines with a team of just two people.

https://twitter.com/esthersuh/status/678006071907217408

How did you get started with OpenStack?

We started with OpenStack in 2013. It started as an experiment in self-managed development environments for our engineers. Our company’s leader wanted scalable monitoring, standardized/managed/automated resource provisioning, automated application deploying, self-scalable resource management, automated fault/error detection and automated-healing resources.

We searched for technology that would allow us to achieve this. At the time, we didn’t find many options. We have expertise and good engineers, but funding was a constraint. We decided to use open source-based cloud software. We did some research and didn’t find many options. We tested OpenNebula, OpenStack and Apache CloudStack, but at the time there wasn’t much documentation on that. So OpenStack had the best potential and a good design for the scale-out environment. Charlie Choe, who was also the OpenStack Korea community coordinator, led the initial design and development of Kakao’s OpenStack cloud. He named this cloud Krane which is composite word of Kakao + crane. That was the very beginning of Kakao’s OpenStack cloud.

What kinds of applications or workloads are you currently running on OpenStack?

The initial purpose of our OpenStack cloud only was for internal development and testing. Our developers choose what they want to use it for — big data solutions, web applications, databases and even real-time analytics over our OpenStack cloud.

As time went on, our developers saw the stability and speed of OpenStack. They started asking, “Can we use the virtual computing resource for the public and external services too?”…[Before we could do that] our monitoring system and configuration database management system had an issue with managing virtual resources. We re-created our monitoring system to have the same monitoring/managing experience for physical resources, virtual resources as well. After this change, our OpenStack cloud was ready for carrying production workloads.

xtgnuawahddcse9p2ooc
A peek inside the Kakao Corp. data center.

You’re managing 5,000 virtual machines, some of them in production – how big is the team?

For managing upgrades, bug patching, new features, we only have two dedicated people who take care of more than 5,000 VMs. As I mentioned earlier, part of our VMs run production services. But those VMs have full support from the infra team including automatic failure alerts, system engineering support and network-engineering support for the quality of that public service…
We constantly tell our developers to be ready for errors. Their applications should be fault-tolerant, their data backed up (we provide object storage and volume through Glance and Cinder) and ready for automatic deployment for fast scaleout and re-deployment.

xlkq6uqcfcv1uvjo0fcj
The OpenStack dynamic duo: Al (Eohyung lee) and Raymon (Hyun Ha).

What are some of the challenges you face with such a small team?

Even though those two people are excellent, there’s always a feeling that we lack resources to take care of such large computing resources. That’s challenging. To make it easier, we made an automated dev, test and deployment system for OpenStack we call ‘kfiled,’ which Andrew Kong developed. We try to automate all the routine jobs but it’s still really difficult. Except for OpenStack and its networks in compute node, other teams take care of the service VM or network over the compute node too. So the whole infra team (server, network, database and security) takes care of OpenStack when it has issues.

aljc2cykomcv8lyxbvyq

The entire cloud team, from top left clockwise: Jenny (Jihyun Song), Maximus (Jungju lee), Andrew Kong, Hardy (Woncheon Jung), Dave (Gyudong Choi), Nico (Seung lee), Al (Eohyung lee), Raymon (Hyun Ha), Issac (Seoungkuk lim), Joanne (Younju Hwang). Not pictured: Charlie (Jungdae Choi) and Rick (Heesu Sung).

What are the keys to your success with OpenStack?

I don’t think it’s either a success or failure in our case…But there are three key factors to the success of OpenStack. One is excellent people. If you let them get to work, you don’t have to worry about anything. Two is to be aligned with pre-existing computing resources (people call this ‘legacy’ but I don’t use this term, because to us, then, OpenStack is becoming legacy.) OpenStack resources follow the same rules / policy / authentication. We have a simple policy for all compute resources, this makes less developing work for the OpenStack cloud but it’s still a big job…The last one is collaboration. We don’t want to make another huge silo system that makes things worse and harder down the road. We are the experts on OpenStack, and we have to be. Other members in our infra team are also experts in their fields and we respect their knowledge, point-of-view and experiences. Without their help, the cloud doesn’t work.

r2e2jw1gx0xgcnxtjahp

What were your major milestones?

Three major technologies are coming to our OpenStack cloud. One is a datacenter-level resource life cycle management system. This will find which computing resource is underused compared to its template (this project is called ‘CUOTA:’ continuous under-used object into trash automatically.) The analytics for this are supplied by the internal unified metric information center: codename CROW) and resize the resource to smaller one.
Two is the automated resource lifecycle management system (codename: Kengine ) that will automatically add or delete resources for one service.
Three is that the container management system will come into public service based on our OpenStack cloud experience and technology.

euutjwwbxst4xkhwyeop

_A slide from KakaoTalk’s session at OpenStack day Korea._

Anything else you want to share about the real experience of operation/developing OpenStack?

Deployment is one thing, upgrading is another thing. A lot of tools can help deploy OpenStack with just a click. But when you try to upgrade it, no tool can help except your body and soul…That said, there’s a lot of good advice about upgrades. I should say: backup your database! And do the migration test for the OpenStack database — this will probably solve a lot of problems encountered during upgrading.
For example, this code: (https://github.com/OpenStack/neutron/commit/73900fd0f4c1a343c880d8529aff4f51dd071d4b ) added a new table to the Havana version of Neutron. This will add information about which port is on which hypervisor in Havana and this information is only added to VM, which is created in Havana version. Upgrading from Grizzly to Havana, this won’t cause any problems. But from Havana to the Icehouse upgrade, if this information were empty, the VM would lose the Neutron network — and we lost every VM that was created before Havana. The precious lesson from this was: backup first, backup second and backup more.

Try to make managing your infra resources simple — this will make automation, cloud and OpenStack much easier.

Superuser is always looking for user stories. Please get in touch: [email protected]