It’s time for the community to determine the winner of the Superuser Award to be presented at the OpenStack Boston Summit. Based on the community voting, the Superuser Editorial Advisory Board will review the nominees and determine the finalists and overall winner.
Now, it’s your turn.
Snapdeal is among the seven nominees for the Superuser Awards. Review the nomination criteria below, check out the other nominees and rate the nominees before the deadline Tuesday, April 4 at 11:59 p.m. Pacific Time Zone.
How has OpenStack transformed Snapdeal’s business?
Snapdeal runs on OpenStack is not an overstatement. Literally the entire business of Snapdeal runs on the private cloud called Snapdeal Cirrus that is built on top of OpenStack. In the last year and a half, Snapdeal has not only built an infrastructure-as-a-service (IaaS) using OpenStack but created an entire ecosystem of services built on top of the new private cloud which has transformed completely how applications are designed, built, deployed and maintained. Snapdeal is built using over 500 microservices, and each service has its own multi-tier architecture and exposes its capabilities over rest API. In the new architecture, all Snapdeal services are defined in YAML files called blueprints, which capture every detail about the microservice down from the Git repository, dependencies, resource requirements for both vertical and horizontal scaling. This infrastructure is then created automatically by our automation without any manual intervention. This infrastructure-as-code is heavily intertwined with OpenStack like choosing the right flavor, host aggregate, attaching Cinder volume from Ceph or local SSD. Also teams can create any number of copies of the applications with a click of a button without any manual effort. This new architecture has made the entire engineering team extremely cognizant of its own infrastructure usage.
How has Snapdeal participated in or contributed to the OpenStack community?
We have been attending the OpenStack Summits regularly. At the Austin and Tokyo Summits, we attended to interact with fellow engineers who have built large private clouds using OpenStack. At the OpenStack Summit Barcelona, we announced our private cloud and we were welcomed into the 100K cores club. We have embraced OpenStack as the core software for our cloud platform, which meant that we treat it as our own code. We have a fork of the OpenStack Kilo 4, which we run in production. We have made several enhancements on this code to suit our requirements. So far we have not actively contributed back to the community version because of sheer lack of resources, but it is something that we are actively going to pursue this year.
What open source technologies does Snapdeal use in its IT environment?
CentOS, Python, Kafka, ActiveMQ, RabbitMQ, Spark 2.0.1, Scala 2.11.8, Hadoop Ecosystem (HDFS, Scoop, Yarn, Pig, Zookeeper), Spark Job Server, Zeppelin, Tachyon, Oozie – 4.3.0, Pentaho, Kylin, Tungsten, Hbase, OpenStack, Ceph. Ansible, EFK (ElasticSearch, FluentD, Kibana), Icinga, Influxdb|Telegraf|Kapacitor|Grafana, Salt, Terraform, Jenkins, Chef, Smartstack, HAProxy, Gradle, Consul, Vault and Zookeeper
What is the scale of your OpenStack deployment?
With OpenStack at its center, Snapdeal Cirrus covers three data center regions with an extensive architecture of 100,000 cores, 16 petabytes of storage and 100G SDN infrastructure. You can learn more about Snapdeal’s OpenStack footprint in this case study.
What kind of operational challenges have you overcome during your experience with OpenStack?
Separation of data plane and control plane
The design philosophy for the Snapdeal Cirrus cloud from the beginning was to keep the data plane and control plane separate. The reason for that was to avoid any issues of the control plane affecting the running instances. One of the main intersection points in the data plane and the control plane is the networking layer. Instead of using an OpenStack router for every subnet, we used the routing capabilities of hardware switches. The main advantage of this will help us in doing upgrades, maintenances and deal with any issues in OpenStack while the cloud still runs.
In order to reduce any single point of failure, Snapdeal Cirrus has been designed as a multi-region cloud. Different hardware, different OpenStack controllers, separate storage and network layer that are separate for each region. The entire application placement is aware of this concept to make sure provisioning is done across regions. This gives the cloud the ability to grow into further regions and give redundancy across regions.
In addition to the default server level anti-affinity, we designed a pod and rack level anti-affinity. All servers in a rack share a TOR switch whereas servers in a pod only share a leaf switch pair and are in the same region. Anti-affinity has been designed to give the provisioning layer a choice to choose the level of anti-affinity required by a particular application. This kind of anti-affinity gives different failure domains for mission critical applications.
Snapdeal Cirrus is highly redundant from OpenStack controllers to network and storage. By using multi-region placements along with anti-affinity within a region and multi-replication at the storage level, the cloud is designed to be redundant at every level with no single point of failure.
Firmware upgrades and live migration
When upgrading firmware of servers, switches are a reality. Cloud has been designed to make sure that we use shared storage on Ceph for root volume of VMs. This gives us the flexibility to move VMs around, one of the big use cases being hardware maintenance.
How many Certified OpenStack Administrators (COAs) are on your team?
There are currently no COAs on the Snapdeal team.
Cover image courtesy of Snapdeal.