Twilio SendGrid’s Mike Dorman on iterative improvements, minding metrics and leading the capacity curve.

image

This post is a detailed write-up of a lightning talk I gave at the recent Open Infrastructure Summit in Denver. Slides are available here and a video here.

(See also the related talk, “Don’t Repeat Our Mistakes: Lessons Learned from Running Go Daddy’s Private Cloud“)

Currently I work on the TechOps team at Twilio SendGrid, which manages all our physical infrastructure and virtualization and container orchestration platforms. Before that I was at Go Daddy for about seven years where I worked on the OpenStack private cloud, as well as a couple other cloud servers products.

Overall I’ve been in the software-as-a-service industry for over 16 years, so I have experience running several different platform services over that time. I’d like to share a few tips for being successful doing that.

Starting point

As your starting line, before you build anything, define who exactly your customers or users are. Think about what their pain points are and what their expectations will be. In fact, actually go talk to them to find out! Don’t make assumptions about this. (If you have a product manager or product owner, they can help.)

This is a step we missed with the OpenStack cloud at Go Daddy. We assumed that “going to the cloud” was what everyone wanted and that it would be great. But we didn’t take the time to really understand the perspective of the users. And as you’ll see below, this led to some unexpected behaviors.

Solid foundation

A good foundation to start from is to write down what your platform does (and doesn’t) do. (Formally this is called a service contract.) It defines the demarcations between you and your users, and helps you manage expectations.

If you don’t do this, there will be a lot of bad assumptions on both sides! And you’ll become the default support for anything even somewhat related to your platform. Any problem is now going to come to you.

In the Go Daddy OpenStack cloud, any time there was any problem inside a VM, that question would come to our team to troubleshoot. Even though we really had no control over things inside the VM, or any way to influence it, the questions came to us simply because they were running on our platform.

Early adopters

Try to find some early adopters to be beta users that will give you early feedback. They should be fairly knowledgeable and experienced, and familiar with new paradigms, like “architecting for the cloud,” etc. Really invest in building that relationship, these people will be your ambassadors to others.

We had really good success with this at both Go Daddy and SendGrid.

The early adopters of our private cloud at Go Daddy used our platform to build a public-facing cloud servers product. They were able to give us a lot of great feedback and they were super easy to work with, because they understood the infrastructure and how to best use it.

And at SendGrid, our TechOps department “dog foods” our Kubernetes platform and helps to iron out any issues with new features. It’s super helpful to have that early and fast feedback loop

Iterative improvements

Make sure you’re always delivering new value by doing iterative improvements. (Agile and Scrum can help a lot with this, if you’re not already doing it.) It’s really about breaking work down into smaller chunks so you have a constant stream of little improvements.

Try to focus on the specific outcomes that you want (what you actually want to have happen) and don’t get too bogged down by the implementation details. Your goal should be to provide some sort of added value every sprint or iteration, and communicate that out through demos and announcements.

This was a challenge at Go Daddy when we were working through a project to integrate with our legacy asset management system. That project had a lot of unexpected problems and delays, and it ballooned out to a few months. The requirements were pretty vague, and the actual outcomes weren’t clear, so we were chasing a moving target, so to speak. The worst part was during this time we weren’t delivering any value to our users, which hurt our reputation.

Helping hand

Think about any paradigm shifts that you need to help your users go through. These might be hard for you to see, because you’ve already made the shift (but others haven’t yet.) A great example of this is “architecting for the cloud,” or “architect for failure” when folks have been used to highly redundant and protected physical servers.

Start by documenting and communicating best practices to your users. Involve your early adopter beta users in helping to disseminate that info as well.

This was a big challenge at Go Daddy, too, because people weren’t as willing to embrace “the cloud” architecture as we assumed. It turns out people needed a lot more training and education on this than we thought. They continued to treat things as “pets” instead of “cattle”, and they protected and hoarded their VMs.

And because of that, any time we did maintenance on the platform, it was really impactful.

Light touch

Speaking of maintenance, take your maintenances and outages seriously! Try to have as light of a touch as possible. Even small blips that don’t seem significant to you can be a big deal to your users. These can really hurt your platform’s reputation and cause people not to trust it.

Personally, this one was tough for me. After all, users should be building their apps to be cloud native and resilient to failure, right? We should be able to kill a few instances and they will get recreated, no problem. Even today at SendGrid, there are some applications deployed in Kubernetes that aren’t architected well and really shouldn’t be running there.

So in my brain, I think, “why do I have to treat these things with kid gloves?  I just want to get my upgrade done!”

But in reality these impactful maintenance episodes exposed a lot of other single points of failure and a maintenance on our platform would cause a larger outage.

In the end, this just causes more scrutiny and attention on your platform, which you really don’t want. Ultimately it just makes life worse for you. So do what you can to minimize the impact of maintenances.

Reduce the pain

When you do have to do maintenance and take outages (after all, we do have to do this sometimes) provide some good signaling to your users.

So think responding with a 503 status rather than a connection timeout. It’s a better experience and your end users can deal with that situation a lot easier. Figure out what that signaling will be, and write it in your service contract so your users are aware and can plan for it.

Also make sure you think about other systems your platform depends on.

Our Keystone was backed by Active Directory and any time Keystone wasn’t able to contact a domain controller for some reason, that had a lot of downstream effects on pretty much everything.

Another example, in an earlier cloud servers product, all the VMs were backed by NFS storage. There was an incident where both redundant network switches were rebooted at the same time, cutting us off from the NAS, which basically killed everything.

So, again, just be aware of these potential trouble spots and do your best to provide useful signaling to your users when something goes wrong.

Measure everything

This is almost always an afterthought (at least it has been for me.) Really try hard to collect as many metrics as you can from the start. Start with your best guess of what metrics will be useful, you can always adjust and change these later.

And then actually look at them! Look for trends and outliers.

When you’re troubleshooting something and you think, “I wish I could see the trending of this resource”, or “I wish we were collecting this other metric,” make note of that so you can circle back and add more useful measurements.

Another tip is to think about this from the end-user’s perspective:  API call times, latency, and error rates are the things that really matter to them. So they should matter to you, too.

Lead the capacity curve

You have to keep up with capacity! Make sure you don’t run out of space — this has happened to me more than once.

This goes along closely with metrics: know your usage patterns and know when you need to add more capacity (or clean up old stuff to reclaim space.)

If you run out of space, that really hurts your reputation. People start hoarding resources because they’re afraid you will run out of space again, and they want to make sure they have the resources they need.

I know this can be kind of hard because often this isn’t something you look at on a daily basis. But you really should! And it means real money when you need to add more. But it’s super important in order to build and keep trust in your platform.

Build backstops

But you should also build in some protections against this, primarily by setting resource quotas for your users. There are many ways to approach this. Maybe the quotas are tied to departmental budgets. Or you just give everyone a fairly large quota just to protect against run-away provisioning. But you must do something.

Even if you don’t specifically bill for capacity or are running an internal or dedicated system, you still need to do this. The purpose is to protect the integrity and trust in your platform.

Show them the money

It helps if you can show people what they’re using, and what impact that has on the platform or company as a whole. Again, even if you don’t actually bill real money for it, translate usage into a money figure for people. It’s easier to inherently understand that way.

If you don’t do some kind of “show back,” people will remain blissfully ignorant and just keep using more. It turns into a tragedy of the commons problem, where no one has any incentive to conserve and there are no consequences to using a little more.

Takeaways

Let me sum up the key takeaways:

  • Success defined by user experience
  • Manage expectations
  • Consistently deliver value
  • Lead and train
  • Constant feedback from users
  • Measure everything
  • Keep up on capacity
  • Backstop protections
  • Always consider reputation

Stay focused on these and keep in mind that the success of your platform is more than just the uptime number!

This post first appeared on Mike Dorman’s blog. Superuser is always interested in community content — get in touch: editorATsuperuser.org