It’s the busiest tax day of the year and millions of people are filing online. In just 24 hours, Her Majesty’s Revenue and Customs will take in £350 million (about $505 million.)
Instead of sweating it, the team behind the digital platform of HMRC spent part of the day noshing pizza and playing LAN tournaments.
“In live ops, the biggest success stories are always anti-climaxes,” says Tim Britten, product owner for the HMRC digital platform. “The 31st of January 2016 was really boring for us. We didn’t have to do anything. We knew that we were resilient across data centers…That has never really happened before. Normally, we would be absolutely bricking it. We have a lot of self-healing containerizations (so) it looks after itself at the moment.”
It’s really good to be afraid
In just four months, a team of four engineers reached that milestone. Britten and his co-workers could relax about the performance of HMRC’s multi-channel digital tax platform (MDTP) because dev-ops consultant Philip Harries had doubled down on the fear factor.
“If you’re building any kind of infrastructure, any engineering project, big or small, it’s really good to be afraid,” says Harries. “The glass is half-empty. You’ve got to be pessimistic, you’ve got to plan for failure. You’ve got to be resilient against any kind of failure in your system.”
Harries’ strategy to protect HMRC against failure–whether it be from infrastructure bugs, human error or zero-day vulnerabilities that might bring down an entire data center–was to go with multiple vendors.
Those vendors include DataCentred, which provides the OpenStack public cloud and VMware from Skyscape Cloud Services, a company founded to provide cloud computing services through the G-Cloud initiative.
HMRC essentially runs a web gateway for people to interface with the government for their tax affairs. Here’s a look at the architecture that bolstered a “boring” tax day.
Requests go to an Akamai content delivery network (CDN) before being farmed out to each provider. There’s a public-facing zone, with networks, micro-services and Mongo database clusters. Then comes a layer of proxies between that and a protected zone, which is also full of micro-services and MongoDB clusters. Finally, there’s a private layer on the Skyscape side only.
“There are more secure processes — but it has nothing to do with the customer. We can actually lose that without any interruption to the customer journey,” Harries says. Behind that, HMRC doesn’t actually store any data permanently in the infrastructure. There are a couple of secure data centers–physical data centers–linked up by virtual private networks (VPNs.)
Time for revolution, not evolution
That uneventful 2015 tax day was the happy ending of HMRC’s journey with OpenStack that also resulted in winning the UK IT Awards Digital Project of the year. The trip started back in 2010, when Martha Lane Fox, a peer whose previous experience included co-founding lastminute.com, issued a report on reforming the British government’s digital service saying that “government needs to move to a ‘service culture,’ putting the needs of citizens ahead of those of departments.”
At the time, HMRC was anything but agile. Its services were waterfall deliveries, there was a ‘massive amount’ of outsourcing and typically six-month release cycles. “We would have huge and huge amounts of change on one weekend or two weekends of the year. If you did something wrong or you had a bit of content wrong on the page, you wouldn’t be able to get that in until six months later,” Britten says. That’s painfully slow — considering that HMRC is responsible for 50 percent of all transactions with the British government.
Following the report pushing for “revolution not evolution,” the Government Digital Service was created with a mandate to revolutionize digital services. Three of the 25 pilot projects were at HMRC.
“We realized that the only way were going to do this was if we built a new department within HMRC that was outside the confines of the current organization,” Britten says. “We didn’t use any of the corporate networks. We went out, we bought MacBooks. We brought in people from the outside. We started a small skunk works to deliver these services.”
Britten recalls that period in 2013 as “absolutely awesome” with the small team using unlocked laptops, talking directly to users for the first time, delivering services in-house and jazzed about the commitment to code in the open –you can find HMRC’s code on GitHub.
“We got a little bit too excited about all the functionality building. We forgot about infrastructure,” he admits. Then when it came time they scrambled for infrastructure supply, because “we couldn’t simply whack out a credit card and get AWS” since the British government and the public are stringent about data handling and privacy.
HMRC was restricted to finding a cloud supplier that could provide what they needed and the availability they needed but also wasn’t US-based. There was one vendor who met the requirements. “As we go into the future, we plan to have three different technologies underlying our infrastructure,” Britten says, including Amazon Web Services (AWS) if the service becomes available in the UK. “We spread those bets evenly.”
The team hit on the fact that those three initial projects shared components for what they decided to call the tax platform but adds that “we didn’t think of it as a platform-as-a-service (PaaS) or anything. We just had to build stuff.”
“If you fail…scary things start to happen”
The team built a shared infrastructure running the micro-services architecture on Docker containers, got the three initial services live, and then were almost drowned by their own success.
“We have people phoning us up saying, ‘By the way, we’re going to set up a delivery center in Newcastle. It’s going to have 20 agile teams.’ We went from three, to five, and, suddenly, bam, we went to 20. Then, someone phones up again and says, “By the way, we’re setting up another delivery center…and they’re going to have another 20 teams.'” At this point, Britten says they were “desperately” trying to scale. “We’re forced into a position where we have to build a PaaS,” he says.
Higher-ups decided to move all online HMRC payments to the tax platform and the platform became the only way for people to file self-assessment returns. Previously, these services were hosted by an incumbent supplier. There had been some outages, which at the time weren’t that worrying, but Britten says they realized that in January, the tax platform service would take around £150 million in a day during the tax season peak.
Britten and team were becoming the main people in HMRC to deliver or run services for the UK government tax authority — yet the infrastructure ran on the shoulders of one provider and tax season was looming.
“If you fail, if you get downtime at that point, scary things start to happen,” Britten says. “If you have to delay the tax deadline, the prime minister and the chancellor have to meet and sign that off. The treasury have to start borrowing money to cover the loss that they’re have in interest. We’re in October and we start to go, ‘This is kind of worrying.’”
The search for another UK-based provider lead to Datacentred and OpenStack. “We were like, all right, we know that the OpenStack API is really versatile. These guys look good. This is our best bet.”
With just a few months to go before tax season, they buckled down.
Without the time to push infrastructure changes out by going through dev and seeing if they work, QA, staging and then into production, the team started building the staging environment. That new staging environment was used for functional testing, as well as performance testing. Before it was finished, they were tasked with building out the production version. The team functionally tested it in November in staging, without abandoning the production build.
“On Christmas eve, which is actually an awesome day to have a full outage if you’re the tax authority because no one does their tax then, we started a 48-hour outage of our current production, which was running on one supplier,” Britten says.
Then, he says, “we did something I wouldn’t recommend to anyone.” Just weeks before tax day, they turned off all of the tax systems, replicated all the data, populated the new Mongo clusters. Then, they switched over and tried to test. “It was awful. Eventually, everything woke up and got it working over about an hour.”
Then they turned off Skyscape for the first time in two-and-a-half years, relying solely on Datacentred. Then they switched back. “As easy as that, you can switch through the suppliers in terms of how much traffic you’re putting through them.”
Image courtesy HMRC.
The current infrastructure has proved sturdy — and provided a different kind of downtime for the team.
“I always look at Twitter when we do the January peak,” Britten says. “If you look at previous years people are like, ‘I can’t log in.’ Now, they’re just having a go at the tax authority, which is awesome. That’s what you want to see. People not actually not being able to pay their tax, just really annoyed that they had to.”
You can watch the entire 34-minute talk from Britten and Harries from the Austin Summit on the OpenStack Foundation’s YouTube channel.