Saving the Economy From Ruin (with a hyperscale PaaS)

Government IT projects are infamous for being lengthy, costly and delivering unstable services that crash under load. This is an experience report of how all these norms were reversed, resulting in the rare event of positive press coverage for a normally heavily-criticised public sector department.


In March 2020 the United Kingdom went into lockdown, causing the most brutal recession in living memory. The Government had to react quickly, with policies to help citizens and businesses cope. These included:

The Coronavirus Job Retention Scheme

The Self Employment Income Support Scheme

The Eat Out to Help Out Scheme.

Once these initiatives were announced, the UK’s Tax Department (HMRC), needed to encode these policies as digital services, capable of dealing with huge spikes in traffic. And it had to be designed, implemented and delivered in a matter of weeks.


This session shares the story of how such a rapid and effective IT response was achieved. It will include the foundations that made it possible, and the architecture, delivery principles, working practices and organisational structures that enabled hugely successful outcomes.


At its centre is HMRC’s Multi-channel Digital Tax Platform (MDTP) - which employs Continuous Delivery at scale - and began development in 2014. It is a cloud platform for over 130 user-facing applications, powered by over 1000 microservices that have been built as part of HMRC’s ‘making tax digital’ strategy. It provides an easy way for teams to build and deploy applications that can scale to handle millions of requests.


The platform and services needed to deliver financial support to more than 12 million employed and self-employed workers via the three schemes outlined above. In just a matter of weeks, the team was ready and the platform withstood 67,000 job claims within half an hour of the Job Retention Scheme going live. It also handled 440,000 applications for government grants via the Self Employment Income Support Scheme on the first day of its operation.


This is a powerful story of how the right combination of culture, technology and focus can empower a large organisation to pivot fast and efficiently, resulting in the rapid delivery of digital services that are user-centric, maintainable, performant and resilient.

BC

Ben Conrad

Head of Agile Delivery, HMRC

MH

Matt Hyatt

Technical Delivery Manager, Equal Experts

Transcript

00:00:12

Thank you, Fernando and Andrea. So without a doubt, the COVID-19 global pandemic was one of the most disastrous health crises in a century. And also one of the worst economic crisis because suddenly hundreds of millions of people were either unable to work or their jobs had disappeared. However, in many countries, the worst effects of these crises were ameliorated due to the massive government programs to ensure that their citizens often the most vulnerable had sufficient funds to feed their families as well as stimulate the broader economy. So one of my favorite presentations from DevOps enterprise was in 2016 from the UK HMRC, her Majesty's revenue and customs, their tax collection agency for the UK government. They described how they made it easier than ever for citizens to do their personal tax assessment. Enabling say the single parent to file the taxes with a click of the button on their bus ride home.

00:01:11

And they did this despite being embedded in a massively and famously complex its state. So up next is an another amazing story from HMRC. This is a story about how last year they were able to distribute hundreds of billions of pounds to UK citizens and businesses, an unprecedented financial support package that would eventually see around 25% of the entire UK workforce being supported by public money. And they heroically built this technology to do this in four weeks under conditions of incredible pressure and uncertainty. This story is told by Ben Conrad, who is their head of agile delivery, responsible for the multichannel digital tax platform of which the success of this entire program hinged upon. And he will be presenting with Matt Hyatt technical delivery manager at equal expert. They will describe the incredible challenges that they had to overcome and how they achieved their amazing outcomes. Here's Ben and Matt.

00:02:15

Hello, I'm Ben Conrad. I joined the civil service four years ago in order to come and work on the digital platform that we're going to be talking about. I've had a few job titles in that time and currently the line at the bottom of my emails reads head of agile delivery.

00:02:31

Hi folks, I'm Matt Hyatt and I'm an agile delivery consultant with equal experts. I spent the last two years working together with Ben and a team of about 70 people who build and maintain HMRC primary digital platform. Today, we're going to talk about that platform, why we think it's successful and then share some stories about delivering services to save the economy in the midst of the pandemic.

00:02:56

Let's spell it to in context. Uh, although I imagine many of you are already familiar with HMRC. Uh, we are the UK tax collector. We collect taxes from individuals and businesses in the United Kingdom. And as you may have heard, there's a slightly expanded role in the area of customs due to some small rural changes. Recently, the civil service is a political and we are largely an operational department who have relatively little policy work. Hm, treasury headed by the chancellor of the Exchequer sets the economic policy for the UK government. And what HMRC does is make the policies of the government a reality, which can be quite challenging and itself. There are around 60,000 people working in the department as a whole with about 2000 working within each mostly digital. And we are the big player in the UK government when it comes to digital, HMRC is responsible for around 70% of all government transactions with the public that happen over the internet.

00:03:59

And it's because of that, that when COVID-19 hit HMRC played a key role in delivering the UK government, its financial response to the economic crisis. On the 23rd of March last year, the prime minister made an announcement in which he gave the British people. That's very simple instruction. You must stay at home in the first lockdown, the British people were only allowed to leave their homes to go shopping for necessities, take exercise, or if absolutely necessary to travel, to work or check their ability to drive by visiting Barnard castle. Obviously everyone working on building digital services were of course able to work from home. Indeed, we've been doing so for some time prior to this instruction, we have the technology and there was no negative impact to our productivity. We were the lucky ones. The lockdowns had a whopping great impact on the UK economy. Every country had its own experience of the pandemic for the UK, like many European countries.

00:04:59

It was extreme and that is without the health crisis itself, we experienced the most severe economic contraction for over 300 years. It made the global financial crisis look like a blip. This chart shows the impact of the first national lockdown, but there have been three so far during which we were ordered to stay at home. Thousands of businesses paused or, or cease trading, uh, millions of citizens lost their income industries were completely decimated. People simply couldn't work, hospitality travel. The arts were all closed. The government responded by announcing an unprecedented financial support package that would eventually see around 25% of the entire UK workforce. Things supported directly by public money. The support had to be accessed. Somehow the chancellor announced in a live television broadcast stage, Marcy would provide access by building for new digital services and doing so fast from the time of the broadcast, which is more or less when our teams found out they were given 20 working days to deliver the first service.

00:06:12

The spike on this chart is from the announcement of the self-employed income support scheme, which before we'd built any new services at all, people were logging into their tax accounts to calculate themselves what they might be eligible for. A normal new digital service might take nine to 12 months to deliver from inception through a discovery phase, building an MVP and alpha private beater before being launched as a public beta. But for the COVID services, we just didn't have this luxury, but the challenges went way beyond ludicrous timescales when you would have millions of users, but nobody could actually tell us how many, whatever we built had to be accessible to everyone and had to be capable of paying out billions into bank accounts within hours of launch. And it needed to be secure with checks being conducted before money was paid out. So we had four new services, I'll go through the acronyms.

00:07:09

There there's the job retention scheme, which introduced us to the word furlough there's self-employed income support, scheme, statutory sick pay, and the one we've slightly mixed feelings about to help out for that last one. The government probably saved thousands of jobs in the hospitality sector, but it did so by subsidizing meals and encouraging people to sit inside restaurants. So had we do we, thankfully we nailed it. Uh, we went from being the least popular of all government departments to the people you can rely on to help out all the services launched on time most, a week or two ahead of expectations without any issues. And we achieved a 94% user satisfaction rating. The job retention scheme paid out over a billion in claims and its first day at an average rate of three claims per second. The current value of claims across those four schemes is around 80 billion pains.

00:08:09

This created some publicity, although perhaps not as much publicity as it would have done, if they'd all crashed on launch. So why was there this excitement? It projects in the public sector have a reputation for being terrible. At the same time, back in the first lockdown, many established national brands in the private sector were struggling to cope with their increased traffic, major websites crashed or had to implement some really nasty curing solutions and just a book, a delivery from a supermarket, but our services held up and the industry wondered how we'd done it. The answer is that we leveraged a mature digital platform. One that is evolved over the last seven years in which allows HMRC to rapidly build digital services and then deliver them to the public at hyper-scale. But what exactly is a platform and why is it useful? So our platform is the multichannel digital tax platform or MDTP, it's a collection of infrastructure technologies that enables HMRC to serve content to users over the internet.

00:09:18

It's useful because business domains within NHMRC, you can expose tax services to the public by funding, a small cross-functional team to build a microservice or a set of microservices on our platform. The microservice architecture is another talk, but it really does enable a great deal of what we offer, but that's not so different from any hosting service. What makes our platform so useful is that it removes much of the pain and complexity of getting a digital service in front of the user. We achieve that by building customizing, configuring a suite of common components that are necessary to develop and run high quality digital products. And we offer these to our tenants for free. We always struggled to find an image to use on slides that represents MDTP indeed one year we ran a competition across HMRC digital for someone to draw MDTP. This was the winning entry.

00:10:18

Uh, as you can see, it demonstrates a certain degree of hand-eye coordination. Uh, it one by dint of being the only entry. And it doesn't exactly convey much about the platform itself. So let's go back to the logos. I'm not going to go through all of these, but you may notice that most of our tooling is opensource, which is not the norm in a traditional government department, whether sometimes, uh, uh, comfort taken from an expensive licensing arrangement MDTP and the people that have worked on it have successfully transitioned a large scale public organization into open-source on the public cloud and even the coding into the open. Now here's Matt.

00:11:03

Thanks, Ben. So an important part of this talk is scale. So I guess you're wondering how big are we and the answer is pretty big. So we're probably a level down from a planet scale operation, like an Amazon or Facebook, but we're bigger than many tech organizations and we're the largest digital platform in UK government and due to the sheer number of services that we host. We're probably one of the largest platforms in the UK as a whole. We host about 1,200 microservices built by more than 2000 people. And they're split into 70 teams across eight geographic locations. Now those teams make about a hundred deployments or changes every day in our production environment and many, many more than that in our lower environments like staging and QA, all of the teams use agile methods with deliberately lightweight governance and they're trusted to make changes themselves whenever.

00:12:01

And as they see fit, it only takes a few seconds to push changes through our infrastructure. So getting products and services in front of users happens really fast, but the platform hasn't always been this big or busy development began with a single team of engineers nearly a decade ago. So a key part of that story is the government digital service. But again, that's another tool in its own, right? Pivotal to the success of the platform. Aside from GDS has been a constant focus on a few really important things. So culture tooling and practices, and the goal of making it easy to add teams, build services and deliver value quickly. Our teams couldn't have done that without understanding what's involved in getting a digital service in front of a user and doing so rapidly reliably and repeatedly we've evolved the platform around these goals so that a cross-functional team can really quickly come together and make use of our common tooling to design develop and operate their public facing service.

00:13:10

I guess a few of you might be wondering what this actually means in practice. Well, quite simply, a bunch of developers, user researchers, content designers, and product people conform what we call a service team. And then everything you see on the right hand side of the screen that service team would get for free when they use our platform. So we provide somewhere for the code to live, like get hub. We provide automated pipelines for the code to get built and deployed into production environments where the teams can get rapid user feedback. We then supply telemetry tooling to enable a team to monitor its services with automated dashboards and alerting mechanisms. So they always know what's going on. And finally we provide collaboration tools to help all the teams to communicate both internally and between each other so that they can work effectively, both remotely and in person.

00:14:02

A key part is that all of this is available more or less instantly with minimal configuration or manual steps required. The result we hope is that the engineers can focus solely on solving the business problems rather than anything else. Now, one of the key principles that we think enables us to work in this way, and at this scale is the concept of an opinionated platform. You might've heard this being referred to as a paved road or the golden path, or quite simply as guardrails. And the key point is with 2000 actors making changes potentially several times a day on our platform, things could get very messy quite quickly. And our answer to that is to bake some governance into the platform itself. So the basic rules are, if you build a microservice, it must be written in Scala and it must use the play framework.

00:14:58

If your service needs persistence, it must use Mongo. And if your user needs to perform a common action, like uploading a file, you must use a common platform service to do that when there is one, the benefit is if this is most, obviously there, if you stick to the rails, you can go really, really fast when delivering your services, but they're offered the benefits too. So by limiting the technology used on the platform, the platform is simpler to support and we can provide common services, reusable components that we know work with all the services. It also allows people to move between services and indeed the services to move to new teams without worrying about whether our people have the required skills to do the job, they should all at least no scholar and the opinions are designed to prevent waste. So by mandating common components, the idea is that we prevent all service teams having to spend time rolling their own solutions to problems that we've already solved.

00:16:00

Now, obviously not every team follows the rules all the time, but in general, we find that most teams see the benefit doing so now crucially, the need to care about any infrastructure is abstracted away from service teams. So they can focus solely on their apps. They can still observe the infrastructure through tools like cabana and Grafana, but none of the service teams have access to AWS accounts themselves. So you're probably wondering about these opinions. They're a bit out there, right? There's scholar play Mongo. They're hardly ubiquitous elsewhere in the industry. Uh, our opinions can and do change according to user needs and demand for more features. So when teams start demonstrating a justifiable need for something new we'll work to provide it in a repeatable way, enabling self service. Now the self-service part is critical. A service can be created, developed, and deployed on our platform without any direct involvement from platform teams at all. Ben, over to you.

00:17:04

Thank you. We've talked a bit about platform and why it's good, but delivering during the pandemic, wasn't all plain sailing. The UK economy desperately needed this to work and we desperately wanted to avoid any users seeing messages like the one at the top of this slide, the first part of our problem was precedent to have lived through the last global pandemic. You'd have to be 102 years old. And although tax was definitely a thing in 1918, it was very much a paper-based system for our annual key business events. We tend to have years of data so we can forecast to hourly granularity how many people we can expect to use any given service at any one time. But for these new services, we lack data. We lack traffic profiles and we lack models. All we had were ballpark estimates of the eligible population and a hunch that offering people thousands of pains when they had no other income would be popular.

00:18:01

Uh, the estimates produced big, scary numbers. We could performance test our own services and scale them, but where we have dependencies with third parties who are not able to scale, we had to do a lot of work to make those API calls, asynchronous wherever possible. We also quite deliberately prioritize getting money to people who needed it over preventing fraud. Now that doesn't mean that we'd make it easy for fraudsters, but it does mean that we were racing to complete the development of our new counter fraud measures right up until launch. So how do we tackle the problem with scale? I mean, there's a simple answer, uh, by making everything really big, uh, the huge advantage of having a platform composed of immutable infrastructure defined as code is that even the parts of the platform that that need to be manually scaled can easily be COVID-19 teams.

00:18:58

We're working hand in glove with our platform teams between us. We created worst-case traffic profiles, which were based on overall eligibility for the schemes combined with observed behavior like traffic spikes from previous business events and even spikes we've seen after recent TV announcements. Although we try to ensure everything is self service, the platform needed to be responsive to these new requirements to take a few of those. Um, the load testing required, was it a level that a single instance of Gatling was not able to provide? So our build and deploy team added features to enable load testing in parallel. This increased load than, um, broke our logging pipeline. So telemetry, uh, then needed to be scaled up in both staging and production to handle the increase in logs being generated MTTP itself is relatively new and shiny. Um, certainly in UK government terms, it was chosen to deliver the COVID 19 services because it offered the best chance of meeting the ambitious development deadlines and then stealing to support the tsunami of expected traffic.

00:20:09

However, as much as we believe in the power of hyper-scale cloud provider, uh, wider HMRC is much more conservative about where and how it holds substant data. And that means the information about people and their taxes doesn't live for long on MDTP HMRC does hold a great deal of confidential information about every company in taxpayer, in the UK, we protect data on MDTP by ensuring that data can only be accessed by authorized microservices, but most of the systems of record it's this downstream of MDTP in the HMRC corporate tier, it's mostly stored on a mixture of mainframes and old physical hardware, which is impossible to scale to hit with a level of traffic that we expected. So before a line of code was written, we'd realize we needed to remove any synchronous reliance on these data stores and host everything on MDTP, which if that in collaboration with the service teams, the user journeys were cleverly designed so that we could gather some information ahead of launch and avoid unnecessary load.

00:21:17

At peak times from ineligible customers, we then migrated the core eligibility and financial claim data from the legacy data centers into temporary stores on MDTP using a combination of Amazon S3, Mongo DB, and some fairly crude data transfer methods, manually copying things up to S3 from a secure laptop. Despite all of this, we have some nervous moments when things could have gone quite badly wrong. There were a lot of late nights, there were early mornings and there were a lot of very tired developers with one of the services. There was an enormous peak of traffic at midday that HMRC were entirely responsible for generating ourselves. MDTP was actually able to handle this traffic, but the third party systems on which it realize were not, uh, these dependencies were constrained parts of the COVID 19 services to something like 30% of what the platform itself could handle.

00:22:18

We moved these three asynchronous calls where possible, but logging in had to be part of the user journey. The reason for the peak is that HMRC had decided to stagger the traffic by notifying people of a specific day in which they could claim this was in itself entirely sensible. However, splitting that traffic into, by giving people a time from which they could claim either 8:00 AM or 12:00 PM was entirely self-defeating. It turns out that 8:00 AM is too early for a lot of people to log on to an HMRC website, even if it's to claim money, but midday is ideal. And so hundreds of thousands of people set reminders on their phones and tried to log in all at the same time. Uh, government gateway is what we use for logging in. And we knew if you came with a rain 200 logins per second, before it started to creak, uh, we anticipated many more than 200 logins per second.

00:23:14

So we needed a break glass and decided on using Akamai Vista prioritization. This is a fairly crude manual tool that offers the ability to throttle traffic by holding users in a waiting room and allowing a certain percentage through every 30 seconds, the peak of our PKS peak. So while over a thousand login requests per second, under a swarm of around 50 engineers from across both platforms and the service teams monitored the event live and work together to manage that traffic, trying to estimate the percentage to allow through. Although we have very limited visibility of the numbers attempting to log in. And despite initially having close to 100,000 users in that waiting room, we're able to let them through, into the service relatively quickly, so much so that we didn't receive a single complaint. And Matt,

00:24:13

We mentioned earlier that our priority was getting money into the hands of people who really needed it, but HMRC was also acutely aware, but this would be the largest giveaway of public money in living memory and therefore an irresistible target for fraud. Now, whilst the service things were flat out building new user journeys, there were other teams across HMRC that were busy beefing up or building entirely new anti-fraud measures. These varied from plugging into third-party systems that check for dodgy bank accounts to extending HMRC his entire internal fraud risking system, which is a massive task just on its own. Now with just four days left before the public launch, the integrate integration between the first COVID-19 service and the new Ford risking system, which is hosted on an entirely different platform still hadn't been built. What's more, the service team didn't have the capacity to finish it.

00:25:09

So instead one of our platform teams picked up the Baton and found a way to get the data flowing between these two new services and platforms. Now, frankly, this involve breaking most of the rules, which we normally insist on. There were two microservices sharing, a single database, and we had to hack our way through our own database authentication to get it to work. However, the result was that the fraud risk and capability went live together with the service, which I can tell you personally seemed super unlikely at 4:30 AM on the morning of the launch and how much fraud did it catch? Well, the truth is we just don't know yet. Officially HMRC has assumed that between five and 10% of all claims, but either be fraudulent or incorrect. Um, but what we've seen so far is only about 70,000 cases that have been marked for investigation and proving the fraud takes time.

00:26:05

So it will be quite a while before we understand how much fraud we actually stopped. Now, stories like this last minute integration were abundant everywhere. You looked there was innovation improvisation and a Herculean people effort. And it didn't stop with just the initial service launch. These services were meant to be temporary, but they still continued to be developed today as the teams iterate and add the features, which they didn't have time to release last year. And now there are even more services being built as the government continues to announce further initiatives to help the public. And they trust HMRC to be able to get them out on time. In fact, the number of services in our production environment has grown by 30% in just the last year. So in many ways, COVID-19 has forced us to redraw what we thought was possible. It's kind of become the new normal, but we've discovered that that's not entirely a good thing.

00:27:04

So we've learned that number one, it's not good to integrate a country scale fraud risking system in four days. Secondly, you need to be careful with unhelpful precedent. It's not advisable to compromise your fundamental design standards to get a product shipped. And it's almost always best to avoid doing anything manually, particularly large scale dumps of citizen data. Finally, it's never good to ask an engineer to work more days in April than there are days in April. We have to remember that what was achieved was done. So under incredible pressure by admittedly willing and determined people, but they had the knowledge that they were helping their families and friends and neighbors who were quite literally just trying to survive. Now, it's very true that there wasn't a lot going on socially at the time. And anyone with children would have killed for an excuse to get out of homeschooling, but that doesn't make it okay.

00:28:06

At points. We had to tell engineers to stop working and to go to sleep. It was clear that working as our teams did in 2020, just isn't sustainable. So this is a message that we need to keep reinforcing now with our leadership community and with other government departments who are understandably interested in learning how we got things done so quickly in an effort to recreate that. So COVID is strange for, for our things in a number of ways, the legacy alongside the pride and the sense of achievement at doing an incredible job is a kind of collective burnout and fatigue with being a hundred percent remote. So much of what we're doing at the moment is focusing on how we make our ways of working sustainable in whatever turns out to be the new normal, with a commitment to maintaining the flexibility that our people want, but trying to revive the comradery and the human contact that we took for granted when we were together every day, Ben,

00:29:11

We shouldn't leave you with the impression that we were only working on the COVID-19 response in the last 12 months. And that time we have actually migrated the tax platform away from our homegrown legacy deployment tooling to run on AWS elastic container service. This project was delivered by utilizing a tiger team, made up of engineers from the different platform teams drawing on that experience available to us while also ensuring that other work continued result is a more modern platform, which is simpler to operate and will scale elastically in response to demand. I'm proud to say we even won an industry award for it. And even more proud that we completed it with zero downtime. As we hosted hundreds of important services, we believe that we have this amazing platform and we have shown that it can enable services to be built and deployed at breakneck speed and scale to cope with more traffic than a government website should ever expect to see.

00:30:15

We believe we are secure and we aim to build on the reputation we've grown over the last decade. Digital sections of departments were set up years ago and Avaya they've succeeded in transforming the experience of using government digital services. They've mostly operated in a silo and fail to deliver the revolution in how technology is delivered across government. I hope that can change as we look to possibly host additional services. We'll always need to keep the opinions we hold under review. Those are balanced between the consistency that enables rapid delivery and stifling innovation by restricting people in the technology that they would most like to use. And one day soon, we might even be able to meet up again in person. And that's that a story of building some digital stuff to save an economy during a pandemic. If you have any experience with achieving large scale digital transformation is in huge monolithic organizations overcoming the vested interests that come from monopoly suppliers. We would love to compare notes with you. Thank you so much for listening. It really is a privilege to have shared this with you today. Hopefully none of us will ever need to build anything to fight a pandemic again, but if you're interested in finding out more about any of the things we've touched on, we'll happily answer the questions and share some links to things republished. Thank you. Thanks folks. Thank you.