We'll take you through the transformation of NAV, Norways biggest governmental agency, responsible for paying out a third of the federal budget. Five years ago, we had no internal developers, and 4 releases a year. Now we have 300 internal developers, and deploy to production every other minute. We believe we did something right - treating internal platforms as products, going for continuous improvement over projects. We also did some things that we are not sure about. How to split up a gigantic organisation into autonomous teams. This talk will tell the story of our journey, and share what we learned.
Audun Fauchald Strand
Principal Engineer, NAV
Principal Engineer, NAV
Hello, welcome to this. Talk about how we stopped the transformation of Norway's biggest government latency. We call it from foreign leases here to once every other minute. My name is Aiden and the sisters. We're going to give you some insight into our transformation, but first we want to show you some data
And let's start with accelerates, perhaps the most important for this story. Accelerate identifies the payment frequency as one of the four key metrics that indicate the performance of a software development team. And we have searched for, and also found data from the last 12 years with just enough resolution, allowing us to plot the average number of deploys to production per week, per year. And, uh, here is, as you can see, the change is pretty big before 2016. Now I used to take pride in arranging for coordinated massive releases per year, but something changed in 2016 that accelerated into 2019. And as of today, this number is over 1,300 per week. This translates to once every other minute of working week, at least here in Norway,
But I want to dig deeper into the numbers. We found something strange. The last graph that showed us was like a steady increase per year, but then it changed was resolution to weekly. We saw that the before a week 32 in 2019, uh, we had had a few hundred deploys every week. And then we had a little drop in the summer holidays. And then straight after some holidays, we went from a few hundred to more than a thousand. This was kind of a mystery. Just why did this happen? Uh, and in this doc, uh, we're going to show you what happened to explain this a sudden jump of a deployment frequency,
But let's, uh, 2019 remained a mystery for now. Let's turn to 2016 where the change began and to understand how and not to forget why this change started. We need to go for the back in time to start this story to the very start on love and even a bit before that in 2006 or politicians in Norway decided to create an interconnected super agency with three already huge existing agencies. We combined the service for private benefits with the service for helping people get back to work and the social service and the vision was to create an agency agency that gave its users that is all nourishment citizens. Really a unified welfare experience.
Love is actually an original word. That means the central part of the, and this is a good description novice, the central Norwegian welfare state. And we want to support a lot of the cases where welfare set sector, just now the size of knob is actually quite unique repair at about the third of Norway's national budget. Every year. Most of this is in age-related pensions, but we also have other benefits. If you compare this to other countries I've read, I've seen somewhere that novel does the job of something like 30 different agencies in the us, for instance. So we have centralized a lot of them. They put a lot of functionality into one big system.
Um, the political ambitions were higher for now as an agency, but sadly though, those ambitions was not followed by equally high ambitions for the it systems behind this super agency. So now our sport with two separate legacy monoliths, both of them, it was absolutely essential for developers state to function, but being one organization. Now we needed to make those moments communicate in some way and making changes in one monolith is difficult enough, but making coordinated changes in two interconnected monoliths, that's nearly impossible. And this made for a large organization, not optimized for change at all. And we compensated with a lots of change managements. And the very first thing that now did as a new agency, well, that was building yet another monolith, uh, this time a new pension system and having no in-house developers at all this new system was built entirely by consultants.
And then of course we have the applications outside the big monitor as well. We have task management for caseworkers systems for handling or showing applications to people and an application platform, but almost all of this was built by consultants. So it was really difficult for us to take ownership of our own technical direction. We basically had as an agency, 19,000 employees, and just the it department to solve most of housing people that have no, no developers ourselves. So we decided to start insourcing
And that's that started in 2016 and we got a new CTO. His name is and he started off with a clear vision of what now could be and what now should be. And he told that story again and again, both high and low in the organization and outside the organization. In retrospect, he checked off all the boxes for what transformational leadership is all about and fundamental in the story was the need to reclaim technical ownership over our own systems. That was the very same systems that have been built and maintained by consultants for the last 15 years.
So, uh, reclaiming technic technology was a huge task. Choose tonight was the first one was some of the first developer side. And a lot of the time in the first few years, it was just getting more people in and making sure that they were all the required quality and everybody of course wants to hire the best developers. So it isn't an easy process, but now almost 200 developers. If you count the data engineers as well, we have 300 and we have both the senior people and we are starting to take more people straight out to university as well. So now we have the capabilities to build and maintain and shape our own systems. So now we set the scene, we told a little bit about, about what novice and our history. Now we want to talk through three different topics for you want to talk about platforms, sustainability and culture, and we're going to show you how all of these are connected and you need to work on all of them to succeed.
So let's start about, let's start poking about platforms.
So at the time we were going to talk about now, I didn't actually work for now. I worked for a big telco outside. Now I've been a consultant, uh, now for a year before I started there. And the experience was so bad that I actually quit the consultancy company. I heard some rumors about the new application platform. They built it now, uh, specifically I heard that you could get a new server in production on the 10 minutes, and this was much better than what I had, what I've seen, uh, where I in, in the big telco.
And actually I worked back then as a consultant myself, I might add, uh, on the team that made that, that four and thinking back, it was pretty good for its time back in 2013, I think, uh, it demanded a clear separation between application and I will count config and offering zero downtime deploys return. And you can also, as I've just mentioned, get your own server, your own virtual machine in 10 minutes. But the problem was that very few teams enough actually used it.
So NAHB did an assessment using nicotine is delivery maturity model from the continuous delivery book. If you look at the different facets of what continuous delivery you see, you need a lot of different things to be able to have a, to get good speed. We got really good at building the deploy, but they hadn't really thought that much about the rest of them.
Um, plotting out this, uh, assessment for much of my motivation to join up as the first developer back in 2016, because what good is there in offering deployment in minutes when it takes months to change a comma in a simple application, because it needs integrated and coordinated testing and a really, really hard deployment mechanisms.
So we needed to think bigger and what we realized that just being good on building the plug wasn't enough. We had to look at solving all the other aspects as well, and we knew we wanted to do this the right way. So we were inspired by Spotify and we wanted to build what they call a golden path platform. So something that's easy to use for a lot of developers and makes it easy to do the right thing. And we didn't want to support all the necessary edge edge cases that leads into the first. So a few years later when we referred her along with insourcing, we started on new iteration of our application platform. We saw that that was a good tool to use, to address the other parts of the continuous delivery maturity model.
So the first thing is that Yvonne wanted to use the platform as an instrument to improve both the organization, the quality of the application architecture and system architecture now, and to do that, we wanted to of course, use great technology, uh, everyone to be open source. And as Adam just said, we wanted to optimize for migration. We wanted to make a platform that the teams really, really wanted to use.
So by far the most important thing, when you create an application platform, and this is also recognized as one of the most difficult problems in computer science is choosing a name. We have multiple strategies to choose from this time we chose to pick F and a name and retrofitted into an acronym Ms. Strategy. So we made the log application infrastructure service. Nice. I wanted this to be nice to use for developers and with nice to use him and that it should be as simple as possible, but not simpler. We wanted to remove a lot of unnecessary creativity that teams need to do. You shouldn't need to con uh, consider different alternatives for load balancing or traffic shaping the platform should be opinionated, but do the right thing in an easy and nice way. So, uh, from, from, from the start, we didn't want to make the perfect platform. We wanted it to be open source, but that's more of an inspiration than reuse. We wanted it to be especially designed to work perfectly for teams now and make it easy, as easy as possible for them to deploy their own applications running on the different, uh, other application platforms, uh, in, on the nice platform. So we did a lot of the heavy lifting, uh, prompt and integrated parts of the overall platform, not to talk about the other
And nice was born source. And that started the process to open source most overcoats enough. Uh, we were inspired by government UK, and they're excellent. Write-up on not just on how, but, but on the motivation for why public sector codes should be open as our software is paid for by the public, it should be public available. So we now have over 1000 public reports and get it. Most of them are just caught in the open. Uh, there's very division of welfare state problems, and doesn't aim to be useful outside and up, but we've got a handful of reposts that are run us proper open source projects, including nice,
Even introduced a completely new buzzword cake driven development. We use cake for almost everything we gave cake to any developer outside, the nice platform that sent us polar requests to fix bugs in the platform. We used cake to get teams, to migrate to them, um, to the platform. And it's a really good trick instead of having the team considering if this is difficult or not, they can only think, do I want the cake and most teams want cake, so let's start it on this process.
Um, we have enough just getting cake. We made who this, we made socks and for the hipsters that is not us caps. And of course, stickers, image, huge Pedro stickers and spraying them all over the place.
And this also created kind of a sense of pride, uh, for working at now. They understood that is possible to work ethanol in something that's used to be a really bad place for software development and actually make good, good things. I would say we use what most teams do when I create software for external users, we use the same product development techniques, but for our own internal platform. And that helped a lot when it came to migration and optics. So nice was a successful platform. And, uh, as of now most applications that run on a nice platform, but nice didn't happen in a vacuum. There were other shifts going on at the same time.
So we need to talk a bit about the culture. You know, uh, now didn't always have a DevOps culture, uh, as you probably figured out by now. Uh, and back in the days where software development was something that now both from the market, uh, there was no the ops culture at all.
And all this outsourcing meant that we built a big control. Rasheem knob created a specifications and the consultant estimated Olav accepted, and then the consultant build it. And then now said, well, this is the sector. We will need it. And you kind of had a bad process and making any change in this process with loads of loads of people and loads of, uh, coordination is really expensive.
And when change is expensive, you tend to change the less often. And the less often you change the bigger the change becomes because the need for change, doesn't stop. And the developers doesn't stop typing, uh, and big changes are risky. So you need to put more control mechanisms in, you get to manual testers. You've got to change managers and coordinators, all trying the very best to keep the risk down, but now we get there it is. And, uh, that make you change even less often. And because at this point, change is rare. You don't see any value of ultimate thing. So change becomes even more scary and even more risky.
And then you have the, then you have a downward spiral and not raced down this downward spiral all the way down to only four releases a year. And it's even went so far that you actually celebrated the size of our releases because when you release only four times a year, the release has become really big. Uh, someone can remember a cake being given out to the people who would celebrate the fact that we had 103,000 development hours in a single release I'm of course, when you have such a big list, it's almost impossible to think about and reason about the risks involved. No one actually knows the size, all the changes, and what's the con what the consequence of an error risk. And this, again, leads to more testing and more coordination and more change management. And all this creates loads of hair fairies and fear of something slipping through the net of manual testers.
So that was how it used to be. But in 2016, with a new CTO and in-house deployment, things started to change. And this new awkward feeling of trust between developers and the suits were beginning to emerge.
So specifically the trust gave us the opportunity to deploy ourselves as a software team. This didn't happen all over now, but we had had a few teams where there were enough, there was enough trust between the old ops department and the development team so that the kids started to play themselves. And with much less overhead, the few teams be able to deploy small changes, more often, and people, when they saw this, they saw that everything doesn't explode, just because of deploy often the clung to production every week means that it doesn't mean that everything changes. It doesn't mean that the users has to get rid, to get used to a new UI every week or that, uh, applications explode. And when they see that the change is safe, they start to deploy even more often. And then we have an upward spiral and the changes since it's get better and better.
But even though we had some trust, there is enough trust to deploy ourselves that trust wasn't widespread to all levels of the organization now. And the upward spiral I've been mentioned was not without conflict because those change managers, they, they used to be in charge of all deployments, and they were now being challenged by this new DevOps culture and the fear of losing control. They're very much alive at this point, even though they were partly realizing that the control that we used to have was merely an illusion that illusion of control by heart.
So their response was to create a system of categorization for applications, the most modern applications, where we have Melbourne plays and the teams, and they had, we own the technical direction. They went on the white list. This meant, as we talked about a little bit, they could deploy themselves. And they had full control of the deployment process and the old system, the old monolithic applications, they needed the same amount of change management and the same amount of, uh, coordination as before. So they could only release on the four times a year schedule. They, this was called the blacklist, but most of the applications that now actually happened, uh, actually were placed on something in between the gray list. This was apps that, uh, needs the one-sided needed some kind of change management. There wasn't enough trust to make it possible for developers to the play themselves. And this the system, the device, the gray list, uh, involved deep abuse of JIRA state machine. And it made it possible to deploy using a JIRA ticket and a button, but you still have a lot of central coordination. And every new application meant you had to talk to someone to put it on the, to get it evaluated and made better than putting it on the correct list.
And at this time, this is a good time to return to the data and the mystery in 2019. Remember this graph, what looks like a steady increase, turned out to be a sharp jump at a specific time, week 32 in 2019. So what happened in week 32 in 2019,
This was actually the exact moment where we stopped having the gray list. You w there wasn't necessarily for teams to create JIRA tickets, to deploy, to production anymore. All the teams, except for the old monolithic applications could deploy themselves. And the platform had the functionality necessary to have the transparency that the people could see what happened and removing the bureaucracy and the centralized coordination made not go much faster. And as you could see from the graph builder, there was a big jump just because we released at 1, 1, 1 blocker.
So in the end, we embraced a DevOps culture and we are now changing our systems. Some of which are core to the buffer state, every other minutes, to do that in a responsible manner, we are building skills to reduce the probability of failure, such as creating deployment pipelines and other skills necessary to do trunk based development and still get sleep at night and also establishing good monitoring and build insecurity mechanisms to further reduce meantime to recovery.
Now we looked at platforms like culture, and we looked at how important it is to create a good technical platform to solve the most common problem for the development teams on how the culture and the platform kind of interacts and how we need both to work together. So the last thing we want to talk about is sustainability. And we have this quote, we enjoy very much from Alberta Kundalini software development is a learning process. Working code is a side effect, and this kind of sums up what you think about when it, when we talk about what we think about when we consider software engineering, how to create systems that work over time, although the, the quote is kind of abstract. So we need to go deeper and see what this actually means in practice. If the learning process is the most important part, how should we go about organizing a big work effort? We've had multiple projects at now, all the we don't like to call them project anymore, modernizing the different parts of our systems, big parts, big parts of the organization still want to optimize for deliveries, but we want to optimize for learning. So how do we do that?
And it's such a development is a learning process. That means you should get to learning as early as possible, and that learning happens in production. So get your work in skeleton to production as early as possible. And that sounds easy enough, right? But the legal complexity of public sector makes identifying a minimum viable product to pressing heart. We have had success with trying to support all the parts of our user base as a first iteration. For instance, only those with one job from one employer and no other benefits, but it's, it's really hard to, to make that work, uh, legally. But, uh, but still it's, it's important to identify that minimum viable product, because contrast based approach with a product mindset, we start with the plan and you create all the tasks needed to reach the finish line. And that is all the functionality and all the edge cases. And then you, the protein production, and then you're done. You cross the finish line. The problem is that the finish line is a lie. It doesn't exist.
It's impossible to create a plan for everything you need. Before you start working on your projects, then you discard all the learning. You're going to have vaults. You create your minimum viable product, and we want to optimize for learning. And when it gets to the production, right? Like when you go to production, you realize there is no finish line either just because you've gotten into production. It's not the work doesn't stop. You have to continue evolving the system. This means you have to take a long-term view of the software development. The system will live for as long as the problem is worth solving. And you need to frame the system with an organization that has the necessary knowledge to own them maintain that system. And of course, when you have the stable stabilization, you need to have a proper approach to financing. And you need to, you need to have, uh, funding for the team as long as necessary, because we never want to stop learning.
So sustainability is about pace and it's about, about the right cross-functional skillset and a problem size that fits the cognitive capacity of the team. And none of these things are trivial to solve, but we believe that stable teams are an essential part of the answer and stable teams need stable funding. And this is a challenge in the public sector where most of the funding processes is project space and more suitable for building offices, ambitious then software. And this is a long political process, but has implications on how the welfare state is funded. But now it has taken a clear position here. We want to move away from projects and into product based software development with funding.
So to conclude what to talk about during this talk, we showed you some data showing we made great progress. We moved from a few hundreds to more than a thousand deployments to production per week. The diving into the data, there was this big mystery. Uh, why did we have this sudden jump of deployments? And of course the solution was not to the surprise. We replaced the bureaucracy and the coordination with trust, and it took away some of the blockers for speed. And we'd gone to all the change. Uh, a lot of the changes have been made and how they're all connected. We sold some technical problems and we did that with an application platforms and that application platform helped shape the quality of our applications as well. Just having a platform in isolation doesn't solve the problem. You also need to look at the culture on how we work.
So we put a lot of effort in building trust as a fundament for DevOps culture. And as we have matured as a product organization, we strive to remove projects and have abandoned a mind-shift that software development is a race to the finish line because there is no finish line. Instead, we like to think of it as a race to the start line to get as early as possible to production so that you can start the learning process.
Thank you so much for listening. Hopefully we've been able to answer a lot of the questions on slack during this presentation. You can also reach us on Twitter. If there are more questions like thank you.
Unlimited users from organization
Gene Kim’s SRE Playlist