Las Vegas 2020

Super-charging Software Development at Intel through DevOps

Imagine the functional and integration testing complexities involved when 20,000+ software developers distributed across 10+ countries all trying to make changes to code, testing and putting a complex system together - At Intel, we re-imagined and transformed how Devops for complex products should work at a SCALE. We have battled religious wars on tools, processes where every team across the company have already invested in their own well-established localized workflows and tools.


• How do you drive change especially in a large and diverse company such as Intel?

• How do you help teams of all shapes, sizes and different maturity level move towards faster integration cadence?


After more than half decade of change management we have lessons to share on how to approach large scale modernization using Devops and analytics solutions. Our solution is now able integrate changes across the company quickly and output a packaged software kit to our customers at 40X more capacity.


Hardware and Software co-development is becoming more and more relevant these days with uptick in device development from wearable's to software that needs to be integrated with hardware frequently. You will learn some key insights not only on Hardware and Software dependencies but also Intel's modernization efforts.

MD

Madhu Datla

Senior Engineering Manager for DevOps, Global Infrastructure and Systems Engineering Team, Intel

PT

Peter Tiegs

Principal Engineer, Intel

Transcript

00:00:00

Um,

00:00:14

Hi, I'm other doctor, I'm a senior engineering manager at Intel, responsible for developing DevOps and systems engineering capabilities.

00:00:22

Hi, my name is Peter . I'm a principal engineer at Intel focusing on DevOps. And today we're going to talk a little bit about, um, our journey supercharging software development at Intel. I do

00:00:36

Intel process and products are used everywhere, like laptops, mobile devices, servers, autonomous vehicles, and IOT devices within our client organization. In last five to six years. The number of credits that we have been delivering has grown exponentially year over year because each of our customers wanted to differentiate that product. We created a segmentation strategy, which allowed us to meet growing market demand and supporting a large number of use cases in each of this product. Skew can be a different set of components interacting with your processor and this offer on that product would differentiate how the system would behave in different situations. For example, I have an Ultrabook here, which can be in a laptop mode Or a tablet mode

00:01:28

Or a tent mode. And in each of these modes, the user experience is different and just talk to it. And hardware interactions are different. Another customer may want to just sell a sleek laptop at a different price point. There are several thousands of products from our partners like Google, HP, Lenovo, Microsoft, each of them offering different price points and different usage usages. So essentially the number of credits have basically exploded every year. We are integrating four X more products, and this is happening across multiple operating systems like windows, Linux, Chrome, which are getting more frequently released. So I D ecosystem has become extremely complex and dynamic and hardware. Our hardware had to be validated and released faster with the highest quality of standards possible

00:02:26

In order to provide the best customer experiences on this, on these Intel systems. And, uh, uh, our development team is working across the globe. We have 15,000 engineers doing software development in variety of languages and then multiple geographies around the clock. And one thing that is unique about Intel is our complex. Uh, our product development is extremely complex. One of the reasons is because we have a large dependency on our hardware and in the early phases of our product development, hardware is very, very unstable, integrating 30 plus toxic components on a daily basis. And that that is getting developed across the globe is not a trivial task. We had to innovate. We had to come up with enterprise wide DevOps infrastructure to support the complex development here at Intel.

00:03:22

So as my do was saying there's 30 plus different software components that go into our platform, a level of products or the various skews that we have, this platform level continuous delivery mechanism that we've put together is based on common, continuous integration and continuous delivery practices, but scaled up and set into a segmented strategy where each of these various software teams, whether it's the graphics driver or the wireless land driver or the audio driver runs their own process, and then delivers those in a C I like process into platform integration where we bring in not only all of those software stacks, but the OSTP and the standard software and the drivers. Historically, before we started to look at this as a, a platform level continuous deployment process, we used to do big bang integration where each of these teams would deliver their software. At some arbitrary time, the system integrators would pull together, ask people for what versions, what shared drive that version was on what SharePoint site this version was on and pull them all together, put them together once for the first time and see if they would work.

00:04:37

And nine times out of 10, they did not work. Right. So what we did was we put together this process based on continuous integration, where we would incrementally add a new version against the baseline of all of these ingredients, assemble it into what we call a base SOC kit, and then turn on or enable different features depending on our target customers, whether it was for IOT or for client. As my do pointed out, we had a couple of different clients, skews and capabilities, or even now going into the data center and, and server based platforms so that we could deliver a BKC or a best known configuration out to our customers. All of our software for platforms goes through this pipeline now, um, and we've enabled the common repository for sharing source code, um, because of the, the ability to deep debug use cases, um, by our upstream validation teams through source.

00:05:41

And as you can imagine with a diversity of, uh, software teams delivering into this system, as well as the system itself, we have a diversity of tools within our entire DevOps portfolio. Um, I'm sure you recognize many of these tools operating in many of the different roles that we need in a complex DevOps system. And while it may seem like the right goal is to drive down to a single pipeline and a single set of tools in a single tool chain, um, we have found that that's not really, uh, an achievable goal in the reality of, of dev ops at a complex enterprise level system. One software teams have legacy that they need to support. And there are certain things that may be tied to specific tools to some of those software teams are coming in from acquisitions, whether we're purchasing a new company, um, or, you know, a team is coming in and, and using that stuff historically.

00:06:36

And, and they're not necessarily using the same tools that we've used historically. And one other really important piece is that we want to stay up to date and modern as tools change and evolve. The DevOps space is incredibly dynamic now. And so having the ability to have a mix of tools in the system so that we can stay up to date with tools. We also need to balance both commercial on, off the shelf tools, in addition to open source tools, as well as internally developed tools. Like you'll see some of these blue ones, one kit, one VKC and Exxon, which are Intel developed tools to support our own DevOps operations as we need. And having that mix to find the best set of capabilities for our product and to get these systems to work together is really what we need for, uh, for a tool chain.

00:07:26

So if we look for example, at one particular pipeline, say one platform team needs to, uh, do their builds on Kubernetes with some physical hardware defensible and their source code is sorting get lab. They're doing the build integration through Jenkins, and they need to run clockwork and black deck binary analyzer for, uh, security scans. And then they need to deliver their binaries out to Artifactory and report through some of our own Intel internal systems. Similarly, another team might have a different mix, uh, in their pipeline of, of tools that is needed. They might need to use get hub on top of, uh, Protex and team city, um, for their build system instead of Jenkins. And they need to report, uh, their test results through Splunk and, and create a report in power BI. So these mixes and matches of tool chains that we have within the entire, um, portfolio, make it really challenging to make sure that we have a consistent live stable DevOps platform for our teams to deliver their software on the tool chain is constantly evolving and for good reasons, and feel free to ask questions on whatever the platform is as we go.

00:08:48

So, one of the things that we did, um, to help us, um, handle and wrangle this diverse set of tools within the tool chain to support this broad ecosystem, as my do kind of touched on is we created a dev ops enterprise program, and we focused on kind of three areas where we would want to make sure we were consistent. The first area is systems engineering. We wanted to make sure that the data about the software going through our dev ops pipeline was consistent so that we had a virtuous feedback loop. Um, as things came into the system and were processed at the platform level, we could provide data back to the software teams and the data that the software teams delivered into us was was good. Mike, you mentioned pre-checks, this is where those kinds of the kind of data around those precheck algorithms coming in to watch our various software components coming in.

00:09:46

We also had a team focused specifically on where our source code was and how we could be consistent with our build automation. Uh, this team made sure that even if we had a diverse mix of, uh, source code, uh, systems out there that the source code was available through a common access permissions and that, uh, for building, when there were common behaviors that needed to happen, regardless of what pipeline you had, there were tools and capabilities enabled for the teams to, uh, draw upon so that we had at least some consistency, regardless of whether we were using the same tools or not. And finally, uh, the test and release team focused on standardizing and simplifying, um, how we were testing and reporting our test results into, um, uh, between the different teams, as well as standardizing the release channels of our platform software out to our customers. This program was, um, a key piece to making sure that we at Intel with our big enterprise could handle all the diversity and that explosion of, of software and skews that Maude you mentioned before.

00:10:59

So a little bit more detail into the, the enterprise DevOps, uh, exploration, knowing that we had a diverse set of tools in the tool chain. We wanted to set some foundational rules to make sure that the tools work together. So one of the key areas, as we touched on in the building source team was focusing on inner sourcing, making sure that the source code was available, whether it was for debug purposes or just sharing knowledge between the company, we needed to make sure that we had binary storage consistently available. We have a global enterprise and some software that's built in Folsom, California may need to be tested in Bangalore India. We needed to make sure that the build infrastructure, the compute capacity that we needed to do this, uh, computation was distributed worldwide. So that softwares, regardless of where they were around the world had a consistent, um, environment.

00:11:54

And we deployed a hybrid cloud, uh, with Kubernetes on-prem and the ability to go off prem as needed. Finally, knowing that we needed to re-examine our tool chain on a regular cadence to stay up to date with the best practices within the DevOps industry. We plan that into our system that we would rotate and reexamined the tool chain on a three to five-year cadence. One of the side effects of that is we decided to build reusable libraries that would abstract away the tools specifically from the logic that we needed to deliver software for our business. We called this thing, the abstract build interface, and it helps us avoid vendor lock-in. And it allows us to survive in an environment where we have a diverse tool chain. Ultimately, the secret of dev ops at this enterprise was to remove the barriers of the software engineers to deliver software and value to the customers.

00:12:53

Thank you, Peter. We fully believe that what cannot be measured cannot be improved. There are several commonly used metrics in the DevOps community, like average time average number of brief failures, but the number of test regressions, the nightly, uh, regressions that are happening to relabel related metrics like downtime, but in a large enterprises, the tools are managed by several teams. Like Peter mentioned, a typical DevOps workflow is probably going through, be achieved through a combination of tools. Developer productivity is one of the important measures that we need to consider when thinking about an enterprise DevOps system, when one of people in that whole chain or the stack of tools that, uh, that you are offering as a DevOps tool chain, when one of that is down, it is impacting the overall reliability of the solution. The developer who is waiting for that belief to come out is, has to wait for longer his object to start getting delayed.

00:13:57

So we propose that you have a solution level object to which is a measurable criteria for the whole solution. We also need to set a boundary condition for those objectives. And, and a good example is we expect that a bill should be finishing within the expected time. Let's say five minutes, but a 5% variation is okay. We also need to clear a guide for some of the, uh, pedagogy engineers who can allow who can, uh, route cause one of the issues as quickly as possible. And the team needs a guide, which are our escalation paths. And they know when to call for help or escalate. We also have something called solution level mission interrupts, which are systemic failure in the system, right? These are the interrupts that are interrupting the overall business workflow. Then when a system level mission trip happens, you are not able to deliver something that is significant. Our typical failures need to be classified as, as a mission interrupt. If it has a significant business impact conducting a systematic retrospectives root cause and making sure that we are continuously improving on, on the issues that we are finding is extremely important so that we can avoid the recurring failures.

00:15:18

The other important thing to keep in mind is you want to have a way of finding the false as quickly as possible, right? So the average time to find a fault in a service needs to be coming down over a period of time. It is always a good idea to define certain quality gates. We call them as paychecks and within the pipeline, having establishing those, uh, establishing the pre-checks will allow a seamless integration of one software team team's deliverables to the other soccer teams deliverable. In summary, the enterprise wide DevOps tool chain can be quite complex and messy. You should fully expect that the tools need to coexisting quite with other solutions for a long time. Instead of trying to, instead of tying your business workflow to the individual tools, invest in a standard interfaces so that when new technologies come along, it is easier to migrate to the new technologies. Lastly, think about the solution level objectives to hold the individual teams accountable. And the system needs to be tolerant for multiple destabilizing factors because there's an enterprise wide system. There could be many factors that could go wrong. Defining those tolerance levels for each of the teams to optimize their solutions will be beneficial in the long run. Feel free to reach out to us with any questions and we'll be able to, we'll be more than happy to answer. And thank you for listening.

00:17:01

Thank you for listening.