Out of the Cyber Crisis - What Would Deming Do?
In 1982 Dr. Edwards Deming published a book called "Out of the Crisis." dealing with his frustration leading up to the prior decade. He was 82 years old; he wasn't writing this as a get-rich management consulting book. Dr. Deming wrote it as a stark warning for everyone, manufacturing, healthcare, government, and education alike.
Fast forward forty years, and imagine what he would say today if he were alive. My guess is he would be intensely interested in cybersecurity. More than just information technology, but cyber across all aspects of human infrastructure, for example, "Maslow's Hierarchy of Needs," water, warmth, and food.
Over the past ten years, we have seen cyberattacks on water treatment plants, power grids, oil and gas, and food supply chains. The latest is a cream cheese manufacturer. I have studied Dr. Deming for over a decade, and I believe there is an exciting story to be told around his 14 Points and System of Profound Knowledge. These concepts are also the fundamental ideas behind Lean, Agile, DevOps, and DevSecOps.
John Willis
Distinguished Researcher, Kosli
Chapters
Full transcript
The complete talk, organized by section.
John Willis
Hi, I'm John Willis. This presentation is called "What Would Deming Do in a Cyber World?" or sort of "Out of the Cyber Crisis." I'm known as Botchagalupe to most people. Most of the images in this presentation were generated by OpenAI, something called DALL-E, so pretty cool stuff.
I've done a lot of things, so we won't waste a whole lot of time going through my background. I'm probably most known for co-authoring The DevOps Handbook with Gene Kim, Jez Humble, and Patrick Debois. I also did a great book called Beyond the Phoenix Project with Gene. The green book in the middle is something I've done with a lot of automated governance, which led us to this book that came out in September this year, Investments Unlimited. It's a great story, and I'll talk a little bit about that. This presentation is really based around a book I've been working on for 10 years, and by Gene's encouragement I think I finally have it done. The fifth draft will be done, hopefully ready for publishing early next year. I've worked for a lot of companies. I just left Red Hat. I'm actually now working for a company that focuses on DevOps automated governance called Kosli.
I wanted to talk briefly on the Investments Unlimited book. It's really exciting. It was written with nine authors who have a lot of industry expertise. It's a novel about DevOps, security, audit, and compliance in the digital age. I think you'll really enjoy it. I've done a fair amount of presentations on that topic, but not today.
Even if you haven't heard of Dr. Deming, you probably have heard of Dr. Deming. I think it's mandatory that all DevOps presentations have at least one Deming quote. I'm obviously kidding, but not so much. Here are a couple: "Learning is not compulsory. Neither is survival." "In God we trust; all others must bring data." My personal favorite: "A bad system will beat a good person every time." And for baseball fans, sort of like Yogi Berra, "Every system is perfectly designed to get the results that it does."
Deming's life has some really interesting things in it. He started out as a mathematical physicist, around 1923, in the early 20th century, learning physics during the second scientific revolution, quantum physics. That has a lot to do with the way he thought about things differently. He ultimately became a management consultant, but what made him unique was that he was a boundary spanner. He had an early interest in epistemology, particularly pragmatism, an American philosophy, which helped him think about knowledge. His weapon of choice was statistics, and specifically analytical statistics. He was a professor at prestigious places like NYU, Columbia, and Washington University. He was also very interested in psychology, how people are motivated. He was probably one of the earliest systems thinkers. In my book I cover his work in government during World War II, in dark projects with Norbert Wiener and some of the original cybernetics work. He was most known as an industrialist, probably because he was sent to Japan and helped create what they called the miracle in Japan. For me, even though he died in 1993, he's probably one of the greatest DevOps people. If you read Deming's 14 points, and part of this presentation, he's a futurist, because the things he laid down in his 70-year career are really relevant and probably more relevant today when we talk about cyber.
To set the stage for where we're at: Marc Andreessen, in 2011, founder of Netscape and now famous VC at Andreessen Horowitz, said that software is eating the world. I think you'd be hard-pressed to disagree with that. Our good friend Josh Corman, part of our DevOps tribe, says software is infecting the world. It's not a negative; it comes part and parcel with Andreessen's statement. I would say software is now a little worse than exposing credit cards or Equifax credit records. That's not good, but now it's messing with our basic hierarchy of needs. I would say it's screwing with Maslow: our health, food, water, and shelter.
A point I make in my book is that Deming had this idea of the aim, creating a purpose. If we think about healthcare, research says that 5,600 hospitals in the U.S. have zero cybersecurity people, not even a director of security, no CISO, no VP, no director. That's pretty scary. Even more scary are devices affecting our lives: pacemakers, with an uncounted number out there, many still with three-character hardcoded passwords; infusion pumps and Bluetooth; even Dick Cheney had Bluetooth turned off. People can die. It has been proven hackers can hack into these devices. In hospitals, infusion pumps have been proven in hackathons to be changeable from what should be a dose over 30 minutes into a minute dose that could kill somebody.
A lot of radiation oncology systems moved to the cloud because they believed it was safer. There have been ransomware cases where they shut down the whole radiation oncology system. That means for chemotherapy they have to shut down chemo for two weeks and people's lives are rescheduled. Scary stuff.
Then there is the ransomware revolution: Schreiber Foods, where at the end of 2021 you literally couldn't get cream cheese on the East Coast. For me, lox and bagels without lox is a tragedy, but all kidding aside, nobody dies from not getting cream cheese. The point is that one ransomware incident at Schreiber Foods basically shut down the East Coast during the worst time, the holiday season. Most people know about the JBS meatpacking ransomware; it may not have been as well known because they paid the ransom immediately. Hackers tried to poison drinking water at the Tampa Bay water treatment plant about a week before the Super Bowl in Tampa, less than 10 miles away. They were trying to increase sodium hydroxide, lye, to lethal levels. Fortunately an admin caught them using remote viewer software. There was the Colonial Pipeline breach. Earlier this year there was the first ransomware-related litigated death: a woman came into a hospital where ransomware had shut down radiology. They couldn't do X-rays; they didn't inform the woman. She had a complicated birth. Had they done X-rays, they probably would have sent her to a different hospital. The baby died six months later, and it's the first litigated death.
What does Deming have to do with all this? We have to look at what Deming is all about. Deming's whole life was picking up what he called profound knowledge, the System of Profound Knowledge that he finally documented in his final book, The New Economics, published in 1993, the year he died. He considered these four elements as a lens for understanding complexity, and he said you had to apply all four to understand problems, opportunities, and improvements. They are theory of knowledge: how do we know what we know, how do we really know what we believe we know, the scientific method; theory of variation: how do we understand what we see, a form of measurement but more importantly a form of understanding; psychology: how are people motivated, including cognitive biases and intrinsic motivation; and appreciation for a system, or systems thinking, which integrates all four together.
Deming's weapon of choice was statistics and analytical statistics. In The Essential Deming, he says the job of the statistician is to work with experts in the subject matter to help them solve their problems. The responsibility of the substantive expert is to decide, with or without help, what problems are important and what statistical information might be helpful. The statistician provides the data, and the subject matter expert uses it to discover the opportunity for improvement.
Let's take the System of Profound Knowledge for IT risk. Look at three types of controls you might see in a software supply chain or a pipeline: container scanning, where we're doing image scanning and looking for vulnerabilities in a container image; unit test coverage, where certain applications require a certain level, maybe application A requires 70% and application B requires 75%; and information leakage.
For container scanning, one of the things Deming talked a lot about is misunderstanding the difference between enumerated statistics and analytical statistics. I contend that in IT we over-rotate on enumerated. Enumerated is the "how many"; analytical is the "why." I took a 17-week average of container-scan failures where the control was gated. On the left is a distribution chart, a classic enumeration of how many. It gives some interesting count information. On the right is a control chart, statistical process control, a more analytical approach. We are looking for normal distribution, or what Deming would call common-cause variation. We're looking for randomness around the mean, because there is variation in all processes.
If we take all the data for 25 weeks, the enumerated side is mildly interesting because we might see some outliers, but the same data in the control chart shows that up to week 17 there is randomness, and then suddenly starting around week 17 or 18 there is a pattern. We don't want to see patterns in analytical statistics. The errors are increasing. The statistician is not solving the problem; the statistician is saying, "Here is your data." The subject matter expert can then look. In this case, there was a new development team added, and they weren't following the rules. They were going to Docker Hub and using their own repositories for container images. There was a big increase in vulnerabilities found. The point is that it took analytical statistics to tell us where to look for the why.
For unit tests, there is another example that has more to do with Deming's core message. You've probably heard plan-do-study-act. That is a form of theory of knowledge; in fact, it is scientific method. Mike Rother calls it Toyota Kata. We were taking the weekly percentage of unit test coverage for all applications. The mean was 50%, not great. We wanted to increase unit test coverage. Suppose a vendor comes in and says the TDD training you have is probably not increasing efficiency as much as you want, so try our approach. Instead of just buying the software and assuming everything worked better, we use analytical statistics.
We run another 25 weeks, and around week 16 or 17 we do our plan-do-study-act. Maybe we take an improvement sprint: two weeks where we try out this new TDD package, and then we study the results. It looks like we may have increased percentage coverage because we see a trend starting around week 21 to 25. In the act stage we're not quite sure, so we don't just tell everybody we've done this small test. We onboard a couple more teams, like a canary. We find out that onboarding this new testing approach moved the mean from 50 to 64. We can keep rolling it out to more teams. If it didn't work, we'd know from the data and try another experiment. It's a form of experimentation.
These analytical-statistics techniques are a hundred years old. They started with Dr. Shewhart at Bell Labs and at Hawthorne. They've been used to make toasters, cars, and run nuclear power plants, but we don't really use them in IT. There are opportunities to use them in modern governance practices, pragmatic SRE practices, DevOps data lakes, DORA data, and adaptive skills liquidity. The unit-test example is also a way of figuring out that status quo was not good enough, even though the process was in control and random. We still wanted to improve. You can think of that as a skill-set question: maybe we need more training, or maybe that team is not the right team for the job.
Look at Knight Capital under the lens of profound knowledge. In 2012 they put in a new high-frequency trading program for the NYSE called the Retail Liquidity Program. New code was supposed to go to an eight-node cluster. A sysadmin manually deployed it and only hit seven of the eight nodes. On the old eighth node there was old code used for testing and stress testing called Power Peg. It was designed to buy high and sell low and was never supposed to run in production, but through a mismatch of command flags it got turned on. Knight Capital normally managed an average of 3.3 billion trades a day, worth over $21 billion. In this case, the buy-high-sell-low Power Peg traded over $21 billion of bad trades. It was a $444 million loss in less than 45 minutes. They were literally out of business in 45 minutes.
The SEC sent a cease-and-desist order to Knight Capital. Under the lens of the four elements: for knowledge, the SEC said a second technician should review the code installation; for variation, they cited procedures sufficient to ensure an orderly installation of new code and prevent activation of code no longer intended for use; for psychology, they asked whether there were reasonably designed guides for employees' responses to significant technology and compliance incidents; for systems, they said there wasn't an adequate written description of risk management controls. Did they really know what would happen? Did they have pairing, a pull-request review, baselining, or anomaly detection? Was there psychological safety for someone to raise a hand and say maybe we ought to pause this delivery? It was not too dissimilar from Columbia and the shuttle, where there was back pressure to deliver on a schedule and people did not feel empowered to pull the Andon cord.
If you need to be convinced that Deming is relevant to DevOps, DevSecOps, and cyber, you don't have to look much further than his 14 points. I won't go through all 14 in detail, though I do in the book with examples from modern technology, modern infrastructure, DevSecOps, and cyber. Take create constancy of purpose. Schreiber Foods is interesting: Mark Schwartz's A Seat at the Table asks whether a company is serious about information technology if the CIO doesn't have a direct seat at the executive table. In the case of Schreiber Foods, they didn't have a CISO, a director of security, or a VP; even after the breach, periodically when I check they still don't have more than a hardware director. That's like the 5,600 hospitals: you're not taking it seriously.
Never stop improving. Shannon Lietz has been an incredible mentor for me. She does adversary analysis. There is always a never-ending cat-and-mouse game with adversaries. Reactive policies scan and look for vulnerabilities, and that's table stakes. What she does is adversary analysis: how often do they come, how long do they stay? She uses a theory-of-knowledge approach, experiments on things, and then sees what the data says. Did adversaries stay less after she made this change or added that thing?
Don't manage, lead. This is where Deming would be upset with how we manage incidents. We categorize incidents as P1s and P2s, and the truth is we do it based on napkin math. Was it really critical because it was a critical service? Did it really start at 9:05, or is that when somebody wrote it down? That's a classic case of enumerated statistics. John Allspaw says incidents are unplanned investments. Most organizations don't have enough bandwidth to process all the P1s; they ignore P2s and P3s, which are a wealth of opportunity. Deming would say stop using enumerated statistics for this data. Use analytical statistics. Stop arbitrarily creating abstraction layers for P1 versus P2, and let the patterns show up in analytical statistics.
Break down silos. Equifax is the famous example. There was such Conway's Law: the CISO reported to the chief legal officer. Under testimony, when asked whether it was odd that the CISO reported to the chief legal officer, the response was, "I figured they knew what they were doing." When asked why the breach wasn't reported to the CIO, the response was, "I didn't think of it." Of course they didn't think of it, because the CISO reported to the chief legal officer. The organizational chart mandated how they would react to that breach.
Deming hated quotas, MBOs, MBRs, and he would have hated OKRs. Deming was a proponent of method. He cared about results, but figured the only way to get results in a repeatable fashion is to understand the method it took. He would ask, "By what method?" There must be a method to achieve the aim. Lloyd Nelson, a student of Deming, said that if you can accomplish a goal without a method, why didn't you do it last year? I would say there is no guarantee you'll do it next year. Goals for goals' sake, without understanding how to achieve them, are not enough. We want to teach people how to achieve. Some people refer to this as management by means: the means or method by which you get there.
When we set up goal mode or results mode, we force people into several outcomes. In my travels doing qualitative analysis at big organizations, people lie. It's the watermelon: red all over in the middle and green on the outside. Or people tell the truth and their status to leadership is always yellow and red. Or they find workarounds. I recently talked to somebody whose team figured out how a risk scorecard tool scored them, then fed it exactly what it wanted to hear, so they were always A's and B-pluses.
Deming would not hate everything we do today. I think Deming would have loved SRE. SRE done right maps beautifully to profound knowledge. An SLA is a systems-thinking approach. An SLO is our measurement, our theory of variation. SLIs are our theory of knowledge, our indicators. The psychology is that, if SRE is done right, this is decoupled from the person or the team. It is the system that generates the performance indicators. The SLO encapsulates that, and the SLA encapsulates the system. We remove MBOs and OKRs from teams and people and let the system do the work. We take out the human factor of people having to game the system, work around the system, or decide whether to tell the truth or lie.
KLOC is another form of forcing people to work around. If you measure thousands of lines of code, I may not clean up my code because I don't want to affect my KLOC. I may not want to write new code. There are a lot of things like that.
I have a book coming out early next year. I may rename it Why Deming Matters; right now the prototype name is Profound. If you grab the barcode, the first 200 people who sign up for the waiting list don't have to pay now, but when the book is available I will send a signed special copy to the first 200. I know I went fast, and there was a lot of information here. I'm John Willis. Most of my blogs and everything are at profound-deming.com. You can also go to kosli.com and see what I'm up to there. I'm Botchagalupe on Twitter, that's my LinkedIn, and john.willis@kosli.com. Thank you very much.