How We're Transforming the Practice of Learning From Incidents in a 12,000 Person Organization
The IBM CIO organization is making significant progress along our journey to learn more from our incidents. Where previously a small number of our leaders would discuss a handful of "major incident RCAs" focusing almost exclusively what broke and how we’re fixing it, we are now having broad and open discussions that are improving everyone's understanding of our systems, the expertise that keeps them running normally, and the challenges that can overwhelm that expertise. A centerpiece of the IBM CIO Learning from Incidents Program is a monthly discussion in which the CIO senior leadership team, all technical leaders in the organization, and many others gather to review and discuss the story of a recent incident. Hundreds of CIO team members have participated in these meetings and/or viewed the recordings, and these monthly meetings have led to several significant outcomes. These monthly meetings are also inspiring CIO teams to improve their own practice of learning from incidents.From this presentation, the audience will understand:- Why the IBM CIO organization is adopting a resilience engineering approach to learning from incidents- How this approach differs from the traditional RCA and Problem Management practices- What the IBM CIO organization has been doing to improve our ability to learn from incidents- What are some of the themes and outcomes we've observed- How we got here, and how other organizations can join us on this journey
David Leigh
IBM Distinguished Engineer, IBM