“Quality in a service or product is not what you put into it. It is what the client or customer gets out of it.” — Peter Drucker
In the past several years of my engineering career, I have worked at companies big and small, startups, and established public companies, and I have realized how critical it is to have discipline around operational health. No matter how perfect the service design is, its surface area is going to expand over time, you are going to add features to it, the customer usage patterns are going to evolve. A few months or years down the line, it is not going to be the same service you started with. Ultimately, we are building these services in order to meet a business goal or customer requirement. And customers expect a service which is available, reliable and performant. This is a two part series in which in the first part we talk about how we should think about operational health maturity and then in second part we will cover how you can build it through a process called operational reviews.
In this post, before we talk about what an Operational Health Maturity Model can look like, there are a couple of points I would like us to reflect on.
- The real world is so much more complex than a Maturity Model. The Maturity Model is merely a guideline, and you may not perfectly fit into one level or another. Use the concepts in the maturity model more as a guide.
- The Maturity Model has multiple levels of maturity. No level is good or bad in itself. It all depends on the business context. For example the level of maturity required for an authentication product is going to be much higher than level maturity for a service which is not in the critical path of customer experience. Or the level of maturity required for a large scale enterprise company is going to be different than a seed stage proof of concept product.
- As you go through the sections below, I would encourage you to reflect on:
— What level of maturity you are today as an organization or team?
— What level of maturity do you want to be one year from now as an organization or team?
— Which parts of the model stand out to you most as areas you feel are the biggest pain points that you would wish to invest in the next year?
Operational Health Maturity Model
The model has 3 levels:
- Level 1 — Reactive
- Level 2 — Adaptive
- Level — Proactive
And has 5 components:
- Availability and Incidents
- Data and Metrics
- People and Processes
- Product and Architecture
- Engineering efficiency
Lets examine each of these components:
Availability and Incidents:
The Availability and Incidents component is about how you detect and mitigate your incidents, and how long these incidents end up lasting. In level 1(reactive systems) incident detection is more often manual than automated, where as in level 3(proactive systems) you run controlled experiments of various failure modes like disaster recovery, password rotation so that you are intentionally investing and preparing for when these scenarios do happen, you have playbooks, scripts, etc handy to more gracefully recover or minimize impact to your customers.
Data and Metrics:
The Data and Metrics component is about:
- How data is captured within your organization
- How consistently data is referenced within your organization
- At what level of granularity that data is captured (eg: simpler throughput metrics vs granular metrics related to dependencies , cache miss rate, etc. or simpler incident time vs TTD/TTM/TTR etc)
- How well you know how to use the data you have
People and Processes:
The People and Processes component is about how well the folks in your organization understand what their specific roles and responsibilities when it comes to operational health and processes related to operational health. For eg: starting right from an incident is detected, knowing the right teams and engineers to engage, to post incident having clear ownership for incident reviews to well defined expectations of what to do with and when to resolve all the learnings and actions from the incident review process.
Product and Architecture:
As an industry, we are trying to shift left a lot of the practices which have generally been traditionally more of an after thought like security, etc. Similarly with Operational Health, one can build an intentional muscle of shifting left and making it a part of how we build the service. For example, being intentional about what are your resiliency modes, and what data and metrics you need to capture as part of writing the Architecture Decision Record(ADR) is one way to shift left. How far along you are in this shift left journey, indicates your level of maturity.
The component on Engineering Efficiency is more of gauge on which side of the pendulum are you swinging on unplanned work. Unplanned work doesn’t necessarily follow a consistent cadence, it may follow a peaks and valleys pattern, and hence would recommend looking at what it looks like on an aggregate over a longer period of time. Since Level 3 organizations and intentionally planning activities like disaster recovery, instrumentation of data, their unplanned work is generally on the lower end. Where as in Level 1 organization you are reacting and adding things as you go, which ends up there being longer periods of time where you spend time on unplanned work like incident repair items, or clean up work etc.
Putting the Big Picture Together
Earlier in this post, I encouraged you to reflect on the following questions:
- What level of maturity you are today as an organization or team?
- What level of maturity do you want to be one year from now as an organization or team?
- Which parts of the model stand out to you most as areas you feel are the biggest pain points that you would wish to invest in the next year?
When I generally ask people the question of “What level of maturity do you want to be one year from now as an organization or team?”, most people by default respond they wish to be at Level 3(Proactive). I want to caution that not all teams and organizations have to be at Level 3 to be successful as a product. Instead I would encourage you to think about what components would make sense as a Level 2 vs Level 3 or even Level 1. And pick and choose intentionally where you want to invest. For example, if you are at Level 1 in People and Processes, and Level 2 in Data and Metrics, instead of aiming to get to Level 3 in Data and Metrics, consider whether it is better to invest in getting People and Processes to Level 2 first.
I would love for you to share in the comments what your reflections were and which areas you and your teams wish you could invest more time and effort in.
PS: Thanks to John Mogensen for feedback on the Maturity Model.