Sep 10, 2021

What's the worst thing that could happen?

Going deep on decision-making...

I’m back, after a really awesome summer spent hanging out with my family, and I hope that you had a great summer too!

Our focus today is efficient decision-making, and the right questions to ask ourselves.

First, an Anecdote

My formative professional years were spent as a software developer, and I remember the releases more than anything.

For one in particular, I had been responsible for writing a large portion of the release. It’s late one afternoon, and I’m checking through the list of bugs. It’s device software, so we don’t have the benefit of rapid fixes, especially if we have to go through an approval cycle with one of the app marketplaces.

The bug list isn’t small, but it looks ok. I don’t see anything catastrophic, and we’ve invested a great deal into testing. The marketing is ready to go, including website updates. I’ve run a beta program for early adopters and the feedback has been helpful and positive.

So, why do I feel like it’s falling apart and the sky is going to fall, at the exact moment that this release is announced via mass email tomorrow? Are we actually ready? Should we wait?

I’m actually terrified, when this should be the final step in something we should be really proud of.

Decisions and Consequences

Nothing tests decision-making more than the act of releasing software.

Does it work? Have we explained it well? Are we ready? Is the market ready?

I can think of releases that, for whichever reason, went out too quickly and caused major difficulty. Other releases were delayed to the point where the market advantage that we could have gained was lost when we hung on too long out of fear of poor quality or disagreement among decision-makers.

The best way to characterize how I now think about these go/no go decisions is that if everyone that has a stake in this - engineering, marketing, sales and more - is a little bit uncomfortable with the decision to go, then that’s perfect. Holding on until everyone is “happy” is waiting too long. Too many companies really don’t have a system around this other than a single “gatekeeper”, or a senior person who “makes the call”. Even worse, you might have a de facto “committee” that makes the decision.

It’s not about the decision itself, it is about the system surrounding the decision, and understanding the consequences of decisions and change.

Jeff Bezos puts it this way in his 2015 shareholder letter:

Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions. But most decisions aren’t like that – they are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgment individuals or small groups.

As organizations get larger, there seems to be a tendency to use the heavy-weight Type 1 decision-making process on most decisions, including many Type 2 decisions. The end result of this is slowness, unthoughtful risk aversion, failure to experiment sufficiently, and consequently diminished invention.

What’s The Worst Thing That Could Happen?

To get to the heart of the systems you have in place now, ask the question: what is the worst thing that could plausibly happen if we do X?

This question is a really worthwhile question to think about, particularly for operational, execution and engineering teams. It can be our brain’s nature to be instinctively pessimistic and dramatic about the consequences of a choice, and the question forces us to confront the facts of the situation.

“It will end the company”… “I will lose my job” … “We’ll lose all of our users”

The key here is to get beyond everyone’s initial reaction to this question. Is the fear really justified? Play out the scenarios, and talk about what might happen and how you would respond.

Then, based on that context, you can determine if this is a reversible or irreversible decision and proceed accordingly.

Bringing this back to the system, the goal of the broader organization and for you as a leader is to architect a system whereby the consequences are minimized by the characteristics of the system, thus making decision-making better and easier for everyone, within certain boundaries.

What does a system like this look like?

DORA Metrics and Software

My favourite current example of a system that seeks to be resilient and minimize impacts is the system by which a company releases software.

The four metrics I advocate teams to use when judging how successful a company is at shipping software (and DevOps) are the following:

Deployment Frequency (DF)
Mean Time for Changes (MLT)
Mean Time to Recover (MTTR)
Change Failure Rate (CFR)

These metrics are the result of six years’ surveys from a group at Google called the DevOps Research and Assessments team, distilling down the indicators that best reflect the “performance” of an organization shipping software.

If requested, I can get into the weeds of these metrics in a later newsletter, but I want to highlight MTTR as one that is important to our discussion.

MTTR could also be phrased as “how fast can we recover if there is an outage or incident?”. A team seeking to minimize this time is actively changing their system to be more resilient, and to constrain the worst-case scenario.

Working in a team that actively is tracking these metrics gives a sense of safety for decisions! Obviously, we are seeking to minimize the number of failures in the system and the number of “bad decisions”, but there is freedom in knowing that the worst-case scenario is limited and known.

This thinking can be applied to many other systems beyond software engineering. Think about the system that you work in - does anyone think about how you recover from a bad decision and what it looks like, and is that something that you track?

Wrapping it up

It’s not about the decisions per se. It’s really about the system that those decisions play into, and how much you know about your environment.

If every decision is perceived to be irreversible (aka Bezos Type 1), then it is highly likely that your organization is mired in slow, ponderous decisions that are driving everyone crazy.

If you can’t identify those important, irreversible decisions, you’re going to walk into a wall without even knowing it was there.

Think about your systems and how you understand them, and you will help everyone around you make decisions.

First, an Anecdote

Decisions and Consequences

What’s The Worst Thing That Could Happen?

DORA Metrics and Software

Wrapping it up

Subscribe to High Leverage Engineering