Black Fist Security: Change Management Metrics

What are some good Change Management metrics, and what kind of stats should we be keeping about the changes that go on in our organizations? Now I understand that most of my brothers and sisters in higher ed IT don't have any change management process to speak of so this question may be moot. But this is a question that also extends outside of higher ed and affects anyone that has a change management process in place. I am going to make two bold statements right now without any evidence to back them up.
1. You don't know how effective something is unless you can measure it.
2. Most (more than 50%) of organizations with change management processes are not attempting to measure its effectiveness.

So what are some of the metrics we could be using to measure the effectiveness of our change management processes? One metric that I have introduced to our change tickets is dysfunction. For each change that comes through I look for whether or not the change is dysfunctional and in what way. Here are the classifications of dysfunctional change that I use.

1. Not dysfunctional. We would like for almost all of our changes to have this classification.

2. Acceptable dysfunction (legitimate emergency change). We know that sometimes emergencies come up and while we want to keep them few and far between, we have to accept that they happen.

3. Non-emergency reported after the fact. This actually comprises about a third of the changes that I see in my organization. In this case an administrator makes a change to some system and doesn't bother telling anyone. Then at the next CM meeting he says "Oh yeah, I did this other thing last week. Make sure that gets into the bulletin."

4. Zero or short notice. After all, you can't really plan for a change if it's going to happen tomorrow. Other people definitely can't prepare for your change if you're only telling them now about something you're going to do in five minutes. This might make up about a third of the changes that I see in my organization.

5. Overly wide date range. This is probably another 25% of the changes that I see. An administrator says that he has to make updates to the VmWare Tools software on our virtual servers and he's going to do that sometime next week. If this change causes a negative interaction we wont be able to pinpoint it right away since we only have a vague idea when this is going to happen at all.

6. Inaccurate description of change. I'm going to reboot the firewall. And by reboot the firewall I mean update the firmware and clean up the rule set a bit.

There are some others that I haven't added but probably should be classes of dysfunctional change such as the hidden change where an administrator changes a service and never tells anyone until something goes wrong and then owns up to it. It is similar to an after the fact change but is more severe since someone else had to discover it first. Another one that is worth keeping track of is changes that were done late. The whole point of putting a change on the calendar is so that everyone will know why things went to hell. So if you make your change two days later it is the same as doing the change and not telling anyone. Nobody will suspect that your change is to blame when we all think that it went off without a hitch two days ago.

Why track these levels of dysfunction? Because I think you can use the ratio of dysfunctional changes to not dysfunctional changes as a way of comparing CM's effectiveness from one quarter to the next or from one organization to the next. This is also a way of measuring the overall maturity of an organizations CM process and possibly even the IT department as a whole. If an organization has 10% dysfunctional changes compared to another org that has 80% dysfunctional changes then I think we could agree that organization one has a more effective CM process.

What else do I like to record? I like to make people enter how much risk they place on each change they are proposing and how much impact the risk will cause. I don't have this in place yet, but I'd like to be able to track how many of our changes go sour and what the impact was. That way I can look back over a years worth of data and find out how accurate our guesses are. When we say that a change is low risk, how often do we end up with a negative interaction? 20% ? 25%? What does it say when only 10% of our high risk changes go badly but 25% of our low risk changes have issues? How often do we have major outages when we estimated a low risk? These numbers also reveal quite a bit about the effectiveness of our change management program. One of the reasons we have change management is so that our managers can assess the risks involved in the work we're doing and hopefully approve only the changes with a reasonable risk/rewards ratio. But if we're over confident in our risk or impact estimates then our managers can't make well informed decisions. So we need to know if this is a problem for us.

I am not mean enough to do this, but I'd like to use these numbers to put a credit rating on system administrators so that when somebody proposes doing something I can get an idea of how much I need to scrutinize the proposal. What do you know, that's actually managing changes rather than just keeping a log and asking people to report them. If you're not taking these steps then you're probably in the business of change notification rather than change management. I'm not looking down on you, I'm right there with you...for now.

So what are some other metrics for measuring the effectiveness of change management?

1 comment:

Michael Janke said...: I'm always interested in whether or not the change went exactly as planned. The simple example would be the firewall administrator whose plans on opening up port 80, but a few minutes later figures out that she had a typo and opened up 800. A quick fix - just delete and re-paste the rule - but is that a failed change?; August 19, 2010 at 4:11 PM

Thursday, August 19, 2010

Change Management Metrics

1 comment: