Wednesday, September 8, 2010

My first risk model

Hee Hee Hee!

I am so happy because my boss bought me a new toy and I got a chance to use it this week. I have been pestering him to purchase me a copy of @RISK, a monte carlo tool for Excel, so that I could add more quantitative estimates into my risk analysis. And not long after buying the product something came up that required a risk analysis. So of course it fell to me to produce one. Time to see what this program can do.

Our Exchange administrator would like to install Service Pack 1 on our Exchange 2010 environment and he would like to do it sooner rather than later. We're in a change freeze right now, but this has some really cool features and there is a lot of hope in the air that it will fix some bugs that have been nagging us for a while. The big boss is skittish about making a significant change to our most visible service in the first month of the new school year and would like an estimate of what could go wrong.

So this is the basic methodology I am using to develop my risk analysis.


  1. Identify potential risks.
  2. Create estimates of the likelihood of each risk being realized and the impact that it will have. Use expert opinions in this part and any secondary research that you can get your hands on. Decide on the probability density function for each unknown and correlate wherever it makes sense.
  3. Run the Monte Carlo simulation.
However, since I'm a good disciple of Douglas Hubbard, and I've read The Failure of Risk Management I added a few extra steps.

  1. Document the predictions made and compare to real life if possible. Adjust and use these numbers in other forecasts.
  2. Put the model out there on the Internet for other people to tear apart because "All models are wrong, some models are useful" (George Box)

Identify potential Risk

I talked with the Exchange administrator and some other system admins and we identified this list of things that were reasonably possible.

  • Complete failure of multiple Client Access Servers (CAS) causes performance problems for the end users.
  • Misconfiguration of CAS servers causes minor problems with mail flow or intermittent user access problems.
  • Complete failure of multiple database servers causes some users mailboxes to be unavailable.
  • Failure of a single database server causes minor performance problems for some users.
  • Corruption of a database causes the service to be unavailable for some number of users.
  • Project could introduce new bugs that break interoperability with existing services
  • Project could introduce changes to the user interface that frustrate users and results in increased calls to the help desk
  • Project could introduce new features that users want to know more about resulting in increased calls to the help desk.

Create estimates of likelihood and impact

I'm using @RISK 5.5 for this part, and it makes things pretty slick in this department. Let's look at one of the risks identified, that multiple CAS servers could completely fail causing performance problems. In our case we have four of these servers, and each one absorbs between 10% and 30% of the client load. Our Exchange administrator feels that taking less than a 25% performance hit will not be noticed by the users.

Our Exchange administrator feels that there is a 1% chance that a server will completely fail after the service pack is installed. I doubled that number because I have read a lot of the research about how overconfident people are. This is the first variable in the model: how likely is a server to fail. So I set up my spreadsheet with four rows (one for each server) and a column with a binomial distribution that would return zero all but 2% of the time. Next up was the impact that each would have on performance. We decided that the impact was no less than 10%, no more than 30% and probably 25%. So I used a pert distribution, which is a form of a normal distribution. It looks like this...

One of the most powerful features of @RISK is the ability to correlate variables. I said that there was a 2% chance that a server will fail. But what if the first server fails, is it likely that there is only a 2% chance that the next server will fail? So I put a .75 correlation on these variables. Meaning that if one of the servers fail, the others are much more likely to also fail.

Run the Simulation

Here are a couple images of the spreadsheet setup. The zeros in the first picture are where the random variables get calculated. The second picture shows what the distributions look like.


So the model will calculate a random value within that pert distribution for each server on each iteration. Then to get the total downtime that we actually realized, I multiplied the impactby the one or zero returned from the binomial distribution. If the server didn't go down, then there was no impact. The total impact was summed up and the simulation was ran.

As you can see, in 92% of the simulations there was no impact on performance at all. In 96.9% of the simulations the impact was less than the 25% which our Exchange administrator said would be unnoticed by the user.

Follow up


This was the first serious risk model I've created, so I expect it to be rife with problems, but I'm still kind of proud of it. I don't want this blog post to go on forever so I wont go into the details of all the other risks that were modeled, but I might write some follow up entries. I'd love to get some ideas on how to improve my risk model and how to identify additional risks that need to be taken into consideration, so feel free to drop in some comments.

8 comments:

Michael Janke said...

Awesome.

jth said...

Seems pretty straightforward to me. I'd be interested in seeing your copy, maybe working with you on a pet project or two to see if @RISK is something worth picking up for me as well.

Does that work on your mac? Or win only?

Unknown said...

@jth Sorry, @RISK is an excel add on that only works in Windows. I am impressed with this style of risk modelling because it's pretty much impossible to create a risk model without knowing the system you're trying to model. Much like you can't make a model car without learning how the parts go together. That forced me to learn more about our email system than I knew before and I think I have a more realistic risk picture than if I just used a 50,000 foot view.

AThulin said...

Looks like @RISK requires risk estimates to a degree that is unusual to get in infosec. For instance, I doubt that the 1% probability for Exchange server failure after applying a service pack is anything but a gut feeling expressed in numbers.

Your 'doubling' seems a pretty strange methodology -- if you had got 15% risk instead, would you have doubled that as well?

Do you have any error estimates for the original estimate, or the subsequent doubling?

If the various risk elements are not commensurable (and gut feelings usually aren't), you might want to try to estimate the risk that the data you get out of this process is garbage.

I find that it can be quite difficult to get even a 4-level estimate out of most IT-people (extremely unlikely, unlikely, likely, very likely), and a similar 4-level estimate of the damage. (I.e. error estimates are quite large.)

Also, it's one thing enumerating risks, and another to enumerate the right risks. Once you have defined the system for which you are trying to do a risk analysis, ask the right people involved with that system about the risks. IT support/helpdesk/customer support are often forgotten, yet they deal with see the effects of any failures, and even a small IT failure can mean long telephone queues for support -- which is another kind of damage.

I've worked a lot with mini-analyses: get everyone together in a room, brainstorm risks, evaluate -- i.e. place them into a 4x4 grid of probability vs damage --, decide which bins need to be addressed, and so on. As long as you get the right people together, this tend to work quite well -- failures happen when some area of the system in question was forgotten (like helpdesk operation). The main work is done in 6-8 hours, the documentation in another day.

Unknown said...

@AHutton thanks for the comment. I'm sure I'll be able to work some of that into the next model I put together.

After reading Hubbard's first book, How to Measure Anything, I learned that an uncalibrated estimator is about 80% overconfident in some cases, so I could have set the chance of failure at 1.2% but I decided to round up to a whole number. It is still a gut feeling expressed as a number, but that's better than a gut feeling expressed as a word like medium or low. By saying 2% we have a testable hypothesis. If one of the four servers should fail then I know that the probability is most likely greater than 5% and my next model will be more accurate.

The output of this process probably IS garbage. Models are poor approximations of real life, but every estimate we make is based on some model. For many of us in the security field, those models live in our head and aren't open to scrutiny. I am more willing to put my faith in this model than the one in my head and the head of the overconfident sysadmins around me.

Unknown said...

Oops, I mean AThulin not AHutton.

Anonymous said...

I like this model, it makes sense to me (having just got RiskAMP myself) and I agree with some of AThulin's comments, but I think what you've created is far more useful than asking likely/unlikely type questions.

While your description was a little vague, it looks like you've got a good trade off between time and accuracy. If you wanted to get a more accurate number on "CAS Failure" you could have broken that problem into further cause and effect. Meaning rather than ask what the probability was of failure, break down the various failure conditions, causes and probabilities of those. But like I said, I think your approach was a good trade off between time and accuracy. If you wanted more confidence in your results you just spend more time. ...and that's the way it should be!

This has got to the best start I've seen. Ever.

Anonymous said...

Sorry, Jay Jacobs here and on that last anonymous post.