Monday, May 18, 2009

Beta, it's not just for fraternity names

Last week at Secure360 I gave a talk on using monte carlo simulations to deal with unknowns in the calculation of Annualized Loss Expectancy (ALE).  For those of you that need a refresher, the idea behind ALE is that you figure out how much an asset is worth (Asset Value or AV), and you figure out how badly an event would hurt that asset (Exposure Factor or EF).  Then you multiply those by how often it happens (Annual Rate of Occurrence or ARO) to get the Annualized Loss Expectancy (ALE).

In my talk I mentioned using Monte Carlo simulations and why I like them.  You see, I am of the belief that you can't nail any of those numbers (AV, EF, ARO) down precisely so you need to work out a reasonable range or even a slightly unreasonable range as long as you err towards inclusion.  By erring toward inclusion, I mean your range would be unrealistically wide rather than narrow.  Once you have your ranges you can whip up an excel spreadsheet that picks a random number between each of your ranges and spits out the ALE.  Repeat this over 5,000 to 10,000 rows and you've got your simulation.  But a good Monte Carlo simulation is more nuanced than that.

If you just pick a random number between each of your ranges, and there is no skewing of the numbers then after you've done this about 10,000 times you're going to get an average that is shockingly close to what you get if you just take the middle number of each range and multiply.  That is the law of large numbers in action.  What really makes your monte carlo simulations more accurate is that they also take into consideration the shape of your variables.  I'd like to talk about a couple of shapes, and my new favorite formula to use in monte carlo simulations.

For the most part, I have always stuck with two basic shapes, the uniform distribution and the normal distribution.  Uniform distribution is where there is an equally likely chance of any number in the range being the "true" value of the real thing being simulated.  I typically use this on asset value by utilizing the RANDBETWEEN() function.  I know that the asset value falls between x and y and there is an equal chance of any one of those numbers being accurate.  I typically use the normal distribution in cases where I have an average and a reasonable guess about the standard deviation.  For example, if I know that 75% of my users have experienced some phenomena give or take 8% then I will use a normal curve that would spike at 75% and taper off dramatically so that there are almost no values below 67% or above 83%.

But as I have continued to refine my practice of monte carlo simulations, it occurred to me that I need more shapes.  There are variables that don't fit neatly into one of these two shapes, and that is where the beta distribution (http://en.wikipedia.org/wiki/Beta_distribution) comes in.  Beta is able to reproduce a wide variety of shapes that may be more appropriate for your variables.  Let me give you an example.

Let's say one of the threats to your asset is power outage.  One thing you need to know is how often you're going to deal with a power outage in your data center.  Going back over the historical statistics that you've kept, and talking to your server and network people you've all agreed that there will probably be 3 power outages in the data center this year because your maintenance people suck and they are always making changes without telling anyone.  Everyone also agrees that there could be more power outages, but that the odds of having more outages go down quickly as the number increases.  This isn't something that is shaped like a normal curve, this is more of a straight line that is high on the left side and moves down as you go right.  Sure you can punish your data and force it into a normal curve, but instead lets try out our new beta distribution and see if we like that shape better.  In my spreadsheet, under annual rate of occurrence for power outage, I put in =INT(BETAINV(RAND(),1,5,3,8)) and copied that down 500 rows.  Out of the 500 rows, it returned 3 outages per year 347 times, 4 outages 127 times, and 5 outages just 31 times.

I'll let you look at wikipedia to see how the first two numbers (a and b) affect the shape of the distribution.  In a nutshell, if a is bigger than b then the distribution trends upwards.  If b is bigger than a, then the distribution trends downwards.  The difference between a and b is how dramatic that trend is.  If b is much larger than a you get a very L shaped graph where the numbers drop off quickly.  If a=1 and b=2 you get a straight line that trends downward.  The last two numbers in the formula are a bottom and top boundary to put on the distribution.  In my formula above, I say that there will always be at least 3 and never 8 or more.  So if I wanted a straight line reflecting outages of 3 to 8 times per year, then I could use this formula =INT(BETAINV(RAND(),1,2,3,9)).  When I ran that 500 times I got 3 outages 163 times, 4 outages 137 times, 5 = 95, 6 = 66, 7 = 38, and 8 outages 12 times.  In other words, there is only a 1.6% chance that we'll have 8 power outages in one year, but there is a 60% chance that we'll have 3 or 4 outages.  

Play around with some of the other shapes that you can make with your beta distribution.  I just created a spreadsheet where I could play with the a, b, xlow, and xhigh numbers and see how it charts.  I will be honest and tell you that I don't yet know how to calculate what a and b should be in my beta distribution, but I am still really happy because if I can make a distribution that more closely approximates what I expect to see in the real world then my simulations will return better data.  This is one of those areas where we can make our range estimate tighter without spending additional money on research, so even if it isn't exact, it is still good news.  I hope you are able to find value from this as well.

No comments: