Monday, November 7, 2011

Stats for risk modeling: The Normal Distribution

This blog entry was originally posted on the Society of Information Risk Analysts web page on November 1, 2011: http://societyinforisk.org/content/stats-risk-modeling-normal-distribution

A few months ago I did a presentation at the Twin Cities Information Risk Round Table (TCIRRT) about basic statistical distributions that you might use in your risk models.  The presentation proved to be rather popular and so I thought maybe I should write a series of blog posts to review the material that I covered in that presentation.

I was a little torn over whether to start with a normal distribution or a uniform distribution and ultimately just decided to pick one.  So I’m going to talk about the normal distribution in this post.

SOME BACKGROUND: Distributions are all about the area under a curve.  That’s why calculus is so important to statistics.  We use distributions to represent random variables that we encounter in our risk models.  These are the unknowns, like how much is a loss event going to cost us.  I have included here a picture of a normal distribution with a mean of 50 and a standard deviation of 15.  In the picture the total area under the curve (which is all the blue stuff) adds up to 1.  That is going to be the case for every distribution that you look at.  The whole curve represents every possible outcome for a random variable, and the cumulative probability of every possible outcome is 1.

I have highlighted the area between 35 and 50 on this curve, and you might notice at the top that this accounts for 34.1% of the area under the curve.  If this distribution was an accurate representation of a random variable, I could say that 34.1% of the time I should see a value that falls between 35 and 50.

WHEN TO USE IT:  It seems to me that there are 3 tests you can use to decide if the normal distribution is the right distribution to represent your random variable.
  1. You have an average value that you can calculate or reasonably estimate.  If you were using FAIR for your risk modeling, this would be your “most likely” value.
  2. The “shape” of the random variable is nearly symmetric.  It can be skewed a bit in one direction or another, but it should not be extreme.
  3. There is a low probability of getting a value at the far left or far right of your distribution.  In other words, it’s bell shaped.
WHAT MAKES IT COOL: The normal distribution is cool because of the 68-95-99 rule.  A normal distribution has two parameters, the mean and the standard deviation.  You could think of standard deviation as how fat or skinny the distribution appears.  I mentioned earlier that this picture has a mean of 50 and a standard deviation of 15.  If you were to go from 50 to 35, you have moved 1 standard deviation away from the mean.  If you take the values from 35 to 65, you have 1 standard deviation in either direction.  And that should cover 68% of the area under the curve.  So in a normal distribution, 68% of the values are within one standard deviation in either direction from the mean.  If you move two standard deviations in either direction you have 95% of the area, and three will get you 99% of the area.  So if you can calculate or reasonably estimate the mean and standard deviation of a random variable, you will know that 99% of your values should be greater than the mean minus (standard deviation * 3) and smaller than the mean plus (standard deviation * 3).

WHEN TO AVOID IT: It should be obvious that if your random variable doesn’t pass the three tests I wrote up there then you should avoid using the normal distribution.  The main reason outside of failing the 3 tests that you should avoid it is that the normal distribution is continuous and boundless.  Continuous means that ANY value between two points is possible.  So my random variable could be assigned a value of 35.000654780.  I might just round that to two decimal places and call it a dollar value, but you should be aware that these values are possible.  Also, it will return values that extend beyond that 99% mark.  For example, let’s say that your mean is 30 and your standard deviation is 10.  You should know that 3 times out of 1000 your random number generator will give you a value that is less than zero.  If that is not possible in real life then you need to consider another distribution.

No comments: