A few months ago I did a presentation at the Twin Cities Information Risk Round Table (TCIRRT) about basic statistical distributions that you might use in your risk models. The presentation proved to be rather popular and so I thought maybe I should write a series of blog posts to review the material that I covered in that presentation.
I was a little torn over whether to start with a normal distribution or a uniform distribution and ultimately just decided to pick one. So I’m going to talk about the normal distribution in this post.
SOME BACKGROUND: Distributions are all about the area under a curve. That’s why calculus is so important to statistics. We use distributions to represent random variables that we encounter in our risk models. These are the unknowns, like how much is a loss event going to cost us. I have included here a picture of a normal distribution with a mean of 50 and a standard deviation of 15. In the picture the total area under the curve (which is all the blue stuff) adds up to 1. That is going to be the case for every distribution that you look at. The whole curve represents every possible outcome for a random variable, and the cumulative probability of every possible outcome is 1.
I have highlighted the area between 35 and 50 on this curve, and you might notice at the top that this accounts for 34.1% of the area under the curve. If this distribution was an accurate representation of a random variable, I could say that 34.1% of the time I should see a value that falls between 35 and 50.
WHEN TO USE IT: It seems to me that there are 3 tests you can use to decide if the normal distribution is the right distribution to represent your random variable.
- You have an average value that you can calculate or reasonably estimate. If you were using FAIR for your risk modeling, this would be your “most likely” value.
- The “shape” of the random variable is nearly symmetric. It can be skewed a bit in one direction or another, but it should not be extreme.
- There is a low probability of getting a value at the far left or far right of your distribution. In other words, it’s bell shaped.
WHEN TO AVOID IT: It should be obvious that if your random variable doesn’t pass the three tests I wrote up there then you should avoid using the normal distribution. The main reason outside of failing the 3 tests that you should avoid it is that the normal distribution is continuous and boundless. Continuous means that ANY value between two points is possible. So my random variable could be assigned a value of 35.000654780. I might just round that to two decimal places and call it a dollar value, but you should be aware that these values are possible. Also, it will return values that extend beyond that 99% mark. For example, let’s say that your mean is 30 and your standard deviation is 10. You should know that 3 times out of 1000 your random number generator will give you a value that is less than zero. If that is not possible in real life then you need to consider another distribution.