Tuesday, March 27, 2012

Stats for Risk Modelling: Poisson and Lognormal distributions

This is a blog entry that I wrote for the Society for Information Risk Analysts blog on March 5, 2012.  https://www.societyinforisk.org/content/stats-risk-modelling-poisson-and-lognormal-distributions

It's been a while since I wrote up a blog post explaining some distributions, and I thought this time I might try something a little different.  I'm going to combine them into a scenario and build a model around that using the distributions that I want to explain.  So let's give it a shot and see how well it works.
I work at a University, and Universities are widely seen as a hotbed of piracy by large content holders.  So as you can imagine, every semester I receive a fairly decent pile of DMCA takedown requests that I need to handle.  And of course I also receive a fair number of calls from vendors selling me a product to help manage or eliminate my copyright complaints.  Let's define some variables.
Every semester I receive at least 80 copyright complaints and I have never received more than 113 in a semester.  The only exception is summertime when I usually get 3 to 5.  On average I spend about 5 to 10 minutes dealing with a single copyright complaint, however there are extreme cases where I have spent over an hour with a complaint that involved a repeat offender and a referral to the Student Conduct office.  And since I am a low-paid state worker, I make $100/hour (at least in this scenario).
The question is: should I be willing to spend $30,000 plus another $5000 in labor to purchase a product that promises to cut my copyright complaints down to a quarter of what I'm getting right now?  The product will also require approximately $10,000/year in licensing costs.
So the first thing we should try to model is the number of copyright complaints that I'm going to receive in a semester, and in this case I'd like to use the Poisson distribution.  Poisson is your friend when you're trying to model the frequency of something within a space of time.  For example, the number of defects a machine produces in an hour or how many people will call into your call center between 9:00am and 10:00am.  Poisson is really easy to understand because all we need is the average number to create a distribution.  The major thing to watch out for is that Poisson is intended for modeling independent variables, meaning that each event has equal probability and one event does not trigger the next event.  I have come to believe that there are very few truly independent variables out there, but there are a lot of variables that are independent enough.  These two graphics show the Summer and Spring distribution.  For Spring, I selected an average of 96.5, which is right between my minimum and my maximum.  The result is a distribution that is heavily weighted around 100, but is willing to accept that sometimes I'm going to get 65 or 130 in a semester.  In the Summer, there is about a 5% chance that I wont get any at all, but I could see a rough year that puts me up to 8.

The next variable I want to model is how much time I spent on each copyright complaint.  In this case, we have an average of 10 to 15 minutes per complaint, but there are rare but extreme examples of the time stretching into hours.  The distribution I've chosen to use is Lognormal.  Lognormal is a continuous distribution, so we're able to get values that fall between the units; for example a complaint could take 12.28 minutes to resolve.  Lognormal is particularly well suited to distributions that have long tails with low probability on the far side.  I have selected an average of 15 minutes, and a standard deviation of 25 minutes.  Just like with a normal distribution, the standard deviation tells us the dispersion of the variable.  A larger number means that there is more variability and the spread is wider.  Here is a graph of what a single copyright complaint might require for time.  You can see that 58.9% of the simulated copyright complaints will take ten minutes or less.  You can't see it in this picture, but 71.8% will take 15 minutes or less.  3.8% of the simulations will require more than an hour of my time.
Now the challenging part: in each semester I am going to generate a random number of events, and then I need to add that many instances of my impact variable.  If that sounds like a bit of mental gymnastics, consider this: what if my frequency variable returns a value of 130 copyright complaints and my impact variable returns a value of 60 minutes.  If I multiply the first variable by the second I will get a result that is so astronomically unlikely that it isn't worth considering at all.  My model will be returning junk.  Someone from Vose Software, maybe David Vose himself, put a fantastic video of how to handle this situation the right way and wrong way on YouTube: http://www.youtube.com/watch?v=0EZLiDyfF30.  I was really disappointed to learn that there isn't a clean and easy way to do this in @Risk, and the free version of ModelRisk that I downloaded doesn't have the aggregate feature either.  I'm not going to buy another software product to do something that @Risk should do, so I had to get a little more creative.
The way I solved this problem was to put a distribution of impact into a cell and copy that formula down 300 rows.  I know that I am rarely going to have my simulation return more than 264 copyright complaints for a single year so that seemed safe.  I highlighted the whole column and named the column 'Simulations' in Excel.  The Simulations table starts in row C9 in my spreadsheet, and the output of my Poisson distribution is in cell C2.  So in another cell I put in this formula: SUM(OFFSET(C9,0,0,C2,1)).  The offset formula is a lot of fun, it returns a range of cells starting from a reference cell.  My reference cell is C9.  I tell offset to start at C9, move over zero columns and zero rows, then return a range that is 1 column wide and has a height of whatever is in C2.  So even though my spreadsheet is going to generate 300 instances of my impact variable, I'm only going to add up whatever number my frequency distribution outputs.  It's a lot of wasted computing effort, I know.  Here is the output of the total number of minutes:

The max number was 5064 minutes spent on copyright complaints per year.  Notice that the shape of the distribution looks nothing like the Lognormal distribution that we defined for impact.  That's the Central Limit Theorum at work again.  When you add a number of random variables the result becomes more and more Normal.  That's another reason why we can't just multiply one instance of our impact varaible by the frequency variable.  What we would get is a really really big Lognormal distribution which does not reflect what happens when you add up random variables.  When I mutliply this by my per-minute rate of $1.67 I get a cost distribution that looks like this:

The maximum on this was $8,440 which doesn't even cover the yearly maintenance let alone the upfront cost of the application.  So I feel confident that it is more efficient for me to continue processing copyright complaints manually rather than purchasing software.  Also, I don't really make anywhere close to $100/hour so the odds of recommending that we purchase this software is very low.  This is the exact technique I use to convince my boss not to purchase another product that I have to maintain.  If that isn't a Risk Management success story then I don't know what is.
So here is the takeaway from this blog post:
  1. If you're modeling the frequency of an event and the probability of each event is constant then try out the Poisson distribution.  But if one event raises the probability of another then Poisson probably isn't the right choice.
  2. For impacts that have long tails with low probability the Lognormal distribution is probably the best choice to model the variable.
  3. Be careful when you multiply two variables.  Multiplying one instance of a variable by 80 is not the same as adding up 80 instances of a random variable.

No comments: