Monday, May 18, 2009

Testing out the rule of five

A while back I read the book "How to Measure Anything" by Douglas Hubbaard.  In a nutshell, I thought the book was great, and it has a lot of simplifying assumptions in it that you can use when you're trying to measure something intangible, like Information Security.
There is one thing that I have had a little trouble accepting though, and that is the rule of five that he describes in one of the chapters.  It says that if you were to randomly sample five people in a population for some value (such as how many hours of sleep you got last night) there is about a 93% chance that the median value for the whole population will fall between the largest number you get from your sample and the smallest that you get from your sample.  

If I remember correctly, it all starts with the premise that if you sample two people there is a 50% chance that the range of their two numbers will not include the median.  Add a third person and there is another 50% chance.  .5 x .5 = .25 so now there is only a 25% chance that the median does not fall in that range.  A fourth person means we multiply .25 by .5 and get .125.  Finally the fifth person brings us to a probability of .0625 that the median is not included in our range.  So I've always had a little trouble with the first statement, that there is a 50% chance that the first two numbers will include the median.  I looked around the Internet and I haven't been able to find any other confirmation of the rule of five, except for other people citing Hubbard.  So I decided I would try a couple simple tests to see if this would work for me in theory.

The first test was to see if I could reproduce the 50% chance of picking two numbers that include the median.  I opened up my spreadsheet program and in the A column I put in this formula: =RANDBETWEEN(1,1000).  I copied that down 1000 rows to get 1000 random numbers between 1 and 1000.  This was my reference column.  I copied the values and pasted them into column B and then deleted column A.  That way the values wont keep changing every time I do some math on the page - if you're keeping track at home that means that a list of static numbers is now in column A.  Then I put that same formula into columns B and C.  This simulates the process of picking two numbers from the whole population.

If you're really paying attention, you will notice that I didn't actually chose two values from the sample, I generated two more random numbers.  So this isn't exactly the same, but I'm just trying to do a "back of the envelope" test here, and the values are probably close enough to some other random number.  In other words, I recognize that this isn't perfect, but it is close enough for my purposes.  In column D I just put in one formula: =MEDIAN(a1:a1000).  And it gave me the median value of my list of random numbers.  In column E I put in this formula: =MAX(b1,c1).  In column F I put in this formula: =MIN(b1,c1).  So now I know that column E has the upper bound of my range and column F has the lower bound.

In column G I put in this formula: =IF(E1>=D$1,"1","0"), and in column H I put in =IF(D$1>=F1,"1","0").  So if the range includes the median, I will have a 1 in column G and H.  In column I put =G1+H1.  Copy these formulas all the way down and column I will have a 2 in it every time the range includes the median.  BTW, if there is an easier way to do this I would love to hear about it.  The last step was in cell J1 where I put =COUNTIF(I1:I1000,"=2").  If there is really a 50% chance then this should be pretty close to 500.  What was my final number?  I have to admit I was surprised get 520.  Not bad.  Not proof, mind you, but definitely something to lend credence to the rule of five.

For my next experiment, I decided to get a little more fancy.  I whipped up the following python script.  In a nutshell, it creates a population of 1000 random numbers.  Then on it creates 500 independent random samples and checks to see if the median of the population falls within each sample and prints out the percentage of successes.  I ran this bad boy and got 96.2%.  So after all of this, I have to say I'm feeling pretty good about the rule of five, even if I can't find any independent verification of it.
from __future__ import division
import random

def intherange(median, sample):
  sample.sort()
  if sample[4] >= median:
    if sample[0] <= median:
      return 1
  return 0

population = []
sample = []

# create a list of 1001 random numbers
for i in range(1,1002):
  population.append(random.randint(1,1000))

# sort the sample and get the median
# dont forget to offset by one or you'll get one number above
# the median.
population.sort()
median = population[500]

# Let's take 500 samples and see what we get each time.
sum = 0
for i in range(1,501):
  sample = random.sample(population,5)
  sum += intherange(median, sample)

print sum/500

4 comments:

Anonymous said...

Doesn't the 50% rule include some assumption about the distribution of the population from which the samples are taken? Does it follow a normal distribution or a rectangular distribition?

Unknown said...

Well I was thinking about this some more last night while I was trying to sleep and my wife was trying to prevent me from sleeping.

I think the key word in this is that we're saying the median will fall in this range, not the mean. When you're looking at the median of the population, then exactly half of the population is above and below. So that's where the whole 50% thing comes in. Anytime you sample anyone, there is a 50% chance that they are above the median.

So when you sample your first person, there is a 100% chance that the value you get is either >= the median or <= the median. When you sample the second person, there is a 50% chance that the sample is on the same side of the median as the last person you sampled. There is a 25% chance that you will sample three people on the same side of the median. Etc.

So I don't think it matters what the distribution is for the accuracy of the rule of five, but that is absolutely something worth testing down the road.

Where shape really comes into play is when you try to interpret the results of the rule of five. If you're dealing with data that has a normal curve (especially when that curve is pretty tight) then the median and the mean will be very close. The same could be said for a very uniform distribution like the one my program creates. On the other hand, if the shape of your distribution is a sharp downward facing slope then there could be a big difference between the mean and the median.

ckg said...

"something worth testing down the road"

Why? It's as true as a coin flips heads and tails 50% of the time each. No rocket science there; nothing really to test (unless you just want to assure yourself that you can do some trivial little python program).

You're right that you can't say anything about the mean really. But with only five observations you can't say a whole lot about the distribution anyway. Hubbard's whole point is: it might be tempting to say "we can't say anything at all with only five data points" and that is just wrong. With only five observations, we still have a nice idea of where the "average observation", i.e. the median, is.

Now with only five observations, that might be quite a wide interval, but the point is we actually can do something meaningful with very small sample sizes. What's important is not to get a large sample size, but how we get our data. Without randomization, all bets are off. Always. To say anything.

Glad to see you dug Hubbard's book -- it's an excellent read that security-minded folks would do well to absorb. Without good *meaningful* metrics, information security assessment is meaningless, and there is no other book that gives such a gentle layman's intro to what measurement is all about.

Unknown said...

Hey @cyg that python program isn't trivial. Just look at the script I put in this blog posting. That is the peak of my programming ability.