A while back I read the book "How to Measure Anything" by Douglas Hubbaard. In a nutshell, I thought the book was great, and it has a lot of simplifying assumptions in it that you can use when you're trying to measure something intangible, like Information Security.
There is one thing that I have had a little trouble accepting though, and that is the rule of five that he describes in one of the chapters. It says that if you were to randomly sample five people in a population for some value (such as how many hours of sleep you got last night) there is about a 93% chance that the median value for the whole population will fall between the largest number you get from your sample and the smallest that you get from your sample.
If I remember correctly, it all starts with the premise that if you sample two people there is a 50% chance that the range of their two numbers will not include the median. Add a third person and there is another 50% chance. .5 x .5 = .25 so now there is only a 25% chance that the median does not fall in that range. A fourth person means we multiply .25 by .5 and get .125. Finally the fifth person brings us to a probability of .0625 that the median is not included in our range. So I've always had a little trouble with the first statement, that there is a 50% chance that the first two numbers will include the median. I looked around the Internet and I haven't been able to find any other confirmation of the rule of five, except for other people citing Hubbard. So I decided I would try a couple simple tests to see if this would work for me in theory.
The first test was to see if I could reproduce the 50% chance of picking two numbers that include the median. I opened up my spreadsheet program and in the A column I put in this formula: =RANDBETWEEN(1,1000). I copied that down 1000 rows to get 1000 random numbers between 1 and 1000. This was my reference column. I copied the values and pasted them into column B and then deleted column A. That way the values wont keep changing every time I do some math on the page - if you're keeping track at home that means that a list of static numbers is now in column A. Then I put that same formula into columns B and C. This simulates the process of picking two numbers from the whole population.
If you're really paying attention, you will notice that I didn't actually chose two values from the sample, I generated two more random numbers. So this isn't exactly the same, but I'm just trying to do a "back of the envelope" test here, and the values are probably close enough to some other random number. In other words, I recognize that this isn't perfect, but it is close enough for my purposes. In column D I just put in one formula: =MEDIAN(a1:a1000). And it gave me the median value of my list of random numbers. In column E I put in this formula: =MAX(b1,c1). In column F I put in this formula: =MIN(b1,c1). So now I know that column E has the upper bound of my range and column F has the lower bound.
In column G I put in this formula: =IF(E1>=D$1,"1","0"), and in column H I put in =IF(D$1>=F1,"1","0"). So if the range includes the median, I will have a 1 in column G and H. In column I put =G1+H1. Copy these formulas all the way down and column I will have a 2 in it every time the range includes the median. BTW, if there is an easier way to do this I would love to hear about it. The last step was in cell J1 where I put =COUNTIF(I1:I1000,"=2"). If there is really a 50% chance then this should be pretty close to 500. What was my final number? I have to admit I was surprised get 520. Not bad. Not proof, mind you, but definitely something to lend credence to the rule of five.
For my next experiment, I decided to get a little more fancy. I whipped up the following python script. In a nutshell, it creates a population of 1000 random numbers. Then on it creates 500 independent random samples and checks to see if the median of the population falls within each sample and prints out the percentage of successes. I ran this bad boy and got 96.2%. So after all of this, I have to say I'm feeling pretty good about the rule of five, even if I can't find any independent verification of it.
from __future__ import divisionimport randomdef intherange(median, sample):sample.sort()if sample >= median:if sample <= median:return 1return 0population = sample = # create a list of 1001 random numbersfor i in range(1,1002):population.append(random.randint(1,1000))# sort the sample and get the median# dont forget to offset by one or you'll get one number above# the median.population.sort()median = population# Let's take 500 samples and see what we get each time.sum = 0for i in range(1,501):sample = random.sample(population,5)sum += intherange(median, sample)print sum/500