i'm trying to figure out a fair weighting system for words in a file. this is what i mean. lets say i have a file that has 200 unique words and the word 'foo' is there 5 times. then i have another file that has 150 unique words with 'foo' also listed 5 times. it works out to 5/200=.025 and 5/150=.033. just because the second file has less words doesn't make it a better match for foo. they should be equal. is there anyway i can normalize the weights. and i can't just use the number of occurances because there are other metrics involved. i hope this makes sense. any help would be appreciated.
