Re: String hashing (was: Thou shalt have no other gods before the ANSI C standard)
Date: 1 Mar 2005 22:10:25 -0800
> email@example.com wrote:
> > Tim Peters wrote:
> ... snip ...
> > Ok, I don't believe FNV has any serious claim to having better
> > collision properties (which, of course, is the first
> > requirement for good hashing performance) than either my hash
> > or Bob Jenkin's hash, or Colin Plumb's One-at-a-time hash. I
> > have assumed, and I think conservatively (since I know for
> > sure mine has them), that we all made good hash functions
> > with essentially "best possible" collision properties. If I
> > am allowed to make this assumption, then in fact the
> > "microseconds per hash" *IS* the only thing remaining that
> > matters.
> > You may be jaded because CRCs, checksums, rotations, this
> > "31/33" hash, and various other ad hoc solutions have serious
> > collision problems. That doesn't mean that creation of good
> > hash functions (for non-adversarial environments) is somehow
> > magical or exceedingly difficult to do. That the FNV or the
> > Bob Jenkins hash functions took a very convoluted route to
> > achieve good collision properties, is not evidence of the
> > need to take such a route. Pearson, Zobrist, One-at-a-time
> > and my hash function are evidence of the fact that you can
> > take much simpler approaches to generate a hash function with
> > good properties. Mine just happens to be the one built
> > around a particular very high performance inner core that was
> > then tuned to have the required properties. About the only
> > thing remarkable about the FNV hash is that it seems to scale
> > to 64, 128, etc bit hashes -- but without a claim of good
> > crypto, the utility of this escapes me.
> I just downloaded your test program, and I see various
> difficulties. First, it is sadly (and unnecessarily) lacking in
> portability, using ints in many places where an unsigned long is
> needed. (There is no guarantee that an int has 32 bits).
This is not comp.lang.c. The dead ANSI committee dogma has no audience
here. (Or well it does -- from some trolling cross posters, but I
don't think your blind adherence to some non-existent CLC charter
should spill over into the crypto universe.)
The fact is that *MANY* hash functions don't work correctly at *all*
unless they can have some sort of guarantee about the exact number of
bits they are operating on (not some minimum number). Many also do not
function correctly unless they can rely on 2s complement and/or sign
extending right shift. Hashing/crypto tends to need this sort of
The much needed <stdint.h> has come too late for people to notice.
That means you get to see code like those hash functions, and you deal
with it as is, because nobody is paying attention to obsolete issues
like this. (Or you deal with Java.)
> Second, it tests the wrong things. It creates an arbitrary
> binary byte array of length 256. This is an unusually long
> string for any hash table use. In addition, most real strings
> will consist of printable characters, not arbitrary bytes.
You should search for the last time I was discussing "strings". It
probably had to do with something completely different (and by now I
think you can guess as to what it might be.) First of all, strings are
not the only things that can be or needs to be hashed, second of all
UTF-8 which is a very popular encoding of Unicode into octets *ARE*
strings that can be directly hashed, thirdly the functions I tested are
totally insensitive to ASCII versus non-ASCII, and finally, as you may
well know, in my universe strings are just not limited to the ASCII
> [...] If the string length is lowered to 5 or 10
Excuse me? You can't even put someone's full name into strings that
short. What are you doing? Trying to solve boggles or low score on
scrabble or something? Oh I get it. You are trying to write a
compiler -- where hashing is supposedly your bottleneck. *sigh*.
> [...] the efficacy of the times31 or times33 hash
> shows up, and the SuperFastHash is relegated to a much lower
> speed. The SFH is probably well suited for binary keys of long
> Spread of the hashes over the 32 bit spectrum is not important.
How are you planning to deal with rehashing, or quick verifying? You
see, you *could* call your slower hash function a second time. Or you
could just rip off more bits from your primary hash calculation if it
outputs enough bits in the first place.
> The important thing is the spread over the current size of the
> hashtable, which had better not be a 32 bit sized item. This is
> normally attained by either masking or modulo operations. As
> ever, the optimum hash function depends on the particular data
> to be hashed. It simply cannot be optimized on its own.
Bob Jenkins, the FNV guys, Colin Plumb and I must be so deluded ...
> As soon as I find a round tuit I expect to try the various
> functions out on real data with a real hash table. My hashlib
> will supply the table (go to my site for a copy).
I've seen your hashlib implementation. Its a closed-rehash thing that
is insensitive to L1 cache usage, and can't be used in hard real-time
applications because it rebuilds the hash as it needs while it fills.
But, I'm sure, in those cases where you are hash function limited, SFH
will work just fine.
About time you actually testing various hash functions for it don't you
> [...] I expect to use a text version of N869 that I have here
> as the data sample, it is roughly 1 meg of text. I can store
> either words, defined as things that end with white space,
If you were doing a first cut at implementing google or m-w.com or
something there might be some logic to this, but still the right answer
is to use length delimited strings and completely *unroll* the hash
function for each possible word size. I don't think the 31/33 hash
would fair too well in such a situation.
But did you know that you can encode the *entire* english language in a
"trie" data structure that fits in the L2 cache of most modern CPUs?
Kind of makes the whole idea of "hashing words" seem a little silly,
especially if you are concerned with performance. Why not just direct
map straight from a trie?
In any event if you just want to test hash function collisions, be sure
to include mixed case alphanumerics. I think the red lights on the
31/33 hash function will be pretty hard to ignore. SFH is more than
ready for just about any real world data.
> [...] or lines, being things that end with \n.
That's more like it -- I hear that people's full names, street
addresses, and credit card numbers are an interesting application for
string processing too. Just don't call strlen() repeatedly.
> The hashlib instrumentation will tell a lot about the efficacy
> of the hash functions ON THAT DATA. I will have to think about
> how to eliminate the biases from file reading.
Typically you just read the file (or part of it) once, with
consideration for how large it is relative to your memory or cache
(depending on what you are testing) then just run the test on the same
data over and over again. Some people like to do median of three for
test results -- I prefer, *best* of three (its more stable, believe it
or not.) Its not rocket science.
Good luck with it.
-- Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/