Re: String hashing (was: Thou shalt have no other gods before the ANSI C standard)

From: CBFalconer (cbfalconer_at_yahoo.com)
Date: 03/02/05


Date: Wed, 02 Mar 2005 03:03:19 GMT

websnarf@gmail.com wrote:
> Tim Peters wrote:
>
... snip ...
>
>> I suspect you're focusing solely on microseconds per hash. That's
>> fine, if that's all that matters. Then the speed of FNV depends
>> on how fast your HW integer multiplication is, or on how much
>> effort you're willing to pour in to micro-optimizing.
>
> Ok, I don't believe FNV has any serious claim to having better
> collision properties (which, of course, is the first requirement for
> good hashing performance) than either my hash or Bob Jenkin's hash,
> or Colin Plumb's One-at-a-time hash. I have assumed, and I think
> conservatively (since I know for sure mine has them), that we all
> made good hash functions with essentially "best possible" collision
> properties. If I am allowed to make this assumption, then in fact the
> "microseconds per hash" *IS* the only thing remaining that matters.
>
> You may be jaded because CRCs, checksums, rotations, this "31/33" hash,
> and various other ad hoc solutions have serious collision problems.
> That doesn't mean that creation of good hash functions (for
> non-adversarial environments) is somehow magical or exceedingly
> difficult to do. That the FNV or the Bob Jenkins hash functions took a
> very convoluted route to achieve good collision properties, is not
> evidence of the need to take such a route. Pearson, Zobrist,
> One-at-a-time and my hash function are evidence of the fact that you
> can take much simpler approaches to generate a hash function with good
> properties. Mine just happens to be the one built around a particular
> very high performance inner core that was then tuned to have the
> required properties. About the only thing remarkable about the FNV
> hash is that it seems to scale to 64, 128, etc bit hashes -- but
> without a claim of good crypto, the utility of this escapes me.

I just downloaded your test program, and I see various
difficulties. First, it is sadly (and unnecessarily) lacking in
portability, using ints in many places where an unsigned long is
needed. (There is no guarantee that an int has 32 bits).

Second, it tests the wrong things. It creates an arbitrary binary
byte array of length 256. This is an unusually long string for any
hash table use. In addition, most real strings will consist of
printable characters, not arbitrary bytes. If the string length is
lowered to 5 or 10 the efficacy of the times31 or times33 hash
shows up, and the SuperFastHash is relegated to a much lower
speed. The SFH is probably well suited for binary keys of long
length.

Spread of the hashes over the 32 bit spectrum is not important.
The important thing is the spread over the current size of the
hashtable, which had better not be a 32 bit sized item. This is
normally attained by either masking or modulo operations. As ever,
the optimum hash function depends on the particular data to be
hashed. It simply cannot be optimized on its own.

As soon as I find a round tuit I expect to try the various
functions out on real data with a real hash table. My hashlib will
supply the table (go to my site for a copy). I expect to use a
text version of N869 that I have here as the data sample, it is
roughly 1 meg of text. I can store either words, defined as things
that end with white space, or lines, being things that end with
\n. The hashlib instrumentation will tell a lot about the efficacy
of the hash functions ON THAT DATA.

I will have to think about how to eliminate the biases from file
reading. Virtual memory thrashing will show up as a sharp change
in time versus size measurements.

-- 
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
   <http://cbfalconer.home.att.net>  USE worldnet address!


Relevant Pages