Re: what is probability to create two equal hashes for md5 algorithm
- From: Volker Hetzer <firstname.lastname@xxxxxxxx>
- Date: Wed, 06 Dec 2006 19:20:15 +0100
Sergei schrieb:
Joseph Ashwood wrote:How big are the blocks?"Dmitry Chumack" <saint.d.a@xxxxxxxxx> wrote in message
news:1165339092.911172.123630@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi *
I have a question. There is few doesns of __terabytes__ of data. I need
to split this data to blocks of fixed size.
Than I need to calculate aThe probability is minimal when the hash is equal (in value) to the block.sha1 hash for each block and store it. The question is which minimal
size of block I have to choose to minimize the probability of existence
of two equal hashes?
Other than that, if the hash is statistically good, the longer the hash, the
less likely the collision probability.
Of course, CPU time for search increases linearly with the hash length.
Btw, what do you want to do with the whole thing?
Fast matches? Fast searches? Anything cryptographic?
If it's just for matches and the blocks aren't very large, it's probably
cheaper in terms of CPU time to do a crc32 and compare the data in the
few cases the hashes match.
What probability of such a coincidence would be if[...]I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'llSince you're talking about 2 generated values colliding, in a cryptographic
use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
md5 and crc32)?
hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about
2^64, and SHA about 2^80 hashes generated before you collide. Considering
the size of your data pile, you'll probably get collisions sooner because
there is almost certainly duplicate data.
How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As IIf we knew that they behaved differently, we'd be designing different
understand, such numbers can be given for truly random hash functions
and MD5, SHA are just the algorithms that try to emulate such things.
hashes.
Consider MD5 and SHA-1 close enough to the ideal that such subtleties
don't matter.
Btw, MD5 is "cryptographically broken" in the sense that you can construct 2
blocks that have equal hashes, but it doesn't change the statistics very
much. However, since you seriously consider crc32, the "broken" state of
MD5 is of no concern to you.
The things that matter for you (IMHO) are:
- How much (cpu cycles) does it cost to compute the hash?
- How much does it cost to compare two hashes?
- How much does it cost to compare two blocks? (Is I/O time relevant too?)
- How often do two blocks have to be compared?
- How often do two hashes have to be compared?
- How often do you have to compute hashes (i.e. how often does your data
change)?
- Where do you get the hash from that you use to match the data blocks
against? Does it cost extra?
With a couple of days collecting the performance data and playing with
excel you should be able to choose the optimum hash. By the way, it's
possible too, to calculate the full MD5 hash and then to store a few
bits only, if that speeds up your searches. Makes sense only for hash
sizes larger than the crc hashes.
Lots of Greetings!
Volker
--
For email replies, please substitute the obvious.
.
- References:
- what is probability to create two equal hashes for md5 algorithm
- From: Dmitry Chumack
- Re: what is probability to create two equal hashes for md5 algorithm
- From: Joseph Ashwood
- Re: what is probability to create two equal hashes for md5 algorithm
- From: Sergei
- what is probability to create two equal hashes for md5 algorithm
- Prev by Date: Re: what is probability to create two equal hashes for md5 algorithm
- Next by Date: Re: group signature
- Previous by thread: Re: what is probability to create two equal hashes for md5 algorithm
- Next by thread: Re: what is probability to create two equal hashes for md5 algorithm
- Index(es):
Relevant Pages
|