Re: what is probability to create two equal hashes for md5 algorithm



Sergei schrieb:
Joseph Ashwood wrote:
"Dmitry Chumack" <saint.d.a@xxxxxxxxx> wrote in message
news:1165339092.911172.123630@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi *

I have a question. There is few doesns of __terabytes__ of data. I need
to split this data to blocks of fixed size.
How big are the blocks?

Than I need to calculate a
sha1 hash for each block and store it. The question is which minimal
size of block I have to choose to minimize the probability of existence
of two equal hashes?
The probability is minimal when the hash is equal (in value) to the block.
Other than that, if the hash is statistically good, the longer the hash, the
less likely the collision probability.
Of course, CPU time for search increases linearly with the hash length.

Btw, what do you want to do with the whole thing?
Fast matches? Fast searches? Anything cryptographic?
If it's just for matches and the blocks aren't very large, it's probably
cheaper in terms of CPU time to do a crc32 and compare the data in the
few cases the hashes match.


What probability of such a coincidence would be if
I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
md5 and crc32)?
Since you're talking about 2 generated values colliding, in a cryptographic
hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about
2^64, and SHA about 2^80 hashes generated before you collide. Considering
the size of your data pile, you'll probably get collisions sooner because
there is almost certainly duplicate data.
[...]
How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I
understand, such numbers can be given for truly random hash functions
and MD5, SHA are just the algorithms that try to emulate such things.
If we knew that they behaved differently, we'd be designing different
hashes.
Consider MD5 and SHA-1 close enough to the ideal that such subtleties
don't matter.
Btw, MD5 is "cryptographically broken" in the sense that you can construct 2
blocks that have equal hashes, but it doesn't change the statistics very
much. However, since you seriously consider crc32, the "broken" state of
MD5 is of no concern to you.
The things that matter for you (IMHO) are:
- How much (cpu cycles) does it cost to compute the hash?
- How much does it cost to compare two hashes?
- How much does it cost to compare two blocks? (Is I/O time relevant too?)
- How often do two blocks have to be compared?
- How often do two hashes have to be compared?
- How often do you have to compute hashes (i.e. how often does your data
change)?
- Where do you get the hash from that you use to match the data blocks
against? Does it cost extra?

With a couple of days collecting the performance data and playing with
excel you should be able to choose the optimum hash. By the way, it's
possible too, to calculate the full MD5 hash and then to store a few
bits only, if that speeds up your searches. Makes sense only for hash
sizes larger than the crc hashes.

Lots of Greetings!
Volker
--
For email replies, please substitute the obvious.
.



Relevant Pages

  • Re: ACCEPT and the SCREEN SECTION.
    ... General-purpose hashes, for hash tables and similar data structures, ... In short, cryptographic hashes like MD5 are used to verify data, ... if a client asks for and receives a chunk ...
    (comp.lang.cobol)
  • Re: password length
    ... ]>]The short answer is "Different encryption ... ]>based hash, 128 bits in the case of the MD5 based hash. ... ]>Ie, the password algorithms are not encryptions, they are hashes. ...
    (alt.os.linux.suse)
  • Re: Efficient MD5 (or similar) hashes
    ... > From reading the python docs, md5 reads the entire file as a string. ... > That's not practical on a 1 GB file that's network mounted. ... file and the MD5 hash to your machine and re-check the hash and compare. ...
    (comp.lang.python)
  • RE: Insecure Hash Algorithms (MD5) and NTLMv2
    ... Generating arbitrary malware that produces the same hash (MD5 or any ... infinite universe of cypher texts. ... MD5, by definition, produces hashes from ...
    (Pen-Test)
  • Re: BSDstats Project v2.0 ...
    ... With my new script it took only 158 minutes to compute ALL ... All I need to do is grep your hash and ... example: ifconfig |md5. ... I've attached the script used to make the hashes. ...
    (freebsd-questions)