Re: what is probability to create two equal hashes for md5 algorithm



"Sergei" <silentser@xxxxxxxxx> writes:

Joseph Ashwood wrote:
"Dmitry Chumack" <saint.d.a@xxxxxxxxx> wrote in message
news:1165339092.911172.123630@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi *

I have a question. There is few doesns of __terabytes__ of data. I need
to split this data to blocks of fixed size. Than I need to calculate a
sha1 hash for each block and store it. The question is which minimal
size of block I have to choose to minimize the probability of existence
of two equal hashes? What probability of such a coincidence would be if
I'll use an md5, crc32 or sha2 algorithms instead of sha1? Or if I'll
use a combination of this algorithms (e.g. sha1 and crc32 or sha2 and
md5 and crc32)?

Since you're talking about 2 generated values colliding, in a cryptographic
hash it takes on average 1/sqrt(2^output size), for MD5 you'll need about
2^64, and SHA about 2^80 hashes generated before you collide. Considering
the size of your data pile, you'll probably get collisions sooner because
there is almost certainly duplicate data.

Here's how I'd do it, and I'd include encryption as well:
Encrypt everything with AES in CTR mode, the odds of a collision in the data
are now very small, about 1/16777216 for a petabyte of data. Then I'd use
SHA512 with 512KB chunks and build up a Merkle tree, this will save you
enormous time later on in figuring out what was changed or damaged. Assuming
1PB this will give you 2*10^9 leaf blocks, then holding to the 512KB chunk,
you can store ~8000 of them per chunk with proper indexing (this will be shy
of the 8192 apparent maximum if you implement the indexing that is necessary
for security). This gives 268435 chunks, building a third level of 34
chunks, and a root/top/fourth level that is tiny. For a total of about
2*10^9 chunks, 2*10^9
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2^256 (~10^77) which is the limit for SHA512. Each of these chunks will fit
conveniently on a floppy disk, 1200 of them to a CD, and I don't even want
to do the math on the other options. On a system with modern cpu you should
be able to achieve this in about a month of compute time (assuming a quad
core, scale appropriately), Tom's benchmark website
http://libtomcrypt.com:80/ltc113.html has more information to help you run
the numbers. The overall odds of a collision in this are in the neighborhood
of 1/10^68, which is quite frankly not gonna happen.

It will be reasonably secure though, so don't lose the AES key. Without the
AES encryption the odds of collision rise sharply, but assuming your data
cannot be compressed beyond a reasonable amount still ignorable, and if they
do collide you don't have to ship one of the blocks because they are
identical.
Joe

How do you get such estimations for MD5 (2^60) and SHA-1 (2^80)? As I
understand, such numbers can be given for truly random hash functions
and MD5, SHA are just the algorithms that try to emulate such things.

Because as far as is known, both are uniformly distributed functions.



Sergei

.



Relevant Pages

  • Re: what is probability to create two equal hashes for md5 algorithm
    ... sha1 hash for each block and store it. ... Encrypt everything with AES in CTR mode, the odds of a collision in the data ... SHA512 with 512KB chunks and build up a Merkle tree, ...
    (sci.crypt)
  • Re: what is probability to create two equal hashes for md5 algorithm
    ... sha1 hash for each block and store it. ... Encrypt everything with AES in CTR mode, the odds of a collision in the data ... SHA512 with 512KB chunks and build up a Merkle tree, ...
    (sci.crypt)
  • Re: what is probability to create two equal hashes for md5 algorithm
    ... sha1 hash for each block and store it. ... Encrypt everything with AES in CTR mode, the odds of a collision in the data ... SHA512 with 512KB chunks and build up a Merkle tree, ...
    (sci.crypt)
  • Re: message digest of large files
    ... > thought chopping up my files into smaller chunks and generating a hash ... The likelyhood of collisions is independend of the message length. ... With 10000 files the likelyhood of two files producing the same 160bit hash is ... With 10^23 files to have a 0.3% chance of a collision. ...
    (sci.crypt)
  • Re: How did they get behind my NAT?
    ... One reason I ask is that I downloaded Mandriva 2008, using the tracker on ... and both ktorrent and bittorrent stated that the download was ... were found to have large numbers of chunks invalid. ... I offered you a CD image of Windows Vista Ultimate with the hash ...
    (alt.computer.security)