Re: message digest of large files

From: Volker Hetzer (volker.hetzer_at_ieee.org)
Date: 08/18/05


Date: Thu, 18 Aug 2005 10:10:36 +0200

James Whitwell wrote:
> Kristian Gjøsteen wrote:
>
>> James Whitwell <jams@momento.com.au> wrote:
>>
>>> We're trying to use message digests as a way to uniquely identify
>>> large binary files (around 50-60MB). Is there a limit to the size of
>>> the file that we feed through, say SHA1?
>>
>>
>>
>> Yes, there is a limit, but I believe it is 2^64 bits or something
>> like that, so there is no need to worry. You should check the
>> relevant standard (FIPS 180-1).
>>
>
> Thanks for everyone's replies, I'm reading the FIPS 180-1 standard now,
> hopefully it'll sink in.
>
> Is there a way of determining what the chances of two hashes for
> different files being the same are? My reasoning is that, if I have
> lots of large files (say 10000 files of 50MB each), and the hash is only
> 160 bits long, surely I'll get a collision fairly quickly? That's why I
> thought chopping up my files into smaller chunks and generating a hash
> for each chunk, then concatenating the hashes together to form a large
> unique ID would help me avoid collisions. The files are PDFs that have
> been encrypted using Blowfish, so I'd assume they're pretty random.
The likelyhood of collisions is independend of the message length.
With 10000 files the likelyhood of two files producing the same 160bit hash is
about 0.000000000000000000000000000000000000003%.
With 10^23 files to have a 0.3% chance of a collision.

Lots of Greetings!
Volker