Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > binary file compare...

Reply
Thread Tools

binary file compare...

 
 
Martin
Guest
Posts: n/a
 
      04-15-2009
On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
<(E-Mail Removed)> wrote:
> The checksum does look at every byte in each file. Checksumming isn't a
> way to avoid looking at each byte of the two files, it is a way of
> mapping all the bytes to a single number.


My understanding of the original question was a way to determine
wether 2 files are equal or not. Creating a checksum of 1-n files and
comparing those checksums IMHO is a valid way to do that. I know it's
a (one way) mapping between a (possibly) longer byte sequence and
another one, how does checksumming not take each byte in the original
sequence into account.

I'd still say rather burn CPU cycles than development hours (if I got
the question right), if not then with binary files you will have to
find some way of representing differences between the 2 files in a
readable manner anyway.

> Hashing is a *lot* more work than just comparing two bytes. The MD5
> checksum has been specifically designed to be fast and compact, and the
> algorithm is still complicated:


I know that the various checksum algorithms aren't exactly cheap, but
I do think that just to know wether 2 files are different a solution
which takes 5mins to implement wins against a lengthy discussion which
optimizes too early wins hands down.

regards,
martin

--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
 
Reply With Quote
 
 
 
 
Nigel Rantor
Guest
Posts: n/a
 
      04-15-2009
Martin wrote:
> On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
> <(E-Mail Removed)> wrote:
>> The checksum does look at every byte in each file. Checksumming isn't a
>> way to avoid looking at each byte of the two files, it is a way of
>> mapping all the bytes to a single number.

>
> My understanding of the original question was a way to determine
> wether 2 files are equal or not. Creating a checksum of 1-n files and
> comparing those checksums IMHO is a valid way to do that. I know it's
> a (one way) mapping between a (possibly) longer byte sequence and
> another one, how does checksumming not take each byte in the original
> sequence into account.


The fact that two md5 hashes are equal does not mean that the sources
they were generated from are equal. To do that you must still perform a
byte-by-byte comparison which is much less work for the processor than
generating an md5 or sha hash.

If you insist on using a hashing algorithm to determine the equivalence
of two files you will eventually realise that it is a flawed plan
because you will eventually find two files with different contents that
nonetheless hash to the same value.

The more files you test with the quicker you will find out this basic truth.

This is not complex, it's a simple fact about how hashing algorithms work.

n

 
Reply With Quote
 
 
 
 
Nigel Rantor
Guest
Posts: n/a
 
      04-15-2009
Grant Edwards wrote:
> We all rail against premature optimization, but using a
> checksum instead of a direct comparison is premature
> unoptimization.


And more than that, will provide false positives for some inputs.

So, basically it's a worse-than-useless approach for determining if two
files are the same.

n
 
Reply With Quote
 
SpreadTooThin
Guest
Posts: n/a
 
      04-15-2009
On Apr 15, 8:04*am, Grant Edwards <invalid@invalid> wrote:
> On 2009-04-15, Martin <(E-Mail Removed)> wrote:
>
>
>
> > Hi,

>
> > On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards <invalid@invalid> wrote:
> >> On 2009-04-13, SpreadTooThin <(E-Mail Removed)> wrote:

>
> >>> I want to compare two binary files and see if they are the same.
> >>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
> >>> that it is doing a byte by byte comparison of two files to see if they
> >>> are they same.

>
> >> Perhaps I'm being dim, but how else are you going to decide if
> >> two files are the same unless you compare the bytes in the
> >> files?

>
> > I'd say checksums, just about every download relies on checksums to
> > verify you do have indeed the same file.

>
> That's slower than a byte-by-byte compare.
>
> >> You could hash them and compare the hashes, but that's a lot
> >> more work than just comparing the two byte streams.

>
> > hashing is not exactly much mork in it's simplest form it's 2
> > lines per file.

>
> I meant a lot more CPU time/cycles.
>
> --
> Grant Edwards * * * * * * * * * grante * * * * * * Yow! Was my SOY LOAF left
> * * * * * * * * * * * * * * * * * at * * * * * * * out in th'RAIN? *It tastes
> * * * * * * * * * * * * * * * *visi.com * * * * * *REAL GOOD!!


I'd like to add my 2 cents here.. (Thats 1.8 cents US)
All I was trying to get was a clarification of the documentation of
the cmp method.
It isn't clear.

byte by byte comparison is good enough for me as long as there are no
cache issues.
a check sum is not good because it doesn't guarantee that 1 + 2 + 3
== 3 + 2 + 1
a crc of any sort is more work than a byte by byte comparison and
doesn't give you any more information.


 
Reply With Quote
 
Adam Olsen
Guest
Posts: n/a
 
      04-15-2009
On Apr 15, 11:04*am, Nigel Rantor <(E-Mail Removed)> wrote:
> The fact that two md5 hashes are equal does not mean that the sources
> they were generated from are equal. To do that you must still perform a
> byte-by-byte comparison which is much less work for the processor than
> generating an md5 or sha hash.
>
> If you insist on using a hashing algorithm to determine the equivalence
> of two files you will eventually realise that it is a flawed plan
> because you will eventually find two files with different contents that
> nonetheless hash to the same value.
>
> The more files you test with the quicker you will find out this basic truth.
>
> This is not complex, it's a simple fact about how hashing algorithms work..


The only flaw on a cryptographic hash is the increasing number of
attacks that are found on it. You need to pick a trusted one when you
start and consider replacing it every few years.

The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.
 
Reply With Quote
 
Nigel Rantor
Guest
Posts: n/a
 
      04-15-2009
Adam Olsen wrote:
> The chance of *accidentally* producing a collision, although
> technically possible, is so extraordinarily rare that it's completely
> overshadowed by the risk of a hardware or software failure producing
> an incorrect result.


Not when you're using them to compare lots of files.

Trust me. Been there, done that, got the t-shirt.

Using hash functions to tell whether or not files are identical is an
error waiting to happen.

But please, do so if it makes you feel happy, you'll just eventually get
an incorrect result and not know it.

n
 
Reply With Quote
 
Lawrence D'Oliveiro
Guest
Posts: n/a
 
      04-18-2009
In message <(E-Mail Removed)>, Nigel
Rantor wrote:

> Adam Olsen wrote:
>
>> The chance of *accidentally* producing a collision, although
>> technically possible, is so extraordinarily rare that it's completely
>> overshadowed by the risk of a hardware or software failure producing
>> an incorrect result.

>
> Not when you're using them to compare lots of files.
>
> Trust me. Been there, done that, got the t-shirt.


Not with any cryptographically-strong hash, you haven't.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbie: working with binary files/extract png from a binary file Jim Ruby 6 12-24-2013 08:09 AM
Program to open a file in binary, skip X bytes and write the rest ofthe file to a new file scad C++ 4 05-28-2009 08:47 AM
(8-bit binary to two digit bcd) or (8-bit binary to two digit seven segment) Fangs VHDL 3 10-26-2008 06:41 AM
writing binary file (ios::binary) Ron Eggler C++ 9 04-28-2008 08:20 AM
Re: ostreams, ios::binary, endian, mixed binary-ascii Marc Schellens C++ 8 07-15-2003 12:27 PM



Advertisments