Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Fast checksums?

Reply
Thread Tools

Fast checksums?

 
 
George Adams
Guest
Posts: n/a
 
      02-13-2008
Hi, all. I'm trying to make a simple backup program for myself that
will check to see if certain files have been modifed before backing them
up. (It's got to be portable to multiple OSes that have Perl, but
possibly not other handy tools like, say, rsync or some such).

Anyway, I had originally planned to use the Windows archive bit, or else
file modification dates to determine if a file had been changed. That
turned out to be unreliable, though (and, in the case of the archive
bit, impossible on Linux). So I decided instead to create a checksum of
the original file, then compare future versions of the file with the
stored checksum to see if it's changed (and hence needs to be backed up).

This works... except it's really, really slow. I started with SHA-1,
but it was taking just too long. I switched to MD5 and then CRC32, but
even that was fairly slow. And when the backup directory contains
several gigs of files to check, it was just too much.

So, given the problem of "how can I tell if a file has been changed", am
I tackling it the wrong way? Should I be using some other method
besides simply hashing an entire file (using whatever algorithm) and
comparing it to a stored value? Obviously backup software makers have
solved this problem to make incremental backups pretty fast - what's the
trick to it?

Thanks to anyone who can help.
 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      02-13-2008
George Adams <(E-Mail Removed)> wrote in
news:(E-Mail Removed):


> I tackling it the wrong way? Should I be using some other method
> besides simply hashing an entire file (using whatever algorithm) and
> comparing it to a stored value? Obviously backup software makers have
> solved this problem to make incremental backups pretty fast - what's the
> trick to it?


I don't know if there is a trick to it, but why not check the modification
time returned by stat?

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)
clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>

 
Reply With Quote
 
 
 
 
comp.llang.perl.moderated
Guest
Posts: n/a
 
      02-13-2008
On Feb 13, 2:25 pm, George Adams <(E-Mail Removed)>
wrote:
> Hi, all. I'm trying to make a simple backup program for myself that
> will check to see if certain files have been modifed before backing them
> up. (It's got to be portable to multiple OSes that have Perl, but
> possibly not other handy tools like, say, rsync or some such).
>
> Anyway, I had originally planned to use the Windows archive bit, or else
> file modification dates to determine if a file had been changed. That
> turned out to be unreliable, though (and, in the case of the archive
> bit, impossible on Linux). So I decided instead to create a checksum of
> the original file, then compare future versions of the file with the
> stored checksum to see if it's changed (and hence needs to be backed up).
>
> This works... except it's really, really slow. I started with SHA-1,
> but it was taking just too long. I switched to MD5 and then CRC32, but
> even that was fairly slow. And when the backup directory contains
> several gigs of files to check, it was just too much.
>
> So, given the problem of "how can I tell if a file has been changed", am
> I tackling it the wrong way? Should I be using some other method
> besides simply hashing an entire file (using whatever algorithm) and
> comparing it to a stored value? Obviously backup software makers have
> solved this problem to make incremental backups pretty fast - what's the
> trick to it?


Maybe timestamp as well as file size with a
fallback to SHA-1 or MD5 only if time/size are
unchanged.

--
Charles DeRykus
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      02-13-2008
"comp.llang.perl.moderated" <(E-Mail Removed)> wrote in
news:(E-Mail Removed):

> On Feb 13, 2:25 pm, George Adams <(E-Mail Removed)>
> wrote:
>> Hi, all. I'm trying to make a simple backup program for myself that
>> will check to see if certain files have been modifed before backing
>> them up. (It's got to be portable to multiple OSes that have Perl,
>> but possibly not other handy tools like, say, rsync or some such).
>>
>> Anyway, I had originally planned to use the Windows archive bit, or
>> else file modification dates to determine if a file had been changed.
>> That turned out to be unreliable, though (and, in the case of the
>> archive bit, impossible on Linux). So I decided instead to create a
>> checksum of the original file, then compare future versions of the
>> file with the stored checksum to see if it's changed (and hence needs
>> to be backed up).
>>
>> This works... except it's really, really slow. I started with SHA-1,
>> but it was taking just too long. I switched to MD5 and then CRC32,
>> but even that was fairly slow. And when the backup directory
>> contains several gigs of files to check, it was just too much.
>>
>> So, given the problem of "how can I tell if a file has been changed",
>> am I tackling it the wrong way? Should I be using some other method
>> besides simply hashing an entire file (using whatever algorithm) and
>> comparing it to a stored value? Obviously backup software makers
>> have solved this problem to make incremental backups pretty fast -
>> what's the trick to it?

>
> Maybe timestamp as well as file size with a
> fallback to SHA-1 or MD5 only if time/size are
> unchanged.


That is not going to save work.

If either file size or time stamp has changed, then the checksum needs
to be recomputed (so that it can be checked the next time).

If neither has changed, the checksum needs to be recomputed just to be
sure.

That is, the checksum needs to be recomputed for all the files on each
run.

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)
clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      02-14-2008

Quoth Big and Blue <(E-Mail Removed)>:
>
> Just checking the last mod time should be sufficient (it's what a backup
> system would do). The assumption is that if something has changed the file
> *without* changing the mod time then that would mean it has reset the mod
> time after making the change deliberately, and the file didn't need to be
> backed up again.


Not necessarily. There may be a good reason the file needs that date but
the change in data should still be recorded. Under Unix this is what
ctime is for: changing the data and then fixing the mtime will update
the ctime, so dump( can tell the file needs dumping again.

Ben

 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      02-14-2008
Big and Blue <(E-Mail Removed)> wrote in
news:(E-Mail Removed):

> A. Sinan Unur wrote:
>
>> That is, the checksum needs to be recomputed for all the files on
>> each run.

>
> Depends what you are trying to achieve. Doing a checksum for each
> file every time means you have to read everything every time, which is
> rather unproductive.
>
> Just checking the last mod time should be sufficient (it's what a
> backup system would do). The assumption is that if something has
> changed the file *without* changing the mod time then that would mean
> it has reset the mod time after making the change deliberately, and
> the file didn't need to be backed up again.


I agree with that. You snipped too much, changing the intended meaning of
the snippet above. The point I was trying to make was that with the
algorithm Charles DeRykus was proposing, one would have to compute to
checksum for each and every file on each run, not saving any time at all.

Elsethread, I did indeed propose using the modification time for deciding
if the file needs to be backed up again.

Sinan


--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)
clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>

 
Reply With Quote
 
George Adams
Guest
Posts: n/a
 
      02-14-2008
Thank you all for the replies. Let me explain more why I originally
decided against relying on ctime/mtime (this program will mainly be
mainly run in a Windows environment)

1) First, I decided to use mtime, which worked fine... until I
discovered that for some reason, certain Windows files didn't HAVE a
mtime. In Windows file properties, they were listed as "September 10,
1956, 6:13:51 AM". In Perl, the mtime was -1.

2) Still, I figured I could still use mtime, and just assume that if it
was greater than ctime (which always exists under Windows, even if mtime
does not), then the mtime was usable. If it was -1, I would just use
the ctime as the baseline instead.

3) The final blow, however, was the discovery that Windows apparently
sets a new mtime for a file *even when no changes have taken place* .
I discovered this when trying to backup some MP3s. After some research,
I discovered that when certain media programs load an MP3 file, they
will go out to retrieve additional info about that file (perhaps storing
the info they find in the ID3 tags?). Even if no changes are ultimately
made to the file, the mtime is still changed, fooling my Perl script
into thinking it needs backing up. And that's just for MP3s.

So finally I decided to give up on ctime/mtime and work with something I
was sure would be accurate - hashing. But now I've hit this stumbling
block.

Anyway, that's why I was hoping to find a hashing shortcut. This is a
fairly fault-tolerant program - if files are occassionally tagged
false-positive, then the worst that happens is that it is backed up even
when it wasn't needed. So I don't necessarily need SHA-512 or anything.
But still, even the humble CRC32 is awfully slow for really large files...

Again, thanks for the help - any other suggestions are welcomed!
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      02-14-2008

Quoth George Adams <(E-Mail Removed)>:
> Thank you all for the replies. Let me explain more why I originally
> decided against relying on ctime/mtime (this program will mainly be
> mainly run in a Windows environment)
>
> 1) First, I decided to use mtime, which worked fine... until I
> discovered that for some reason, certain Windows files didn't HAVE a
> mtime. In Windows file properties, they were listed as "September 10,
> 1956, 6:13:51 AM". In Perl, the mtime was -1.


Weird... I've never seen that... Was this FAT or NTFS, and were the
files created under Win9x or NT?

> 2) Still, I figured I could still use mtime, and just assume that if it
> was greater than ctime (which always exists under Windows, even if mtime
> does not), then the mtime was usable. If it was -1, I would just use
> the ctime as the baseline instead.


The ctime slot under Win32 holds the file create time, and Windows does
some weird things with the create time. In particular, if you copy a
file in Explorer the new file has a new ctime but the old mtime, so
ctime > mtime is not uncommon. It makes some sense, I guess: the file's
contents are 'older' than the file itself, since they used to be
somewhere else.

> 3) The final blow, however, was the discovery that Windows apparently
> sets a new mtime for a file *even when no changes have taken place* .


This is a very common occurrence. Lots of programs stupidly update a
file when they didn't need to. One way out (presumably) is to make such
files read-only: in the case of an MP3 collection, I'd want to make all
files and directories involved read-only anyway, to avoid accidentally
deleting them.

> Anyway, that's why I was hoping to find a hashing shortcut. This is a
> fairly fault-tolerant program - if files are occassionally tagged
> false-positive, then the worst that happens is that it is backed up even
> when it wasn't needed. So I don't necessarily need SHA-512 or anything.
> But still, even the humble CRC32 is awfully slow for really large files...


One (slightly evil) suggestion would be to use the archive flag, and
write a daemon that sits in the background (hah!) and uses
Win32::ChangeNotify or some such to watch for files that have been
changed, recalculate the hash, and fix up the flag if needed. It's still
doing the same work (well, more, actually) but if it's done in the
background it might not be so noticeable. Of course, you'd then have
lots of fun sychronisation issues... (your backup program needs to know
that all the files are in a valid state, and the daemon isn't playing
with any of them, while it works) .

Ben

 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      02-14-2008
David Filmer <(E-Mail Removed)> wrote:

> George Adams wrote:
>> Obviously backup software makers have solved this problem to
> > make incremental backups pretty fast - what's the trick to it?

>
> Indeed. The rsync program, for example, is able to determine which
> files have changed. It's really fast and really smart (it is able to
> syncronize changed files by only transmitting the delta, not the whole
> file). I don't know how it works,


Me neither, but
http://en.wikipedia.org/wiki/Rsync

has some information on the check sum method used, and has also a
link to the algorithm: http://rsync.samba.org/tech_report/node2.html

--
John

Arachnids near Coyolillo - part 1
http://johnbokma.com/mexit/2006/05/0...yolillo-1.html
 
Reply With Quote
 
comp.llang.perl.moderated
Guest
Posts: n/a
 
      02-14-2008
On Feb 13, 3:54 pm, "A. Sinan Unur" <(E-Mail Removed)> wrote:
> "comp.llang.perl.moderated" <(E-Mail Removed)> wrote innews:(E-Mail Removed):
>
>
>
> > On Feb 13, 2:25 pm, George Adams <(E-Mail Removed)>
> > wrote:
> >> Hi, all. I'm trying to make a simple backup program for myself that
> >> will check to see if certain files have been modifed before backing
> >> them up. (It's got to be portable to multiple OSes that have Perl,
> >> but possibly not other handy tools like, say, rsync or some such).

>
> >> Anyway, I had originally planned to use the Windows archive bit, or
> >> else file modification dates to determine if a file had been changed.
> >> That turned out to be unreliable, though (and, in the case of the
> >> archive bit, impossible on Linux). So I decided instead to create a
> >> checksum of the original file, then compare future versions of the
> >> file with the stored checksum to see if it's changed (and hence needs
> >> to be backed up).

>
> >> This works... except it's really, really slow. I started with SHA-1,
> >> but it was taking just too long. I switched to MD5 and then CRC32,
> >> but even that was fairly slow. And when the backup directory
> >> contains several gigs of files to check, it was just too much.

>
> >> So, given the problem of "how can I tell if a file has been changed",
> >> am I tackling it the wrong way? Should I be using some other method
> >> besides simply hashing an entire file (using whatever algorithm) and
> >> comparing it to a stored value? Obviously backup software makers
> >> have solved this problem to make incremental backups pretty fast -
> >> what's the trick to it?

>
> > Maybe timestamp as well as file size with a
> > fallback to SHA-1 or MD5 only if time/size are
> > unchanged.

>
> That is not going to save work.
>
> If either file size or time stamp has changed, then the checksum needs
> to be recomputed (so that it can be checked the next time).
>
> If neither has changed, the checksum needs to be recomputed just to be
> sure.
>
> That is, the checksum needs to be recomputed for all the files on each
> run.
>


True, to be really thorough, you'd need to checksum each time. From
the original problem description, I guessed that mtimes were not
being
updated so backup's were being missed and a size check would usually
catch the update even if the mtime was unchanged -- particularly with
checksumming as a last resort. Of course, the size check might still
help, if as the OP mentions, Win32 spuriously updates mtime when there
wasn't any change.

However, the problem under Win32 with a non-existent mtime for some
files or a ctime more current than mtime for others certainly makes
the problem more complicated than that


--
Charles DeRykus
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[ANN] macstl 0.2 -- portable SIMD toolkit, fast valarray transcendentals, fast Mach vectors glenlow@pixelglow.com C++ 0 02-02-2005 12:32 PM
More memory: How fast is fast rfdjr1@optonline.net Computer Support 5 05-19-2004 05:45 PM
Canon S30 Fast shutter mode... Why so fast? mark popp Digital Photography 1 02-08-2004 10:07 PM
I NEED HELP FAST!!!!! REAL FAST!!!!! R. Jizzle MCSE 3 09-29-2003 08:51 PM
Super-fast AA Chargers: Anything as fast as the 15 minute Rayovac? David Chien Digital Photography 4 08-30-2003 07:49 PM



Advertisments