Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > fdups: calling for beta testers

Reply
Thread Tools

fdups: calling for beta testers

 
 
Patrick Useldinger
Guest
Posts: n/a
 
      02-25-2005
Hi all,

I am looking for beta-testers for fdups.

fdups is a program to detect duplicate files on locally mounted
filesystems. Files are considered equal if their content is identical,
regardless of their filename. Also, fdups ignores symbolic links and is
able to detect and ignore hardlinks, where available.

In contrast to similar programs, fdups does not rely on md5 sums or
other hash functions to detect potentially identical files. Instead, it
does a direct blockwise comparison and stops reading as soon as
possible, thus reducing the file reads to a maximum.

fdups has been developed on Linux but should run on all platforms that
support Python.

fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
you'll also find a link to download the tar.

I am primarily interested in getting feedback if it produces correct
results. But as I haven't been programming in Python for a year or so,
I'd also be interested in comments on code if you happen to look at it
in detail.

Your help is much appreciated.

-pu
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      02-26-2005

Patrick Useldinger wrote:
>
> fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
> you'll also find a link to download the tar.
>


"""fdups has no installation program. Just change into a temporary
directory, and type "tar xfj fdups.tar.bz". You should also chown the
files according to your needs, and then copy the executables to your
PATH."""

(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?

(5) if files[subgroup[j]]['flag'] and files[subgroup[i]]['buffer'] ==
files[subgroup[j]]['buffer']:

That's not the most readable code I've ever seen.

(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'

Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.

(7)

! def compare(self):
! """ compare all files of the same size - outer loop """
! sizes=self.compfiles.keys()
! sizes.sort()
! for size in sizes:
! self.comparefiles(size,self.compfiles[size])

Why sort? What's wrong with just two lines:

! for size, file_list in self.compfiles.iteritems():
! self.comparefiles(size, file_list)

( global
MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZ E,INODES

That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:

! class fDups:
! """ encapsulates the whole logic """

(9) Any good reason why the "executables" don't have ".py" extensions
on their names?

All in all, a very poor "out-of-the-box" experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.
And what is "chown" -- any relation of Perl's "chomp"?

 
Reply With Quote
 
 
 
 
Patrick Useldinger
Guest
Posts: n/a
 
      02-26-2005
John Machin wrote:

> (1) It's actually .bz2, not .bz (2) Why annoy people with the
> not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
> Typing that on Windows command line doesn't produce a useful result (4)
> Haven't you heard of distutils?


(1) Typo, thanks for pointing it out
(2)(3) In the Linux world, it is really popular. I suppose you are a
Windows user, and I haven't given that much thought. The point was not
to save space, just to use the "standard" format. What would it be for
Windows - zip?
(4) Never used them, but are very valid point. I will look into it.

> (6) You are keeping open handles for all files of a given size -- have
> you actually considered the possibility of an exception like this:
> IOError: [Errno 24] Too many open files: 'foo509'


(6) Not much I can do about this. In the beginning, all files of equal
size are potentially identical. I first need to read a chunk of each,
and if I want to avoid opening & closing files all the time, I need them
open together.
What would you suggest?

> Once upon a time, max 20 open files was considered as generous as 640KB
> of memory. Looks like Bill thinks 512 (open files, that is) is about
> right these days.


Bill also thinks it is normal that half of service pack 2 lingers twice
on a harddisk. Not sure whether he's my hero

> (7)
> Why sort? What's wrong with just two lines:
>
> ! for size, file_list in self.compfiles.iteritems():
> ! self.comparefiles(size, file_list)


(7) I wanted the output to be sorted by file size, instead of being
random. It's psychological, but if you're chasing dups, you'd want to
start with the largest ones first. If you have more that a screen full
of info, it's the last lines which are the most interesting. And it will
produce the same info in the same order if you run it twice on the same
folders.

> ( global
> MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZ E,INODES
>
> That doesn't sit very well with the 'everything must be in a class'
> religion seemingly espoused by the following:


( Agreed. I'll think about that.

> (9) Any good reason why the "executables" don't have ".py" extensions
> on their names?


(9) Because I am lazy and Linux doesn't care. I suppose Windows does?

> All in all, a very poor "out-of-the-box" experience. Bear in mind that
> very few Windows users would have even heard of bzip2, let alone have a
> bzip2.exe on their machine. They wouldn't even be able to *open* the
> box.


As I said, I did not give Windows users much thought. I will improve this.

> And what is "chown" -- any relation of Perl's "chomp"?


chown is a Unix command to change the owner or the group of a file. It
has to do with controlling access to the file. It is not relevant on
Windows. No relation to Perl's chomp.

Thank you very much for your feedback. Did you actually run it on your
Windows box?

-pu
 
Reply With Quote
 
Peter Hansen
Guest
Posts: n/a
 
      02-26-2005
Patrick Useldinger wrote:
>> (9) Any good reason why the "executables" don't have ".py" extensions
>> on their names?

>
> (9) Because I am lazy and Linux doesn't care. I suppose Windows does?


Unfortunately, yes. Windows has nothing like the "x" permission
bit, so you have to have an actual extension on the filename and
Windows (XP anyway) will check it against the list of extensions
in the PATHEXT environment variable to determine if it should be
treated like an executable.

Otherwise you must type "python" and the full filename.

-Peter
 
Reply With Quote
 
Serge Orlov
Guest
Posts: n/a
 
      02-26-2005
Peter Hansen wrote:
> Patrick Useldinger wrote:
>>> (9) Any good reason why the "executables" don't have ".py"
>>> extensions on their names?

>>
>> (9) Because I am lazy and Linux doesn't care. I suppose Windows does?

>
> Unfortunately, yes. Windows has nothing like the "x" permission
> bit, so you have to have an actual extension on the filename and
> Windows (XP anyway) will check it against the list of extensions
> in the PATHEXT environment variable to determine if it should be
> treated like an executable.
>
> Otherwise you must type "python" and the full filename.


Or use exemaker, which IMHO is the best way to handle this
problem.

Serge.


 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      02-26-2005

Patrick Useldinger wrote:
> John Machin wrote:
>
> > (1) It's actually .bz2, not .bz (2) Why annoy people with the
> > not-widely-known bzip2 format just to save a few % of a 12KB file??

(3)
> > Typing that on Windows command line doesn't produce a useful result

(4)
> > Haven't you heard of distutils?

>
> (1) Typo, thanks for pointing it out
> (2)(3) In the Linux world, it is really popular. I suppose you are a
> Windows user, and I haven't given that much thought. The point was

not
> to save space, just to use the "standard" format. What would it be

for
> Windows - zip?


Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
bzip2.

> > (6) You are keeping open handles for all files of a given size --

have
> > you actually considered the possibility of an exception like this:
> > IOError: [Errno 24] Too many open files: 'foo509'

>
> (6) Not much I can do about this. In the beginning, all files of

equal
> size are potentially identical. I first need to read a chunk of each,


> and if I want to avoid opening & closing files all the time, I need

them
> open together.
> What would you suggest?


Test, like I did, to see how many open handles you can get away with. I
was not joking, 20 was the max on MS-DOS at one stage and I vaguely
recall: (a) some low limits on various flavours of *x (b) the "ulimit"
command can be used to vary the per-process limit but (c) there is a
system-wide limit also.

You should consider a fall-back method to be used in this case and in
the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
seems tiny; desktop PCs come with 512MB standard these days, and Bill
does leave a bit more than 1MB available for applications.

> > And what is "chown" -- any relation of Perl's "chomp"?

>
> chown is a Unix command to change the owner or the group of a file.

It
> has to do with controlling access to the file. It is not relevant on
> Windows. No relation to Perl's chomp.


The question was rhetorical. Your irony detector must be on the fritz.


> Did you actually run it on your
> Windows box?


Yes, with trepidation, after carefully reading the source. It detected
some highly plausible duplicates, which I haven't verified yet.

Cheers,
John

 
Reply With Quote
 
Patrick Useldinger
Guest
Posts: n/a
 
      02-26-2005
John Machin wrote:

> Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
> bzip2.


I've added a zip file. It was made in Linux with the zip command-line
tool, the man pages say it's compatible with the Windows zip tools. I
have also added .py extentions to the 2 programs. I did however not use
distutils, because I'm not sure it is really adapted to module-less scripts.

> You should consider a fall-back method to be used in this case and in
> the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
> seems tiny; desktop PCs come with 512MB standard these days, and Bill
> does leave a bit more than 1MB available for applications.


I've added it to the TODO list.

> The question was rhetorical. Your irony detector must be on the fritz.
>


I always find it hard to detect irony by mail with people I do not know. ..

>>Did you actually run it on your
>>Windows box?

>
>
> Yes, with trepidation, after carefully reading the source. It detected
> some highly plausible duplicates, which I haven't verified yet.


I would have been reluctant too. But I've tested it intensively, and
there's strictly no statement that actually alters the file system.

Thanks for your feedback!

-pu
 
Reply With Quote
 
Patrick Useldinger
Guest
Posts: n/a
 
      02-26-2005
Serge Orlov wrote:

> Or use exemaker, which IMHO is the best way to handle this
> problem.


Looks good, but I do not use Windows.

-pu
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      02-27-2005
On Sat, 26 Feb 2005 23:53:10 +0100, Patrick Useldinger
<(E-Mail Removed)> wrote:

> I've tested it intensively


"Famous Last Words"

>Thanks for your feedback!


Here's some more:

(1) Manic s/w producing lots of files all the same size: the Borland
C[++] compiler produces a debug symbol file (.tds) that's always
384KB; I have 144 of these on my HD, rarely more than 1 in the same
directory.

Here's a snippet from a duplicate detection run:

DUP|393216|2|\devel\delimited\build\lib.win32-1.5\delimited.tds|\devel\delimited\build\lib.win32-2.1\delimited.tds
DUP|393216|2|\devel\delimited\build\lib.win32-2.3\delimited.tds|\devel\delimited\build\lib.win32-2.4\delimited.tds

(2) There appears to be a flaw in your logic such that it will find
duplicates only if they are in the *SAME* directory and only when
there are no other directories with two or more files of the same
size. The above duplicates were detected only when I made the
following changes to your script:


--- fdups Sat Feb 26 06:41:36 2005
+++ fdups_jm.py Sun Feb 27 12:18:04 2005
@@ -29,13 +29,14 @@
self.count = self.totalsize = self.inodecount =
self.slinkcount = 0
self.gain = self.bytescompared = self.bytesread =
self.inodecount = 0
for toplevel in args:
- os.path.walk(toplevel, self.buildList, None)
+ os.path.walk(toplevel, self.updateDict, None)
if self.count > 0:
self.compare()

- def buildList(self,arg,dirpath,namelist):
- """ build a dictionnary of files to be analysed, indexed by
length """
- files = {}
+ def updateDict(self,arg,dirpath,namelist):
+ """ update a dictionary of files to be analysed, indexed by
length """
+ # files = {}
+ files = self.compfiles
for filepath in namelist:
fullpath = os.path.join(dirpath,filepath)
if os.path.isfile(fullpath):
@@ -51,20 +52,23 @@
if size >= MIN_FILESIZE:
self.count += 1
self.totalsize += size
+ # is above totalling in the wrong place?
if size not in files:
files[size]=[fullpath]
else:
files[size].append(fullpath)
- for size in files:
- if len(files[size]) != 1:
- self.compfiles[size]=files[size]
+ # for size in files:
+ # if len(files[size]) != 1:
+ # self.compfiles[size]=files[size]

def compare(self):
""" compare all files of the same size - outer loop """
sizes=self.compfiles.keys()
sizes.sort()
for size in sizes:
- self.comparefiles(size,self.compfiles[size])
+ list_of_filenames = self.compfiles[size]
+ if len(list_of_filenames) > 1:
+ self.comparefiles(size, list_of_filenames)

def comparefiles(self,size,filelist):
""" compare all files of the same size - inner loop """


(3) Your fdups-check gadget doesn't work on Windows; the commands
module works only on Unix but is supplied with Python on all
platforms. The results might just confuse a newbie:

(1, "'{' is not recognized as an internal or external
command,\noperable program or batch file.")

Why not use the Python filecmp module?

Cheers,
John
 
Reply With Quote
 
Patrick Useldinger
Guest
Posts: n/a
 
      02-27-2005
John Machin wrote:

>>I've tested it intensively

> "Famous Last Words"




> (1) Manic s/w producing lots of files all the same size: the Borland
> C[++] compiler produces a debug symbol file (.tds) that's always
> 384KB; I have 144 of these on my HD, rarely more than 1 in the same
> directory.


Not sure what you want me to do about it. I've decreased the minimum
block size once more, to accomodate for more files of the same length
without increasing the total amount of memory used.

> (2) There appears to be a flaw in your logic such that it will find
> duplicates only if they are in the *SAME* directory and only when
> there are no other directories with two or more files of the same
> size.


Ooops...
A really stupid mistake on my side. Corrected.

> (3) Your fdups-check gadget doesn't work on Windows; the commands
> module works only on Unix but is supplied with Python on all
> platforms. The results might just confuse a newbie:
> Why not use the Python filecmp module?


Done. It's also faster AND it works better. Thanks for the suggestion.

Please fetch the new version from http://www.homepages.lu/pu/fdups.html.

-pu
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
TimingAnalyzer beta version 0.90 -- beta testers wanted timinganalyzer VHDL 3 10-29-2008 06:46 AM
seeking beta testers a_chameleon Cisco 0 03-12-2006 04:00 PM
OT: ATTN: MCNGP: Beta testers requested /* Microcephalic S. Bob [MCSBNGP+++ #7.13.86.42.1138.2600] */ MCSE 12 08-20-2005 03:12 AM
Re: OT: ATTN: MCNGP: Beta testers requested /* Microcephalic S. Bob [MCSBNGP+++ #7.13.86.42.1138.2600] */ MCSE 2 08-18-2005 05:12 PM
Product to automatically convert VB.Net projects to C# - Beta Testers wanted Roger Jack ASP .Net 0 12-04-2003 06:05 PM



Advertisments