Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Checking if two files are the same

Reply
Thread Tools

Checking if two files are the same

 
 
New C.
Guest
Posts: n/a
 
      03-07-2011
I have a got a few folders which may have same files under different
names.
Is there any way I can find which these files are using ruby ?

The files with text (*.doc, *.txt ...) should be pretty easy to check
but what about pdf files, exe files etc ?

I wondering if there is some sort of diff module that can do this.

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Xavier Noria
Guest
Posts: n/a
 
      03-07-2011
On Mon, Mar 7, 2011 at 11:48 AM, New C. <(E-Mail Removed)> wrote:

> I have a got a few folders which may have same files under different
> names.
> Is there any way I can find which these files are using ruby ?
>
> The files with text (*.doc, *.txt ...) should be pretty easy to check
> but what about pdf files, exe files etc ?
>
> I wondering if there is some sort of diff module that can do this.


There's File.compare.

Depending on how many comparisons you're going to do, it might be a
good idea to precompute checksums and compare the checksums.

 
Reply With Quote
 
 
 
 
Reid Thompson
Guest
Posts: n/a
 
      03-07-2011
On 03/07/2011 06:09 AM, Xavier Noria wrote:

>
> Depending on how many comparisons you're going to do, it might be a
> good idea to precompute checksums and compare the checksums.


+1



 
Reply With Quote
 
Colin Bartlett
Guest
Posts: n/a
 
      03-07-2011
On Mon, Mar 7, 2011 at 11:09 AM, Xavier Noria <(E-Mail Removed)> wrote:
> On Mon, Mar 7, 2011 at 11:48 AM, New C. <(E-Mail Removed)> wrote:
>
>> I have a got a few folders which may have same files under different
>> names. Is there any way I can find which these files are using ruby ?
>> ...
>> I wondering if there is some sort of diff module that can do this.

>
> There's File.compare.
>
> Depending on how many comparisons you're going to do, it might be a
> good idea to precompute checksums and compare the checksums.


I'm interested in any (Ruby) solutions (actual or ideas) for this, as
I have needed to do it in the past, and want to do something similar
in the very near future.

For comparing directories where the file names might have changed what
I've done in the past is to first match on file name, then for the
unmatching files in each directory see if there are any matches on
file size, and for those matches either make a direct File.compare (if
only two files match on a size) or compute checksums and use those to
exclude definitely unmatching files, and then use File.compare on what
(if anything) remains matching for that file size and checksum.

I assume something similar would work for finding duplicates in
general, not just comparing directories? (If there are likely to be
many matches on file size, then presumably one might as well compute
checksums for all files?)

 
Reply With Quote
 
Xavier Noria
Guest
Posts: n/a
 
      03-07-2011
On Mon, Mar 7, 2011 at 6:38 PM, Colin Bartlett <(E-Mail Removed)> wrote:

> I'm interested in any (Ruby) solutions (actual or ideas) for this, as
> I have needed to do it in the past, and want to do something similar
> in the very near future.
>
> For comparing directories where the file names might have changed what
> I've done in the past is to first match on file name, then for the
> unmatching files in each directory see if there are any matches on
> file size, and for those matches either make a direct File.compare (if
> only two files match on a size) or compute checksums and use those to
> exclude definitely unmatching files, and then use File.compare on what
> (if anything) remains matching for that file size and checksum.


I have played with this as an exercise. The idea is to filter
candidates iteratively applying different criteria, from cheap to
expensive, until you arrive at the solution.

It is just a proof of concept in pseudocode, I wrote it off the top of
my head, it does not even run:

https://gist.github.com/859046

The code above assumes a generic scenario with m-n possible
duplicates, if a particular situation has details that can speed up
the process they should be taken into account of course.

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      03-08-2011
New C. wrote in post #985925:
> I have a got a few folders which may have same files under different
> names.
> Is there any way I can find which these files are using ruby ?


Here is a little ruby script I use for finding and/or deleting duplicate
image and video files downloaded from my camera - it will work for any
sort of file.

#!/usr/bin/ruby -w
require 'digest/sha1'
if ARGV[0] == "-d"
do_delete = true
ARGV.shift
end

seen = {}
dirs = ARGV.empty? ? ["#{ENV["HOME"]}/Pictures"] : ARGV

dirs.each do |dir|
Dir["#{dir}/**/*"].sort.each do |fn|
next if File.directory?(fn)
hash = Digest::SHA1.file(fn).hexdigest
if seen[hash]
puts "#{fn} is dupe of #{seen[hash]}"
if do_delete
File.delete(fn)
puts "DELETED"
end
else
seen[hash] = fn
end
end
end

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      03-08-2011
On Tue, Mar 8, 2011 at 9:48 AM, Brian Candler <(E-Mail Removed)> wrote:
> New C. wrote in post #985925:
>> I have a got a few folders which may have same files under different
>> names.
>> Is there any way I can find which these files are using ruby ?

>
> Here is a little ruby script I use for finding and/or deleting duplicate
> image and video files downloaded from my camera - it will work for any
> sort of file.
>
> #!/usr/bin/ruby -w
> require 'digest/sha1'
> if ARGV[0] =3D=3D "-d"
> =A0do_delete =3D true
> =A0ARGV.shift
> end
>
> seen =3D {}
> dirs =3D ARGV.empty? ? ["#{ENV["HOME"]}/Pictures"] : ARGV
>
> dirs.each do |dir|
> =A0Dir["#{dir}/**/*"].sort.each do |fn|
> =A0 =A0next if File.directory?(fn)
> =A0 =A0hash =3D Digest::SHA1.file(fn).hexdigest
> =A0 =A0if seen[hash]
> =A0 =A0 =A0puts "#{fn} is dupe of #{seen[hash]}"
> =A0 =A0 =A0if do_delete
> =A0 =A0 =A0 =A0File.delete(fn)
> =A0 =A0 =A0 =A0puts "DELETED"
> =A0 =A0 =A0end
> =A0 =A0else
> =A0 =A0 =A0seen[hash] =3D fn
> =A0 =A0end
> =A0end
> end


For this idiom Hash#fetch can be used nicely:

irb(main):008:0> h=3D{};
10.times {|i|
puts i
h.fetch(i % 3) {|x| printf "first %p\n", i; h[x]=3Dtrue; nil} and
printf "duplicate %p\n", i}
0
first 0
1
first 1
2
first 2
3
duplicate 3
4
duplicate 4
5
duplicate 5
6
duplicate 6
7
duplicate 7
8
duplicate 8
9
duplicate 9
=3D> 10
irb(main):009:0>

... for arbitrary values of "nice".

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      03-08-2011
It looks pretty obfuscated to my eyes, but each to his own.

dirs.each do |dir|
Dir["#{dir}/**/*"].sort.each do |fn|
next if File.directory?(fn)
hash = Digest::SHA1.file(fn).hexdigest
if seen.fetch(hash) { seen[hash]=fn; false }
puts "#{fn} is dupe of #{seen[hash]}"
if do_delete
File.delete(fn)
puts "DELETED"
end
end
end
end

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to compare two SOAP Envelope or two Document or two XML files GenxLogic Java 3 12-06-2006 08:41 PM
Two processes writing to the same XML file at the same time? darrel ASP .Net 2 04-05-2006 05:30 PM
Prevent two users from accessing the same file at the same time Shawn ASP .Net 2 02-19-2006 03:11 AM
Checking if two FILE *s are associated with the same stream. Edd C Programming 4 05-25-2004 12:04 AM
Two PIX on same subnet with same gateway? This Old Man Cisco 4 10-20-2003 07:27 PM



Advertisments