Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > using python to parse md5sum list

Reply
Thread Tools

using python to parse md5sum list

 
 
Ben Rf
Guest
Posts: n/a
 
      03-06-2005
Hi

I'm new to programming and i'd like to write a program that will parse
a list produced by md5summer and give me a report in a text file on
which md5 sums appear more than once and where they are located.

the end end goal is to have a way of finding duplicate files that are
scattered across a lan of 4 windows computers.

I've dabbled with different languages over the years and i think
python is a good language for this but i have had a lot of trouble
sifting through manual and tutorials finding out with commands i need
and their syntax.

Can someone please help me?

Thanks.

Ben
 
Reply With Quote
 
 
 
 
James Stroud
Guest
Posts: n/a
 
      03-06-2005
Among many other things:

First, you might want to look at os.path.walk()
Second, look at the string data type.

Third, get the Python essential reference.

Also, Programming Python (O'Riely) actually has a lot in it about stuff like
this. Its a tedious read, but in the end will help a lot for administrative
stuff like you are doing here.

So, with the understanding that you will look at these references, I will
foolishly save you a little time...

If you are using md5sum, tou can grab the md5 and the filename like such:

myfile = open(filename)
md5sums = []
for aline in myfile.readlines():
md5sums.append(aline[:-1].split(" ",1))
myfile.close()

The md5 sum will be in the 0 element of each tuple in the md5sums list, and
the path to the file will be in the 1 element.


James

On Saturday 05 March 2005 07:54 pm, Ben Rf wrote:
> Hi
>
> I'm new to programming and i'd like to write a program that will parse
> a list produced by md5summer and give me a report in a text file on
> which md5 sums appear more than once and where they are located.
>
> the end end goal is to have a way of finding duplicate files that are
> scattered across a lan of 4 windows computers.
>
> I've dabbled with different languages over the years and i think
> python is a good language for this but i have had a lot of trouble
> sifting through manual and tutorials finding out with commands i need
> and their syntax.
>
> Can someone please help me?
>
> Thanks.
>
> Ben


--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095
 
Reply With Quote
 
 
 
 
Michael Hoffman
Guest
Posts: n/a
 
      03-06-2005
Ben Rf wrote:

> I'm new to programming and i'd like to write a program that will parse
> a list produced by md5summer and give me a report in a text file on
> which md5 sums appear more than once and where they are located.


This should do the trick:

"""
import fileinput

md5s = {}
for line in fileinput.input():
md5, filename = line.rstrip().split()
md5s.setdefault(md5, []).append(filename)

for md5, filenames in md5s.iteritems():
if len(filenames) > 1:
print "\t".join(filenames)
"""

Put this in md5dups.py and you can then use
md5dups.py [FILE]... to find duplicates in any of the files you
specify. They'll then be printed out as a tab-delimited list.

Key things you might want to look up to understand this:

* the dict datatype
* dict.setdefault()
* dict.iteritems()
* the fileinput module
--
Michael Hoffman
 
Reply With Quote
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      03-06-2005
In <(E-Mail Removed)>, James Stroud
wrote:

> If you are using md5sum, tou can grab the md5 and the filename like such:
>
> myfile = open(filename)
> md5sums = []
> for aline in myfile.readlines():
> md5sums.append(aline[:-1].split(" ",1))


md5sums.append(aline[:-1].split(None, 1))

That works too if md5sum opened the files in binary mode which is the
default on Windows. The filename is prefixed with a '*' then, leaving
just one space between checksum and filename.

> myfile.close()


Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
Christos TZOTZIOY Georgiou
Guest
Posts: n/a
 
      03-07-2005
On 5 Mar 2005 19:54:34 -0800, rumours say that http://www.velocityreviews.com/forums/(E-Mail Removed) (Ben Rf)
might have written:

[snip]

>the end end goal is to have a way of finding duplicate files that are
>scattered across a lan of 4 windows computers.


Just in case you want to go directly to that goal, check this:

http://groups-beta.google.com/group/...8e292ec9adb82d

It doesn't read a file at all, unless there is a need to do that. For example,
if you have ten small files and one large one, the large one will not be read
(since no other files would be found with the same size).

In your case, you can use the find_duplicate_files function with arguments like:
r"\\COMPUTER1\SHARE1", r"\\COMPUTER2\SHARE2" etc
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC195
I really should keep that in mind when talking with people, actually...
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
md5sum c function Udai Kiran C Programming 4 11-12-2007 07:34 PM
MD5SUM SHA1SUM Checker for Windows Explorer west Computer Support 5 02-04-2007 08:35 PM
Help me Find SHA1SUM MD5SUM Checker for Windows XP Keith Computer Support 3 08-16-2006 09:12 PM
md5sum differs across builds Andrew Chalk C++ 1 08-23-2005 09:41 PM
Copying vob file - md5sum does not match? bwooster47@gmail.com DVD Video 0 01-11-2005 02:43 PM



Advertisments