Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Finding duplicates in a file (Newbie question)

Reply
Thread Tools

Finding duplicates in a file (Newbie question)

 
 
Benz
Guest
Posts: n/a
 
      02-01-2005
Hi!

Is there a smart way of finding duplicates in a large file.

This is how the file will look:
Col1 Col2 Col3 Col4
1/02/2005 20:06:10.870^F^l0091nd^F^5591^F^793423423^R
1/02/2005 21:06:15.533^F^l0091f3^F^5591^F^793423324^R
1/02/2005 22:12:14.653^F^l0031d6^F^5591^F^793423324^R

The ^F^ is the file seperator. The file could have upto 140,000 lines
and I need to find if there are duplicates in Col4.

Iam tryin to do this the convetional way, and this is how far I got..
- by reading the file using BufferedReader
- tokenizing the line and going to col 4
- take the value in col4 .
Would appreciate if there are pointers...

- TIA Ben

 
Reply With Quote
 
 
 
 
digidigo
Guest
Posts: n/a
 
      02-02-2005
You could then iterate over each value and put it into a TreeSet. And
before inserting the nxt value test to see if the TreeSet already
contains the value:

TreeSet set= new TreeSet()

foreach ( foo in COL4 ) {
if ( set.contains(foo)
print("Duplicate found" + foo);
else
set.put(foo)
}

Something like that.

 
Reply With Quote
 
 
 
 
Gerbrand
Guest
Posts: n/a
 
      02-02-2005
digidigo schreef:
> You could then iterate over each value and put it into a TreeSet. And
> before inserting the nxt value test to see if the TreeSet already
> contains the value:
>
> TreeSet set= new TreeSet()
>
> foreach ( foo in COL4 ) {
> if ( set.contains(foo)
> print("Duplicate found" + foo);
> else
> set.put(foo)
> }
>


foreach doesn't exist, but I think it's pretty clear for the OP.
There's Java 1.5 syntax with collons, unfortunately I forgot the exact
syntaxt (somehing like Object o: COL4)

Instead of if (set.contains(foo)
print dup found
you can also use
if (set.put(foo))
System.out.println(..)
It's slightly shorter and faster, since put would do a check as well.

Also for the Treeset, equals and hashcode() have to be defined. If you
use Strings that's already the case, otherwise you have to implement them.
 
Reply With Quote
 
Mark Murphy
Guest
Posts: n/a
 
      02-03-2005
If your not limited to Java, Perl has a sweet little trick for doing it.
You may be able to re-code something like this in Java.

In Perl you have Hashes which are similar to what I understand a Java
map is. Basicly a key value pair. What you can do in perl is make the
value you want to count or find repeats and use them as keys. Then
increment the value of the keys each time you run into it. This gives
you a list of unique strings and how many times they occur.

I whipped something like this up to find word frequencies once. This was
before expanding my mind to Java. People often complain one way or
another, but I think Java and Perl compliment each other well, you just
have to be flexible enough to accept they do things differently. (What
good would choices be if they all do the same thing.)

If your not limited to a Java solution then let me know I can send you
the perl. If you are going to do it in java I would be interested in
seeing how you do it.

Mark M
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
extracting duplicates from CSV file by specific fields VP Python 2 04-29-2009 05:52 AM
Finding Server... Finding Host.... enough already!!! Leesa_Tay@softhome.net Computer Support 2 01-20-2006 10:23 AM
Finding duplicates with thumbnail view Ritter197 Digital Photography 4 12-29-2004 02:01 AM
xpath finding duplicates Timo Nentwig XML 0 12-25-2004 03:50 PM
Re: Removing duplicates from a DropdownList William F. Robertson, Jr. ASP .Net 1 08-04-2003 04:16 PM



Advertisments