Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > How to check variables for uniqueness ?

Reply
Thread Tools

How to check variables for uniqueness ?

 
 
Ed
Guest
Posts: n/a
 
      12-30-2006

Lew skrev:


>
> (Please do not embed TAB characters in newsgroup postings.)
>
> You could use a HashMap if you wanted to know how many times each word occurred:
>

snip
> - Lew


Indeed.

And in case anyone's interested, here are the times for HashMap. Looks
like Map is in the league of Set, and not the slow-moving List. (These
times are longer than the previous times because of current CPU
loading; relativity is the key.)

522393 duplicated words. Using java.util.HashSet, time = 789ms.
522393 duplicated words. Using java.util.TreeSet, time = 2168ms.
522393 duplicated words. Using Map , time = 1180ms.
522393 duplicated words. Using java.util.ArrayList, time = 183795ms.
522393 duplicated words. Using java.util.LinkedList, time = 274781ms.

Apologies to Patricia: I see I mis-attributed her post, yet again. And
Lew, I've now become fast friends now with Linux's expand(). Let's see
whether I purged those nasty TABs:

import java.util.*;
import java.io.*;

class Test {
private static String TEXT_BOOK_NAME = "war-and-peace.txt";

public static void main(String[] args) {
try {
String text = readText(); // Read text into RAM
countDuplicateWords(text, new HashSet());
countDuplicateWords(text, new TreeSet());
countDuplicateWordsMap(text);
countDuplicateWords(text, new ArrayList());
countDuplicateWords(text, new LinkedList());
} catch (Throwable t) {
System.out.println(t.toString());
}
}

private static String readText() throws Throwable {
BufferedReader reader =
new BufferedReader(new FileReader(TEXT_BOOK_NAME));
String line = null;
StringBuffer text = new StringBuffer();
while ((line = reader.readLine()) != null) {
text.append(line + " ");
}
return text.toString();
}

private static void countDuplicateWords(String text,
Collection listOfWords) {
int numDuplicatedWords = 0;
long startTime = System.currentTimeMillis();
for (StringTokenizer i = new StringTokenizer(text);
i.hasMoreElements() {
String word = i.nextToken();
if (listOfWords.contains(word)) {
numDuplicatedWords++;
} else {
listOfWords.add(word);
}
}
long endTime = System.currentTimeMillis();
System.out.println(numDuplicatedWords + " duplicated words. " +
"Using " + listOfWords.getClass().getName() +
", time = " + (endTime - startTime) + "ms.");
}

private static void countDuplicateWordsMap(String text) {
int numDuplicatedWords = 0;
Map wordsToFrequency = new HashMap();
long startTime = System.currentTimeMillis();
for (StringTokenizer i = new StringTokenizer(text);
i.hasMoreElements() {
String word = i.nextToken();
Integer frequency = (Integer)wordsToFrequency.get(word);
if (frequency == null) {
wordsToFrequency.put(word, new Integer(0));
} else {
int value = frequency.intValue();
wordsToFrequency.put(word, new Integer(value + 1));
numDuplicatedWords++;
}
}
long endTime = System.currentTimeMillis();
System.out.println(numDuplicatedWords + " duplicated words. " +
"Using Map " +
", time = " + (endTime - startTime) + "ms.");
}
}



..ed

--

www.EdmundKirwan.com - Home of The Fractal Class Composition

 
Reply With Quote
 
 
 
 
Lew
Guest
Posts: n/a
 
      12-31-2006
Ed wrote:
> And in case anyone's interested, here are the times for HashMap. Looks
> like Map is in the league of Set, and not the slow-moving List. (These
> times are longer than the previous times because of current CPU
> loading; relativity is the key.)
>
> 522393 duplicated words. Using java.util.HashSet, time = 789ms.
> 522393 duplicated words. Using java.util.TreeSet, time = 2168ms.
> 522393 duplicated words. Using Map , time = 1180ms.
> 522393 duplicated words. Using java.util.ArrayList, time = 183795ms.
> 522393 duplicated words. Using java.util.LinkedList, time = 274781ms.


These times are extremely interesting.

I speculate that the greater part of the difference between HashMap and
HashSet would be the second loop through the Map. Note that though the Map was
slightly slower than the Set, it delivers more information. With the Set you
only knew how many words were duplicated; with the Map you can also figure out
which words were, and how many times each one occurred.

You could, for example, use the Map to deliver the words in order of
frequency, given the right comparator over the entry set.

- Lew
 
Reply With Quote
 
 
 
 
John Ersatznom
Guest
Posts: n/a
 
      01-04-2007
Lew wrote:
> Ed wrote:
>
>> And in case anyone's interested, here are the times for HashMap. Looks
>> like Map is in the league of Set, and not the slow-moving List. (These
>> times are longer than the previous times because of current CPU
>> loading; relativity is the key.)
>>
>> 522393 duplicated words. Using java.util.HashSet, time = 789ms.
>> 522393 duplicated words. Using java.util.TreeSet, time = 2168ms.
>> 522393 duplicated words. Using Map , time = 1180ms.
>> 522393 duplicated words. Using java.util.ArrayList, time = 183795ms.
>> 522393 duplicated words. Using java.util.LinkedList, time = 274781ms.

>
> These times are extremely interesting.
>
> I speculate that the greater part of the difference between HashMap and
> HashSet would be the second loop through the Map. Note that though the
> Map was slightly slower than the Set, it delivers more information. With
> the Set you only knew how many words were duplicated; with the Map you
> can also figure out which words were, and how many times each one occurred.
>
> You could, for example, use the Map to deliver the words in order of
> frequency, given the right comparator over the entry set.


A lot of the Map slowness is probably the churn of Integer objects
created. Using an int[1] as a "mutable Integer" would work far better
(although mutable objects in collections is normally bad, mutable values
in a map isn't generally a problem, so long as you don't have mutable keys).

On the subject of tabs, my copy of Thunderbird seems to be quietly
converting tabs into spaces, though I can't find the setting for it.
Posts apparently originally containing tabs (e.g. Ed's earlier) have
spaces when I view them, and my own posts written with tabs don't make
you complain. The curious thing is that incoming posts seem to have
tab->4 spaces and the editor shows tabs as 8 spaces, but they become 4
in the actual sent posting...and none of the options in Thunderbird say
anything about conversion of tabs at all, either to set their displayed
width or to actually change tabs to certain numbers of spaces. Hrm. The
"online help" doesn't open a help window, but rather hijacks my open
Firefox window, and the search there is useless on this topic too...
 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-04-2007
Oliver Wong wrote:
> "John Ersatznom" <(E-Mail Removed)> wrote in message
> news:en3as3$p8d$(E-Mail Removed)...
>
>>Oliver Wong wrote:
>>
>>>>>assert locale is German; //pseudcode
>>>>>assert "BEISSEN".toLowerCase().equals("beissen");
>>>>>assert "BEISSEN".toLowerCase().equals("beißen");
>>>>
>>>>Yeah, and assert "Color".toLowerCase().equals("Colour".toLowerCase( )).
>>>
>>>
>>>{
>>> String originalA = "color";
>>> a = originalA; // "color"
>>> a = a.toUppercase(); // "COLOR"
>>> a = a.toLowercase(); // "color"
>>> assert a.equals(originalA);
>>>}

>>
>>I don't see "colour" (with a U) in there anywhere, Oliver.

>
> You weren't intended to.


Then you're missing the point entirely. "COLOR" and "colour" differ only
by capitalization while "beissen" and "beißen" differ by spelling in a
manner similar to "color" vs. "colour". Alternate spellings of the same
word can't in general be idenfitied as identical by a computer -- not
without a trip through a spellchecking dictionary or the like, anyway. I
think you may be expecting too much of Java's humble string classes.
Perhaps Collator is smart enough for you?
 
Reply With Quote
 
Andrew Thompson
Guest
Posts: n/a
 
      01-04-2007
John Ersatznom wrote:
....
> Then you're missing the point entirely. "COLOR" and "colour" differ only
> by capitalization ..


As well as the 'u' in the second word. And from my
vague recollections of this thread (that I am not prepared
to review at this instant) - a misunderstanding between
the spelling observed, might actually explain this (sub)
thread..(?)

Andrew T.

 
Reply With Quote
 
Ehsan Khoddam mohammadi
Guest
Posts: n/a
 
      01-04-2007
I prefer to implement a BST for that, create a BST yourself, insert
vars to them , it uses
O(logn) to compare vars.

class BST {
Node root;
Node leftChild, rightChild;
int insert(String s){
return insert (new Node (String s),root);
}

int insert (Node node,Node node2insert){
if (node2insert==null){
node2insert=node;
return 1;
}
else {

if (node < node2insert){
return insert (node,node2insert.leftChild);

}
else if (node > node2insert){
return insert (node,node2insert.leftChild);
}
else return 0;
}
}



you insert all vars, which nodes that the insert return 0 is duplicated


}
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Hi all,
>
> I have eight variables : var1, var2... var 8. All types String.
> How to check that each variables has unique values ?
>
> Thank you for your help,
> xtanto


 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      01-04-2007

"John Ersatznom" <(E-Mail Removed)> wrote in message
news:eninf3$5i6$(E-Mail Removed)...
> Oliver Wong wrote:
>> "John Ersatznom" <(E-Mail Removed)> wrote in message
>> news:en3as3$p8d$(E-Mail Removed)...
>>
>>>Oliver Wong wrote:
>>>
>>>>>>assert locale is German; //pseudcode
>>>>>>assert "BEISSEN".toLowerCase().equals("beissen");
>>>>>>assert "BEISSEN".toLowerCase().equals("beißen");
>>>>>
>>>>>Yeah, and assert "Color".toLowerCase().equals("Colour".toLowerCase( )).
>>>>
>>>>
>>>>{
>>>> String originalA = "color";
>>>> a = originalA; // "color"
>>>> a = a.toUppercase(); // "COLOR"
>>>> a = a.toLowercase(); // "color"
>>>> assert a.equals(originalA);
>>>>}
>>>
>>>I don't see "colour" (with a U) in there anywhere, Oliver.

>>
>> You weren't intended to.

>
> Then you're missing the point entirely.


Must be, because I was under the impression I was making a point to you,
as opposed to the other way around. I thought you were curious as to how
manually doing case-insensitive conversions could fail, as opposed to using
the build in equalsIgnoreCase().



> "COLOR" and "colour" differ only by capitalization while "beissen" and
> "beißen" differ by spelling in a manner similar to "color" vs. "colour".


I disagree.

> Alternate spellings of the same word can't in general be idenfitied as
> identical by a computer -- not without a trip through a spellchecking
> dictionary or the like, anyway. I think you may be expecting too much of
> Java's humble string classes. Perhaps Collator is smart enough for you?


You should take the code I posted and put it in your favorite IDE, fix
the compile errors (apparently, it's toLowerCase, not toLowercase), and run
it. You might find the results enlightening. If those results surprise you,
add a few System.out.println(a) to see what's going on.

- Oliver


 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-05-2007
Andrew Thompson wrote:
> John Ersatznom wrote:
> ...
>
>>Then you're missing the point entirely. "COLOR" and "colour" differ only
>>by capitalization ..

>
>
> As well as the 'u' in the second word.


That wasn't supposed to be there, though the later "color" vs "colour"
(all lower case) is correct. I trust my meaning is still easy to glean.
 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-06-2007
Oliver Wong wrote:
>>>>I don't see "colour" (with a U) in there anywhere, Oliver.
>>>
>>> You weren't intended to.

>>
>>Then you're missing the point entirely.

>
> Must be, because I was under the impression I was making a point to you,
> as opposed to the other way around. I thought you were curious as to how
> manually doing case-insensitive conversions could fail, as opposed to using
> the build in equalsIgnoreCase().


Both will fail when you want words spelled differently to compare equal,
though Collator may have more smarts in that area.

>>"COLOR" and "colour" differ only by capitalization while "beissen" and
>>"beißen" differ by spelling in a manner similar to "color" vs. "colour".

>
> I disagree.


On what basis? The typo I made? It was meant to say "COLOR" and "color"
differ only by capitalization while "beissen" and "beißen" differ by
spelling in a manner similar to "color" vs. "colour".

In fact the analogy goes so far as for the number of letters in the
latter two examples to differ by one in both cases, and for a two letter
region in one to correspond to a single letter at the same place in the
other in particular. And (presumably -- I don't know the German word(s))
they are in both cases variant spellings of a different word --
differing in more than just capitalization, but used interchangeably or
as regional variants rather than having distinct meanings.

> You should take the code I posted and put it in your favorite IDE, fix
> the compile errors (apparently, it's toLowerCase, not toLowercase), and run
> it.


It would have been nice if Sun had been consistent about their own
capitalization. There's also Character.isWhitespace (in the same class!
Note lowercase s) and System.arraycopy (note lowercase c), at minimum.
Maybe they need to implement an isCamelCase method (note second
capital C)...

In any event, I suppose the real lesson here is that String (and
friends) get you primitive ordering and comparisons, perhaps somewhat
Anglocentric, and you need to use Collator and relatives for serious
language-and-locale-sensitive comparisons. I don't know the extent to
which even the latter will cope with variant spellings, mind you. There
is also a where-do-you-draw-the-line issue -- from case to slight
variations in the actual sequence of letters used on to more overt
differences, as between "huge" and "giant" -- when should those be
considered synonyms, and when different? -- and on until if you broaden
your requirements enough solving the NLP seems to be a required
component of any conforming implementation. Language has a fuzziness
in it in actual human usage that computers have trouble with. It's
curiously not unlike the problems that arose elsewhere here today with
float and double comparisons. You can't rely usefully on == for the most
part, and using Math.abs(x - y) < someThreshold gives an "equality" test
that's more meaninful in some ways but is not transitive any more.
Eventually linguistic equality loses transitivity too -- you can play
all kinds of games of picking close synonyms of the previous word to
grow a chain that can end in a fairly good approximation to an antonym
for your starting word, in most any language, using either phonemic
proximity or lexical proximity, and get different results with each besides.

The real upshot is simply "computers, at present, don't have the ability
to really model things in linguistics". But they know about abstract
sequences of discrete, wholly-distinct characters that happen to stand
for graphical squiggles meaningful to humans.

Play to their strengths -- the computers' *and* the humans'.
 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      01-08-2007

"John Ersatznom" <(E-Mail Removed)> wrote in message
news:eno52u$m8c$(E-Mail Removed)...
> Oliver Wong wrote:
>> I thought you were curious as to how manually doing case-insensitive
>> conversions could fail, as opposed to using the build in
>> equalsIgnoreCase().

>
> Both will fail when you want words spelled differently to compare equal,
> though Collator may have more smarts in that area.


I don't know how you define "fail" or "not fail" in this context, but the
point that I'm trying to make is that the two methods do not give the same
results and are thus not equivalent. Try running the example I provided
earlier, or try this example:

{
System.out.println("beißen".equalsIgnoreCase("BEIS SEN"));
System.out.println("beißen".toUpperCase().equals(" BEISSEN"));
}

>
>>>"COLOR" and "colour" differ only by capitalization while "beissen" and
>>>"beißen" differ by spelling in a manner similar to "color" vs. "colour".

>>
>> I disagree.

>
> On what basis?


Replace the "beißen" by "colour" and "BEISSEN" by "COLOR", and you will
see get different results, thus showing that the difference between "COLOR"
and "colour" is not of the same nature as that between "beißen" and
"BEISSEN".

- Oliver


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Subclassing Hash to enforce value uniqueness ala key uniqueness. Adam Gardner Ruby 5 11-19-2008 07:36 AM
Best Way to Check Uniqueness in Array Kasp Perl Misc 5 11-13-2003 04:06 PM
uniqueness across different child elements Don Bate XML 0 07-22-2003 10:42 PM
XML Schema keys, uniqueness based on ancestor's attribute Ognen Ivanovski XML 0 07-15-2003 02:36 PM



Advertisments