Velocity Reviews > C++ > Benchmarks

# Benchmarks

user923005
Guest
Posts: n/a

 11-06-2008
On Nov 6, 7:53*am, (E-Mail Removed) wrote:
> The task: Write a program that reads a set of words from standard
> input and prints the number of distinct words.

[snip]

I am surprised chuck has not chimed in here.
A hash table is *ideal* for this.

P.S. to those analyzing performance...
Since we have to examine every word in the list, the performance of
the algorithm *cannot* be faster than O(n).
The hash table solution is O(n).

Using a btree, skiplist, avltree, etc. will be O(n log n) because:
For each word, we must collect it.
For this word, we must check for duplicity. With a hash table the
check is O(1). With a logarithmic search structure, the check is
O(log n). (There is a multiplicative constant less than 1, but that
does not alter the O(log n) behavior.
Hence: O(log n) * O(n) = O(n log n) for most ordered list variants.

There is another structure that would be competitive. I guess that a
ternary search tree might beat a hash table just because of the
excellence in memory access pattern. At least for lists of less than
a million items (and it would be hard to come up with more than a
million correctly spelled real words).
http://www.cs.princeton.edu/~rs/strings/

Juha Nieminen
Guest
Posts: n/a

 11-06-2008
user923005 wrote:
> The hash table solution is O(n).

You would have hard time proving that.

user923005
Guest
Posts: n/a

 11-06-2008
On Nov 6, 1:21*pm, Juha Nieminen <(E-Mail Removed)> wrote:
> user923005 wrote:
> > The hash table solution is O(n).

>
> * You would have hard time proving that.

Cuckoo hashing has guaranteed O(1) lookup and delete and amortized
O(1) insert.

My solution would also do the counting at insert time (IOW, the hash
insert function will return 0 if the item is already in the table and
return 1 if the item was inserted).
In that way, there is no need to scan the table and you can make it as
large as you like.

IOW:

unsigned long long count = 0;

while (item = fgets(string, sizeof string, stdin))
count += cuckoo_hash_insert(item);

Paul Hsieh
Guest
Posts: n/a

 11-07-2008
On Nov 6, 1:51*pm, (E-Mail Removed) (Richard Harter) wrote:
> On Thu, 6 Nov 2008 12:53:55 -0800 (PST), user923005
> <(E-Mail Removed)> wrote:
> >On Nov 6, 7:53=A0am, (E-Mail Removed) wrote:
> >> The task: Write a program that reads a set of words from standard
> >> input and prints the number of distinct words.

> >[snip]

>
> >[...] Using a btree, skiplist, avltree, etc. will be O(n log n) because: [...]

>
> hash table * * *- O(log n)
> comparison tree - O((log n)^2)
> radix trees * * - O(log n)
>
> [etc]

I don't have any idea what anyone here is talking about. This is
clearly a "trie" problem. The performance is O(n), where n is the
length of the input (in characters). If your performance is any
different from that your implementation is just wrong. Here is my
solution, which is simpler/shorter than anything given so far or on
that website:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct trieNode {
int canTerminateHere;
struct trieNode * letter[26];
};

static int * insertTrie (struct trieNode * tn, const char * w) {
if ('\0' == *w) return &tn->canTerminateHere;
if (NULL == tn->letter[*w-'a']) {
if (NULL == (tn->letter[*w-'a'] = (struct trieNode *) calloc
(1, sizeof (struct trieNode)))) return NULL;
}
return insertTrie (tn->letter[*w-'a'], w+1);
}

int main () {
struct trieNode start = {0};
char buff[2048], *s, *t;
int count = 0, *ret;
while (buff == fgets (buff, 2048, stdin)) {
for (t = buff; *t {
s = t + strspn (t, "abcdefghijklmnopqrstuvwxyz");
if (s != t) {
char c = *s;
*s = '\0';
if (NULL == (ret = insertTrie (&start, t))) exit (-1);
*s = c;
count += 1 ^ *ret;
*ret = 1;
}
t = s + strcspn (s, "abcdefghijklmnopqrstuvwxyz");
}
}
printf ("Count: %d\n", count);
return 0;
}

This makes the assumption that all inputs are continguous words in
lines no longer than 2048 characters separated by white space or line
feeds or whatever. It also assumes that you have enough memory to
hold all the words input, at a rate of (26 * sizeof (void *) + sizeof
(int)) * (size of the dictionary of the input in characters), roughly
speaking. The program doesn't clean up after itself, but I don't
think that was in the requirements.

As for the *real* performance of this thing, it will come down to the
calloc() speed. I could waste a lot more lines of code to massively
improve the performance here. It might be worth it if, say, the
benchmark were trying to measure performance per line of code. So the
score would be, say, lines of code times time taken to output the
correct count or something, and it would then probably be worth it to
implement a calloc pooler (it would double the size at least, but
probably make the performance at least 3 times higher, if the words
had a high enough uniqueness rate).

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Jerry Coffin
Guest
Posts: n/a

 11-07-2008
In article <(E-Mail Removed)>, http://www.velocityreviews.com/forums/(E-Mail Removed) says...

[ ... ]

> Just as a note, the common claim that hash table are O(1) per
> access whereas search trees are O(log n) per access is
> misleading, and is arrived at by an apples and oranges
> comparison. The number of probes in a hash table is O(1) whereas
> the number of probes in a search tree is O(log n). However the
> cost of computing the hash code for independent keys is O(m)
> where m is the average key length, which is necessarily greater
> than log2(n). In comparison based trees the cost of the
> comparison is O(m), i.e., the probe cost is O(m). In radix based
> trees the probe cost is O(1). If O(m) = O(n) the execution costs
> per probe (ignoring access issues) are:

Keeping in mind, however, that m and n will usually be quite a bit
difference, so any similarity between O(m) and O(n) is purely
coincidental. In reality, the performance of hash tables tends to be
rather different in general than the performance of tree-based systems

> hash table - O(log n)
> comparison tree - O((log n)^2)
> radix trees - O(log n)
>
> That is, hash tables and radix trees have the same order of
> performance with comparison trees a distant third.

Note, however, that there are factors that don't show up in a big-O
comparison that can still be quite substantial, especially for
collectiosn of reasonble sizes. For one obvious one, consider the common
case (like this one) where the keys are strings. Hashing the string
requires looking at the whole string, but comparing two strings will
often look at _far_ fewer -- for the first few probes, it'll typically
only involve looking at one or two characters.

From a theoretical viewpoint, this effect is negligible -- but given the
number of words in a typical vocabulary, it can be quite substantial.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Nick Keighley
Guest
Posts: n/a

 11-07-2008
On 6 Nov, 15:53, (E-Mail Removed) wrote:

> The task: Write a program that reads a set of words from standard
> input and prints the number of distinct words.
>
> I came across a website that listed a few programs to accomplish this
> flaming , and thought that all of them did unnecessary operations,
> so I wrote my own. But for some reason, my version turned out slower
> that ALL of the versions in the website, even though it seems to
> perform less operations (yes, I benchmarked them on my own computer).
>
> According to the website, the slowest version is:
>
> #include <set>
> #include <string>
> #include <iostream>
>
> int main(int argc, char **argv)
> {
> * * * * // Declare and Initialize some variables
> * * * * std::string word;
> * * * * std::set<std::string> wordcount;
> * * * * // Read words and insert in rb-tree
> * * * * while (std::cin >> word) wordcount.insert(word);
> * * * * // Print the result
> * * * * std::cout << "Words: " << wordcount.size() << std::endl;
> * * * * return 0;
>
> }

the above uses an rb tree (or equivalent) in the set class

> My version is about 12 times slower than that. It uses lower-level
> constructs. Here it is:

[snip version using linear search]

> Any ideas as to what causes the big slowdown?

this is a very important lesson. Print it out in big letters
and post it on your wall.

CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

I may get this printed on a tee shirt

--
Nick Keighley

CBFalconer
Guest
Posts: n/a

 11-07-2008
Juha Nieminen wrote:
> (E-Mail Removed) wrote:
>
>> Any ideas as to what causes the big slowdown?

>
> Why do you expect that searching a linked list could be even
> close to the speed of searching a balanced binary tree?

Which is O(log N), as compared to O(N). However a hashtable is
even faster, being O(1) for suitable organization.

F'ups set to c.l.c. only.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>

CBFalconer
Guest
Posts: n/a

 11-07-2008
user923005 wrote:
> (E-Mail Removed) wrote:
>
>> The task: Write a program that reads a set of words from
>> standard input and prints the number of distinct words.

>
> [snip]
>
> I am surprised chuck has not chimed in here.
> A hash table is *ideal* for this.

I just did, a few minutes ago. Been having problems with the
news-server.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>

CBFalconer
Guest
Posts: n/a

 11-07-2008
Richard Harter wrote:
>

.... snip ...
>
> Just as a note, the common claim that hash table are O(1) per
> access whereas search trees are O(log n) per access is
> misleading, and is arrived at by an apples and oranges
> comparison. The number of probes in a hash table is O(1) whereas
> the number of probes in a search tree is O(log n). However the
> cost of computing the hash code for independent keys is O(m)
> where m is the average key length, which is necessarily greater
> than log2(n). In comparison based trees the cost of the
> comparison is O(m), i.e., the probe cost is O(m). In radix based
> trees the probe cost is O(1). If O(m) = O(n) the execution costs
> per probe (ignoring access issues) are:
>
> hash table - O(log n)
> comparison tree - O((log n)^2)
> radix trees - O(log n)
>
> That is, hash tables and radix trees have the same order of
> performance with comparison trees a distant third.

I think you might be surprised. Try the implementation of
wdfreq.c, which is a part of my hashlib distribution, as a usage
demonstration. Diddle it to use your comparison tree methods, and
I think you will be shocked. For example, processing n869.txt, a
1.2 Meg text file, on a 450 Mhz machine, takes 0.88 seconds,
All i/o is to disk.

143209 words, 3910 entries, 227358 probes, 77381 misses

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>

Phil Carmody
Guest
Posts: n/a

 11-07-2008
Juha Nieminen <(E-Mail Removed)> writes:
> user923005 wrote:
>> The hash table solution is O(n).

>
> You would have hard time proving that.

He shouldn't. It should be pretty clear. No hash table needs
to examine each element more than once.

Phil
--
We must respect the other fellow's religion, but only in the sense and to the
extent that we respect his theory that his wife is beautiful and his children
smart. -- Henry Louis Mencken (1880-1956), American editor and critic