On Nov 6, 1:51*pm, c...@tiac.net (Richard Harter) wrote:
> On Thu, 6 Nov 2008 12:53:55 -0800 (PST), user923005
> <dcor...@connx.com> wrote:
> >On Nov 6, 7:53=A0am, s0s...@gmail.com wrote:
> >> The task: Write a program that reads a set of words from standard
> >> input and prints the number of distinct words.
> >[snip]
>
> >[...] Using a btree, skiplist, avltree, etc. will be O(n log n) because: [...]
>
> hash table * * *- O(log n)
> comparison tree - O((log n)^2)
> radix trees * * - O(log n)
>
> [etc]
I don't have any idea what anyone here is talking about. This is
clearly a "trie" problem. The performance is O(n), where n is the
length of the input (in characters). If your performance is any
different from that your implementation is just wrong. Here is my
solution, which is simpler/shorter than anything given so far or on
that website:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct trieNode {
int canTerminateHere;
struct trieNode * letter[26];
};
static int * insertTrie (struct trieNode * tn, const char * w) {
if ('\0' == *w) return &tn->canTerminateHere;
if (NULL == tn->letter[*w-'a']) {
if (NULL == (tn->letter[*w-'a'] = (struct trieNode *) calloc
(1, sizeof (struct trieNode)))) return NULL;
}
return insertTrie (tn->letter[*w-'a'], w+1);
}
int main () {
struct trieNode start = {0};
char buff[2048], *s, *t;
int count = 0, *ret;
while (buff == fgets (buff, 2048, stdin)) {
for (t = buff; *t

{
s = t + strspn (t, "abcdefghijklmnopqrstuvwxyz");
if (s != t) {
char c = *s;
*s = '\0';
if (NULL == (ret = insertTrie (&start, t))) exit (-1);
*s = c;
count += 1 ^ *ret;
*ret = 1;
}
t = s + strcspn (s, "abcdefghijklmnopqrstuvwxyz");
}
}
printf ("Count: %d\n", count);
return 0;
}
This makes the assumption that all inputs are continguous words in
lines no longer than 2048 characters separated by white space or line
feeds or whatever. It also assumes that you have enough memory to
hold all the words input, at a rate of (26 * sizeof (void *) + sizeof
(int)) * (size of the dictionary of the input in characters), roughly
speaking. The program doesn't clean up after itself, but I don't
think that was in the requirements.
As for the *real* performance of this thing, it will come down to the
calloc() speed. I could waste a lot more lines of code to massively
improve the performance here. It might be worth it if, say, the
benchmark were trying to measure performance per line of code. So the
score would be, say, lines of code times time taken to output the
correct count or something, and it would then probably be worth it to
implement a calloc pooler (it would double the size at least, but
probably make the performance at least 3 times higher, if the words
had a high enough uniqueness rate).
--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/