Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Binary search trees (AVL trees)

Reply
Thread Tools

Binary search trees (AVL trees)

 
 
jacob navia
Guest
Posts: n/a
 
      01-03-2010
Continuing the container library I have started the more complex
containers: trees.

I implemented over the holidays the AVL trees in a small module of
around 600 lines. Obviously the bare minimum: insert/delete and
find. There is an equal method to compare two trees.

This is a powerful container that can search a giga-object database with
just around 30 comparisons...

But it is still not the red/black trees used by the C++ crowd, that will
be done later.

This trees implement sets, i.e. all elements are unique, and there are
no keys, the element itself is the key and it is stored just behind
each node.

What bothers me is the enormous overhead of the 64 bit pointers:

typedef struct tagBinarySearchTreeNode {
char hidden;
char factor;
struct tagBinarySearchTreeNode *left;
struct tagBinarySearchTreeNode *right;
char data[];
} BinarySearchTreeNode;

Size: 24 bytes overhead

To store the "hidden" bit and the 3 bits needed for the factor
I am wasting 8 bytes since the pointers must be aligned at an
8 byte boundary... Besides I am wasting 16 bytes in those two
pointers when there will be very few applications that will use
trees of more than 4Giga objects... and in fact most applications
will fit in 65535 objects.

If we limit ourselves to 65535 objects in the tree, we could
allocate a table with 65535 nodes, and instead of using 8 byte pointers
we could use 16 bit unsigned shorts that would index into that
table. At the same time, we eliminate the alignment bytes.

We would have then:

typedef st+ruct tagBinarySearchTreeNode {
char hidden;
char factor;
unsigned short left;
unsigned short right;
char data[];
} BinarySearchTreeNode;

Size: 6 bytes overhead. An improvement of 300% !!

Limiting the number of objects to 4 Giga objects yields
still an improvement of 100% since our node would take
12 bytes.

It is in this flexibility to elimnate waste where C could
make a big contribution precisely.
 
Reply With Quote
 
 
 
 
jacob navia
Guest
Posts: n/a
 
      01-03-2010
jacob navia a écrit :
[snip]

After posting that, I have just discovered that I can do the transition
between the smallish 65535 limited trees to the 32 bit ones AUTOMATICALLY!

Since I keep the count of the elements in the tree header, when I arrive
at inserting the 65537th element I have just to change the POINTER of
the vtable to point to the vtable of the 32 bit version, reallocate the
nodes in 32 bits and go on!

This is completely impossible to do in C++ since an object can't change
dynamically its class at runtime.

This idea has far reaching consequences for doing dynamically adaptive
data structures. I could start all containers in the "smallish" version,
and grow as needed.

What is interesting in the C version is that it is completely dynamic.
You can adapt the data structure to new conditions AS NEEDED, without
having to fix everything during the compilation.

True, the insertion of the 65537th element will take some time...

Happily for C++ users, the library will be available for them too.

Just add the

extern "C"

jacob
 
Reply With Quote
 
 
 
 
Gene
Guest
Posts: n/a
 
      01-03-2010
On Jan 3, 2:43*pm, jacob navia <ja...@spamsink.net> wrote:
> Continuing the container library I have started the more complex
> containers: trees.
>
> I implemented over the holidays the AVL trees in a small module of
> around 600 lines. Obviously the bare minimum: insert/delete and
> find. There is an equal method to compare two trees.
>
> This is a powerful container that can search a giga-object database with
> just around 30 comparisons...
>
> But it is still not the red/black trees used by the C++ crowd, that will
> be done later.
>
> This trees implement sets, i.e. all elements are unique, and there are
> no keys, the element itself is the key and it is stored just behind
> each node.
>
> What bothers me is the enormous overhead of the 64 bit pointers:
>
> typedef struct tagBinarySearchTreeNode {
> * * * * char * * * * * * * hidden;
> * * * * char * * * * * * * factor;
> * * * * struct tagBinarySearchTreeNode *left;
> * * * * struct tagBinarySearchTreeNode *right;
> * * * * char * * * * * * * data[];
>
> } BinarySearchTreeNode;
>
> Size: 24 bytes overhead
>
> To store the "hidden" bit and the 3 bits needed for the factor
> I am wasting 8 bytes since the pointers must be aligned at an
> 8 byte boundary... Besides I am wasting 16 bytes in those two
> pointers when there will be very few applications that will use
> trees of more than 4Giga objects... and in fact most applications
> will fit in 65535 objects.
>
> If we limit ourselves to 65535 objects in the tree, we could
> allocate a table with 65535 nodes, and instead of using 8 byte pointers
> we could use 16 bit unsigned shorts that would index into that
> table. At the same time, we eliminate the alignment bytes.
>
> We would have then:
>
> typedef st+ruct tagBinarySearchTreeNode {
> * * * * char * * * * * * * hidden;
> * * * * char * * * * * * * factor;
> * * * * unsigned short * * left;
> * * * * unsigned short * * right;
> * * * * char * * * * * * * data[];
>
> } BinarySearchTreeNode;
>
> Size: 6 bytes overhead. An improvement of 300% !!
>
> Limiting the number of objects to 4 Giga objects yields
> still an improvement of 100% since our node would take
> 12 bytes.
>
> It is in this flexibility to elimnate waste where C could
> make a big contribution precisely.


Not much surprise that 8-byte pointers need 8 bytes. This is why 64-
bit technology is not widespread: memory still has costs.

I'm curious about "hidden" and "factor". Last time I implemented AVL
trees, only 2 bits of balance information were needed.

Is the final "data" field meant to be a pointer to node data? If so,
you'd be better off with void*. It avoids at some casts and is more
evokative of what is meant.

I agree with defining nodes with just "data" and not key and value.
IME most applications require a way to get the key while working with
the value, so you end up maintaining extra pointers or storing keys
twice anyway. It's more flexible to think of "key" and "value" as
functions on structured data rather than separately stored values.
(FWIW, the Containers library of Ada 2005 took this approach for sets
in the form of very elegant generic definitions. Worth checking out.)

I suggest that tree equality is most cleanly provided by implementing
iterators rather than a special purpose traversal. A pair of
iterators makes equality testing trivial. Iterators that can start at
any given element and step in either direction are particularly handy:
they make the trees usable in many cases where you'd normally like a
vector/list but need to avoid O(n) per op behaviors. With AVL trees,
the absolute maximum depth is only about 100 or so. Consequently
iterator stacks can have fixed sizes.

You might consider offering a container that manages bare nodes:

typedef struct tagBinarySearchTreeNode {
struct tagBinarySearchTreeNode *left, *right;
char balance;
} BinarySearchTreeNode;

Now the user can define his or her own nodes that extend this. I
believe that for Intel architectures at least, this gives you access
to the 7 characters that otherwise would be padding. I.e.

typedef struct tagBinarySearchTreeNode {
struct tagBinarySearchTreeNode *left, *right;
char balance;
char data[7];
} MySearchTreeNode;

will still be packed in 24 bytes. (I have not checked this.) Or a 4-
byte integer index ought to fit as well.

It's a bit ugly, but I've done this by preprocessor with stuff like:

/* header file binarytrees.h */
#ifndef BINARY_TREE_DATA
#define BINARY_TREE_DATA
#endif

typedef struct tagBinarySearchTreeNode {
struct tagBinarySearchTreeNode *left, *right;
char balance;
BINARY_TREE_DATA
} BinarySearchTreeNode;
/* end header file binarytrees.h */

Now you can do stuff like

typedef struct {
...
} MY_DATA;

#define BINARY_TREE_DATA MY_DATA data[1];
#include "binarytrees.h"
#undef BINARY_TREE_DATA

Finally, trees that allow duplicate keys, have many uses. You should
consider supporting them. Iterators make that easier, too.

 
Reply With Quote
 
jacob navia
Guest
Posts: n/a
 
      01-03-2010
Gene a écrit :
>
> I'm curious about "hidden" and "factor". Last time I implemented AVL
> trees, only 2 bits of balance information were needed.


You are right of course!

I need 2 bits for the factor and 1 bit for the hidden bit. It was
as conceptual typo

Thanks a lot for your detailed answer.

There is a lot of info there. I will read it again tomorrow.

Thanks

jacob
 
Reply With Quote
 
Ben Pfaff
Guest
Posts: n/a
 
      01-03-2010
jacob navia <> writes:

> Gene a écrit :
>>
>> I'm curious about "hidden" and "factor". Last time I implemented AVL
>> trees, only 2 bits of balance information were needed.

>
> You are right of course!
>
> I need 2 bits for the factor and 1 bit for the hidden bit. It was
> as conceptual typo


What is the "hidden bit"?

You can eliminate the "factor" field entirely by using a
scapegoat tree, which doesn't have any per-node overhead at all.
If you don't really need the "hidden bit" also, then you could
get rid of those 8 bytes of overhead entirely.
--
"Your correction is 100% correct and 0% helpful. Well done!"
--Richard Heathfield
 
Reply With Quote
 
jacob navia
Guest
Posts: n/a
 
      01-03-2010
Ben Pfaff a écrit :
> jacob navia <> writes:
>
>> Gene a écrit :
>>> I'm curious about "hidden" and "factor". Last time I implemented AVL
>>> trees, only 2 bits of balance information were needed.

>> You are right of course!
>>
>> I need 2 bits for the factor and 1 bit for the hidden bit. It was
>> as conceptual typo

>
> What is the "hidden bit"?
>


A small "optimization", instead of erasing some data, I declare
it "hidden", what saves for some time a tree reorganization.

> You can eliminate the "factor" field entirely by using a
> scapegoat tree, which doesn't have any per-node overhead at all.


Yes, I know. They are a better data structure. They will be implemented
later, probably before red/black trees

> If you don't really need the "hidden bit" also, then you could
> get rid of those 8 bytes of overhead entirely.


That's a good point. AVL trees are widely used though, and I want
to start this stuff with easy structures first, to better understand
what I am doing.



Thanks for your answer.
 
Reply With Quote
 
jacob navia
Guest
Posts: n/a
 
      01-03-2010
remod a écrit :
> jacob navia ha scritto:
>> Yes, I know. They are a better data structure. They will be implemented
>> later, probably before red/black trees

>
> Are you suggesting you will offer different data structures? If that's
> the case I would be interested in understing the rationale behind your
> choice.
>


AVL trees are very simple, and even if they need sme data per node, they
offer a simple structure for small/medium size applications.

> I'm working on a container library too


This is strange, I have been discussing my container library in this
discussion group since november if I remember well, and very often.
It is the first time thatI hear about your project.

I have published the interface and some code here.

What do you think of it?

Does your library follow those conventions too?


> but I've decided to implement a
> single data structure and support the key/value abstraction that is very
> easy to use.
>


Yes, it is easy to use if your data has a key. I will implement that
later, for data that has a key.

> My intent is to create a library with dynamic structures that are as
> simple to use as those offered by modern scripting languages.


What do you understand by "dynamic"?

> First
> version of a program is (hopefully) quickly whipped up using that
> library, than, if performance are an issue, the programmer could focus
> on a more efficient data structure. My take is that more often than not,
> the performance would be adequate for the task.
>


Well, the goal of my project is to get a STANDARD interface for
containers in C, so that we could use a language-wide interface.

I have discussed this at length in this forum. Maybe you can access
my posts about it.


> I'm currently using hash tables but I have considered (and I might
> change to one of the) various types of balanced trees (I was considering
> a treap or a scapegoat tree).
>


I am afraid that a songle structure can't be OK for ALL uses.

I have implemented Strings, string collections, arraylists, hash tables,
bitstrings, and now binary search trees. All those containers have
different requirements and objectives.

> The key point, though, is that the implementation of the container would
> be hidden to the programmer using the library. He/she will just use the
> API for Sets, Stacks, Queue, Vectors, etc without knowing how they are
> implemented.


My project will be open source with no copyright (no GPL either) so that
anyone can use it, and adapt it as he/she sees fit. Nothing is hidden,
but not all things are standardized. Actually, the objective is to
propose an interface standard, and propose a sample implementation.

>
> PS: I sent a question on the lcc newsgroup, have you had the opportunity
> of reading it?
>


Yes, but I was completely busy with the container stuff and could not do
what you requested, sorry.

 
Reply With Quote
 
Dennis \(Icarus\)
Guest
Posts: n/a
 
      01-03-2010
"jacob navia" <> wrote in message
news:hhqth2$g3$...
> jacob navia a écrit :
> [snip]
>
> After posting that, I have just discovered that I can do the transition
> between the smallish 65535 limited trees to the 32 bit ones AUTOMATICALLY!
>
> Since I keep the count of the elements in the tree header, when I arrive
> at inserting the 65537th element I have just to change the POINTER of
> the vtable to point to the vtable of the 32 bit version, reallocate the
> nodes in 32 bits and go on!
>
> This is completely impossible to do in C++ since an object can't change
> dynamically its class at runtime.


And as far as I know, neither can a C structure variable.
Is this an extension specific to lcc?

Dennis

 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      01-03-2010
jacob navia <> writes:
[...]
> My project will be open source with no copyright (no GPL either) so that
> anyone can use it, and adapt it as he/she sees fit. Nothing is hidden,
> but not all things are standardized. Actually, the objective is to
> propose an interface standard, and propose a sample implementation.

[...]

In other words, you want it to be public domain.

My understanding is that, under most current copyright laws,
anything you create is automatically copyrighted by default.
To release it to the public domain, you have to say so explicitly.

See, for example, <http://www.sqlite.org/copyright.html>. I presume
they got it right. I am not a lawyer; please do not take my word
on anything I say on this topic without checking elsewhere.

--
Keith Thompson (The_Other_Keith) kst- <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
Stefan Ram
Guest
Posts: n/a
 
      01-04-2010
remod <> writes:
>I'm working on a container library too but I've decided to implement a
>single data structure and support the key/value abstraction that is very
>easy to use.


This is 2010. C is one of the most widespread programming
languages, and containers (like AVL trees) are one of the
most widespread entities in programming and known for
decades now.

Shouldn't there be a great choice of container libraries for
C around already? What is the point in writing one's own
instead of reusing someone else's container library?

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Binary tree search vs Binary search Bogdan C Programming 22 10-21-2010 09:46 PM
Re: Binary search trees or hash tables? BGB / cr88192 C Programming 5 01-07-2010 12:01 AM
Re: Binary search trees or hash tables? Dann Corbit C Programming 0 01-06-2010 04:16 AM
binary search trees j_depp_99@yahoo.com C++ 2 03-05-2008 02:11 PM
Help : Merging Binary Search Trees ptrSriram C Programming 3 12-12-2005 04:37 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57