Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Datastructure design

Reply
Thread Tools

Datastructure design

 
 
Santosh
Guest
Posts: n/a
 
      11-19-2003
Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance
 
Reply With Quote
 
 
 
 
Sheldon Simms
Guest
Posts: n/a
 
      11-19-2003
On Tue, 18 Nov 2003 16:13:49 -0800, Santosh wrote:

> Hello,
>
> I would like some input on choosing a datastructure and a algorithm. I
> have a text file which contains three strings(say name, phonenumber
> and city). The file contains a about a billion records.


Your question is off topic here, but I have to wonder a bit... You
have a billion records in a file? How interesting. You know there
are these things called databases. They're quite useful in fact.

> I was thinking we could use multi-key quicksort but I am a little
> confused as to how to store the data.


No kidding... Do a google on disk sort algorithms.

 
Reply With Quote
 
 
 
 
Mac
Guest
Posts: n/a
 
      11-19-2003
On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:

> Hello,
>
> I would like some input on choosing a datastructure and a algorithm. I
> have a text file which contains three strings(say name, phonenumber
> and city). The file contains a about a billion records.
>
> I need to choose a datastructure which will sort efficienctly based on
> any of the strings(keys) which may be any one of the three or a
> combination of the three in which case we will need to sort with
> multiple keys.
>
> What is the best datastructure to store this data?
>
> the problem here is that the key is not fixed. It could be the name,
> phonenumber or the city and sometimes we nmight also need to sort
> first by name and then by city.
>
> I was thinking we could use multi-key quicksort but I am a little
> confused as to how to store the data. I could use a B-Tree to store
> the data but how I dont know how to implement the compare function,
> because the keys are not fixed ?
>
> Any suggestions?
>
> Thanks in advance


Your question is off-topic here, but make sure you understand that sorting
any data set which is bigger than your computer's real memory size is going
to be a specialized task. You can't just read it into memory and then
start swapping things around, because this will lead to thrashing.

You're going to have to read a little, sort a little, read a little, sort
a little, merge, merge, merge. Or something like that.

Of course, maybe you have enough RAM to read in a billion records, I don't
know.

Good luck.

Mac
--

 
Reply With Quote
 
Santosh
Guest
Posts: n/a
 
      11-19-2003
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

Thanks



"Mac" <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
>
> > Hello,
> >
> > I would like some input on choosing a datastructure and a algorithm. I
> > have a text file which contains three strings(say name, phonenumber
> > and city). The file contains a about a billion records.
> >
> > I need to choose a datastructure which will sort efficienctly based on
> > any of the strings(keys) which may be any one of the three or a
> > combination of the three in which case we will need to sort with
> > multiple keys.
> >
> > What is the best datastructure to store this data?
> >
> > the problem here is that the key is not fixed. It could be the name,
> > phonenumber or the city and sometimes we nmight also need to sort
> > first by name and then by city.
> >
> > I was thinking we could use multi-key quicksort but I am a little
> > confused as to how to store the data. I could use a B-Tree to store
> > the data but how I dont know how to implement the compare function,
> > because the keys are not fixed ?
> >
> > Any suggestions?
> >
> > Thanks in advance

>
> Your question is off-topic here, but make sure you understand that sorting
> any data set which is bigger than your computer's real memory size is going
> to be a specialized task. You can't just read it into memory and then
> start swapping things around, because this will lead to thrashing.
>
> You're going to have to read a little, sort a little, read a little, sort
> a little, merge, merge, merge. Or something like that.
>
> Of course, maybe you have enough RAM to read in a billion records, I don't
> know.
>
> Good luck.
>
> Mac
> --

 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      11-19-2003
Santosh wrote:
>
> I am sorry that the question was not offtopic for this group. But I
> felt most of the C Programmers would definitely be able to suggest
> something. Well, I know we can use a Database. But the question was
> more intended from a design perspective.
>
> Since there are a billion records, I need to use for sure some kind of
> external sorting algorithm like mergesort. I could use a B-Tree to
> store the data with the phone number(or some unique identifier) as the
> key. But what I wanted to know what how can I sort this tree based on
> any other field stored in the node. Lets say a name, something like in
> a Database where we can sort by multiple fields of a record at the
> same time.


It's still off-topic, and C programmers are not ipso
facto experts on managing large data collections. Maybe
there will eventually be some C questions as you move ahead
with the implementation, but as yet no C question has been
asked.

<off-topic>

Start with Knuth "The Art of Computer Programming, Volume
III: Sorting and Searching," section 6.5 "Retrieval on
secondary keys." Follow bibliographical links as appropriate.

</off-topic>

--
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Mike Wahler
Guest
Posts: n/a
 
      11-19-2003

"Santosh" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> I am sorry that the question was not offtopic for this group. But I
> felt most of the C Programmers would definitely be able to suggest
> something. Well, I know we can use a Database. But the question was
> more intended from a design perspective.
>
> Since there are a billion records, I need to use for sure some kind of
> external sorting algorithm like mergesort. I could use a B-Tree to
> store the data with the phone number(or some unique identifier) as the
> key. But what I wanted to know what how can I sort this tree based on
> any other field stored in the node. Lets say a name, something like in
> a Database where we can sort by multiple fields of a record at the
> same time.
>
> Thanks


Any sort implies a comparison of elements. Factor this
comparison out into a function (or several if different
comparisons are needed), and have your sort function call
the appropriate comparison function(s) (Do this by providing
a function pointer parameter for your sort function). This
is the way the standard library's 'qsort()' function works
(but 'qsort()' sorts an array, not a 'tree').

-Mike


 
Reply With Quote
 
pandy
Guest
Posts: n/a
 
      11-20-2003
(E-Mail Removed) (Santosh) wrote in message news:<(E-Mail Removed). com>...
> I am sorry that the question was not offtopic for this group. But I
> felt most of the C Programmers would definitely be able to suggest
> something. Well, I know we can use a Database. But the question was
> more intended from a design perspective.
>
> Since there are a billion records, I need to use for sure some kind of
> external sorting algorithm like mergesort. I could use a B-Tree to
> store the data with the phone number(or some unique identifier) as the
> key. But what I wanted to know what how can I sort this tree based on
> any other field stored in the node. Lets say a name, something like in
> a Database where we can sort by multiple fields of a record at the
> same time.


You can achieve it by using SQL. Easy!

>
> Thanks
>
>
>
> "Mac" <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> > On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
> >
> > > Hello,
> > >
> > > I would like some input on choosing a datastructure and a algorithm. I
> > > have a text file which contains three strings(say name, phonenumber
> > > and city). The file contains a about a billion records.
> > >
> > > I need to choose a datastructure which will sort efficienctly based on
> > > any of the strings(keys) which may be any one of the three or a
> > > combination of the three in which case we will need to sort with
> > > multiple keys.
> > >
> > > What is the best datastructure to store this data?
> > >
> > > the problem here is that the key is not fixed. It could be the name,
> > > phonenumber or the city and sometimes we nmight also need to sort
> > > first by name and then by city.
> > >
> > > I was thinking we could use multi-key quicksort but I am a little
> > > confused as to how to store the data. I could use a B-Tree to store
> > > the data but how I dont know how to implement the compare function,
> > > because the keys are not fixed ?
> > >
> > > Any suggestions?
> > >
> > > Thanks in advance

> >
> > Your question is off-topic here, but make sure you understand that sorting
> > any data set which is bigger than your computer's real memory size is going
> > to be a specialized task. You can't just read it into memory and then
> > start swapping things around, because this will lead to thrashing.
> >
> > You're going to have to read a little, sort a little, read a little, sort
> > a little, merge, merge, merge. Or something like that.
> >
> > Of course, maybe you have enough RAM to read in a billion records, I don't
> > know.
> >
> > Good luck.
> >
> > Mac
> > --

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
program design, what datastructure to choose question..... jason C Programming 6 11-10-2007 01:09 PM
What datastructure to use to store nodes/locations I've visited 6tc1@qlink.queensu.ca Java 3 06-23-2005 02:28 AM
Index-based datastructure Sharp Java 1 03-14-2005 11:56 AM
Set DataStructure Anony! Java 2 08-13-2004 02:32 AM
Datastructure and Algorithms Prateek Basu C Programming 4 01-24-2004 10:37 AM



Advertisments