Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > sort an array of strings

Reply
Thread Tools

sort an array of strings

 
 
Gordon Burditt
Guest
Posts: n/a
 
      11-28-2005
> If the data won't fit in memory, sorting pointers to it
>doesn't help much: The eventual rearrangement of the pointed-
>to data (whether explicit or implicit) will involve an awful
>lot of bouncing around on the disk.


Sorting an array in memory of structures consisting of (a) the
sort key and (b) the file offset of the record is often a
useful technique, especially if the array of structures fits
in memory even though the records don't. I remember using that
to advantage on a system with 32k of memory in the early 70's.
Copying the file to put it in sorted order did involve a lot
of seeking, and sometimes would overheat the floppy drive.

Gordon L. Burditt
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      11-29-2005
Gordon Burditt wrote:

>> If the data won't fit in memory, sorting pointers to it
>>doesn't help much: The eventual rearrangement of the pointed-
>>to data (whether explicit or implicit) will involve an awful
>>lot of bouncing around on the disk.

>
>
> Sorting an array in memory of structures consisting of (a) the
> sort key and (b) the file offset of the record is often a
> useful technique, especially if the array of structures fits
> in memory even though the records don't. I remember using that
> to advantage on a system with 32k of memory in the early 70's.
> Copying the file to put it in sorted order did involve a lot
> of seeking, and sometimes would overheat the floppy drive.


Yah. Knuth cites a theorem by Floyd showing that the
task of rearranging the unsorted input (after generating a
sorted sequence of key/location pairs) has a lower bound
that makes it at least as hard as sorting the original
records in full, give or take a few constant factors.

Of course, those easily ignored constant factors can
have an awful lot to do with whether a method is or is not
practical. That's why Rainmaker needs to disclose quite a
lot more about his problem before advice can be advanced
with any confidence.

--
Eric Sosman
lid
 
Reply With Quote
 
 
 
 
Rainmaker
Guest
Posts: n/a
 
      11-29-2005
To be precise,
Take 2 terabytes of data

each string not longer than 100bytes.
I dont understand why you need to know what these strings signify?
These are normal strings and you can store anything in it < 100 bytes.

There is no pre existing order.

Thanks,
Rainmaker

Eric Sosman wrote:
> Rainmaker wrote:
>
> > Hi,
> >
> > Sorry for the delay....
> > By HUGE i mean giga bytes of data....it may be more....

>
> Still no quantitative information. How many gigabytes?
> How much more? Terabytes? Zettabytes?
>
> > Sorting the pointers seems to me a good solution. Can you please give
> > me a small example(in C) as to how to achieve this?

>
> If the data won't fit in memory, sorting pointers to it
> doesn't help much: The eventual rearrangement of the pointed-
> to data (whether explicit or implicit) will involve an awful
> lot of bouncing around on the disk. Let's see: one gigabyte
> of 100-byte strings is ten million strings. If you must do
> a 10-ms seek before reading each of them that's 100,000
> seconds or about twenty-eight hours -- and that's *after*
> sorting the pointers, which involved a few disk accesses
> in its own right ...
>
> Rainmaker, if you want usable advice you're going to have
> to describe your problem in considerably more detail. How
> many strings are there, how long are they, what do they signify,
> is there any pre-existing order you can exploit, why do you
> want them sorted anyhow, ...?
>
> --
> Eric Sosman
> lid


 
Reply With Quote
 
Mark McIntyre
Guest
Posts: n/a
 
      11-29-2005
On 29 Nov 2005 14:42:41 -0800, in comp.lang.c , "Rainmaker"
<> wrote:

>I dont understand why you need to know what these strings signify?


The more we understand about your (rather peculiar) requirement, the
more likely it is that someone can come up with an answer. The
solution for 10,000,000 short bit different strings could be radically
different from the solution for 10,000 long very similar strings.

For instance, my immediate thought is - 2 TB of totally unsorted data
simply should not have been allowed to come into existence. You have a
major problem with whatever is creating these files, and you need to
take a step back and redesign that end of it.

Another thought - if the strings can be grouped into similar sets eg
everything that starts "foobarpotato" in one group, everything that
starts "wibblewobble" in another, you could perhaps rapidly sort into
groups, then each group, perhaps writing them out to separate files
and using filesystem tools such as the unix sort utility.

Hence its useful to know more.

--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
 
Reply With Quote
 
Joe Wright
Guest
Posts: n/a
 
      11-30-2005
Rainmaker wrote:
> To be precise,
> Take 2 terabytes of data
>
> each string not longer than 100bytes.
> I dont understand why you need to know what these strings signify?
> These are normal strings and you can store anything in it < 100 bytes.
>
> There is no pre existing order.
>
> Thanks,
> Rainmaker
>
> Eric Sosman wrote:
>


My apologies to Eric; I have snipped completely everything.
Now Rainmaker. Let's see.. 2 terabytes is 2.0e12 bytes, right? 100 bytes
is 1.0e2 I think. Dividing file size by line length give 2.0e10 I think.
That's 20 giga-lines, right?

We're trying to get a grip on what you have and what you are trying to
achieve. Your data set seems over large. There aren't 20 giga-lines in
all the books in the Library of Congress.

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
 
Reply With Quote
 
Peter Nilsson
Guest
Posts: n/a
 
      11-30-2005
[Please don't top-post.]

Rainmaker wrote:
> To be precise,
> Take 2 terabytes of data
>
> each string not longer than 100bytes.
> I dont understand why you need to know what these strings signify?


Perhaps you don't, but then you don't understand how to sort such
strings either, so I don't understand why you're withholding
information
that could help you get you an efficient solution to your problem.

> These are normal strings and you can store anything in it < 100 bytes.


What is a 'normal' string?

> There is no pre existing order.


If they can store _anything_, why exactly are you sorting them? How
will that benefit
your future processing?

It sounds like you don't actually have a C question per se, rather you
have a programming issue that could probably be better answered in
another, more general, forum, e.g. comp.programming.

But you're unlikely to get different answers there either, if you're
not going to
offer clues that might help to make the sort more streamlined.

--
Peter

 
Reply With Quote
 
Walter Roberson
Guest
Posts: n/a
 
      11-30-2005
In article < .com>,
Rainmaker <> wrote:
>Can anyone tell me an efficient algorithm to sort an array of strings?
>Keep in mind that this array is HUGE and so the algorithm should me
>efficient enough to deal with it.


Taking into account the other information you posted, about the
amount of data and about the size of each string, and the lack of
information about what the strings signify (information that might
potentially allow for a better algorithm based upon the internal
structure of the strings, or based upon string probabilities ):

I wonder whether you really need to -sort- the strings, or if you
only need to be able to -locate- a string efficiently?

Often, the most efficient way for locating particular strings is -not-
through sorting them. Instead, the most efficient way might be
through some kind of hash table. The efficiency of the hash table
would depend upon matters such as whether you need to be able to
update the table without rebuilding it; upon the distribution
of the strings; and upon restrictions you may need to impose about
paging data in from disk.
--
Is there any thing whereof it may be said, See, this is new? It hath
been already of old time, which was before us. -- Ecclesiastes
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      11-30-2005
Rainmaker wrote:

> To be precise,
> Take 2 terabytes of data
>
> each string not longer than 100bytes.


Okay: at least 20,000,000,000 strings.

> I dont understand why you need to know what these strings signify?


For example, suppose they are all United States postal
("ZIP") codes consisting of five digits. You could "sort"
them by creating a table of 100,000 counters, reading the
data and counting how many times each code appears, and
then traversing the table in order, outputting the proper
number of copies of the input string. Knowing something
about the data doesn't always open up other approaches, but
it sometimes does.

More generally, it's often helpful to describe the
overall problem you're trying to solve, not just seek advice
on one of the steps in a solution you've come up with. It
may happen that there's a better way to attack the larger
problem, one that avoids the step that's giving trouble.

Despite repeated pleas, though, you persist in dribbling
out tiny bits of information -- always, always less than was
asked for. All right, I wash my hands of you; I've got better
things to do than coddle the uncooperative for free. Based
on the little you've told us about your problem:

- You've got more data than is likely to fit in high-
speed memory, so you need what is known as an "external
sort." Plenty of such programs have been written already
and I suggest you don't set out to write yet another.
Unix systems come with a utility imaginatively named
"sort," and there are commercial products that are faster,
fancier, friendlier, and so on. Choose one and use it.

- The amount of time required to sort your data is hard
to estimate. The I/O speeds of your disks will play
a part, as will the size of your computer memory. If
we suppose you can use ten gigabytes of memory, the
technique of replacement selection applied to your two
terabytes of input will give about one hundred initial
sub-files to sort, and you might be able to finish up
with a hundred-way merge. If you've got only two gig
of buffer space, though, you'll get something like five
hundred sub-files and will probably want a different
merge pattern. In any event, the choice of a good merge
pattern will be heavily dependent on the characteristics
of your computer and disks.

- There's at least a fifty percent chance that you really
don't need to sort the data in the classical sense at
all. But lacking information about what you're trying
to do, I'm unable to suggest any likely shortcuts. Go
ahead and sort, if that's your desire.

... and for further advice, seek another newsgroup. You
do not have a C problem -- not unless you decide to write a
C program, and I suspect you're not yet knowledgable enough
about sorting to write an effective one -- so you should seek
a forum frequented by sorting experts, not experts in C (or
Fortran, or Java, or ...). Good bye, and good luck.

--
Eric Sosman
lid
 
Reply With Quote
 
osmium
Guest
Posts: n/a
 
      11-30-2005
"Rainmaker" wrote:

> To be precise,
> Take 2 terabytes of data
>
> each string not longer than 100bytes.
> I dont understand why you need to know what these strings signify?
> These are normal strings and you can store anything in it < 100 bytes.
>
> There is no pre existing order.


This link looks kind of reasonable to me. I post this for background and as
a starting point, I didn't follow the links to the code. But I don't think
you want code anyway. I think the number of files your OS will allow you to
have open at a single time will become interesting as you pursue this.
Depending on the real numbers in your problem, you may want to open hundreds
of files at a time and your OS may be the constraint on what you can do..

I found this with< "external sort" huge OR giga>

http://cis.stvincent.edu/swd/extsort/extsort.html


 
Reply With Quote
 
websnarf@gmail.com
Guest
Posts: n/a
 
      11-30-2005
Rainmaker wrote:
> To be precise,
> Take 2 terabytes of data
>
> each string not longer than 100bytes.
> I dont understand why you need to know what these strings signify?
> These are normal strings and you can store anything in it < 100 bytes.


So reference URLS from a weblog of a very popular website or something
like that?

Ok, like Peter posted, your top-level should be merge-sort. This will
allow you to retain locality of reference, which is very important once
you get to sizes that large. Ok, but once your recursion makes it down
to in-memory sizes, you should use IntroSort which is implemented as
quick-sort on top of heapsort (not on top of InsertSort, which some
IntroSort implementations do.) As an additional condition, once you
reach "in-cache" sizes (L2 size is ok) for the partition, switch to
heap-sort. And of course, generate pointers first, and sort the
pointers first.

Here's why:

1. Ordinary quicksort on *average* performs fewer compares and data
moves than either heap-sort or merge-sort. (But note that the
guaranteed O(n*log(n)) versions of Quicksort, based on the Select
algorithm, do *NOT* have this property -- i.e., its only worth it if
you use the naive style algorithm.) So when you are dominated by
comparison and access times, quick-sort is the best bet.

2. On modern processors, Heap-sort is faster once you reach the cache.
This is not very well known however since I had to discover this for
myself a couple years ago: http://www.pobox.com/~qed/sort.html . This
does require that you use "branch removal" techniques, or use a moden
compiler which implements these techniques (such as the Intel
compiler.) The reason is that once everything is in the cache, branch
misses start to dominate the actual bottom line performance. By
properly coding up heap-sort, it almost completely removes all branch
mispredictions (even though it performs three times as many comparisons
and twice as many data moves as quicksort).

3. Random disk access is just tremendously slow, and can only be
mitigated by streaming the data in the order that it is stored in in
the first place. I.e., first random access is always far more
expensive than reads that follow it, if they are sequentially from the
first read. Mergesort is the only sorting algorithm that performs
streaming.

There are some new parallel sorting algorithms which are very close to
O(n*log(n)/#cpus) however I don't know very much about them. It seems
to me that once you recurse to "in-memory" you can dispatch each
segment to a processor, which will basically be close enough. But I am
really just talking off the top of my head on that.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Merge Sort in C - array output is same as input after sort routine completes rkk C Programming 9 09-24-2006 08:30 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
Array sort function sorts on chars not numbers ... help ! how to sort numbers GIMME Javascript 5 07-26-2004 01:28 AM
multi-field array sort using Sort::Fields method Domenico Discepola Perl Misc 6 04-28-2004 04:28 AM
Ado sort error-Ado Sort -Relate, Compute By, or Sort operations cannot be done on column(s) whose key length is unknown or exceeds 10 KB. Navin ASP General 1 09-09-2003 07:16 AM



Advertisments