Velocity Reviews > Java > Merge Sort question

# Merge Sort question

Seth G.
Guest
Posts: n/a

 08-11-2003
The following question is about merge sort, and not particularly java
( I didnt know where else to put this query).

If merge sort uses the divide and conquer approach to sort an array in
the following manner(roughly):

MERGE_SORT(Array,begin,end)
{
if ( begin < end )
{
middle = ( begin + end ) / 2

// merge sort first half of array
MERGE_SORT(Array,begin,end/2);

// merge sort second half of array
MERGE_SORT(Array,end/2,end);

MERGE(Array,begin,middle,end);
}

}

What effect would dividing by the array into more than 2 (say
3,4,...,n) have on the algorithm? Would it be more efficient than the
merge sort runtime of
O(n logn)? My inclination is to say no, and it may in fact be less
efficient due to the increased recursive calls to merge sort, pushing
larger chunks of data onto the stack.

cant recall the outcome of that conversation and why I convinced
myself it was no more efficient than merge sort "as is".

Chris Smith
Guest
Posts: n/a

 08-11-2003
Seth G. wrote:
> What effect would dividing by the array into more than 2 (say
> 3,4,...,n) have on the algorithm? Would it be more efficient than the
> merge sort runtime of
> O(n logn)?

No. Logarithms of different bases are constant multiples of each other,
so dividing into three pieces would produce an algorithm that could be
called O(n log[base 3] n), but that is equivalent to any other
logarithmic base, as far as big-O notation is concerned. You just end
up with a constant multiple difference of (log 3 / log 2), and constant
multiples don't matter to big-O notation.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Roedy Green
Guest
Posts: n/a

 08-11-2003
On 11 Aug 2003 08:52:45 -0700, http://www.velocityreviews.com/forums/(E-Mail Removed) (Seth G.) wrote
or quoted :

>What effect would dividing by the array into more than 2 (say
>3,4,...,n) have on the algorithm? Would it be more efficient than the
>merge sort runtime of
>O(n logn)? My inclination is to say no, and it may in fact be less
>efficient due to the increased recursive calls to merge sort, pushing
>larger chunks of data onto the stack.

It all depends on what algorithm is it using to sort the two halves.
Is it more or less efficient than merge over what range of sizes.
Normally a merge is used when:

1. you already have two streams sorted.

2. You have not enough ram to sort the whole pile at once, so you sort
chunks. The advantage of merge is you can merge two chunks without
needing room for both in the in and out chunks in ram at once.

Generally I would just use Sun's sort unless there is something
dreadfully wrong with it. Focus on writing a fast tight comparator.
If you want to experiment, benchmark various algorithms. See
http://mindprod.com/jgloss/sort.html

--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Eric Sosman
Guest
Posts: n/a

 08-11-2003
"Seth G." wrote:
>
> The following question is about merge sort, and not particularly java
> ( I didnt know where else to put this query).
>
> If merge sort uses the divide and conquer approach to sort an array in
> the following manner(roughly): [pseudocode snipped]
>
> What effect would dividing by the array into more than 2 (say
> 3,4,...,n) have on the algorithm? Would it be more efficient than the
> merge sort runtime of
> O(n logn)? My inclination is to say no, and it may in fact be less
> efficient due to the increased recursive calls to merge sort, pushing
> larger chunks of data onto the stack.

The big-Oh behavior will be unaffected, although all of
the constant factor, the threshold, and the lower-order terms
might change.

Note that merging three sorted sub-files of twelve items
each is harder than merging two sorted sub-files of eighteen
items. Merging six six-item files is harder still -- and,
of course, if you found yourself merging thirty-six one-item
files, it might occur to you that the sort hadn't made a lot
of progress up to that point ...

Carrying the idea to not quite such an extreme leads to
the "replacement selection" algorithm, described in detail
by Knuth ("The Art of Computer Programming, Volume III: Sorting
and Searching," Section 5.4.1) in connection with developing
the initial runs for external sorting, typically on magnetic
tape. The medium may seem archaic, but the algorithm can
certainly be applied in other contexts.

--
(E-Mail Removed)

Roedy Green
Guest
Posts: n/a

 08-11-2003
On Mon, 11 Aug 2003 15:23:33 -0400, Eric Sosman <(E-Mail Removed)>
wrote or quoted :

> in connection with developing
>the initial runs for external sorting, typically on magnetic
>tape. The medium may seem archaic, but the algorithm can
>certainly be applied in other contexts.

The nice thing about merge is it requires only sequential access. It
can work in a bare minimum of RAM.

You will see it used in OptTech sort for example after RAM-fuls of
data have been sorted and dumped to disk. After that first phase, you
do an N-way merge to create longer and longer chunks on disk until you
are done.

see http://mindprod.com/jgloss/sort.html

In the old days, 3 way merges were accomplished with a bank of 6 tape
drives, three in and three out. You would see saw the data back and
forth, gradually getting longer blocks of sorted data.

It was hypnotic to watch the tapes twitching away.

Today, people seem to panic as soon as Sunsort runs out of RAM.

You'd think someone would have written a merge sort to extend SunSort
to giant data sets. It works best if you can put each of your temp
files on a different physical disk.

The complication is you have to start serialising the objects.
With pure-in ram sort, all you need to manipulate is a single array of
references.

--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Eric Sosman
Guest
Posts: n/a

 08-11-2003

(Straying from topicality, but ...)

Roedy Green wrote:
>
> On Mon, 11 Aug 2003 15:23:33 -0400, Eric Sosman <(E-Mail Removed)>
> wrote or quoted :
>
> > in connection with developing
> >the initial runs for external sorting, typically on magnetic
> >tape. The medium may seem archaic, but the algorithm can
> >certainly be applied in other contexts.

>
> The nice thing about merge is it requires only sequential access. It
> can work in a bare minimum of RAM.
>
> You will see it used in OptTech sort for example after RAM-fuls of
> data have been sorted and dumped to disk. After that first phase, you
> do an N-way merge to create longer and longer chunks on disk until you
> are done.

The nice thing about replacement selection is that you
typically get run lengths of about twice the memory capacity,
hence half as many chunks to deal with while merging.

> see http://mindprod.com/jgloss/sort.html

Hmmm ... I'd take issue with your characterization of
HeapSort as mostly good for already-ordered data and of
ShellSort as good only for small data sets. HeapSort is
perfectly happy with jumbled data, and ShellSort with well-
chosen increments can be very fast indeed. See

http://www.cs.princeton.edu/~rs/shell/index.html

> In the old days, 3 way merges were accomplished with a bank of 6 tape
> drives, three in and three out. You would see saw the data back and
> forth, gradually getting longer blocks of sorted data.

That wasn't a very good way to do things. See Knuth for
descriptions and analyses of tape merge patterns that are far
more efficient, in the sense of shuffling the data around
fewer times. It seems likely that the tapes you saw were in
fact following one or another of these more sophisticated
patterns.

> It was hypnotic to watch the tapes twitching away.

Been there, done that, done thaaat, doo-one tha-a-a...

--
(E-Mail Removed)

Roedy Green
Guest
Posts: n/a

 08-12-2003
On Mon, 11 Aug 2003 17:50:45 -0400, Eric Sosman <(E-Mail Removed)>
wrote or quoted :

>replacement selection

Is this what is called a "tournament" sort? How does it work?

--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Eric Sosman
Guest
Posts: n/a

 08-12-2003
Roedy Green wrote:
>
> On Mon, 11 Aug 2003 17:50:45 -0400, Eric Sosman <(E-Mail Removed)>
> wrote or quoted :
>
> >replacement selection

>
> Is this what is called a "tournament" sort? How does it work?

*Very* briefly, since the topicality seems shaky: Read as
many items as you possibly can. Then select the least item and
write it to the merge file. Fill the vacated spot by reading
another input item, which gives two cases: if the new item is
greater than or equal to the last item output, just keep on
repeating the select-write-replace loop. If the new item is
less than the last item output, it can't be added to the run
currently being written, so it must be deferred to the next
run. Set it aside for now, thus reducing the number of items
still available for selection. If any items still are available
for selection, continue the select-write-replace loop.

Eventually, memory becomes completely full of deferred items,
none of which can belong to the current output run. Finish off
that run (writing trailer records, closing files, whatever) and
begin a new one, using all those deferred items as the initial
contents of the selection tree.

This process is something like an M-way merge of the input
data with itself, where M is the number of items that fit in
memory. For further details, including an analysis of why the
expected run length is approximately 2*M, see Knuth "The Art
of Computer Programming, Volume III: Sorting and Searching,"
Section 5.4.1.

--
(E-Mail Removed)

Roedy Green
Guest
Posts: n/a

 08-14-2003
On Tue, 12 Aug 2003 15:11:06 -0400, Eric Sosman <(E-Mail Removed)>
wrote or quoted :

> If the new item is
>less than the last item output, it can't be added to the run
>currently being written, so it must be deferred to the next
>run.

You need some sort of structure that lets you easily peel off the
lowest element, and that lets you add a new element so that when its
time comes it will peel off as lowest.

This may not necessarily be a sorted list. Many years ago I
implemented C.A.R. Hoare's HeapSort. I have a Java version at
http://mindprod.com/jgloss/heapsort.html The structure it uses might
fill the bill. It has been too long ago to remember. I vaguely recall
it was like a authority hierarchy at a business.
--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

Roedy Green
Guest
Posts: n/a

 08-14-2003
On Thu, 14 Aug 2003 12:43:19 -0400, Eric Sosman <(E-Mail Removed)>
wrote or quoted :

> Heapsort and binary search are the two algorithms that
>make me nostalgic for FORTRAN's "arithmetic IF" statement
>and its three-way branching capability.

CompareTo is reminiscent of the old Fortran 3-way arithmetic IF.

you write code something like this:

int diff = a.compareTo(b);
if ( diff > 0 )
{
return x;
}
else if ( diff < 0 )
{
return y;
}
else
{
return z;
}

If that is natively compiled on a Pentium, the comparison is done only
once, and then there are conditional jumps based on the condition
code.

--
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.