Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Re: benchmarks? java vs .net

Reply
Thread Tools

Re: benchmarks? java vs .net

 
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      06-04-2008
On Jun 4, 1:29 am, Barry Kelly <(E-Mail Removed)> wrote:

<snip>

> Of course, the whole point of my naive algorithm is that it is naive, so
> I didn't do any of this. It's a 30-minute hack. I just find it's
> interesting that TPL can, on occasion, take close to 30% more time than
> a straightforward naive implementation, depending on work chunk size.


Out of interest, which release of Parallel Extensions did you measure
against? Have you installed the CTP from Monday yet? I'd be interested
to hear whether that is better or worse than the December one in your
case.

Also, have you posted on the PFX forum about this? I'm sure the team
would be interested in examining it.

Jon
 
Reply With Quote
 
 
 
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      06-04-2008
On Jun 4, 2:52 am, Lew <(E-Mail Removed)> wrote:
> > One problem with creating anything like Parallel Extensions (which I
> > assume is what you meant - AFAIK the TPL is just part of Parallel
> > Extensions, although I could be misunderstanding you completely) for
> > Java is that it doesn't (currently) have closures other than anonymous
> > inner classes which are ugly as hell.

>
> I personally like anonymous (and named) inner classes in lieu of closures, myself.
>
> I take more the attitude of the maintenance programmer than the initial
> developer. The very verbosity of inner classes help me understand what's
> going on.


I can't see why - it's just more fluff that gets in the way of the
actual logic.

For instance, I find it much easier to understand this:

people.Where(person => person.Age > 1;

than this:

people.where(new Predicate()
{
public boolean match(Person person)
{
return person.Age > 18;
}
});

It's entirely reasonable to chain together several method calls with
lambda expressions and still end up with readable code. By the time
you've got more than a couple of anonymous inner classes in a single
statement, I think the readability ends up being lost.

> > The state of closures in Java 7 is currently up in the air.

>
> In part because a lot of Java mavens feel similarly, or they subscribe to Josh
> Bloch's opinion of, "When in doubt, leave it out", or they have similar
> concerns as he about what the closure idiom would do to other long-standing
> Java idioms and semantics.


I have concerns over some of the proposals under consideration. I
personally don't like the idea of closures being able to return beyond
their own call, if you see what I mean - I think of closures as a way
of expressing logic for "normal" methods more easily, rather than a
way of introducing whole new control structures.

None of the proposals currently on the table is entirely satisfactory
to me, and I'd like to see some more with possibly more limited scope
but simpler syntax. However, I *do* find closures incredibly useful as
a tool to have available, and I really don't like all the extra cruft
that anonymous inner classes requires.

> Some of those in favor of closures, by my general impression of other threads
> that have discussed this, tend to take as axiomatic their value equations that
> make closures seem desirable, and to dismiss the concerns of those opposed to
> closures as somehow benighted. The trouble with that is that both sides then
> flame each other for failing to understand their values.


Not having been in the discussion, I can certainly see how that would
happen. There's truth in Blub's Paradox, but there's also certainly
value in keeping things simple. It's highly unlikely that everyone is
going to end up happy, I'd say.

> I see the value of closures, and while they are marginally more powerful than
> inner classes, their cost to the Java zeitgeist could be as high as Mr. Bloch
> and others feel, and I also find in practice that they are too terse for the
> maintenance programmer, compared to the inner class idiom. YMMV, of course,
> since this seems to me a matter of style.


It's absolutely a matter of style - but I think it opens up a style
which ends up simplifying code significantly precisely *because* it's
terse.

It certainly takes a little bit of extra up-front learning for
developers (whether maintenance or not) but I personally think the pay-
back is massive.

> The opponents to closures in Java point out that inner classes get the job
> done in practice, so that the need for closures is small, while the complexity
> and difficulties of it are large.


Anonymous inner classes get the job done in a way which discourages a
lot of the more effective ways of using closures, IMO. Yes, they
achieve largely the same goals (I need to consider the value of
capturing variables vs values, compared with the cost in complexity -
it *does* trip people up in C#) but it's the extra cruft that I have a
problem with.

> It's harder to retrofit such a feature to a language that was deliberately
> designed to exclude it than to put it in the beginning into a language that
> was designed to use it.


Certainly. But taking C# as an example, closures weren't available *at
all* in C# 1 (not even in an anonymous inner class manner) - and yet
have been introduced in a very elegant manner.

> It's also unnecessary to add closures to Java.


It was also "unnecessary" to add generics, or the enhanced for loop,
or enums though. "Unnecessary" != "not useful".

This is now wildly off the topic of Razii's benchmarks, but I'd be
very interested in further discussions, particularly across both C#
and Java. It's been a while since I've been a regular poster on
comp.lang.java.programmer, but do you think a new thread discussing
closures and cross-posted between
microsoft.public.dotnet.languages.csharp and comp.lang.java.programmer
would be useful? If the topic's been done to death on the Java groups
already, that's fine - I just wonder whether the C# crowd might have
some interesting perspectives to bring in. There's always the *risk*
that it would turn into yet another language oneupmanship contest of
course, but we could try to make it clear that that's not the point of
the thread...

Alternatively, if you'd be happy to take this up on email instead,
please drop me a mail ((E-Mail Removed))

If you're sick of the whole debate, I quite understand - and thanks
for the previous post.

Jon
 
Reply With Quote
 
 
 
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      06-04-2008
On Jun 4, 9:12 am, Jon Harrop <(E-Mail Removed)> wrote:

<snip>

> >> I can't imagine many transformations being simpler than that - and the
> >> results are great.

>
> > How have you found Parallel.For to perform?

>
> Yes, but you have to write your code properly and all of the freely
> available examples I have seen (like Jon's above) are poorly written.


Can you elaborate on that? I'm not going to try to claim the example I
gave is going to be optimal. However, I find it important that it's a
*very* simple transformation. I suspect that most of the more optimal
transformations are also more complicated - and thus likely to be
beyond a large segment of the developer population. (I'm not trying to
be patronising - it seems to me to be pretty obvious that the general
level of understanding of threading complexities is very low. I'd
consider myself to know "a bit more than most" but I still get very
concerned when things get complicated.) I like the idea of "here's a
really simple transformation which can give you pretty good
parallelisation - more powerful and complex building blocks are also
available if you need to squeeze every bit of performance out".

Of course, if you've got examples of transformations which are about
as simple, but give even better results, that would be fabulous.

Jon
 
Reply With Quote
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      06-04-2008
On Jun 4, 10:48 am, Jon Harrop <(E-Mail Removed)> wrote:
> > Can you elaborate on that? I'm not going to try to claim the example I
> > gave is going to be optimal. However, I find it important that it's a
> > *very* simple transformation.

>
> The problem is that it is a simple transformation that introduces massive
> performance degradation in a variety of cases that are likely to be of
> practical importance.


Doesn't that just mean you have to apply it selectively?

<snip>

> You must take that into account when slicing up the computations or you will
> end up spawning parallelism for trivial computations which is often vastly
> slower than doing them in serial. Blindly replacing "for"
> with "Parallel.For" is a bad idea, particularly in library code.


So don't do it blindly

I wouldn't start replacing every for loop with Parallel.For, or every
foreach loop with Parallel.ForEach - I'd do it based on evidence and
thought.

> > I like the idea of "here's a
> > really simple transformation which can give you pretty good
> > parallelisation - more powerful and complex building blocks are also
> > available if you need to squeeze every bit of performance out".

>
> This is more about avoiding performance pitfalls rather than squeezing
> performance out. If you follow the style you promoted then you can easily
> make a wide variety of functions 100x slower.


Again, that's just a case of not applying it blindly, isn't it?

> > Of course, if you've got examples of transformations which are about
> > as simple, but give even better results, that would be fabulous.

>
> This was all described in detail with measurements in the article I cited
> but, as you say, the information is not free. I'll propose to MS that they
> get us to write a whitepaper on the subject for them so that you can read
> it for free. Good idea.


I suspect you'd get a lot more subscriptions if you'd give away a few
good sample articles - as well as that being a Good Thing in general.

> In the most advanced cases that I have studied you even augment data
> structures with information that conveys how computations should be divided
> across the data structure. We have started introducing such augmentations
> into the data structures provided by our libraries because we believe it is
> essential to fully exploit multicores in order to survive commercially.


It really depends on what you're doing. As ever, one size doesn't fit
all in software engineering.

Jon
 
Reply With Quote
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      06-04-2008
On Jun 4, 1:38 pm, Jon Harrop <(E-Mail Removed)> wrote:
> > Doesn't that just mean you have to apply it selectively?

>
> Not quite, no. It means you must automate the selective application so that
> it can be done at run time, i.e. make it adaptive.


Sometimes that's the case - but in many business cases, I believe it
can be statically decided, making development simpler at the risk that
in some cases you may not take the optimal course.

> > I wouldn't start replacing every for loop with Parallel.For, or every
> > foreach loop with Parallel.ForEach - I'd do it based on evidence and
> > thought.

>
> The problem is that part of the "evidence" is a function of the problem
> being solved and, therefore, is only available at run time. So you cannot
> do it statically (with changes to the source code) to good effect.


In some cases. In others it's pretty obvious, IMO. Take the Mandelbrot
case: on a single core, I don't believe Parallel.For will hurt much (a
case I should test some time...) but obviously it won't help. On
multiple cores, it will improve things significantly vs a non-parallel
implementation. There may be other solutions which will improve the
performance more, but only by introducing more complexity. The balance
between complexity and improvement depends on many things, of course -
including the comfort zone of the developers involved.

> > I suspect you'd get a lot more subscriptions if you'd give away a few
> > good sample articles - as well as that being a Good Thing in general.

>
> I suspect we would get fewer subscriptions if we made even more content
> freely available.


When you say "even more" content, it's not like the web site is
currently bursting at the seams with free samples. With so much free
content (much of it pretty good) available online these days, I'd want
more evidence than a single F# tutorial before spending £39 on a
subscription. Obviously it's your call though.

> > It really depends on what you're doing. As ever, one size doesn't fit
> > all in software engineering.

>
> There is still a lot of room for improvement though.


Sure - but I believe that *just* Parallel.For and Parallel.ForEach is
a quite staggering improvement for the relatively parallelism-scared
developer who wants to get more bang-for-the-buck with the minimum of
effort.

Jon
 
Reply With Quote
 
Barry Kelly
Guest
Posts: n/a
 
      06-04-2008
Jon Skeet [C# MVP] wrote:

> On Jun 4, 1:29 am, Barry Kelly <(E-Mail Removed)> wrote:
>
> <snip>
>
> > Of course, the whole point of my naive algorithm is that it is naive, so
> > I didn't do any of this. It's a 30-minute hack. I just find it's
> > interesting that TPL can, on occasion, take close to 30% more time than
> > a straightforward naive implementation, depending on work chunk size.

>
> Out of interest, which release of Parallel Extensions did you measure
> against? Have you installed the CTP from Monday yet?


Yes, that's what I installed last night to test with.

I do note this blog posting:

http://blogs.msdn.com/pfxteam/archiv...2/8179013.aspx

> I'd be interested
> to hear whether that is better or worse than the December one in your
> case.


The two aren't compatible when installed side by side, unfortunately, so
I'd have to uninstall and reinstall. If I write it up I'll do that.

> Also, have you posted on the PFX forum about this? I'm sure the team
> would be interested in examining it.


I'm doing that now.

-- Barry

--
http://barrkel.blogspot.com/
 
Reply With Quote
 
Barry Kelly
Guest
Posts: n/a
 
      06-04-2008
(Sorry for dup in folks not following Microsoft NNTP server - it appears
to have been censoring the use of a synonym for "donkey"):

Jon Skeet wrote:

> Barry Kelly <(E-Mail Removed)> wrote:
> > > I can't imagine many transformations being simpler than that - and the
> > > results are great.

> >
> > How have you found Parallel.For to perform?

>
> Very well - see my last few blog posts.


I've been reading - I'm more wondering "compared to what", of course

> I wouldn't be surprised if some individual test cases did badly in
> microbenchmarks. Without even looking at your code, my guess is that
> you may find your code doesn't work as well when there's other load on
> the processor (e.g. effectively taking up one core with a different
> process) whereas work stealing should (I believe) cope with that
> reasonably well.


My code does naive work-stealing too, though through a centralized
queue, rather than having a per-thread queue that bored threads wander
away from - that would avoid some cache sloshing etc.

> But as I point out on the most recent blog entry, accurately
> benchmarking this kind of thing and being able to interpret the results
> sensibly is basically beyond my current understanding (and time to
> spend on it).


For the interest of others, I have appended my code to this post. Here
are some of the results I got:

(Running .NET 3.5 on XP SP3 on Q6600 @ 2.8GHz w/ 3.25GB RAM)

- When all cores are fairly idle (no significant activity):

---8<---
Switching Overhead (Naive) : 0.078 (fastest 0.07
Switching Overhead (TPL) : 0.045 (fastest 0.042)
Lots of small tasks (Naive) : 0.682 (fastest 0.679)
Lots of small tasks (TPL) : 0.081 (fastest 0.081)
Few big tasks (Naive) : 0.967 (fastest 0.966)
Few big tasks (TPL) : 1.259 (fastest 1.253)
--->8---

This shows that TPL's switching overhead is definitely lower than my
"naive pool" when the body of the loop is just sleeping, and is far,
far more efficient for lots and lots of small tasks, it seems less
efficient - a lot less efficient - for bigger tasks.

I can see obvious ways to improve my Naive parallel-for

- per-thread queues with bored threads going work stealing
- take advantage of ParallelFor's knowledge of total job size by adding
lots of work to per-thread queues in one go, rather than bottlenecking
on synchronization overhead for each single iteration

Of course, the whole point of my naive algorithm is that it is naive, so
I didn't do any of this. It's a 30-minute hack. I just find it's
interesting that TPL can, on occasion, take close to 30% more time than
a straightforward naive implementation, depending on work chunk size.

Running 1 instance of super-pi to eat up 1 core, leaving only 3 cores
free:

---8<---
Switching Overhead (Naive) : 0.078 (fastest 0.07
Switching Overhead (TPL) : 0.045 (fastest 0.041)
Lots of small tasks (Naive) : 0.734 (fastest 0.70
Lots of small tasks (TPL) : 0.087 (fastest 0.086)
Few big tasks (Naive) : 0.983 (fastest 0.976)
Few big tasks (TPL) : 1.317 (fastest 1.294)
--->8---

Running 2 instances of super-pi, leaving 2 cores free:

---8<---
Switching Overhead (Naive) : 0.078 (fastest 0.07
Switching Overhead (TPL) : 0.045 (fastest 0.042)
Lots of small tasks (Naive) : 1.063 (fastest 1.021)
Lots of small tasks (TPL) : 0.094 (fastest 0.094)
Few big tasks (Naive) : 1.158 (fastest 1.147)
Few big tasks (TPL) : 1.376 (fastest 1.36
--->8---

Here, TPL seems to be catching up with my Naive implementation the more
is loaded on other cores, but it's still a fair ways off.


Code follows:

using System;
using System.Collections.Generic;
using System.Text;
using System.Threading;
using System.Diagnostics;
using System.Runtime.CompilerServices;

// So dumb it's a stack, not a queue, but we never promised
// anything about order.
class NaiveQueue<T>
{
class Item
{
public Item next;
public T value;
}

// volatile to get most recent head in non-blocking loops.
volatile Item _head;

Semaphore _itemCount;
Semaphore _emptyCount;

public NaiveQueue(int maxCount)
{
_itemCount = new Semaphore(0, maxCount);
_emptyCount = new Semaphore(maxCount, maxCount);
}

public T Dequeue()
{
_itemCount.WaitOne();
try
{
for (;
{
Item result = _head;
if (Interlocked.CompareExchange(ref _head,
result.next, result) == result)
return result.value;
}
}
finally
{
_emptyCount.Release();
}
}

public void Enqueue(T item)
{
_emptyCount.WaitOne();
try
{
Item added = new Item();
added.value = item;
for (;
{
added.next = _head;
if (Interlocked.CompareExchange(ref _head, added,
added.next) == added.next)
break;
}
}
finally
{
_itemCount.Release();
}
}
}

class NaivePool
{
Thread[] _threads;
NaiveQueue<Action> _queue;

public NaivePool()
{
_threads = new Thread[Environment.ProcessorCount];
_queue = new NaiveQueue<Action>(
Environment.ProcessorCount * 2);
Thread.MemoryBarrier();
for (int i = 0; i < _threads.Length; ++i)
{
_threads[i] = new Thread(ThreadMain);
_threads[i].IsBackground = true;
_threads[i].Start();
}
}

void ThreadMain()
{
for (;
_queue.Dequeue()();
}

public void ParallelFor(int start, int limit, Action<int> body)
{
using (DoneWaiter waiter = new DoneWaiter(limit - start))
{
for (int i = start; i < limit; ++i)
{
int index = i;
_queue.Enqueue(delegate
{
body(index);
waiter.Signal();
});
}
}
}

class DoneWaiter : IDisposable
{
int _taskCount;
ManualResetEvent _done;

public DoneWaiter(int taskCount)
{
_taskCount = taskCount;
_done = new ManualResetEvent(false);
}

public void Dispose()
{
_done.WaitOne();
_done.Close();
}

public void Signal()
{
if (Interlocked.Decrement(ref _taskCount) == 0)
_done.Set();
}
}
}

class Program
{
[MethodImpl(MethodImplOptions.NoInlining)]
static void Use(int x)
{
if (x == -42)
Console.WriteLine("Got the magic number");
}

static void Time(string title, Action proc)
{
double totalTime = 0;
double fastest = double.MaxValue;
int iterCount = 3;

// warmup
proc();

for (int i = 0; i < iterCount; ++i)
{
Stopwatch w = Stopwatch.StartNew();
proc();
double time = w.ElapsedTicks / (double) Stopwatch.Frequency;
totalTime += time;
if (time < fastest)
fastest = time;
}

Console.WriteLine("{0,-30}: {1,6:f3} (fastest {2,6:f3})",
title, totalTime / iterCount, fastest);
}

static void Main(string[] args)
{
NaivePool pool = new NaivePool();

Time("Switching Overhead (Naive)", delegate
{
pool.ParallelFor(0, 20, i =>
{
Thread.Sleep(10);
});
});

Time("Switching Overhead (TPL)", delegate
{
Parallel.For(0, 20, i =>
{
Thread.Sleep(10);
});
});

Time("Lots of small tasks (Naive)", delegate
{
pool.ParallelFor(0, 100000, i =>
{
for (int j = 0; j < 1000; ++j)
Use(j);
});
});

Time("Lots of small tasks (TPL)", delegate
{
Parallel.For(0, 100000, i =>
{
for (int j = 0; j < 1000; ++j)
Use(j);
});
});

Time("Few big tasks (Naive)", delegate
{
pool.ParallelFor(0, 10, i =>
{
for (int j = 0; j < 100000000; ++j)
Use(j);
});
});

Time("Few big tasks (TPL)", delegate
{
Parallel.For(0, 10, i =>
{
for (int j = 0; j < 100000000; ++j)
Use(j);
});
});
}
}

-- Barry

--
http://barrkel.blogspot.com/
 
Reply With Quote
 
Barry Kelly
Guest
Posts: n/a
 
      06-04-2008
I wrote a reply, but it never showed up here in my newsreader. It's
online, however:

http://groups.google.com/group/micro...8e9162b5bae717

(Apologies if other folks are seeing my message repeatedly.)

-- Barry

--
http://barrkel.blogspot.com/
 
Reply With Quote
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      06-04-2008
On Jun 4, 3:11 pm, Jon Harrop <(E-Mail Removed)> wrote:
> >> Not quite, no. It means you must automate the selective application so
> >> that it can be done at run time, i.e. make it adaptive.

>
> > Sometimes that's the case - but in many business cases, I believe it
> > can be statically decided, making development simpler at the risk that
> > in some cases you may not take the optimal course.

>
> Again, the problem isn't that you fail to take the optimal course. The
> problem is that you silently introduce massive performance degradations
> because you don't understand when Parallel.For is slow.


I know that it will introduce significant degredations when the body
of the loop is very quick, thus dwarfed by the time taken to invoke a
delegate and the time taken to schedule threads etc.

Are there other situations you know about where the performance is
terrible? In many cases (not all by a long way, but many) the
developer will *know* that the body of the loop is a significant
amount of work.

> > In some cases. In others it's pretty obvious, IMO. Take the Mandelbrot
> > case: on a single core, I don't believe Parallel.For will hurt much (a
> > case I should test some time...) but obviously it won't help. On
> > multiple cores, it will improve things significantly vs a non-parallel
> > implementation.

>
> No. That is exactly what I was saying was wrong. Your implementation has
> introduced massive performance degradations and you don't even know when
> they appear. That is a serious problem, IMHO.


So do *you* know when they appear? In the Mandelbrot case, I'd expect
the ParallelFor version to be a lot slower for situations where
computing each row is a trivial amount of work compared to the
scheduling involved - i.e. when the number of columns is tiny. I'm
happy enough to ignore that case - any Mandelbrot image with so few
columns is going to be uninteresting anyway.

> > When you say "even more" content, it's not like the web site is
> > currently bursting at the seams with free samples. With so much free
> > content (much of it pretty good) available online these days, I'd want
> > more evidence than a single F# tutorial before spending £39 on a
> > subscription. Obviously it's your call though.

>
> May I ask which articles you would most like for free?


The ParallelFX and SciMark articles would be nice. Are the paid
articles longer/meatier than the free ones?

> I'll also upload the code from my forthcoming book F# for Scientists ASAP.


Cool.

> > Sure - but I believe that *just* Parallel.For and Parallel.ForEach is
> > a quite staggering improvement for the relatively parallelism-scared
> > developer who wants to get more bang-for-the-buck with the minimum of
> > effort.

>
> They must be made aware of when Parallel.For kills performance or they will
> surely start going in the wrong direction.


Agreed. If that really *is* a case of "don't do it when the body of
the loop is going to execute really quickly anyway" then that's easy
enough to understand. If there are more complicated situations where
using Parallel.For introduces degredations (rather than just not being
optimal) then I'd be really interested to hear about them.

Jon
 
Reply With Quote
 
Mark Thornton
Guest
Posts: n/a
 
      06-04-2008
Lew wrote:
> Mark Thornton wrote:
>> Given that my Java code won't be making best use of SSE

>
> Are you sure of that?
>
> If true, it's one more area for Hotspot authors to approach.
>


As far as I know Java 6 only uses SSE for a single operation at a time
(i.e scalar operations), it doesn't detect opportunities to do 4 at a
time ops.

Mark Thornton
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Hot Requirements: 1.Sr Java Developer,2.Java Developer (Java with EJB) Isaac Java 0 01-20-2011 08:41 PM
hey i am just started java,, can anyone tell me the use ,application, why java , importance of java.. manish sahu Java 3 02-14-2008 12:00 AM
[JAVA] [EVALUATION] - The Java Failure (Sorry: The Java(tm) Failure) Ilias Lazaridis Java 0 02-01-2005 10:32 AM
JAVA VIRTUAL MUCHINE OR SUN JAVA Fernando Kohan Firefox 1 11-14-2004 02:04 AM
Job to convert Java App 1.3.1 to Java Newest of Java Michael Kintner Java 0 11-30-2003 04:42 AM



Advertisments