Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Boost.bind and performance penalties?

Reply
Thread Tools

Boost.bind and performance penalties?

 
 
Rune Allnor
Guest
Posts: n/a
 
      05-25-2011
Hi all.

I have this application which is heavy on computations
on data scattered in RAM. The task is to generate
a triangulated surface from a set of scattered (x,y,z)
data. In this application the memory managment is
a big issue while computatiosn are relativley small
and uncompicated, so that run-time would be expected
to split about 50/50 between memory managment and
computations.

I expect to see a 50% slash in run-time by rewriting
the program from vector indexing presently on the form

std::vector<double> v;
double x;
std::vector<size_t> i;
size_t n;

x = v[i[n]];

to pointer dereferencing, _along_the_lines_of_

std::vector<double> v;
double x;
std::vector<*double> i;
size_t n;

// initialize the elements of i to point to elements of v
x = *i[n];

The point is that I can't afford much run-time overhead, or
it will signifigantly impact the application.

Now, there are several variants of computations that I can
do on the data x. I'd like to use function objects or pointers
to specify how to do the computations, and am thinking of
using boost.bind to declare these function pointers when
calling this library.

What run-time overheads, if any, would be expected when
using function pointers, as opposed to 'hard-wiring' the
computations inside the library?

Rune
 
Reply With Quote
 
 
 
 
Edek
Guest
Posts: n/a
 
      05-25-2011
On 05/25/2011 01:56 PM, Rune Allnor wrote:
> Hi all.

[snip]
> The point is that I can't afford much run-time overhead, or
> it will signifigantly impact the application.
>
> Now, there are several variants of computations that I can
> do on the data x. I'd like to use function objects or pointers
> to specify how to do the computations, and am thinking of
> using boost.bind to declare these function pointers when
> calling this library.
>
> What run-time overheads, if any, would be expected when
> using function pointers, as opposed to 'hard-wiring' the
> computations inside the library?


I guess it depends on how complicated the functions are, in two ways.

One: if they take long time, the call overhead is _relatively_ small.

Two: how complicated they are, impacting optimizing possibilities. E.g.
simple multiplication can be SSE'd ('vectorized') if inline'd or
deducted by compiler in some other way. Such optimisations cannot be
done - in most cases - when the call is indirect enough that the
compiler cannot be sure it knows the called function body. This can be
significant.

In general bind() is not light, and even function pointers, or even
worse pointers to members, should not be called in a tight loop, if only
for the call itself, let alone missed optimisations.

These are mostly rules of thumb. If you have a lot of time to spend on
programming, use expression templates, or inline at least. If you
cannot, use boost::bind and just don't worry.

HTH
Edek
 
Reply With Quote
 
 
 
 
Rune Allnor
Guest
Posts: n/a
 
      05-26-2011
On May 25, 4:45*pm, Edek <(E-Mail Removed)> wrote:
> On 05/25/2011 01:56 PM, Rune Allnor wrote:
>
> > Hi all.

> [snip]
> > The point is that I can't afford much run-time overhead, or
> > it will signifigantly impact the application.

>
> > Now, there are several variants of computations that I can
> > do on the data x. I'd like to use function objects or pointers
> > to specify how to do the computations, and am thinking of
> > using boost.bind to declare these function pointers when
> > calling this library.

>
> > What run-time overheads, if any, would be expected when
> > using function pointers, as opposed to 'hard-wiring' the
> > computations inside the library?

>
> I guess it depends on how complicated the functions are, in two ways.
>
> One: if they take long time, the call overhead is _relatively_ small.


The functions are small. 8-10 trivial math operations.

> Two: how complicated they are, impacting optimizing possibilities. E.g.
> simple multiplication can be SSE'd ('vectorized') if inline'd or
> deducted by compiler in some other way. Such optimisations cannot be
> done - in most cases - when the call is indirect enough that the
> compiler cannot be sure it knows the called function body. This can be
> significant.


I suppose this is the issue I am worried about.

> In general bind() is not light, and even function pointers, or even
> worse pointers to members, should not be called in a tight loop, if only
> for the call itself, let alone missed optimisations.


I'll have to think about this. The reason why the question came up
in the first place, was that I found that I could use some of the
same functionality I implemented for the (x,y,z) surface data stuff,
when playing with various GUI features like points on the screen:

How can I efficiently hide what type of point I play with, GUI or
measurements, from the data? And how can I access the coordinates
in the point types effectively without either hard-coding the
access functions in the library or pay too large a price on
performance?

Maybe I can accept to do only one call to fetch the data.
Maybe I can come up with a specialization for the performance-
critical stuff. There are several options here.

> These are mostly rules of thumb. If you have a lot of time to spend on
> programming, use expression templates, or inline at least. If you
> cannot, use boost::bind and just don't worry.


Can't afford not to worry. Run-time is a critical factor here.

> HTH
> Edek


Thanks.

Rune
 
Reply With Quote
 
Edek
Guest
Posts: n/a
 
      05-26-2011
On 05/26/2011 06:35 AM, Rune Allnor wrote:

>> One: if they take long time, the call overhead is _relatively_ small.

>
> The functions are small. 8-10 trivial math operations.


When I was talking about time, this somehow includes scattered access
which you mentioned before. Uncached RAM access takes long time
comparing to couple of simple ops.

Which reminds me: I do not understand why replacing vector offset by
vector of pointers should speed things up. Vector offset boils down to
pointer + offset.

You might want to consider a 'manual prefetch' if the vector fits into
cache and large part of it is used. The rule here is that CPU has a
hardware prefetcher, which kicks in during sequencial access; otherwise,
manual prefetch might help. Not the first thing to try though, and a bit
compiler specific or assembly. (I am talking about x86)

>
>> Two: how complicated they are, impacting optimizing possibilities. E.g.
>> simple multiplication can be SSE'd ('vectorized') if inline'd or
>> deducted by compiler in some other way. Such optimisations cannot be
>> done - in most cases - when the call is indirect enough that the
>> compiler cannot be sure it knows the called function body. This can be
>> significant.

>
> I suppose this is the issue I am worried about.


Make a test and measure.

>> In general bind() is not light, and even function pointers, or even
>> worse pointers to members, should not be called in a tight loop, if only
>> for the call itself, let alone missed optimisations.

>
> I'll have to think about this. The reason why the question came up
> in the first place, was that I found that I could use some of the
> same functionality I implemented for the (x,y,z) surface data stuff,
> when playing with various GUI features like points on the screen:
>
> How can I efficiently hide what type of point I play with, GUI or
> measurements, from the data? And how can I access the coordinates
> in the point types effectively without either hard-coding the
> access functions in the library or pay too large a price on
> performance?


Inline getters (would that be 'hardcoding'? ), or inline MyPoint
const& getPoint () const, or inherit from MyPoint, or template<class
MyOp> doOpOnPoint(MyOp& op) on some point class, or ...

>
> Maybe I can accept to do only one call to fetch the data.
> Maybe I can come up with a specialization for the performance-
> critical stuff. There are several options here.


This might be a good idea. Easiest: fetch into vector, pass a vector to
the function. The function would be locally working on a vector; but
then, fetching is an operation too, it is simple, but takes some time.
It depends, measure a test. I functions (ops) are separate from data, a
wrapper can be made applying an op on a vector. I guess the options are
limited by what you actually want to do with the result.

>
>> These are mostly rules of thumb. If you have a lot of time to spend on
>> programming, use expression templates, or inline at least. If you
>> cannot, use boost::bind and just don't worry.

>
> Can't afford not to worry. Run-time is a critical factor here.


Still, measure first, make a simple test, measure again. There are more
things in the CPU than are dreamt of in programming.

>
>> HTH
>> Edek

>
> Thanks.
>
> Rune


 
Reply With Quote
 
Rune Allnor
Guest
Posts: n/a
 
      05-26-2011
On May 26, 11:34*am, Edek <(E-Mail Removed)> wrote:
> On 05/26/2011 06:35 AM, Rune Allnor wrote:
>
> >> One: if they take long time, the call overhead is _relatively_ small.

>
> > The functions are small. 8-10 trivial math operations.

>
> When I was talking about time, this somehow includes scattered access
> which you mentioned before. Uncached RAM access takes long time
> comparing to couple of simple ops.
>
> Which reminds me: I do not understand why replacing vector offset by
> vector of pointers should speed things up. Vector offset boils down to
> pointer + offset.


Not quite:

1) ElementPointer = BasePointer + ElementOffset
2) Element = derefernce(ElementPointer)

while the pointer semantics is only half that:

A) Element = dereference(ElementPointer)

In this application the difference matters. The issue is a bit
more involved than what I posted first (there might b 3 - 5 levels
of indirections through pointers to pointers). Maybe rewriting to
pointer semantics instead of vector semantics saves me 30% run time;
maybe it saves 60%. I don't know until I rewrite.

But it's that order of magnitude we are talking about. I don't want
to blow those savings or more, on an expensive function indirection.

> You might want to consider a 'manual prefetch' if the vector fits into
> cache and large part of it is used. The rule here is that CPU has a
> hardware prefetcher, which kicks in during sequencial access; otherwise,
> manual prefetch might help. Not the first thing to try though, and a bit
> compiler specific or assembly. (I am talking about x86)


The problem is that the data points are scattered in RAM.
Cache organization is at least as hard as the main task,
which is to organize the data points. Cache organization
just uses a different criterium.

....

> >> These are mostly rules of thumb. If you have a lot of time to spend on
> >> programming, use expression templates, or inline at least. If you
> >> cannot, use boost::bind and just don't worry.

>
> > Can't afford not to worry. Run-time is a critical factor here.

>
> Still, measure first, make a simple test, measure again. There are more
> things in the CPU than are dreamt of in programming.


The performance factors only kick in visibly at large volumes.
No such thing as a 'simple test': either rewrite the whole thing
or nothing at all.

Rune
 
Reply With Quote
 
Edek
Guest
Posts: n/a
 
      05-26-2011

>> Which reminds me: I do not understand why replacing vector offset by
>> vector of pointers should speed things up. Vector offset boils down to
>> pointer + offset.

>
> Not quite:
>
> 1) ElementPointer = BasePointer + ElementOffset
> 2) Element = derefernce(ElementPointer)
>
> while the pointer semantics is only half that:
>
> A) Element = dereference(ElementPointer)


Not quite, if I understand your first post:

1) Pointer = deref(const_pointer + pointerOffset)
2) Element = deref(Pointer)

vs.

1) ElementOffset = deref (const_pointer + elementOffsetOffset)
2) Element = deref (const_pointer + ElementOffset)

.... 1 is sequencial I guess, 2 is scattered. In both cases 2 is
dependant on 1. deref(pointer + offset) is a single instruction on most
CPUs if not all, although somehow split into microops - still, RAM fetch
is way more that address addition.

And it is not even that simple addressing. 2) should be a & + member
offset. v[off].x that is, or deref ([const_pointer + ElementOffset] +
const-x-offset).

>
> In this application the difference matters. The issue is a bit
> more involved than what I posted first (there might b 3 - 5 levels
> of indirections through pointers to pointers). Maybe rewriting to
> pointer semantics instead of vector semantics saves me 30% run time;
> maybe it saves 60%. I don't know until I rewrite.


I don't know. I usually think in such cases that the compiler will do
such things for you, if you do not prevent it: generated assembly does
not match line-by-line what is in .cpp file. It is a simple
transformation. The compiler stops further transformations when faced
with something it does not know for sure. Indirection might be one of
them, true; a 'blackbox' call is for sure another. I still see no
significant difference between returning an offset and returning a
pointer, at least in a loop.

Making everything a template, or just inline, makes no 'blackbox' calls;
fewer indirections should help too.

>
> But it's that order of magnitude we are talking about. I don't want
> to blow those savings or more, on an expensive function indirection.


Probably boost::bind is what you need to avoid. std::for_each or any
similar template would be better I guess.

>>>> These are mostly rules of thumb. If you have a lot of time to spend on
>>>> programming, use expression templates, or inline at least. If you
>>>> cannot, use boost::bind and just don't worry.

>>
>>> Can't afford not to worry. Run-time is a critical factor here.

>>
>> Still, measure first, make a simple test, measure again. There are more
>> things in the CPU than are dreamt of in programming.

>
> The performance factors only kick in visibly at large volumes.
> No such thing as a 'simple test': either rewrite the whole thing
> or nothing at all.


Maybe. I use simple tests to learn the mechanisms, on large volumes. It
might be easier than rewriting the whole thing and learning later that
it was not the right way to go. Which by the way happens during such
changes, performance is too often heuristic-like. What is clearly an
advantage, is changes like O(n^2) to O(log n), but we're not talking
about that now.

Edek

 
Reply With Quote
 
Rune Allnor
Guest
Posts: n/a
 
      05-26-2011
On May 26, 12:53*pm, Edek <(E-Mail Removed)> wrote:
> >> Which reminds me: I do not understand why replacing vector offset by
> >> vector of pointers should speed things up. Vector offset boils down to
> >> pointer + offset.

>
> > Not quite:

>
> > 1) ElementPointer = BasePointer + ElementOffset
> > 2) Element = *derefernce(ElementPointer)

>
> > while the pointer semantics is only half that:

>
> > A) Element = dereference(ElementPointer)

>
> Not quite, if I understand your first post:


Don't take that post too literally; what *really*
goes on is something like (in vector semantics)

struct element_t {size_t a; size_t b; size_t c;};
std::vector<element_t> v;
size_t n,m;

m = v[v[v[n].a].b].c;

where one can not predict any relation between n and v[n].x.

I think this can be more efficient if written along the
lines of something like

struct Element_t {*Element_t A; *Element_t B; size_t C};
std::vector<element_t> v;
size_t n,m;

m = v[n].A->B->C;

....
> > In this application the difference matters. The issue is a bit
> > more involved than what I posted first (there might b 3 - 5 levels
> > of indirections through pointers to pointers). Maybe rewriting to
> > pointer semantics instead of vector semantics saves me 30% run time;
> > maybe it saves 60%. I don't know until I rewrite.

>
> I don't know. I usually think in such cases that the compiler will do
> such things for you, if you do not prevent it: generated assembly does
> not match line-by-line what is in .cpp file. It is a simple
> transformation. The compiler stops further transformations when faced
> with something it does not know for sure. Indirection might be one of
> them, true; a 'blackbox' call is for sure another. I still see no
> significant difference between returning an offset and returning a
> pointer, at least in a loop.


Again, my first few posts were maybe too simplistic. The example
above
may still be too simplistic, but gives some impression of what is
actually going on.

> Making everything a template, or just inline, makes no 'blackbox' calls;
> fewer indirections should help too.


Can't do that. The nature of the problem to be solved.

> > But it's that order of magnitude we are talking about. I don't want
> > to blow those savings or more, on an expensive function indirection.

>
> Probably boost::bind is what you need to avoid. std::for_each or any
> similar template would be better I guess.


It's two separate problems:

1) Getting fast indirections
2) Applying functions without overhead.

Problem 1 is not really essential for problem 2, other than as
a demonstration of what kinds of overheads are damaging.

> >>>> These are mostly rules of thumb. If you have a lot of time to spend on
> >>>> programming, use expression templates, or inline at least. If you
> >>>> cannot, use boost::bind and just don't worry.

>
> >>> Can't afford not to worry. Run-time is a critical factor here.

>
> >> Still, measure first, make a simple test, measure again. There are more
> >> things in the CPU than are dreamt of in programming.

>
> > The performance factors only kick in visibly at large volumes.
> > No such thing as a 'simple test': either rewrite the whole thing
> > or nothing at all.

>
> Maybe. I use simple tests to learn the mechanisms, on large volumes. It
> might be easier than rewriting the whole thing and learning later that
> it was not the right way to go. Which by the way happens during such
> changes, performance is too often heuristic-like. What is clearly an
> advantage, is changes like O(n^2) to O(log n), but we're not talking
> about that now.


The present version of the application is a 'naive simple semantics'
version, which has already been shown to be too slow. The question
now is how to implement a fast version that also might be flexible
enough to be useful in other applications, where speed is not as
much of an issue.

Rune
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Performance Tutorials Services - Boosting Performance by DisablingUnnecessary Services on Windows XP Home Edition Software Engineer Javascript 0 06-10-2011 02:18 AM
Performance and Scalability of JSP and ASP tharma ASP .Net 4 09-13-2005 06:04 PM
Web Form Performance Versus Single File Performance jm ASP .Net 1 12-12-2003 11:14 PM
Re: IOS and zebra compatibility and performance Phillip Remaker Cisco 2 07-07-2003 08:40 PM



Advertisments