Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Which is a better way to implement this algorithm?

Reply
Thread Tools

Which is a better way to implement this algorithm?

 
 
mike3
Guest
Posts: n/a
 
      04-20-2008
Hi.
(Xposted to both comp.lang.c++ and comp.programming since I've got
questions related to both C++ language and general programming)

I've got the following C++ code. The first routine runs in like 65% of
the time of the second routine. Yet both do the same thing. However,
the second one seems better in terms of the way the code is written
since it helps encapsulate the transformation in the inner loop better
making it easier to read, at least in my opinion. Maybe it's not,
maybe the former is "better" that way and I should go with it, but if
the latter is "better" in that respect should I just ditch it anyway
and tradeoff for performance since I want this thing to be fast???

What each routine does is multiply two arbitrary-precision integers
together. The second one though uses an additional "slice" type that
provides a window enabling the "multiply and add" operation to be
performed on a limited range of digits, which can then be advanced
across the number, making clear that part of the algorithn.

I'm using the simple "grade school" multiply algorithm. Note how the
second routine more easily outlines this algorithm, while in the first
it is a little more difficult to see. Which would you prefer, exactly?

For brevity, class definitions and other members have been omitted.
However it should not be too difficult to figure out the necessary
information.

Also, could I lose the typecasts in this code or do I need them?

The reason why I'm asking is because I remembered getting told earlier
by someone here (Alf P. Steinbach, group: comp.lang.c++) about how my
last implementation of my program (this is for a fractal generator)
had some sort of bad, bug-inducing "abstraction gap" yet I was unable
to get enough elaboration responses about it so I've been more or less
shooting around trying to figure out how to make it better (although
he did give me some *examples* of where there were problems, which I
have since fixed.). But it seems I'm losing performance and that's not
a good thing for my application. Not to mention also that my original
bignum implementation was called "silly" as well("...although I didn't
look at the silly bignum class, whatever it's fault is..." ref:
http://groups.google.com/group/comp....9?dmode=source)
without any explanation as to what exactly was so silly about it. So I
had to start dark-shooting there too trying to figure out what I had
done wrong.

First version (fastest):
---
void RawDigit::Mul(const RawDigit &a, const RawDigit &b)
{
/* Check to see if we're doing an in-place multiply */
if(&a == this)
{
MulInPlace(b);
return;
} else if(&b == this) {
MulInPlace(a);
return;
}

/* Get lengths. */
std::size_t rLength(GetLength());
std::size_t aLength(a.GetLength());
std::size_t bLength(b.GetLength());

/* Zeroize this. */
Zeroize();

/* Do the multiplication. */
TwoDigit64 carry(0);
for(std::size_t i(0);i<aLength && i<rLength;i++)
{
carry = 0;

for(std::size_t j(0);j<bLength && (i+j)<rLength;j++)
{
TwoDigit64
tmp((static_cast<TwoDigit64>(a.digits[i])*b.digits[j]) +
digits[i+j] + carry);
carry = tmp >> DIGIT_BITS;
tmp -= carry << DIGIT_BITS;
digits[i+j] = tmp;
}

if(i+bLength < rLength)
digits[i+bLength] = carry;
}
}
---

Second version (slower):
---
*** Slice manipulation.
inline Digit32 MulAddOp(Digit32 a, Digit32 b, Digit32 c, Digit32
&carry)
{
TwoDigit64 sum(a + (static_cast<TwoDigit64>(b)*c) + carry);
carry = sum >> DIGIT_BITS;
return(sum & MAX_DIGIT);
}

inline Digit32 MulAddCarryPropOpL(Digit32 a, Digit32 &carry)
{
Digit32 sum(a + carry);
carry = sum < carry;
return(sum);
}

inline Digit32 MulAddCarryPropOpR(Digit32 b, Digit32 c, Digit32
&carry)
{
TwoDigit64 sum((static_cast<TwoDigit64>(b)*c) + carry);
carry = sum >> DIGIT_BITS;
return(sum & MAX_DIGIT);
}

Digit32 RawDigitSlice::MulAdd(const ConstRawDigitSlice &a,
const Digit32 b,
const ConstRawDigitSlice &c)
{
/* Set up iterators */
DigitIterator32 ri(GetStartIterator());
ConstDigitIterator32 ai(a.GetConstStartIterator());
ConstDigitIterator32 ci(c.GetConstStartIterator());

/* Get minimum length of a and c. */
std::size_t minLength(std::min(std::min(a.length, c.length),
length));

/* Do the combined multiply + add */
Digit32 carry(0);
std::size_t i(0);
for(i;i<minLength;++i,++ri,++ai,++ci)
*ri = MulAddOp(*ai, b, *ci, carry);

/* Handle the remaining part(s) of a or b. */
int largerLength(std::min(std::max(a.length, c.length), length));
if(a.length >= c.length)
{
for(i;i<largerLength;++i,++ri,++ai)
*ri = MulAddCarryPropOpL(*ai, carry);
} else {
for(i;i<largerLength;++i,++ri,++ci)
*ri = MulAddCarryPropOpR(b, *ci, carry);
}

/* Finish carry propagation */
if(largerLength < length)
{
for(i;i<length;++i,++ri)
*ri = MulAddCarryPropOpL(0, carry);
}

/* Done! */
return(carry);
}

*** This next routine is in a different source file, the one
implementing the full "RawDigit" class.
*** The former would be in another file that implements the
"RawDigitSlice" class.
void RawDigit::Mul(const RawDigit &a, const RawDigit &b)
{
/* Check to see if we're doing an in-place multiply */
if(&a == this)
{
MulInPlace(b);
return;
} else if(&b == this) {
MulInPlace(a);
return;
}

/* Get lengths. */
std::size_t rLength(GetLength());
std::size_t aLength(a.GetLength());
std::size_t bLength(b.GetLength());

/* Zeroize this. */
Zeroize();

/* Set up slices. */
RawDigitSlice runningSum(*this, 0, aLength + 1); /* explanation:
(<RawDigit object>, <origin digit idx>, <nbr of digits in slice>) */
ConstRawDigitSlice as(a);

/* Do the multiplication. */
for(std::size_t i(0);i<bLength && i<rLength;++i,++runningSum)
acc.MulAdd(runningSum, b[i], as);
}
---

 
Reply With Quote
 
 
 
 
Ivan Vecerina
Guest
Posts: n/a
 
      04-20-2008
"mike3" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
: Hi.
: (Xposted to both comp.lang.c++ and comp.programming since I've got
: questions related to both C++ language and general programming)
:
: I've got the following C++ code. The first routine runs in like 65% of
: the time of the second routine. Yet both do the same thing. However,
: the second one seems better in terms of the way the code is written
: since it helps encapsulate the transformation in the inner loop better
: making it easier to read, at least in my opinion. Maybe it's not,
: maybe the former is "better" that way and I should go with it, but if
: the latter is "better" in that respect should I just ditch it anyway
: and tradeoff for performance since I want this thing to be fast???
As a first-time reader of both implementations, I find the first one
much easier to understand and maintain than the second one. Adding
levels of abstractions without a clear benefit only obfuscates code.

: I'm using the simple "grade school" multiply algorithm. Note how the
: second routine more easily outlines this algorithm, while in the first
: it is a little more difficult to see. Which would you prefer, exactly?
The first one.

: Also, could I lose the typecasts in this code or do I need them?
I'll comment & review the first function below.
At some point, you do need to (at least implicitly) cast
an operand to the larger type, as this might not happen
automatically for some combinations of types, in C++.

: First version (fastest):
: ---
: void RawDigit::Mul(const RawDigit &a, const RawDigit &b)
: {
: /* Check to see if we're doing an in-place multiply */
: if(&a == this)
: {
: MulInPlace(b);
: return;
: } else if(&b == this) {
: MulInPlace(a);
: return;
: }
General class design: unless there is an established reason
not to do so, I would use operator overloading and implement
(as members of the Num class):
static Num operator *( const Num& a, const Num& b );
Num& operator *( const Num& a );
Maybe RawDigit::Mul is an internal private member, and the
above operations are provided? But then the special cases
(e.g. multiply in place) would best be handled in the
layers above...

: /* Get lengths. */
: std::size_t rLength(GetLength());
: std::size_t aLength(a.GetLength());
: std::size_t bLength(b.GetLength());
:
: /* Zeroize this. */
: Zeroize();

Is the intent truly to implement modulo math, and to allow
the result to be truncated? If not the result should be
resized automatically, and the code is simplified.

: /* Do the multiplication. */
: TwoDigit64 carry(0);
Technically, the carry should only require 1 digit,
and can be defined within the outer loop below.

: for(std::size_t i(0);i<aLength && i<rLength;i++)
: {
: carry = 0;
:
: for(std::size_t j(0);j<bLength && (i+j)<rLength;j++)
: {
: TwoDigit64
: tmp((static_cast<TwoDigit64>(a.digits[i])*b.digits[j]) +
: digits[i+j] + carry);
: carry = tmp >> DIGIT_BITS;
: tmp -= carry << DIGIT_BITS;
: digits[i+j] = tmp;
: }
:
: if(i+bLength < rLength)
: digits[i+bLength] = carry;
: }
: }

Let's say that you have the following digit-related
definitions within your class:
typedef ... Digit;
typedef ... TwoDigit;
const unsigned digitBitCount = ...;
const Digit digitBitMask = 0xF...UL;

Assuming no truncation (because of adequate pre-allocation
of result digits), the same algorithm can be written as:

for( unsigned aPos = 0 ; aPos<aLength ; ++aPos )
{
Digit carry = 0;
TwoDigit const aDig = a.digits[aPos]; //NB: cast "hidden" here
for( unsigned bPos = 0 ; bPos<bLength ; ++bPos )
{
TwoDigit mul = aDig * b.digits[bPos]
+ this->digits[aPos+bPos]
+ carry;

this->digits[aPos+bPos] = mul & digitBitMask;
carry = ( mul >> digitBitCount );
}
this->digits[aPos+bPos] = carry;
}

There are many correct ways to write this, and some are
probably better is some ways than this example. But I
hope that you will find it useful.
In any case, given the very acceptable complexity of this loop,
I would not bother breaking it up or adding layers of
encapsulation.

Regards -Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
Brainbench MVP for C++ <> http://www.brainbench.com

 
Reply With Quote
 
 
 
 
thomas.mertes@gmx.at
Guest
Posts: n/a
 
      04-20-2008
On 20 Apr., 08:26, mike3 <(E-Mail Removed)> wrote:
> Hi.
> (Xposted to both comp.lang.c++ and comp.programming since I've got
> questions related to both C++ language and general programming)
>
> I've got the following C++ code. The first routine runs in like 65% of
> the time of the second routine. Yet both do the same thing. However,
> the second one seems better in terms of the way the code is written
> since it helps encapsulate the transformation in the inner loop better
> making it easier to read, at least in my opinion. Maybe it's not,
> maybe the former is "better" that way and I should go with it, but if
> the latter is "better" in that respect should I just ditch it anyway
> and tradeoff for performance since I want this thing to be fast???
>
> What each routine does is multiply two arbitrary-precision integers
> together. The second one though uses an additional "slice" type that
> provides a window enabling the "multiply and add" operation to be
> performed on a limited range of digits, which can then be advanced
> across the number, making clear that part of the algorithn.
>
> I'm using the simple "grade school" multiply algorithm. Note how the
> second routine more easily outlines this algorithm, while in the first
> it is a little more difficult to see. Which would you prefer, exactly?
>
> For brevity, class definitions and other members have been omitted.
> However it should not be too difficult to figure out the necessary
> information.
>
> Also, could I lose the typecasts in this code or do I need them?
>
> The reason why I'm asking is because I remembered getting told earlier
> by someone here (Alf P. Steinbach, group: comp.lang.c++) about how my
> last implementation of my program (this is for a fractal generator)
> had some sort of bad, bug-inducing "abstraction gap" yet I was unable
> to get enough elaboration responses about it so I've been more or less
> shooting around trying to figure out how to make it better (although
> he did give me some *examples* of where there were problems, which I
> have since fixed.). But it seems I'm losing performance and that's not
> a good thing for my application. Not to mention also that my original
> bignum implementation was called "silly" as well("...although I didn't
> look at the silly bignum class, whatever it's fault is..." ref:http://groups.google.com/group/comp....38d0b8dab9?dmo...)
> without any explanation as to what exactly was so silly about it. So I
> had to start dark-shooting there too trying to figure out what I had
> done wrong.
>
> First version (fastest):
> ---
> void RawDigit::Mul(const RawDigit &a, const RawDigit &b)
> {
> /* Check to see if we're doing an in-place multiply */
> if(&a == this)
> {
> MulInPlace(b);
> return;
> } else if(&b == this) {
> MulInPlace(a);
> return;
> }
>
> /* Get lengths. */
> std::size_t rLength(GetLength());
> std::size_t aLength(a.GetLength());
> std::size_t bLength(b.GetLength());
>
> /* Zeroize this. */
> Zeroize();
>
> /* Do the multiplication. */
> TwoDigit64 carry(0);
> for(std::size_t i(0);i<aLength && i<rLength;i++)
> {
> carry = 0;
>
> for(std::size_t j(0);j<bLength && (i+j)<rLength;j++)
> {
> TwoDigit64
> tmp((static_cast<TwoDigit64>(a.digits[i])*b.digits[j]) +
> digits[i+j] + carry);
> carry = tmp >> DIGIT_BITS;
> tmp -= carry << DIGIT_BITS;
> digits[i+j] = tmp;
> }
>
> if(i+bLength < rLength)
> digits[i+bLength] = carry;
> }}
>

[snip Second version (slower)]

I consider the first version easier to read. I prefer to see the
main algorithm at one page instead of "millions of small
methods". What I am missing is:
- The signs. All your values seem to be unsigned.
Do you want to use two's complement representation or
sign + magnitude.
- The management of the size. The user of the functions
should not be bothered with resizing the big integers.

Did you know that big integer libraries already exist. Some
people have taken the burden to write a library for big integer
functions. For example: Me.
If you download Seed7
(http://sourceforge.net/project/showf...roup_id=151126)
you will see that the file seed7/src/big_rtl.c contains a bigInteger
library written in C. This library manages the memory for the
digits automatically and contains the usual arithmetic operations
(inclusive two forms of bigInteger division and remainder which
trunc towards zero or minus infinite). The multiplication uses
the Karatsuba algorithm when possible.
BTW.: I plan to do an improved release today (By coincidence I
did several improvements for bigInteger's). If you wait for approx.
12 hours you can get the new version.

Greetings Thomas Mertes

Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.
 
Reply With Quote
 
mike3
Guest
Posts: n/a
 
      04-20-2008
On Apr 20, 1:43*am, "Ivan Vecerina"
<(E-Mail Removed)> wrote:
> "mike3" <(E-Mail Removed)> wrote in message
>
> news:(E-Mail Removed)...
> : Hi.
> : (Xposted to both comp.lang.c++ and comp.programming since I've got
> : questions related to both C++ language and general programming)
> :
> : I've got the following C++ code. The first routine runs in like 65% of
> : the time of the second routine. Yet both do the same thing. However,
> : the second one seems better in terms of the way the code is written
> : since it helps encapsulate the transformation in the inner loop better
> : making it easier to read, at least in my opinion. Maybe it's not,
> : maybe the former is "better" that way and I should go with it, but if
> : the latter is "better" in that respect should I just ditch it anyway
> : and tradeoff for performance since I want this thing to be fast???
> As a first-time reader of both implementations, I find the first one
> much easier to understand and maintain than the second one. *Adding
> levels of abstractions without a clear benefit only obfuscates code.
>


OK, then I'll go for the first. Especially considering it's faster...

> : I'm using the simple "grade school" multiply algorithm. Note how the
> : second routine more easily outlines this algorithm, while in the first
> : it is a little more difficult to see. Which would you prefer, exactly?
> The first one.
>


Hmm.

> : Also, could I lose the typecasts in this code or do I need them?
> I'll comment & review the first function below.
> At some point, you do need to (at least implicitly) cast
> an operand to the larger type, as this might not happen
> automatically for some combinations of types, in C++.
>


So then in this case a cast is OK, right?

> : First version (fastest):
> : ---
> : void RawDigit::Mul(const RawDigit &a, const RawDigit &b)
> : {
> : * */* Check to see if we're doing an in-place multiply */
> : * *if(&a == this)
> : * *{
> : * * *MulInPlace(b);
> : * * *return;
> : * *} else if(&b == this) {
> : * * *MulInPlace(a);
> : * * *return;
> : * *}
> General class design: unless there is an established reason
> not to do so, I would use operator overloading and implement
> (as members of the Num class):
> * static *Num * operator *( const Num& a, const Num& b );
> * * * * * Num& *operator *( const Num& a );
> Maybe RawDigit::Mul is an internal private member, and the
> above operations are provided? *But then the special cases
> (e.g. multiply in place) would best be handled in the
> layers above...
>


I've tried overlaoded operators but one seems to have to construct
a temporary copy to hold results to return in the case of the binary
operators. This is a big problem for my app. as I have performance
concerns. Deallocating and reallocating memory billions of times
is a serious waste of performance. (These routines will be called
billions of times.)

Therefore I do the opposite and build these low-level routines first,
then build the overloaded operators using them, which can then be
used in all non-performance-critical areas of the program. If it turns
out that I don't need the MulInPlace() functionality to be integrated
with Mul(), I'll just remove it and document that Mul() does not
work in-place. Is that acceptable practice?

> : * */* Get lengths. */
> : * *std::size_t rLength(GetLength());
> : * *std::size_t aLength(a.GetLength());
> : * *std::size_t bLength(b.GetLength());
> :
> : * */* Zeroize this. */
> : * *Zeroize();
>
> Is the intent truly to implement modulo math, and to allow
> the result to be truncated? If not the result should be
> resized automatically, and the code is simplified.
>


So I can omit this functionality and just document that the
routine doesn't behave as most would expect? (If I have operands
of 3 differing lengths I'd usually expect a Mul() to respect
those lengths but then again your opinion may differ.)

Also the above just does nothing but clear the result buffer,
which you need as we're going to add (as in arithmetical addition)
numbers to it. You need that for the multiply algorithm to work.

And you need to get the lengths of the input operands because
you're going to loop over the digits, no? And you don't want to
read outside the buffer or read too few digits? I think you would,
but then again maybe I'm wrong...

> : * */* Do the multiplication. */
> : * *TwoDigit64 carry(0);
> Technically, the carry should only require 1 digit,
> and can be defined within the outer loop below.
>


But I thought then that would create a performance-wasting conversion
to/from 32/64 again and again.

> : * *for(std::size_t i(0);i<aLength && i<rLength;i++)
> : * *{
> : * * * carry = 0;
> :
> : * * * for(std::size_t j(0);j<bLength && (i+j)<rLength;j++)
> : * * * {
> : * * * * *TwoDigit64
> : tmp((static_cast<TwoDigit64>(a.digits[i])*b.digits[j]) +
> : * * * * * * * * * * * * digits[i+j] + carry);
> : * * * * *carry = tmp >> DIGIT_BITS;
> : * * * * *tmp -= carry << DIGIT_BITS;
> : * * * * *digits[i+j] = tmp;
> : * * * }
> :
> : * * * if(i+bLength < rLength)
> : * * * * digits[i+bLength] = carry;
> : * *}
> : }
>
> Let's say that you have the following digit-related
> definitions within your class:
> * typedef ... *Digit;
> * typedef ... *TwoDigit;
> * const unsigned digitBitCount = ...;
> * const Digit digitBitMask = 0xF...UL;
>
> Assuming no truncation (because of adequate pre-allocation
> of result digits), the same algorithm can be written as:
>
> *for( unsigned aPos = 0 ; aPos<aLength ; ++aPos )
> *{
> * * Digit carry = 0;
> * * TwoDigit const aDig = a.digits[aPos]; //NB: cast "hidden" here
> * * for( unsigned bPos = 0 ; bPos<bLength ; ++bPos )
> * * {
> * * * * TwoDigit mul = aDig * b.digits[bPos]
> * * * * * * * * * * *+ this->digits[aPos+bPos]
> * * * * * * * * * * *+ carry;
>
> * * * * this->digits[aPos+bPos] = * mul & *digitBitMask;
> * * * * carry * * * * * * * * * = ( mul >> digitBitCount );
> * * }
> * * this->digits[aPos+bPos] = carry;
> *}
>
> There are many correct ways to write this, and some are
> probably better is some ways than this example. *But I
> hope that you will find it useful.
> In any case, given the very acceptable complexity of this loop,
> I would not bother breaking it up or adding layers of
> encapsulation.
>


OK, then.
 
Reply With Quote
 
mike3
Guest
Posts: n/a
 
      04-20-2008
On Apr 20, 2:22*am, (E-Mail Removed) wrote:
> On 20 Apr., 08:26, mike3 <(E-Mail Removed)> wrote:

<snip>
> > First version (fastest):
> > ---
> > void RawDigit::Mul(const RawDigit &a, const RawDigit &b)
> > {
> > * * /* Check to see if we're doing an in-place multiply */
> > * * if(&a == this)
> > * * {
> > * * * MulInPlace(b);
> > * * * return;
> > * * } else if(&b == this) {
> > * * * MulInPlace(a);
> > * * * return;
> > * * }

>
> > * * /* Get lengths. */
> > * * std::size_t rLength(GetLength());
> > * * std::size_t aLength(a.GetLength());
> > * * std::size_t bLength(b.GetLength());

>
> > * * /* Zeroize this. */
> > * * Zeroize();

>
> > * * /* Do the multiplication. */
> > * * TwoDigit64 carry(0);
> > * * for(std::size_t i(0);i<aLength && i<rLength;i++)
> > * * {
> > * * * *carry = 0;

>
> > * * * *for(std::size_t j(0);j<bLength && (i+j)<rLength;j++)
> > * * * *{
> > * * * * * TwoDigit64
> > tmp((static_cast<TwoDigit64>(a.digits[i])*b.digits[j]) +
> > * * * * * * * * * * * * *digits[i+j] + carry);
> > * * * * * carry = tmp >> DIGIT_BITS;
> > * * * * * tmp -= carry << DIGIT_BITS;
> > * * * * * digits[i+j] = tmp;
> > * * * *}

>
> > * * * *if(i+bLength < rLength)
> > * * * * *digits[i+bLength] = carry;
> > * * }}

>
> [snip Second version (slower)]
>
> I consider the first version easier to read. I prefer to see the
> main algorithm at one page instead of "millions of small
> methods". What I am missing is:
> - The signs. All your values seem to be unsigned.
> * Do you want to use two's complement representation or
> * sign + magnitude.


Digits are usually not signed, this is not a balanced-radix
representation. This is simple radix-2^32 arithmetic.

This is also a raw unsigned number type that just allows
basic digit manipulations. I'm going to use it to build a
floating point type on top.

> - The management of the size. The user of the functions
> * should not be bothered with resizing the big integers.
>


See above. This is a raw base type. In the full application
I do not need automatic resizes (since with floating point one
usually truncates or rounds the result from multiplication
anyway), plus they would hinder the performance and I need
lots of that latter stuff.

> Did you know that big integer libraries already exist. Some
> people have taken the burden to write a library for big integer
> functions. For example: Me.


Actually, I've looked at other packages when trying to figure out
how to come up with mine.

The reason for producing my own package was primarily due to
licensing issues, since I do not know how I'm going to license
the software when I decide to release it for distribution. If that
was not a concern I would have just used someone else's package.
I could always switch to the use of a different package in the
future.

Plus it's fun to program it and I can learn more about programming
this way.

> If you download Seed7
> (http://sourceforge.net/project/showf...roup_id=151126)
> you will see that the file seed7/src/big_rtl.c contains a bigInteger
> library written in C. This library manages the memory for the
> digits automatically and contains the usual arithmetic operations
> (inclusive two forms of bigInteger division and remainder which
> trunc towards zero or minus infinite). The multiplication uses
> the Karatsuba algorithm when possible.


I'm curious: Is that Karatsuba D&C (divide & conquer) method
faster even on smaller number sizes like 512 bits or so? Because
if so I'm thinking of implementing that in my program as well.

Also, I'm writing this with C++, not C, which means that looking
at a C package may be less than helpful.

> BTW.: I plan to do an improved release today (By coincidence I
> did several improvements for bigInteger's). If you wait for approx.
> 12 hours you can get the new version.
>

 
Reply With Quote
 
mike3
Guest
Posts: n/a
 
      04-20-2008
On Apr 20, 1:43*am, "Ivan Vecerina"
<(E-Mail Removed)> wrote:
> "mike3" <(E-Mail Removed)> wrote in message

<snip>
> Assuming no truncation (because of adequate pre-allocation
> of result digits), the same algorithm can be written as:
>
> *for( unsigned aPos = 0 ; aPos<aLength ; ++aPos )
> *{
> * * Digit carry = 0;
> * * TwoDigit const aDig = a.digits[aPos]; //NB: cast "hidden" here
> * * for( unsigned bPos = 0 ; bPos<bLength ; ++bPos )
> * * {
> * * * * TwoDigit mul = aDig * b.digits[bPos]
> * * * * * * * * * * *+ this->digits[aPos+bPos]
> * * * * * * * * * * *+ carry;
>
> * * * * this->digits[aPos+bPos] = * mul & *digitBitMask;
> * * * * carry * * * * * * * * * = ( mul >> digitBitCount );
> * * }
> * * this->digits[aPos+bPos] = carry;
> *}
>
> There are many correct ways to write this, and some are
> probably better is some ways than this example. *But I
> hope that you will find it useful.
> In any case, given the very acceptable complexity of this loop,
> I would not bother breaking it up or adding layers of
> encapsulation.
>


Unfortunately, I noticed this last implementation where the
cast is "hidden" seems to lose performance. My benchmark
returned 17 seconds for this vs 12 for a similar routine where the
only difference is the cast is not "hidden" by assigning the digit
to a TwoDigit64.

I'm wondering if maybe this is because the compiler has done
a 32x32 multiply when using the one-liner with internal cast, and
returns 64-bit result, but when you try to "hide" the cast it tries
to do a 64x32 or even 64x64 multiply since it sees one 64-bit
operand coming in, which wastes time. I guess it depends on the
optimizer. In the one with the cast the optimizer can "see" that
both operands are 32-bit, so it could "know" to only do 32x32->64
instead of 64x32->64 or 64x64->64 (mod mul by 2^64).

Is this a good theory? Should I go with the casts?
 
Reply With Quote
 
Ivan Vecerina
Guest
Posts: n/a
 
      04-20-2008
"mike3" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
>Unfortunately, I noticed this last implementation where the
>cast is "hidden" seems to lose performance. My benchmark
>returned 17 seconds for this vs 12 for a similar routine where the
>only difference is the cast is not "hidden" by assigning the digit
>to a TwoDigit64.
>
>I'm wondering if maybe this is because the compiler has done
>a 32x32 multiply when using the one-liner with internal cast, and
>returns 64-bit result, but when you try to "hide" the cast it tries
>to do a 64x32 or even 64x64 multiply since it sees one 64-bit
>operand coming in, which wastes time. I guess it depends on the
>optimizer. In the one with the cast the optimizer can "see" that
>both operands are 32-bit, so it could "know" to only do 32x32->64
>instead of 64x32->64 or 64x64->64 (mod mul by 2^64).
>
>Is this a good theory? Should I go with the casts?


These things are optimizer- and platform-dependent, but
yes, a "widened" multiplication is the likely culprit.
If you care about such low-level optimizations, I would
recommend learning to look at the assembly code (it is not
that difficult to get to a level where you can look and
see what your compiler does / where overheads come from).

Although in general it is always advisable to first look
for better algorithms (as these micro-optimizations can
be platform-specific, and become redundant or even
counter-productive in the future). For instance, a 64-bit
platform might be able to store the 64-bit word into
a local register, and execute the code I posted faster.


--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
Brainbench MVP for C++ <> http://www.brainbench.com

 
Reply With Quote
 
mike3
Guest
Posts: n/a
 
      04-20-2008
On Apr 20, 2:39*pm, "Ivan Vecerina"
<(E-Mail Removed)> wrote:
> "mike3" <(E-Mail Removed)> wrote in message
>
> news:(E-Mail Removed)...
>
>
>
>
>
> >Unfortunately, I noticed this last implementation where the
> >cast is "hidden" seems to lose performance. My benchmark
> >returned 17 seconds for this vs 12 for a similar routine where the
> >only difference is the cast is not "hidden" by assigning the digit
> >to a TwoDigit64.

>
> >I'm wondering if maybe this is because the compiler has done
> >a 32x32 multiply when using the one-liner with internal cast, and
> >returns 64-bit result, but when you try to "hide" the cast it tries
> >to do a 64x32 or even 64x64 multiply since it sees one 64-bit
> >operand coming in, which wastes time. I guess it depends on the
> >optimizer. In the one with the cast the optimizer can "see" that
> >both operands are 32-bit, so it could "know" to only do 32x32->64
> >instead of 64x32->64 or 64x64->64 (mod mul by 2^64).

>
> >Is this a good theory? Should I go with the casts?

>
> These things are optimizer- and platform-dependent, but
> yes, a "widened" multiplication is the likely culprit.
> If you care about such low-level optimizations, I would
> recommend learning to look at the assembly code (it is not
> that difficult to get to a level where you can look and
> see what your compiler does / where overheads come from).
>
> Although in general it is always advisable to first look
> for better algorithms (as these micro-optimizations can
> be platform-specific, and become redundant or even
> counter-productive in the future). For instance, a 64-bit
> platform might be able to store the 64-bit word into
> a local register, and execute the code I posted faster.
>


So what would be the best course of action in this case?
 
Reply With Quote
 
mike3
Guest
Posts: n/a
 
      04-21-2008
On Apr 20, 12:26*am, mike3 <(E-Mail Removed)> wrote:
<snip>

Anyway this discussion made me wonder whether or not the design of the
package overall is so good. You can get a full copy of the parts made
so far here:

http://www.mediafire.com/?z2j9t2s9mv9

A batch file is included to compile it with Microsoft's C++ compiler
under Windows. Currently it only implements the "RawDigit" unsigned
integer type, but is made to allow the building of a floating point
package on top (which is why you'll find things like "AddRsh" and
"SubRsh", which are necessary for floating point arithmetic.).

Is the current design I have there good, or poor? How does it compare
to my old implementation, which you can find here (see "mp" directory
in unzipped file):

http://www.mediafire.com/?cfmzd1y3yij

I'd like some discussion on this as I'm really just shooting around
consider the lack of specifics given in Alf's posts.
 
Reply With Quote
 
Ivan Vecerina
Guest
Posts: n/a
 
      04-21-2008
"mike3" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
On Apr 20, 2:39 pm, "Ivan Vecerina"
<(E-Mail Removed)> wrote:
:> Although in general it is always advisable to first look
:> for better algorithms (as these micro-optimizations can
:> be platform-specific, and become redundant or even
:> counter-productive in the future). For instance, a 64-bit
:> platform might be able to store the 64-bit word into
:> a local register, and execute the code I posted faster.
:>
:
:So what would be the best course of action in this case?

If squeezing more performance matters, study literature and
find a faster algorithm. From a vague memory of reading
Knuth's TAOCP (was 20 years ago for me, as a teenager),
I vaguely remember that there is a faster algorithm for
multiplying multi-word integers, whose complexity is
O( n^log2(3) ) = O( n^1.58 ) rather than O(n^2) as you
do now. If your integers are large enough, this could
be considered... ( quick google search -> for example
http://en.wikipedia.org/wiki/Karatsuba_algorithm ).

But large integer multiplications are commonly needed
(math, cryptography, etc), and some have certainly been
optimized in assembly language, or even SIMD-optimized.
I'm not into this lately, but you will certainly find
open source projects that do it. Or it might be possible
to use the "Big Number" library subset from Intel's
Integrated Performance Primitives. It depends...

Good luck -Ivan

--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
Brainbench MVP for C++ <> http://www.brainbench.com

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Better way to implement reverse mapping of custom enum ordinals? david.karr Java 23 01-03-2010 01:56 AM
Is there a better way to implement this: Michael Yanowitz Python 3 01-23-2007 08:13 AM
Client Server like program, which way is better? Kevin Java 8 02-27-2006 04:48 PM
Build a Better Blair (like Build a Better Bush, only better) Kenny Computer Support 0 05-06-2005 04:50 AM
Which is the better way to define methods? Blue Ocean C++ 14 07-09-2004 10:24 PM



Advertisments