Velocity Reviews > Can anyone improve this any further......

# Can anyone improve this any further......

Sean Kenwrick
Guest
Posts: n/a

 02-07-2004

"Francois Grieu" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> In article <c00tva\$s3c\$(E-Mail Removed)>,
> "Sean Kenwrick" <(E-Mail Removed)> wrote:
>
> > I got rid of the extra divide and replaced it with a multiply and

subtract
> > (to do the modulo bit). It is now (faster)

>
> In my experience, is unusual that % is slower than / is for unsigned
> quatities. I guess your system has a problem with modulo of signed
> quantities.
> Since apparently you have no requirement that the string returned
> starts where s does, maybe try:
>
> /* 32 bit signed arithmetic assumed */
> /* does not handle -2147483648 */
> char *myltoa(long n, char *s)
> {
> char neg;
> *(s += 11) = neg = '\0';
> if (n<0)
> {
> n = -n; /* does not handle -2147483648 */
> neg = '-';
> }
> do *--s = (unsigned)n%10+'0';
> while ((n = (unsigned)n/10)!=0);
> if (neg!='\0')
> *--s = neg;
> return(s);
> }
>
>
> François Grieu

The problem is that divide is typically ALOT slower than multiply or any
other arithmetic operations and this is probably the case for all CPUs.
And since the modulo operator requires a divide (I've checked the assembler
code) then it is just as slow as a normal divide. Your algorithm above
therefore still effectively does 2 divides per digit which is about 10%
slower than my very similar version of your algorithm that I wrote (see
previous post) which uses a multiple and a subtract instead of a modulo...

Sean

Johan Lindh
Guest
Posts: n/a

 02-07-2004
Sean Kenwrick wrote:

> "Francois Grieu" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>
>>In article <c00tva\$s3c\$(E-Mail Removed)>,
>> "Sean Kenwrick" <(E-Mail Removed)> wrote:
>>
>>
>>>I got rid of the extra divide and replaced it with a multiply and

>
> subtract
>
>>>(to do the modulo bit). It is now (faster)

>>
>>In my experience, is unusual that % is slower than / is for unsigned
>>quatities. I guess your system has a problem with modulo of signed
>>quantities.
>>Since apparently you have no requirement that the string returned
>>starts where s does, maybe try:
>>
>>/* 32 bit signed arithmetic assumed */
>>/* does not handle -2147483648 */
>>char *myltoa(long n, char *s)
>> {
>> char neg;
>> *(s += 11) = neg = '\0';
>> if (n<0)
>> {
>> n = -n; /* does not handle -2147483648 */
>> neg = '-';
>> }
>> do *--s = (unsigned)n%10+'0';
>> while ((n = (unsigned)n/10)!=0);
>> if (neg!='\0')
>> *--s = neg;
>> return(s);
>> }
>>
>>
>> François Grieu

>
>
> The problem is that divide is typically ALOT slower than multiply or any
> other arithmetic operations and this is probably the case for all CPUs.
> And since the modulo operator requires a divide (I've checked the assembler
> code) then it is just as slow as a normal divide. Your algorithm above
> therefore still effectively does 2 divides per digit which is about 10%
> slower than my very similar version of your algorithm that I wrote (see
> previous post) which uses a multiple and a subtract instead of a modulo...
>
> Sean
>
>

static long mask_list[] = { 1000000000, 100000000, 10000000, 1000000,
100000, 10000, 1000, 100, 10, 1 };
char * __fastcall myltoa2(long n, char *s)
{
char *org_s = s;
char c;

if( n <= 0 )
{
if( !n )
{
*s++ = '0';
*s = '\0';
return org_s;
}
n = -n;
*s++ = '-';
}

do
{
c = '0';
{
c++;
}
*s++ = c;
} while( n );

*s = '\0';

return org_s;
}

/Johan

pete
Guest
Posts: n/a

 02-07-2004
Sean Kenwrick wrote:

> Here is my final version...
>
> char * __fastcall myltoa(long n, char *s)

Function identifiers starting with two underscores like that,
are reserved for the implementation.
Are you implementing C ?

N869
7.1.3 Reserved identifiers
[#1]

-- All identifiers that begin with an underscore and
either an uppercase letter or another underscore are
always reserved for any use.

--
pete

Old Wolf
Guest
Posts: n/a

 02-08-2004
> > char * __fastcall myltoa(long n, char *s)
>
> Function identifiers starting with two underscores like that,
> are reserved for the implementation.
> Are you implementing C ?

No, but his compiler is implementing C-with-extensions,
one of which is the keyword "__fastcall".

Sean Kenwrick
Guest
Posts: n/a

 02-09-2004

"Johan Lindh" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Sean Kenwrick wrote:
>
> > "Francois Grieu" <(E-Mail Removed)> wrote in message
> > news:(E-Mail Removed)...
> >
> >>In article <c00tva\$s3c\$(E-Mail Removed)>,
> >> "Sean Kenwrick" <(E-Mail Removed)> wrote:
> >>
> >>
> >>>I got rid of the extra divide and replaced it with a multiply and

> >
> > subtract
> >
> >>>(to do the modulo bit). It is now (faster)
> >>
> >>In my experience, is unusual that % is slower than / is for unsigned
> >>quatities. I guess your system has a problem with modulo of signed
> >>quantities.
> >>Since apparently you have no requirement that the string returned
> >>starts where s does, maybe try:
> >>
> >>/* 32 bit signed arithmetic assumed */
> >>/* does not handle -2147483648 */
> >>char *myltoa(long n, char *s)
> >> {
> >> char neg;
> >> *(s += 11) = neg = '\0';
> >> if (n<0)
> >> {
> >> n = -n; /* does not handle -2147483648 */
> >> neg = '-';
> >> }
> >> do *--s = (unsigned)n%10+'0';
> >> while ((n = (unsigned)n/10)!=0);
> >> if (neg!='\0')
> >> *--s = neg;
> >> return(s);
> >> }
> >>
> >>
> >> François Grieu

> >
> >
> > The problem is that divide is typically ALOT slower than multiply or any
> > other arithmetic operations and this is probably the case for all CPUs.
> > And since the modulo operator requires a divide (I've checked the

assembler
> > code) then it is just as slow as a normal divide. Your algorithm

above
> > therefore still effectively does 2 divides per digit which is about 10%
> > slower than my very similar version of your algorithm that I wrote (see
> > previous post) which uses a multiple and a subtract instead of a

modulo...
> >
> > Sean
> >
> >

>
> static long mask_list[] = { 1000000000, 100000000, 10000000, 1000000,
> 100000, 10000, 1000, 100, 10, 1 };
> char * __fastcall myltoa2(long n, char *s)
> {
> char *org_s = s;
> char c;
>
> if( n <= 0 )
> {
> if( !n )
> {
> *s++ = '0';
> *s = '\0';
> return org_s;
> }
> n = -n;
> *s++ = '-';
> }
>
>
> do
> {
> c = '0';
> while( n >= mask )
> {
> c++;
> }
> *s++ = c;
> } while( n );
>
> *s = '\0';
>
> return org_s;
> }
>
>
> /Johan
>
>

Your algorithm looked promising (faster) at first until I realised that I
was using numbers containing low-digit combinations in my test code (E.g.
a=1021020/324) . As soon as I used numbers with higher digit combinations
like 83894758 then the algorithm suffered with performance since it would
require 9 subtractions for the digit '9' etc. There is also a bug in the
algorithm as it was presented above. It need to take care of the case
where the number ends with trailing zeros (E.g 234000), so another loop
needed to be placed at the end as follows:

*s++='0';

Although the performance on average is quite similar to the current best
algorithm, I think I will stick to the version which has consistent
performance for any number and does not rely on digits being of low value
which could cause unexpected slowdowns...

Sean

Francois Grieu
Guest
Posts: n/a

 02-10-2004
I now get the "* faster than %" point.
Can you try this, which I hope more pipeline-friendly?

#include <limits.h>
#if LONG_MAX==0x7FFFFFFFu && ULONG_MAX==0xFFFFFFFFu
/* 32 bit signed arithmetic assumed */
/* may not handle -2147483648, but will on most modern machines */
char *myltoa(long n, char *s)
{
unsigned long m;
*(s += 11) = '\0';
if (n>=0)
{
do
{
*--s = '0'+n-(m = (unsigned long)n/10u)*10u;
if (m==0)
break;
*--s = '0'+m-(unsigned long)(n = m/10u)*10u;
}
while (n!=0);
return s;
}
n = -n; /* may not handle -2147483648 */
do
{
*--s = '0'+n-(m = (unsigned long)n/10u)*10u;
if (m==0)
break;
*--s = '0'+m-(unsigned long)(n = m/10u)*10u;
}
while (n!=0);
*--s = '-';
return(s);
}
#endif

Next try will use floating point arithmetic.

François Grieu

Sean Kenwrick
Guest
Posts: n/a

 02-10-2004
"Francois Grieu" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> I now get the "* faster than %" point.
> Can you try this, which I hope more pipeline-friendly?
>
> #include <limits.h>
> #if LONG_MAX==0x7FFFFFFFu && ULONG_MAX==0xFFFFFFFFu
> /* 32 bit signed arithmetic assumed */
> /* may not handle -2147483648, but will on most modern machines */
> char *myltoa(long n, char *s)
> {
> unsigned long m;
> *(s += 11) = '\0';
> if (n>=0)
> {
> do
> {
> *--s = '0'+n-(m = (unsigned long)n/10u)*10u;
> if (m==0)
> break;
> *--s = '0'+m-(unsigned long)(n = m/10u)*10u;
> }
> while (n!=0);
> return s;
> }
> n = -n; /* may not handle -2147483648 */
> do
> {
> *--s = '0'+n-(m = (unsigned long)n/10u)*10u;
> if (m==0)
> break;
> *--s = '0'+m-(unsigned long)(n = m/10u)*10u;
> }
> while (n!=0);
> *--s = '-';
> return(s);
> }
> #endif
>

For some reason the casts to unsigned caused this to slow down by 6% over
the current fastest. After removing the casts which appeared to be
unnecessary then it matched the time of my previous best exactly. Thus I
don't think the attempts to take advantage of pipelining had any effect and
the code thus became equivalent to my version (but slightly less

>
> Next try will use floating point arithmetic.
>
> François Grieu

I would be interested to see a solution based on FP arithmetic - perhaps FP
divides are quicker than the CPU integer divides?? I would be surprised
though...

Sean

Chris Torek
Guest
Posts: n/a

 02-11-2004
[code using integer division etc]

In article <news:c0ai4a\$5fh\$(E-Mail Removed)>
Sean Kenwrick <(E-Mail Removed)> writes:
>For some reason the casts to unsigned caused this to slow down by 6% over
>the current fastest.

Some machines only have a "native" signed integer division, so
unsigned integer division requires "undoing" some work the
signed-divide instruction performed.

On other machines, there is no difference, or unsigned integer
division is slightly faster (e.g., pre-V8 SPARC, where there is
no divide instruction at all, and the .udiv routine can skip
the sign-fiddling).

>After removing the casts which appeared to be
>unnecessary then it matched the time of my previous best exactly.

The "first" cast may be necessary, or at least useful, depending
on the machine. If we ignore bizarre (but apparently legal)
implementations in which UINT_MAX is strictly less than INT_MAX or
-INT_MIN or both (where -INT_MIN means the mathematical value, not
the "C value"), we still have the very common case of two's complement
machines, in which -INT_MAX is off by one from INT_MIN: e.g.,
INT_MAX is 32767 while INT_MIN is (numerically) -32768, or INT_MAX
is 2147483647 while INT_MIN is -2147483648.

(Aside: if, in this example, INT_MIN is -32768, it has to be
expressed in C in some other form, such as (-32767 - 1), because
the integral constant expression -32768 consists of the unary "-"
followed by the constant 32768. But we just said INT_MAX is
32767 -- so the integral constant 32768 has type "unsigned int",
and negating it tells the compiler to compute the value mod 65536,
which happens to continue to be (unsigned int)32768. So -32768
and 32768U are the same number, on this machine. For 32-bit int
two's complement machines, one must write (-2147483647 - 1) or
similar, due to the same problem. The problem repeats for "long"
as well -- I have chosen to address only "int" here, even though
I think the original code uses "long".)

In any case, getting back to the point at hand, in C89 integer
division is allowed to round either towards 0 or towards -infinity,
so that (-3)/2 is either -1 (round towards 0) or -2 (round towards
-inf). The only constraint here is that ((a / b) * b) + (a % b)
should produce the original value in "a" (assuming b nonzero),
which in turn means that if (-3)/2 is -1, (-3)%2 must be -1 as
well, while if (-3)/2 is -2, (-3)%2 must be 1. (In C99, implementations
must round towards zero, I believe.)

Suppose, then, that INT_MIN is (-32767-1) and INT_MAX is 32767 and
the implementation rounds towards zero. Then suppose i is -32768
initially, and "i = -i" leaves it set to -32768 (which is quite
common even if it is undefined):

int i, j, rest;
...
rest = i / 10;
j = i % 10; /* or: j = i - (rest * 10); */

Since i is still -32768, this sets rest to -3276 (rounded towards
0) and j is -8. j cannot be converted to a digit by adding 0.

On the other hand, suppose the implementation rounds towards -inf.
Then rest is set to -3277 and j is 2, and converting j to a digit
gets '2'. If we do this in a loop, the next digit produced is '3'
(with rest set to -32, then '2' (-33), then '7' (-4), then '6';
and the number we will print is "-67232"!

By using unsigned arithmetic for the first division, we guarantee

int i, j, rest;
...
rest = -(unsigned)i / 10U;
j = (unsigned)i - (rest * 10U);

Now we divide 32768U by 10U giving 3276U, and then subtract 32760U
from 32768U to set j to 8. Converting to a digit gives '8' and
rest is now a positive integer well out of the problematic range.
The rest of the arithmetic can be done using signed divides, if
those are faster.

Since (I think) C99 guarantees rounding towards zero, the other
technique that can be used is to leave the negative number negative,
and simply adjust for the fact that j will be negative:

int i, j, rest;
bool negative = false; /* remember to #include <stdbool.h> */
char *p;
char buf[X]; /* use the log-base-8 trick to derive X */
...
p = buf + X;
*--p = '\0';
if (i < 0) {
negative = true;
rest = i / 10;
j = i - (rest * 10);

Now "rest" is at most INT_MIN/10 and j is (e.g.) -8 as before, so:

*--p = '0' + (-j);
i = -rest;
}
do {
rest = i / 10;
j = i - rest * 10;
*--p = '0' + j;
i = rest;
} while (i != 0);
if (negative)
*--p = '-';
printf("the result of conversion is %s\n", p);

This does assume that INT_MIN is less than -9, but the C standards
guarantee it.

>Thus I don't think the attempts to take advantage of pipelining
>[by swapping variables inside a doubled-up loop] had any effect and
>the code thus became equivalent to my version (but slightly less

If this kind of register-renaming pipelining will help, a good
compiler should expand the loop for you automatically anyway.

>I would be interested to see a solution based on FP arithmetic - perhaps FP
>divides are quicker than the CPU integer divides?? I would be surprised
>though...

Division is fundamentally more difficult than multiplication (see
comp.arch for gory hardware details), and for various reasons, many
hardware designers are much more willing to "spend" the transisitors
required to make hardware FP divide fast, than they are to spend
the transistors to make integer divide fast. So it can be the case
that FP divide happens in (say) 10 clocks while integer divide
takes (say) 100. But there are other tricks to compensate (such
as "reciprocal multiplication", providing the division is by a
constant), and again a good compiler really should do them for you.
The short version of the answer is "the speed of the operation is
off-topic in comp.lang.c." The long version starts with at
least this paragraph, and goes on for many more to discuss the
details of the CPU(s) and/or compiler(s) in question.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
Reading email is like searching for food in the garbage, thanks to spammers.

Dave Thompson
Guest
Posts: n/a

 02-14-2004
On Fri, 6 Feb 2004 12:24:01 +0000 (UTC), "Sean Kenwrick"
<(E-Mail Removed)> wrote:

> I am writing a byte-code interpreter/emulator for a language that
> exclusively uses strings for variables (i.e all variables are pushed onto
> the stack as strings). Therefore for arithmetic operations I need to
> convert the string to a 32 bit integer, carry out the arithmetic operation,
> then convert it back to a string before pushing it back onto the stack.
> This can happen millions of times per second so it is essential that I have
> optimised (for speed) atol() and ltoa() functions.

<snip>
> By the way, I have tried pre-calculating integer values for all strings
> pushed onto the stack (by always appending the 4 byte integer value on the
> stack at the end of the string, but it actually caused a slowdown since the
> I have to do a atol() on every function return value or string concatenation
> operation even if the value is not going to be used in an arithmetic or
> comparison operation.
>

If you can store numeric value as well as string value in variables --
assuming your language has variables (not pure functional or FORTHy)
-- how about pushing/storing etc. the numeric value *if known* and a
dummy value (or flag) if not; operations which need numeric value(s)
do the atol(s) if/when they encounter the dummy; similarly push/store
a dummy string, such as NULL, if only the numeric value is known, and
operations needing a string value do the ltoa if/when needed.

If you really do need to do ltoa an awful lot, your absolute best
performance is more likely achievable only by writing in assembler.
Especially since even if you write in C, the microoptimization results
often vary for different architectures, and even different models --
especially for x86, where division and different ways of
multiplication have bounced up and down through its history -- and
thus while you may have portable code, it isn't actually a portable

For x86, AIR the last time this was discussed on comp.lang.asm.x86,
the consensus for (all?) current CPUs was to use fstbp or else split
into 4-digit (< 16 - epsilon bit) chunks and 16:32 multiply by 64K/10
to form each digit (within each chunk). Or possibly for some models by
64K/100 and AAM to form digit pairs.

- David.Thompson1 at worldnet.att.net