Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C Programming (http://www.velocityreviews.com/forums/f42-c-programming.html)
-   -   Re: Data alignment and endianess (http://www.velocityreviews.com/forums/t742462-re-data-alignment-and-endianess.html)

Eric Sosman 01-23-2011 04:06 PM

Re: Data alignment and endianess
 
On 1/23/2011 10:05 AM, pozz wrote:
> I need to parse a packet of data received on a serial port or stored in
> a file. The bytes are stored in an arry:
> unsigned char data[PACKET_LENGTH];
>
> Now I'm wondering how I can write a "full portable" code to retrieve a
> 32-bit unsigned integer in Network Byte Order (Big Endian), even at
> not-aligned addresses.
>
> Suppose I have a 32-bit integer starting from data[1]. I think the most
> portable way to retrieve this integer is:
> ---
> uint32_t i;
> i = (data[1] << 24) | (data[2] << 16) | (data[3] << 8) | data[4];


This is almost right, and will work on many machines. But on
machines with 16-bit `int' (in fact, anything narrower than 32, or
even a 32-bit `int' on machines that are picky about overflow) the
promotion rules will trip you up. The `<<' operator will promote
each `unsigned char' to an `int' (or `unsigned int' on some machines),
but not to a `uint32_t', and then the shift may not work as desired.

To ensure you're shifting with the proper width, then, you
need something like

i = (((uint32_t)data[1]) << 24)
| (((uint32_t)data[2]) << 16)
| (((uint32_t)data[3]) << 8)
| (((uint32_t)data[4]) << 0);

(You could lose the final shift, and even the final cast, but the
pattern may be more readable if you stick with it all the way.
You could also lose the outermost sets of parentheses, but I'd
suggest keeping them in the interests of clarity.)

> i = ntohl(i);


No. The original calculation has already produced the value
you want, so no further mucking around is necessary or desirable.

By the way, you'll probably find ntohl() and its relatives
already defined for you on systems that support networking. If
you need them (you don't, here), prefer the system's versions to
free-hand code.

> [...]
> What happens if I retrieve a 32-bit integer on a Big Endian CPU at an
> aligned address, for example starting from data[0]? With the code above,
> the CPU will constructs the integer starting from the bytes, copying and
> shifting them.
> The code works well, but is more time consuming compared with a simple
> cast:
> ---
> uint32_t i = *(uint32_t *)&data[0];


For this variant, you *do* need ntohl() or something like it.

Alignment is still a problem, even starting at `data[0]', because
we have no reason (in what you've shown) to think that `data' itself
is aligned in any particular way.

You say that the original "is more time consuming" than the
second. How much more? What have your measurements shown? (You
*have* measured the speed difference, have you not? Surely you
would not be so rash as to state that there "is" a difference without
evidence to support your assertion, would you? Hmmm?)

> Can you suggest a better snippet code?


Go back to the original, minus the ntohl(). Until and unless
you have solid evidence that simple and portable code is a bottleneck,
leave it portable and simple. "More computing sins are committed in
the name of efficiency (without necessarily achieving it) than for any
other single reason--including blind stupidity." - W.A. Wulf

--
Eric Sosman
esosman@ieee-dot-org.invalid

Eric Sosman 01-24-2011 12:52 PM

Re: Data alignment and endianess
 
On 1/23/2011 6:06 PM, pozz wrote:
> Il 23/01/2011 17:06, Eric Sosman ha scritto:
>> [...]
>> You say that the original "is more time consuming" than the
>> second. How much more? What have your measurements shown? (You
>> *have* measured the speed difference, have you not? Surely you
>> would not be so rash as to state that there "is" a difference without
>> evidence to support your assertion, would you? Hmmm?)

>
> Oh, I haven't made any measurement. But I think that a simple cast could
> be very fast instead of copying, shifting and or-ing bytes in the
> correct order. Isn't this true?


What have your measurements shown?

Compilers are clever, even devious, in what they'll do to
optimize code, and what you see in the source may have only a
sketchy resemblance to what the CPU winds up doing. Also, on
many systems the CPU is hundreds of times faster than memory,
so "optimization" is largely about scheduling memory references
advantageously rather than worrying about how many CPU-internal
manipulations are done on the data once it's fetched. (And if
you think you can count the memory fetches by studying the source
of the two proposed snippets, see the earlier remark about the
deviousness of optimizers.)

If you actually care about the relative speed, the only
practical approach is to measure. Relying on operation counts
(as seen in the source) is not entirely useless, but as near to
it as makes no never-mind. MEASURE!

> Keep in mind that I use low-end embedded processor with very limited
> resources compared with x86 desktop processors.


How can I "keep in mind" a datum that you now offer for the
very first time? Or as Alice put it, "How can I have more when
I have'n't had any?" Besides, in your original post you asked
about "most portable" solutions for "all the CPUs (Big and Little
Endian)," not for one specific unnamed processor.

--
Eric Sosman
esosman@ieee-dot-org.invalid

Willem 01-24-2011 08:34 PM

Re: Data alignment and endianess
 
pozz wrote:
) You are saying that code A could be faster than code B on a platform
) (CPU+compiler), but could be slower on another platform.
) It depends on the memory access time, compiler optimization and so on.
)
) In my case, I work on embedded low-end microcontrollers (8- or 16-bits),
) with internal RAM and Flash memories.
) I made a test:
) const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
) i = *(int *)&data[1];
) is translated in:
) MOVW A, _data+1
) MOVW @RW0, A
) With this example I understand that this processor can access int
) variables even at non aligned addresses. And the assignment is very
) fast: just two move instructions.
)
) While:
) const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
) i = ((int)data[1] << 8) + ((int)data[2] << 0);
) is translated as:
) MOV A, _data+1
) SWAP
) MOV A, _data+2
) ADDW A
) MOVW @RW0, A
) In this case I have 5 instructions, 3 of them are move. I think the
) second code is more portable, but more time consuming.

Did you try this with full optimization ?
A good compiler should have emitted the same instructions for both pieces
of source code, assuming the first set of instructions is indeed faster
in the case of misaligned access.

) This is the reason for my first question: how can I wrote code that is
) at the same time "full portable" (or almost portable) on every
) CPU/compiler and fast (for example, using the misaligned access feature
) as in my CPU)?

Get a good optimizing compiler, and have it do the work for you.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Bartc 01-24-2011 08:53 PM

Re: Data alignment and endianess
 
"pozz" <pNoOzSzPuAgMno@gmail.com> wrote in message
news:ihkn21$6u6$1@nnrp.ngi.it...
> Il 24/01/2011 13:52, Eric Sosman ha scritto:


> In my case, I work on embedded low-end microcontrollers (8- or 16-bits),
> with internal RAM and Flash memories.
> I made a test:
> const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
> i = *(int *)&data[1];
> is translated in:
> MOVW A, _data+1
> MOVW @RW0, A
> With this example I understand that this processor can access int
> variables even at non aligned addresses. And the assignment is very fast:
> just two move instructions.
>
> While:
> const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 };
> i = ((int)data[1] << 8) + ((int)data[2] << 0);
> is translated as:
> MOV A, _data+1
> SWAP
> MOV A, _data+2
> ADDW A
> MOVW @RW0, A
> In this case I have 5 instructions, 3 of them are move. I think the second
> code is more portable, but more time consuming.
>
> This is the reason for my first question: how can I wrote code that is at
> the same time "full portable" (or almost portable) on every CPU/compiler
> and fast (for example, using the misaligned access feature as in my CPU)?


You want to use portable code for misaligned memory loads and stores, but
don't want a speed penalty where the hardware can directly do misaligned
access?

(But presumably you are happy for there to be a penalty anyway where the
hardware does not allow misaligned access? Some processors may not even be
byte addressable: your char data[] could be stored one character per word.
Or could be stored packed, so that quite a few instructions are needed to do
even byte access to the data, let alone combine arbitrary bytes into a
word.)

In C I believe this sort of thing is handled with conditional code, hidden
behind macros. It's not pretty, but means code can be (manually) tweaked
according to the target processor.

For example:

#if ...
#define GETINT(addr) (*(int*)(addr)
#else
#define GETINT(addr) (*(char*)(addr)<<8) + (*(char*)(addr+1))
#endif

(made-up macros not tested). Then with conditionals, you compile one or the
other depending on which is best. If misaligned loads are OK, use the first;
if not, use the other. But there needs to be some flag or other which is set
differently in each target.

And they'd be used as i = GETINT(&data[1]);

--
Bartc




Bartc 01-25-2011 12:04 AM

Re: Data alignment and endianess
 
"Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
news:ihldu4$ou6$1@news.eternal-september.org...

> 3) ... but okay: Suppose that by expending hours and hours of
> careful measurement and days and days of programming and weeks
> and weeks of ongoing maintenance you actually *do* manage to
> process the number in 30ns instead of 51, for a clear savings
> of nearly forty percent, whee! Now, please calculate how
> many numbers you need to process in order to break even on the
> time you spent engineering the savings. For argument's sake,
> suppose you somehow got through all the initial work in just
> one forty-hour week, and divide that by 21ns per number to find
> the payoff moment. (With these completely made-up numbers, you
> will break even at 6.8 TRILLION input values. YMMV, of course.)


You also have to multiply any savings by the number of people who are going
to run the program, and by how many times they run it.

These savings are typical of the improvements you might get through one
minor optimisation in a compiler. Or a slightly different strategy for
predicting branches in a processor.

By themselves they are nothing. But there might be hundreds of such
improvements. They might be invoked a billion times per program. And there
might be a million copies of the program.

At the very least, valuable experience can be gained as to how far a piece
of software or hardware can be pushed.

Presumably you are not suggesting that compiler writers and processor
designers shouldn't bother with the stuff they do?

That +2 clocks penalty for a misaligned load, on page 76 of that processor
manual, is mentioned for a reason.

--
Bartc





Eric Sosman 01-25-2011 02:52 AM

Re: Data alignment and endianess
 
On 1/24/2011 3:22 PM, pozz wrote:
> Il 24/01/2011 13:52, Eric Sosman ha scritto:
>> [...]
>> If you actually care about the relative speed, the only
>> practical approach is to measure. Relying on operation counts
>> (as seen in the source) is not entirely useless, but as near to
>> it as makes no never-mind. MEASURE!

>
> You are saying that code A could be faster than code B on a platform
> (CPU+compiler), but could be slower on another platform.


Yes. If you believe A is always faster/slower and B is always
slower/faster, regardless of platform, compiler version, compiler
options, program context, and phase of the Moon, you are too young
and inexperienced for the work you're engaged in.

> [...]
> In this case I have 5 instructions, 3 of them are move. I think the
> second code is more portable, but more time consuming.


"I think," you say, but I notice that you STILL don't say
"I measure." What's the matter? Scared?

> This is the reason for my first question: how can I wrote code that is
> at the same time "full portable" (or almost portable) on every
> CPU/compiler and fast (for example, using the misaligned access feature
> as in my CPU)?
>
> After discussing with you, I think there isn't a solution for that
> question.


It is surpassingly unlikely that a straight-line source-code
sequence A will be faster than B in all circumstances. Compilers
are startlingly subtle, but are not magical. So, what's a poor
programmer to do? There are several approaches, among them:

1) Lard your code with #ifdef's for different platforms. This
approach offers the best opportunities for speed, but the
worst burden for maintenance: You wind up with N independent
implementations, >=N times the effort to keep them current.
There's also bit-rot to consider: `#ifdef FROBOZZ_MAGIC_C'
may be fine today, but next month Frobozz comes out with a
new and improved version that optimizes differently. If you
don't have a continuing project to verify and re-verify all
those hand-crafted optimizations, at least a few of them
*will* turn into pessimizations. And if you *do* have such
a project, all you can look forward to is adding a new
`#ifdef FROBOZZ_MAGIC_C_VERSION_3_POINT_5_OR_GREATER' and
increasing your maintenance burden from N to (N+1).

2) Measure (that *awful* word!) the speeds of A and B on several
platforms of interest, and ponder the magnitude of the spread.
You're worried about extracting a 32-bit integer from a
network packet, fine. Let's suppose that A takes 32-44 ns
on your suite of platforms, while B takes 30-51. Well, is
that difference significant in relation to the speed at which
the integers arrive? If you're already sucking the data across
a WiFi link shared by forty-seven other coffee drinkers, and
shipping them through an entire TCP/IP stack with checksumming
and window management and retransmissions and ACK's and all the
rest, the difference between 32ns and 51ns is unlikely to matter.

3) ... but okay: Suppose that by expending hours and hours of
careful measurement and days and days of programming and weeks
and weeks of ongoing maintenance you actually *do* manage to
process the number in 30ns instead of 51, for a clear savings
of nearly forty percent, whee! Now, please calculate how
many numbers you need to process in order to break even on the
time you spent engineering the savings. For argument's sake,
suppose you somehow got through all the initial work in just
one forty-hour week, and divide that by 21ns per number to find
the payoff moment. (With these completely made-up numbers, you
will break even at 6.8 TRILLION input values. YMMV, of course.)

Okay, this analysis is far from complete. I've focused on payoff --
time invested versus time saved -- and that's not the only "performance"
measure. Still, it's a useful measure to keep in mind when assessing
whether something is worth doing.

Here's a (possibly misleading) analogy I've used before: With the
price of gasoline on the rise, you may well desire to improve your
car's fuel efficiency. Should you give the old jalopy a fresh coat
of wax? YEA: It will make the car slipperier, reducing the wind
resistance and allowing the engine to work less hard. NAY: It will
make the car heavier, forcing the engine to drag a heavier load down
the road. Oh, gosh, oh, gosh, whatever should I do?

Answer: Stop fretting about wax and unhitch the boat trailer.

... and that is the LAST you will hear from me on this thread
until and unless you offer some actual MEASUREMENTS! If you say
"I think" or "It stands to reason" or "Everyone knows," I will
personally come after you with a big book of six-place logarithms
and bludgeon you so hard your mantissa will fall off.

--
Eric Sosman
esosman@ieee-dot-org.invalid

Eric Sosman 01-25-2011 04:29 AM

Re: Data alignment and endianess
 
On 1/24/2011 7:04 PM, Bartc wrote:
> "Eric Sosman"<esosman@ieee-dot-org.invalid> wrote in message
> news:ihldu4$ou6$1@news.eternal-september.org...
>
>> 3) ... but okay: Suppose that by expending hours and hours of
>> careful measurement and days and days of programming and weeks
>> and weeks of ongoing maintenance you actually *do* manage to
>> process the number in 30ns instead of 51, for a clear savings
>> of nearly forty percent, whee! Now, please calculate how
>> many numbers you need to process in order to break even on the
>> time you spent engineering the savings. For argument's sake,
>> suppose you somehow got through all the initial work in just
>> one forty-hour week, and divide that by 21ns per number to find
>> the payoff moment. (With these completely made-up numbers, you
>> will break even at 6.8 TRILLION input values. YMMV, of course.)

>
> You also have to multiply any savings by the number of people who are going
> to run the program, and by how many times they run it.


You still need 6.8 trillion (or whatever) to break even, spread
it out however you may.

> Presumably you are not suggesting that compiler writers and processor
> designers shouldn't bother with the stuff they do?


IMHO they bother with lots of stuff that's pointless, in pursuit
of a better SPECKmock than the next guy. But even so: The CPU maven,
the compiler wizard, even the library fabricator -- all these folks
face the prospect that their stuff really *will* run trillions of
times. The O.P. has offered *no* reason to think he's at that sort
of level, not even within several orders of magnitude.

Of course, this famous "6.8 trillion" is an entirely fictitious
number. It's not to be taken literally, but as an illustration of
the sort of calculation the O.P. *ought* to make before expending
further effort on lightening his car by installing thinner windshields.
But he doesn't seem to be quantitatively inclined; if anything, he's
quantitatively disinclined. Such people should give up on programming
and resign themselves to being computer science professors. ;-)

--
Eric Sosman
esosman@ieee-dot-org.invalid

Bartc 01-25-2011 10:30 AM

Re: Data alignment and endianess
 
"Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message
news:0.52b00768d99351f72e65.20110125132656GMT.87d3 nldsgv.fsf@bsb.me.uk...
> Eric Sosman <esosman@ieee-dot-org.invalid> writes:


>> You still need 6.8 trillion (or whatever) to break even, spread
>> it out however you may.

>
> I don't buy these peculiar sums! The hours I spend shaving a few ms off
> a program run (note ms and not just ns) are real hours that I could
> have spent doing something else -- adding a feature, fixing a bug, going
> home on time. The few ms shaved off each run might not even be noticed
> by the user. Even if a billion users gain these extra milliseconds a
> hundred times a day, they don't sum to anything other than zero (or to
> something very close to zero)!


A billion times 1msec times 100 times a day comes to about 3 years'
savings -- per day. One entire lifetime saved per month.

> Of course there are time when it does matter, but it can't simply be
> decided by adding up time spent and time gained.


I just tried this:

for (i=0; i<N; ++i) sum+=p[i];

p is an array of integers, Misaligning by 1 byte on an x86 makes this code
20% slower. That may or may not have significance for the application. But
if you are writing a library function (where N is a parameter) or are
responsible for allocating p, then it's something that needs to be
considered, and deserves some time spent on it.

But if you know it will only ever save 20% (strictly, 16.7%) of 1% of the
runtime, then perhaps it can wait (unless the idea of a misaligned array
grates, then you might work on it anyway..)

--
Bartc



James Kuyper 01-25-2011 12:29 PM

Re: Data alignment and endianess
 
On 01/24/2011 03:34 PM, Willem wrote:
> pozz wrote:

....
> ) This is the reason for my first question: how can I wrote code that is
> ) at the same time "full portable" (or almost portable) on every
> ) CPU/compiler and fast (for example, using the misaligned access feature
> ) as in my CPU)?
>
> Get a good optimizing compiler, and have it do the work for you.


The requirements "every CPU/compiler" and "good optimizing compiler" are
incompatible.

--
James Kuyper

James Kuyper 01-25-2011 12:33 PM

Re: Data alignment and endianess
 
On 01/24/2011 03:22 PM, pozz wrote:
....
> This is the reason for my first question: how can I wrote code that is
> at the same time "full portable" (or almost portable) on every
> CPU/compiler and fast (for example, using the misaligned access feature
> as in my CPU)?


You can't say "every CPU/compiler" and also mandate "fast"; there are
probably some compilers out there which are inherently incapable of
generating "fast" code, by whatever standard you want to use for
defining "fast". There's also pairs of compilers out there where the
fastest code for one of the two compilers is different from the fastest
code for the other. If it is essential to you to optimize at any level
other than the language-independent level of the algorithm itself, you
must inherently abandon any hope for portability.

--
James Kuyper


All times are GMT. The time now is 10:18 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.