![]() |
Re: Data alignment and endianess
On 1/23/2011 10:05 AM, pozz wrote:
> I need to parse a packet of data received on a serial port or stored in > a file. The bytes are stored in an arry: > unsigned char data[PACKET_LENGTH]; > > Now I'm wondering how I can write a "full portable" code to retrieve a > 32-bit unsigned integer in Network Byte Order (Big Endian), even at > not-aligned addresses. > > Suppose I have a 32-bit integer starting from data[1]. I think the most > portable way to retrieve this integer is: > --- > uint32_t i; > i = (data[1] << 24) | (data[2] << 16) | (data[3] << 8) | data[4]; This is almost right, and will work on many machines. But on machines with 16-bit `int' (in fact, anything narrower than 32, or even a 32-bit `int' on machines that are picky about overflow) the promotion rules will trip you up. The `<<' operator will promote each `unsigned char' to an `int' (or `unsigned int' on some machines), but not to a `uint32_t', and then the shift may not work as desired. To ensure you're shifting with the proper width, then, you need something like i = (((uint32_t)data[1]) << 24) | (((uint32_t)data[2]) << 16) | (((uint32_t)data[3]) << 8) | (((uint32_t)data[4]) << 0); (You could lose the final shift, and even the final cast, but the pattern may be more readable if you stick with it all the way. You could also lose the outermost sets of parentheses, but I'd suggest keeping them in the interests of clarity.) > i = ntohl(i); No. The original calculation has already produced the value you want, so no further mucking around is necessary or desirable. By the way, you'll probably find ntohl() and its relatives already defined for you on systems that support networking. If you need them (you don't, here), prefer the system's versions to free-hand code. > [...] > What happens if I retrieve a 32-bit integer on a Big Endian CPU at an > aligned address, for example starting from data[0]? With the code above, > the CPU will constructs the integer starting from the bytes, copying and > shifting them. > The code works well, but is more time consuming compared with a simple > cast: > --- > uint32_t i = *(uint32_t *)&data[0]; For this variant, you *do* need ntohl() or something like it. Alignment is still a problem, even starting at `data[0]', because we have no reason (in what you've shown) to think that `data' itself is aligned in any particular way. You say that the original "is more time consuming" than the second. How much more? What have your measurements shown? (You *have* measured the speed difference, have you not? Surely you would not be so rash as to state that there "is" a difference without evidence to support your assertion, would you? Hmmm?) > Can you suggest a better snippet code? Go back to the original, minus the ntohl(). Until and unless you have solid evidence that simple and portable code is a bottleneck, leave it portable and simple. "More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason--including blind stupidity." - W.A. Wulf -- Eric Sosman esosman@ieee-dot-org.invalid |
Re: Data alignment and endianess
On 1/23/2011 6:06 PM, pozz wrote:
> Il 23/01/2011 17:06, Eric Sosman ha scritto: >> [...] >> You say that the original "is more time consuming" than the >> second. How much more? What have your measurements shown? (You >> *have* measured the speed difference, have you not? Surely you >> would not be so rash as to state that there "is" a difference without >> evidence to support your assertion, would you? Hmmm?) > > Oh, I haven't made any measurement. But I think that a simple cast could > be very fast instead of copying, shifting and or-ing bytes in the > correct order. Isn't this true? What have your measurements shown? Compilers are clever, even devious, in what they'll do to optimize code, and what you see in the source may have only a sketchy resemblance to what the CPU winds up doing. Also, on many systems the CPU is hundreds of times faster than memory, so "optimization" is largely about scheduling memory references advantageously rather than worrying about how many CPU-internal manipulations are done on the data once it's fetched. (And if you think you can count the memory fetches by studying the source of the two proposed snippets, see the earlier remark about the deviousness of optimizers.) If you actually care about the relative speed, the only practical approach is to measure. Relying on operation counts (as seen in the source) is not entirely useless, but as near to it as makes no never-mind. MEASURE! > Keep in mind that I use low-end embedded processor with very limited > resources compared with x86 desktop processors. How can I "keep in mind" a datum that you now offer for the very first time? Or as Alice put it, "How can I have more when I have'n't had any?" Besides, in your original post you asked about "most portable" solutions for "all the CPUs (Big and Little Endian)," not for one specific unnamed processor. -- Eric Sosman esosman@ieee-dot-org.invalid |
Re: Data alignment and endianess
pozz wrote:
) You are saying that code A could be faster than code B on a platform ) (CPU+compiler), but could be slower on another platform. ) It depends on the memory access time, compiler optimization and so on. ) ) In my case, I work on embedded low-end microcontrollers (8- or 16-bits), ) with internal RAM and Flash memories. ) I made a test: ) const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 }; ) i = *(int *)&data[1]; ) is translated in: ) MOVW A, _data+1 ) MOVW @RW0, A ) With this example I understand that this processor can access int ) variables even at non aligned addresses. And the assignment is very ) fast: just two move instructions. ) ) While: ) const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 }; ) i = ((int)data[1] << 8) + ((int)data[2] << 0); ) is translated as: ) MOV A, _data+1 ) SWAP ) MOV A, _data+2 ) ADDW A ) MOVW @RW0, A ) In this case I have 5 instructions, 3 of them are move. I think the ) second code is more portable, but more time consuming. Did you try this with full optimization ? A good compiler should have emitted the same instructions for both pieces of source code, assuming the first set of instructions is indeed faster in the case of misaligned access. ) This is the reason for my first question: how can I wrote code that is ) at the same time "full portable" (or almost portable) on every ) CPU/compiler and fast (for example, using the misaligned access feature ) as in my CPU)? Get a good optimizing compiler, and have it do the work for you. SaSW, Willem -- Disclaimer: I am in no way responsible for any of the statements made in the above text. For all I know I might be drugged or something.. No I'm not paranoid. You all think I'm paranoid, don't you ! #EOT |
Re: Data alignment and endianess
"pozz" <pNoOzSzPuAgMno@gmail.com> wrote in message
news:ihkn21$6u6$1@nnrp.ngi.it... > Il 24/01/2011 13:52, Eric Sosman ha scritto: > In my case, I work on embedded low-end microcontrollers (8- or 16-bits), > with internal RAM and Flash memories. > I made a test: > const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 }; > i = *(int *)&data[1]; > is translated in: > MOVW A, _data+1 > MOVW @RW0, A > With this example I understand that this processor can access int > variables even at non aligned addresses. And the assignment is very fast: > just two move instructions. > > While: > const unsigned char data[] = { 0x01, 0x02, 0x03, 0x04, 0x05 }; > i = ((int)data[1] << 8) + ((int)data[2] << 0); > is translated as: > MOV A, _data+1 > SWAP > MOV A, _data+2 > ADDW A > MOVW @RW0, A > In this case I have 5 instructions, 3 of them are move. I think the second > code is more portable, but more time consuming. > > This is the reason for my first question: how can I wrote code that is at > the same time "full portable" (or almost portable) on every CPU/compiler > and fast (for example, using the misaligned access feature as in my CPU)? You want to use portable code for misaligned memory loads and stores, but don't want a speed penalty where the hardware can directly do misaligned access? (But presumably you are happy for there to be a penalty anyway where the hardware does not allow misaligned access? Some processors may not even be byte addressable: your char data[] could be stored one character per word. Or could be stored packed, so that quite a few instructions are needed to do even byte access to the data, let alone combine arbitrary bytes into a word.) In C I believe this sort of thing is handled with conditional code, hidden behind macros. It's not pretty, but means code can be (manually) tweaked according to the target processor. For example: #if ... #define GETINT(addr) (*(int*)(addr) #else #define GETINT(addr) (*(char*)(addr)<<8) + (*(char*)(addr+1)) #endif (made-up macros not tested). Then with conditionals, you compile one or the other depending on which is best. If misaligned loads are OK, use the first; if not, use the other. But there needs to be some flag or other which is set differently in each target. And they'd be used as i = GETINT(&data[1]); -- Bartc |
Re: Data alignment and endianess
"Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
news:ihldu4$ou6$1@news.eternal-september.org... > 3) ... but okay: Suppose that by expending hours and hours of > careful measurement and days and days of programming and weeks > and weeks of ongoing maintenance you actually *do* manage to > process the number in 30ns instead of 51, for a clear savings > of nearly forty percent, whee! Now, please calculate how > many numbers you need to process in order to break even on the > time you spent engineering the savings. For argument's sake, > suppose you somehow got through all the initial work in just > one forty-hour week, and divide that by 21ns per number to find > the payoff moment. (With these completely made-up numbers, you > will break even at 6.8 TRILLION input values. YMMV, of course.) You also have to multiply any savings by the number of people who are going to run the program, and by how many times they run it. These savings are typical of the improvements you might get through one minor optimisation in a compiler. Or a slightly different strategy for predicting branches in a processor. By themselves they are nothing. But there might be hundreds of such improvements. They might be invoked a billion times per program. And there might be a million copies of the program. At the very least, valuable experience can be gained as to how far a piece of software or hardware can be pushed. Presumably you are not suggesting that compiler writers and processor designers shouldn't bother with the stuff they do? That +2 clocks penalty for a misaligned load, on page 76 of that processor manual, is mentioned for a reason. -- Bartc |
Re: Data alignment and endianess
On 1/24/2011 3:22 PM, pozz wrote:
> Il 24/01/2011 13:52, Eric Sosman ha scritto: >> [...] >> If you actually care about the relative speed, the only >> practical approach is to measure. Relying on operation counts >> (as seen in the source) is not entirely useless, but as near to >> it as makes no never-mind. MEASURE! > > You are saying that code A could be faster than code B on a platform > (CPU+compiler), but could be slower on another platform. Yes. If you believe A is always faster/slower and B is always slower/faster, regardless of platform, compiler version, compiler options, program context, and phase of the Moon, you are too young and inexperienced for the work you're engaged in. > [...] > In this case I have 5 instructions, 3 of them are move. I think the > second code is more portable, but more time consuming. "I think," you say, but I notice that you STILL don't say "I measure." What's the matter? Scared? > This is the reason for my first question: how can I wrote code that is > at the same time "full portable" (or almost portable) on every > CPU/compiler and fast (for example, using the misaligned access feature > as in my CPU)? > > After discussing with you, I think there isn't a solution for that > question. It is surpassingly unlikely that a straight-line source-code sequence A will be faster than B in all circumstances. Compilers are startlingly subtle, but are not magical. So, what's a poor programmer to do? There are several approaches, among them: 1) Lard your code with #ifdef's for different platforms. This approach offers the best opportunities for speed, but the worst burden for maintenance: You wind up with N independent implementations, >=N times the effort to keep them current. There's also bit-rot to consider: `#ifdef FROBOZZ_MAGIC_C' may be fine today, but next month Frobozz comes out with a new and improved version that optimizes differently. If you don't have a continuing project to verify and re-verify all those hand-crafted optimizations, at least a few of them *will* turn into pessimizations. And if you *do* have such a project, all you can look forward to is adding a new `#ifdef FROBOZZ_MAGIC_C_VERSION_3_POINT_5_OR_GREATER' and increasing your maintenance burden from N to (N+1). 2) Measure (that *awful* word!) the speeds of A and B on several platforms of interest, and ponder the magnitude of the spread. You're worried about extracting a 32-bit integer from a network packet, fine. Let's suppose that A takes 32-44 ns on your suite of platforms, while B takes 30-51. Well, is that difference significant in relation to the speed at which the integers arrive? If you're already sucking the data across a WiFi link shared by forty-seven other coffee drinkers, and shipping them through an entire TCP/IP stack with checksumming and window management and retransmissions and ACK's and all the rest, the difference between 32ns and 51ns is unlikely to matter. 3) ... but okay: Suppose that by expending hours and hours of careful measurement and days and days of programming and weeks and weeks of ongoing maintenance you actually *do* manage to process the number in 30ns instead of 51, for a clear savings of nearly forty percent, whee! Now, please calculate how many numbers you need to process in order to break even on the time you spent engineering the savings. For argument's sake, suppose you somehow got through all the initial work in just one forty-hour week, and divide that by 21ns per number to find the payoff moment. (With these completely made-up numbers, you will break even at 6.8 TRILLION input values. YMMV, of course.) Okay, this analysis is far from complete. I've focused on payoff -- time invested versus time saved -- and that's not the only "performance" measure. Still, it's a useful measure to keep in mind when assessing whether something is worth doing. Here's a (possibly misleading) analogy I've used before: With the price of gasoline on the rise, you may well desire to improve your car's fuel efficiency. Should you give the old jalopy a fresh coat of wax? YEA: It will make the car slipperier, reducing the wind resistance and allowing the engine to work less hard. NAY: It will make the car heavier, forcing the engine to drag a heavier load down the road. Oh, gosh, oh, gosh, whatever should I do? Answer: Stop fretting about wax and unhitch the boat trailer. ... and that is the LAST you will hear from me on this thread until and unless you offer some actual MEASUREMENTS! If you say "I think" or "It stands to reason" or "Everyone knows," I will personally come after you with a big book of six-place logarithms and bludgeon you so hard your mantissa will fall off. -- Eric Sosman esosman@ieee-dot-org.invalid |
Re: Data alignment and endianess
On 1/24/2011 7:04 PM, Bartc wrote:
> "Eric Sosman"<esosman@ieee-dot-org.invalid> wrote in message > news:ihldu4$ou6$1@news.eternal-september.org... > >> 3) ... but okay: Suppose that by expending hours and hours of >> careful measurement and days and days of programming and weeks >> and weeks of ongoing maintenance you actually *do* manage to >> process the number in 30ns instead of 51, for a clear savings >> of nearly forty percent, whee! Now, please calculate how >> many numbers you need to process in order to break even on the >> time you spent engineering the savings. For argument's sake, >> suppose you somehow got through all the initial work in just >> one forty-hour week, and divide that by 21ns per number to find >> the payoff moment. (With these completely made-up numbers, you >> will break even at 6.8 TRILLION input values. YMMV, of course.) > > You also have to multiply any savings by the number of people who are going > to run the program, and by how many times they run it. You still need 6.8 trillion (or whatever) to break even, spread it out however you may. > Presumably you are not suggesting that compiler writers and processor > designers shouldn't bother with the stuff they do? IMHO they bother with lots of stuff that's pointless, in pursuit of a better SPECKmock than the next guy. But even so: The CPU maven, the compiler wizard, even the library fabricator -- all these folks face the prospect that their stuff really *will* run trillions of times. The O.P. has offered *no* reason to think he's at that sort of level, not even within several orders of magnitude. Of course, this famous "6.8 trillion" is an entirely fictitious number. It's not to be taken literally, but as an illustration of the sort of calculation the O.P. *ought* to make before expending further effort on lightening his car by installing thinner windshields. But he doesn't seem to be quantitatively inclined; if anything, he's quantitatively disinclined. Such people should give up on programming and resign themselves to being computer science professors. ;-) -- Eric Sosman esosman@ieee-dot-org.invalid |
Re: Data alignment and endianess
"Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message
news:0.52b00768d99351f72e65.20110125132656GMT.87d3 nldsgv.fsf@bsb.me.uk... > Eric Sosman <esosman@ieee-dot-org.invalid> writes: >> You still need 6.8 trillion (or whatever) to break even, spread >> it out however you may. > > I don't buy these peculiar sums! The hours I spend shaving a few ms off > a program run (note ms and not just ns) are real hours that I could > have spent doing something else -- adding a feature, fixing a bug, going > home on time. The few ms shaved off each run might not even be noticed > by the user. Even if a billion users gain these extra milliseconds a > hundred times a day, they don't sum to anything other than zero (or to > something very close to zero)! A billion times 1msec times 100 times a day comes to about 3 years' savings -- per day. One entire lifetime saved per month. > Of course there are time when it does matter, but it can't simply be > decided by adding up time spent and time gained. I just tried this: for (i=0; i<N; ++i) sum+=p[i]; p is an array of integers, Misaligning by 1 byte on an x86 makes this code 20% slower. That may or may not have significance for the application. But if you are writing a library function (where N is a parameter) or are responsible for allocating p, then it's something that needs to be considered, and deserves some time spent on it. But if you know it will only ever save 20% (strictly, 16.7%) of 1% of the runtime, then perhaps it can wait (unless the idea of a misaligned array grates, then you might work on it anyway..) -- Bartc |
Re: Data alignment and endianess
On 01/24/2011 03:34 PM, Willem wrote:
> pozz wrote: .... > ) This is the reason for my first question: how can I wrote code that is > ) at the same time "full portable" (or almost portable) on every > ) CPU/compiler and fast (for example, using the misaligned access feature > ) as in my CPU)? > > Get a good optimizing compiler, and have it do the work for you. The requirements "every CPU/compiler" and "good optimizing compiler" are incompatible. -- James Kuyper |
Re: Data alignment and endianess
On 01/24/2011 03:22 PM, pozz wrote:
.... > This is the reason for my first question: how can I wrote code that is > at the same time "full portable" (or almost portable) on every > CPU/compiler and fast (for example, using the misaligned access feature > as in my CPU)? You can't say "every CPU/compiler" and also mandate "fast"; there are probably some compilers out there which are inherently incapable of generating "fast" code, by whatever standard you want to use for defining "fast". There's also pairs of compilers out there where the fastest code for one of the two compilers is different from the fastest code for the other. If it is essential to you to optimize at any level other than the language-independent level of the algorithm itself, you must inherently abandon any hope for portability. -- James Kuyper |
| All times are GMT. The time now is 12:16 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.