Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > asm code for ARM (very simple)

Reply
Thread Tools

asm code for ARM (very simple)

 
 
Gernot Frisch
Guest
Posts: n/a
 
      09-04-2008
Hi,

can someone, please optimize this routine for an ARM processor?

inline void _QCopy4(register unsigned long* a,
register unsigned long* b,
register unsigned int ndwords)
{
// copy 4 bytes at once
for(; ndwords>0; ndwords--) *a++=*b++;
}


Thank you.


--
------------------------------------
Gernot Frisch
http://www.glbasic.com

 
Reply With Quote
 
 
 
 
Michael DOUBEZ
Guest
Posts: n/a
 
      09-04-2008
Gernot Frisch a écrit :
> Hi,
>
> can someone, please optimize this routine for an ARM processor?
>
> inline void _QCopy4(register unsigned long* a,
> register unsigned long* b,
> register unsigned int ndwords)
> {
> // copy 4 bytes at once
> for(; ndwords>0; ndwords--) *a++=*b++;
> }


Depending on the architecture you can:
- use a DMA to copy the data: slow to setup but may be more efficient
if you have a lot of data
- use bursts: usually can copy 4 long in one burst (check your hardware)
- configure your cache policy

This as nothing to do with C++.

Otherwise:
- use memmove instead of memcpy when it is the intended semantic
- try to keep your data aligned on 4,8 or 16 bytes boundaries

And the most important: benchmark to locate your bottleneck.

--
Michael
 
Reply With Quote
 
 
 
 
peter koch
Guest
Posts: n/a
 
      09-04-2008
On 4 Sep., 09:52, "Gernot Frisch" <(E-Mail Removed)> wrote:
> Hi,
>
> can someone, please optimize this routine for an ARM processor?
>
> inline void _QCopy4(register unsigned long* a,
> * * * * * * * * * * register unsigned long* b,
> * * * * * * * * * * register unsigned int ndwords)
> {
> *// copy 4 bytes at once
> *for(; ndwords>0; ndwords--) *a++=*b++;
>
> }


What benchmarks did you make, that made you decide that this function
is a bottleneck and that the code generated by the compiler is
inadequate? I would expect the compiler to be able to generate quite
good if not optimal code here.

/Peter
 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      09-08-2008
On Thu, 4 Sep 2008 14:54:20 -0700 (PDT), peter koch <(E-Mail Removed)> wrote:
> On 4 Sep., 09:52, "Gernot Frisch" <(E-Mail Removed)> wrote:
>> Hi,
>>
>> can someone, please optimize this routine for an ARM processor?
>>
>> inline void _QCopy4(register unsigned long* a,
>> * * * * * * * * * * register unsigned long* b,
>> * * * * * * * * * * register unsigned int ndwords)


Why is b not a pointer to const? It's an open invitation to copy in
the wrong direction.

>> {
>> *// copy 4 bytes at once
>> *for(; ndwords>0; ndwords--) *a++=*b++;
>>
>> }

>
> What benchmarks did you make, that made you decide that this function
> is a bottleneck and that the code generated by the compiler is
> inadequate? I would expect the compiler to be able to generate quite
> good if not optimal code here.


Unless the data is unaliged, and (which I seem to recall is the case
with ARM) unaligned reads & writes work, but are really, really slow.

(By the way, I'd skip the 'register' keyword, unless it really affects
the generated code. And if it *does*, I'd consider looking for a new
compiler.)

/Jorgen

--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
 
Reply With Quote
 
Michael DOUBEZ
Guest
Posts: n/a
 
      09-09-2008
Jorgen Grahn a écrit :
> On Thu, 4 Sep 2008 14:54:20 -0700 (PDT), peter koch <(E-Mail Removed)> wrote:
>> On 4 Sep., 09:52, "Gernot Frisch" <(E-Mail Removed)> wrote:
>>> Hi,
>>>
>>> can someone, please optimize this routine for an ARM processor?
>>>
>>> inline void _QCopy4(register unsigned long* a,
>>> register unsigned long* b,
>>> register unsigned int ndwords)

>
> Why is b not a pointer to const? It's an open invitation to copy in
> the wrong direction.
>
>>> {
>>> // copy 4 bytes at once
>>> for(; ndwords>0; ndwords--) *a++=*b++;
>>>
>>> }

>> What benchmarks did you make, that made you decide that this function
>> is a bottleneck and that the code generated by the compiler is
>> inadequate? I would expect the compiler to be able to generate quite
>> good if not optimal code here.

>
> Unless the data is unaliged, and (which I seem to recall is the case
> with ARM) unaligned reads & writes work, but are really, really slow.


AFAIK an MMU can be integrated as an option with most ARM today.

But there is no problem here since the parameters are long pointer; they
should be on the right boundaries. Unless the caller coerced them but
this is however UB.

--
Michael
 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      09-09-2008
On Tue, 09 Sep 2008 09:22:14 +0200, Michael DOUBEZ <(E-Mail Removed)> wrote:
> Jorgen Grahn a écrit :
>> On Thu, 4 Sep 2008 14:54:20 -0700 (PDT), peter koch <(E-Mail Removed)> wrote:
>>> On 4 Sep., 09:52, "Gernot Frisch" <(E-Mail Removed)> wrote:
>>>> Hi,
>>>>
>>>> can someone, please optimize this routine for an ARM processor?
>>>>
>>>> inline void _QCopy4(register unsigned long* a,
>>>> register unsigned long* b,
>>>> register unsigned int ndwords)
>>>> {
>>>> // copy 4 bytes at once
>>>> for(; ndwords>0; ndwords--) *a++=*b++;
>>>>
>>>> }
>>> What benchmarks did you make, that made you decide that this function
>>> is a bottleneck and that the code generated by the compiler is
>>> inadequate? I would expect the compiler to be able to generate quite
>>> good if not optimal code here.

>>
>> Unless the data is unaliged, and (which I seem to recall is the case
>> with ARM) unaligned reads & writes work, but are really, really slow.

>
> AFAIK an MMU can be integrated as an option with most ARM today.


I'm not sure an MMU has anything to do with it. I have seen two or
three different systems without an MMU which worked "correctly" with
unaligned accesses, but at a huge speed penalty.

> But there is no problem here since the parameters are long pointer; they
> should be on the right boundaries. Unless the caller coerced them but
> this is however UB.


It's UB, but it's unfortunately very common out there, especially in
embedded systems. It might be part of this particular problem.

/Jorgen

--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
 
Reply With Quote
 
Michael DOUBEZ
Guest
Posts: n/a
 
      09-09-2008
Jorgen Grahn a écrit :
> On Tue, 09 Sep 2008 09:22:14 +0200, Michael DOUBEZ <(E-Mail Removed)> wrote:
>> Jorgen Grahn a écrit :
>>> On Thu, 4 Sep 2008 14:54:20 -0700 (PDT), peter koch <(E-Mail Removed)> wrote:
>>>> On 4 Sep., 09:52, "Gernot Frisch" <(E-Mail Removed)> wrote:
>>>>> Hi,
>>>>>
>>>>> can someone, please optimize this routine for an ARM processor?
>>>>>
>>>>> inline void _QCopy4(register unsigned long* a,
>>>>> register unsigned long* b,
>>>>> register unsigned int ndwords)
>>>>> {
>>>>> // copy 4 bytes at once
>>>>> for(; ndwords>0; ndwords--) *a++=*b++;
>>>>>
>>>>> }
>>>> What benchmarks did you make, that made you decide that this function
>>>> is a bottleneck and that the code generated by the compiler is
>>>> inadequate? I would expect the compiler to be able to generate quite
>>>> good if not optimal code here.
>>> Unless the data is unaliged, and (which I seem to recall is the case
>>> with ARM) unaligned reads & writes work, but are really, really slow.

>> AFAIK an MMU can be integrated as an option with most ARM today.

>
> I'm not sure an MMU has anything to do with it. I have seen two or
> three different systems without an MMU which worked "correctly" with
> unaligned accesses, but at a huge speed penalty.


Without an MMU, you get corrupted data unless the software you use can
add some magic.

I know it from experience: we had this problem on a network device where
data was written by an ethernet device in 4 bytes aligned memory. But
the ethernet header is 14 bytes long which means that all remaining data
(IP adresses, TCP informations, application data ...) was unaligned. I
won't ellaborate on the fact that development done before we received
the chip, relied on cleanly aligned data.

We got through it but an MMU seemed priceless

>> But there is no problem here since the parameters are long pointer; they
>> should be on the right boundaries. Unless the caller coerced them but
>> this is however UB.

>
> It's UB, but it's unfortunately very common out there, especially in
> embedded systems. It might be part of this particular problem.


It might be if you write in C, but reinterpret_cast<> tend to stand out
in C++ and is caught at the first code review.

--
Michael
 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      09-16-2008
On Tue, 09 Sep 2008 14:18:05 +0200, Michael DOUBEZ <(E-Mail Removed)> wrote:
> Jorgen Grahn a écrit :
>> On Tue, 09 Sep 2008 09:22:14 +0200, Michael DOUBEZ <(E-Mail Removed)> wrote:
>>> Jorgen Grahn a écrit :


>>> But there is no problem here since the parameters are long pointer; they
>>> should be on the right boundaries. Unless the caller coerced them but
>>> this is however UB.

>>
>> It's UB, but it's unfortunately very common out there, especially in
>> embedded systems. It might be part of this particular problem.

>
> It might be if you write in C, but reinterpret_cast<> tend to stand out
> in C++ and is caught at the first code review.


Yes, and I like gcc's -Wc-style-cast flag.

But you are assuming real-world projects use reinterpret_cast<>,
perform code reviews, and care about type safety. I agree that they
*should* (I think it would pay off quickly), but many don't.

/Jorgen

--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
ASM method attribute and code order MarlonBrando Java 7 10-15-2009 02:50 PM
C++ produced ASM code Joris C++ 3 08-09-2005 04:08 PM
OLD CISCO ASM CSC3 - Terminal line configuration Joe Bloggs Cisco 0 01-21-2004 02:02 PM
Getting the KVM running on ARM Linux on an ARM processor based device Steve Jasper Java 0 11-20-2003 06:55 PM
Asm code to C code LordBlue C Programming 2 08-15-2003 12:10 PM



Advertisments