This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Parents
  • Note: This was originally posted on 3rd October 2010 at http://forums.arm.com

    In summary - the buggy implementation needed a extra ',' between the register and the alignment.

    Yes you are right, it works when I use:
       vld1.8 {d0}, [r1, :128]
    Thanks! I actually posted the issue on the gcc-help mailing list and got a reply from Richard Earnshaw at ARM saying that it is a bug in old versions of the assembler in binutils (not the gcc compiler), and that:

    I've just realized that older binutils are buggy and don't parse this correctly.  It will be fixed in the up-coming binutils 2.21 release, or you can download the latest sources from www.sourceware.org.


    Now I'm ready to start making more optimized functions :-) This is my first time trying to write SIMD code, so I'm wondering, is there any websites or something that show tricks of the trade or useful advice for writing SIMD code by hand? Otherwise I'll just try to figure it out myself based on the ARM + NEON instruction set.

    Cheers,
    Shervin Emami.
Reply
  • Note: This was originally posted on 3rd October 2010 at http://forums.arm.com

    In summary - the buggy implementation needed a extra ',' between the register and the alignment.

    Yes you are right, it works when I use:
       vld1.8 {d0}, [r1, :128]
    Thanks! I actually posted the issue on the gcc-help mailing list and got a reply from Richard Earnshaw at ARM saying that it is a bug in old versions of the assembler in binutils (not the gcc compiler), and that:

    I've just realized that older binutils are buggy and don't parse this correctly.  It will be fixed in the up-coming binutils 2.21 release, or you can download the latest sources from www.sourceware.org.


    Now I'm ready to start making more optimized functions :-) This is my first time trying to write SIMD code, so I'm wondering, is there any websites or something that show tricks of the trade or useful advice for writing SIMD code by hand? Otherwise I'll just try to figure it out myself based on the ARM + NEON instruction set.

    Cheers,
    Shervin Emami.
Children
No data