How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Parents
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!


    Very few compilers generate "weird" instructions - so if you are after anything a little special in the instruction set the odds are you will either need to use intrinsics for that instruction or fall back to assembler.
Reply
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!


    Very few compilers generate "weird" instructions - so if you are after anything a little special in the instruction set the odds are you will either need to use intrinsics for that instruction or fall back to assembler.
Children
No data
More questions in this forum