How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Parents
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    I can't figure out how to specify the NEON data alignment


    I believe GCC uses ":" rather than "@", as "@" is the GCC comment character.

    And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?


    I wasn't assuming any particular processor was in use, I simply provided the largest alignment that could be guaranteed for the given multiple of 480 bytes assuming the source image started of 128byte aligned.

    hth
    s.
Reply
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    I can't figure out how to specify the NEON data alignment


    I believe GCC uses ":" rather than "@", as "@" is the GCC comment character.

    And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?


    I wasn't assuming any particular processor was in use, I simply provided the largest alignment that could be guaranteed for the given multiple of 480 bytes assuming the source image started of 128byte aligned.

    hth
    s.
Children
No data
More questions in this forum