This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]

Parents

Shervin Emami over 12 years ago

Note: This was originally posted on 27th September 2010 at http://forums.arm.com

Actually I just used VPADDL's instead of VPADAL, so your code should run even faster than myne :-) But there is one other important difference: My code doesn't specify the data alignment (such as @64 and @128), because I'm using the default assembler in XCode (GCC4.2 -assembler-as-cpp), and I can't figure out how to specify the NEON data alignment. Maybe I should be using NASM or something instead of GCC to assemble my code...

And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
Cancel
Vote up 0 Vote down

Cancel

Reply

Shervin Emami over 12 years ago

Note: This was originally posted on 27th September 2010 at http://forums.arm.com

Actually I just used VPADDL's instead of VPADAL, so your code should run even faster than myne :-) But there is one other important difference: My code doesn't specify the data alignment (such as @64 and @128), because I'm using the default assembler in XCode (GCC4.2 -assembler-as-cpp), and I can't figure out how to specify the NEON data alignment. Maybe I should be using NASM or something instead of GCC to assemble my code...

And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
Cancel
Vote up 0 Vote down

Cancel

Children

No data