This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]

Parents

Peter Harris over 12 years ago

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly.

The trick would be to abuse the (ARM or Thumb2) USADA8 instruction I think - this performs a "sum of absolute differences" - so it subtracts each byte in one word from another, and sums the absolute value of the 4 resulting byte values. If you give the value to difference against as 0, then the difference _is_ the original value in your 4 byte vector.

MOV r0, #0 LDR r1, [r2]! USADA8 r3, r0, r1

P.S. given that you want to put a gap between the load of r1 and the use of it on most modern ARM cores, you may want to do something like:

MOV r0, #0 LDR r1, [r12]! LDR r2, [r12]! LDR r3, [r12]! LDR r4, [r12]! USADA8 r5, r0, r1 USADA8 r6, r0, r2 USADA8 r7, r0, r3 USADA8 r8, r0, r4
Cancel
Vote up 0 Vote down

Cancel

Reply

Peter Harris over 12 years ago

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly.

The trick would be to abuse the (ARM or Thumb2) USADA8 instruction I think - this performs a "sum of absolute differences" - so it subtracts each byte in one word from another, and sums the absolute value of the 4 resulting byte values. If you give the value to difference against as 0, then the difference _is_ the original value in your 4 byte vector.

MOV r0, #0 LDR r1, [r2]! USADA8 r3, r0, r1

P.S. given that you want to put a gap between the load of r1 and the use of it on most modern ARM cores, you may want to do something like:

MOV r0, #0 LDR r1, [r12]! LDR r2, [r12]! LDR r3, [r12]! LDR r4, [r12]! USADA8 r5, r0, r1 USADA8 r6, r0, r2 USADA8 r7, r0, r3 USADA8 r8, r0, r4
Cancel
Vote up 0 Vote down

Cancel

Children

No data