This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Parents
  • Note: This was originally posted on 17th September 2010 at http://forums.arm.com

    I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly.


    The trick would be to abuse the (ARM or Thumb2) USADA8 instruction I think - this performs a "sum of absolute differences" - so it subtracts each byte in one word from another, and sums the absolute value of the 4 resulting byte values. If you give the value to difference against as 0, then the difference _is_ the original value in your 4 byte vector.

    MOV r0, #0
    LDR r1, [r2]!
    USADA8 r3, r0, r1


    P.S. given that you want to put a gap between the load of r1 and the use of it on most modern ARM cores, you may want to do something like:

    MOV r0, #0
    LDR r1, [r12]!
    LDR r2, [r12]!
    LDR r3, [r12]!
    LDR r4, [r12]!
    USADA8 r5, r0, r1
    USADA8 r6, r0, r2
    USADA8 r7, r0, r3
    USADA8 r8, r0, r4
Reply
  • Note: This was originally posted on 17th September 2010 at http://forums.arm.com

    I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly.


    The trick would be to abuse the (ARM or Thumb2) USADA8 instruction I think - this performs a "sum of absolute differences" - so it subtracts each byte in one word from another, and sums the absolute value of the 4 resulting byte values. If you give the value to difference against as 0, then the difference _is_ the original value in your 4 byte vector.

    MOV r0, #0
    LDR r1, [r2]!
    USADA8 r3, r0, r1


    P.S. given that you want to put a gap between the load of r1 and the use of it on most modern ARM cores, you may want to do something like:

    MOV r0, #0
    LDR r1, [r12]!
    LDR r2, [r12]!
    LDR r3, [r12]!
    LDR r4, [r12]!
    USADA8 r5, r0, r1
    USADA8 r6, r0, r2
    USADA8 r7, r0, r3
    USADA8 r8, r0, r4
Children
No data