This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
  • Note: This was originally posted on 21st September 2010 at http://forums.arm.com

    Wow thanks so much guys, thats exactly what I needed to know! I still haven't learnt enough of NEON to have used it and I totally overlooked USADA8 for this operation, so until now I just came up with something like this:

    LDR r0, [r4]  // Load 4 pixels A:B:C:D from (x,y)
    LDR r1, [r5]  // Load 4 pixels E:F:G:H from (x,y+1)
    UHADD8 r2, r0, r1  // Add pixels A:B:C:D with pixels E:F:G:H and divide each pixel by 2.
    UXTB r3, r2   // Set r3 = (D+H)/2
    UXTAB r3, r3, r2, ROR #8 // Set r3 = r3 + (C+G)/2
    UXTAB r3, r3, r2, ROR #16 // Set r3 = r3 + (B+F)/2
    UXTAB r3, r3, r2, ROR #24 // Set r3 = r3 + (A+E)/2
    // r3 is now (A+E)/2 + (B+F)/2 + (C+G)/2 + (D+H)/2
    // which is (A+B+C+D + E+F+G+H) / 2
    LSR  r3, r3, 2 // Set r3 = average of 8 pixels A to H


    So obviously your 2 solutions are much better! I used to be an Assembly programmer about 10 years ago for Intel 16bit and 32bit CPUs but I only just started learning ARM last week, and now I finally see why so many people used to say that ARM RISC is better than Intel CISC! NEON seems really powerful, and I'm amazed that Thumb-2 can fit something like USADA8 in just 16-bits!

    Glad to finally be part of the ARM community :-)

    Cheers,
    Shervin Emami
    [url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
  • Note: This was originally posted on 25th September 2010 at http://forums.arm.com

    Hi guys,

    After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.

    To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):

    Resize function in OpenCV C library (GCC4.2 with -O3) takes about 38 msec
    My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
    My hand-optimised ARMv7 Assembly code takes about 7.8 msec
    My hand-optimised NEON Assembly code takes about 0.9 msec


    So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!

    Cheers,
    Shervin Emami.
    [url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    Actually I just used VPADDL's instead of VPADAL, so your code should run even faster than myne :-) But there is one other important difference: My code doesn't specify the data alignment (such as @64 and @128), because I'm using the default assembler in XCode (GCC4.2 -assembler-as-cpp), and I can't figure out how to specify the NEON data alignment. Maybe I should be using NASM or something instead of GCC to assemble my code...

    And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    If you're down to final tweaking, it might be worth experimenting with preloading ahead in the source image.

    Its funny I was thinking of asking you about memory preloading but I thought I had already asked too much of your time as it is :-) From the few message posts I've read about NEON optimisation (I think mainly in the FFmpeg msg boards), they say that memory preloading involves some trial & error to get the right values in the right places?

    I tried aligning in GCC using:
         VLD1.u8 {q0}, [r0:128]!
    but it still gives an error, and I tried every keyboard symbol in place of @ but it still wont work. I'll try using NASM instead.
    Anyway I still don't understand why you aligned some to @128 and some to @64 and some to nothing. Wouldn't it work better if all 8 loads & the store use align (such as @64 on everything if its a 480 pixel wide image or @128 if its a 640 pixel wide image)?

    Thanks a lot for your help! I'm still contemplating whether to attempt a generic image resizing function (from any size to any size) using NEON or whether it would be too difficult to take advantage of SIMD for that type of operation.

    Cheers,
    Shervin Emami.
    [url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    I just installed NASM only to discover that it doesn't support ARM. So I guess I'll stick with the XCode default "gcc-4.2 -x assembler-with-cpp" and not have NEON alignment. And I finally figured out why you only gave alignment on some rows and not others: because you thought I had a 480 pixel wide image and since 480 is not divisible by 64 or 128, only some rows would have good alignment.

    Like I said, thanks a lot both of you for your help with jump starting me on ARM and NEON development.
  • Note: This was originally posted on 3rd October 2010 at http://forums.arm.com

    In summary - the buggy implementation needed a extra ',' between the register and the alignment.

    Yes you are right, it works when I use:
       vld1.8 {d0}, [r1, :128]
    Thanks! I actually posted the issue on the gcc-help mailing list and got a reply from Richard Earnshaw at ARM saying that it is a bug in old versions of the assembler in binutils (not the gcc compiler), and that:

    I've just realized that older binutils are buggy and don't parse this correctly.  It will be fixed in the up-coming binutils 2.21 release, or you can download the latest sources from www.sourceware.org.


    Now I'm ready to start making more optimized functions :-) This is my first time trying to write SIMD code, so I'm wondering, is there any websites or something that show tricks of the trade or useful advice for writing SIMD code by hand? Otherwise I'll just try to figure it out myself based on the ARM + NEON instruction set.

    Cheers,
    Shervin Emami.
  • Note: This was originally posted on 2nd October 2010 at http://forums.arm.com

    thanks guys for this great info here. :)
  • Note: This was originally posted on 15th October 2010 at http://forums.arm.com

    I been trying to find this info for a long time now lol. thanks.
  • Note: This was originally posted on 26th September 2010 at http://forums.arm.com

    Presumably your 4x4/16 inner loop for Neon is something close to:

    ;; consume 256 source image pixels
    VLD1.8 {Q0,Q1},[r1@128]!; load 32 from row 0
    VLD1.8 {Q4,Q5},[r2]! ; load 32 from row 1
    VLD1.8 {Q8,Q9},[r3@64]!; load 32 from row 2
    VLD1.8 {Q12,Q13},[r4]! ; load 32 from row 3
    VLD1.8 {Q2,Q3},[r1@128]!; load another 32 from row 0
    VLD1.8 {Q6,Q7},[r2]! ; load another 32 from row 1
    VLD1.8 {Q10,Q11},[r3@64]!; load another 32 from row 2
    VLD1.8 {Q14,Q15},[r4]! ; load another 32 from row 3

    ;; now at 256 8-bit values

    VPADDL.u8 Q0,Q0; 8 adds
    VPADDL.u8 Q1,Q1; 8 adds
    VPADDL.u8 Q2,Q2; 8 adds
    VPADDL.u8 Q3,Q3; 8 adds

    VPADAL.u8 Q0,Q4; 16 adds
    VPADAL.u8 Q1,Q5; 16 adds
    VPADAL.u8 Q2,Q6; 16 adds
    VPADAL.u8 Q3,Q7; 16 adds

    VPADAL.u8 Q0,Q8; 16 adds
    VPADAL.u8 Q1,Q9; 16 adds
    VPADAL.u8 Q2,Q10; 16 adds
    VPADAL.u8 Q3,Q11; 16 adds

    VPADAL.u8 Q0,Q12; 16 adds
    VPADAL.u8 Q1,Q13; 16 adds
    VPADAL.u8 Q2,Q14; 16 adds
    VPADAL.u8 Q3,Q15; 16 adds

    ;; now at 32 16-bit values

    VPADD.u16 Q0,Q0,Q1; 8 adds
    VPADD.u16 Q1,Q2,Q3; 8 adds

    ;; now at 16 16-bit values

    VSHRN.u16 D0,Q0,#4; 8 divides by 16
    VSHRN.u16 D1,Q1,#4; 8 divides by 16

    ;; now at 16 8-bit values

    ;; write out 16 destination image pixels
    VST1.8 {Q0},[r0@64]!; store 16


    Pulling in 256 pixels (filling the entire Neon register file) and emitting 16 per iteration.

    s.
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    I can't figure out how to specify the NEON data alignment


    I believe GCC uses ":" rather than "@", as "@" is the GCC comment character.

    And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?


    I wasn't assuming any particular processor was in use, I simply provided the largest alignment that could be guaranteed for the given multiple of 480 bytes assuming the source image started of 128byte aligned.

    hth
    s.
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    If you're down to final tweaking, it might be worth experimenting with preloading ahead in the source image.
    Something like:

    ;; now at 256 8-bit values

    ...
    VPADDL.u8 Q3,Q3; 8 adds
    PLD  [r1,#((4*640)-256)]

    ...
    VPADAL.u8 Q3,Q7; 16 adds
    PLD  [r2,#((4*640)-256)]

    ...
    VPADAL.u8 Q3,Q11; 16 adds
    PLD  [r3,#((4*640)-256)]

    ...
    VPADAL.u8 Q3,Q15; 16 adds
    PLD  [r4,#((4*640)-256)]

    ;; now at 32 16-bit values


    s.
  • Note: This was originally posted on 17th September 2010 at http://forums.arm.com

    Assuming your four pixels are in a square, i.e.:

    [font="Courier New"]+---+---+
    |   A   |   B   |  
    +---+---+   --->    (A+B+C+D)/4
    |   C   |   D  |
    +---+---+[/font]

    Then the following [untested] Neon code might not be too far off optimal.

    ;; Average square of four pixels to single pixel.
    ;; Produces NxM pixel image from 2Nx2M pixel image.
    ;; Generates 16 output pixels per loop.
    ;; May over-read by upto 63 bytes.
    ;; May over-write by upto 15 bytes.

    ;; r0 = Input line start address
    ;; r1 = Input line width in bytes
    ;; r2 = Input line total size in bytes
    ;; r3 = Output line start address

    quad FUNC
    ;; Compute start of second line and end address
    ADD  r1,r1,r0
    ADD  r2,r2,r1

    1
    ;; Load 32 pixels from each of two rows
    VLD1.8  {Q0,Q1},[r0]!
    VLD1.8  {Q2,Q3},[r1]!

    ;; Sum neighbouring 8-bits in each row to 16-bits
    VPADDL.U8 Q0,Q0
    VPADDL.U8 Q1,Q1
    VPADDL.U8 Q2,Q2
    VPADDL.U8 Q3,Q3

    ;; Sum 16-bit values vertically
    VADD.U16 Q0,Q0,Q2
    VADD.U16 Q1,Q1,Q3

    ;; Divide each sum of four pixels by 4 and cast to char
    VSHRN.U16 D0,Q0,#2
    VSHRN.U16 D1,Q1,#2

    ;; Store 16 pixels of resized image
    VST1.8  {Q0},[r3]!

    ;; Loop if not past end of image
    CMP  r1,r2
    BLE  %b1

    ;; Return from function
    BX  lr
    ENDFUNC


    hth
    s.
  • Note: This was originally posted on 17th September 2010 at http://forums.arm.com

    I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly.


    The trick would be to abuse the (ARM or Thumb2) USADA8 instruction I think - this performs a "sum of absolute differences" - so it subtracts each byte in one word from another, and sums the absolute value of the 4 resulting byte values. If you give the value to difference against as 0, then the difference _is_ the original value in your 4 byte vector.

    MOV r0, #0
    LDR r1, [r2]!
    USADA8 r3, r0, r1


    P.S. given that you want to put a gap between the load of r1 and the use of it on most modern ARM cores, you may want to do something like:

    MOV r0, #0
    LDR r1, [r12]!
    LDR r2, [r12]!
    LDR r3, [r12]!
    LDR r4, [r12]!
    USADA8 r5, r0, r1
    USADA8 r6, r0, r2
    USADA8 r7, r0, r3
    USADA8 r8, r0, r4
  • Note: This was originally posted on 22nd September 2010 at http://forums.arm.com

    Wow thanks so much guys, thats exactly what I needed to know!
    Glad to finally be part of the ARM community :-)


    No probs; glad to be of help. And welcome =) A good question to ask too - I love answering assembler hacking questions :)

    I've programmed assembler on a couple of register based architectures (ARM and TI DSPs mainly), and I have to say whenever I look at writing x86 CISC assembler I really get put off by it (mostly I just find register based architectures more intuitive). The more recent versions of the ARM architecture are really nice to write algorithms for; a mixture of ARM DSP and SIMD instructions,some of the newer ARM instructions in ARMv7 such as the wide constant loads, and of course NEON, make it really very flexible and a pleasure to write in =)
  • Note: This was originally posted on 27th September 2010 at http://forums.arm.com

    Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!


    Very few compilers generate "weird" instructions - so if you are after anything a little special in the instruction set the odds are you will either need to use intrinsics for that instruction or fall back to assembler.