This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Parents
  • Note: This was originally posted on 26th September 2010 at http://forums.arm.com

    Presumably your 4x4/16 inner loop for Neon is something close to:

    ;; consume 256 source image pixels
    VLD1.8 {Q0,Q1},[r1@128]!; load 32 from row 0
    VLD1.8 {Q4,Q5},[r2]! ; load 32 from row 1
    VLD1.8 {Q8,Q9},[r3@64]!; load 32 from row 2
    VLD1.8 {Q12,Q13},[r4]! ; load 32 from row 3
    VLD1.8 {Q2,Q3},[r1@128]!; load another 32 from row 0
    VLD1.8 {Q6,Q7},[r2]! ; load another 32 from row 1
    VLD1.8 {Q10,Q11},[r3@64]!; load another 32 from row 2
    VLD1.8 {Q14,Q15},[r4]! ; load another 32 from row 3

    ;; now at 256 8-bit values

    VPADDL.u8 Q0,Q0; 8 adds
    VPADDL.u8 Q1,Q1; 8 adds
    VPADDL.u8 Q2,Q2; 8 adds
    VPADDL.u8 Q3,Q3; 8 adds

    VPADAL.u8 Q0,Q4; 16 adds
    VPADAL.u8 Q1,Q5; 16 adds
    VPADAL.u8 Q2,Q6; 16 adds
    VPADAL.u8 Q3,Q7; 16 adds

    VPADAL.u8 Q0,Q8; 16 adds
    VPADAL.u8 Q1,Q9; 16 adds
    VPADAL.u8 Q2,Q10; 16 adds
    VPADAL.u8 Q3,Q11; 16 adds

    VPADAL.u8 Q0,Q12; 16 adds
    VPADAL.u8 Q1,Q13; 16 adds
    VPADAL.u8 Q2,Q14; 16 adds
    VPADAL.u8 Q3,Q15; 16 adds

    ;; now at 32 16-bit values

    VPADD.u16 Q0,Q0,Q1; 8 adds
    VPADD.u16 Q1,Q2,Q3; 8 adds

    ;; now at 16 16-bit values

    VSHRN.u16 D0,Q0,#4; 8 divides by 16
    VSHRN.u16 D1,Q1,#4; 8 divides by 16

    ;; now at 16 8-bit values

    ;; write out 16 destination image pixels
    VST1.8 {Q0},[r0@64]!; store 16


    Pulling in 256 pixels (filling the entire Neon register file) and emitting 16 per iteration.

    s.
Reply
  • Note: This was originally posted on 26th September 2010 at http://forums.arm.com

    Presumably your 4x4/16 inner loop for Neon is something close to:

    ;; consume 256 source image pixels
    VLD1.8 {Q0,Q1},[r1@128]!; load 32 from row 0
    VLD1.8 {Q4,Q5},[r2]! ; load 32 from row 1
    VLD1.8 {Q8,Q9},[r3@64]!; load 32 from row 2
    VLD1.8 {Q12,Q13},[r4]! ; load 32 from row 3
    VLD1.8 {Q2,Q3},[r1@128]!; load another 32 from row 0
    VLD1.8 {Q6,Q7},[r2]! ; load another 32 from row 1
    VLD1.8 {Q10,Q11},[r3@64]!; load another 32 from row 2
    VLD1.8 {Q14,Q15},[r4]! ; load another 32 from row 3

    ;; now at 256 8-bit values

    VPADDL.u8 Q0,Q0; 8 adds
    VPADDL.u8 Q1,Q1; 8 adds
    VPADDL.u8 Q2,Q2; 8 adds
    VPADDL.u8 Q3,Q3; 8 adds

    VPADAL.u8 Q0,Q4; 16 adds
    VPADAL.u8 Q1,Q5; 16 adds
    VPADAL.u8 Q2,Q6; 16 adds
    VPADAL.u8 Q3,Q7; 16 adds

    VPADAL.u8 Q0,Q8; 16 adds
    VPADAL.u8 Q1,Q9; 16 adds
    VPADAL.u8 Q2,Q10; 16 adds
    VPADAL.u8 Q3,Q11; 16 adds

    VPADAL.u8 Q0,Q12; 16 adds
    VPADAL.u8 Q1,Q13; 16 adds
    VPADAL.u8 Q2,Q14; 16 adds
    VPADAL.u8 Q3,Q15; 16 adds

    ;; now at 32 16-bit values

    VPADD.u16 Q0,Q0,Q1; 8 adds
    VPADD.u16 Q1,Q2,Q3; 8 adds

    ;; now at 16 16-bit values

    VSHRN.u16 D0,Q0,#4; 8 divides by 16
    VSHRN.u16 D1,Q1,#4; 8 divides by 16

    ;; now at 16 8-bit values

    ;; write out 16 destination image pixels
    VST1.8 {Q0},[r0@64]!; store 16


    Pulling in 256 pixels (filling the entire Neon register file) and emitting 16 per iteration.

    s.
Children
No data