This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Parents
  • Note: This was originally posted on 17th September 2010 at http://forums.arm.com

    Assuming your four pixels are in a square, i.e.:

    [font="Courier New"]+---+---+
    |   A   |   B   |  
    +---+---+   --->    (A+B+C+D)/4
    |   C   |   D  |
    +---+---+[/font]

    Then the following [untested] Neon code might not be too far off optimal.

    ;; Average square of four pixels to single pixel.
    ;; Produces NxM pixel image from 2Nx2M pixel image.
    ;; Generates 16 output pixels per loop.
    ;; May over-read by upto 63 bytes.
    ;; May over-write by upto 15 bytes.

    ;; r0 = Input line start address
    ;; r1 = Input line width in bytes
    ;; r2 = Input line total size in bytes
    ;; r3 = Output line start address

    quad FUNC
    ;; Compute start of second line and end address
    ADD  r1,r1,r0
    ADD  r2,r2,r1

    1
    ;; Load 32 pixels from each of two rows
    VLD1.8  {Q0,Q1},[r0]!
    VLD1.8  {Q2,Q3},[r1]!

    ;; Sum neighbouring 8-bits in each row to 16-bits
    VPADDL.U8 Q0,Q0
    VPADDL.U8 Q1,Q1
    VPADDL.U8 Q2,Q2
    VPADDL.U8 Q3,Q3

    ;; Sum 16-bit values vertically
    VADD.U16 Q0,Q0,Q2
    VADD.U16 Q1,Q1,Q3

    ;; Divide each sum of four pixels by 4 and cast to char
    VSHRN.U16 D0,Q0,#2
    VSHRN.U16 D1,Q1,#2

    ;; Store 16 pixels of resized image
    VST1.8  {Q0},[r3]!

    ;; Loop if not past end of image
    CMP  r1,r2
    BLE  %b1

    ;; Return from function
    BX  lr
    ENDFUNC


    hth
    s.
Reply
  • Note: This was originally posted on 17th September 2010 at http://forums.arm.com

    Assuming your four pixels are in a square, i.e.:

    [font="Courier New"]+---+---+
    |   A   |   B   |  
    +---+---+   --->    (A+B+C+D)/4
    |   C   |   D  |
    +---+---+[/font]

    Then the following [untested] Neon code might not be too far off optimal.

    ;; Average square of four pixels to single pixel.
    ;; Produces NxM pixel image from 2Nx2M pixel image.
    ;; Generates 16 output pixels per loop.
    ;; May over-read by upto 63 bytes.
    ;; May over-write by upto 15 bytes.

    ;; r0 = Input line start address
    ;; r1 = Input line width in bytes
    ;; r2 = Input line total size in bytes
    ;; r3 = Output line start address

    quad FUNC
    ;; Compute start of second line and end address
    ADD  r1,r1,r0
    ADD  r2,r2,r1

    1
    ;; Load 32 pixels from each of two rows
    VLD1.8  {Q0,Q1},[r0]!
    VLD1.8  {Q2,Q3},[r1]!

    ;; Sum neighbouring 8-bits in each row to 16-bits
    VPADDL.U8 Q0,Q0
    VPADDL.U8 Q1,Q1
    VPADDL.U8 Q2,Q2
    VPADDL.U8 Q3,Q3

    ;; Sum 16-bit values vertically
    VADD.U16 Q0,Q0,Q2
    VADD.U16 Q1,Q1,Q3

    ;; Divide each sum of four pixels by 4 and cast to char
    VSHRN.U16 D0,Q0,#2
    VSHRN.U16 D1,Q1,#2

    ;; Store 16 pixels of resized image
    VST1.8  {Q0},[r3]!

    ;; Loop if not past end of image
    CMP  r1,r2
    BLE  %b1

    ;; Return from function
    BX  lr
    ENDFUNC


    hth
    s.
Children
No data