LDR r0, [r4] // Load 4 pixels A:B:C:D from (x,y) LDR r1, [r5] // Load 4 pixels E:F:G:H from (x,y+1) UHADD8 r2, r0, r1 // Add pixels A:B:C:D with pixels E:F:G:H and divide each pixel by 2. UXTB r3, r2 // Set r3 = (D+H)/2 UXTAB r3, r3, r2, ROR #8 // Set r3 = r3 + (C+G)/2 UXTAB r3, r3, r2, ROR #16 // Set r3 = r3 + (B+F)/2 UXTAB r3, r3, r2, ROR #24 // Set r3 = r3 + (A+E)/2 // r3 is now (A+E)/2 + (B+F)/2 + (C+G)/2 + (D+H)/2 // which is (A+B+C+D + E+F+G+H) / 2 LSR r3, r3, 2 // Set r3 = average of 8 pixels A to H
If you're down to final tweaking, it might be worth experimenting with preloading ahead in the source image.
In summary - the buggy implementation needed a extra ',' between the register and the alignment.
vld1.8 {d0}, [r1, :128]
I've just realized that older binutils are buggy and don't parse this correctly. It will be fixed in the up-coming binutils 2.21 release, or you can download the latest sources from www.sourceware.org.
;; consume 256 source image pixels VLD1.8 {Q0,Q1},[r1@128]!; load 32 from row 0 VLD1.8 {Q4,Q5},[r2]! ; load 32 from row 1 VLD1.8 {Q8,Q9},[r3@64]!; load 32 from row 2 VLD1.8 {Q12,Q13},[r4]! ; load 32 from row 3 VLD1.8 {Q2,Q3},[r1@128]!; load another 32 from row 0 VLD1.8 {Q6,Q7},[r2]! ; load another 32 from row 1 VLD1.8 {Q10,Q11},[r3@64]!; load another 32 from row 2 VLD1.8 {Q14,Q15},[r4]! ; load another 32 from row 3;; now at 256 8-bit values VPADDL.u8 Q0,Q0; 8 adds VPADDL.u8 Q1,Q1; 8 adds VPADDL.u8 Q2,Q2; 8 adds VPADDL.u8 Q3,Q3; 8 adds VPADAL.u8 Q0,Q4; 16 adds VPADAL.u8 Q1,Q5; 16 adds VPADAL.u8 Q2,Q6; 16 adds VPADAL.u8 Q3,Q7; 16 adds VPADAL.u8 Q0,Q8; 16 adds VPADAL.u8 Q1,Q9; 16 adds VPADAL.u8 Q2,Q10; 16 adds VPADAL.u8 Q3,Q11; 16 adds VPADAL.u8 Q0,Q12; 16 adds VPADAL.u8 Q1,Q13; 16 adds VPADAL.u8 Q2,Q14; 16 adds VPADAL.u8 Q3,Q15; 16 adds;; now at 32 16-bit values VPADD.u16 Q0,Q0,Q1; 8 adds VPADD.u16 Q1,Q2,Q3; 8 adds;; now at 16 16-bit values VSHRN.u16 D0,Q0,#4; 8 divides by 16 VSHRN.u16 D1,Q1,#4; 8 divides by 16;; now at 16 8-bit values;; write out 16 destination image pixels VST1.8 {Q0},[r0@64]!; store 16
I can't figure out how to specify the NEON data alignment
And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
;; now at 256 8-bit values ... VPADDL.u8 Q3,Q3; 8 adds PLD [r1,#((4*640)-256)] ... VPADAL.u8 Q3,Q7; 16 adds PLD [r2,#((4*640)-256)] ... VPADAL.u8 Q3,Q11; 16 adds PLD [r3,#((4*640)-256)] ... VPADAL.u8 Q3,Q15; 16 adds PLD [r4,#((4*640)-256)];; now at 32 16-bit values
;; Average square of four pixels to single pixel.;; Produces NxM pixel image from 2Nx2M pixel image.;; Generates 16 output pixels per loop.;; May over-read by upto 63 bytes.;; May over-write by upto 15 bytes.;; r0 = Input line start address;; r1 = Input line width in bytes;; r2 = Input line total size in bytes;; r3 = Output line start addressquad FUNC;; Compute start of second line and end address ADD r1,r1,r0 ADD r2,r2,r11;; Load 32 pixels from each of two rows VLD1.8 {Q0,Q1},[r0]! VLD1.8 {Q2,Q3},[r1]!;; Sum neighbouring 8-bits in each row to 16-bits VPADDL.U8 Q0,Q0 VPADDL.U8 Q1,Q1 VPADDL.U8 Q2,Q2 VPADDL.U8 Q3,Q3;; Sum 16-bit values vertically VADD.U16 Q0,Q0,Q2 VADD.U16 Q1,Q1,Q3;; Divide each sum of four pixels by 4 and cast to char VSHRN.U16 D0,Q0,#2 VSHRN.U16 D1,Q1,#2;; Store 16 pixels of resized image VST1.8 {Q0},[r3]!;; Loop if not past end of image CMP r1,r2 BLE %b1;; Return from function BX lr ENDFUNC
I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly.
MOV r0, #0LDR r1, [r2]!USADA8 r3, r0, r1
MOV r0, #0LDR r1, [r12]!LDR r2, [r12]!LDR r3, [r12]!LDR r4, [r12]!USADA8 r5, r0, r1USADA8 r6, r0, r2USADA8 r7, r0, r3USADA8 r8, r0, r4
Wow thanks so much guys, thats exactly what I needed to know! Glad to finally be part of the ARM community :-)
Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!