We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
LDR r0, [r4] // Load 4 pixels A:B:C:D from (x,y) LDR r1, [r5] // Load 4 pixels E:F:G:H from (x,y+1) UHADD8 r2, r0, r1 // Add pixels A:B:C:D with pixels E:F:G:H and divide each pixel by 2. UXTB r3, r2 // Set r3 = (D+H)/2 UXTAB r3, r3, r2, ROR #8 // Set r3 = r3 + (C+G)/2 UXTAB r3, r3, r2, ROR #16 // Set r3 = r3 + (B+F)/2 UXTAB r3, r3, r2, ROR #24 // Set r3 = r3 + (A+E)/2 // r3 is now (A+E)/2 + (B+F)/2 + (C+G)/2 + (D+H)/2 // which is (A+B+C+D + E+F+G+H) / 2 LSR r3, r3, 2 // Set r3 = average of 8 pixels A to H
If you're down to final tweaking, it might be worth experimenting with preloading ahead in the source image.
In summary - the buggy implementation needed a extra ',' between the register and the alignment.
vld1.8 {d0}, [r1, :128]
I've just realized that older binutils are buggy and don't parse this correctly. It will be fixed in the up-coming binutils 2.21 release, or you can download the latest sources from www.sourceware.org.
;; consume 256 source image pixels VLD1.8 {Q0,Q1},[r1@128]!; load 32 from row 0 VLD1.8 {Q4,Q5},[r2]! ; load 32 from row 1 VLD1.8 {Q8,Q9},[r3@64]!; load 32 from row 2 VLD1.8 {Q12,Q13},[r4]! ; load 32 from row 3 VLD1.8 {Q2,Q3},[r1@128]!; load another 32 from row 0 VLD1.8 {Q6,Q7},[r2]! ; load another 32 from row 1 VLD1.8 {Q10,Q11},[r3@64]!; load another 32 from row 2 VLD1.8 {Q14,Q15},[r4]! ; load another 32 from row 3;; now at 256 8-bit values VPADDL.u8 Q0,Q0; 8 adds VPADDL.u8 Q1,Q1; 8 adds VPADDL.u8 Q2,Q2; 8 adds VPADDL.u8 Q3,Q3; 8 adds VPADAL.u8 Q0,Q4; 16 adds VPADAL.u8 Q1,Q5; 16 adds VPADAL.u8 Q2,Q6; 16 adds VPADAL.u8 Q3,Q7; 16 adds VPADAL.u8 Q0,Q8; 16 adds VPADAL.u8 Q1,Q9; 16 adds VPADAL.u8 Q2,Q10; 16 adds VPADAL.u8 Q3,Q11; 16 adds VPADAL.u8 Q0,Q12; 16 adds VPADAL.u8 Q1,Q13; 16 adds VPADAL.u8 Q2,Q14; 16 adds VPADAL.u8 Q3,Q15; 16 adds;; now at 32 16-bit values VPADD.u16 Q0,Q0,Q1; 8 adds VPADD.u16 Q1,Q2,Q3; 8 adds;; now at 16 16-bit values VSHRN.u16 D0,Q0,#4; 8 divides by 16 VSHRN.u16 D1,Q1,#4; 8 divides by 16;; now at 16 8-bit values;; write out 16 destination image pixels VST1.8 {Q0},[r0@64]!; store 16
I can't figure out how to specify the NEON data alignment
And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
;; now at 256 8-bit values ... VPADDL.u8 Q3,Q3; 8 adds PLD [r1,#((4*640)-256)] ... VPADAL.u8 Q3,Q7; 16 adds PLD [r2,#((4*640)-256)] ... VPADAL.u8 Q3,Q11; 16 adds PLD [r3,#((4*640)-256)] ... VPADAL.u8 Q3,Q15; 16 adds PLD [r4,#((4*640)-256)];; now at 32 16-bit values
;; Average square of four pixels to single pixel.;; Produces NxM pixel image from 2Nx2M pixel image.;; Generates 16 output pixels per loop.;; May over-read by upto 63 bytes.;; May over-write by upto 15 bytes.;; r0 = Input line start address;; r1 = Input line width in bytes;; r2 = Input line total size in bytes;; r3 = Output line start addressquad FUNC;; Compute start of second line and end address ADD r1,r1,r0 ADD r2,r2,r11;; Load 32 pixels from each of two rows VLD1.8 {Q0,Q1},[r0]! VLD1.8 {Q2,Q3},[r1]!;; Sum neighbouring 8-bits in each row to 16-bits VPADDL.U8 Q0,Q0 VPADDL.U8 Q1,Q1 VPADDL.U8 Q2,Q2 VPADDL.U8 Q3,Q3;; Sum 16-bit values vertically VADD.U16 Q0,Q0,Q2 VADD.U16 Q1,Q1,Q3;; Divide each sum of four pixels by 4 and cast to char VSHRN.U16 D0,Q0,#2 VSHRN.U16 D1,Q1,#2;; Store 16 pixels of resized image VST1.8 {Q0},[r3]!;; Loop if not past end of image CMP r1,r2 BLE %b1;; Return from function BX lr ENDFUNC
I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly.
MOV r0, #0LDR r1, [r2]!USADA8 r3, r0, r1
MOV r0, #0LDR r1, [r12]!LDR r2, [r12]!LDR r3, [r12]!LDR r4, [r12]!USADA8 r5, r0, r1USADA8 r6, r0, r2USADA8 r7, r0, r3USADA8 r8, r0, r4
Wow thanks so much guys, thats exactly what I needed to know! Glad to finally be part of the ARM community :-)
Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!