Hi, I'm searching of an optimization of the following code:
void prepareData(uint16_t* dataOut, uint16_t* dataIn, uint32_t length) { uint32_t i; for (i = 0; i < length; i += 2) { dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF; dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF; } }
It's just swapping 2 16-bit words. shifting them by 4 and setting the upper 6 bits to 0. I already tried the hints from http://www.keil.com/support/man/docs/armcc/armcc_cjajacch.htm . But its getting slower with decrementing counter.
It's taking about 50ms (55ms with decrementing counter) for a length of 350000. Target: AT91SAM9260, executed from external RAM.
Loop code:
uint32_t i; length = length >> 1; for (i = length; i != 0; i -= 2) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; }
Disassembly [code inside the loop]:
0x200204AC E3E0CB03 MVN R12,#0x00000C00 0x200204B0 E59F40A4 LDR R4,[PC,#0x00A4] 0x200204B4 E5913000 LDR R3,[R1] 0x000003FF) 0x03FF0000); 0x200204B8 E2522002 SUBS R2,R2,#0x00000002 0x200204BC E00C5A23 AND R5,R12,R3,LSR #20 0x200204C0 E0043603 AND R3,R4,R3,LSL #12 0x200204C4 E1833005 ORR R3,R3,R5 0x000003FF) 0x03FF0000); 0x200204C8 E5803000 STR R3,[R0] 0x200204CC E5B13004 LDR R3,[R1,#0x0004]! 0x200204D0 E2811004 ADD R1,R1,#PIOC_PDR(0x00000004) 0x200204D4 E00C5A23 AND R5,R12,R3,LSR #20 0x200204D8 E0043603 AND R3,R4,R3,LSL #12 0x200204DC E1833005 ORR R3,R3,R5 0x200204E0 E5A03004 STR R3,[R0,#0x0004]! 0x200204E4 E2800004 ADD R0,R0,#PIOC_PDR(0x00000004) 0x200204E8 1AFFFFF1 BNE 0x200204B4
I want it as fast as possible. The instruction cache is enabled (done in the startup file). The data cache is disabled (CP15 control register equals 0x00051078). I tried to enable it, by setting it to 0x0005107D - MMU and DCache enabled - but the processor then hangs. Is there a special proceeding to enable the data cache?
> I tried to enable it, by setting it to 0x0005107D - > MMU and DCache enabled - but the processor then hangs. > Is there a special proceeding to enable the data cache?
Did you set up a page table at all? The MMU needs one to work properly. Don't forget to initialize cp15,c2 (TTB). RTFTRM ;-)
Looking at the assembler output, I am not sure if the unrolled loop is better than the single-word parallel version that I posted.
Regards Marcus http://www.doulos.com/arm/
PS: -Otime seems to be detrimental to performance (RealView Compiler) of all variants that have been posted here.