We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, I'm searching of an optimization of the following code:
void prepareData(uint16_t* dataOut, uint16_t* dataIn, uint32_t length) { uint32_t i; for (i = 0; i < length; i += 2) { dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF; dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF; } }
It's just swapping 2 16-bit words. shifting them by 4 and setting the upper 6 bits to 0. I already tried the hints from http://www.keil.com/support/man/docs/armcc/armcc_cjajacch.htm . But its getting slower with decrementing counter.
It's taking about 50ms (55ms with decrementing counter) for a length of 350000. Target: AT91SAM9260, executed from external RAM.
I tried the following implementations:
uint32_t i; length = length >> 1; for (i = length; i != 0; i -= 2) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; }
uint32_t i; length = length >> 2; for (i = 0; i < length; i ++) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; }
Info: length is the number of 16-bit word to process.
Both of them took about 28ms. Without unrolling the loop its 50ms. And with more iteration inside the loop, the time remains constant.
Well, that's some success, 28 vs 50 ms. Would you mind posting the resulting assembly for one of these variants again, please?
Do you have any specific performance requirement to meet or is it just "I want this to work as quick as possible" thing?
Loop code:
Disassembly [code inside the loop]:
0x200204AC E3E0CB03 MVN R12,#0x00000C00 0x200204B0 E59F40A4 LDR R4,[PC,#0x00A4] 0x200204B4 E5913000 LDR R3,[R1] 0x000003FF) 0x03FF0000); 0x200204B8 E2522002 SUBS R2,R2,#0x00000002 0x200204BC E00C5A23 AND R5,R12,R3,LSR #20 0x200204C0 E0043603 AND R3,R4,R3,LSL #12 0x200204C4 E1833005 ORR R3,R3,R5 0x000003FF) 0x03FF0000); 0x200204C8 E5803000 STR R3,[R0] 0x200204CC E5B13004 LDR R3,[R1,#0x0004]! 0x200204D0 E2811004 ADD R1,R1,#PIOC_PDR(0x00000004) 0x200204D4 E00C5A23 AND R5,R12,R3,LSR #20 0x200204D8 E0043603 AND R3,R4,R3,LSL #12 0x200204DC E1833005 ORR R3,R3,R5 0x200204E0 E5A03004 STR R3,[R0,#0x0004]! 0x200204E4 E2800004 ADD R0,R0,#PIOC_PDR(0x00000004) 0x200204E8 1AFFFFF1 BNE 0x200204B4
I want it as fast as possible. The instruction cache is enabled (done in the startup file). The data cache is disabled (CP15 control register equals 0x00051078). I tried to enable it, by setting it to 0x0005107D - MMU and DCache enabled - but the processor then hangs. Is there a special proceeding to enable the data cache?
> I tried to enable it, by setting it to 0x0005107D - > MMU and DCache enabled - but the processor then hangs. > Is there a special proceeding to enable the data cache?
Did you set up a page table at all? The MMU needs one to work properly. Don't forget to initialize cp15,c2 (TTB). RTFTRM ;-)
Looking at the assembler output, I am not sure if the unrolled loop is better than the single-word parallel version that I posted.
Regards Marcus http://www.doulos.com/arm/
PS: -Otime seems to be detrimental to performance (RealView Compiler) of all variants that have been posted here.