We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, I'm searching of an optimization of the following code:
void prepareData(uint16_t* dataOut, uint16_t* dataIn, uint32_t length) { uint32_t i; for (i = 0; i < length; i += 2) { dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF; dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF; } }
It's just swapping 2 16-bit words. shifting them by 4 and setting the upper 6 bits to 0. I already tried the hints from http://www.keil.com/support/man/docs/armcc/armcc_cjajacch.htm . But its getting slower with decrementing counter.
It's taking about 50ms (55ms with decrementing counter) for a length of 350000. Target: AT91SAM9260, executed from external RAM.
a few tips: * make all function parameters 32 bits wide to spare casts. * use a moving pointer rather than dereferencing your array explicitly. * use loop unrolling.
one more: count down to 0.
Could you post the assembly code your compiler produces for this function?
It has been compiled with max. optimization level.
36: { 37: uint32_t i; 0x2200B2F4 E92D0030 STMDB R13!,{R4-R5} 38: for (i = 0; i < length; i += 2) 39: { 40: dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF; 41: dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF; 42: } 0x2200B2F8 E2825001 ADD R5,R2,#CMOS_EVENT_CAPTURE_AND_SEND_BMP_USB(0x00000001) 0x2200B2FC E3550001 CMP R5,#CMOS_EVENT_CAPTURE_AND_SEND_BMP_USB(0x00000001) 43: } 44: 0x2200B300 98BD0030 LDMLSIA R13!,{R4-R5} 0x2200B304 912FFF1E BXLS R14 0x2200B308 E2413002 SUB R3,R1,#MIN_IMAGE_WIDTH(0x00000002) 0x2200B30C E240C004 SUB R12,R0,#CMOS_EVENT_SEND_BMP(0x00000004) 0x2200B310 E1A02F02 MOV R2,R2,LSL #30 0x2200B314 E2922101 ADDS R2,R2,#0x40000000 0x2200B318 E2411004 SUB R1,R1,#CMOS_EVENT_SEND_BMP(0x00000004) 0x2200B31C E2400002 SUB R0,R0,#MIN_IMAGE_WIDTH(0x00000002) 40: dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF; 0x2200B320 E3E04B03 MVN R4,#0x00000C00 0x2200B324 5A000005 BPL 0x2200B340 0x2200B328 E1F320B4 LDRH R2,[R3,#CMOS_EVENT_SEND_BMP(0x04)]! 0x2200B32C E0042222 AND R2,R4,R2,LSR #4 0x2200B330 E1EC20B4 STRH R2,[R12,#CMOS_EVENT_SEND_BMP(0x04)]! 41: dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF; 42: } 0x2200B334 E1F120B4 LDRH R2,[R1,#CMOS_EVENT_SEND_BMP(0x04)]! 0x2200B338 E0042222 AND R2,R4,R2,LSR #4 0x2200B33C E1E020B4 STRH R2,[R0,#CMOS_EVENT_SEND_BMP(0x04)]! 38: for (i = 0; i < length; i += 2) 39: { 0x2200B340 E1B02125 MOVS R2,R5,LSR #2 0x2200B344 08BD0030 LDMEQIA R13!,{R4-R5} 0x2200B348 012FFF1E BXEQ R14 40: dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF; 0x2200B34C E1D350B4 LDRH R5,[R3,#CMOS_EVENT_SEND_BMP(0x04)] 41: dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF; 42: } 0x2200B350 E2522001 SUBS R2,R2,#CMOS_EVENT_CAPTURE_AND_SEND_BMP_USB(0x00000001) 0x2200B354 E0045225 AND R5,R4,R5,LSR #4 0x2200B358 E1CC50B4 STRH R5,[R12,#CMOS_EVENT_SEND_BMP(0x04)] 0x2200B35C E1D150B4 LDRH R5,[R1,#CMOS_EVENT_SEND_BMP(0x04)] 0x2200B360 E0045225 AND R5,R4,R5,LSR #4 0x2200B364 E1C050B4 STRH R5,[R0,#CMOS_EVENT_SEND_BMP(0x04)] 0x2200B368 E1F350B8 LDRH R5,[R3,#STEPPER_TIMER_DELTA_PER_STEP(0x08)]! 0x2200B36C E0045225 AND R5,R4,R5,LSR #4 0x2200B370 E1EC50B8 STRH R5,[R12,#STEPPER_TIMER_DELTA_PER_STEP(0x08)]! 0x2200B374 E1F150B8 LDRH R5,[R1,#STEPPER_TIMER_DELTA_PER_STEP(0x08)]! 0x2200B378 E0045225 AND R5,R4,R5,LSR #4 0x2200B37C E1E050B8 STRH R5,[R0,#STEPPER_TIMER_DELTA_PER_STEP(0x08)]! 0x2200B380 1AFFFFF1 BNE 0x2200B34C 43: }
50ms/350000 = 143 ns per loop iteration. That would be 28 instructions per iteration, assuming maximum CPU speed of 5 ns per instruction. Disassembly shows that there are far fewer instructions per iteration in the code, suggesting that maybe memory bandwidth is the bottleneck. I don't think the compiler is sophisticated enough to optimize memory bandwidth when generating code. Perhaps, hand-crafted assembly is the answer...
A good test is to run an unrolled loop of maybe four iterations and see how that affects the timing.
If input and output buffers are 32-bit aligned, it should be possible to move data to/from memory in 32-bit chunks:
void prepareData(uint32_t* dataOut, uint32_t* dataIn, uint32_t length) { uint32_t i; for (i = 0; i < length; i += 2) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; } }
I tried the following implementations:
uint32_t i; length = length >> 1; for (i = length; i != 0; i -= 2) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; }
uint32_t i; length = length >> 2; for (i = 0; i < length; i ++) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; }
Info: length is the number of 16-bit word to process.
Both of them took about 28ms. Without unrolling the loop its 50ms. And with more iteration inside the loop, the time remains constant.
I don't think the compiler is sophisticated enough to optimize memory bandwidth when generating code. Perhaps, hand-crafted assembly is the answer... How can I achieve an optimized memory access with assembler in general - I'm not realy familiar with that.
How can I achieve an optimized memory access with assembler in general - I'm not realy familiar with that.
I haven't done that either. One technique would be to use 'load/store multiple' instructions. If the CPU has a data cache, you should try minimizing cache misses too.
One technique would be to use 'load/store multiple' instructions.
On an ARM7 chip, yes. I believe that there's no difference in cycle count on an ARM9 chip, at least not from the architecture. The LDM/STM instructions mainly exist there to save space.
In general, such optimizations require profound knowledge of the CPUs instruction cycle timings and the behavior of the memory system.
absolutely. I think 99% of the people on this forum (me included...) and in the industry lack that knowledge...!
I tried beating the compiler once, and failed (compiler generated code was about 5% faster).
I did have some success in very specific cases (where the aforementioned LDM/STM instructions can be used, but the compiler does not), achieving code that is about 10%-20% faster. But all that code does is loading blocks of data, performing simple operations on it, and store it.
Anything that saves number of code bytes will improve the performance of the flash memory accelerator used in some of the better ARM7 chips, and probably (?) in all Cortex chips. Flash is notoriously slow, and only got their name because they are faster than normal EEPROM when erasing.
I have also tested my luck beating the compiler, with very limited success. The amount of time needed to win over a compiler quickly grows with the amount of caching or out-of-order execution. It takes too much time checking all possible combinations.
Anything that saves number of code bytes will improve the performance of the flash memory accelerator used in some of the better ARM7 chips, and probably (?) in all Cortex chips.
Ah, yes. That's one detail I forgot - my code was running out of RAM, not out of flash. So, no waitstates at all.
I have also tested my luck beating the compiler, with very limited success. The amount of time needed to win over a compiler quickly grows with the amount of caching or out-of-order execution.
Pipelining. Don't forget pipelining. Keeping all the implications of a large pipeline in mind quickly becomes fairly mind-boggling.