We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, I'm searching of an optimization of the following code:
void prepareData(uint16_t* dataOut, uint16_t* dataIn, uint32_t length) { uint32_t i; for (i = 0; i < length; i += 2) { dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF; dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF; } }
It's just swapping 2 16-bit words. shifting them by 4 and setting the upper 6 bits to 0. I already tried the hints from http://www.keil.com/support/man/docs/armcc/armcc_cjajacch.htm . But its getting slower with decrementing counter.
It's taking about 50ms (55ms with decrementing counter) for a length of 350000. Target: AT91SAM9260, executed from external RAM.
50ms/350000 = 143 ns per loop iteration. That would be 28 instructions per iteration, assuming maximum CPU speed of 5 ns per instruction. Disassembly shows that there are far fewer instructions per iteration in the code, suggesting that maybe memory bandwidth is the bottleneck. I don't think the compiler is sophisticated enough to optimize memory bandwidth when generating code. Perhaps, hand-crafted assembly is the answer...
A good test is to run an unrolled loop of maybe four iterations and see how that affects the timing.
If input and output buffers are 32-bit aligned, it should be possible to move data to/from memory in 32-bit chunks:
void prepareData(uint32_t* dataOut, uint32_t* dataIn, uint32_t length) { uint32_t i; for (i = 0; i < length; i += 2) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; } }
I tried the following implementations:
uint32_t i; length = length >> 1; for (i = length; i != 0; i -= 2) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; }
uint32_t i; length = length >> 2; for (i = 0; i < length; i ++) { *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; *dataOut = ((*dataIn >> 20) & 0x000003FF) | ((*dataIn << 12) & 0x03FF0000); dataIn++; dataOut++; }
Info: length is the number of 16-bit word to process.
Both of them took about 28ms. Without unrolling the loop its 50ms. And with more iteration inside the loop, the time remains constant.
I don't think the compiler is sophisticated enough to optimize memory bandwidth when generating code. Perhaps, hand-crafted assembly is the answer... How can I achieve an optimized memory access with assembler in general - I'm not realy familiar with that.
How can I achieve an optimized memory access with assembler in general - I'm not realy familiar with that.
I haven't done that either. One technique would be to use 'load/store multiple' instructions. If the CPU has a data cache, you should try minimizing cache misses too.
One technique would be to use 'load/store multiple' instructions.
On an ARM7 chip, yes. I believe that there's no difference in cycle count on an ARM9 chip, at least not from the architecture. The LDM/STM instructions mainly exist there to save space.
In general, such optimizations require profound knowledge of the CPUs instruction cycle timings and the behavior of the memory system.
absolutely. I think 99% of the people on this forum (me included...) and in the industry lack that knowledge...!
I tried beating the compiler once, and failed (compiler generated code was about 5% faster).
I did have some success in very specific cases (where the aforementioned LDM/STM instructions can be used, but the compiler does not), achieving code that is about 10%-20% faster. But all that code does is loading blocks of data, performing simple operations on it, and store it.
Anything that saves number of code bytes will improve the performance of the flash memory accelerator used in some of the better ARM7 chips, and probably (?) in all Cortex chips. Flash is notoriously slow, and only got their name because they are faster than normal EEPROM when erasing.
I have also tested my luck beating the compiler, with very limited success. The amount of time needed to win over a compiler quickly grows with the amount of caching or out-of-order execution. It takes too much time checking all possible combinations.
Anything that saves number of code bytes will improve the performance of the flash memory accelerator used in some of the better ARM7 chips, and probably (?) in all Cortex chips.
Ah, yes. That's one detail I forgot - my code was running out of RAM, not out of flash. So, no waitstates at all.
I have also tested my luck beating the compiler, with very limited success. The amount of time needed to win over a compiler quickly grows with the amount of caching or out-of-order execution.
Pipelining. Don't forget pipelining. Keeping all the implications of a large pipeline in mind quickly becomes fairly mind-boggling.
"Pipelining. Don't forget pipelining...."
Things are so much easier on 8051s. It's quite easy to beat the compiler there - Even the latest Keil C51 offering.
Well, that's some success, 28 vs 50 ms. Would you mind posting the resulting assembly for one of these variants again, please?
Do you have any specific performance requirement to meet or is it just "I want this to work as quick as possible" thing?
I loved 8086 assembler until the Pentium got a zero-clock FXCH instruction to swap the place of two instructions in the FP stack. Before that, I could walk all over the x86 compilers. After, it took me a day to match what Watcom C did almost instantly, implementing basic arithmetic for vectors and matrics with n=4. And I needed to write a helper application just to visualise the contents of the FP stack as the FXCH instruction wildly moved around the data to always have the optimal value on the stack top.
Today, a PC processor doesn't have just one single instruction that can be run concurrently. Almost all instructions can be - and are - run concurrently. We have to fight so hard with the ordering of the instructions (both to combine concurrent pipelines and to count guestimated cache line responses), that we end up with code that we can't proof-read for correctness. All we can hope for is that our regression tests will catch all possible corner cases. In the end, I'm doing my best to avoid assembler for 32-bit processors and higher. And I'm also desperately holding on to already used compiler versions, to reduce the probability of bugs in the compiler, or bugs trigged by changes in the code generation.
This is definitely the right approach, Mike. Better still, if data were 8 word aligned, since that is the size of a cache line in ARM926. Assuming the data cache has been enabled.
However, you can still shave off a few cycles inside the loop by parallelizing operations. Fortunately the task is rather well suited to this.
void prepareDataMH(uint16_t* dataOut, uint16_t* dataIn, uint32_t length) { int32_t i; uint32_t tmp; uint32_t *dataIn_pair = (uint32_t *)dataIn; uint32_t *dataOut_pair = (uint32_t *)dataOut; for (i = (length/2)-1; i >= 0; i--) { tmp = (dataIn_pair[i] >> 4) & 0x03FF03FF; dataOut_pair[i] = (tmp >> 16) | (tmp << 16); } }
Regards Marcus http://www.doulos.com/arm/
Loop code:
Disassembly [code inside the loop]:
0x200204AC E3E0CB03 MVN R12,#0x00000C00 0x200204B0 E59F40A4 LDR R4,[PC,#0x00A4] 0x200204B4 E5913000 LDR R3,[R1] 0x000003FF) 0x03FF0000); 0x200204B8 E2522002 SUBS R2,R2,#0x00000002 0x200204BC E00C5A23 AND R5,R12,R3,LSR #20 0x200204C0 E0043603 AND R3,R4,R3,LSL #12 0x200204C4 E1833005 ORR R3,R3,R5 0x000003FF) 0x03FF0000); 0x200204C8 E5803000 STR R3,[R0] 0x200204CC E5B13004 LDR R3,[R1,#0x0004]! 0x200204D0 E2811004 ADD R1,R1,#PIOC_PDR(0x00000004) 0x200204D4 E00C5A23 AND R5,R12,R3,LSR #20 0x200204D8 E0043603 AND R3,R4,R3,LSL #12 0x200204DC E1833005 ORR R3,R3,R5 0x200204E0 E5A03004 STR R3,[R0,#0x0004]! 0x200204E4 E2800004 ADD R0,R0,#PIOC_PDR(0x00000004) 0x200204E8 1AFFFFF1 BNE 0x200204B4
I want it as fast as possible. The instruction cache is enabled (done in the startup file). The data cache is disabled (CP15 control register equals 0x00051078). I tried to enable it, by setting it to 0x0005107D - MMU and DCache enabled - but the processor then hangs. Is there a special proceeding to enable the data cache?