Hi,
I have strange symptom with Cortex-A15 device.
The below is traced data.
Program AddressDisassembly
0x40401AA0CMP R12, R0
0x40401AA4BHI 0x40401A80
0x40401A80LDR R0, [R13]
0x40401A84LDRB R12, [R13, #24]
0x40401A88STRB R12, [R0]
0x40401A8CLDR R12, [R13]
0x40401A90ADD R12, R12, #1
0x40401A94STR R12, [R13]
0x40401A98LDR R12, [R13, #4]
0x40401A9CLDR R0, [R13]
We execute the memory write repeatedly.
10 step instructions is used for 1 addreess write of DDR3. But the trace data shows 724 cycles is spend for this 10 instruction execution.
In generally speaking, one instruction is one cycle. 724 cycles are abnormal.
Why does this symptom occur? Please
let me know the reason.
I appreciate your quick reply.
Best regards,
Michi
Hello Michi,generally speaking, load or store instruction cannot be executed within one cycle because of a memory latency.Also there are many pipeline hazards such as registers or memories in your code.I don't think it is abnormal with 724 execution time according to such considerations.I executed the code on Coetex-A9 because I don't have Cortex-A15 environment.The result was about 800 cycles per loop.I think the code can be more optimized.For example, I can show as the following.
mov r0,[r13] mov r12,[r13,#4] LOOP: ldrb r1,[r13,#24] add r0,r0,#1 strb r1,[r0,#-1] cmp r12,r0 bhi LOOP
I anticipate it would be a half execution time.Best regards,Yasuhiko Koumoto.
Dear Koumoto-san,
Thank you for your reply.
>The result was about 800 cycles per loop.
→ Which memory did you use? Internal SRAM? Or DDR3 external memory with cache enable?
Please give me your answer.
Hello Michi,
I used an external SDR SDRAM with enabling both instruction and data caches.
When I executed the same code repeatedly, the execution cycles decreased to about 700 cycles because of cache hit.
Yasuhiko Koumoto.
I understood 724cycles is normal on DDR3. How about internal RAM(OCMC_RAM3). When the same code is executed, 824cycles is spend.
I think this execution time is also too long time.Is it same result with cortex-A9? If it is so, does the optimization effect to reduce the cycle time?
it looks strange.The execution time on internal SRAM should be shorter than on DDR3.By my Cortex-A9 environment, the execution time on SRAM was only 34 cycles.As my code had a problem, I revised it and measured the execution time again.The results on external SDR DRAM were875 cycles when L2 cache was off and320 cycles when L2 cache was on.Best regards,Yasuhiko Koumoto.