This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Question -

Hi,

I have strange symptom with Cortex-A15 device.

The below is traced data.

Program AddressDisassembly

0x40401AA0CMP             R12, R0

0x40401AA4BHI             0x40401A80

0x40401A80LDR             R0, [R13]

0x40401A84LDRB            R12, [R13, #24]

0x40401A88STRB            R12, [R0]

0x40401A8CLDR             R12, [R13]

0x40401A90ADD             R12, R12, #1

0x40401A94STR             R12, [R13]

0x40401A98LDR             R12, [R13, #4]

0x40401A9CLDR             R0, [R13]

0x40401AA0CMP             R12, R0

0x40401AA4BHI             0x40401A80

We execute the memory write repeatedly.

10 step instructions is used for 1 addreess  write of DDR3.  But the trace data shows 724 cycles is spend for this 10 instruction execution.

In generally speaking, one instruction is one cycle. 724 cycles are abnormal.

Why does this symptom occur? Please

let me know the reason.

I appreciate your quick reply.

Best regards,

Michi

  • Hello Michi,
    generally speaking, load or store instruction cannot be executed within one cycle because of a memory latency.
    Also there are many pipeline hazards such as registers or memories in your code.
    I don't think it is abnormal with 724 execution time according to such considerations.
    I executed the code on Coetex-A9 because I don't have Cortex-A15 environment.
    The result was about 800 cycles per loop.
    I think the code can be more optimized.
    For example, I can show as the following.

      mov r0,[r13]
      mov r12,[r13,#4]
    LOOP:
      ldrb r1,[r13,#24]
      add  r0,r0,#1
      strb r1,[r0,#-1]
      cmp  r12,r0
      bhi  LOOP
    

    I anticipate it would be a half execution time.
    Best regards,
    Yasuhiko Koumoto.

  • Dear Koumoto-san,

    Thank you for your reply.

    >The result was about 800 cycles per loop.

    → Which memory did you use? Internal SRAM? Or DDR3 external memory with cache enable?

    Please give me your answer.

    Best regards,

    Michi

  • Hello Michi,

    I used an external SDR SDRAM with enabling both instruction and data caches.

    When I executed the same code repeatedly, the execution cycles decreased to about 700 cycles because of cache hit.

    Best regards,

    Yasuhiko Koumoto.

  • Dear Koumoto-san,

    Thank you for your reply.

    I understood 724cycles is normal on DDR3. How about internal RAM(OCMC_RAM3). When the same code is executed, 824cycles is spend.

    I think this execution time is also too long time.Is it same result with cortex-A9? If it is so, does the optimization effect to reduce the cycle time?

    I appreciate your quick reply.

    Best regards,

    Michi 

  • Hello Michi,

    it looks strange.
    The execution time on internal SRAM should be shorter than on DDR3.
    By my Cortex-A9 environment, the execution time on SRAM was only 34 cycles.
    As my code had a problem, I revised it and measured the execution time again.
    The results on external SDR DRAM were
    875 cycles when L2 cache was off and
    320 cycles when L2 cache was on.
    Best regards,
    Yasuhiko Koumoto.