This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A8 performance

I'm working an a project on a Texas Instruments AM3517 Cortex-A8 processor. I was seeing less than expected performance, and did a simple comparison with a Cortex-M3 processor. The M3 performance was more than twice as good as the A8(?!).

The test was a simple count to 100,000:

while (1)
{
    volatile uint32_t    i;
  
    dbg_PinSet(DBG_PIN_00);
    for ( i = 0; i < 100000; i++ )
    {
    }
    dbg_PinClear(DBG_PIN_00);
}

This is a bare metal system.  Timing was measured with a scope and the debug pin, and found to be about 40 ms on the A8 clocked at 600 MHz, and about 14 ms on the M3 clocked at 72 MHz.

The code on the A8 is running from the on chip 64K ram to remove cache and external memory effects.  Interrupts are disabled on both processors.

I'm relatively new to the A8, and suspect I'm missing something simple in setup somewhere.

Any pointers or help will be greatly appreciated.

Thanks,

-Rob

  • Did you turn on the MMU, caches, and branch predictor? (The caches are virtually indexed, so you will need the page tables set up and the MMU enabled before you can turn the caches on, so it takes a little work to get this going).

    > The code on the A8 is running from the on chip 64K ram to remove cache


    This does not remove the need for cache. Cache is generally single cycle access, the on chip RAM is hanging off an internal AXI slave somewhere inside the chip, so will be 20-30 cycles to access (vs > 100 cycles for external DDR). All Cortex-A cores are designed to run with caches turned on, no sane use case will run with them disabled - it will cripple performance.

    HTH,
    Pete

  • Adding to Peter's answer ... I do not know if this is the case, it's just a thought that popped into my head.

    On some devices, it's required to configure the GPIO pins as "high speed", otherwise they might only run at for instance 2 MHz or 25 MHz.

    If you see "erratic" measuring results, this might be the cause.

  • I did some simply testing with only the level 1 cache, and saw some minor variations between combinations of instruction/data cache enabled/disabled.  Moving from internal to externally memory cut the performance in half, though.  Digging in to the MMU and level 2 cache, and will let you know....

    Thanks,

    -Rob

  • I think the io pins are clocked ok - the low pulse between count loops is ~375ns.  Thanks for the reply.

  • You can't turn on the caches without the MMU turned on (certainly true for the data cache - the instruction cache may do something useful); I would guess the random variation would mostly just be noise in the measurement.

  • Hi Peter,

    regarding the instruction cache, it is able to turn on without MMU, isn't it?

    Best regards,
    Yasuhiko Koumoto.

  • Hi rlepage,

    as I don't have the Cortex-A8 board, I executed the program on the Cortex-A9 board (i.e. Renesas RZ/A1L). The frequency is 384MHz. The results are the below.


    icahce=ON  branch predict=ON   0.392ms
    icahce=ON  branch predict=OFF  16.4ms
    icahce=OFF branch predict=ON   0.392ms
    icahce=OFF branch predict=OFF  11.7ms


    The branch prediction would be dominant in the measurement. If the branch prediction is OFF, the icache would even be a penalty.
    I'm not sure why your Cortex-A8 resulted in such slow mark.
    By the way, what was your results if the icahce and the branch prediction were ON?

    For your information, the Cortex-M4 results are the followings. The frequency is 50MHz. I used the FTDM-K20D50M.

    Flash execution  8ms

    SRAM execution 12ms

    Best regards,

    Yasuhiko Koumoto.

  • Hi rlepage,

    Did you wrote your own baremetal code or you took some reference code ?

    Can you post the startup code and settings of your experimentation ?

  • To wrap this up, I went back and tracked down some init code from TI Starterware v2.00.01.01 for the AM335X (different processor than the AM3517 I'm using but still A8, and mmu and cache are the same).

    Here are my results for the count to 100,000 test:

    On Chip Memory  branch prediction  
    disabled    enabled
    mmu, caches disabled:  49.5780 ms    43.6035 ms
    mmu enabled:  36.2767 ms    38.4381 ms
    mmu enabled, caches enabled:  1.54182 ms    1.00014 ms
    external DDR2 memory, 166 MHz  branch prediction  
    disabled    enabled
    mmu, caches disabled:  81.7507 ms    72.6937 ms
    mmu enabled:  68.8886 ms    60.4374 ms
    mmu enabled, caches enabled:  1.80018 ms    1.20018 ms

    Clearly, enabling the mmu, caches, and branch prediction was the answer I needed.

    Thanks to all who replied!

    -Rob

  • Thanks for sharing the data - glad to know you got it sorted =)

    Regards,
    Pete

  • Hi,

    I am currently working on TI Davinci DM8148 EVM,  have found out that Cortex-A8 core has 64KB internal RAM from DM8148 datasheet. I want to run a sample application from Internal RAM on Cortex-A8 Processor and also use remaining internal RAM as a heap for memory operations done inside sample app. I understand that in this case it should be a bare-metal application to be flashed onto ROM, but with memory sections for instruction and data properly mapped to internal RAM. The following are my queries related to the requirement:

    1. How to map my program section and data section to internal RAM ?

    2. What is the procedure for flashing a bare-metal code on Cortex-A8 processor (DM8148 Platform)?

    Regards,

    Vinay