I'm working an a project on a Texas Instruments AM3517 Cortex-A8 processor. I was seeing less than expected performance, and did a simple comparison with a Cortex-M3 processor. The M3 performance was more than twice as good as the A8(?!).
The test was a simple count to 100,000:
while (1){ volatile uint32_t i; dbg_PinSet(DBG_PIN_00); for ( i = 0; i < 100000; i++ ) { } dbg_PinClear(DBG_PIN_00);}
This is a bare metal system. Timing was measured with a scope and the debug pin, and found to be about 40 ms on the A8 clocked at 600 MHz, and about 14 ms on the M3 clocked at 72 MHz.
The code on the A8 is running from the on chip 64K ram to remove cache and external memory effects. Interrupts are disabled on both processors.
I'm relatively new to the A8, and suspect I'm missing something simple in setup somewhere.
Any pointers or help will be greatly appreciated.
Thanks,
-Rob
Did you turn on the MMU, caches, and branch predictor? (The caches are virtually indexed, so you will need the page tables set up and the MMU enabled before you can turn the caches on, so it takes a little work to get this going).
> The code on the A8 is running from the on chip 64K ram to remove cache
This does not remove the need for cache. Cache is generally single cycle access, the on chip RAM is hanging off an internal AXI slave somewhere inside the chip, so will be 20-30 cycles to access (vs > 100 cycles for external DDR). All Cortex-A cores are designed to run with caches turned on, no sane use case will run with them disabled - it will cripple performance.
HTH, Pete
Adding to Peter's answer ... I do not know if this is the case, it's just a thought that popped into my head.
On some devices, it's required to configure the GPIO pins as "high speed", otherwise they might only run at for instance 2 MHz or 25 MHz.
If you see "erratic" measuring results, this might be the cause.
I did some simply testing with only the level 1 cache, and saw some minor variations between combinations of instruction/data cache enabled/disabled. Moving from internal to externally memory cut the performance in half, though. Digging in to the MMU and level 2 cache, and will let you know....
I think the io pins are clocked ok - the low pulse between count loops is ~375ns. Thanks for the reply.
You can't turn on the caches without the MMU turned on (certainly true for the data cache - the instruction cache may do something useful); I would guess the random variation would mostly just be noise in the measurement.
Hi Peter,
regarding the instruction cache, it is able to turn on without MMU, isn't it?
Best regards,Yasuhiko Koumoto.
Yes, I think so.
Hi rlepage,
as I don't have the Cortex-A8 board, I executed the program on the Cortex-A9 board (i.e. Renesas RZ/A1L). The frequency is 384MHz. The results are the below.
icahce=ON branch predict=ON 0.392msicahce=ON branch predict=OFF 16.4msicahce=OFF branch predict=ON 0.392msicahce=OFF branch predict=OFF 11.7ms
The branch prediction would be dominant in the measurement. If the branch prediction is OFF, the icache would even be a penalty.I'm not sure why your Cortex-A8 resulted in such slow mark.By the way, what was your results if the icahce and the branch prediction were ON?
For your information, the Cortex-M4 results are the followings. The frequency is 50MHz. I used the FTDM-K20D50M.
Flash execution 8ms
SRAM execution 12ms
Best regards,
Yasuhiko Koumoto.
Did you wrote your own baremetal code or you took some reference code ?
Can you post the startup code and settings of your experimentation ?
To wrap this up, I went back and tracked down some init code from TI Starterware v2.00.01.01 for the AM335X (different processor than the AM3517 I'm using but still A8, and mmu and cache are the same).
Here are my results for the count to 100,000 test:
Clearly, enabling the mmu, caches, and branch prediction was the answer I needed.
Thanks to all who replied!
Thanks for sharing the data - glad to know you got it sorted =)
Regards, Pete
Hi,
I am currently working on TI Davinci DM8148 EVM, have found out that Cortex-A8 core has 64KB internal RAM from DM8148 datasheet. I want to run a sample application from Internal RAM on Cortex-A8 Processor and also use remaining internal RAM as a heap for memory operations done inside sample app. I understand that in this case it should be a bare-metal application to be flashed onto ROM, but with memory sections for instruction and data properly mapped to internal RAM. The following are my queries related to the requirement:
1. How to map my program section and data section to internal RAM ?
2. What is the procedure for flashing a bare-metal code on Cortex-A8 processor (DM8148 Platform)?
Regards,
Vinay