Found that the function execution time is affected by different flash address when looping load(LDRB) data from flash to core register(R0),the number loop is 60, we also used core PMU test the number of instruction, found they are different.
I also tested that if the ldrb instruction is run only once, the time of this instruction is not affected by any flash address. It can be affected only when the LDRB instruction is loop execution.
Found a strong rule, performance is very well when 32 bytes are aligned manually. how can we automatically ensure optimal performance when using LDRB instruction?
thanks, I only want to know the root causes, and there are lots of this style code, it is not reasonable to unroll all of loop code.
Which device are you using to benchmark this?It may be that the memory system is inefficient for non word-aligned accesses?