Cortex-A53 :The execution time of the same four assembly code instructions varies at different locations(non-cacheable). After counting with the PMU, it is found that in the slower cases, there is an additional instruction per cycle.If the instruction cache is enabled, the runtime will be consistent. Please analyze the reasons for this
It's a long time since I last used the Lauterbach tools, so I can't comment specifically on them. However, if you're doing this in a debugger I'd expect it to make the effects I talked about worse. Every time you set or hit a breakpoint the debugger is (typically) going to do a bunch of interactions with the target - for example to populate memory or register views. Even if it doesn't, the act of entering/exiting debug state is going to muddy the numbers you see.
If you want to measure an instruction sequence using the PMU I think you'll need to follow the above suggestion. Run the sequence many time in a loop, reading the PMU only at the start and end, then averaging over the number of iterations. Without interacting with the debugger for the duration of the loop.