Cortex-R4 PMU CP15 cycle counter value : runto readout vs step by step to readout..

Looking at this cycle counter value, and  as they say starring at it with blank mind .. not understanding it. 

Very seems trivial setup: I just enable it early at startup,  just after my PLL's all set and locked, and CPU clock is set to final speec & runs Ok..
I init, reset , & start the cycle counter, it runs .

Then in main(), I set break point & read it out.  Ok, it ticked many times,  Too many for my liking.. Looks abnormally high.  So I decide to reset, and go step my step (C statements .. or asm, wont' matter..), hoping I will see where it takes most cycles, thinking RAM init, or C runtime copying.

..And i get to same break point in main, and it's different, Massively different to if I just let it run.   (Repeated the process just incase).

E.g. if I just let it run,  cycles = 1,656,559;    If I step-by-step from cp15/cycle counter init to the main brkpt,  cycles = 106729,  or like 15 times less..

What Am I seeing /not ..?  What black magic happens while I step , I cannot rationalize.  

The bigger figure is /cannot be correct (if it is, its a horrible hal & c init runtime load times; and I will throw it the f away, I cannot have this long boot).  The smaller figure from step-by-step is what I intuitively expect ( or want.)

Now I'm left with good-old toggle gpio & scope it ,  from as early as possible.