This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

program execution time in ARM Cortex-A9 processor

Note: This was originally posted on 4th January 2013 at http://forums.arm.com

  I'm using ARM Cortex-A9 and trying to read the value from CCNT time counter through the assembly code.  I am following this post http://stackoverflow.com/questions/3247373/how-to-measure-program-execution-time-in-arm-cortex-a8-processor?answertab=oldest#tab-top .  In accordance with it, before I can read the value from timer, I have  to enable counter, enable a 64-bit divider and clear overflows. These  operations are performed by writing inside the appropriate registers  (for instance, PMCR (Performance Monitro Control Register)). So, I am  printed counter values in a loop to keep track how overflow occurs and I  have this behavior:
[size="1"]1     (starts to incrementing after it was reset to zero)
4650
4858
4943
5023
...
...     (incrementing...)
...
4293939054
4293939128    (overflow happens)
1602570          
1602703
1602788
...
...
4293522911
4293522987
4293523062
4293523137
1186243
1186367
1186453
1186536
1186612
1186686
...
4293536300
4293536377
4293536456
4293536533
4293536612
1199090
1199209
1199295
1199373
1199453
1199530
....
and so forth.

[/size]  Accordingly, I have a set of questions:


  a) Which or the said above registers are used by the Linux kernel ?  (how reliable is the information for further kernel versions). How safe  can be the change of their values?

  [size="3"]b ) What is the accurate value of CCNT frequency and how to get it?  Unfortunately, I can't find the value in processor spec. However, dmesg  says that [/size]
   [ 0.000000] OMAP clocksource: GPTIMER2 at 24000000 Hz
   [ 0.000000] sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 178956ms
   [ 0.132855] Switching to clocksource gp timer 
  But identifying it manually, against the clock_gettime,  gives me 7 MHz. So, why it is not 24 MHz as expected?

  c) According to my first output, why after the overflow it starts not with zero, but from about 1 mil ?

  d) Why without 64 divider am I getting wrong results? The value starts to jump this way:

  ...
134110099
134114934
134119656
302352300
302361825
302367135
...
2885588930
2885593776
2885598630
3053958670
3053966752
3053972232
...
261130096
261134909
429343853
429351487
429356735

  I'd appreciate any help. Thanks
  • Note: This was originally posted on 5th January 2013 at http://forums.arm.com


    a - It depends.  The PMU isn't directly needed by a basic kernel.  But it will be used by things like oprofile or ARM's Streamline.  So if say you have Streamline set up and running, it'll conflict with your manual poking of the PMU.

    b - CCNT counts cycles (or "ticks") as experienced by the processor - NOT time.  If you have dynamic voltage/frequency enabled (and most mobile systems will) then two cycles may take different amounts of time.  So CCNT isn't a good way of measuring time.

    c - How often are you reading the counter?  I suspect that you are just missing the 0 by not reading often enough.

    d - Guessing the same answer as above.

    Edit: Stupid emoticons.


    Thank you for a quick response.

    b - if, in your opinion, CCNT is not a good way, which is the most preferable one? My aim is avoiding any system call.

    c - I'm reading it in a loop of 1 billion, so I hardly missed anything
  • Note: This was originally posted on 13th January 2013 at http://forums.arm.com

    Actually, I prefer cycles and as I understood, I'd better to read them from the global system timer, which has the constant frequency and doesn't vary with dynamic voltage and frequency scaling. Probably the strange behavior was caused by the process scheduling on CPU.. So I'll try to do it with the global timer. Thanks for your help
  • Note: This was originally posted on 4th January 2013 at http://forums.arm.com

    a - It depends.  The PMU isn't directly needed by a basic kernel.  But it will be used by things like oprofile or ARM's Streamline.  So if say you have Streamline set up and running, it'll conflict with your manual poking of the PMU.

    b - CCNT counts cycles (or "ticks") as experienced by the processor - NOT time.  If you have dynamic voltage/frequency enabled (and most mobile systems will) then two cycles may take different amounts of time.  So CCNT isn't a good way of measuring time.

    c - How often are you reading the counter?  I suspect that you are just missing the 0 by not reading often enough.

    d - Guessing the same answer as above.

    Edit: Stupid emoticons.
  • Note: This was originally posted on 7th January 2013 at http://forums.arm.com

    b - It does sort of depend on what you care about,  Cycles are often preferable to time, as they're frequency independent.  For example, it takes a Cortex-A9 X cycles to execute this code fragment.  Allowing for memory system effects, that'll be true for all A9 based parts.  If what you need is time, then I'm not sure you have much option but to use a system call.  All the timers are memory mapped, and the kernel shouldn;t allow direct access by user space applications.

    c - I'm sorry, I don't follow.  Code you post a code snippet? 

    But note, what is important is the frequency you sample the CCNT at, not the total number of times you sample it.  Think of it this way...  The processor is probably running at something close to 1 GHz, which gives around 1,000,000,000 cycles (ticks) per second.  So it takes 1/1000 th of a second for CCNT to be incremented by 1,000,000.  Now your app is sample the CCNT and printing it's value, how long does one iteration of that loop take?  Add in the fact that your app isn't running all the time, as the kernel is sometimes switching it out to run something else....  It could simply be that you just miss the lower count values.
  • Note: This was originally posted on 14th January 2013 at http://forums.arm.com

    One additional point on context switching:


    • Each core in an SMP system has a unique set of performance counters, so threads which move cores are going to see changeable results.
    • It may be useful to try "perf" on Linux - the "perf" infrastructure in Linux wraps the performance counters and performs suitable context switching of data when threads or processes are context switched / migrate cores.
    Second point - why are you trying to second guess the OS in this case? OSes have time functions for a reason, and linux time functions usually give down to microsecond granularity. Exposing peripherals directly to user-space is usually a "bad idea" - the global timer is probably used by the OS itself if it is available.

    HTH,
    Iso