When I am using the cycle counter in AArch64, I am not getting cycles properly. I have enabled read of pmccntr_el0 in user space using a small kernel module. I have sample code like:
asm volatile("isb;mrs %0, pmccntr_el0" : "=r"(prev));
asm volatile("isb;mrs %0, pmccntr_el0" : "=r"(curr));
delta = curr-prev;
I expected delta to be in the range of 1400000000 as a57 in our design runs at 1400MHz
But I am getting around 32100000 which means the cycle counter frequency is ~3.21MHz
The value of Control register is pmcr=41013001 indicating divider is off.
With Generic timer counter registers, I am getting the values as expected. The below code gives
asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (te));
asm volatile ("isb; mrs %0, cntfrq_el0" : "=r" (freq));
printf ("Aarch64 %20ld cycles\n", (unsigned long long)(te - ts));
printf (" Frequency = %u\n",freq);
I get count of 512021629 cycles for 2 sec as expected for 256MHz frequency which I got from cntfrq_el0.
Is there something basic I am missing for PMCCNTR_EL0?
thanks and regards,
With the caveat that I'm not a Linux expert, my suspicion is power management. The Generic Timer measures time, but the PMU is counting cycles. If your system has a light load, it might well be either reducing the clock frequency or taking the core off-line entirely. Both of which would reduce the number of cycles experienced by the core, but not the passage of time.
My guess would be taking the core off-line, but it is only a guess.
Just to add fuel to the fire: the cycle counter counts CLK cycles consumed by the processor - when in WFI or WFE state, the PMU won't count, so it may not even be 'offline' or have a reduced input clock rate.
When you sleep(1) you're telling the kernel that you want to wait a second, which is an ideal time for the kernel to set up a timer 1 second in the future and then, in the absence of anything else to do, might end up in the idle loop (WFI..) fairly quickly.
Thanks. I will try with some normal code which does not include sleep. However, I am looking for a counter similar to rdtsc (Read Timestamp) in X86 which gives me ticks since the system started with nanosecond granularity to do some time measurements. If WFI/WFE stops PMU cycle counter that will not help. Is there any other counter I can use?
I was thinking of deeper sleep modes that standby, but...
It depends on what you actually want to know. If what you know is the elapsed time, then you can use the Generic Timer. It
I guess in ARM-7 architecture (in A15) this was not the case. The CCNT was counting all the cycles (or divided by 64). Using generic timer is not that accurate. For example, in our design the frequency of timer is 256Mhz and CPU frequency is 1600MHz. So, the counter will be 8 times slower.
In case of X86, the rdtsc keeps incrementing at a constant rate irrespective if CPU frequency scaling etc so that we can depend on that to timestamp measurements.
Martin is right, you're using the wrong functionality -- if you want to read a timestamp which tells you the 'time' of the system at different points, the Generic Timer is the method to do this. In actual fact, you are probably looking to find a relative time between two places, which is why you've converged on the PMU. The Generic Timer is also the thing to do that, but you don't need the official 'uptime' of the system to do so, you can use relative views of the timer between two sampled points.
Unfortunately the PMU counts events within the microarchitecture of the processor, not "time." It can be used, along with a source of "time" to infer particular things about those events, but it is quite difficult to attribute an event (say, any one of 3.21 million counted cycles) in that sea of counted events to a particular nanosecond. PMUs allow what's called statistical profiling, it doesn't give you information about a single run's deterministic behaviour, but it can show you trends...
Due to many factors, however, it is doubtful that you would be able to actually capture and service events or timer comparators at a nanosecond granularity. The counter source feeding the Generic Timer is needfully not going to run at >1GHz (ARM recommends something around 50MHz, but 100KHz to 10MHz would be fine for most usage). Due to the requirement to use ISB barriers for both the PMU and Generic Timer under certain situations, and requirements for low power consumption, even if your counter did tick at >1GHz, you simply won't be able to measure a nanosecond or single-digit (or maybe not even a 100-nanosecond resolution) event with a processor at 1.4GHz.
General computing almost never calls for anything that requires nanosecond granularity.. I'm curious what you need it for.
Thanks for the explanation. I may now stick to values from generic timer. In one of the networking applications, we need to do some job periodically at 100us or less. Instead of depending on the timer interrupt because of latency involved, we poll the "cycle counter" in a loop. Once we hit the required count, we will do the periodic job. We were using rdtsc in X86 (tsc in x86 increments at a constant rate based on max frequency of the core) and on ARM we wanted to use similar mechanism. Hence tried PMCCNTR_EL0. As it is clear now from your/Martin's explanation that it is not a real "time stamp" counter, now I will use timer counters. At 50MHz, each tick of generic timer counter will be 20ns, I hope that shall be OK for now.
You mentioned the counter seemed to run at 256MHz - so you should be getting resolution of 4ns or thereabouts. Note that it will probably take far longer than 4ns to actually poll the counter -- at the CPU frequency you state, if you can read the Generic Timer virtual counter (with the ISB to ensure that it is not speculatively read) and do the comparison and branch in ~6 cycles, you'll get your Timer 4ns resolution - the polling itself, however, doesn't seem feasible to meet that resolution. Since you only need 100us, it should make no difference whatsoever,
Again, the curious question is what do you need that kind of polling and timer resolution for? One would assume you have this event that needs to run at least every 100us, but the processing required takes very close to 100us, otherwise you would not be so concerned about meeting the timing or the some-thousands of cycles interrupt latency. But the constant ISB-before-read is going to really hurt system performance. Is it possible that some other system effect of the processing is causing higher interrupt latency (copious use of STM/LDM instructions, other uses of ISB, that will add cycles to latency)?