I need to benchmark some C++ code on the FVP_MPS2_Cortex-M4 simulator.
I considered using CMSIS function osKernelGetTickCount() to provide timestamps but the resolution of the tick timer seems to be 1ms, which is too coarse.
What would be a suitable clock counter to use for the timings?
Would ARM_CM_DWT_CYCCNT be suitable and, if so, how would I access it?Best regards
David
Hi David,
Yes, the FVP supports this register. You can access it in your code (in Thread mode) using something like the below.
#define CM_DEMCR (*((volatile unsigned int*)0xE000EDFC)) #define CM_TRCENA_BIT (1UL<<24) #define CM_DWT_CONTROL (*((volatile unsigned int*)0xE0001000)) #define CM_DWT_CYCCNTENA_BIT (1UL<<0) #define CM_DWT_CYCCNT (*((volatile unsigned int*)0xE0001004)) void start_cyccnt() { CM_DEMCR |= CM_TRCENA_BIT; CM_DWT_CONTROL |= CM_DWT_CYCCNTENA_BIT; CM_DWT_CYCCNT = 0; } unsigned int stop_cyccnt() { CM_DWT_CONTROL &= ~CM_DWT_CYCCNTENA_BIT; return CM_DWT_CYCCNT; }
Note that the FVP is not fully cycle accurate (especially regarding memory access timing),but useful for relative comparison, See Cortex-M4 processor documentation for description of the registers.
There are CMSIS functions available too, though these are for Arm architecture v8.1M (such as Cortex-M55), Cortex-M4 is v7-M processor, and so only supports a subset of the functionality.https://arm-software.github.io/CMSIS_5/Core/html/group__pmu8__functions.html
Regards
Ronan
Which version of FVP do you use? In the past, there was a known issue in DWT_CYCCNT of M4 model so I will check if it has been fixed or not.
BTW, FVPs and Fast Models aren't suitable for performance measurement so I recommend you to check the doc below in case you expect some accuracy from the model.
developer.arm.com/.../Model-capabilities
Kind regards,
Toshi
Hi Ronan
Thanks for your answer. My reason for wanting the counter is to provide a clock source for the C++ std::chrono library. I will investigate whether they will work together. BTW what do you mean by 'thread mode'?
Best regards
Hi Toshihisa
Thanks for your answer. > Which version of FVP do you use?I am using:
FVP_MPS2_Cortex-M4 --version Fast Models [11.14.21 (Mar 16 2021)] Copyright 2000-2021 ARM Limited. All Rights Reserved.
> BTW, FVPs and Fast Models aren't suitable for performance measurement
What should I look at to provide cycle-accurate simulation?
Thanks for the FM version. The issue has been fixed in 11.15 so please use 11.15 or later version.
I am not a sales person so it would be better for you to contact Arm Sales team but I think Arm Flexible Access (see the link below) is a good starting point to explore the solution from Arm.
www.arm.com/.../flexible-access
Hi Toshi and Ronan, thanks for indicating that the issue is fixed in 11.15. I downloaded 11.16 and, indeed, it does look fixed.
Can you tell me, please, how to determine the clock frequency of the FVP_MPS2_Cortex-M4 simulator?
Thanks for the updates. I am glad that now you can see it works.
The core (Cortex-M4) in the FVP_MPS2_Cortex-M4 is running at 25 MHz and SysTick timer in the FVP is running at 25 kHz. These clock frequencies are hard-coded in the FVP so they cannot be changed. FYI, I've checked this in the source code of the FVP.
Hi Toshi
Thanks for your reply, I confirmed your figures by executing:
OS_Tick_GetClock() = 25000000 =25Mhz ('OS Tick time clock frequency in Hz')
OS_Tick_GetInterval() = 25000 therefore, tick freq is 1kHz, i.e. 1ms period
So that is consistent with what you wrote.
However, I then measured the elapsed clock cycle count and tick count for osDelay(10000) which gives a delay of 10s (confirmed by stopwatch). If found:
Elapsed cycle count = 31327755 which suggests an interval of 31327755/25e6=1.2531s @25MHz
Elapsed tick count (start-stop): 5010 = 5.01s @1ms tick
So the cycle count appears to be running 8x slower than the system clock and I can't explain the tick count.
Here is the code I used:
#define CM_DEMCR (*((volatile unsigned int*)0xE000EDFC)) #define CM_TRCENA_BIT (1UL<<24) #define CM_DWT_CONTROL (*((volatile unsigned int*)0xE0001000)) #define CM_DWT_CYCCNTENA_BIT (1UL<<0) #define CM_DWT_CYCCNT (*((volatile unsigned int*)0xE0001004)) void start_cyccnt() { CM_DEMCR |= CM_TRCENA_BIT; CM_DWT_CONTROL |= CM_DWT_CYCCNTENA_BIT; CM_DWT_CYCCNT = 0; } unsigned int stop_cyccnt() { CM_DWT_CONTROL &= ~CM_DWT_CYCCNTENA_BIT; return CM_DWT_CYCCNT; } printf("10s start\n"); start = OS_Tick_GetCount(); start_cyccnt(); osDelay(10000); printf("10s cyccnt: %d\n",stop_cyccnt()); stop = OS_Tick_GetCount(); printf("10s tick count (start-stop): %d\n",(start-stop)); printf("10s stop\n");
and my results:
10s start 10s cyccnt: 31327755 10s tick count (start-stop): 5010 10s stop
Can you explain these figures please?
I am not an expert of CMSIS so I don't know how osDelay() is implemented but some of the documents (e.g. https://www.keil.com/pack/doc/cmsis/RTOS/html/group__CMSIS__RTOS__Wait.html) mention it is an elapsed time, meaning it's not a simulation time so I suspect you see the difference (8x slower (*)) because the model is running at 25 MHz in its own world (the simulation world), while osDelay() is triggered by an elapsed time in wall clock of the real world.
(*) Probably, is it meant to 4x slower because 5.01s / 1.2531s = 3.998 ?
Hi Toshi and RonanI'm still struggling with obtaining meaningful timing info from CM_DWT_CYCCNT. I wrote a simple loop and timed it using the start_cyccnt() and stop_cyccnt() functions that Ronan suggested above. Here are the code and the results:
As you will see, the values seem increasingly meaningful as the total loop count increases, but the numbers are meaningless for a loop count of 75 or lower. Also, given that this is a cycle count, i.e. at least one cycle per instruction, a count of 1040 must be wrong for a loop count of 1000.
I wonder if the simulator is working correctly?
Do you have any thoughts on this, or any suggestions for an alternative timing method, please? (RTOS tick would be too coarse).Best regards
Hi David, this is not totally unexpected, and goes to the same issue that Toshi said earlier, that Fast Models are not 100% cycle accurate.
I think what you are seeing here is an affect of the 'quantum' of instructions that Fast Models use to accelerate execution. If you single step though the loop with low cycle count, do you get different numbers?
FVPs and Fast Models (FM) sacrifice accuracy to get fast simulation speed so the result you see is expected. As I previously posted, users should not expect cycle accuracy from them. Please check what the models cannot in the link that I put in my previous post (see below in case you missed it).
https://developer.arm.com/documentation/100964/1116/Introduction-to-the-Fast-Models-Reference-Manual/Model-capabilities
--
Fast Models can:
Fast Models cannot:
You can increase accuracy with a smaller number of quantum (-Q) or minimum sync latency (-M) listed in Table 4 Timing and performance options of the FVP reference guide below, however, please note that even if you set them to the minimum number i.e. 1, you won't be able to see the expected result because this is how the model is implemented (they are not designed to produce accuracy but to run very fast) so again users should not expect cycle accuracy with FVP/FM.
developer.arm.com/.../FVP-command-line-options
Please note that smaller numbers to these parameters will make the model run slowly. Speed vs Accuracy is always exclusive.
Kind regards,Toshi
Hi Toshi and Ronan
Thanks for your replies. It's pretty clear that FVP and fast models aren't suitable for my purposes. Does ARM offer a cycle accurate simulator for Cortex-M4? I realise it may be slow but we could tolerate that. If so, could you please tell me where I can obtain it?
I've started a new thread 'Cycle accurate simulator for Cortex-M4?' as this one seems to have run its course.