Is the FVP accurate in terms of measuring performance of programs? Is it cycle accurate? If I use clock_gettime to measure time taken on applications, is it meaningful? If not, is there an accurate way to measure performance of programs on the FVP?
Hi Mohannad,FVPs, and Fast Models in general, are functionally accurate, meaning that they fully execute all instructions correctly, however they are not cycle accurate (a separate technolgy, Cycle Models, are available for that use case).As the name implies, Fast Models (the technology from which the FVPs are built) are designed to execute code quickly, typically in the order of 100M instructions/sec, whereas Cycle Models run in the 10k-100k range.Some high level timing annotation can be applied to the FVP (use <fvp_executable> --list-params to see all the available options, then edit as appropriate) to change cache and memory access characteristics etc, the effect of these can be seen with the --stat output. I tend to use this as a relative comparison rather than absolute. Some further annotation (pipeline models etc) can be applied with the full Fast Models tool. Note that these annotations will impact the performance of the model.For more information, see https://developer.arm.com/docs/100965/1110/timing-annotation
Hello Ronan,
Thank you for your reply.
That's interesting! Are there any Cycle Models that support ARMv8.3 and above? I took a look but could only find for the already released Cortex processors. I'll take a look at the timing annotation link and experiment with it a bit more on the FVP. I'll let you know if I have any more questions. As always, thank you very much for your help and support!
Mohannad Ismail
Sorry, I should have been more clear on that. As Cycle Models are derived from the actual RTL design, these only become available once the CPUs are released. I was just making the point in general that these are the models for true cycle accuracy.
Hi Mohannad,
adding some more detail to Ronan's answer on Fast Models Timing Annotation.
Each of the Fast Models CPU Models has parameters "cpi_mul" and "cpi_div". By default, Fast Models execute 1 instruction per clock tick, i.e., CPI = 1). These parameters can be used to modify that. e.g. to get a CPI of 1.25, you would set cpi_mul=5, cpi_div=4. (Fast Models doesn't support real numbers as parameters, so to create a fraction you need the two integer parameters).
Caches (and TLB Page Tables, etc) in Fast Model CPUs have a set of parameters to define estimated latency caused by accesses. By default cache modelling is set OFF in the Fast Model as it won't affect software functionality. In order for the latency to be applied to accesses cache modelling must be switched on by parameter. This can be done at the start of simulation or at a pre-defined clock count in the simulation.
The delays on downstream Memory Accesses are not directly supported on the FVP. These rely on annotating the delay to SystemC/TLM b_transport transactions between the CPU and downstream models. As the FVP does not use these there is no way of inserting the delay. Using Fast Models and building the platform from source is required. You could use the delay annotation on cache miss operations for the outermost cache (the L2 in the Base FVP) that include an estimated delay for a downstream memory access. Note: although memory is the most usual use of the TA it could be on peripherals, or interconnects, or other components as long as they are using b_transports between the CPU model and the component.
In general, to use the Timing Annotations that are available requires that TA is enabled in the Fast Model. It is set off by default. This is done by setting the environment variable "FASTSIM_DISABLE_TA" to 0 before starting the model.
Hello Rob,
Thank you very much for the detailed explanation. This will help! I will experiment with this further and will come back to ask if I have any questions.
Oh I see. Thanks for the clarification!
Hello Ronan, as for relative comparisons: I'm wondering whether the performance of vector instructions is correctly captured by the fast models.To be more specific: I'm currently comparing an implementation of an algorithm that uses regular vector loads to one that uses gather memory accesses on the Corstone300 MPS2. Would the performance penalty that gather loads typcially incur already be included in the number of cycles reported by the Fast Model in this case? I'm currently comparing the cycle number reported by the PMU_CCNTR register, if it makes any difference.
Hi Eltro, I would not rely on the Fast Model for that level of accuracy. You may see that one implementation is 'better' than the other, but I would not rely on 'how much better', for exactly reasons like this (this is unfortunately the cost of keeping these models "fast").
Great, that is good to know. Of course I'll need to benchmark on real hardware in the end, but knowing that the numbers from the Fast Model are not completely off in this case is already quite helpful.