We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I want to improve the performance of some game in our platform. I have heard of cache lock down feature to achieve performance improvement. Please suggest me how to use this feature in ARM Cortex-A7 for performance improvement.
The L1 and L2 Cache of Cortex-A7 don't support lockdown, you may consider other ways to improve performance. You can refer to the following links.
Re: L2 Cache difference between Cortex-A9 and Cortex-A7?
Does ARMv8 SOC support cache lockdown?
I'd just like to add a comment on this ...
I have heard of cache lock down feature to achieve performance improvement.
... as I've seen this comment before in other forums and would like to pin some additional information somewhere where it is visible to the wider ARM community, as it is a horribly incorrect misconception which I would like to stamp on.
Cache lockdown is almost never a good thing for performance for general purpose application code. Caches are nearly always smaller than the total overall code/data set on a platform. If you lock down 25% of the cache to accelerate one application, you effectively make the cache 25% smaller for everything else. You may make one application faster (possibly, if it happens to have a small data set, although even this is questionable for most applications which are bigger than the L2 size), but at the expense of making everything else running on the platform slower. For this reason cache lockdown is nearly always "the wrong thing", even if it were available. For most software development there is unfortuantely no quick fix to making software run faster; you just need to profile and optimize your application hotspots to use better algorithms, cleaner code, less memory, etc.
The only real use case for cache lockdown is for critical sections of hard-realtime systems where guaranteed performance of small code sections (interrupt handlers and the like) is required, and the overall loss of cache (and drop in peformance for everything else) is viewed as an acceptable sacrifice to achieve that predictable response time. It is also worth noting that in many markets needing realtime response TCM is generally available as a synthesis option in the Cortex-R family, so even in those markets there are better alternatives to cache lockdown which provide better area efficiency.
In summary - cache lockdown generally makes your platform slower (due to smaller average cache size remaining after lockdown), but buys predictable performance time for critical realtime sections. It is not, and never has been, an optimization techique to make application code run faster.
HTH, Pete
Hi Peter,
Thanks a lot for your detailed description.
Could you please explain following line little better :
It is also worth noting that in many markets needing realtime response, TCM is generally available as a synthesis option in the Cortex-R family, so even in those markets there are better alternatives to cache lockdown ...
Also could you please let me know the steps to perform cache lock down for any piece of code (I want to try it own my own).
Anshul
Hi Anshul,
Could you please explain following line little better : It is also worth noting that in many markets needing realtime response, TCM is generally available as a synthesis option in the Cortex-R family, so even in those markets there are better alternatives to cache lockdown ...
A TCM (Tightly Coupled Memory) is a flat memory which exists at the same (or similar) level in the memory hierarchy as the L1 cache, often split in to pairs so you have an I-TCM and a D-TCM. The processor can use TCM directly without needing to go via the cache for the TCM mapped address ranges.
As Wangyong mentioned, it isn't supported at all on the Cortex-A7. It is "optional" in the ARM architecture and "implementation option" in the standalone L2 cache controllers. It is very rarely actually implemented, because it is so rarely useful, so there is a good chance that you can't actually do this on your device.
HTH,
Pete
Thanks a lot for clarification. I want to try it on some other hardware. If possible please guide me how to perform cache lock down on some piece of code. I want to check the impact of cache lock down. Thanks again.
It depends a little on the hardware you have, so I would check the TRM.
For example – here are the instructions for the L220 cache controller:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0329l/Beieiiab.html
Regards,
Hi peter,
Due to TrustZone security extension of Cortex-A processor, the cache lines tagged as secure can't be replaced by non-secure line fill and
accessed by non-secure read/write. So do we need to clear and invalidate cache at the end of secure world software operation to
avoid that normal world software only uses part of the cache?
Best Regards.
Hi Wangyong,
the cache lines tagged as secure can't be replaced by non-secure line fill and accessed by non-secure read/write.
Half right =)
TrustZone guarantees that the non-secure environment cannot read or write secure lines. However it does not guarantee non-eviction of secure lines by non-secure accesses. The line fill policy is fully dynamic - it's not a statically partitioned cache, and non-secure line fill can evict secure entries (and visa versa).
Hi Peter/Wangyong,
I have one more question regarding cache lock down. Lock down mechanism can be used oonly for L1 (I-cache/D-cache)/ L2 cache or both L1 and L2. In Cortex A7 TRM I found following line in L1 Instruction cache controller section.
no lockdown support
So does it mean that only L1 doesn't support lock down? Can we use lock down for L2 cache? Please help me to clarfiy it. Thanks a lot..
Thanks a lot. I misunderstand about the cache eviction and I also find description in TrustZone whitepaper according to your explanation.
Caches
It is a desirable feature of any high performance design to support data of both security states in the caches. This removes the need for a cache flush when switching between worlds, and enables high performance software to communicate over the world boundary. To enable this the L1, and where applicable level two and beyond, processor caches have been extended with an additional tag bit which records the security state of the transaction that accessed the memory.
The content of the caches, with regard to the security state, is dynamic. Any non-locked down cache line can be evicted to make space for new data, regardless of its security state. It is possible for a Secure line load to evict a Non-secure line, and for a Non-secure line load to evict a Secure line.
L1 (I-cache/D-cache)/ L2 cache of Cortex-A7 don't support lockdown. TRM just does't describe this in detail.
Hi Peter/Wangyong
Is it possible to check contents of cache at run time? i.e In case of any problem in system I want to check whether cache contents are in sync with Main memory or not? How can I store cache contents in RAM? If it possible then please let me the procedure. Thanks a lot!!
Not easily - you're trying to make visible something CPUs try very hard to make invisible.
If you hit what you think are coherency problems then you can try inserting cache cleans of the entire cache, and if the problems go away then that is likely your problem. Another approach I've used in the past, if you know the address range which is causing problems, is to use a mater outside of the CPU (such as a DMA engine) to create a copy of the main memory contents, which the CPU can then read back and compare against it's view of the original data.
Thanks a lot for your reply.
As you know it's very hard to reproduce cache coherency problems. I am working of Android and till date I suspect that 2 problems occur due to cache coherency but don't have any proof to confirm the same. So what I want is if exception occurs then I copy the data of cache into RAM. By doing this I can check the data of cache with RAM and confirm whether it is because of cache coherency or any other reason. Here I am not sure about address range also. So can you suggest how to analyze such problems if we can't dump cache contents in RAM.
While I haven't examine it in depth, it seems that the TLB cache seems to persist Secure accesses for a long time. I copied the normal world boot code via secure world and do not flush the secure TLB. After one day of use, the initial secure copies are still present in the TLB cache. Either the normal world OS is avoiding section entries or it seems that secure TLB entries on the Cortex-A5 persist for some time. I also have a 'NULL' TLB entry from the secure OS, so it has been interesting to dump the TLBs. I definitely don't think the eviction of secure world entries is 'standard'. Most definitely, the normal world OS accesses the same boot sections and no normal world entries are allocated.
Also, the eviction of the secure world lines makes them susceptible to the same attacks discussed in this hyper-threading paper.