Hi,
I'm a FPGA designer and this new project is challenging for me because it has to deal with ACP port and L2 Cache of the ARM core of the Zynq FPGA device !
So it's new and I guess will need some touchy software, so any help, advice or C examples would be great !
What I need to do is :
- Periodically the PL part has to store using the ACP port a fixed amount of data at a fixed address in the L2 cache provided by software
- Each time the data has been updated in the L2 cache the software will get the data to use them.
- The L2 cache is supposed to be the "storage memory" and so, if possible, I don't want to have the physical cachable memory associated to this ! (If really needed I can create a "phantom" address section in the PL, meaning I can response to the AXI access but without the physical memory)
Context:
- No DDR available,
- Single core Zynq Cortex A9 device,
- Software executes from OCM
For now I have the following information :
- (Zynq TRM ) ACP coherent write requests: An ACP write request is coherent when AWUSER[0] = 1 and AWCACHE[1] =1 alongside AWVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is first cleaned and invalidated from the relevant CPU. When the data is not present in any of the Cortex-A9 processors, or when it has been cleaned and invalidated, the write request is issued on one of the SCU AXI master ports, along with all corresponding AXI parameters with the exception of the locked attribute.
Note: The transaction can optionally allocate into the L2 cache if the write parameters are set accordingly.
=> What I understand is that :
- The Data will be written in both the L2 cache and onto the destination to the physical memory, because of SCU coherency ? Or coherency only means the SCU will update the cache status for the associated line ?
- if yes does this means I have to use an allocate definition with AwCACHE value with write-allocate to have the data also written in L2 cache ?
- I can eliminate the physical update step if I use the lock attribute, => meaning using ACP AwLock signal ? meaning software locking of the associated L2 cache section ?
Questions :
- In the software how do I "request" the storage room in the L2 cache ?
- In the software how do I get the address in the L2 cache where the ACP is supposed to write ?
- In the software what are the configuration actions do I have to do to use the L2 cache in the mode ?
As you can see for now it's pretty confused for me, so any help woud be very great !
Many thanks in advance.
Hello Florian,
Thank you for your answer.
What I did in my design in a first approch is to implement in the FPGA section a physical memory.
So the theorical data path would be :
ACP master in FPGA => SCU in Arm core => L2 Cache in Arm core => Arm core Central Interconnect => Arm core master GP0 AXI port => Memory in FPGA.
This allows me to see for understanding what are the accesses to the back-end memory ! Final theorical goal being having no access to it !!
flongnos said:- no read miss:always write L2 cache before any read to a specific address
=> It should always be the case as software is synchronized to ACP master event to get the data in the L2 cache.
For now I'm studying what ACP access to generate to write to L2 cache !
My best lead is th Zynq TRM :
- ACP coherent write requests: An ACP write request is coherent when AWUSER[0] = 1 and AWCACHE[1] =1 alongside AWVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is first cleaned and invalidated from the relevant CPU. When the data is not present in any of the Cortex-A9 processors, or when it has been cleaned and invalidated, the write request is issued on one of the SCU AXI master ports, along with all corresponding AXI parameters with the exception of the locked attribute.
As I new to cache so text explaination would be needed !
What does this means ?
1- SCU enforces coherency => what does that mean exactly ?
2- When the data is present in one of the Cortex-A9 processors => meaning in L1 cache ? inregisters ? else ?
3- The transaction can optionally allocate into the L2 cache if the write parameters are set accordingly. => what is cache allocation ? why is it optionnal ?
I understand in this TRM extract that points 1& 2 only concern L1 and that if I want L2 to be involved I have to set parameters (which ones ? how ?) to have the SCU allocate ( this means store ?) in L2.
Ok, this extract is only a few lines but it contains too many new informations for me !
Can someone rephrase it for more clarity ?
Thank you.
although you will use ACP interconnect, that does not mean the request is coherent with the cache system. Coherent means that all the copies of one data or instruction in the cache and memory system are identical.
So for a request to be coherent, you shall set metadata bits of the write data bus to be AWUSER[0] = 1 and AWCACHE[1] =1. You need to check how, I have not used ACP interconnect so far.
Then to follow up with your questions:
1. SCU enforces coherency with a snoop based mechanism. As you know Snoop Control Unit (SCU) stores copies of tags of the data present in L1 caches in its own tag RAMs.
2. As a result, if the data is present in either of all L1 caches, the SCU will update L1 caches with the value of the data stored in L2 cache. For the detailed implementation, it uses the MESI coherency protocol so that the CPU does not access stale data in the L1 caches in the meanwhile.
3. Allocation policy means whether you will or not reserve a cacheline in the cache if a transaction occurs. For example read no-allocate means the cacheline readout from memory is not reserved and filled on the way to L1 or L2 cache. Read allocation policy and write allocation policy can be set separately.
4. Yes; different cache levels can have different allocate policy.
5. The replacement policy (write back or write through) and the cache allocation policy are configured in the MMU translation table. So for example, if you want to use a back end memory in the FPGA, you shall pay attention to the FPGA GP0 and GP1 memory ranges and configure the memory attributes as normal inner cacheable/outer cacheable with the specific policies you want. The supported combinations and encodings can be found in the Zynq TRM.
Good luck.
Florian
although you will use ACP interconnect, that does not mean the request is coherent with the cache system. Coherent means that all the copies of one data or instruction in the cache and memory system are identical.So for a request to be coherent, you shall set metadata bits of the write data bus to be AWUSER[0] = 1 and AWCACHE[1] =1. You need to check how, I have not used ACP interconnect so far.Then to follow up with your questions:1. SCU enforces coherency with a snoop based mechanism. As you know Snoop Control Unit (SCU) stores copies of tags of the data present in L1 caches in its own tag RAMs.2. As a result, if the data is present in either of all L1 caches, the SCU will update L1 caches with the value of the data stored in L2 cache. For the detailed implementation, it uses the MESI coherency protocol so that the CPU does not access stale data in the L1 caches in the meanwhile.3. Allocation policy means whether you will or not reserve a cacheline in the cache if a transaction occurs. For example read no-allocate means the cacheline readout from memory is not reserved and filled on the way to L1 or L2 cache. Read allocation policy and write allocation policy can be set separately.4. Yes; different cache levels can have different allocate policy.5. The replacement policy (write back or write through) and the cache allocation policy are configured in the MMU translation table. So for example, if you want to use a back end memory in the FPGA, you shall pay attention to the FPGA GP0 and GP1 memory ranges and configure the memory attributes as normal inner cacheable/outer cacheable with the specific policies you want. The encodings you can use can be found in the Zynq TRM.Good luck.Florian
Thank you for these explanations.
I'm designing my own ACP master in VHDL that is able to interface to the Arm Core's ACP port without an interconnect.
This way I have total control on the AXI signals generation and especially the AWUSER and AWCACHE ones.
I keep trying to understand to whole process and I found some valuable information in the Zynq TRM :
In the ARM architecture, the inner attributes are used to control the behavior of the L1 caches and write buffers. The outer attributes are exported to the L2 or an external memory system.
write buffers. The outer attributes are exported to the L2 or an external memory system.
Write Allocate : If the transfer is a write and it misses in the cache, then it should be allocated. This attribute is not valid if the transfer is not cacheable.
As I will write new data on each sequence, they will always miss in the cache, so they have to be allocated (written) in the cache. So write allocate + cachable attributes have to be generated by my ACP master.
This is the configuration that I think fits the closest to my needs, right ?
When the processor will read the data they should be read from L2, so as data will be new and available in L2 it will be a cache hit.
When the ACP will write the data they should go to L2, so as there will be room in L2 they will be stored in L2, it will be a cache hit.
But as strategy is write-back, the back end memory will be updated only when the processor has read the data in L2, right?
Even if I can acknowledge the writes coming to the back end memory mapping without having the memory, are there solutions to avoid this ??? Something with configuring the SCU / cache controller before or after accessing the data ? something with locking down the data in L2 ?
Thanks.
Hello tef70,
I had to change my account so that I could answer again...
You shall first consider what kind of data structures you want to store in the L2 cache.
Is that a circular buffer? How does the CPU knows which data to read out?
This will influence you address range.
This said, in order to avoid read misses, you shall make your data structure fit in the L2 cache.
The first writes to any given address will miss in the L2 cache, but later you may reuse the same addresses and therefore always hit. These first write misses are called compulsory misses.
All that said you shall reserve a region of memory within MMU FPGA GP0 range for the virtual back-end memory. If you stick to the above strategy I don't think there will be any write back to it.
You shall also be careful about initialisation:
1) For L2 cache: never clean, always invalidate to Point of Unification
2) Make sure that the Level of Unification Uniprocessor is set as 3b'001 that is L1 cache in Cache Level ID Register
3) Make sure that the maintenance broadcast is set as 3'b010, that is it only depends on the individual instruction behavior (e.g. invalidate to PoU will invalidate L1 cache level)
4) For MMU translation table, modify the translation_table.S file so that:
a) no page address from the DDR range is present,
b) OCM is non-cacheable
c) FPGA GP0 and GP1 ranges are strongly-ordered by default
d) the range for your virtual FPGA back-end memory is inner cacheable (write back, read allocate, write allocate), outer cacheable (write back, read no-allocate, write-allocate)
Note: Making the L2 cachelines locked does not prevent all read misses...
Thanks again for your detailed answers !
Some more details on the handled data :
- These data are from an ADC interfaced in the FPGA side, ADC data are 16 bits
- The data are always stored by the ACP master at the same address from one acquisition sequence to another
- The maximum data size is 64KBytes
Sequence is the following :
- A periodical tic launches the sequence in software
- The software requests the FPGA for an acquisition sequence with parameter data number in range [2:64KBytes]
- The software goes in wait state
- The FPGA generates acquisition, and stores the requested data in L2 cache,
- The FPGA provides an event to the software when all data have been stored in L2 cache,
- The software uses the data in the L2 cache
- Software goes to idle and waits for next tic
So the data structure is an array of 16 bits data stored in a fixed memory section (fixed address, fixe size).
The CPU knows how it handles data in its treatment algorithm. Starts with the first one, then goes sequentially to the last one.
So data fits in the 512KB L2 cache and it seems that it even fits in a L2 cache 64KB structure ?
As you said, data locations will always be reused, leading to cache hits every time excepted for the first time.
The mapping of the back-end memory on GP0 is totally free, so it will be placed for sure in MMU cachable memory range.
I have now to go deeper in the analysis of the mentioned initializations !
Thank you very much.
Hi tef70,
The L2 cacheline size is 32 Bytes for ARM Cortex A9 (ARMv7-A ISA), and your data will span over lots of cachelines.
But with the write enable signals, you can modify the cacheline at the granularity of 1 byte. So your individual acquisition data of16-bit can be written to L2 cache without prior buffering.
So I think you are good to go.