This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to use L2 cache as memory from ACP access on zynq Cortex A9 ?

tef70 over 5 years ago

Hi,

I'm a FPGA designer and this new project is challenging for me because it has to deal with ACP port and L2 Cache of the ARM core of the Zynq FPGA device !

So it's new and I guess will need some touchy software, so any help, advice or C examples would be great !

What I need to do is :

- Periodically the PL part has to store using the ACP port a fixed amount of data at a fixed address in the L2 cache provided by software

- Each time the data has been updated in the L2 cache the software will get the data to use them.

- The L2 cache is supposed to be the "storage memory" and so, if possible, I don't want to have the physical cachable memory associated to this ! (If really needed I can create a "phantom" address section in the PL, meaning I can response to the AXI access but without the physical memory)

Context:

- No DDR available,

- Single core Zynq Cortex A9 device,

- Software executes from OCM

For now I have the following information :

- (Zynq TRM ) ACP coherent write requests: An ACP write request is coherent when AWUSER[0] = 1 and AWCACHE[1] =1 alongside AWVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is first cleaned and invalidated from the relevant CPU. When the data is not present in any of the Cortex-A9 processors, or when it has been cleaned and invalidated, the write request is issued on one of the SCU AXI master ports, along with all corresponding AXI parameters with the exception of the locked attribute.

Note: The transaction can optionally allocate into the L2 cache if the write parameters are set accordingly.

=> What I understand is that :

- The Data will be written in both the L2 cache and onto the destination to the physical memory, because of SCU coherency ? Or coherency only means the SCU will update the cache status for the associated line ?

- if yes does this means I have to use an allocate definition with AwCACHE value with write-allocate to have the data also written in L2 cache ?

- I can eliminate the physical update step if I use the lock attribute, => meaning using ACP AwLock signal ? meaning software locking of the associated L2 cache section ?

Questions :

- In the software how do I "request" the storage room in the L2 cache ?

- In the software how do I get the address in the L2 cache where the ACP is supposed to write ?

- In the software what are the configuration actions do I have to do to use the L2 cache in the mode ?

As you can see for now it's pretty confused for me, so any help woud be very great !

Many thanks in advance.

Parents

0 XNoOp over 5 years ago in reply to tef70

Hello tef70,

I had to change my account so that I could answer again...

You shall first consider what kind of data structures you want to store in the L2 cache.

Is that a circular buffer? How does the CPU knows which data to read out?

This will influence you address range.

This said, in order to avoid read misses, you shall make your data structure fit in the L2 cache.

The first writes to any given address will miss in the L2 cache, but later you may reuse the same addresses and therefore always hit. These first write misses are called compulsory misses.

All that said you shall reserve a region of memory within MMU FPGA GP0 range for the virtual back-end memory. If you stick to the above strategy I don't think there will be any write back to it.

You shall also be careful about initialisation:

1) For L2 cache: never clean, always invalidate to Point of Unification

2) Make sure that the Level of Unification Uniprocessor is set as 3b'001 that is L1 cache in Cache Level ID Register

3) Make sure that the maintenance broadcast is set as 3'b010, that is it only depends on the individual instruction behavior (e.g. invalidate to PoU will invalidate L1 cache level)

4) For MMU translation table, modify the translation_table.S file so that:

a) no page address from the DDR range is present,

b) OCM is non-cacheable

c) FPGA GP0 and GP1 ranges are strongly-ordered by default

d) the range for your virtual FPGA back-end memory is inner cacheable (write back, read allocate, write allocate), outer cacheable (write back, read no-allocate, write-allocate)

Note: Making the L2 cachelines locked does not prevent all read misses...

Florian
Cancel
Vote up 0 Vote down

Cancel

Reply

0 XNoOp over 5 years ago in reply to tef70

Hello tef70,

I had to change my account so that I could answer again...

You shall first consider what kind of data structures you want to store in the L2 cache.

Is that a circular buffer? How does the CPU knows which data to read out?

This will influence you address range.

This said, in order to avoid read misses, you shall make your data structure fit in the L2 cache.

The first writes to any given address will miss in the L2 cache, but later you may reuse the same addresses and therefore always hit. These first write misses are called compulsory misses.

All that said you shall reserve a region of memory within MMU FPGA GP0 range for the virtual back-end memory. If you stick to the above strategy I don't think there will be any write back to it.

You shall also be careful about initialisation:

1) For L2 cache: never clean, always invalidate to Point of Unification

2) Make sure that the Level of Unification Uniprocessor is set as 3b'001 that is L1 cache in Cache Level ID Register

3) Make sure that the maintenance broadcast is set as 3'b010, that is it only depends on the individual instruction behavior (e.g. invalidate to PoU will invalidate L1 cache level)

4) For MMU translation table, modify the translation_table.S file so that:

a) no page address from the DDR range is present,

b) OCM is non-cacheable

c) FPGA GP0 and GP1 ranges are strongly-ordered by default

d) the range for your virtual FPGA back-end memory is inner cacheable (write back, read allocate, write allocate), outer cacheable (write back, read no-allocate, write-allocate)

Note: Making the L2 cachelines locked does not prevent all read misses...

Florian
Cancel
Vote up 0 Vote down

Cancel

Children

0 tef70 over 5 years ago in reply to XNoOp

Hello Florian,

Thanks again for your detailed answers !

Some more details on the handled data :

- These data are from an ADC interfaced in the FPGA side, ADC data are 16 bits

- The data are always stored by the ACP master at the same address from one acquisition sequence to another

- The maximum data size is 64KBytes

Sequence is the following :

- A periodical tic launches the sequence in software

- The software requests the FPGA for an acquisition sequence with parameter data number in range [2:64KBytes]

- The software goes in wait state

- The FPGA generates acquisition, and stores the requested data in L2 cache,

- The FPGA provides an event to the software when all data have been stored in L2 cache,

- The software uses the data in the L2 cache

- Software goes to idle and waits for next tic

So the data structure is an array of 16 bits data stored in a fixed memory section (fixed address, fixe size).

The CPU knows how it handles data in its treatment algorithm. Starts with the first one, then goes sequentially to the last one.

So data fits in the 512KB L2 cache and it seems that it even fits in a L2 cache 64KB structure ?

As you said, data locations will always be reused, leading to cache hits every time excepted for the first time.

The mapping of the back-end memory on GP0 is totally free, so it will be placed for sure in MMU cachable memory range.

I have now to go deeper in the analysis of the mentioned initializations !

Thank you very much.
Cancel
Vote up 0 Vote down

Cancel
0 XNoOp over 5 years ago in reply to tef70

Hi tef70,

The L2 cacheline size is 32 Bytes for ARM Cortex A9 (ARMv7-A ISA), and your data will span over lots of cachelines.

But with the write enable signals, you can modify the cacheline at the granularity of 1 byte. So your individual acquisition data of16-bit can be written to L2 cache without prior buffering.

So I think you are good to go.

Good luck.

Florian
Cancel
Vote up 0 Vote down

Cancel