This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Coprocessor interface information

Note: This was originally posted on 29th July 2013 at http://forums.arm.com

Hi

I'm continuing to struggle with the problem of manually invalidating the caches in my A9 based Zynq 7000 architecture...

My difficulty stems from a problem in using the on-board DMA-330 controller, as the Xilinx architecture seems to leave me having to invalidate & flush all the data caches manually, which is rather time-consuming with the ARM cache controllers (having only line-by-line access rather than range based instructions). I'm trying to get an estimate of the peak performance I might achieve on DMA given that the bulk of the time seems to be lost in cache invalidation and flushing.

Isogen74 was generous enough to answer my question about the need for a data sync barrier between sequential accesses to  CP15 but I'm still struggling to get good estimates of likely performance, perhaps because of my own scruffy thinking.

My guess is that the A9 main core is much faster than the coprocessor so, as dsb isn't needed between sequential accesses, there must be some other mechanism for the coprocessor to hold-off the core which must impact the performance. I could really do with understanding this interaction but have failed to track down a document with this so far.


For instance, if there's a fifo queueing the data to the coprocessor, then the depth of this will impact the number of cache lines that can be invalidated in a burst before stalling the core; if the coprocessor runs at (perhaps?) a 16:1 cycle ratio then I can start estimating the maximum burst size and inter-burst spacing I'll need without impacting on other threads.


Does anyone know where I might find some description of the coprocessor architecture and core arbitration mechanisms?


Cheers


Joe.
Parents
  • Note: This was originally posted on 30th July 2013 at http://forums.arm.com

    Thank you Iso for your generous response and wasting your evening on my blathering!

    Your words are indeed wise on cache architecture and also on the ACP - our roadmap has our hardware team (he'll be pleased to know he's a 'team') implementing the interface into the FPGA fabric over ACP just as soon as he's solved some interface issues to the hardware blocks.

    Sadly our application has lots of CPU activity on bursty data before and after processing in the math intensive fabric blocks. Our benchmarking show's we'll miss our targets wildly if we disable caches. Data destined for the fpga accelerators will generally reside in DDR, have expired from the L1 caches but not from the L2 cache.

    So that leaves the humble softie trying to squeeze interim performance out of the existing hardware; getting the DMA-330 to slightly outperform memcpy.

    Benchmarking the flushes without data-sync barriers between each CP15 suggests I'm now inside my current target (thank you Iso for that too!) but I'm concerned to check that I'm not introducing a processor stall that might affect other areas of our system - something much harder to measure empirically.

    Might I read your knowledgeable response, Iso, to say that there will be no (or little) additional delay than necessary to write out any dirty cache data to memory?

    I promise that I have spent hours searching the ARM site for information on the low-level 'coprocessor' operation on the A9; a hint of a few more appropriate keywords might send me off on my own research. I'm sure that ARM have published this stuff somewhere for eedjits like me!
Reply
  • Note: This was originally posted on 30th July 2013 at http://forums.arm.com

    Thank you Iso for your generous response and wasting your evening on my blathering!

    Your words are indeed wise on cache architecture and also on the ACP - our roadmap has our hardware team (he'll be pleased to know he's a 'team') implementing the interface into the FPGA fabric over ACP just as soon as he's solved some interface issues to the hardware blocks.

    Sadly our application has lots of CPU activity on bursty data before and after processing in the math intensive fabric blocks. Our benchmarking show's we'll miss our targets wildly if we disable caches. Data destined for the fpga accelerators will generally reside in DDR, have expired from the L1 caches but not from the L2 cache.

    So that leaves the humble softie trying to squeeze interim performance out of the existing hardware; getting the DMA-330 to slightly outperform memcpy.

    Benchmarking the flushes without data-sync barriers between each CP15 suggests I'm now inside my current target (thank you Iso for that too!) but I'm concerned to check that I'm not introducing a processor stall that might affect other areas of our system - something much harder to measure empirically.

    Might I read your knowledgeable response, Iso, to say that there will be no (or little) additional delay than necessary to write out any dirty cache data to memory?

    I promise that I have spent hours searching the ARM site for information on the low-level 'coprocessor' operation on the A9; a hint of a few more appropriate keywords might send me off on my own research. I'm sure that ARM have published this stuff somewhere for eedjits like me!
Children
No data