This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Coprocessor interface information

Note: This was originally posted on 29th July 2013 at http://forums.arm.com

Hi

I'm continuing to struggle with the problem of manually invalidating the caches in my A9 based Zynq 7000 architecture...

My difficulty stems from a problem in using the on-board DMA-330 controller, as the Xilinx architecture seems to leave me having to invalidate & flush all the data caches manually, which is rather time-consuming with the ARM cache controllers (having only line-by-line access rather than range based instructions). I'm trying to get an estimate of the peak performance I might achieve on DMA given that the bulk of the time seems to be lost in cache invalidation and flushing.

Isogen74 was generous enough to answer my question about the need for a data sync barrier between sequential accesses to  CP15 but I'm still struggling to get good estimates of likely performance, perhaps because of my own scruffy thinking.

My guess is that the A9 main core is much faster than the coprocessor so, as dsb isn't needed between sequential accesses, there must be some other mechanism for the coprocessor to hold-off the core which must impact the performance. I could really do with understanding this interaction but have failed to track down a document with this so far.


For instance, if there's a fifo queueing the data to the coprocessor, then the depth of this will impact the number of cache lines that can be invalidated in a burst before stalling the core; if the coprocessor runs at (perhaps?) a 16:1 cycle ratio then I can start estimating the maximum burst size and inter-burst spacing I'll need without impacting on other threads.


Does anyone know where I might find some description of the coprocessor architecture and core arbitration mechanisms?


Cheers


Joe.
  • Note: This was originally posted on 30th July 2013 at http://forums.arm.com

    Thank you Iso for your generous response and wasting your evening on my blathering!

    Your words are indeed wise on cache architecture and also on the ACP - our roadmap has our hardware team (he'll be pleased to know he's a 'team') implementing the interface into the FPGA fabric over ACP just as soon as he's solved some interface issues to the hardware blocks.

    Sadly our application has lots of CPU activity on bursty data before and after processing in the math intensive fabric blocks. Our benchmarking show's we'll miss our targets wildly if we disable caches. Data destined for the fpga accelerators will generally reside in DDR, have expired from the L1 caches but not from the L2 cache.

    So that leaves the humble softie trying to squeeze interim performance out of the existing hardware; getting the DMA-330 to slightly outperform memcpy.

    Benchmarking the flushes without data-sync barriers between each CP15 suggests I'm now inside my current target (thank you Iso for that too!) but I'm concerned to check that I'm not introducing a processor stall that might affect other areas of our system - something much harder to measure empirically.

    Might I read your knowledgeable response, Iso, to say that there will be no (or little) additional delay than necessary to write out any dirty cache data to memory?

    I promise that I have spent hours searching the ARM site for information on the low-level 'coprocessor' operation on the A9; a hint of a few more appropriate keywords might send me off on my own research. I'm sure that ARM have published this stuff somewhere for eedjits like me!
  • Note: This was originally posted on 29th July 2013 at http://forums.arm.com

    Hi Joe,

    What are you actually trying to do?

    The traditional use for a DMA is to move data from a non-CPU device to memory, or visa-versa; as a technology it works best when the CPU doesn't touch the data at all and just handles the control plane. Once data is inside the CPU caches it will probably end up being no faster than a memcpy ; you've done the expensive part (pulling in the data) so you may as well write it out to the right place, and as you are finding anything involving cache maintenance is relatively expensive.

    For cases where there is a data access which the CPU must touch _and_ an external master must see there are a couple of options.

    Firstly you can simply mark the memory as "non-cached, buffered" on the CPU. This allows the CPU to write to the memory efficiently (but not read from it) but avoids the need to flush caches - you can just drain the write buffer which is relatively quick. This is the common approach for most high performance devices such as video encode/decode and 3D graphics devices; they fit well with the data stream model (CPU write only, GPU read only) and this avoids putting cache maintenance overheads on the CPU. It also means you avoid polluting the cache with data the CPU is never actually going to read so helps performance in that light too.

    Secondly you can attach a bus master to the ARM core's ACP port which allows it direct access to the memory in the CPU caches, so avoiding the flush overheads. The ACP isn't designed for very high bandwidth masters, such as a GPU, but can sustain moderate amount of traffic. This is obviously a hardware change, not a software one, so not 100% sure what is possible in the Zynq.


    In terms of your question of predicting performance - for anything touching memory the answer really depends on the platform; the CPU itself is only part of the story. There is no "coprocessor" per say in any modern ARM core - the ISA naming for these instructions is really a throwback to older ARM cores such as the ARM9; cache flushes are just generally quite expensive (any operation touching memory is always much slower than the arithmetic core itself, so try and avoid where possible).

    HTH,
    Iso