Correctly invalidating Cortex-A53 shared L2 cache for access through ACP?

I've got a Zynq UltraScale+ design with the following setup, that I'm having issues with regarding the correct invalidation of L2 cache regions for access over the ACP port from the FPGA fabric:

  • PCIe interface in the FPGA fabric, copies data to DDR
  • A53 cores used to run tasks, operating on that memory
    • MMU is used by the A53 cores, and all memory regions in question here are marked as Normal Cacheable, Inner Shareable.
  • FPGA contains hardware acceleration for parts of those tasks, accessing the L2 cache through the ACP port
    • FPGA accelerator contains its own (small) L1 cache, and only reads memory.
    • Reads through the ACP port are set to allocate in the L2 cache.
    • Reads are specified as Outer Sharable (since this device is outside of the "inner" A53 cluster).
    • Note that the FPGA accelerator does NOT touch the same memory as is used by the A53 cores! The A53 cores should never be pulling this memory into their L1 caches.
  • Before running tasks, A53 cores invalidate their own L1 caches for the regions of memory they will be accessing, followed by core 0 invalidating L2 for those same regions
    • Core 0 also invalidates the L2 for memory regions that will be read by the FPGA accelerator

The typical setup is something like the following:

  • Host CPU sets up a bunch of data, uploads to device, and kicks off tasks to run
  • Similar tasks are executed on the same data for a while...
  • Host CPU changes some portion of the data, uploads to device, and kicks off new tasks to run
  • Run similar tasks for a bit, repeat

The behavior that I am seeing is that despite having A53 core 0 invalidate the L2 cache for all the memory regions that the FPGA accelerator will access, the accelerator still reads stale data from the cache. I can confirm that by dynamically changing the ARCACHE flags on the ACP transactions to disable ever allocating into the L2 cache (before starting any tasks) that the correct data is read until re-enabling cache allocation, at which point the cache gets filled and stale data starts being returned.

It seems that the attempts by the A53 core to invalidate the L2 cache are not actually invalidating the portions of the cache that were allocated by the ACP reads from the FPGA. It IS correctly invalidating the cache for regions of memory that are accessed by the A53 core tasks, as cache accesses through those behave consistently with respect to what's been invalidated.

Is there something else I need to be doing to get it to correctly invalidate the L2 cache?