Device-GRE memory attributes and A53 core lockup

I'm utilizing a Xilinx Zynq UltraScale+ FPGA, with a design that calls for the four A53 cores in the cluster to be responsible for processing data and writing that data back to memory contained in the FPGA fabric through an AXI bus. By default, Xilinx's bare-metal SDK sets up the A53 firmware to map the entire FPGA address space as Device-nGnRnE and with this configuration, the design works as expected. However, the A53 firmware's performance is critical to the design, and to that end its operation has been written using NEON intrinsics and performs stores to the FPGA memory in sequential blocks that would lend themselves to combining the stores into larger AXI transactions.

To this end, I have added a new configuration to the MAIR_EL3 register specifying Device-GRE, specifically to allow the interconnect to combine the writes. After setting up a TLB entry mapping the necessary address space to that new MAIR index, I do see that the core/interconnect is combining the writes into larger AXI transactions before sending them to the PL, which is great!

What is not so great is that with this configuration, the data has frequently not been fully written by the time the A53 notifies the PL that it has finished working on the data set (which it does by writing to a register exposed by the PL through a separate AXI bus address space configured as Device-nGnRnE).

To try and fix this, I've tried using a dsb st instruction once the core is done with the data set (but before it notifies the PL), which I hoped would handle the Early Write Acknowledgement memory attribute causing the A53 core to think the write had made its way to the PL when it really had not. Unfortunately this seems to lead to the core deadlocking after some time and after having successfully operated on many data sets. I've tried disabling interrupts, but the problem still remains. When the core locks up I am unable to halt execution on the core in the debugger, and all accesses (from either the A53 cluster or the R5 cluster) to the PL seem to be timing out, implying that something going on in the interconnect is the primarily the cause of the deadlock.

Am I doing (or not doing) something incorrectly to trigger this issue? I am not deeply familiar with the details of the A53 core or of the AXI interconnect (though I assume that is a custom configuration specified by Xilinx), so I'm hoping I have just missed some obvious thing that needs to be performed to get this working. I could obviously just leave the region as Device-nGnRnE, but being able to combine these writes into larger AXI transactions is vital to the performance of the design and I would really like to squeeze everything I can out of it!