Cortex-A9 accessing atomic variable results in dead loop

I'm using ZYNQ 7000 and implementing a counter in a bare-metal system. A counter in the shared memory is increased by both cores. _Atomic is used so that the 2 cores can be synchronized. But accessing the atomic variable results in dead loop.

The code

// shared_mem.h
#include <stdint.h>
#define SHARED_MEM_BASE_ADDR 0xffff0000
typedef struct {
	uint32_t basicCounter;
	_Atomic uint32_t atomicCounter;
} SharedMem;
#define SHAERD_MEM ((volatile SharedMem *)SHARED_MEM_BASE_ADDR)
 
// main.c, core0 and core1
#include "shared_mem.h"
int debuggerflag = 0; // set to 1 by debugger, each core has one flag
#define INC_VALUE 10000
void IncreaseCounters(){
	for(unsigned i=0; i< INC_VALUE; ++i){
		++SHAERD_MEM->basicCounter;
	}
	for(unsigned i=0; i< INC_VALUE; ++i){
		++SHAERD_MEM->atomicCounter;
		// blocks here if Xil_SetTlbAttributes(0xffff0000, 0x14de2); has been called
	}
}
int main(){
	for(;;){
		if(debuggerflag) {
			IncreaseCounters();
			debuggerflag = 0;
		}
	}
}

The problem I met:

  1. The counter value is only visible to one core. After running IncreaseCounters() in core0, core1 still sees the value 0 in shared memory.
  2. ​The default MMU config is “S=b0 TEX=b100 AP=b11, Domain=b0, C=b1, B=b1" as in translation_table.S. If `Xil_SetTlbAttributes(0xffff0000, 0x14de2)` in xil_mmu.c  is added to make the config "
    S=b1 TEX=b100 AP=b11, Domain=b1111, C=b0, B=b0" as in Xilinx xapp1079, then the counter value is visible to the other core. But the increament of the plain counter is not synchronized. The increment of the atomic counter loops forever. The assembly of `++SHAERD_MEM->atomicCounter` is the following (in which strex always fails):

        dmb     ish				; data memory barrier
.L2:
        ldrex   r3, [r0]		; exclusive load, r3=*r0
        add     r3, r3, #1		; increment
        strex   r2, r3, [r0]	; exclusive store, *r0=r3, write result in r2
        cmp     r2, #0			; r2==0 means exclusive store succeeds
        bne     .L2				; retry if fails
        dmb     ish

Is this a config problem of MMU? How can the counter be synchronized in the 2 cores? Thanks very much