Atomic write (LDAXR/STLXR) causes infinite loop on Cortex-A72

I have code which runs on Cortex-A72 (AArch64) and it disassembles to the following:
 0:   d53800a9    mrs x9, mpidr_el1
 4:   92400529    and x9, x9, #0x3
 8:   b4000069    cbz x9, 0x14
 c:   d503205f    wfe
10:   17ffffff    b   0xc
14:   10ffff69    adr x9, 0x0
18:   9100013f    mov sp, x9

// write to uninitialized memory beyond executable image (BSS section)
1c:   90000008    adrp    x8, 0x0
20:   91016108    add x8, x8, #0x58
24:   c85ffd09    ldaxr   x9, [x8]
28:   b2720129    orr x9, x9, #0x4000
2c:   c80afd09    stlxr   w10, x9, [x8]
30:   35ffffaa    cbnz    w10, 0x24

// turn on ACT LED -- this code never executes
34:   d2bfc404    mov x4, #0xfe200000
38:   b9401080    ldr w0, [x4, #16]
3c:   12177000    and w0, w0, #0xfffffe3f
40:   321a0000    orr w0, w0, #0x40
44:   b9001080    str w0, [x4, #16]
48:   52808000    mov w0, #0x400
4c:   f9001080    str x0, [x4, #32]
50:   14000000    b   0x50



The problem is it enters infinite loop with stlxr/ldaxr and I have no idea how to debug or where to look at to resolve the issue.

I've read about exclusive access monitor in the ARNv8-A Reference Manual:

    aarch64/functions/exclusive/AArch64.ExclusiveMonitorsPass
    
    // AArch64.ExclusiveMonitorsPass()
    // ===============================
    // Return TRUE if the Exclusives monitors for the current PE include all of the addresses
    // associated with the virtual address region of size bytes starting at address.
    // The immediately following memory write must be to the same addresses.

    boolean AArch64.ExclusiveMonitorsPass(bits(64) address, integer size)
        // It is IMPLEMENTATION DEFINED whether the detection of memory aborts happens
        // before or after the check on the local Exclusives monitor. As a result a failure
        // of the local monitor can occur on some implementations even if the memory
        // access would give an memory abort.

        acctype = AccType_ATOMIC;
        iswrite = TRUE;
        aligned = (address == Align(address, size));

        if !aligned then
            secondstage = FALSE;
            AArch64.Abort(address, AArch64.AlignmentFault(acctype, iswrite, secondstage));
        passed = AArch64.IsExclusiveVA(address, ProcessorID(), size);

        if !passed then
            return FALSE;
        memaddrdesc = AArch64.TranslateAddress(address, acctype, iswrite, aligned, size);

        // Check for aborts or debug exceptions
        if IsFault(memaddrdesc) then
            AArch64.Abort(address, memaddrdesc.fault);
        passed = IsExclusiveLocal(memaddrdesc.paddress, ProcessorID(), size);
        ClearExclusiveLocal(ProcessorID());

        if passed then
            if memaddrdesc.memattrs.shareable then
                passed = IsExclusiveGlobal(memaddrdesc.paddress, ProcessorID(), size);
        return passed;

    aarch64/functions/exclusive/AArch64.IsExclusiveVA
        // An optional IMPLEMENTATION DEFINED test for an exclusive access to a virtual
        // address region of size bytes starting at address.
        //
        // It is permitted (but not required) for this function to return FALSE and
        // cause a store exclusive to fail if the virtual address region is not
        // totally included within the region recorded by MarkExclusiveVA().
        //
        // It is always safe to return TRUE which will check the physical address only.
        boolean AArch64.IsExclusiveVA(bits(64) address, integer processorid, integer size);


Could it be something related? I don't enable address translation and I don't initialize exclusive access monitor. Do I need to?
Any advice or help is highly appreciated. Thanks in advance!
More questions in this forum