BaseFVP: undef behavior or emulation bug?

Version: Fast Models [11.4.37 (Jun 19 2018)] (Free) on Linux x86_64

Repro: here.

The program creates two binaries s.bin and ns.bin, and concatenates them into pkg.bin. The package is provided to the BaseFVP as its secureflashloader's file.

All (four) PEs go through s.bin and ns.bin. Some of the tasks are alloted to a single PE, marked as PRIMARY, which is the PE with the id 0. (The id is the AFF1 field, since MT is enabled.)

s.bin's job is to

- run at el3 (All PEs),
- configure TZC to enable access to range [0x80000000, 0xffffffff] of the RAM (PRIMARY PE),
- copy ns.bin into RAM_BASE + 1MB (PRIMARY PE),
- switch to el2 and jump to ns.bin (All PEs)

ns.bin's job is

- to switch to el1 (All PEs),
- Zero 6*4096 bytes at the address RAM_BASE+16MB (PRIMARY PE),
- signal non-PRIMARY PEs waiting in s.bin (PRIMARY PE),
- loop infinite with wfe(All PEs)

Expected behavior: All PEs end up in the wfe loop inside ns.bin.

Observed behavior: Two PEs end up in the wfe loop (including PRIMARY). The other two go haywire and write into RAM at sequential, consecutive but not independent locations, which were not programmed.

For e.g., the PRIMARY is asked to zero 6*4096 bytes starting at 0x81000000. The two rogue PEs (none of them is PRIMARY) not only touch that range, they go beyond it as if the size given was much too large.

The zero memory loop is given below:

        int sz;
        char *d;
        if (id == PRIMARY) {
                /* Zero 6 pages starting at base + 16MB. */
                d = ((char *)RAM_BASE + 0x1000000);
                sz = 0x6000;
                while (sz) {
                        *d = 0ull;
                        sz -= sizeof(*d);
                __asm__ volatile("sev\n\t");

There are three independent ways to fix the problem:

1. The PRIMARY PE should not be 0. That is, set PRIMARY to 1, 2 or 3, and the problem goes away.
2. Change the loop condition from while(sz) to while (sz > 0). The corresponding change in the assembly is from to PRIMARY PE is kept 0.
3. Change the datatype of sz from int to short. The corresponding change in the assembly is from str-ldr to strh-ldrh. PRIMARY PE is kept 0.

On the surface, it seems that there's a fault in the emulation of flags and word/halfword instructions, or that PE0 is, in this context, different from other PEs. Note that the compilation is carried out without optimizations (-O0).

Did the program violate some fundamentals of programming the arm64/FVP? I do not own a FVP debugger.