Weird SPSR behaviour

I was trying to write a register context saving/restoring when I came across a weird behaviour.

My code (sorry, tried to format tens of times, but the editor WANTS to make asm a table):

asm volatile (
...
"pop {r0 - r3}"
"push {r0 - r3}"
"mov r0, r3"
"bl dbg_out" - outputs 60000013
"pop {r0 - r3}"
"msr cpsr_fsxc, r2"
"@dsb"
"@isb"
"msr spsr_fsxc, r3" - set value
"@dsb"
"@isb"
"mov lr, r1"
"mov sp, r0"
"push {r0 - r4, lr}"
"mov r0, lr"
"bl dbg_out"
"mov r0, sp"
"bl dbg_out"
"mrs r2, cpsr"
"mrs r3, spsr" - read value
"mov r0, r2"
"bl dbg_out"
"mov r0, r3" - outputs 00000002
"bl dbg_out"
...
);

When the exception is returned from, the calling function:

asm volatile ("svc #0\n\t");

    msg = "returned from SVC\r\n";

    serial_io.put_string(msg, util_str_len(msg)+1);

    asm volatile (

"mrs %[retreg], cpsr\n\t"
:[retreg] "=r" (tmp1) ::

    );

    msg = "cpsr = ";

    serial_io.put_string(msg, util_str_len(msg)+1);

    util_word_to_hex(scratchpad, tmp1);

    serial_io.put_string(scratchpad, 9);

    serial_io.put_string("\r\n", 3);

outputs "returned from SVC" and "cpsr = 60000013".

Why the "00000002"? the barriers don't seem to have any effect.

  • I think that 'bl dbg_out' is free to change r0-r3, r12 and lr.

    Popping and then pushing r0-r3 looks strange to me; probably because a LDM sp,{r0-r3} would do the job just as well, but save a bunch of clock cycles.

    I recommend to avoid pushing the values back onto the stack; if you need to change a single register, I think it would be better to just do an indexed store (STR rN,[sp,#offset])

  • It's hard to believe that LDM saves clock cycles compared to POP - at least on Cortex-A7:

    c    c    c    c    1    0    0    0    1    0    W    1    n    n    n    n    r    r    r    r    r    r    r    r    r    r    r    r    r    r    r    r    arm_cldstm_ldm    arm_core_ldstm    LDM<c> <Rn>{!}, <registers>    A1    A8.8.58

    c    c    c    c    1    0    0    0    1    0    1     1    1    1    0    1    r    r    r    r    r    r    r    r    r    r    r    r    r    r    r    r    arm_cldstm_pop    arm_core_ldstm    POP<c> <registers>

      (The 'W' is writeback, and nnnn is the base register - SP is 1101.)

    The pushing and popping is about debug.

    I have stuff in stack, and if I suspect the registers may have been corrupted, I pop and push back to "fix" the registers.

    And actually, the 'dbg_out' corrupts r0 and r1.

    Another weirdness in today's debug:

       "@ align SP\n\t"
       "mov r0, sp\n\t"
       "bl dbg_out\n\t"
       "mov r0, sp\n\t"
       "and r1, r0, #7\n\t"
       "push {r0, r1}\n\t"
       "mov r0, r1\n\t"
       "bl dbg_out\n\t"
       "pop {r0, r1}\n\t"
       "sub r0, r1\n\t"
       "mov sp, r0\n\t"
       "push {r0,r1} @ stack correction"
       "mov r0, r1\n\t"
       "bl dbg_out\n\t"

    The first 'gbg_out' prints 1f012658, the second prints 00000000 and the third 1f012658...

    The stack should be 'corrected', because when you call assembly from C, the stack can be 4 but not 8 byte aligned, and GCC requires, that when you call C-code from 'outside (like assembly), the stack should be 8 byte aligned.

  • Uhm, what I mean is that a single LDM is approximately twice as fast as the two instructions POP + PUSH.

    -Thus you will not have to use POP+PUSH, but can just read the contents of the stack and avoid writing to it.

    The registers need to be loaded from the stack, that is correct, because they do not contain any defined value on interrupt entry.

    Thus instead of ...

       pop {r0-r3}

       push {r0-r3}

    ... you can write ...

       ldm sp,{r0-r3}

    ... which only reads the registers.

    dbg_out is allowed to corrupt r2 and r3 as well, which I think is why you see the strange value you mentioned earlier.

    I did not know about the 8-byte alignment requirement.

    -But remember to restore SP to its original value before you return; either by saving the entire value of SP or by adding the difference back, otherwise you'll get a crash.

  • Hello turboscrew,

    do you wonder why  "00000002" was read as SPSR?

    If it would be correct, I think "msr cpsr_fsxc, r2" would affect it.

    By changing CSPR, the execution mode was changed.

    After that, the SPSR would be read from the new execution mode.

    Probably it would be unknown value.

    I guess you did recover CPSR to SVC mode after reading SPSR.

    Therefore, the correct CPSR was read in the main function.

    Best regards,

    Yasuhiko Koumoto.

  • Ah, stupid me. That's what you ment - no writeback -> no write back...

    "dbg_out is allowed to corrupt r2 and r3 as well"

    Yes, allowed, but it doesn't - checked the disassembly.

    "But remember to restore SP..."

    I wrote them as pair. That way you don't forget, and it's easier to write one as "mirror image" of the other.

  • But after "msr cpsr_fsxc, r2" I do "msr spsr_fsxc, r3" before reading "mrs r3, spsr".

    The setting and reading spsr should happen in the same mode.

  • And another weirdness - this time the gcc:

    The source:

    asm volatile (
    "@ align SP\n\t"
    "mov r0, sp\n\t"
    "and r1, r0, #7\n\t"
    "sub r0, r1\n\t"
    "mov sp, r0\n\t"
    "push {r0,r1} @ stack correction"

        );

        // rpi2_svc_handler2() // - No C in naked function

        asm volatile (

    "mov r0, sp\n\t"
    "mov r1, lr\n\t"
    "push {r0 - r3}\n\t"
    "bl rpi2_svc_handler2\n\t"
    "pop {r0 - r3}\n\t"

        );

        asm volatile (

    "@ restore stack correction"
    "pop {r0, r1}\n\t" - THIS
    "add r0, r1\n\t"
    "mov sp, r0\n\t"

        );

    The disassembly:

    1f000cf8:    e1a0000d     mov    r0, sp

    1f000cfc:    e2001007     and    r1, r0, #7

    1f000d00:    e0400001     sub    r0, r0, r1

    1f000d04:    e1a0d000     mov    sp, r0

    1f000d08:    e92d0003     push    {r0, r1}

    1f000d0c:    e1a0000d     mov    r0, sp

    1f000d10:    e1a0100e     mov    r1, lr

    1f000d14:    e92d000f     push    {r0, r1, r2, r3}

    1f000d18:    ebffff51     bl    1f000a64 <rpi2_svc_handler2>

    1f000d1c:    e8bd000f     pop    {r0, r1, r2, r3}

    Where's the "pop {r0, r1}?

    1f000d20:    e0800001     add    r0, r0, r1

    1f000d24:    e1a0d000     mov    sp, r0

    The stack fix pop in "restore"-part ("pop {r0, r1}\n\t") is missing from the disassembly!

    OK, the push and pop around call to rpi2_svc_handler2 are needless, but still - the stack effect...

    The compiler shouldn't "optimize" such that the stack gets unbalanced, and the data got is wrong.

  • Aarghh - found it: the pervious line "comments" it off!

    This: "@ restore stack correction" doesn't end with "\n\t", so the next physical line

    is logically a continuation...

    "@ restore stack correction"
    "pop {r0, r1}\n\t" - THIS

    becomes:

    "@ restore stack correction pop {r0, r1}\n\t"

    And because the logical line is a comment, there is no error messages or anything...

  • Hello turboscrew,

    I'm sorry and I misunderstood the program sequence.

    By the way, why do the delimiters exit in the previous inline assembly ocdes?

    If you put  delimiter for each assembly line, dose the problem still occur?

    I think

    "@isb"

    "msr spsr_fsxc, r3" - set value

    did not change the SPSR.

    Best regards,

    Yasuhiko Koumoto.

  • If you wrote the entire dbg_out yourself, it probably won't change r2 and r3.

    What I'm most worried about is that if it's written in C, the C-compiler will optimize it at some later point, so it changes r2 and r3.

    If dbg_out calls another C-routine, which you didn't write, then it's very much in danger of being unpredictable, regarding which registers it uses.

    If you've written dbg_out in assembly language, then begin the routine by saving r0-r3 and r12 (since it's a debugging routine), then you won't have the problem at a later point).

    -But it's great to see you found the error; this one is difficult to spot.

  • The funny thing was, the CPSR was right after return, so "msr spsr_fsxc, r3" did change the value - later.

    The code works even if the printed value was wrong.

  • The dbg_out was written in C, but only used for debugging that situation. It's already removed.

  • Now that you've seen my struggle, I wanted to give a status update:

    I can now load a program with GDB through my stub, run the loaded program. set and delete breakpoints, and the program really breaks there and the GDB is notified about it. Just the PC gets some weird value, and I haven't found out yet where that comes from, but if I correct it manually from GDB (set $pc = 0x?????) I can 'cont' forward. The weird PC value is not in the memory area of the stub nor the debuggee.

    I also tried with branch prediction and caches enabled, and it worked.

    Now the code also copies itself into upper memory (0x1f000000) to let the debuggee to load to the default address 0x8000.

    I wonder if I could use the hivec, and route all non-stub exceptions to lovec bu just jumping there. Then the debuggee could install its own exception vectors. Stub would just filter its own exceptions in the way (UART0 and BKPT). I could pre-install lovecs to unhandled exception handlers, and the debuggee then - if it wants to handle them - just overwrites them.

  • Looks like the original question will remain unsolved. For other things I had to make quite some changes into the code, and the problen doesn't show up anymore.

    Instead, using the double vectoring caused some instabilities, and I'm about to become insane. I haven't found the cause, and the main symptom is that the serial line seems to stop working. I suspect that something somewhere turns the interrupt mask on. Hard to debug, when the problem "mutes" the only communication channel, and the LEDs are not informative enough...

  • It can be quite difficult to debug without having a display or a communications channel to a computer.

    Remember that you can blink the LEDs.

    One blink + long delay means this ...

    Two blinks + long delay means that ...

    Three blinks + long delay ... etc. ...

    If you have more than one LED available, then you can switch between them; or you can blink LED 1 X times, then blink LED 2 Y times, then have a few seconds delay.

    Note: I sometimes connect a small Cortex-M to one pin, which I can send data through.

    Eg. for instance, if I have a problem on a STM32F427, and I can't use the UART for debugging, I bit-bang the data on a GPIO pin to a STM32F103.

    The STM32F103 watches one GPIO pin and when a data-byte is received, it can output it on a connected SPI-display.

    -Or it can transmit the data to a computer via the UART.

    You can easily receive 16 bits at a time on the GPIO pins with the STM32F103, so it's a great little tool.