This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Baremetal program jumps to 0x200

Hello, I am trying to run a "hello world" program with C/C++ standard library support on Morello board (hardware), using Arm Development Studio Morello edition.

I previously followed the standalone-baremetal-readme.rst guide which worked well (following the advice from this topic), but it did not allow to use functions like "printf".

I tried to use examples from:

https://git.morello-project.org/morello/llvm-project-releases/-/tree/morello/baremetal-release-1.6?ref_type=heads

I ran make and the "make-bm-image.sh" with "-e" flag to produce "howdy-purecap-bm-image.elf" and "howdy-morello-bm-image.elf" (in the "make-bm-image.sh" script I added a line to preserve a copy of the .elf file), then I loaded these in the development studio.

It appears that the program goes to address 0x200 after executing the "MRS" instruction.

Does anyone know why that happens?

Also, in the standalone-baremetal-readme.rst guide it was necessary to specify UART address (0x2A400000) in the program, is it correct to assume that examples from baremetal-release-1.6 branch of llvm-project-releases will use that address (without the need to specify it anywhere in the program) and the printf/cout messages will appear in the AP com port of Morello hardware board? Or is it necessary to do some adjustments to achieve that?

Top replies

Parents

0 Kevin Brodsky over 2 years ago in reply to Michal Borowski

It looks like an exception occurs here:

https://git.morello-project.org/morello/newlib/-/blob/morello/master/libgloss/aarch64/crt0.S#L180

Hard to say why without more information. Have a look at the value of the ESR_EL2 system register just after the exception is taken, feel free to copy it here.
Cancel
Vote up +1 Vote down

Cancel

Reply

0 Kevin Brodsky over 2 years ago in reply to Michal Borowski

It looks like an exception occurs here:

https://git.morello-project.org/morello/newlib/-/blob/morello/master/libgloss/aarch64/crt0.S#L180

Hard to say why without more information. Have a look at the value of the ESR_EL2 system register just after the exception is taken, feel free to copy it here.
Cancel
Vote up +1 Vote down

Cancel

Children

0 Michal Borowski over 2 years ago in reply to Kevin Brodsky

I just tried to do the same thing I did before (using the same hello_world program), and somehow the LDR instruction does not cause exception anymore, no idea why. I am facing another issue, when I step through the program using "F5" (step into), multiple function calls execute and return well. But when I use "F6" (step over), each of the following functions result in jump to "curr_sp0_fiq":

-_cpu_init_hook (the cpu_init_hook seems to have "ret" instruction only in the source code, but in the disassembler it calls _init_vectors and _flat_map, I attached image of it below)

- _init_vectors

- _flat_map

- memset

I didn't test using F6 with any other functions, but every function I tried to step-over resulted in exception.

I tried to use breakpoints to find part of code that potentially cause this exception, but working with breakpoints seems unstable (sometimes running the code until breakpoint worked well, sometimes it crashed the ARM Development studio, sometimes it made program jump at address 0 and required rebooting the board).

In this thread the same behaviour was described (where stepping through code worked, and running it resulted in exception), and the suggested solution was to introduce 2 ISB instructions, but from what I see in the "rebuild-newlib" I used, these 2 ISB instructions are already in the following file:

newlib/libgloss/aarch64/crt0.S

I checked system registers after the program went to "curr_sp0_fiq" (following F6/step-over functions) and the values were always the following:
Cancel
Vote up 0 Vote down

Cancel
0 Kevin Brodsky over 2 years ago in reply to Michal Borowski
I'm really not sure what is happening regarding stepping over. A few things I can say though:

_cpu_init_hook is actually defined here. The definition you were looking at is a fallback (weak symbol), in case it is not otherwise defined.

The ESR_EL2 value in your screenshot corresponds to a data abort, with the DFSC (0x2a) indicating a capability bound fault (DS should give you the decoding if you click on the + icon next to it). The address at which the access (write) failed is indicated by FAR_EL2, and I would guess that is somewhere on the stack. Maybe a store via CSP, whose bounds are not appropriate? The address of the instruction at which the fault occurred is indicated by ELR_EL2. That might tell you enough to figure out what happened.
Cancel
Vote up +1 Vote down

Cancel
0 Michal Borowski over 2 years ago in reply to Kevin Brodsky

It is a tricky issue to debug because pressing F6 to step-over function does not stop the execution, it just makes the code run forever (just like it happens on the image with red arrows), so after pressing the pause button, the state of registers is different from the state of registers when the "curr_sp0_fiq" was invoked for the first time (which makes it difficult to recognize what caused the issue in the first place).

After stepping over the _cpu_init_hook function and pressing the pause button, here's what I can see:

(I highlighted some registers with orange because the font is awkward without it and names can't be seen, changing theme didn't help)

If I understand correctly the STP instruction in "write" function makes the program jump to "curr_sp0_fiq". And the STP instruction used C29, C30 and CSP capabilities. CSP appears to have the value 0xFF000000, and because the STP instruction specifies "-32" offset I checked the 0xFEFFFFE0 address contents (no idea if this is helpful in any way):
Cancel
Vote up 0 Vote down

Cancel
0 Michal Borowski over 2 years ago in reply to Kevin Brodsky

My previous reply to this message was hidden so it may appear after this one.

I just realized that connecting to "Rainier_SMP_0" or "Rainierx4 Multi-Cluster SMP" (instead of "Rainier_0") causes the LDR instruction (the first instruction of ".pure" function that follows "_start") to jump at 0xE0002000 (and then at "curr_sp0_fiq").

I don't know why but at least it gives the opportunity to view registers values when the issue first happens.

This is the whole code that executes before LDR instruction fails:

After the jump at 0xE0002000, this is the state of registers/memory:

The ESR_EL2 mentions "Capability tag fault", do you know what could be the cause of it? Is it because C0 tag is not equal to 1 for some reason?
Cancel
Vote up 0 Vote down

Cancel
0 Kevin Brodsky over 2 years ago in reply to Michal Borowski

Ah that is indeed progress. Your first post suggests some kind of stack overflow, as CSP hits its lower bound. Clearly this is a consequence of something else going wrong, probably related to this strange stepping behaviour.

The second post is a lot more straightforward: the function pointer C0 is null-derived, so BR C0 will cause PCC to become an invalid capability and thus cause an instruction abort right away. Of course the question is why C0 would be null-derived. Assuming the code sequence is correctly executed, this could only happen if DDC itself is null. Could you check the value of DDC_EL2?
Cancel
Vote up +1 Vote down

Cancel
0 Michal Borowski over 2 years ago in reply to Kevin Brodsky

It seems to be all zeros:
Cancel
Vote up 0 Vote down

Cancel
0 Kevin Brodsky over 2 years ago in reply to Michal Borowski

Right, that'll be your problem. How that came to pass, I have no idea... The first step would be to check if it is valid at the very beginning of the execution. If not, there must be something going wrong with the firmware.
Cancel
Vote up +1 Vote down

Cancel
0 Michal Borowski over 2 years ago in reply to Kevin Brodsky

I rebooted the board and it seems that now DCC_EL2 is not null anymore (at the beginning of execution), and the LDR instruction is executed well.

I wrongly assumed that changing "Rainier_0" to "Rainier_SMP_0" or "Rainierx4 Multi-Cluster SMP" made the LDR fail, I think it was coincidence, because changing these 3 options now does not make LDR fail anymore (following reboot which fixed DCC_EL2 being 0).

But the issue where the function enters "curr_sp0_fiq" recursively after being stepped-over is still there.

I've set a hardware breakpoint on the "curr_sp0_fiq" and 0xE0002200, I used F5 to step until the first function call (which was _cpu_init_hook) and pressed F6 to step-over it, the breakpoint got triggered and registers values are:

I tried expanding the column and copying the text to see if there's something after "following..." but there was nothing, this is what it looks like when using tooltip:

The DDC_EL2 seems to keep the same value as at the beginning of execution, I think there may be 2 separate issues, DDC sometimes being 0 (which gets fixed by rebooting the board), and this unidentified issue when stepping-over function call. The ELR_EL2 points to the instruction just after the _cpu_init_hook call.

I did another experiment, where I pressed "continue" button (from the beginning of "_start" function) instead of using F5 to reach the 1st function call. Interestingly, in this case the breakpoint is hit with ELR_EL2 having much higher value (part of _get_s function), where DDC_EL2 becomes 0.

Apologies for bombarding with all these screenshots/reports but it's all black magic to me, and I can't understand why such weird issues could occur.
Cancel
Vote up 0 Vote down

Cancel