This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

GCC 7.2.1 on Cortex-M4 - C++ exceptions not being caught

Hi all, first post. I've posted about this issue in the NXP forums, and it was suggested I post here, since this may be a GCC toolchain issue (if it's not somehow my own fault). If there's a better place to post this, let me know. (I didn't want to post an actual bug report until I know that there is in fact a bug.)

(Quick note: I'm using newlib, not newlib-nano, so that's not my problem.)

The issue is that sometimes when I build and debug my Cortex-M4 (NXP Kinetis K24) C++ application, it becomes impossible to catch any exceptions, even with "catch (...)" (catch-all). By that, I mean that when an exception of any type is thrown, regardless of the catch statements that follow the try block, the following appears on the console:

terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively

That was generated by this test function:

static void VerifyExceptions(void) {
try {
    throw std::runtime_error("Exceptions are being handled normally.");
} catch (std::runtime_error &e) {
    std::cout << e.what() << "\n";
}
/* If exceptions are not working correctly, then the above statement will
   * cause __cxa_throw() to call terminate() immediately. Testing this at
   * startup assures that any issues with exceptions will be immediately
   * diagnosed during development. */
}

This is an example call stack when this fault occurs:

Call stack showing uncaught exception calling terminate() twice

Here is what I've determined so far:

1) For a given project and set of code, the issue either always occurs, or never occurs. That is, rebuilding firmware does not change whether the problem manifests. Also, building the code on another workstation using the same toolchain will give the same results.

2) This issue can be triggered at any point during the target's execution. My firmware is a FreeRTOS application, and I am calling VerifyExceptions() in main() shortly after initializing peripherals and stdio, but well before I have allocated/initialized any tasks, let alone started the scheduler.

3) My stack size in the linker is 16K, so it seems tremendously unlikely that this would be some stack-related issue. And as mentioned in 2), this does not need to occur inside an RTOS thread.

4) What seems to change whether this issue occurs or not, is the simple act of adding or removing some portion of code from the project (e.g. creating or expanding class methods). An example: I have a test module that issues one of several strings to a parsing module. I added a new test string to the module, recompiled my firmware, debugged on the target, and found that exceptions had stopped working completely. I removed the test string from the module, and exception handling was restored.

5) It is not necessary to call any of the code that is being added to the project in order to manifest the issue; the mere presence of additional code in the project is enough to cause the issue to occur.

6) There is a workaround that (so far) appears to mitigate this issue 100% of the time (see below).

7) If the system is operating normally, I can reverse the presence or absence of the workaround and recompile, and the system will then manifest the issue.

8) Whether or not the issue is present, my program works 100% fine (all threads running and healthy), unless some module throws an exception.

9) The only module in my code that is designed to throw exceptions is a JSON parser which I have used successfully in a previous project. All calls into the JSON parser are surrounded by try/catch statements.

10) I am building my application with newlib, which has exception handling enabled, versus newlib-nano, which does not. I have also specified -fexceptions in my C and C++ compiler flags.

If you read the NXP thread linked above, I go into more detail, but in the course of investigating this issue, I happened upon a workaround. It involves adding a single data member (wibble_) to one of my application objects, creating a method (DoNothing()) which simply contains "wibble_ = 123;", then calling that method from the constructor of the object. That's it. When VerifyExceptions() was triggering an uncaught exception, adding the code described caused exceptions to behave and be caught normally once again.

Here's the best part... as I continued development, I encountered the issue again, where all exceptions became uncatchable again. So I simply commented out the call to DoNothing(), recompiled, and now exceptions work again. I have gone through this iteration at least two or three more times, where I continue development, rebuild, and find that exceptions are no longer being caught. I then flip the commented/uncommented state of the DoNothing() call, rebuild, and debug, and the system works fine again. The DoNothing() call has become a toggle switch in the code; it either MUST be present, or MUST NOT be present, for a given set of code to build and operate correctly.

This smells like some sort of alignment issue, where inserting a small blob of code (e.g. the call to DoNothing()) causes something in the build to misalign, or fixes an existing misalignment. However, I have no idea where to start looking at this. All I know is that this issue is 100% reproducible with my current codebase. I can toggle whether DoNothing() is or isn't called by the constructor, and that either breaks catching exceptions globally, or fixes them. And to reiterate, my program doesn't even call the object prior to calling VerifyExceptions() in main(). My object is a singleton that is initialized by a GetInstance() method, and I've verified that DoNothing(), when enabled, isn't called until much further down in main(), well past VerifyExceptions().

And to emphasize, I am NOT having an issue with unexpected exceptions; this is an issue where exceptions are wholly anticipated and should be caught with valid try/catch statements, but for whatever reason, any exception causes terminate() to be called.

So... how should I start looking into this? Version info below:

arm-none-eabi-gcc.exe (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release)

Windows 10 Pro v1803 patched, NXP MCUXpresso v10.2.1, NXP Kinetis K24 (MK24FN1M0VLL12)

Top replies

0 Tejas Belagod over 7 years ago

Hi David,

The symptoms you explain point to some a memory leak or a buffer overflow or section overlap.

I'd start with a few sanity checks (sorry, I don't know if you've already done this).

1. Try a later toolchain (7-2018-q2-update) to see if the problem persists

2. Try building on Linux toolchain?

3. I'd go back to the linker map file(s) and double-check all the limits of text section, data section etc. Also check if the heap and stack are non-overlapping or have been mapped to RAM correctly.

4. Check if all the startup code/drivers copy all ROM to RAM correctly.

5. Check if you have unaligned accesses across regions in the system memory map?

6. Check if you have any activity happening in the bitband aliassed region that are accidentally toggling bits its corresponding bitband regions in SRAM.
Cancel
Vote up +1 Vote down

Cancel
0 David R. over 7 years ago in reply to Tejas Belagod

I realize this is a very belated reply, but having finished development on this project, I can now devote some time to addressing side issues such as this. The only question of yours I have a solid answer for is #1... NXP just released MCUXpresso v11.0.0, which includes GCC 8 (2018q4-major), and the issue still persists. I haven't done any development on Linux in a while; I don't even have a Linux machine or VM set up at the moment. I'm certain that the system heap and stack are mapped to separate, non-overlapping sections of SRAM; if they weren't, toggling one line of code wouldn't suddenly make everything work. As for the other questions... my Arrow rep has told me that he has the line of someone at NXP who is willing to investigate this issue more thoroughly, so they would be able to examine the other potential issues that you raise. Thanks you for taking the time to reply, I do appreciate your help, sorry for getting back to this so late.
Cancel
Vote up 0 Vote down

Cancel
0 Andrés over 6 years ago in reply to David R.

Did you find a solution for this? I'm having the same problem!
Cancel
Vote up 0 Vote down

Cancel
0 David R. over 6 years ago in reply to Andrés

I have not yet, but apparently an engineer from NXP is willing to investigate the issue, once I port my custom application back onto a FRDM kit. I will update here if I learn anything from their investigation.
Cancel
Vote up 0 Vote down

Cancel
+1 hansdampf over 6 years ago
Could you post your linker script and startup code?

Can you compare the map files or symbols/section addresses of working and non working example?

I know it's a bit late, but since the problem is not solved and I had a very similar issue these days I can maybe offer some help.

There are many wrong linker scripts out there where symbols like __exidx_start or __exidx_end are placed outside of the section like this:

SECTIONS { ... .ARM.extab : { *(.ARM.extab* .gnu.linkonce.armextab.*) } > FLASH __exidx_start = .; .ARM.exidx : { *(.ARM.exidx* .gnu.linkonce.armexidx.*) } > FLASH __exidx_end = .; ... }

The problem is that __exidx_start is not guaranteed to be the start of .ARM.exidx section. There are potential problems, and I 've seen one where in between __exidx_start and .ARM.exidx the linker placed other stuff. In that case it could be seen in the map.

The fix was just to place it inside the section (the alignment should not be the problem here):

.ARM.exidx : { __exidx_start = .; *(.ARM.exidx* .gnu.linkonce.armexidx.*) __exidx_end = .; } > FLASH
Cancel
Vote up +3 Vote down

Cancel
0 skearney over 6 years ago in reply to hansdampf

FYI, I just had the exact same issue as David R. (intermittent/seemingly non-deterministic calling of std::terminate() when catching exceptions) and I was able to isolate this fix by hansdampf as the solution.
Cancel
Vote up +1 Vote down

Cancel
0 David R. over 6 years ago in reply to hansdampf

OP here, finally... I've been able to test this on my setup (got pulled in to do some code maintenance), and I can verify, this is in fact the solution. I'm writing up a big post on the NXP forums right now and will link here when I'm done. Thank you so much for identifying this issue.
Cancel
Vote up 0 Vote down

Cancel
0 David R. over 6 years ago in reply to hansdampf

For those interested, I have written a reply in the NXP forums here which references this post:

https://community.nxp.com/thread/524440
Cancel
Vote up 0 Vote down

Cancel
0 David R. over 6 years ago in reply to hansdampf

Even better, I captured and compared some MAP files which show exactly how this defect manifests. Link here:

community.nxp.com/.../524440

EDIT: NXP has accepted that this is an issue with their toolchain and will be issuing a fix in late February.
Cancel
Vote up 0 Vote down

Cancel