GCC 7.2.1 on Cortex-M4 - C++ exceptions not being caught

Hi all, first post.  I've posted about this issue in the NXP forums, and it was suggested I post here, since this may be a GCC toolchain issue (if it's not somehow my own fault).  If there's a better place to post this, let me know.  (I didn't want to post an actual bug report until I know that there is in fact a bug.)

(Quick note: I'm using newlib, not newlib-nano, so that's not my problem.)

The issue is that sometimes when I build and debug my Cortex-M4 (NXP Kinetis K24) C++ application, it becomes impossible to catch any exceptions, even with "catch (...)" (catch-all). By that, I mean that when an exception of any type is thrown, regardless of the catch statements that follow the try block, the following appears on the console:

terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively

That was generated by this test function:

static void VerifyExceptions(void) {
  try {
    throw std::runtime_error("Exceptions are being handled normally.");
  } catch (std::runtime_error &e) {
    std::cout << e.what() << "\n";
  /* If exceptions are not working correctly, then the above statement will
   * cause __cxa_throw() to call terminate() immediately.  Testing this at
   * startup assures that any issues with exceptions will be immediately
   * diagnosed during development. */

This is an example call stack when this fault occurs:

Call stack showing uncaught exception calling terminate() twice

Here is what I've determined so far:

1) For a given project and set of code, the issue either always occurs, or never occurs.  That is, rebuilding firmware does not change whether the problem manifests.  Also, building the code on another workstation using the same toolchain will give the same results.

2) This issue can be triggered at any point during the target's execution.  My firmware is a FreeRTOS application, and I am calling VerifyExceptions() in main() shortly after initializing peripherals and stdio, but well before I have allocated/initialized any tasks, let alone started the scheduler.

3) My stack size in the linker is 16K, so it seems tremendously unlikely that this would be some stack-related issue.  And as mentioned in 2), this does not need to occur inside an RTOS thread.

4) What seems to change whether this issue occurs or not, is the simple act of adding or removing some portion of code from the project (e.g. creating or expanding class methods).  An example: I have a test module that issues one of several strings to a parsing module.  I added a new test string to the module, recompiled my firmware, debugged on the target, and found that exceptions had stopped working completely.  I removed the test string from the module, and exception handling was restored.

5) It is not necessary to call any of the code that is being added to the project in order to manifest the issue; the mere presence of additional code in the project is enough to cause the issue to occur.

6) There is a workaround that (so far) appears to mitigate this issue 100% of the time (see below).

7) If the system is operating normally, I can reverse the presence or absence of the workaround and recompile, and the system will then manifest the issue.

8) Whether or not the issue is present, my program works 100% fine (all threads running and healthy), unless some module throws an exception.

9) The only module in my code that is designed to throw exceptions is a JSON parser which I have used successfully in a previous project.  All calls into the JSON parser are surrounded by try/catch statements.

10) I am building my application with newlib, which has exception handling enabled, versus newlib-nano, which does not.  I have also specified -fexceptions in my C and C++ compiler flags.

If you read the NXP thread linked above, I go into more detail, but in the course of investigating this issue, I happened upon a workaround.  It involves adding a single data member (wibble_) to one of my application objects, creating a method (DoNothing()) which simply contains "wibble_ = 123;", then calling that method from the constructor of the object.  That's it.  When VerifyExceptions() was triggering an uncaught exception, adding the code described caused exceptions to behave and be caught normally once again.

Here's the best part... as I continued development, I encountered the issue again, where all exceptions became uncatchable again.  So I simply commented out the call to DoNothing(), recompiled, and now exceptions work again.  I have gone through this iteration at least two or three more times, where I continue development, rebuild, and find that exceptions are no longer being caught.  I then flip the commented/uncommented state of the DoNothing() call, rebuild, and debug, and the system works fine again.  The DoNothing() call has become a toggle switch in the code; it either MUST be present, or MUST NOT be present, for a given set of code to build and operate correctly.

This smells like some sort of alignment issue, where inserting a small blob of code (e.g. the call to DoNothing()) causes something in the build to misalign, or fixes an existing misalignment.  However, I have no idea where to start looking at this.  All I know is that this issue is 100% reproducible with my current codebase.  I can toggle whether DoNothing() is or isn't called by the constructor, and that either breaks catching exceptions globally, or fixes them.  And to reiterate, my program doesn't even call the object prior to calling VerifyExceptions() in main().  My object is a singleton that is initialized by a GetInstance() method, and I've verified that DoNothing(), when enabled, isn't called until much further down in main(), well past VerifyExceptions().

And to emphasize, I am NOT having an issue with unexpected exceptions; this is an issue where exceptions are wholly anticipated and should be caught with valid try/catch statements, but for whatever reason, any exception causes terminate() to be called.

So... how should I start looking into this?  Version info below:

arm-none-eabi-gcc.exe (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release)

Windows 10 Pro v1803 patched, NXP MCUXpresso v10.2.1, NXP Kinetis K24 (MK24FN1M0VLL12)

  • Hi David,

    The symptoms you explain point to some a memory leak or a buffer overflow or section overlap. 

    I'd start with a few sanity checks (sorry, I don't know if you've already done this).

    1. Try a later toolchain (7-2018-q2-update)  to see if the problem persists

    2. Try building on Linux toolchain?

    3. I'd go back to the linker map file(s) and double-check all the limits of text section, data section etc. Also check if the heap and stack are non-overlapping or have been mapped to RAM correctly.

    4. Check if all the startup code/drivers copy all ROM to RAM correctly.

    5. Check if you have unaligned accesses across regions in the system memory map?

    6. Check if you have any activity happening in the bitband aliassed region that are accidentally toggling bits its corresponding bitband regions in SRAM.