Hi,
We have an application using RL-ARM on STM32 (Cortex-M3), 6 tasks are running in total.
Between our tasks we pass messages via mailboxes, strictly from a _alloc_box and _free_box memory pool.
However, in long term testing of our product, we are experiencing a system crash after ~3 days, where an invalid memory address (as in not one from the pool) is coming off the mailbox. This looks like an address that the RTOS uses. However, the ones that are going into the mailbox have been allocated with _alloc_box.
We have tried updating to the latest MDK (3.50) binary only version, and it does the same.
Does anyone have any ideas or advice as to why this could be happening, and how we could debug this? We have tried tracing, and increasing the stack size, but this does not help. Our next idea at the moment was to see if we could obtain the source code, but I am not sure this will tell us much more.
Best Regards,
Martin.
it sounds like memory corruption. mailboxes use dynamically allocated memory - can't you use events/mutexes/semaphores in conjunction with statically allocated data? maybe you even suffer from memory fragmentation, which is likely after a large number of allocations and de-allocations.
Hi Tamir,
Thanks for your reply. I agree with you it does sound like memory corruption.
However, the _alloc_box and _free_box routines use a statically allocated memory pool. Therefore the memory is declared from the outset, and each element is the same size in bytes. Dynamic memory allocation is not used.
if you don't have access to source code, try embedding all calls to allocating/deallocating routines inside your own routines that keep an administration of the memory accessed.
1) verify that all the items returned from _alloc_box are actually within the box
2) Do not free the same memory location more than once. This may (will) cause the box to become invalid.
3) If you overwrite data within a box used to keep track of the state of the box, the results are unpredictable. (i.e. your code will fail at some point)
Item 2 or 3 are the most likely issues if you are seeing #1.
If #1 is not the case, than I would suspect you are over writing the mailbox structure at some point. This will also cause unpredictable results.
Thanks for your replies.
After extensive testing, we were still unable to determine the cause of this failure.
We have peppered the application with breakpoints if any rtos function returns an incorrect value, we also confirmed that _alloc_box returns in-bounds data as expected.
What was strange is that some of the O/S calls were returning corrupted values. But if you re-position the program counter at the start of the function and run it again, the program continues OK, and the RTOS call returns the correct value.
Some kind of corruption is occurring but we cannot see where.
As a test, we have replaced the Keil RTOS with FreeRTOS. Now our application has been running for nearly a week now, whereas we would get 1-3 days with RL-ARM.
I still cannot rule out some rogue code in our app, but we will see what happens in this test.
Thanks,
Have you seen this thread: http://www.keil.com/forum/docs/thread14677.asp
Might there be a problem that interrupts are enabled somewhere by a OS call, while you assume that you are still in a critical section?
Hi Martin,
we have found a problem in Cortex-M interrupt priority arbiter which allows interrupting SVC with SysTick/PendSV interrupts of the same priority during the push register state (time interval of a few clock cycles after the SVC instruction has been executed, but the SVC handler not yet activated). Because of this problem system might incorrectly execute RTX svc system calls in a very rare timing sequences when followed by the SysTick timer interrupt.
We have corrected this problem in a new MDK/RLARM v3.60
Franc
Hi Per / Franc,
Thanks for your replies. Per your thread makes interesting reading but we do not have any critical sections, and do not disable global ints.
We will try 3.6 and see if that cures the problem.
It looks like 3.6 has cured the problem.
Our system has been running for a week now, whereas before we got 3 days max.
View all questions in Keil forum