Exception entry stacking on Cortex-M7, not possible to stack onto Main stack at any case?

Hi all,

I had a long trouble finding a "nonsense" issue with IRQ handing, using an ARM Cortex-M7 core equipped MCU.

In short words: there is a somewhat complex (not that much) system with tiny OS, a FIrmware that is running on an MCU. It has 6 fixed threads, accesses a lot of peripherals, uses DMA, external DMA etc. All was workig fine until I changed the display update (LCD connected to an external bus interface) function using CPU to a function using DMA and freeing up CPU time for other threads while.
This was working filne by itself, unless I need to serve a serial line by IRQ (no DMA possible with flow control, a known issue of the MCU).

The main problem was that the IRQ handler couldn't read all incoming bytes (3Mbaud, 300kB/s max.), it randomly misses one IRQ to be executed in time and data overrun happens, the data stream is broken.
(The exception handler is put on ITCM and only accessing DTCM memory to run as fast as possible.)

I've digged down deeply and found two reasons working together to lock the CPU for many microseconds and so missing IRQ to happen in time.
One was a memory barrier instruction, where I didn't want to put such. (Using an inline function for global interrupt disabling.)
Second is that the IRQ/exception was stacked to the process stack (that was the actual stack at the time of the IRQ request) and these stacks were put into SRAM, cached. As they are cached and the process was able to execute code with very long data accessing external memory (SDRAM) only, without a touch of the stack, so the procss stack must not be cahced at all when the IRQ request happened and it took a lot of time to free up a cache line for the stack, because all cache lines must have been used for SDRAM (data written). This SDRAM access too not so long but it shares the same bus with a DMA upadting the display (LCD) and if this two were active concurrently on the bus, it made huge delay for the IRQ entry and the data overrun happened...
(All was tested using an oscillloscope and debug IO pins showing the execution time of code parts.)

The ARM Cortex-M7 docs say that the CPU stacks the exception to the current stack pointer (so in my case mostly the PSP on cached SRAM) and it seems there's no way to force it to stack onto the main stack, independent of the current process stack. Main stack could be easily put to DTCM, it doesn't even need to be big. But if I want benefit of the CPU's exception system (that is real fast), I have to put all my process stacks to the DTCM, which is quite a waste... There will be only a few memory left for other arrays and buffers that is used no to disturb cache, that needs fast run time etc.

I wonder if there's a way to force the CPU to stack at exception entry right to the main stack, instead of the current one. (I can't see if it was possible.)

What's the reason not offering this function for the developers? Pushing exception entry data to main stack and returning from it to any stack won't be a big magic. Why it is not implemented (optionally or fixed), if I'm right? Why to allow the process stacks to be used that may be randomly quite slow for data pushed into it?

(Now I moved all the process stacks to DTCM and removed the memory barrier instruction and the system is better than ever. I can't even measure (ofc. I can) jitter of the exception handler to start, it's some 10 ns maximum. It is now impossible to miss a byte. So problem is solved but I'm a bit sad that I had to put all my stacks to the good an valuable DTCM, instead of putting only the main stack there.)

Thanks for your answers in advance!