I have been reading through the ARM documentation on memory and instruction barriers.
I have read that the single core ARMv7-M parts do not reorder instructions, as such the DSB and ISB are not needed, is this correct?
I have also read the same about DMB, however their is a concern across clock domains. For example if peripheral is running at lower clock speed then writing to the peripheral could take a long time. For example imagine that you clear and interrupt flag in peripheral during ISR, then if the clear does not happen before exiting the ISR it could trigger falsely trigger ISR again. Hence I was wondering if DMB would fix the problem across clock domains like this? Then if so does this only happen with core data cache enabled?
That is how can the core change the volatile flag register but write takes longer unless the peripheral has a local cache, in which case how does the core know the peripheral local cache register has been applied?
So in what cases when you have a single core part would DSB be appropriate in code? Then the same for DSB an ISB?
Hi Trampas,
We have an application note covering memory barrier instructions in Cortex-M
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0321a/index.html
DMB/DSB doesn't necessarily help the interrupt clear race condition you mention. (It might work if the extra execution time of DMB/DSB is enough to avoid the race condition, but not always the case).
One possible solution is to do a dummy read of the same peripheral after the clear, before exiting the ISR.
http://www.keil.com/support/docs/3928.htm
regards,
Joseph
A concern I have is that if the peripheral is running at a lower clock rate. For example if the core is at 100Mhz and peripheral is at 32khz. The processor vendors do not seem to indicate if the Interrupt Flag/enable are cached in the peripheral. For example if the peripheral has a cache/shadow register than the write to the flag could be at core clock rate, but the peripheral's actual register is updated later. Thus an ISR handler could clear the flag and continue running, but the real clear could take a long time to happen. Also in this case what happens if you read the flag register back? Do you read the cached copy?
If the register is not cached in peripheral then the core would have to stall for ~3000 clocks to do a read/write, if this happens does the DSB/DMB indeed enforce a barrier?
For the dummy reads, this seems like it would be doing the same as a DMB, that is the second memory access would have to wait for the first to finish (assuming they do not use different memory buses). Again I am unsure if the ARM core allows different memory buses to operate concurrently or not?
Also the second memory access will not help unless that second memory location is a volatile memory. Specifically the write to clear the interrupt flag is (should be) a volatile memory access, however if the second memory access is not to a volatile then the compiler could reorder the code and have the second memory access occur first since the the clearing flag is not dependent on it. Hence a compiler memory barrier would be needed before the dummy read to ensure instruction ordering (this is compiler memory barrier is enforced if the second read being a volatile memory location).
Hence I am still confused over when a DMB/DSB is needed, and how they play with processor vendor's peripheral accesses? Of course I guess this is highly depended on how bad the vendor did their peripherals unless ARM has some rules, which I am hoping they do.
If the peripheral is running at 32KHz (e.g. RTC), then it is likely that the bus interface is going to be running on a different clock domain (it means the RTC is divided into two clock domains), and the registers would be shadowed. It is extremely unlikely that the access will take 1000+ cycles.
Please note Cortex-M3/M4 has a write buffer. A write operation could take a number of clock cycles but the subsequent instruction could start to execute before the write completed. The write buffer in Cortex-M3/M4 is single entry, and if there is a dummy data read, the read cannot be accepted until the write is completed (nature of the AHB protocol). Also, in the case of Cortex-M3/M4, exception entry / return cannot start until the buffered data write is completed. In Cortex-M3/M4, issuing a DSB ensure the write buffer is drained before next instruction (could be any instruction for DSB). A DMB could also be used if you just want to make sure the next data memory access doesn't start until the buffer write is completed.
In Cortex-M7, the write buffer is multi-entries, and some of the peripherals can be connected by AXI, which support multiple outstanding transfers. Unlike Cortex-M3/M4, the write buffer doesn't have to be drained before exception entry/exit - you might want to use DSB to drain the write buffer, but a dummy read could be better - I will explain later. Similar to Cortex-M3/M4, you can use DSB and DMB in the same way.
Most of the peripherals are connected using AHB or APB bus protocols. These protocols doesn't allow multiple outstanding transfers and reordering between transfers (AXI allows these). So in most cases a dummy read ensure that the bus transfer is actually completed at peripheral bus interface level.
*** The reason I said using a dummy read rather than using DSB is that : DSB would not help in the case where there is write buffer(s) in the system level AHB to APB or AXI to AHB/APB bus bridge(s). (We've seen several support cases on the issue that interrupt carried out twice due to delay in system level write buffers).
You are right that the register accessed by the dummy read need to be defined as volatile - otherwise the C compiler will optimize the code away. C compilers should not reorder volatile data accesses, so no need to put memory barrier between peripheral accesses.
When you read back the flag register - whether you read back the value in the fast clock domain or the value in the 32KHz clock domain is completely dependent on the peripheral design. I don't think there is any rule on that. Usually MCU vendors provide example codes for their peripherals, and I think they should should provide more details in documentation to explain their designs.
In the case that the interrupt line actually will take even longer to be de-asserted at the processor side, for example, a clock domain crossing interface for the interrupt signal might cause a few clock cycles of delay in interrupt de-assertion, then MCU vendors should provide a solution for software developer to check if the interrupt line is actually de-asserted. (e.g. a status register to read the interrupt line).
>Of course I guess this is highly depended on how bad the vendor did their peripherals ...
LOL :-)