by Fred Le Cam and Kais Errouissi
ARM Cortex-M7 based Microcontrollers usually share similar ARMCortex-M7 processor configuration options including a 64-bit AXI system bus interface, instruction and data cache, 64-bit Instruction Tightly Coupled Memory (ITCM), and dual 32-bit Data Tightly Coupled Memory (DTCM).
There are several features that differentiate the STM32F7 from other Cortex-M7 based microcontrollers:
When designing CM7 MCUs, system memory is normally connected to either the TCM interface or to the AXI interface (optimized interface improving throughput for both on-chip and off-chip slow memories). STM32F7 is designed with dual Flash interface AXI and TCM hence offering greater flexibility for code execution. Complementing this flexibility is a built-in Flash accelerator (ART) bringing 0-wait execution from Flash. Using the TCM interface with the ART accelerator enables similar performance compared with using the cached AXI interface without the penalty of a cache-misses and cache maintenance operations in user code. The embedded flash is designed also to be seen as either a single 256bits bank or dual 128 bits banks enabling the read while write feature. Alternatively using the cached AXI allows applications with large code sizes to use external memory at no performance penalty.
Taking advantage of ST’s ART Accelerator™ as well as an L1 cache (up to 16K), STM32F7 devices deliver the maximum theoretical performance of the Cortex-M7 no matter whether code is executed from embedded Flash or external memory: 1082 CoreMark /462 DMIPS at 216 MHz fCPU.
One benefit of this architecture, is in HMI applications where audio data and graphic date must be transferred concurrently from/to the system RAM.
These SRAM blocks can be cached or non-cached, except for TCM RAMs which are on the cache-less TCM interface directly connected to the core which means data or instruction transfers are not affected by traffic on the other bus interfaces, making them well suited for critical data or code sections where determinism is a must
Besides FMC external memory interface, STM32F7 features a QSPI serial memory interface . The interface is cacheable and seen by all masters includingCPU, DMA, DMA2D and LTDC. Hence memory attached to the QSPI interface is well suited for code execution, graphic frame buffer, data storage. In Dual Flash mode, where 2 Quad-SPI flash memories are accessed in parallel, up to 8 bits (or 16 bits in DDR mode) are sent/received every cycle over the 8 I/O pins effectively doubling the throughput as well as the capacity . The QSPI connected single or dual flash makes the right compromise performance versus board PCB cost.
The AHB 64bits interface of the internal Flash, connected to the cached AXI bus of Cortex-M7 through the bus matrix (refer to Fig2): Is offering a low latency in cache line fill: Less than 12 CPU cycles when running at the max operating frequency of the system (@216MHZ).
Note that The DTCM and ITCM RAMs (tightly coupled memories) are not part of the bus matrix. The Data TCM RAM is accessible by the GP-DMAs and peripherals DMAs through specific slave bus: AHBS. These product capabilities offer higher flexibility on application code memory partitioning.
As said “A picture is worth a thousand words”.
Fig3. is a quick summary on the different memory resources and interfaces available on STM32F756 device, their bases addresses, their maximum access width and addressable range andpotentially permitted accesses in read/write.
Fig3. Gives also some recommendations on application code partitioning for better system reliability and performance.
More details could be found in AN4667 “STM32F7 Series System architecture and performance by ST
Obviously CM7 microcontroller design is the best fit for embedded applications demanding high-processing performance, real-time response capability and energy efficiency.
However In the embedded applications overall size and determinism are more important than absolute performance. A system tradeoff that must be made is absolute performance versus system determinism and application safety. To implement this tradeoff, here some programming considerations that have to be taken when developing user code or when migrating from legacy Cortex-M based applications.
Memory order reflects the order of access to an off- chip or on-chip memory by the CPU during runtime.
In some High-performances processor, such access may occurs out of the programmed order.
For most simple microcontrollers, if all peripheral devices registers are declared as volatile and their memory spaces are defined as non-cacheable, then no further action is required to maintain memory access ordering.
Means: Software must include memory barrier instructions to force the ordering.
To prevent reordering, it is possible also to use the 16 zones Memory Protection Unit (MPU) available on the STM32F7 Microcontroller to define a memory as a Device or Strongly-ordered region , however, this reduces the performance and does not prevent the C compiler from re-ordering the data transfers in the generated code. For best efficiency, ARM recommends that SRAM is defined
as Normal memory and use memory barriers in situations where memory ordering is important.
For more details see the ARM Infocenter
STM32F7 embeds an Instruction and data cache to compensate for inserted wait states when fetching code and data out of on-chip or off-chip memories hence boosting performance.
However going through those caches will not preserve determinism when cache misses and cache line fills. That’s why TCM memories are well recommended for critical code execution and critical data storage.
For instance in home appliance or any motor control applications where safety is must.
Software maintenances operations are needed because besides CPU, cached memories could be accessed by other masters including LTDC, DMA, DMA2D, those masters may read a non-up-to-date data when accessing physical memories while new updates are already available in CPU caches.
To solve this issue the following are recommended in user code:
During programming operations in STM32F7 on-Chip flash memory or off-chip flash memory (connected through QSPI or FMC) it is recommended to configure them as device or strongly-ordered memories using MPU settings. Once finished it is recommended to go back to normal memory attribute to take advantage from STM32F7 L1-caches.
This should be interesting for people in the Software Development Tools and Embedded groups looking for their next device. Thanks for sharing, Sylvie!