We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I believe that many of us are interested in the ARM Cortex-M7.
Recently, jyiu posted a status update, where I asked a couple of questions about the architecture.
A few questions on the subject was also asked in the Interview and Question Time with Joseph Yiu discussion.
As I think the information posted is important and relevant, I'm posting a shortened version here, so it's easier to find.
Links:
Cortex-M7 Processor - ARM
ARM Cortex-M7 Processor Technical Reference Manual
ARMv7-M Reference Manual (Issue E.b)
AnandTech | Cortex-M7 Launches: Embedded, IoT and Wearables
ARM Supercharges MCU Market with High Performance Cortex-M7 Processor
ARM gives Internet of Things a piece of its mind – the Cortex-M7
STM32F7 von STMicroelectronics: ARMs Cortex-M7 (this article is in german)
Freescale Plans Extreme Performance for Kinetis MCUs with ARM® Cortex®-M7 Core
ARM Cortex-M - Wikipedia
NEW App Note: Migrating Application Code from ARM Cortex-M4 to Cortex-M7 Processors
Meet the new ARM Cortex-M7 processor: supercharging embedded devices
Atmel launches new series ARM Cortex-M7 based MCUs
As you see, STMicroelectronics will be releasing their first Cortex-M7 soon; Microchip and Freescale are also close.
Move the mouse over green-coloured abbreviations, in order to see what they mean.
Q: The Cortex-M7 now has a Branch Predictor and a BTAC. Does this mean that branches use 1 clock cycle only (or perhaps even below) ?
A: Yes, if correctly predicted the branch instruction is only 1 cycle.
Q: Does the 6-stage pipeline mean that loads can be archieved in a single cycle as well ?
A: Load from TCM is pipelined with other operations, so essentially single cycle or even less due to dual issue.
Q: The Cortex-M7 should be able to run at speeds up to 400Mhz, is that correct ?
A: In term of clock frequency, it is dependent on the semiconductor process nodes.
400MHz is the estimation for 40nm low power (LP) process. If using 28nm (e.g. 28hpm) or 14nm, the clock frequency can go much higher.
Q: From what I've heard, Interrupt latency is sometimes 12, sometimes 11 clock cycles; depending on the situation ?
A: The interrupt latency is a complex topic because it depends on how the memory system design looks.
The complete picture is fairly complex and I think we will need to create a separate document for that.
Q: Is it possible to move data directly between general purpose registers and floating point (single/double precision) registers without storing the data in memory first ?
A: The VMOV instruction (which exists on the Cortex-M4 already) allows data value to be transferred between general registers and floating point registers.
According to the Wikipedia, the Cortex-M7 supports the same instruction set as Cortex-M4F. I do not know if there are any additions to the instruction set, but I would expect that in order to use double-precision floating points and because of the enhanced DSP Extensions and BPU, there might be a few extra (I'm only guessing here).
Personally, I look very much forward to using the Branch Predictor, BTAC, the 6-stage superscalar pipeline, the dual integer pipe ALU, the higher speed, the double-precision floating point and the FPP.
If you have some technical information, I'd like to encourage you to post it here.
Hi,
Can anybody help to get information regarding stalls on Cortex-M7.. I am using STM32F769NI Eval Board on IAR Tool chain. I wrote a simple ASM code of 50 instructions mostly using VLDMIA's and VMLA's, but i am getting around 170 cycles in executing these instructions.
Thanks in Advance.
Jaikanth.
I assume you have read the document on Migrating Cortex-M4 Applications to Cortex-M7 written by bobboys.
I don't have any information on stalls on the Cortex-M7, as I have unfortunately not worked with it yet.
However, I would start out by using the same rules as I'm using on the Cortex-M3 and Cortex-M4.
Note: jyiu may override what I say anytime!
I think the most important thing to avoid is to use a register as base or index register right after it's updated.
Eg.
adds.n r1,r1,#17 /* [1] update r1 */ /* [1] stall while waiting for r1 to be written to the register file */ ldrb.n r0,[r1] /* [2] load the value */
-But I've found that on Cortex-M3 (and most likely Cortex-M4 as well), it helps to align all 32-bit opcodes ("wide" instruction) on a 32-bit boundary.
Eg. if your wide instruction (any instruction that uses a high register for instance), happen to be on an address that is not divisible by 2, and the instruction preceding it is a narrow instruction (a 16-bit wide opcode), then change the 16-bit opcode to become a 32-bit opcode by postfixing the instruction with .w instead of .n.
Remember that stalls can be caused also when accessing the AHB peripherals or other peripherals.
On some devices, AHB peripheral access may require 2 clock cycles, which means you can insert a one-cycle instruction after your store instruction, in order to do something useful while you wait.
If you're doing a lot of base-register updates in a loop, I would like to suggest that you use fixed offsets instead.
Doing so gets rid of base-register updates, but also ensures that your store-instructions will use only a single clock-cycle.
This is an easy optimization, which can be used when unrolling loops.
Hi jaikanth7,
according to my experience, Cortex-M7 will not do FPU register forwarding (a.k.a. register bypass). Therefore, some stall cycles will happen if there will be register dependency between 2 or 3 instructions.
Regarding Cortex-M7, you had better use C compiler to reduce such stall cycles.
Best regards,
Yasuhiko Koumoto.
Hi all,
Thanks to Jens and Yasuhiko for their useful replies.
It is not easy to guess the reason for slow performance.
The issue could be in the memory system (e.g. if there is a 64-bit load/store to AXI and if the address is not 64-bit aligned, the memory might need two transfers).
Couple more things to think about:
- if using memories that is accessed via AXI, maybe enabling cache can help? (if source data is coming from DMA operation then cache will not help).
- some memory access speed test with simple LDMIA to check access speed of the memory might be useful?
- are you running this code in an ISR? (effect of lazy stacking)
- have to try schedule the instruction sequence to interleave integer and floating operations? (many FPU operations cannot be dual issued)
regards,
Joseph