Hi, Coming from a games coder background, I always seek to find the very limits of what a CPU can do. Now we have PragmatIC and very cheap CPUs but much more importantly - vastly cheaper MROM (Mask ROM). With this in mind, I wanted to know how many registers I could REALLY use on the AtmelSAMD21 and found some very interesting ideas, and some very interesting possibilities.For subroutines:1)R14 (LR) can be stacked and unstacked - an extra register2)R13 (SP) can be stored in memory as long as no interrupts can occur and the subroutine doesn't call anything.3)R15 (IP) is not used on said Atmel product if the code and data of the subroutine are all in the cache.These things work although that last one seems to have some rules that I am still divining - but not much use if it is technically part of the design errata.Now i'm interested in the special registers, or rather the instructions themselves.(MRS,MSR). The ARM Infocenter notes that they perform a read-modify-write sequence and lists the special registers as:APSR IPSREPSRIEPSR -IAPSR EAPSR PSR,MSPPSP PRIMASK,CONTROLIt appears that the field governing which special register is read or written to is a 5-bit field and what is more, for low-cost debug I'm guessing, if you select a value outside the range of the special registers, it acts on the general-purpose resisters. I'm interested in knowing if people can see optimizations in this. Code from Flash often has a 1-cycle penalty so a single instruction that performs a RMW in one instruction will be faster.I know these are the extreme cases but getting a fixed-point implementation of .MP3 decode, for example, will really be scratching around for stray bus cycles. Plastic is 20 years behind silicon and will be very cheap so I'm jumping the gun a couple of years because I believe the M0 & M0+ running MBed will become the de facto baseline processor. It is only from bitter experience with MROMs (order 80000 units of Chuck Rock Jr for the Megadrive. Sell 60000 and you make a loss) that has put people off. Now, especially with simple CRC 10:8 MROM will provide a yield so close to 100% that it will be reborn.
Sean Dunlevy said:For subroutines:1)R14 (LR) can be stacked and unstacked - an extra register2)R13 (SP) can be stored in memory as long as no interrupts can occur and the subroutine doesn't call anything.3)R15 (IP) is not used on said Atmel product if the code and data of the subroutine are all in the cache.
R13: I see no benefit to store SP into RAM unless you really run out of registers and the function is very long.
R14: C-Compilers also use R14, but more likely on Cortex-M3/4/7 where it can be used directly.
R15: How should the CPU know which is the next instruction if you modify IP?
Hi - yes, using 16 would drastically speed up certain routines. one instance of this is codebook generation in ACELP. I'm sure you can imagine that the quality of the sound relies heavily on focussed search techniques. There is little advantage in using wideband speech encoding unless you can get a close if not the closest to perfect code.There have been many articles on extracting the most out of the Thumb instruction set (e.g. Efficient Use of Invisible Registers in Thumb Code) but the premise that for every low register, there is an equivalent high register provides a much faster, more efficient, smaller and I guess more aesthetic paradym.I appreciate the input so I will look at some other cores. I'm now setting up the system to write the speed dependent code into RAM. The Flash is only 16-but thus code doesn't execute well but a DMA from Flash to RAM with a lower priority than the CPU so like the monsters of old like the PSX & Jaguar, part of the RAM is a scratchpad.I have to day this about Atmel, they did confirm their specifications so now I know 32-bits are read if they are on a 32-bit boundary so, along with the cache, two rules that will improve execution time. There are a few rules so you never need to use boundary instructions.Many thanks.
Forgive my ignorance, but how do you want to use r15 for anything useful. The cache is completely transparent to the core's instruction fetch.
View all questions in Cortex-M / M-Profile forum