Hi, Coming from a games coder background, I always seek to find the very limits of what a CPU can do. Now we have PragmatIC and very cheap CPUs but much more importantly - vastly cheaper MROM (Mask ROM). With this in mind, I wanted to know how many registers I could REALLY use on the AtmelSAMD21 and found some very interesting ideas, and some very interesting possibilities.For subroutines:1)R14 (LR) can be stacked and unstacked - an extra register2)R13 (SP) can be stored in memory as long as no interrupts can occur and the subroutine doesn't call anything.3)R15 (IP) is not used on said Atmel product if the code and data of the subroutine are all in the cache.These things work although that last one seems to have some rules that I am still divining - but not much use if it is technically part of the design errata.Now i'm interested in the special registers, or rather the instructions themselves.(MRS,MSR). The ARM Infocenter notes that they perform a read-modify-write sequence and lists the special registers as:APSR IPSREPSRIEPSR -IAPSR EAPSR PSR,MSPPSP PRIMASK,CONTROLIt appears that the field governing which special register is read or written to is a 5-bit field and what is more, for low-cost debug I'm guessing, if you select a value outside the range of the special registers, it acts on the general-purpose resisters. I'm interested in knowing if people can see optimizations in this. Code from Flash often has a 1-cycle penalty so a single instruction that performs a RMW in one instruction will be faster.I know these are the extreme cases but getting a fixed-point implementation of .MP3 decode, for example, will really be scratching around for stray bus cycles. Plastic is 20 years behind silicon and will be very cheap so I'm jumping the gun a couple of years because I believe the M0 & M0+ running MBed will become the de facto baseline processor. It is only from bitter experience with MROMs (order 80000 units of Chuck Rock Jr for the Megadrive. Sell 60000 and you make a loss) that has put people off. Now, especially with simple CRC 10:8 MROM will provide a yield so close to 100% that it will be reborn.
PS I forgot to mention that the M0 has a 3-stage pipeline and can load 2 instructions at once. On the Super Hitachi SH-2 design (also 32-bit with 16 bit instructions), positioning memory reads and writes on 32-bit boundaries meant there was no clash of bus-access so it made a big difference when you optimized for it. I am going to be using DMA in my work so I'm interested in knowing if a DMA channel can have lower priority than the CPU i.e. it conveniently simplifies the aim of 100% bus bandwidth use.I cannot see any documents or data concerning this in the ARM data. I suppose this is more correctly an ATMEL question but I mention it here in the spirit of the original design team 'MIPS for the masses'. 48MHz CPU so I will aim to use every cycle to access RAM/Flash.