Hi, Coming from a games coder background, I always seek to find the very limits of what a CPU can do. Now we have PragmatIC and very cheap CPUs but much more importantly - vastly cheaper MROM (Mask ROM). With this in mind, I wanted to know how many registers I could REALLY use on the AtmelSAMD21 and found some very interesting ideas, and some very interesting possibilities.For subroutines:1)R14 (LR) can be stacked and unstacked - an extra register2)R13 (SP) can be stored in memory as long as no interrupts can occur and the subroutine doesn't call anything.3)R15 (IP) is not used on said Atmel product if the code and data of the subroutine are all in the cache.These things work although that last one seems to have some rules that I am still divining - but not much use if it is technically part of the design errata.Now i'm interested in the special registers, or rather the instructions themselves.(MRS,MSR). The ARM Infocenter notes that they perform a read-modify-write sequence and lists the special registers as:APSR IPSREPSRIEPSR -IAPSR EAPSR PSR,MSPPSP PRIMASK,CONTROLIt appears that the field governing which special register is read or written to is a 5-bit field and what is more, for low-cost debug I'm guessing, if you select a value outside the range of the special registers, it acts on the general-purpose resisters. I'm interested in knowing if people can see optimizations in this. Code from Flash often has a 1-cycle penalty so a single instruction that performs a RMW in one instruction will be faster.I know these are the extreme cases but getting a fixed-point implementation of .MP3 decode, for example, will really be scratching around for stray bus cycles. Plastic is 20 years behind silicon and will be very cheap so I'm jumping the gun a couple of years because I believe the M0 & M0+ running MBed will become the de facto baseline processor. It is only from bitter experience with MROMs (order 80000 units of Chuck Rock Jr for the Megadrive. Sell 60000 and you make a loss) that has put people off. Now, especially with simple CRC 10:8 MROM will provide a yield so close to 100% that it will be reborn.
PS I forgot to mention that the M0 has a 3-stage pipeline and can load 2 instructions at once. On the Super Hitachi SH-2 design (also 32-bit with 16 bit instructions), positioning memory reads and writes on 32-bit boundaries meant there was no clash of bus-access so it made a big difference when you optimized for it. I am going to be using DMA in my work so I'm interested in knowing if a DMA channel can have lower priority than the CPU i.e. it conveniently simplifies the aim of 100% bus bandwidth use.I cannot see any documents or data concerning this in the ARM data. I suppose this is more correctly an ATMEL question but I mention it here in the spirit of the original design team 'MIPS for the masses'. 48MHz CPU so I will aim to use every cycle to access RAM/Flash.
Sean Dunlevy said:It appears that the field governing which special register is read or written to is a 5-bit field and what is more, for low-cost debug I'm guessing, if you select a value outside the range of the special registers, it acts on the general-purpose resisters. I'm interested in knowing if people can see optimizations in this. Code from Flash often has a 1-cycle penalty so a single instruction that performs a RMW in one instruction will be faster.
Are you assuming or did you really test it?I thought the time of "illegal opcodes" are gone ... :-) (Remember the Z80 where you could use the IX/IY register byte-wise with illegal opcodes :-) )
Sean Dunlevy said:For subroutines:1)R14 (LR) can be stacked and unstacked - an extra register2)R13 (SP) can be stored in memory as long as no interrupts can occur and the subroutine doesn't call anything.3)R15 (IP) is not used on said Atmel product if the code and data of the subroutine are all in the cache.
R13: I see no benefit to store SP into RAM unless you really run out of registers and the function is very long.
R14: C-Compilers also use R14, but more likely on Cortex-M3/4/7 where it can be used directly.
R15: How should the CPU know which is the next instruction if you modify IP?
Hi - yes, using 16 would drastically speed up certain routines. one instance of this is codebook generation in ACELP. I'm sure you can imagine that the quality of the sound relies heavily on focussed search techniques. There is little advantage in using wideband speech encoding unless you can get a close if not the closest to perfect code.There have been many articles on extracting the most out of the Thumb instruction set (e.g. Efficient Use of Invisible Registers in Thumb Code) but the premise that for every low register, there is an equivalent high register provides a much faster, more efficient, smaller and I guess more aesthetic paradym.I appreciate the input so I will look at some other cores. I'm now setting up the system to write the speed dependent code into RAM. The Flash is only 16-but thus code doesn't execute well but a DMA from Flash to RAM with a lower priority than the CPU so like the monsters of old like the PSX & Jaguar, part of the RAM is a scratchpad.I have to day this about Atmel, they did confirm their specifications so now I know 32-bits are read if they are on a 32-bit boundary so, along with the cache, two rules that will improve execution time. There are a few rules so you never need to use boundary instructions.Many thanks.
Forgive my ignorance, but how do you want to use r15 for anything useful. The cache is completely transparent to the core's instruction fetch.
Well, occasionally you will need a 16-bit value and if a literal pool will slow down that read, R15 might contain the 16 bits you need.For a LONG time I have realised that ARM should have placed the vector table at $400 so the bottom 1K could be a scratchpad. Don't forget that 8-bit immediates are supported so you can access the bottom 256 bytes AS bytes, the next 256 bytes as half-words and the las 512 bytes as words.I know it sounds dumb bum my CLZ macro has a table that overlays most of the vector table (but I don't use those interrupts) because I can read an 8-bit value as the address of that table.There are a lot of little things that would make the M0+ much faster but only for assembly language programmers.I noted that the official ARM compiler takes 23 cycles for a 32-bit x 32-bit --> 64-bit multiply. GNU managed 19. I do it in 17. Seems tiny but if you have an array of structures then every time you access it, you bleed cycles.
> ARM should have placed the vector table at $400 so the bottom 1K could be a scratchpad.
You know, you can usually MOVE the vector table.
I sort-of think that things will go further in the direction of the RPi rp2040 - silly high clock rates and cache/TCM memory ("outside" of the ARM core, of course) that doesn't need wait states. And specialized peripherals/fifos/etc.
(Have you experimented with your MP3 decode and the rp2040's "Interpolators" and HW divider at all? It doesn't look like they do multiply or 64bit, so perhaps not. What about using the PIO FIFOs as scratchpad(s)?)