This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Registers and Cache on M0

Hi,
Coming from a games coder background, I always seek to find the very limits of what a CPU can do. Now we have PragmatIC and very cheap CPUs but much more importantly - vastly cheaper MROM (Mask ROM). With this in mind, I wanted to know how many registers I could REALLY use on the AtmelSAMD21 and found some very interesting ideas, and some very interesting possibilities.

For subroutines:

1)R14 (LR) can be stacked and unstacked - an extra register
2)R13 (SP) can be stored in memory as long as no interrupts can occur and the subroutine doesn't call anything.
3)R15 (IP) is not used on said Atmel product if the code and data of the subroutine are all in the cache.

These things work although that last one seems to have some rules that I am still divining - but not much use if it is technically part of the design errata.

Now i'm interested in the special registers, or rather the instructions themselves.(MRS,MSR). The ARM Infocenter notes that they perform a read-modify-write sequence and lists the special registers as:

APSR
IPSR
EPSR
IEPSR -
IAPSR
EAPSR
PSR
,
MSP
PSP
PRIMASK,
CONTROL

It appears that the field governing which special register is read or written to is a 5-bit field and what is more, for low-cost debug I'm guessing, if you select a value outside the range of the special registers, it acts on the general-purpose resisters. I'm interested in knowing if people can see optimizations in this. Code from Flash often has a 1-cycle penalty so a single instruction that performs a RMW in one instruction will be faster.

I know these are the extreme cases but getting a fixed-point implementation of .MP3 decode, for example, will really be scratching around for stray bus cycles. Plastic is 20 years behind silicon and will be very cheap so I'm jumping the gun a couple of years because I believe the M0 & M0+ running MBed will become the de facto baseline processor. It is only from bitter experience with MROMs (order 80000 units of Chuck Rock Jr for the Megadrive. Sell 60000 and you make a loss) that has put people off. Now, especially with simple CRC 10:8 MROM will provide a yield so close to 100% that it will be reborn.

Parents

0 Sean Dunlevy over 6 years ago in reply to 42Bastian Schick

Hi - yes, using 16 would drastically speed up certain routines. one instance of this is codebook generation in ACELP. I'm sure you can imagine that the quality of the sound relies heavily on focussed search techniques. There is little advantage in using wideband speech encoding unless you can get a close if not the closest to perfect code.

There have been many articles on extracting the most out of the Thumb instruction set (e.g. Efficient Use of Invisible Registers in Thumb Code) but the premise that for every low register, there is an equivalent high register provides a much faster, more efficient, smaller and I guess more aesthetic paradym.

I appreciate the input so I will look at some other cores. I'm now setting up the system to write the speed dependent code into RAM. The Flash is only 16-but thus code doesn't execute well but a DMA from Flash to RAM with a lower priority than the CPU so like the monsters of old like the PSX & Jaguar, part of the RAM is a scratchpad.

I have to day this about Atmel, they did confirm their specifications so now I know 32-bits are read if they are on a 32-bit boundary so, along with the cache, two rules that will improve execution time. There are a few rules so you never need to use boundary instructions.

Many thanks.
Cancel
Up 0 Down

Cancel

Reply

0 Sean Dunlevy over 6 years ago in reply to 42Bastian Schick

Hi - yes, using 16 would drastically speed up certain routines. one instance of this is codebook generation in ACELP. I'm sure you can imagine that the quality of the sound relies heavily on focussed search techniques. There is little advantage in using wideband speech encoding unless you can get a close if not the closest to perfect code.

There have been many articles on extracting the most out of the Thumb instruction set (e.g. Efficient Use of Invisible Registers in Thumb Code) but the premise that for every low register, there is an equivalent high register provides a much faster, more efficient, smaller and I guess more aesthetic paradym.

I appreciate the input so I will look at some other cores. I'm now setting up the system to write the speed dependent code into RAM. The Flash is only 16-but thus code doesn't execute well but a DMA from Flash to RAM with a lower priority than the CPU so like the monsters of old like the PSX & Jaguar, part of the RAM is a scratchpad.

I have to day this about Atmel, they did confirm their specifications so now I know 32-bits are read if they are on a 32-bit boundary so, along with the cache, two rules that will improve execution time. There are a few rules so you never need to use boundary instructions.

Many thanks.
Cancel
Up 0 Down

Cancel

Children

0 42Bastian Schick over 6 years ago in reply to Sean Dunlevy

Forgive my ignorance, but how do you want to use r15 for anything useful. The cache is completely transparent to the core's instruction fetch.
Cancel
Up 0 Down

Cancel
0 Sean Dunlevy over 2 years ago in reply to 42Bastian Schick

Well, occasionally you will need a 16-bit value and if a literal pool will slow down that read, R15 might contain the 16 bits you need.

For a LONG time I have realised that ARM should have placed the vector table at $400 so the bottom 1K could be a scratchpad. Don't forget that 8-bit immediates are supported so you can access the bottom 256 bytes AS bytes, the next 256 bytes as half-words and the las 512 bytes as words.

I know it sounds dumb bum my CLZ macro has a table that overlays most of the vector table (but I don't use those interrupts) because I can read an 8-bit value as the address of that table.

There are a lot of little things that would make the M0+ much faster but only for assembly language programmers.

I noted that the official ARM compiler takes 23 cycles for a 32-bit x 32-bit --> 64-bit multiply. GNU managed 19. I do it in 17. Seems tiny but if you have an array of structures then every time you access it, you bleed cycles.
Cancel
Up 0 Down

Cancel
0 WestfW over 2 years ago in reply to Sean Dunlevy

> ARM should have placed the vector table at $400 so the bottom 1K could be a scratchpad.

You know, you can usually MOVE the vector table.

I sort-of think that things will go further in the direction of the RPi rp2040 - silly high clock rates and cache/TCM memory ("outside" of the ARM core, of course) that doesn't need wait states. And specialized peripherals/fifos/etc.

(Have you experimented with your MP3 decode and the rp2040's "Interpolators" and HW divider at all? It doesn't look like they do multiply or 64bit, so perhaps not. What about using the PIO FIFOs as scratchpad(s)?)
Cancel
Up 0 Down

Cancel