The Arm Cortex-M0 microcontroller supports a subset of the instructions provided by the Cortex M3. Presumably these extra instructions provide better performance for some applications.
But does this have any implications for a developer writing performance critical C code for either of these devices? Would you write performance critical code differently knowing that you were developing for one instruction set rather than the other?
For example,I would write quite different code if I was targeting an 8-bit PIC than I would for a 32-bit ARM. And if I knew that my target hardware didn't have a single-cycle multiply instruction, I might use bit-shifting rather than multiplying where possible (or only multiplying and dividing by powers of 2, so that the compiler can optimised it to bit shifting).
Does GCC even make good use of the extra instructions?
The most dramatic thing I've noticed is that gcc includes an optimized floating point library (using copious amounts of assembly code) for the CM3, but "falls back" to a generic float library written in C on CM0. That means that CM0 code using floating point is dramatically slower and bigger than on a CM3 (although, I haven't actually measured it...) (might be "fixed" at any time, though.)
The other thing is that "newlib", which is used to provide libc functionality in many ARM distributions, is not particularly size-efficient (even in its "nano" form.) You probably won't notice if you're programming a CM0 with 128k+ of flash, but it can get pretty painful if you're aiming at one of those tiny CM0 chips with 32k or less.
The compilers are all pretty good at optimizing multiplies and divides; I wouldn't bother modifying source. And they are pretty good at using the CM3 instructions; theoretically, the CM3 provides more of the core "ARM" model, and CM0 omits things, rather than CM0 being the base and CM3 having "extras."
A lot of the CM3 instructions are 32bits; CM0 totally lacks the "flexible 2nd operand, which is only present in THUMB2 32bit instructions.
That means that some common operations become two 16bit instructions rather than 1 32bit instruction, like:
register int x = 0x1000;
movs R1, #0x1000 ;; on CM3
movs R1, #0x10 ;; on CM0 lsls R1,#8
movs R1, #0x10 ;; on CM0
I guess that's twice as slow, but the same code length.