Why the Cortex M4 instruction SMMUL (32 = 32 x 32b) preserves a redundant sign bit and discards one useful bit of information? What could possibly be the justification for such blatant disregard of the ISO/IEC TR 18037 standard Fract format?
This is not an answer to petr's question, I just found myself comparing they way some RISC processors initiated their support for multiplication in hardware.
The MUL instruction was added in ARMv2, SMULL in ARMv3M.
The i960 has multiply instructions generating (the least significant) 32 bits and extended multiply instruction that generates 64 bits stored in two 32-bit registers.
When I was studying the PowerPC (using older generations), I have to learn that to perform 32-bit x 32-bit = 64-bit two instructions must be used, one for getting the high-order 32 bits and one for getting the low-order 32 bits of the result.
When multiplying, MIPS32 uses special registers for storing the high- and low-order words of the result.