Why the Cortex M4 instruction SMMUL (32 = 32 x 32b) preserves a redundant sign bit and discards one useful bit of information? What could possibly be the justification for such blatant disregard of the ISO/IEC TR 18037 standard Fract format?
On my side, I think that ARM intended to define instructions that can be used by C compilers.
Regarding multiply, ANSI C states that
The result of the binary * operator is the product of the operands.
Therefore in order to have following C code generate efficient code, you have to define SMULL and SMMUL as they are today !
int64_t result64 = (int64_t)(int32_t)operand1 * (int64_t)(int32_t)operand2; // Translates to SMULL r0,r1,r1,r0 int32_t result32 = (int32_t)(((int64_t)(int32_t)operand3 * (int64_t)(int32_t)operand4) >> 32); // Translates to SMMUL r0,r2,r3
Also, in order to have a symmetrical error introduced by SMMUL truncation, you can use its alternative SMMULR which performs rounding before extracting those 32 Most Significant
Bits.
What was wanted was what the NEON instructions VQDMULH or VQRDMULH do so ARM certainly thought the operation was worthwhile implementing when they designed NEON.
The examples in Cortex-M4 Devices Generic User Guide, 3.6.8. SMMUL use SMULL instead of SMMUL.