Why the Cortex M4 instruction SMMUL (32 = 32 x 32b) preserves a redundant sign bit and discards one useful bit of information? What could possibly be the justification for such blatant disregard of the ISO/IEC TR 18037 standard Fract format?
I think they just missed a trick there when defining the DSP instructions. You can get the result you want by using a long multiply and double but it takes a few extra cycles. Or one can get by with one less bit of precision and do a shift left of one of the operands before the multiply but that isn't any sort of standard. Or if they had set carry for the top bit of the low half then doubling would have just taken an extra cycle.
By the way that 'redundant sign bit' isn't completely redundant because 0x80000000 multiplied by 0x80000000 gives 0x4000000 for the top half. An instruction designed for DSP would have to do saturation to 0x7fffffff or else ignore that case and get 0 or 0x80000000.
As to blatant disregard for Fract format - the instructions were defined and implemented long before that standard was thought of as far as I'm aware.
Jens,
I need to perform these operations:
All of the above operations require the standard fractional multiplication for optimal accuracy.
Could you please give me an example of a DSP application on a low-power 32-bit MCU where you would need to multiply two more-than-32-bit signed values?
SMULL gives both the high and low parts , one can do everything with that and it is implemented in the Cortex-M3,
SMMUL gives just the high part and is part of the DSP extension, you need a Cortex M4 for SMMUL.
This is just to avoid misinterpretation.
SMULL is still present in Cortex-M4.
The saturated multiplication would be an obvious choice (you can use SMULL when you need guarded result). The ARM format does not even allow chaining of multiplications without a progressive loss of accuracy, ugh.
The ARM DSP extension was defined in 2009 - three years after that ISO standard. The M4 core was introduced in 2010 so no excuse there. The fractional format itself dates back to 1980's with chips like the Motorola DSP56000.
I believe this is intended to be used for getting the highword result of a topword x topword multiplication.
(note: by topword, I mean the most significant word of each factor; they could for instance be a 32-bit value multiplied by a 32-bit value or the highword of a 64-bit value multiplied by the highword of a 64-bit value; the product would go into the highest word of a 128-bit value).
Without it, I think it would be cumbersome to multiply two more-than-32-bit signed values.
Which operation do you need to perform in more details ?
Misinterpretation. Yes I could easily get paranoid about people misinterpreting what I've said! It just seems to happen so easily despite ones best efforts.
The ARM DSP extension including the SMMUL instruction was introduced into the ARM architecture in 2000 in ARMv5TE in 2004 in ARMv6.
I see it was added after the other DSP instructions, a bit later than I thought but still before the standard..And they refer to it as an extended multiply instruction rather than DSP, it might have helped if it was thought as DSP.
Looking a bit deeper as that struck me as a bit wrong - the ARM1136 technical Manual r0p1 from February 2003 has SMMUL in it even though that is before ARMv6 was defined. ARM wasn't so rigorous about versions and features then. ARM1136 was upgraded a bit when the ARMv6 definition came out but it already had this instruction.
By the way some of the Cortex-M4 processors have single precision floating point and that is quite quick.
I just had a look at what gcc does for this and it doesn't do saturation. It could do the work in the same time and saturation I think with
smull hi, lo, x, y
lsr lo,31
qdadd result,lo,hi
Yes, your code example corresponds to the saturated multiplication that I have been using. It takes 3 cycles to complete. The single precision floating point is faster (VMUL.F32 takes 1 cycle) but the 24-bit mantissa has lower resolution than the 32-bit Fract so it can't be considered a direct replacement.
daith, sorry if I posted that. I didn't have much time and about to log-out then but there was a young engineer (I've recently convinced to also study ARM instead of being too dedicated to AVR and PICmicro) who read your reply pertaining to SMULL/SMMUL and wondered if SMULL was excluded in Cortex-M4. I then decided to post such response, hoping that would help prevent some other readers, especially new users of Cortex-M, from also misinterpreting the info.
It's interesting that even the 8-bit 68HC11 have versions with support, albeit minimal, for fractional format. The E variants are perhaps the earliest to provide such support. Nonetheless, the fractional data format might have already been used even in early computers.
Hi petr,
I'm not sure if Jens' answer was the main reason for the SMMUL instruction. Note however that Cortex-M4 is strictly not a DSP but an MCU with DSP extension so multiplication of more-than-32-bit signed values may have application aside from DSP.
I hope you can visit here more often. You can share your knowledge about DSP by participating in discussions, posting blogs, etc. My impression is that you already have intensive experience in DSP especially using DSP/DSC rather than MCU.
Regards,
Goodwin
I was agreeing with you. Misinterpreting happens all the time and is very hard to guard against.
On my side, I think that ARM intended to define instructions that can be used by C compilers.
Regarding multiply, ANSI C states that
The result of the binary * operator is the product of the operands.
Therefore in order to have following C code generate efficient code, you have to define SMULL and SMMUL as they are today !
int64_t result64 = (int64_t)(int32_t)operand1 * (int64_t)(int32_t)operand2; // Translates to SMULL r0,r1,r1,r0 int32_t result32 = (int32_t)(((int64_t)(int32_t)operand3 * (int64_t)(int32_t)operand4) >> 32); // Translates to SMMUL r0,r2,r3
Also, in order to have a symmetrical error introduced by SMMUL truncation, you can use its alternative SMMULR which performs rounding before extracting those 32 Most Significant
Bits.