Still more instruction things giving me head ache.
This time it's the MUL-instruction.
What the heck means:
Multiply multiplies two register values. The least significant 32 bits of the result are written to the destination register. These 32 bits do not depend on whether the source register values are considered to be signed values or unsigned values.
Multiply multiplies two register values. The least significant 32 bits of the result are written to the destination
register. These 32 bits do not depend on whether the source register values are considered to be signed values or
unsigned values.
(The bolded part)
Couldn't tell...
Here are the MUL instruction decoding with its "closest relatives". (Isn't spreadsheet wonderful for analyzing instruction sets?)
c c c c 0 0 0 0 0 0 0 S d d d d 0 0 0 0 m m m m 1 0 0 1 n n n n MUL{S}<c> <Rd>, <Rn>, <Rm>
c c c c 0 0 0 0 0 0 1 S d d d d a a a a m m m m 1 0 0 1 n n n n MLA{S}<c> <Rd>, <Rn>, <Rm>, <Ra>
c c c c 0 0 0 0 0 1 0 0 h h h h l l l l m m m m 1 0 0 1 n n n n UMAAL<c> <RdLo>, <RdHi>, <Rn>, <Rm>
c c c c 0 0 0 0 0 1 1 0 d d d d a a a a m m m m 1 0 0 1 n n n n MLS<c> <Rd>, <Rn>, <Rm>, <Ra>
c c c c 0 0 0 0 1 0 0 S h h h h l l l l m m m m 1 0 0 1 n n n n UMULL{S}<c> <RdLo>, <RdHi>, <Rn>, <Rm>
c c c c 0 0 0 0 1 0 1 S h h h h l l l l m m m m 1 0 0 1 n n n n UMLAL{S}<c> <RdLo>, <RdHi>, <Rn>, <Rm>
c c c c 0 0 0 0 1 1 0 S h h h h l l l l m m m m 1 0 0 1 n n n n SMULL{S}<c> <RdLo>, <RdHi>, <Rn>, <Rm>
c c c c 0 0 0 0 1 1 1 S h h h h l l l l m m m m 1 0 0 1 n n n n SMLAL{S}<c> <RdLo>, <RdHi>, <Rn>, <Rm>
I almost see something, but not quite...
It looks like these instructions are pairs and MUL and MLA are pair like UMULL and UMLAL, but
also like UMAAL and MLS. The four (MUL, MLA, UMAAL and MLS) form a weird subset, though.
But I still fail to see the gist.
An update (now that I've done them in my SW):
MUL, MLA and MLS are a triplet like SMMUL, SMMLA and SMMLS.
MUL only multiplies, MLA adds the product to accumulate-value and MLS subtracts the product from accumulate-value.
I handled the triplets together - they were so similar in encoding and in function.
c c c c 0 0 0 0 0 0 0 S d d d d 0 0 0 0 m m m m 1 0 0 1 n n n n arm_cmac_mul arm_core_data_mac MUL{S}<c> <Rd>, <Rn>, <Rm> A1 A8.8.114
c c c c 0 0 0 0 0 0 1 S d d d d a a a a m m m m 1 0 0 1 n n n n arm_cmac_mla arm_core_data_mac MLA{S}<c> <Rd>, <Rn>, <Rm>, <Ra> A1 A8.8.100
c c c c 0 0 0 0 0 1 1 0 d d d d a a a a m m m m 1 0 0 1 n n n n arm_cmac_mls arm_core_data_mac MLS<c> <Rd>, <Rn>, <Rm>, <Ra> A1 A8.8.101
The "signedness immune" multiplication taking only the low-bits is common to all three.
IF MLS THEN
Rd = Ra + Rd
ELSE IF MLA THEN
Rd = Ra - Rd
ENDIF
The differences bertween SMMUL, SMMLA and SMMLS are the same.
So MUL seems not to be a by-product.
I am developing on a SAMD21G18 SoC so MUL is the only multiply instruction I have. Recently a developed in the ARM community posted a 32-bit x 32-bit --->64 bit multiply in 17 cycles and it is certainly an interesting routine. My problem is that I am converting a 64kb/s mono MP3 decoder and there are 10s of thousands of MULSHIFT32 macros all over the code. As the name suggests. it performs a 32-bit x 32-bit --->64 bit product but only bits 32-63 are required. I have just about exhausted the various methodologies within the programming fraternity as well as the pure maths branch of science. One GOOD result is that for people using an SoC with a 32-cycle multiply, Karatsuba multiplication is some 22 cycles faster. Something of use to people writing for the very smallest ARM cores. For myself, I have exhausted tricks like finding the least significant bits within a register so that just 2 multiplies will produce the correct results.... but buy is it slow.If anyone has just a name for me to search, I would really appreciate it. The system works at 48MHz but the slower I can run the SoC, the less power it uses and with the product aiming to use a single (rechargable) AA battery, improving battery life is vital.There are quite a few interesting & novel aspects to the ARM processors. It is always interesting although the C bit treated as a 'borrow' rather than a carry spoils some looping and makes some maths... interesting.I would like to take this chance to thank Yasuhiko Koumoto who has always been a superb source of information explaining 'branch shadows' which I think is unique to ARM. I coded many of the RISC chips developed in the 80s (used in consoles in the 90s) so I came from 'branch delay-slot' instructions.