Hello all!
I am new to ARM community and this is my first question here. I work on embedded systems where we use Cortex-M4 based MCUs (concretely STM32F3 series). I would like to ask, if there is a DSP instruction which would calculate x*x + y*y.
x and y represent sine and cosine values (signed integers, 16-bit variables are sufficient). I would like to calculate a square of amplitude (x*x + y*y).
Thanks in advance.
Hello,
you can use "SMLALD{X}<c> <RdLo>,<RdHi>,<Rn>,<Rm>".
Best regards,
Yasuhiko Koumoto.
Hi.
Thank you a lot. I played a little bit with that and other similar instructions and it works quite well. I noticed that SMLALD has an "accumulate" feature, which I don't need. Actually, it is a drawback in my case, because I would have to clear that accumulator back to zero before each use of the instruction.
Maybe SMLAD should be better, where I could multiply topHW * topHW + bottomHW * bottomHW, then add zero and write this to destination variable.
Hi,
I'm sorry.
You are right.
Why don't you use SMUAD{X} instruction? It performs bottom x bottom + top x top. Additionally it can support bottom x top + top x bottom as well.
Regards, Prasad
Great. I did't see this one.
Thank you.
Matic
Hi matic,
You can use PKHBT and PKHTB to format the registers, see § 3.8. Packing and unpacking instructions of Cortex-M4 Devices Generic User Guide Revision r0p1. Now, tabulate all the instructions in the two methods (PKHBT/PKHTB, SMUAD versus multiply, add), calculate and compare the total number of cycles incurred to accomplish the mathematical expression. I hope you can post the result; I'm also curious but I have to log out already.
Regards,
Goodwin
Hello Matic,
These instructions are kind of SIMD extensions possible with existing ARM register files.
Consider the scenario where you have a lot of x and y co-ordinates in memory where it is already in the packed form. In that case you just need to load them as 32 bit values and perform these computations.
Even in scenarios where x and y values are calculated and then performed (x*x + y*y), other SIMD instructions can be used to get the x and y results in packed format.
Or if you localize the problem, then yes we do have the packing overhead and it can be huge such that there is no advantage in using those special instructions.
I would have another question regarding these DSP instructions.
What is their real advantage? I mean, if I calculate x*x + y*y using SMUAD instruction, I first had to format the 32-bit register with (x << 16) | y. Now, I have three instructions (<<, | and SMUAD). If I do a simple calculation with two multiplication and one addition (x*x + y*y), I also have three instructions. I know that I will not gain a huge amount of time with these few instructions, but I am curious when they become preferred over normal calculation. In my case there is no advantage of using it at all.
Thanks
In order to quickly find your way in ARM and Thumb instruction sets, there is a very good cheat sheet on arm website : ARM and Thumb-2 Instruction Set Quick Reference Card .
This paper lies on my desktop for some years now and has proven to be very useful !!
You may want to try if the instructions below will work:
PKHBT Rpck, Ry, Rx LSL #16 ; Writes bottom halfword of Ry to bottom halfword of
; Rpck, writes top halfword of Rx, shifted left by 16 bit, to top
; halfword of Rpck
SMUAD Rsumsqrs, Rpck, Rpck ; Multiplies bottom halfword of Rpck with the bottom
; halfword of Rpck (y squared), adds multiplication of top halfword
; of Rpck with top halfword of Rpck (+ x squared), writes to Rsumsqrs
1. Verify if the PKHBT instruction works as intended.
If registers Rx and Ry contain the signed 16-bit x and signed 16-bit y, respectively, in their low-order halfwords, PKHBT packs them into register Rpck. Here, x occupies the high-order halfword and y occupies the low-order halfword in Rpck. Rx and Ry can be interchanged swapping the high-order and low-order halfwords in Rpck.
2. Verify if the format used for SMUAD is allowed.
Using Rpck for both the first and second operands in SMUAD, the sum of the square of the high-order halfword and the square of the low-order halfword of Rpck will be stored in Rsumsqrs.
If this will work, you get a total of 2 instructions (also 2 cycles) to compute x2 + y2 (when x and y are already in registers Rx and Ry prior to PKHBT instruction).
There might be further improvement that you can do to the code but additional details about your application are needed. For example, if it is possible to interleave x and y in memory (x and y to occupy 1 word) the PKHBT instruction after a load is not needed anymore. I believe though that they are results of calculations (real and imaginary components of complex quantity) and can be readily found in registers prior to the calculation of square of amplitude.
Thank you all for your help!
I found out that I don't make any improvement using PKHBT and SMUAD instructions. Actually, if I do standard multiplication and addition (x2 + y2), it also takes only 2 instructions (with optimization level 3), because it compiles to one MUL and another MLA instruction.
However, I will be more attentive to those instructions in the future, because they might be helpful some day.
Nice to see that you look to extract the most of your target Microcontroller !
In your case, intrinsic usage would not make you gain cycles but you may see other advantages of using such instructions.
I have listed a couple of advantages in a previous blog post (What intrinsics make you gain | ARM Cortex M4 & M7 Unleashed) if you're interested !
Thank you for the link. I will definitely read your blog.