I have looked at the cycle counts for the Cortex M3 instructions at http://infocenter.arm.com/help/topic/com.arm.doc.100165_0201_00_en/ric1414056333562.html. Some instructions are listed as taking a range of cycles to complete. I want to understand what conditions determine the actual cycle counts.
I am particularly interested in the SMULL/UMULL, SMLAL/UMLAL instructions which take between 3-5 and 4-7 cycles respectively. The linked reference stipulates the instructions terminate early depending on the size of source values. What does this mean exactly?
I am also interested in the SDIV and UDIV instructions which take 2-12 cycles. Is there a way I can determine how many cycles the instruction will actually take?
I am guessing here: For (S|U)DIV: The time it takes depends on the input values. I guess to get a rule of thumb(TM) you have to know the design. What I am pretty sure is, that for the same inputs you get the same number of cycles.As for multiplication, I'd say there are lesser steps if the input values have lot of leading zeroes.
I would guess the long multiplies can involve up to three 32x32 multiplies plus an addition and an overhead cycle, and the ones with accumulate can involve another two additions. And a cycle can be left out if an operand is 0 (and perhaps -1 but I wouldn't bet on that).
For the division the timings sound like they can do up to three instances of a very simple single bit shift algorithm per cycle with an extra cycle making the operands positive for a signed divide and quickly shifting through zeroes in the numerator
I wouldn't bet on any of that and if you need the timings to be constant you can expand as individual instructions that take a constant time. Dividing by a constant can be done by a multiply and a little messing around. This can actually be faster than a hardware divide in some cases.
How do you get three 32x32 multiplies? I'm not seeing that.
Is it possible to get the number of cycles for input sizes confirmed by ARM?
For division, I presumed it would be a fast-forwarded conditional subtraction and shift operation. I just wanted it confirmed.
My motivation for these questions is that I want a better understanding of the conditions that would cause an operation to take much longer than expected due to inputs.
Using an extra register would involve an extra cycle which would go in somewhere - but I think perhaps the best thing to do is to write a program and see. I would guess the times are based on whether the top 16 bits are zero or not but you could also try 15 bit and negative numbers and just time a loop and see what the difference is. Try some very small numbers too just to get a base point. and in case the cycles go up starting at a smaller number.