This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex M3 : what determines the cycle count for a variable cycle count instruction?

I have looked at the cycle counts for the Cortex M3 instructions at http://infocenter.arm.com/help/topic/com.arm.doc.100165_0201_00_en/ric1414056333562.html. Some instructions are listed as taking a range of cycles to complete. I want to understand what conditions determine the actual cycle counts.

I am particularly interested in the SMULL/UMULL, SMLAL/UMLAL instructions which take between 3-5 and 4-7 cycles respectively. The linked reference stipulates the instructions terminate early depending on the size of source values. What does this mean exactly?

I am also interested in the SDIV and UDIV instructions which take 2-12 cycles. Is there a way I can determine how many cycles the instruction will actually take?

Parents
  • I would guess the long multiplies can involve up to three 32x32 multiplies plus an addition and an overhead cycle, and the ones with accumulate can involve another two additions. And a cycle can be left out if an operand is 0 (and perhaps -1 but I wouldn't bet on that).

    For the division the timings sound like they can do up to three instances of a very simple single bit shift algorithm per cycle with an extra cycle making the operands positive for a signed divide and quickly shifting through zeroes in the numerator

    I wouldn't bet on any of that and if you need the timings to be constant you can expand as individual instructions that take a constant time. Dividing by a constant can be done by a multiply and a little messing around. This can actually be faster than a hardware divide in some cases.

Reply
  • I would guess the long multiplies can involve up to three 32x32 multiplies plus an addition and an overhead cycle, and the ones with accumulate can involve another two additions. And a cycle can be left out if an operand is 0 (and perhaps -1 but I wouldn't bet on that).

    For the division the timings sound like they can do up to three instances of a very simple single bit shift algorithm per cycle with an extra cycle making the operands positive for a signed divide and quickly shifting through zeroes in the numerator

    I wouldn't bet on any of that and if you need the timings to be constant you can expand as individual instructions that take a constant time. Dividing by a constant can be done by a multiply and a little messing around. This can actually be faster than a hardware divide in some cases.

Children
  • How do you get three 32x32 multiplies? I'm not seeing that.

    Is it possible to get the number of cycles for input sizes confirmed by ARM?

    For division, I presumed it would be a fast-forwarded conditional subtraction and shift operation. I just wanted it confirmed.

    My motivation for these questions is that I want a better understanding of the conditions that would cause an operation to take much longer than expected due to inputs.

  • Using an extra register would involve an extra cycle which would go in somewhere - but I think perhaps the best thing to do is to write a program and see. I would guess the times are based on whether the top 16 bits are zero or not but you could also try 15 bit and negative numbers and just time a loop and see what the difference is. Try some very small numbers too just to get a base point. and in case the cycles go up starting at a smaller number.