I am trying to generate a piece of code on an M4 with an exact known runtime, independent of the input.
Currently my bottleneck is that the duration of a division (udiv) is dependent on the input and therefore variable in execution time. Is there a way to ensure that my division lasts a same amount of instructions for each input?
Note: I am trying to write this with as minimal overhead as possible due to rather extreme execution time constraints.
From the doc:
a. Division operations use early termination to minimize the number of cycles required based on the number of leading ones and zeroes in the input operands.
Maybe you can try to adjust those leading zeroes before division and the right shift the result.
Or you can measure the cycles and do a bunch of NOPs to reach the WCET.
Hi Bastian, Thanks for your good reply!
Two good suggestions!
For the first option, the problem is that the numerator for the division is a 16-bit pseudorandom number, so the variation on the leading zeroes is quite large and unpredictable, adjusting these would leave me with a significant overhead imho.
The NOPs method would have been my first choice, but from other use cases on the M4 so far, it has proven quite hard to match timing using NOPs on an M4 due to the NOP instruction being taken as a hint and not necessarily delaying the exact amount of time matching the amount of NOPs. (instruction alignment and pipelining of instructions seem to have a big impact on this, but I have not found a satisfying explanation for this yet.)
I was mainly wondering whether there is a way of disabling/preventing the early termination mechanism?
If you are for a constant number of cycles and the total amount does not matter (much), you might do two divisions.
One with the original numerator and the second with the inverted one. Maybe the total runtime of these two is close to constant.
Hi Bastian, thanks a lot for your help.
Currently I have used a cast to float such that it performs a float division which always lasts 14 cycles. Timingwise it's all stable now.
Will try your proposition out of curiosity though! :)
Have a nice day, Rens
When does "early termination" happen? (how does up-to 63 leading zeros become 12 cycles or less?)With a 16bit numerator, presumably the range in number of cycles is smaller than if there were a full 32bit numerator? Do you know how much variation you're seeing? I'm wondering if you can multiply by constants first (perhaps in conjunction with CLZ) to always result in the same number of cycles. (well, I'm pretty sure you could, but I don't know whether it would be the float solution you came up with.)
If the denominator is constant, you could consider a multiply-by-reciprocal approach.
Thanks for the update and quick reply. I'll be sure to keep an eye on this thread. Looking for the same issue. Bumped into your thread. Thanks for creating it. Looking forward for solution.