This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

speed up square root computation (approximation)

Hello,

I have to compute the square root of a floating point number.
A function sqrt() already exists in the math.h, but this function is too slow and the computation time depends on the input value (because of an iterative method used to compute the square root). The accuracy requirements in my case are not very high, so 2 digits after decimal point would be enough.

So I seached for an alternative method to compute a square root and found an interessting method, that uses an approximation to compute the square root (see below). But this only works on 32 bit CPUs (little endian), so I tried to port it to a ST10.

I thought this is not very difficult because the ST10 also uses little endian and is IEEE-754 compatible.

The function to approximate the square root:
float fastsqrt(float val) {
int tmp = *(int *)&val;
tmp -= 1<<23; /* Remove IEEE bias from exponent (-2^23) */
/* tmp is now an appoximation to logbase2(val) */
tmp = tmp >> 1; /* divide by 2 */
tmp += 1<<23; /* restore the IEEE bias from the exponent (+2^23) */
return *(float *)&tmp;
}

I replaced all int with long (because long is 32bit on ST10), but this doesn't work anyway.

It seems that the behavior of the shifts is different on the ST10 in combination with long numbers than on a 32-bit CPU.

Does anybody has same hints for me to make this function working?
Or has somebody an alternative function for approximate or compute a square root in a quick way on a ST10?

Thanks a lot...

Alexander

  • You can optimize the function further if you write it in assembly:

    MOV	R4,R8
    MOV	R5,R9
    SUB	R5,#0x3F80
    SHR	R4,#1
    BMOV	R4.15,R5.0
    ASHR	R5,#1
    ADD	R5,#0x3F80
    
    Only 7 instructions: this should be very fast. With optimization switched on, the compiler should generate similar code. Try raising optimization level and see the generated code. If the compiler doesn't do a good job, consider writing the function in assembly.

    Regards,
    - mike

  • Hi Mike,

    your modified code works pretty well now. Thanks a lot. :D

    I did a little benchmark with this code and measure the execution time over 100 square-root computations with a timer.

    This fastsqrt-function is more than 3 times faster than the normal sqrt-function and has a fixed computation time.

    The only drawback is the low accuracy for small numbers. But I'm dealing with more or less large numbers and my accuracy-requirements are not very high, so this functions works fine for me.

    If you need high accuracy for small numbers, Drews tip using a lookup-table for the mantissa is great.

    Thanks a lot to everyone who participates on this discussion.

    regards...

    Alexander

  • Hi Alexander,

    I've played around with the code a bit. I believe that you need to replace 1L<<23 with 127L<<23. That's the correct way of removing and restoring IEEE exponent bias, if I am not mistaken, based on the discription of float representation in memory from Keil's docs. I tried it and it gave sensible results.
    Googling around I found this link, which seemed interesting:
    http://www.mactech.com/articles/mactech/Vol.14/14.01/FastSquareRootCalc/

    Regards,
    - mike

  • Hello together,

    I have tested the modified function Mike posted above. This function doesn't work.

    On my uC-board is a display (16x2 chars) so I'm able to see the result of the fastsqrt-computation on the display and the result is always zero (independed of input-value).

    I used this quick&dirty hacked program:
    So there are same variables and #include I have to use to control the display. But this has no effect on the fastsqrt-function.

    #include "stdio.h"
    #include "math.h"
    #include "Intrins.h"			// intrinsic commands (nop..)
    #include "regst10F269.h"		// Register Set of ST10F268 controller
    
    // variables for the display
    unsigned int far display_count1,display_count2,x,disp_pos,disp_data;
    unsigned char far c;
    bit far disp_busy, disp_tmp1, disp_tmp2;
    
    #include "display_2zeilig_4bit.h"	// display control
    
    float fastsqrt(float val) {
        long tmp = *(long *)&val;
        tmp -= 1L<<23; /* Remove IEEE bias from exponent (-2^23) */
        /* tmp is now an appoximation to logbase2(val) */
        tmp = tmp >> 1; /* divide by 2 */
        tmp += 1L<<23; /* restore the IEEE bias from the exponent (+2^23) */
        return *(float *)&tmp;
    }
    
    void main(void)	{
    
    	float sqrtval = 10.0;
    	float result = fastsqrt(sqrtval);
    	float result2 = sqrt(sqrtval);
    
    	init_display();
    	printf("%f, %f", result, result2);
    }
    


    This is the generated assembler-code for the fastsqrt-function.
    	fastsqrt  PROC  FAR
    	PUBLIC  fastsqrt
    ; FUNCTION fastsqrt (BEGIN  RMASK = @0x0330)
    	MOV	[-R0],R9
    	MOV	[-R0],R8
    	SUB	R0,#4
    	MOV	R4,[R0+#4]                 ; val
    	MOV	R5,[R0+#6]                 ; val+2
    	MOV	[R0],R4                    ; tmp
    	MOV	[R0+#2],R5                 ; tmp+2
    	MOV	R8,R4
    	MOV	R9,R5
    	SUB	R9,#128
    	MOV	[R0],R4                    ; tmp
    	MOV	[R0+#2],R9                 ; tmp+2
    	MOV	R5,R9
    	SHR	R4,#1
    	BMOV	R4.15,R5.0
    	ASHR	R5,#1
    	MOV	[R0],R4                    ; tmp
    	MOV	[R0+#2],R5                 ; tmp+2
    	ADD	R5,#128
    	MOV	[R0],R4                    ; tmp
    	MOV	[R0+#2],R5                 ; tmp+2
    	MOV	R4,[R0]                    ; tmp
    	MOV	R5,[R0+#2]                 ; tmp+2
    	ADD	R0,#8
    	RETS
    ; FUNCTION fastsqrt (END    RMASK = @0x0330)
    	fastsqrt  ENDP
    

    The input-value (in my example 10.0) is stored in R8 and R9:
    	MOV	R8,#0
    	MOV	R9,#16672
    	CALL	fastsqrt
    

    Thanks for your help.

    regards

    Alexander

  • Hi Alexander,

    I read the Wikipedia article you mentioned. Seems like this should work. The function will return a denormalized result, but this should not be a problem. Please confirm that this works (or doesn't work):

    float fastsqrt(float val) {
        long tmp = *(long *)&val;
        tmp -= 1L<<23; /* Remove IEEE bias from exponent (-2^23) */
        /* tmp is now an appoximation to logbase2(val) */
        tmp = tmp >> 1; /* divide by 2 */
        tmp += 1L<<23; /* restore the IEEE bias from the exponent (+2^23) */
        return *(float *)&tmp;
    }
    

    If not, how do you know that it doesn't work?

    - mike

  • This function simply divides the exponent of the floating point number by 2. Multiplying two numbers together means you add their exponents. Dividing the exponent by two is a way to find two numbers with equal exponents that multiply to the original exponent, which is roughly equivalent to taking the square root

    Consider sqrt(100). That's 10^2, which is to say 10^1 * 10^1. 2 / 2 = 1, the new exponent of 10 in the result.

    Since this method neglects the mantissa entirely, it won't be very precise, nor accurate for numbers without large exponents. The example above happens to be exactly accurate, because the input is an even power of the base I'm using (10).

    Consider sqrt(500). 5 * 10^2 -> 5 * 10^1. This function would produce the result "50" rather than ~22, because it doesn't take the root of the mantissa.

    For really large numbers, this effect might not be as important. The square root of 5,000,000,000,000 is about 5,000,000, and the error factor of around 2 perhaps doesn't matter compared to the different of six orders of magnitude between the input and the root. For numbers close to 2^1, the error is likely much more significant.

    A 256-entry lookup table for the mantissa should get you around 2 decimal digits of accuracy there (with a maximum error around 1/256th, or ~0.4%). You could look up the 8 most significant bits of the mantissa and replace it with a pre-calculated 8-bit value from the LUT. This will still be pretty quick, but cost 256 bytes of RAM.

    Interpolation between table entries would further improve accuracy at the cost of speed.

  • Jack Gannsle has had a few articles about fast approximations in embedded magazine (I then to remember feb-apr) go to http://www.embedded.com and hunt them up

    Erik

  • The algorithm assumes that 'int' is a 32-bit number. Replace 'int' with 'long'.

  • Hi Mike,

    thanks for your answer.
    Yout hint doesn't work, I'm afraid.

    I don't know why this function work, but it does work. It is a special case of newton's method in combination with the bit-format of an IEEE-floating-point-number.

    Please referr to http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Approximations_that_depend_on_IEEE_representation for more information.

    regards...

    Alexander

  • I think you need to replace 1<<23 with 1L<<23. Consult your favourite C textbook to see why.
    Not sure what this function does, but I doubt it can provide an accuracy of 2 decimal digits.

    Regards,
    - mike