This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Code portability

Hello,
I was browsing through older posts that deal with the painful issue of portability (http://www.keil.com/forum/docs/thread8109.asp). I was (and still am) a big advocate of programming as much as possible conforming to the C standard, and having a layered structure that allowed "plugging-in" other hardware. But I have come to change my mind recently. I am reading the "ARM system developer's guide" (excellent book by the way. I'm reading it because I want to port some C167 code to an ARM9 environment) in which chapter 5 discusses writing efficient C code for an ARM. The point is, and it is fairly demonstrated, that even common, innocent looking C code can either be efficient of very inefficient on an ARM depending on specific choices made, let alone another processor used! So, if we are talking about squeezing every clock cycle out of a microcontroller - I do not believe that portability without ultimately littering the code is possible!

Parents
  • Ok Jack, here you go:

    int checksum_v5(int *data)
    {
        unsigned int i;
        int sum=0;
        for (i=0; i<64; i++)
        {
           sum += *(data++);
        }
        return sum;
    }
    


    This compiles to

    checksum_v5
    MOV r2,r0 ; r2 = data
    MOV r0,#0 ; sum = 0
    MOV r1,#0 ; i = 0
    checksum_v5_loop
    LDR r3,[r2],#4 ; r3 = *(data++)
    ADD r1,r1,#1 ; i++
    CMP r1,#0x40 ; compare i, 64
    ADD r0,r3,r0 ; sum += r3
    BCC checksum_v5_loop ; if (i<64) goto loop
    MOV pc,r14 ; return sum
    

    It takes three instructions to implement the for loop structure:

    *An ADD to increment i
    *A compare to check if i is less than 64
    *A conditional branch to continue the loop if i < 64

    This is not efficient. On the ARM, a loop should only use two instructions:

    *A subtract to decrement the loop counter, which also sets the condition code flags on
    the result
    *A conditional branch instruction

    The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit.

    Now, an improved verison is this:

    int checksum_v6(int *data)
    {
        unsigned int i;
        int sum=0;
        for (i=64; i!=0; i--)
        {
          sum += *(data++);
        }
        return sum;
    }
    

    This compiles to

    checksum_v6
    MOV r2,r0 ; r2 = data
    MOV r0,#0 ; sum = 0
    MOV r1,#0x40 ; i = 64
    checksum_v6_loop
    LDR r3,[r2],#4 ; r3 = *(data++)
    SUBS r1,r1,#1 ; i-- and set flags
    ADD r0,r3,r0 ; sum += r3
    BNE checksum_v6_loop ; if (i!=0) goto loop
    MOV pc,r14 ; return sum
    

    Say, Jack, are you going to read the manual for a change :-) :-) ;-)

Reply
  • Ok Jack, here you go:

    int checksum_v5(int *data)
    {
        unsigned int i;
        int sum=0;
        for (i=0; i<64; i++)
        {
           sum += *(data++);
        }
        return sum;
    }
    


    This compiles to

    checksum_v5
    MOV r2,r0 ; r2 = data
    MOV r0,#0 ; sum = 0
    MOV r1,#0 ; i = 0
    checksum_v5_loop
    LDR r3,[r2],#4 ; r3 = *(data++)
    ADD r1,r1,#1 ; i++
    CMP r1,#0x40 ; compare i, 64
    ADD r0,r3,r0 ; sum += r3
    BCC checksum_v5_loop ; if (i<64) goto loop
    MOV pc,r14 ; return sum
    

    It takes three instructions to implement the for loop structure:

    *An ADD to increment i
    *A compare to check if i is less than 64
    *A conditional branch to continue the loop if i < 64

    This is not efficient. On the ARM, a loop should only use two instructions:

    *A subtract to decrement the loop counter, which also sets the condition code flags on
    the result
    *A conditional branch instruction

    The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit.

    Now, an improved verison is this:

    int checksum_v6(int *data)
    {
        unsigned int i;
        int sum=0;
        for (i=64; i!=0; i--)
        {
          sum += *(data++);
        }
        return sum;
    }
    

    This compiles to

    checksum_v6
    MOV r2,r0 ; r2 = data
    MOV r0,#0 ; sum = 0
    MOV r1,#0x40 ; i = 64
    checksum_v6_loop
    LDR r3,[r2],#4 ; r3 = *(data++)
    SUBS r1,r1,#1 ; i-- and set flags
    ADD r0,r3,r0 ; sum += r3
    BNE checksum_v6_loop ; if (i!=0) goto loop
    MOV pc,r14 ; return sum
    

    Say, Jack, are you going to read the manual for a change :-) :-) ;-)

Children
  • Yes Per, another excellent example, but I think that the example above is more powerful as it depends on the actual instruction set of the processor.

  • Just about all processors prefer loops that decrement to zero, since zero is "magic".

    In this case it takes a decrement and a conditional branch. A lot of processors has DJNZ instructions, where a hard-coded register is used to fit all in a single instruction.

  • The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit.

    Given that the loop counter is not used in or after the body of the loop the compiler is, I believe, well within its rights under the 'as if' rule to rearrange the loop to decrement rather than increment. I guess it's a quality of implementation issue.

    If you decide to code loops like this to decrement rather than increment the resulting 'C' is no less portable, so I'm not entirely sure what your point is.

    If you mean that you would have to perform this kind of manual optimisation for each platform and/or compiler you target then I congratulate you on being able to design hardware that is only just powerful enough to work with optimal code every time.

    If 'every clock cycle counts' then you have to use assembly. 'C' will always produce code that is slower and larger - the problem is that you cannot predict by exactly how much. If this matters, don't use 'C'.

    Say, Jack, are you going to read the manual for a change

    I don't use ARM.