We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello, I was browsing through older posts that deal with the painful issue of portability (http://www.keil.com/forum/docs/thread8109.asp). I was (and still am) a big advocate of programming as much as possible conforming to the C standard, and having a layered structure that allowed "plugging-in" other hardware. But I have come to change my mind recently. I am reading the "ARM system developer's guide" (excellent book by the way. I'm reading it because I want to port some C167 code to an ARM9 environment) in which chapter 5 discusses writing efficient C code for an ARM. The point is, and it is fairly demonstrated, that even common, innocent looking C code can either be efficient of very inefficient on an ARM depending on specific choices made, let alone another processor used! So, if we are talking about squeezing every clock cycle out of a microcontroller - I do not believe that portability without ultimately littering the code is possible!
Ok Jack, here you go:
int checksum_v5(int *data) { unsigned int i; int sum=0; for (i=0; i<64; i++) { sum += *(data++); } return sum; }
This compiles to
checksum_v5 MOV r2,r0 ; r2 = data MOV r0,#0 ; sum = 0 MOV r1,#0 ; i = 0 checksum_v5_loop LDR r3,[r2],#4 ; r3 = *(data++) ADD r1,r1,#1 ; i++ CMP r1,#0x40 ; compare i, 64 ADD r0,r3,r0 ; sum += r3 BCC checksum_v5_loop ; if (i<64) goto loop MOV pc,r14 ; return sum
It takes three instructions to implement the for loop structure:
*An ADD to increment i *A compare to check if i is less than 64 *A conditional branch to continue the loop if i < 64
This is not efficient. On the ARM, a loop should only use two instructions:
*A subtract to decrement the loop counter, which also sets the condition code flags on the result *A conditional branch instruction
The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit.
Now, an improved verison is this:
int checksum_v6(int *data) { unsigned int i; int sum=0; for (i=64; i!=0; i--) { sum += *(data++); } return sum; }
checksum_v6 MOV r2,r0 ; r2 = data MOV r0,#0 ; sum = 0 MOV r1,#0x40 ; i = 64 checksum_v6_loop LDR r3,[r2],#4 ; r3 = *(data++) SUBS r1,r1,#1 ; i-- and set flags ADD r0,r3,r0 ; sum += r3 BNE checksum_v6_loop ; if (i!=0) goto loop MOV pc,r14 ; return sum
Say, Jack, are you going to read the manual for a change :-) :-) ;-)
Loop unrolling?
Yes Per, another excellent example, but I think that the example above is more powerful as it depends on the actual instruction set of the processor.
Just about all processors prefer loops that decrement to zero, since zero is "magic".
In this case it takes a decrement and a conditional branch. A lot of processors has DJNZ instructions, where a hard-coded register is used to fit all in a single instruction.
Given that the loop counter is not used in or after the body of the loop the compiler is, I believe, well within its rights under the 'as if' rule to rearrange the loop to decrement rather than increment. I guess it's a quality of implementation issue.
If you decide to code loops like this to decrement rather than increment the resulting 'C' is no less portable, so I'm not entirely sure what your point is.
If you mean that you would have to perform this kind of manual optimisation for each platform and/or compiler you target then I congratulate you on being able to design hardware that is only just powerful enough to work with optimal code every time.
If 'every clock cycle counts' then you have to use assembly. 'C' will always produce code that is slower and larger - the problem is that you cannot predict by exactly how much. If this matters, don't use 'C'.
Say, Jack, are you going to read the manual for a change
I don't use ARM.