We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello, I was browsing through older posts that deal with the painful issue of portability (http://www.keil.com/forum/docs/thread8109.asp). I was (and still am) a big advocate of programming as much as possible conforming to the C standard, and having a layered structure that allowed "plugging-in" other hardware. But I have come to change my mind recently. I am reading the "ARM system developer's guide" (excellent book by the way. I'm reading it because I want to port some C167 code to an ARM9 environment) in which chapter 5 discusses writing efficient C code for an ARM. The point is, and it is fairly demonstrated, that even common, innocent looking C code can either be efficient of very inefficient on an ARM depending on specific choices made, let alone another processor used! So, if we are talking about squeezing every clock cycle out of a microcontroller - I do not believe that portability without ultimately littering the code is possible!
I was (and still am) a big advocate of programming as much as possible conforming to the C standard, and having a layered structure that allowed "plugging-in" other hardware. But I have come to change my mind recently.
I'm not sure from this whether you've made up your mind about changing you mind...
The point is, and it is fairly demonstrated, that even common, innocent looking C code can either be efficient of very inefficient on an ARM
I'm interested - if the example is reasonably concise could you post it?
So, if we are talking about squeezing every clock cycle out of a microcontroller - I do not believe that portability without ultimately littering the code is possible!
If every last clock cycle counts then changing processor or increasing clock speed would seem a better option. If neither of these is feasible then accept that the design is inadequate, code the necessary bits in assembly then start work on the MK2 replacement system as quickly as possible.
If every last clock cycle counts then changing processor or increasing clock speed would seem a better option. rephrase "then changing processor or increasing clock speed would seem a costlier option" The OP has not revealed the (planned) production volume of his thingy, but if it is significant then ...
If neither of these is feasible then accept that the design is inadequate indequate??? if it is doable, working and there are cost constrains, where do you get 'inadequate' from
code the necessary bits in assembly A reasonable approach, but if "processor pleasing C" will do, then why not use that?
then start work on the MK2 replacement system as quickly as possible. and go bankrupt becuse your product is more expensive than the competitors.
If the world rotated about "standard C" and "programmers convenience" instead of business realities then ...
Please go to the nearest paharmacy and buy a dose of reality
Erik
Ok Jack, here you go:
int checksum_v5(int *data) { unsigned int i; int sum=0; for (i=0; i<64; i++) { sum += *(data++); } return sum; }
This compiles to
checksum_v5 MOV r2,r0 ; r2 = data MOV r0,#0 ; sum = 0 MOV r1,#0 ; i = 0 checksum_v5_loop LDR r3,[r2],#4 ; r3 = *(data++) ADD r1,r1,#1 ; i++ CMP r1,#0x40 ; compare i, 64 ADD r0,r3,r0 ; sum += r3 BCC checksum_v5_loop ; if (i<64) goto loop MOV pc,r14 ; return sum
It takes three instructions to implement the for loop structure:
*An ADD to increment i *A compare to check if i is less than 64 *A conditional branch to continue the loop if i < 64
This is not efficient. On the ARM, a loop should only use two instructions:
*A subtract to decrement the loop counter, which also sets the condition code flags on the result *A conditional branch instruction
The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit.
Now, an improved verison is this:
int checksum_v6(int *data) { unsigned int i; int sum=0; for (i=64; i!=0; i--) { sum += *(data++); } return sum; }
checksum_v6 MOV r2,r0 ; r2 = data MOV r0,#0 ; sum = 0 MOV r1,#0x40 ; i = 64 checksum_v6_loop LDR r3,[r2],#4 ; r3 = *(data++) SUBS r1,r1,#1 ; i-- and set flags ADD r0,r3,r0 ; sum += r3 BNE checksum_v6_loop ; if (i!=0) goto loop MOV pc,r14 ; return sum
Say, Jack, are you going to read the manual for a change :-) :-) ;-)
Loop unrolling?
Yes Per, another excellent example, but I think that the example above is more powerful as it depends on the actual instruction set of the processor.
Just about all processors prefer loops that decrement to zero, since zero is "magic".
In this case it takes a decrement and a conditional branch. A lot of processors has DJNZ instructions, where a hard-coded register is used to fit all in a single instruction.
If every last clock cycle counts then changing processor or increasing clock speed would seem a better option
So efficient programming does not count in your school? Knowing your tool and hardware, as you so often preach, is the key!
Given that the loop counter is not used in or after the body of the loop the compiler is, I believe, well within its rights under the 'as if' rule to rearrange the loop to decrement rather than increment. I guess it's a quality of implementation issue.
If you decide to code loops like this to decrement rather than increment the resulting 'C' is no less portable, so I'm not entirely sure what your point is.
If you mean that you would have to perform this kind of manual optimisation for each platform and/or compiler you target then I congratulate you on being able to design hardware that is only just powerful enough to work with optimal code every time.
If 'every clock cycle counts' then you have to use assembly. 'C' will always produce code that is slower and larger - the problem is that you cannot predict by exactly how much. If this matters, don't use 'C'.
Say, Jack, are you going to read the manual for a change
I don't use ARM.
Efficient programming is important. But when "every single cycle counts" you run a risk that the development cost will explode.
You may also fail to get the product to market early.
And you do not have any safety margin in case you got a late "gotcha", where the requirements suddenly got updated, or you found a little processor errata or a compliance test showed a failure to fulfull some regulations.
There is an old saying that the last 5% of the application can take 95% of the development time. The difference between writing a couple of critical loops in optimized assembler and of having to consider almost all of the application as timing critical represents a huge cost difference.
Another thing is the cost to develop the next generation product. With a bit of safety margin, you may be able to use the same hardware platform. On the other hand - with most of the software written for maintainability instead of maximum optimization, you may do quite drastic changes to the hardware and still be able to reuse a large amount of code.
The ability to select the optimum chip is very important to the final price of the product. Too much optimized code may mean that a significant amount of code has to be thrown away and recreated.
We have one product where a processor with 8kB flash and 1kB RAM is used and everything is written in assembler. We have another product with 256kB flash and 32kB RAM, and just about everything is written in C. The cost of the two processors are almost the same. The big difference: The larger processor was selected 24 months later and could not use a single line of code from the other product. A number of lines of code from the newer product has migrated to even more products, since the C code is more protable.
The cost of a product is not directly related to the clock frequency or the number of kB of flash/RAM, so portability, maintainability and developemtn time must be taken into consideration.
One nice thing with a not too heavilly optimized C program is that a project may start with two alternative processors. A brand new, very inexpensive chip with a high risk factor because of the possibility of delivery delays. And an older chip with a higher cost but similar functionality. The project can then strive for the new and dirt-cheap chip, but still have a backup plan where most of the code will be usable on the older chip in case that is the only way to get a product on the market within the required time frame.
If every last clock cycle counts then changing processor or increasing clock speed would seem a better option.
"then changing processor or increasing clock speed would seem a costlier option"
Not necessarily. Prices do change with time and popularity of components.
If every last clock cycle counts then changing processor or increasing clock speed would seem a better option. If neither of these is feasible then accept that the design is inadequate
indequate??? if it is doable, working and there are cost constrains, where do you get 'inadequate' from
Well, the system clearly hasn't been designed so that it can be programmed in 'C', but you're trying to program it in 'C'. I'd describe that as an inadequate design. Perhaps you'd prefer something like 'not up to the job'? Also, code that is written in a 'do-able' situation is likely to be an unmaintainable, non-portable mess. This in turn means increased development and maintenance costs, increased chance of unnoticed or 'marginal' bugs, and of course longer time to market.
If every last clock cycle counts then changing processor or increasing clock speed would seem a better option. If neither of these is feasible then accept that the design is inadequate, code the necessary bits in assembly
A reasonable approach, but if "processor pleasing C" will do, then why not use that?
Because you can't guarantee that a 'C' construct will compile to the same object code each time. Any change in compiler version or optimisation level may affect things, and as you will no doubt be using the highest optimisation level given that 'every clock cycle counts' any change to an unrelated area of code may alter timing in the critical parts.
then start work on the MK2 replacement system as quickly as possible.
and go bankrupt becuse your product is more expensive than the competitors.
By your logic you will have no competitors, as your product will be cheapest therefore they will be bankrupt.
Business realities dictate that one should think about the total cost over the product lifecycle. This includes such things as future development, time to market, maintenance and upgrade of existing systems and so on. If you've thrown together some nightmarish hodge podge of hand 'optimised' code on some hardware running at the bleeding edge you've had it.
Now I understand where you're going wrong.
I do code with an eye to efficiency but try to avoid getting into situations where micro-optimisations are necessary. Knowing your tools and hardware are essential if you want to write reliable, maintainable and portable code. Knowing your tools and hardware are essential if you want to properly design and implement a system rather than papering over the cracks with software.
Jack: This forum is not threaded, even if it for some reason allows people to place answers in the middle.
Please avoid that attempt and instead post at the bottom - since you use a lot of quotes, it really doesn't matter if there will be a number of posts in between your post and the one you are responding to.
Per, Thanks for your insight. By the way, I have another one for Jack:
Did you know that am ARM does not have divide instructions in hardware ? If you try to port code that heavily relies on divisions into an ARM (without modifications, such as converting divides into multiples), you are destined to be forced to use many calls to the compiler's C library. And that is going to hurt, not?
I understand you. But please also do refer to my comments above regarding divisions in the ARM core.
I was "lucky" enough to grow up with in an environment where no processors had multiply or divide instructions, or where the instructions did exist, but consumed about the same number of clock cycles as if you did the job with add and shift instructions.
Because of this, I very often design my code based on order-of-two multiplications/divisions, where bit-and can be used instead of modulo and shifts can be used instead of multiplications/divisions. Most of the time, this can be done without significant loss of readability or maintability.
More often than not, I see memory and not speed as the primary limiting factor. Most managers I have worked with are unable to say no to new features/requests/whishes/suggestions/dreams which results in products that will get new features one-by-one until the code memory is full. Only "hard" limits seem to get the managers to decide that a new feature can wait until the next generation product. To keep down the production and supply chain costs, most managers prefer "one-size-fits-all" Swiss army knife jack-of-all-trades products where all these features should always be available in the same application, just in case a customer calls and wonders if he/she can use his product for <insert strange request here>.
The more features you put into a product, the more important it will be to focus on KISS, and make sure that the code is testable and maintainable. A tested and true hand-optimized assembler routine may be wonderful, but if the customer requires that the function should be slightly different, then that "tested" didn't mean anything anymore, and the "hand-optimized" part will bite you since it is based on assumptions that are no longer true.
Hunting clock cycles (or bytes) is a good way to extend the life of an existing product but it is seldom a good idea for a new product, unless it is a 100k+ product with a trivial application that is only likely to be sent out in one (1) revision and then never changed.
So, isn't division instructions important? Of course they are - but if I know that I am going to implement a high-speed signal-processing pipeline, I will probably spend quite a lot of time considering what processor to use, instead of just picking one and then start to fight with the clock cycles.
Because you can't guarantee that a 'C' construct will compile to the same object code each time.
OH???? I would like you to justify yor statement re the following pseudo code:
ring_buffer[30]
ring index++ if ringindex%30 ring index = 0
as ooposed to ring buffer[32] ring index++ ring index &= 0xc0
I have no doubt what is the most "processor pleasing C" and no optimization will change which is the most efficient.