This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimizing specific code

Hi,
I'm searching of an optimization of the following code:

void prepareData(uint16_t* dataOut, uint16_t* dataIn, uint32_t length)
{
        uint32_t i;
        for (i = 0; i < length; i += 2)
        {
                dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF;
                dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF;
        }
}

It's just swapping 2 16-bit words. shifting them by 4 and setting the upper 6 bits to 0.
I already tried the hints from http://www.keil.com/support/man/docs/armcc/armcc_cjajacch.htm . But its getting slower with decrementing counter.

It's taking about 50ms (55ms with decrementing counter) for a length of 350000.
Target: AT91SAM9260, executed from external RAM.

Parents

0 Christoph Franck over 16 years ago in reply to Tamir Michael

absolutely. I think 99% of the people on this forum (me included...) and in the industry lack that knowledge...!

I tried beating the compiler once, and failed (compiler generated code was about 5% faster).

I did have some success in very specific cases (where the aforementioned LDM/STM instructions can be used, but the compiler does not), achieving code that is about 10%-20% faster. But all that code does is loading blocks of data, performing simple operations on it, and store it.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Christoph Franck over 16 years ago in reply to Tamir Michael

absolutely. I think 99% of the people on this forum (me included...) and in the industry lack that knowledge...!

I tried beating the compiler once, and failed (compiler generated code was about 5% faster).

I did have some success in very specific cases (where the aforementioned LDM/STM instructions can be used, but the compiler does not), achieving code that is about 10%-20% faster. But all that code does is loading blocks of data, performing simple operations on it, and store it.
Cancel
Vote up 0 Vote down

Cancel

Children

0 ImPer Westermark over 16 years ago in reply to Christoph Franck

Anything that saves number of code bytes will improve the performance of the flash memory accelerator used in some of the better ARM7 chips, and probably (?) in all Cortex chips. Flash is notoriously slow, and only got their name because they are faster than normal EEPROM when erasing.

I have also tested my luck beating the compiler, with very limited success. The amount of time needed to win over a compiler quickly grows with the amount of caching or out-of-order execution. It takes too much time checking all possible combinations.
Cancel
Vote up 0 Vote down

Cancel
0 Christoph Franck over 16 years ago in reply to ImPer Westermark

Anything that saves number of code bytes will improve the performance of the flash memory accelerator used in some of the better ARM7 chips, and probably (?) in all Cortex chips.

Ah, yes. That's one detail I forgot - my code was running out of RAM, not out of flash. So, no waitstates at all.

I have also tested my luck beating the compiler, with very limited success. The amount of time needed to win over a compiler quickly grows with the amount of caching or out-of-order execution.

Pipelining. Don't forget pipelining. Keeping all the implications of a large pipeline in mind quickly becomes fairly mind-boggling.
Cancel
Vote up 0 Vote down

Cancel
0 Silly Sasuage over 16 years ago in reply to Christoph Franck

"Pipelining. Don't forget pipelining...."

Things are so much easier on 8051s. It's quite easy to beat the compiler there - Even the latest Keil C51 offering.
Cancel
Vote up 0 Vote down

Cancel
0 ImPer Westermark over 16 years ago in reply to Silly Sasuage

I loved 8086 assembler until the Pentium got a zero-clock FXCH instruction to swap the place of two instructions in the FP stack. Before that, I could walk all over the x86 compilers. After, it took me a day to match what Watcom C did almost instantly, implementing basic arithmetic for vectors and matrics with n=4. And I needed to write a helper application just to visualise the contents of the FP stack as the FXCH instruction wildly moved around the data to always have the optimal value on the stack top.

Today, a PC processor doesn't have just one single instruction that can be run concurrently. Almost all instructions can be - and are - run concurrently. We have to fight so hard with the ordering of the instructions (both to combine concurrent pipelines and to count guestimated cache line responses), that we end up with code that we can't proof-read for correctness. All we can hope for is that our regression tests will catch all possible corner cases. In the end, I'm doing my best to avoid assembler for 32-bit processors and higher. And I'm also desperately holding on to already used compiler versions, to reduce the probability of bugs in the compiler, or bugs trigged by changes in the code generation.
Cancel
Vote up 0 Vote down

Cancel