This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimizing specific code

Hi,
I'm searching of an optimization of the following code:

void prepareData(uint16_t* dataOut, uint16_t* dataIn, uint32_t length)
{
        uint32_t i;
        for (i = 0; i < length; i += 2)
        {
                dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF;
                dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF;
        }
}


It's just swapping 2 16-bit words. shifting them by 4 and setting the upper 6 bits to 0.
I already tried the hints from http://www.keil.com/support/man/docs/armcc/armcc_cjajacch.htm . But its getting slower with decrementing counter.

It's taking about 50ms (55ms with decrementing counter) for a length of 350000.
Target: AT91SAM9260, executed from external RAM.

  • a few tips:
    * make all function parameters 32 bits wide to spare casts.
    * use a moving pointer rather than dereferencing your array explicitly.
    * use loop unrolling.

  • Could you post the assembly code your compiler produces for this function?

  • It has been compiled with max. optimization level.

        36: {
        37:         uint32_t i;
    0x2200B2F4  E92D0030  STMDB     R13!,{R4-R5}
        38:         for (i = 0; i < length; i += 2)
        39:         {
        40:                 dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF;
        41:                 dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF;
        42:         }
    0x2200B2F8  E2825001  ADD       R5,R2,#CMOS_EVENT_CAPTURE_AND_SEND_BMP_USB(0x00000001)
    0x2200B2FC  E3550001  CMP       R5,#CMOS_EVENT_CAPTURE_AND_SEND_BMP_USB(0x00000001)
        43: }
        44:
    0x2200B300  98BD0030  LDMLSIA   R13!,{R4-R5}
    0x2200B304  912FFF1E  BXLS      R14
    0x2200B308  E2413002  SUB       R3,R1,#MIN_IMAGE_WIDTH(0x00000002)
    0x2200B30C  E240C004  SUB       R12,R0,#CMOS_EVENT_SEND_BMP(0x00000004)
    0x2200B310  E1A02F02  MOV       R2,R2,LSL #30
    0x2200B314  E2922101  ADDS      R2,R2,#0x40000000
    0x2200B318  E2411004  SUB       R1,R1,#CMOS_EVENT_SEND_BMP(0x00000004)
    0x2200B31C  E2400002  SUB       R0,R0,#MIN_IMAGE_WIDTH(0x00000002)
        40:                 dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF;
    0x2200B320  E3E04B03  MVN       R4,#0x00000C00
    0x2200B324  5A000005  BPL       0x2200B340
    0x2200B328  E1F320B4  LDRH      R2,[R3,#CMOS_EVENT_SEND_BMP(0x04)]!
    0x2200B32C  E0042222  AND       R2,R4,R2,LSR #4
    0x2200B330  E1EC20B4  STRH      R2,[R12,#CMOS_EVENT_SEND_BMP(0x04)]!
        41:                 dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF;
        42:         }
    0x2200B334  E1F120B4  LDRH      R2,[R1,#CMOS_EVENT_SEND_BMP(0x04)]!
    0x2200B338  E0042222  AND       R2,R4,R2,LSR #4
    0x2200B33C  E1E020B4  STRH      R2,[R0,#CMOS_EVENT_SEND_BMP(0x04)]!
        38:         for (i = 0; i < length; i += 2)
        39:         {
    0x2200B340  E1B02125  MOVS      R2,R5,LSR #2
    0x2200B344  08BD0030  LDMEQIA   R13!,{R4-R5}
    0x2200B348  012FFF1E  BXEQ      R14
        40:                 dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF;
    0x2200B34C  E1D350B4  LDRH      R5,[R3,#CMOS_EVENT_SEND_BMP(0x04)]
        41:                 dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF;
        42:         }
    0x2200B350  E2522001  SUBS      R2,R2,#CMOS_EVENT_CAPTURE_AND_SEND_BMP_USB(0x00000001)
    0x2200B354  E0045225  AND       R5,R4,R5,LSR #4
    0x2200B358  E1CC50B4  STRH      R5,[R12,#CMOS_EVENT_SEND_BMP(0x04)]
    0x2200B35C  E1D150B4  LDRH      R5,[R1,#CMOS_EVENT_SEND_BMP(0x04)]
    0x2200B360  E0045225  AND       R5,R4,R5,LSR #4
    0x2200B364  E1C050B4  STRH      R5,[R0,#CMOS_EVENT_SEND_BMP(0x04)]
    0x2200B368  E1F350B8  LDRH      R5,[R3,#STEPPER_TIMER_DELTA_PER_STEP(0x08)]!
    0x2200B36C  E0045225  AND       R5,R4,R5,LSR #4
    0x2200B370  E1EC50B8  STRH      R5,[R12,#STEPPER_TIMER_DELTA_PER_STEP(0x08)]!
    0x2200B374  E1F150B8  LDRH      R5,[R1,#STEPPER_TIMER_DELTA_PER_STEP(0x08)]!
    0x2200B378  E0045225  AND       R5,R4,R5,LSR #4
    0x2200B37C  E1E050B8  STRH      R5,[R0,#STEPPER_TIMER_DELTA_PER_STEP(0x08)]!
    0x2200B380  1AFFFFF1  BNE       0x2200B34C
        43: }
    

  • It's taking about 50ms (55ms with decrementing counter) for a length of 350000.
    Target: AT91SAM9260, executed from external RAM.

    50ms/350000 = 143 ns per loop iteration. That would be 28 instructions per iteration, assuming maximum CPU speed of 5 ns per instruction. Disassembly shows that there are far fewer instructions per iteration in the code, suggesting that maybe memory bandwidth is the bottleneck.
    I don't think the compiler is sophisticated enough to optimize memory bandwidth when generating code. Perhaps, hand-crafted assembly is the answer...

  • A good test is to run an unrolled loop of maybe four iterations and see how that affects the timing.

  • If input and output buffers are 32-bit aligned, it should be possible to move data to/from memory in 32-bit chunks:

    void prepareData(uint32_t* dataOut, uint32_t* dataIn, uint32_t length)
    {
            uint32_t i;
            for (i = 0; i < length; i += 2)
            {
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
            }
    }
    

  • I tried the following implementations:

    uint32_t i;
    length = length >> 1;
    for (i = length; i != 0; i -= 2)
    {
            *dataOut = ((*dataIn >> 20) & 0x000003FF)
                     | ((*dataIn << 12) & 0x03FF0000);
            dataIn++;
            dataOut++;
    
            *dataOut = ((*dataIn >> 20) & 0x000003FF)
                     | ((*dataIn << 12) & 0x03FF0000);
            dataIn++;
            dataOut++;
    
    }
    

    uint32_t i;
    length = length >> 2;
    for (i = 0; i < length; i ++)
    {
            *dataOut = ((*dataIn >> 20) & 0x000003FF)
                     | ((*dataIn << 12) & 0x03FF0000);
            dataIn++;
            dataOut++;
    
            *dataOut = ((*dataIn >> 20) & 0x000003FF)
                     | ((*dataIn << 12) & 0x03FF0000);
            dataIn++;
            dataOut++;
    
    }
    
    


    Info: length is the number of 16-bit word to process.

    Both of them took about 28ms. Without unrolling the loop its 50ms. And with more iteration inside the loop, the time remains constant.

  • I don't think the compiler is sophisticated enough to optimize memory bandwidth when generating code. Perhaps, hand-crafted assembly is the answer...
    How can I achieve an optimized memory access with assembler in general - I'm not realy familiar with that.

  • How can I achieve an optimized memory access with assembler in general - I'm not realy familiar with that.

    I haven't done that either.
    One technique would be to use 'load/store multiple' instructions. If the CPU has a data cache, you should try minimizing cache misses too.

  • One technique would be to use 'load/store multiple' instructions.

    On an ARM7 chip, yes. I believe that there's no difference in cycle count on an ARM9 chip, at least not from the architecture. The LDM/STM instructions mainly exist there to save space.

    In general, such optimizations require profound knowledge of the CPUs instruction cycle timings and the behavior of the memory system.

  • In general, such optimizations require profound knowledge of the CPUs instruction cycle timings and the behavior of the memory system.

    absolutely. I think 99% of the people on this forum (me included...) and in the industry lack that knowledge...!

  • absolutely. I think 99% of the people on this forum (me included...) and in the industry lack that knowledge...!

    I tried beating the compiler once, and failed (compiler generated code was about 5% faster).

    I did have some success in very specific cases (where the aforementioned LDM/STM instructions can be used, but the compiler does not), achieving code that is about 10%-20% faster. But all that code does is loading blocks of data, performing simple operations on it, and store it.

  • Anything that saves number of code bytes will improve the performance of the flash memory accelerator used in some of the better ARM7 chips, and probably (?) in all Cortex chips. Flash is notoriously slow, and only got their name because they are faster than normal EEPROM when erasing.

    I have also tested my luck beating the compiler, with very limited success. The amount of time needed to win over a compiler quickly grows with the amount of caching or out-of-order execution. It takes too much time checking all possible combinations.

  • Anything that saves number of code bytes will improve the performance of the flash memory accelerator used in some of the better ARM7 chips, and probably (?) in all Cortex chips.

    Ah, yes. That's one detail I forgot - my code was running out of RAM, not out of flash. So, no waitstates at all.

    I have also tested my luck beating the compiler, with very limited success. The amount of time needed to win over a compiler quickly grows with the amount of caching or out-of-order execution.

    Pipelining. Don't forget pipelining. Keeping all the implications of a large pipeline in mind quickly becomes fairly mind-boggling.