This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimizing specific code

Hi,
I'm searching of an optimization of the following code:

void prepareData(uint16_t* dataOut, uint16_t* dataIn, uint32_t length)
{
        uint32_t i;
        for (i = 0; i < length; i += 2)
        {
                dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF;
                dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF;
        }
}


It's just swapping 2 16-bit words. shifting them by 4 and setting the upper 6 bits to 0.
I already tried the hints from http://www.keil.com/support/man/docs/armcc/armcc_cjajacch.htm . But its getting slower with decrementing counter.

It's taking about 50ms (55ms with decrementing counter) for a length of 350000.
Target: AT91SAM9260, executed from external RAM.

Parents
  • If input and output buffers are 32-bit aligned, it should be possible to move data to/from memory in 32-bit chunks:

    void prepareData(uint32_t* dataOut, uint32_t* dataIn, uint32_t length)
    {
            uint32_t i;
            for (i = 0; i < length; i += 2)
            {
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
            }
    }
    

Reply
  • If input and output buffers are 32-bit aligned, it should be possible to move data to/from memory in 32-bit chunks:

    void prepareData(uint32_t* dataOut, uint32_t* dataIn, uint32_t length)
    {
            uint32_t i;
            for (i = 0; i < length; i += 2)
            {
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
            }
    }
    

Children
  • I tried the following implementations:

    uint32_t i;
    length = length >> 1;
    for (i = length; i != 0; i -= 2)
    {
            *dataOut = ((*dataIn >> 20) & 0x000003FF)
                     | ((*dataIn << 12) & 0x03FF0000);
            dataIn++;
            dataOut++;
    
            *dataOut = ((*dataIn >> 20) & 0x000003FF)
                     | ((*dataIn << 12) & 0x03FF0000);
            dataIn++;
            dataOut++;
    
    }
    

    uint32_t i;
    length = length >> 2;
    for (i = 0; i < length; i ++)
    {
            *dataOut = ((*dataIn >> 20) & 0x000003FF)
                     | ((*dataIn << 12) & 0x03FF0000);
            dataIn++;
            dataOut++;
    
            *dataOut = ((*dataIn >> 20) & 0x000003FF)
                     | ((*dataIn << 12) & 0x03FF0000);
            dataIn++;
            dataOut++;
    
    }
    
    


    Info: length is the number of 16-bit word to process.

    Both of them took about 28ms. Without unrolling the loop its 50ms. And with more iteration inside the loop, the time remains constant.

  • Well, that's some success, 28 vs 50 ms.
    Would you mind posting the resulting assembly for one of these variants again, please?

    Do you have any specific performance requirement to meet or is it just "I want this to work as quick as possible" thing?

  • This is definitely the right approach, Mike. Better still, if data were 8 word aligned, since that is the size of a cache line in ARM926. Assuming the data cache has been enabled.

    However, you can still shave off a few cycles inside the loop by parallelizing operations. Fortunately the task is rather well suited to this.

    void prepareDataMH(uint16_t* dataOut, uint16_t* dataIn, uint32_t length)
    {
        int32_t  i;
        uint32_t tmp;
        uint32_t *dataIn_pair  = (uint32_t *)dataIn;
        uint32_t *dataOut_pair = (uint32_t *)dataOut;
    
        for (i = (length/2)-1; i >= 0; i--)
        {
            tmp             = (dataIn_pair[i] >> 4) & 0x03FF03FF;
            dataOut_pair[i] = (tmp >> 16) | (tmp << 16);
        }
    }
    

    Regards
    Marcus
    http://www.doulos.com/arm/

  • Loop code:

            uint32_t i;
            length = length >> 1;
            for (i = length; i != 0; i -= 2)
            {
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
    
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
    
            }
    


    Disassembly [code inside the loop]:

    0x200204AC  E3E0CB03  MVN       R12,#0x00000C00
    0x200204B0  E59F40A4  LDR       R4,[PC,#0x00A4]
    0x200204B4  E5913000  LDR       R3,[R1]
    0x000003FF)
    0x03FF0000);
    0x200204B8  E2522002  SUBS      R2,R2,#0x00000002
    0x200204BC  E00C5A23  AND       R5,R12,R3,LSR #20
    0x200204C0  E0043603  AND       R3,R4,R3,LSL #12
    0x200204C4  E1833005  ORR       R3,R3,R5
    0x000003FF)
    0x03FF0000);
    0x200204C8  E5803000  STR       R3,[R0]
    0x200204CC  E5B13004  LDR       R3,[R1,#0x0004]!
    0x200204D0  E2811004  ADD       R1,R1,#PIOC_PDR(0x00000004)
    0x200204D4  E00C5A23  AND       R5,R12,R3,LSR #20
    0x200204D8  E0043603  AND       R3,R4,R3,LSL #12
    0x200204DC  E1833005  ORR       R3,R3,R5
    0x200204E0  E5A03004  STR       R3,[R0,#0x0004]!
    0x200204E4  E2800004  ADD       R0,R0,#PIOC_PDR(0x00000004)
    0x200204E8  1AFFFFF1  BNE       0x200204B4
    

    I want it as fast as possible.
    The instruction cache is enabled (done in the startup file). The data cache is disabled (CP15 control register equals 0x00051078).
    I tried to enable it, by setting it to 0x0005107D - MMU and DCache enabled - but the processor then hangs.
    Is there a special proceeding to enable the data cache?

  • > I tried to enable it, by setting it to 0x0005107D -
    > MMU and DCache enabled - but the processor then hangs.
    > Is there a special proceeding to enable the data cache?

    Did you set up a page table at all? The MMU needs one to work properly. Don't forget to initialize cp15,c2 (TTB). RTFTRM ;-)

    Looking at the assembler output, I am not sure if the unrolled loop is better than the single-word parallel version that I posted.

    Regards
    Marcus
    http://www.doulos.com/arm/

    PS: -Otime seems to be detrimental to performance (RealView Compiler) of all variants that have been posted here.