This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimizing specific code

Hi,
I'm searching of an optimization of the following code:

void prepareData(uint16_t* dataOut, uint16_t* dataIn, uint32_t length)
{
        uint32_t i;
        for (i = 0; i < length; i += 2)
        {
                dataOut[i] = (dataIn[i+1] >> 4) & 0x03FF;
                dataOut[i+1] = (dataIn[i] >> 4) & 0x03FF;
        }
}


It's just swapping 2 16-bit words. shifting them by 4 and setting the upper 6 bits to 0.
I already tried the hints from http://www.keil.com/support/man/docs/armcc/armcc_cjajacch.htm . But its getting slower with decrementing counter.

It's taking about 50ms (55ms with decrementing counter) for a length of 350000.
Target: AT91SAM9260, executed from external RAM.

Parents
  • Loop code:

            uint32_t i;
            length = length >> 1;
            for (i = length; i != 0; i -= 2)
            {
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
    
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
    
            }
    


    Disassembly [code inside the loop]:

    0x200204AC  E3E0CB03  MVN       R12,#0x00000C00
    0x200204B0  E59F40A4  LDR       R4,[PC,#0x00A4]
    0x200204B4  E5913000  LDR       R3,[R1]
    0x000003FF)
    0x03FF0000);
    0x200204B8  E2522002  SUBS      R2,R2,#0x00000002
    0x200204BC  E00C5A23  AND       R5,R12,R3,LSR #20
    0x200204C0  E0043603  AND       R3,R4,R3,LSL #12
    0x200204C4  E1833005  ORR       R3,R3,R5
    0x000003FF)
    0x03FF0000);
    0x200204C8  E5803000  STR       R3,[R0]
    0x200204CC  E5B13004  LDR       R3,[R1,#0x0004]!
    0x200204D0  E2811004  ADD       R1,R1,#PIOC_PDR(0x00000004)
    0x200204D4  E00C5A23  AND       R5,R12,R3,LSR #20
    0x200204D8  E0043603  AND       R3,R4,R3,LSL #12
    0x200204DC  E1833005  ORR       R3,R3,R5
    0x200204E0  E5A03004  STR       R3,[R0,#0x0004]!
    0x200204E4  E2800004  ADD       R0,R0,#PIOC_PDR(0x00000004)
    0x200204E8  1AFFFFF1  BNE       0x200204B4
    

    I want it as fast as possible.
    The instruction cache is enabled (done in the startup file). The data cache is disabled (CP15 control register equals 0x00051078).
    I tried to enable it, by setting it to 0x0005107D - MMU and DCache enabled - but the processor then hangs.
    Is there a special proceeding to enable the data cache?

Reply
  • Loop code:

            uint32_t i;
            length = length >> 1;
            for (i = length; i != 0; i -= 2)
            {
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
    
                    *dataOut = ((*dataIn >> 20) & 0x000003FF)
                             | ((*dataIn << 12) & 0x03FF0000);
                    dataIn++;
                    dataOut++;
    
            }
    


    Disassembly [code inside the loop]:

    0x200204AC  E3E0CB03  MVN       R12,#0x00000C00
    0x200204B0  E59F40A4  LDR       R4,[PC,#0x00A4]
    0x200204B4  E5913000  LDR       R3,[R1]
    0x000003FF)
    0x03FF0000);
    0x200204B8  E2522002  SUBS      R2,R2,#0x00000002
    0x200204BC  E00C5A23  AND       R5,R12,R3,LSR #20
    0x200204C0  E0043603  AND       R3,R4,R3,LSL #12
    0x200204C4  E1833005  ORR       R3,R3,R5
    0x000003FF)
    0x03FF0000);
    0x200204C8  E5803000  STR       R3,[R0]
    0x200204CC  E5B13004  LDR       R3,[R1,#0x0004]!
    0x200204D0  E2811004  ADD       R1,R1,#PIOC_PDR(0x00000004)
    0x200204D4  E00C5A23  AND       R5,R12,R3,LSR #20
    0x200204D8  E0043603  AND       R3,R4,R3,LSL #12
    0x200204DC  E1833005  ORR       R3,R3,R5
    0x200204E0  E5A03004  STR       R3,[R0,#0x0004]!
    0x200204E4  E2800004  ADD       R0,R0,#PIOC_PDR(0x00000004)
    0x200204E8  1AFFFFF1  BNE       0x200204B4
    

    I want it as fast as possible.
    The instruction cache is enabled (done in the startup file). The data cache is disabled (CP15 control register equals 0x00051078).
    I tried to enable it, by setting it to 0x0005107D - MMU and DCache enabled - but the processor then hangs.
    Is there a special proceeding to enable the data cache?

Children
  • > I tried to enable it, by setting it to 0x0005107D -
    > MMU and DCache enabled - but the processor then hangs.
    > Is there a special proceeding to enable the data cache?

    Did you set up a page table at all? The MMU needs one to work properly. Don't forget to initialize cp15,c2 (TTB). RTFTRM ;-)

    Looking at the assembler output, I am not sure if the unrolled loop is better than the single-word parallel version that I posted.

    Regards
    Marcus
    http://www.doulos.com/arm/

    PS: -Otime seems to be detrimental to performance (RealView Compiler) of all variants that have been posted here.