This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex M3 ASM Taking Longer Than it Should

Note: This was originally posted on 14th December 2011 at http://forums.arm.com

I'm doing some digital filtering in Cypress' PSoC5, but my filtering subroutine is taking much longer than it should.  It takes the expected amount of time to run when I have only one copy of the ~30 command process, but when I duplicate the commands for the second channel of input data, the subroutine ends up taking 3 times as long, instead of just 2x.  I've evaluated it without the wait for the SAR status register, and it exhibits the same behavior.  Any ideas on what could be causing this delay?

More details:
1. The subroutine loops and repeats both ~30 command filtering processes about 120 times, so pseudo code would look something like:

10 filter data 1 (~30 commands)
20 filter data 2 (~30 commands)
30 increment data pointers
40 return to 10 if pointer != terminal point
50 exit loop

2. When I begin cutting out commands successively, there seem to be large jumps in execution time at certain points.  For instance, reducing the amount of commands by 6 may only reduce execution time by 200 ns, but removing another multiply increases the reduction to 300 ns; the difference does not match the execution time of the single cycle multiply.  Also, this effect seems to be related to the number of commands, not necessarily the number of cycles needed to execute.

3. The longest branch is only 140 bytes, which is within the -252 byte range.

Thanks in advance!  The code follows, if you are so inclined.

filter_proc PROC
        push {lr}
adc0_filt
        ; =====================================================================
        ; ---------------  check for ADC0 result, retrieve  -------------------
        ; =====================================================================
        LDRH r0, [r9]                ; Load SAR0 satus reg
        ANDS r0, r0, #0x00000100
        BEQ adc0_filt                ; go back if not
        LDRH r3, [r9, #0x2a0]        ; get SAR0 result
        LDMIA r10, {r1, r2, r4, r5, r8, r12}    ; load filter variables
        SUB r3, r3, #2048            ; shift adc result to center
        LSL r3, r3, #4
       
        ; -------------------  calculate EGM biquad  --------------------------
        MOV r6, #71                    ; b2 = b0 = 0.5*b1
        MUL r5, r5, r6                ; b2*xn-2
        MUL r0, r6, r3                ; b0*xn
        ADD r5, r0, r5                ; yn += b0*xn + b2*xn-2
        MUL r0, r6, r4                ; 0.5*b1*xn-1
        ADD r6, r5, r0, LSL #1        ; yn += b1*xn-1
        MOV r5, #28587                ; a2
        MUL r5, r5, r4                ; a2*yn-2
        SUB r6, r6, r12                ; yn -= a2*yn-2
        MOV r5, #61070                ; -a1 -(-61070)
        MUL r5, r5, r8                ; -a1*yn-1
        ADD r6, r6, r5                ; yn += -a1*yn-1
        ASR r6, r6, #16                ; yn = yn >> 16

        ; Load sine LUT values
        ADD r5, r11, #640            ; create shifted pointer for cos
        LDR r0, [r10, r11]            ; load sine LUT value
        LDR r1, [r10, r5]            ; cos(x) = sin(x+T/4)
       
        ; Consider changing sine-table to 16 or more bits per sample (why not?)
        ; -------------------  calculate NavX filter  -------------------------
        SUB r3, r3, r6                ; subtract DC from NavX signal
        ADD r7, r7, #24                ; update filt_data offset
       
        MUL r1, r1, r3                ; work0 = cos_lut[i] * (adc_samp - dc_offset)
        ADD r3, r3, r1, ASR #12        ; cos_sum = cos_sum + (work0 >> 12)
       
        MUL r0, r0, r3                ; work0 = sin_lut[i] * (adc_samp - dc_offset)
        ADD r2, r2, r0, ASR #12        ; sin_sum = sin_sum + (work0 >> 12)
       
        ; --------------------  store filter variables  -----------------------
        STMIA r10!, {r1, r2, r3, r4, r6, r8}; write back incremented address
       
adc1_filt
        ; =====================================================================
        ; ---------------  check for ADC1 result, retrieve  -------------------
        ; =====================================================================
        LDRH r3, [r9, #0x2a2]        ; get SAR0 result
        LDMIA r10, {r1, r2, r4, r5, r8, r12}    ; load filter variables
        SUB r3, r3, #2048            ; shift adc result to center
LSL r3, r3, #4
       
        ; -------------------  calculate EGM biquad  --------------------------
        MOV r6, #71                    ; b2 = b0 = 0.5*b1
        MUL r5, r5, r6                ; b2*xn-2
        MUL r0, r6, r3                ; b0*xn
        ADD r5, r0, r5                ; yn += b0*xn + b2*xn-2
        MUL r0, r6, r4                ; 0.5*b1*xn-1
        ADD r6, r5, r0, LSL #1        ; yn += b1*xn-1
        MOV r5, #28587                ; a2
        MUL r5, r5, r4                ; a2*yn-2
        SUB r6, r6, r12                ; yn -= a2*yn-2
        MOV r5, #61070                ; -a1 -(-61070)
        MUL r5, r5, r8                ; -a1*yn-1
        ADD r6, r6, r5                ; yn += -a1*yn-1
        ASR r6, r6, #16                ; yn = yn >> 16
        MOV r6, r3

        ; Load sine LUT values
        ADD r5, r11, #640            ; create shifted pointer for cos
        LDR r0, [r10, r11]            ; load sine LUT value
        LDR r1, [r10, r5]            ; cos(x) = sin(x+T/4)
       
        ; Consider changing sine-table to 16 or more bits per sample (why not?)
        ; -------------------  calculate NavX filter  -------------------------
        SUB r3, r3, r6                ; subtract DC from NavX signal
        ADD r7, r7, #24                ; update filt_data offset
        ADD r11, r11, #4            ; update sine rom pointer
       
        MUL r1, r1, r3                ; work0 = cos_lut[i] * (adc_samp - dc_offset)
        ADD r3, r3, r1, ASR #12        ; cos_sum = cos_sum + (work0 >> 12)
       
        MUL r0, r0, r3                ; work0 = sin_lut[i] * (adc_samp - dc_offset)
        ADD r2, r2, r0, ASR #12        ; sin_sum = sin_sum + (work0 >> 12)
       
        ; --------------------  store filter variables  -----------------------
        STMIA r10!, {r1, r2, r3, r4, r6, r8}; write back incremented address
       
        ; ------------  check if all channels has been processed  -------------
        SUBS r0, r7, #480            ; have we processed all 20 channels?
        LDREQ r10, =filt_data        ; reset filt_data ptr
        MOVEQ r7, #0                ; reset filt_data offset
       
        ; -----------  check if we have reached termination point  ------------
        SUBS r0, r11, #960            ; have 120 samples been processed?
        BNE adc0_filt                ; if no, go to filter_proc again
        MOV r11, #480                ; reset sine LUT index, r11 is in the range
          ;  480 to 960, for optimization reasons
0