This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Process ADC data, moved by DMA, using CMSIS DSP: what's the right way?

Hi to you all,
I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:

  • Fills the ADC FIFO @40msps.
  • Copies the data into memory using the built-in DMA Controller and 2 linked buffers.
  • Processes one buffer while the other is being filled.

My problem is that my code is too slow, and every now and then and overwrite occurs.

Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.


The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.
It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.

I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .

Thanks in advance!

Parents
  • Thanks for the detailed reply Jens.
    Right now I'm doing the sign-aligned stuff inside Thibaut's function (
    https://www.m4-unleashed.com/parallel-comparison/ ), which is called during the DMA's Transfer Completed ISR.
    here's my code:

    uint32_t MAXmin;
    int16_t sample[NUM_SAMPLE] = {0};
    int16_t sample2[NUM_SAMPLE] = {0};
    uint16_t shiftBits = 4;
    uint16_t wordLenght = 8; /*Figured out looking at the registers address in while debugging*/
    
    uint32_t SearchMinMax16_DSP(int16_t* pSrc, int32_t pSize)
    {
        uint32_t data, min, max;
        int16_t data16;
    
        /* max variable will hold two max : one on each 16-bits half
         * same thing for min
         */
    
        /*Sign Extension*/
        *pSrc = (*pSrc) << shiftBits;
        *pSrc = (*pSrc) >> shiftBits;
        *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits;
        *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits;
    
        /* Load two first samples in one 32-bit access */
        data = *__SIMD32(pSrc)++;
        /* Initialize Min and Max to these first samples */
        min = data;
        max = data;
        /* decrement sample count */
        pSize-=2;
    
        /* Loop as long as there remains at least two samples */
        while (pSize > 1)
        {
            /*Sign Extension*/
             *pSrc = (*pSrc) << shiftBits;
             *pSrc = (*pSrc) >> shiftBits;
             *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits;
             *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits;
    
    
            /* Load next two samples in a single access */
            data = *__SIMD32(pSrc)++;
            /* Parallel comparison of max and new samples */
            (void)__SSUB16(max, data);
            /* Select max on each 16-bits half */
            max = __SEL(max, data);
            /* Parallel comparison of new samples and min */
            (void)__SSUB16(data, min);
            /* Select min on each 16-bits half */
            min = __SEL(min, data);
    
            pSize-=2;
        }
        /* Now we have maximum on even samples on low halfword of max
         * and maximum on odd samples on high halfword */
        /* look for max between halfwords 1 & 0 by comparing on low halfword */
        (void)__SSUB16(max, max >> 16);
        /* Select max on low 16-bits */
        max = __SEL(max, max >> 16);
    
        /* look for min between halfwords 1 & 0 by comparing on low halfword */
        (void)__SSUB16(min >> 16, min);
        /* Select min on low 16-bits */
        min = __SEL(min, min >> 16);
    
        /* Test if odd number of samples */
        if (pSize > 0)
        {
            data16 = *pSrc;
            /* look for max between on low halfwords */
            (void)__SSUB16(max, data16);
            /* Select max on low 16-bits */
            max = __SEL(max, data16);
    
            /* look for min on low halfword */
            (void)__SSUB16(data16, min);
            /* Select min on low 16-bits */
            min = __SEL(min, data16);
        }
    
        /* Pack result : Min on Low halfword, Max on High halfword */
        return __PKHBT(min, max, 16); /* PKHBT documentation */
    }
    
    
    
    
    
    
    
    
    
    
    
    
    

    At line 33 there's the bit extension.

    Great analysis about the clock/sample Jens!

    That means if your code runs from SRAM, then per sample, it should cost 7 clock cycles if P is 1.

    Sounds great! How can i be sure this is happening?

    Speaking of the compilare, this is the out put of my actual configuration in lpcxpresso (S2D.c is the file containing the code we are talking about):

    rm-none-eabi-gcc -nostdlib -L"/home/abet/LPCXpresso/link2_2/lpc_board_nxp_lpclink2_4370/Debug" -L"/home/abet/LPCXpresso/link2_2/lpc_chip_43xx/Debug" -L"/home/abet/LPCXpresso/link2_2/CMSIS_DSPLIB_CM4/lib" -Xlinker -Map="S2D.map" -Xlinker --gc-sections -Xlinker -print-memory-usage -mcpu=cortex-m4 -mthumb -T "S2D_Debug.ld" -o "S2D.axf"  ./src/S2D.o ./src/cr_startup_lpc43xx.o ./src/crp.o ./src/sysinit.o   -llpc_board_nxp_lpclink2_4370 -llpc_chip_43xx -lCMSIS_DSPLIB_CM4

    Memory region           Used Size  Region Size  %age Used

          RamLoc128:            6688 B      128 KB      5.10%

            RamLoc72:               0 GB        72 KB      0.00%

           RamAHB32:               0 GB        32 KB      0.00%

           RamAHB16:               0 GB        16 KB      0.00%

        RamAHB_ETB16:         0 GB        16 KB      0.00%

       RamM0Sub16:               0 GB        16 KB      0.00%

         RamM0Sub2:               0 GB          2 KB      0.00%

                     SPIFI:          13668 B         4 MB      0.33%

    Also,    

    arm-none-eabi-gcc --version

    gives:

    arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 5.2.1 20151202 (release) [ARM/embedded-5-branch revision 231848]

    Looking at the project properties as suggested by Jens I found out that I had no optimization level here:

    Properties for S2D _033.png

    So I'm going to turn this on and implements Inside Thibaut's function the sign-extension the Jen's way! And see if I get some good news!

Reply
  • Thanks for the detailed reply Jens.
    Right now I'm doing the sign-aligned stuff inside Thibaut's function (
    https://www.m4-unleashed.com/parallel-comparison/ ), which is called during the DMA's Transfer Completed ISR.
    here's my code:

    uint32_t MAXmin;
    int16_t sample[NUM_SAMPLE] = {0};
    int16_t sample2[NUM_SAMPLE] = {0};
    uint16_t shiftBits = 4;
    uint16_t wordLenght = 8; /*Figured out looking at the registers address in while debugging*/
    
    uint32_t SearchMinMax16_DSP(int16_t* pSrc, int32_t pSize)
    {
        uint32_t data, min, max;
        int16_t data16;
    
        /* max variable will hold two max : one on each 16-bits half
         * same thing for min
         */
    
        /*Sign Extension*/
        *pSrc = (*pSrc) << shiftBits;
        *pSrc = (*pSrc) >> shiftBits;
        *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits;
        *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits;
    
        /* Load two first samples in one 32-bit access */
        data = *__SIMD32(pSrc)++;
        /* Initialize Min and Max to these first samples */
        min = data;
        max = data;
        /* decrement sample count */
        pSize-=2;
    
        /* Loop as long as there remains at least two samples */
        while (pSize > 1)
        {
            /*Sign Extension*/
             *pSrc = (*pSrc) << shiftBits;
             *pSrc = (*pSrc) >> shiftBits;
             *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits;
             *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits;
    
    
            /* Load next two samples in a single access */
            data = *__SIMD32(pSrc)++;
            /* Parallel comparison of max and new samples */
            (void)__SSUB16(max, data);
            /* Select max on each 16-bits half */
            max = __SEL(max, data);
            /* Parallel comparison of new samples and min */
            (void)__SSUB16(data, min);
            /* Select min on each 16-bits half */
            min = __SEL(min, data);
    
            pSize-=2;
        }
        /* Now we have maximum on even samples on low halfword of max
         * and maximum on odd samples on high halfword */
        /* look for max between halfwords 1 & 0 by comparing on low halfword */
        (void)__SSUB16(max, max >> 16);
        /* Select max on low 16-bits */
        max = __SEL(max, max >> 16);
    
        /* look for min between halfwords 1 & 0 by comparing on low halfword */
        (void)__SSUB16(min >> 16, min);
        /* Select min on low 16-bits */
        min = __SEL(min, min >> 16);
    
        /* Test if odd number of samples */
        if (pSize > 0)
        {
            data16 = *pSrc;
            /* look for max between on low halfwords */
            (void)__SSUB16(max, data16);
            /* Select max on low 16-bits */
            max = __SEL(max, data16);
    
            /* look for min on low halfword */
            (void)__SSUB16(data16, min);
            /* Select min on low 16-bits */
            min = __SEL(min, data16);
        }
    
        /* Pack result : Min on Low halfword, Max on High halfword */
        return __PKHBT(min, max, 16); /* PKHBT documentation */
    }
    
    
    
    
    
    
    
    
    
    
    
    
    

    At line 33 there's the bit extension.

    Great analysis about the clock/sample Jens!

    That means if your code runs from SRAM, then per sample, it should cost 7 clock cycles if P is 1.

    Sounds great! How can i be sure this is happening?

    Speaking of the compilare, this is the out put of my actual configuration in lpcxpresso (S2D.c is the file containing the code we are talking about):

    rm-none-eabi-gcc -nostdlib -L"/home/abet/LPCXpresso/link2_2/lpc_board_nxp_lpclink2_4370/Debug" -L"/home/abet/LPCXpresso/link2_2/lpc_chip_43xx/Debug" -L"/home/abet/LPCXpresso/link2_2/CMSIS_DSPLIB_CM4/lib" -Xlinker -Map="S2D.map" -Xlinker --gc-sections -Xlinker -print-memory-usage -mcpu=cortex-m4 -mthumb -T "S2D_Debug.ld" -o "S2D.axf"  ./src/S2D.o ./src/cr_startup_lpc43xx.o ./src/crp.o ./src/sysinit.o   -llpc_board_nxp_lpclink2_4370 -llpc_chip_43xx -lCMSIS_DSPLIB_CM4

    Memory region           Used Size  Region Size  %age Used

          RamLoc128:            6688 B      128 KB      5.10%

            RamLoc72:               0 GB        72 KB      0.00%

           RamAHB32:               0 GB        32 KB      0.00%

           RamAHB16:               0 GB        16 KB      0.00%

        RamAHB_ETB16:         0 GB        16 KB      0.00%

       RamM0Sub16:               0 GB        16 KB      0.00%

         RamM0Sub2:               0 GB          2 KB      0.00%

                     SPIFI:          13668 B         4 MB      0.33%

    Also,    

    arm-none-eabi-gcc --version

    gives:

    arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 5.2.1 20151202 (release) [ARM/embedded-5-branch revision 231848]

    Looking at the project properties as suggested by Jens I found out that I had no optimization level here:

    Properties for S2D _033.png

    So I'm going to turn this on and implements Inside Thibaut's function the sign-extension the Jen's way! And see if I get some good news!

Children