This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Process ADC data, moved by DMA, using CMSIS DSP: what's the right way?

Hi to you all,
I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:

Fills the ADC FIFO @40msps.
Copies the data into memory using the built-in DMA Controller and 2 linked buffers.
Processes one buffer while the other is being filled.

My problem is that my code is too slow, and every now and then and overwrite occurs.

Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.

The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.
It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.

I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .

Thanks in advance!

Parents

0 Andrea Bettati over 9 years ago in reply to G. Goodwin L. Pitos

So, yesterday I got some probes that I missed the past week and I was able to do some time measures on the board while it was working in continuous mode @40Msps.
It turned out that, using a ping-pong buffer, of 1024 sample each, the LPC4370 @204MHz roughly needs 37 instruction per sample (instead of the 5 I actually can use) just for the arm_shift_q31 and the arm_max_q31.
I tried to use the q15 versions of the functions, but strangely the didn't give any performance boost (even if they're reported to use the SIMD instructions).
Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
Regards,
Andrea
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Andrea Bettati over 9 years ago in reply to G. Goodwin L. Pitos

So, yesterday I got some probes that I missed the past week and I was able to do some time measures on the board while it was working in continuous mode @40Msps.
It turned out that, using a ping-pong buffer, of 1024 sample each, the LPC4370 @204MHz roughly needs 37 instruction per sample (instead of the 5 I actually can use) just for the arm_shift_q31 and the arm_max_q31.
I tried to use the q15 versions of the functions, but strangely the didn't give any performance boost (even if they're reported to use the SIMD instructions).
Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
Regards,
Andrea
Cancel
Vote up 0 Vote down

Cancel

Children

0 Jens Bauer over 9 years ago in reply to Andrea Bettati

abet wrote:
Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
It certainly could improve things if the code is currently optimized for size or unoptimized.
If you use a pre-compiled library, then the library is most likely built with optimal performance - however - if the code is executing from Flash memory, I think it might be worth moving it to SRAM.
What I would recommend, is to put the code in a "ramcode" section and optimize for highest speed if you rebuild.
In addition, I would recommend you to run any other time-critical code from SRAM, however, make sure your code resides in a different section of RAM than the section that your DMA will access; this is very important.
Executing code from SRAM will give you a huge performance increase on a LPC40xx.
-But if the DMA and CPU fight over who's going to use the SRAM section, you might end up getting worse performance than before.
So make sure that the two sections are independent.
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to Jens Bauer

Jens, Andrea is using LPC-Link 2 which is based on LPC4370, a Flashless MCU. Quad SPI Flash memory is used in this board. Since fastest code execution is sought copying to RAM rather than executing in place is imperative (I presume this is what Andrea is doing).
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to Andrea Bettati

Since processing speed is your primary requirement Q15 is the optimal data type applicable to your samples for use with CMSIS-DSP. However, if you are currently comfortable with Q31, you can continue to work with it since you are not yet in the final stage of your project. If you eventually decide to use Q15, it will be helpful if you will configure the FIFO to pack 2 samples per word. This will double the number of samples transferred per word; from another perspective, reduce the number of words to transfer per specific number of samples (reduce the size of DMA transfer). It might also help format the data for SIMD.
Cancel
Vote up 0 Vote down

Cancel
0 Jens Bauer over 9 years ago in reply to G. Goodwin L. Pitos

-As far as I recall, the LPC4xxx is able to execute code directly from SPIFI (please correct me if I'm wrong).
-But even if the code is already running from SRAM, it is a good idea to put the code in one SRAM, the data in another SRAM and the DMA buffers in a third SRAM, so that there are no stalls (collisions).
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to Jens Bauer

-As far as I recall, the LPC4xxx is able to execute code directly from SPIFI (please correct me if I'm wrong).
Yes, the external Flash is memory-mapped and code can be executed directly.
-But even if the code is already running from SRAM, it is a good idea to put the code in one SRAM, the data in another SRAM and the DMA buffers in a third SRAM, so that there are no stalls (collisions).
Yes.
Based on Andrea's updates my assumption that the code is running in SRAM may be wrong.
Cancel
Vote up 0 Vote down

Cancel