Hi to you all,I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:
My problem is that my code is too slow, and every now and then and overwrite occurs.
Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.
The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.
I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .
Thanks in advance!
Thank you Carl, It's my first post on the forum and I'm still not that experienced!
Hi Andrea,
I am not an expert regarding your main problem so I cannot provide a complete solution. I just want to give some thoughts and suggestions which might be helpful to you.
While the core is running fast at 204 MHz, it would not be able to process the ADC data at 40 Msps continuously. The use of dual DMA buffer will only work at low to moderate sampling rate. At 40 Msps, no matter how you juggle with the FIFO, DMA, and interrupt, it will all boil down to the relation that the core has only slightly greater than 5 clock cycles per ADC sample. You can do little more than data movement. Converting from integer to floating-point and calling CMSIS functions, you're into buffer overrun. The high-speed ADC is usable at high speed when you give the core enough time to process the data, this is possible when the ADC is in discontinuous mode.
I also read your post at NXP so I'm going to follow this with a reply relating to your code.
Regards,
Goodwin
So, yesterday I got some probes that I missed the past week and I was able to do some time measures on the board while it was working in continuous mode @40Msps.It turned out that, using a ping-pong buffer, of 1024 sample each, the LPC4370 @204MHz roughly needs 37 instruction per sample (instead of the 5 I actually can use) just for the arm_shift_q31 and the arm_max_q31.I tried to use the q15 versions of the functions, but strangely the didn't give any performance boost (even if they're reported to use the SIMD instructions).Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
Regards, Andrea
abet wrote:Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
abet wrote:
Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
It certainly could improve things if the code is currently optimized for size or unoptimized.
If you use a pre-compiled library, then the library is most likely built with optimal performance - however - if the code is executing from Flash memory, I think it might be worth moving it to SRAM.
What I would recommend, is to put the code in a "ramcode" section and optimize for highest speed if you rebuild.
In addition, I would recommend you to run any other time-critical code from SRAM, however, make sure your code resides in a different section of RAM than the section that your DMA will access; this is very important.
Executing code from SRAM will give you a huge performance increase on a LPC40xx.
-But if the DMA and CPU fight over who's going to use the SRAM section, you might end up getting worse performance than before.
So make sure that the two sections are independent.
Jens, Andrea is using LPC-Link 2 which is based on LPC4370, a Flashless MCU. Quad SPI Flash memory is used in this board. Since fastest code execution is sought copying to RAM rather than executing in place is imperative (I presume this is what Andrea is doing).
Since processing speed is your primary requirement Q15 is the optimal data type applicable to your samples for use with CMSIS-DSP. However, if you are currently comfortable with Q31, you can continue to work with it since you are not yet in the final stage of your project. If you eventually decide to use Q15, it will be helpful if you will configure the FIFO to pack 2 samples per word. This will double the number of samples transferred per word; from another perspective, reduce the number of words to transfer per specific number of samples (reduce the size of DMA transfer). It might also help format the data for SIMD.
-As far as I recall, the LPC4xxx is able to execute code directly from SPIFI (please correct me if I'm wrong).
-But even if the code is already running from SRAM, it is a good idea to put the code in one SRAM, the data in another SRAM and the DMA buffers in a third SRAM, so that there are no stalls (collisions).
Yes, the external Flash is memory-mapped and code can be executed directly.
Yes.
Based on Andrea's updates my assumption that the code is running in SRAM may be wrong.