Hi to you all,I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:
My problem is that my code is too slow, and every now and then and overwrite occurs.
Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.
The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.
I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .
Thanks in advance!
Line 19, 20, 36 and 37 looks very wrong to me.
To me, it seems you're doing the same job twice.
Eg. after 8 iterations, the values you've already sign-extended, will be sign-extended again.
I could be wrong, but I better mention it; are you sure that they're doing what you want ?
(I would remove them completely)
About the 'prefetch' on branches (P):
Prefetch only happens when necessary. It's not really something you're in control of (especially not when using C code).
-But it may be 3 the first time the branch jumps back in the loop and then 1 from that point on.
If an interrupt happens while you're inside the loop, P might become 3 again.
But as you see, this is something that's rare, so I think you can assume the value 1.
RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.
As far as I can tell, the DMA buffer is somewhere in RamLoc128.
That means you can pick any of the other ram locations (I'd suggest one of the AHB sections) for the code.
Now, I just don't know which address RamLoc128 is.
(I particularly like that NXP measure 0 Bytes in GB).
About Optimization:
Line 19, 20, 36 and 37 looks very wrong to me. [...]are you sure that they're doing what you want ?
Unfortunately no, I'm not: the purpose of those lines is to point to the 2nd value of the pair that is being processed by Thibaut's function and sign-extend that value!I tried to figure out how much I needed to move my pointer to get the next-sample address by looking at the samples' address trough the debugger, maybe I was wrong?
so I think you can assume the value 1
That's ok for now, won't thinker with it.
Today I did some tests using the -O3 optimization level for my project and the result is great (using thibaut's function with no sign extension): the elapsed time for 128 sample is roughly 18us compared to the 160 without optimization!Fun fact: compiling the CMSIS DSP with -O2 gives slight better performance than using -O3! (updated the old post)
Speaking of the Memory Layout:
RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.As far as I can tell, the DMA buffer is somewhere in RamLoc128.
Yes, I completely agree: as I can see in LPCXpresso project's properties RamLock128 starts @ 0x10000000 (you can look for it in the picture posted my recap).
That means you can pick any of the other ram locations
Next step on this front I'll try to do: understand how I can do this.
It's great to hear about the optimization results.
abet wrote:Today I did some tests using the -O3 optimization level for my project and the result is great (using thibaut's function with no sign extension): the elapsed time for 128 sample is roughly 18us compared to the 160 without optimization!Fun fact: compiling the CMSIS DSP with -O2 gives slight better performance than using -O3! (updated the old post)
abet wrote:
Today I did some tests using the -O3 optimization level for my project and the result is great (using thibaut's function with no sign extension): the elapsed time for 128 sample is roughly 18us compared to the 160 without optimization!
Fun fact: compiling the CMSIS DSP with -O2 gives slight better performance than using -O3! (updated the old post)
The -O2 is a great observation. This might be connected to that -O3 most likely unrolls the loops more than -O2.
If that's the case, it means that fetching the code from SPIFI slows down (I'm only guessing here).
If it's possible for you to link to a binary version of a pre-compiled CMSIS DSP library, try that.
I know that the people who have developed the DSP library have spent very much time on optimizing it; like that was the most important thing in thw World for them,.
-So if a precompiled library exists and you can link directly to that, then you'll most likely get the best performance regarding the DSP library.
RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.As far as I can tell, the DMA buffer is somewhere in RamLoc128.That means you can pick any of the other ram locations (I'd suggest one of the AHB sections) for the code.Now, I just don't know which address RamLoc128 is.
Jens, if code will be run in SRAM the location should be at 0x10000000, the start of RamLoc128. It should be the data space and DMA buffer that must be relocated. This is because RamLoc128 (starting from 0x10000000) is the area where the bootloader copies and executes the image from SPIFI (and other external source) when not executing in place.
From your post above:
I needed to add the SPIFI Flash in order to use the Link2 as evaluation board (as described here: Introduction to Programming the NXP LPC4370 MCU Using the LPCxpresso Tools and Using Two LPC-Link2 Boards and here: Using an LPC-Link2 as an LPC4370 evaluation board | NXP Community
Using the LPC-Link2 with SPIFI Flash as the boot source is described in those two pages. What Jens is recommending is to execute from SRAM for the code to run faster. This means that you will add SPIFI Flash but program execution should not be directly from that location. The code from Flash should be copied to SRAM and executed there.
From my reply to Jens, the code should be run from 0x10000000, the start of RamLoc128. It should be the data space and DMA buffer that must be relocated.
Many thanks goodwin for the insight on memory map. I'll try to combine your answers and jensbauer ones.As you suggest now I'm going to close this topic tomorrow (as soon as I can access the board and the thibaut's code) and see if open a new one on memory map/speed optimization as soon as I have more, detailed, infos.I'm sorry that I was unavailable in the last few days.