Hi to you all,I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:
My problem is that my code is too slow, and every now and then and overwrite occurs.
Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.
The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.
I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .
Thanks in advance!
Regarding the sign extension, there is a very simple way to do it : change the scale of your data !
Before Andrea posted what he is doing with the samples, this is not an advisable trick. Now that finding the minimum and maximum values seems the only task to be done, left-shift/change of scale is a simple but effective way of sign extension.
For this project, writing a custom function for searching the minimum and maximum values rather than using the CMSIS functions is more advantageous. This is because the input samples need to be sign-extended first and the search for minimum and maximum values can be combined in a single function.
... I know I asked info about floating point conversion so, I would be happy to give the correct answer to Goodwin cause his insight is great, but then I'd need to open up a new post to continue the interesting optimization discussion.
Please, let me know if a way is better than the other and what I should do, I really don't want to take advantage of anyone.
The floating point conversion did not solve your problem. I just thought it might help if we try to find a better way of converting the samples to floating-point format. You can mark one of Jens' or thibaut's reply as the correct answer since they provided high-caliber codes and tips on optimization. It's more than enough that you are very responsive and you acknowledged the effort of those who reply.
If you are planning to have expanded discussion regarding optimization, especially new question(s), it will be better if you open a new discussion about it.
In this project I need to use the ADCHS to sample gaussian-shaped impulsive signals like these:
I only have very very little knowledge about ionizing-radiations but I think signed data format is unwieldy for the type of signal you are processing. Perhaps the quantity you are measuring is somehow associated with energy. Anyhow it's quite unusual to use negative values for gaussian-shaped signal like this.
Yesterday there was an important update to the project and a collegue was able to tell me that we need at least 40 samples (or 128 on the entire 10us duration) per spike, so this acquisition can't be slower than roughly 1.25 Msps.
(Note that I still hope to be able to perform a 40Msps acquisition and understand how to get out the most from this micro, since this would open a whole set of possibilities for the applications of my project)
If you can use an unsigned data type for the samples I speculate that you can still gain improvement in processing speed. This is because you might be able to use the data from the DMA buffer directly, no need for sign extension. You can configure the ADC for offset binary format, pack 2 samples per word, and have bits 31 to 28 and 15 to 12 read as 0. In that case the maximum value will be the one closest to 4095 and the minimum value will be the one closest to 0. As can be seen from the plot there are only peaks and valleys there is no trench.
If the right-shift and left-shift counts are the same, then yes, the sbfx would certainly be the best choice; it cost only one clock cycle.
If, however, converting to fixed point, the sbfx instruction won't be able to do it on its own.
This is because the left-shift and right-shift counts would differ.
As sbfx is a wide instruction (it spans two 16-bit words) and shifting only spans one 16-bit word, they will shorten the used program space (and be a little more cache-friendly; a marginal improvement, but it could help preventing hickups).
Two shift operations will still cost 2 clock cycles. There's another advantage: You do not have to worry about 32-bit alignment when you use a 16-bit instruction. I've experienced performance decrease if a 32-bit instruction is on not on a 32-bit boundary (performance decrease in the form of stalls).
This is an advanced topic, which I hope to cover in a document I plan on writing on how to use the IT instruction more efficiently.
Many thanks goodwin for the insight on memory map. I'll try to combine your answers and jensbauer ones.As you suggest now I'm going to close this topic tomorrow (as soon as I can access the board and the thibaut's code) and see if open a new one on memory map/speed optimization as soon as I have more, detailed, infos.I'm sorry that I was unavailable in the last few days.
Thanks goodwin for your reply.I think you're right,
and that's similar to what we came up with using Thibaut's code: that's why I'm going to give him the correct answer.
But I'd like to thank you very much: you were the first to reply here on this post and I'm glad you did.
Speaking of my project:
Perhaps the quantity you are measuring is somehow associated with energy.
Definitely, you're right. Unfortunately we have some trenches due to electronics artifacts from the analog circuitry and at the moment I don't know if I could get rid of them.But anyway, I can't get why switching to offset binary should make me gain in Dinamic Range on the positive side: isn't it still -2048 to 2047?Probably it is just me!
Thanks again for all your work, I'll follow your advice and eventually open a new post for my further questions.
Cheers,Andrea
Thank you Thibaut for your detailed answers.I studied and evaluated your code today and I got good results: it took roughly 22.5us for 128 32bit-data, so, correct me if I'm wrong, actually 256 samples! That's the same time it took with the previous implementation and without the bit shifting.
Usually, I try to let compiler do his job where he's good !In fact, you need to wonder what you can do to help him generate efficient code:I made quite a detailed about this analysis on my blog (Simplest algorithm ever).In the end : - try to fix everything you can at compile time (bit shift count, buffer size, loop count ...) - limit code visibility to what's necessary (using static functions will allow inlining optimizations inside a module), same for variables, do not use module variables (placed in RAM) when only local variables can be used
Usually, I try to let compiler do his job where he's good !
In fact, you need to wonder what you can do to help him generate efficient code:
I made quite a detailed about this analysis on my blog (Simplest algorithm ever).
In the end :
- try to fix everything you can at compile time (bit shift count, buffer size, loop count ...)
- limit code visibility to what's necessary (using static functions will allow inlining optimizations inside a module), same for variables, do not use module variables (placed in RAM) when only local variables can be used
Thanks for this hints, I read the article you linked and now I think I better understand this new (to me of course) way to program your are showing: I feel like I need to study *a lot*. I just wonder how I can get those nice compiler outputs in GCC/LPCXpresso (which is actually a forked version of Eclipse).
Thanks again for your help! Now I'm going to close this post, but it was nice and helpful. Unfortunately I can choose just one correct answer, but I'd like to thank you all (once again) for what you are doing here. Lovely community.
You're welcome, very much too.
Unfortunately we have some trenches due to electronics artifacts from the analog circuitry and at the moment I don't know if I could get rid of them.
Maybe those are just noise of small amplitude not necessarily trenches.
But anyway, I can't get why switching to offset binary should make me gain in Dinamic Range on the positive side: isn't it still -2048 to 2047?
Dynamic range is not the reason why I'm suggesting that you try using the offset binary format. With offset binary format the values of the samples will just be shifted (0 to 4095) but the range is still the same.
With binary offset, the user has flexibility in interpreting the output code. If the input signal is unipolar the output code can be treated as straight binary code (0X0..0 to 0XF..F unsigned). If the input is bipolar the output is a signed data format, you can convert to two's complement by simply complementing the leftmost bit or subtracting the offset if sign-extension is needed (when the destination is wider than the number of bits from the ADC). If the input is bipolar but level-shifted to become unipolar the user has freedom whether to treat the output code as straight binary or sign-encoded.
What I am suggesting you to explore is the possibility of avoiding to reformat the data from the DMA buffer. This is possible if your ADC output is in offset binary format and you configure the FIFO so that you always read 0s from bits 12 to 15 and 28 to 31. Then you will only need to search the minimum and maximum. In thibaut's code you can remove the statement
data <<= 4;
or the statement(s) that you used to sign-extend the data. Getting rid of that/those statement/s translates to faster execution and it avoids the unnecessary change of scale of the samples. Moreover, you have the freedom to choose between uint16_t, int16_t, or q15_t as the data type of the samples. Even if there is a possible hindrance in using the offset binary format I have less doubt that processing using two's complement code can only be as fast but not faster with offset binary format. Thus, it's worth exploring if you can avoid the two's complement format.
So, good luck on your thesis it's an interesting application of high-performance MCU.
I do totally agree with goodwin on this binary offset usage.
If further treatments can be tuned to work with this kind of data, it would improve your code nicely.
You would spare this shift operation and the write back to memory => this would lead back to the original SearchMinMax16_DSP code !
Thanks Goodwin for this advice!I now understand what you wanted to say, I'll speak with the analog/physic guys that I'm working with and I'll try to see if it's possible to get a unipolar signal!I know how to configure the offset binary and I think there should be a way to have always 0 before my samples, if not I hope that a bitwise "and" won't slow things down, but I'll see.I'm pretty sure I'll post again about this project so, let's say "see you soon"! And again, thanks for your help.
An AND operation will cost 1 clock cycle at least.
I will expect that it will stay at one clock cycle because the compiler should prepare the AND-mask before the loop starts.
The ADC, however, should always give you the same bit-values in the upper bits.
As far as I recall, the LPC4xxx can be configured to shift the bits to the top, which means you should be able to min/max those as most-significant instead of least-significant.
Thus you would not need to do any shifting or ANDing.
If the LPC4xxx always insert zeros at the bottom, when in this mode, this might be the most beneficial setting you can get (because the ADC value would be ready for use as a fractional value).
Even if the lower bits are 'rubbish', you can always clear those bits after the min/max loop, so it won't really cause any slow-down.
-I decided to investigate things, and in UM10503, section 46.6.2 and section 46.6.4, for GDR and DRx, it says:
When DONE is 1, this field contains a binary fraction representing
the voltage on the ADCn pin selected by the SEL field, divided by
the reference voltage on the VDDA pin. Zero in the field indicates
that the voltage on the ADCn input pin was less than, equal to, or
close to that on VSSA, while 0x3FF indicates that the voltage on
ADCn input pin was close to, equal to, or greater than that on
VDDA.
-So it's pretty clear: The result is aligned to the top of the 16-bit word. Your DMA should only transfer 16 bits from the DRx or GDR register.
The manual also mentions the following, which I believe is important in your case:
Remark: Use only the individual channel data registers DR0 to DR7 with burst mode or with hardware triggering to read the conversion results.
Thus the result is in the low 16 bits as follows (in binary): %RRRRRRRRRR000000, where R represents the 10-bit ADC Result.
So you don't need the AND or any bit-shifting; you get the fractional ADC result.
That means, when you've found your MIN and MAX values, you could multiply those values by 3.3 and divide by 65536.0, then you have the actual voltage (providing that the ADC reference voltage on the VDDA pin is 3.3V exactly.
If you prefer using integers (which is of course quicker), you can just multiply by 3300 and shift the result to the right 16 times, then you have millivolts. - but it should not be necessary to use integers in this case, as you'd most likely do that conversion on a PC, which is processing the captured data.
Jens, more likely you referred to an outdated version of UM10503. Currently there is Rev. 2.1 of the document
LPC43xx/LPC43Sxx ARM Cortex-M4/M0 multi-core microcontroller User manual.
Sections 46.6.2 and 46.6.4 for A/D Global Data register and A/D Data Registers are now in 47.6.2 and 47.6.4.
The information you posted apply to ADC0 and ADC1 which are 400 ksps, 10-bit ADCs. Andrea is using the high-speed (up to 80 Msps) 12-bit ADC (ADCHS). ADCHS is in Chapter 48 of UM10503 Rev. 2.1.
Line 19, 20, 36 and 37 looks very wrong to me.
To me, it seems you're doing the same job twice.
Eg. after 8 iterations, the values you've already sign-extended, will be sign-extended again.
I could be wrong, but I better mention it; are you sure that they're doing what you want ?
(I would remove them completely)
About the 'prefetch' on branches (P):
Prefetch only happens when necessary. It's not really something you're in control of (especially not when using C code).
-But it may be 3 the first time the branch jumps back in the loop and then 1 from that point on.
If an interrupt happens while you're inside the loop, P might become 3 again.
But as you see, this is something that's rare, so I think you can assume the value 1.
RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.
As far as I can tell, the DMA buffer is somewhere in RamLoc128.
That means you can pick any of the other ram locations (I'd suggest one of the AHB sections) for the code.
Now, I just don't know which address RamLoc128 is.
(I particularly like that NXP measure 0 Bytes in GB).
There is not that much datas on the given link : https://community.nxp.com/thread/432348
Glad that people on arm are pretty responsive and helpful. Keep it up!
- RJ, From pest control services auckland
Yeah I agree, I'd just get it recoded rather than do it myself dog walker elizabeth NJ