Hi to you all,I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:
My problem is that my code is too slow, and every now and then and overwrite occurs.
Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.
The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.
I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .
Thanks in advance!
Hi abeta and welcome to the Community!
I've moved your question to ARM Processors where I hope you will get your answer.
Thank you Carl, It's my first post on the forum and I'm still not that experienced!
Hi Andrea,
I am not an expert regarding your main problem so I cannot provide a complete solution. I just want to give some thoughts and suggestions which might be helpful to you.
While the core is running fast at 204 MHz, it would not be able to process the ADC data at 40 Msps continuously. The use of dual DMA buffer will only work at low to moderate sampling rate. At 40 Msps, no matter how you juggle with the FIFO, DMA, and interrupt, it will all boil down to the relation that the core has only slightly greater than 5 clock cycles per ADC sample. You can do little more than data movement. Converting from integer to floating-point and calling CMSIS functions, you're into buffer overrun. The high-speed ADC is usable at high speed when you give the core enough time to process the data, this is possible when the ADC is in discontinuous mode.
I also read your post at NXP so I'm going to follow this with a reply relating to your code.
Regards,
Goodwin
This is about the code that you posted in NXP Community.
Your code temporarily converts the negative ADC values to positive (by "manual" two's complementing) and eventually you convert back to negative by multiplying by -1. I suggest that you explore alternate, which can be more efficient, ways of converting from 12-bit signed integer to floating-point format.
A possible way of converting the ADC data from 12-bit signed integer to 32-bit floating-point format is by transforming to 32-bit two's complement code then assigning the result to a floating-point variable (simply let the compiler do the rest of the work).
If the ADC data is in (12-bit) offset binary code it can be converted to 32-bit two's complement format by simply subtracting the offset (in this case the offset is 2048)
adcBuff[i] - 2048
where adcBuff[] is an array of signed 32-bit integer (int32_t). We then let the compiler convert it into floating-point format by assigning to a floating-point variable. The complete statement would be
float32Buff[i] = adcBuff[i] - 2048;
where float32Buff[] is an array of 32-bit floating-point (float32_t) data.
If the ADC data is in 12-bit two's complement code it can be converted to 32-bit two's complement format by simply sign-filling bits 12 to 31 (replicating bit 11 into bits 12 to 31). Sign-filling can be done by (logical) shifting the 12-bit ADC data 20 bit positions to the left then performing an arithmetic shift right by 20 bit positions
(adcBuff[i] << 20) >> 20
Note that there is no test for the state of bit 11. I just can't clearly recall if I previously encountered a compiler which cancels (no corresponding code generated) opposing shifts like this. Including the assignment to a floating-point variable, the complete statement would be
float32Buff[i] = (adcBuff[i] << 20) >> 20;
If the compiler produces no code for the opposite shifts, you should devise a way to circumvent that or you can resort to alternative methods. An alternate method of converting the ADC data is to transform from 12-bit two's complement to 12-bit offset binary format. The subsequent conversion to 32-bit two's complement can be done as described above for the offset binary format. Two's complement code can be converted to offset binary format by simply complementing the leftmost bit (in this case bit 11)
adcBuff[i] ^ 0X00000800
The complete conversion statement would be
float32Buff[i] = (adcBuff[i] ^ 0X00000800) - 2048;
Hi goodwin,thank you very much for all your replies. You're right, in these days I figured out that fixed math is probably the way what I'm trying to do should be done.I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
I just want to give some thoughts and suggestions which might be helpful to you.
Extremely helpful. What I do need more right now is an insight from a skilled eye to point out what could be possible limitation of my design.I'm not experienced so I read some chapters on this interesting book: The Definitive Guide to ARM® Cortex®-M3 and Cortex®-M4 Processors, Third Edition: Joseph Yiu: 9780124080829: Amazon.com:…
Here there are lots of useful information about the cycles needed for each operation and I'll try to do rough calculations about how many of them my code needs.I hoped I could squeeze things to fit in that 5 cycles boundary (which I was aware of since the beginning), but, as I understand, I might be wrong.Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
Again, thank you very much: as I said I'm developing this alone and for the first time, therefore your help is highly appreciated.
Cheers,
Andrea
So, yesterday I got some probes that I missed the past week and I was able to do some time measures on the board while it was working in continuous mode @40Msps.It turned out that, using a ping-pong buffer, of 1024 sample each, the LPC4370 @204MHz roughly needs 37 instruction per sample (instead of the 5 I actually can use) just for the arm_shift_q31 and the arm_max_q31.I tried to use the q15 versions of the functions, but strangely the didn't give any performance boost (even if they're reported to use the SIMD instructions).Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
Regards, Andrea
Good post goodwin!
goodwin wrote: (adcBuff[i] << 20) >> 20I just can't clearly recall if I previously encountered a compiler which cancels opposing shifts like this.
goodwin wrote:
I just can't clearly recall if I previously encountered a compiler which cancels opposing shifts like this.
Such compilers are faulty and would generate incorrect code.
The compiler is free to optimize the code, however, making the code behave incorrectly is not allowed.
An exception: If the resulting value of the shift is unused, then of course the compiler is allowed to remove it completely.
Another exception: If the compiler can 'see' that only the low 12 bits are used anyway, it is free to ignore the line completely (GCC is capable of doing that, as the optimizer is very sophisticated).
I would expect that an optimizing compiler would reduce the above to a single instruction:
... for signed values ...
sbfx rT,rS,0,12
... and for unsigned values ...
ubfx rT,rS,0,12
-I know that GCC is working well, regarding optimizing and code reduction.
abet wrote:Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
abet wrote:
Do you think that trying to rebuild my CMSIS library with a different optimization level will likely be a serious improvement?
It certainly could improve things if the code is currently optimized for size or unoptimized.
If you use a pre-compiled library, then the library is most likely built with optimal performance - however - if the code is executing from Flash memory, I think it might be worth moving it to SRAM.
What I would recommend, is to put the code in a "ramcode" section and optimize for highest speed if you rebuild.
In addition, I would recommend you to run any other time-critical code from SRAM, however, make sure your code resides in a different section of RAM than the section that your DMA will access; this is very important.
Executing code from SRAM will give you a huge performance increase on a LPC40xx.
-But if the DMA and CPU fight over who's going to use the SRAM section, you might end up getting worse performance than before.
So make sure that the two sections are independent.
Update:
From what I can see in your disassembled code, you're using hardware floating point.
This is good for performance.
The instructions ... vmov, vneg.f32, vstr, ... (all those starting with 'v') are floating point instructions.
(unfortunately the code is messed up quite a bit and it seems it's not the full subroutine).
Indeed, the optimization that goodwin mentioned will improve performance dramatically!
The problem with floating point and Interrupt Service Routines (ISR) is that if you use a floating point register in the ISR, it will need to be saved before it's used and restored before the interrupt is ended.
-This is because if you're using floating point registers anywhere in the other parts of your program, the values would be messed up randomly if the registers are changed by the interrupt.
However... If you're using floating point only in the interrupt, you won't have any problems.
[Note: sometimes it would be worth it to use hardware floating point in ISR and software floating point at task-time; eg. if you're only using software floating point for printing out values once per second, that would be the optimal solution. Mixing software and hardware floating point is an advanced topic, however and is not recommended the first 3 days of your life as a programmer].
Building upon goodwin's answer, I'd like to suggest a complete loop:
void Twos2Dec_Remapp(const uint32_t *twosBuff, float32_t *decBuff, uint32_t buffLength) { register uint32_t i; register float *d; register const uint32_t *s; s = (int32_t *) &twosBuff[buffLength]; d = &decBuff[buffLength]; i = -buffLength; /* convert length to negative index for speed */ do { d[i] = (s[i] & 0xfff) - 2048; /* most likely, this is the correct calculation */ } while(++i); }
I believe the above would produce the optimal binary code (by 'hinting' the compiler how).
The most expensive part is to convert to a float! In fact, I think it's a very bad idea using floats if you don't have hardware floats.
Imagine that for each float operation, a huge block of code is executed. Each line of code usually takes 1 or 2 clock cycles (unless we're speaking about dividing), so if you can, use fixed point.
Converting to a 16:16 fixed point is real easy; you just need to change the type 'float *' to 'int32_t *' and the shift operations to ...
d[i] = (s[[i] << 20) >> 4);
-Because fixed point is just an "integer part" and a "fractional part". The integer part is the same as your integer value, the fractional part is 0.
So in hexadecimal, a 16:16 fixed point would lookk like this:
0xiiiipppp
-Also make sure that your destination buffer is not in the SRAM section, which the DMA is using, in order to avoid disturbing the DMA.
The larger the DMA buffer is, the larger the 'propagation delay' will be (eg. time between input and output)
If your data is output in real-time, you will want a small DMA buffer.
But if the DMA buffer is very small, the CPU will spend a lot of time executing non-essential code (such as entering/leaving subroutines, instead of actually working on the data).
-So you'll need to find the right balance and when you've found a size that juuust works, give it a little more room; 40% extra is often a good choice. I would not recommend less than 10% extra.
I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
These are the possible reasons why you were lost:
float32Buff[i] = (adcBuff[i] & 0X00000FFF) - 2048;
float32Buff[i] = ((adcBuff[i] & 0X00000FFF) ^ 0X00000800) - 2048;
the order of AND and XOR can be interchanged
float32Buff[i] = ((adcBuff[i] ^ 0X00000800) & 0X00000FFF) - 2048;
Note that there is no change to the second statement, the inital settings of bits 12 to 31 will be lost with the shift to the left. The third statement is just for showing how to convert two's complement to offset binary format. In LPC4370 conversion of the ADC samples from two's complement to offset binary format in software is not needed since the ADC can be configured to output data in offset binary format.
Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
I think that's the way to do it but as I stated I am not an expert. Instead I would further suggest that you access LabTool's documentation and open-source software. LabTool is also based on LPC4370/LPC-Link 2 so I hope you find it helpful to your thesis.
Such compilers are faulty and would generate incorrect code.The compiler is free to optimize the code, however, making the code behave incorrectly is not allowed.
If there is really a compiler that behaves like that, the author unintentionally neglected the signed (integral) nature of the variable.
Jens, Andrea is using LPC-Link 2 which is based on LPC4370, a Flashless MCU. Quad SPI Flash memory is used in this board. Since fastest code execution is sought copying to RAM rather than executing in place is imperative (I presume this is what Andrea is doing).
Since processing speed is your primary requirement Q15 is the optimal data type applicable to your samples for use with CMSIS-DSP. However, if you are currently comfortable with Q31, you can continue to work with it since you are not yet in the final stage of your project. If you eventually decide to use Q15, it will be helpful if you will configure the FIFO to pack 2 samples per word. This will double the number of samples transferred per word; from another perspective, reduce the number of words to transfer per specific number of samples (reduce the size of DMA transfer). It might also help format the data for SIMD.
Hi,
I did not follow the entire subject bu since my name is quoted here, I thought I would take a look !
Regarding speed comparison and optimization issues, can you tell which compiler you are using with which options ?
I you link with CMSIS_DSP library which is most likely built with ARM compiler (5, 6 ?) with high optimizations. Your code may not be comparable in terms of speed if built let's say with GCC !
If I caught your need properly, you need to detect min and max over a buffer of 12-bits signed integers ?
Can you show your computation code ?