Hi to you all,I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:
My problem is that my code is too slow, and every now and then and overwrite occurs.
Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.
The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.
I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .
Thanks in advance!
This is about the code that you posted in NXP Community.
Your code temporarily converts the negative ADC values to positive (by "manual" two's complementing) and eventually you convert back to negative by multiplying by -1. I suggest that you explore alternate, which can be more efficient, ways of converting from 12-bit signed integer to floating-point format.
A possible way of converting the ADC data from 12-bit signed integer to 32-bit floating-point format is by transforming to 32-bit two's complement code then assigning the result to a floating-point variable (simply let the compiler do the rest of the work).
If the ADC data is in (12-bit) offset binary code it can be converted to 32-bit two's complement format by simply subtracting the offset (in this case the offset is 2048)
adcBuff[i] - 2048
where adcBuff[] is an array of signed 32-bit integer (int32_t). We then let the compiler convert it into floating-point format by assigning to a floating-point variable. The complete statement would be
float32Buff[i] = adcBuff[i] - 2048;
where float32Buff[] is an array of 32-bit floating-point (float32_t) data.
If the ADC data is in 12-bit two's complement code it can be converted to 32-bit two's complement format by simply sign-filling bits 12 to 31 (replicating bit 11 into bits 12 to 31). Sign-filling can be done by (logical) shifting the 12-bit ADC data 20 bit positions to the left then performing an arithmetic shift right by 20 bit positions
(adcBuff[i] << 20) >> 20
Note that there is no test for the state of bit 11. I just can't clearly recall if I previously encountered a compiler which cancels (no corresponding code generated) opposing shifts like this. Including the assignment to a floating-point variable, the complete statement would be
float32Buff[i] = (adcBuff[i] << 20) >> 20;
If the compiler produces no code for the opposite shifts, you should devise a way to circumvent that or you can resort to alternative methods. An alternate method of converting the ADC data is to transform from 12-bit two's complement to 12-bit offset binary format. The subsequent conversion to 32-bit two's complement can be done as described above for the offset binary format. Two's complement code can be converted to offset binary format by simply complementing the leftmost bit (in this case bit 11)
adcBuff[i] ^ 0X00000800
The complete conversion statement would be
float32Buff[i] = (adcBuff[i] ^ 0X00000800) - 2048;
Hi goodwin,thank you very much for all your replies. You're right, in these days I figured out that fixed math is probably the way what I'm trying to do should be done.I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
I just want to give some thoughts and suggestions which might be helpful to you.
Extremely helpful. What I do need more right now is an insight from a skilled eye to point out what could be possible limitation of my design.I'm not experienced so I read some chapters on this interesting book: The Definitive Guide to ARM® Cortex®-M3 and Cortex®-M4 Processors, Third Edition: Joseph Yiu: 9780124080829: Amazon.com:…
Here there are lots of useful information about the cycles needed for each operation and I'll try to do rough calculations about how many of them my code needs.I hoped I could squeeze things to fit in that 5 cycles boundary (which I was aware of since the beginning), but, as I understand, I might be wrong.Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
Again, thank you very much: as I said I'm developing this alone and for the first time, therefore your help is highly appreciated.
Cheers,
Andrea
I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
These are the possible reasons why you were lost:
float32Buff[i] = (adcBuff[i] & 0X00000FFF) - 2048;
float32Buff[i] = ((adcBuff[i] & 0X00000FFF) ^ 0X00000800) - 2048;
the order of AND and XOR can be interchanged
float32Buff[i] = ((adcBuff[i] ^ 0X00000800) & 0X00000FFF) - 2048;
Note that there is no change to the second statement, the inital settings of bits 12 to 31 will be lost with the shift to the left. The third statement is just for showing how to convert two's complement to offset binary format. In LPC4370 conversion of the ADC samples from two's complement to offset binary format in software is not needed since the ADC can be configured to output data in offset binary format.
Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
I think that's the way to do it but as I stated I am not an expert. Instead I would further suggest that you access LabTool's documentation and open-source software. LabTool is also based on LPC4370/LPC-Link 2 so I hope you find it helpful to your thesis.
Hi,
I did not follow the entire subject bu since my name is quoted here, I thought I would take a look !
Regarding speed comparison and optimization issues, can you tell which compiler you are using with which options ?
I you link with CMSIS_DSP library which is most likely built with ARM compiler (5, 6 ?) with high optimizations. Your code may not be comparable in terms of speed if built let's say with GCC !
If I caught your need properly, you need to detect min and max over a buffer of 12-bits signed integers ?
Can you show your computation code ?
First of all, I'll talk about what's (in theory) slow about your shift operations.
Normally, the compiler should figure this out by itself, but if you've turned off optimizing, it's not going to happen.
*pSrc = (*pSrc) << shiftBits; *pSrc = (*pSrc) >> shiftBits;
We don't need to go down to the assembly-level.
This is what happens in unoptimized code:
1: Read a value from memory
2: Shift a value by n positions
3: Store a value in memory
4: Read a value from memory
5: Shift a value by n positions
6: Store a value in memory
Reading a value from memory require 2 clock cycles.
Shifting the value to the left or right requires 1 clock cycle.
Storing the value requires 1 clock cycle.
If the code is not optimized by the compiler, then the code can be improved by removing step 3 and step 4.
As I can not see the full loop, I can't give a full suggestion on optimizing; except from what I've written earlier.
My earlier suggestion adapted to shifting:
void ...(...) { register int32_t i; /* index */ register uint32_t *d; /* destination */ d = &((int32_t *)pSrc)[i]; /* (destination seems to be the same as source) */ i = -length; /* convert length of array to a negative index */ do { d[i] = (d[i] << shiftCount) >> shiftCount; } while(++i); /* increment index and keep going until i wraps to 0 */ }
In other words: do not split up the shift in several 'stages', it might make an impact performance, as the code could grow.
If the code is build as I planned, it would result in something like this:
adds.n r0,r0,r1,lsl#2 /* point d to end of array */ rsbs.n r2,r1 /* index = -length */ loop: ldr.n r0,[r1,r2] /* [2] get 12-bit ADC value */ sbfx.w r0,r0,#0,#12 /* [1] sign-extend it to 32 bits */ str.n r0,[r1,r2] /* [1] store the sign-extended result */ adds.n r2,r2,#4 /* [1] increment index */ bne.n loop /* [1/1+P] go round loop until index wraps to 0 */
The numbers in square brackets are how many clock-cycles I expect the code to spend.
The last one has the format [branch not taken/branch taken], where P is 'prefetch'; P is a value between 1 and 3 (normally 1).
That means if your code runs from SRAM, then per sample, it should cost 7 clock cycles if P is 1.
We'll add an extra clock cycle, so we won't be disappointed .. Dividing 204MHz by 8 clock cycles allow us to process 25 million samples per second.
However! You also need to remember that the DSP needs to process the data.
In addition to using the above loop, I recommend changing the optimization level.
If I understand correctly, LPCXpresso is using gcc, and if that's the case, then it's easy for me to tell you how to change the optimization level:
In case you're able to run your gcc from the command-line, try this:
arm-none-eabi-gcc --help=optimizers
-It will give you a long list of optimization options, but the following is the important one: -O<number>.
Normally I use -Os (for size optimization), but that's not what you want in this case!
-Ofast is another way of saying you want fast code; here's the description: "Optimize for speed disregarding exact standards compliance".
From what I can see on NXP's web-site, you need to specify the setting inside the IDE; here's where they say it is:
Project -> Properties -> C/C++ Build -> Settings -> Tool Settings -> MCU C Compiler -> Optimization -> Optimization Level
I recommend first trying -O3
I usually write my own code in a way, so that even when optimization is disabled, the code is almost just as efficient.
The most important thing you can do is to get the optimization working; it should improve performance very much; especially if the compiler unrolls loops (unrolling means more operations per branch - or less branches per operation; have your pick).
Thanks for the detailed reply Jens.Right now I'm doing the sign-aligned stuff inside Thibaut's function (https://www.m4-unleashed.com/parallel-comparison/ ), which is called during the DMA's Transfer Completed ISR.here's my code:
uint32_t MAXmin; int16_t sample[NUM_SAMPLE] = {0}; int16_t sample2[NUM_SAMPLE] = {0}; uint16_t shiftBits = 4; uint16_t wordLenght = 8; /*Figured out looking at the registers address in while debugging*/ uint32_t SearchMinMax16_DSP(int16_t* pSrc, int32_t pSize) { uint32_t data, min, max; int16_t data16; /* max variable will hold two max : one on each 16-bits half * same thing for min */ /*Sign Extension*/ *pSrc = (*pSrc) << shiftBits; *pSrc = (*pSrc) >> shiftBits; *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits; *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits; /* Load two first samples in one 32-bit access */ data = *__SIMD32(pSrc)++; /* Initialize Min and Max to these first samples */ min = data; max = data; /* decrement sample count */ pSize-=2; /* Loop as long as there remains at least two samples */ while (pSize > 1) { /*Sign Extension*/ *pSrc = (*pSrc) << shiftBits; *pSrc = (*pSrc) >> shiftBits; *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits; *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits; /* Load next two samples in a single access */ data = *__SIMD32(pSrc)++; /* Parallel comparison of max and new samples */ (void)__SSUB16(max, data); /* Select max on each 16-bits half */ max = __SEL(max, data); /* Parallel comparison of new samples and min */ (void)__SSUB16(data, min); /* Select min on each 16-bits half */ min = __SEL(min, data); pSize-=2; } /* Now we have maximum on even samples on low halfword of max * and maximum on odd samples on high halfword */ /* look for max between halfwords 1 & 0 by comparing on low halfword */ (void)__SSUB16(max, max >> 16); /* Select max on low 16-bits */ max = __SEL(max, max >> 16); /* look for min between halfwords 1 & 0 by comparing on low halfword */ (void)__SSUB16(min >> 16, min); /* Select min on low 16-bits */ min = __SEL(min, min >> 16); /* Test if odd number of samples */ if (pSize > 0) { data16 = *pSrc; /* look for max between on low halfwords */ (void)__SSUB16(max, data16); /* Select max on low 16-bits */ max = __SEL(max, data16); /* look for min on low halfword */ (void)__SSUB16(data16, min); /* Select min on low 16-bits */ min = __SEL(min, data16); } /* Pack result : Min on Low halfword, Max on High halfword */ return __PKHBT(min, max, 16); /* PKHBT documentation */ }
At line 33 there's the bit extension.
Great analysis about the clock/sample Jens!
Sounds great! How can i be sure this is happening?
Speaking of the compilare, this is the out put of my actual configuration in lpcxpresso (S2D.c is the file containing the code we are talking about):
rm-none-eabi-gcc -nostdlib -L"/home/abet/LPCXpresso/link2_2/lpc_board_nxp_lpclink2_4370/Debug" -L"/home/abet/LPCXpresso/link2_2/lpc_chip_43xx/Debug" -L"/home/abet/LPCXpresso/link2_2/CMSIS_DSPLIB_CM4/lib" -Xlinker -Map="S2D.map" -Xlinker --gc-sections -Xlinker -print-memory-usage -mcpu=cortex-m4 -mthumb -T "S2D_Debug.ld" -o "S2D.axf" ./src/S2D.o ./src/cr_startup_lpc43xx.o ./src/crp.o ./src/sysinit.o -llpc_board_nxp_lpclink2_4370 -llpc_chip_43xx -lCMSIS_DSPLIB_CM4Memory region Used Size Region Size %age Used RamLoc128: 6688 B 128 KB 5.10% RamLoc72: 0 GB 72 KB 0.00% RamAHB32: 0 GB 32 KB 0.00% RamAHB16: 0 GB 16 KB 0.00% RamAHB_ETB16: 0 GB 16 KB 0.00% RamM0Sub16: 0 GB 16 KB 0.00% RamM0Sub2: 0 GB 2 KB 0.00% SPIFI: 13668 B 4 MB 0.33%
rm-none-eabi-gcc -nostdlib -L"/home/abet/LPCXpresso/link2_2/lpc_board_nxp_lpclink2_4370/Debug" -L"/home/abet/LPCXpresso/link2_2/lpc_chip_43xx/Debug" -L"/home/abet/LPCXpresso/link2_2/CMSIS_DSPLIB_CM4/lib" -Xlinker -Map="S2D.map" -Xlinker --gc-sections -Xlinker -print-memory-usage -mcpu=cortex-m4 -mthumb -T "S2D_Debug.ld" -o "S2D.axf" ./src/S2D.o ./src/cr_startup_lpc43xx.o ./src/crp.o ./src/sysinit.o -llpc_board_nxp_lpclink2_4370 -llpc_chip_43xx -lCMSIS_DSPLIB_CM4
Memory region Used Size Region Size %age Used
RamLoc128: 6688 B 128 KB 5.10%
RamLoc72: 0 GB 72 KB 0.00%
RamAHB32: 0 GB 32 KB 0.00%
RamAHB16: 0 GB 16 KB 0.00%
RamAHB_ETB16: 0 GB 16 KB 0.00%
RamM0Sub16: 0 GB 16 KB 0.00%
RamM0Sub2: 0 GB 2 KB 0.00%
SPIFI: 13668 B 4 MB 0.33%
Also,
arm-none-eabi-gcc --version
gives:
arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 5.2.1 20151202 (release) [ARM/embedded-5-branch revision 231848]
Looking at the project properties as suggested by Jens I found out that I had no optimization level here:
So I'm going to turn this on and implements Inside Thibaut's function the sign-extension the Jen's way! And see if I get some good news!
Hi Thibaut, indeed I thought that since I'm evaluating your code it was proper to link your blog! And of course any hint from you would be welcomed. Actually I tried to reply to your questions in the section below trying to merge your answer with Jen's one.
That's right!And I believe that your implementation while save me a lot of computational time: I quoted the code in the answer below.
View all questions in Cortex-M / M-Profile forum