Hi to you all,I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:
My problem is that my code is too slow, and every now and then and overwrite occurs.
Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.
The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.
I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .
Thanks in advance!
This is about the code that you posted in NXP Community.
Your code temporarily converts the negative ADC values to positive (by "manual" two's complementing) and eventually you convert back to negative by multiplying by -1. I suggest that you explore alternate, which can be more efficient, ways of converting from 12-bit signed integer to floating-point format.
A possible way of converting the ADC data from 12-bit signed integer to 32-bit floating-point format is by transforming to 32-bit two's complement code then assigning the result to a floating-point variable (simply let the compiler do the rest of the work).
If the ADC data is in (12-bit) offset binary code it can be converted to 32-bit two's complement format by simply subtracting the offset (in this case the offset is 2048)
adcBuff[i] - 2048
where adcBuff[] is an array of signed 32-bit integer (int32_t). We then let the compiler convert it into floating-point format by assigning to a floating-point variable. The complete statement would be
float32Buff[i] = adcBuff[i] - 2048;
where float32Buff[] is an array of 32-bit floating-point (float32_t) data.
If the ADC data is in 12-bit two's complement code it can be converted to 32-bit two's complement format by simply sign-filling bits 12 to 31 (replicating bit 11 into bits 12 to 31). Sign-filling can be done by (logical) shifting the 12-bit ADC data 20 bit positions to the left then performing an arithmetic shift right by 20 bit positions
(adcBuff[i] << 20) >> 20
Note that there is no test for the state of bit 11. I just can't clearly recall if I previously encountered a compiler which cancels (no corresponding code generated) opposing shifts like this. Including the assignment to a floating-point variable, the complete statement would be
float32Buff[i] = (adcBuff[i] << 20) >> 20;
If the compiler produces no code for the opposite shifts, you should devise a way to circumvent that or you can resort to alternative methods. An alternate method of converting the ADC data is to transform from 12-bit two's complement to 12-bit offset binary format. The subsequent conversion to 32-bit two's complement can be done as described above for the offset binary format. Two's complement code can be converted to offset binary format by simply complementing the leftmost bit (in this case bit 11)
adcBuff[i] ^ 0X00000800
The complete conversion statement would be
float32Buff[i] = (adcBuff[i] ^ 0X00000800) - 2048;
Hi goodwin,thank you very much for all your replies. You're right, in these days I figured out that fixed math is probably the way what I'm trying to do should be done.I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
I just want to give some thoughts and suggestions which might be helpful to you.
Extremely helpful. What I do need more right now is an insight from a skilled eye to point out what could be possible limitation of my design.I'm not experienced so I read some chapters on this interesting book: The Definitive Guide to ARM® Cortex®-M3 and Cortex®-M4 Processors, Third Edition: Joseph Yiu: 9780124080829: Amazon.com:…
Here there are lots of useful information about the cycles needed for each operation and I'll try to do rough calculations about how many of them my code needs.I hoped I could squeeze things to fit in that 5 cycles boundary (which I was aware of since the beginning), but, as I understand, I might be wrong.Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
Again, thank you very much: as I said I'm developing this alone and for the first time, therefore your help is highly appreciated.
Cheers,
Andrea
Good post goodwin!
goodwin wrote: (adcBuff[i] << 20) >> 20I just can't clearly recall if I previously encountered a compiler which cancels opposing shifts like this.
goodwin wrote:
I just can't clearly recall if I previously encountered a compiler which cancels opposing shifts like this.
Such compilers are faulty and would generate incorrect code.
The compiler is free to optimize the code, however, making the code behave incorrectly is not allowed.
An exception: If the resulting value of the shift is unused, then of course the compiler is allowed to remove it completely.
Another exception: If the compiler can 'see' that only the low 12 bits are used anyway, it is free to ignore the line completely (GCC is capable of doing that, as the optimizer is very sophisticated).
I would expect that an optimizing compiler would reduce the above to a single instruction:
... for signed values ...
sbfx rT,rS,0,12
... and for unsigned values ...
ubfx rT,rS,0,12
-I know that GCC is working well, regarding optimizing and code reduction.
Update:
From what I can see in your disassembled code, you're using hardware floating point.
This is good for performance.
The instructions ... vmov, vneg.f32, vstr, ... (all those starting with 'v') are floating point instructions.
(unfortunately the code is messed up quite a bit and it seems it's not the full subroutine).
Indeed, the optimization that goodwin mentioned will improve performance dramatically!
The problem with floating point and Interrupt Service Routines (ISR) is that if you use a floating point register in the ISR, it will need to be saved before it's used and restored before the interrupt is ended.
-This is because if you're using floating point registers anywhere in the other parts of your program, the values would be messed up randomly if the registers are changed by the interrupt.
However... If you're using floating point only in the interrupt, you won't have any problems.
[Note: sometimes it would be worth it to use hardware floating point in ISR and software floating point at task-time; eg. if you're only using software floating point for printing out values once per second, that would be the optimal solution. Mixing software and hardware floating point is an advanced topic, however and is not recommended the first 3 days of your life as a programmer].
Building upon goodwin's answer, I'd like to suggest a complete loop:
void Twos2Dec_Remapp(const uint32_t *twosBuff, float32_t *decBuff, uint32_t buffLength) { register uint32_t i; register float *d; register const uint32_t *s; s = (int32_t *) &twosBuff[buffLength]; d = &decBuff[buffLength]; i = -buffLength; /* convert length to negative index for speed */ do { d[i] = (s[i] & 0xfff) - 2048; /* most likely, this is the correct calculation */ } while(++i); }
I believe the above would produce the optimal binary code (by 'hinting' the compiler how).
The most expensive part is to convert to a float! In fact, I think it's a very bad idea using floats if you don't have hardware floats.
Imagine that for each float operation, a huge block of code is executed. Each line of code usually takes 1 or 2 clock cycles (unless we're speaking about dividing), so if you can, use fixed point.
Converting to a 16:16 fixed point is real easy; you just need to change the type 'float *' to 'int32_t *' and the shift operations to ...
d[i] = (s[[i] << 20) >> 4);
-Because fixed point is just an "integer part" and a "fractional part". The integer part is the same as your integer value, the fractional part is 0.
So in hexadecimal, a 16:16 fixed point would lookk like this:
0xiiiipppp
-Also make sure that your destination buffer is not in the SRAM section, which the DMA is using, in order to avoid disturbing the DMA.
The larger the DMA buffer is, the larger the 'propagation delay' will be (eg. time between input and output)
If your data is output in real-time, you will want a small DMA buffer.
But if the DMA buffer is very small, the CPU will spend a lot of time executing non-essential code (such as entering/leaving subroutines, instead of actually working on the data).
-So you'll need to find the right balance and when you've found a size that juuust works, give it a little more room; 40% extra is often a good choice. I would not recommend less than 10% extra.
I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
These are the possible reasons why you were lost:
float32Buff[i] = (adcBuff[i] & 0X00000FFF) - 2048;
float32Buff[i] = ((adcBuff[i] & 0X00000FFF) ^ 0X00000800) - 2048;
the order of AND and XOR can be interchanged
float32Buff[i] = ((adcBuff[i] ^ 0X00000800) & 0X00000FFF) - 2048;
Note that there is no change to the second statement, the inital settings of bits 12 to 31 will be lost with the shift to the left. The third statement is just for showing how to convert two's complement to offset binary format. In LPC4370 conversion of the ADC samples from two's complement to offset binary format in software is not needed since the ADC can be configured to output data in offset binary format.
Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
I think that's the way to do it but as I stated I am not an expert. Instead I would further suggest that you access LabTool's documentation and open-source software. LabTool is also based on LPC4370/LPC-Link 2 so I hope you find it helpful to your thesis.
Such compilers are faulty and would generate incorrect code.The compiler is free to optimize the code, however, making the code behave incorrectly is not allowed.
If there is really a compiler that behaves like that, the author unintentionally neglected the signed (integral) nature of the variable.
Hi,
I did not follow the entire subject bu since my name is quoted here, I thought I would take a look !
Regarding speed comparison and optimization issues, can you tell which compiler you are using with which options ?
I you link with CMSIS_DSP library which is most likely built with ARM compiler (5, 6 ?) with high optimizations. Your code may not be comparable in terms of speed if built let's say with GCC !
If I caught your need properly, you need to detect min and max over a buffer of 12-bits signed integers ?
Can you show your computation code ?
First of all, I'll talk about what's (in theory) slow about your shift operations.
Normally, the compiler should figure this out by itself, but if you've turned off optimizing, it's not going to happen.
*pSrc = (*pSrc) << shiftBits; *pSrc = (*pSrc) >> shiftBits;
We don't need to go down to the assembly-level.
This is what happens in unoptimized code:
1: Read a value from memory
2: Shift a value by n positions
3: Store a value in memory
4: Read a value from memory
5: Shift a value by n positions
6: Store a value in memory
Reading a value from memory require 2 clock cycles.
Shifting the value to the left or right requires 1 clock cycle.
Storing the value requires 1 clock cycle.
If the code is not optimized by the compiler, then the code can be improved by removing step 3 and step 4.
As I can not see the full loop, I can't give a full suggestion on optimizing; except from what I've written earlier.
My earlier suggestion adapted to shifting:
void ...(...) { register int32_t i; /* index */ register uint32_t *d; /* destination */ d = &((int32_t *)pSrc)[i]; /* (destination seems to be the same as source) */ i = -length; /* convert length of array to a negative index */ do { d[i] = (d[i] << shiftCount) >> shiftCount; } while(++i); /* increment index and keep going until i wraps to 0 */ }
In other words: do not split up the shift in several 'stages', it might make an impact performance, as the code could grow.
If the code is build as I planned, it would result in something like this:
adds.n r0,r0,r1,lsl#2 /* point d to end of array */ rsbs.n r2,r1 /* index = -length */ loop: ldr.n r0,[r1,r2] /* [2] get 12-bit ADC value */ sbfx.w r0,r0,#0,#12 /* [1] sign-extend it to 32 bits */ str.n r0,[r1,r2] /* [1] store the sign-extended result */ adds.n r2,r2,#4 /* [1] increment index */ bne.n loop /* [1/1+P] go round loop until index wraps to 0 */
The numbers in square brackets are how many clock-cycles I expect the code to spend.
The last one has the format [branch not taken/branch taken], where P is 'prefetch'; P is a value between 1 and 3 (normally 1).
That means if your code runs from SRAM, then per sample, it should cost 7 clock cycles if P is 1.
We'll add an extra clock cycle, so we won't be disappointed .. Dividing 204MHz by 8 clock cycles allow us to process 25 million samples per second.
However! You also need to remember that the DSP needs to process the data.
In addition to using the above loop, I recommend changing the optimization level.
If I understand correctly, LPCXpresso is using gcc, and if that's the case, then it's easy for me to tell you how to change the optimization level:
In case you're able to run your gcc from the command-line, try this:
arm-none-eabi-gcc --help=optimizers
-It will give you a long list of optimization options, but the following is the important one: -O<number>.
Normally I use -Os (for size optimization), but that's not what you want in this case!
-Ofast is another way of saying you want fast code; here's the description: "Optimize for speed disregarding exact standards compliance".
From what I can see on NXP's web-site, you need to specify the setting inside the IDE; here's where they say it is:
Project -> Properties -> C/C++ Build -> Settings -> Tool Settings -> MCU C Compiler -> Optimization -> Optimization Level
I recommend first trying -O3
I usually write my own code in a way, so that even when optimization is disabled, the code is almost just as efficient.
The most important thing you can do is to get the optimization working; it should improve performance very much; especially if the compiler unrolls loops (unrolling means more operations per branch - or less branches per operation; have your pick).
Thanks for the detailed reply Jens.Right now I'm doing the sign-aligned stuff inside Thibaut's function (https://www.m4-unleashed.com/parallel-comparison/ ), which is called during the DMA's Transfer Completed ISR.here's my code:
uint32_t MAXmin; int16_t sample[NUM_SAMPLE] = {0}; int16_t sample2[NUM_SAMPLE] = {0}; uint16_t shiftBits = 4; uint16_t wordLenght = 8; /*Figured out looking at the registers address in while debugging*/ uint32_t SearchMinMax16_DSP(int16_t* pSrc, int32_t pSize) { uint32_t data, min, max; int16_t data16; /* max variable will hold two max : one on each 16-bits half * same thing for min */ /*Sign Extension*/ *pSrc = (*pSrc) << shiftBits; *pSrc = (*pSrc) >> shiftBits; *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits; *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits; /* Load two first samples in one 32-bit access */ data = *__SIMD32(pSrc)++; /* Initialize Min and Max to these first samples */ min = data; max = data; /* decrement sample count */ pSize-=2; /* Loop as long as there remains at least two samples */ while (pSize > 1) { /*Sign Extension*/ *pSrc = (*pSrc) << shiftBits; *pSrc = (*pSrc) >> shiftBits; *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits; *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits; /* Load next two samples in a single access */ data = *__SIMD32(pSrc)++; /* Parallel comparison of max and new samples */ (void)__SSUB16(max, data); /* Select max on each 16-bits half */ max = __SEL(max, data); /* Parallel comparison of new samples and min */ (void)__SSUB16(data, min); /* Select min on each 16-bits half */ min = __SEL(min, data); pSize-=2; } /* Now we have maximum on even samples on low halfword of max * and maximum on odd samples on high halfword */ /* look for max between halfwords 1 & 0 by comparing on low halfword */ (void)__SSUB16(max, max >> 16); /* Select max on low 16-bits */ max = __SEL(max, max >> 16); /* look for min between halfwords 1 & 0 by comparing on low halfword */ (void)__SSUB16(min >> 16, min); /* Select min on low 16-bits */ min = __SEL(min, min >> 16); /* Test if odd number of samples */ if (pSize > 0) { data16 = *pSrc; /* look for max between on low halfwords */ (void)__SSUB16(max, data16); /* Select max on low 16-bits */ max = __SEL(max, data16); /* look for min on low halfword */ (void)__SSUB16(data16, min); /* Select min on low 16-bits */ min = __SEL(min, data16); } /* Pack result : Min on Low halfword, Max on High halfword */ return __PKHBT(min, max, 16); /* PKHBT documentation */ }
At line 33 there's the bit extension.
Great analysis about the clock/sample Jens!
Sounds great! How can i be sure this is happening?
Speaking of the compilare, this is the out put of my actual configuration in lpcxpresso (S2D.c is the file containing the code we are talking about):
rm-none-eabi-gcc -nostdlib -L"/home/abet/LPCXpresso/link2_2/lpc_board_nxp_lpclink2_4370/Debug" -L"/home/abet/LPCXpresso/link2_2/lpc_chip_43xx/Debug" -L"/home/abet/LPCXpresso/link2_2/CMSIS_DSPLIB_CM4/lib" -Xlinker -Map="S2D.map" -Xlinker --gc-sections -Xlinker -print-memory-usage -mcpu=cortex-m4 -mthumb -T "S2D_Debug.ld" -o "S2D.axf" ./src/S2D.o ./src/cr_startup_lpc43xx.o ./src/crp.o ./src/sysinit.o -llpc_board_nxp_lpclink2_4370 -llpc_chip_43xx -lCMSIS_DSPLIB_CM4Memory region Used Size Region Size %age Used RamLoc128: 6688 B 128 KB 5.10% RamLoc72: 0 GB 72 KB 0.00% RamAHB32: 0 GB 32 KB 0.00% RamAHB16: 0 GB 16 KB 0.00% RamAHB_ETB16: 0 GB 16 KB 0.00% RamM0Sub16: 0 GB 16 KB 0.00% RamM0Sub2: 0 GB 2 KB 0.00% SPIFI: 13668 B 4 MB 0.33%
rm-none-eabi-gcc -nostdlib -L"/home/abet/LPCXpresso/link2_2/lpc_board_nxp_lpclink2_4370/Debug" -L"/home/abet/LPCXpresso/link2_2/lpc_chip_43xx/Debug" -L"/home/abet/LPCXpresso/link2_2/CMSIS_DSPLIB_CM4/lib" -Xlinker -Map="S2D.map" -Xlinker --gc-sections -Xlinker -print-memory-usage -mcpu=cortex-m4 -mthumb -T "S2D_Debug.ld" -o "S2D.axf" ./src/S2D.o ./src/cr_startup_lpc43xx.o ./src/crp.o ./src/sysinit.o -llpc_board_nxp_lpclink2_4370 -llpc_chip_43xx -lCMSIS_DSPLIB_CM4
Memory region Used Size Region Size %age Used
RamLoc128: 6688 B 128 KB 5.10%
RamLoc72: 0 GB 72 KB 0.00%
RamAHB32: 0 GB 32 KB 0.00%
RamAHB16: 0 GB 16 KB 0.00%
RamAHB_ETB16: 0 GB 16 KB 0.00%
RamM0Sub16: 0 GB 16 KB 0.00%
RamM0Sub2: 0 GB 2 KB 0.00%
SPIFI: 13668 B 4 MB 0.33%
Also,
arm-none-eabi-gcc --version
gives:
arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 5.2.1 20151202 (release) [ARM/embedded-5-branch revision 231848]
Looking at the project properties as suggested by Jens I found out that I had no optimization level here:
So I'm going to turn this on and implements Inside Thibaut's function the sign-extension the Jen's way! And see if I get some good news!
Hi Thibaut, indeed I thought that since I'm evaluating your code it was proper to link your blog! And of course any hint from you would be welcomed. Actually I tried to reply to your questions in the section below trying to merge your answer with Jen's one.
That's right!And I believe that your implementation while save me a lot of computational time: I quoted the code in the answer below.
Line 19, 20, 36 and 37 looks very wrong to me.
To me, it seems you're doing the same job twice.
Eg. after 8 iterations, the values you've already sign-extended, will be sign-extended again.
I could be wrong, but I better mention it; are you sure that they're doing what you want ?
(I would remove them completely)
About the 'prefetch' on branches (P):
Prefetch only happens when necessary. It's not really something you're in control of (especially not when using C code).
-But it may be 3 the first time the branch jumps back in the loop and then 1 from that point on.
If an interrupt happens while you're inside the loop, P might become 3 again.
But as you see, this is something that's rare, so I think you can assume the value 1.
RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.
As far as I can tell, the DMA buffer is somewhere in RamLoc128.
That means you can pick any of the other ram locations (I'd suggest one of the AHB sections) for the code.
Now, I just don't know which address RamLoc128 is.
(I particularly like that NXP measure 0 Bytes in GB).
About Optimization:
Line 19, 20, 36 and 37 looks very wrong to me. [...]are you sure that they're doing what you want ?
Unfortunately no, I'm not: the purpose of those lines is to point to the 2nd value of the pair that is being processed by Thibaut's function and sign-extend that value!I tried to figure out how much I needed to move my pointer to get the next-sample address by looking at the samples' address trough the debugger, maybe I was wrong?
so I think you can assume the value 1
That's ok for now, won't thinker with it.
Today I did some tests using the -O3 optimization level for my project and the result is great (using thibaut's function with no sign extension): the elapsed time for 128 sample is roughly 18us compared to the 160 without optimization!Fun fact: compiling the CMSIS DSP with -O2 gives slight better performance than using -O3! (updated the old post)
Speaking of the Memory Layout:
RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.As far as I can tell, the DMA buffer is somewhere in RamLoc128.
Yes, I completely agree: as I can see in LPCXpresso project's properties RamLock128 starts @ 0x10000000 (you can look for it in the picture posted my recap).
That means you can pick any of the other ram locations
Next step on this front I'll try to do: understand how I can do this.
goodwin, jensbauer, I'd like to thank you very very much for the effort you're putting into helping me: I'd really offer you a beer (or 2) if we weren't spread around the globe (but, I mean, never say never).Anyway, I'm working hard on my project in these days and today I'd like to write here a small recap. I hope this isn't off-topic: I know I asked info about floating point conversion so, I would be happy to give the correct answer to Goodwin cause his insight is great, but then I'd need to open up a new post to continue the interesting optimization discussion.Please, let me know if a way is better than the other and what I should do, I really don't want to take advantage of anyone.
In this project I need to use the ADCHS to sample gaussian-shaped impulsive signals like these:
These are generated by analog electronics and we're working in the field of ionizing-radiations, so let's say these spikes represent two photons and have a 3us full width.My goal is to reach a 100kHz count rate: this means 10us between each peak, and therefore to process a MinMax algorithm on the data with this requirement (this would be great, but I'm trying to find a compromise between count rate and hardware costs).
Yesterday there was an important update to the project and a collegue was able to tell me that we need at least 40 samples (or 128 on the entire 10us duration) per spike, so this acquisition can't be slower than roughly 1.25 Msps.
(Note that I still hope to be able to perform a 40Msps acquisition and understand how to get out the most from this micro, since this would open a whole set of possibilities for the applications of my project)
I'm not an expert on this topic, and since I'm self taught on this I think i could also be somewhat slow, but since you point out that some info about my memory layout should be very helpful, I'll try to describe it.From my LPCXpresso IDE:
Note that:
int16_t sample[NUM_SAMPLE] = {0}; int16_t sample2[NUM_SAMPLE] = {0};
As jensbauer said now I'm trying to figure out which is the best DMA buffer lenght, and, sadly, as goodwin said @40Msps I'm not able to perform a continuous read + elaboration.
I tried:
arm_shift_q31(sample, shiftBits, sampleTmp, NUM_SAMPLE); arm_max_q31(sampleTmp, NUM_SAMPLE, &maxSample, &sampleIndex);
MAXmin = SearchMinMax16_DSP(sampleTest, NUM_SAMPLE);
Here we have some results from the tests I did:
So it turns out that:
Any help on this optimization problem is highly appreciated of course!
Hope this wasn't too long/boring, if so let me know and i'll do everything I can to: give a better explanation, divide the problem in small ones and find the right place in the forum for every piece.Again, thank to those who helped since today, lovely community!
Regards,Andrea
It's great to hear about the optimization results.
abet wrote:Today I did some tests using the -O3 optimization level for my project and the result is great (using thibaut's function with no sign extension): the elapsed time for 128 sample is roughly 18us compared to the 160 without optimization!Fun fact: compiling the CMSIS DSP with -O2 gives slight better performance than using -O3! (updated the old post)
abet wrote:
Today I did some tests using the -O3 optimization level for my project and the result is great (using thibaut's function with no sign extension): the elapsed time for 128 sample is roughly 18us compared to the 160 without optimization!
Fun fact: compiling the CMSIS DSP with -O2 gives slight better performance than using -O3! (updated the old post)
The -O2 is a great observation. This might be connected to that -O3 most likely unrolls the loops more than -O2.
If that's the case, it means that fetching the code from SPIFI slows down (I'm only guessing here).
If it's possible for you to link to a binary version of a pre-compiled CMSIS DSP library, try that.
I know that the people who have developed the DSP library have spent very much time on optimizing it; like that was the most important thing in thw World for them,.
-So if a precompiled library exists and you can link directly to that, then you'll most likely get the best performance regarding the DSP library.
Regarding the sign extension, there is a very simple way to do it : change the scale of your data !
If I understood properly, your sample buffer holds 16-bits values in which 12 lowest significant bits are the ADC output value in 2-s complement and I expect you have 4x 0-bit in front (bits 15-12).
I would symbolize this sample pair like that : sample n (0x0SA1), sample n+1 (0x0SA2)
When you use *__SIMD32(pSrc), it loads a register with both samples (0x0SA20SA1), then you just need to shift left by 4 bits to have 0xSA20SA10 which is a pair of signed 16-bits values !
I you need to keep your samples for further computations, you can write back to memory with this new scale.
This would give something like :
uint32_t SearchMinMax16_DSP(int16_t* pSrc, int32_t pSize) { uint32_t data, min, max; int16_t data16; /* max variable will hold two max : one on each 16-bits half * same thing for min */ /* Load two first samples in one 32-bit access */ data = *__SIMD32(pSrc); /* put significant bits on bits 15-4 instead of 11-0 on each halfword */ data <<= 4; /* Write back to memory to have useable 16-bits samples, increment source pointer by a pair of samples */ *__SIMD32(pSrc)++ = data; /* Initialize Min and Max to these first samples */ min = data; max = data; /* decrement sample count */ pSize-=2; /* Loop as long as there remains at least two samples */ while (pSize > 1) { /* Load next two samples in a single access */ data = *__SIMD32(pSrc); /* put significant bits on bits 15-4 instead of 11-0 on each halfword */ data <<= 4; /* Write back to memory to have useable 16-bits samples, increment source pointer by a pair of samples */ *__SIMD32(pSrc)++ = data; /* Parallel comparison of max and new samples */ (void)__SSUB16(max, data); /* Select max on each 16-bits half */ max = __SEL(max, data); /* Parallel comparison of new samples and min */ (void)__SSUB16(data, min); /* Select min on each 16-bits half */ min = __SEL(min, data); pSize-=2; } /* Now we have maximum on even samples on low halfword of max * and maximum on odd samples on high halfword */ /* look for max between halfwords 1 & 0 by comparing on low halfword */ (void)__SSUB16(max, max >> 16); /* Select max on low 16-bits */ max = __SEL(max, max >> 16); /* look for min between halfwords 1 & 0 by comparing on low halfword */ (void)__SSUB16(min >> 16, min); /* Select min on low 16-bits */ min = __SEL(min, min >> 16); /* Test if odd number of samples */ if (pSize > 0) { data16 = *pSrc; /* put significant bits on bits 15-4 instead of 11-0 on low halfword */ data16 <<= 4; /* Write back to memory to have useable 16-bits sample */ *pSrc = data16; /* look for max between on low halfwords */ (void)__SSUB16(max, data16); /* Select max on low 16-bits */ max = __SEL(max, data16); /* look for min on low halfword */ (void)__SSUB16(data16, min); /* Select min on low 16-bits */ min = __SEL(min, data16); } /* Pack result : Min on Low halfword, Max on High halfword */ return __PKHBT(min, max, 16); /* PKHBT documentation */ }
With proper optimization options, I expect this to be quite efficient.