This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Process ADC data, moved by DMA, using CMSIS DSP: what's the right way?

Hi to you all,
I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:

Fills the ADC FIFO @40msps.
Copies the data into memory using the built-in DMA Controller and 2 linked buffers.
Processes one buffer while the other is being filled.

My problem is that my code is too slow, and every now and then and overwrite occurs.

Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.

The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.
It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.

I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .

Thanks in advance!

Parents

0 Andrea Bettati over 9 years ago in reply to G. Goodwin L. Pitos

Hi goodwin,
thank you very much for all your replies. You're right, in these days I figured out that fixed math is probably the way what I'm trying to do should be done.
I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
I just want to give some thoughts and suggestions which might be helpful to you.
Extremely helpful. What I do need more right now is an insight from a skilled eye to point out what could be possible limitation of my design.
I'm not experienced so I read some chapters on this interesting book: The Definitive Guide to ARM® Cortex®-M3 and Cortex®-M4 Processors, Third Edition: Joseph Yiu: 9780124080829: Amazon.com:…
Here there are lots of useful information about the cycles needed for each operation and I'll try to do rough calculations about how many of them my code needs.
I hoped I could squeeze things to fit in that 5 cycles boundary (which I was aware of since the beginning), but, as I understand, I might be wrong.
Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
Again, thank you very much: as I said I'm developing this alone and for the first time, therefore your help is highly appreciated.
Cheers,
Andrea
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Andrea Bettati over 9 years ago in reply to G. Goodwin L. Pitos

Hi goodwin,
thank you very much for all your replies. You're right, in these days I figured out that fixed math is probably the way what I'm trying to do should be done.
I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
I just want to give some thoughts and suggestions which might be helpful to you.
Extremely helpful. What I do need more right now is an insight from a skilled eye to point out what could be possible limitation of my design.
I'm not experienced so I read some chapters on this interesting book: The Definitive Guide to ARM® Cortex®-M3 and Cortex®-M4 Processors, Third Edition: Joseph Yiu: 9780124080829: Amazon.com:…
Here there are lots of useful information about the cycles needed for each operation and I'll try to do rough calculations about how many of them my code needs.
I hoped I could squeeze things to fit in that 5 cycles boundary (which I was aware of since the beginning), but, as I understand, I might be wrong.
Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
Again, thank you very much: as I said I'm developing this alone and for the first time, therefore your help is highly appreciated.
Cheers,
Andrea
Cancel
Vote up 0 Vote down

Cancel

Children

0 G. Goodwin L. Pitos over 9 years ago in reply to Andrea Bettati
I read your reply about the float thought, and even if I don't have the time now to try it out (I "lost", actually used, too much time and I need to move forward since all this is for my thesis!), I got the idea and for sure I'd be back on your code in the future.
These are the possible reasons why you were lost:
adcBuff[] and float32Buff[] are the arrays I used instead of twosBuff[] and decBuff[].
I stated that adcBuff[] is an array of signed 32-bit integer (int32_t). In your code twosBuff[] is unsigned (uint32_t ).
I assumed that the ADC samples, in adcBuff[], have bits 12 to 31 all equal to 0 since this is possible. If any of these bits is 1, an AND operation to clear them (adcBuff[] & 0X00000FFF) is necesssary. The three statements in my previous reply should now be
12-bit offset binary code to 32-bit two's complement to single-precision floating-point
float32Buff[i] = (adcBuff[i] & 0X00000FFF) - 2048;
12-bit two's complement to 32-bit two's complement to single-precision floating-point
float32Buff[i] = (adcBuff[i] << 20) >> 20;
12-bit two's complement to 12-bit offset binary code to 32-bit two's complement to single-precision floating-point
float32Buff[i] = ((adcBuff[i] & 0X00000FFF) ^ 0X00000800) - 2048;
the order of AND and XOR can be interchanged
float32Buff[i] = ((adcBuff[i] ^ 0X00000800) & 0X00000FFF) - 2048;
Note that there is no change to the second statement, the inital settings of bits 12 to 31 will be lost with the shift to the left. The third statement is just for showing how to convert two's complement to offset binary format. In LPC4370 conversion of the ADC samples from two's complement to offset binary format in software is not needed since the ADC can be configured to output data in offset binary format.
Anyway, I'm planning to follow your advice and, since I'll need to sample impulsive signals, to use the ADCHS in non-continuous mode, using the thresholds I can set and processing data when no info is being sampled. What do you think about it?
I think that's the way to do it but as I stated I am not an expert. Instead I would further suggest that you access LabTool's documentation and open-source software. LabTool is also based on LPC4370/LPC-Link 2 so I hope you find it helpful to your thesis.
Cancel
Vote up 0 Vote down

Cancel
0 Thibaut ZEISSLOFF over 9 years ago in reply to Andrea Bettati

Hi,
I did not follow the entire subject bu since my name is quoted here, I thought I would take a look !
Regarding speed comparison and optimization issues, can you tell which compiler you are using with which options ?
I you link with CMSIS_DSP library which is most likely built with ARM compiler (5, 6 ?) with high optimizations. Your code may not be comparable in terms of speed if built let's say with GCC !
If I caught your need properly, you need to detect min and max over a buffer of 12-bits signed integers ?
Can you show your computation code ?
Cancel
Vote up 0 Vote down

Cancel
0 Jens Bauer over 9 years ago in reply to Andrea Bettati
First of all, I'll talk about what's (in theory) slow about your shift operations.
Normally, the compiler should figure this out by itself, but if you've turned off optimizing, it's not going to happen.
*pSrc = (*pSrc) << shiftBits; *pSrc = (*pSrc) >> shiftBits;
We don't need to go down to the assembly-level.
This is what happens in unoptimized code:
1: Read a value from memory
2: Shift a value by n positions
3: Store a value in memory
4: Read a value from memory
5: Shift a value by n positions
6: Store a value in memory
Reading a value from memory require 2 clock cycles.
Shifting the value to the left or right requires 1 clock cycle.
Storing the value requires 1 clock cycle.
If the code is not optimized by the compiler, then the code can be improved by removing step 3 and step 4.
As I can not see the full loop, I can't give a full suggestion on optimizing; except from what I've written earlier.
My earlier suggestion adapted to shifting:
void ...(...) { register int32_t i; /* index */ register uint32_t *d; /* destination */ d = &((int32_t *)pSrc)[i]; /* (destination seems to be the same as source) */ i = -length; /* convert length of array to a negative index */ do { d[i] = (d[i] << shiftCount) >> shiftCount; } while(++i); /* increment index and keep going until i wraps to 0 */ }
In other words: do not split up the shift in several 'stages', it might make an impact performance, as the code could grow.
If the code is build as I planned, it would result in something like this:
adds.n r0,r0,r1,lsl#2 /* point d to end of array */ rsbs.n r2,r1 /* index = -length */ loop: ldr.n r0,[r1,r2] /* [2] get 12-bit ADC value */ sbfx.w r0,r0,#0,#12 /* [1] sign-extend it to 32 bits */ str.n r0,[r1,r2] /* [1] store the sign-extended result */ adds.n r2,r2,#4 /* [1] increment index */ bne.n loop /* [1/1+P] go round loop until index wraps to 0 */
The numbers in square brackets are how many clock-cycles I expect the code to spend.
The last one has the format [branch not taken/branch taken], where P is 'prefetch'; P is a value between 1 and 3 (normally 1).
That means if your code runs from SRAM, then per sample, it should cost 7 clock cycles if P is 1.
We'll add an extra clock cycle, so we won't be disappointed .. Dividing 204MHz by 8 clock cycles allow us to process 25 million samples per second.
However! You also need to remember that the DSP needs to process the data.
In addition to using the above loop, I recommend changing the optimization level.
If I understand correctly, LPCXpresso is using gcc, and if that's the case, then it's easy for me to tell you how to change the optimization level:
In case you're able to run your gcc from the command-line, try this:
arm-none-eabi-gcc --help=optimizers
-It will give you a long list of optimization options, but the following is the important one: -O<number>.
Normally I use -Os (for size optimization), but that's not what you want in this case!
-Ofast is another way of saying you want fast code; here's the description: "Optimize for speed disregarding exact standards compliance".
From what I can see on NXP's web-site, you need to specify the setting inside the IDE; here's where they say it is:
Project -> Properties -> C/C++ Build -> Settings -> Tool Settings -> MCU C Compiler -> Optimization -> Optimization Level
I recommend first trying -O3
I usually write my own code in a way, so that even when optimization is disabled, the code is almost just as efficient.
The most important thing you can do is to get the optimization working; it should improve performance very much; especially if the compiler unrolls loops (unrolling means more operations per branch - or less branches per operation; have your pick).
Cancel
Vote up 0 Vote down

Cancel

0 Andrea Bettati over 9 years ago in reply to Jens Bauer

Thanks for the detailed reply Jens.
Right now I'm doing the sign-aligned stuff inside Thibaut's function (https://www.m4-unleashed.com/parallel-comparison/ ), which is called during the DMA's Transfer Completed ISR.
here's my code:

uint32_t MAXmin;
int16_t sample[NUM_SAMPLE] = {0};
int16_t sample2[NUM_SAMPLE] = {0};
uint16_t shiftBits = 4;
uint16_t wordLenght = 8; /*Figured out looking at the registers address in while debugging*/

uint32_t SearchMinMax16_DSP(int16_t* pSrc, int32_t pSize)
{
    uint32_t data, min, max;
    int16_t data16;

    /* max variable will hold two max : one on each 16-bits half
     * same thing for min
     */

    /*Sign Extension*/
    *pSrc = (*pSrc) << shiftBits;
    *pSrc = (*pSrc) >> shiftBits;
    *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits;
    *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits;

    /* Load two first samples in one 32-bit access */
    data = *__SIMD32(pSrc)++;
    /* Initialize Min and Max to these first samples */
    min = data;
    max = data;
    /* decrement sample count */
    pSize-=2;

    /* Loop as long as there remains at least two samples */
    while (pSize > 1)
    {
        /*Sign Extension*/
         *pSrc = (*pSrc) << shiftBits;
         *pSrc = (*pSrc) >> shiftBits;
         *(pSrc + wordLenght) = (*(pSrc+wordLenght)) << shiftBits;
         *(pSrc + wordLenght) = (*(pSrc+wordLenght)) >> shiftBits;


        /* Load next two samples in a single access */
        data = *__SIMD32(pSrc)++;
        /* Parallel comparison of max and new samples */
        (void)__SSUB16(max, data);
        /* Select max on each 16-bits half */
        max = __SEL(max, data);
        /* Parallel comparison of new samples and min */
        (void)__SSUB16(data, min);
        /* Select min on each 16-bits half */
        min = __SEL(min, data);

        pSize-=2;
    }
    /* Now we have maximum on even samples on low halfword of max
     * and maximum on odd samples on high halfword */
    /* look for max between halfwords 1 & 0 by comparing on low halfword */
    (void)__SSUB16(max, max >> 16);
    /* Select max on low 16-bits */
    max = __SEL(max, max >> 16);

    /* look for min between halfwords 1 & 0 by comparing on low halfword */
    (void)__SSUB16(min >> 16, min);
    /* Select min on low 16-bits */
    min = __SEL(min, min >> 16);

    /* Test if odd number of samples */
    if (pSize > 0)
    {
        data16 = *pSrc;
        /* look for max between on low halfwords */
        (void)__SSUB16(max, data16);
        /* Select max on low 16-bits */
        max = __SEL(max, data16);

        /* look for min on low halfword */
        (void)__SSUB16(data16, min);
        /* Select min on low 16-bits */
        min = __SEL(min, data16);
    }

    /* Pack result : Min on Low halfword, Max on High halfword */
    return __PKHBT(min, max, 16); /* PKHBT documentation */
}

At line 33 there's the bit extension.

Great analysis about the clock/sample Jens!

That means if your code runs from SRAM, then per sample, it should cost 7 clock cycles if P is 1.

Sounds great! How can i be sure this is happening?

Speaking of the compilare, this is the out put of my actual configuration in lpcxpresso (S2D.c is the file containing the code we are talking about):

rm-none-eabi-gcc -nostdlib -L"/home/abet/LPCXpresso/link2_2/lpc_board_nxp_lpclink2_4370/Debug" -L"/home/abet/LPCXpresso/link2_2/lpc_chip_43xx/Debug" -L"/home/abet/LPCXpresso/link2_2/CMSIS_DSPLIB_CM4/lib" -Xlinker -Map="S2D.map" -Xlinker --gc-sections -Xlinker -print-memory-usage -mcpu=cortex-m4 -mthumb -T "S2D_Debug.ld" -o "S2D.axf" ./src/S2D.o ./src/cr_startup_lpc43xx.o ./src/crp.o ./src/sysinit.o   -llpc_board_nxp_lpclink2_4370 -llpc_chip_43xx -lCMSIS_DSPLIB_CM4

Memory region           Used Size Region Size %age Used
      RamLoc128:            6688 B      128 KB      5.10%
        RamLoc72:               0 GB        72 KB      0.00%
       RamAHB32:               0 GB        32 KB      0.00%
       RamAHB16:               0 GB        16 KB      0.00%
    RamAHB_ETB16:         0 GB        16 KB      0.00%
   RamM0Sub16:               0 GB        16 KB      0.00%
     RamM0Sub2:               0 GB          2 KB      0.00%
                 SPIFI:          13668 B         4 MB      0.33%

Also,

arm-none-eabi-gcc --version

gives:

arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors) 5.2.1 20151202 (release) [ARM/embedded-5-branch revision 231848]

Looking at the project properties as suggested by Jens I found out that I had no optimization level here:

So I'm going to turn this on and implements Inside Thibaut's function the sign-extension the Jen's way! And see if I get some good news!

0 Andrea Bettati over 9 years ago in reply to Thibaut ZEISSLOFF

Hi Thibaut, indeed I thought that since I'm evaluating your code it was proper to link your blog! And of course any hint from you would be welcomed.
Actually I tried to reply to your questions in the section below trying to merge your answer with Jen's one.
If I caught your need properly, you need to detect min and max over a buffer of 12-bits signed integers ?
That's right!
And I believe that your implementation while save me a lot of computational time: I quoted the code in the answer below.
Cancel
Vote up 0 Vote down

Cancel
0 Jens Bauer over 9 years ago in reply to Andrea Bettati

Line 19, 20, 36 and 37 looks very wrong to me.
To me, it seems you're doing the same job twice.
Eg. after 8 iterations, the values you've already sign-extended, will be sign-extended again.
I could be wrong, but I better mention it; are you sure that they're doing what you want ?
(I would remove them completely)
About the 'prefetch' on branches (P):
Prefetch only happens when necessary. It's not really something you're in control of (especially not when using C code).
-But it may be 3 the first time the branch jumps back in the loop and then 1 from that point on.
If an interrupt happens while you're inside the loop, P might become 3 again.
But as you see, this is something that's rare, so I think you can assume the value 1.
RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.
As far as I can tell, the DMA buffer is somewhere in RamLoc128.
That means you can pick any of the other ram locations (I'd suggest one of the AHB sections) for the code.
Now, I just don't know which address RamLoc128 is.
(I particularly like that NXP measure 0 Bytes in GB).
Cancel
Vote up 0 Vote down

Cancel
0 Andrea Bettati over 9 years ago in reply to Jens Bauer

About Optimization:
Line 19, 20, 36 and 37 looks very wrong to me. [...]
are you sure that they're doing what you want ?
Unfortunately no, I'm not: the purpose of those lines is to point to the 2nd value of the pair that is being processed by Thibaut's function and sign-extend that value!
I tried to figure out how much I needed to move my pointer to get the next-sample address by looking at the samples' address trough the debugger, maybe I was wrong?
so I think you can assume the value 1
That's ok for now, won't thinker with it.
Today I did some tests using the -O3 optimization level for my project and the result is great (using thibaut's function with no sign extension): the elapsed time for 128 sample is roughly 18us compared to the 160 without optimization!
Fun fact: compiling the CMSIS DSP with -O2 gives slight better performance than using -O3! (updated the old post)
Speaking of the Memory Layout:
RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.
As far as I can tell, the DMA buffer is somewhere in RamLoc128.
Yes, I completely agree: as I can see in LPCXpresso project's properties RamLock128 starts @ 0x10000000 (you can look for it in the picture posted my recap).
That means you can pick any of the other ram locations
Next step on this front I'll try to do: understand how I can do this.
Cancel
Vote up 0 Vote down

Cancel

0 Andrea Bettati over 9 years ago in reply to G. Goodwin L. Pitos

goodwin, jensbauer, I'd like to thank you very very much for the effort you're putting into helping me: I'd really offer you a beer (or 2) if we weren't spread around the globe (but, I mean, never say never).
Anyway, I'm working hard on my project in these days and today I'd like to write here a small recap. I hope this isn't off-topic: I know I asked info about floating point conversion so, I would be happy to give the correct answer to Goodwin cause his insight is great, but then I'd need to open up a new post to continue the interesting optimization discussion.
Please, let me know if a way is better than the other and what I should do, I really don't want to take advantage of anyone.

The 'Big' Picture

In this project I need to use the ADCHS to sample gaussian-shaped impulsive signals like these:

These are generated by analog electronics and we're working in the field of ionizing-radiations, so let's say these spikes represent two photons and have a 3us full width.
My goal is to reach a 100kHz count rate: this means 10us between each peak, and therefore to process a MinMax algorithm on the data with this requirement (this would be great, but I'm trying to find a compromise between count rate and hardware costs).

Yesterday there was an important update to the project and a collegue was able to tell me that we need at least 40 samples (or 128 on the entire 10us duration) per spike, so this acquisition can't be slower than roughly 1.25 Msps.

(Note that I still hope to be able to perform a 40Msps acquisition and understand how to get out the most from this micro, since this would open a whole set of possibilities for the applications of my project)

Memory Layout

I'm not an expert on this topic, and since I'm self taught on this I think i could also be somewhat slow, but since you point out that some info about my memory layout should be very helpful, I'll try to describe it.
From my LPCXpresso IDE:

Note that:

I needed to add the SPIFI Flash in order to use the Link2 as evaluation board (as described here: Introduction to Programming the NXP LPC4370 MCU Using the LPCxpresso Tools and Using Two LPC-Link2 Boards and here: Using an LPC-Link2 as an LPC4370 evaluation board | NXP Community
I'm declaring my gloabl arrays like so:

int16_t sample[NUM_SAMPLE] = {0};
int16_t sample2[NUM_SAMPLE] = {0};

Elaboration: is the LPC4370 @204MHz fast enough?

As jensbauer said now I'm trying to figure out which is the best DMA buffer lenght, and, sadly, as goodwin said @40Msps I'm not able to perform a continuous read + elaboration.

I tried:

the CMSIS dsp functions (How to prepare ADC data for Q31_t CMSIS DSP functions? , thanks to dwhite85: this is why I was able to get rid of the float type.)

arm_shift_q31(sample, shiftBits, sampleTmp, NUM_SAMPLE);
arm_max_q31(sampleTmp, NUM_SAMPLE, &maxSample, &sampleIndex);

thibaut's solution: https://www.m4-unleashed.com/parallel-comparison/ (thanks for this one!). Eventually modified with the sign-extension suggested by goodwin

MAXmin = SearchMinMax16_DSP(sampleTest, NUM_SAMPLE);

Here we have some results from the tests I did:

Software Executed	DMA Buffer's Lenght	Elapsed Time	Instructions/Sample
arm_shift_q31 + arm_max_q31	128	~135us	~211
arm_shift_q31 + arm_max_q31	1024	~190us	~37
arm_shift_q31 + arm_max_q31	2048	~250us	~24.5
SearchMinMax16 (no sign-extension)	128	~160us	n.a.
SearchMinMax16 (no sign-extension)	1024	~160us	n.a.
SearchMinMax16 (no sign-extension)	2048	~500us	n.a.
SearchMinMax16 (with sign-extension)	2048	~7ms	n.a.
update: 24/08/2016 SearchMinMax16 (no sign-extension + -O3)	128	~18us	n.a.

So it turns out that:

I know that thibaut's implementation is giving me both the Max and the Min, but shouldn't it be faster anyway? Am I missing something about optimization level? I must admit I'm still stucked with the standard LPCXpresso one.
My sin extension implementation is dead-slow. Here's my code, and I just notice I have few extra that are not needed (fixing them..).

*pSrc = (*pSrc) << shiftBits;
*pSrc = (*pSrc) >> shiftBits;

Any help on this optimization problem is highly appreciated of course!

Hope this wasn't too long/boring, if so let me know and i'll do everything I can to: give a better explanation, divide the problem in small ones and find the right place in the forum for every piece.
Again, thank to those who helped since today, lovely community!

Regards,
Andrea

0 Jens Bauer over 9 years ago in reply to Andrea Bettati

It's great to hear about the optimization results.
abet wrote:
Today I did some tests using the -O3 optimization level for my project and the result is great (using thibaut's function with no sign extension): the elapsed time for 128 sample is roughly 18us compared to the 160 without optimization!
Fun fact: compiling the CMSIS DSP with -O2 gives slight better performance than using -O3! (updated the old post)
The -O2 is a great observation. This might be connected to that -O3 most likely unrolls the loops more than -O2.
If that's the case, it means that fetching the code from SPIFI slows down (I'm only guessing here).
If it's possible for you to link to a binary version of a pre-compiled CMSIS DSP library, try that.
I know that the people who have developed the DSP library have spent very much time on optimizing it; like that was the most important thing in thw World for them,.
-So if a precompiled library exists and you can link directly to that, then you'll most likely get the best performance regarding the DSP library.
Cancel
Vote up 0 Vote down

Cancel

0 Thibaut ZEISSLOFF over 9 years ago in reply to Andrea Bettati

Regarding the sign extension, there is a very simple way to do it : change the scale of your data !

If I understood properly, your sample buffer holds 16-bits values in which 12 lowest significant bits are the ADC output value in 2-s complement and I expect you have 4x 0-bit in front (bits 15-12).

I would symbolize this sample pair like that : sample n (0x0SA1), sample n+1 (0x0SA2)

When you use *__SIMD32(pSrc), it loads a register with both samples (0x0SA20SA1), then you just need to shift left by 4 bits to have 0xSA20SA10 which is a pair of signed 16-bits values !

I you need to keep your samples for further computations, you can write back to memory with this new scale.

This would give something like :

uint32_t SearchMinMax16_DSP(int16_t* pSrc, int32_t pSize)  
{  
    uint32_t data, min, max;  
    int16_t data16;  
  
    /* max variable will hold two max : one on each 16-bits half 
     * same thing for min 
     */  
  
    /* Load two first samples in one 32-bit access */  
    data = *__SIMD32(pSrc);
    /* put significant bits on bits 15-4 instead of 11-0 on each halfword */
    data <<= 4;

    /* Write back to memory to have useable 16-bits samples, increment source pointer by a pair of samples */
    *__SIMD32(pSrc)++ = data;

    /* Initialize Min and Max to these first samples */  
    min = data;  
    max = data;  
    /* decrement sample count */  
    pSize-=2;  
  
    /* Loop as long as there remains at least two samples */  
    while (pSize > 1)  
    {    
  
        /* Load next two samples in a single access */  
        data = *__SIMD32(pSrc);
        /* put significant bits on bits 15-4 instead of 11-0 on each halfword */
        data <<= 4;

        /* Write back to memory to have useable 16-bits samples, increment source pointer by a pair of samples */
        *__SIMD32(pSrc)++ = data;

        /* Parallel comparison of max and new samples */  
        (void)__SSUB16(max, data);  
        /* Select max on each 16-bits half */  
        max = __SEL(max, data);  
        /* Parallel comparison of new samples and min */  
        (void)__SSUB16(data, min);  
        /* Select min on each 16-bits half */  
        min = __SEL(min, data);  
  
        pSize-=2;  
    }  
    /* Now we have maximum on even samples on low halfword of max 
     * and maximum on odd samples on high halfword */  
    /* look for max between halfwords 1 & 0 by comparing on low halfword */  
    (void)__SSUB16(max, max >> 16);  
    /* Select max on low 16-bits */  
    max = __SEL(max, max >> 16);  
  
    /* look for min between halfwords 1 & 0 by comparing on low halfword */  
    (void)__SSUB16(min >> 16, min);  
    /* Select min on low 16-bits */  
    min = __SEL(min, min >> 16);  
  
    /* Test if odd number of samples */  
    if (pSize > 0)  
    {  
        data16 = *pSrc;  
        /* put significant bits on bits 15-4 instead of 11-0 on low halfword */
        data16 <<= 4;
        /* Write back to memory to have useable 16-bits sample */
        *pSrc = data16;

        /* look for max between on low halfwords */  
        (void)__SSUB16(max, data16);  
        /* Select max on low 16-bits */  
        max = __SEL(max, data16);  
  
        /* look for min on low halfword */  
        (void)__SSUB16(data16, min);  
        /* Select min on low 16-bits */  
        min = __SEL(min, data16);  
    }  
  
    /* Pack result : Min on Low halfword, Max on High halfword */  
    return __PKHBT(min, max, 16); /* PKHBT documentation */  
}

With proper optimization options, I expect this to be quite efficient.

0 Jens Bauer over 9 years ago in reply to Thibaut ZEISSLOFF

Yes, this is quite efficient!
It might be possible to gain 2 extra clock cycles per iteration by further unrolling, eg. processing four 16-bit samples at a time.
-It requires reading the vaiues contiguously.
Eg. if the two load instructions are next to eachother, then a clock cycle will be saved.
Another clock cycle is saved on the branch, we're saving.
Explained in detail; this sequence will save 2 clock-cycles:
load : load: process : process : store : store : branch
This sequence will only save one clock-cycle:
load : process : store : load : process : store : branch
(the end of the while loop represents the branch)
Each time the unrolling doubles, 2 clock cycles are saved until there are not enough free registers.
If we keep the DMA buffer's size divisible by 16 or a higher power of two, we do not need the 'cleanup' for the remaining values.
That would make the code a little simpler and easier to maintain.
Cancel
Vote up 0 Vote down

Cancel
0 Thibaut ZEISSLOFF over 9 years ago in reply to Jens Bauer

You're right. Now that you have a efficient computation technique, you still can improve the overall efficiency.
Usually, I try to let compiler do his job where he's good !
In fact, you need to wonder what you can do to help him generate efficient code:
I made quite a detailed about this analysis on my blog (Simplest algorithm ever).
In the end :
- try to fix everything you can at compile time (bit shift count, buffer size, loop count ...)
- limit code visibility to what's necessary (using static functions will allow inlining optimizations inside a module), same for variables, do not use module variables (placed in RAM) when only local variables can be used
As demonstrated in my post, this will let you write safe code and allow compiler to get rid of unused parts !
All of this is only true when you need to reach best efficiency and can afford to turn compiler optimizations ON and very high !!
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to Jens Bauer

RAM usage looks great. There's plenty for placing code in SRAM in a section that does not collide with the DMA.
As far as I can tell, the DMA buffer is somewhere in RamLoc128.
That means you can pick any of the other ram locations (I'd suggest one of the AHB sections) for the code.
Now, I just don't know which address RamLoc128 is.
Jens, if code will be run in SRAM the location should be at 0x10000000, the start of RamLoc128. It should be the data space and DMA buffer that must be relocated. This is because RamLoc128 (starting from 0x10000000) is the area where the bootloader copies and executes the image from SPIFI (and other external source) when not executing in place.
(I particularly like that NXP measure 0 Bytes in GB).
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to Andrea Bettati

From your post above:
I needed to add the SPIFI Flash in order to use the Link2 as evaluation board (as described here: Introduction to Programming the NXP LPC4370 MCU Using the LPCxpresso Tools and Using Two LPC-Link2 Boards and here: Using an LPC-Link2 as an LPC4370 evaluation board | NXP Community
Using the LPC-Link2 with SPIFI Flash as the boot source is described in those two pages. What Jens is recommending is to execute from SRAM for the code to run faster. This means that you will add SPIFI Flash but program execution should not be directly from that location. The code from Flash should be copied to SRAM and executed there.
Yes, I completely agree: as I can see in LPCXpresso project's properties RamLock128 starts @ 0x10000000 (you can look for it in the picture posted my recap).
Next step on this front I'll try to do: understand how I can do this.
From my reply to Jens, the code should be run from 0x10000000, the start of RamLoc128. It should be the data space and DMA buffer that must be relocated.
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to Thibaut ZEISSLOFF

Regarding the sign extension, there is a very simple way to do it : change the scale of your data !
Before Andrea posted what he is doing with the samples, this is not an advisable trick. Now that finding the minimum and maximum values seems the only task to be done, left-shift/change of scale is a simple but effective way of sign extension.
For this project, writing a custom function for searching the minimum and maximum values rather than using the CMSIS functions is more advantageous. This is because the input samples need to be sign-extended first and the search for minimum and maximum values can be combined in a single function.
Cancel
Vote up 0 Vote down

Cancel