This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Process ADC data, moved by DMA, using CMSIS DSP: what's the right way?

Hi to you all,
I've a firmware running on a NXP LPCLink2 (LPC4370: 204 Mhz Cortex M4 MCU) board which basically does this:

  • Fills the ADC FIFO @40msps.
  • Copies the data into memory using the built-in DMA Controller and 2 linked buffers.
  • Processes one buffer while the other is being filled.

My problem is that my code is too slow, and every now and then and overwrite occurs.

Using the DMA I'm saving the ADC data, which I get in Twos complement format (Offset binary is also available), in a uint32_t buffer and try to prepare them for the CMSIS DSP function by converting the buffer into float32_t: here's where the overwrite occurs. It's worth saying that I'm currently using Floating point Software, not hardware.


The CMSIS library also accepts fractional formats like q31_t, q15_t and so on, and since I don't strictly need floating point maths I could even use these formats if that could save me precious time.
It feels like I'm missing something important about this step, that's no surprise since this is my first project on a complex MCU, any help/hint/advise would be highly appreciated and would help me in my thesis.

I'll leave here the link for the (more datailed) question I asked in the NXP forums, just in case: LPC4370: ADCHS, GPDMA and CMSIS DSP | NXP Community .

Thanks in advance!

Parents
  • goodwin, jensbauer, I'd like to thank you very very much for the effort you're putting into helping me: I'd really offer you a beer (or 2) if we weren't spread around the globe (but, I mean, never say never).
    Anyway, I'm working hard on my project in these days and today I'd like to write here a small recap. I hope this isn't off-topic: I know I asked info about floating point conversion so, I would be happy to give the correct answer to Goodwin cause his insight is great, but then I'd need to open up a new post to continue the interesting optimization discussion.
    Please, let me know if a way is better than the other and what I should do, I really don't want to take advantage of anyone.

    The 'Big' Picture

    In this project I need to use the ADCHS to sample gaussian-shaped impulsive signals like these:

    sample_1.jpg

    These are generated by analog electronics and we're working in the field of ionizing-radiations, so let's say these spikes represent two photons and have a 3us full width.
    My goal is to reach a 100kHz count rate: this means 10us between each peak, and therefore to process a MinMax algorithm on the data with this requirement (this would be great, but I'm trying to find a compromise between count rate and hardware costs).

    Yesterday there was an important update to the project and a collegue was able to tell me that we need at least 40 samples (or 128 on the entire 10us duration) per spike, so this acquisition can't be slower than roughly 1.25 Msps.

    (Note that I still hope to be able to perform a 40Msps acquisition and understand how to get out the most from this micro, since this would open a whole set of possibilities for the applications of my project)

    Memory Layout

    I'm not an expert on this topic, and since I'm self taught on this I think i could also be somewhat slow, but since you point out that some info about my memory layout should be very helpful, I'll try to describe it.
    From my LPCXpresso IDE:

    Selection_032.png

    Note that:

    int16_t sample[NUM_SAMPLE] = {0};
    int16_t sample2[NUM_SAMPLE] = {0};
    

    Elaboration: is the LPC4370 @204MHz fast enough?

    As jensbauer said now I'm trying to figure out which is the best DMA buffer lenght, and, sadly, as  goodwin said @40Msps I'm not able to perform a continuous read + elaboration.

    I tried:

    arm_shift_q31(sample, shiftBits, sampleTmp, NUM_SAMPLE);
    arm_max_q31(sampleTmp, NUM_SAMPLE, &maxSample, &sampleIndex);
    
    MAXmin = SearchMinMax16_DSP(sampleTest, NUM_SAMPLE);
    

    Here we have some results from the tests I did:

    Software ExecutedDMA Buffer's LenghtElapsed TimeInstructions/Sample
    arm_shift_q31 + arm_max_q31128~135us~211
    arm_shift_q31 + arm_max_q311024~190us~37
    arm_shift_q31 + arm_max_q312048~250us~24.5
    SearchMinMax16
    (no sign-extension)
    128~160usn.a.
    SearchMinMax16
    (no sign-extension)
    1024~160usn.a.
    SearchMinMax16
    (no sign-extension)
    2048~500usn.a.
    SearchMinMax16
    (with sign-extension)
    2048~7msn.a.
    update: 24/08/2016
    SearchMinMax16
    (no sign-extension + -O3)
    128~18usn.a.

      So it turns out that:

    1. I know that thibaut's implementation is giving me both the Max and the Min, but shouldn't it be faster anyway? Am I missing something about optimization level? I must admit I'm still stucked with the standard LPCXpresso one.
    2. My sin extension implementation is dead-slow. Here's my code, and I just notice I have few extra that are not needed (fixing them..).
    *pSrc = (*pSrc) << shiftBits;
    *pSrc = (*pSrc) >> shiftBits;
    

    Any help on this optimization problem is highly appreciated of course!

    Hope this wasn't too long/boring, if so let me know and i'll do everything I can to: give a better explanation, divide the problem in small ones and find the right place in the forum for every piece.
    Again, thank to those who helped since today, lovely community!

    Regards,
    Andrea

Reply
  • goodwin, jensbauer, I'd like to thank you very very much for the effort you're putting into helping me: I'd really offer you a beer (or 2) if we weren't spread around the globe (but, I mean, never say never).
    Anyway, I'm working hard on my project in these days and today I'd like to write here a small recap. I hope this isn't off-topic: I know I asked info about floating point conversion so, I would be happy to give the correct answer to Goodwin cause his insight is great, but then I'd need to open up a new post to continue the interesting optimization discussion.
    Please, let me know if a way is better than the other and what I should do, I really don't want to take advantage of anyone.

    The 'Big' Picture

    In this project I need to use the ADCHS to sample gaussian-shaped impulsive signals like these:

    sample_1.jpg

    These are generated by analog electronics and we're working in the field of ionizing-radiations, so let's say these spikes represent two photons and have a 3us full width.
    My goal is to reach a 100kHz count rate: this means 10us between each peak, and therefore to process a MinMax algorithm on the data with this requirement (this would be great, but I'm trying to find a compromise between count rate and hardware costs).

    Yesterday there was an important update to the project and a collegue was able to tell me that we need at least 40 samples (or 128 on the entire 10us duration) per spike, so this acquisition can't be slower than roughly 1.25 Msps.

    (Note that I still hope to be able to perform a 40Msps acquisition and understand how to get out the most from this micro, since this would open a whole set of possibilities for the applications of my project)

    Memory Layout

    I'm not an expert on this topic, and since I'm self taught on this I think i could also be somewhat slow, but since you point out that some info about my memory layout should be very helpful, I'll try to describe it.
    From my LPCXpresso IDE:

    Selection_032.png

    Note that:

    int16_t sample[NUM_SAMPLE] = {0};
    int16_t sample2[NUM_SAMPLE] = {0};
    

    Elaboration: is the LPC4370 @204MHz fast enough?

    As jensbauer said now I'm trying to figure out which is the best DMA buffer lenght, and, sadly, as  goodwin said @40Msps I'm not able to perform a continuous read + elaboration.

    I tried:

    arm_shift_q31(sample, shiftBits, sampleTmp, NUM_SAMPLE);
    arm_max_q31(sampleTmp, NUM_SAMPLE, &maxSample, &sampleIndex);
    
    MAXmin = SearchMinMax16_DSP(sampleTest, NUM_SAMPLE);
    

    Here we have some results from the tests I did:

    Software ExecutedDMA Buffer's LenghtElapsed TimeInstructions/Sample
    arm_shift_q31 + arm_max_q31128~135us~211
    arm_shift_q31 + arm_max_q311024~190us~37
    arm_shift_q31 + arm_max_q312048~250us~24.5
    SearchMinMax16
    (no sign-extension)
    128~160usn.a.
    SearchMinMax16
    (no sign-extension)
    1024~160usn.a.
    SearchMinMax16
    (no sign-extension)
    2048~500usn.a.
    SearchMinMax16
    (with sign-extension)
    2048~7msn.a.
    update: 24/08/2016
    SearchMinMax16
    (no sign-extension + -O3)
    128~18usn.a.

      So it turns out that:

    1. I know that thibaut's implementation is giving me both the Max and the Min, but shouldn't it be faster anyway? Am I missing something about optimization level? I must admit I'm still stucked with the standard LPCXpresso one.
    2. My sin extension implementation is dead-slow. Here's my code, and I just notice I have few extra that are not needed (fixing them..).
    *pSrc = (*pSrc) << shiftBits;
    *pSrc = (*pSrc) >> shiftBits;
    

    Any help on this optimization problem is highly appreciated of course!

    Hope this wasn't too long/boring, if so let me know and i'll do everything I can to: give a better explanation, divide the problem in small ones and find the right place in the forum for every piece.
    Again, thank to those who helped since today, lovely community!

    Regards,
    Andrea

Children