This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

CMSIS confusion

Note: This was originally posted on 27th October 2012 at http://forums.arm.com

Hello all
I have few questions about CMSIS. I have reviewed some libs with basic functions and there is no loops for rest of the samples (Cortex M3/4), for example (Abs function q31):

/*loop Unrolling */
  blkCnt = blockSize >> 2u;

  /* First part of the processing with loop unrolling.  Compute 4 outputs at a time.  
   ** a second loop below computes the remaining 1 to 3 samples. */
  while(blkCnt > 0u)
  {
    /* C = |A| */
    /* Calculate absolute of input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. */
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == 0x80000000) ? 0x7fffffff : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == 0x80000000) ? 0x7fffffff : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == 0x80000000) ? 0x7fffffff : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == 0x80000000) ? 0x7fffffff : -in);

    /* Decrement the loop counter */
    blkCnt--;
  }

  /* If the blockSize is not a multiple of 4, compute any remaining output samples here.  
   ** No loop unrolling is used. */
  blkCnt = blockSize % 0x4u;



I think there should be a loop for remaining samples. Can you explain why there is 4 operations in one loop iteration? Is it related with time optimization?

Best Regards
  • Note: This was originally posted on 27th October 2012 at http://forums.arm.com

    Hi Surix,
    The DSP functions in the CMSIS library rely on loop unrolling for optimization.  The Q31 absolute value function is unrolled by a factor of 4.  Unrolling helps to reduce run time by:  (1) allowing memory accesses to be grouped together, (2) reducing the associated loop overhead (it takes 3 cycles to do a loop on the M4).  The first loop in the function arm_abs_q31.c does processing on multiples of 4 samples.  If you look further down in the code you'll see code that does any remaining samples:


      while(blkCnt > 0u)
      {
        /* C = |A| */
        /* Calculate absolute value of the input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. */
        in = *pSrc++;
        *pDst++ = (in > 0) ? in : ((in == 0x80000000) ? 0x7fffffff : -in);

        /* Decrement the loop counter */
        blkCnt--;
      }

    Structured this way, the function can handle any length vector.

    -Paul
  • Note: This was originally posted on 27th October 2012 at http://forums.arm.com

    Hi Paul
    Yea, I have noticed that there is loop but comment says /* Run the below code for Cortex-M0 */, in some libs like arm_add_q31 there is a loop for Cortex M3/4 and second for M0, why? Your explanation about 4 samples in one loop iteration is clear for me :) thanks!

    Best Regards