This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Can keil compiler translate C code into SIMD instructions(ARMv7-m)


When we write such code as following, can keil compiler automaticly translate it into SIMD STM.

  do{

          *p++ = 1;

           *p++ = 1;

           *p++ = 1;

           *p++ = 1;

           *p++ = 1;

           *p++ = 1;

           *p++ = 1;

           *p++ = 1;

     }while( i-- );

We write code like this expecting STM can be used to do some optimization.

If the compiler can do this, what the option it need? I think this is not the default behaviour.

  • Hello ZhiYang,

    I believe there are some ways to write C code to "suggest" to the compiler to use the SIMD instructions.

    However, I have not used them myself.

    I hope some experts can take the time to share methods.

  • HI Zhi Yang,

    Yes, the Keil compiler will use the STM instruction automatically, when appropriate.  I have seen this instruction when looking at the intermediate ASM output of the compiler, and it is often used for stack operations.  My understanding is that on the M4 the STM instruction saves only on code space but does not reduce the cycle count.  N back-to-back stores takes N+1 instructions, and this holds true of individual store instructions are used or a single STM.

  • Besides function entry and exit, armcc will use STM for structure copies:

    struct S { int a[4]; };
    void f(int i, struct S* p) {
      const struct S s_ones = { { 1, 1, 1, 1 } }
      do {
        *p++ = s_ones;
      } while (i--);
    }
    

    (and maybe for some memcpys)

    But the compiler won't find STMs from the code you gave (even when 'p' is 'int *'); and it won't auto-vectorize for the pre-NEON SIMD instructions like SHADD8 (on Cortex-M4), but you can use intrinsics to get them  http://infocenter.arm.com/help/topic/com.arm.doc.dui0491i/CJAGACAD.html

    [By the way, I don't usually consider STM a SIMD instruction, but I see what you mean.]

  • Hi Zhi Yang,

    One more thing I thought about while looking at your code.  Is your pointer p pointing at 16-bit or 32-bit values?  If 32-bit values, then you can do no better than the STM instruction or multiple store instructions.  However, if it is 16-bit, then on the M4 you could take advantage of some of the 16-bit SIMD instructions and pack two 16-bit values into a single store.  Let me know if you are accessing 16-bit values and if you are using the M3 or M4.  If so, we could provide some example code that uses the SIMD intrinsics on the M4.

    -Paul

  • Hi Zhiyang,

    Yes, but there is a distinction between a multiple store operation (STM) and a multiple data operation (SIMD).

    While you can use an STM to store many values subsequently, which in general will take N+1 cycles, SIMD performs operations on multiple data within the same clock cycle.

    One way to take advantage of SIMD is by using intrinsics. Consider the following 16 bit pointers:

    void function(  q15_t * pSrcA,  q15_t * pSrcB,  q15_t * pDst) {


    *pDst++ = (q15_t) __QADD16(*pSrcA++, *pSrcB++);


    }

    The piece of code above translates into a SIMD addition of two 16 bit registers at once.

    Cheers,


  • Hi All,

    Actually, I just focus on memset. And the type of p is int*. and valua is int.

    I think this method must get some advantages. So I thought the SIMD though I am not familiar with it and even made a misunderstanding beween STM and SIMD.

    Thanks