This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Parents
  • Note: This was originally posted on 10th April 2013 at http://forums.arm.com

    There are several reasons why this code can be slower than plain C code.

    First thing to note is that you are manually using NEON Intrinsics but also telling GCC to try to generate NEON code from your plain C code (auto-vectorize) since you use -ftree-vectorize -O3. Maybe they are causing you a strange comparison (eg: your C code might be using GCC's NEON, and I'm not sure but perhaps your NEON code might be interfering somehow with GCC's NEON code).

    Also, you are using a for loop but the for loop only runs twice, so it might actually be generating the loop rather than automatically unrolling your loop (never assume GCC for ARM will automatically figure out any optimization, you are better off unrolling it yourself to be sure).

    Also, I'm willing to bet money on the fact that your speed is not limited by the CPU (ARM or NEON), it is limited by your memory access. And memory access isn't necessarily faster using NEON than plain ARM, often plain ARM will have better memory speeds than NEON.

    Also, your NEON code to extract each NEON byte using vgetq_lane_s16() might not be an efficient solution. Try using vst1q_s16() to store the whole 16 bytes in 1 NEON instruction instead of 8 lines of code (that GCC might turn into 1 NEON instruction if you are lucky but might turn it into 24 NEON instructions if you are unlucky!).
Reply
  • Note: This was originally posted on 10th April 2013 at http://forums.arm.com

    There are several reasons why this code can be slower than plain C code.

    First thing to note is that you are manually using NEON Intrinsics but also telling GCC to try to generate NEON code from your plain C code (auto-vectorize) since you use -ftree-vectorize -O3. Maybe they are causing you a strange comparison (eg: your C code might be using GCC's NEON, and I'm not sure but perhaps your NEON code might be interfering somehow with GCC's NEON code).

    Also, you are using a for loop but the for loop only runs twice, so it might actually be generating the loop rather than automatically unrolling your loop (never assume GCC for ARM will automatically figure out any optimization, you are better off unrolling it yourself to be sure).

    Also, I'm willing to bet money on the fact that your speed is not limited by the CPU (ARM or NEON), it is limited by your memory access. And memory access isn't necessarily faster using NEON than plain ARM, often plain ARM will have better memory speeds than NEON.

    Also, your NEON code to extract each NEON byte using vgetq_lane_s16() might not be an efficient solution. Try using vst1q_s16() to store the whole 16 bytes in 1 NEON instruction instead of 8 lines of code (that GCC might turn into 1 NEON instruction if you are lucky but might turn it into 24 NEON instructions if you are unlucky!).
Children
No data