This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
  • Note: This was originally posted on 16th April 2013 at http://forums.arm.com

    Hi Shervin,

    thanks for the reply a a lot. I removed the auto vectorization flag and used the vst1q_s16()to extract data back to register and it optimized my coded not so much though. I think its because of the reason you mentioned about the memory being slow and also the short period of time my loop is generated even though its called many times.

    I wanted to know, when profiling, is it advisable to use gprof or is there other profiling tool i can use?

    I have purchased D-stream. but it does not give me the granule profiling like gprof on the JMVC software that am currently working on. and also when I tried to run it Under RTSM, i had the compiler problem where some libraries were not included.


    Thanks for the help again.
    :D
  • Note: This was originally posted on 10th April 2013 at http://forums.arm.com

    There are several reasons why this code can be slower than plain C code.

    First thing to note is that you are manually using NEON Intrinsics but also telling GCC to try to generate NEON code from your plain C code (auto-vectorize) since you use -ftree-vectorize -O3. Maybe they are causing you a strange comparison (eg: your C code might be using GCC's NEON, and I'm not sure but perhaps your NEON code might be interfering somehow with GCC's NEON code).

    Also, you are using a for loop but the for loop only runs twice, so it might actually be generating the loop rather than automatically unrolling your loop (never assume GCC for ARM will automatically figure out any optimization, you are better off unrolling it yourself to be sure).

    Also, I'm willing to bet money on the fact that your speed is not limited by the CPU (ARM or NEON), it is limited by your memory access. And memory access isn't necessarily faster using NEON than plain ARM, often plain ARM will have better memory speeds than NEON.

    Also, your NEON code to extract each NEON byte using vgetq_lane_s16() might not be an efficient solution. Try using vst1q_s16() to store the whole 16 bytes in 1 NEON instruction instead of 8 lines of code (that GCC might turn into 1 NEON instruction if you are lucky but might turn it into 24 NEON instructions if you are unlucky!).