This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Neon vs Intel SSE

Hello experts.

Its my first question and it is regarding ARM Neon engine performance compared to Intel SSEx.

Introduction.

I took C-function which performs addition on 16-bit data in array and wrote it using ARM Neon intrinsics,

also the same function I wrote with Intel SSE intrinsics, I executed these two versions on appropriate platforms and

measured execution time.

On Intel platform (under Ubuntu OS) I used Time Stamp Counter (RDTSC instruction), I did many executions to calculate mean time of function execution in first case written on C and in second written on C + SSE intrinsics.

On ARM platform (under Android OS) I used library function clock_gettime (time.h) to get execution time, CLOCK_MONOTONIC parameter was used in function call

clock_gettime to get monotonic time, I did the same mean time measurements in first case for C code and in second case for C + Neon intrinsics. Timer resolution on device was coarse-grained, under tests I revealed that 1 step of counter is 50 nanoseconds. Also I want to note that architecture of processor core on device is CortexA-75 (out-of-order).

On both platforms application was executed in single thread under OS.

Intrinsics were the same, for SSE it was _mm_adds_epu16, for Neon it was vqaddq_u16

After measurements I saw difference in execution time:

1) C-function + SSE compared to simple C-function -> ~ x6 times of speed up

2) C-function + Neon compared to simple C-function -> ~ x3 times of speed up 

So the question is next.

Is this maximum speed up, I mean arithmetic not including memory latencies, it is because of Neon engine hardware limitations? Neon is slower than SSE?

I saw from ARM Cortex-A75 software optimization guide that Aarch64 UQADD take 3 cycles,

the same Intel SSE instruction paddusw takes 1 cycle, so can I conclude that Neon is slower than SSE? 

  • You are talking about "speed-up". But if the x64 C code is bad then this might result in a better speed-up as compared to the ARM version.
    So in order to really compare both, you need to have the absolute timing of both and then be sure to normalize the results.

    How many SSE units are in the x64 CPU? How many NEON units in the CA75?

    Did you pay attention to fill the pipeline and avoid stalls?

    I think, there are so many screws to turn, that a simple comparison might be unfair.

    Anyway, checkout xhash on github. It is optimized for NEON and SSE2 and might also give a good hint.