This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

3x3 convolution optimized speed using (NEON SIMD) or (NEON SIMD and OpenMP) on S7/Note7

We want to implement 3x3 convolution of image whose size is 4032x3024 on S7/Note7 to have Chipset such as Exynos 8890(S7 S.LSI) or Qualcomm MSM8996 Snapdragon 820.
To implement this, we used the Anroid NDK, Neon SIMD and OpenMP.
For 1 image (4032x3024), could you inform me of the optimized speed of 2 case (1. NEON SIMD implementation, 2. NEON SIMD + OpenMP implementation) on S7 or Note7?
I want to know the only 3x3 convolution optimized speed.
The test scenarios are as belows.

1. 3x3 convolution optimized speed(ms) to use SIMD     on S7(Exynos 8890 Chipset (Code Name(Jungfrau)) or Note7(Exynos 8890 Chipset (Code Name(Jungfrau))
2. 3x3 convolution optimized speed(ms) to use SIMD + OpenMP on S7(Exynos 8890 Chipset (Code Name(Jungfrau)) or Note7(Exynos 8890 Chipset (Code Name(Jungfrau))
3. 3x3 convolution optimized speed(ms) to use SIMD        on S7(Qualcomm MSM8996 Snapdragon 820 Chipset) or Note7(Qualcomm MSM8996 Snapdragon 820 Chipset)
4. 3x3 convolution optimized speed(ms) to use SIMD + OpenMP on S7(Qualcomm MSM8996 Snapdragon 820 Chipset) or Note7(Qualcomm MSM8996 Snapdragon 820 Chipset)

note) Exynos 8890 Chipset speed : Octa-core (4x2.3 GHz Mongoose & 4x1.6 GHz Cortex-A53)

-. 3x3 convolution c code

* Input Buffer : 10 bit, size (4034(W)*3026(H))
* Output Buffer : 10 bit, size (4032(W)*3024(H))

void convolution_3by3(unsigned short *Input, unsigned short *Output) {
int input_width = 4032 + 2; // 4034
unsigned short *p_I1s_c  = Input + buffer;
unsigned short *p_I1s_p1 = p_I1s_c - input_width;
unsigned short *p_I1s_n1 = p_I1s_c + input_width;

for (int i=0;i<3024;i++){
  for (int j=0;j<4032;j++){
   const int jm1 = j-1;
   const int jp1 = j+1;
   Output[j] = (p_I1s_p1[jm1] + p_I1s_p1[jp1] + p_I1s_n1[jm1] + p_I1s_n1[jp1] +
      ((p_I1s_c [jm1] + p_I1s_c [jp1] + p_I1s_p1[j]   + p_I1s_n1[j]) <<1) +
         (p_I1s_c[j]<<2)) >> 4;
  }
  Output  = Output + 4032;
  p_I1s_p1   = p_I1s_p1 + input_width;
  p_I1s_c    = p_I1s_c  + input_width;
  p_I1s_n1   = p_I1s_n1 + input_width;
}
}