compiler: linaro-aarch64-2020.09-gcc10.2-linux5.4
optimization option: -O3
CPU: Arm A53 1Ghz
Hello, this is newbie.
code1 is 3.1x slower than code2
- code1: 106 ms
- code2: 34 ms
I think using constant in for-loop is the only(?) difference.
I really wonder why such big performance difference between two code.
<code 1: img_bitshift function>
void img_bitshift ( CAMERA_OPAQUE_t *pstDevInfo, int16_t img_width, int16_t img_height, int16_t bitshift ) { uint16_t *src_img = (uint16_t *) pstDevInfo->some_field.pVirt; uint8_t *dst_img = (uint8_t *) pstDevInfo->some_field.pVirt; for (int i = 0; i < img_height; i++) { for (int j = 0; j < img_width; j++) { uint16_t pixel = src_img[i*img_width + j]; dst_img[i*img_width + j] = pixel >> bitshift; } } } // img_bitshift(_, 12800, 8000, _) took 106 ms
<code 2: copy and paste of img_bitshift function>
void dummy ( CAMERA_OPAQUE2_t *camerainfo, DummyType *dummy ) { int32_t channelIndex = 0; for( channelIndex = 0 ; channelIndex < 1 ; channelIndex++ ) { // copy&paste of img_bitshift() CAMERA_OPAQUE_t *pstDevInfo = camerainfo->channelDevice; uint16_t *src_img = (uint16_t *) pstDevInfo->somefield.pVirt; uint8_t *dst_img = (uint8_t *) pstDevInfo->somefield.pVirt; // NOTE:----------------------------------------- // Here, we used constant instead of variable! // ---------------------------------------------- uint16_t img_width = 12800; uint16_t img_height = 8000; uint16_t bitshift = 8; for (int i = 0; i < img_height; i++) { for (int j = 0; j < img_width; j++) { uint16_t pixel = src_img[i*img_width + j]; dst_img[i*img_width + j] = pixel >> bitshift; } } /* end of loop */ } } //line23 ~ line30 took 34 ms.
Thank in advance.
Hello, to understand this it is best to look at the outputted disassembly of each example.
I suspect the latter is able to make (better) use of Neon instructions to vectorize the algorithmhttps://developer.arm.com/architectures/instruction-sets/simd-isas/neon