This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

which kind of compiler optimization can be applied in this code?

soojin over 2 years ago

compiler: linaro-aarch64-2020.09-gcc10.2-linux5.4

optimization option: -O3

CPU: Arm A53 1Ghz

Hello, this is newbie.

code1 is 3.1x slower than code2

- code1: 106 ms

- code2: 34 ms

I think using constant in for-loop is the only(?) difference.

I really wonder why such big performance difference between two code.

void img_bitshift
(
    CAMERA_OPAQUE_t *pstDevInfo,
    int16_t img_width,
    int16_t img_height,
    int16_t bitshift
)
{
    uint16_t *src_img = (uint16_t *) pstDevInfo->some_field.pVirt;
    uint8_t *dst_img = (uint8_t *) pstDevInfo->some_field.pVirt;

    for (int i = 0; i < img_height; i++)
    {
        for (int j = 0; j < img_width; j++)
        {
            uint16_t pixel = src_img[i*img_width + j];
            dst_img[i*img_width + j] = pixel >> bitshift;
        }
    }
}

// img_bitshift(_, 12800, 8000, _) took 106 ms

void dummy
(
    CAMERA_OPAQUE2_t *camerainfo,
    DummyType *dummy
)
{
    int32_t channelIndex = 0;

    for( channelIndex = 0 ; channelIndex < 1 ; channelIndex++ )
    {
        // copy&paste of img_bitshift()
        CAMERA_OPAQUE_t *pstDevInfo = camerainfo->channelDevice;
        uint16_t *src_img = (uint16_t *) pstDevInfo->somefield.pVirt;
        uint8_t *dst_img = (uint8_t *) pstDevInfo->somefield.pVirt;
        
        // NOTE:-----------------------------------------
        // Here, we used constant instead of variable!
        // ----------------------------------------------
        uint16_t img_width = 12800;
        uint16_t img_height = 8000;
        uint16_t bitshift = 8;

        for (int i = 0; i < img_height; i++)
        {
            for (int j = 0; j < img_width; j++)
            {
                uint16_t pixel = src_img[i*img_width + j];
                dst_img[i*img_width + j] = pixel >> bitshift;
            }
        } /* end of loop */
    }
}

//line23 ~ line30 took 34 ms.

Thank in advance.

Top replies

0 Ronan Synnott over 2 years ago

Hello, to understand this it is best to look at the outputted disassembly of each example.

I suspect the latter is able to make (better) use of Neon instructions to vectorize the algorithm
https://developer.arm.com/architectures/instruction-sets/simd-isas/neon
Cancel
Up +2 Down

Cancel
0 Tamar Christina over 2 years ago

The only constant that makes a difference here is the `8` in the right shift. To Understand what happens you need to indeed look at the source as Ronan suggested. As an example https://godbolt.org/z/j5nx7EofM shows you the two differences.

In C, the uint16_t operations you are doing are required to be done as integers. These are the integer promotion rules (read more at https://en.cppreference.com/w/c/language/conversion and search for Integer promotion).

So in the non-constant case we have to widen the uint16_t values to uint32_t do the shifts are 32-bit integers and then narrow the results back to the final size. This of course causes a significant overhead.

In your constant case we can avoid the narrowing and widening conversions and instead directly shift the values as uint16_t.

This however isn't the fastest form of doing so. In this case the arm compiler is missing some tricks like the aarch64 one does.

Since your shift amount is half that of your data type size and you're doing a logical right shift, you can replace with shifts with UZP if you use neon-intrinsics.

See what the AArch64 compiler generates for dummy.
Cancel
Up +1 Down

Cancel