This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to gain performance through PBO (pixel buffer object) on Mali T-880

I'm working on a corner detection algorithm on a international version of Samsung S7 which is empowered by Mali T-880. The basic framework is 1.) grab android camera capture into a OpenGL texture. 2) run through several stages of image filters written in GLSL shaders. 3) read the processed result back to main memory, let CPU finish to final detection. As you can image the performance bottleneck is glReadPixels in step 3. The texture/render target size is 2560 * 1440, the usual time glReadPixels() costs is 180ms after all draw calls of these image filters. (if no filter at all, just fetch last render also takes 120ms). Since trigger draw command for these filters is extremely fast < 10ms, I can still get 2x performance boost.

Now I tried further optimized glReadPixels by using PBO. Followings are my code:

// initialize pbo

const int pbo_count = 2;
glGenBuffers( pbo_count, gl_pbo_ids );
for (int i = 0; i < pbo_count; ++i)

{

     glBindBuffer( GL_PIXEL_PACK_BUFFER, gl_pbo_ids[i] );

     glBufferData( GL_PIXEL_PACK_BUFFER, pbo_buffer_size, 0, GL_DYNAMIC_READ );
}

// in render thread, trigger read pixel on one pbo asynchronously , and process another pbo data

static int r_idx = 0;
int p_idx = 0;
r_idx = (r_idx + 1) % pbo_count;
p_idx = (r_idx + 1 ) % pbo_count;

glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo_ids[r_idx]);
glReadPixels(0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);

glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo_ids[p_idx]);
pbo_ptr= (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_buffer_size, GL_MAP_READ_BIT);
memcpy(data_ptr, pbo_ptr, pbo_buffer_size);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

// process data_ptr with CPU ....

By this code, I can get identical result with non-pbo version, but the performance is even worse. glReadPixels() does return immediately. So is glMapBufferRange(). The problem is memcpy() takes around 450 ms which is totally a disaster. I wonder if I missed some setup or problematic code. I test memcpy() between two CPU allocated memory area with same size, it won't take more than 5ms. And also tried to increase PBO number. It didn't help.

Searched some suggestions to even use an asynchronous read pixel thread, I haven't tried it. Because if memcpy() really takes so long, I don't think it will really solve my issue.

Any opinion would be very appreciated.

Thanks.

Parents
  • Most Mali graphics memory is uncached on the CPU to avoid cache maintenance during normal operation. Normal memcpy tends to be pretty poor at reading from uncached buffers; I'd recommend writing your own copy routine specifically to handle this.

    I'd suggest using NEON and making big vectors loads and stores to ensure you make the biggest possible bus transfer sizes. I've not tested it, but something like this in ARMv8 NEON will transfer big data blocks very quickly. Ensure the base addresses are 64-byte aligned for maximum performance (the driver should be doing this for you for the src buffer, but check in your dst buffer too).

    # x0 = *dst
    # x1 = * src
    # x2 = size (must be multiple of 64)
    neon_aligned_memcpy:
        ld4        {v0.2d, v1.2d, v2.2d, v3.2d}, [x1], #64
        subs       x2, x2, #64; 
        st4        {v0.2d, v1.2d, v2.2d, v3.2d}, [x0], #64
        bne        neon_aligned_memcpy
        ret
    

    Also worth noting that the the Galaxy S7 uses a Samsung-designed ARM core (the M1), so they may be better able to advise on how to get best copy performance out of it.

    HTH,
    Pete

Reply
  • Most Mali graphics memory is uncached on the CPU to avoid cache maintenance during normal operation. Normal memcpy tends to be pretty poor at reading from uncached buffers; I'd recommend writing your own copy routine specifically to handle this.

    I'd suggest using NEON and making big vectors loads and stores to ensure you make the biggest possible bus transfer sizes. I've not tested it, but something like this in ARMv8 NEON will transfer big data blocks very quickly. Ensure the base addresses are 64-byte aligned for maximum performance (the driver should be doing this for you for the src buffer, but check in your dst buffer too).

    # x0 = *dst
    # x1 = * src
    # x2 = size (must be multiple of 64)
    neon_aligned_memcpy:
        ld4        {v0.2d, v1.2d, v2.2d, v3.2d}, [x1], #64
        subs       x2, x2, #64; 
        st4        {v0.2d, v1.2d, v2.2d, v3.2d}, [x0], #64
        bne        neon_aligned_memcpy
        ret
    

    Also worth noting that the the Galaxy S7 uses a Samsung-designed ARM core (the M1), so they may be better able to advise on how to get best copy performance out of it.

    HTH,
    Pete

Children