I'm working on a corner detection algorithm on a international version of Samsung S7 which is empowered by Mali T-880. The basic framework is 1.) grab android camera capture into a OpenGL texture. 2) run through several stages of image filters written in GLSL shaders. 3) read the processed result back to main memory, let CPU finish to final detection. As you can image the performance bottleneck is glReadPixels in step 3. The texture/render target size is 2560 * 1440, the usual time glReadPixels() costs is 180ms after all draw calls of these image filters. (if no filter at all, just fetch last render also takes 120ms). Since trigger draw command for these filters is extremely fast < 10ms, I can still get 2x performance boost.
Now I tried further optimized glReadPixels by using PBO. Followings are my code:
// initialize pbo
const int pbo_count = 2;glGenBuffers( pbo_count, gl_pbo_ids );for (int i = 0; i < pbo_count; ++i)
{
glBindBuffer( GL_PIXEL_PACK_BUFFER, gl_pbo_ids[i] );
glBufferData( GL_PIXEL_PACK_BUFFER, pbo_buffer_size, 0, GL_DYNAMIC_READ );}
// in render thread, trigger read pixel on one pbo asynchronously , and process another pbo data
static int r_idx = 0;int p_idx = 0;r_idx = (r_idx + 1) % pbo_count;p_idx = (r_idx + 1 ) % pbo_count;
glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo_ids[r_idx]);glReadPixels(0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo_ids[p_idx]);pbo_ptr= (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_buffer_size, GL_MAP_READ_BIT);memcpy(data_ptr, pbo_ptr, pbo_buffer_size);glUnmapBuffer(GL_PIXEL_PACK_BUFFER);glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
// process data_ptr with CPU ....
By this code, I can get identical result with non-pbo version, but the performance is even worse. glReadPixels() does return immediately. So is glMapBufferRange(). The problem is memcpy() takes around 450 ms which is totally a disaster. I wonder if I missed some setup or problematic code. I test memcpy() between two CPU allocated memory area with same size, it won't take more than 5ms. And also tried to increase PBO number. It didn't help.
Searched some suggestions to even use an asynchronous read pixel thread, I haven't tried it. Because if memcpy() really takes so long, I don't think it will really solve my issue.
Any opinion would be very appreciated.
Thanks.
Most Mali graphics memory is uncached on the CPU to avoid cache maintenance during normal operation. Normal memcpy tends to be pretty poor at reading from uncached buffers; I'd recommend writing your own copy routine specifically to handle this.
I'd suggest using NEON and making big vectors loads and stores to ensure you make the biggest possible bus transfer sizes. I've not tested it, but something like this in ARMv8 NEON will transfer big data blocks very quickly. Ensure the base addresses are 64-byte aligned for maximum performance (the driver should be doing this for you for the src buffer, but check in your dst buffer too).
# x0 = *dst # x1 = * src # x2 = size (must be multiple of 64) neon_aligned_memcpy: ld4 {v0.2d, v1.2d, v2.2d, v3.2d}, [x1], #64 subs x2, x2, #64; st4 {v0.2d, v1.2d, v2.2d, v3.2d}, [x0], #64 bne neon_aligned_memcpy ret
Also worth noting that the the Galaxy S7 uses a Samsung-designed ARM core (the M1), so they may be better able to advise on how to get best copy performance out of it.
HTH, Pete
Hi Peter,
Thanks for your accurate insight. By using neon code, I can improve specialized memcpy from 450ms to 120ms. I just have a quick question, hope you may provide some opinions.
Since I'm not a expert of neon code, basically I borrowed the code from http://stackoverflow.com/questions/34888683/arm-neon-memcpy-optimized-for-uncached-memory like this:
void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz){ if (sz & 63) { sz = (sz & -64) + 64; } asm volatile ( "NEONCopyPLD: \n" " VLDM %[src]!,{d0-d7} \n" " VSTM %[dst]!,{d0-d7} \n" " SUBS %[sz],%[sz],#0x40 \n" " BGT NEONCopyPLD \n" : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");}
I think it's essential similar to your code, to transfer data in maximal bandwidth. The faster version should also use pre-load. However as Timothy Miller, I found PLD didn't quite help for this scenario. Is this because this graphics memory is totally out of caching mechanism? So, in other word, it already approaches the performance limits and can never be as fast as cached memory.
> found PLD didn't quite help for this scenario. Is this because this graphics memory is totally out of caching mechanism?
Yes - PLD requires the buffer to be cached, as the cache acts as the storage for any prefetched data.
> So, in other word, it already approaches the performance limits and can never be as fast as cached memory.
There may be some specific micro-architecture improvements for the M1 which might help; but the CPU isn't designed by ARM so I don't know of any. Might be worth asking on Samsung's developer forums.
Cheers,
Pete