I am trying to save screenshot of a qml quick controls application on a platform (running QT on wayland) by using native opengl functions .What I am doing is that using a RGB color render buffer with eglCreateImageKHR function and then send the EGLImageKHR void pointer to another device through Qt socket communication. I can successfully create EGLImage that means that there is no error from eglGetError function . For testing the EGLImageKHR object correctness, I bind it to another framebuffer by using glEglImageTargetRenderbufferStorageOES on the same process and read the pixel from glReadPixel function , create a png file from read buffer and observed that correct png is created with correct colors.
After that I tried to send this EGLImageKHR void pointer to another device or process and then create some png from the sended EGLImageKHR object and I do not see correct colored png ,only have a noise on the png.
Following is the code sample to create the EGLImageKHR from render buffer and then saving a tga_file from EGLImageKHR.
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// create render buffer and bind it to a framebuffer glGenRenderbuffers( 1, &renderBuffer ); glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer ); glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB, mWinWidth, mWinHeight ); glBindRenderbuffer(GL_RENDERBUFFER, 0); //mwindow->openglContext()->defaultFramebufferObject()); if (glGetError()==GL_NO_ERROR) { //qDebug() << "Render buff storage is OK" << glGetError(); } else { qDebug() << "Render buff storage error is " << glGetError(); } glGenFramebuffers( 1, &frameBuffer ); glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer); glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer); //printFramebufferInfo(frameBuffer); if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE) { qDebug() << "Framebuffer error is " << glGetError(); } else { //qDebug() << "Framebuffer is OK" << glGetError(); } // create EGLImageKHR object mWinWidth = mwindow->width(); mWinHeight = mwindow->height(); glGetIntegerv(GL_PACK_ALIGNMENT, &rowPack); glPixelStorei(GL_PACK_ALIGNMENT, 1); glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject()); glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer); glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_NEAREST); m_display = reinterpret_cast<egldisplay>(reinterpret_cast<void*>(QGuiApplication::platformNativeInterface()->nativeResourceForIntegration("egldisplay"))); m_context = QGuiApplication::platformNativeInterface()->nativeResourceForContext("eglcontext", mwindow->openglContext()); mImage = CreateImageKHR(m_display,m_context, EGL_GL_RENDERBUFFER_KHR,reinterpret_cast<eglclientbuffer>(renderBuffer), nullptr); if (mImage == EGL_NO_IMAGE_KHR) { qDebug("failed to make image from target buffer: %s", get_egl_error()); return -1; } int size = mWinWidth * mWinHeight * 3; sendEglImage(size); glDeleteRenderbuffers(1,&renderBuffer); renderBuffer = 0; glDeleteFramebuffers(1,&frameBuffer); frameBuffer = 0; // send EGLImageKHR to client sendEglImage(int size) { if (SenderSocket != NULL) { QByteArray data; data.append(reinterpret_cast<const char*="">(mImage),size); //data.append(reinterpret_cast<qbytearray *="">(mImage)); QDataStream out(&data, QIODevice::WriteOnly); out.setDevice(SenderSocket); out << data; //qDebug() << "func " << __FUNCTION__ << "line" << __LINE__; qDebug() << "func " << __FUNCTION__ << "line" << __LINE__ << "data size" << data.size(); } QImage testImg((uchar *)mImage,640,480,QImage::Format_RGB888, nullptr, nullptr); if(testImg.save("server.png")) qDebug() << "Successfully saved image" << testImg; DestroyImageKHR(m_display,mImage); mImage = 0; } // Another approach to create a tga_file from EGLImageKHR is FILE *out = fopen("tga_file", "w"); short TGAhead[] = {0, 2, 0, 0, 0, 0, 640, 480, 24}; fwrite(&TGAhead, sizeof(TGAhead), 1, out); fwrite(mImage, mWinWidth * mWinHeight*3, 1, out); fflush(out); fclose(out); // One more different trial int bufSize = mWinHeight * mWinWidth*3; unsigned char * trialBuff = new unsigned char[bufSize]; memcpy(trialBuff,khrImage,bufSize); FILE *out = fopen("dada.txt", "w"); fwrite(trialBuff, bufSize, 1, out); fflush(out); fsync(fileno(out)); fclose(out); delete [] trialBuff;
So When I try to create a png with QImage or with fwrite from EGLImageKHR object, I do not get a valid png or tga_file.
Note that I do not want to use glReadPixels function since it is causing high cpu load. Is there any idea how I can create some png file from EGLImageKHR and How I can send it to another device ?
Best Regards
Randomly happened upon this thread.
I admit, I don't know QT at all. Is there not a capability in the library to take a screenshot and return a CPU memory buffer? KHR structures don't magically give you CPU pixel buffers, they're just further wrappers around the GPU memory. You need CPU memory to send a block to another machine (or to encode to say png or jpeg, and then send that to another machine).
If QT doesn't already expose a screen capture feature, then if everything in QT is going via GL under the covers, and you can get at the current draw target, then yes glReadPixels with say GL_UNSIGNED_BYTE and GL_RGBA as arguments should be reasonable -- those args should give fastest path, though may still be a bit slow. Using PBOs will be a bit faster as it can be done delayed and async, but it will be async then and not immediately usable (adding latency to getting the data, at a tradeoff of not stalling the GPU/CPU to get it...).
Hope that adds something to the conversation here.
Hello,
Thanks for the information. First of all , I want to clarify something.Here qt is an example, my goal is to create some generic functions so that I can use when android or some other technology is used as well.Also the other device that I want to send the pixels may not be a powerful device, may not have dma support, may not have opengl support. For this reasons, I need to get byte array and send to another device by using glReadPixels.There is some functionality inside QT libs such as grabwindow but It may require some conversions from Qimage to pixels which may cause higher cpu usage.(not 100% sure for the conversions and cpu usage.)
I implemented different algorithms by using pbos and observed that read time/process time and cpu usages are changing among different devices which is expected.
What I need is to have 16 bit RGB color pixels and will send this to other device.So followings are some algorithm and results:
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
void WaylandEgl::createPixelBO(){ if (!buffCreated) { pbo_size = mWinHeight * mWinWidth *2; pixels = new unsigned char[pbo_size]; glPixelStorei(GL_UNPACK_ALIGNMENT, 1); glPixelStorei(GL_PACK_ALIGNMENT, 1); glGenBuffers(PBO_COUNT,pboIds); glBindBuffer(GL_PIXEL_PACK_BUFFER,pboIds[0]); glBufferData(GL_PIXEL_PACK_BUFFER, pbo_size, 0, GL_STREAM_READ); glBindBuffer(GL_PIXEL_PACK_BUFFER,pboIds[1]); glBufferData(GL_PIXEL_PACK_BUFFER, pbo_size, 0, GL_STREAM_READ); glBindBuffer(GL_PIXEL_PACK_BUFFER, 0); buffCreated = true; glInfo glInfo; glInfo.getInfo(); glInfo.printSelf(); if(glInfo.isExtensionSupported("GL_ARB_pixel_buffer_object")) { qDebug() << "Video card supports GL_ARB_pixel_buffer_object."; pboSupported = true; } else { qDebug() << "Video card does NOT support GL_ARB_pixel_buffer_object."; pboSupported = false; return; } }}void WaylandEgl::runPixelBO(){ static int index = 0; int nextIndex = 0; // pbo index used for next frame index = (index + 1) % 2; // index 1 nextIndex = (index + 1) % 2; // nexIndex 0 createPixelBO(); memset(pixels,0,pbo_size); glReadBuffer(GL_FRONT); if (pboSupported) { t1.start(); glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[index]); glReadPixels(0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0); t1.stop(); readTime = t1.getElapsedTimeInMilliSec(); t1.start(); glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[nextIndex]); GLubyte *ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_size, GL_MAP_READ_BIT); if (ptr) { memcpy(pixels, ptr, pbo_size); //sendImage(pixels,pbo_size); glUnmapBuffer(GL_PIXEL_PACK_BUFFER); } else { qDebug() << "NULL bokk"; } t1.stop(); processTime = t1.getElapsedTimeInMilliSec(); glBindBuffer(GL_PIXEL_PACK_BUFFER, 0); } else { t1.start(); glReadPixels(0, 0, mWinWidth, mWinHeight, GL_BGR, GL_UNSIGNED_BYTE, pixels); // measure the time reading framebuffer t1.stop(); readTime = t1.getElapsedTimeInMilliSec(); /////////////////////////////////////////////////// t1.start(); // measure the time reading framebuffer t1.stop(); processTime = t1.getElapsedTimeInMilliSec(); } qDebug() << "Read Time " << readTime; qDebug() << "Process Time " << processTime;}
PBO ON:
Intel : 12-13% cpu usage without sending pixels to another device, Read Time: 0.029ms, Process Time:0.245ms
Nvidia: 29-32%cpu usage without sending pixels to another device, Read Time:4.708 ms, Process Time:0.47 ms
PBO OFF:
Intel : 10-12% cpu usage without sending pixels to another device, Read Time: 1.978ms, Process Time:0.001ms
Nvidia: 35-38%cpu usage without sending pixels to another device, Read Time:3.026 ms, Process Time:0.001 ms
Following is the second algorithm:
void WaylandEgl::initFastBuffers()
{
if (!buffCreated)
pbo_size = mWinHeight * mWinWidth *2;
pixels = new unsigned char[pbo_size];
Readback_buf = (GLchar *) malloc( pbo_size );
glPixelStorei( GL_PACK_ALIGNMENT, 1 );
glGenBuffers( PBO_COUNT, pboIds );
// Buffer #0: glReadPixels target
GLenum target = GL_PIXEL_PACK_BUFFER;
glBindBuffer( target, pboIds[0] );
glBufferData( target, pbo_size, 0, GL_STATIC_COPY );
glGetBufferParameterui64vNV = (PFNGLGETBUFFERPARAMETERUI64VNVPROC)eglGetProcAddress("glGetBufferParameterui64vNV");
if (!glGetBufferParameterui64vNV)
qDebug() << "glGetBufferParameterui64vNV not fouynded!";
return;
}
glMakeBufferResidentNV = (PFNGLMAKEBUFFERRESIDENTNVPROC)eglGetProcAddress("glMakeBufferResidentNV");
if (!glMakeBufferResidentNV)
qDebug() << "glMakeBufferResidentNV not fouynded!";
glUnmapBufferARB = (PFNGLUNMAPBUFFERARBPROC)eglGetProcAddress("glUnmapBufferARB");
if (!glUnmapBufferARB)
qDebug() << "glUnmapBufferARB not fouynded!";
glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
if (!glGetBufferSubData)
qDebug() << "glGetBufferSubData not fouynded!";
qDebug() << "Run the optimizatiosn";
GLuint64EXT addr;
glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
glMakeBufferResidentNV( target, GL_READ_ONLY );
// Buffer #1: glCopyBuffer target
target = GL_COPY_WRITE_BUFFER;
glBindBuffer( target, pboIds[1] );
glBufferData( target, pbo_size, 0, GL_STREAM_READ );
glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT );
glUnmapBufferARB( target );
glMakeBufferResidentNV ( target, GL_READ_ONLY );
buffCreated = true;
void WaylandEgl::doReadbackFAST()
// Work-around for NVidia driver readback crippling on GeForce.
initFastBuffers();
//glFinish();
Timer t1;
t1.start();
// Do a depth readback to BUF OBJ #0
glBindBuffer( GL_PIXEL_PACK_BUFFER, pboIds[0] );
glReadPixels( 0, 0, mWinWidth, mWinHeight,
GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );
t1.stop();
readTime = t1.getElapsedTimeInMilliSec();
// Copy from BUF OBJ #0 to BUF OBJ #1
glBindBuffer( GL_COPY_WRITE_BUFFER, pboIds[1] );
glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
pbo_size );
// Do the readback from BUF OBJ #1 to app CPU memory
glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, pbo_size,
Readback_buf );
//sendImage((unsigned char*)Readback_buf,pbo_size);
processTime = t1.getElapsedTimeInMilliSec();
glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
//qDebug() << "Read Time " << readTime;
//qDebug() << "Process Time " << processTime;
Intel : 11-12% cpu usage without sending pixels to another device, Read Time: 0.039ms, Process Time:2.014ms
Nvidia: 28-32%cpu usage without sending pixels to another device, Read Time:3.118 ms, Process Time:1.659 ms
So second algorithm is decreased the cpu usages on both platforms. Read Time and Process Times are in above.
Is there any idea how I can reduce the cpu usage on especially nvidia board for 16bit RGB pixels ?
Regards
I won't claim I know from firsthand experience what best approach is, but I'll throw out some guesses.
First, unless your source is running 16b, you probably want to look at reading 32bpp (RGBA or RGBX). Often that will be the fastest format if you are in 24/32b display mode. Some drivers will do REALLY slow operations, per pixel read, if they have to convert from 32->24 or 32->16. You can do it faster.
Second, if you are sending a pixel buffer over a network to another device, that is lower powered, you might consider whether compressing to say PNG or JPG (or other format) is worth using more cpu resource on the host. Result is local cpu cost to reduce network bandwidth, with maybe potential cpu cost to decode on the other end. Then again, if you have a 'dumb panel' on the other end, you may want just raw pixels, hard to say.
Third, have you timed simply doing glReadPixels instead of the overly complex mechanic above? If you use PBOs internal to a SINGLE FRAME, not with multithreading, not assuming delayed result, you will take immediate cost to synchronize the gpu and pull data to cpu (again, with possible per-pixel conversions). At which point the PBO isn't useful, just use ReadPixels, take the synchonization hit, and move on. :)
As you said that RGBA is the fastest and consuming lower cpu. Yes I will use gigabit ethernet and there will be no compressing in plan :)
The pbo Off values in above are for direct glReadPixels:
"PBO OFF:
Nvidia: 24-25%cpu usage without sending pixels to another device, Read Time:3.026 ms, Process Time:0.001 ms"
At that moment, the nvidia cpu usage are high so I am using pbo :)
so the quick ideas:
- read RGBA 32bpp, not 16b, not 24b.
- make sure you don't have a multisample framebuffer.
- render to FBO, blit that to screen, then do readpixels or similar on the FBO surface.
- might want glPixelStorei(GL_PACK_ALIGNMENT, 4); to force 32b alignment. I'm not sure that forcing 8b align isn't kicking you off a fast path.
I can see the timing differences being associated with any of the above.
So When I try to read RGBA with Pbo , The cpu usage is already low:
RGBA with RunPixelBo algorithm:
Intel : 18-19% cpu usage without sending pixels to another device, Read Time: 0.085ms, Process Time:1.112ms
Nvidia: 17-18%cpu usage without sending pixels to another device, Read Time:0.196 ms, Process Time:0.732 ms
PBO OFF(glReadPixels):
Intel : 16-17% cpu usage without sending pixels to another device, Read Time: 3.25ms, Process Time:0ms
Nvidia: 31-32%cpu usage without sending pixels to another device, Read Time:4.064 ms, Process Time:0.001 ms
RGBA with doReadFastBack algorithm:
Intel : 15-16.5% cpu usage without sending pixels to another device, Read Time: 0.065ms, Process Time:3.217ms
Nvidia: 14.5-15.5%cpu usage without sending pixels to another device, Read Time:0.108 ms, Process Time:5.833 ms
For RGB , it is also good number on nvidia but not on intel.So my requirement is 16 bit RGB for this reason 32b or 24b is not suitable for me. I will also check render buffer with 16bit as you ment above.