This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Porting a mali_egl_image* Utgard application to Midgard

Hi all

We are currently migrating an embedded application from a Mali 400MP2 Utgard platform to one with a Mali T720 Midgard GPU. The application uses the following (probably fairly common) mali_egl_image* code to achieve zero-copy update of a texture:

EGLImageKHR eglImage = eglCreateImageKHR( display, EGL_NO_CONTEXT, EGL_NATIVE_PIXMAP_KHR, (EGLClientBuffer)(&fbPixMap), NULL );
glEGLImageTargetTexture2DOES(GL_TEXTURE_2D, eglImage);
...
...
mali_egl_image *mimg   = mali_egl_image_lock_ptr( eglImage );
unsigned char  *buffer = mali_egl_image_map_buffer( mimg, attribs_rgb );

// update buffer here

mali_egl_image_unmap_buffer( mimg, attribs_rgb );
mali_egl_image_unlock_ptr( eglImage );

These mali_egl_image_* functions do not appear to be available in the mali_midgard driver we received from our chip vendor.

Our application is written in C with, apart from the above, standard openGL ES2 calls.

What would be the equivalent approach for updating a texture directly (ie not using glTexSubImage2D() ) with the T720 Midgard driver? Thankfully the above code exists in a single function and called from many places, so ideally a direct replacement would be fantastic!

Regards

Chris

Top replies

Parents

0 Chris S over 4 years ago in reply to Ben Clark

Hi Ben.

We have a full-frame image generation thread running at full throttle that only wants to wait for the display thread to take each new image from it, and for that to complete in the shortest time possible. The display thread renders each new image full-frame while the generator builds the next, so isn't doing too much. Each new image is generated by applying deltas to the previous image, so the generator cannot flip between two buffers, which would have solved this timing issue!

What we achieved with the Utgard zero-copy functions was a single copy into the texture that occurred with predictable measurable time, and thus minimal interruption of the generation thread.

Based on what you've said, I wonder if we could either call glFinish straight after the glTexImage2D, or make a full copy of the image data before calling glTexImage2D, so the generator can get on with producing the next frame as soon as the copy is done. The latter would give the completion time predictability we need, but increase CPU load presumably.....

Understanding a little of what glTexImage2D is doing would help - is the data transfer within glTexImage2D() CPU bound (ie the driver does the transfer into texture memory using ARM/Neon instructions) or via a GPU/DMA transfer?

Thanks

Chris
Cancel
Up 0 Down

Cancel

Reply

0 Chris S over 4 years ago in reply to Ben Clark

Hi Ben.

We have a full-frame image generation thread running at full throttle that only wants to wait for the display thread to take each new image from it, and for that to complete in the shortest time possible. The display thread renders each new image full-frame while the generator builds the next, so isn't doing too much. Each new image is generated by applying deltas to the previous image, so the generator cannot flip between two buffers, which would have solved this timing issue!

What we achieved with the Utgard zero-copy functions was a single copy into the texture that occurred with predictable measurable time, and thus minimal interruption of the generation thread.

Based on what you've said, I wonder if we could either call glFinish straight after the glTexImage2D, or make a full copy of the image data before calling glTexImage2D, so the generator can get on with producing the next frame as soon as the copy is done. The latter would give the completion time predictability we need, but increase CPU load presumably.....

Understanding a little of what glTexImage2D is doing would help - is the data transfer within glTexImage2D() CPU bound (ie the driver does the transfer into texture memory using ARM/Neon instructions) or via a GPU/DMA transfer?

Thanks

Chris
Cancel
Up 0 Down

Cancel

Children

+1 Ben Clark over 4 years ago in reply to Chris S

Hi Chris,

glFinish will block until it is complete, so that will work.

For further info: the GPU prefers images in "cache optimal" tiling, which dramatically speeds up GPU sampling. But when we import an image in GLES we use whatever tiling the image was created with - which may well be linear tiling. (If we create the image, we use "cache optimal").

When glTexImage2D is called the image will be converted from the linear input data to whatever the image uses. If this is also linear, it will be a simple memcpy that will return by the time glTexImage2D returns. If it's cache optimal then that conversion can take time and will be deferred, ie not done by the time glTexImage returns.

Conversion usually happens on GPU, but can be CPU depending on image size / format / GPU version. It's generally like a GPU render from the linear input to the cache optimal stored image.

Hope that helps,

Ben
Cancel
Up +1 Down

Cancel
0 Chris S over 4 years ago in reply to Ben Clark
Thanks Ben, that's really helpful.

Our source images are raster (either RGB565 and RGBA8888) and we are now "uploading" to the GPU image/texture using the familiar:

glTexImage2D( GL_TEXTURE_2D, 0, GL_RGB, 800, 600, 0, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, image )

In the previous Utgard version we copied and converted the RGB565 image straight into the mapped Mali texture as BGRA888 using an optimised Neon function, which was our only option really. Another part of the application (the UI) updates parts of textures, and now uses glTexSubImage2D where it too previously used the direct Mali texture mapping trick.

From what you have said, I wonder if we are falling foul of some inefficiencies. Does letting the Mali infrastructure copy and convert the image result in better and/or more optimal performance (resulting in textures stored as "cache optimal" perhaps)?

We want the GPU workload to be efficient, but at the same time require the source image to be copied for the render thread as fast as possible. I assume there will need to be a balance.

Thanks!

Chris
Cancel
Up 0 Down

Cancel
+1 Ben Clark over 4 years ago in reply to Chris S

Yes I guess you've got a decision on which is the most important - the fast image copy, or fast image access thereafter. Keeping it linear will be very fast to copy (and it is what your Utgard version did if you want consistency), but you now have that potential much faster access if you create/convert the image as "cache optimal".

A colleague has pointed out that using glSync between your 2 threads reading from and writing to the image will be better than a full glFinish.
Cancel
Up +1 Down

Cancel
0 Chris S over 4 years ago in reply to Ben Clark

Fantastic. We'll look into glSync.

So that we can test and benchmark each image variant, and know what we're testing, could you confirm my understanding? :

If, as in out previous Utgard version, we create the image with eglCreateImageKHR() with EGL_NATIVE_PIXMAP_KHR and glEGLImageTargetTexture2DOES(), we will get a linear image?

If, as we're doing right now in the Midgard version, we create the image using glTexImage2D, we will get a "cache optimised" image?

If I've got that right, then what happens if we issue a glTexImage2D to replace the contents of an eglCreateImageKHR created image? Does it discard all of internal image attributes (such as the fact it is linear) and create a new "cache optimised" image, or will it merely reallocate the storage keeping the attributes and "linear" layout?

Thanks for taking the time to help us out with this - it is really important for us to understand, besides being very interesting.

Chris
Cancel
Up +1 Down

Cancel
0 Ben Clark over 4 years ago in reply to Chris S

Hi Chris,

I've clarified with the driver team, and glTexImage2D counts as a create rather than an import, so yes, it will change to "cache optimal" every time. glTexSubImage2D will not change the tiling, so if you want to keep linear you would need to use that.

As to how to get linear - if the allocator is the DDK it will be cache optimal. If the allocator is external and the image is imported it will be whatever the external allocator uses. For example, Android Gralloc will always allocate images using linear tiling whenever the image is host visible.

If you do it the old way with pixmaps it will end up linear as I understand it, yes.

Cheers, Ben
Cancel
Up 0 Down

Cancel
0 Chris S over 4 years ago in reply to Ben Clark

Ben, that's all been tremendously helpful. Thank you.
Cancel
Up 0 Down

Cancel