Chinese Version中文版 : 节约带宽才是王道
Building an efficient and high performing System-on-Chip(SoC) is becoming an increasingly complex task. The growing demands for bandwidth heavy applications mean that system components are required to improve efficiency in each generation to address the additional bandwidth consumption that these apps entail. And this is true across all markets: as high-end mobile computing platforms strive for ever greater performance alongside better energy efficiency, SoCs targeting the mainstream still need to deliver a premium style feature set and performance density - and at the same time reduce manufacturing costs and time to market too!
If you look closely at typical user interactions with consumer devices you will realize that they very often centre on a combination of text, audio, stills, images, animation, video, in other words, a multimedia experience. What is significant about this is that media intensive use cases require the transfer of large amounts of data and the more advanced the user experience, the higher the requirement for increased system bandwidth. But the higher the bandwidth consumption, the higher the power consumption.
But which specific use cases and functional requirements are driving this nearly exponential increase for more bandwidth?
So in summary, there are more pixels which all have to be delivered faster – and at the same time more work is required for each of these pixels every frame. All of this requires not only higher computational capabilities but also more bandwidth. Unless SoC designers think about this in advance when designing a media system, a lot more power will be consumed when delivering a quality user experience.
So what can we do about that? A typical media system consists of a number of IP components, each of which has slightly different functionality and characteristics. Each stage of the media processing pipeline is handled by a separate block, and each of them (as Sean explained in his blog) has inputs, intermediate data and outputs - all of which contribute to the total power budget.
Looking deeper, the typical media pipeline consists of number of interactions between GPU, Video and Display Processors and it requires the passing of a certain amount of data between each of these components. If we take a glass half full view, it means that there are plenty of opportunities to optimize these interactions and provide components that save bandwidth by working together in an efficient way. This is exactly why ARM has developed a range of bandwidth reducing technologies: to allow increasingly more complex media within the power capacity and thermal limit of mobile devices.
Let’s start with looking at how optimizations already applied to Mali GPUs can be applied to other media components. New use cases require an innovative approach. Nowadays we are seeing more and more that require wireless delivery of audio and video from tablets, mobile phones and other consumer devices to large screens such as that on a DTV. Both sending and receiving devices must support compression of the video stream using algorithms such as H.264. In a typical use case, instead of writing the frame buffer to the screen memory, the Display Processor will send it to the Video Decoder and then the compressed frame will be sent over the WiFi network.
Motion Search Elimination extends the concept of Transaction Elimination, introduced last year in the Mali GPUs and described below, to Display and Video Processors. Each of them maintains a signature for each tile and when the Display Processor writes the frame buffer out, the Video Processor can eliminate motion search for tiles where signatures match. Why does this matter? Motion estimation is an expensive part of the video pipeline so skipping the search for selected tiles will lower latency of Wi-Fi transmission, lower bandwidth consumption and as a result, lower the entire SoC power.
Transaction Elimination (TE) is a key bandwidth saving feature of the ARM Mali Midgard GPU architecture that allows significant energy savings when writing out frame buffers. In a nutshell, when TE is enabled, the GPU compares the current frame buffer with the previously rendered frame and performs a partial update only to the particular parts of it that have been modified.
With that, the amount of data that need to be transmitted per frame to external memory is significantly reduced. TE can be used by every application for all frame buffer formats supported by the GPU, irrespective of the frame buffer precision requirements. It is highly effective even on first person shooters and video streams. Given that in many other popular graphics applications, such as User Interfaces and casual games, large parts of the frame buffer remain static between two consecutive frames, frame buffer bandwidth savings from TE can reach up to 99%.
So is there anything else we could do to minimize the amount of data processed through the GPU? Smart Composition is another technology developed to reduce bandwidth while reading in textures during frame composition and, as outlined by Plout in his blog, it builds on the previously described Transaction Elimination.
By analyzing frames prior to final frame composition, Smart Composition determines if any reason exists to render a given portion of the frame or whether the previously rendered and composited portion can be reused. If that portion of the frame can be reused then it is not read from memory again or re-composited, thereby saving additional computational effort.
Now let’s look more closely at interactions between the GPU, Video and Display processors. One of the most bandwidth intensive use cases is video post processing. In many use cases, the GPU is required to read a video and apply effects when using video streams as textures in 2D or 3D scenes. In such cases, ARM Frame Buffer Compression (AFBC), a lossless image compression protocol and format with fine grained random access, reduces the overall system level bandwidth and power by up to 50% by minimizing the amount of data transferred between IP blocks within a SoC.
When AFBC is used in an SoC[TW1] , the Video Processor will simply write out the video streams in the compressed format and the GPU will read them and only uncompress them in the on-chip memory. Exactly the same optimization will be applied to the output buffers intended for the screen. Whether it is the GPU or Video Processor producing the final frame buffers, they will be compressed so that the Display Processor will read these in the AFBC format and only uncompress when moving to the display memory. AFBC is described in more detail in Ola’s blog Mali-V500 video processor: reducing memory bandwidth with AFBC.
But what about interactions between the GPU and a graphics application such as a high-end game or user interface? This is the perfect opportunity to optimize the amount of memory that texture assets require. Adaptive Scalable Texture Compression (ASTC) technology, developed by ARM and AMD, donated to Khronos and has been adopted as an official extension to both the Open GL® and OpenGL® ES graphics APIs. ASTC is a major step forward in reducing memory bandwidth and thus energy use, all while maintaining image quality.
The ASTC specification includes two profiles: LDR and Full, both of which are already supported on Mali-T62X GPUs and above and are described in more detail by Tom Olson and Stacy Smith.
To finish, let’s explore another great opportunity to optimize system bandwidth. Modern games apply various post processing effects and the textures are often combined with the frame buffer. This means that memory is written out through the external bus and then read back multiple times to achieve advanced graphics effects. This type of deferred rendering requires the transfer of a significant amount of external data and consumes a lot of power. But Mali GPUs are a tile-based rendering architecture, which means that fragment shading is performed tile by tile using on-chip memory and only when all the contents of a tile has been processed is the tile data written back out to external memory.
This is a perfect opportunity to employ deferred shading without the need to write out the data through the external bus. ARM has introduced two advanced OpenGL ES extensions that enable developers to achieve console-like effects within the mobile bandwidth and power budget: Shader Framebuffer Fetch and Shader Pixel Local Storage. For more information on these extensions, read Jan-Harald’s blog.
So have we exhausted all of the possibilities to minimize system bandwidth with the technologies described in this blog? The good news is … of course not! At ARM we are positively obsessed with finding new areas for optimizations and making mobile media systems even more power efficient. In each generation our Silicon Partners, OEMs and end consumers help us to discover new use cases which are posing different challenges and requirements. With this there is a constant stream of new opportunities to get our innovation engines going and design even more efficient SoCs.
Got any questions on the technologies outlined above? Let us know in the comments section below.
Dear Jakub, thank you for very interesting article!
I have only one question about [Motion Search Elimination], you are saying:
"Each of them maintains a signature for each tile and when the Display Processor writes the frame buffer out, the Video Processor can eliminate motion search for tiles where signatures match."
Do you mean that Video Processor checks whether the tile pixels were not changed (from previous frame), and that allows to eliminate the search for the tile (it can use zero motion vectors then)?