The recently released Mali™-G71 GPU is our most powerful and efficient graphics processor to date and is all set to take next generation high performance devices by storm. The Mali family of GPUs is well known for providing unbeatable flexibility and scalability in order to meet the broad-ranging needs of our customers but we’ve taken another step forward with this latest product. ARM®’s brand new Bifrost architecture, which forms the basis of the Mali-G71, will enable future generations of Mali GPUs to power all levels of devices from mass market to premium mobile. In a few short blogs I’m going to take a look at some of the key features that make Bifrost unique and the benefits they bring to ARM-powered mobile devices.
The first feature we’re going to look at is the innovative introduction of clauses for shader execution. In a traditional set up, the control flow might change between any two instructions. We therefore need to make sure that the execution state is committed to the architectural registers after each instruction and is retrieved at the start of the next. This means the instructions are executed sequentially after a scheduling decision is made before each one.
Classic Instruction Execution
The revolutionary changes ARM has implemented in the Bifrost architecture means instructions are grouped together and executed in clauses. These clauses provide more flexibility than a Very Long Instruction Word (VLIW) instruction set in that they can be of varying lengths and can contain multiple instructions for the same execution unit. However, the control flow within each clause is much more tightly controlled than a traditional architecture. Once a clause begins, execution runs from start to finish without any interruptions or loss of predictability. This means the control flow logic doesn’t need to be executed after every individual instruction. Branches may only appear at the end of clauses and their effects are therefore isolated in the system. A quad’s program counter can never be changed within a clause, allowing us to eliminate costly edge cases. Also, if you examine how typical shaders are written, you will find that they have large basic blocks which automatically make them a good fit for the clause system. Since instructions within a clause execute back-to-back without interruption, this provides us with the predictability we need to be able to optimize aggressively.
As is the case in a classic instruction set, the instructions work on values stored in a register file. Each instruction reads values from the registers and then writes the results back to the same register file shortly afterwards. Instructions can then be combined in sequence due to the knowledge that the register retains its written value.
The register file itself is generally something of a power drain due to the large numbers of accesses to the register file. Since wire length contributes to dynamic power (long wires have more capacitance), the larger the register file, or the further away it is, the higher the power requirement to address it. The Bifrost architecture allocates a thread of execution to exactly one execution unit for its entire duration so that its working values can be stored in that Arithmetic Logic Unit (ALU)’s register file close by. Another optimization uses the predictability to eliminate back-to-back accesses to the register file, further reducing the overall power requirements for register access.
In a fine-grained, multi-threaded system we need to allow threads to request variable-latency operations, such as memory accesses, and sleep and wake, very quickly. We implement this using a lightweight dependency system. Dependencies are discovered by the compiler, which removes runtime complexity, and each clause can both request a variable-latency operation and also depend on the results of previous operations. Clauses always execute in order, and may continue to execute even if unrelated operations are pending. While waiting for a previous result, clauses from other quads can be scheduled, and this gives us a lot of run-time flexibility to deal with variable latencies with manageable complexity. Again, by executing this only at clause boundaries we reduce the power cost of the system.
The implementation of clause shaders not only reduces the overhead by spreading it across several instructions but it also guarantees the sequential execution of all instructions contained in a clause and allows us significant scope for optimization due to the predictability and overall power saving. This is just one of the many features of the Bifrost architecture which will allow new Mali based systems to perform more efficiently than ever before, including for high end use cases such as virtual reality and computer vision.
Many thanks to seanellis for his technical wizardry and don't forget to check back soon for the next blog in the Bitesize Bifrost series!
I'm really looking forward to reading part 2! Coherency is very interesting, though perhaps, under-appreciated!
Hi Sean, glad it was helpful! Clauses are indeed constructed and partitioned by the compiler. This makes the life of the compiler writers a bit more difficult, as it’s another set of things to optimise, but it moves the scheduling complexity away from the hardware and into software.
We have worked hard on the compiler to ensure that it is producing good code, but I am sure that they will discover additional opportunities for optimisation in the future. And if the hardware microarchitecture changes, then the compiler can change to exploit the new balance between the execution units.
Bifrost blog 2 is now live if you want to check it out: Bitesize Bifrost 2: System coherency
Thank you for this wonderful post!
The clause system of execution is very clever! Keeping clause data in one place, in close proximity to the execution engine, and until it is fully completed, will surely save power, and possibly improve performance! Is it correct to assume that clauses are partitioned by the compiler before being handed to the hardware for scheduling? This seems most intuitive and something that would seem to both simplify hardware, and offer the flexibility in finding future exploitable code patterns that can be "claused" with a compiler update.
I think I'm starting to appreciate the importance that the compiler plays in the design of hardware!