Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Mobile, Graphics, and Gaming blog Bitesize Bifrost 1: The benefits of clause shaders
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Architecture
  • alu
  • Mali
  • instruction-set
  • Arm Architecture
  • scalability
  • Bifrost
  • Mali-G71
  • innovation
  • overhead
  • bitesize
  • scalable
  • gpu
  • efficiency
  • shaders
  • clause_shaders
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Bitesize Bifrost 1: The benefits of clause shaders

Freddi Jeffries
Freddi Jeffries
July 5, 2016
4 minute read time.

The recently released Mali™-G71 GPU is our most powerful and efficient graphics processor to date and is all set to take next generation high performance devices by storm. The Mali family of GPUs is well known for providing unbeatable flexibility and scalability in order to meet the broad-ranging needs of our customers but we’ve taken another step forward with this latest product. ARM®’s brand new Bifrost architecture, which forms the basis of the Mali-G71, will enable future generations of Mali GPUs to power all levels of devices from mass market to premium mobile. In a few short blogs I’m going to take a look at some of the key features that make Bifrost unique and the benefits they bring to ARM-powered mobile devices.

The first feature we’re going to look at is the innovative introduction of clauses for shader execution. In a traditional set up, the control flow might change between any two instructions. We therefore need to make sure that the execution state is committed to the architectural registers after each instruction and is retrieved at the start of the next. This means the instructions are executed sequentially after a scheduling decision is made before each one.

classic.png

Classic Instruction Execution

The revolutionary changes ARM has implemented in the Bifrost architecture means instructions are grouped together and executed in clauses. These clauses provide more flexibility than a Very Long Instruction Word (VLIW) instruction set in that they can be of varying lengths and can contain multiple instructions for the same execution unit. However, the control flow within each clause is much more tightly controlled than a traditional architecture. Once a clause begins, execution runs from start to finish without any interruptions or loss of predictability. This means the control flow logic doesn’t need to be executed after every individual instruction. Branches may only appear at the end of clauses and their effects are therefore isolated in the system. A quad’s program counter can never be changed within a clause, allowing us to eliminate costly edge cases. Also, if you examine how typical shaders are written, you will find that they have large basic blocks which automatically make them a good fit for the clause system. Since instructions within a clause execute back-to-back without interruption, this provides us with the predictability we need to be able to optimize aggressively.

clause.png

Clause Execution

As is the case in a classic instruction set, the instructions work on values stored in a register file. Each instruction reads values from the registers and then writes the results back to the same register file shortly afterwards. Instructions can then be combined in sequence due to the knowledge that the register retains its written value.

The register file itself is generally something of a power drain due to the large numbers of accesses to the register file. Since wire length contributes to dynamic power (long wires have more capacitance), the larger the register file, or the further away it is, the higher the power requirement to address it. The Bifrost architecture allocates a thread of execution to exactly one execution unit for its entire duration so that its working values can be stored in that Arithmetic Logic Unit (ALU)’s register file close by. Another optimization uses the predictability to eliminate back-to-back accesses to the register file, further reducing the overall power requirements for register access.

In a fine-grained, multi-threaded system we need to allow threads to request variable-latency operations, such as memory accesses, and sleep and wake, very quickly. We implement this using a lightweight dependency system. Dependencies are discovered by the compiler, which removes runtime complexity, and each clause can both request a variable-latency operation and also depend on the results of previous operations. Clauses always execute in order, and may continue to execute even if unrelated operations are pending. While waiting for a previous result, clauses from other quads can be scheduled, and this gives us a lot of run-time flexibility to deal with variable latencies with manageable complexity. Again, by executing this only at clause boundaries we reduce the power cost of the system.

The implementation of clause shaders not only reduces the overhead by spreading it across several instructions but it also guarantees the sequential execution of all instructions contained in a clause and allows us significant scope for optimization due to the predictability and overall power saving. This is just one of the many features of the Bifrost architecture which will allow new Mali based systems to perform more efficiently than ever before, including for high end use cases such as virtual reality and computer vision.

Many thanks to seanellis for his technical wizardry and don't forget to check back soon for the next blog in the Bitesize Bifrost series!

Anonymous
  • Sean Lumly
    Sean Lumly over 8 years ago

    Thanks Freddie!

    I'm really looking forward to reading part 2! Coherency is very interesting, though perhaps, under-appreciated!

    Sean

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Freddi Jeffries
    Freddi Jeffries over 8 years ago

    Hi Sean, glad it was helpful! Clauses are indeed constructed and partitioned by the compiler. This makes the life of the compiler writers a bit more difficult, as it’s another set of things to optimise, but it moves the scheduling complexity away from the hardware and into software.

    We have worked hard on the compiler to ensure that it is producing good code, but I am sure that they will discover additional opportunities for optimisation in the future. And if the hardware microarchitecture changes, then the compiler can change to exploit the new balance between the execution units.

    Bifrost blog 2 is now live if you want to check it out: Bitesize Bifrost 2: System coherency

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Sean Lumly
    Sean Lumly over 8 years ago

    Thank you for this wonderful post!

    The clause system of execution is very clever! Keeping clause data in one place, in close proximity to the execution engine, and until it is fully completed, will surely save power, and possibly improve performance! Is it correct to assume that clauses are partitioned by the compiler before being handed to the hardware for scheduling? This seems most intuitive and something that would seem to both simplify hardware, and offer the flexibility in finding future exploitable code patterns that can be "claused" with a compiler update.

    I think I'm starting to appreciate the importance that the compiler plays in the design of hardware!

    Sean

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Mobile, Graphics, and Gaming blog
  • Optimizing 3D scenes in Godot on Arm GPUs

    Clay John
    Clay John
    Exploring advanced mobile GPU optimizations in Godot using Arm tools like Streamline and Mali Offline Compiler for real-world performance gains.
    • July 10, 2025
  • Optimizing 3D scenes in Godot on Arm GPUs

    Clay John
    Clay John
    In part 1 of this series, learn how we utilized Arm Performance Studio to identify and resolve major performance issues in Godot’s Vulkan-based mobile renderer.
    • June 11, 2025
  • Bringing realistic clothing simulation to mobile: A new frontier for game developers

    Mina Dimova
    Mina Dimova
    Realistic clothing simulation on mobile—our neural GAT model delivers lifelike cloth motion without heavy physics or ground-truth data.
    • June 6, 2025