Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Mobile, Graphics, and Gaming blog ARM Mali Compute Architecture Fundamentals
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • OpenCL
  • Mali-T600
  • Mali
  • gpu_compute
  • Mali-T700
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

ARM Mali Compute Architecture Fundamentals

Anton Lokhmotov
Anton Lokhmotov
April 23, 2014
3 minute read time.

In his book "How music works", David Byrne points out that music is created to fit a given context: music that would sound great in a symphony hall would likely sound unconvincing in a stadium.  Similarly, OpenCL™ programs are often created with a particular context in mind.  In particular, OpenCL kernels are often optimised for a particular compute device (e.g. a GPU in the programmer's desktop machine).

I am continuing my blog series by presenting the fundamentals of the ARM Midgard architecture underpinning the ARM® Mali™-T600 and Mali-T700 GPU series.

Architecture Overview

The number of Mali cores in a system-on-chip (SoC) can be scaled to satisfy the performance requirements for this SoC.  For example, the Mali-T624 can be scaled from one to four cores. Each core contains a tri-pipe consisting of two arithmetic (A) pipelines, one load-store (LS) pipeline and one texturing (T) pipeline:

mali-tripipe.png

Thus, the peak throughput of each core is two A instruction words, one LS instruction word and one T instruction word per cycle.

Try as I might, I cannot refer the Midgard architecture to a single class:

  • Midgard is a Very Long Instruction Word (VLIW) architecture, such that each pipe contains multiple units and most instruction words contain instructions for multiple units.
  • Midgard is also a Single Instruction Multiple Data (SIMD) architecture, such that most instructions operate on multiple data elements packed in 128-bit vector registers.
  • Finally, Midgard is a Fine-Grain Multi-Threaded (FGMT) architecture, such that each core runs its threads in a round-robin fashion, on every cycle switching to the next ready-to-execute thread. What's interesting, each thread has its individual program counter (unlike warp-based designs, where threads in a warp share the same program counter).

Guidelines for Optimising Compute Kernels

So what do the Midgard architectural features actually mean for optimising compute kernels? I recommend:

  • Having sufficient instruction level parallelism in kernel code to allow for dense packing of instructions into instruction words by the compiler. (This addresses the VLIW-ness of the architecture.)
  • Using vector operations in kernel code to allow for straightforward mapping to vector instructions by the compiler. (I will have much more to say on vectorisation later, as it's one of my favourite topics.)
  • Having a balance between A and LS instruction words. Without cache misses, the ratio of 2:1 of A-words to LS-words would be optimal; with cache misses, a higher ratio is desirable. For example, a kernel consisting of 15 A-words and 7 LS-words is still likely to be bound by the LS-pipe.
  • Using a sufficient number of concurrently executing (or active) threads per core to hide the execution latency of instructions (which is the depth of a corresponding pipeline). The maximum number of active threads I is determined by the number of registers R that the kernel code uses: I = 256, if 0 < R ≤ 4; I = 128, if 4 < R ≤ 8; I = 64, if 8 < R ≤ 16. For example, kernel A that uses 5 registers and kernel B that uses 8 registers can both be executed by running no more than 128 threads per core. This means that it may be preferable to split complex, register-heavy kernels into a number of simpler ones. (For compiler folk among us, this also means that the backend may decide to spill a value to memory rather than use an extra register when its heuristics suggest that the number of registers to be likely required is approaching 4 or 8.)

In some respects, writing high performance code for the Mali GPUs embedded in SoCs is easier than for GPUs found in desktop machines:

  • The global and local OpenCL address spaces get mapped to the same physical memory (the system RAM), backed by caches transparent to the programmer. This often removes the need for explicit data copying and associated barrier synchronisation.
  • Since all threads have individual program counters, branch divergence is less of an issue than for warp-based architectures.

Fasten your Seat Belt!

With this theoretical knowledge under the belt, we can now look at optimising some kernels! Can you guess where we'll start?.. (Or use comments to vote for your favourite kernel!)

Anonymous
Parents
  • Anton Lokhmotov
    Anton Lokhmotov over 11 years ago

    Thanks for the comment. You are absolutely right: effectively using caches is paramount to achieving high performance. I am planning to provide examples of cache optimisations in a not so distant future.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • ericlew
    ericlew over 7 years ago in reply to Anton Lokhmotov

    Hope you will give a post on optimizing for the L1 cache, Thanks. There is document said the L1 Cache has about 250 cache lines(512bit), is it true?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Comment
  • ericlew
    ericlew over 7 years ago in reply to Anton Lokhmotov

    Hope you will give a post on optimizing for the L1 cache, Thanks. There is document said the L1 Cache has about 250 cache lines(512bit), is it true?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Children
No Data
Mobile, Graphics, and Gaming blog
  • What is Arm Performance Studio?

    Jai Schrem
    Jai Schrem
    Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
    • August 27, 2025
  • How Neural Super Sampling works: Architecture, training, and inference

    Liam O'Neil
    Liam O'Neil
    A deep dive into a practical, ML-powered approach to temporal super sampling.
    • August 12, 2025
  • Start experimenting with Neural Super Sampling for mobile graphics today

    Sergio Alapont Granero
    Sergio Alapont Granero
    Laying the foundation for neural upscaling to enable sharper, smoother, AI-powered gaming on next-generation Arm GPUs.
    • August 12, 2025