In his book "How music works", David Byrne points out that music is created to fit a given context: music that would sound great in a symphony hall would likely sound unconvincing in a stadium. Similarly, OpenCL™ programs are often created with a particular context in mind. In particular, OpenCL kernels are often optimised for a particular compute device (e.g. a GPU in the programmer's desktop machine).
I am continuing my blog series by presenting the fundamentals of the ARM Midgard architecture underpinning the ARM® Mali™-T600 and Mali-T700 GPU series.
The number of Mali cores in a system-on-chip (SoC) can be scaled to satisfy the performance requirements for this SoC. For example, the Mali-T624 can be scaled from one to four cores. Each core contains a tri-pipe consisting of two arithmetic (A) pipelines, one load-store (LS) pipeline and one texturing (T) pipeline:
Thus, the peak throughput of each core is two A instruction words, one LS instruction word and one T instruction word per cycle.
Try as I might, I cannot refer the Midgard architecture to a single class:
So what do the Midgard architectural features actually mean for optimising compute kernels? I recommend:
In some respects, writing high performance code for the Mali GPUs embedded in SoCs is easier than for GPUs found in desktop machines:
With this theoretical knowledge under the belt, we can now look at optimising some kernels! Can you guess where we'll start?.. (Or use comments to vote for your favourite kernel!)