In his book "How music works", David Byrne points out that music is created to fit a given context: music that would sound great in a symphony hall would likely sound unconvincing in a stadium. Similarly, OpenCL™ programs are often created with a particular context in mind. In particular, OpenCL kernels are often optimised for a particular compute device (e.g. a GPU in the programmer's desktop machine).
I am continuing my blog series by presenting the fundamentals of the ARM Midgard architecture underpinning the ARM® Mali™-T600 and Mali-T700 GPU series.
The number of Mali cores in a system-on-chip (SoC) can be scaled to satisfy the performance requirements for this SoC. For example, the Mali-T624 can be scaled from one to four cores. Each core contains a tri-pipe consisting of two arithmetic (A) pipelines, one load-store (LS) pipeline and one texturing (T) pipeline:
Thus, the peak throughput of each core is two A instruction words, one LS instruction word and one T instruction word per cycle.
Try as I might, I cannot refer the Midgard architecture to a single class:
So what do the Midgard architectural features actually mean for optimising compute kernels? I recommend:
In some respects, writing high performance code for the Mali GPUs embedded in SoCs is easier than for GPUs found in desktop machines:
With this theoretical knowledge under the belt, we can now look at optimising some kernels! Can you guess where we'll start?.. (Or use comments to vote for your favourite kernel!)
Thanks for the comment. You are absolutely right: effectively using caches is paramount to achieving high performance. I am planning to provide examples of cache optimisations in a not so distant future.
Hope you will give a post on optimizing for the L1 cache, Thanks. There is document said the L1 Cache has about 250 cache lines(512bit), is it true?