Those two links are great starters to get a feel of Mali architecture.
The Architecture of the Mali Midgard
The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core
But I fail to understand key concepts (my background is mostly from desktop GPU hence my confusion). Each tile gets assigned to 1 shader core (SC). Is this a 1:1 relationship or the tile can be assigned to many SC? What is the granularity of threads for a SC? For example, AMD's Graphic Core Next architecture has 64 threads running and a bunch of others waiting. That is, a single SC has 64 threads in flight and a pack of other in a waiting list to manage internal scheduling (latency hiding). If one triggers work, say by a compute shader, that is not a multiple of 64 threads, then one is wasting compute power. If one has has a complex shader then the "waiting list" cannot be filled to maximum capacity due of lack of resources (internal registers) and latency hiding degrades. So, how many threads in the shader core thread pool (I assume this is the waiting list)? How many threads can be running on the arithmetic pipeline of a single core (does 2 pipelines mean 2 threads)? How does the arithmetic pipeline handles branches (static and dyamic)? Is my mental model completely off? And what about memory access. I have a feeling my mental model is completely wrong when it comes to Mali and need your light on this subject..
Is this a 1:1 relationship or the tile can be assigned to many SC?
One tile belongs to a single shader core.
What is the granularity of threads for a SC?
Single threads are scheduled individually in Midgard family cores (Mali-T600/700/800 series). Groups of four threads (called quads) are bundled together in the Bifrost family cores (Mali-G71).
So, how many threads in the shader core thread pool (I assume this is the waiting list)?
For Midgard it's 256 threads maximum in the core (executing + waiting). For Bifrost the information isn't public yet - watch this space, we'll have a follow up blog on the new architecture shortly.
How many threads can be running on the arithmetic pipeline of a single core (does 2 pipelines mean 2 threads)?
Just like a CPU each pipeline is multiple stages long, so you can issue 1 thread per clock cycle per pipeline. Total number in flight depends on the pipeline length.
How does the arithmetic pipeline handles branches (static and dyamic)?
There is no significant direct cost for branching, but indirectly there are always side-effects which cost performance (loss of locality in caches, etc). Performance is always better if threads don't branch too much, and don't diverge too much within nearby vertices / fragments.
And what about memory access.
Not entirely sure what your question is here - it's a massive topic . Can you narrow down what you want to know?
Cheers, Pete
Thanks you. This information is most valuable.
As for memory access, I was referring to scheduling and guides to have the best memory access patterns. I assume when a thread waits for memory access, it is put in a waiting state and a new one takes its place. What if this new thread operates on a different program? I am referring to the complexity of shaders. How can I tell my shader is behaving in a friendly manner scheduling wise? What happens exactly in the tripipes when a texture fetch is requested? Do we have some kind of control over the load/store and a texture pipes? Can a fragment shader explicitly use both to improve data fetching?
For non-compute shaders there isn't really any direct control over memory access patterns other than ensuring efficient data resources:
For compute shaders it becomes more important, ensure good data locality, avoid reading tangentially across cache lines and MMU pages (e.g. if you do SGEMM large matrix multiplication you really really want to transpose one of the input matrices to avoid thrashing in the cache / uTLB).
None of this is really Mali-specific - most GPU vendors will have exactly the same advice (keep things small, ensure good locality, don't give the GPU more data than it needs, don't give the GPU redundant data).