Those two links are great starters to get a feel of Mali architecture.
The Architecture of the Mali Midgard
The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core
But I fail to understand key concepts (my background is mostly from desktop GPU hence my confusion). Each tile gets assigned to 1 shader core (SC). Is this a 1:1 relationship or the tile can be assigned to many SC? What is the granularity of threads for a SC? For example, AMD's Graphic Core Next architecture has 64 threads running and a bunch of others waiting. That is, a single SC has 64 threads in flight and a pack of other in a waiting list to manage internal scheduling (latency hiding). If one triggers work, say by a compute shader, that is not a multiple of 64 threads, then one is wasting compute power. If one has has a complex shader then the "waiting list" cannot be filled to maximum capacity due of lack of resources (internal registers) and latency hiding degrades. So, how many threads in the shader core thread pool (I assume this is the waiting list)? How many threads can be running on the arithmetic pipeline of a single core (does 2 pipelines mean 2 threads)? How does the arithmetic pipeline handles branches (static and dyamic)? Is my mental model completely off? And what about memory access. I have a feeling my mental model is completely wrong when it comes to Mali and need your light on this subject..
Thanks you. This information is most valuable.
As for memory access, I was referring to scheduling and guides to have the best memory access patterns. I assume when a thread waits for memory access, it is put in a waiting state and a new one takes its place. What if this new thread operates on a different program? I am referring to the complexity of shaders. How can I tell my shader is behaving in a friendly manner scheduling wise? What happens exactly in the tripipes when a texture fetch is requested? Do we have some kind of control over the load/store and a texture pipes? Can a fragment shader explicitly use both to improve data fetching?
For non-compute shaders there isn't really any direct control over memory access patterns other than ensuring efficient data resources:
For compute shaders it becomes more important, ensure good data locality, avoid reading tangentially across cache lines and MMU pages (e.g. if you do SGEMM large matrix multiplication you really really want to transpose one of the input matrices to avoid thrashing in the cache / uTLB).
None of this is really Mali-specific - most GPU vendors will have exactly the same advice (keep things small, ensure good locality, don't give the GPU more data than it needs, don't give the GPU redundant data).