Those two links are great starters to get a feel of Mali architecture.
The Architecture of the Mali Midgard
The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core
But I fail to understand key concepts (my background is mostly from desktop GPU hence my confusion). Each tile gets assigned to 1 shader core (SC). Is this a 1:1 relationship or the tile can be assigned to many SC? What is the granularity of threads for a SC? For example, AMD's Graphic Core Next architecture has 64 threads running and a bunch of others waiting. That is, a single SC has 64 threads in flight and a pack of other in a waiting list to manage internal scheduling (latency hiding). If one triggers work, say by a compute shader, that is not a multiple of 64 threads, then one is wasting compute power. If one has has a complex shader then the "waiting list" cannot be filled to maximum capacity due of lack of resources (internal registers) and latency hiding degrades. So, how many threads in the shader core thread pool (I assume this is the waiting list)? How many threads can be running on the arithmetic pipeline of a single core (does 2 pipelines mean 2 threads)? How does the arithmetic pipeline handles branches (static and dyamic)? Is my mental model completely off? And what about memory access. I have a feeling my mental model is completely wrong when it comes to Mali and need your light on this subject..
For non-compute shaders there isn't really any direct control over memory access patterns other than ensuring efficient data resources:
For compute shaders it becomes more important, ensure good data locality, avoid reading tangentially across cache lines and MMU pages (e.g. if you do SGEMM large matrix multiplication you really really want to transpose one of the input matrices to avoid thrashing in the cache / uTLB).
None of this is really Mali-specific - most GPU vendors will have exactly the same advice (keep things small, ensure good locality, don't give the GPU more data than it needs, don't give the GPU redundant data).