This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali Graphic Achitexture

Those two links are great starters to get a feel of Mali architecture.

The Architecture of the Mali Midgard

The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core

But I fail to understand key concepts (my background is mostly from desktop GPU hence my confusion). Each tile gets assigned to 1 shader core (SC). Is this a 1:1 relationship or the tile can be assigned to many SC? What is the granularity of threads for a SC? For example, AMD's Graphic Core Next architecture has 64 threads running and a bunch of others waiting. That is, a single SC has 64 threads in flight and a pack of other in a waiting list to manage internal scheduling (latency hiding). If one triggers work, say by a compute shader, that is not a multiple of 64 threads, then one is wasting compute power. If one has has a complex shader then the "waiting list" cannot be filled to maximum capacity due of lack of resources (internal registers) and latency hiding degrades. So, how many threads in the shader core thread pool (I assume this is the waiting list)? How many threads can be running on the arithmetic pipeline of a single core (does 2 pipelines mean 2 threads)? How does the arithmetic pipeline handles branches (static and dyamic)? Is my mental model completely off? And what about memory access. I have a feeling my mental model is completely wrong when it comes to Mali and need your light on this subject..

Parents
  • Is this a 1:1 relationship or the tile can be assigned to many SC?

    One tile belongs to a single shader core.

    What is the granularity of threads for a SC?

    Single threads are scheduled individually in Midgard family cores (Mali-T600/700/800 series). Groups of four threads (called quads) are bundled together in the Bifrost family cores (Mali-G71).

    So, how many threads in the shader core thread pool (I assume this is the waiting list)?

    For Midgard it's 256 threads maximum in the core (executing + waiting). For Bifrost the information isn't public yet - watch this space, we'll have a follow up blog on the new architecture shortly.

    How many threads can be running on the arithmetic pipeline of a single core (does 2 pipelines mean 2 threads)?

    Just like a CPU each pipeline is multiple stages long, so you can issue 1 thread per clock cycle per pipeline. Total number in flight depends on the pipeline length.

    How does the arithmetic pipeline handles branches (static and dyamic)?

    There is no significant direct cost for branching, but indirectly there are always side-effects which cost performance (loss of locality in caches, etc). Performance is always better if threads don't branch too much, and don't diverge too much within nearby vertices / fragments.

    And what about memory access.

    Not entirely sure what your question is here - it's a massive topic . Can you narrow down what you want to know?

    Cheers,
    Pete

Reply
  • Is this a 1:1 relationship or the tile can be assigned to many SC?

    One tile belongs to a single shader core.

    What is the granularity of threads for a SC?

    Single threads are scheduled individually in Midgard family cores (Mali-T600/700/800 series). Groups of four threads (called quads) are bundled together in the Bifrost family cores (Mali-G71).

    So, how many threads in the shader core thread pool (I assume this is the waiting list)?

    For Midgard it's 256 threads maximum in the core (executing + waiting). For Bifrost the information isn't public yet - watch this space, we'll have a follow up blog on the new architecture shortly.

    How many threads can be running on the arithmetic pipeline of a single core (does 2 pipelines mean 2 threads)?

    Just like a CPU each pipeline is multiple stages long, so you can issue 1 thread per clock cycle per pipeline. Total number in flight depends on the pipeline length.

    How does the arithmetic pipeline handles branches (static and dyamic)?

    There is no significant direct cost for branching, but indirectly there are always side-effects which cost performance (loss of locality in caches, etc). Performance is always better if threads don't branch too much, and don't diverge too much within nearby vertices / fragments.

    And what about memory access.

    Not entirely sure what your question is here - it's a massive topic . Can you narrow down what you want to know?

    Cheers,
    Pete

Children