This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali Offline Compiler - arithmetics cycles vs texture cycles

So as its written in documentation and explained in some sources, whenever you work with mali offline compiler - you need to focus on stage which has the highest score in output from Mali first (I.e. arithmetics/load storage or texture stage)

One thing I noticed is that in pretty much any shader texture unit is never a bottleneck.
Example:

Hardware: Mali-T720 r1p1
Architecture: Midgard
Driver: r23p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 4
Uniform registers: 1
Stack spilling: false

                                A      LS       T    Bound
Total instruction cycles:   22.00    1.00    5.00        A
Shortest path cycles:       10.75    1.00    5.00        A
Longest path cycles:        10.75    1.00    5.00        A

A = Arithmetic, LS = Load/Store, T = Texture

Every texture instruction usually takes one cycle (on Midgard at least).

So if you add another texture - you need to do something with it - blend it with other computation at least - it means arithmetics cycles will go up as well.. So as I said - texture cycles are like never higher than other columns.
So when I work on optimizing shaders - my current intuition is to still be quite agressive and try to reduce texture fetches as much as possible. And usually I don't tradeoff arithmetics and texture fetches - i.e. I don't move computation from arithmetcs to baked texture unless it's something very expensive.

Another thing: Mali offline compiler makes assumption that texture fetch is bilinear and texture has mipmaps.
We currently mostly use bilinear filtering without mipmaps on mobile.
Rationale: when you start using mipmaps - you also need trilinear filtering, otherwise transition between mipmaps levels become visible.
Trilinear filtering means - double the cycles and also more memory throughput is needed (fetching 8 texels instead of 4 for bilinear).
On the other hand not using mipmaps means poor cache utilization which also means - more memory throughput is needed. No idea what's better in practice. I guess depends on the project/hardware. Or is there a universal answer?

And also fetching texture means latency, this latency is hidden to some degree if shader use relatively small amount but I assume it's still there.

Once I switch to another project in the company, I'll have time to do extensive tests related to the cost of textures and hopefully build some intuition.
As I am impatient and curious, I do hope other more experienced devs will share their intuition here.

So my questions:
1. Is it good strategy to aggressively optimize out texture fetches and treat them as very expensive thing (even if it's not a bottleneck by Mali offline compiler). Should I adjust score by Mali offline compiler, i.e. multiply it by 2 (so it's trilinear) or maybe I should use GPU profiler and look at some GPU metrics like memory throughput to make final decision? How do you do it in practice?

2. Bilinear no mipmaps vs Trilinear mipmaps - what do you think is better in practice? How do you choose what to use? Does it depend on hardware maybe? We do need to support Midgard devices (we support very old devices, we're mobile development company)

3. If you can share with me any links/books/resources explaining anything above which might help me - please do share them as well. I already read official mali documentation and optimization guides.

Top replies

Parents

0 Peter Harris over 3 years ago in reply to Peter Harris

This video might help introduce some of the concepts here:

* www.youtube.com/watch
Cancel
Vote up +1 Vote down

Cancel

Reply

0 Peter Harris over 3 years ago in reply to Peter Harris

This video might help introduce some of the concepts here:

* www.youtube.com/watch
Cancel
Vote up +1 Vote down

Cancel

Children

0 Mikhail Golub over 3 years ago in reply to Peter Harris

Thanks again Peter. I watched through the videos, it was helpful

Do I have correct understanding now?

i.e. each shader core has a list of threads (how many depends on architecture, Midgard - 256, Valhall - 1024 and usage of registers by shader program)

Midgard executes single thread at time (because its vector architecture)
Bifrost/Valhall executes warp (i.e. 8/16 threads at the same time in lockstep)

Once thread/warp stalls - core selects another one and this is considered "free"

1. Am I correct that when thread/warp finishes, it's being removed from core thread set and immediately replaced by something else i.e. core pulls more work from some queue? or does it finish all thread set and then takes next batch? (I don't see reason for it but who knows)

2. Does it mean that for example Midgard core can effectively wait for 256 texture fetches in parallel wiithout any problems?

3. In theoretical situation where there is no texture cache (for simplicity of calculations) total execution time for fetching all 256 texels will be approximately 256+time_of_one_fetch (hundreds of cycles) instead of 256*time_of_one_fetch.

4. if you can answer this one: can I apply this rough understanding model to all modern mobile GPUs (from other major vendors) or there are some caveats and better to study their documentation?
Cancel
Vote up +1 Vote down

Cancel
+1 Peter Harris over 3 years ago in reply to Mikhail Golub

Mikhail Golub said:
Midgard executes single thread at time (because its vector architecture)
Bifrost/Valhall executes warp (i.e. 8/16 threads at the same time in lockstep)

Yes. Just to be clear "at a time" = per instruction issue. You can have multiple threads live at different stages in the pipeline.

Mikhail Golub said:
1. Am I correct that when thread/warp finishes, it's being removed from core thread set and immediately replaced by something else i.e. core pulls more work from some queue

Yes, the shader core has queues of work waiting to become threads (the next compute work items, or the next set of rasterized fragments) as soon as there is capacity for them.

Mikhail Golub said:
2. Does it mean that for example Midgard core can effectively wait for 256 texture fetches in parallel wiithout any problems?

Yes, that's the general idea. In reality if a very high percentage of your total thread pool is waiting for data you probably start to run out of things to do, so "without any problems" is going to be an optimistic outlook =0.

Mikhail Golub said:
3. In theoretical situation where there is no texture cache (for simplicity of calculations) total execution time for fetching all 256 texels will be approximately 256+time_of_one_fetch (hundreds of cycles) instead of 256*time_of_one_fetch.

Yes, that's the idea.

Mikhail Golub said:
can I apply this rough understanding model to all modern mobile GPUs (from other major vendors)

I don't know if this is entirely accurate for other vendors - I don't know their microarchitectures - but I'd expect all GPUs to broadly fit this working model.

Mikhail Golub said:
or there are some caveats

There are always caveats =)

Cheers,
Pete
Cancel
Vote up +1 Vote down

Cancel