This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Request for help: Performance bottleneck analysis with GA

Hi guys,

I am doing some basic bottleneck analysis on Kirin960 social with Unity URP demo(SSAA on, HDR Pipeline...).I run the demo on mate9, mate9 pro, honor9, nova 2s and all of these devices use Kirin 960 SOC.

But the fps is varies widely. mate 9 and mate 9 pro(Android 9) run the demo on 20 FPS and honor9(Android 8) and nova 2s(Android 9) are only 10FPS.

I use the Streamline to check the bottleneck, it shows a huge read stall rate on honor9 and all the devices have a huge write stall rate.

The following picture show the difference between mate 9(left, 20 fps) and honor 9(right, 10 fps).

GPU active shows that the frequencies of the two gpu are almost the same, with no throttling.

Mali Core L2 Memory Reads(load/store bytes) on Honor9 is 4 times the number on Mate 9.

It seems to be caused by L2 cache size or frequency? Is this guess reasonable? And How can I get the L2 cache size or frequency on Mali GPU?

Top replies

Parents

+1 Peter Harris over 3 years ago

The latency measurements are from the memory system outside of the GPU (i.e. from system MMU translation, system cache, or external DRAM). High latency can be due to e.g. a slower bus frequency, a slower system cache frequency, use of a slower external memory, or a high miss rate in the SMMU TLB. Another possible reason is contention from another processor in the SoC using DRAM at the same time.

This is all outside of the GPU, so it's not possible to measure using Streamline unless the device manufacturer has some means to expose it (it's not Arm IP).

Kind regards,
Pete
Cancel
Vote up +3 Vote down

Cancel

Reply

+1 Peter Harris over 3 years ago

The latency measurements are from the memory system outside of the GPU (i.e. from system MMU translation, system cache, or external DRAM). High latency can be due to e.g. a slower bus frequency, a slower system cache frequency, use of a slower external memory, or a high miss rate in the SMMU TLB. Another possible reason is contention from another processor in the SoC using DRAM at the same time.

This is all outside of the GPU, so it's not possible to measure using Streamline unless the device manufacturer has some means to expose it (it's not Arm IP).

Kind regards,
Pete
Cancel
Vote up +3 Vote down

Cancel

Children

0 cloud_zero over 3 years ago in reply to Peter Harris

Thanks for the reply, very helpful!
Cancel
Vote up +1 Vote down

Cancel
0 cloud_zero over 3 years ago in reply to Peter Harris

One more question. I checked the ram size on the devices, the low fps devices have 4G Ram and high fps devices have 6G Ram. Will ram size affect the fps? I think the high latency is caused by more data requirement.Low fps devices do read more data from L2 cache and system memory than the high fps devices in the streamline and it does not make sense:"why low fps devices need more data than the high fps for the same demo?". They should have same L1 cache and L2 cache size.
Cancel
Vote up 0 Vote down

Cancel