hi,
my application use android java JNI to process camera vidéo in real time. The C part use openCL and multicore threading with médiatek 9200+ and mali G715.
Something strange append with the application. After fews seconde of processing, 70-80 frames, i go from 60-70ms per frame to 140-160 per frame.
What i am doing :
1) i do some kernel to conver YUV and extract data
2) Then i procees the extracted data with CPU multicore threading(4 thread at the time) 7 time.
3) Then i use kernel again to extract data and send it back to Java.
If i remove all the CPU work, the GPU time remain stable between 20ms at the begining(70-80 frames) to 40-45. But with the CPU work time increase dramaticly after 70-80 frame using the same amount of input data. The same problem appened with my old mali g72 after 20 frames.
I tried to use streamline, but as i run windows7 i cannot have the analizer, how just run from windows10. But i can get the graph and i remarked a strange activity on the GPU.
1) the Mali Memory Read Latency after to get red after the 70-80 frames. (it show 25 mega beats ?)
2) the Mali geométry culling rate start to ocsillate. (it show 100%)
3) the Mali geométry efficiency start to ocsillate. (it show 3 trheads)
4) Mali Early ZS rate is red
and many other thing start to ocillate. But Device Thermal State is 100% green.
in fact after 70-80 frames a lot of thing start to ocsillate.
And streamline is to complicated for me. there is too much things to know and anderstant. I do not have the time.
So i am wondering if it could be possible for an expert to analyse it.
As i said i had the same problem on old maliG78 so at some point there is something that goes wrong using in alternance GPU-CPU-GPU. In one of my post about SVM someone told me that using GPU and CPU was not a good odea. But i cannot do with GPU what i am doing with GPU. Or i do not know how to do with GPU what i am doing with CPU. At some time i need to procces data with CPU.
thanks for the help.
did you receive my e-mail ?
hi again,
I made some more testing.
1)I removed the data transfer from GPU to CPU (enqueueMapBuffer). So multicore processing process zéro data and trtake 1 to 2 ms. And i removed the transfer from CPU to GPU (enqueueWriteBuffer or cl::bufferCL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR because i tested with the two possibility of transfer) and i removed the final GPU to JNI buffer for display (enqueueReadBuffer). But i the process time still double after few frames.
2)
I also tried to remove tha all JNI call, so no more openCL and no more multicore threading. In this case the speed is stable and the streamline is not balancing anymore.
3)
i also tried to only remove all the CPU processing and keep only the OpenCL, YUV transfor and all the read and write to CPU.and in this case i remaked that at time double after 20 frame and that the Mali Memory Read Lantency start to get very red after 7 seconde like if the CPU were processing data. so from 7ms for the éà first frame then 20ms. And if i removed all the GPU/CPU transfer there is still i little bit of red and frame are processed fromm 0ms to 4ms after 7 seconde.
So it look like there is somehing wrong when transfering data from GPU to GPU and GPU to CPU. And of course CPU processing data increase with the amount the data processed. And the good indicator is the Mali Mémory Read Latency. But it may be something alse. I am not good enough to help more.
> Did you receive my e-mail?
No, sorry. Can you try sending to peter.harris@arm.com.
it is done. Let me know if you got it.
No, sorry =(
Surprising.
Sorry but as you can see. I send it 3 times. The last time was today 14/09/2025 at 15:59 and the first time the 03/09/2025 at performancestudio@arm.com.
and the confirmation for the last one. So if yoi have not received it. I come from ARM how stop it for some reason ?
Strange ;))
I made an error in the prévious post. The problem is not solved by removing all the debug. That just inprove the performance. and the picture are wrong because i forgot to modified the abscissa value to the correct time. Both are strating at 30. But i have seen it after posting. Little mistake.
here are the correct picture without debug. And after more than 50 test it is alway the same. Dépending on the amount of data, after fews seconde there is a drop in time.
the comment are in the picture like before.
my conclusion is that when step 1 (picture 1) slow down. The inpact is a time augmentation in step 2 (picture 2).
And this is completly strange because if step 1 goes faster step2 should go faster also. But it is the inverse. ?
I will look at step 1 in more detai to the next post.
I suspect the problem is that your algorithm switches between the CPU and the GPU without pipelining them, so both the CPU and GPU are going idle while the other processor is busy. The idle time on a processor often causes frequency scaling control logic to decide that the processor is clocked too high, and so clock frequency gets reduced.
How frequency control works is decided by the OEM, so not really something Arm can help with.
You are rigth. I just check the CPU frequency when running the application and of course after fews seconde the fréquency drop by 4, from 2000 to 400/600 and than back to 2000 for fews ms and than back again to low frequency.
Thanks for the confirmation.
I supose that is for energy and heat purpose.
By the way does someone know witch ARM mobile allow full CPU speed ? would be good to know.