This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

processus time double after fews seconde

hterrolle 10 months ago

hi,

my application use android java JNI to process camera vidéo in real time. The C part use openCL and multicore threading with médiatek 9200+ and mali G715.

Something strange append with the application. After fews seconde of processing, 70-80 frames, i go from 60-70ms per frame to 140-160 per frame.

What i am doing :

1) i do some kernel to conver YUV and extract data

2) Then i procees the extracted data with CPU multicore threading(4 thread at the time) 7 time.

3) Then i use kernel again to extract data and send it back to Java.

If i remove all the CPU work, the GPU time remain stable between 20ms at the begining(70-80 frames) to 40-45. But with the CPU work time increase dramaticly after 70-80 frame using the same amount of input data. The same problem appened with my old mali g72 after 20 frames.

I tried to use streamline, but as i run windows7 i cannot have the analizer, how just run from windows10. But i can get the graph and i remarked a strange activity on the GPU.

1) the Mali Memory Read Latency after to get red after the 70-80 frames. (it show 25 mega beats ?)

2) the Mali geométry culling rate start to ocsillate. (it show 100%)

3) the Mali geométry efficiency start to ocsillate. (it show 3 trheads)

4) Mali Early ZS rate is red

and many other thing start to ocillate. But Device Thermal State is 100% green.

in fact after 70-80 frames a lot of thing start to ocsillate.

And streamline is to complicated for me. there is too much things to know and anderstant. I do not have the time.

So i am wondering if it could be possible for an expert to analyse it.

As i said i had the same problem on old maliG78 so at some point there is something that goes wrong using in alternance GPU-CPU-GPU. In one of my post about SVM someone told me that using GPU and CPU was not a good odea. But i cannot do with GPU what i am doing with GPU. Or i do not know how to do with GPU what i am doing with CPU. At some time i need to procces data with CPU.

thanks for the help.

Top replies

Peter Harris 9 months ago in reply to hterrolle +1 suggested

I suspect the problem is that your algorithm switches between the CPU and the GPU without pipelining them, so both the CPU and GPU are going idle while the other processor is busy. The idle time on a processor...

Parents

0 hterrolle 10 months ago in reply to hterrolle

hi,

did you receive my e-mail ?
Cancel
Vote up 0 Vote down

Cancel

Reply

0 hterrolle 10 months ago in reply to hterrolle

hi,

did you receive my e-mail ?
Cancel
Vote up 0 Vote down

Cancel

Children

0 hterrolle 10 months ago in reply to hterrolle

hi again,

I made some more testing.

1)
I removed the data transfer from GPU to CPU (enqueueMapBuffer). So multicore processing process zéro data and trtake 1 to 2 ms. And i removed the transfer from CPU to GPU (enqueueWriteBuffer or cl::bufferCL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR because i tested with the two possibility of transfer) and i removed the final GPU to JNI buffer for display (enqueueReadBuffer). But i the process time still double after few frames.

2)

I also tried to remove tha all JNI call, so no more openCL and no more multicore threading. In this case the speed is stable and the streamline is not balancing anymore.

3)

i also tried to only remove all the CPU processing and keep only the OpenCL, YUV transfor and all the read and write to CPU.and in this case i remaked that at time double after 20 frame and that the Mali Memory Read Lantency start to get very red after 7 seconde like if the CPU were processing data. so from 7ms for the éà first frame then 20ms. And if i removed all the GPU/CPU transfer there is still i little bit of red and frame are processed fromm 0ms to 4ms after
7 seconde.

So it look like there is somehing wrong when transfering data from GPU to GPU and GPU to CPU. And of course CPU processing data increase with the amount the data processed. And the good indicator is the Mali Mémory Read Latency. But it may be something alse. I am not good enough to help more.
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris 10 months ago in reply to hterrolle

> Did you receive my e-mail?

No, sorry. Can you try sending to peter.harris@arm.com.
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 10 months ago in reply to Peter Harris

it is done. Let me know if you got it.
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris 10 months ago in reply to hterrolle

No, sorry =(
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 10 months ago in reply to Peter Harris

Surprising.

Sorry but as you can see. I send it 3 times. The last time was today 14/09/2025 at 15:59 and the first time the 03/09/2025 at performancestudio@arm.com.

and the confirmation for the last one. So if yoi have not received it. I come from ARM how stop it for some reason ?

Strange ;))
Cancel
Vote up 0 Vote down

Cancel

0 hterrolle 9 months ago in reply to hterrolle

hi,

i have removed the streamline picture. Because too big. But this morning i woke up very early with a new idéa to how to explain what appends in détail. That wake me up ;))

I will show you 3 picture, that représent the all proccessing secnde per seconde for each step of the all processing.

fir all the picture, abscissa information are time in seconde and in ordonate is the duration of the step for every frame processed. It appends that some time for some reason the debug forgot one step or two during CPU processing.

By the way i can send you more détailled informations but for 21 seconde of analyse it is 500 line of data for 218 frame. quite a lot.

First picture. I do the first steep of kernel. YUV convertion and some other kernel and enqueueMapBuffer to make available the data for CPU processing.


As you can see procees time is very stable in time.

Now, the second picture. In this step where a do all the multicore processing (5 time 4 pthread) 
and (form 1 to 12 time (small amount of data) (2 time 4 pthread) and (1 time 2 pthread)) and some other work in single 
process for drawing pixel for output.i work with a struc of int (1024*1024). i could use Array ?

Time processing dépending of the amount of data to be processed. But in all the picture i procees 
nearly the same amount of data.I look at the same picture with the camera at the same distance.


As you can see Multicore processing is completly choatique. And it is for this step that some
time the debug start missinf frame.

here is an example of missing frame. look at the NB frame at seconde 32. so one seconde after the app start.

2025-10-11 10:02:32.337 10482-10482/com.example.xiaomi E/JNIProcessor:  12 enqueueReadBuffer and last kernel finished ready to end JNI before display Nb Frame 18 in 63 ms 
2025-10-11 10:02:32.340 10482-10482/com.example.xiaomi E/JNIProcessor:  4 traitement enqueueNDRangeKernel finished in 2 ms 
2025-10-11 10:02:32.363 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 25 ms 
2025-10-11 10:02:32.392 10482-10482/com.example.xiaomi E/JNIProcessor:  10 traitement enqueueWriteBuffer and all CPU finished in 54 ms 
2025-10-11 10:02:32.395 10482-10482/com.example.xiaomi E/JNIProcessor:  12 enqueueReadBuffer and last kernel finished ready to end JNI before display Nb Frame 19 in 57 ms 
2025-10-11 10:02:32.406 10482-10482/com.example.xiaomi E/JNIProcessor:  4 traitement enqueueNDRangeKernel finished in 2 ms 
2025-10-11 10:02:32.438 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 34 ms 
2025-10-11 10:02:32.633 10482-10482/com.example.xiaomi E/JNIProcessor:  10 traitement enqueueWriteBuffer and all CPU finished in 53 ms 
2025-10-11 10:02:32.635 10482-10482/com.example.xiaomi E/JNIProcessor:  12 enqueueReadBuffer and last kernel finished ready to end JNI before display Nb Frame 23 in 54 ms 
2025-10-11 10:02:32.641 10482-10482/com.example.xiaomi E/JNIProcessor:  4 traitement enqueueNDRangeKernel finished in 2 ms 
2025-10-11 10:02:32.666 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 28 ms 
2025-10-11 10:02:32.871 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 25 ms 
2025-10-11 10:02:33.010 10482-10482/com.example.xiaomi E/JNIProcessor:  10 traitement enqueueWriteBuffer and all CPU finished in 51 ms 
2025-10-11 10:02:33.013 10482-10482/com.example.xiaomi E/JNIProcessor:  12 enqueueReadBuffer and last kernel finished ready to end JNI before display Nb Frame 28 in 55 ms 
2025-10-11 10:02:33.016 10482-10482/com.example.xiaomi E/JNIProcessor:  4 traitement enqueueNDRangeKernel finished in 2 ms 
2025-10-11 10:02:33.037 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 23 ms 
2025-10-11 10:02:33.366 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 31 ms 
2025-10-11 10:02:33.796 10482-10482/com.example.xiaomi E/JNIProcessor:  10 traitement enqueueWriteBuffer and all CPU finished in 58 ms 
2025-10-11 10:02:33.799 10482-10482/com.example.xiaomi E/JNIProcessor:  12 enqueueReadBuffer and last kernel finished ready to end JNI before display Nb Frame 37 in 61 ms 
2025-10-11 10:02:33.802 10482-10482/com.example.xiaomi E/JNIProcessor:  4 traitement enqueueNDRangeKernel finished in 2 ms 
2025-10-11 10:02:33.831 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 31 ms 
2025-10-11 10:02:33.926 10482-10482/com.example.xiaomi E/JNIProcessor:  10 traitement enqueueWriteBuffer and all CPU finished in 60 ms 
2025-10-11 10:02:33.930 10482-10482/com.example.xiaomi E/JNIProcessor:  12 enqueueReadBuffer and last kernel finished ready to end JNI before display Nb Frame 39 in 64 ms 
2025-10-11 10:02:33.934 10482-10482/com.example.xiaomi E/JNIProcessor:  4 traitement enqueueNDRangeKernel finished in 3 ms 
2025-10-11 10:02:33.963 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 33 ms 
2025-10-11 10:02:34.198 10482-10482/com.example.xiaomi E/JNIProcessor:  4 traitement enqueueNDRangeKernel finished in 2 ms 
2025-10-11 10:02:34.229 10482-10482/com.example.xiaomi E/JNIProcessor:  6 enqueueMapBuffer and all kernel start CPU multicore finished in 33 ms 
2025-10-11 10:02:34.254 10482-10482/com.example.xiaomi E/JNIProcessor:  10 traitement enqueueWriteBuffer and all CPU finished in 58 ms 
2025-10-11 10:02:34.256 10482-10482/com.example.xiaomi E/JNIProcessor:  12 enqueueReadBuffer and last kernel finished ready to end JNI before display Nb Frame 42 in 61 ms

and the last picture were i do 2 kernel and an enqueueReadBuffer before getting out of JNI for drawing.



as you can see it is quite stable, between 2 and 5ms. The zéro in every picture are just ecxel blanc value.

May be the Debug ;))

here are the same 3 picture but this time a removed nearly all the debug and specially the one that inside
the function called by pthread. 

 



So i found the solution just because i did this post. At the end i ask myself and if it was just the debug.
And it was.

So very sorry. I feel very stupid rigth now. un peu la honte ;))

have a good day.

0 hterrolle 9 months ago in reply to hterrolle

hi,

I made an error in the prévious post. The problem is not solved by removing all the debug. That just inprove the performance. and the picture are wrong because i forgot to modified the abscissa value to the correct time. Both are strating at 30. But i have seen it after posting. Little mistake.

here are the correct picture without debug. And after more than 50 test it is alway the same. Dépending on the amount of data, after fews seconde there is a drop in time.

the comment are in the picture like before.

my conclusion is that when step 1 (picture 1) slow down. The inpact is a time augmentation in step 2 (picture 2).

And this is completly strange because if step 1 goes faster step2 should go faster also. But it is the inverse. ?

I will look at step 1 in more detai to the next post.
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris 9 months ago in reply to hterrolle

I suspect the problem is that your algorithm switches between the CPU and the GPU without pipelining them, so both the CPU and GPU are going idle while the other processor is busy. The idle time on a processor often causes frequency scaling control logic to decide that the processor is clocked too high, and so clock frequency gets reduced.

How frequency control works is decided by the OEM, so not really something Arm can help with.
Cancel
Vote up +1 Vote down

Cancel
0 hterrolle 9 months ago in reply to Peter Harris

hi,

You are rigth. I just check the CPU frequency when running the application and of course after fews seconde the fréquency drop by 4, from 2000 to 400/600 and than back to 2000 for fews ms and than back again to low frequency.

Thanks for the confirmation.

I supose that is for energy and heat purpose.

By the way does someone know witch ARM mobile allow full CPU speed ? would be good to know.
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 9 months ago in reply to hterrolle

hi,

Sorry i come back to the discussion. So if i anderstoud your answer some processor procees the GPU driver and some other processor the multicore and when i finished both work anothrer processor manage the the display and the camera forcing the processor that use GPU and CPU became low frequency because it is not used at the moment. That what you called frequency scalling control logic.

And it is for that reason that every frame can have deffiérent time of processing. Has i can see on the picture.

It is possible to do what you said ! pipelining CPU and GPU ?
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 9 months ago in reply to hterrolle

hi,

Using only GPU do not get trouble with CPU frequency scalling control logic because everything ate done to inprove GPU performance.

The main problem for my algorithme is to use massive CPU work. Because CPU frequency scalling control logic only work with
CPU.

So at the end it is only a problem of heat and battery. Why not adding a cooling system on the chip. Spécially if ARM want to move from mobile to laptop or PC.

And why increasing the speed of the CPU if we can only use it at an average of 30%. I look at the CPU use with the apk "3C CPU manager" and CPU speed is very rarelly use at is top frequency.

I do not anderstand why alway inproving speed if it cannot be used. It would be more usefull to have 8 core at 1800 Mhz. On the 92000+ médiatek the X3 never goes faster than 1400 Mhz. May be for vulkan CPU can be usefull but it is only for GPU work not really for CPU work.

In the old time CPU work was the main purpose, now it is only GPU. But GPU do not work like CPU and I need both to work fast. Not only showing nice image but procees data. Nobody will make a server with only GPU it is not usefull.

There is matrice work and row work. Both are necessary. May be not yet in phone. But it will come.
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 8 months ago in reply to hterrolle

In fact using mobile take care of energy and heat. At the begining (70 frame),frequency scaling give all the power to the CPU. But in my case it use a lot of work. the algorithme will prefert the GPU work rather than the CPU work. There is no scalling fresuancy about GPU. Or i am not aware. By the way after 70 frame GPU goes faster and CPU slower just for energy and heat safe.

It is normal, smartphonne are not made for massive computing(CPU) but for for massive frame display[GPU). This is why mobile do not have any cooling système. And ARM surperform in using less énergy. Because of not massive CPU work.

So the best is to try to reduce CPU work on mobile.

I am doing something about it. I will let you know.

But massive CPU on mobile is something to avoid after 6 seconde of work. Does not matter the number of data to process. CPU just drop down working.
Cancel
Vote up 0 Vote down

Cancel