This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

bad performance on 3.8 kernel

Note: This was originally posted on 18th June 2013 at http://forums.arm.com

Mali 400 on an exynos-based board:

with 3.0 kernel, EGL working fine, with up to 600fps in es2gears

ported drivers to 3.8 kernel, and mali acceleration working, however, the performance is roughly 50%.

I have debugged the issue at the gp job start wrapper - _mali_ukk_gp_start_job, which is called now 50% more times than on the 3.0 kernel...

Here is a comparison between the 2 kernels:

1) with SKIP_GP_JOBS and retuning the job straight away from _mali_ukk_gp_start_job, both 3.0 and 3.8 kernel results in the same number of mali_ioctl calls and the same performance - 650fps in es2gears
2) i modified es2gears to stop after 600 frames and here are my results (from bottom to top):

      GP jobs actually done - calls to "mali_gp_job_start": 299 on 3.0 kernel, 302 on 3.8 kernel
      calls to mali_group_start_gp_job (which calls mali_gp_job_start): 299 on 3.0, 302 on 3.8 kernel
      executions of mali_gp_scheduler_schedule (which calls mali_group_start_gp_job): 299 on 3.0, 302 on 3.8 kernel -- appears as "mali_gp_scheduler_schedule() {" in ftrace
      calls to mali_gp_scheduler_schedule: 0 on 3.0, 299 on 3.8 kernel -- appears as "mali_gp_scheduler_schedule();" in ftrace
     
      system calls served (mali_ioctl) : 960 on 3.0 kernel, 1373 on 3.8 kernel

results: ~600fps on 3.0 kernel, ~380fps on 3.8 kernel

So the conclusion is that the slowdown is due to a much larger number (almost double) of mali_ioctls for MALI_IOC_GP2_START_JOB.

Since I don't have the code for libMali to debug why exactly it's making so many syscalls, I hope somebody here can help me and give me an idea where to look.

A strange thing is the job numbers assigned.
In the 3.0 kernel, they are all multiples of 4, like: Mali GP scheduler: Job 2405 (0xE6581B80) queued; 2409, 2413, 2417, 2421, 2425, ...
In the 3.8 kernel, they increment either by 2, 4 or 6: 8825, 8829, 8833, 8835, 8841, 8843, 8849, 8853, ...
Parents
  • Note: This was originally posted on 29th June 2013 at http://forums.arm.com

    Some more details about my traces are posted here: http://forum.odroid.com/viewtopic.php?f=55&t=305&p=11748#p11748

    On the 3.0 kernel, in most ioctl calls for GP jobs, there are 2 sets of frame registers that are read for jobs, 2 jobs created and executed, the ioctl ends after the second job. On the 3.8 kernel, there is only 1 job processed in a ioctl call, i.e. only 1 set of frame registers. The frame registers are used alternately: ioctl gp job from frame registers set 1, ioctl gp job from frame registers set 2, ioctl set 1, ioctl set 2. Moreover, jobs from set 2 of frame registers end up not being scheduled immediately, with the scheduler exiting because the slot is in use. Probably, the job is scheduled when the previous job was finished.

    I guess the question is: how was it than in 3.0 the ioctl would read both sets of frame registers and create 2 jobs, and in 3.8 there is only 1 job created per ioctl ? Since the mali drivers are the same, I can only think that somehow platform is initialized differently, or maybe there is something wrong with the UMP memory allocated?

    Any ideas are welcomed.
Reply
  • Note: This was originally posted on 29th June 2013 at http://forums.arm.com

    Some more details about my traces are posted here: http://forum.odroid.com/viewtopic.php?f=55&t=305&p=11748#p11748

    On the 3.0 kernel, in most ioctl calls for GP jobs, there are 2 sets of frame registers that are read for jobs, 2 jobs created and executed, the ioctl ends after the second job. On the 3.8 kernel, there is only 1 job processed in a ioctl call, i.e. only 1 set of frame registers. The frame registers are used alternately: ioctl gp job from frame registers set 1, ioctl gp job from frame registers set 2, ioctl set 1, ioctl set 2. Moreover, jobs from set 2 of frame registers end up not being scheduled immediately, with the scheduler exiting because the slot is in use. Probably, the job is scheduled when the previous job was finished.

    I guess the question is: how was it than in 3.0 the ioctl would read both sets of frame registers and create 2 jobs, and in 3.8 there is only 1 job created per ioctl ? Since the mali drivers are the same, I can only think that somehow platform is initialized differently, or maybe there is something wrong with the UMP memory allocated?

    Any ideas are welcomed.
Children
No data