This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

bad performance on 3.8 kernel

Note: This was originally posted on 18th June 2013 at http://forums.arm.com

Mali 400 on an exynos-based board:

with 3.0 kernel, EGL working fine, with up to 600fps in es2gears

ported drivers to 3.8 kernel, and mali acceleration working, however, the performance is roughly 50%.

I have debugged the issue at the gp job start wrapper - _mali_ukk_gp_start_job, which is called now 50% more times than on the 3.0 kernel...

Here is a comparison between the 2 kernels:

1) with SKIP_GP_JOBS and retuning the job straight away from _mali_ukk_gp_start_job, both 3.0 and 3.8 kernel results in the same number of mali_ioctl calls and the same performance - 650fps in es2gears
2) i modified es2gears to stop after 600 frames and here are my results (from bottom to top):

      GP jobs actually done - calls to "mali_gp_job_start": 299 on 3.0 kernel, 302 on 3.8 kernel
      calls to mali_group_start_gp_job (which calls mali_gp_job_start): 299 on 3.0, 302 on 3.8 kernel
      executions of mali_gp_scheduler_schedule (which calls mali_group_start_gp_job): 299 on 3.0, 302 on 3.8 kernel -- appears as "mali_gp_scheduler_schedule() {" in ftrace
      calls to mali_gp_scheduler_schedule: 0 on 3.0, 299 on 3.8 kernel -- appears as "mali_gp_scheduler_schedule();" in ftrace
     
      system calls served (mali_ioctl) : 960 on 3.0 kernel, 1373 on 3.8 kernel

results: ~600fps on 3.0 kernel, ~380fps on 3.8 kernel

So the conclusion is that the slowdown is due to a much larger number (almost double) of mali_ioctls for MALI_IOC_GP2_START_JOB.

Since I don't have the code for libMali to debug why exactly it's making so many syscalls, I hope somebody here can help me and give me an idea where to look.

A strange thing is the job numbers assigned.
In the 3.0 kernel, they are all multiples of 4, like: Mali GP scheduler: Job 2405 (0xE6581B80) queued; 2409, 2413, 2417, 2421, 2425, ...
In the 3.8 kernel, they increment either by 2, 4 or 6: 8825, 8829, 8833, 8835, 8841, 8843, 8849, 8853, ...
  • Note: This was originally posted on 24th June 2013 at http://forums.arm.com

    Hello memeka,
    [color="#444444"][font="arial"] [/font][/color]Could you give more details on the device you are testing on?
    Do you have a reproducer app that we can begin testing on?
    This will help us identify what the problem is, what the solution may be, and any workarounds that may be of use.

    Thanks in advance,

    McGeagh
  • Note: This was originally posted on 25th June 2013 at http://forums.arm.com

    It's an Odroid U2 (Exynos 4412).
    The app I used was es2gears - the only modification I made was to exit after 600 frames instead of running forever...
    I wonder if the errors are related to the UMP module... but I am not sure...

    Thanks.


    Hello memeka,
    [color="#444444"][font="arial"] [/font][/color]Could you give more details on the device you are testing on?
    Do you have a reproducer app that we can begin testing on?
    This will help us identify what the problem is, what the solution may be, and any workarounds that may be of use.

    Thanks in advance,


    McGeagh
  • Note: This was originally posted on 29th June 2013 at http://forums.arm.com

    Some more details about my traces are posted here: http://forum.odroid.com/viewtopic.php?f=55&t=305&p=11748#p11748

    On the 3.0 kernel, in most ioctl calls for GP jobs, there are 2 sets of frame registers that are read for jobs, 2 jobs created and executed, the ioctl ends after the second job. On the 3.8 kernel, there is only 1 job processed in a ioctl call, i.e. only 1 set of frame registers. The frame registers are used alternately: ioctl gp job from frame registers set 1, ioctl gp job from frame registers set 2, ioctl set 1, ioctl set 2. Moreover, jobs from set 2 of frame registers end up not being scheduled immediately, with the scheduler exiting because the slot is in use. Probably, the job is scheduled when the previous job was finished.

    I guess the question is: how was it than in 3.0 the ioctl would read both sets of frame registers and create 2 jobs, and in 3.8 there is only 1 job created per ioctl ? Since the mali drivers are the same, I can only think that somehow platform is initialized differently, or maybe there is something wrong with the UMP memory allocated?

    Any ideas are welcomed.
  • Note: This was originally posted on 4th July 2013 at http://forums.arm.com


    Some more details about my traces are posted here: http://forum.odroid....&p=11748#p11748

    On the 3.0 kernel, in most ioctl calls for GP jobs, there are 2 sets of frame registers that are read for jobs, 2 jobs created and executed, the ioctl ends after the second job. On the 3.8 kernel, there is only 1 job processed in a ioctl call, i.e. only 1 set of frame registers. The frame registers are used alternately: ioctl gp job from frame registers set 1, ioctl gp job from frame registers set 2, ioctl set 1, ioctl set 2. Moreover, jobs from set 2 of frame registers end up not being scheduled immediately, with the scheduler exiting because the slot is in use. Probably, the job is scheduled when the previous job was finished.

    I guess the question is: how was it than in 3.0 the ioctl would read both sets of frame registers and create 2 jobs, and in 3.8 there is only 1 job created per ioctl ? Since the mali drivers are the same, I can only think that somehow platform is initialized differently, or maybe there is something wrong with the UMP memory allocated?

    Any ideas are welcomed.


    Hi memeka,

    Could you give some more details on how you upgraded from kernel 3.0 to 3.8? Are you using the Ubuntu system image found here?  What steps did you take to build the Odroid 3.8 kernel which I presume you downloaded from the hardkernel linux repo odroid-3.8.y branch.  What steps did you then take to try and integrate Mali?

    I believe the issue you are describing is caused by the integration of UMP with the kernel.

    Rich
  • Note: This was originally posted on 5th July 2013 at http://forums.arm.com


    Could you give some more details on how you upgraded from kernel 3.0 to 3.8? Are you using the Ubuntu system image found here?  What steps did you take to build the Odroid 3.8 kernel which I presume you downloaded from the hardkernel linux repo odroid-3.8.y branch.  What steps did you then take to try and integrate Mali?

    I believe the issue you are describing is caused by the integration of UMP with the kernel.
    [size=2]
    [/size]

    Thanks for the reply.
    I have tried both a Debian Wheezy image (http://forum.odroid.com/viewtopic.php?f=9&t=1608) with LXDE[size=2] and a Ubuntu 13.04 image with XFCE (not Linaro).[/size]
    [size=2]The kernel is compiled straight from the hardkernel repository ([/size]https://github.com/hardkernel/linux/tree/odroid-3.8.y), the drivers are in drivers/gpu/arm/ (https://github.com/hardkernel/linux/tree/odroid-3.8.y/drivers/gpu/arm) and have been integrated by the maintainer, working but having bad performance. The framebuffer driver I think it's at https://github.com/hardkernel/linux/blob/odroid-3.8.y/drivers/media/v4l2-core/videobuf2-fb.c

    I have been looking for the cause of the performance drop and debugged the drivers with ftrace, and found the issue described above: there is one GP job / ioctl started, whereas in the 3.0 kernel there are 2 GP jobs/ioctl started.

    My understanding is that in 3.0 you have:
    ioctl -> GP start job from frame register 1 -> schedule job -> submit job -> (...libMali.so binary blob...) -> send job to user -> GP start job frame register 2 -> schedule job -> submit job -> [size=2](...libMali.so binary blob...)[/size][size=2]  -> [/size][size=2]send job to user -> end ioctl -> new ioctl -> repeat[/size]

    while in 3.8 the behaviour is:
    ioctl -> GP start job from frame register 1 -> schedule job -> submit job -> [size=2](...libMali.so binary blob...)[/size][size=2]  ->[/size][size=2]send job to user -> end ioctl -> new ioctl -> repeat[/size]
    ...... (in the meantime) ioctl -> GP start job from frame register 2 -> schedule job -> slot busy -> end ioctl -> new ioctl -> repeat

    this gives 2x ioctls and 2x more locks , and scheduling for frame register 2 always results in slot busy, so you have an ioctl wasted just for putting a job in the queue. The mali code is exactly the same as before, so that's not the issue. [size=2]The maintainer also thinks the UMP integration is at fault, but can't find a root cause. I was hoping for somebody here to have a better idea what exactly is causing this.[/size]

    As a side note, es2gears gives ~300fps on Ubuntu+XFCE(even worse with compositor enabled) and ~600 on Debian+LXDE. I did not look at the xorg server version, just went on with the Debian image :)

  • Note: This was originally posted on 15th July 2013 at http://forums.arm.com


    [/size]

    Thanks for the reply.
    I have tried both a Debian Wheezy image (http://forum.odroid.....php?f=9&t=1608) with LXDE and a Ubuntu 13.04 image with XFCE (not Linaro).
    The kernel is compiled straight from the hardkernel repository (https://github.com/h...ee/odroid-3.8.y), the drivers are in drivers/gpu/arm/ (https://github.com/h...drivers/gpu/arm) and have been integrated by the maintainer, working but having bad performance. The framebuffer driver I think it's at https://github.com/h.../videobuf2-fb.c

    I have been looking for the cause of the performance drop and debugged the drivers with ftrace, and found the issue described above: there is one GP job / ioctl started, whereas in the 3.0 kernel there are 2 GP jobs/ioctl started.

    My understanding is that in 3.0 you have:
    ioctl -> GP start job from frame register 1 -> schedule job -> submit job -> (...libMali.so binary blob...) -> send job to user -> GP start job frame register 2 -> schedule job -> submit job -> (...libMali.so binary blob...)  -> send job to user -> end ioctl -> new ioctl -> repeat

    while in 3.8 the behaviour is:
    ioctl -> GP start job from frame register 1 -> schedule job -> submit job -> (...libMali.so binary blob...)  ->send job to user -> end ioctl -> new ioctl -> repeat
    ...... (in the meantime) ioctl -> GP start job from frame register 2 -> schedule job -> slot busy -> end ioctl -> new ioctl -> repeat

    this gives 2x ioctls and 2x more locks , and scheduling for frame register 2 always results in slot busy, so you have an ioctl wasted just for putting a job in the queue. The mali code is exactly the same as before, so that's not the issue. The maintainer also thinks the UMP integration is at fault, but can't find a root cause. I was hoping for somebody here to have a better idea what exactly is causing this.

    As a side note, es2gears gives ~300fps on Ubuntu+XFCE(even worse with compositor enabled) and ~600 on Debian+LXDE. I did not look at the xorg server version, just went on with the Debian image :)




    Hi Memeka,
    Apologies for the delayed response.


    Could you run strings on mali.ko to find the API level, failing that, the revision number from libmali.so?  The issue with reduced performance could be caused by attempting to use kernel driver modules known to be incompatible with this revision of the Linux kernel.

    Kind Regards,
    Rich
  • Note: This was originally posted on 20th July 2013 at http://forums.arm.com


    Hi Memeka,
    Apologies for the delayed response.


    Could you run strings on mali.ko to find the API level, failing that, the revision number from libmali.so?  The issue with reduced performance could be caused by attempting to use kernel driver modules known to be incompatible with this revision of the Linux kernel.

    Kind Regards,
    Rich


    Hi Rich,

    I wasn't able to find mali.ko, but I did see API_VERSION=19 in __malidrv_build_info.c.
    In libMali.so the revision is "Linux-r3p2-01rel0".  I'm seeing the same behaviour as Memeka on 3.8, so I'm guessing we probably have the same drivers.

    Regards,

    Steve
  • Note: This was originally posted on 5th August 2013 at http://forums.arm.com

    Hi Steve,

    I have looked up that version of the driver and API version 19 does match the userspace libraries, however, unfortunately, the driver is currently only officially supported in linux kernel versions 3.0.15 to 3.5.4 inclusive.

    My best suggestion would be to contact Hardkernel and request support with the integration.

    Kind Regards,
    Rich


    Hi Rich,

    I wasn't able to find mali.ko, but I did see API_VERSION=19 in __malidrv_build_info.c.
    In libMali.so the revision is "Linux-r3p2-01rel0".  I'm seeing the same behaviour as Memeka on 3.8, so I'm guessing we probably have the same drivers.

    Regards,

    Steve