This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali-Txxx performance

Hi,

I have been using both the X11 and fbdev drivers (r4p0) on an Odroid-XU3 with Mali-T628, the kernel driver integrated by the vendor (Hardkernel) and the binary blob from armdeveloper website.

The performance was really bad - especially on fbdev, framerates were less than half than on X11. On X11, framerates were worse than on an Odroid-U3 with Mali400.

Now with the release of r5p0 drivers, early tests see a similar situation, unfortunately.

On r4p0-MaliT628 X11, I get just under 200fps on es2_gears, and ~55 glmark2 score. I've seen the results of Mali-T764 on the RK3288 board from http://pastebin.com/Qzrh51Yv and are similarly bad.

Is this something you are aware of?

On Mali-400, I get ~260 fps in es2_gears, and a glmark2 score of ~60. fbdev and x11 performance is similar.

But I would even be happy with these numbers in fbdev. es2_gears does not work on fbdev, but some of the apps I've been trying showed exceptionally poor framerates (15 in fbdev vs. 60 in X11) and were unusable.

On the forums, I remembered seeing a similar post with a benchmark on Arndale octa or Chromebook (can't remember exactly, T6xx in any case), where framerates were also smaller with fbdev drivers for the same app.

Is there any comment on the performance of fbdev vs. x11? Or the lack of 3D performance seen in the Txxx from various vendors?

Thanks.

  • That's interesting my initials results on RK3288 (firefly) with r5p0 fbdev drivers are showing similar low numbers. I'm seeing 52 fps running the Mali SDK triangle sample (opengl_20). We get better figures from an older Vivante GC2000 with its fbdev drivers.

  • > Is there any comment on the performance of fbdev vs. x11? Or the lack of 3D performance seen in the Txxx from various vendors?


    The fbdev driver we provide is designed for hardware verification and new platform bring up - it's generally designed for simplicity and reliability rather than performance - so it really isn't optimized.  My guess is something in the reference integration is serializing frames so the GPU pipeline drains - I can raise this with our BSP team, but the general assumption is that most people will use a real windowing system (X11, Android Surface Flinger, etc).

    I'm not entirely sure what the X11 integration with the host platform looks like on these boards - it generally gets customized by most vendors shipping commercial products.

    Cheers,
    Pete

  • Hi Peter, thanks for the reply...

    "something in the reference integration" - can you explain this a bit please? My impression was that vendors, as long as they are using the reference, only have to customize the open-source kernel part, which deals with bringing up the platform, DVFS, etc. So theoretically, the binary blob should have similar performance across boards - as long as they use the reference Mali implementation. And in my experience on the XU3, using the blob compiled by HardKernel and the blob provided by you resulted in similar performance (1 fps difference in my tests), and now HK just points to your blob.

    "the general assumption is that most people will use a real windowing system" - well... there are a few of us that would like fbdev working well... especially with the situation in the windowing systems on Linux. Currently, there is no support for Wayland in your drivers (I saw binary blobs for Mali400 with Wayland support in Tizen, but I think they were compiled for soft float, and did not work at all in on Linux hardfloat image; and I saw the collabora demo with Wayland on chromebook with MaliT628 - but they probably have a time machine to go forward in time and get bobs from the future). Also, on X11 ... you have the MaliDDX with UMP support on one hand, but then you promote armsoc DDX with no UMP - like on Chromebox and my board. But then, armsoc is a mess, and I don't see any support from you guys on that front. So we are left with drivers that are slower than a sick snail on fbdev, no wayland, and a confusing X11, where either everything is slowed down by the X11 drivers ***, or there is some community work on armsoc (which I don't even know if you are aware) but currently anything touching composition will crash X11 either immediately or eventually.


    In conclusion, even with some vendor support - either there are issues when there is no windowing system (and I am still confused if it's from your reference drivers, or the vendor, but I suspect the first); or when there is a windowing system, you don't look at integrating your driver with it. Well, at least on Linux, I can't speak about Android, but things seem to work there.

    *** using the armsoc ddx from the repo chromebook uses, everything is very slow, or fullscreen EGLapp crash, and compositors don't work. There are some patches that make armsoc usable - in terms of speed (for example http://pastebin.com/PxTpPi4J) - but there are issues with 32 bit visuals for RGB888, which result in e.g. black screen when exiting fullscreen EGL apps.

    PS: I would love to try Wayland on Mali, but until then, I try to stay away from X11 because of the above issues and would love if the fbdev drivers would have decent performance.

  • I'm surprised that the fbdev drivers are promoted as a hardware verification because I couldn't find this isn't explicitly mentioned in the docs. What more concerning that is that I would expect the drivers to perform to at least showcase the capabilities of the GPU  not the other way round. For example I have customers who need see a prototype of their application running on their chosen SOC to get a feel for performance plus determine power/cooling requirements. It's very difficult to tell the end customer that we can't give you performance metrics .

    Regards fbdev, from our experience they are useful because:

    1. QT supports fbdev and is still used for embedded application development

    2. X11 is a heavy weight stack and doesn't perform particularly well on ARM (for numerous reasons), in some circumstances we rewrite the application to use fbdev to lower CPU usage with the added benefit of less heat/power.

    3. Some of your competitors provide fbdev support, useful when customers are comparing feature of the SOC.

  • Thanks for your explains, i mean it's the same on Odroid-C1. I will just try this immediatly

  • "something in the reference integration" - can you explain this a bit please?

    You basically seem to have covered what I was thinking - there are lots of options around management of surface memory and how that is moved around the system, how that interacts the with dmabuf kernel APIs for CPU-GPU memory coherency. It's one area where Linux is improving, but is still generally "a bit of a mess" and very possible to end up with a set of components which are functional, but doing more work than needed around mapping or cache flushing of buffers. I've never use the board in question, so I'm not entire sure how the buffer integration has been done, but this is an area where I know our direct licensees have had issues.

    I try to stay away from X11 because of the above issues and would love if the fbdev drivers would have decent performance.

    Yes, I agree. There is no real reason for the fbdev drivers to be as slow as they are; I'll definitely follow up with our BSP team to get this sorted.

  • Hi Jasbir,

    Yes, all good points - I see the need, I'm just explaining what to expect out of the fbdev drivers we currently ship on malideveloper.com.

    That said, to be honest I'm really surprised that the fbdev drivers are as slow as they are - so I will definitely be raising with our BSP team.

    Kind regards,
    Pete

  • Yes peterharris i noticed that also. On the Odroid C1 i used Mali-X11 instead of Mali-fbdev it's works fine and better than the Mali-Fbdev

  • To tell you the truth, I found the performance on C1 (mali450) in fbdev acceptable, I don't have numbers to compare it to X11 but I think they should be close.

    The difference in performance I found only for T6xx/T7xx.

    PS: also, the fbdev drivers for 400/450 come from Hardkernel, the fbdev for T628 comes from malideveloper.

  • for sure the difference is not huge . But official image provided by by Hardkernel is overloaded. The difference appears when we used the minimum image.

    The supplied image is clearly an image for the development on the board. Too many menage to do. Better to start on a minimum version.

  • H Pete,

    Did you manage to get a response from the BSP team?

    Jasbir

  • We've run some benchmarks to compare r4p1 and r5p0 drivers for Mali-T60x, T62x and T76x and found an increased performance in all cases, so the results reported here are surprising.  On Firefly with r5p0, we've seen the triangle SDK sample running at over 100fps in fbdev mode.  Could you please describe which system you have running on Firefly that shows only 52 fps?

    Then regarding fbdev vs X11 performance, I believe the main problem here is that there is typically no zero-copy support available for fbdev so the user-side GPU driver has to keep calling memcpy to copy the contents of the GPU output buffer into the display framebuffer.  Some kernels and framebuffer drivers have support for DMA-BUF but it's not that common.  X11 typically uses the Direct Rendering Manager (DRM) which has better standard support for zero-copy (i.e. the GPU writes directly into the display framebuffer).  This means that an off-screen benchmark should give the same results with fbdev and X11, but the on-screen performance will hit the bottleneck of memcpy on fbdev.  It's worth noting that X11 is fairly heavy so it will also degrade the fps score compared to off-screen.  Whenever available, fbdev with zero-copy is typically the fastest solution.

    Regarding Pete's earlier comment, for pure GPU driver validation purposes, the display integration is a non-issue.

    We're looking into how to fix fbdev, but it will always depend on the platform as on each platform a different framebuffer driver will need to implement a DMA-BUF exporter mechanism.

    Back to the original question, could you please describe your Firefly set-up (OS, kernel version...) and maybe try to run some off-screen benchmarks?

    Best wishes,

    Guillaume

  • Hi Guillaume,

    Many thanks for the response, I'm currently testing against the chromium (3.14) kernel (on the Firefly) . Given your seen a higher fps rate I suspect the issue may lie with the KMS driver implementation because I'm also seeing high CPU usage. The 3.10 kernel uses CONFIG_FB_ROCKCHIP  which is a simple frame-buffer driver. I'll see if I can patch the 3.10 kernel and give that a go.

    I'm surprised that DMA-BUF isn't used but as you point out it's probably going to be SOC specific. I would assume the Android Mali drivers are quite similar to fbdev, do they use DMA?

    thanks

    Jasbir

  • Hi Jasbir,

    Thanks, so we're at least using different kernels.  We've run our benchmarks using this Firefly 3.10 kernel branch:

    https://bitbucket.org/T-Firefly/firefly-rk3288-kernel.git

    I don't think either of them is using DMA-BUF in fbdev, but the 3.14 Chromium kernel should definitely have it enabled in X11 as that's what Chrome OS uses, or at least has used until now.  It would still be good to know that really makes this fps difference in fbdev.  We might run our benchmarks again with 3.14 at some point, but please also let us know if you try 3.10 on your side and get different results.  Also, as fbdev uses memcpy which takes a fair amount of CPU usage and memory bandwidth, if the CPU is busy doing anything else at the same time then this may indirectly impact the graphics performance.

    Android kernels provide the ION framework which essentially does the same thing as DMA-BUF to share a buffer between the display and GPU drivers.  In principle, any modern kernel used in a production Android device will have ION enabled.  Some may use DMA-BUF, but none of them would want to use software memcpy...  The Mali user-side drivers are built for a specific windowing system, and the difference is mainly about how to set up the zero-copy by sharing the display buffer with the GPU driver.  For example, with DMA-BUF this is typically achieved by passing a file descriptor from the display driver to the GPU driver via user-space.

    Best wishes,

    Guillaume