This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Chrombook OpenCL Development issues

Hi all,

I have gotten a Samsung Chromebook model number Xe303c12 and I am trying to get some OpenCL code running.

I have followed the "Graphics and Compute Development on Samsung Chromebook" (including disabling CONFIG_SECURITY_CHROMIUMOS) and have been able to boot to linux on the Chromebook.  My issue is that I see no network at all, either via the apple USB dongle or the wireless.  The devices don't even seem to exist so far as linux is concerned.  Trying to install the USB device via modprobe gives an "Operation not permitted" error on usbnet.ko.

Has anyone gotten networking to work?  If so how? 

If anyone is willing to share a working sdcard image that would be fantastic.  To be honest I got sick of building the linux kernel sometime in the 90s.

Thanks.

--Mike

P.S.  I don't see OpenCL headers.  Can I just copy them from a working system?

Parents Reply Children
  • Tu,

    I found the setting in .config and disabled it.  I haven't tried booting without the lsm.module_locking=0 kernel parameter, so I'm not sure if it helps or not.

    I have gotten my code running on the Chromebook at this point.  I'm curious as to perfromance.

    Running the code on both my iMac (quad i7, GeForce 680MX) and the chromebook, there are some interesting discrepancies in performance.  On paper the 680MX should be about 30 times as fast.

    My code does some image processing on mulitband images (in this case 50 band images).  The three benchmarks are for the mean 50-band image pixel, the image covariance matrix, and something called the RX anomaly statistic.  The RX calculates statistics locally to produce an anomaly measurement.  Thus it calculates the mean in a 10x10 window, the covariance in a 10x10 window.  The covariance has to be inverted and then multiplied against a pixel of interest to produce a target score.  All of these are done on the GPU in a single kernel each.  For comparison I run the exact same algorithms on the CPUs.

    For speed sake, I'm running on a small image.

    iMac GeForce 680Mx  2234 GFLOPS  100 Watts+

    Quad i7 3.4 Ghz

    ======================================

    GPU GeForce 680Mx

    mean  0.0042

    cov  0.0423

    RX  0.3392

    CPU

    mean  0.0111

    cov  0.0218

    RX  0.3410

    Chromebook Exynos  68 GFlops  1/32 the speed

    =====================================

    GPU Mali 604

    mean  0.0373

    cov  0.9044

    RX  20.59

    CPU

    mean  0.3911

    cov  0.2464

    RX  20.0725

    Obviously the i7 GeForce is going to cream the chromebook.  What is interesting here is the relative slowness of the covariance calculation.   The Mali is about 3 times slower then the ARM CPU at calculating this, which surprises me, since it essentially a matrix multiply.  For performance, do  I need to use the OpenCL vector load and math operations (I notice you vectorized sobel is 10 times the speed of the unvectorized, and your sgemm uses vectors).  I have not done this in the past since it makes no difference for the nVidia GPUs I've been using, and to be honest it is a pain.

    I'm also having the same stability problems as everyone else related to the MMC.  I get constant crashes.  I've been working off of a NFS disk to avoid writing to the SD card.  Is there any other option to this?

    Thank you for your assistance. I'm really interested in low-power computing and I am excited about your products.

    --Mike

  • Reading the development guide, it looks like the Mali doesn't have separate high-speed local memory.  Everything is just stored in global memory, so this may explain things.  The COV and RX functions use a ton of local memory.  I'm copying chunks of the images into local memory for performance reasons.  Since I'm copying portions of the image multiple times into local memory (necessary on a machine with 16k of local memory) this is probably killing performance.  I'll take a look at my code some more.

    --Mike

  • Hi Mike,


    > Everything is just stored in global memory, so this may explain things


    GPUs are designed to be latency tolerant, so where things are stored is a little less critical than CPUs, so although lower latency memory will always help it's isn't usually necessary.


    > For performance, do I need to use the OpenCL vector load and math operations


    Ideally yes - the Mali-T600 is a vector architecture with SIMD maths units, which is different to many other GPU architectures.


    Where Mali excels is that our SIMD units are very flexible and very wide - if you only need int8 or int16 data for your kernel we can process 16 or 8 elements per SIMD unit per clock cycle (i.e. we have a 128-bit data path and you can carve that up into 8, 16, or 32-bit lanes). If you need floating point I believe the current drivers we have available off the website are only exposing fp32, but we are adding the half-float extension support in our next driver release.


    While the compiler can auto-vectorize (and it is getting better at doing so, so we hope to improve here) there needs to be enough work in a work-item to fill the SIMD lanes, and auto-vectorization is relatively fiddly in any compiler, so it is always more reliable if you use the built-in functions.


    You may want to try downloading the ARM DS-5 Community Edition - this supports the Streamline profiling tool which includes support for capturing and displaying the GPU hardware performance counters. This should help you indicate where your GPU cycles are being spent, including some measure of the efficiency of the GPU interaction with main memory. We have an optimization guide which includes some hints on what counters you want to look at for different types of problem, but if you have any questions please shout:


    Mali GPU Application Optimization Guide v3.0 « Mali Developer Center


    Kind regards,
    Pete

  • > I've been working off of a NFS disk to avoid writing to the SD card.  Is there any other option to this?

    This is the method I use, although USB stick is also viable.

    > Since I'm copying portions of the image multiple times into local memory (necessary on a machine with 16k of local memory) this is probably killing performance

    This is unnecessary on our architecture, and everything can be done from global memory. The guide Pete links to is focused on graphics I believe but will give some good insight and performance considerations on our architecture. There is also an OpenCL specific guide at Mali-T600 Series GPU OpenCL Developer Guide « Mali Developer Center which is worth a read!

    Thanks,

    Chris