This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Initial Look at OpenCL Accelerated SQLite Performance numbers on Mali

Here's a link to a blog post from today about my work on accelerating SQLite with OpenCL on the ARM based Samsung Chromebook with a Mali T604.

Details & Early Benchmarks of OpenCL accelerated SQLite on ARM Mali | Tom Gall

Comments, questions and suggestions most welcome.

  • Hi Tom,

    That looks like a nice use of OpenCL; a problem with parallelism and few data dependencies.

    I suspect your kernels are quite simple relative to the number of memory accesses you are making, so I wonder if this can be tuned.

    For example, if you have the example below the only data you need for the "test" is uniformi and normali5; you only need to touch id if you pass the test (actually I wonder if the GPU can avoid this completely and return a byte vector of passed values, and the CPU builds the result set, but not sure how that works internally in SQLite).

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0
    
    
    
    

    Given Mali is a SIMD architecture it will work best if you have a vec4 of identical operations for the maths units. In this case you should split your input data into three arrays rather than using a packed struct for each row. You have one array for each column, and pack 4 rows per work item (assuming int32 data types, you would pack 8 for int16, or 16 for int8).

    You would have a vec4 load where each vector contains data from one column, and each vector element contains a value from one row. This would mean that the vec4 operations on each "column slice" of 4 rows are identical which packs into the SIMD lanes cleanly - you can use the relational built-in functions to compare all 4 rows against the same condition in a single operation.

    In addition to the vector processing packing improvements, this has one advantage in that you cleanly partition the kernel's memory access. You can test all 4/8/16 rows for passing "uniformi > 60". If they all fail then there is no need to load "normali5" or "id" at all - so you don't waste GPU cycles or cache loading data you never actually use.

    If you are willing to show the code for a kernel or two we could possibly provide a bit more help.

    HTH,
    Pete

  • Hi Tom,

    Thanks for the link... it is a fascinating use of GPU Compute and the results are quite encouraging.  I'm presuming your CPU version is running on a single core - is that correct?  The Chromebook has dual Cortex-A15 so you could presumably double the performance there.  And also, adding NEON acceleration on the CPU side would be an interesting comparison with the GPU.

    Regarding RenderScript, do let us know how that goes.  It will be interesting to see how it compares to your OpenCL version.

    And seeing how the same code performs on a Mali-T628 platform will also be interesting.  Bear in mind that once an implementation of T628 goes above 4 GPU cores, they are split into 2 core groups... and these appear as separate devices, so the same OpenCL application won't automatically spread the load across both.  I would suspect - though it would be interesting to check - that you would see similar performance with the Mali-T604 you are currently using.

    As Pete has said, there may be a number of ways to optimise what you have done further, and tuning memory access and vector operations is likely the key.

    Regards,

    Tim

  • Thanks for the pointers Pete. They seem like a worthwhile evolution of the code and make complete sense. I'll give that a try in the next day or two.

    I do plan to open the code up and will post a pointer to the git repo sometime next week.

  • Hi Tim,

    My system is the ARM based dual core Cortex-A15 Samsung Chromebook.

    You're right that across multi cores as well as with NEON acceleration is also a worthwhile comparison. It'll come down to a matter of how much time I have to devote to it.

    On Renderscript yes this is an interest data point that I want to follow up on. I've just an original Nexus 7 right now tho and I'm not sure that's a good choice. I do have an Arndale board also and will have to see if there's a version of KitKat with accelerated Renderscript drivers. KitKat includes the C apis for Renderscript and obviously that's critical.

    Thanks for the details on the T628 that's also good detail to be aware of. IIRC there's an MP6 and an MP8 which would be a 6 and 8 core Mali? Does that mean there would be 3 cores in 2 groups and 4 cores in 2 groups respectively?  Do the groups get reported as 2 platforms from an OpenCL perspective?

  • Hi Tom,

    The actual configuration of a T628 can vary.  MP6 and MP8 do indeed refer to the number of cores, but an MP6 doesn't necessarily mean 3+3... the T628-MP6 configurations out there at the moment are 4+2.  The CL driver will by default run on the 4-core group... and though it is the intention for both groups to appear as separate devices I'm not sure the current driver supports it - but I'll check that for you.  A T628-MP8 would indeed be configured 4+4.

    HTH, Tim

  • Hi Pete,

    Thanks again for your suggestions. I'm still working on the code a bit yet but it's looking good. Performing:

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

    sqlite built -O2

    CPU sql1 took 43631 microseconds

    OpenCL sql1  took 14545 microseconds  (2.99x or 199% better) 

    OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

    The improvement not only resulted from working with vectors over arrays of integers but by also reducing the number of registers in use. I was able to jump from 64 to 128 work units.

    The heart of the OpenCL kernel evolved from:

        do {

            if ((data[offset].v > 60) && (data[offset].w < 0)) {

                resultArray[roffset].id = data[offset].id;

                resultArray[roffset].v = data[offset].v;

                resultArray[roffset].w = data[offset].w;

                roffset++;

            }  

            offset++;

            endRow--;

        } while (endRow);

    to

        do {

            v1 = vload4(0, data1+offset);

            v2 = vload4(0, data2+offset);

            r = (v1 > 60) && ( 0 > v2);

            vstore4(r,0, resultMask+offset);

            offset+=4;

            totalRows--;

        } while (totalRows);

    Thanks again. If i can knock down one little bug I anticipate I'll be posting my code tomorrow. I've a curious situation where 2 results (out of 100,000 rows) aren't being matched and I'm not sure why. What I wouldn't give for an OpenCL debugger!

  • Hi Tom,

    Thanks again for your suggestions.

    No problem - happy to help. Writing optimal GPGPU code generally requires spotting where you can orientate your world 90 degrees and run on the walls - it takes a bit of getting used to.


    OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

    Sweet =)

    Cheers,
    P