This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Initial Look at OpenCL Accelerated SQLite Performance numbers on Mali

Here's a link to a blog post from today about my work on accelerating SQLite with OpenCL on the ARM based Samsung Chromebook with a Mali T604.

Details & Early Benchmarks of OpenCL accelerated SQLite on ARM Mali | Tom Gall

Comments, questions and suggestions most welcome.

Parents
  • Hi Tom,

    That looks like a nice use of OpenCL; a problem with parallelism and few data dependencies.

    I suspect your kernels are quite simple relative to the number of memory accesses you are making, so I wonder if this can be tuned.

    For example, if you have the example below the only data you need for the "test" is uniformi and normali5; you only need to touch id if you pass the test (actually I wonder if the GPU can avoid this completely and return a byte vector of passed values, and the CPU builds the result set, but not sure how that works internally in SQLite).

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0
    
    
    
    

    Given Mali is a SIMD architecture it will work best if you have a vec4 of identical operations for the maths units. In this case you should split your input data into three arrays rather than using a packed struct for each row. You have one array for each column, and pack 4 rows per work item (assuming int32 data types, you would pack 8 for int16, or 16 for int8).

    You would have a vec4 load where each vector contains data from one column, and each vector element contains a value from one row. This would mean that the vec4 operations on each "column slice" of 4 rows are identical which packs into the SIMD lanes cleanly - you can use the relational built-in functions to compare all 4 rows against the same condition in a single operation.

    In addition to the vector processing packing improvements, this has one advantage in that you cleanly partition the kernel's memory access. You can test all 4/8/16 rows for passing "uniformi > 60". If they all fail then there is no need to load "normali5" or "id" at all - so you don't waste GPU cycles or cache loading data you never actually use.

    If you are willing to show the code for a kernel or two we could possibly provide a bit more help.

    HTH,
    Pete

Reply
  • Hi Tom,

    That looks like a nice use of OpenCL; a problem with parallelism and few data dependencies.

    I suspect your kernels are quite simple relative to the number of memory accesses you are making, so I wonder if this can be tuned.

    For example, if you have the example below the only data you need for the "test" is uniformi and normali5; you only need to touch id if you pass the test (actually I wonder if the GPU can avoid this completely and return a byte vector of passed values, and the CPU builds the result set, but not sure how that works internally in SQLite).

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0
    
    
    
    

    Given Mali is a SIMD architecture it will work best if you have a vec4 of identical operations for the maths units. In this case you should split your input data into three arrays rather than using a packed struct for each row. You have one array for each column, and pack 4 rows per work item (assuming int32 data types, you would pack 8 for int16, or 16 for int8).

    You would have a vec4 load where each vector contains data from one column, and each vector element contains a value from one row. This would mean that the vec4 operations on each "column slice" of 4 rows are identical which packs into the SIMD lanes cleanly - you can use the relational built-in functions to compare all 4 rows against the same condition in a single operation.

    In addition to the vector processing packing improvements, this has one advantage in that you cleanly partition the kernel's memory access. You can test all 4/8/16 rows for passing "uniformi > 60". If they all fail then there is no need to load "normali5" or "id" at all - so you don't waste GPU cycles or cache loading data you never actually use.

    If you are willing to show the code for a kernel or two we could possibly provide a bit more help.

    HTH,
    Pete

Children
  • Thanks for the pointers Pete. They seem like a worthwhile evolution of the code and make complete sense. I'll give that a try in the next day or two.

    I do plan to open the code up and will post a pointer to the git repo sometime next week.

  • Hi Pete,

    Thanks again for your suggestions. I'm still working on the code a bit yet but it's looking good. Performing:

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

    sqlite built -O2

    CPU sql1 took 43631 microseconds

    OpenCL sql1  took 14545 microseconds  (2.99x or 199% better) 

    OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

    The improvement not only resulted from working with vectors over arrays of integers but by also reducing the number of registers in use. I was able to jump from 64 to 128 work units.

    The heart of the OpenCL kernel evolved from:

        do {

            if ((data[offset].v > 60) && (data[offset].w < 0)) {

                resultArray[roffset].id = data[offset].id;

                resultArray[roffset].v = data[offset].v;

                resultArray[roffset].w = data[offset].w;

                roffset++;

            }  

            offset++;

            endRow--;

        } while (endRow);

    to

        do {

            v1 = vload4(0, data1+offset);

            v2 = vload4(0, data2+offset);

            r = (v1 > 60) && ( 0 > v2);

            vstore4(r,0, resultMask+offset);

            offset+=4;

            totalRows--;

        } while (totalRows);

    Thanks again. If i can knock down one little bug I anticipate I'll be posting my code tomorrow. I've a curious situation where 2 results (out of 100,000 rows) aren't being matched and I'm not sure why. What I wouldn't give for an OpenCL debugger!

  • Hi Tom,

    Thanks again for your suggestions.

    No problem - happy to help. Writing optimal GPGPU code generally requires spotting where you can orientate your world 90 degrees and run on the walls - it takes a bit of getting used to.


    OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

    Sweet =)

    Cheers,
    P