This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Initial Look at OpenCL Accelerated SQLite Performance numbers on Mali

Here's a link to a blog post from today about my work on accelerating SQLite with OpenCL on the ARM based Samsung Chromebook with a Mali T604.

Details & Early Benchmarks of OpenCL accelerated SQLite on ARM Mali | Tom Gall

Comments, questions and suggestions most welcome.

Parents
  • Hi Tom,

    That looks like a nice use of OpenCL; a problem with parallelism and few data dependencies.

    I suspect your kernels are quite simple relative to the number of memory accesses you are making, so I wonder if this can be tuned.

    For example, if you have the example below the only data you need for the "test" is uniformi and normali5; you only need to touch id if you pass the test (actually I wonder if the GPU can avoid this completely and return a byte vector of passed values, and the CPU builds the result set, but not sure how that works internally in SQLite).

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0
    
    
    
    

    Given Mali is a SIMD architecture it will work best if you have a vec4 of identical operations for the maths units. In this case you should split your input data into three arrays rather than using a packed struct for each row. You have one array for each column, and pack 4 rows per work item (assuming int32 data types, you would pack 8 for int16, or 16 for int8).

    You would have a vec4 load where each vector contains data from one column, and each vector element contains a value from one row. This would mean that the vec4 operations on each "column slice" of 4 rows are identical which packs into the SIMD lanes cleanly - you can use the relational built-in functions to compare all 4 rows against the same condition in a single operation.

    In addition to the vector processing packing improvements, this has one advantage in that you cleanly partition the kernel's memory access. You can test all 4/8/16 rows for passing "uniformi > 60". If they all fail then there is no need to load "normali5" or "id" at all - so you don't waste GPU cycles or cache loading data you never actually use.

    If you are willing to show the code for a kernel or two we could possibly provide a bit more help.

    HTH,
    Pete

Reply
  • Hi Tom,

    That looks like a nice use of OpenCL; a problem with parallelism and few data dependencies.

    I suspect your kernels are quite simple relative to the number of memory accesses you are making, so I wonder if this can be tuned.

    For example, if you have the example below the only data you need for the "test" is uniformi and normali5; you only need to touch id if you pass the test (actually I wonder if the GPU can avoid this completely and return a byte vector of passed values, and the CPU builds the result set, but not sure how that works internally in SQLite).

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0
    
    
    
    

    Given Mali is a SIMD architecture it will work best if you have a vec4 of identical operations for the maths units. In this case you should split your input data into three arrays rather than using a packed struct for each row. You have one array for each column, and pack 4 rows per work item (assuming int32 data types, you would pack 8 for int16, or 16 for int8).

    You would have a vec4 load where each vector contains data from one column, and each vector element contains a value from one row. This would mean that the vec4 operations on each "column slice" of 4 rows are identical which packs into the SIMD lanes cleanly - you can use the relational built-in functions to compare all 4 rows against the same condition in a single operation.

    In addition to the vector processing packing improvements, this has one advantage in that you cleanly partition the kernel's memory access. You can test all 4/8/16 rows for passing "uniformi > 60". If they all fail then there is no need to load "normali5" or "id" at all - so you don't waste GPU cycles or cache loading data you never actually use.

    If you are willing to show the code for a kernel or two we could possibly provide a bit more help.

    HTH,
    Pete

Children