This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Initial Look at OpenCL Accelerated SQLite Performance numbers on Mali

Here's a link to a blog post from today about my work on accelerating SQLite with OpenCL on the ARM based Samsung Chromebook with a Mali T604.

Details & Early Benchmarks of OpenCL accelerated SQLite on ARM Mali | Tom Gall

Comments, questions and suggestions most welcome.

Parents

Peter Harris over 10 years ago
Hi Tom,
That looks like a nice use of OpenCL; a problem with parallelism and few data dependencies.
I suspect your kernels are quite simple relative to the number of memory accesses you are making, so I wonder if this can be tuned.
For example, if you have the example below the only data you need for the "test" is uniformi and normali5; you only need to touch id if you pass the test (actually I wonder if the GPU can avoid this completely and return a byte vector of passed values, and the CPU builds the result set, but not sure how that works internally in SQLite).
SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0
Given Mali is a SIMD architecture it will work best if you have a vec4 of identical operations for the maths units. In this case you should split your input data into three arrays rather than using a packed struct for each row. You have one array for each column, and pack 4 rows per work item (assuming int32 data types, you would pack 8 for int16, or 16 for int8).
You would have a vec4 load where each vector contains data from one column, and each vector element contains a value from one row. This would mean that the vec4 operations on each "column slice" of 4 rows are identical which packs into the SIMD lanes cleanly - you can use the relational built-in functions to compare all 4 rows against the same condition in a single operation.
In addition to the vector processing packing improvements, this has one advantage in that you cleanly partition the kernel's memory access. You can test all 4/8/16 rows for passing "uniformi > 60". If they all fail then there is no need to load "normali5" or "id" at all - so you don't waste GPU cycles or cache loading data you never actually use.
If you are willing to show the code for a kernel or two we could possibly provide a bit more help.
HTH,
Pete
Cancel
Up 0 Down

Cancel

Reply

Peter Harris over 10 years ago
Hi Tom,
That looks like a nice use of OpenCL; a problem with parallelism and few data dependencies.
I suspect your kernels are quite simple relative to the number of memory accesses you are making, so I wonder if this can be tuned.
For example, if you have the example below the only data you need for the "test" is uniformi and normali5; you only need to touch id if you pass the test (actually I wonder if the GPU can avoid this completely and return a byte vector of passed values, and the CPU builds the result set, but not sure how that works internally in SQLite).
SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0
Given Mali is a SIMD architecture it will work best if you have a vec4 of identical operations for the maths units. In this case you should split your input data into three arrays rather than using a packed struct for each row. You have one array for each column, and pack 4 rows per work item (assuming int32 data types, you would pack 8 for int16, or 16 for int8).
You would have a vec4 load where each vector contains data from one column, and each vector element contains a value from one row. This would mean that the vec4 operations on each "column slice" of 4 rows are identical which packs into the SIMD lanes cleanly - you can use the relational built-in functions to compare all 4 rows against the same condition in a single operation.
In addition to the vector processing packing improvements, this has one advantage in that you cleanly partition the kernel's memory access. You can test all 4/8/16 rows for passing "uniformi > 60". If they all fail then there is no need to load "normali5" or "id" at all - so you don't waste GPU cycles or cache loading data you never actually use.
If you are willing to show the code for a kernel or two we could possibly provide a bit more help.
HTH,
Pete
Cancel
Up 0 Down

Cancel

Children

Tom Gall over 10 years ago in reply to Peter Harris

Thanks for the pointers Pete. They seem like a worthwhile evolution of the code and make complete sense. I'll give that a try in the next day or two.
I do plan to open the code up and will post a pointer to the git repo sometime next week.
Cancel
Up 0 Down

Cancel
Tom Gall over 10 years ago in reply to Peter Harris

Hi Pete,
Thanks again for your suggestions. I'm still working on the code a bit yet but it's looking good. Performing:
SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0
sqlite built -O2
CPU sql1 took 43631 microseconds
OpenCL sql1 took 14545 microseconds (2.99x or 199% better)
OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)
The improvement not only resulted from working with vectors over arrays of integers but by also reducing the number of registers in use. I was able to jump from 64 to 128 work units.
The heart of the OpenCL kernel evolved from:
    do {
        if ((data[offset].v > 60) && (data[offset].w < 0)) {
            resultArray[roffset].id = data[offset].id;
            resultArray[roffset].v = data[offset].v;
            resultArray[roffset].w = data[offset].w;
            roffset++;
        }
        offset++;
        endRow--;
    } while (endRow);
to
    do {
        v1 = vload4(0, data1+offset);
        v2 = vload4(0, data2+offset);
        r = (v1 > 60) && ( 0 > v2);
        vstore4(r,0, resultMask+offset);
        offset+=4;
        totalRows--;
    } while (totalRows);
Thanks again. If i can knock down one little bug I anticipate I'll be posting my code tomorrow. I've a curious situation where 2 results (out of 100,000 rows) aren't being matched and I'm not sure why. What I wouldn't give for an OpenCL debugger!
Cancel
Up 0 Down

Cancel
Peter Harris over 10 years ago in reply to Tom Gall

Hi Tom,
Thanks again for your suggestions.

No problem - happy to help. Writing optimal GPGPU code generally requires spotting where you can orientate your world 90 degrees and run on the walls - it takes a bit of getting used to.

OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)
Sweet =)
Cheers,
P
Cancel
Up 0 Down

Cancel