This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Initial Look at OpenCL Accelerated SQLite Performance numbers on Mali

Here's a link to a blog post from today about my work on accelerating SQLite with OpenCL on the ARM based Samsung Chromebook with a Mali T604.

Details & Early Benchmarks of OpenCL accelerated SQLite on ARM Mali | Tom Gall

Comments, questions and suggestions most welcome.

Parents
  • Hi Pete,

    Thanks again for your suggestions. I'm still working on the code a bit yet but it's looking good. Performing:

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

    sqlite built -O2

    CPU sql1 took 43631 microseconds

    OpenCL sql1  took 14545 microseconds  (2.99x or 199% better) 

    OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

    The improvement not only resulted from working with vectors over arrays of integers but by also reducing the number of registers in use. I was able to jump from 64 to 128 work units.

    The heart of the OpenCL kernel evolved from:

        do {

            if ((data[offset].v > 60) && (data[offset].w < 0)) {

                resultArray[roffset].id = data[offset].id;

                resultArray[roffset].v = data[offset].v;

                resultArray[roffset].w = data[offset].w;

                roffset++;

            }  

            offset++;

            endRow--;

        } while (endRow);

    to

        do {

            v1 = vload4(0, data1+offset);

            v2 = vload4(0, data2+offset);

            r = (v1 > 60) && ( 0 > v2);

            vstore4(r,0, resultMask+offset);

            offset+=4;

            totalRows--;

        } while (totalRows);

    Thanks again. If i can knock down one little bug I anticipate I'll be posting my code tomorrow. I've a curious situation where 2 results (out of 100,000 rows) aren't being matched and I'm not sure why. What I wouldn't give for an OpenCL debugger!

Reply
  • Hi Pete,

    Thanks again for your suggestions. I'm still working on the code a bit yet but it's looking good. Performing:

    SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

    sqlite built -O2

    CPU sql1 took 43631 microseconds

    OpenCL sql1  took 14545 microseconds  (2.99x or 199% better) 

    OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

    The improvement not only resulted from working with vectors over arrays of integers but by also reducing the number of registers in use. I was able to jump from 64 to 128 work units.

    The heart of the OpenCL kernel evolved from:

        do {

            if ((data[offset].v > 60) && (data[offset].w < 0)) {

                resultArray[roffset].id = data[offset].id;

                resultArray[roffset].v = data[offset].v;

                resultArray[roffset].w = data[offset].w;

                roffset++;

            }  

            offset++;

            endRow--;

        } while (endRow);

    to

        do {

            v1 = vload4(0, data1+offset);

            v2 = vload4(0, data2+offset);

            r = (v1 > 60) && ( 0 > v2);

            vstore4(r,0, resultMask+offset);

            offset+=4;

            totalRows--;

        } while (totalRows);

    Thanks again. If i can knock down one little bug I anticipate I'll be posting my code tomorrow. I've a curious situation where 2 results (out of 100,000 rows) aren't being matched and I'm not sure why. What I wouldn't give for an OpenCL debugger!

Children
  • Hi Tom,

    Thanks again for your suggestions.

    No problem - happy to help. Writing optimal GPGPU code generally requires spotting where you can orientate your world 90 degrees and run on the walls - it takes a bit of getting used to.


    OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

    Sweet =)

    Cheers,
    P