Optimised OpenCL SGEMM implementation for ARM Mali Midgard GPUs.

I wish to implement an optimised sgemm for Mali MidGard Gpu whichas of now only support OpenCL 1.2.  As far as I know, OpenCL 1.2 doesn't support subgroup extensions and Mali GPUs don't have any benefits for local memory tiling. So What should be the best way to perform sgemm on Mali without any memory reshaping such that it performs better or at least equivalent to the cpu implementation ? KIndly give me some pointers other than Arm Compute ML. Really appreciate it.

More questions in this forum