I tested the performance of mali's texture(cl_image), I found it is poor than buffer(cl_mem).
my GPU is mali G76
I think the texture should be better than buffer, such as: bilinear.
but, my test tell me G76's texture is poor than buffer about 10%-20%. my test format is RGBA
I don't know why?
is there anyone would like to tell me the secret?
or, is there any standard benchmark program?
Without knowing what your kernel is doing, it's really impossible to say why it might be slower.
hello, Peter
thanks for your reply,
let's look a case: there is a 1080P rgba image, we need resize it into 1600x900,
we test this case through buffer style and texture style.
the code of buffer style like this:
__kernel void image_resize_kernel(__global unsigned char const* src, __global unsigned char* dst, __private const int2 src_shape, __private const int2 dst_shape, __private const float2 scale_ratio){ const int globalx = get_global_id(0); const int globaly = get_global_id(1); float src_x_f = scale_ratio.x * (float)globalx; float src_y_f = scale_ratio.y * (float)globaly; int src_x_i = (int)src_x_f; int src_y_i = (int)src_y_f; float u = src_y_f - src_y_i; float v = src_x_f - src_x_i; uchar4 v00 = vload4( src_y_i * src_shape.x + src_x_i, src); uchar4 v01 = vload4( src_y_i * src_shape.x + src_x_i + 1, src); uchar4 v10 = vload4((src_y_i + 1) * src_shape.x + src_x_i, src); uchar4 v11 = vload4((src_y_i + 1) * src_shape.x + src_x_i + 1, src); float4 v00_f = convert_float4(v00); float4 v01_f = convert_float4(v01); float4 v10_f = convert_float4(v10); float4 v11_f = convert_float4(v11); float res_b = (1.0f - u) * (1.0f - v) * v00_f.x + (1.0f - u) * v * v01_f.x + u * (1.0f - v) * v10_f.x + u * v * v11_f.x; float res_g = (1.0f - u) * (1.0f - v) * v00_f.y + (1.0f - u) * v * v01_f.y + u * (1.0f - v) * v10_f.y + u * v * v11_f.y; float res_r = (1.0f - u) * (1.0f - v) * v00_f.z + (1.0f - u) * v * v01_f.z + u * (1.0f - v) * v10_f.z + u * v * v11_f.z; float res_a = 0.0f; uchar4 res = convert_uchar4_sat_rte((float4)(res_b, res_g, res_r, res_a)); vstore4(res, globaly * dst_shape.x + globalx, dst); }
and the code of texture sytle like this:
__constant sampler_t SAMPLER_SRC = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_LINEAR; __constant sampler_t SAMPLER_DST = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; __kernel void image_resize_kernel(__read_only image2d_t src_img, __write_only image2d_t dst_img, __private const int2 src_shape, __private const int2 dst_shape, __private const float2 scale_ratio){ const int globalx = get_global_id(0); const int globaly = get_global_id(1); float dst_x = (float)globalx * scale_ratio.x; float dst_y = (float)globaly * scale_ratio.y; uint4 dst_val = read_imageui(src_img, SAMPLER_SRC, (float2)(dst_x, dst_y)); write_imageui(dst_img, (int2)(globalx, globaly), dst_val); }
the lws are both set to [32, 4]
and then the performance result(average value of 1000 times) is:
buffer style performance: 1095.45600 us
texture style performance: 1039.66600 us
and then, the statistic code about above test, like this:
int loop = 1000; auto start = std::chrono::system_clock::now(); for (int i = 0; i < loop; i++){ err = clEnqueueNDRangeKernel(cmd_que, ocl_kernel, work_dim, NULL, gws, lws, 0, NULL, NULL); assert(err == CL_SUCCESS); } err = clFinish(cmd_que); if (CL_SUCCESS != err){ printf("Error: clFinish returned %s\n", ocl_com::translate_opencl_error(err)); } auto end = std::chrono::system_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start); printf("GPU performance counter time: %lld\n", (long long int)(duration.count())); std::size_t total_num = loop; double avg_time = ((double)((long long int)(duration.count()))) / total_num; printf("each gpu time: %.5f\n", avg_time);
through above test, I found there is not any advantage for the texture style
I don't know why
Would you like to give me some prompt?
For simple downscale operations I would expect this to be memory bound - the shader code probably won't be the issue.
Have you tried getting a profile with Streamline?
I confused about why texture's cache is no better than buffer
furthermore, I assume the texture's cache is different from CPU's cache, for example, it is z-curve style.
So, I guess the texture's performance should be better than buffer, if they resize the same image.
in other word, I guess the texture's performance should be better than buffer, if the computation is memory bound.
I don't know why texture's advantage of G76 is tiny ? Would you like to show me some texture details in G76?
about Streamline, I didn't check it, let me check it at first
I guess the texture's performance should be better than buffer
Stop guessing and measure some hard data =)
I don't know why texture's advantage of G76 is tiny ?
... because cache probably isn't the bottleneck.
Hello, Peter
I‘ve run the profiler, like this:
buf style:
Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8507093659]; Non-Frag Util [99.8948%]; Tiler Ac [99797640]; Tiler Util [11.7188%]; Frag Overdraw [0];
texture style:
Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8497696824]; Non-Frag Util [99.6966%]; Tiler Ac [99886410]; Tiler Util [11.7189%]; Frag Overdraw [0];
above output is very simple, I don't know what you want,
Would you like to give me some suggestion: what are the preferred performance-counter?
One confused: I didn't use texture in buf style, I don't know why the Tiler Util is about 11%?
Can you get a capture of both scenarios with Streamline and share the exported .apc files? The latest Streamline should recommend the counters to use automatically, so the default profile should be fine.