I tested the performance of mali's texture(cl_image), I found it is poor than buffer(cl_mem).
my GPU is mali G76
I think the texture should be better than buffer, such as: bilinear.
but, my test tell me G76's texture is poor than buffer about 10%-20%. my test format is RGBA
I don't know why?
is there anyone would like to tell me the secret?
or, is there any standard benchmark program?
hello, Peter
thanks for your reply,
let's look a case: there is a 1080P rgba image, we need resize it into 1600x900,
we test this case through buffer style and texture style.
the code of buffer style like this:
__kernel void image_resize_kernel(__global unsigned char const* src, __global unsigned char* dst, __private const int2 src_shape, __private const int2 dst_shape, __private const float2 scale_ratio){ const int globalx = get_global_id(0); const int globaly = get_global_id(1); float src_x_f = scale_ratio.x * (float)globalx; float src_y_f = scale_ratio.y * (float)globaly; int src_x_i = (int)src_x_f; int src_y_i = (int)src_y_f; float u = src_y_f - src_y_i; float v = src_x_f - src_x_i; uchar4 v00 = vload4( src_y_i * src_shape.x + src_x_i, src); uchar4 v01 = vload4( src_y_i * src_shape.x + src_x_i + 1, src); uchar4 v10 = vload4((src_y_i + 1) * src_shape.x + src_x_i, src); uchar4 v11 = vload4((src_y_i + 1) * src_shape.x + src_x_i + 1, src); float4 v00_f = convert_float4(v00); float4 v01_f = convert_float4(v01); float4 v10_f = convert_float4(v10); float4 v11_f = convert_float4(v11); float res_b = (1.0f - u) * (1.0f - v) * v00_f.x + (1.0f - u) * v * v01_f.x + u * (1.0f - v) * v10_f.x + u * v * v11_f.x; float res_g = (1.0f - u) * (1.0f - v) * v00_f.y + (1.0f - u) * v * v01_f.y + u * (1.0f - v) * v10_f.y + u * v * v11_f.y; float res_r = (1.0f - u) * (1.0f - v) * v00_f.z + (1.0f - u) * v * v01_f.z + u * (1.0f - v) * v10_f.z + u * v * v11_f.z; float res_a = 0.0f; uchar4 res = convert_uchar4_sat_rte((float4)(res_b, res_g, res_r, res_a)); vstore4(res, globaly * dst_shape.x + globalx, dst); }
and the code of texture sytle like this:
__constant sampler_t SAMPLER_SRC = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_LINEAR; __constant sampler_t SAMPLER_DST = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST; __kernel void image_resize_kernel(__read_only image2d_t src_img, __write_only image2d_t dst_img, __private const int2 src_shape, __private const int2 dst_shape, __private const float2 scale_ratio){ const int globalx = get_global_id(0); const int globaly = get_global_id(1); float dst_x = (float)globalx * scale_ratio.x; float dst_y = (float)globaly * scale_ratio.y; uint4 dst_val = read_imageui(src_img, SAMPLER_SRC, (float2)(dst_x, dst_y)); write_imageui(dst_img, (int2)(globalx, globaly), dst_val); }
the lws are both set to [32, 4]
and then the performance result(average value of 1000 times) is:
buffer style performance: 1095.45600 us
texture style performance: 1039.66600 us
and then, the statistic code about above test, like this:
int loop = 1000; auto start = std::chrono::system_clock::now(); for (int i = 0; i < loop; i++){ err = clEnqueueNDRangeKernel(cmd_que, ocl_kernel, work_dim, NULL, gws, lws, 0, NULL, NULL); assert(err == CL_SUCCESS); } err = clFinish(cmd_que); if (CL_SUCCESS != err){ printf("Error: clFinish returned %s\n", ocl_com::translate_opencl_error(err)); } auto end = std::chrono::system_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start); printf("GPU performance counter time: %lld\n", (long long int)(duration.count())); std::size_t total_num = loop; double avg_time = ((double)((long long int)(duration.count()))) / total_num; printf("each gpu time: %.5f\n", avg_time);
through above test, I found there is not any advantage for the texture style
I don't know why
Would you like to give me some prompt?
For simple downscale operations I would expect this to be memory bound - the shader code probably won't be the issue.
Have you tried getting a profile with Streamline?
I confused about why texture's cache is no better than buffer
furthermore, I assume the texture's cache is different from CPU's cache, for example, it is z-curve style.
So, I guess the texture's performance should be better than buffer, if they resize the same image.
in other word, I guess the texture's performance should be better than buffer, if the computation is memory bound.
I don't know why texture's advantage of G76 is tiny ? Would you like to show me some texture details in G76?
about Streamline, I didn't check it, let me check it at first
I guess the texture's performance should be better than buffer
Stop guessing and measure some hard data =)
I don't know why texture's advantage of G76 is tiny ?
... because cache probably isn't the bottleneck.
Hello, Peter
I‘ve run the profiler, like this:
buf style:
Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8507093659]; Non-Frag Util [99.8948%]; Tiler Ac [99797640]; Tiler Util [11.7188%]; Frag Overdraw [0];
texture style:
Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8497696824]; Non-Frag Util [99.6966%]; Tiler Ac [99886410]; Tiler Util [11.7189%]; Frag Overdraw [0];
above output is very simple, I don't know what you want,
Would you like to give me some suggestion: what are the preferred performance-counter?
One confused: I didn't use texture in buf style, I don't know why the Tiler Util is about 11%?
Can you get a capture of both scenarios with Streamline and share the exported .apc files? The latest Streamline should recommend the counters to use automatically, so the default profile should be fine.