I tested the performance of mali's texture(cl_image), I found it is poor than buffer(cl_mem).
my GPU is mali G76
I think the texture should be better than buffer, such as: bilinear.
but, my test tell me G76's texture is poor than buffer about 10%-20%. my test format is RGBA
I don't know why?
is there anyone would like to tell me the secret?
or, is there any standard benchmark program?
thanks for your reply,
I confused about why texture's cache is no better than buffer
furthermore, I assume the texture's cache is different from CPU's cache, for example, it is z-curve style.
So, I guess the texture's performance should be better than buffer, if they resize the same image.
in other word, I guess the texture's performance should be better than buffer, if the computation is memory bound.
I don't know why texture's advantage of G76 is tiny ? Would you like to show me some texture details in G76?
about Streamline, I didn't check it, let me check it at first
I guess the texture's performance should be better than buffer
Stop guessing and measure some hard data =)
I don't know why texture's advantage of G76 is tiny ?
... because cache probably isn't the bottleneck.
Hello, Peter
I‘ve run the profiler, like this:
buf style:
Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8507093659]; Non-Frag Util [99.8948%]; Tiler Ac [99797640]; Tiler Util [11.7188%]; Frag Overdraw [0];
texture style:
Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8497696824]; Non-Frag Util [99.6966%]; Tiler Ac [99886410]; Tiler Util [11.7189%]; Frag Overdraw [0];
above output is very simple, I don't know what you want,
Would you like to give me some suggestion: what are the preferred performance-counter?
One confused: I didn't use texture in buf style, I don't know why the Tiler Util is about 11%?
Can you get a capture of both scenarios with Streamline and share the exported .apc files? The latest Streamline should recommend the counters to use automatically, so the default profile should be fine.
I forget this post for long time,
My test platform is ubuntu, and I have no license of streamline, so, I cannot provide the report of streamline.
I git clone the HWCPipe, and I apply these sdk of performance counter.
I sampled these performance counter with HWCPipe,test 100 times for buf_style program and texture_style program
the report of buf_style program like following:
Tile unit write bytes(unit: bytes): 0 Load/store unit bytes written to L2 per access cycle(unit: bytes): 16.0059 Load/store unit write bytes(unit: bytes): 5.76211e+08 Load/store unit write beats to L2 memory system(unit: beats): 3.60132e+07 Texture unit bytes read from external memory per texture cycle(unit: bytes): 0 Texture unit read bytes from external memory(unit: bytes): 0 Texture unit bytes read from L2 per texture cycle(unit: bytes): 0 Texture unit read bytes from L2 cache(unit: bytes): 0 Load/store unit bytes read from external memory per access cycle(unit: bytes): 5.76021 Load/store unit read bytes from external memory(unit: bytes): 8.2947e+08 Load/store unit bytes read from L2 per access cycle(unit: bytes): 8.68622 Load/store unit read bytes from L2 cache(unit: bytes): 1.25082e+09 Front-end unit read bytes from external memory(unit: bytes): 0 Front-end unit read bytes from L2 cache(unit: bytes): 0 Varying unit utilization(unit: percent): 0 Varying unit issue cycles(unit: cycles): 0 16-bit interpolation active cycles(unit: cycles): 0 32-bit interpolation active cycles(unit: cycles): 0 Load/store unit utilization(unit: percent): 40.5665 Load/store unit issue cycles(unit: cycles): 1.8e+08 Load/store unit write issues(unit: cycles): 3.6e+07 Load/store unit read issues(unit: cycles): 1.44e+08 Texture unit issue cycles(unit: cycles): 0 Texture accesses using trilinear filter percentage(unit: percent): 0 Texture data fetches from compressed lines(unit: percent): 0 Texture accesses using mipmapping percentage(unit: percent): 0 Texture unit cache utilization(unit: percent): 0 Texture unit utilization(unit: percent): 0 Texture filtering cycles per instruction(unit: cycles): 0 Texture samples(unit: requests): 0 Arithmetic unit utilization(unit: percent): 51.3823 Warp divergence percentage(unit: percent): 0 Full quad warp rate(unit: percent): 100 All registers warp rate(unit: percent): 0 Fragment threads(unit: threads): 0 Non-fragment threads(unit: threads): 1.44e+08 Execution core utilization(unit: percent): 99.6935 Unchanged tile kill rate(unit: percent): 0 Fragments per pixel(unit: threads): 0 Late ZS killed thread percentage(unit: percent): 0 Late ZS tested thread percentage(unit: percent): 0 FPK killed quad percentage(unit: percent): 0 FPK killed quads(unit: quads): 0 Early ZS killed quad percentage(unit: percent): 0 Early ZS updated quad percentage(unit: percent): 0 Early ZS tested quad percentage(unit: percent): 0 Partial coverage rate(unit: percent): 0 Shaded coarse quads(unit: quads): 0 Non-occluding quads(unit: quads): 0 Occluding quad percentage(unit: percent): 0 Fragment FPK buffer utilization(unit: percent): 0 Average cycles per fragment thread(unit: cycles): 0 Fragment utilization(unit: percent): 0 Average cycles per non-fragment thread(unit: cycles): 3.0844 Non-fragment utilization(unit: percent): 99.7918 Varying cache hit rate(unit: percent): 0 Varying threads per input primitive(unit: threads): 0 Varying shader thread invocations(unit: threads): 0 Position cache hit rate(unit: percent): 0 Position threads per input primitive(unit: threads): 0 Position shader thread invocations(unit: threads): 0 Sample test cull rate(unit: percent): 0 Z plane test cull rate(unit: percent): 0 Facing or XY plane test cull rate(unit: percent): 0 Culled primitives(unit: primitives): 0 Visible primitives rate(unit: percent): 0 Total input primitives(unit: primitives): 0 Tiler utilization(unit: percent): 11.7131 Output external outstanding writes 75-100%(unit: transactions): 279568 Output external outstanding reads 75-100%(unit: transactions): 16402 Output external read latency 384+ cycles(unit: beats): 2.70392e+06 Output external write stall rate(unit: percent): 2.84916e-05 Output external read stall rate(unit: percent): 3.13676e-05 Output external write bytes(unit: bytes): 5.76017e+08 Output external read bytes(unit: bytes): 8.32275e+08 L2 cache write miss rate(unit: percent): 99.9257 L2 cache read miss rate(unit: percent): 31.8045 Non-fragment queue utilization(unit: percent): 100 Fragment queue utilization(unit: percent): 0 Interrupt pending utilization(unit: percent): 0.411299 Input external snoop stall cycles(unit: cycles): 0 Input external snoop transactions(unit: transactions): 0 Output external outstanding writes 50-75%(unit: transactions): 1719855 Output external outstanding writes 25-50%(unit: transactions): 3856581 Output external outstanding writes 0-25%(unit: transactions): 3144573 Output external write stall cycles(unit: cycles): 13297053 Output external write beats(unit: beats): 36001046 Output external WriteSnoopPartial transactions(unit: transactions): 0 Output external WriteSnoopFull transactions(unit: transactions): 0 Output external WriteNoSnoopPartial transactions(unit: transactions): 419 Output external WriteNoSnoopFull transactions(unit: transactions): 9000158 Output external write transactions(unit: transactions): 9000577 Output external read latency 320-383 cycles(unit: beats): 1277948 Output external read latency 256-319 cycles(unit: beats): 1511840 Output external read latency 192-255 cycles(unit: beats): 1847742 Output external read latency 128-191 cycles(unit: beats): 1653552 Output external read latency 0-127 cycles(unit: beats): 43022201 Output external outstanding reads 50-75%(unit: transactions): 616081 Output external outstanding reads 25-50%(unit: transactions): 3453399 Output external outstanding reads 0-25%(unit: transactions): 8918419 Output external read stall cycles(unit: cycles): 14639299 Output external read beats(unit: beats): 52017204 Output external ReadUnique transactions(unit: transactions): 0 Output external ReadNoSnoop transactions(unit: transactions): 13004301 Output external read transactions(unit: transactions): 13004301 Input external snoop lookup requests(unit: requests): 0 Write lookup requests(unit: requests): 9007268 Read lookup requests(unit: requests): 40888253 Any lookup requests(unit: requests): 80302664 Output internal write requests(unit: requests): 9007109 Output internal read stall cycles(unit: cycles): 202121 Output internal read requests(unit: requests): 26565317 Input internal snoop stall cycles(unit: cycles): 75 Input internal snoop requests(unit: requests): 9222540 Input internal write stall cycles(unit: cycles): 9 Input internal write requests(unit: requests): 565 Input internal read stall cycles(unit: cycles): 8714906 Input internal read requests(unit: requests): 43502875 MMU stage 2 L2 lookup TLB hits(unit: requests): 0 MMU stage 2 L3 lookup TLB hits(unit: requests): 0 MMU stage 2 L2 lookup requests(unit: requests): 0 MMU stage 2 L3 lookup requests(unit: requests): 0 MMU stage 2 lookup requests(unit: requests): 0 MMU L2 lookup TLB hits(unit: requests): 0 MMU L3 lookup TLB hits(unit: requests): 1678862 MMU L2 table read requests(unit: requests): 18 MMU L3 table read requests(unit: requests): 86016 MMU lookup requests(unit: requests): 1925054 Load/store unit write-back write beats(unit: beats): 36013174 Tile unit write beats to L2 memory system(unit: beats): 0 Load/store unit other write beats(unit: beats): 0 Miscellaneous read beats from L2 cache(unit: beats): 21360 Texture unit read beats from external memory(unit: beats): 0 Texture unit read beats from L2 cache(unit: beats): 0 Load/store unit read beats from external memory(unit: beats): 51841884 Load/store unit read beats from L2 cache(unit: beats): 78176006 Fragment front-end read beats from external memory(unit: beats): 0 Fragment front-end read beats from L2 cache(unit: beats): 0 Attribute instructions(unit: instructions): 0 16-bit interpolation slots(unit: issues): 0 32-bit interpolation slots(unit: issues): 0 Varying unit instructions(unit: requests): 0 Load/store unit atomic issues(unit: cycles): 0 Load/store unit partial write issues(unit: cycles): 0 Load/store unit full write issues(unit: cycles): 36000000 Load/store unit partial read issues(unit: cycles): 86400000 Load/store unit full read issues(unit: cycles): 57600000 Texture filtering cycles(unit: cycles): 0 Texture cache lookup requests(unit: requests): 0 Compressed texture line fetch requests(unit: issues): 0 Texture line fetch requests(unit: issues): 0 Trilinear filtered texture quad issues(unit: issues): 0 Mipmapped texture quad issues(unit: issues): 0 Texture quad descriptor misses(unit: requests): 0 Texture quad issues(unit: issues): 0 Texture quads(unit: quads): 0 Execution engine starvation cycles(unit: cycles): 208198305 Diverged instructions(unit: instructions): 0 Executed instructions(unit: instructions): 227991754 Execution engine active cycles(unit: cycles): 443654938 Execution core active cycles(unit: cycles): 443716416 Non-fragment warps(unit: warps): 18000000 Non-fragment core tasks(unit: tasks): 562500 Non-fragment active cycles(unit: cycles): 444153595 Full quad warps(unit: warps): 18000000 Occluding quads(unit: quads): 0 Killed unchanged tiles(unit: tiles): 0 Tiles(unit: tiles): 0 Warps using more than 32 registers(unit: warps): 0 Late ZS killed quads(unit: quads): 0 Late ZS tested quads(unit: quads): 0 Early ZS killed quads(unit: quads): 0 Early ZS updated quads(unit: quads): 0 Early ZS tested quads(unit: quads): 0 Rasterized fine quads(unit: quads): 0 Partial fragment warps(unit: warps): 0 Fragment warps(unit: warps): 0 Forward pixel kill buffer active cycles(unit: cycles): 0 Rasterized primitives(unit: primitives): 0 Fragment primitives loaded(unit: primitives): 0 Fragment active cycles(unit: cycles): 0 Tiler varying shading stall cycles(unit: cycles): 0 Tiler varying shading requests(unit: requests): 0 Varying cache misses(unit: requests): 0 Varying cache hits(unit: requests): 0 Position cache miss requests(unit: requests): 0 Position cache hit requests(unit: requests): 0 Tiler position FIFO full cycles(unit: cycles): 0 Tiler position shading stall cycles(unit: cycles): 0 Tiler position shading requests(unit: requests): 0 Internal write beats(unit: beats): 0 Output internal read beats(unit: beats): 0 Sample test culled primitives(unit: primitives): 0 Z plane culled primitives(unit: primitives): 0 Facing or XY plane test culled primitives(unit: primitives): 0 Visible primitives(unit: primitives): 0 Visible back-facing primitives(unit: primitives): 0 Visible front-facing primitives(unit: primitives): 0 Point primitives(unit: primitives): 0 Line primitives(unit: primitives): 0 Triangle primitives(unit: primitives): 0 Tiler active cycles(unit: cycles): 5213267 L2 cache flush requests(unit: requests): 3 Reserved queue job finish wait cycles(unit: cycles): 0 Reserved queue job dependency wait cycles(unit: cycles): 0 Reserved queue job issue wait cycles(unit: cycles): 0 Reserved queue job descriptor read wait cycles(unit: cycles): 0 Reserved queue cache flush wait cycles(unit: cycles): 0 Reserved active cycles(unit: cycles): 0 Reserved queue tasks(unit: tasks): 0 Reserved queue jobs(unit: jobs): 0 Non-fragment queue job finish wait cycles(unit: cycles): 0 Non-fragment queue job dependency wait cycles(unit: cycles): 0 Non-fragment queue job issue wait cycles(unit: cycles): 42576347 Non-fragment queue job descriptor read wait cycles(unit: cycles): 11080 Non-fragment queue cache flush wait cycles(unit: cycles): 23917 Non-fragment queue active cycles(unit: cycles): 44508037 Non-fragment tasks(unit: tasks): 562500 Non-fragment jobs(unit: jobs): 300 Fragment queue job finish wait cycles(unit: cycles): 0 Fragment queue job dependency wait cycles(unit: cycles): 0 Fragment queue job issue wait cycles(unit: cycles): 0 Fragment queue job descriptor read wait cycles(unit: cycles): 0 Fragment queue cache flush wait cycles(unit: cycles): 0 Fragment queue active cycles(unit: cycles): 0 Fragment tasks(unit: tasks): 0 Fragment jobs(unit: jobs): 0 GPU interrupt pending cycles(unit: cycles): 183061 GPU active cycles(unit: cycles): 44508037
the report of texture_style program like following:
Average cycles per pixel(unit: cycles): inf Pixels(unit: pixels): 0 Tile unit write bytes(unit: bytes): 0 Load/store unit bytes written to L2 per access cycle(unit: bytes): 16 Load/store unit write bytes(unit: bytes): 5.76002e+08 Load/store unit write beats to L2 memory system(unit: beats): 3.60001e+07 Texture unit bytes read from external memory per texture cycle(unit: bytes): 9.60027 Texture unit read bytes from external memory(unit: bytes): 6.91219e+08 Texture unit bytes read from L2 per texture cycle(unit: bytes): 11.1556 Texture unit read bytes from L2 cache(unit: bytes): 8.03204e+08 Load/store unit bytes read from external memory per access cycle(unit: bytes): 0 Load/store unit read bytes from external memory(unit: bytes): 0 Load/store unit bytes read from L2 per access cycle(unit: bytes): inf Load/store unit read bytes from L2 cache(unit: bytes): 1.44001e+08 Front-end unit read bytes from external memory(unit: bytes): 0 Front-end unit read bytes from L2 cache(unit: bytes): 0 Varying unit utilization(unit: percent): 0 Varying unit issue cycles(unit: cycles): 0 16-bit interpolation active cycles(unit: cycles): 0 32-bit interpolation active cycles(unit: cycles): 0 Load/store unit utilization(unit: percent): 8.28342 Load/store unit issue cycles(unit: cycles): 3.6e+07 Load/store unit write issues(unit: cycles): 3.6e+07 Load/store unit read issues(unit: cycles): 0 Texture unit issue cycles(unit: cycles): 7.2e+07 Texture accesses using trilinear filter percentage(unit: percent): 0 Texture data fetches from compressed lines(unit: percent): 0 Texture accesses using mipmapping percentage(unit: percent): 0 Texture unit cache utilization(unit: percent): 8.28342 Texture unit utilization(unit: percent): 16.5668 Texture filtering cycles per instruction(unit: cycles): 0.5 Texture samples(unit: requests): 1.44e+08 Arithmetic unit utilization(unit: percent): 9.66446 Warp divergence percentage(unit: percent): 0 Full quad warp rate(unit: percent): 100 All registers warp rate(unit: percent): 0 Fragment threads(unit: threads): 0 Non-fragment threads(unit: threads): 1.44e+08 Execution core utilization(unit: percent): 99.2073 Unchanged tile kill rate(unit: percent): 0 Fragments per pixel(unit: threads): 0 Late ZS killed thread percentage(unit: percent): 0 Late ZS tested thread percentage(unit: percent): 0 FPK killed quad percentage(unit: percent): 0 FPK killed quads(unit: quads): 0 Early ZS killed quad percentage(unit: percent): 0 Early ZS updated quad percentage(unit: percent): 0 Early ZS tested quad percentage(unit: percent): 0 Partial coverage rate(unit: percent): 0 Shaded coarse quads(unit: quads): 0 Non-occluding quads(unit: quads): 0 Occluding quad percentage(unit: percent): 0 Fragment FPK buffer utilization(unit: percent): 0 Average cycles per fragment thread(unit: cycles): 0 Fragment utilization(unit: percent): 0 Average cycles per non-fragment thread(unit: cycles): 3.02251 Non-fragment utilization(unit: percent): 99.353 Varying cache hit rate(unit: percent): 0 Varying threads per input primitive(unit: threads): 0 Varying shader thread invocations(unit: threads): 0 Position cache hit rate(unit: percent): 0 Position threads per input primitive(unit: threads): 0 Position shader thread invocations(unit: threads): 0 Sample test cull rate(unit: percent): 0 Z plane test cull rate(unit: percent): 0 Facing or XY plane test cull rate(unit: percent): 0 Culled primitives(unit: primitives): 0 Visible primitives rate(unit: percent): 0 Total input primitives(unit: primitives): 0 Tiler utilization(unit: percent): 11.7111 Output external outstanding writes 75-100%(unit: transactions): 988013 Output external outstanding reads 75-100%(unit: transactions): 29299 Output external read latency 384+ cycles(unit: beats): 5.59447e+06 Output external write stall rate(unit: percent): 7.06203e-05 Output external read stall rate(unit: percent): 9.7719e-05 Output external write bytes(unit: bytes): 5.76022e+08 Output external read bytes(unit: bytes): 6.9408e+08 L2 cache write miss rate(unit: percent): 99.4261 L2 cache read miss rate(unit: percent): 18.8492 Non-fragment queue utilization(unit: percent): 100 Fragment queue utilization(unit: percent): 0 Interrupt pending utilization(unit: percent): 0.363565 Input external snoop stall cycles(unit: cycles): 0 Input external snoop transactions(unit: transactions): 0 Output external outstanding writes 50-75%(unit: transactions): 4121746 Output external outstanding writes 25-50%(unit: transactions): 3126050 Output external outstanding writes 0-25%(unit: transactions): 764878 Output external write stall cycles(unit: cycles): 32439841 Output external write beats(unit: beats): 36001395 Output external WriteSnoopPartial transactions(unit: transactions): 0 Output external WriteSnoopFull transactions(unit: transactions): 0 Output external WriteNoSnoopPartial transactions(unit: transactions): 483 Output external WriteNoSnoopFull transactions(unit: transactions): 9000204 Output external write transactions(unit: transactions): 9000687 Output external read latency 320-383 cycles(unit: beats): 1884675 Output external read latency 256-319 cycles(unit: beats): 2197805 Output external read latency 192-255 cycles(unit: beats): 3665942 Output external read latency 128-191 cycles(unit: beats): 11272125 Output external read latency 0-127 cycles(unit: beats): 18764990 Output external outstanding reads 50-75%(unit: transactions): 2084662 Output external outstanding reads 25-50%(unit: transactions): 6831597 Output external outstanding reads 0-25%(unit: transactions): 1899444 Output external read stall cycles(unit: cycles): 44887785 Output external read beats(unit: beats): 43380008 Output external ReadUnique transactions(unit: transactions): 0 Output external ReadNoSnoop transactions(unit: transactions): 10845002 Output external read transactions(unit: transactions): 10845002 Input external snoop lookup requests(unit: requests): 0 Write lookup requests(unit: requests): 9052640 Read lookup requests(unit: requests): 57535661 Any lookup requests(unit: requests): 78276549 Output internal write requests(unit: requests): 9000710 Output internal read stall cycles(unit: cycles): 477103 Output internal read requests(unit: requests): 22390126 Input internal snoop stall cycles(unit: cycles): 43327 Input internal snoop requests(unit: requests): 9781454 Input internal write stall cycles(unit: cycles): 9 Input internal write requests(unit: requests): 652 Input internal read stall cycles(unit: cycles): 35951247 Input internal read requests(unit: requests): 21608727 MMU stage 2 L2 lookup TLB hits(unit: requests): 0 MMU stage 2 L3 lookup TLB hits(unit: requests): 0 MMU stage 2 L2 lookup requests(unit: requests): 0 MMU stage 2 L3 lookup requests(unit: requests): 0 MMU stage 2 lookup requests(unit: requests): 0 MMU L2 lookup TLB hits(unit: requests): 0 MMU L3 lookup TLB hits(unit: requests): 1350038 MMU L2 table read requests(unit: requests): 18 MMU L3 table read requests(unit: requests): 86376 MMU lookup requests(unit: requests): 1761166 Load/store unit write-back write beats(unit: beats): 36000112 Tile unit write beats to L2 memory system(unit: beats): 0 Load/store unit other write beats(unit: beats): 0 Miscellaneous read beats from L2 cache(unit: beats): 37000 Texture unit read beats from external memory(unit: beats): 43201208 Texture unit read beats from L2 cache(unit: beats): 50200272 Load/store unit read beats from external memory(unit: beats): 0 Load/store unit read beats from L2 cache(unit: beats): 9000055 Fragment front-end read beats from external memory(unit: beats): 0 Fragment front-end read beats from L2 cache(unit: beats): 0 Attribute instructions(unit: instructions): 36000000 16-bit interpolation slots(unit: issues): 0 32-bit interpolation slots(unit: issues): 0 Varying unit instructions(unit: requests): 0 Load/store unit atomic issues(unit: cycles): 0 Load/store unit partial write issues(unit: cycles): 0 Load/store unit full write issues(unit: cycles): 36000000 Load/store unit partial read issues(unit: cycles): 0 Load/store unit full read issues(unit: cycles): 0 Texture filtering cycles(unit: cycles): 72000000 Texture cache lookup requests(unit: requests): 36000000 Compressed texture line fetch requests(unit: issues): 0 Texture line fetch requests(unit: issues): 7530006 Trilinear filtered texture quad issues(unit: issues): 0 Mipmapped texture quad issues(unit: issues): 0 Texture quad descriptor misses(unit: requests): 1000 Texture quad issues(unit: issues): 36000000 Texture quads(unit: quads): 36000000 Execution engine starvation cycles(unit: cycles): 380458140 Diverged instructions(unit: instructions): 0 Executed instructions(unit: instructions): 42002044 Execution engine active cycles(unit: cycles): 434556483 Execution core active cycles(unit: cycles): 434603027 Non-fragment warps(unit: warps): 18000000 Non-fragment core tasks(unit: tasks): 562500 Non-fragment active cycles(unit: cycles): 435241489 Full quad warps(unit: warps): 18000000 Occluding quads(unit: quads): 0 Killed unchanged tiles(unit: tiles): 0 Tiles(unit: tiles): 0 Warps using more than 32 registers(unit: warps): 0 Late ZS killed quads(unit: quads): 0 Late ZS tested quads(unit: quads): 0 Early ZS killed quads(unit: quads): 0 Early ZS updated quads(unit: quads): 0 Early ZS tested quads(unit: quads): 0 Rasterized fine quads(unit: quads): 0 Partial fragment warps(unit: warps): 0 Fragment warps(unit: warps): 0 Forward pixel kill buffer active cycles(unit: cycles): 0 Rasterized primitives(unit: primitives): 0 Fragment primitives loaded(unit: primitives): 0 Fragment active cycles(unit: cycles): 0 Tiler varying shading stall cycles(unit: cycles): 0 Tiler varying shading requests(unit: requests): 0 Varying cache misses(unit: requests): 0 Varying cache hits(unit: requests): 0 Position cache miss requests(unit: requests): 0 Position cache hit requests(unit: requests): 0 Tiler position FIFO full cycles(unit: cycles): 0 Tiler position shading stall cycles(unit: cycles): 0 Tiler position shading requests(unit: requests): 0 Internal write beats(unit: beats): 0 Output internal read beats(unit: beats): 0 Sample test culled primitives(unit: primitives): 0 Z plane culled primitives(unit: primitives): 0 Facing or XY plane test culled primitives(unit: primitives): 0 Visible primitives(unit: primitives): 0 Visible back-facing primitives(unit: primitives): 0 Visible front-facing primitives(unit: primitives): 0 Point primitives(unit: primitives): 0 Line primitives(unit: primitives): 0 Triangle primitives(unit: primitives): 0 Tiler active cycles(unit: cycles): 5130350 L2 cache flush requests(unit: requests): 3 Reserved queue job finish wait cycles(unit: cycles): 0 Reserved queue job dependency wait cycles(unit: cycles): 0 Reserved queue job issue wait cycles(unit: cycles): 0 Reserved queue job descriptor read wait cycles(unit: cycles): 0 Reserved queue cache flush wait cycles(unit: cycles): 0 Reserved active cycles(unit: cycles): 0 Reserved queue tasks(unit: tasks): 0 Reserved queue jobs(unit: jobs): 0 Non-fragment queue job finish wait cycles(unit: cycles): 0 Non-fragment queue job dependency wait cycles(unit: cycles): 0 Non-fragment queue job issue wait cycles(unit: cycles): 41675417 Non-fragment queue job descriptor read wait cycles(unit: cycles): 15135 Non-fragment queue cache flush wait cycles(unit: cycles): 221667 Non-fragment queue active cycles(unit: cycles): 43807587 Non-fragment tasks(unit: tasks): 562500 Non-fragment jobs(unit: jobs): 400 Fragment queue job finish wait cycles(unit: cycles): 0 Fragment queue job dependency wait cycles(unit: cycles): 0 Fragment queue job issue wait cycles(unit: cycles): 0 Fragment queue job descriptor read wait cycles(unit: cycles): 0 Fragment queue cache flush wait cycles(unit: cycles): 0 Fragment queue active cycles(unit: cycles): 0 Fragment tasks(unit: tasks): 0 Fragment jobs(unit: jobs): 0 GPU interrupt pending cycles(unit: cycles): 159269 GPU active cycles(unit: cycles): 43807587
would you like to help me analyze above report? the texture_style program, why there is not any obvious advantage?
Shaquille.Wu said:My test platform is ubuntu, and I have no license of streamline, so, I cannot provide the report of streamline.
If you download Arm Performance Studio, you shouldn't need a license - we made Arm Linux support part of the free-of-charge bundle, so no Streamline feature is license-managed any more.
Texture unit bytes read from external memory per texture cycle(unit: bytes): 9.60027Texture unit bytes read from L2 per texture cycle(unit: bytes): 11.1556
As expected for a downscale, these are all quite high "per clock" numbers. I'd expect a lot of this is just going to be down to differences in access pattern, which is going to be hard to diagnose from the counters.
Hi Peter, thanks for your reply at first
I think I cannot understand your analyzation correctly:
1. you mean, the two performance counters which you prompt should be quite high, but, my report is very low, it is abnormal. is it right?
furthermore, what are the normal number?
2. I cannot understand the "access pattern". I think I cannot specify "access pattern" in OpenCL, would you like to explain it furthermore?
3. I cannot understand these two performance couters: Non-fragment tasks(unit: tasks) and Non-fragment jobs(unit: jobs)
buf_style: Non-fragment tasks(unit: tasks): 562500 Non-fragment jobs(unit: jobs): 300 texture_style: Non-fragment tasks(unit: tasks): 562500 Non-fragment jobs(unit: jobs): 400
1). how to caculate them if my program is OpenCL?
2). I only run one kernel both in buf_style program and texture_style program, but, the Non-fragment jobs in reports are 3 and 4, instead of 1
buf_style: Non-fragment jobs(unit: jobs): 300 texture_style: Non-fragment jobs(unit: jobs): 400
I run the both tests for 100 times, so, the reports are 300 and 400. why not 100?
Shaquille.Wu said:1. you mean, the two performance counters which you prompt should be quite high, but, my report is very low,
Your bytes per-access value is high, so inline with expectations for a downscale.
Shaquille.Wu said:. I cannot understand the "access pattern". I think I cannot specify "access pattern" in OpenCL, would you like to explain it furthermore?
Correct, you can't control it. But buffers and textures may have a different memory layouts, and so have different access patterns.
Shaquille.Wu said: 1). how to caculate them if my program is OpenCL?
You can't.
You will get at least one Job per compute dispatch, but may get more as the driver generates small jobs for some management activities.
Tasks are somewhat meaningless to an application developer. For compute workloads a task is some multiple of the workgroup size, but the exact scaling is chosen by the driver and depends on the system configuration.
Shaquille.Wu said: 2). I only run one kernel both in buf_style program and texture_style program, but, the Non-fragment jobs in reports are 3 and 4, instead of 1
As above, will get get at least one Job per fragment workload, but may get more.