This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali's texture is poor than buffer, why?

I tested the performance of mali's texture(cl_image), I found it is poor than buffer(cl_mem).

my GPU is mali G76

I think the texture should be better than buffer, such as: bilinear.

but, my test tell me G76's texture is poor than buffer about 10%-20%. my test format is RGBA

I don't know why?

is there anyone would like to tell me the secret?

or, is there any standard benchmark program?

Parents
  • Hello, Peter

    I‘ve run the profiler, like this:

    buf style:

    Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8507093659]; Non-Frag Util [99.8948%]; Tiler Ac [99797640]; Tiler Util [11.7188%]; Frag Overdraw [0];

    texture style:

    Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8497696824]; Non-Frag Util [99.6966%]; Tiler Ac [99886410]; Tiler Util [11.7189%]; Frag Overdraw [0];

    above output is very simple, I don't know what you want,

    Would you like to give me some suggestion: what are the preferred performance-counter?

    One confused: I didn't use texture in buf style, I don't know why the Tiler Util is about 11%?

Reply
  • Hello, Peter

    I‘ve run the profiler, like this:

    buf style:

    Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8507093659]; Non-Frag Util [99.8948%]; Tiler Ac [99797640]; Tiler Util [11.7188%]; Frag Overdraw [0];

    texture style:

    Fragment Ac [0]; Fragment Util [0%]; Non-Fragment Ac [8497696824]; Non-Frag Util [99.6966%]; Tiler Ac [99886410]; Tiler Util [11.7189%]; Frag Overdraw [0];

    above output is very simple, I don't know what you want,

    Would you like to give me some suggestion: what are the preferred performance-counter?

    One confused: I didn't use texture in buf style, I don't know why the Tiler Util is about 11%?

Children
  • Can you get a capture of both scenarios with Streamline and share the exported .apc files? The latest Streamline should recommend the counters to use automatically, so the default profile should be fine.

  • Hello, Peter

       I forget this post for long time, 

       My test platform is ubuntu, and I have no license of streamline, so, I cannot provide the report of streamline.

       I git clone the HWCPipe, and I apply these sdk of performance counter.

       I sampled these performance counter with HWCPipe,test 100 times for buf_style program and texture_style program

       

    the report of buf_style program like following:

    Tile unit write bytes(unit: bytes): 0
    Load/store unit bytes written to L2 per access cycle(unit: bytes): 16.0059
    Load/store unit write bytes(unit: bytes): 5.76211e+08
    Load/store unit write beats to L2 memory system(unit: beats): 3.60132e+07
    Texture unit bytes read from external memory per texture cycle(unit: bytes): 0
    Texture unit read bytes from external memory(unit: bytes): 0
    Texture unit bytes read from L2 per texture cycle(unit: bytes): 0
    Texture unit read bytes from L2 cache(unit: bytes): 0
    Load/store unit bytes read from external memory per access cycle(unit: bytes): 5.76021
    Load/store unit read bytes from external memory(unit: bytes): 8.2947e+08
    Load/store unit bytes read from L2 per access cycle(unit: bytes): 8.68622
    Load/store unit read bytes from L2 cache(unit: bytes): 1.25082e+09
    Front-end unit read bytes from external memory(unit: bytes): 0
    Front-end unit read bytes from L2 cache(unit: bytes): 0
    Varying unit utilization(unit: percent): 0
    Varying unit issue cycles(unit: cycles): 0
    16-bit interpolation active cycles(unit: cycles): 0
    32-bit interpolation active cycles(unit: cycles): 0
    Load/store unit utilization(unit: percent): 40.5665
    Load/store unit issue cycles(unit: cycles): 1.8e+08
    Load/store unit write issues(unit: cycles): 3.6e+07
    Load/store unit read issues(unit: cycles): 1.44e+08
    Texture unit issue cycles(unit: cycles): 0
    Texture accesses using trilinear filter percentage(unit: percent): 0
    Texture data fetches from compressed lines(unit: percent): 0
    Texture accesses using mipmapping percentage(unit: percent): 0
    Texture unit cache utilization(unit: percent): 0
    Texture unit utilization(unit: percent): 0
    Texture filtering cycles per instruction(unit: cycles): 0
    Texture samples(unit: requests): 0
    Arithmetic unit utilization(unit: percent): 51.3823
    Warp divergence percentage(unit: percent): 0
    Full quad warp rate(unit: percent): 100
    All registers warp rate(unit: percent): 0
    Fragment threads(unit: threads): 0
    Non-fragment threads(unit: threads): 1.44e+08
    Execution core utilization(unit: percent): 99.6935
    Unchanged tile kill rate(unit: percent): 0
    Fragments per pixel(unit: threads): 0
    Late ZS killed thread percentage(unit: percent): 0
    Late ZS tested thread percentage(unit: percent): 0
    FPK killed quad percentage(unit: percent): 0
    FPK killed quads(unit: quads): 0
    Early ZS killed quad percentage(unit: percent): 0
    Early ZS updated quad percentage(unit: percent): 0
    Early ZS tested quad percentage(unit: percent): 0
    Partial coverage rate(unit: percent): 0
    Shaded coarse quads(unit: quads): 0
    Non-occluding quads(unit: quads): 0
    Occluding quad percentage(unit: percent): 0
    Fragment FPK buffer utilization(unit: percent): 0
    Average cycles per fragment thread(unit: cycles): 0
    Fragment utilization(unit: percent): 0
    Average cycles per non-fragment thread(unit: cycles): 3.0844
    Non-fragment utilization(unit: percent): 99.7918
    Varying cache hit rate(unit: percent): 0
    Varying threads per input primitive(unit: threads): 0
    Varying shader thread invocations(unit: threads): 0
    Position cache hit rate(unit: percent): 0
    Position threads per input primitive(unit: threads): 0
    Position shader thread invocations(unit: threads): 0
    Sample test cull rate(unit: percent): 0
    Z plane test cull rate(unit: percent): 0
    Facing or XY plane test cull rate(unit: percent): 0
    Culled primitives(unit: primitives): 0
    Visible primitives rate(unit: percent): 0
    Total input primitives(unit: primitives): 0
    Tiler utilization(unit: percent): 11.7131
    Output external outstanding writes 75-100%(unit: transactions): 279568
    Output external outstanding reads 75-100%(unit: transactions): 16402
    Output external read latency 384+ cycles(unit: beats): 2.70392e+06
    Output external write stall rate(unit: percent): 2.84916e-05
    Output external read stall rate(unit: percent): 3.13676e-05
    Output external write bytes(unit: bytes): 5.76017e+08
    Output external read bytes(unit: bytes): 8.32275e+08
    L2 cache write miss rate(unit: percent): 99.9257
    L2 cache read miss rate(unit: percent): 31.8045
    Non-fragment queue utilization(unit: percent): 100
    Fragment queue utilization(unit: percent): 0
    Interrupt pending utilization(unit: percent): 0.411299
    Input external snoop stall cycles(unit: cycles): 0
    Input external snoop transactions(unit: transactions): 0
    Output external outstanding writes 50-75%(unit: transactions): 1719855
    Output external outstanding writes 25-50%(unit: transactions): 3856581
    Output external outstanding writes 0-25%(unit: transactions): 3144573
    Output external write stall cycles(unit: cycles): 13297053
    Output external write beats(unit: beats): 36001046
    Output external WriteSnoopPartial transactions(unit: transactions): 0
    Output external WriteSnoopFull transactions(unit: transactions): 0
    Output external WriteNoSnoopPartial transactions(unit: transactions): 419
    Output external WriteNoSnoopFull transactions(unit: transactions): 9000158
    Output external write transactions(unit: transactions): 9000577
    Output external read latency 320-383 cycles(unit: beats): 1277948
    Output external read latency 256-319 cycles(unit: beats): 1511840
    Output external read latency 192-255 cycles(unit: beats): 1847742
    Output external read latency 128-191 cycles(unit: beats): 1653552
    Output external read latency 0-127 cycles(unit: beats): 43022201
    Output external outstanding reads 50-75%(unit: transactions): 616081
    Output external outstanding reads 25-50%(unit: transactions): 3453399
    Output external outstanding reads 0-25%(unit: transactions): 8918419
    Output external read stall cycles(unit: cycles): 14639299
    Output external read beats(unit: beats): 52017204
    Output external ReadUnique transactions(unit: transactions): 0
    Output external ReadNoSnoop transactions(unit: transactions): 13004301
    Output external read transactions(unit: transactions): 13004301
    Input external snoop lookup requests(unit: requests): 0
    Write lookup requests(unit: requests): 9007268
    Read lookup requests(unit: requests): 40888253
    Any lookup requests(unit: requests): 80302664
    Output internal write requests(unit: requests): 9007109
    Output internal read stall cycles(unit: cycles): 202121
    Output internal read requests(unit: requests): 26565317
    Input internal snoop stall cycles(unit: cycles): 75
    Input internal snoop requests(unit: requests): 9222540
    Input internal write stall cycles(unit: cycles): 9
    Input internal write requests(unit: requests): 565
    Input internal read stall cycles(unit: cycles): 8714906
    Input internal read requests(unit: requests): 43502875
    MMU stage 2 L2 lookup TLB hits(unit: requests): 0
    MMU stage 2 L3 lookup TLB hits(unit: requests): 0
    MMU stage 2 L2 lookup requests(unit: requests): 0
    MMU stage 2 L3 lookup requests(unit: requests): 0
    MMU stage 2 lookup requests(unit: requests): 0
    MMU L2 lookup TLB hits(unit: requests): 0
    MMU L3 lookup TLB hits(unit: requests): 1678862
    MMU L2 table read requests(unit: requests): 18
    MMU L3 table read requests(unit: requests): 86016
    MMU lookup requests(unit: requests): 1925054
    Load/store unit write-back write beats(unit: beats): 36013174
    Tile unit write beats to L2 memory system(unit: beats): 0
    Load/store unit other write beats(unit: beats): 0
    Miscellaneous read beats from L2 cache(unit: beats): 21360
    Texture unit read beats from external memory(unit: beats): 0
    Texture unit read beats from L2 cache(unit: beats): 0
    Load/store unit read beats from external memory(unit: beats): 51841884
    Load/store unit read beats from L2 cache(unit: beats): 78176006
    Fragment front-end read beats from external memory(unit: beats): 0
    Fragment front-end read beats from L2 cache(unit: beats): 0
    Attribute instructions(unit: instructions): 0
    16-bit interpolation slots(unit: issues): 0
    32-bit interpolation slots(unit: issues): 0
    Varying unit instructions(unit: requests): 0
    Load/store unit atomic issues(unit: cycles): 0
    Load/store unit partial write issues(unit: cycles): 0
    Load/store unit full write issues(unit: cycles): 36000000
    Load/store unit partial read issues(unit: cycles): 86400000
    Load/store unit full read issues(unit: cycles): 57600000
    Texture filtering cycles(unit: cycles): 0
    Texture cache lookup requests(unit: requests): 0
    Compressed texture line fetch requests(unit: issues): 0
    Texture line fetch requests(unit: issues): 0
    Trilinear filtered texture quad issues(unit: issues): 0
    Mipmapped texture quad issues(unit: issues): 0
    Texture quad descriptor misses(unit: requests): 0
    Texture quad issues(unit: issues): 0
    Texture quads(unit: quads): 0
    Execution engine starvation cycles(unit: cycles): 208198305
    Diverged instructions(unit: instructions): 0
    Executed instructions(unit: instructions): 227991754
    Execution engine active cycles(unit: cycles): 443654938
    Execution core active cycles(unit: cycles): 443716416
    Non-fragment warps(unit: warps): 18000000
    Non-fragment core tasks(unit: tasks): 562500
    Non-fragment active cycles(unit: cycles): 444153595
    Full quad warps(unit: warps): 18000000
    Occluding quads(unit: quads): 0
    Killed unchanged tiles(unit: tiles): 0
    Tiles(unit: tiles): 0
    Warps using more than 32 registers(unit: warps): 0
    Late ZS killed quads(unit: quads): 0
    Late ZS tested quads(unit: quads): 0
    Early ZS killed quads(unit: quads): 0
    Early ZS updated quads(unit: quads): 0
    Early ZS tested quads(unit: quads): 0
    Rasterized fine quads(unit: quads): 0
    Partial fragment warps(unit: warps): 0
    Fragment warps(unit: warps): 0
    Forward pixel kill buffer active cycles(unit: cycles): 0
    Rasterized primitives(unit: primitives): 0
    Fragment primitives loaded(unit: primitives): 0
    Fragment active cycles(unit: cycles): 0
    Tiler varying shading stall cycles(unit: cycles): 0
    Tiler varying shading requests(unit: requests): 0
    Varying cache misses(unit: requests): 0
    Varying cache hits(unit: requests): 0
    Position cache miss requests(unit: requests): 0
    Position cache hit requests(unit: requests): 0
    Tiler position FIFO full cycles(unit: cycles): 0
    Tiler position shading stall cycles(unit: cycles): 0
    Tiler position shading requests(unit: requests): 0
    Internal write beats(unit: beats): 0
    Output internal read beats(unit: beats): 0
    Sample test culled primitives(unit: primitives): 0
    Z plane culled primitives(unit: primitives): 0
    Facing or XY plane test culled primitives(unit: primitives): 0
    Visible primitives(unit: primitives): 0
    Visible back-facing primitives(unit: primitives): 0
    Visible front-facing primitives(unit: primitives): 0
    Point primitives(unit: primitives): 0
    Line primitives(unit: primitives): 0
    Triangle primitives(unit: primitives): 0
    Tiler active cycles(unit: cycles): 5213267
    L2 cache flush requests(unit: requests): 3
    Reserved queue job finish wait cycles(unit: cycles): 0
    Reserved queue job dependency wait cycles(unit: cycles): 0
    Reserved queue job issue wait cycles(unit: cycles): 0
    Reserved queue job descriptor read wait cycles(unit: cycles): 0
    Reserved queue cache flush wait cycles(unit: cycles): 0
    Reserved active cycles(unit: cycles): 0
    Reserved queue tasks(unit: tasks): 0
    Reserved queue jobs(unit: jobs): 0
    Non-fragment queue job finish wait cycles(unit: cycles): 0
    Non-fragment queue job dependency wait cycles(unit: cycles): 0
    Non-fragment queue job issue wait cycles(unit: cycles): 42576347
    Non-fragment queue job descriptor read wait cycles(unit: cycles): 11080
    Non-fragment queue cache flush wait cycles(unit: cycles): 23917
    Non-fragment queue active cycles(unit: cycles): 44508037
    Non-fragment tasks(unit: tasks): 562500
    Non-fragment jobs(unit: jobs): 300
    Fragment queue job finish wait cycles(unit: cycles): 0
    Fragment queue job dependency wait cycles(unit: cycles): 0
    Fragment queue job issue wait cycles(unit: cycles): 0
    Fragment queue job descriptor read wait cycles(unit: cycles): 0
    Fragment queue cache flush wait cycles(unit: cycles): 0
    Fragment queue active cycles(unit: cycles): 0
    Fragment tasks(unit: tasks): 0
    Fragment jobs(unit: jobs): 0
    GPU interrupt pending cycles(unit: cycles): 183061
    GPU active cycles(unit: cycles): 44508037

     

    the report of texture_style program like following:

    Average cycles per pixel(unit: cycles): inf
    Pixels(unit: pixels): 0
    Tile unit write bytes(unit: bytes): 0
    Load/store unit bytes written to L2 per access cycle(unit: bytes): 16
    Load/store unit write bytes(unit: bytes): 5.76002e+08
    Load/store unit write beats to L2 memory system(unit: beats): 3.60001e+07
    Texture unit bytes read from external memory per texture cycle(unit: bytes): 9.60027
    Texture unit read bytes from external memory(unit: bytes): 6.91219e+08
    Texture unit bytes read from L2 per texture cycle(unit: bytes): 11.1556
    Texture unit read bytes from L2 cache(unit: bytes): 8.03204e+08
    Load/store unit bytes read from external memory per access cycle(unit: bytes): 0
    Load/store unit read bytes from external memory(unit: bytes): 0
    Load/store unit bytes read from L2 per access cycle(unit: bytes): inf
    Load/store unit read bytes from L2 cache(unit: bytes): 1.44001e+08
    Front-end unit read bytes from external memory(unit: bytes): 0
    Front-end unit read bytes from L2 cache(unit: bytes): 0
    Varying unit utilization(unit: percent): 0
    Varying unit issue cycles(unit: cycles): 0
    16-bit interpolation active cycles(unit: cycles): 0
    32-bit interpolation active cycles(unit: cycles): 0
    Load/store unit utilization(unit: percent): 8.28342
    Load/store unit issue cycles(unit: cycles): 3.6e+07
    Load/store unit write issues(unit: cycles): 3.6e+07
    Load/store unit read issues(unit: cycles): 0
    Texture unit issue cycles(unit: cycles): 7.2e+07
    Texture accesses using trilinear filter percentage(unit: percent): 0
    Texture data fetches from compressed lines(unit: percent): 0
    Texture accesses using mipmapping percentage(unit: percent): 0
    Texture unit cache utilization(unit: percent): 8.28342
    Texture unit utilization(unit: percent): 16.5668
    Texture filtering cycles per instruction(unit: cycles): 0.5
    Texture samples(unit: requests): 1.44e+08
    Arithmetic unit utilization(unit: percent): 9.66446
    Warp divergence percentage(unit: percent): 0
    Full quad warp rate(unit: percent): 100
    All registers warp rate(unit: percent): 0
    Fragment threads(unit: threads): 0
    Non-fragment threads(unit: threads): 1.44e+08
    Execution core utilization(unit: percent): 99.2073
    Unchanged tile kill rate(unit: percent): 0
    Fragments per pixel(unit: threads): 0
    Late ZS killed thread percentage(unit: percent): 0
    Late ZS tested thread percentage(unit: percent): 0
    FPK killed quad percentage(unit: percent): 0
    FPK killed quads(unit: quads): 0
    Early ZS killed quad percentage(unit: percent): 0
    Early ZS updated quad percentage(unit: percent): 0
    Early ZS tested quad percentage(unit: percent): 0
    Partial coverage rate(unit: percent): 0
    Shaded coarse quads(unit: quads): 0
    Non-occluding quads(unit: quads): 0
    Occluding quad percentage(unit: percent): 0
    Fragment FPK buffer utilization(unit: percent): 0
    Average cycles per fragment thread(unit: cycles): 0
    Fragment utilization(unit: percent): 0
    Average cycles per non-fragment thread(unit: cycles): 3.02251
    Non-fragment utilization(unit: percent): 99.353
    Varying cache hit rate(unit: percent): 0
    Varying threads per input primitive(unit: threads): 0
    Varying shader thread invocations(unit: threads): 0
    Position cache hit rate(unit: percent): 0
    Position threads per input primitive(unit: threads): 0
    Position shader thread invocations(unit: threads): 0
    Sample test cull rate(unit: percent): 0
    Z plane test cull rate(unit: percent): 0
    Facing or XY plane test cull rate(unit: percent): 0
    Culled primitives(unit: primitives): 0
    Visible primitives rate(unit: percent): 0
    Total input primitives(unit: primitives): 0
    Tiler utilization(unit: percent): 11.7111
    Output external outstanding writes 75-100%(unit: transactions): 988013
    Output external outstanding reads 75-100%(unit: transactions): 29299
    Output external read latency 384+ cycles(unit: beats): 5.59447e+06
    Output external write stall rate(unit: percent): 7.06203e-05
    Output external read stall rate(unit: percent): 9.7719e-05
    Output external write bytes(unit: bytes): 5.76022e+08
    Output external read bytes(unit: bytes): 6.9408e+08
    L2 cache write miss rate(unit: percent): 99.4261
    L2 cache read miss rate(unit: percent): 18.8492
    Non-fragment queue utilization(unit: percent): 100
    Fragment queue utilization(unit: percent): 0
    Interrupt pending utilization(unit: percent): 0.363565
    Input external snoop stall cycles(unit: cycles): 0
    Input external snoop transactions(unit: transactions): 0
    Output external outstanding writes 50-75%(unit: transactions): 4121746
    Output external outstanding writes 25-50%(unit: transactions): 3126050
    Output external outstanding writes 0-25%(unit: transactions): 764878
    Output external write stall cycles(unit: cycles): 32439841
    Output external write beats(unit: beats): 36001395
    Output external WriteSnoopPartial transactions(unit: transactions): 0
    Output external WriteSnoopFull transactions(unit: transactions): 0
    Output external WriteNoSnoopPartial transactions(unit: transactions): 483
    Output external WriteNoSnoopFull transactions(unit: transactions): 9000204
    Output external write transactions(unit: transactions): 9000687
    Output external read latency 320-383 cycles(unit: beats): 1884675
    Output external read latency 256-319 cycles(unit: beats): 2197805
    Output external read latency 192-255 cycles(unit: beats): 3665942
    Output external read latency 128-191 cycles(unit: beats): 11272125
    Output external read latency 0-127 cycles(unit: beats): 18764990
    Output external outstanding reads 50-75%(unit: transactions): 2084662
    Output external outstanding reads 25-50%(unit: transactions): 6831597
    Output external outstanding reads 0-25%(unit: transactions): 1899444
    Output external read stall cycles(unit: cycles): 44887785
    Output external read beats(unit: beats): 43380008
    Output external ReadUnique transactions(unit: transactions): 0
    Output external ReadNoSnoop transactions(unit: transactions): 10845002
    Output external read transactions(unit: transactions): 10845002
    Input external snoop lookup requests(unit: requests): 0
    Write lookup requests(unit: requests): 9052640
    Read lookup requests(unit: requests): 57535661
    Any lookup requests(unit: requests): 78276549
    Output internal write requests(unit: requests): 9000710
    Output internal read stall cycles(unit: cycles): 477103
    Output internal read requests(unit: requests): 22390126
    Input internal snoop stall cycles(unit: cycles): 43327
    Input internal snoop requests(unit: requests): 9781454
    Input internal write stall cycles(unit: cycles): 9
    Input internal write requests(unit: requests): 652
    Input internal read stall cycles(unit: cycles): 35951247
    Input internal read requests(unit: requests): 21608727
    MMU stage 2 L2 lookup TLB hits(unit: requests): 0
    MMU stage 2 L3 lookup TLB hits(unit: requests): 0
    MMU stage 2 L2 lookup requests(unit: requests): 0
    MMU stage 2 L3 lookup requests(unit: requests): 0
    MMU stage 2 lookup requests(unit: requests): 0
    MMU L2 lookup TLB hits(unit: requests): 0
    MMU L3 lookup TLB hits(unit: requests): 1350038
    MMU L2 table read requests(unit: requests): 18
    MMU L3 table read requests(unit: requests): 86376
    MMU lookup requests(unit: requests): 1761166
    Load/store unit write-back write beats(unit: beats): 36000112
    Tile unit write beats to L2 memory system(unit: beats): 0
    Load/store unit other write beats(unit: beats): 0
    Miscellaneous read beats from L2 cache(unit: beats): 37000
    Texture unit read beats from external memory(unit: beats): 43201208
    Texture unit read beats from L2 cache(unit: beats): 50200272
    Load/store unit read beats from external memory(unit: beats): 0
    Load/store unit read beats from L2 cache(unit: beats): 9000055
    Fragment front-end read beats from external memory(unit: beats): 0
    Fragment front-end read beats from L2 cache(unit: beats): 0
    Attribute instructions(unit: instructions): 36000000
    16-bit interpolation slots(unit: issues): 0
    32-bit interpolation slots(unit: issues): 0
    Varying unit instructions(unit: requests): 0
    Load/store unit atomic issues(unit: cycles): 0
    Load/store unit partial write issues(unit: cycles): 0
    Load/store unit full write issues(unit: cycles): 36000000
    Load/store unit partial read issues(unit: cycles): 0
    Load/store unit full read issues(unit: cycles): 0
    Texture filtering cycles(unit: cycles): 72000000
    Texture cache lookup requests(unit: requests): 36000000
    Compressed texture line fetch requests(unit: issues): 0
    Texture line fetch requests(unit: issues): 7530006
    Trilinear filtered texture quad issues(unit: issues): 0
    Mipmapped texture quad issues(unit: issues): 0
    Texture quad descriptor misses(unit: requests): 1000
    Texture quad issues(unit: issues): 36000000
    Texture quads(unit: quads): 36000000
    Execution engine starvation cycles(unit: cycles): 380458140
    Diverged instructions(unit: instructions): 0
    Executed instructions(unit: instructions): 42002044
    Execution engine active cycles(unit: cycles): 434556483
    Execution core active cycles(unit: cycles): 434603027
    Non-fragment warps(unit: warps): 18000000
    Non-fragment core tasks(unit: tasks): 562500
    Non-fragment active cycles(unit: cycles): 435241489
    Full quad warps(unit: warps): 18000000
    Occluding quads(unit: quads): 0
    Killed unchanged tiles(unit: tiles): 0
    Tiles(unit: tiles): 0
    Warps using more than 32 registers(unit: warps): 0
    Late ZS killed quads(unit: quads): 0
    Late ZS tested quads(unit: quads): 0
    Early ZS killed quads(unit: quads): 0
    Early ZS updated quads(unit: quads): 0
    Early ZS tested quads(unit: quads): 0
    Rasterized fine quads(unit: quads): 0
    Partial fragment warps(unit: warps): 0
    Fragment warps(unit: warps): 0
    Forward pixel kill buffer active cycles(unit: cycles): 0
    Rasterized primitives(unit: primitives): 0
    Fragment primitives loaded(unit: primitives): 0
    Fragment active cycles(unit: cycles): 0
    Tiler varying shading stall cycles(unit: cycles): 0
    Tiler varying shading requests(unit: requests): 0
    Varying cache misses(unit: requests): 0
    Varying cache hits(unit: requests): 0
    Position cache miss requests(unit: requests): 0
    Position cache hit requests(unit: requests): 0
    Tiler position FIFO full cycles(unit: cycles): 0
    Tiler position shading stall cycles(unit: cycles): 0
    Tiler position shading requests(unit: requests): 0
    Internal write beats(unit: beats): 0
    Output internal read beats(unit: beats): 0
    Sample test culled primitives(unit: primitives): 0
    Z plane culled primitives(unit: primitives): 0
    Facing or XY plane test culled primitives(unit: primitives): 0
    Visible primitives(unit: primitives): 0
    Visible back-facing primitives(unit: primitives): 0
    Visible front-facing primitives(unit: primitives): 0
    Point primitives(unit: primitives): 0
    Line primitives(unit: primitives): 0
    Triangle primitives(unit: primitives): 0
    Tiler active cycles(unit: cycles): 5130350
    L2 cache flush requests(unit: requests): 3
    Reserved queue job finish wait cycles(unit: cycles): 0
    Reserved queue job dependency wait cycles(unit: cycles): 0
    Reserved queue job issue wait cycles(unit: cycles): 0
    Reserved queue job descriptor read wait cycles(unit: cycles): 0
    Reserved queue cache flush wait cycles(unit: cycles): 0
    Reserved active cycles(unit: cycles): 0
    Reserved queue tasks(unit: tasks): 0
    Reserved queue jobs(unit: jobs): 0
    Non-fragment queue job finish wait cycles(unit: cycles): 0
    Non-fragment queue job dependency wait cycles(unit: cycles): 0
    Non-fragment queue job issue wait cycles(unit: cycles): 41675417
    Non-fragment queue job descriptor read wait cycles(unit: cycles): 15135
    Non-fragment queue cache flush wait cycles(unit: cycles): 221667
    Non-fragment queue active cycles(unit: cycles): 43807587
    Non-fragment tasks(unit: tasks): 562500
    Non-fragment jobs(unit: jobs): 400
    Fragment queue job finish wait cycles(unit: cycles): 0
    Fragment queue job dependency wait cycles(unit: cycles): 0
    Fragment queue job issue wait cycles(unit: cycles): 0
    Fragment queue job descriptor read wait cycles(unit: cycles): 0
    Fragment queue cache flush wait cycles(unit: cycles): 0
    Fragment queue active cycles(unit: cycles): 0
    Fragment tasks(unit: tasks): 0
    Fragment jobs(unit: jobs): 0
    GPU interrupt pending cycles(unit: cycles): 159269
    GPU active cycles(unit: cycles): 43807587

    would you like to help me analyze above report? the texture_style program, why there is not any obvious advantage?

  • My test platform is ubuntu, and I have no license of streamline, so, I cannot provide the report of streamline.

    If you download Arm Performance Studio, you shouldn't need a license - we made Arm Linux support part of the free-of-charge bundle, so no Streamline feature is license-managed any more.  

    Texture unit bytes read from external memory per texture cycle(unit: bytes): 9.60027
    Texture unit bytes read from L2 per texture cycle(unit: bytes): 11.1556

    As expected for a downscale, these are all quite high "per clock" numbers. I'd expect a lot of this is just going to be down to differences in access pattern, which is going to be hard to diagnose from the counters. 

  • Hi Peter, thanks for your reply at first

    I think I cannot understand your analyzation correctly:

    1. you mean, the two performance counters which you prompt should be quite high, but, my report is very low, it is abnormal. is it right?

        furthermore, what are the normal number?

    2. I cannot understand the "access pattern". I think I cannot specify "access pattern" in OpenCL, would you like to explain it furthermore?

    3. I cannot understand these two performance couters: Non-fragment tasks(unit: tasks) and Non-fragment jobs(unit: jobs) 

    buf_style:
    Non-fragment tasks(unit: tasks): 562500
    Non-fragment jobs(unit: jobs): 300
    
    texture_style:
    Non-fragment tasks(unit: tasks): 562500
    Non-fragment jobs(unit: jobs): 400

        1). how to caculate them if my program is OpenCL?

        2). I only run one kernel both in buf_style program and texture_style program, but, the Non-fragment jobs in reports are 3 and 4, instead of 1

    buf_style:
    Non-fragment jobs(unit: jobs): 300
    
    texture_style:
    Non-fragment jobs(unit: jobs): 400

    I run the both tests for 100 times, so, the reports are 300 and 400. why not 100?

     

        

  • 1. you mean, the two performance counters which you prompt should be quite high, but, my report is very low,

    Your bytes per-access value is high, so inline with expectations for a downscale.

    . I cannot understand the "access pattern". I think I cannot specify "access pattern" in OpenCL, would you like to explain it furthermore?

    Correct, you can't control it. But buffers and textures may have a different memory layouts, and so have different access patterns.

        1). how to caculate them if my program is OpenCL?

    You can't. 

    You will get at least one Job per compute dispatch, but may get more as the driver generates small jobs for some management activities. 

    Tasks are somewhat meaningless to an application developer. For compute workloads a task is some multiple of the workgroup size, but the exact scaling is chosen by the driver and depends on the system configuration. 

      2). I only run one kernel both in buf_style program and texture_style program, but, the Non-fragment jobs in reports are 3 and 4, instead of 1

    As above, will get get at least one Job per fragment workload, but may get more.