This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali's texture is poor than buffer, why?

I tested the performance of mali's texture(cl_image), I found it is poor than buffer(cl_mem).

my GPU is mali G76

I think the texture should be better than buffer, such as: bilinear.

but, my test tell me G76's texture is poor than buffer about 10%-20%. my test format is RGBA

I don't know why?

is there anyone would like to tell me the secret?

or, is there any standard benchmark program?

Parents
  • My test platform is ubuntu, and I have no license of streamline, so, I cannot provide the report of streamline.

    If you download Arm Performance Studio, you shouldn't need a license - we made Arm Linux support part of the free-of-charge bundle, so no Streamline feature is license-managed any more.  

    Texture unit bytes read from external memory per texture cycle(unit: bytes): 9.60027
    Texture unit bytes read from L2 per texture cycle(unit: bytes): 11.1556

    As expected for a downscale, these are all quite high "per clock" numbers. I'd expect a lot of this is just going to be down to differences in access pattern, which is going to be hard to diagnose from the counters. 

Reply
  • My test platform is ubuntu, and I have no license of streamline, so, I cannot provide the report of streamline.

    If you download Arm Performance Studio, you shouldn't need a license - we made Arm Linux support part of the free-of-charge bundle, so no Streamline feature is license-managed any more.  

    Texture unit bytes read from external memory per texture cycle(unit: bytes): 9.60027
    Texture unit bytes read from L2 per texture cycle(unit: bytes): 11.1556

    As expected for a downscale, these are all quite high "per clock" numbers. I'd expect a lot of this is just going to be down to differences in access pattern, which is going to be hard to diagnose from the counters. 

Children
  • Hi Peter, thanks for your reply at first

    I think I cannot understand your analyzation correctly:

    1. you mean, the two performance counters which you prompt should be quite high, but, my report is very low, it is abnormal. is it right?

        furthermore, what are the normal number?

    2. I cannot understand the "access pattern". I think I cannot specify "access pattern" in OpenCL, would you like to explain it furthermore?

    3. I cannot understand these two performance couters: Non-fragment tasks(unit: tasks) and Non-fragment jobs(unit: jobs) 

    buf_style:
    Non-fragment tasks(unit: tasks): 562500
    Non-fragment jobs(unit: jobs): 300
    
    texture_style:
    Non-fragment tasks(unit: tasks): 562500
    Non-fragment jobs(unit: jobs): 400

        1). how to caculate them if my program is OpenCL?

        2). I only run one kernel both in buf_style program and texture_style program, but, the Non-fragment jobs in reports are 3 and 4, instead of 1

    buf_style:
    Non-fragment jobs(unit: jobs): 300
    
    texture_style:
    Non-fragment jobs(unit: jobs): 400

    I run the both tests for 100 times, so, the reports are 300 and 400. why not 100?

     

        

  • 1. you mean, the two performance counters which you prompt should be quite high, but, my report is very low,

    Your bytes per-access value is high, so inline with expectations for a downscale.

    . I cannot understand the "access pattern". I think I cannot specify "access pattern" in OpenCL, would you like to explain it furthermore?

    Correct, you can't control it. But buffers and textures may have a different memory layouts, and so have different access patterns.

        1). how to caculate them if my program is OpenCL?

    You can't. 

    You will get at least one Job per compute dispatch, but may get more as the driver generates small jobs for some management activities. 

    Tasks are somewhat meaningless to an application developer. For compute workloads a task is some multiple of the workgroup size, but the exact scaling is chosen by the driver and depends on the system configuration. 

      2). I only run one kernel both in buf_style program and texture_style program, but, the Non-fragment jobs in reports are 3 and 4, instead of 1

    As above, will get get at least one Job per fragment workload, but may get more.