This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali-T880 in Helio X20

Hello, I am developing software based on mali-T880 (X20) with OpenCL.I was wondering the specific structure of the mali-T880, including the number of shadercores,the structure of each shadercore ,the size of the L1 cache and the size of L2 cache, which is helpful to improve the efficiency of the algorithm.

thanks a lot.

  • The Mali GPU is configurable, so our silicon partners can choose how many shader cores they have as well as the size of the L2 cache.

    The MediaTek datasheet for the Helio-X20 confirms the design as having 4 Mali-T880 shader cores, but does not publicly document the L2 cache size. We recommend 64KB per shader core, so I would expect it to be 256KB for the Helio-X20, but I cannot confirm this is the case.

    In terms of the shader core itself, this blog of mine should be a good place to start:

    community.arm.com/.../the-mali-gpu-an-abstract-machine-part-3---the-midgard-shader-core

    Hope that helps,
    Pete

  • Hi Peter,

    Thank you for your reply witch is great helpful.I have read your blog and I learned that every shader core of Mali T880 has one texture pipeline,one load/store pipeline and three arithmetic pipelines.There are two 16KB L1 data caches per shader core, one for texture access and one for generic memory access.In my opencl algorithm there are SIMD vector processes which only contains addition and subtraction calculation.I was wondering that my algorithm is only used the arithmetic pipeline or it is also used the texture pipeline. Can I use a total of 32KB L1 caches to cache data?

    I get the device information of the Mali T880 of Helio-X20 through opencl function interface,which may be useful for determining the size of L2 cache.
    Here are the OpenCL queries from the devices:
    CL_PLATFORM_PROFILE = FULL_PROFILE
    CL_PLATFORM_VERSION = OpenCL 1.1 v1.r7p0-02rel0.c2815c77377fd8176029d97c61eba4df
    CL_PLATFORM_VENDOR = ARM
    CL_DEVICE_MAX_COMPUTE_UNITS = 4
    CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
    CL_DEVICE_MAX_WORK_GROUP_SIZE = 256
    CL_DEVICE_MAX_MEM_ALLOC_SIZE = 536870912(512M)
    CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = 524288(512K)
    CL_DEVICE_GLOBAL_MEM_SIZE = 2147483648(2G)
    CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = 65536(64K)
    CL_DEVICE_LOCAL_MEM_SIZE= 32768(32K)

    I doubt the CL_DEVICE_GLOBAL_MEM_CACHE_SIZE is the L2 cache size.
  • Hi David,

    > I was wondering that my algorithm is only used the arithmetic pipeline or it is also used the texture pipeline.

    I do not know what your algorithm is, so cannot give you a definitive answer, but in general the read-only filtered texture access built-in functions will use the texture pipe and all other memory access (e.g. pointer based data load/store, imageLoad/imageStore built in functions) will use the load/store unit. The two caches are independent, so if you are not using the texturing unit you will only have 16KB of L1 cache per core available.

    One important thing to note with Mali for compute is that there is no physically distinct backing for OpenCL "local memory"; both local and global memory pools are simply backed by cached system memory. Do not copy data resources from global to local pools in the belief it helps performance; it will make things worse on Mali due to increased pressure on the L1 LS cache.

    HTH,
    Pete
  • Pete,
    Your reply is of great help to us.Thank you again for your help...

    Best regards,
    David