This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

mali T860运行时间测试问题

huanshen over 4 years ago

我在rk3399运行mali的sdk中的例子hello_world_opencl

同样数据量的情况下，下面这种情况耗时居然比上面快，哪位大神可以告知一下原因吗

0 章政 over 4 years ago

是在同一个程序里面还是二个独立的app？如果是一个程序里面，两个函数调用，可能是前面函数已经把后面函数数据准备好了，那后面函数调用就不用搬移数据了，自然时间要少，如果是二个独立程序，按不同顺序独立跑几次看看结果是否改变了，是不是因为系统其他因素影响了当时的状态
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

非常感谢您的回复，这个问题困扰了我好几天。是两个独立的程序，并且经过反复测试，时间基本是比较稳定的。下面附上每次运行的时间以及对应的streamLine。

第一种情况：直接相加

第二种情况：相加后的结果乘2
Cancel
Up 0 Down

Cancel
0 章政 over 4 years ago in reply to huanshen

你这两个数据量应该是不一样，你看L2的Read/Write差很多
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

您好，就是因为同样的数据量，发生了这种情况，我才觉得很疑惑，两种情况下globalWorkSize都是10M（10000000）。然后我到桌面端的N卡也进行过测试，两者时间一致，但是在RK3288和RK3399都发生了上述的情况。另外streamline我也看不太懂，按照我的理解，在第一种情况下，应该是有读取了80M的数据，写入了40M的数据但是不知道应该对应上面的图中的哪一个计数器。
Cancel
Up 0 Down

Cancel
0 章政 over 4 years ago in reply to huanshen

你看L2的read, write,第一个数据量将近是第二个的一倍
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

您指的是readhits？还是read lookups?我想知道如果我读了80M的数据，在streamline中应该怎么样体现，具体的来说就是哪个参数可以通过这80M的数据来计算得到。
Cancel
Up 0 Down

Cancel
0 章政 over 4 years ago in reply to huanshen

L2 cache里面有个External Bus Read Beats的counter，你把这个在运行期间得到读数再乘以位宽就好了，3288应该是16 bytes
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

再次感谢您的指导，目前我在streamline中把这个加进去了，第二种情况下，数据能对上了，确实就是读了80M9数据，写了40M，但是第一种情况下，streamline显示读了94M，写了67M数据，数据量的话肯定是一样的，我cpp只编译了一次，后面运行程序的时候，就是来回替换cl文件进行测试的。另外我想请问下，3399ddr的频率是800Mhz,然后位宽是64位，那L2 cache的带宽是否就是6.4G/s。
Cancel
Up 0 Down

Cancel
0 章政 over 4 years ago in reply to huanshen

800MHZx64/8(BYTE)*2(DDR)=12.8GByte/s这个是理论带宽，实际中会有冲突等因素影响，所以可能达不到这个值，具体还要看SOC的实现
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

假设第二种情况，读写了120M的数据，我把板子的运算频率调到800Mhz运行时间是15ms，程序实际运行时，达到的带宽才120/15=8G/s。这种情况下没有达到理论性能的原因是什么，这属于最简单的核函数了，应该不会是运算能力瓶颈，所以我假设这个核函数是被带宽瓶颈所限。
Cancel
Up 0 Down

Cancel
0 章政 over 4 years ago in reply to huanshen

操作系统除了你的程序还有很多其他程序在运行的，任何的中断都有可能导致bus的冲突
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

非常感谢您的耐心回复，解决了一些困扰我许多天的问题，至于那两个核函数执行时间不同，如果我找到答案，会及时反馈。
Cancel
Up 0 Down

Cancel
0 章政 over 4 years ago in reply to huanshen

欢迎，欢迎，有空来论坛多交流
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

您好，您知道mali-offline中Instructions emmited和Longest Path Cycles的区别吗，我是mali-T860的GPU，还是以这两个程序为例，分析的结果一致，如下图所示，按照我的想法L/S应该是3才对，还有Instructions emmited和Longest Path Cycles的ALU cycle不一致，分别是什么意思。
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

和RK厂商交流了下，是驱动的问题造成这两种情况运行时间的不一致，换了个合适的驱动后，两者运行时间一致。
Cancel
Up 0 Down

Cancel