This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

mali T860运行时间测试问题

huanshen over 4 years ago

我在rk3399运行mali的sdk中的例子hello_world_opencl

同样数据量的情况下，下面这种情况耗时居然比上面快，哪位大神可以告知一下原因吗

Parents

0 huanshen over 4 years ago in reply to 章政

再次感谢您的指导，目前我在streamline中把这个加进去了，第二种情况下，数据能对上了，确实就是读了80M9数据，写了40M，但是第一种情况下，streamline显示读了94M，写了67M数据，数据量的话肯定是一样的，我cpp只编译了一次，后面运行程序的时候，就是来回替换cl文件进行测试的。另外我想请问下，3399ddr的频率是800Mhz,然后位宽是64位，那L2 cache的带宽是否就是6.4G/s。
Cancel
Up 0 Down

Cancel

Reply

0 huanshen over 4 years ago in reply to 章政

再次感谢您的指导，目前我在streamline中把这个加进去了，第二种情况下，数据能对上了，确实就是读了80M9数据，写了40M，但是第一种情况下，streamline显示读了94M，写了67M数据，数据量的话肯定是一样的，我cpp只编译了一次，后面运行程序的时候，就是来回替换cl文件进行测试的。另外我想请问下，3399ddr的频率是800Mhz,然后位宽是64位，那L2 cache的带宽是否就是6.4G/s。
Cancel
Up 0 Down

Cancel

Children

0 章政 over 4 years ago in reply to huanshen

800MHZx64/8(BYTE)*2(DDR)=12.8GByte/s这个是理论带宽，实际中会有冲突等因素影响，所以可能达不到这个值，具体还要看SOC的实现
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

假设第二种情况，读写了120M的数据，我把板子的运算频率调到800Mhz运行时间是15ms，程序实际运行时，达到的带宽才120/15=8G/s。这种情况下没有达到理论性能的原因是什么，这属于最简单的核函数了，应该不会是运算能力瓶颈，所以我假设这个核函数是被带宽瓶颈所限。
Cancel
Up 0 Down

Cancel
0 章政 over 4 years ago in reply to huanshen

操作系统除了你的程序还有很多其他程序在运行的，任何的中断都有可能导致bus的冲突
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

非常感谢您的耐心回复，解决了一些困扰我许多天的问题，至于那两个核函数执行时间不同，如果我找到答案，会及时反馈。
Cancel
Up 0 Down

Cancel
0 章政 over 4 years ago in reply to huanshen

欢迎，欢迎，有空来论坛多交流
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

您好，您知道mali-offline中Instructions emmited和Longest Path Cycles的区别吗，我是mali-T860的GPU，还是以这两个程序为例，分析的结果一致，如下图所示，按照我的想法L/S应该是3才对，还有Instructions emmited和Longest Path Cycles的ALU cycle不一致，分别是什么意思。
Cancel
Up 0 Down

Cancel
0 huanshen over 4 years ago in reply to 章政

和RK厂商交流了下，是驱动的问题造成这两种情况运行时间的不一致，换了个合适的驱动后，两者运行时间一致。
Cancel
Up 0 Down

Cancel