Use the following code to profile memory bandwidth for Kirin 985:
// initialize
std::vector<float> vecSrc(1000000, 2);
std::vector<float> vecDst(1000000, 3);
// memcpy profile
auto tStart = std::chrono::high_resolution_clock::now();
memcpy(vecDst.data(), vecSrc.data(), vecSrc.size() * sizeof(float));
auto tEnd = std::chrono::high_resolution_clock::now();
// calculate time
std::cout << "vecDst[999999] = " << vecDst[999999] << std::endl;
float tDif = std::chrono::duration_cast<std::chrono::microseconds>(tEnd - tStart).count() / 1000.f;
std::cout << "tDif = " << tDif << "ms" << std::endl;
result is 1.302 ms.
measured bandwidth should be
1000000.0 * 4 / 1024 / 1024 / 1024 / (1.302 / 1000) = 2.86 GB/s
The max bandwidth should be much higher than this. Why is memcpy so low? Could anyone help? Thanks
The cache utilization may not be good.
1) when your benchmark application is running, the userspace program is interrupted by other kernel space programs frequently
2) your benchmark code is not designed to be cacheline friendly. Your cache miss rate may be very high.
3) It's interesting to do memcpy for the float vectors. Why not to benchmark the integer arrays or char arrays?
For serious benchmark, please use the bare-mental code for testing.
Thanks. Yes it seems not to be a good benchmark test.
(BTW, float memcpy and int memcpy results are the same (which should be no surprise))