This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

memcpy slowness on Kirin 985

Use the following code to profile memory bandwidth for Kirin 985:

// initialize

   std::vector<float> vecSrc(1000000, 2);

   std::vector<float> vecDst(1000000, 3);

   // memcpy profile

   auto tStart = std::chrono::high_resolution_clock::now();

   memcpy(vecDst.data(), vecSrc.data(), vecSrc.size() * sizeof(float));

   auto tEnd = std::chrono::high_resolution_clock::now();

   // calculate time

   std::cout << "vecDst[999999] = " << vecDst[999999] << std::endl;

   float tDif = std::chrono::duration_cast<std::chrono::microseconds>(tEnd - tStart).count() / 1000.f;

   std::cout << "tDif = " << tDif << "ms" << std::endl;

result is 1.302 ms.

measured bandwidth should be

1000000.0 * 4 / 1024 / 1024 / 1024 / (1.302 / 1000) = 2.86 GB/s

The max bandwidth should be much higher than this. Why is memcpy so low? Could anyone help? Thanks

Parents
  • The cache utilization may not be good.

    1) when your benchmark application is running, the userspace program is interrupted by other kernel space programs frequently

    2) your benchmark code is not designed to be cacheline friendly.   Your cache miss rate may be very high.

    3) It's interesting to do memcpy for the float vectors. Why not to benchmark the integer arrays or char arrays? 

    For serious benchmark, please use the bare-mental code for testing.

Reply
  • The cache utilization may not be good.

    1) when your benchmark application is running, the userspace program is interrupted by other kernel space programs frequently

    2) your benchmark code is not designed to be cacheline friendly.   Your cache miss rate may be very high.

    3) It's interesting to do memcpy for the float vectors. Why not to benchmark the integer arrays or char arrays? 

    For serious benchmark, please use the bare-mental code for testing.

Children