use and utility of NPU

hterrolle 1 month ago

hi,

I do not know if using NPU could be interesting for what i would like to do.

So i explain my need.

1) i want to to compare matrice 64*64 with mask of data for comparaison.

2) until now i use CPU and SIMD to do this like a simple double loop on array [64][64][number of form to compare = 64]

3) using GPU i could do the same but in this case i will need to and i flag for each form because gpu i random processing. So i do not think it will be faster. I try and it is not. but i may be wrong with the way i implemnted the kernel and organized the data.

4) If i where using NPU will i get better performance ?

I readed many things about NPU and i anderstand that it can be faster and using less energy. But it is used for CNN model and as i anderstoud, some calculation are faster because they integrated ALU unit of calculation. But i do not need all that staff, i do it in another way than CNN.

So in my case will NPU would be usefull ?

thans in advance.

Top replies

Peter Harris 1 month ago +1 suggested

Offloading to any accelerator (GPU, NPU, etc) has an overhead, so it only tends to be beneficial for large workloads where the cost of offload is recovered by faster performance of the batch processing...

Parents

0 Peter Harris 1 month ago

Offloading to any accelerator (GPU, NPU, etc) has an overhead, so it only tends to be beneficial for large workloads where the cost of offload is recovered by faster performance of the batch processing. A 64x64 matrix is quite small, so I would be surprised if it would benefit from offloading because the setup cost will dominate the performance.

Arm SME on the CPU might be a good fit for this size of small matrix, as it would be faster than NEON/SVE without a high setup cost. There is a blog on it here: developer.arm.com/.../arm-scalable-matrix-extension-introduction

HTH,
Pete
Cancel
Vote up +1 Vote down

Reply

Accept answer

Reject answer

Cancel

Reply

0 Peter Harris 1 month ago

Offloading to any accelerator (GPU, NPU, etc) has an overhead, so it only tends to be beneficial for large workloads where the cost of offload is recovered by faster performance of the batch processing. A 64x64 matrix is quite small, so I would be surprised if it would benefit from offloading because the setup cost will dominate the performance.

Arm SME on the CPU might be a good fit for this size of small matrix, as it would be faster than NEON/SVE without a high setup cost. There is a blog on it here: developer.arm.com/.../arm-scalable-matrix-extension-introduction

HTH,
Pete
Cancel
Vote up +1 Vote down

Reply

Accept answer

Reject answer

Cancel

Children

0 hterrolle 1 month ago in reply to Peter Harris

thanks,

just for précision. What would be the workloads good size ? and until wich size it would not be usefull.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel
0 Peter Harris 1 month ago in reply to hterrolle

It's hard to give a precise answer - different accelerators will have different offload overheads and different performance for the workloads submitted to them (either because of accelerator hardware differences or workload differences), so both sides of the "cost vs benefit" balance are platform-dependent variables.

For GPGPU work my gut feel answer would be something like "it's worth considering if it's at least a millisecond of work" - for a high-end Mali that's 12M shader core cycles, each core doing 128 fp32 FMAs per clock, so "quite large".
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel