Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
hi,
I do not know if using NPU could be interesting for what i would like to do.
So i explain my need.
1) i want to to compare matrice 64*64 with mask of data for comparaison.
2) until now i use CPU and SIMD to do this like a simple double loop on array [64][64][number of form to compare = 64]
3) using GPU i could do the same but in this case i will need to and i flag for each form because gpu i random processing. So i do not think it will be faster. I try and it is not. but i may be wrong with the way i implemnted the kernel and organized the data.
4) If i where using NPU will i get better performance ?
I readed many things about NPU and i anderstand that it can be faster and using less energy. But it is used for CNN model and as i anderstoud, some calculation are faster because they integrated ALU unit of calculation. But i do not need all that staff, i do it in another way than CNN.
So in my case will NPU would be usefull ?
thans in advance.
Offloading to any accelerator (GPU, NPU, etc) has an overhead, so it only tends to be beneficial for large workloads where the cost of offload is recovered by faster performance of the batch processing. A 64x64 matrix is quite small, so I would be surprised if it would benefit from offloading because the setup cost will dominate the performance.
Arm SME on the CPU might be a good fit for this size of small matrix, as it would be faster than NEON/SVE without a high setup cost. There is a blog on it here: developer.arm.com/.../arm-scalable-matrix-extension-introduction
HTH,Pete
thanks,
just for précision. What would be the workloads good size ? and until wich size it would not be usefull.
It's hard to give a precise answer - different accelerators will have different offload overheads and different performance for the workloads submitted to them (either because of accelerator hardware differences or workload differences), so both sides of the "cost vs benefit" balance are platform-dependent variables.
For GPGPU work my gut feel answer would be something like "it's worth considering if it's at least a millisecond of work" - for a high-end Mali that's 12M shader core cycles, each core doing 128 fp32 FMAs per clock, so "quite large".