I am trying to use ARM PL with scikit-learn to get benefits of some better cases of ARMPL i.e., dgemm.
I linked armpl_mp with numpy,scipy and built scikit-learn. i can see dgemm used from armpl linked numpy and scipy giving better performance compare to default openblas numpy and scipy for dgemm.
So one of scikit-learn algorithm,DBSCAN i tried with armpl as default with openblas it was using the dgemm and taking more time there.
But with armpl it was worsen more as it took 400 times more time. and 99.96% time it is running "armpl::clag::bcms".
can anyone help what this function do and understanding why it taking longer time here?
+ 99.75% 0.00% python3 libarmpl_mp.so [.] armpl::clag::parallel<armpl::◆+ 99.75% 0.00% python3 libarmpl_mp.so [.] armpl::clag::parallelise_2d<t▒- 99.75% 99.50% python3 libarmpl_mp.so [.] armpl::clag::bcms<(armpl::cla▒ 96.37% thread_start ▒ start_thread ▒ 0xffff9c1bb80c ▒ - armpl::clag::parallel<armpl::clag::parallelise_2d<true, true, armpl::clag::resident<(armpl::clag::which_matrix)1▒ - 96.37% armpl::clag::parallelise_2d<true, true, armpl::clag::resident<(armpl::clag::which_matrix)1, armpl::cla▒ armpl::clag::bcms<(armpl::clag::which_matrix)1, double, armpl::clag::convert<double const, double, armpl::▒+ 96.75% 0.00% python3 libc.so.6 [.] thread_start ▒+ 96.75% 0.00% python3 libc.so.6 [.] start_thread ▒+ 96.61% 0.00% python3 libgomp.so.1.0.0 [.] 0x0000ffff9c1bb80c ▒+ 3.24% 0.00% python3 libomp.so [.] __kmp_invoke_microtask ▒+ 3.24% 0.00% python3 _base.cpython-310-aarch64-linux-gnu.so [.] .omp_outlined..145 ▒+ 3.16% 0.02% python3 _radius_neighbors.cpython-310-aarch64-linux-gnu.so [.] __pyx_f_7sklearn_7metrics_29_▒+ 3.14% 0.00% python3 _middle_term_computer.cpython-310-aarch64-linux-gnu.so [.] __pyx_f_7sklearn_7metrics_29_▒+ 3.14% 0.00% python3 _cython_blas.cpython-310-aarch64-linux-gnu.so [.] __pyx_fuse_1__pyx_f_7sklearn_▒+ 3.14% 0.00% python3 libarmpl_mp.so [.] armpl::clag::gemm<true, int, ▒+ 3.14% 0.00% python3 libarmpl_mp.so [.] _ZZZN5armpl4clag4gemmIdLNS0_4▒+ 3.14% 0.00% python3 libgomp.so.1.0.0 [.] GOMP_parallel ▒ 0.17% 0.10% python3 libomp.so [.] kmp_flag_64<false, true>::wai▒ 0.15% 0.15% python3 libarmpl_mp.so [.] dgemm_sve_big
Machine used : graviton 3
script used :
DBSCAN script :## Imports#from sklearn.cluster import DBSCANfrom timeit import default_timer as timerfrom sklearn.datasets import make_blobsimport numpy as np
#Generate Dataset
X, y = make_blobs( n_samples=50000, n_features=100, centers=50, center_box=(-32, 32), shuffle=True, random_state=0 )## Mainstart = timer()y_pred = DBSCAN(n_jobs=-1).fit(X, y)elapsed = timer() - startprint(f"Total Time Taken for Execution of DBSCAN Fit Function: {elapsed} sec/s")
Thanks for your comments, it gives me idea.
so the issue identified from your questions.
i used clang intital to build the numpy,scipy and scikit-learn but the armpl libs default comes with gcc version. so i rebuild all the package with gcc and threading issue got fixed and i see now it is taking less time compare to openblas,Thank you for support.
Hi.
Glad to hear that it was simple to fix!
For reference, if you already have armflang installed you have a compatible ArmPL installed as part of that package, as well as a GCC-compatible one. In fact using "armflang -armpl" would do all the necessary linking as well for you.
Nice to know you see an improvement over OpenBLAS as well! Do let us know if you run into any other oddities.
Chris