This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Performance library with scikit-learn

I am trying to use ARM PL with scikit-learn to get benefits of some better cases of ARMPL i.e., dgemm.

I linked armpl_mp with numpy,scipy and built scikit-learn. i can see dgemm used from armpl linked numpy and scipy giving better performance compare to default openblas numpy and scipy for dgemm.

So one of scikit-learn algorithm,DBSCAN i tried with armpl as default with openblas it was using the dgemm and taking more time there.

But with armpl it was worsen more as it took 400 times more time. and 99.96% time it is running "armpl::clag::bcms". 

 

can anyone help what this function do and understanding why it taking longer time here?

 + 99.75% 0.00% python3 libarmpl_mp.so [.] armpl::clag::parallel<armpl::◆
+ 99.75% 0.00% python3 libarmpl_mp.so [.] armpl::clag::parallelise_2d<t▒
- 99.75% 99.50% python3 libarmpl_mp.so [.] armpl::clag::bcms<(armpl::cla▒
96.37% thread_start ▒
start_thread ▒
0xffff9c1bb80c ▒
- armpl::clag::parallel<armpl::clag::parallelise_2d<true, true, armpl::clag::resident<(armpl::clag::which_matrix)1▒
- 96.37% armpl::clag::parallelise_2d<true, true, armpl::clag::resident<(armpl::clag::which_matrix)1, armpl::cla▒
armpl::clag::bcms<(armpl::clag::which_matrix)1, double, armpl::clag::convert<double const, double, armpl::▒
+ 96.75% 0.00% python3 libc.so.6 [.] thread_start ▒
+ 96.75% 0.00% python3 libc.so.6 [.] start_thread ▒
+ 96.61% 0.00% python3 libgomp.so.1.0.0 [.] 0x0000ffff9c1bb80c ▒
+ 3.24% 0.00% python3 libomp.so [.] __kmp_invoke_microtask ▒
+ 3.24% 0.00% python3 _base.cpython-310-aarch64-linux-gnu.so [.] .omp_outlined..145 ▒
+ 3.16% 0.02% python3 _radius_neighbors.cpython-310-aarch64-linux-gnu.so [.] __pyx_f_7sklearn_7metrics_29_▒
+ 3.14% 0.00% python3 _middle_term_computer.cpython-310-aarch64-linux-gnu.so [.] __pyx_f_7sklearn_7metrics_29_▒
+ 3.14% 0.00% python3 _cython_blas.cpython-310-aarch64-linux-gnu.so [.] __pyx_fuse_1__pyx_f_7sklearn_▒
+ 3.14% 0.00% python3 libarmpl_mp.so [.] armpl::clag::gemm<true, int, ▒
+ 3.14% 0.00% python3 libarmpl_mp.so [.] _ZZZN5armpl4clag4gemmIdLNS0_4▒
+ 3.14% 0.00% python3 libgomp.so.1.0.0 [.] GOMP_parallel ▒
0.17% 0.10% python3 libomp.so [.] kmp_flag_64<false, true>::wai▒
0.15% 0.15% python3 libarmpl_mp.so [.] dgemm_sve_big  

Machine used : graviton 3

script used :

DBSCAN script :
#
# Imports
#
from sklearn.cluster import DBSCAN
from timeit import default_timer as timer
from sklearn.datasets import make_blobs
import numpy as np

#Generate Dataset

X, y = make_blobs(  n_samples=50000,
                    n_features=100,
                    centers=50,
                    center_box=(-32, 32),
                    shuffle=True,
                    random_state=0  )#
# Main

start = timer()
y_pred = DBSCAN(n_jobs=-1).fit(X, y)
elapsed = timer() - startprint(f"Total Time Taken for Execution of DBSCAN Fit Function: {elapsed} sec/s")

Parents
  • Hi.

    That definitely looks like something pathologically bad is happening with the parallelism that we've not heard reported before.

    Could you let us know a couple of extra details, please, to help us figure out what's going on?

    * Which compiler to you think you are using?  GCC or Arm Compiler for Linux?  I can see mentions of both libomp and libgomp in your profile above.  If it is using different libraries there is a good chance that could be causing difficulties.

    * If you execute the libarmpl_mp.so that you have linked to [Trust me, it will work!] then could you post that output, too?

    Thanks.

    Chris

Reply
  • Hi.

    That definitely looks like something pathologically bad is happening with the parallelism that we've not heard reported before.

    Could you let us know a couple of extra details, please, to help us figure out what's going on?

    * Which compiler to you think you are using?  GCC or Arm Compiler for Linux?  I can see mentions of both libomp and libgomp in your profile above.  If it is using different libraries there is a good chance that could be causing difficulties.

    * If you execute the libarmpl_mp.so that you have linked to [Trust me, it will work!] then could you post that output, too?

    Thanks.

    Chris

Children
  • Thanks for your comments, it gives me idea.

    so the issue identified from your questions. 

    i used clang intital to build the numpy,scipy and scikit-learn but the armpl libs default comes with gcc version. so i rebuild all the package with gcc and threading issue got fixed and i see now it is taking less time compare to openblas,


    Thank you for support.

  • Hi.

    Glad to hear that it was simple to fix!

    For reference, if you already have armflang installed you have a compatible ArmPL installed as part of that package, as well as a GCC-compatible one.  In fact using "armflang -armpl" would do all the necessary linking as well for you.

    Nice to know you see an improvement over OpenBLAS as well!  Do let us know if you run into any other oddities.

    Chris