How to improve ML algorithms or rewrite optimised multithreading for ARM?

I am trying to understand how can we rewrite optimized multithreading for ARM architecture. Any suggestions will be of great help.