Exploring Opportunities to Speed Up Vector API Performance on Arm

Hi HPC Community,

We have recently used the Vector API to implement bit packing and unpacking of boolean values.

For benchmarking, we've used JMH with JDK 24

  • Bit-packing: We used VectorMask.fromArray(…).toLong(…)and observed some speed-up.
  • Unpacking: We used VectorMask.fromLong(…).intoArray(…), but noticed a sharp performance degradation.

On inspecting the assembly with the HotSpot disassembler, we noticed that SVE instructions such as STR (predicate): Store predicate register and LDR (predicate): Load predicate register, which match well with this use case, are not being generated. Instead, the current implementation relies on shifts, rotations, and bitwise operations.

With this post, we’d like to explore opportunities for improving the performance of VectorMask operations on Arm by leveraging direct predicate instructions (STR/LDR) rather than bitwise operations.

We have gone through a prior post on Vector API (Exploring SIMD and Java Vector API Performance), looking forward to insights and possible collaboration opportunities to enhance Arm performance.

Regards,
Chiranmoy

Parents Reply Children
No data