Exploring Opportunities to Speed Up Vector API Performance on Arm

Hi HPC Community,

We have recently used the Vector API to implement bit packing and unpacking of boolean values.

For benchmarking, we've used JMH with JDK 24

Bit-packing: We used VectorMask.fromArray(…).toLong(…)and observed some speed-up.
Unpacking: We used VectorMask.fromLong(…).intoArray(…), but noticed a sharp performance degradation.

On inspecting the assembly with the HotSpot disassembler, we noticed that SVE instructions such as STR (predicate): Store predicate register and LDR (predicate): Load predicate register, which match well with this use case, are not being generated. Instead, the current implementation relies on shifts, rotations, and bitwise operations.

With this post, we’d like to explore opportunities for improving the performance of VectorMask operations on Arm by leveraging direct predicate instructions (STR/LDR) rather than bitwise operations.

We have gone through a prior post on Vector API (Exploring SIMD and Java Vector API Performance), looking forward to insights and possible collaboration opportunities to enhance Arm performance.

Regards,
Chiranmoy

Top replies

Chiranmoy Bhattacharya 20 days ago in reply to Mikhail Ablakatov +2 verified

Hi Mikhail, The issue got resolved after the PR https://github.com/openjdk/jdk/pull/27481 was merged recently. The benchmark performance is now acceptable. I'll be closing the discussion. Thank you...

Parents

0 Mikhail Ablakatov 3 months ago

Hi Chiranmoy,

Thanks for the report. We're planning to reproduce this on the JDK mainline and we'll share findings as we learn more.

If you could share the benchmark you've used or another minimal reproducer test case, that would help.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

Reply

0 Mikhail Ablakatov 3 months ago

Hi Chiranmoy,

Thanks for the report. We're planning to reproduce this on the JDK mainline and we'll share findings as we learn more.

If you could share the benchmark you've used or another minimal reproducer test case, that would help.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

Children

+1 Chiranmoy Bhattacharya 20 days ago in reply to Mikhail Ablakatov

Hi Mikhail,

The issue got resolved after the PR https://github.com/openjdk/jdk/pull/27481 was merged recently. The benchmark performance is now acceptable.

I'll be closing the discussion.

Thank you for the support.
Cancel
Vote up +2 Vote down

Reply

Reject answer

Cancel