Hello,
I want to use the SVE2 instruction set. To this end, I have created a VM in a known cloud provider based on armv9 architecture which supports SVE2 instruction set.
I want to ask the following:
1) How can I determine the size of the vectors? The only way that I am aware of is by executing the following asm code:
rdvl x9, #1 cmp x9, #16 bgt .label1
And branch based on the VL number. Is there any other command in the operating system level?
2) How can I change the size of the vectors? Currently, I can see that the size in my VM is 128 bits (using the aforementioned asm code). Is there a way to increase this value (provided that the underlying processor supports it)? For example, a command in the operating system level?
3) I want to have vectors whose size is bigger than 128 bits. Is a processor that supports this enough or do I need anything else (for example a specific version of a specific operating system, specific gcc, and so on)?
4) Do you know a cloud provider which supports armv9 VMs with vector size bigger than 128 bits?
Thank you in advance,
Akis
Hi George,
thanks for your answer. You are really helping me and I am really grateful for this.
I executed the C code that you provided in my virtual machine. The output was:
$ gcc test.c $ ./a.out 128 bits 128 bits
I also tried to change line 8 to be:
int new_vl_in_bytes = 32; // set to 256 bits
Again, the output of the execution was the same:
So, my VM does not support vector sizes larger than 128 bits (by the way, the cloud provider I am using is Alibaba. It seems that they have some in house ARMv9 processors (ARM-based YiTian 710 processors)).
Regarding the questions whether I need SVE2 and not SVE, I think the answer is I need SVE2. There are some parts of the code that I think I must use the narrowing/widening instructions of SVE2. For example, consider the following code in NEON:
function PFX(addAvg_2x\h\()_neon) lsl x3, x3, #1 lsl x4, x4, #1 mov w11, #0x40 dup v30.16b, w11 .rept \h / 2 ldr w10, [x0] ldr w11, [x1] add x0, x0, x3 add x1, x1, x4 ldr w12, [x0] ldr w13, [x1] add x0, x0, x3 add x1, x1, x4 dup v0.2s, w10 dup v1.2s, w11 dup v2.2s, w12 dup v3.2s, w13 add v0.4h, v0.4h, v1.4h add v2.4h, v2.4h, v3.4h saddl v0.4s, v0.4h, v30.4h saddl v2.4s, v2.4h, v30.4h shrn v0.4h, v0.4s, #7 shrn2 v0.8h, v2.4s, #7 sqxtun v0.8b, v0.8h st1 {v0.h}[0], [x2], x5 st1 {v0.h}[2], [x2], x5 .endr ret endfunc .endm
The only way that I could think of in order to migrate it in SVE is the following (which apparently uses the narrowing/widening instructions of SVE2):
function PFX(addAvg_2x\h\()_sve2) mov z30.b, #0x40 ptrue p0.s, vl2 ptrue p1.h, vl4 ptrue p2.h, vl2 .rept \h / 2 ld1rw {z0.s}, p0/z, [x0] ld1rw {z1.s}, p0/z, [x1] add x0, x0, x3, lsl #1 add x1, x1, x4, lsl #1 ld1rw {z2.s}, p0/z, [x0] ld1rw {z3.s}, p0/z, [x1] add x0, x0, x3, lsl #1 add x1, x1, x4, lsl #1 add z0.h, p1/m, z0.h, z1.h add z2.h, p1/m, z2.h, z3.h saddlb z1.s, z0.h, z30.h saddlt z3.s, z0.h, z30.h saddlb z4.s, z2.h, z30.h saddlt z5.s, z2.h, z30.h shrnb z2.h, z1.s, #7 shrnt z2.h, z3.s, #7 shrnb z3.h, z4.s, #7 shrnt z3.h, z5.s, #7 sqxtunb z0.b, z2.h sqxtunb z1.b, z3.h st1b {z0.h}, p2, [x2] add x2, x2, x5 st1b {z1.h}, p2, [x2] add x2, x2, x5 .endr ret endfunc .endm
Am I missing something?
So, if indeed I need SVE2, the only way to get a platform with vector sizes larger than 128 bits, is to just buy a real hardware, right?
However, if I am not mistaken, the real advantage of SVE/SVE2 is the utilization of vectors whose size is up to 2048 bits. In the case where the vector size is 128 bits, there is no so much difference in terms of performance between a NEON version of the code(if it takes full advantage of 128 bits) and a SVE2 version of the code (as you have pointed out in the other thread: https://community.arm.com/support-forums/f/high-performance-computing-forum/53949/take-full-advantage-of-sve-vector-length-agnostic-approach), right? Isn't there enough platforms out there that supports ARMv9 with vectors sizes larger than 128 bits?
BR,
Hi Akis,
You're right, it seems like you would need SVE2 for your workloadunfortunately. You are also correct that the performance at a vector length of128-bits is likely to be roughly the same as the Neon performance unless youare able to take advantage of instructions only present in SVE or SVE2 such asunpacked load/stores, predication, or gather/scatter instructions.
If your motivation for using other machines with different vector lengths is totest correctness of your SVE2 code, there are a few different options availableto you.
The open-source QEMU emulator supports SVE2 at any vector length, you should beable to run your program as such:
qemu-aarch64 -cpu max,sve-max-vq=[1-16] ./a.out
Where 1-16 specifies the number of 128-bit segments in your vector, such thate.g. 1 = 128-bits, 4 = 512-bits, etc).
Arm also maintains Arm Instruction Emulator, which you can find documentationfor here: https://developer.arm.com/documentation/102190/22-0/?lang=en
Once installed you should just be able to run your binaries as normal using acommand such as:
armie -msve-vector-bits=[128-2048] ./a.out
Hope that helps!
Thanks,George
thanks for your answer. I am after a real platform on which I can test and benchmark my code. The qemu and ARM emulator approaches seem quite interesting, but as you said, it is only for development purposes. I will try to use them and come back to you if I face any problem.
Is there a commercial platform that I can buy which supports SVE2 with vector sizes greater than 128 bits? It seems that neither raspberry nor macbooks is such a platform.
Thanks!
Sure, let me know if you run into any problems using the tools and I will tryand help if possible!
Unfortunately I am not aware of any platforms currently available that supportSVE2 at vector lengths greater than 128-bits.