This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Size of vectors in SVE2

Hello,

I want to use the SVE2 instruction set. To this end, I have created a VM in a known cloud provider based on armv9 architecture which supports SVE2 instruction set.

I want to ask the following:

1) How can I determine the size of the vectors? The only way that I am aware of is by executing the following asm code:

rdvl            x9, #1
cmp             x9, #16
bgt             .label1

And branch based on the VL number. Is there any other command in the operating system level?

2) How can I change the size of the vectors? Currently, I can see that the size in my VM is 128 bits (using the aforementioned asm code). Is there a way to increase this value (provided that the underlying processor supports it)? For example, a command in the operating system level?

3) I want to have vectors whose size is bigger than 128 bits. Is a processor that supports this enough or do I need anything else (for example a specific version of a specific operating system, specific gcc, and so on)?

4) Do you know a cloud provider which supports armv9 VMs with vector size bigger than 128 bits?

Thank you in advance,

Akis

  • Hi Akis,

    Inspecting the SVE vector length directly with assembly as in your code snippet
    is likely to be the fastest way, but you can also determine this from the
    operating system by using a prctl:

    #include <sys/prctl.h>
    #include <stdio.h>
    
    int main() {
      int vl_in_bytes = prctl(PR_SVE_GET_VL) & PR_SVE_VL_LEN_MASK;
      printf("%d bits\n", vl_in_bytes * 8);
    
      int new_vl_in_bytes = 16; // set to 128 bits
      prctl(PR_SVE_SET_VL, new_vl_in_bytes); // should check success/failure
    
      vl_in_bytes = prctl(PR_SVE_GET_VL) & PR_SVE_VL_LEN_MASK;
      printf("%d bits\n", vl_in_bytes * 8);
    }

    The above program will query the vector length, then set it to 128-bits and
    fetch the vector length again. On a system with a default vector-length of
    256-bits for instance you should see:

    $ gcc test.c
    $ ./a.out
    256 bits
    128 bits

    I would suggest reading the Linux documentation around PR_SVE_GET_VL and
    PR_SVE_SET_VL for additional options (in particular PR_SVE_SET_VL_ONEXEC and
    PR_SVE_VL_INHERIT).

    If your processor supports a vector length larger than 128-bits then that alone
    should be enough, for instance the Amazon Graviton 3 instances have a default
    vector length of 256-bits. If your processor does not support a larger vector
    length (and I would imagine that it does not support it, given the VM
    defaulting to 128-bits?) then the above prctl to set the vector length will
    correctly fail to set the vector length.

    I am not aware of any cloud providers with Arm v9 offerings with an SVE2 vector
    length of 128-bits, unfortunately. The Amazon Graviton 3 supports a vector
    length of 256-bits however this is only SVE1 and not SVE2, but this might be
    sufficient for your use case?

    Thanks,
    George

  • Hi George,

    thanks for your answer. You are really helping me and I am really grateful for this.

    I executed the C code that you provided in my virtual machine. The output was:

    $ gcc test.c
    $ ./a.out
    128 bits
    128 bits

    I also tried to change line 8 to be:

    int new_vl_in_bytes = 32; // set to 256 bits

    Again, the output of the execution was the same:

    $ gcc test.c
    $ ./a.out
    128 bits
    128 bits

    So, my VM does not support vector sizes larger than 128 bits (by the way, the cloud provider I am using is Alibaba. It seems that they have some in house ARMv9 processors (ARM-based YiTian 710 processors)).

    Regarding the questions whether I need SVE2 and not SVE, I think the answer is I need SVE2. There are some parts of the code that I think I must use the narrowing/widening instructions of SVE2. For example, consider the following code in NEON:

    function PFX(addAvg_2x\h\()_neon)
        lsl             x3, x3, #1
        lsl             x4, x4, #1
        mov             w11, #0x40
        dup             v30.16b, w11
    .rept \h / 2
        ldr             w10, [x0]
        ldr             w11, [x1]
        add             x0, x0, x3
        add             x1, x1, x4
        ldr             w12, [x0]
        ldr             w13, [x1]
        add             x0, x0, x3
        add             x1, x1, x4
        dup             v0.2s, w10
        dup             v1.2s, w11
        dup             v2.2s, w12
        dup             v3.2s, w13
        add             v0.4h, v0.4h, v1.4h
        add             v2.4h, v2.4h, v3.4h
        saddl           v0.4s, v0.4h, v30.4h
        saddl           v2.4s, v2.4h, v30.4h
        shrn            v0.4h, v0.4s, #7
        shrn2           v0.8h, v2.4s, #7
        sqxtun          v0.8b, v0.8h
        st1             {v0.h}[0], [x2], x5
        st1             {v0.h}[2], [x2], x5
    .endr
        ret
    endfunc
    .endm

    The only way that I could think of in order to migrate it in SVE is the following (which apparently uses the narrowing/widening instructions of SVE2):

    function PFX(addAvg_2x\h\()_sve2)
        mov             z30.b, #0x40
        ptrue           p0.s, vl2
        ptrue           p1.h, vl4
        ptrue           p2.h, vl2
    .rept \h / 2
        ld1rw           {z0.s}, p0/z, [x0]
        ld1rw           {z1.s}, p0/z, [x1]
        add             x0, x0, x3, lsl #1
        add             x1, x1, x4, lsl #1
        ld1rw           {z2.s}, p0/z, [x0]
        ld1rw           {z3.s}, p0/z, [x1]
        add             x0, x0, x3, lsl #1
        add             x1, x1, x4, lsl #1
        add             z0.h, p1/m, z0.h, z1.h
        add             z2.h, p1/m, z2.h, z3.h
        saddlb          z1.s, z0.h, z30.h
        saddlt          z3.s, z0.h, z30.h
        saddlb          z4.s, z2.h, z30.h
        saddlt          z5.s, z2.h, z30.h
        shrnb           z2.h, z1.s, #7
        shrnt           z2.h, z3.s, #7
        shrnb           z3.h, z4.s, #7
        shrnt           z3.h, z5.s, #7
        sqxtunb         z0.b, z2.h
        sqxtunb         z1.b, z3.h
        st1b            {z0.h}, p2, [x2]
        add             x2, x2, x5
        st1b            {z1.h}, p2, [x2]
        add             x2, x2, x5
    .endr
        ret
    endfunc
    .endm

    Am I missing something?

    So, if indeed I need SVE2, the only way to get a platform with vector sizes larger than 128 bits, is to just buy a real hardware, right?

    However, if I am not mistaken, the real advantage of SVE/SVE2 is the utilization of vectors whose size is up to 2048 bits. In the case where the vector size is 128 bits, there is no so much difference in terms of performance between a NEON version of the code(if it takes full advantage of 128 bits) and a SVE2 version of the code (as you have pointed out in the other thread: https://community.arm.com/support-forums/f/high-performance-computing-forum/53949/take-full-advantage-of-sve-vector-length-agnostic-approach), right? Isn't there enough platforms out there that supports ARMv9 with vectors sizes larger than 128 bits?

    BR,

    Akis

  • Hi Akis,

    You're right, it seems like you would need SVE2 for your workload
    unfortunately. You are also correct that the performance at a vector length of
    128-bits is likely to be roughly the same as the Neon performance unless you
    are able to take advantage of instructions only present in SVE or SVE2 such as
    unpacked load/stores, predication, or gather/scatter instructions.

    If your motivation for using other machines with different vector lengths is to
    test correctness of your SVE2 code, there are a few different options available
    to you.

    The open-source QEMU emulator supports SVE2 at any vector length, you should be
    able to run your program as such:

    qemu-aarch64 -cpu max,sve-max-vq=[1-16] ./a.out

    Where 1-16 specifies the number of 128-bit segments in your vector, such that
    e.g. 1 = 128-bits, 4 = 512-bits, etc).

    Arm also maintains Arm Instruction Emulator, which you can find documentation
    for here: https://developer.arm.com/documentation/102190/22-0/?lang=en

    Once installed you should just be able to run your binaries as normal using a
    command such as:

    armie -msve-vector-bits=[128-2048] ./a.out

    Hope that helps!

    Thanks,
    George

  • Hi George,

    thanks for your answer. I am after a real platform on which I can test and benchmark my code. The qemu and ARM emulator approaches seem quite interesting, but as you said, it is only for development purposes. I will try to use them and come back to you if I face any problem.

    Is there a commercial platform that I can buy which supports SVE2 with vector sizes greater than 128 bits? It seems that neither raspberry nor macbooks is such a platform.

    Thanks!

    Akis

  • Hi Akis,

    Sure, let me know if you run into any problems using the tools and I will try
    and help if possible!

    Unfortunately I am not aware of any platforms currently available that support
    SVE2 at vector lengths greater than 128-bits.

    Thanks,
    George