This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

No segmentation fault with unaligned access

Hi all,

It is a well known fact that performing an aligned vector load with an unaligned memory address should lead to segmentation fault.

However, when I do try to run code segment below using the same, i do not see any segmentation fault.

------------------------------------------------------------------------------------------------

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "arm_neon.h"

void add (uint32x4_t *data_a,uint32x4_t *data_b)
{
    /* Set each sixteen values of the vector to 3.
     *
     * Remark: a 'q' suffix to intrinsics indicates
     * the instruction run for 128 bits registers.
     */
    
    *data_a = vaddq_u32 (*data_a, *data_b);
}

int main (int argc,char** argv)
{
    unsigned int n = atoi(argv[1]);

    /* Create custom arbitrary data. */
    uint32_t uint32_data_a[n];
    uint32_t uint32_data_b[n];
    uint32_t uint32_data_c[n];
    struct timespec start,end;

    for(uint32_t i = 1; i <= n ; i+=1)
    {
        uint32_data_a[i-1] = i;
        uint32_data_b[i-1] = i;
        uint32_data_c[i-1] = i;
    }

    /* Create the vector with our data. */
    uint32x4_t data_a;
    uint32x4_t data_b;

   
    clock_gettime(CLOCK_MONOTONIC,&start);

    for(int count = 0; count < 10; count++)
    {
        for(int i = 0; i < n ; i+=4)
        {    
            /* Load our custom data into the vector register. */
            data_a  = vld1q_u32 (uint32_data_a + i);
        data_b  = vld1q_u32 (uint32_data_b + i);

            /* Call of the add3 function. */
            add(&data_a,&data_b);

            vst1q_u32(uint32_data_c + i,data_a);
    }    
    }

    clock_gettime(CLOCK_MONOTONIC,&end);

    double time_usec=(((double)end.tv_sec * 1000000 + (double)end.tv_nsec/1000) - ((double)start.tv_sec *1000000 + (double)start.tv_nsec/1000));
    printf("Time taken for aligned load is : %fus and count is %d \n", time_usec/10,n );

    for(uint32_t i = 0; i < n ; i++) printf("%2d ",uint32_data_c[i]);
    printf("\n");
    return 0;
}

----------------------------------------------------------------------------------------------------

Clearly almost every access to memory in this case is unaligned? is there any reason for this inconsistent behavior? Thanks in advance.

Aketh TM

Parents Reply Children
  • Hi Jason and thanks for the reply.

    I didn't find the user count increasing when running the program, is that a sign of something?

    I am not sure what that indicates. could you point me material about the same OR kindly provide more information about the same?

  • Also jason there are 2 keypoints you have missed here.

    1) The kernel needs an argument to be passed to it, say ./a.out 32
    2) I have accidentally have posted the wrong kernel. The loop with i must begin with i = 1. (I am try and edit the original post if possible)

    Here is the correct kernel for reference

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include "arm_neon.h"

    void add (uint32x4_t *data_a,uint32x4_t *data_b)
    {
        /* Set each sixteen values of the vector to 3.
         *
         * Remark: a 'q' suffix to intrinsics indicates
         * the instruction run for 128 bits registers.
         */
        
        *data_a = vaddq_u32 (*data_a, *data_b);
    }

    int main (int argc,char** argv)
    {
        unsigned int n = atoi(argv[1]);

        /* Create custom arbitrary data. */
        uint32_t uint32_data_a[n];
        uint32_t uint32_data_b[n];
        uint32_t uint32_data_c[n];
        struct timespec start,end;

        for(uint32_t i = 1; i <= n ; i+=1)
        {
            uint32_data_a[i-1] = i;
            uint32_data_b[i-1] = i;
            uint32_data_c[i-1] = i;
        }

        /* Create the vector with our data. */
        uint32x4_t data_a;
        uint32x4_t data_b;

       
        clock_gettime(CLOCK_MONOTONIC,&start);

        for(int count = 0; count < 10; count++)
        {
            for(int i = 0; i < n ; i+=4)
            {    
                /* Load our custom data into the vector register. */
                data_a  = vld1q_u32 (uint32_data_a + i);
            data_b  = vld1q_u32 (uint32_data_b + i);

                /* Call of the add3 function. */
                add(&data_a,&data_b);

                vst1q_u32(uint32_data_c + i,data_a);
        }    
        }

        clock_gettime(CLOCK_MONOTONIC,&end);

        double time_usec=(((double)end.tv_sec * 1000000 + (double)end.tv_nsec/1000) - ((double)start.tv_sec *1000000 + (double)start.tv_nsec/1000));
        printf("Time taken for aligned load is : %fus and count is %d \n", time_usec/10,n );

        for(uint32_t i = 0; i < n ; i++) printf("%2d ",uint32_data_c[i]);
        printf("\n");
        return 0;

  • Hi,

    The program works for me know, thanks for the corrections. 

    I'm interested to find out why you think there are unaligned memory accesses and how you determined this? A little more info on what you think the problem is would be helpful.

    Thanks,

    Jason

  • Notice the for loop, it clearly begins with 1, hence

    1) if the array was indeed aligned to a 128 bit/16 byte boundary we would see a segmentation fault straight away, since a vector load at an unaligned boundary must cause a segmentation fault.

    2) if the array wasn't aligned and by chance always the element 1 was processed okay, we would expect a segmentation fault atleast on any of other values of i (since every element of an array can never be aligned). Note :- The loop has stride 1, (i = 0 ; i < n ; i++).

    Hence, I expect a segmentation fault.

  • Hi,

    I recommend to narrow the focus to a specific instruction and load address you think should be causing segfault.

    Use objdump -S to see the source code mixed with the disassembly, gdb to single step by source and assembly and print registers, or just printf() statements to print pointer values.

    I didn't see any unaligned accesses. The stride of 1 is used for pointer arithmetic which moves to the next vector.

    Thanks,

    Jason