Hi all,
It is a well known fact that performing an aligned vector load with an unaligned memory address should lead to segmentation fault.
However, when I do try to run code segment below using the same, i do not see any segmentation fault.
------------------------------------------------------------------------------------------------
#include <stdio.h>#include <stdlib.h>#include <time.h>#include "arm_neon.h"void add (uint32x4_t *data_a,uint32x4_t *data_b) { /* Set each sixteen values of the vector to 3. * * Remark: a 'q' suffix to intrinsics indicates * the instruction run for 128 bits registers. */ *data_a = vaddq_u32 (*data_a, *data_b);}int main (int argc,char** argv) { unsigned int n = atoi(argv[1]); /* Create custom arbitrary data. */ uint32_t uint32_data_a[n]; uint32_t uint32_data_b[n]; uint32_t uint32_data_c[n]; struct timespec start,end; for(uint32_t i = 1; i <= n ; i+=1) { uint32_data_a[i-1] = i; uint32_data_b[i-1] = i; uint32_data_c[i-1] = i; } /* Create the vector with our data. */ uint32x4_t data_a; uint32x4_t data_b; clock_gettime(CLOCK_MONOTONIC,&start); for(int count = 0; count < 10; count++) { for(int i = 0; i < n ; i+=4) { /* Load our custom data into the vector register. */ data_a = vld1q_u32 (uint32_data_a + i); data_b = vld1q_u32 (uint32_data_b + i); /* Call of the add3 function. */ add(&data_a,&data_b); vst1q_u32(uint32_data_c + i,data_a); } } clock_gettime(CLOCK_MONOTONIC,&end); double time_usec=(((double)end.tv_sec * 1000000 + (double)end.tv_nsec/1000) - ((double)start.tv_sec *1000000 + (double)start.tv_nsec/1000)); printf("Time taken for aligned load is : %fus and count is %d \n", time_usec/10,n ); for(uint32_t i = 0; i < n ; i++) printf("%2d ",uint32_data_c[i]); printf("\n"); return 0;}
----------------------------------------------------------------------------------------------------
Clearly almost every access to memory in this case is unaligned? is there any reason for this inconsistent behavior? Thanks in advance.
Aketh TM
Hi,
I did a quick check with your program on a Raspberry Pi 3 running Ubuntu Mate (info from this article)
https://community.arm.com/tools/b/blog/posts/profiling-alexnet-on-raspberry-pi-and-hikey-960-with-the-compute-library
It segfaulted for me on this system.
$ gcc -g t.c -mfpu=neon$ ./a.outSegmentation fault
Some things that may be relevant:
$ cat /proc/cpu/alignment
Check if the user count is increasing when you run your program.
Also, look at your /proc/config.gz for the alignment related kernel configuration.
Thanks,
Jason
Hi Jason and thanks for the reply.I didn't find the user count increasing when running the program, is that a sign of something?I am not sure what that indicates. could you point me material about the same OR kindly provide more information about the same?
Also jason there are 2 keypoints you have missed here.
1) The kernel needs an argument to be passed to it, say ./a.out 322) I have accidentally have posted the wrong kernel. The loop with i must begin with i = 1. (I am try and edit the original post if possible)Here is the correct kernel for reference#include <stdio.h>#include <stdlib.h>#include <time.h>#include "arm_neon.h"void add (uint32x4_t *data_a,uint32x4_t *data_b) { /* Set each sixteen values of the vector to 3. * * Remark: a 'q' suffix to intrinsics indicates * the instruction run for 128 bits registers. */ *data_a = vaddq_u32 (*data_a, *data_b);}int main (int argc,char** argv) { unsigned int n = atoi(argv[1]); /* Create custom arbitrary data. */ uint32_t uint32_data_a[n]; uint32_t uint32_data_b[n]; uint32_t uint32_data_c[n]; struct timespec start,end; for(uint32_t i = 1; i <= n ; i+=1) { uint32_data_a[i-1] = i; uint32_data_b[i-1] = i; uint32_data_c[i-1] = i; } /* Create the vector with our data. */ uint32x4_t data_a; uint32x4_t data_b; clock_gettime(CLOCK_MONOTONIC,&start); for(int count = 0; count < 10; count++) { for(int i = 0; i < n ; i+=4) { /* Load our custom data into the vector register. */ data_a = vld1q_u32 (uint32_data_a + i); data_b = vld1q_u32 (uint32_data_b + i); /* Call of the add3 function. */ add(&data_a,&data_b); vst1q_u32(uint32_data_c + i,data_a); } } clock_gettime(CLOCK_MONOTONIC,&end); double time_usec=(((double)end.tv_sec * 1000000 + (double)end.tv_nsec/1000) - ((double)start.tv_sec *1000000 + (double)start.tv_nsec/1000)); printf("Time taken for aligned load is : %fus and count is %d \n", time_usec/10,n ); for(uint32_t i = 0; i < n ; i++) printf("%2d ",uint32_data_c[i]); printf("\n"); return 0;
The program works for me know, thanks for the corrections.
I'm interested to find out why you think there are unaligned memory accesses and how you determined this? A little more info on what you think the problem is would be helpful.
Notice the for loop, it clearly begins with 1, hence1) if the array was indeed aligned to a 128 bit/16 byte boundary we would see a segmentation fault straight away, since a vector load at an unaligned boundary must cause a segmentation fault.
2) if the array wasn't aligned and by chance always the element 1 was processed okay, we would expect a segmentation fault atleast on any of other values of i (since every element of an array can never be aligned). Note :- The loop has stride 1, (i = 0 ; i < n ; i++). Hence, I expect a segmentation fault.
I recommend to narrow the focus to a specific instruction and load address you think should be causing segfault.
Use objdump -S to see the source code mixed with the disassembly, gdb to single step by source and assembly and print registers, or just printf() statements to print pointer values.
I didn't see any unaligned accesses. The stride of 1 is used for pointer arithmetic which moves to the next vector.