and obviously the DMB is enough here, with lower payload?
how can cpu0 get s_byt_test_int1 > s_byt_test_int2 ?
int1++;thread1_barrier;int2++;
read int1;read int2;
make two variables into the same cache is much better than make them separately in these tests, see result of step 1 and step 2, how to explain it ?