neon hard and soft float differences

I work on Valgrind. Recently I got a Raspberry Pi 5, and I've been cleaning up the regression tests a bit (both Raspberry Pi OS 32bit arm userland and Ubuntu all aarch64).

One testcase that is failing is called "neon64". This hasn't been updated for about 10 years. At least originally it was compiled with -mfloat-abi=softfp but now it is compiled without any abi options so must be defaulting to the Raspberry Pi OS gnueabihf hard float.

There are 3 differences that all look similar to

#define TESTINSN_VSTn(instruction, QD1, QD2, QD3, QD4) \
{ \
  unsigned int out[9]; \
\
  memset(out, 0x55, 8 * (sizeof(unsigned int)));\
  __asm__ volatile( \
      "mov r4, %1\n\t" \
      "vldmia %1!, {" #QD1 "}\n\t" \
      "vldmia %1!, {" #QD2 "}\n\t" \
      "vldmia %1!, {" #QD3 "}\n\t" \
      "vldmia %1!, {" #QD4 "}\n\t" \
      "mov %1, r4\n\t" \
      instruction ", [%0]\n\t" \
      "str %0, [%2]\n\t" \
      : \
      : "r" (out), "r" (mem), "r"(&out[8]) \
      : #QD1, #QD2, #QD3, #QD4, "memory", "r4" \
      ); \
  fflush(stdout); \
  printf("%s :: Result %08x'%08x %08x'%08x " \
         "%08x'%08x %08x'%08x  delta %d\n",             \
         instruction, out[1], out[0], out[3], out[2], out[5],   \
         out[4], out[7], out[6], (int)out[8]-(int)out);         \
}

TESTINSN_VSTn("vst4.32 {d0[1],d1[1],d2[1],d3[1]}", d0, d1, d2, d4);

which gives a diff

-vst4.32 {d0[1],d1[1],d2[1],d3[1]} :: Result 0f0e0d0c'07060504 1f1e1d1c'17161514 55555555'55555555 55555555'55555555  delta 0
+vst4.32 {d0[1],d1[1],d2[1],d3[1]} :: Result 0f0e0d0c'07060504 55555555'17161514 55555555'55555555 55555555'55555555  delta 0

The Valgrind results look correct. If I run the application natively I get the same result as running it under Valgrind.

So my question is, does it seem plausible that there were some bugs in the soft float emulation that don't exist in the hardware?