Are you using "-mfloat-abi=softfp -mfpu=neon" on the GCC command line?
Also, you really might want to consider using the intrinsics instead of inline asm.