A maxVal of 255 may well map onto a single instruction.
5.12.5. VLDn and VSTn (multiple n-element structures)Vector Load multiple n-element structures. It loads multiple n-element structures from memory into one or more NEON registers, with de-interleaving (unless n == 1). Every element of each register is loaded.Vector Store multiple n-element structures. It stores multiple n-element structures to memory from one or more NEON registers, with interleaving (unless n == 1). Every element of each register is stored.SyntaxVopn{cond}.datatype list, [Rn{@align}]{!}Vopn{cond}.datatype list, [Rn{@align}], Rmwhere:[i]op [/i]must be either LD or ST.[i]n [/i]must be one of 1, 2, 3, or 4.[i]cond [/i]is an optional condition code (see Condition codes).[i]datatype [/i]see Table 5.14 for options.[i]list [/i]specifies the NEON register list. See Table 5.14 for options.[i]Rn [/i]is the ARM register containing the base address. Rn cannot be r15.[i]align [/i]specifies an optional alignment. See Table 5.14 for options.!if ! is present, Rn is updated to (Rn + the number of bytes transferred by the instruction). The update occurs after all the loads or stores have taken place.[i]Rm [/i]is an ARM register containing an offset from the base address. If Rm is present, Rn is updated to (Rn + Rm) after the address is used to access memory. Rm cannot be r13 or r15.
"vld1.16 q1, [%4] ,%5 \n\t"
Error: Neon quad precision register expected -- `vld1.16 q1,[ip],r4'
"cmp %10,#1 \n\t"//if(isLast) "bne 3f \n\t" "vmin.s32 d4,d4,d13 \n\t" "vmax.s32 d4,d4,d12 \n\t" "3: \n\t" //d13 contains maxVal(255) //d12 contains 0
Are you using "-mfloat-abi=softfp -mfpu=neon" on the GCC command line?
Also, you really might want to consider using the intrinsics instead of inline asm.