Cortex-A8: memcpy() into DMA buffer hangs on NEON instructions

I am cyclically filling the mmap-ed DMA buffer with my data by copying it from "normal" memory in 290 bytes chunks.

At the first cycle memcpy always passes OK. At the second cycle it hangs in __memcpy_neon routine (at least this is what the gdb says each time when I press Ctrl-C).
Disassembler always shows the strmi instruction been stuck in.

Just for test purpose I substituted memcpy() with my simple byte-byte memcpy1() and everything works fine on all 3MB DMA buffer (but slower obviously...:-)).
To exclude the alignment issue I tested library memcpy()to copy unaligned buffers - no problems detected.

After tons of experiments with my different assembler variants of memcpy (from glibc/newlib/etc. libraries) I can say that what hangs is the NEON memory copy instructions (VLDM), both with and without preload:

PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BNE Loop

All other "normal" variants of memcpy() work fine. Are there any mysteries in using DMA uncached(!) mmaped memory with NEON instructions?
(I am using linux 2.6.37 with glibc 2.23 (gcc 6.3.1 linaro) on DM8148 CPU).