How to do fast memcpy of 8 words on ARM Cortex M4?

Our ARM Cortex M4 application, written in C++, needs to copy a 8 x 32-bit word struct to external memory, as fast as possible.

I found that a 'for' loop performed better than memcpy, but it's still slow.

Are there intrinsics using LDM/STM instructions, or an optimised version of memcpy, that we could use?

Would a 'placement new' for the destination, with a simple assignment of one struct to another, help?

We are using the armclang 6 compiler.