Our ARM Cortex M4 application, written in C++, needs to copy a 8 x 32-bit word struct to external memory, as fast as possible.
I found that a 'for' loop performed better than memcpy, but it's still slow.
Are there intrinsics using LDM/STM instructions, or an optimised version of memcpy, that we could use?
Would a 'placement new' for the destination, with a simple assignment of one struct to another, help?
We are using the armclang 6 compiler.
Calling memcpy produces good code if you ensure the pointers are 4-byte aligned, eg: godbolt.org/.../9nKq5Yc3E