I am trying to write a code using Neon instructions for an iOS app, based on a previous code that has been optimized with SSE instructions for a desktop app.
The SSE code has memory alignment checks throughout the code, based on which it invokes the aligned or unaligned load / store instruction. Basically the SSE has the following instructions for loads -
__mm_storeu_si128 - the pointer address need not be 16-byte aligned
__mm_store_si128 - the pointer address needs to be 16-byte aligned
And the following instructions for store -
__mm_loadu_si128 - the pointer address need not be 16-byte aligned
__mm_load_si128 - the pointer address needs to be 16-byte aligned.
From the ARM Neon instructions documentation, I was not able to find separate load / store instructions for the aligned or unaligned memory addresses. I could only find generic vldq_s16 for loads and vstq_s16 for stores. Do these instructions handle the memory alignment under the hood? How should I handle cases where memory addresses might not be aligned?
You're right, there are only generic load and store instructions. Whether alignment is handled in hardware is a function of the underlying Memory Type of the addresses you're trying to access ("Normal" memory can handle unaligned accesses), and the current processor state (SCTLR_ELx.A might cause exceptions for unaligned accesses). In that sense you can just use unaligned addresses and things should, in general, work.
You may notice a performance difference, however, since the microarchitectural implementation of an unaligned load or store is handled differently across processors, and it also depends on the capabilities of the fabric connecting your processor to memory. But, whether source and destination addresses for loads and stores are aligned or not, it will be the same instruction to access it. Only a few 'atomic' instructions (with memory model semantics that disallow being broken up or misaligned even on "Normal" memory) require aligned addresses.