Does the CM55 have a hardware implementation of the vector instructions gather load / scatter offset store (for example, int16x8_t) ? When optimizing the functions, we expected 1-2 cycles with these instructions, but on the hardware, we got 4-5 (for small offsets) and 8-9 (for large ones). Is this how it should be? Could it be a problem with memory alignment or the memory section from which load/store is performed?