Hardware implementation of vector load/store instructions on the CM55 core

Does the CM55 have a hardware implementation of the vector instructions gather load / scatter offset store (for example, int16x8_t) ?
When optimizing the functions, we expected 1-2 cycles with these instructions, but on the hardware, we got 4-5 (for small offsets) and 8-9 (for large ones). Is this how it should be?
Could it be a problem with memory alignment or the memory section from which load/store is performed?