We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Does the CM55 have a hardware implementation of the vector instructions gather load / scatter offset store (for example, int16x8_t) ? When optimizing the functions, we expected 1-2 cycles with these instructions, but on the hardware, we got 4-5 (for small offsets) and 8-9 (for large ones). Is this how it should be? Could it be a problem with memory alignment or the memory section from which load/store is performed?