NEON optimisation techniques for cortex-A35

Is there any tricks to efficiently utilise the NEON feature in Cortex-A35. I believe the Cortex-A35 has in-order execution, so what is the correct ways to load and process data,

  1. I need to load data into batch of neon buffers to hide data latency (ie. found in case of cortex-A8 article)?
  2. Combining LOAD-STORE operations improve CPU cycles (Does this execute parallely)?
  3. Does pre-load improves data-cache in case of consecutive buffer access?
  4. Does ARM code and NEON execute parallely, so can i combine ARM and NEON to improve CPU cycle?