How can I measure cycle count of my program with NEON instructions using FVP models?

I'm looking for a tool to iterate faster on my ARM NEON optimizations all through software, i.e without using any hardware  / dev boards.  I came across ARM Development studio and its Fixed Vritual Platforms (FVPs). I am not very particular on cycle count accuracy when compared to real hardware. As long as i can get consistent cycle count numbers on multiple runs of the simulation, it will be sufficient for me to optimize my code better.

It would be good if i can select a Cortex A series processor (say A53 for now), and some memory model for the DRAM to go with it.