Hum. you start with very complex questions !!!First I do not understand what you say about "static scheduling scoreboard, replay and pending queue"But I do not really understand what ARM call "data hazard" ;(What I can say is that if you apply the stage rules describe into the ARM documentation to count cycle, you'll have a "quite" correct result.After that there is a lot of special case (and they are not always documented) that can improve the quality of the counting process.shortcut (or fast forward) for example.
Branch mispredict penality : you can't handle this kind of stall cycles because you can't know when the ARM will have a mispredict branch. It's the same problem with memory read outside the cache !So you can just expect that most of case you don't have those stall cycle and then ignore those case.
The Cortex "can start" 4 instructions in the same cycle.Don't believe you'll be able to execute 4 instructions at each cycle! that's wrong !But in some case, in some cycle, the Cortex Can start 4 instructions (2 ARM and 2 NEON) in the same cycle.
I do not handle the 13 pipelines stages. I handle instructions when they enter into a functional unit.The cycle counter is not so complex (in fact decode step are not usefull to count cycle (I guess)).