We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Yes you are right, many operations use all 128-bits at once on Cortex-A8/A9, so your detailed analysis is quite good. I was just explaining the rough overview that even if the more complex 128-bit operations take 2 (or occasionally 3) cycles instead of 1, it is still a good idea to write NEON code for 128-bits because future ARMv7 devices will soon have even more 128-bit single-cycle NEON paths.-Shervin.