It's true that NEON has 128-bit registers but nearly always operates on just 64-bits at a time.
Yes you are right, many operations use all 128-bits at once on Cortex-A8/A9, so your detailed analysis is quite good. I was just explaining the rough overview that even if the more complex 128-bit operations take 2 (or occasionally 3) cycles instead of 1, it is still a good idea to write NEON code for 128-bits because future ARMv7 devices will soon have even more 128-bit single-cycle NEON paths.-Shervin.
This is NOT true, please read my post. For integer operations 128-bit operation is more the rule than the exception, at least on Cortex-A8 and A9. If you're targeting these devices it's important to know what operations do and don't operate on 128-bits in one cycle.