Thank you for your reply. A few more questions:
Is Dn a 128-bit wide register? Is Dd also a 128-bit wide register? (Referring to the diagram in the original question)
Also, the diagram shows 4 parallel operations. Is this the actual number of parallel operations that the hardware can execute?
As an example: If 4 parallel operations are executed, each operation would use 32-bits from each source register. Is this correct?
(Please note I split the discussion so the new question has a new thread. This makes content easier to follow and find in the future.)