Thank you for your reply. A few more questions:
Is Dn a 128-bit wide register? Is Dd also a 128-bit wide register? (Referring to the diagram in the original question)
Also, the diagram shows 4 parallel operations. Is this the actual number of parallel operations that the hardware can execute?
As an example: If 4 parallel operations are executed, each operation would use 32-bits from each source register. Is this correct?