This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex M4 (SIMD) - Fastest way to un-pack 1 (one) uint32 to 4 (four) uint8

Hi to you all,
In my current project I need to send over a serial bus an array of integers:

type = unsigned 32 bit integers
length = 4096

The driver I'm using (actually USB CDC VCOM from NXP, which is embedded in LPCOpen) takes pointer to unit8 and does a bulk transfer using DMA. I can output long strings, no problem. But of course I need to output data and so 32 bit integers.

Here's the function I'm trying to call:

   /** \fn uint32_t WriteEP(USBD_HANDLE_T hUsb, uint32_t EPNum, uint8_t *pData, uint32_t cnt)
   *  Function to write data to be sent on the requested endpoint.
   *
   *  This function is called by USB stack and the application layer to send data
   *  on the requested endpoint.
   *  
   *  \param[in] hUsb Handle to the USB device stack. 
   *  \param[in] EPNum  Endpoint number as per USB specification. 
   *                    ie. An EP1_IN is represented by 0x81 number.
   *  \param[in] pData Pointer to the data buffer from where data is to be copied. 
   *  \param[in] cnt  Number of bytes to write. 
   *  \return Returns the number of bytes written.
   */
 uint32_t (*WriteEP)(USBD_HANDLE_T hUsb, uint32_t EPNum, uint8_t *pData, uint32_t cnt);

I need to do the parsing in 8 bit tokens as fast as possible because the project as some serious time constraints. Is anyone aware of some SIMD instruction to un-pack data to this purpose?

Any help would be highly appreciated.
Thanks in advance,
Andrea

Top replies

Parents

0 Matt Sealey over 8 years ago

UXTAB or some clever usage of UXTB16 would be probably the instructions you want - they'll swizzle out 8-bit values from a 32-bit register (or two halfwords in a 32-bit register) into a destination register or two (so you'll need four 32-bit destinations to unpack four bytes from a 32-bit register). Is that kind of what you're going for?

However I think you'll find that if the byte order in memory (from lowest address to highest) is how you want them output then there's no need to do this at all, you can just cast your uint32_t pointer to uint8_t. If you need to do some reordering of those bytes then REV, REV16 and maybe some ROR usage might be your best bet. There're not really a single SIMD/DSP instruction that will do it for you.
Cancel
Vote up +1 Vote down

Cancel

Reply

0 Matt Sealey over 8 years ago

UXTAB or some clever usage of UXTB16 would be probably the instructions you want - they'll swizzle out 8-bit values from a 32-bit register (or two halfwords in a 32-bit register) into a destination register or two (so you'll need four 32-bit destinations to unpack four bytes from a 32-bit register). Is that kind of what you're going for?

However I think you'll find that if the byte order in memory (from lowest address to highest) is how you want them output then there's no need to do this at all, you can just cast your uint32_t pointer to uint8_t. If you need to do some reordering of those bytes then REV, REV16 and maybe some ROR usage might be your best bet. There're not really a single SIMD/DSP instruction that will do it for you.
Cancel
Vote up +1 Vote down

Cancel

Children

0 Andrea Bettati over 8 years ago in reply to Matt Sealey
Hi Matt Sealey, thanks a lot for the reply.
I looked at the doc for UXTAB:

UXTAB{cond} {Rd}, Rn, Rm {,rotation}

This instruction does the following:

Rotate the value from Rm right by 0, 8, 16 or 24 bits.

Extract bits[7:0] from the value obtained.

Zero extend to 32 bits.

Add the value from Rn.

as far as I understand your suggestion is to use it for the "Rotate" feature, right? But How does this rotation work?
Let's see how confuse I am:
say we have the number 0x12-34-56-78 then

8 bit rotation > 0x12-34-78-56

16 bit rotation > 0x56-78-12-34

24 bit rotation > 0x34-56-78-12

I guess that's not the case...

" there's no need to do this at all, you can just cast your uint32_t pointer to uint8_t" yes the numbers are in the right order, no need to sort, so really a cast should work this out? I've not the board here (I study and work not in the sample place so sometimes it becomes difficult to test in real-time) so unfortunately I cannot check, but I think I tried a week ago and got some troubles. Anyway, if that's true then my bad, I'll try to fix it as soon as possible. Anyway the UXTAB solution seems intersting.
Cancel
Vote up 0 Vote down

Cancel
+2 Matt Sealey over 8 years ago in reply to Andrea Bettati

That's exactly the rotation it does. Actually consider that you don't need UXTAB since you're never going to add a value and use UXTB instead.

UXTB r2, r1, #0

UXTB r3, r1, #8

UXTB r4, r1, #16

UXTB r5, r1, #24

r2,r3,r4,r5 will contain 0x12, 0x34, 0x56, 0x78 (assuming it was a little endian 32-bit value). It's the instruction that powers ((X >> N) & 0xFF) for multiplies of 8-bit shifts. Actually UXTB (without the add) does the same thing. The 16-bit version though does it twice:

UXTAB16 r3, r0, r1, #0

UXTAB16 r5, r0, r1, #8

.. r3 and r5 will contain 0x00560012 and 0x00780034 (or the other way around. I'm having endian problems in my head). You can rotate them so that r4 and r2 will contain 0x00120056 and 0x00340078. STRB will just store bits [7:0] of the register to memory.

Anyway if you can do the UXTAB16 version, then you can also do something like:

UXTAB16 r3, r1, r0, #0

UXTAB16 r5, r1, r0, #8

STRB r3, [r0], #1

STRB r5, [r0], #1

ROR r3, r3, #16

ROR r5, r5, #16

STRB r3, [..]

STRB r5, [...]

Although the Cortex-M4 pipeline isn't complex enough to see the benefit of doing some work between the stores.

What we just did there is turn the 32-bit value 0x12345678 which is stored in memory as [0x78, 0x56, 0x34, 0x12] (LSB at lowest address, little endian) into [0x12, 0x34, 0x56, 0x78] by messing with it in an overcomplicated manner.

The "REV" instruction will do this, and you can get away with a single store.

LDR r1, [r2], #4

REV r1, r1

STR r1, [r0], #4

That may still be overkill. Rather than extracting bytes in little-endian order out of a 32-bit word, reversing them so we can iterate over them LSB first in big-endian order, if they're in memory in the order you want anyway just do:

uint32_t foo[] = { 0x12345678 };

bar->writeEP((uint8_t *) foo);

It really depends on the data formats on either side, but there's usually a way to do it without getting too monstrous. If this were an A-class core you could use {V}TBL and get complicated, but actually REV works just as well there, too.

Ta,

Matt
Cancel
Vote up +3 Vote down

Cancel
0 42Bastian Schick over 8 years ago in reply to Matt Sealey

"uint32_t foo[] = { 0x12345678 };

bar->writeEP((uint8_t *) foo);"

That's it, just cast and you are done.
Cancel
Vote up +1 Vote down

Cancel
0 Andrea Bettati over 8 years ago in reply to Matt Sealey
Thank you Matt Sealey for the reply. Today I worked on the board and found that the problem was no the cast, which, as you suggested, worked properly, but the fact that the array I am trying to output is stored into the ram.
It is declared as:
__DATA(RAM3) static uint32_t multiChannel[BINS];

If I remove the __DATA attribute I'm able to read properly the content, otherwise what I get are unexpected values.
I'm opening a new question about this.

Thanks again,
Andrea
Cancel
Vote up 0 Vote down

Cancel