This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Implementation in NEON of non uniform address jumps

Rnjai Lamba over 12 years ago

Rnjai Lamba over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Rnjai Lamba over 12 years ago

Note: This was originally posted on 29th June 2012 at http://forums.arm.com

You used this command in your previous post :

    VLD1.32 { d0[0] },[r2]    // j==0, offset 0
   ADD r2,r2,#64

I am trying to something like this:

VLD1.32 { d0[0] },[%2]    // [%2] is pointing to src
ADD %2,%2,#64     //This is NOT WORKING!! ---EDIT: IT WORKED

: "r"( n ), "r"( res ), "r"( src ),"r"( c ) //INPUT data

[size=2]
[/size]
Cancel
Vote up 0 Vote down

Cancel
Rnjai Lamba over 12 years ago

Note: This was originally posted on 29th June 2012 at http://forums.arm.com

DOUBT:How may we rename the ARM register [%2] to r2?
Cancel
Vote up 0 Vote down

Cancel
Rnjai Lamba over 12 years ago

Note: This was originally posted on 29th June 2012 at http://forums.arm.com

I was facing a problem implementing something like this-
VLD1.32 {d0[0]},[r2]

I need to load a single 16 bit element to s0:
VLD1.16 {s0},[r2]
The above line gives me an error.It says only double or quads may be loaded at atime.
Cancel
Vote up 0 Vote down

Cancel
Rnjai Lamba over 12 years ago

Note: This was originally posted on 29th June 2012 at http://forums.arm.com

Okay and the second half of D0[0] can be utilized by doing->

ADD r2,r2,#2

VLD1.16 {D0[0]},[r2]

Right?
Cancel
Vote up 0 Vote down

Cancel
Gilead Kutnick over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago

Note: This was originally posted on 28th June 2012 at http://forums.arm.com

Assuming that "jump=8" is a constant, then there is no benefit in performing the non-contiguous random loads. What you appear to be trying to compute is the sum of:

src[ 0, 3, 4, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,33,34,37] * c[ 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 6, 7, 7]

The only memory locations you don't use are src[1,2,5,32,35,36] out of the array src[0..37], at which point loading them in and ignoring them is likely faster than avoiding loading them.

Using VLD3, you can automatically pair most of the locations using the same coefficients, e.g. sum[0,3,6] all use coefficient[0], and a few VEXT/VTRN can correct the rest. After that you can perform multiply and multiply-accumlates on pairs via each scalar coefficient, and then sum up at the end. The result is something like:

int sumfunc (int *c, int *src): // r0 = c, r1 = src VLD1.32 {d0,d1,d2,d3},[r0] // c[{0,1},{2,3},{4,5},{6,7}] VLD3.32 {d4,d5,d6},[r1]! // src[{ 0, 3},{ -, 4},{ -, -}] // d8-to-d15 left intact to avoid ABI required preserve and restore VLD3.32 {d17,d18,d19},[r1]! // src[{ 6, 9},{ 7,10},{ 8,11}] VLD3.32 {d20,d21,d22},[r1]! // src[{12,15},{13,16},{14,17}] VLD3.32 {d23,d24,d25},[r1]! // src[{18,21},{19,22},{20,23}] VLD3.32 {d26,d27,d28},[r1]! // src[{24,27},{25,28},{26,29}] VLD3.32 {d29,d30,d31},[r1]! // src[{30,33},{31,34},{ -, -}] VLD1.32 {d7},[r1] // src[{ -,37}] VEXT.8 d5,d5,d21,#4 // d5 = src[{ 4,13}] VEXT.8 d21,d21,d27,#4 // d21 = src[{16,25}] VTRN.32 d7,d27 // d27 = src[{37,28}] VMUL.I32 d4,d4,d0[0] // src[{ 0, 3}] * c[0] VMUL.I32 d5,d5,d0[1] // src[{ 4,13}] * c[1] VMUL.I32 d16,d30,d3[1] // src[{31,34}] * c[7] VMUL.I32 d17,d17,d0[0] // src[{ 6, 9}] * c[0] VMUL.I32 d18,d18,d0[1] // ... VMUL.I32 d19,d19,d1[0] VMUL.I32 d20,d20,d1[1] VMUL.I32 d21,d21,d2[0] VMLA.I32 d4,d22,d1[0] // += src[{14,17}] * c[2] VMLA.I32 d5,d23,d1[1] // += src[{18,21}] * c[3] VMLA.I32 d16,d24,d2[0] // ... VMLA.I32 d17,d25,d2[1] VMLA.I32 d18,d26,d3[0] VMLA.I32 d19,d27,d3[1] VMLA.I32 d20,d28,d2[1] VMLA.I32 d21,d29,d3[0] VADD.I32 q2,q2,q8 // Sum all values VADD.I32 q3,q9,q10 VADD.I32 q0,q2,q3 VADD.I32 d0,d0,d1 VPADD.I32 d0,d0,d0 // Final sum to s0 VMOV.32 r0,d0[0] // Move result to return value BX lr // Return

hth
s.
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago

Note: This was originally posted on 29th June 2012 at http://forums.arm.com

"VLD1.16 {D0[0]},[r2]" will load the bottom 16-bits of D0 (which are the same as the bottom 16-bits of S0).

hth
s.
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago

Note: This was originally posted on 2nd July 2012 at http://forums.arm.com

Yes, though an increment of 2 bytes when loading 16-bit values could be implemented as:

VLD1.16 {D0[0]},[r2]! VLD1.16 {D0[1]},[r2]! VLD1.16 {D0[2]},[r2]! VLD1.16 {D0[3]},[r2]!
Which in itself would be better implemented as:

VLD1.16 {D0},[r2]!
hth
s.
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago

Note: This was originally posted on 2nd July 2012 at http://forums.arm.com

Are you using "-mfloat-abi=softfp -mfpu=neon" on the GCC command line?
Also, you really might want to consider using the intrinsics instead of inline asm.

hth
s.
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago

Note: This was originally posted on 6th July 2012 at http://forums.arm.com

12 is a lot of simultaneous registers to ask the compiler for given that it only ever really had 14 to start with after the stack-pointer and program-counter were deducted.
The additional 2 registers may already be permanently allocated for stack limit, frame pointer, global table pointer or other purposes depending on the platform requirements.

You really should consider whether what you are doing is appropriate for inline assembly vs either a naked function / standalone assembly file or using the intrinsics.

hth
s.
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago

Note: This was originally posted on 16th July 2012 at http://forums.arm.com

It is reasonably unlikely that the compiler is doing something it shouldn't be at "-O3"; the more likely scenario is that:

1. for inline assembly, you aren't describing the required/expected side-effects, so the compiler is [correctly] assuming it can optimize the code out (e.g. if your inline Neon code is writing all its results back to memory, are you declaring memory as having been modified, and/or the assembly as being "volatile"?).

2. for C code, you are relying on something which the C standard does require to be invariant, and thus can legally be optimized out (e.g. relying on the values of variables which fall out of scope, or relying on the explicit number of memory accesses for things not declared as volatile).

hth
s.
Cancel
Vote up 0 Vote down

Cancel