Arm Community
Site
Search
User
Site
Search
User
Groups
Arm Research
DesignStart
Education Hub
Graphics and Gaming
High Performance Computing
Innovation
Multimedia
Open Source Software and Platforms
Physical
Processors
Security
System
Software Tools
TrustZone for Armv8-M
中文社区
Blog
Announcements
Artificial Intelligence
Automotive
Healthcare
HPC
Infrastructure
Innovation
Internet of Things
Machine Learning
Mobile
Smart Homes
Wearables
Forums
All developer forums
IP Product forums
Tool & Software forums
Support
Open a support case
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Developer Community
Tools and Software
Software Tools
Jump...
Cancel
Software Tools
Arm Development Studio forum
Division with NEON
Tools, Software and IDEs blog
Forums
Videos & Files
Help
Jump...
Cancel
New
Replies
8 replies
Subscribers
126 subscribers
Views
8554 views
Users
0 members are here
Related
Division with NEON
Offline
Etienne SOBOLE
over 7 years ago
Note: This was originally posted on 30th September 2011 at http://forums.arm.com
Hi.
I have 4 unsigned 16bit values into a Dn register (or 8 into a Qn register)
[v1] [v2] [v3] [v4]
I'm looking for the code to finaly have
[65536 / v1] [65536 / v2] [65536 / v3] [65536 / v4]
into another (or the same) Dn (or Qn) register...
Thank's
Etienne
Parents
Offline
Gilead Kutnick
over 7 years ago
Note: This was originally posted on 3rd October 2011 at
http://forums.arm.com
vrecpe.u32 takes normalized inputs, similar to how floating point significant data is usually stored. What that means is that the input has no leading zeroes past the first bit that's always 0. So the top two bits will always be 01.
Another way to look at it is that vrecpe.u32 works on values between 0.5 and 1.0 (non-inclusive), where the format is 0.1.31. That means no sign bits, 1 whole bit, and 31 fraction bits. Due to the input constraints the top bit will always be 0.
The reason for this format is to limit the possible range of the calculated reciprocal, which you'll notice must be between 1.0 and 2.0. The one whole number bit was kept available to satisfy this range. If you didn't perform this range limiting you wouldn't be able to define very useful data representations for integer reciprocals, since the reciprocal of any whole number is a fraction.
What normalization does is converts an input x to the format:
x_normalized = x * 2^shift
x = x_normalized * 2^-shift
Where the multiplication can be performed by a bit-shift. Note that for the reciprocal:
x_reciprocal = 1 / x = 1 / (x_normalized * 2^-shift) = (1 / x_normalized) * 2^shift
Which means that you end performing a left shift in the end to undo the normalization. This is instead of a right shift because the reciprocal changes the sign of the power.
Then for the actual division:
a = y / x
a = y * (1 / x)
a = y * (1 / x_normalized * 2^-shift)
a = (y * (1 / x_normalized)) * 2^-shift
You can find the normalization shift with a count leading zeroes instruction. In your case you'll want to use vclz.u16. But you need to leave that integer bit, so you want to set shift equal to clz(x) - 1.
However, you will not always get the correct answer using vrecpe.u32, because it's only correct to ~8 bits. In order to improve the result to get correct 16 bit values you need to use Newton-Raphson iteration. That is, for y = 1 / x,
y_refined = y * (2 - (x * y))
This is kind of a pain to do in integer on NEON because there's no vrecps equivalent instruction and since this is a fixed point multiplication you need the long answer, only to throw out the bottom bits. Honestly you're probably better off just converting to floating point and back. You don't even have to do the final multiplication, you can use vcvt to convert between floating point and fixed point and do the multiplication (left shift by 16) for free. Of course, you can do something similar if you stick with integer.
Cancel
Up
0
Down
Reply
Cancel
Reply
Offline
Gilead Kutnick
over 7 years ago
Note: This was originally posted on 3rd October 2011 at
http://forums.arm.com
vrecpe.u32 takes normalized inputs, similar to how floating point significant data is usually stored. What that means is that the input has no leading zeroes past the first bit that's always 0. So the top two bits will always be 01.
Another way to look at it is that vrecpe.u32 works on values between 0.5 and 1.0 (non-inclusive), where the format is 0.1.31. That means no sign bits, 1 whole bit, and 31 fraction bits. Due to the input constraints the top bit will always be 0.
The reason for this format is to limit the possible range of the calculated reciprocal, which you'll notice must be between 1.0 and 2.0. The one whole number bit was kept available to satisfy this range. If you didn't perform this range limiting you wouldn't be able to define very useful data representations for integer reciprocals, since the reciprocal of any whole number is a fraction.
What normalization does is converts an input x to the format:
x_normalized = x * 2^shift
x = x_normalized * 2^-shift
Where the multiplication can be performed by a bit-shift. Note that for the reciprocal:
x_reciprocal = 1 / x = 1 / (x_normalized * 2^-shift) = (1 / x_normalized) * 2^shift
Which means that you end performing a left shift in the end to undo the normalization. This is instead of a right shift because the reciprocal changes the sign of the power.
Then for the actual division:
a = y / x
a = y * (1 / x)
a = y * (1 / x_normalized * 2^-shift)
a = (y * (1 / x_normalized)) * 2^-shift
You can find the normalization shift with a count leading zeroes instruction. In your case you'll want to use vclz.u16. But you need to leave that integer bit, so you want to set shift equal to clz(x) - 1.
However, you will not always get the correct answer using vrecpe.u32, because it's only correct to ~8 bits. In order to improve the result to get correct 16 bit values you need to use Newton-Raphson iteration. That is, for y = 1 / x,
y_refined = y * (2 - (x * y))
This is kind of a pain to do in integer on NEON because there's no vrecps equivalent instruction and since this is a fixed point multiplication you need the long answer, only to throw out the bottom bits. Honestly you're probably better off just converting to floating point and back. You don't even have to do the final multiplication, you can use vcvt to convert between floating point and fixed point and do the multiplication (left shift by 16) for free. Of course, you can do something similar if you stick with integer.
Cancel
Up
0
Down
Reply
Cancel
Children
No data
More questions in this forum
By title
By date
By reply count
By view count
By most asked
By votes
By quality
Descending
Ascending
All recent questions
Unread questions
Questions you've participated in
Questions you've asked
Unanswered questions
Answered questions
Questions with suggested answers
Questions with no replies
Suggested Answer
How to execute tag manipulation instructions in Cortex-A76 FVP
0
9651
views
2
replies
Latest
6 months ago
by
Stephen Theobald
Suggested Answer
Failed to debug hello world project on Cortex-A76
0
9494
views
1
reply
Latest
6 months ago
by
Stephen Theobald
Suggested Answer
DS-5: Unable to connect to USB-Blaster
0
9121
views
1
reply
Latest
6 months ago
by
Ronan Synnott
Suggested Answer
ARM 8.5-A BTI and MTE Benchmarking
0
25453
views
6
replies
Latest
6 months ago
by
Stephen Theobald
Suggested Answer
ARM64 Linaro toochain Link error ( R_AARCH64_ADR_PREL_PG_HI21 )
0
9908
views
2
replies
Latest
7 months ago
by
Kishan Patel
<
>
View all questions in Arm Development Studio forum