Arm Community
Site
Search
User
Site
Search
User
Groups
Arm Research
DesignStart
Education Hub
Graphics and Gaming
High Performance Computing
Innovation
Multimedia
Open Source Software and Platforms
Physical
Processors
Security
System
Software Tools
TrustZone for Armv8-M
中文社区
Blog
Announcements
Artificial Intelligence
Automotive
Healthcare
HPC
Infrastructure
Innovation
Internet of Things
Machine Learning
Mobile
Smart Homes
Wearables
Forums
All developer forums
IP Product forums
Tool & Software forums
Support
Open a support case
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Developer Community
IP Products
Processors
Jump...
Cancel
Processors
Cortex-M / M-Profile forum
How does the NEON access Memory?
Blogs
Forums
Videos & Files
Help
Jump...
Cancel
New
State
Not Answered
Replies
3 replies
Subscribers
6 subscribers
Views
5129 views
Users
0 members are here
Architecture
NEON
simd
Memory
Related
How does the NEON access Memory?
Offline
bearfish
over 7 years ago
Note: This was originally posted on 5th May 2008 at
http://forums.arm.com
I have a question about how to get the maximum calculation capability of NEON. In our video processing application, we should access several frame video. Then if the video is HD resolution (1920*1080), the memory size of each frame is more than 6M(1920*1080*3*). So it's impossible to store the total frame in cache. Then we will meet cache miss. I don't know what measure you will take to avoid cache miss.
I will share our experience on this topic when we implement our video processing algorithm on Cell processor and Equator's BSP processor. In Equator's BSP processor, there is DMA measure that can move data between cache(I don't know the details, maybe it's TCM) and memory. So we can set double buffer (for example "ping pong" buffer) in cache to avoid cache miss - when the CPU works on "ping" buffer, we can set the DMA to move data between "pong" buffer and memory, then the time for DMA transfer will be overlapped with the time of CPU's computation, and when the CPU finishes the processing on "ping" buffer and want to process "pong" buffer, it won't meet cache miss.
In Cell processor, the Synergistic Processing Unit (SPU) does not have cache instead of a high speed memory (local store, not more than 256K, include Data, instruction and stack). The local store can be access by a DMA, and this DMA can move data between local store and main memory. Then we also can design a double buffer to move data to one buffer when the SPU is processing the data in other buffer.
My question is that whether there is also the similar DMA in NEON to deal with data movement to avoid cache miss. It's very important for our application, because for video processing, we should access abundant data. And how does the NEON synchronize with the ARM. I have not found the answer in the ARM architecture reference manual. I think that if I can get a simple sample about how to use NEON, I will have some sense about my puzzle.
Thanks!
Parents
0
Offline
Peter Harris
over 7 years ago
Note: This was originally posted on 6th May 2008 at
http://forums.arm.com
On the ARM Cortex implementations the Neon block has direct access to the larger L2 cache, bypassing L1 avoiding many cache pollution issues. The Neon unit also has a large number of registers, making ideal for implementing things like decode lookup tables using just registers without accessing external memory at all.
The second point to realise with video is that you don't randomly access the whole video frame, so it doesn't all need to be in cache simultaneously. This allow you to hide the fact that cache is smaller than your whole data set. Block based decoders tend to work on data that has a lot of spatial locality - the current block's motion vectors apply to the same spatial location in the previous frame. So the top 32x32 pixel block in the current frame need the motion vectors for the current frame delta and the 32x32 pixel block from the previous frame (with perhaps a few neighbouring blocks too). This is a small and very manageable data set. Via use of early PLD or LDR you can cause lines to be loaded in the L2 cache just ahead of when they are used and ensure that the pipelining of data through the cache is high performance.
You have to do a similar locality argument with the Cell SPE's because the local memory is not big enough for the entire data set, so you can treat decomposition of the codec in exactly the same way, based on the size of the L2 cache in the system.
Note that the implementation of the PLD instructions is often a NOP. The ARM cores with the security extensions (including all of the current Cortex A family) force a PLD executed on the non-secure virtual CPU to a NOP even if the secure virtual CPU can execute a PLD. For this reason it is better to issue an early LDR for each line in the L2 cache rather than using PLD; it is more likely to produce the effect you want
To answer a few specific points:
> how does the NEON synchronize with the ARM?
Neon is part of the ARM pipeline, so there is no synchronization needed, except that the Neon can access L2 cache directly. Software running on the non-Neon integer part of ARM CPU must ensure that the L2 cache is coherent with the L1 cache when required (most of the time the data set doesn't need to be in L1 at all so this should be minimal).
Cancel
Up
0
Down
Reply
Accept answer
Cancel
Reply
0
Offline
Peter Harris
over 7 years ago
Note: This was originally posted on 6th May 2008 at
http://forums.arm.com
On the ARM Cortex implementations the Neon block has direct access to the larger L2 cache, bypassing L1 avoiding many cache pollution issues. The Neon unit also has a large number of registers, making ideal for implementing things like decode lookup tables using just registers without accessing external memory at all.
The second point to realise with video is that you don't randomly access the whole video frame, so it doesn't all need to be in cache simultaneously. This allow you to hide the fact that cache is smaller than your whole data set. Block based decoders tend to work on data that has a lot of spatial locality - the current block's motion vectors apply to the same spatial location in the previous frame. So the top 32x32 pixel block in the current frame need the motion vectors for the current frame delta and the 32x32 pixel block from the previous frame (with perhaps a few neighbouring blocks too). This is a small and very manageable data set. Via use of early PLD or LDR you can cause lines to be loaded in the L2 cache just ahead of when they are used and ensure that the pipelining of data through the cache is high performance.
You have to do a similar locality argument with the Cell SPE's because the local memory is not big enough for the entire data set, so you can treat decomposition of the codec in exactly the same way, based on the size of the L2 cache in the system.
Note that the implementation of the PLD instructions is often a NOP. The ARM cores with the security extensions (including all of the current Cortex A family) force a PLD executed on the non-secure virtual CPU to a NOP even if the secure virtual CPU can execute a PLD. For this reason it is better to issue an early LDR for each line in the L2 cache rather than using PLD; it is more likely to produce the effect you want
To answer a few specific points:
> how does the NEON synchronize with the ARM?
Neon is part of the ARM pipeline, so there is no synchronization needed, except that the Neon can access L2 cache directly. Software running on the non-Neon integer part of ARM CPU must ensure that the L2 cache is coherent with the L1 cache when required (most of the time the data set doesn't need to be in L1 at all so this should be minimal).
Cancel
Up
0
Down
Reply
Accept answer
Cancel
Children
No data
More questions in this forum
By title
By date
By reply count
By view count
By most asked
By votes
By quality
Descending
Ascending
All recent questions
Unread questions
Questions you've participated in
Questions you've asked
Unanswered questions
Answered questions
Questions with suggested answers
Questions with no replies
Not Answered
Using STM32H7 ETM without external tool
0
CoreSight ETM7
STM32
332
views
0
replies
Started
3 months ago
by
GuillaumeP
Not Answered
How to realise Real-time detection of access of memory beyond the bounds of an allocation block, instead of period detection in Cortex-M4. Please give me any idea.
0
Cortex-M4
1053
views
3
replies
Latest
3 months ago
by
42Bastian Schick
Not Answered
Getting Dummy character while receiving UART data,How to fix it ?
0
732
views
2
replies
Latest
3 months ago
by
Jerome Decamps - 杜尚杰
Not Answered
Timer not working in stm32f401re
0
Keil MDK Cortex-M Edition
STM32
Cortex-M
STM32 F4
620
views
1
reply
Latest
3 months ago
by
42Bastian Schick
Not Answered
Is there an FPGA image for the MPS2+AN521 with an FPU?
0
Cortex-M33
Cortex-M Prototyping System (V2M-MPS2)
344
views
0
replies
Started
3 months ago
by
erickroane
<
>
View all questions in Cortex-M / M-Profile forum