This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How does the NEON access Memory?

Note: This was originally posted on 5th May 2008 at http://forums.arm.com

I have a question about how to get the maximum calculation capability of NEON. In our video processing application, we should access several frame video. Then if the video is HD resolution (1920*1080), the memory size of each frame is more than 6M(1920*1080*3*). So it's impossible to store the total frame in cache. Then we will meet cache miss. I don't know what measure you will take to avoid cache miss.
I will share our experience on this topic when we implement our video processing algorithm on Cell processor and Equator's BSP processor. In Equator's BSP processor, there is DMA measure that can move data between cache(I don't know the details, maybe it's TCM) and memory. So we can set double buffer (for example "ping pong" buffer) in cache to avoid cache miss - when the CPU works on "ping" buffer, we can set the DMA to move data between "pong" buffer and memory, then the time for DMA transfer will be overlapped with the time of CPU's computation, and when the CPU finishes the processing on "ping" buffer and want to process "pong" buffer, it won't meet cache miss.
In Cell processor, the Synergistic Processing Unit (SPU) does not have cache instead of a high speed memory (local store, not more than 256K, include Data, instruction and stack). The local store can be access by a DMA, and this DMA can move data between local store and main memory. Then we also can design a double buffer to move data to one buffer when the SPU is processing the data in other buffer.
My question is that whether there is also the similar DMA in NEON to deal with data movement to avoid cache miss. It's very important for our application, because for video processing, we should access abundant data. And how does the NEON synchronize with the ARM. I have not found the answer in the ARM architecture reference manual. I think that if I can get a simple sample about how to use NEON, I will have some sense about my puzzle.

Thanks!

Parents

0 Peter Harris over 12 years ago

Note: This was originally posted on 6th May 2008 at http://forums.arm.com

On the ARM Cortex implementations the Neon block has direct access to the larger L2 cache, bypassing L1 avoiding many cache pollution issues. The Neon unit also has a large number of registers, making ideal for implementing things like decode lookup tables using just registers without accessing external memory at all.

The second point to realise with video is that you don't randomly access the whole video frame, so it doesn't all need to be in cache simultaneously. This allow you to hide the fact that cache is smaller than your whole data set. Block based decoders tend to work on data that has a lot of spatial locality - the current block's motion vectors apply to the same spatial location in the previous frame. So the top 32x32 pixel block in the current frame need the motion vectors for the current frame delta and the 32x32 pixel block from the previous frame (with perhaps a few neighbouring blocks too). This is a small and very manageable data set. Via use of early PLD or LDR you can cause lines to be loaded in the L2 cache just ahead of when they are used and ensure that the pipelining of data through the cache is high performance.

You have to do a similar locality argument with the Cell SPE's because the local memory is not big enough for the entire data set, so you can treat decomposition of the codec in exactly the same way, based on the size of the L2 cache in the system.

Note that the implementation of the PLD instructions is often a NOP. The ARM cores with the security extensions (including all of the current Cortex A family) force a PLD executed on the non-secure virtual CPU to a NOP even if the secure virtual CPU can execute a PLD. For this reason it is better to issue an early LDR for each line in the L2 cache rather than using PLD; it is more likely to produce the effect you want

To answer a few specific points:

> how does the NEON synchronize with the ARM?

Neon is part of the ARM pipeline, so there is no synchronization needed, except that the Neon can access L2 cache directly. Software running on the non-Neon integer part of ARM CPU must ensure that the L2 cache is coherent with the L1 cache when required (most of the time the data set doesn't need to be in L1 at all so this should be minimal).
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Peter Harris over 12 years ago

Note: This was originally posted on 6th May 2008 at http://forums.arm.com

On the ARM Cortex implementations the Neon block has direct access to the larger L2 cache, bypassing L1 avoiding many cache pollution issues. The Neon unit also has a large number of registers, making ideal for implementing things like decode lookup tables using just registers without accessing external memory at all.

The second point to realise with video is that you don't randomly access the whole video frame, so it doesn't all need to be in cache simultaneously. This allow you to hide the fact that cache is smaller than your whole data set. Block based decoders tend to work on data that has a lot of spatial locality - the current block's motion vectors apply to the same spatial location in the previous frame. So the top 32x32 pixel block in the current frame need the motion vectors for the current frame delta and the 32x32 pixel block from the previous frame (with perhaps a few neighbouring blocks too). This is a small and very manageable data set. Via use of early PLD or LDR you can cause lines to be loaded in the L2 cache just ahead of when they are used and ensure that the pipelining of data through the cache is high performance.

You have to do a similar locality argument with the Cell SPE's because the local memory is not big enough for the entire data set, so you can treat decomposition of the codec in exactly the same way, based on the size of the L2 cache in the system.

Note that the implementation of the PLD instructions is often a NOP. The ARM cores with the security extensions (including all of the current Cortex A family) force a PLD executed on the non-secure virtual CPU to a NOP even if the secure virtual CPU can execute a PLD. For this reason it is better to issue an early LDR for each line in the L2 cache rather than using PLD; it is more likely to produce the effect you want

To answer a few specific points:

> how does the NEON synchronize with the ARM?

Neon is part of the ARM pipeline, so there is no synchronization needed, except that the Neon can access L2 cache directly. Software running on the non-Neon integer part of ARM CPU must ensure that the L2 cache is coherent with the L1 cache when required (most of the time the data set doesn't need to be in L1 at all so this should be minimal).
Cancel
Vote up 0 Vote down

Cancel

Children

No data