This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Efficient uasage of PLD instruction in combination with Load instructions?

Hi all,  after a long time I'm back to forum with a question

I'm posting this question with some pseudo code

for(i=0;i<100;i++)

{

instruction1

instruction2

instruction3

.................

instructionA : pld [r0]

..................

instructionB :vld1.16 {d0-d3},[r0]!

..................

InstructionN

}

Let me describe my understanding of pld instruction, correct me if I was wrong.....

pld instruction will give a hint to the processor that in near future we need the data at address r0 so that it may fill the cache lines with the required data from r0 to avoid cache miss penalties, but it is not compulsory sometimes processor may ignore it also....{cache line size = 8words = 32 bytes, in 32kb cache A9 processor, I know cache sizes are configurable}

I want to know below details

1.How many instructions ahead we have put pld [r0] before vld1.16 {d0-d3},[r0]! to see the better performane {avoiding cache miss penalties} on hardware like panda board ? like

3 instructions or 4 instructions ahead.......

2.when ever processor is excuting pld [r0] instruction how many cache lines will filled with data only 1-cache line or more?

will it be the same case for PLDW also with VST.16

ex : PLDW [r1]

     .................

VST.16 {d0-d3},[r1]!

What about PLI , how can specify the address reg for PLI instruction which contains address of instructions

Parents
  • You may not get all that much benefit if you are going through a large area sequentially using an A9 or bigger because they implement automatic prefetchers that try and recognize such sequential accesses and preload for you (I'm not sure exactly which processors and how good they are at it). Even so the string routines like memcpy use them to try and get that little bit extra so you could study them.

    The real place I believe you can gain is if you use linked lists, just preload the next node as soon as you get to the current node and then do whatever it is you want to do at the current node.

    Some vague recollection strikes me about seeing someone checking for null rather than letting the preload instruction just ignore that. I had a quick google but couldn't find anything.

Reply
  • You may not get all that much benefit if you are going through a large area sequentially using an A9 or bigger because they implement automatic prefetchers that try and recognize such sequential accesses and preload for you (I'm not sure exactly which processors and how good they are at it). Even so the string routines like memcpy use them to try and get that little bit extra so you could study them.

    The real place I believe you can gain is if you use linked lists, just preload the next node as soon as you get to the current node and then do whatever it is you want to do at the current node.

    Some vague recollection strikes me about seeing someone checking for null rather than letting the preload instruction just ignore that. I had a quick google but couldn't find anything.

Children
No data