This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Efficient uasage of PLD instruction in combination with Load instructions?

josephgopu over 11 years ago

Hi all, after a long time I'm back to forum with a question

I'm posting this question with some pseudo code

for(i=0;i<100;i++)

{

instruction1

instruction2

instruction3

.................

instructionA : pld [r0]

..................

instructionB :vld1.16 {d0-d3},[r0]!

..................

InstructionN

}

Let me describe my understanding of pld instruction, correct me if I was wrong.....

pld instruction will give a hint to the processor that in near future we need the data at address r0 so that it may fill the cache lines with the required data from r0 to avoid cache miss penalties, but it is not compulsory sometimes processor may ignore it also....{cache line size = 8words = 32 bytes, in 32kb cache A9 processor, I know cache sizes are configurable}

I want to know below details

1.How many instructions ahead we have put pld [r0] before vld1.16 {d0-d3},[r0]! to see the better performane {avoiding cache miss penalties} on hardware like panda board ? like

3 instructions or 4 instructions ahead.......

2.when ever processor is excuting pld [r0] instruction how many cache lines will filled with data only 1-cache line or more?

will it be the same case for PLDW also with VST.16

ex : PLDW [r1]

.................

VST.16 {d0-d3},[r1]!

What about PLI , how can specify the address reg for PLI instruction which contains address of instructions

0 Chris Shore over 11 years ago

Hi,
I'll try to answer your questions but some of them are quite tricky.
1. The first one is the hardest, I'm afraid! The optimal placing of PLD instructions is going to be very dependent on memory latency. The only real way to find out is to conduct some experiments on a real system (or a model which correctly models memory latency).
2. In general, only one cache line will be loaded for each PLD instruction and it will be the cache line containing the address in the instruction.
3. Don't confuse PLDW with a "real" store instruction like VST. The PLDW is, like all the others, only a hint instruction and if it does anything it will simply cause the relevant cache line to be allocated in the cache so that a subsequent store to that address can hit in the cache. A real VST instruction is not a hint and will, if the cache is configured appropriately, cause the line to be allocated.
4. For PLI, you can put the address of a segment of code in a register (use an instruction like ADR, for instance).
I hope this helps.
Chris
Cancel
Vote up 0 Vote down

Cancel
0 josephgopu over 11 years ago in reply to Chris Shore

Thank you Chris
Cancel
Vote up 0 Vote down

Cancel
0 daith over 11 years ago in reply to Chris Shore

You may not get all that much benefit if you are going through a large area sequentially using an A9 or bigger because they implement automatic prefetchers that try and recognize such sequential accesses and preload for you (I'm not sure exactly which processors and how good they are at it). Even so the string routines like memcpy use them to try and get that little bit extra so you could study them.
The real place I believe you can gain is if you use linked lists, just preload the next node as soon as you get to the current node and then do whatever it is you want to do at the current node.
Some vague recollection strikes me about seeing someone checking for null rather than letting the preload instruction just ignore that. I had a quick google but couldn't find anything.
Cancel
Vote up 0 Vote down

Cancel