This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Cortex-A9 Preload and Lock Code in L2C-310

I've been studying and experimenting with the caches on an ARM Cortex-A9, namely a Zynq SoC, for the past week with the main objective of loading and locking part of my code to L2 (PL310). The steps I take to achieve this are:

  • Set TTBR0 and Invalidate TLBS
  • Invalidate L1 Inst and Data Caches and L2 Cache
  • Init and Enable L2 Cache
  • Enable L1 Data and Inst and MMU
  • Unlock all L2 ways. Run a loop loading the code (using symbols defined in the linker script for the memory region I target). I've tried using three types of instructions for loading - LDR, PLD and PLI. Lock all L2 ways.

The code for loading is:

extern uint32_t code_start;
extern uint32_t code_end;

void PreloadCode() {
    uint32_t* temp;
    uint32_t dummy;

    //invalidate all ways and L1 data cache
    L1ICacheInvalidate();
    *REG7_CLEAN_INV_WAY = 0xffff;
    while(*REG7_CLEAN_INV_WAY);
    *REG9_CACHE_SYNC = 0;
    while(*REG9_CACHE_SYNC);

    //unlock all ways
    *REG9_D_LOCKDOWN0 = 0x0000;
    *REG9_I_LOCKDOWN0 = 0x0000;

    asm volatile ("dsb");
    asm volatile ("isb");

    for(temp = &code_start; temp < &code_end; temp += 1){
        asm volatile ("ldr %0, [%1]" : "=r"(dummy) : "r"(temp));
    //  asm volatile ("pld [%0]" :: "r"(temp) : "memory");
    //  asm volatile ("pli [%0]" :: "r"(temp) : "memory");
    }

    asm volatile ("dsb");
    asm volatile ("isb");

    //lock all ways
    *REG9_D_LOCKDOWN0 = 0xFFFF;
    *REG9_I_LOCKDOWN0 = 0xFFFF;

}

I also set up the event counters in the PL310 to count the number of IRHIT (instruction read hits) and IRREQ(instruction read requests). I run a piece of code periodically, resetting the counters at each loop and also invalidating L1 instruction cache.

I was hoping to verify that after each loop I would see the number of hit and requests for instructions in L2 to be the same. However, this does not happen. The number of hits is always 0 which suggests I've locked all L2 but the code was not loaded.

When I run the exact same code without locking L2 at the end. I get the first loop of 0 % hit rate, but all subsequent loops show a 100 % hit.

Do you have any idea what I'm doing wrong?

Note: I'm only using one of the CPUs. Also, the region I want to load is configured in the page table as Outer and Inner Write-Back, Write-Allocate.

Parents Reply Children
  • Sorry, I was unclear: I meant more: Be sure _not_ to set "exclusive".
  • Another thing: The CA9 manual says this: "When enabled during a period of time, all newly allocated cache lines get marked as locked."
    It sounds to me as if you have to "lock" before fetching.
  • Despite of not making any sense (i.e. locking before fetching), I tried this approach. No results. Btw, looking at some code provided by Xilinx provided at www.wiki.xilinx.com/Zynq-7000 AP SoC Boot - Locking and Executing out of L2 Cache Tech Tip, it seems like I'm following the correct steps:

     

    int preload_funct(unsigned int uiSrcAddress, unsigned int uiSize)
    {
    //	static unsigned int uiAlreadyProgrammed;
    	unsigned int 		i=0;
    //	unsigned int  		uiNumofWays=0;
    //	unsigned int 		uiVariable=0;
    //	unsigned int 		uiValue0=0;
    //	unsigned int 		uiValue1=0;
    
    
    	fsbl_printf(DEBUG_GENERAL,"\n\rInside  Preload Functions \n\r");
    	// Disable FIQ and IRQ interrupt
    	Xil_ExceptionDisableMask(XIL_EXCEPTION_ALL);
    	/*
    	 * UnLock Data and Instruction from way 1 to7 and unlock Data and instruction for Way 0.
    	 * The PL310 has 8 sets of registers, one per possible CPU.
    	 */
    	for(i=0;i<8;i++)
    	{
    		Xil_Out32((XPS_L2CC_BASEADDR + (XPS_L2CC_CACHE_DLCKDWN_0_WAY_OFFSET + (i*8)) ), (0x00000000));
    		Xil_Out32((XPS_L2CC_BASEADDR + (XPS_L2CC_CACHE_ILCKDWN_0_WAY_OFFSET + (i*8)) ), (0x00000000));
    
    	}
    
    
    	/* Flush the Caches */
    	Xil_DCacheFlush();
    	Xil_DCacheInvalidate();
    	fsbl_printf(DEBUG_GENERAL,"\n\r Invalidate D cache \n\r");
    
    	/*Preload instruction from section starts from 0x31000000 to Cache Way 0*/
    	{
    	// Copy Applciation source adress to ro register
    	 asm volatile ("mov r0,%0":: "r"(uiSrcAddress));
    	 //Copy application size to r1 register
    	 asm volatile ("mov r1,%0":: "r"(uiSize));
    	 // Offset register i.e. r2
    	 asm volatile  ("mov r2, #0");
    	 // Label
    	 asm ("preload_inst:");
    	 // Load r4 register from the r0+r2 (Source address + offset)
    	 // This step create an valid entry of the address (Source address + offset) in L2 cache
    	 asm volatile ("ldr r4, [r0,r2]");
    	 // Increment the offset by one cache line
    	 asm volatile ("add r2,r2,#4");
    	 // Compare the offset with the Application size.
    	 asm volatile ("cmp r1, r2");
    	 // If not equal jump to Label
    	 asm volatile ("bge preload_inst");
    
    	}
    	// lock both Data and instruction caches from Way 1 to 7.
    	// Lock Data and Instruction Caches for Way 0
    	for(i=0;i<8;i++)
    		{
    			Xil_Out32((XPS_L2CC_BASEADDR + (XPS_L2CC_CACHE_DLCKDWN_0_WAY_OFFSET + (i*8)) ), 0xffff);
    			Xil_Out32((XPS_L2CC_BASEADDR + (XPS_L2CC_CACHE_ILCKDWN_0_WAY_OFFSET + (i*8)) ), 0xffff);
    		}
    	// Enable all the Interrupts
    	Xil_ExceptionEnableMask(XIL_EXCEPTION_ALL);
    //	uiAlreadyProgrammed=uiVariable;
    
    	return - XST_SUCCESS;
    
    }

    However, they run this function with L1 disabled. I also tried this with no results.

    Completely lost.

    Thank you for your efforts trying to help me.

  • Where is your code? If the "known good" example does not work for you, check what is different at your place?
    Ah, and see: They lock _after_ reading ;-)
    Do you also disable the interrupts?
  • Yes, I run with interrupts disabled. What to you mean "_after reading_"? In my code in the first post of the thread, I show that I only lock the ways after runnning the complete fetch loop and separate it with synchronization instrcutions.
  • Sorry, yes. Too many parallel tasks. You do it and Xilinx does it also, locking "after" touching the "code". Which in a way makes sense. But the reference manual states that the locking is done before touching the code.
    *hmm* this triggers my curiosity.
  • I can&#x27;t find your reference in the Cortex-A9 TRM. Could you please show me exactly where it is?
  • It is the PL2 TRM: ARM DDI 0246F, chapter 2.3.6
  • Ensure that all the code executed by this procedure is in an un-cacheable area of memory.
  • I will try this. But could you elaborate on why do you suggest this?
  • Else your code may be locked in L2 cache instead of the instructions that are to be locked down:
    See below for the high level flow (Stage 2 ):
    1. Ensure that no processor exceptions can occur during the execution of this procedure, by disabling interrupts.
    2. Ensure that all the code executed by this procedure is in an un-cacheable area of memory or in an already locked.
    3. Ensure that all data used by the following code (apart from the data that is to be locked down) is in an un-cacheable area of memory or is in an already locked.
    4. Ensure that the data/instructions that are to be locked down are in a cacheable area of memory.
    5. Ensure that the data/instructions that are to be locked down are not already in the cache, using cache clean and/or invalidate instructions.
    6. Enable the allocation per line (By writing to enable bit). This enables allocation per line.
    7. For each of the cache lines to be locked down:
    • If a data cache is being locked down, use an LDR instruction to load a word from the memory cache line, which ensures that the memory cache line is loaded into the cache.
    • If an instruction cache is being locked down, use the prefetch instruction cache line operation to fetch the memory cache line into the cache.
    8. Disable the allocation per line (By writing to enable bit).
  • "If an instruction cache is being locked down, use the prefetch instruction cache line operation to fetch the memory cache line into the cache." By this you mean the PLI instruction? Or is there another instruction for prefetching that I&#x27;m not aware of?

    I opted for the LDR instruction because the Xilinx example uses this.
  • I&#x27;d say, the "PLI" instruction works for L1 and L2 cache where the "LDR" only for the L2 cache (it is unified).
    Anyway, please keep us informed about the final solution.
  • No luck. Placed all my cache maintenance and the preload/lock code on an uncachable region. Also verified the memory attributes on the page table. Everything seems fine.

    Enabled the allocation per line bit (which I was not using before), also with no results. By reading the documention it doesn&#x27;t seem necessary for locking by way.

    Thank you for yout suggestions
  • Other suggestions:

    Ensure that there is no accesses of other masters (cores/peripherals/ACP) to the L2 cache during the lock.

    What is Replacement strategy do you use? round-robin or pseudo-random. I guess the round-robin is more predicted. Bit [25] of the Auxiliary Control Register configures the replacement strategy.

    I see this definition in the L2 cache spec."The locked status of each cache line is given by the optional bit [21] of the Tag RAM". I guess you can read L2 cache RAM to see which addreses are locked.