This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

weird issue in arm code called by C function

  • Note: This was originally posted on 21st March 2013 at http://forums.arm.com


    Lots of things could be happening here. Can't really say what all it could be without knowing your processor and your OS, but here are some possibilities:

    1) The function address isn't in the BTB yet so jumping to it causes a branch misprediction (~8-13 cycles)
    2) The code isn't in L1 icache so causes a miss to L2 cache (~12-25 cycles)
    3) The code isn't in L2 cache so causes a miss to main memory (could be dozens to hundreds of cycles)
    4) The code region isn't in the ITLB so causes a miss in the main TLB (~5-10 cycles)
    5) The code region isn't in the main TLB so it causes a page walk (a few dozen cycles)
    6) The page tables aren't in cache, needs to fetch from main memory, could involve two completely different memory locations (potentially hundreds of cycles)
    7) The code isn't even in main memory and causes a load from disk/flash/whatever. It's actually common OS procedure to not page in data until it's used. (could vary wildly, anywhere from thousands to hundreds of thousands of cycles)

    Would also have to know how long that memmove actually takes in order to get a feel for the comparison you made. Are you sure that it's being performed and not optimized out by the compiler since you don't actually use the results for anything?


    Thanks for your reply, Exophase,
    I am using iOS 6.1 along with iPad 3
    I am not sure how to influence the things you mentioned: BTB, L1, L2 caches, TLB as Apple did some setup of the processor during the bootloader. I can't even execute STREXB instruction, because the multi-processor extensions are not enabled (nor CPUID instruction, which is privileged).
    Actually my assembler is the clang compiler: clang -x assembler-with-cpp and I noticed that when I set the Link-Time-Optimization flag of clang to YES the loading time is drastically lowered. However I am not sure this isn't a coincidence. The optimisation is set to -O0, which means no optimization (for the C code) and though the variables aren't being used they still remain. For the assembly code it is obvious there is no optimization.
    I would like to know more about these L1 and L2 caches, TLB and BTB and how to influence them. Can you point me to some resources, please?
  • Note: This was originally posted on 21st March 2013 at http://forums.arm.com

    You time measurement is also likely to be wildly inaccurate. You are needing a system call either side of the memmove or the ARM function to get the time. The system call is going to take a lot longer than code it is timing (which is likely to be much shorter than the time granularity, depending on the clock speed you have).

    Loop the memmove and the ARM function so they take a second or so to run, and time that.
  • Note: This was originally posted on 21st March 2013 at http://forums.arm.com

    Wikipedia can give you a good overview on these topics:

    http://en.wikipedia.org/wiki/CPU_cache
    http://en.wikipedia....ookaside_buffer
    http://en.wikipedia....anch_prediction
    http://en.wikipedia....arget_predictor

    In a nutshell, these are buffers on the CPU that are used for commonly performed operations, so they could be faster vs having to access them off-chip. But when your program just starts these buffers are empty and therefore there's a big expense to first fill them.

    I don't know how iOS works exactly but there's a good chance that new executables are demand paged through the virtual memory system. This could mean that the function you're calling isn't even in memory yet. The memory will be marked in the page table as inaccessible and when the program tries to access it an exception will be raised. This will go into the OS kernel which will run a lot of code to determine that this memory is on disk and needs to be loaded, then it has to get the memory off of flash, probably via DMA, which can take so long that the OS will try to schedule other running programs to execute in the mean time. There's a good chance the OS will try to DMA a pretty large region to lower the chances of needing to do this again.

    But I really have no idea if this is the case or not.. even if it's demand paged your function may have already been loaded along with other things that were loaded. As for the other buffers I'm pretty confident they all need to be filled so you'd be looking at at least hundreds of cycles for that. Depending on how the memmove is implemented and what the CPU is like (Apple hasn't released details on how their Swift processor works so it's kind of guess work) the memmove could take as little as 100 or even 50 cycles. But if the first call to that function really takes 100 times more cycles than the memmove (like isogen says the timing setup is not too reliable) then I think it must be due to demand paging. You can get a better idea for that mechanism here:

    http://en.wikipedia....i/Demand_paging
    http://en.wikipedia..../Virtual_memory
  • Note: This was originally posted on 20th March 2013 at http://forums.arm.com

    Lots of things could be happening here. Can't really say what all it could be without knowing your processor and your OS, but here are some possibilities:

    1) The function address isn't in the BTB yet so jumping to it causes a branch misprediction (~8-13 cycles)
    2) The code isn't in L1 icache so causes a miss to L2 cache (~12-25 cycles)
    3) The code isn't in L2 cache so causes a miss to main memory (could be dozens to hundreds of cycles)
    4) The code region isn't in the ITLB so causes a miss in the main TLB (~5-10 cycles)
    5) The code region isn't in the main TLB so it causes a page walk (a few dozen cycles)
    6) The page tables aren't in cache, needs to fetch from main memory, could involve two completely different memory locations (potentially hundreds of cycles)
    7) The code isn't even in main memory and causes a load from disk/flash/whatever. It's actually common OS procedure to not page in data until it's used. (could vary wildly, anywhere from thousands to hundreds of thousands of cycles)

    Would also have to know how long that memmove actually takes in order to get a feel for the comparison you made. Are you sure that it's being performed and not optimized out by the compiler since you don't actually use the results for anything?