This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

weird issue in arm code called by C function

Parents
  • Note: This was originally posted on 21st March 2013 at http://forums.arm.com


    Lots of things could be happening here. Can't really say what all it could be without knowing your processor and your OS, but here are some possibilities:

    1) The function address isn't in the BTB yet so jumping to it causes a branch misprediction (~8-13 cycles)
    2) The code isn't in L1 icache so causes a miss to L2 cache (~12-25 cycles)
    3) The code isn't in L2 cache so causes a miss to main memory (could be dozens to hundreds of cycles)
    4) The code region isn't in the ITLB so causes a miss in the main TLB (~5-10 cycles)
    5) The code region isn't in the main TLB so it causes a page walk (a few dozen cycles)
    6) The page tables aren't in cache, needs to fetch from main memory, could involve two completely different memory locations (potentially hundreds of cycles)
    7) The code isn't even in main memory and causes a load from disk/flash/whatever. It's actually common OS procedure to not page in data until it's used. (could vary wildly, anywhere from thousands to hundreds of thousands of cycles)

    Would also have to know how long that memmove actually takes in order to get a feel for the comparison you made. Are you sure that it's being performed and not optimized out by the compiler since you don't actually use the results for anything?


    Thanks for your reply, Exophase,
    I am using iOS 6.1 along with iPad 3
    I am not sure how to influence the things you mentioned: BTB, L1, L2 caches, TLB as Apple did some setup of the processor during the bootloader. I can't even execute STREXB instruction, because the multi-processor extensions are not enabled (nor CPUID instruction, which is privileged).
    Actually my assembler is the clang compiler: clang -x assembler-with-cpp and I noticed that when I set the Link-Time-Optimization flag of clang to YES the loading time is drastically lowered. However I am not sure this isn't a coincidence. The optimisation is set to -O0, which means no optimization (for the C code) and though the variables aren't being used they still remain. For the assembly code it is obvious there is no optimization.
    I would like to know more about these L1 and L2 caches, TLB and BTB and how to influence them. Can you point me to some resources, please?
Reply
  • Note: This was originally posted on 21st March 2013 at http://forums.arm.com


    Lots of things could be happening here. Can't really say what all it could be without knowing your processor and your OS, but here are some possibilities:

    1) The function address isn't in the BTB yet so jumping to it causes a branch misprediction (~8-13 cycles)
    2) The code isn't in L1 icache so causes a miss to L2 cache (~12-25 cycles)
    3) The code isn't in L2 cache so causes a miss to main memory (could be dozens to hundreds of cycles)
    4) The code region isn't in the ITLB so causes a miss in the main TLB (~5-10 cycles)
    5) The code region isn't in the main TLB so it causes a page walk (a few dozen cycles)
    6) The page tables aren't in cache, needs to fetch from main memory, could involve two completely different memory locations (potentially hundreds of cycles)
    7) The code isn't even in main memory and causes a load from disk/flash/whatever. It's actually common OS procedure to not page in data until it's used. (could vary wildly, anywhere from thousands to hundreds of thousands of cycles)

    Would also have to know how long that memmove actually takes in order to get a feel for the comparison you made. Are you sure that it's being performed and not optimized out by the compiler since you don't actually use the results for anything?


    Thanks for your reply, Exophase,
    I am using iOS 6.1 along with iPad 3
    I am not sure how to influence the things you mentioned: BTB, L1, L2 caches, TLB as Apple did some setup of the processor during the bootloader. I can't even execute STREXB instruction, because the multi-processor extensions are not enabled (nor CPUID instruction, which is privileged).
    Actually my assembler is the clang compiler: clang -x assembler-with-cpp and I noticed that when I set the Link-Time-Optimization flag of clang to YES the loading time is drastically lowered. However I am not sure this isn't a coincidence. The optimisation is set to -O0, which means no optimization (for the C code) and though the variables aren't being used they still remain. For the assembly code it is obvious there is no optimization.
    I would like to know more about these L1 and L2 caches, TLB and BTB and how to influence them. Can you point me to some resources, please?
Children
No data