3 Replies Latest reply: Aug 6, 2014 10:23 PM by divcesar RSS

I don't understand cache miss count between cachegrind vs. streamline

seongjincho Bit

I am studying about cache effect using a simple micro-benchmark.

 

I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line. (Show 1.)

In my board(Arndale-5250), cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that. (Show 2.)

 

 

However, streamline tool displays different result. It only occur 21,373 cache miss operations. (Show 3.)

I am doubted about hardware prefetch, however I can't check any value through the counter in streamline tool.

 

I really don't know why streamline tool's cache miss occur very small operations than "cachegrind". Could someone give me a reasonable explanation?

 


1. Here is a simple micro-benchmark program.

  #include <stdio.h>

  #define N 10000000

 

  static int A[N];

 

  int main(){

 

  int i;

  double temp=0.0;

 

  for (i=0 ; i<N ; i++){

  temp = A[i]*A[i];

  } 

 

  return 0;




2. Following result is cachegrind's output:

result

 

 

 

 

3. Following result is streamline's output:

result2

  • Re: I don't understand cache miss count between cachegrind vs. streamline
    peterharris Employee

    Have you checked the disassembly that the compiler generates for your benchmark? There is a good chance that the compiler spots that it is "totally pointless" because you never read the result and hence optimizes out most of the behavior. Try ...

     

     

    #include <stdio.h>
    #define N 10000000
    
    static volatile int A[N];
    
    int main(){
    
      int i;  
      int temp = 0;
    
      for (i=0 ; i<N ; i++){
        temp += A[i]*A[i];
      } 
    
      return temp;
    }
    

     

     

    HTH,
    Pete

    • Re: Re: I don't understand cache miss count between cachegrind vs. streamline
      seongjincho Bit

      Dear Peter.

       

      I really appreciate for answer.

       

       

      Here is assembly code for micro-benchmark program and result of "objdump" also same.

       

       

       

      .syntax unified
      .arch armv7-a
        .eabi_attribute 27, 3
        .eabi_attribute 28, 1
        .fpu vfpv3-d16
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 2
        .eabi_attribute 30, 2
        .eabi_attribute 34, 1
        .eabi_attribute 18, 4
        .thumb
        .file "test_arm.c"
        .section .text.startup,"ax",%progbits
        .align 2
        .global main
        .thumb
        .thumb_func
        .type main, %function
      
      
      main:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        push {r4, r5}
        movw r4, #38528
        ldr r2, .L5
        movt r4, 152
        movs r3, #0
        ldr r5, .L5+4
      .L2:
        ldr r0, [r2, r3, lsl #2]
        ldr r1, [r2, r3, lsl #2]
        adds r3, r3, #1
        cmp r3, r4
        add r1, r0, r1
        str r1, [r5, #448]
        bne .L2
        movs r0, #0
        pop {r4, r5}
        bx lr
      .L6:
        .align 2
      .L5:
        .word .LANCHOR0
        .word .LANCHOR1
        .size main, .-main
        .comm B,80000000,8
        .bss
        .align 2
      .LANCHOR0 = . + 0
      .LANCHOR1 = . + 39999552
        .type A, %object
        .size A, 40000000
      A:
        .space 40000000
        .type temp, %object
        .size temp, 4
      temp:
        .space 4
        .ident "GCC: (crosstool-NG linaro-1.13.1-4.7-2013.03-20130313 - Linaro GCC 2013.03) 4.7.3 20130226 (prerelease)"
        .section .note.GNU-stack,"",%progbits
      
      
      
      
  • Re: I don't understand cache miss count between cachegrind vs. streamline
    divcesar Bit

    You can't compare the raw result of these two tools.

     

    Cachegrind uses simulation or instrumentation to do its job. ARM Streamline use time-based sampling.