2 Replies Latest reply: May 19, 2014 11:17 AM by seongjincho RSS

I don't understand cache miss count between cachegrind vs. streamline

seongjincho Bit

I am studying about cache effect using a simple micro-benchmark.

 

I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line. (Show 1.)

In my board(Arndale-5250), cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that. (Show 2.)

 

 

However, streamline tool displays different result. It only occur 21,373 cache miss operations. (Show 3.)

I am doubted about hardware prefetch, however I can't check any value through the counter in streamline tool.

 

I really don't know why streamline tool's cache miss occur very small operations than "cachegrind". Could someone give me a reasonable explanation?

 


1. Here is a simple micro-benchmark program.

  #include <stdio.h>

  #define N 10000000

 

  static int A[N];

 

  int main(){

 

  int i;

  double temp=0.0;

 

  for (i=0 ; i<N ; i++){

  temp = A[i]*A[i];

  } 

 

  return 0;




2. Following result is cachegrind's output:

result

 

 

 

 

3. Following result is streamline's output:

result2

  • Re: I don't understand cache miss count between cachegrind vs. streamline
    peterharris Employee

    Have you checked the disassembly that the compiler generates for your benchmark? There is a good chance that the compiler spots that it is "totally pointless" because you never read the result and hence optimizes out most of the behavior. Try ...

     

     

    #include <stdio.h>
    #define N 10000000
    
    static volatile int A[N];
    
    int main(){
    
      int i;  
      int temp = 0;
    
      for (i=0 ; i<N ; i++){
        temp += A[i]*A[i];
      } 
    
      return temp;
    }
    

     

     

    HTH,
    Pete

    • Re: Re: I don't understand cache miss count between cachegrind vs. streamline
      seongjincho Bit

      Dear Peter.

       

      I really appreciate for answer.

       

       

      Here is assembly code for micro-benchmark program and result of "objdump" also same.

       

       

       

      .syntax unified
      .arch armv7-a
        .eabi_attribute 27, 3
        .eabi_attribute 28, 1
        .fpu vfpv3-d16
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 2
        .eabi_attribute 30, 2
        .eabi_attribute 34, 1
        .eabi_attribute 18, 4
        .thumb
        .file "test_arm.c"
        .section .text.startup,"ax",%progbits
        .align 2
        .global main
        .thumb
        .thumb_func
        .type main, %function
      
      
      main:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        push {r4, r5}
        movw r4, #38528
        ldr r2, .L5
        movt r4, 152
        movs r3, #0
        ldr r5, .L5+4
      .L2:
        ldr r0, [r2, r3, lsl #2]
        ldr r1, [r2, r3, lsl #2]
        adds r3, r3, #1
        cmp r3, r4
        add r1, r0, r1
        str r1, [r5, #448]
        bne .L2
        movs r0, #0
        pop {r4, r5}
        bx lr
      .L6:
        .align 2
      .L5:
        .word .LANCHOR0
        .word .LANCHOR1
        .size main, .-main
        .comm B,80000000,8
        .bss
        .align 2
      .LANCHOR0 = . + 0
      .LANCHOR1 = . + 39999552
        .type A, %object
        .size A, 40000000
      A:
        .space 40000000
        .type temp, %object
        .size temp, 4
      temp:
        .space 4
        .ident "GCC: (crosstool-NG linaro-1.13.1-4.7-2013.03-20130313 - Linaro GCC 2013.03) 4.7.3 20130226 (prerelease)"
        .section .note.GNU-stack,"",%progbits