I don't understand cache miss count between cachegrind vs. streamline

I am studying about cache effect using a simple micro-benchmark.

I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line. (Show 1.)

In my board(Arndale-5250), cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that. (Show 2.)

However, streamline tool displays different result. It only occur 21,373 cache miss operations. (Show 3.)

I am doubted about hardware prefetch, however I can't check any value through the counter in streamline tool.

I really don't know why streamline tool's cache miss occur very small operations than "cachegrind". Could someone give me a reasonable explanation?

1. Here is a simple micro-benchmark program.

  #include <stdio.h>

  #define N 10000000

  static int A[N];

  int main(){

  int i;

  double temp=0.0;

  for (i=0 ; i<N ; i++){

  temp = A[i]*A[i];


  return 0;

2. Following result is cachegrind's output:


3. Following result is streamline's output:

  Have you checked the disassembly that the compiler generates for your benchmark? There is a good chance that the compiler spots that it is "totally pointless" because you never read the result and hence optimizes out most of the behavior.

    #include <stdio.h>
    #define N 10000000
    static volatile int A[N];
    int main(){
      int i;  
      int temp = 0;
      for (i=0 ; i<N ; i++){
        temp += A[i]*A[i];
      return temp;



  • In reply to Peter Harris:

    Dear Peter.

    I really appreciate for answer.

    Here is assembly code for micro-benchmark program and result of "objdump" also same.

    .syntax unified
    .arch armv7-a
      .eabi_attribute 27, 3
      .eabi_attribute 28, 1
      .fpu vfpv3-d16
      .eabi_attribute 20, 1
      .eabi_attribute 21, 1
      .eabi_attribute 23, 3
      .eabi_attribute 24, 1
      .eabi_attribute 25, 1
      .eabi_attribute 26, 2
      .eabi_attribute 30, 2
      .eabi_attribute 34, 1
      .eabi_attribute 18, 4
      .file "test_arm.c"
      .section .text.startup,"ax",%progbits
      .align 2
      .global main
      .type main, %function
      @ args = 0, pretend = 0, frame = 0
      @ frame_needed = 0, uses_anonymous_args = 0
      @ link register save eliminated.
      push {r4, r5}
      movw r4, #38528
      ldr r2, .L5
      movt r4, 152
      movs r3, #0
      ldr r5, .L5+4
      ldr r0, [r2, r3, lsl #2]
      ldr r1, [r2, r3, lsl #2]
      adds r3, r3, #1
      cmp r3, r4
      add r1, r0, r1
      str r1, [r5, #448]
      bne .L2
      movs r0, #0
      pop {r4, r5}
      bx lr
      .align 2
      .word .LANCHOR0
      .word .LANCHOR1
      .size main, .-main
      .comm B,80000000,8
      .align 2
    .LANCHOR0 = . + 0
    .LANCHOR1 = . + 39999552
      .type A, %object
      .size A, 40000000
      .space 40000000
      .type temp, %object
      .size temp, 4
      .space 4
      .ident "GCC: (crosstool-NG linaro-1.13.1-4.7-2013.03-20130313 - Linaro GCC 2013.03) 4.7.3 20130226 (prerelease)"
      .section .note.GNU-stack,"",%progbits
  • You can't compare the raw result of these two tools.

    Cachegrind uses simulation or instrumentation to do its job. ARM Streamline use time-based sampling.