This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

I don't understand cache miss count between cachegrind vs. streamline

I am studying about cache effect using a simple micro-benchmark.

I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line. (Show 1.)

In my board(Arndale-5250), cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that. (Show 2.)

However, streamline tool displays different result. It only occur 21,373 cache miss operations. (Show 3.)

I am doubted about hardware prefetch, however I can't check any value through the counter in streamline tool.

I really don't know why streamline tool's cache miss occur very small operations than "cachegrind". Could someone give me a reasonable explanation?

1. Here is a simple micro-benchmark program.

#include <stdio.h>

#define N 10000000

static int A[N];

int main(){

int i;

double temp=0.0;

for (i=0 ; i<N ; i++){

temp = A[i]*A[i];

}

return 0;

}

2. Following result is cachegrind's output:

3. Following result is streamline's output:

Parents

0 Peter Harris over 11 years ago
Have you checked the disassembly that the compiler generates for your benchmark? There is a good chance that the compiler spots that it is "totally pointless" because you never read the result and hence optimizes out most of the behavior. Try ...
#include <stdio.h> #define N 10000000 static volatile int A[N]; int main(){ int i; int temp = 0; for (i=0 ; i<N ; i++){ temp += A[i]*A[i]; } return temp; }

HTH,
Pete
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Peter Harris over 11 years ago
Have you checked the disassembly that the compiler generates for your benchmark? There is a good chance that the compiler spots that it is "totally pointless" because you never read the result and hence optimizes out most of the behavior. Try ...
#include <stdio.h> #define N 10000000 static volatile int A[N]; int main(){ int i; int temp = 0; for (i=0 ; i<N ; i++){ temp += A[i]*A[i]; } return temp; }

HTH,
Pete
Cancel
Vote up 0 Vote down

Cancel

Children

0 Seong Jin Cho over 11 years ago in reply to Peter Harris

Dear Peter.

I really appreciate for answer.

Here is assembly code for micro-benchmark program and result of "objdump" also same.

.syntax unified
.arch armv7-a
  .eabi_attribute 27, 3
  .eabi_attribute 28, 1
  .fpu vfpv3-d16
  .eabi_attribute 20, 1
  .eabi_attribute 21, 1
  .eabi_attribute 23, 3
  .eabi_attribute 24, 1
  .eabi_attribute 25, 1
  .eabi_attribute 26, 2
  .eabi_attribute 30, 2
  .eabi_attribute 34, 1
  .eabi_attribute 18, 4
  .thumb
  .file "test_arm.c"
  .section .text.startup,"ax",%progbits
  .align 2
  .global main
  .thumb
  .thumb_func
  .type main, %function


main:
  @ args = 0, pretend = 0, frame = 0
  @ frame_needed = 0, uses_anonymous_args = 0
  @ link register save eliminated.
  push {r4, r5}
  movw r4, #38528
  ldr r2, .L5
  movt r4, 152
  movs r3, #0
  ldr r5, .L5+4
.L2:
  ldr r0, [r2, r3, lsl #2]
  ldr r1, [r2, r3, lsl #2]
  adds r3, r3, #1
  cmp r3, r4
  add r1, r0, r1
  str r1, [r5, #448]
  bne .L2
  movs r0, #0
  pop {r4, r5}
  bx lr
.L6:
  .align 2
.L5:
  .word .LANCHOR0
  .word .LANCHOR1
  .size main, .-main
  .comm B,80000000,8
  .bss
  .align 2
.LANCHOR0 = . + 0
.LANCHOR1 = . + 39999552
  .type A, %object
  .size A, 40000000
A:
  .space 40000000
  .type temp, %object
  .size temp, 4
temp:
  .space 4
  .ident "GCC: (crosstool-NG linaro-1.13.1-4.7-2013.03-20130313 - Linaro GCC 2013.03) 4.7.3 20130226 (prerelease)"
  .section .note.GNU-stack,"",%progbits