I don't understand cache miss count between cachegrind vs. streamline

I am studying about cache effect using a simple micro-benchmark.

I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line. (Show 1.)

In my board(Arndale-5250), cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that. (Show 2.)

However, streamline tool displays different result. It only occur 21,373 cache miss operations. (Show 3.)

I am doubted about hardware prefetch, however I can't check any value through the counter in streamline tool.

I really don't know why streamline tool's cache miss occur very small operations than "cachegrind". Could someone give me a reasonable explanation?

1. Here is a simple micro-benchmark program.

  #include <stdio.h>

  #define N 10000000

  static int A[N];

  int main(){

  int i;

  double temp=0.0;

  for (i=0 ; i<N ; i++){

  temp = A[i]*A[i];


  return 0;

2. Following result is cachegrind's output:


3. Following result is streamline's output: