Confused on cache hits/ CPI when analyzing nested for loops

**voidflux** · Feb 9th, 2007, 04:01 PM

Hello everyone this isn't exactly asm but it does fall into the category. ANy help would be great here is my problem:

I had the following file that multiplies 2 matrices together, with 100 elements and puts it into a 3rd 100 element matrix.

Well I used Simple Scalar to anlayize the CPI generated by the code and it also records the miss rate on the level 1 cachee.

But i'm confused on my findings...Here is what I found:

Code:

/* This is a program for a trial on super
   scalar simulator , it multiples to matrices */
#include <stdio.h>

#define N 100

void main()
{
  int a1[N][N], a2[N][N], a3[N][N], i, j, k;
  for(i =0; i<N; i++)
    for(j =0; j<N; j++)
      {
	a1[i][j]= /*1;*/ i * j + 1;
	a2[i][j]= /*1;*/ i + j + 1;
	a3[i][j]= 0;
      }

   printf( "\nThe first matrix !!!\n");

   for(i =0; i<N; i++)
   {
     for(j =0; j<N; j++)
      {
	printf("\t %d", a1[i][j]);
      }
     printf("\n");
   }

   printf( "\nThe second matrix !!!\n");

   for(i =0; i<N; i++)
   {
     for(j =0; j<N; j++)
      {
	printf("\t %d", a2[i][j]);
      }
     printf("\n");
   }

   for(i =0; i<N; i++)
     for(j =0; j<N; j++)
        for(k =0; k<N; k++)
	  a3[i][j] += a1[i][k] * a2[k][j];


   
   printf( "\nThe result matrix !!!\n");

   for(i =0; i<N; i++)
   {
     for(j =0; j<N; j++)
      {
	printf("\t %d", a3[i][j]);
      }
     printf("\n");
   }

Code:

Runs:	         mxm.c	 mxm2.c	  mxm3.c
Loop Order:	I,j,k	   I,k,j	K,I,j
Sim_cpi:	0.5832	 0.5832	    0.5799
ill.miss_rate	 0.0097	  0.0097     0.0097
dl1.miss_rate  0.0047	0.0047	   0.0050
ul2.miss_rate  0.0013	0.0013   0.0012

Okay the Sim_CPI is the #of instructions per cycle
il1.miss_rate is the instruction level 1 miss rate
dl1.miss_rate is the data level 1 miss rate
ul2.miss_rate i'm not quite sure what this is, i'm using a program called Simple Scalar and its not in the documentation

But what i'm confused is, i know the greater the CPI the worse
performance. But i'm confused on why the ul2.miss_rate is lower on the
k,i,j order but the dl1.miss_rate is higher, but the CPI is still lower
out of all of them. I'm also confused on why the I,j,k and the I,K,J
are both the same with no change in CPI.

Any help would be great!

I was thinking it had to do with the way of ordering the loops has more
cache hits than the other. i.e. a cache miss means waiting longer for
the data to be fetched from the next level in the memory hierarchy

But why?
why the i,j,k and i,k,j are identical while k,i,j is more efficient.

Thanks!

Thread: Confused on cache hits/ CPI when analyzing nested for loops

Thread Tools

Display

Threaded View

Confused on cache hits/ CPI when analyzing nested for loops

Posting Permissions