HW 1 Problem 15

From CSI 702

Jump to: navigation, search

Contents

  1. Problem Description
  2. Expectation
  3. Code
  4. Profile and Timing
  5. Results
  6. Contributors

1. Problem Description

Why won’t this code run efficiently? Can it be fixed?

do i = 1, 9999 
    a(i) = a(10000 - i) * 5 
enddo

2. Expectation

The code is inefficient because it will generate cache capacity misses and because the for loop is twice as large as it needs to be, resulting in unnecessary loop overhead. The first element in the array will be loaded into L1 cache when it is being set during the first for loop iteration. As the loop progresses over the large array, the first element will eventually be evicted from L1 cache due to limited capacity. But then, during the final iteration of the for loop, the first element is accessed again (to update the last element of the array) and must be re-fetched from main memory. A similar situation occurs for other array elements except those near the middle of the array.

It is worth noting that in this algorithm, elements in the upper half of the array will be updated based solely on their own initial values. That is, if the initial array has values [1,1,1,1,...,1,1,1,1], then the updated array will have values [5,5,5,5,...25,25,25,25]. Since the final values in the upper half of the array are independent of the initial values in the lower half of the array, their values can be determined independently.

The algorithm can be improved by cutting the number of loop iterations in half. This reduces the amount of loop overhead but also can reduce the number L1 cache misses because each element of the array is accessed only during a single loop iteration, avoiding the possibility of the element being evicted from L1 cache and reloaded later.

  • Thomas Boggs


I think you'd have to split the array in to at least 3 parts. If its split into just 2 parts then when the first half is being assigned it still has to check the second half meaning that both halves are loaded in to memory at once. If you split it into 3rds then at most 2/3 of the array would be in memory at once.

  • Adam Cadien


It is not efficient because the opposite ends of the array are being accessed within the loop. If the entire array fits in cache, then it is not a big deal. It can be fixed by re-writing to compute the values in a different manner. We note after analysis that the result is the first 5000 elements are replaced with 5 times the later 4999 elements, and the latter 4999 elements are multiplied by 5. This can be re-written to update the right half on the array then copy that to the first half, and then update the right half again. While this is conceivably faster, the increased performance is likely to be minimal or worse. For it to be better the memory copy must be quicker than the expense of accessing the array from opposite ends in the original loop.

  • Doug Reitz


Building on the idea of coding an algorithm such that you are not trying to access opposite ends of an array, one could pre-compute a separate array b in place of a(10000-i), simplifying the algorithm into an array multiplied by a constant.

  • H. Le

3. Code

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <ctime>

// original algorithm
void func_a(float * array, const unsigned long N)
{
    for (unsigned long i = 0; i < N; i++)
    {
	array[i] = array[N - i - 1] * 5.f;
    }
    return;
}

// algorithm modified to avoid L1 capacity cache misses
void func_b(float * array, const unsigned long N)
{
    for (unsigned long i = 0; i < N / 2; i++)
    {
	// Update the elements on both sides of the array at the same time
	// to avoid them being evicted from L1 cache and later reloaded.
	array[i] = array[N - 1 - i] * 5.f;
	array[N - 1 - i] *= 25.f;
    }

    // In case N is odd.
    if (N % 2) array[N / 2] *= 25;

    return;    
}

int main (int argc, char **argv)
{
    const unsigned long N = 10000;
    const unsigned long numIterations = 10000000;
    unsigned long i, j;
    clock_t start, stop;
    bool doLoopA = false;
    bool doLoopB = false;
    
    for (int i = 1; i < argc; i++)
    {
	if (*argv[i] == 'a') doLoopA = true;
	if (*argv[i] == 'b') doLoopB = true;
    }
    
    float * array = new float[N];
    
    printf("Running array of %lu elements for %lu iterations...\n", N, numIterations);
    
    if (doLoopA)
    {
	for (i = 0; i < N; i++)
	    array[i] = 1.f;
	start = clock();
	for (j = 0; j < numIterations; j++)
	    func_a(array, N);
	stop = clock();    
	printf("(a) executed in average %.6f ms real time.\n", 1000.f * (stop - start) / CLOCKS_PER_SEC / numIterations);
    }
    
    if (doLoopB)
    {
	for (i = 0; i < N; i++)
	    array[i] = 1.f;
	start = clock();
	for (j = 0; j < numIterations; j++)
	    func_b(array, N);
	stop = clock();
	printf("(b) executed in average %.6f ms real time.\n", 1000.f * (stop - start) / CLOCKS_PER_SEC / numIterations);
    }
    
    delete [] array;
    
    return 0;
}
  • Thomas Boggs


Here I've split the array into 3 portions and run it through with an enormous array (500,000,000 ints long). BTW I stumbled upon a GCC error that causes statically allocated arrays to crash if they're larger than ~8 million bytes (2 million ints), this is easily worked around by dynamically allocating the arrays with malloc.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
  int LEN=500000000;
  if(argc<2)
    {
      printf("usage: %s mode\n Where mode = a or b.",argv[0]);
      return -1;
    }
   
  char opt=(char)*argv[1];

  if(opt=='a'){
    int* a;
    a=(int*)malloc(sizeof(int)*LEN);
    int i;
    for(i=0;i<LEN;i++){
      a[i]=a[LEN-1-i]+3;
    }

  }
  if(opt=='b'){
    int l3=(int)(LEN/3);
    int *a1,*a2,*a3;
    a1=(int*)malloc(sizeof(int)*l3);
    a2=(int*)malloc(sizeof(int)*l3);
    a3=(int*)malloc(sizeof(int)*l3);
    int i;
    for(i=0;i<l3;i++){
      a1[i]=a3[l3-1-i]+3;
    }
    for(i=0;i<l3;i++){
      a3[i]=a1[l3-1-i]+3;
    }
    for(i=0;i<l3;i++){
      a2[i]=a2[l3-1-i]+3;
    }

  }
  return 0;
}
  • Adam Cadien


Below is to test if updating the right side of the array then copying that to the left side, and then updating the right side again is more efficient.

   void Question15a()
   {
      for(Osal::UInt32 i=0; i<9999; ++i) a[i] = a[9999-i] * 5;      
   }
   void Question15b()
   {
      for(Osal::UInt32 i=4999; i<9999; ++i) a[i] = a[i] * 5;
      wmemcpy((wchar_t*)&a[0], (wchar_t*)&a[4999],4999*sizeof(double)/sizeof(wchar_t));
      for(Osal::UInt32 i=4999; i<9999; ++i) a[i] = a[i] * 5;
   }
   
   void Question15c()
    {
      for(Osal::UInt32 i=0; i<9995;)
      {
         a[i] = a[9999-i] * 5;
         a[++i] = a[9999-i] * 5;
         a[++i] = a[9999-i] * 5;
         a[++i] = a[9999-i] * 5;
      }
    }   
     
   void Question15()
   {
      Osal::UInt64 btime(0), atime(0);
      atime = Utils::timeFunction(Question15a,50000);
      btime = Utils::timeFunction(Question15b,50000);
      std::cout << "15a: " << atime << "  15b: "<< btime << "  ratio(b/a): " << (double)btime/			(double)atime << "\n";
   }
  • Doug Reitz


In Fortran:

 program problem15
 integer k
 do k = 1, 50000
   call algorithm1
   call algorithm2
 enddo
 end

Original code

 subroutine algorithm1()
 integer i, a(9999)
 do i = 1, 9999
   a(i) = a(10000 - i) * 5
 enddo
 end

Modified code

 subroutine algorithm2()
 integer i, j, a(9999), b(9999)
 do j = 9999, 1
   b(j) = j
 enddo
 do i = 1, 9999
   a(i) = b(i) * 5
 enddo
 end
  • H. Le

4. Profile and Timing

Following are the results of timing the original (a) and modified (b) algorithms:

iMac1:prob15 thomas$ time ./hw1p15 a
Running array of 10000 elements for 10000000 iterations...
(a) executed in average 0.006958 ms real time.

real	1m10.034s
user	1m9.402s
sys	0m0.180s
iMac1:prob15 thomas$ time ./hw1p15 b
Running array of 10000 elements for 10000000 iterations...
(b) executed in average 0.006182 ms real time.

real	1m2.170s
user	1m1.669s
sys	0m0.153s
iMac1:prob15 thomas$ 

Using the more efficient algorithm (b) resulted in a reduction in execution time of approximately 11% for a 10,000 element array. When the number of elements in the array was increased to 1,000,000, the reduction in execution time was 12% and for 10,000,000 elements, the reduction was over 18%.

  • Thomas Boggs


Running this code with the unix 'time' command demonstrates the difference between real time and user time. When code A is run the real time is 1 min, 40 seconds while the user time is just 2.5 seconds. That means that 97.5% of the program running time is spent loading from ram into cache. Meanwhile code B runs in 21 seconds for real time and 2.3 seconds of user time.

5. Results

Timing results clearly indicate that the original algorithm can be improved by reducing loop iterations and avoiding cache misses. Since the elements in the upper half of the array are accessed in reverse order (in address space) by the algorithm, there is still a question of whether it is possible to access memory in a way that would pre-fetch elements in the upper half of the array and further improve performance.

  • Thomas Boggs


By having an array that is larger than the size of your system's cache you can kill the run time of your program due to cache misses (as predicted in section 2). By setting the array size to be larger than my computer's L2 cache size it becomes obvious from the difference in user vs real time recorded that most of the program's running time is spent reading and writing from RAM (this is also visible in top). By just splitting that large array into 3rds the program runs 5x faster. Clearly splitting the array into even smaller portions will result in further increases in speed.

  • Adam Cadien


Updating the right halve first hypothesis results: With the array size of A as given the results were not conclusive. B was sometimes quicker than A with a ratio of 0.98 and other runs it was slower (1.15 and 1.08). A run with an increase in the size of A to 99999 resulted in B being 9.6 times slower. The conclusion after analyzing the results is that the memcpy function does not provide any savings over the cache misses while iterating through the loop. It is also not known how memcpy is implemented. The results seem to indicate that it may copy byte by byte making the process slower that a block transfer could potentially be. To determine if cache misses could offset the memcpy time the experiment was repeated with the size of A increased to 9999999. The results were similar with B taking slightly longer with a ration of 1.02.

Based on the results the next option would be to unroll the loop and do a few elements per pass through the loop. That was tried with 5 and 10 assignments per loop pass. The results were slightly slower than A. Optimization from A remains elusive so far. Next we turn to compiler optimization with -O3 to get byte alignment, loop unrolling, and other optimizations.

Here are the results with A and B prior to compiler -O3 optimization running each function 50000 times:

        15a: 1726632  15b: 2006713  ratio(b/a): 1.16221

Now with -O3 compiler optimization:

        15a: 129  15b: 491189  ratio(b/a): 3807.67

With compiler optimization level 3 the time for A drops dramatically to 129 usec for 50000 executions of the function. Based on this result, we conclude that the compiler has optimized out the for loop entirely since the array a was defined inside the loop and the value is not used. So it is likely the loop was completely optimized away. No we will modify A to make it appear as though we actually care what the value of A by creating it outside the function and referencing it in another function. Now we have some legitimate results:

        15a: 561792  15b: 797956  ratio(b/a): 1.42038

We note that the compiler optimization resulted in significant reduction in the execution time to approx. one third 561792/1726632 = 0.325 for A and 0.398 for B.

  • Doug Reitz


In Fortran the gprof results were as follows with optimization flag set to zero:

  • 10K iterations: .8 sec for algorithm 1, .74 sec for algorithm 2
  • 20K iterations: 1.6 sec for algorithm 1, 1.45 sec for algorithm 2
  • 30K iterations: 2.36 sec for algorithm 1, 2.29 sec for algorithm 2
  • 40K iterations: 3.32 sec for algorithm 1, 2.93 sec for algorithm 2
  • 50K iterations: 3.96 sec for algorithm 1, 3.78 sec for algorithm 2

On average, a modest speed increase of 7%.

  • H. Le

6. Contributors

  • Thomas Boggs
  • Adam Cadien
  • Doug Reitz
Personal tools