Learn Profiling and Performace Tuning

Programming pearls

Recently, I read some chapter of programming pearls about the performance, which is can be applied to my homework about prime generation.

Here is some performance optimization levels:

System Structure
Algorithm and Data Structures
Code Tuning
System Software
Hardware

Algorithm

For algorithm design, it also provide some tips:

Save state to avoid recomputation
Preprocess infromation into data structure
Divide and conqure
Scanning algorithms

Code tuning

Rules are repeated here( I still not understand some rules, I will update them to add my understanding). One thing to mention is not all rule can speed up your program( it may even slow down it), it just some aspect you can think of.

Space For Time Rules

Data Structure Augmentation:
e.g. add left children count in binary search tree to speed up i-th element query.
Store Pre-computed Results
Caching
Lazy Evaluation

Time For Space Rules

Packing
Interpreter

Loop Rules

Code Motion Out of Loops
Combining Tests:
An efficient loop should contain as few tests as possible, and preferably only one. The programmer should therefore try to simulate some of the exit conditions of loop by other exit conditions.
Loop Unrolling
Transfer-Driven Loop Unrolling?
Unconditional Branch Removal
Loop Fusion

Logic Rules

Exploit Algebraic Identities:
If the evaluation of a logical expression is costly, replace it by an algebraically equivalent expression that is cheaper to evaluate.
Short Circuiting Monotone Functions
Reordering Tests
Inexpensive and often suffcessful tests precede expensive and rarely successful tests.
Precompute Logical Functions
A logical function over a small finite doain can be replaced by a lookup in a talbe that represents the domain.
Boolean Variable Elimination

Procedure Rules

Collapsing Function Hierarchies
inline and macro
Exploit Common Cases
Coroutines
Transformations on Recursive Functions
Parallelism

Expression Rules

Compile-Time Initialization
Exploit Algebraic Identities
Common Subexpression elimination
Pairing Computation
If two similar expression are evaluated together, we should evalute them as a pair
Exploit Word Parallelism

Tools Introduction

In order to tuning your code, you have to have a profiler. I use a simple tool called gprof which is simple but not have much function.

I also used zoom which is much powerful. Here is an introduction video.

Sample

Now we come to an example to profile and tuning our code. This program is used to produce all prime number in range [2, N]. We will implement Sieve_of_Eratosthenes algorithm to make it.

Code

Following the main logic of naive single threaded version( where has is a auxiliary array to remember which number is prime):

for (i = 2; i < sqrt(n) + 1; i++) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i; j < n; j+=i) {
            has[j] = NOT_PRIME;
        }
    }
}

And the following is naive multiple threaded version:

for (i = start; i < sqrt(n) + 1; i+=num_cpus) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i; j < n; j+=i) {
            has[j] = NOT_PRIME;
        }
    }
}

Full code can be get at gist.

Compilation

gcc -g -pg -o timing.out timing.c prime.c prime_pthread.c
-lm -lpthread

Some notice:

Because we use sqrt math function, so have to use -lm
Because we use pthread, we have to use -lpthread
If we use gprof to get profiling data, we have to use -pg

Run

The following results is some kind of approximation( run 10 times successively), some detailed output is at the bottom of blog.

Average run time

input size	simple version	outer loop version
1000	12	150
10000	150	240
100000	2100	1500
1000000	16000	10000

Range

input size	simple version	outer loop version
1000	10-30	100-500
10000	120-180	200-400
100000	2000-2200	1200-2000
1000000	14000-20000	9000-11000

Tuning

Algorithm

As programming pearls says, code tuning should follow the order from high-level to low-level. For our such short code, we don’t have system design, so we start from algorithm.
Our first version of multiple threaded prime generation actually has some problem:

task split is not even: suppose we have two thread on two core, the first will try to set 22+2j, 44+4j, … ; the second will set 33+3j, 55+5j… . It obvious that the first may have one more outer loop, and every outer loop will execute more times( 2*2+2*j < n vs 3*3+3*j < n etc ).
no synchronization for shared array: Considering the cost of synchronization, I didn’t add synchronization at all. This may cause that even though some number(e.g. N=i*i+k*i) is already set not to be a prime, other thread still try to use it to set other element( j=(i*i+k*i)*(i*i+k*i)) which will in vain¹. This problem does not affect the correctness of answer, but will perform some duplicate job.

Solution one – inner loop

So I come up with another task split strategy: split the job in the inner loop rather than outer loop, i.e. make all thread share all number of 2*2+2*j, than 3*3+3*j,… In this case, we still not add any synchronization, but job will be more even in theory.

for (i = 2; i < sqrt(n) + 1; i++) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i + thread_id * i; 
        j < n; j+=(i*num_cpus)) {
            has[j] = NOT_PRIME;
        }
    }
}

Solution two – lock (has some flaw)

Now we try to add lock to make sure no thread will do next outer loop when some thread not finish current outer loop which will save duplicate work. see whether the lock cost will overwhelm the cost of duplicate job. We synchronize at the end of outer loop because inner loop is actually not share any mutable memory. (Actually, the following has a flaw, can you figure out it?)

for (i = 2; i < sqrt(n) + 1; i++) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i + thread_id * i; 
        j < n; j+=(i*num_cpus)) {
            has[j] = NOT_PRIME;
        }
    }
	wait_or_notify(num_of_cpus);
}

void wait_or_notify(int num_of_cpus) {
    pthread_mutex_lock(&work_mutex);
	reach_count++;
	if (reach_count != num_cpus) {
	    if (reach_count != num_cpus 
	    || reach_count != 0) {
	        pthread_cond_wait(&cond, &work_mutex);
	    }
	} else {
	    pthread_cond_broadcast(&cond);
	    reach_count = 0;
	}
	pthread_mutex_unlock(&work_mutex);
}

Solution three – trivial division

And we will also try the trivial division algorithm which can be easily change to multiple-threaded version.

for (i = start + 2; i < n; i+=num_cpus) {
	// division in is_prime() to test
    if (!is_prime(i)) {
        has[i] = NOT_PRIME;
    }
}

Code tuning

Solution four – optimized locked

Actually, solution two is flawed. Simple count of lock show that times of acquisition of lock is 4*outer_loop_times which is obviously too many. We should only synchronize only when we change some memory location. So we move the synchronization into if check to avoid useless lock/unlock to make it faster.

for (i = 2; i < sqrt(n) + 1; i++) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i + thread_id * i; 
        j < n; j+=(i*num_cpus)) {
            has[j] = NOT_PRIME;
        }
		wait_or_notify(num_cpus);
	}
}

Solution five – optimize half

The following code will absolutely faster and can use half memory, but the comparison between single thread version and multiple version will be almost the same. And you can choose to save some memory by declare half of array(has) to only store the state of odd number, which, of course sacrifice some speed.

for (i = 3; i < sqrt(n) + 1; i+=2) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i; j < n; j+=2*i) {
            has[j] = NOT_PRIME;
        }
    }
}
// output only 2 and odd number which is 
MAY_BE_PRIME

Solution six – loop unrolling

Loop unrolling version is somewhat nasty and influence the readability, may be should performed by your smart compiler. Actually, the following unrolling may not be the right way to make it and the profiler still complains about the inner loop stalls the instruction. (Glad to know how to make it in this case :) )

for (i = START_OPT; i < size; i+=2) {
    if (has[num_to_index(i)] == MAY_BE_PRIME) {
        for (j = i * i + gap * i * 2; j < n; 
        j = j + i * num_cpus * 2) {
            has[num_to_index(j)] = NOT_PRIME;
            j += 2 * num_cpus * i;
            if (j < n) {
                has[num_to_index(j)] = NOT_PRIME;
                // continue ...
            }
        }
    }
}

Solution 7 – openmp

In this trial, we try to use the OpenMP library to implement our solution, which is much simpler and even faster sometimes. (detailed result can be seen at the bottom.)

void find_prime_for(void *arg) {
    int *odd = (int *)arg;
#pragma omp parallel for schedule(static, 1)
    for (int i = START_OPT; i < size; i+=2) {
        if (odd[num_to_index(i)] 
	        == MAY_BE_PRIME) {
            for (int j = i * i + i * 2;
             j < n; j = j + i * 2) {
                odd[num_to_index(j)] = NOT_PRIME;
            }
        }
    }
}

Some more thought

I wrote many version of trials code in the same file, which will absolutely increase the code size and removing useless code then test may increase speed also.
In optimize half trial, I actually used only half of array to remember which number is prime, and use helper function to make conversion between index and real number. Although those function are signed with inline, assembly code show compiler seems still make it a function call( GCC make no inline when no optimization flag is set), so replace it with a macro or not invoke it every time may also help.

Full code in gist

Code Sample

prime.h: some common definition
prime.c: single thread version
prime_pthread.c: outer loop parallel, inner loop parallel, locked version, trivial division
prime_opt.c: optimize half of number, unrolling version
timing.c: code to timing different version

Output results on my machine

Test program run 10 times of same size successively( which may cause average time have better performance than real time of a single call, but we just compare the relative relationship between different version, so we decided to ignore it.) and produce the following result.
I compile program with no optimization flag added, which may change the result hilariously and you can try with yourself.

Simple vs outer loop parallel

simple : multiple -- 1000
102 :1480 
simple : multiple -- 10000
1393 :2308 
simple : multiple -- 100000
20196 :14046 
simple : multiple -- 1000000
147677 :99026

Simple vs inner loop parallel

simple : multiple -- 1000
112 :921 
simple : multiple -- 10000
1370 :3029 
simple : multiple -- 100000
20376 :14476 
simple : multiple -- 1000000
182251 :93732

Simple vs Locked

simple : multiple -- 1000
96 :7626 
simple : multiple -- 10000
1437 :21255 
simple : multiple -- 100000
20246 :63372 
simple : multiple -- 1000000
182191 :170709

Simple vs Locked tuning version

simple : multiple -- 1000
111 :	3977 
simple : multiple -- 10000
1356 :	8029 
simple : multiple -- 100000 
19249 :	28459 
simple : multiple -- 1000000
178521 :	130487

Simple vs Trivial Division

simple : multiple -- 1000
109 :1728 
simple : multiple -- 10000
1307 :6658 
simple : multiple -- 100000
20918 :65661 
simple : multiple -- 1000000
131787 :815137

Note: the following test is after changing the helper array from vla to malloc , so upper test result should not be used to compare with the following result.

Simple vs inner loop vs opt

simple : multiple : opt -- 1000
160 :  1410 :  1087 
simple : multiple : opt -- 10000
1569 :  3332 :  2004 
simple : multiple : opt -- 100000
18179 :  17480 :  8706 
simple : multiple : opt -- 1000000
150815 :  114917 :  45050 
simple : multiple : opt -- 10000000
1747069 :  1359124 :  643227

Simple vs inner loop vs loop unrolling

simple : multiple : opt -- 1000
169 : 1277 : 932 
simple : multiple : opt -- 10000
1000 : 3203 : 2393 
simple : multiple : opt -- 100000
20492 : 16931 : 11068 
simple : multiple : opt -- 1000000
188118 : 125661 : 62784 
simple : multiple : opt -- 10000000
1827757 : 1295141 : 547844

Simple vs inner loop vs opt vs omp

simple : multiple : opt  : omp -- 1000
123 : 1362 : 1030 : 131 
simple : multiple : opt  : omp -- 10000
1347 : 2387 : 1944 : 7716 
simple : multiple : opt  : omp -- 100000
15587 : 11999 : 8371 : 6993 
simple : multiple : opt  : omp -- 1000000
186296 : 96155 : 54569 : 57271 
simple : multiple : opt  : omp -- 10000000
1836222 : 1469581 : 614977 : 621706

Written with StackEdit.

Now we haveN=i*i+k*i, all number N*N+m*i can be easily change to the form of i*i + n*i which will be set in the round of i. ↩︎

On teh way