Programming pearls
Recently, I read some chapter of programming pearls about the performance, which is can be applied to my homework about prime generation.
Here is some performance optimization levels:
- System Structure
- Algorithm and Data Structures
- Code Tuning
- System Software
- Hardware
Algorithm
For algorithm design, it also provide some tips:
- Save state to avoid recomputation
- Preprocess infromation into data structure
- Divide and conqure
- Scanning algorithms
Code tuning
Rules are repeated here( I still not understand some rules, I will update them to add my understanding). One thing to mention is not all rule can speed up your program( it may even slow down it), it just some aspect you can think of.
Space For Time Rules
- Data Structure Augmentation:
e.g. add left children count in binary search tree to speed up i-th element query. - Store Pre-computed Results
- Caching
- Lazy Evaluation
Time For Space Rules
- Packing
- Interpreter
Loop Rules
- Code Motion Out of Loops
- Combining Tests:
An efficient loop should contain as few tests as possible, and preferably only one. The programmer should therefore try to simulate some of the exit conditions of loop by other exit conditions. - Loop Unrolling
- Transfer-Driven Loop Unrolling?
- Unconditional Branch Removal
- Loop Fusion
Logic Rules
- Exploit Algebraic Identities:
If the evaluation of a logical expression is costly, replace it by an algebraically equivalent expression that is cheaper to evaluate. - Short Circuiting Monotone Functions
- Reordering Tests
Inexpensive and often suffcessful tests precede expensive and rarely successful tests. - Precompute Logical Functions
A logical function over a small finite doain can be replaced by a lookup in a talbe that represents the domain. - Boolean Variable Elimination
Procedure Rules
- Collapsing Function Hierarchies
inline and macro - Exploit Common Cases
- Coroutines
- Transformations on Recursive Functions
- Parallelism
Expression Rules
- Compile-Time Initialization
- Exploit Algebraic Identities
- Common Subexpression elimination
- Pairing Computation
If two similar expression are evaluated together, we should evalute them as a pair - Exploit Word Parallelism
Tools Introduction
In order to tuning your code, you have to have a profiler. I use a simple tool called gprof which is simple but not have much function.
I also used zoom which is much powerful. Here is an introduction video.
Sample
Now we come to an example to profile and tuning our code. This program is used to produce all prime number in range [2, N]. We will implement Sieve_of_Eratosthenes algorithm to make it.
Code
Following the main logic of naive single threaded version( where has
is a auxiliary array to remember which number is prime):
for (i = 2; i < sqrt(n) + 1; i++) {
if (has[i] == MAY_BE_PRIME) {
for (j = i * i; j < n; j+=i) {
has[j] = NOT_PRIME;
}
}
}
And the following is naive multiple threaded version:
for (i = start; i < sqrt(n) + 1; i+=num_cpus) {
if (has[i] == MAY_BE_PRIME) {
for (j = i * i; j < n; j+=i) {
has[j] = NOT_PRIME;
}
}
}
Full code can be get at gist.
Compilation
gcc -g -pg -o timing.out timing.c prime.c prime_pthread.c
-lm -lpthread
Some notice:
- Because we use
sqrt
math function, so have to use-lm
- Because we use
pthread
, we have to use-lpthread
- If we use
gprof
to get profiling data, we have to use-pg
Run
The following results is some kind of approximation( run 10 times successively), some detailed output is at the bottom of blog.
Average run time
input size | simple version | outer loop version |
---|---|---|
1000 | 12 | 150 |
10000 | 150 | 240 |
100000 | 2100 | 1500 |
1000000 | 16000 | 10000 |
Range
input size | simple version | outer loop version |
---|---|---|
1000 | 10-30 | 100-500 |
10000 | 120-180 | 200-400 |
100000 | 2000-2200 | 1200-2000 |
1000000 | 14000-20000 | 9000-11000 |
Tuning
Algorithm
As programming pearls says, code tuning should follow the order from high-level to low-level. For our such short code, we don’t have system design, so we start from algorithm.
Our first version of multiple threaded prime generation actually has some problem:
- task split is not even: suppose we have two thread on two core, the first will try to set 22+2j, 44+4j, … ; the second will set 33+3j, 55+5j… . It obvious that the first may have one more outer loop, and every outer loop will execute more times(
2*2+2*j < n
vs3*3+3*j < n
etc ). - no synchronization for shared array: Considering the cost of synchronization, I didn’t add synchronization at all. This may cause that even though some number(e.g.
N=i*i+k*i
) is already set not to be a prime, other thread still try to use it to set other element(j=(i*i+k*i)*(i*i+k*i)
) which will in vain1. This problem does not affect the correctness of answer, but will perform some duplicate job.
Solution one – inner loop
So I come up with another task split strategy: split the job in the inner loop rather than outer loop, i.e. make all thread share all number of 2*2+2*j
, than 3*3+3*j
,… In this case, we still not add any synchronization, but job will be more even in theory.
for (i = 2; i < sqrt(n) + 1; i++) {
if (has[i] == MAY_BE_PRIME) {
for (j = i * i + thread_id * i;
j < n; j+=(i*num_cpus)) {
has[j] = NOT_PRIME;
}
}
}
Solution two – lock (has some flaw)
Now we try to add lock to make sure no thread will do next outer loop when some thread not finish current outer loop which will save duplicate work. see whether the lock cost will overwhelm the cost of duplicate job. We synchronize at the end of outer loop because inner loop is actually not share any mutable memory. (Actually, the following has a flaw, can you figure out it?)
for (i = 2; i < sqrt(n) + 1; i++) {
if (has[i] == MAY_BE_PRIME) {
for (j = i * i + thread_id * i;
j < n; j+=(i*num_cpus)) {
has[j] = NOT_PRIME;
}
}
wait_or_notify(num_of_cpus);
}
void wait_or_notify(int num_of_cpus) {
pthread_mutex_lock(&work_mutex);
reach_count++;
if (reach_count != num_cpus) {
if (reach_count != num_cpus
|| reach_count != 0) {
pthread_cond_wait(&cond, &work_mutex);
}
} else {
pthread_cond_broadcast(&cond);
reach_count = 0;
}
pthread_mutex_unlock(&work_mutex);
}
Solution three – trivial division
And we will also try the trivial division algorithm which can be easily change to multiple-threaded version.
for (i = start + 2; i < n; i+=num_cpus) {
// division in is_prime() to test
if (!is_prime(i)) {
has[i] = NOT_PRIME;
}
}
Code tuning
Solution four – optimized locked
Actually, solution two is flawed. Simple count of lock show that times of acquisition of lock is 4*outer_loop_times
which is obviously too many. We should only synchronize only when we change some memory location. So we move the synchronization into if check to avoid useless lock/unlock to make it faster.
for (i = 2; i < sqrt(n) + 1; i++) {
if (has[i] == MAY_BE_PRIME) {
for (j = i * i + thread_id * i;
j < n; j+=(i*num_cpus)) {
has[j] = NOT_PRIME;
}
wait_or_notify(num_cpus);
}
}
Solution five – optimize half
The following code will absolutely faster and can use half memory, but the comparison between single thread version and multiple version will be almost the same. And you can choose to save some memory by declare half of array(has
) to only store the state of odd number, which, of course sacrifice some speed.
for (i = 3; i < sqrt(n) + 1; i+=2) {
if (has[i] == MAY_BE_PRIME) {
for (j = i * i; j < n; j+=2*i) {
has[j] = NOT_PRIME;
}
}
}
// output only 2 and odd number which is
MAY_BE_PRIME
Solution six – loop unrolling
Loop unrolling version is somewhat nasty and influence the readability, may be should performed by your smart compiler. Actually, the following unrolling may not be the right way to make it and the profiler still complains about the inner loop stalls the instruction. (Glad to know how to make it in this case :) )
for (i = START_OPT; i < size; i+=2) {
if (has[num_to_index(i)] == MAY_BE_PRIME) {
for (j = i * i + gap * i * 2; j < n;
j = j + i * num_cpus * 2) {
has[num_to_index(j)] = NOT_PRIME;
j += 2 * num_cpus * i;
if (j < n) {
has[num_to_index(j)] = NOT_PRIME;
// continue ...
}
}
}
}
Solution 7 – openmp
In this trial, we try to use the OpenMP library to implement our solution, which is much simpler and even faster sometimes. (detailed result can be seen at the bottom.)
void find_prime_for(void *arg) {
int *odd = (int *)arg;
#pragma omp parallel for schedule(static, 1)
for (int i = START_OPT; i < size; i+=2) {
if (odd[num_to_index(i)]
== MAY_BE_PRIME) {
for (int j = i * i + i * 2;
j < n; j = j + i * 2) {
odd[num_to_index(j)] = NOT_PRIME;
}
}
}
}
More
Some more thought
- I wrote many version of trials code in the same file, which will absolutely increase the code size and removing useless code then test may increase speed also.
- In optimize half trial, I actually used only half of array to remember which number is prime, and use helper function to make conversion between index and real number. Although those function are signed with inline, assembly code show compiler seems still make it a function call( GCC make no inline when no optimization flag is set), so replace it with a macro or not invoke it every time may also help.
Full code in gist
- prime.h: some common definition
- prime.c: single thread version
- prime_pthread.c: outer loop parallel, inner loop parallel, locked version, trivial division
- prime_opt.c: optimize half of number, unrolling version
- timing.c: code to timing different version
Output results on my machine
Test program run 10 times of same size successively( which may cause average time have better performance than real time of a single call, but we just compare the relative relationship between different version, so we decided to ignore it.) and produce the following result.
I compile program with no optimization flag added, which may change the result hilariously and you can try with yourself.
Simple vs outer loop parallel
simple : multiple -- 1000
102 :1480
simple : multiple -- 10000
1393 :2308
simple : multiple -- 100000
20196 :14046
simple : multiple -- 1000000
147677 :99026
Simple vs inner loop parallel
simple : multiple -- 1000
112 :921
simple : multiple -- 10000
1370 :3029
simple : multiple -- 100000
20376 :14476
simple : multiple -- 1000000
182251 :93732
Simple vs Locked
simple : multiple -- 1000
96 :7626
simple : multiple -- 10000
1437 :21255
simple : multiple -- 100000
20246 :63372
simple : multiple -- 1000000
182191 :170709
Simple vs Locked tuning version
simple : multiple -- 1000
111 : 3977
simple : multiple -- 10000
1356 : 8029
simple : multiple -- 100000
19249 : 28459
simple : multiple -- 1000000
178521 : 130487
Simple vs Trivial Division
simple : multiple -- 1000
109 :1728
simple : multiple -- 10000
1307 :6658
simple : multiple -- 100000
20918 :65661
simple : multiple -- 1000000
131787 :815137
Note: the following test is after changing the helper array from vla
to malloc
, so upper test result should not be used to compare with the following result.
Simple vs inner loop vs opt
simple : multiple : opt -- 1000
160 : 1410 : 1087
simple : multiple : opt -- 10000
1569 : 3332 : 2004
simple : multiple : opt -- 100000
18179 : 17480 : 8706
simple : multiple : opt -- 1000000
150815 : 114917 : 45050
simple : multiple : opt -- 10000000
1747069 : 1359124 : 643227
Simple vs inner loop vs loop unrolling
simple : multiple : opt -- 1000
169 : 1277 : 932
simple : multiple : opt -- 10000
1000 : 3203 : 2393
simple : multiple : opt -- 100000
20492 : 16931 : 11068
simple : multiple : opt -- 1000000
188118 : 125661 : 62784
simple : multiple : opt -- 10000000
1827757 : 1295141 : 547844
Simple vs inner loop vs opt vs omp
simple : multiple : opt : omp -- 1000
123 : 1362 : 1030 : 131
simple : multiple : opt : omp -- 10000
1347 : 2387 : 1944 : 7716
simple : multiple : opt : omp -- 100000
15587 : 11999 : 8371 : 6993
simple : multiple : opt : omp -- 1000000
186296 : 96155 : 54569 : 57271
simple : multiple : opt : omp -- 10000000
1836222 : 1469581 : 614977 : 621706
Written with StackEdit.
Now we have
N=i*i+k*i
, all numberN*N+m*i
can be easily change to the form ofi*i + n*i
which will be set in the round ofi
. ↩︎
评论
发表评论