跳至主要内容

Learn Profiling and Performace Tuning

Learn Profiling and Performace Tuning

Programming pearls

Recently, I read some chapter of programming pearls about the performance, which is can be applied to my homework about prime generation.

Here is some performance optimization levels:

  • System Structure
  • Algorithm and Data Structures
  • Code Tuning
  • System Software
  • Hardware

Algorithm

For algorithm design, it also provide some tips:

  • Save state to avoid recomputation
  • Preprocess infromation into data structure
  • Divide and conqure
  • Scanning algorithms

Code tuning

Rules are repeated here( I still not understand some rules, I will update them to add my understanding). One thing to mention is not all rule can speed up your program( it may even slow down it), it just some aspect you can think of.

Space For Time Rules

  • Data Structure Augmentation:
    e.g. add left children count in binary search tree to speed up i-th element query.
  • Store Pre-computed Results
  • Caching
  • Lazy Evaluation

Time For Space Rules

  • Packing
  • Interpreter

Loop Rules

  • Code Motion Out of Loops
  • Combining Tests:
    An efficient loop should contain as few tests as possible, and preferably only one. The programmer should therefore try to simulate some of the exit conditions of loop by other exit conditions.
  • Loop Unrolling
  • Transfer-Driven Loop Unrolling?
  • Unconditional Branch Removal
  • Loop Fusion

Logic Rules

  • Exploit Algebraic Identities:
    If the evaluation of a logical expression is costly, replace it by an algebraically equivalent expression that is cheaper to evaluate.
  • Short Circuiting Monotone Functions
  • Reordering Tests
    Inexpensive and often suffcessful tests precede expensive and rarely successful tests.
  • Precompute Logical Functions
    A logical function over a small finite doain can be replaced by a lookup in a talbe that represents the domain.
  • Boolean Variable Elimination

Procedure Rules

  • Collapsing Function Hierarchies
    inline and macro
  • Exploit Common Cases
  • Coroutines
  • Transformations on Recursive Functions
  • Parallelism

Expression Rules

  • Compile-Time Initialization
  • Exploit Algebraic Identities
  • Common Subexpression elimination
  • Pairing Computation
    If two similar expression are evaluated together, we should evalute them as a pair
  • Exploit Word Parallelism

Tools Introduction

In order to tuning your code, you have to have a profiler. I use a simple tool called gprof which is simple but not have much function.

I also used zoom which is much powerful. Here is an introduction video.

Sample

Now we come to an example to profile and tuning our code. This program is used to produce all prime number in range [2, N]. We will implement Sieve_of_Eratosthenes algorithm to make it.

Code

Following the main logic of naive single threaded version( where has is a auxiliary array to remember which number is prime):

for (i = 2; i < sqrt(n) + 1; i++) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i; j < n; j+=i) {
            has[j] = NOT_PRIME;
        }
    }
}

And the following is naive multiple threaded version:

for (i = start; i < sqrt(n) + 1; i+=num_cpus) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i; j < n; j+=i) {
            has[j] = NOT_PRIME;
        }
    }
}

Full code can be get at gist.

Compilation

gcc -g -pg -o timing.out timing.c prime.c prime_pthread.c
-lm -lpthread

Some notice:

  • Because we use sqrt math function, so have to use -lm
  • Because we use pthread, we have to use -lpthread
  • If we use gprof to get profiling data, we have to use -pg

Run

The following results is some kind of approximation( run 10 times successively), some detailed output is at the bottom of blog.

Average run time

input size simple version outer loop version
1000 12 150
10000 150 240
100000 2100 1500
1000000 16000 10000

Range

input size simple version outer loop version
1000 10-30 100-500
10000 120-180 200-400
100000 2000-2200 1200-2000
1000000 14000-20000 9000-11000

Tuning

Algorithm

As programming pearls says, code tuning should follow the order from high-level to low-level. For our such short code, we don’t have system design, so we start from algorithm.
Our first version of multiple threaded prime generation actually has some problem:

  • task split is not even: suppose we have two thread on two core, the first will try to set 22+2j, 44+4j, … ; the second will set 33+3j, 55+5j… . It obvious that the first may have one more outer loop, and every outer loop will execute more times( 2*2+2*j < n vs 3*3+3*j < n etc ).
  • no synchronization for shared array: Considering the cost of synchronization, I didn’t add synchronization at all. This may cause that even though some number(e.g. N=i*i+k*i) is already set not to be a prime, other thread still try to use it to set other element( j=(i*i+k*i)*(i*i+k*i)) which will in vain1. This problem does not affect the correctness of answer, but will perform some duplicate job.
Solution one – inner loop

So I come up with another task split strategy: split the job in the inner loop rather than outer loop, i.e. make all thread share all number of 2*2+2*j, than 3*3+3*j,… In this case, we still not add any synchronization, but job will be more even in theory.

for (i = 2; i < sqrt(n) + 1; i++) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i + thread_id * i; 
        j < n; j+=(i*num_cpus)) {
            has[j] = NOT_PRIME;
        }
    }
}
Solution two – lock (has some flaw)

Now we try to add lock to make sure no thread will do next outer loop when some thread not finish current outer loop which will save duplicate work. see whether the lock cost will overwhelm the cost of duplicate job. We synchronize at the end of outer loop because inner loop is actually not share any mutable memory. (Actually, the following has a flaw, can you figure out it?)

for (i = 2; i < sqrt(n) + 1; i++) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i + thread_id * i; 
        j < n; j+=(i*num_cpus)) {
            has[j] = NOT_PRIME;
        }
    }
	wait_or_notify(num_of_cpus);
}

void wait_or_notify(int num_of_cpus) {
    pthread_mutex_lock(&work_mutex);
	reach_count++;
	if (reach_count != num_cpus) {
	    if (reach_count != num_cpus 
	    || reach_count != 0) {
	        pthread_cond_wait(&cond, &work_mutex);
	    }
	} else {
	    pthread_cond_broadcast(&cond);
	    reach_count = 0;
	}
	pthread_mutex_unlock(&work_mutex);
}
Solution three – trivial division

And we will also try the trivial division algorithm which can be easily change to multiple-threaded version.

for (i = start + 2; i < n; i+=num_cpus) {
	// division in is_prime() to test
    if (!is_prime(i)) {
        has[i] = NOT_PRIME;
    }
}

Code tuning

Solution four – optimized locked

Actually, solution two is flawed. Simple count of lock show that times of acquisition of lock is 4*outer_loop_times which is obviously too many. We should only synchronize only when we change some memory location. So we move the synchronization into if check to avoid useless lock/unlock to make it faster.

for (i = 2; i < sqrt(n) + 1; i++) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i + thread_id * i; 
        j < n; j+=(i*num_cpus)) {
            has[j] = NOT_PRIME;
        }
		wait_or_notify(num_cpus);
	}
}
Solution five – optimize half

The following code will absolutely faster and can use half memory, but the comparison between single thread version and multiple version will be almost the same. And you can choose to save some memory by declare half of array(has) to only store the state of odd number, which, of course sacrifice some speed.

for (i = 3; i < sqrt(n) + 1; i+=2) {
    if (has[i] == MAY_BE_PRIME) {
        for (j = i * i; j < n; j+=2*i) {
            has[j] = NOT_PRIME;
        }
    }
}
// output only 2 and odd number which is 
MAY_BE_PRIME
Solution six – loop unrolling

Loop unrolling version is somewhat nasty and influence the readability, may be should performed by your smart compiler. Actually, the following unrolling may not be the right way to make it and the profiler still complains about the inner loop stalls the instruction. (Glad to know how to make it in this case :) )

for (i = START_OPT; i < size; i+=2) {
    if (has[num_to_index(i)] == MAY_BE_PRIME) {
        for (j = i * i + gap * i * 2; j < n; 
        j = j + i * num_cpus * 2) {
            has[num_to_index(j)] = NOT_PRIME;
            j += 2 * num_cpus * i;
            if (j < n) {
                has[num_to_index(j)] = NOT_PRIME;
                // continue ...
            }
        }
    }
}

Solution 7 – openmp

In this trial, we try to use the OpenMP library to implement our solution, which is much simpler and even faster sometimes. (detailed result can be seen at the bottom.)

void find_prime_for(void *arg) {
    int *odd = (int *)arg;
#pragma omp parallel for schedule(static, 1)
    for (int i = START_OPT; i < size; i+=2) {
        if (odd[num_to_index(i)] 
	        == MAY_BE_PRIME) {
            for (int j = i * i + i * 2;
             j < n; j = j + i * 2) {
                odd[num_to_index(j)] = NOT_PRIME;
            }
        }
    }
}

More

Some more thought

  • I wrote many version of trials code in the same file, which will absolutely increase the code size and removing useless code then test may increase speed also.
  • In optimize half trial, I actually used only half of array to remember which number is prime, and use helper function to make conversion between index and real number. Although those function are signed with inline, assembly code show compiler seems still make it a function call( GCC make no inline when no optimization flag is set), so replace it with a macro or not invoke it every time may also help.

Full code in gist

Code Sample

  • prime.h: some common definition
  • prime.c: single thread version
  • prime_pthread.c: outer loop parallel, inner loop parallel, locked version, trivial division
  • prime_opt.c: optimize half of number, unrolling version
  • timing.c: code to timing different version

Output results on my machine

Test program run 10 times of same size successively( which may cause average time have better performance than real time of a single call, but we just compare the relative relationship between different version, so we decided to ignore it.) and produce the following result.
I compile program with no optimization flag added, which may change the result hilariously and you can try with yourself.

Simple vs outer loop parallel

simple : multiple -- 1000
102 :1480 
simple : multiple -- 10000
1393 :2308 
simple : multiple -- 100000
20196 :14046 
simple : multiple -- 1000000
147677 :99026 

Simple vs inner loop parallel

simple : multiple -- 1000
112 :921 
simple : multiple -- 10000
1370 :3029 
simple : multiple -- 100000
20376 :14476 
simple : multiple -- 1000000
182251 :93732 

Simple vs Locked

simple : multiple -- 1000
96 :7626 
simple : multiple -- 10000
1437 :21255 
simple : multiple -- 100000
20246 :63372 
simple : multiple -- 1000000
182191 :170709 

Simple vs Locked tuning version

simple : multiple -- 1000
111 :	3977 
simple : multiple -- 10000
1356 :	8029 
simple : multiple -- 100000 
19249 :	28459 
simple : multiple -- 1000000
178521 :	130487 

Simple vs Trivial Division

simple : multiple -- 1000
109 :1728 
simple : multiple -- 10000
1307 :6658 
simple : multiple -- 100000
20918 :65661 
simple : multiple -- 1000000
131787 :815137 

Note: the following test is after changing the helper array from vla to malloc , so upper test result should not be used to compare with the following result.

Simple vs inner loop vs opt

simple : multiple : opt -- 1000
160 :  1410 :  1087 
simple : multiple : opt -- 10000
1569 :  3332 :  2004 
simple : multiple : opt -- 100000
18179 :  17480 :  8706 
simple : multiple : opt -- 1000000
150815 :  114917 :  45050 
simple : multiple : opt -- 10000000
1747069 :  1359124 :  643227 

Simple vs inner loop vs loop unrolling

simple : multiple : opt -- 1000
169 : 1277 : 932 
simple : multiple : opt -- 10000
1000 : 3203 : 2393 
simple : multiple : opt -- 100000
20492 : 16931 : 11068 
simple : multiple : opt -- 1000000
188118 : 125661 : 62784 
simple : multiple : opt -- 10000000
1827757 : 1295141 : 547844 

Simple vs inner loop vs opt vs omp

simple : multiple : opt  : omp -- 1000
123 : 1362 : 1030 : 131 
simple : multiple : opt  : omp -- 10000
1347 : 2387 : 1944 : 7716 
simple : multiple : opt  : omp -- 100000
15587 : 11999 : 8371 : 6993 
simple : multiple : opt  : omp -- 1000000
186296 : 96155 : 54569 : 57271 
simple : multiple : opt  : omp -- 10000000
1836222 : 1469581 : 614977 : 621706 

Written with StackEdit.


  1. Now we haveN=i*i+k*i, all number N*N+m*i can be easily change to the form of i*i + n*i which will be set in the round of i. ↩︎

评论

此博客中的热门博文

Spring Boot: Customize Environment

Spring Boot: Customize Environment Environment variable is a very commonly used feature in daily programming: used in init script used in startup configuration used by logging etc In Spring Boot, all environment variables are a part of properties in Spring context and managed by Environment abstraction. Because Spring Boot can handle the parse of configuration files, when we want to implement a project which uses yml file as a separate config file, we choose the Spring Boot. The following is the problems we met when we implementing the parse of yml file and it is recorded for future reader. Bind to Class Property values can be injected directly into your beans using the @Value annotation, accessed via Spring’s Environment abstraction or bound to structured objects via @ConfigurationProperties. As the document says, there exists three ways to access properties in *.properties or *.yml : @Value : access single value Environment : can access multi

Elasticsearch: Join and SubQuery

Elasticsearch: Join and SubQuery Tony was bothered by the recent change of search engine requirement: they want the functionality of SQL-like join in Elasticsearch! “They are crazy! How can they think like that. Didn’t they understand that Elasticsearch is kind-of NoSQL 1 in which every index should be independent and self-contained? In this way, every index can work independently and scale as they like without considering other indexes, so the performance can boost. Following this design principle, Elasticsearch has little related supports.” Tony thought, after listening their requirements. Leader notice tony’s unwillingness and said, “Maybe it is hard to do, but the requirement is reasonable. We need to search person by his friends, didn’t we? What’s more, the harder to implement, the more you can learn from it, right?” Tony thought leader’s word does make sense so he set out to do the related implementations Application-Side Join “The first implementation

Implement isdigit

It is seems very easy to implement c library function isdigit , but for a library code, performance is very important. So we will try to implement it and make it faster. Function So, first we make it right. int isdigit ( char c) { return c >= '0' && c <= '9' ; } Improvements One – Macro When it comes to performance for c code, macro can always be tried. #define isdigit (c) c >= '0' && c <= '9' Two – Table Upper version use two comparison and one logical operation, but we can do better with more space: # define isdigit(c) table[c] This works and faster, but somewhat wasteful. We need only one bit to represent true or false, but we use a int. So what to do? There are many similar functions like isalpha(), isupper ... in c header file, so we can combine them into one int and get result by table[c]&SOME_BIT , which is what source do. Source code of ctype.h : # define _ISbit(bit) (1 << (