Generating mandelbrot images in c++ using multithreading. No speedup? - c++

So I posted a similar question to this earlier, but I didn't post enough code to get the help I needed. Even if I went back and added that code now, I don't think it would be noticed because the question is old and "answered". So here's my issue:
I'm trying to generate a section of the mandelbrot fractal. I can generate it fine, but when I add more cores, no matter how large the problem size is, the extra threads generate no speedup. I am completely new to multithreading and it's probably just something small I'm missing. Anyway, here are the functions that generate the fractal:
void mandelbrot_all(std::vector<std::vector<int>>& pixels, int X, int Y, int numThreads) {
using namespace std;
vector<thread> threads (numThreads);
int rowsPerThread = Y/numThreads;
mutex m;
for(int i=0; i<numThreads; i++) {
threads[i] = thread ([&](){
vector<int> row;
for(int j=(i-1)*rowsPerThread; j<i*rowsPerThread; j++) {
row = mandelbrot_row(j, X, Y);
{
lock_guard<mutex> lock(m);
pixels[j] = row;
}
}
});
}
for(int i=0; i<numThreads; i++) {
threads[i].join();
}
}
std::vector<int> mandelbrot_row(int rowNum, int topX, int topY) {
std::vector<int> row (topX);
for(int i=0; i<topX; i++) {
row[i] = mandelbrotOne(i, rowNum, topX, topY);
}
return row;
}
int mandelbrotOne(int currX, int currY, int X, int Y) { //code adapted from http://en.wikipedia.org/wiki/Mandelbrot_set
double x0 = convert(X, currX, true);
double y0 = convert(Y, currY, false);
double x = 0.0;
double y = 0.0;
double xtemp;
int iteration = 0;
int max_iteration = 255;
while ( x*x + y*y < 2*2 && iteration < max_iteration) {
xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
++iteration;
}
return iteration;
}
mandelbrot_all is passed a vector to hold the pixels, the maximum X and Y of the vector, and the number of threads to use, which is taken from the command line when the program is run. It attempts to split the work by row among multiple threads. Unfortunately, it seems that even if that is what it's doing, it's not making it any faster. If you need more details, feel free to ask and I will do my best to provide them.
Thanks in advance for the help.
Edit: reserved vectors in advance
Edit 2: ran this code with problem size 9600x7200 on a quad core laptop. It took an average of 36590000 cycles for one thread (over 5 runs) and 55142000 cycles for four threads.

Your code might appear to do parallel processing, but in practice it doesn't.
Basically, you are spending your time copying data around and queueing for memory allocator accesses.
Besides, you are using the unprotected i loop indice as if there was nothing to it, which will feed your worker threads with random junk instead of beautiful slices of the image.
As usual, C++ is hiding these sad facts from you under a thick crust of syntactic sugar.
But the greatest flaw of your code is the algorithm itself, as you might see if you read further ahead.
Since this example seems a textbook case of parallel processing to me and I never saw an "educational" analysis of it, I will try one.
Functional analysis
You want to use all CPU cores to crunch pixels of the Mandelbrot set. This is a perfect case of parallelizable computation, since each pixel computation can be done independently.
So basically it you have N cores on your machine you should have exactly one thread per core doing 1/N of the processing.
Unfortunately, dividing the input data so that each processor ends up doing 1/N of the needed processing is not as obvious as it might seem.
A given pixel can take from 0 to 255 iterations to compute. "black" pixels are 255 times more costly than "white" ones.
So if you simply divide your picture into N equal sub-surfaces, chances are all of your processors will breeze through "white" areas except one that will crawl through a "black" area. As a result, the slowest area computation time will dominate and parallelization will gain practically nothing.
In real cases, this will not be as dramatic, but still a huge loss of computing power.
Load balancing
To better balance the load, it is more efficient to split your picture in much smaller bits, and have each worker thread pick and compute the next available bit as soon as it is finished with the previous one.
That way, a worker processing "white" chunks will eventually finish its job and start picking "black" chunks to help its less fortunate siblings.
Ideally the chunks should be sorted by decreasing complexity, to avoid adding the linear cost of a big chunk to the total computatuin time.
Unfortunately, due to the chaotic nature of the Mandlebrot set, there is no practical way of predicting the computation time of a given area.
If we decide the chunks will be horizontal slices of the picture, sorting them in natural y order is clearly suboptimal. If that particular area is a kind of "white to black" gradient, the most costly lines will all be bunched at the end of the chunks list and you will end up computing the costliest bits last, which is the worst case for load balancing.
A possible solution is to shuffle the chunks in a butterfly pattern, so that the likelihood of having a "black" area concentrated in the end is small.
Another way would simply be to shuffle input patterns at random.
Here are two outputs of my proof of concept implementation:
Jobs are executed in reverse order (jobs 39 is the first, job 0 is the last).
Each line decodes as follows:
t a-b : thread n°a on processor b
b : begining time (since image computation start)
e : end time
d : duration (all times in milliseconds)
1) 40 jobs with butterfly ordering
job 0: t 1-1 b 162 e 174 d 12 // the 4 tasks finish within 5 ms from each other
job 1: t 0-0 b 156 e 176 d 20 //
job 2: t 2-2 b 155 e 173 d 18 //
job 3: t 3-3 b 154 e 174 d 20 //
job 4: t 1-1 b 141 e 162 d 21
job 5: t 2-2 b 137 e 155 d 18
job 6: t 0-0 b 136 e 156 d 20
job 7: t 3-3 b 133 e 154 d 21
job 8: t 1-1 b 117 e 141 d 24
job 9: t 0-0 b 116 e 136 d 20
job 10: t 2-2 b 115 e 137 d 22
job 11: t 3-3 b 113 e 133 d 20
job 12: t 0-0 b 99 e 116 d 17
job 13: t 1-1 b 99 e 117 d 18
job 14: t 2-2 b 96 e 115 d 19
job 15: t 3-3 b 95 e 113 d 18
job 16: t 0-0 b 83 e 99 d 16
job 17: t 3-3 b 80 e 95 d 15
job 18: t 2-2 b 77 e 96 d 19
job 19: t 1-1 b 72 e 99 d 27
job 20: t 3-3 b 69 e 80 d 11
job 21: t 0-0 b 68 e 83 d 15
job 22: t 2-2 b 63 e 77 d 14
job 23: t 1-1 b 56 e 72 d 16
job 24: t 3-3 b 54 e 69 d 15
job 25: t 0-0 b 53 e 68 d 15
job 26: t 2-2 b 48 e 63 d 15
job 27: t 0-0 b 41 e 53 d 12
job 28: t 3-3 b 40 e 54 d 14
job 29: t 1-1 b 36 e 56 d 20
job 30: t 3-3 b 29 e 40 d 11
job 31: t 2-2 b 29 e 48 d 19
job 32: t 0-0 b 23 e 41 d 18
job 33: t 1-1 b 18 e 36 d 18
job 34: t 2-2 b 16 e 29 d 13
job 35: t 3-3 b 15 e 29 d 14
job 36: t 2-2 b 0 e 16 d 16
job 37: t 3-3 b 0 e 15 d 15
job 38: t 1-1 b 0 e 18 d 18
job 39: t 0-0 b 0 e 23 d 23
You can see load balancing at work when a thread having processed a few small jobs will overtake another that took more time to process its own chunks.
2) 40 jobs with linear ordering
job 0: t 2-2 b 157 e 180 d 23 // last thread lags 17 ms behind first
job 1: t 1-1 b 154 e 175 d 21
job 2: t 3-3 b 150 e 171 d 21
job 3: t 0-0 b 143 e 163 d 20 // 1st thread ends
job 4: t 2-2 b 137 e 157 d 20
job 5: t 1-1 b 135 e 154 d 19
job 6: t 3-3 b 130 e 150 d 20
job 7: t 0-0 b 123 e 143 d 20
job 8: t 2-2 b 115 e 137 d 22
job 9: t 1-1 b 112 e 135 d 23
job 10: t 3-3 b 112 e 130 d 18
job 11: t 0-0 b 105 e 123 d 18
job 12: t 3-3 b 95 e 112 d 17
job 13: t 2-2 b 95 e 115 d 20
job 14: t 1-1 b 94 e 112 d 18
job 15: t 0-0 b 90 e 105 d 15
job 16: t 3-3 b 78 e 95 d 17
job 17: t 2-2 b 77 e 95 d 18
job 18: t 1-1 b 74 e 94 d 20
job 19: t 0-0 b 69 e 90 d 21
job 20: t 3-3 b 60 e 78 d 18
job 21: t 2-2 b 59 e 77 d 18
job 22: t 1-1 b 57 e 74 d 17
job 23: t 0-0 b 55 e 69 d 14
job 24: t 3-3 b 45 e 60 d 15
job 25: t 2-2 b 45 e 59 d 14
job 26: t 1-1 b 43 e 57 d 14
job 27: t 0-0 b 43 e 55 d 12
job 28: t 2-2 b 30 e 45 d 15
job 29: t 3-3 b 30 e 45 d 15
job 30: t 0-0 b 27 e 43 d 16
job 31: t 1-1 b 24 e 43 d 19
job 32: t 2-2 b 13 e 30 d 17
job 33: t 3-3 b 12 e 30 d 18
job 34: t 0-0 b 11 e 27 d 16
job 35: t 1-1 b 11 e 24 d 13
job 36: t 2-2 b 0 e 13 d 13
job 37: t 3-3 b 0 e 12 d 12
job 38: t 1-1 b 0 e 11 d 11
job 39: t 0-0 b 0 e 11 d 11
Here the costly chunks tend to bunch together at the end of the queue, hence a noticeable performance loss.
3) a run with only one job per core, with one to 4 cores activated
reported cores: 4
Master: start jobs 4 workers 1
job 0: t 0-0 b 410 e 590 d 180 // purely linear execution
job 1: t 0-0 b 255 e 409 d 154
job 2: t 0-0 b 127 e 255 d 128
job 3: t 0-0 b 0 e 127 d 127
Master: start jobs 4 workers 2 // gain factor : 1.6 out of theoretical 2
job 0: t 1-1 b 151 e 362 d 211
job 1: t 0-0 b 147 e 323 d 176
job 2: t 0-0 b 0 e 147 d 147
job 3: t 1-1 b 0 e 151 d 151
Master: start jobs 4 workers 3 // gain factor : 1.82 out of theoretical 3
job 0: t 0-0 b 142 e 324 d 182 // 4th packet is hurting the performance badly
job 1: t 2-2 b 0 e 158 d 158
job 2: t 1-1 b 0 e 160 d 160
job 3: t 0-0 b 0 e 142 d 142
Master: start jobs 4 workers 4 // gain factor : 3 out of theoretical 4
job 0: t 3-3 b 0 e 199 d 199 // finish at 199ms vs. 176 for butterfly 40, 13% loss
job 1: t 1-1 b 0 e 182 d 182 // 17 ms wasted
job 2: t 0-0 b 0 e 146 d 146 // 44 ms wasted
job 3: t 2-2 b 0 e 150 d 150 // 49 ms wasted
Here we get a 3x improvement while a better load balancing could have yielded a 3.5x.
And this is a very mild test case (you can see the computation times only vary by a factor of about 2, while they could theoretically vary by a factor of 255 !).
At any rate, if you don't implement some kind of load balancing, all the shiny multiprocessor code you might write will still yield poor do downright miserable performances.
Implementation
For the threads to work unhindered, they must be kept free from interferences from the ouside world.
One such interference is the memory allocation. Each time you allocate even a byte of memory, you will queue for exclusive access to the global memory allocator (and waste a bit of CPU doing the allocation).
Also, creating worker tasks for each picture computation is another waste of time and resources. The computation might be used to display the Mandlebrot set in an interactive application, so better have the workers premanently created and synchronized to compute successive images.
Lastly, there are the data copies. If you synchronize with the main program each time you're done computing a few points, you will again spend a good part of your time queueing for exclusive access to the result area. Besides, the useless copies of a sizeable amount of data will hurt the performances even more.
The obvious solution is to dispense with the copies altogether and work on original data.
design
You must provide your worker threads all they need to work unhindered. For that you need to:
determine the number of available cores on your system
pre-allocate all the memory needed
give access to a list of image chunks to each of your worker
launch exactly one thread per core and let them run free to do their job
job queue
There is no need for fancy no-wait or whatever gizmos, nor do we need to pay special attention to cache optimization.
Here again, the time needed to compute pixels dwarves the inter-thread synchronization cost and cache efficiency problems.
Basically, the queue can be computed as a whole at the start of an image generation. Workers will only have to read the jobs from it: there will never be concurrent read/write accesses on this queue, so the more or less standard bits of code around to implement job queues will be suboptimal and too complex for the job at hand.
We need two sync points:
let the workers wait for a new batch of jobs
let the master wait for a picture completion
workers will wait until the queue length changes to a positive value.
They will then all wakeup and start atomically decrementing the queue length. The current value of the queue length will provide them exclusive access to the associated job data (basically an area of the Mandlebrot set to compute, with an associated bitmap area to store the computed iteration values).
The same mechanism is used to terminate the workers. Instead of finding a new batch of jobs, the poor workers will wakeup to find an order to terminate.
the master waiting for a picture completion will be awoken by the worker that will finish processing the last job. This will be based on an atomic counter of the number of jobs to process.
This is how I implemented it:
class synchro {
friend class mandelbrot_calculator;
mutex lock; // queue lock
condition_variable work; // blocks workers waiting for jobs/termination
condition_variable done; // blocks master waiting for completion
int pending; // number of jobs in the queue
atomic_int active; // number of unprocessed jobs
bool kill; // poison pill for workers termination
void synchro (void)
{
pending = 0; // no job in queue
kill = false; // workers shall live (for now :) )
}
int worker_start(void)
{
unique_lock<mutex> waiter(lock);
while (!pending && !kill) work.wait(waiter);
return kill
? -1 // worker should die
: --pending; // index of the job to process
}
void worker_done(void)
{
if (!--active) // atomic decrement (exclusive with other workers)
done.notify_one(); // last job processed: wakeup master
}
void master_start(int jobs)
{
unique_lock<mutex> waiter(lock);
pending = active = jobs;
work.notify_all(); // wakeup all workers to start jobs
}
void master_done(void)
{
unique_lock<mutex> waiter(lock);
while (active) done.wait(waiter); // wait for workers to finish
}
void master_kill(void)
{
kill = true;
work.notify_all(); // wakeup all workers (to die)
}
};
Putting all together:
class mandelbrot_calculator {
int num_cores;
int num_jobs;
vector<thread> workers; // worker threads
vector<job> jobs; // job queue
synchro sync; // synchronization helper
mandelbrot_calculator (int num_cores, int num_jobs)
: num_cores(num_cores)
, num_jobs (num_jobs )
{
// worker thread
auto worker = [&]()
{
for (;;)
{
int job = sync.worker_start(); // fetch next job
if (job == -1) return; // poison pill
process (jobs[job]); // we have exclusive access to this job
sync.worker_done(); // signal end of picture to the master
}
};
jobs.resize(num_jobs, job()); // computation windows
workers.resize(num_cores);
for (int i = 0; i != num_cores; i++)
workers[i] = thread(worker, i, i%num_cores);
}
~mandelbrot_calculator()
{
// kill the workers
sync.master_kill();
for (thread& worker : workers) worker.join();
}
void compute(const viewport & vp)
{
// prepare worker data
function<void(int, int)> butterfly_jobs;
butterfly_jobs = [&](int min, int max)
// computes job windows in butterfly order
{
if (min > max) return;
jobs[min].setup(vp, max, num_jobs);
if (min == max) return;
jobs[max].setup(vp, min, num_jobs);
int mid = (min + max) / 2;
butterfly_jobs(min + 1, mid );
butterfly_jobs(mid + 1, max - 1);
};
butterfly_jobs(0, num_jobs - 1);
// launch workers
sync.master_start(num_jobs);
// wait for completion
sync.master_done();
}
};
Testing the concept
This code works pretty well on my 2 cores / 4 CPUs Intel I3 # 3.1 GHz, compiled with Microsoft Dev Studio 2013.
I use a bit of the set that has an average of 90 iterations / pixel, on a window of 1280x1024 pixels.
The computation time is about 1.700s with only one worker and drops to 0.480s with 4 workers.
The maximal possible gain would be a factor 4. I get a factor 3.5. Not too bad.
I assume the difference is partly due to the processor architecture (the I3 has only two "real" cores).
Tampering with the scheduler
My program forces the threads to run on one core each (using MSDN SetThreadAffinityMask).
If the scheduler is left free to allocate the tasks, the gain factor drops from 3,5 to 3,2.
This is significant, but still the Win7 scheduler does a pretty good job when left alone.
synchronization overhead
running the algorithm on an "white" window (outside the r < 2 area) gives a good idea of the system calls overhead.
It takes about 7ms to compute this "white" area, compared with the 480 ms of a representative area.
Something like 1.5%, including both the synchronization and computation of the job queue. And this is doing a synchronization on a queue of 1024 jobs.
Utterly neglectible, I would say. That might give food for thought to all the No-wait queue fanatics around.
optimizing iterations
The way iterations are done is a key factor for optimization.
After a few trials, I settled for this method:
static inline unsigned char mandelbrot_pixel(double x0, double y0)
{
register double x = x0;
register double y = y0;
register double x2 = x * x;
register double y2 = y * y;
unsigned iteration = 0;
const int max_iteration = 255;
while (x2 + y2 < 4.0)
{
if (++iteration == max_iteration) break;
y = 2 * x * y + y0;
x = x2 - y2 + x0;
x2 = x * x;
y2 = y * y;
}
return (unsigned char)iteration;
}
net gain: +20% compared with the OP's method
(the register directives don't make a bit of a difference, they are just there for decoration)
killing the tasks after each computation
The benefit of leaving the workers alive is about 5% of the computation time.
butterfly effect
On my test case, the "butterfly" order is doing really well, yielding more than 30% gain in extreme cases and routinely 10-15% due to "de-bunching" the bulkiest requests.

The problem in your code is that all thread capture and access the same i variable. This creates a race condition and the results are wildly incorrect.
You need to pass it as an argument to the thread lambda, and also correct the ranges (i-1 will make your indexing go out of bounds).

Related

perft-function of chess engine is giving self-contradictory output

I am currently developing a chess engine in C++, and I am in the process of debugging my move generator. For this purpose, I wrote a simple perft() function:
int32_t Engine::perft(GameState game_state, int32_t depth)
{
int32_t last_move_nodes = 0;
int32_t all_nodes = 0;
Timer timer;
timer.start();
int32_t output_depth = depth;
if (depth == 0)
{
return 1;
}
std::vector<Move> legal_moves = generator.generate_legal_moves(game_state);
for (Move move : legal_moves)
{
game_state.make_move(move);
last_move_nodes = perft_no_print(game_state, depth - 1);
all_nodes += last_move_nodes;
std::cout << index_to_square_name(move.get_from_index()) << index_to_square_name(move.get_to_index()) << ": " << last_move_nodes << "\n";
game_state.unmake_move(move);
}
std::cout << "\nDepth: " << output_depth << "\nTotal nodes: " << all_nodes << "\nTotal time: " << timer.get_milliseconds() << "ms/" << timer.get_milliseconds()/1000.0f << "s\n\n";
return all_nodes;
}
int32_t Engine::perft_no_print(GameState game_state, int32_t depth)
{
int32_t nodes = 0;
if (depth == 0)
{
return 1;
}
std::vector<Move> legal_moves = generator.generate_legal_moves(game_state);
for (Move move : legal_moves)
{
game_state.make_move(move);
nodes += perft_no_print(game_state, depth - 1);
game_state.unmake_move(move);
}
return nodes;
}
It's results for the initial chess position (FEN: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1) for depths 1 and 2 match the results of stockfish's perft command, so I assume they are correct:
h2h3: 1
h2h4: 1
g2g3: 1
g2g4: 1
f2f3: 1
f2f4: 1
e2e3: 1
e2e4: 1
d2d3: 1
d2d4: 1
c2c3: 1
c2c4: 1
b2b3: 1
b2b4: 1
a2a3: 1
a2a4: 1
g1h3: 1
g1f3: 1
b1c3: 1
b1a3: 1
Depth: 1
Total nodes: 20
Total time: 1ms/0.001s
h2h3: 20
h2h4: 20
g2g3: 20
g2g4: 20
f2f3: 20
f2f4: 20
e2e3: 20
e2e4: 20
d2d3: 20
d2d4: 20
c2c3: 20
c2c4: 20
b2b3: 20
b2b4: 20
a2a3: 20
a2a4: 20
g1h3: 20
g1f3: 20
b1c3: 20
b1a3: 20
Depth: 2
Total nodes: 400
Total time: 1ms/0.001s
The results stop matching at depth 3, though:
Stockfish:
go perft 3
a2a3: 380
b2b3: 420
c2c3: 420
d2d3: 539
e2e3: 599
f2f3: 380
g2g3: 420
h2h3: 380
a2a4: 420
b2b4: 421
c2c4: 441
d2d4: 560
e2e4: 600
f2f4: 401
g2g4: 421
h2h4: 420
b1a3: 400
b1c3: 440
g1f3: 440
g1h3: 400
Nodes searched: 8902
My engine:
h2h3: 361
h2h4: 380
g2g3: 340
g2g4: 397
f2f3: 360
f2f4: 436
e2e3: 380
e2e4: 437
d2d3: 380
d2d4: 437
c2c3: 399
c2c4: 326
b2b3: 300
b2b4: 320
a2a3: 280
a2a4: 299
g1h3: 281
g1f3: 280
b1c3: 357
b1a3: 320
Depth: 3
Total nodes: 7070
Total time: 10ms/0.01s
I figured that my move generator was just buggy, and tried to track down the bugs by making a move the engine gives incorrect values for on the board and then calling perft() with depth = 2 on it to find out which moves are missing. But for all moves I tried this with, the engine suddenly starts to output the correct results I expected to get earlier!
Here is an example for the move a2a3:
When calling perft() on the initial position in stockfish, it calculates 380 subnodes for a2a3 at depth 3.
When calling perft() on the initial position in my engine, it calculates 280 subnodes for a2a3 at depth 3.
When calling perft() on the position you get after making the move a2a3 in the initial position in my engine, it calculates the correct number of total nodes at depth 2, 380:
h7h5: 19
h7h6: 19
g7g5: 19
g7g6: 19
f7f5: 19
f7f6: 19
e7e5: 19
e7e6: 19
d7d5: 19
d7d6: 19
c7c5: 19
c7c6: 19
b7b5: 19
b7b6: 19
a7a5: 19
a7a6: 19
g8h6: 19
g8f6: 19
b8c6: 19
b8a6: 19
Depth: 2
Total nodes: 380
Total time: 1ms/0.001s
If you have any idea what the problem could be here, please help me out. Thank you!
EDIT:
I discovered some interesting new facts that might help to solve the problem, but I don't know what to do with them:
For some reason, using std::sort() like this in perft():
std::sort(legal_moves.begin(), legal_moves.end(), [](auto first, auto second){ return first.get_from_index() % 8 > second.get_from_index() % 8; });
to sort the vector of legal moves causes the found number of total nodes for the initial position (for depth 3) to change from the wrong 7070 to the (also wrong) 7331.
When printing the game state after calling game_state.make_move() in perft(), it seems to have had no effect on the position bitboards (the other properties change like they are supposed to). This is very strange, because isolated, the make_move() method works just fine.
I'm unsure if you were able to pin down the issue but from the limited information available in the question, the best I can assume (and something I faced myself earlier) is that there is a problem in your unmake_move() function when it comes to captures since
Your perft fails only at level 3 - this is when the first legal capture is possible, move 1 and 2 can have no legal captures.
Your perft works fine when it's at depth 1 in the position after a2a3 rather than when it's searching at depth 3 from the start
This probably means that your unmake_move() fails at a depth greater than 1 where you need to restore some of the board's state that cannot be derived from just the move parameter you are passing in (e.g. enpassant, castling rights etc. before you made the move).
This is how you would like to debug your move generator using perft.
Given startpos as p1, generate perft(3) for your engine and sf. (you did that)
Now check any move that have different nodes, you pick a2a3. (you did that)
Given startpos + a2a3 as p2, generate perft(2) for your engine and sf. (you partially did this)
Now check any move that have different nodes in step 3. Let's say move x.
Given startpos + a2a3 + x as p3, generate perft(1) for your engine and sf.
Since that is only perft(1) by this time you will be able to figure out the wrong move or the missing move from your generator. Setup that last position or p3 on the board and see the wrong/missing moves from your engine compared to sf perft(1) result.

Density of fractions between 2 given numbers

I'm trying to do some analysis over a simple Fraction class and I want some data to compare that type with doubles.
The problem
Right know I'm looking for some good way to get the density of Fractions between 2 numbers. Fractions is basically 2 integers (e.g. pair< long, long>), and the density between s and t is the amount of representable numbers in that range. And it needs to be an exact, or very good approximation done in O(1) or very fast.
To make it a bit simpler, let's say I want all the numbers (not fractions) a/b between s and t, where 0 <= s <= a/b < t <= M, and 0 <= a,b <= M (b > 0, a and b are integers)
Example
If my fractions were of a data type which only count to 6 (M = 6), and I want the density between 0 and 1, the answer would be 12. Those numbers are:
0, 1/6, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5, 5/6.
What I thought already
A very naive approach would be to cycle trough all the possible fractions, and count those which can't be simplified. Something like:
long fractionsIn(double s, double t){
long density = 0;
long M = LONG_MAX;
for(int d = 1; d < floor(M/t); d++){
for(int n = ceil(d*s); n < M; n++){
if( gcd(n,d) == 1 )
density++;
}
}
return density;
}
But gcd() is very slow so it doesn't works. I also try doing some math but i couldn't get to anything good.
Solution
Thanks to #m69 answer, I made this code for Fraction = pair<Long,Long>:
//this should give the density of fractions between first and last, or less.
double fractionsIn(unsigned long long first, unsigned long long last){
double pi = 3.141592653589793238462643383279502884;
double max = LONG_MAX; //i can't use LONG_MAX directly
double zeroToOne = max/pi * max/pi * 3; // = approx. amount of numbers in Farey's secuence of order LONG_MAX.
double res = 0;
if(first == 0){
res = zeroToOne;
first++;
}
for(double i = first; i < last; i++){
res += zeroToOne/(i * i+1);
if(i == i+1)
i = nextafter(i+1, last); //if this happens, i might not count some fractions, but i have no other choice
}
return floor(res);
}
The main change is nextafter, which is important with big numbers (1e17)
The result
As I explain at the begining, I was trying to compare Fractions with double. Here is the result for Fraction = pair<Long,Long> (and here how I got the density of doubles):
Density between 0,1: | 1,2 | 1e6,1e6+1 | 1e14,1e14+1 | 1e15-1,1e15 | 1e17-10,1e17 | 1e19-10000,1e19 | 1e19-1000,1e19
Doubles: 4607182418800017408 | 4503599627370496 | 8589934592 | 64 | 8 | 1 | 5 | 0
Fraction: 2.58584e+37 | 1.29292e+37 | 2.58584e+25 | 2.58584e+09 | 2.58584e+07 | 2585 | 1 | 0
Density between 0 and 1
If the integers with which you express the fractions are in the range 0~M, then the density of fractions between the values 0 (inclusive) and 1 (exclusive) is:
M: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0~(1): 1 2 4 6 10 12 18 22 28 32 42 46 58 64 72 80 96 102 120 128 140 150 172 180 200 212 230 242 270 278 308 ...
This is sequence A002088 on OEIS. If you scroll down to the formula section, you'll find information about how to approximate it, e.g.:
Φ(n) = (3 ÷ π2) × n2 + O[n × (ln n)2/3 × (ln ln n)4/3]
(Unfortunately, no more detail is given about the constants involved in the O[x] part. See discussion about the quality of the approximation below.)
Distribution across range
The interval from 0 to 1 contains half of the total number of unique fractions that can be expressed with numbers up to M; e.g. this is the distribution when M = 15 (i.e. 4-bit integers):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
72 36 12 6 4 2 2 2 1 1 1 1 1 1 1 1
for a total of 144 unique fractions. If you look at the sequence for different values of M, you'll see that the steps in this sequence converge:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1: 1 1
2: 2 1 1
3: 4 2 1 1
4: 6 3 1 1 1
5: 10 5 2 1 1 1
6: 12 6 2 1 1 1 1
7: 18 9 3 2 1 1 1 1
8: 22 11 4 2 1 1 1 1 1
9: 28 14 5 2 2 1 1 1 1 1
10: 32 16 5 3 2 1 1 1 1 1 1
11: 42 21 7 4 2 2 1 1 1 1 1 1
12: 46 23 8 4 2 2 1 1 1 1 1 1 1
13: 58 29 10 5 3 2 2 1 1 1 1 1 1 1
14: 64 32 11 5 4 2 2 1 1 1 1 1 1 1 1
15: 72 36 12 6 4 2 2 2 1 1 1 1 1 1 1 1
Not only is the density between 0 and 1 half of the total number of fractions, but the density between 1 and 2 is a quarter, and the density between 2 and 3 is close to a twelfth, and so on.
As the value of M increases, the distribution of fractions across the ranges 0-1, 1-2, 2-3 ... converges to:
1/2, 1/4, 1/12, 1/24, 1/40, 1/60, 1/84, 1/112, 1/144, 1/180, 1/220, 1/264 ...
This sequence can be calculated by starting with 1/2 and then:
0-1: 1/2 x 1/1 = 1/2
1-2: 1/2 x 1/2 = 1/4
2-3: 1/4 x 1/3 = 1/12
3-4: 1/12 x 2/4 = 1/24
4-5: 1/24 x 3/5 = 1/40
5-6: 1/40 x 4/6 = 1/60
6-7: 1/60 x 5/7 = 1/84
7-8: 1/84 x 6/8 = 1/112
8-9: 1/112 x 7/9 = 1/144 ...
You can of course calculate any of these values directly, without needing the steps inbetween:
0-1: 1/2
6-7: 1/2 x 1/6 x 1/7 = 1/84
(Also note that the second half of the distribution sequence consists of 1's; these are all the integers divided by 1.)
Approximating the density in given interval
Using the formulas provided on the OEIS page, you can calculate or approximate the density in the interval 0-1, and multiplied by 2 this is the total number of unique values that can be expressed as fractions.
Given two values s and t, you can then calculate and sum the densities in the intervals s ~ s+1, s+1 ~ s+2, ... t-1 ~ t, or use an interpolation to get a faster but less precise approximate value.
Example
Let's assume that we're using 10-bit integers, capable of expressing values from 0 to 1023. Using this table linked from the OEIS page, we find that the density between 0~1 is 318452, and the total number of fractions is 636904.
If we wanted to find the density in the interval s~t = 100~105:
100~101: 1/2 x 1/100 x 1/101 = 1/20200 ; 636904/20200 = 31.53
101~102: 1/2 x 1/101 x 1/102 = 1/20604 ; 636904/20604 = 30.91
102~103: 1/2 x 1/102 x 1/103 = 1/21012 ; 636904/21012 = 30.31
103~104: 1/2 x 1/103 x 1/104 = 1/21424 ; 636904/21424 = 29.73
104~105: 1/2 x 1/104 x 1/105 = 1/21840 ; 636904/21840 = 29.16
Rounding these values gives the sum:
32 + 31 + 30 + 30 + 29 = 152
A brute force algorithm gives this result:
32 + 32 + 30 + 28 + 28 = 150
So we're off by 1.33% for this low value of M and small interval with just 5 values. If we had used linear interpolation between the first and last value:
100~101: 31.53
104~105: 29.16
average: 30.345
total: 151.725 -> 152
we'd have arrived at the same value. For larger intervals, the sum of all the densities will probably be closer to the real value, because rounding errors will cancel each other out, but the results of linear interpolation will probably become less accurate. For ever larger values of M, the calculated densities should converge with the actual values.
Quality of approximation of Φ(n)
Using this simplified formula:
Φ(n) = (3 ÷ π2) × n2
the results are almost always smaller than the actual values, but they are within 1% for n ≥ 182, within 0.1% for n ≥ 1880 and within 0.01% for n ≥ 19494. I would suggest hard-coding the lower range (the first 50,000 values can be found here), and then using the simplified formula from the point where the approximation is good enough.
Here's a simple code example with the first 182 values of Φ(n) hard-coded. The approximation of the distribution sequence seems to add an error of a similar magnitude as the approximation of Φ(n), so it should be possible to get a decent approximation. The code simply iterates over every integer in the interval s~t and sums the fractions. To speed up the code and still get a good result, you should probably calculate the fractions at several points in the interval, and then use some sort of non-linear interpolation.
function fractions01(M) {
var phi = [0,1,2,4,6,10,12,18,22,28,32,42,46,58,64,72,80,96,102,120,128,140,150,172,180,200,212,230,242,270,278,308,
324,344,360,384,396,432,450,474,490,530,542,584,604,628,650,696,712,754,774,806,830,882,900,940,964,1000,
1028,1086,1102,1162,1192,1228,1260,1308,1328,1394,1426,1470,1494,1564,1588,1660,1696,1736,1772,1832,1856,
1934,1966,2020,2060,2142,2166,2230,2272,2328,2368,2456,2480,2552,2596,2656,2702,2774,2806,2902,2944,3004,
3044,3144,3176,3278,3326,3374,3426,3532,3568,3676,3716,3788,3836,3948,3984,4072,4128,4200,4258,4354,4386,
4496,4556,4636,4696,4796,4832,4958,5022,5106,5154,5284,5324,5432,5498,5570,5634,5770,5814,5952,6000,6092,
6162,6282,6330,6442,6514,6598,6670,6818,6858,7008,7080,7176,7236,7356,7404,7560,7638,7742,7806,7938,7992,
8154,8234,8314,8396,8562,8610,8766,8830,8938,9022,9194,9250,9370,9450,9566,9654,9832,9880,10060];
if (M < 182) return phi[M];
return Math.round(M * M * 0.30396355092701331433 + M / 4); // experimental; see below
}
function fractions(M, s, t) {
var half = fractions01(M);
var frac = (s == 0) ? half : 0;
for (var i = (s == 0) ? 1 : s; i < t && i <= M; i++) {
if (2 * i < M) {
var f = Math.round(half / (i * (i + 1)));
frac += (f < 2) ? 2 : f;
}
else ++frac;
}
return frac;
}
var M = 1023, s = 100, t = 105;
document.write(fractions(M, s, t));
Comparing the approximation of Φ(n) with the list of the 50,000 first values suggests that adding M÷4 is a workable substitute for the second part of the formula; I have not tested this for larger values of n, so use with caution.
Blue: simplified formula. Red: improved simplified formula.
Quality of approximation of distribution
Comparing the results for M=1023 with those of a brute-force algorithm, the errors are small in real terms, never more than -7 or +6, and above the interval 205~206 they are limited to -1 ~ +1. However, a large part of the range (57~1024) has fewer than 100 fractions per integer, and in the interval 171~1024 there are only 10 fractions or fewer per integer. This means that small errors and rounding errors of -1 or +1 can have a large impact on the result, e.g.:
interval: 241 ~ 250
fractions/integer: 6
approximation: 5
total: 50 (instead of 60)
To improve the results for intervals with few fractions per integer, I would suggest combining the method described above with a seperate approach for the last part of the range:
Alternative method for last part of range
As already mentioned, and implemented in the code example, the second half of the range, M÷2 ~ M, has 1 fraction per integer. Also, the interval M÷3 ~ M÷2 has 2; the interval M÷4 ~ M÷3 has 4. This is of course the Φ(n) sequence again:
M/2 ~ M : 1
M/3 ~ M/2: 2
M/4 ~ M/3: 4
M/5 ~ M/4: 6
M/6 ~ M/5: 10
M/7 ~ M/6: 12
M/8 ~ M/7: 18
M/9 ~ M/8: 22
M/10 ~ M/9: 28
M/11 ~ M/10: 32
M/12 ~ M/11: 42
M/13 ~ M/12: 46
M/14 ~ M/13: 58
M/15 ~ M/14: 64
M/16 ~ M/15: 72
M/17 ~ M/16: 80
M/18 ~ M/17: 96
M/19 ~ M/18: 102 ...
Between these intervals, one integer can have a different number of fractions, depending on the exact value of M, e.g.:
interval fractions
202 ~ 203 10
203 ~ 204 10
204 ~ 205 9
205 ~ 206 6
206 ~ 207 6
The interval 204 ~ 205 lies on the edge between intervals, because M ÷ 5 = 204.6; it has 6 + 3 = 9 fractions because M modulo 5 is 3. If M had been 1022 or 1024 instead of 1023, it would have 8 or 10 fractions. (This example is straightforward because 5 is a prime; see below.)
Again, I would suggest using the hard-coded values for Φ(n) to calculate the number of fractions for the last part of the range. If you use the first 17 values as listed above, this covers the part of the range with fewer than 100 fractions per integer, so that would reduce the impact of rounding errors below 1%. The first 56 values would give you 0.1%, the first 182 values 0.01%.
Together with the values of Φ(n), you could hard-code the number of fractions of the edge intervals for each modulo value, e.g.:
modulo: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
M/ 2 1 2
M/ 3 2 3 4
M/ 4 4 5 5 6
M/ 5 6 7 8 9 10
M/ 6 10 11 11 11 11 12
M/ 7 12 13 14 15 16 17 18
M/ 8 18 19 19 20 20 21 21 22
M/ 9 22 23 24 24 25 26 26 27 28
M/10 28 29 29 30 30 30 30 31 31 32
M/11 32 33 34 35 36 37 38 39 40 41 42
M/12 42 43 43 43 43 44 44 45 45 45 45 46
M/13 46 47 48 49 50 51 52 53 54 55 56 57 58
M/14 58 59 59 60 60 61 61 61 61 62 62 63 63 64
M/15 64 65 66 66 67 67 67 68 69 69 69 70 70 71 72
M/16 72 73 73 74 74 75 75 76 76 77 77 78 78 79 79 80
M/17 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
M/18 96 97 97 97 97 98 98 99 99 99 99 100 100 101 101 101 101 102
This is exactly the same as: (Sum of phi(k)) where m <= k <= M where phi(k) is the Euler Totient Function and with phi(0) = 1 (as defined by the problem). There is no known closed form for this sum. However there are many optimizations known as mentioned in the wiki link. This is known as the Totient Summatory Function in Wolfram. The same website also links to the series: A002088 and provides a few asymptotic approximations.
The reasoning is this: consider the number of values of the form {1/M, 2/M, ...., (M-1)/M, M/M}. All those fractions that will be reducible to a smaller value will not be counted in phi(M) because they are not relatively prime. They will appear in the summation of another totient.
For example, phi(6) = 12 and you have 1 + phi(6), since you also count the 0.

how to deal with categorial values in dataset to build mdoels

I have a training dataframe dfTrain and the output of dfTrain.head() is shown below:
C0 C1 C2 C3 C4 C5 C6
0 1 73 Not in universe 0 0 0 Not in universe
1 2 58 Self-employed-not incorporated 4 34 0 Not in universe
2 3 18 Not in universe 0 0 0 High school
3 4 9 Not in universe 0 0 0 Not in universe
4 5 10 Not in universe 0 0 0 Not in universe
There are total 38 features and they are both categorical and numerical. Ignoring C1 and scaling numerical features, I am trying to build a Logistic Regression model. Since, the dataframe has categorical features, I am creating another dataframe which has dummy variables.
X = pd.get_dummies(dfTrain)
The shape of X now has 160 features which is much more than that of dfTrain.
Then I pass X and y (where y is target variable) to Logistic Regression Classifier
modelLogistic = LogisticRegression(C=10**-2, class_weight = 'balanced')
modelLogistic.fit(X, y)
The reason to use class_weight = 'balanced' is that there are 17 classes in y and highly imbalanced.
My question is: is my approach correct? Am I missing anything?

How can I synchronize TSC across cores?

Using:
inline uint64_t rdtsc()
{
uint32_t cycles_high;
uint32_t cycles_low;
asm volatile ("CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
return ( ((uint64_t)cycles_high << 32) | cycles_low );
}
thread 1 running
while(globalIndex < COUNT)
{
while(globalIndex %2 == 0 && globalIndex < COUNT)
;
cycles[globalIndex][0] = rdtsc();
cycles[globalIndex][1] = cpuToBindTo;
__sync_add_and_fetch(&globalIndex,1);
}
thread 2 running
while(globalIndex < COUNT)
{
while(globalIndex %2 == 1 && globalIndex < COUNT)
;
cycles[globalIndex][0] = rdtsc();
cycles[globalIndex][1] = cpuToBindTo;
__sync_add_and_fetch(&globalIndex,1);
}
i am seeing
CPU rdtsc() t1-t0
11 = 5023231563212740 990
03 = 5023231563213730 310
11 = 5023231563214040 990
03 = 5023231563215030 310
11 = 5023231563215340 990
03 = 5023231563216330 310
11 = 5023231563216640 990
03 = 5023231563217630 310
11 = 5023231563217940 990
03 = 5023231563218930 310
11 = 5023231563219240 990
03 = 5023231563220230 310
11 = 5023231563220540 990
03 = 5023231563221530 310
11 = 5023231563221840 990
03 = 5023231563222830 310
11 = 5023231563223140 990
03 = 5023231563224130 310
11 = 5023231563224440 990
03 = 5023231563225430 310
11 = 5023231563225740 990
03 = 5023231561739842 310
11 = 5023231561740152 990
03 = 5023231561741142 310
11 = 5023231561741452 12458
03 = 5023231561753910 458
11 = 5023231561754368 1154
03 = 5023231561755522 318
11 = 5023231561755840 982
03 = 5023231561756822 310
11 = 5023231561757132 990
03 = 5023231561758122 310
11 = 5023231561758432 990
03 = 5023231561759422 310
I'm not sure how I received a pong of 12458, but was wondering why i was seeing 310-990-310 instead of 650-650-650. I thought that tsc was suppose to be synchronized across cores. my constant_tsc cpu flag is on.
What are you running this code on? TSC synchronization is supposed to be done in the OS/kernel and is hardware dependent. For instance, you might pass a flag like powernow-k8.tscsync=1 to the kernel boot parameters via your bootloader.
You need to search for the correct TSC synchronization method for your combination of OS and hardware. By and large, this entire thing is automated - I wouldn't be surprised if you're running on a custom kernel or non i686 hardware?
If you search on Google with the correct terms, you'll find a lot of resources such as mailing list discussions on this topic. For instance, here's one algorithm being discussed (though apparently it's not a good one). However, it's not something that userland developers should be worried with - this is arcane sorcery that only kernel devs need to worry their heads with.
Basically, it's the OS' job, at boot time, to synchronize the TSC counters between all the different processors and/or cores on an SMP machine, within a certain margin of error. If you're seeing numbers that are that wildly off, there's something wrong with the TSC sync and your time would be better spent finding out why your OS hasn't synced the TSCs correctly rather than trying to implement your own TSC sync algorithm.
Do you have a NUMA memory architecture? The global counter could be located in RAM that is a couple hops away for one of the CPUs and local for the other. You can test this by fixing your threads to cores on the same NUMA node.
EDIT: I was guessing this since the performance was CPU specific.
EDIT: As to synchronizing the TSC. I am not aware of a an easy way, which is not to say that there isn't one! What would happen if you took core 1 as the reference clock, and then compared it to core 2? If you did that comparison many times and took the minimum, you might have a good approximation. This should handle the case when you get preempted in the middle of a comparison.

Fair comparison of fork() Vs Thread [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I was having a discussion about the relative cost of fork() Vs thread() for parallelization of a task.
We understand the basic differences between processes Vs Thread
Thread:
Easy to communicate between threads
Fast context switching.
Processes:
Fault tolerance.
Communicating with parent not a real problem (open a pipe)
Communication with other child processes hard
But we disagreed on the start-up cost of processes Vs threads.
So to test the theories I wrote the following code. My question: Is this a valid test of measuring the start-up cost or I am missing something. Also I would be interested in how each test performs on different platforms.
fork.cpp
#include <boost/lexical_cast.hpp>
#include <vector>
#include <unistd.h>
#include <iostream>
#include <stdlib.h>
#include <time.h>
extern "C" int threadStart(void* threadData)
{
return 0;
}
int main(int argc,char* argv[])
{
int threadCount = boost::lexical_cast<int>(argv[1]);
std::vector<pid_t> data(threadCount);
clock_t start = clock();
for(int loop=0;loop < threadCount;++loop)
{
data[loop] = fork();
if (data[looo] == -1)
{
std::cout << "Abort\n";
exit(1);
}
if (data[loop] == 0)
{
exit(threadStart(NULL));
}
}
clock_t middle = clock();
for(int loop=0;loop < threadCount;++loop)
{
int result;
waitpid(data[loop], &result, 0);
}
clock_t end = clock();
std::cout << threadCount << "\t" << middle - start << "\t" << end - middle << "\t"<< end - start << "\n";
}
Thread.cpp
#include <boost/lexical_cast.hpp>
#include <vector>
#include <iostream>
#include <pthread.h>
#include <time.h>
extern "C" void* threadStart(void* threadData)
{
return NULL;
}
int main(int argc,char* argv[])
{
int threadCount = boost::lexical_cast<int>(argv[1]);
std::vector<pthread_t> data(threadCount);
clock_t start = clock();
for(int loop=0;loop < threadCount;++loop)
{
if (pthread_create(&data[loop], NULL, threadStart, NULL) != 0)
{
std::cout << "Abort\n";
exit(1);
}
}
clock_t middle = clock();
for(int loop=0;loop < threadCount;++loop)
{
void* result;
pthread_join(data[loop], &result);
}
clock_t end = clock();
std::cout << threadCount << "\t" << middle - start << "\t" << end - middle << "\t"<< end - start << "\n";
}
I expect Windows to do worse in processes creation.
But I would expect modern Unix like systems to have a fairly light fork cost and be at least comparable to thread. On older Unix style systems (before fork() was implemented as using copy on write pages) that it would be worse.
Anyway My timing results are:
> uname -a
Darwin Alpha.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
> gcc --version | grep GCC
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5659)
> g++ thread.cpp -o thread -I~/include
> g++ fork.cpp -o fork -I~/include
> foreach a ( 1 2 3 4 5 6 7 8 9 10 12 15 20 30 40 50 60 70 80 90 100 )
foreach? ./thread ${a} >> A
foreach? end
> foreach a ( 1 2 3 4 5 6 7 8 9 10 12 15 20 30 40 50 60 70 80 90 100 )
foreach? ./fork ${a} >> A
foreach? end
vi A
Thread: Fork:
C Start Wait Total C Start Wait Total
==============================================================
1 26 145 171 1 160 37 197
2 44 198 242 2 290 37 327
3 62 234 296 3 413 41 454
4 77 275 352 4 499 59 558
5 91 107 10808 5 599 57 656
6 99 332 431 6 665 52 717
7 130 388 518 7 741 69 810
8 204 468 672 8 833 56 889
9 164 469 633 9 1067 76 1143
10 165 450 615 10 1147 64 1211
12 343 585 928 12 1213 71 1284
15 232 647 879 15 1360 203 1563
20 319 921 1240 20 2161 96 2257
30 461 1243 1704 30 3005 129 3134
40 559 1487 2046 40 4466 166 4632
50 686 1912 2598 50 4591 292 4883
60 827 2208 3035 60 5234 317 5551
70 973 2885 3858 70 7003 416 7419
80 3545 2738 6283 80 7735 293 8028
90 1392 3497 4889 90 7869 463 8332
100 3917 4180 8097 100 8974 436 9410
Edit:
Doing a 1000 children caused the fork version to fail.
So I have reduced the children count. But doing a single test also seems unfair so here is a range of values.
mumble ... I do not like your solution for many reasons:
You are not taking in account the execution time of child processes/thread.
You should compare cpu-usage not the bare elapsed time. This way your statistics will not depend from, e.g., disk access congestion.
Let your child process do something. Remember that "modern" fork uses copy-on-write mechanisms to avoid to allocate memory to the child process until needed. It is too easy to exit immediately. This way you avoid quite all the disadvantages of fork.
CPU time is not the only cost you have to account. Memory consumption and slowness of IPC are both disadvantages of fork solution.
You could use "rusage" instead of "clock" to measure real resource usage.
P.S. I do not think you can really measure the process/thread overhead writing a simple test program. There are too many factors and, usually, the choice between threads and processes is driven by other reasons than mere cpu-usage.
Under Linux fork is a special call to sys_clone, either within the library or within the kernel. Clone has lots of switches to flip on and off, and each of them effects how expensive it is to start.
The actual library function clone is probably more expensive than fork though because it does more, though most of that is on the child side (stack swapping and calling a function by pointer).
What that micro-benchmark shows is that thread creation and joining (there are no fork results when I'm writing this) takes tens or hundreds of microseconds (assuming your system has CLOCKS_PER_SEC=1000000, which it probably has, since it's an XSI requirement).
Since you said that fork() takes 3 times the cost of threads, we are still talking tenths of a millisecond at worst. If that is noticeable on an application, you could use pools of processes/threads, like Apache 1.3 did. In any case, I'd say that startup time is a moot point.
The important difference of threads vs processes (on Linux and most Unix-likes) is that on processes you choose explicitly what to share, using IPC, shared memory (SYSV or mmap-style), pipes, sockets (you can send file descriptors over AF_UNIX sockets, meaning you get to choose which fd's to share), ... While on threads almost everything is shared by default, whether there's a need to share it or not. In fact, that is the reason Plan 9 had rfork() and Linux has clone() (and recently unshare()), so you can choose what to share.