Related
I have a minimally reproducible sample which is as follows -
#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <vector>
#include <numeric>
template<typename type>
void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
for(size_t i=0; i < size * size; i++){
result[i] = matA[i] + matB[i];
}
}
int main(){
size_t size = 8192;
//std::cout<<sizeof(double) * 8<<std::endl;
auto matA = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto matB = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto result = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
for(int i = 0; i < size * size; i++){
*(matA + i) = i;
*(matB + i) = i;
}
auto start = std::chrono::high_resolution_clock::now();
for(int j=0; j<500; j++){
AddMatrixOpenMP<float>(matA, matB, result, size);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout<<"Average Time is = "<<duration/500<<std::endl;
std::cout<<*(result + 100)<<" "<<*(result + 1343)<<std::endl;
}
I experiment as follows - I time the code with #pragma omp for simd directive for the loop in the AddMatrixOpenMP function and then time it without the directive. I compile the code as follows -
g++ -O3 -fopenmp example.cpp
Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
I am not able to understand why so.
Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)
TIA
The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.
And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)
After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.
With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)
But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.
So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.
The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.
.L4: # outer loop
movss xmm0, DWORD PTR [rbx+rdx*4]
addss xmm0, DWORD PTR [r13+0+rdx*4] # one scalar operation
mov eax, 500
.L3: # do {
sub eax, 1 # empty inner loop after inversion
jne .L3 # }while(--i);
add rdx, 1
movss DWORD PTR [rcx], xmm0
add rcx, 4
cmp rdx, 67108864
jne .L4
This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.
Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.
There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.
The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)
Looking at memory bandwidth while the benchmark is running:
Running intel_gpu_top -l while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))). IMC is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.
$ intel_gpu_top -l
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
0 0 0 97 0.00 20421 7482 0.00 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 14 99 0.02 19627 6505 0.47 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 20 98 0.02 19625 6516 0.67 0 0 0.00 0 0 0.00 0 0 0.00 0 0
11 10 22 98 0.03 19632 6516 0.65 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 13 99 0.02 19609 6505 0.46 0 0 0.00 0 0 0.00 0 0 0.00 0 0
Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.
But with -O3 defeating the benchmark, with a 1000 repeat count, we see only near-idle levels of main-memory bandwidth.
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
...
8 8 17 99 0.03 365 85 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
9 9 17 99 0.02 349 90 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
4 4 5 100 0.01 303 63 0.25 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 15 100 0.02 345 69 0.43 0 0 0.00 0 0 0.00 0 0 0.00 0 0
10 10 21 99 0.03 350 74 0.64 0 0 0.00 0 0 0.00 0 0 0.00 0 0
vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)
I have found some basic working examples on stitching via OpenCV for panoramic images. I have also found some useful documentation in the API docs, but I can't find out how to speed up the processing by providing additional information.
In my case, I generate a set of images in a 20x20 grid of individual frames, for a total of 400 images to be stitched into a single large one. This takes an enormous amount of time on a modern PC, so it would likely take hours on a developer board.
Is there any way to tell the OpenCV instance information about the images, such as me knowing in advance the relative positioning of all the images as they would appear on a grid? The only API calls I see so far is to just add all the images indiscriminately to a queue via vImg.push_back().
References
Stitching. Image Stitching - OpenCV API Documentation, Accessed 2014-02-26, <http://docs.opencv.org/modules/stitching/doc/stitching.html>
OpenCV Stitching example (Stitcher class, Panorama), Accessed 2014-02-26, <http://feelmare.blogspot.ca/2013/11/opencv-stitching-example-stitcher-class.html>
Panorama – Image Stitching in OpenCV, Accessed 2014-02-26, <http://ramsrigoutham.com/2012/11/22/panorama-image-stitching-in-opencv/>
I did some work with the stitching pipeline and though I do not consider myself an expert on the field, I did get better performance (and better results as well) adjusting each step of the pipeline separately. As you can see in the picture, the Stitching class is nothing but a wrapper of this pipeline:
Some interesting parts you can adjust are the resizing steps (there comes a point were more resolution just means more computation time and more inaccurate features), the matching process and (though this is just a guess) giving a good camera parameters instead of performing an estimation. This involves getting the camera parameters before doing the stitching, but it is not really hard. Here you have some reference: OpenCV Camera Calibration and 3D Reconstruction.
Again: I am not an expert, this is just based on my experience as an intern doing some experiments with the library!
So far as I know, there is no means to provide additional data to the OpenCV engine beyond just giving it a list of images. It does a pretty good job on its own though. I would check out some of the example code, and test how long each stitching operation takes. From my experiments using 4x6, 4x8, ..., 4x20 panoramic reconstructions, the CPU time required seems to increase with the number of overlapping images. I would imagine your case would require at least a minute to compute on a modern machine.
Source:
https://code.ros.org/trac/opencv/browser/trunk/opencv/samples/cpp/stitching.cpp?rev=6682
1 /*M///////////////////////////////////////////////////////////////////////////////////////
2 //
3 // IMPORTANT: READ BEFORE DOWNLOADING, COPYING, INSTALLING OR USING.
4 //
5 // By downloading, copying, installing or using the software you agree to this license.
6 // If you do not agree to this license, do not download, install,
7 // copy or use the software.
8 //
9 //
10 // License Agreement
11 // For Open Source Computer Vision Library
12 //
13 // Copyright (C) 2000-2008, Intel Corporation, all rights reserved.
14 // Copyright (C) 2009, Willow Garage Inc., all rights reserved.
15 // Third party copyrights are property of their respective owners.
16 //
17 // Redistribution and use in source and binary forms, with or without modification,
18 // are permitted provided that the following conditions are met:
19 //
20 // * Redistribution's of source code must retain the above copyright notice,
21 // this list of conditions and the following disclaimer.
22 //
23 // * Redistribution's in binary form must reproduce the above copyright notice,
24 // this list of conditions and the following disclaimer in the documentation
25 // and/or other materials provided with the distribution.
26 //
27 // * The name of the copyright holders may not be used to endorse or promote products
28 // derived from this software without specific prior written permission.
29 //
30 // This software is provided by the copyright holders and contributors "as is" and
31 // any express or implied warranties, including, but not limited to, the implied
32 // warranties of merchantability and fitness for a particular purpose are disclaimed.
33 // In no event shall the Intel Corporation or contributors be liable for any direct,
34 // indirect, incidental, special, exemplary, or consequential damages
35 // (including, but not limited to, procurement of substitute goods or services;
36 // loss of use, data, or profits; or business interruption) however caused
37 // and on any theory of liability, whether in contract, strict liability,
38 // or tort (including negligence or otherwise) arising in any way out of
39 // the use of this software, even if advised of the possibility of such damage.
40 //
41 //M*/
42
43 // We follow to these papers:
44 // 1) Construction of panoramic mosaics with global and local alignment.
45 // Heung-Yeung Shum and Richard Szeliski. 2000.
46 // 2) Eliminating Ghosting and Exposure Artifacts in Image Mosaics.
47 // Matthew Uyttendaele, Ashley Eden and Richard Szeliski. 2001.
48 // 3) Automatic Panoramic Image Stitching using Invariant Features.
49 // Matthew Brown and David G. Lowe. 2007.
50
51 #include <iostream>
52 #include <fstream>
53 #include "opencv2/highgui/highgui.hpp"
54 #include "opencv2/stitching/stitcher.hpp"
55
56 using namespace std;
57 using namespace cv;
58
59 void printUsage()
60 {
61 cout <<
62 "Rotation model images stitcher.\n\n"
63 "stitching img1 img2 [...imgN]\n\n"
64 "Flags:\n"
65 " --try_use_gpu (yes|no)\n"
66 " Try to use GPU. The default value is 'no'. All default values\n"
67 " are for CPU mode.\n"
68 " --output <result_img>\n"
69 " The default is 'result.jpg'.\n";
70 }
71
72 bool try_use_gpu = false;
73 vector<Mat> imgs;
74 string result_name = "result.jpg";
75
76 int parseCmdArgs(int argc, char** argv)
77 {
78 if (argc == 1)
79 {
80 printUsage();
81 return -1;
82 }
83 for (int i = 1; i < argc; ++i)
84 {
85 if (string(argv[i]) == "--help" || string(argv[i]) == "/?")
86 {
87 printUsage();
88 return -1;
89 }
90 else if (string(argv[i]) == "--try_gpu")
91 {
92 if (string(argv[i + 1]) == "no")
93 try_use_gpu = false;
94 else if (string(argv[i + 1]) == "yes")
95 try_use_gpu = true;
96 else
97 {
98 cout << "Bad --try_use_gpu flag value\n";
99 return -1;
100 }
101 i++;
102 }
103 else if (string(argv[i]) == "--output")
104 {
105 result_name = argv[i + 1];
106 i++;
107 }
108 else
109 {
110 Mat img = imread(argv[i]);
111 if (img.empty())
112 {
113 cout << "Can't read image '" << argv[i] << "'\n";
114 return -1;
115 }
116 imgs.push_back(img);
117 }
118 }
119 return 0;
120 }
121
122
123 int main(int argc, char* argv[])
124 {
125 int retval = parseCmdArgs(argc, argv);
126 if (retval) return -1;
127
128 Mat pano;
129 Stitcher stitcher = Stitcher::createDefault(try_use_gpu);
130 Stitcher::Status status = stitcher.stitch(imgs, pano);
131
132 if (status != Stitcher::OK)
133 {
134 cout << "Can't stitch images, error code = " << status << endl;
135 return -1;
136 }
137
138 imwrite(result_name, pano);
139 return 0;
140 }
141
142
Maybe this could help?
https://software.intel.com/en-us/articles/fast-panorama-stitching
Specifically the part about pairwise matching
Ronen
Consider enabling the use of GPU in the Opencv Stitcher:
bool try_use_gpu = true;
Stitcher myStitcher = Stitcher::createDefault(try_use_gpu);
Stitcher::Status status = myStitcher.stitch(Imgs, pano);
If you know the relative positions of the images, it seems that you could break down the problem into sub-problems and possibly reduce the computational load by approaching it with knowledge of the substructure of the problem. Basically break the set of images into groups of 4 adjacent images, process the frames, then proceed to process the resulting images using the same idea until you have arrived at your panorama. That being said, I've only recently began toying with this toolset of opencv. I know it's a pretty simple idea, but it might be useful to someone.
So I posted a similar question to this earlier, but I didn't post enough code to get the help I needed. Even if I went back and added that code now, I don't think it would be noticed because the question is old and "answered". So here's my issue:
I'm trying to generate a section of the mandelbrot fractal. I can generate it fine, but when I add more cores, no matter how large the problem size is, the extra threads generate no speedup. I am completely new to multithreading and it's probably just something small I'm missing. Anyway, here are the functions that generate the fractal:
void mandelbrot_all(std::vector<std::vector<int>>& pixels, int X, int Y, int numThreads) {
using namespace std;
vector<thread> threads (numThreads);
int rowsPerThread = Y/numThreads;
mutex m;
for(int i=0; i<numThreads; i++) {
threads[i] = thread ([&](){
vector<int> row;
for(int j=(i-1)*rowsPerThread; j<i*rowsPerThread; j++) {
row = mandelbrot_row(j, X, Y);
{
lock_guard<mutex> lock(m);
pixels[j] = row;
}
}
});
}
for(int i=0; i<numThreads; i++) {
threads[i].join();
}
}
std::vector<int> mandelbrot_row(int rowNum, int topX, int topY) {
std::vector<int> row (topX);
for(int i=0; i<topX; i++) {
row[i] = mandelbrotOne(i, rowNum, topX, topY);
}
return row;
}
int mandelbrotOne(int currX, int currY, int X, int Y) { //code adapted from http://en.wikipedia.org/wiki/Mandelbrot_set
double x0 = convert(X, currX, true);
double y0 = convert(Y, currY, false);
double x = 0.0;
double y = 0.0;
double xtemp;
int iteration = 0;
int max_iteration = 255;
while ( x*x + y*y < 2*2 && iteration < max_iteration) {
xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
++iteration;
}
return iteration;
}
mandelbrot_all is passed a vector to hold the pixels, the maximum X and Y of the vector, and the number of threads to use, which is taken from the command line when the program is run. It attempts to split the work by row among multiple threads. Unfortunately, it seems that even if that is what it's doing, it's not making it any faster. If you need more details, feel free to ask and I will do my best to provide them.
Thanks in advance for the help.
Edit: reserved vectors in advance
Edit 2: ran this code with problem size 9600x7200 on a quad core laptop. It took an average of 36590000 cycles for one thread (over 5 runs) and 55142000 cycles for four threads.
Your code might appear to do parallel processing, but in practice it doesn't.
Basically, you are spending your time copying data around and queueing for memory allocator accesses.
Besides, you are using the unprotected i loop indice as if there was nothing to it, which will feed your worker threads with random junk instead of beautiful slices of the image.
As usual, C++ is hiding these sad facts from you under a thick crust of syntactic sugar.
But the greatest flaw of your code is the algorithm itself, as you might see if you read further ahead.
Since this example seems a textbook case of parallel processing to me and I never saw an "educational" analysis of it, I will try one.
Functional analysis
You want to use all CPU cores to crunch pixels of the Mandelbrot set. This is a perfect case of parallelizable computation, since each pixel computation can be done independently.
So basically it you have N cores on your machine you should have exactly one thread per core doing 1/N of the processing.
Unfortunately, dividing the input data so that each processor ends up doing 1/N of the needed processing is not as obvious as it might seem.
A given pixel can take from 0 to 255 iterations to compute. "black" pixels are 255 times more costly than "white" ones.
So if you simply divide your picture into N equal sub-surfaces, chances are all of your processors will breeze through "white" areas except one that will crawl through a "black" area. As a result, the slowest area computation time will dominate and parallelization will gain practically nothing.
In real cases, this will not be as dramatic, but still a huge loss of computing power.
Load balancing
To better balance the load, it is more efficient to split your picture in much smaller bits, and have each worker thread pick and compute the next available bit as soon as it is finished with the previous one.
That way, a worker processing "white" chunks will eventually finish its job and start picking "black" chunks to help its less fortunate siblings.
Ideally the chunks should be sorted by decreasing complexity, to avoid adding the linear cost of a big chunk to the total computatuin time.
Unfortunately, due to the chaotic nature of the Mandlebrot set, there is no practical way of predicting the computation time of a given area.
If we decide the chunks will be horizontal slices of the picture, sorting them in natural y order is clearly suboptimal. If that particular area is a kind of "white to black" gradient, the most costly lines will all be bunched at the end of the chunks list and you will end up computing the costliest bits last, which is the worst case for load balancing.
A possible solution is to shuffle the chunks in a butterfly pattern, so that the likelihood of having a "black" area concentrated in the end is small.
Another way would simply be to shuffle input patterns at random.
Here are two outputs of my proof of concept implementation:
Jobs are executed in reverse order (jobs 39 is the first, job 0 is the last).
Each line decodes as follows:
t a-b : thread n°a on processor b
b : begining time (since image computation start)
e : end time
d : duration (all times in milliseconds)
1) 40 jobs with butterfly ordering
job 0: t 1-1 b 162 e 174 d 12 // the 4 tasks finish within 5 ms from each other
job 1: t 0-0 b 156 e 176 d 20 //
job 2: t 2-2 b 155 e 173 d 18 //
job 3: t 3-3 b 154 e 174 d 20 //
job 4: t 1-1 b 141 e 162 d 21
job 5: t 2-2 b 137 e 155 d 18
job 6: t 0-0 b 136 e 156 d 20
job 7: t 3-3 b 133 e 154 d 21
job 8: t 1-1 b 117 e 141 d 24
job 9: t 0-0 b 116 e 136 d 20
job 10: t 2-2 b 115 e 137 d 22
job 11: t 3-3 b 113 e 133 d 20
job 12: t 0-0 b 99 e 116 d 17
job 13: t 1-1 b 99 e 117 d 18
job 14: t 2-2 b 96 e 115 d 19
job 15: t 3-3 b 95 e 113 d 18
job 16: t 0-0 b 83 e 99 d 16
job 17: t 3-3 b 80 e 95 d 15
job 18: t 2-2 b 77 e 96 d 19
job 19: t 1-1 b 72 e 99 d 27
job 20: t 3-3 b 69 e 80 d 11
job 21: t 0-0 b 68 e 83 d 15
job 22: t 2-2 b 63 e 77 d 14
job 23: t 1-1 b 56 e 72 d 16
job 24: t 3-3 b 54 e 69 d 15
job 25: t 0-0 b 53 e 68 d 15
job 26: t 2-2 b 48 e 63 d 15
job 27: t 0-0 b 41 e 53 d 12
job 28: t 3-3 b 40 e 54 d 14
job 29: t 1-1 b 36 e 56 d 20
job 30: t 3-3 b 29 e 40 d 11
job 31: t 2-2 b 29 e 48 d 19
job 32: t 0-0 b 23 e 41 d 18
job 33: t 1-1 b 18 e 36 d 18
job 34: t 2-2 b 16 e 29 d 13
job 35: t 3-3 b 15 e 29 d 14
job 36: t 2-2 b 0 e 16 d 16
job 37: t 3-3 b 0 e 15 d 15
job 38: t 1-1 b 0 e 18 d 18
job 39: t 0-0 b 0 e 23 d 23
You can see load balancing at work when a thread having processed a few small jobs will overtake another that took more time to process its own chunks.
2) 40 jobs with linear ordering
job 0: t 2-2 b 157 e 180 d 23 // last thread lags 17 ms behind first
job 1: t 1-1 b 154 e 175 d 21
job 2: t 3-3 b 150 e 171 d 21
job 3: t 0-0 b 143 e 163 d 20 // 1st thread ends
job 4: t 2-2 b 137 e 157 d 20
job 5: t 1-1 b 135 e 154 d 19
job 6: t 3-3 b 130 e 150 d 20
job 7: t 0-0 b 123 e 143 d 20
job 8: t 2-2 b 115 e 137 d 22
job 9: t 1-1 b 112 e 135 d 23
job 10: t 3-3 b 112 e 130 d 18
job 11: t 0-0 b 105 e 123 d 18
job 12: t 3-3 b 95 e 112 d 17
job 13: t 2-2 b 95 e 115 d 20
job 14: t 1-1 b 94 e 112 d 18
job 15: t 0-0 b 90 e 105 d 15
job 16: t 3-3 b 78 e 95 d 17
job 17: t 2-2 b 77 e 95 d 18
job 18: t 1-1 b 74 e 94 d 20
job 19: t 0-0 b 69 e 90 d 21
job 20: t 3-3 b 60 e 78 d 18
job 21: t 2-2 b 59 e 77 d 18
job 22: t 1-1 b 57 e 74 d 17
job 23: t 0-0 b 55 e 69 d 14
job 24: t 3-3 b 45 e 60 d 15
job 25: t 2-2 b 45 e 59 d 14
job 26: t 1-1 b 43 e 57 d 14
job 27: t 0-0 b 43 e 55 d 12
job 28: t 2-2 b 30 e 45 d 15
job 29: t 3-3 b 30 e 45 d 15
job 30: t 0-0 b 27 e 43 d 16
job 31: t 1-1 b 24 e 43 d 19
job 32: t 2-2 b 13 e 30 d 17
job 33: t 3-3 b 12 e 30 d 18
job 34: t 0-0 b 11 e 27 d 16
job 35: t 1-1 b 11 e 24 d 13
job 36: t 2-2 b 0 e 13 d 13
job 37: t 3-3 b 0 e 12 d 12
job 38: t 1-1 b 0 e 11 d 11
job 39: t 0-0 b 0 e 11 d 11
Here the costly chunks tend to bunch together at the end of the queue, hence a noticeable performance loss.
3) a run with only one job per core, with one to 4 cores activated
reported cores: 4
Master: start jobs 4 workers 1
job 0: t 0-0 b 410 e 590 d 180 // purely linear execution
job 1: t 0-0 b 255 e 409 d 154
job 2: t 0-0 b 127 e 255 d 128
job 3: t 0-0 b 0 e 127 d 127
Master: start jobs 4 workers 2 // gain factor : 1.6 out of theoretical 2
job 0: t 1-1 b 151 e 362 d 211
job 1: t 0-0 b 147 e 323 d 176
job 2: t 0-0 b 0 e 147 d 147
job 3: t 1-1 b 0 e 151 d 151
Master: start jobs 4 workers 3 // gain factor : 1.82 out of theoretical 3
job 0: t 0-0 b 142 e 324 d 182 // 4th packet is hurting the performance badly
job 1: t 2-2 b 0 e 158 d 158
job 2: t 1-1 b 0 e 160 d 160
job 3: t 0-0 b 0 e 142 d 142
Master: start jobs 4 workers 4 // gain factor : 3 out of theoretical 4
job 0: t 3-3 b 0 e 199 d 199 // finish at 199ms vs. 176 for butterfly 40, 13% loss
job 1: t 1-1 b 0 e 182 d 182 // 17 ms wasted
job 2: t 0-0 b 0 e 146 d 146 // 44 ms wasted
job 3: t 2-2 b 0 e 150 d 150 // 49 ms wasted
Here we get a 3x improvement while a better load balancing could have yielded a 3.5x.
And this is a very mild test case (you can see the computation times only vary by a factor of about 2, while they could theoretically vary by a factor of 255 !).
At any rate, if you don't implement some kind of load balancing, all the shiny multiprocessor code you might write will still yield poor do downright miserable performances.
Implementation
For the threads to work unhindered, they must be kept free from interferences from the ouside world.
One such interference is the memory allocation. Each time you allocate even a byte of memory, you will queue for exclusive access to the global memory allocator (and waste a bit of CPU doing the allocation).
Also, creating worker tasks for each picture computation is another waste of time and resources. The computation might be used to display the Mandlebrot set in an interactive application, so better have the workers premanently created and synchronized to compute successive images.
Lastly, there are the data copies. If you synchronize with the main program each time you're done computing a few points, you will again spend a good part of your time queueing for exclusive access to the result area. Besides, the useless copies of a sizeable amount of data will hurt the performances even more.
The obvious solution is to dispense with the copies altogether and work on original data.
design
You must provide your worker threads all they need to work unhindered. For that you need to:
determine the number of available cores on your system
pre-allocate all the memory needed
give access to a list of image chunks to each of your worker
launch exactly one thread per core and let them run free to do their job
job queue
There is no need for fancy no-wait or whatever gizmos, nor do we need to pay special attention to cache optimization.
Here again, the time needed to compute pixels dwarves the inter-thread synchronization cost and cache efficiency problems.
Basically, the queue can be computed as a whole at the start of an image generation. Workers will only have to read the jobs from it: there will never be concurrent read/write accesses on this queue, so the more or less standard bits of code around to implement job queues will be suboptimal and too complex for the job at hand.
We need two sync points:
let the workers wait for a new batch of jobs
let the master wait for a picture completion
workers will wait until the queue length changes to a positive value.
They will then all wakeup and start atomically decrementing the queue length. The current value of the queue length will provide them exclusive access to the associated job data (basically an area of the Mandlebrot set to compute, with an associated bitmap area to store the computed iteration values).
The same mechanism is used to terminate the workers. Instead of finding a new batch of jobs, the poor workers will wakeup to find an order to terminate.
the master waiting for a picture completion will be awoken by the worker that will finish processing the last job. This will be based on an atomic counter of the number of jobs to process.
This is how I implemented it:
class synchro {
friend class mandelbrot_calculator;
mutex lock; // queue lock
condition_variable work; // blocks workers waiting for jobs/termination
condition_variable done; // blocks master waiting for completion
int pending; // number of jobs in the queue
atomic_int active; // number of unprocessed jobs
bool kill; // poison pill for workers termination
void synchro (void)
{
pending = 0; // no job in queue
kill = false; // workers shall live (for now :) )
}
int worker_start(void)
{
unique_lock<mutex> waiter(lock);
while (!pending && !kill) work.wait(waiter);
return kill
? -1 // worker should die
: --pending; // index of the job to process
}
void worker_done(void)
{
if (!--active) // atomic decrement (exclusive with other workers)
done.notify_one(); // last job processed: wakeup master
}
void master_start(int jobs)
{
unique_lock<mutex> waiter(lock);
pending = active = jobs;
work.notify_all(); // wakeup all workers to start jobs
}
void master_done(void)
{
unique_lock<mutex> waiter(lock);
while (active) done.wait(waiter); // wait for workers to finish
}
void master_kill(void)
{
kill = true;
work.notify_all(); // wakeup all workers (to die)
}
};
Putting all together:
class mandelbrot_calculator {
int num_cores;
int num_jobs;
vector<thread> workers; // worker threads
vector<job> jobs; // job queue
synchro sync; // synchronization helper
mandelbrot_calculator (int num_cores, int num_jobs)
: num_cores(num_cores)
, num_jobs (num_jobs )
{
// worker thread
auto worker = [&]()
{
for (;;)
{
int job = sync.worker_start(); // fetch next job
if (job == -1) return; // poison pill
process (jobs[job]); // we have exclusive access to this job
sync.worker_done(); // signal end of picture to the master
}
};
jobs.resize(num_jobs, job()); // computation windows
workers.resize(num_cores);
for (int i = 0; i != num_cores; i++)
workers[i] = thread(worker, i, i%num_cores);
}
~mandelbrot_calculator()
{
// kill the workers
sync.master_kill();
for (thread& worker : workers) worker.join();
}
void compute(const viewport & vp)
{
// prepare worker data
function<void(int, int)> butterfly_jobs;
butterfly_jobs = [&](int min, int max)
// computes job windows in butterfly order
{
if (min > max) return;
jobs[min].setup(vp, max, num_jobs);
if (min == max) return;
jobs[max].setup(vp, min, num_jobs);
int mid = (min + max) / 2;
butterfly_jobs(min + 1, mid );
butterfly_jobs(mid + 1, max - 1);
};
butterfly_jobs(0, num_jobs - 1);
// launch workers
sync.master_start(num_jobs);
// wait for completion
sync.master_done();
}
};
Testing the concept
This code works pretty well on my 2 cores / 4 CPUs Intel I3 # 3.1 GHz, compiled with Microsoft Dev Studio 2013.
I use a bit of the set that has an average of 90 iterations / pixel, on a window of 1280x1024 pixels.
The computation time is about 1.700s with only one worker and drops to 0.480s with 4 workers.
The maximal possible gain would be a factor 4. I get a factor 3.5. Not too bad.
I assume the difference is partly due to the processor architecture (the I3 has only two "real" cores).
Tampering with the scheduler
My program forces the threads to run on one core each (using MSDN SetThreadAffinityMask).
If the scheduler is left free to allocate the tasks, the gain factor drops from 3,5 to 3,2.
This is significant, but still the Win7 scheduler does a pretty good job when left alone.
synchronization overhead
running the algorithm on an "white" window (outside the r < 2 area) gives a good idea of the system calls overhead.
It takes about 7ms to compute this "white" area, compared with the 480 ms of a representative area.
Something like 1.5%, including both the synchronization and computation of the job queue. And this is doing a synchronization on a queue of 1024 jobs.
Utterly neglectible, I would say. That might give food for thought to all the No-wait queue fanatics around.
optimizing iterations
The way iterations are done is a key factor for optimization.
After a few trials, I settled for this method:
static inline unsigned char mandelbrot_pixel(double x0, double y0)
{
register double x = x0;
register double y = y0;
register double x2 = x * x;
register double y2 = y * y;
unsigned iteration = 0;
const int max_iteration = 255;
while (x2 + y2 < 4.0)
{
if (++iteration == max_iteration) break;
y = 2 * x * y + y0;
x = x2 - y2 + x0;
x2 = x * x;
y2 = y * y;
}
return (unsigned char)iteration;
}
net gain: +20% compared with the OP's method
(the register directives don't make a bit of a difference, they are just there for decoration)
killing the tasks after each computation
The benefit of leaving the workers alive is about 5% of the computation time.
butterfly effect
On my test case, the "butterfly" order is doing really well, yielding more than 30% gain in extreme cases and routinely 10-15% due to "de-bunching" the bulkiest requests.
The problem in your code is that all thread capture and access the same i variable. This creates a race condition and the results are wildly incorrect.
You need to pass it as an argument to the thread lambda, and also correct the ranges (i-1 will make your indexing go out of bounds).
Good Day,
I am running Coldfusion 8 MX on a Windows 2003 32bit server with 4gig of RAM (2gig is always free) but I am unable to assign much more than 550m to the JVM.
I had already submitted this question but it got too long and confusing with all of my edits. The closest I ever got to it starting was when I set the -Xmn, it ran for 10 minutes before crashing. After crashing, it would not start with 1024m again, even with those same args.
These crash logs are for Java 1.6.0_38. I have tried with Java 5 and it gives the same result. I cannot even get CF8 to start with Java 7 (This is a separate issue).
I need to assign more RAM to the JVM so that CF doesn't keep crashing under heavy load, so any insight into this behaviour would be appreciated.
Java Args are:
java.args=-server -Xmx1024m -Xms1024m -Xmn200m -XX:MaxPermSize=256m -XX:PermSize=256m -Xloggc:{application.home}/../logs/CFGC/CFGC.log -XX:+PrintGCDetails -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Dcoldfusion.rootDir={application.home}/../ -Dcoldfusion.libPath={application.home}/../lib
Crash Log #1 (The server ran for 10 minutes before producing this):
#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x7c82a5c4, pid=4604, tid=4760
#
# JRE version: 6.0_38-b05
# Java VM: Java HotSpot(TM) Server VM (20.13-b02 mixed mode windows-x86 )
# Problematic frame:
# C [ntdll.dll+0x2a5c4]
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
--------------- T H R E A D ---------------
Current thread (0x5c4f0400): JavaThread "jrpp-2" [_thread_in_native, id=4760, stack(0x638d0000,0x639d0000)]
siginfo: ExceptionCode=0xc0000005, reading address 0x00000000
Registers:
EAX=0x680bd710, EBX=0x003e0000, ECX=0x00000000, EDX=0x00000000
ESP=0x639cc950, EBP=0x639cc95c, ESI=0x680bd708, EDI=0x680b56f8
EIP=0x7c82a5c4, EFLAGS=0x00010246
Top of Stack: (sp=0x639cc950)
0x639cc950: 003e0000 680b56f8 00000000 639cca44
0x639cc960: 7c82a69b 003e0000 00000000 639cca24
0x639cc970: 00000000 00000057 680b5700 00008004
0x639cc980: 04c0a470 7c827b79 71b219d6 000018a4
0x639cc990: 00000001 639cc9ac 000018a4 00000103
0x639cc9a0: 00000103 639cc9c0 7c826e39 ffb3b4c0
0x639cc9b0: ffffffff 0014dfa8 00000000 00000000
0x639cc9c0: 639cca38 580d0000 000018a4 00002020
Instructions: (pc=0x7c82a5c4)
0x7c82a5a4: 85 db a6 02 00 8a 46 05 24 10 a8 10 88 47 05 0f
0x7c82a5b4: 85 02 0f 00 00 8b 4e 0c 8d 46 08 8b 10 89 4d 0c
0x7c82a5c4: 8b 09 3b 4a 04 89 55 14 0f 85 f5 72 01 00 3b c8
0x7c82a5d4: 0f 85 ed 72 01 00 56 53 e8 17 fa ff ff 8b 45 14
Register to memory mapping:
EAX=0x680bd710 is an unknown value
EBX=0x003e0000 is an unknown value
ECX=0x00000000 is an unknown value
EDX=0x00000000 is an unknown value
ESP=0x639cc950 is pointing into the stack for thread: 0x5c4f0400
EBP=0x639cc95c is pointing into the stack for thread: 0x5c4f0400
ESI=0x680bd708 is an unknown value
EDI=0x680b56f8 is an unknown value
Stack: [0x638d0000,0x639d0000], sp=0x639cc950, free space=1010k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [ntdll.dll+0x2a5c4] wcslen+0x1d6
C [ntdll.dll+0x2a69b] wcslen+0x2ad
C [MSVCR71.dll+0x218a] free+0x39
C [net.dll+0x711f] Java_java_net_SocketInputStream_socketRead0+0x1c6
j java.net.SocketInputStream.socketRead0(Ljava/io/FileDescriptor;[BIII)I+0
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j java.net.SocketInputStream.socketRead0(Ljava/io/FileDescriptor;[BIII)I+0
J macromedia.jdbc.sqlserver.SQLServerByteOrderedDataReader.makeMoreDataAvailable()V
j macromedia.jdbc.sqlserver.SQLServerByteOrderedDataReader.receive()V+14
j macromedia.jdbc.sqlserver.tds.TDSRPCRequest.submitRequest(Lmacromedia/jdbc/sqlserver/SQLServerImplStatement;)V+132
j macromedia.jdbc.sqlserver.tds.TDSRPCNonCursorExecuteRequest.submitPrepExec(Lmacromedia/jdbc/sqlserver/SQLServerImplStatement;Lmacromedia/jdbc/base/BaseWarnings;)V+26
j macromedia.jdbc.sqlserver.tds.TDSRPCExecuteRequest.doPrepExec(Lmacromedia/jdbc/sqlserver/SQLServerImplStatement;Lmacromedia/jdbc/base/BaseWarnings;)V+29
j macromedia.jdbc.sqlserver.tds.TDSRPCExecuteRequest.execute(Lmacromedia/jdbc/sqlserver/SQLServerImplStatement;Lmacromedia/jdbc/base/BaseWarnings;)V+339
j macromedia.jdbc.sqlserver.SQLServerImplStatement.execute()V+468
j macromedia.jdbc.base.BaseStatement.commonExecute()V+40
j macromedia.jdbc.base.BaseStatement.executeInternal()Z+5
j macromedia.jdbc.base.BasePreparedStatement.execute()Z+42
j macromedia.jdbc.base.BasePreparedStatementPoolable.execute()Z+4
j coldfusion.sql.Executive.executeQuery(Ljava/sql/Connection;Ljava/lang/String;Lcoldfusion/sql/ParameterList;Ljava/lang/Integer;Ljava/lang/Integer;Ljava/lang/Integer;[IIIZZ)Lcoldfusion/sql/Table;+507
j coldfusion.sql.Executive.executeQuery(Ljava/sql/Connection;Ljava/lang/String;Lcoldfusion/sql/ParameterList;Ljava/lang/Integer;Ljava/lang/Integer;Ljava/lang/Integer;[ILcoldfusion/sql/DataSourceDef;)Lcoldfusion/sql/Table;+181
j coldfusion.sql.Executive.executeQuery(Ljava/sql/Connection;Ljava/lang/String;Lcoldfusion/sql/ParameterList;Ljava/lang/Integer;Ljava/lang/Integer;Ljava/lang/Integer;[ILjava/lang/Object;)Lcoldfusion/sql/Table;+61
j coldfusion.sql.SqlImpl.execute(Z)Lcoldfusion/sql/Table;+138
j coldfusion.tagext.sql.QueryTag.executeQuery(Z)Lcoldfusion/sql/Table;+5
j coldfusion.tagext.sql.QueryTag.doEndTag()I+65
j cfTransferGateway2ecfc585582040$funcLISTBYPROPERTYMAP.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+2719
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object;
j cfSQLManager2ecfc937842696$funcLISTBYPROPERTYMAP.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+450
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod.runFilterChain(Ljava/lang/Object;Ljava/lang/Object;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;
j coldfusion.runtime.UDFMethod.invoke(Ljava/lang/Object;Ljava/lang/String;Ljava/lang/Object;Ljava/util/Map;)Ljava/lang/Object;+26
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;Ljava/util/Map;Ljavax/servlet/jsp/PageContext;)Ljava/lang/Object;
j coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;Ljava/util/Map;)Ljava/lang/Object;+58
j cfTransfer2ecfc1333771829$funcREADBYPROPERTYMAP.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+278
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object;
j cfUserService2ecfc1575820687$funcGETUSERBYPROPERTYMAP.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+116
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object;
j cfCardListener2ecfc1071855549$funcGETCARD.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+333
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod.runFilterChain(Ljava/lang/Object;Ljava/lang/Object;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;
j coldfusion.runtime.UDFMethod.invoke(Ljava/lang/Object;Ljava/lang/String;Ljava/lang/Object;Ljava/util/Map;)Ljava/lang/Object;+26
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;Ljava/util/Map;Ljavax/servlet/jsp/PageContext;)Ljava/lang/Object;
j coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;Ljava/util/Map;)Ljava/lang/Object;+58
j coldfusion.tagext.lang.InvokeTag.doEndTag()I+176
j cfEventInvoker2ecfc577457096$funcINVOKELISTENER.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+412
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object;
j cfNotifyCommand2ecfc1810625261$funcEXECUTE.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+295
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object;
j cfEventHandler2ecfc451386063$funcHANDLEEVENT.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+322
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object;
j cfRequestHandler2ecfc128659239$funcHANDLEEVENT.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+862
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod.runFilterChain(Ljava/lang/Object;Ljava/lang/Object;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invokeUDF(Ljava/lang/Object;Ljava/lang/String;Lcoldfusion/runtime/CFPage;[Ljava/lang/Object;)Ljava/lang/Object;
j cfRequestHandler2ecfc128659239$funcHANDLENEXTEVENT.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+181
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod.runFilterChain(Ljava/lang/Object;Ljava/lang/Object;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invokeUDF(Ljava/lang/Object;Ljava/lang/String;Lcoldfusion/runtime/CFPage;[Ljava/lang/Object;)Ljava/lang/Object;
j cfRequestHandler2ecfc128659239$funcPROCESSEVENTS.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+390
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod.runFilterChain(Ljava/lang/Object;Ljava/lang/Object;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invokeUDF(Ljava/lang/Object;Ljava/lang/String;Lcoldfusion/runtime/CFPage;[Ljava/lang/Object;)Ljava/lang/Object;
j cfRequestHandler2ecfc128659239$funcHANDLEREQUEST.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+1738
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invoke(Ljava/lang/Object;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object;
j cfmach2dii2ecfc392524582$funcHANDLEREQUEST.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+1653
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod.runFilterChain(Ljava/lang/Object;Ljava/lang/Object;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;
J coldfusion.runtime.CfJspPage._invokeUDF(Ljava/lang/Object;Ljava/lang/String;Lcoldfusion/runtime/CFPage;[Ljava/lang/Object;)Ljava/lang/Object;
j cfApplication2ecfc1704547219$funcONREQUESTSTART.runFunction(Lcoldfusion/runtime/LocalScope;Ljava/lang/Object;Lcoldfusion/runtime/CFPage;Lcoldfusion/runtime/ArgumentCollection;)Ljava/lang/Object;+829
J coldfusion.runtime.UDFMethod.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.UDFMethod$ReturnTypeFilter.invoke(Lcoldfusion/filter/FusionContext;)V
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljava/util/Map;Lcoldfusion/runtime/CfJspPage;Lcoldfusion/filter/FusionContext;)Ljava/lang/Object;
J coldfusion.runtime.TemplateProxy.invoke(Ljava/lang/String;[Ljava/lang/Object;Ljavax/servlet/jsp/PageContext;)Ljava/lang/Object;
j coldfusion.runtime.AppEventInvoker.invoke(Ljava/lang/String;[Ljava/lang/Object;Lcoldfusion/filter/FusionContext;)Z+28
j coldfusion.runtime.AppEventInvoker.onRequestStart([Ljava/lang/Object;Lcoldfusion/filter/FusionContext;)Z+5
j coldfusion.filter.ApplicationFilter.invoke(Lcoldfusion/filter/FusionContext;)V+676
j coldfusion.filter.RequestMonitorFilter.invoke(Lcoldfusion/filter/FusionContext;)V+56
j coldfusion.filter.MonitoringFilter.invoke(Lcoldfusion/filter/FusionContext;)V+12
j coldfusion.filter.PathFilter.invoke(Lcoldfusion/filter/FusionContext;)V+124
j coldfusion.filter.ExceptionFilter.invoke(Lcoldfusion/filter/FusionContext;)V+13
j coldfusion.filter.ClientScopePersistenceFilter.invoke(Lcoldfusion/filter/FusionContext;)V+5
j coldfusion.filter.BrowserFilter.invoke(Lcoldfusion/filter/FusionContext;)V+35
j coldfusion.filter.NoCacheFilter.invoke(Lcoldfusion/filter/FusionContext;)V+120
j coldfusion.filter.GlobalsFilter.invoke(Lcoldfusion/filter/FusionContext;)V+13
j coldfusion.filter.DatasourceFilter.invoke(Lcoldfusion/filter/FusionContext;)V+5
j coldfusion.filter.RequestThrottleFilter.invoke(Lcoldfusion/filter/FusionContext;)V+82
j coldfusion.CfmServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+114
j coldfusion.bootstrap.BootstrapServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+30
j jrun.servlet.FilterChain.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+53
j com.intergral.fusionreactor.filter.FusionReactorCoreFilter.doHttpServletRequest(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;Ljavax/servlet/FilterChain;)V+468
j com.intergral.fusionreactor.filter.FusionReactorCoreFilter.doFusionRequest(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;Ljavax/servlet/FilterChain;)V+262
j com.intergral.fusionreactor.filter.FusionReactorCoreFilter.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V+45
j com.intergral.fusionreactor.filter.FusionReactorFilter.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V+39
j jrun.servlet.FilterChain.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+80
j coldfusion.monitor.event.MonitoringServletFilter.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V+9
j coldfusion.bootstrap.BootstrapFilter.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;Ljavax/servlet/FilterChain;)V+25
j jrun.servlet.FilterChain.doFilter(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+80
j jrun.servlet.FilterChain.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+3
j jrun.servlet.ServletInvoker.invoke(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;Ljrun/servlet/InvokerChain;)V+183
j jrun.servlet.JRunInvokerChain.invokeNext(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V+55
j jrun.servlet.JRunRequestDispatcher.invoke(Ljrun/servlet/ServletConnection;)V+249
j jrun.servlet.ServletEngineService.dispatch(Ljrun/servlet/ServletConnection;)V+74
j jrun.servlet.jrpp.JRunProxyService.invokeRunnable(Ljava/lang/Runnable;)V+24
j jrunx.scheduler.ThreadPool$DownstreamMetrics.invokeRunnable(Ljava/lang/Runnable;)V+113
j jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(Ljava/lang/Runnable;)V+16
j jrunx.scheduler.ThreadPool$UpstreamMetrics.invokeRunnable(Ljava/lang/Runnable;)V+47
j jrunx.scheduler.WorkerThread.run()V+24
v ~StubRoutines::call_stub
--------------- P R O C E S S ---------------
[I removed this section due to StackOverflow question size limits. Please let me know if you need to see this]
Other Threads:
0x56879000 VMThread [stack: 0x566d0000,0x567d0000] [id=5156]
0x568af000 WatcherThread [stack: 0x57260000,0x57360000] [id=3788]
VM state:not at safepoint (normal execution)
VM Mutex/Monitor currently owned by a thread: None
Heap
par new generation total 184320K, used 34086K [0x039d0000, 0x101d0000, 0x101d0000)
eden space 163840K, 11% used [0x039d0000, 0x04cc5c60, 0x0d9d0000)
from space 20480K, 71% used [0x0edd0000, 0x0fc23e68, 0x101d0000)
to space 20480K, 0% used [0x0d9d0000, 0x0d9d0000, 0x0edd0000)
concurrent mark-sweep generation total 843776K, used 325395K [0x101d0000, 0x439d0000, 0x439d0000)
concurrent-mark-sweep perm gen total 262144K, used 146524K [0x439d0000, 0x539d0000, 0x539d0000)
Code Cache [0x00710000, 0x010f8000, 0x03710000)
total_blobs=2699 nmethods=2465 adapters=187 free_code_cache=40016832 largest_free_block=17152
Dynamic libraries:
[I removed this section due to StackOverflow question size limits. Please let me know if you need to see this]
VM Arguments:
jvm_args: -Xmx1024m -Xms1024m -Xmn200m -XX:MaxPermSize=256m -XX:PermSize=256m -Xloggc:C:\ColdFusion8\runtime/../logs/CFGC/CFGC.log -XX:+PrintGCDetails -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Dcoldfusion.rootDir=C:\ColdFusion8\runtime/../ -Dcoldfusion.libPath=C:\ColdFusion8\runtime/../lib -Dsun.io.useCanonCaches=false -Djmx.invoke.getters=true
java_command: <unknown>
Launcher Type: generic
Environment Variables:
PATH=C:\ColdFusion8\runtime\..\lib;C:\ColdFusion8\runtime\..\jintegra\bin;C:\ColdFusion8\runtime\..\jintegra\bin\international;C:\Program Files\CollabNet Subversion Server;C:\ColdFusion8\verity\k2\_nti40\bin;C:\CFusionMX7\verity\k2\_nti40\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\PROGRA~1\NcFTP;C:\WINDOWS\system32\WindowsPowerShell\v1.0;C:\WINDOWS\system32\WindowsPowerShell\v1.0;D:\StrawberryPerl\c\bin;D:\StrawberryPerl\perl\site\bin;D:\StrawberryPerl\perl\bin;D:\php\ext;
OS=Windows_NT
PROCESSOR_IDENTIFIER=x86 Family 6 Model 23 Stepping 6, GenuineIntel
--------------- S Y S T E M ---------------
OS: Windows Server 2003 family Build 3790 Service Pack 2
CPU:total 2 (1 cores per cpu, 1 threads per core) family 6 model 23 stepping 6, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1
Memory: 4k page, physical 3919344k(2146400k free), swap 7963284k(5585376k free)
vm_info: Java HotSpot(TM) Server VM (20.13-b02) for windows-x86 JRE (1.6.0_38-b05), built on Nov 14 2012 01:50:25 by "java_re" with MS VC++ 7.1 (VS2003)
time: Wed Feb 06 17:21:55 2013
elapsed time: 401 seconds
Crash Log #2 (Right after it crashed I attempted to restart the CF Service with the same configuration, but it would not even start, and produced this error):
#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x7c8194cd, pid=6020, tid=5608
#
# JRE version: 6.0_38-b05
# Java VM: Java HotSpot(TM) Server VM (20.13-b02 mixed mode windows-x86 )
# Problematic frame:
# C [ntdll.dll+0x194cd]
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x568a2c00): JavaThread "C2 CompilerThread1" daemon [_thread_in_native, id=5608, stack(0x57060000,0x57160000)]
siginfo: ExceptionCode=0xc0000005, writing address 0x0000e010
Registers:
EAX=0x5b644638, EBX=0x5a9f0000, ECX=0x5b652658, EDX=0x0000e010
ESP=0x5715e534, EBP=0x5715e56c, ESI=0x003e0000, EDI=0x5b652650
EIP=0x7c8194cd, EFLAGS=0x00010283
Top of Stack: (sp=0x5715e534)
0x5715e534: 5b652650 5b652658 5b652650 580d0000
0x5715e544: 00000000 5b673000 00000000 567d0140
0x5715e554: 00000000 00000000 00000000 00000136
0x5715e564: 00000000 00000000 5715e590 7c81727a
0x5715e574: 5b653000 00020000 00000136 00001000
0x5715e584: 5b652650 003e0000 00000071 5715e7b8
0x5715e594: 7c82b460 003e0000 00007ff4 00007ff4
0x5715e5a4: 579f2508 7c86a7b2 00000000 5715e5d4
Instructions: (pc=0x7c8194cd)
0x7c8194ad: ff ff 0f b7 c8 8d 84 ce 78 01 00 00 39 00 0f 84
0x7c8194bd: 1d 55 ff ff 8b 50 04 8d 4f 08 89 01 89 51 04 57
0x7c8194cd: 89 0a 56 89 48 04 e8 5d 0f 01 00 66 83 7d fc 00
0x7c8194dd: 0f 85 0b 73 ff ff 8b 45 ec 85 c0 0f 85 8a 2d ff
Register to memory mapping:
EAX=0x5b644638 is an unknown value
EBX=0x5a9f0000 is pointing into the stack for thread: 0x57608400
ECX=0x5b652658 is an unknown value
EDX=0x0000e010 is an unknown value
ESP=0x5715e534 is pointing into the stack for thread: 0x568a2c00
EBP=0x5715e56c is pointing into the stack for thread: 0x568a2c00
ESI=0x003e0000 is an unknown value
EDI=0x5b652650 is an unknown value
Stack: [0x57060000,0x57160000], sp=0x5715e534, free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [ntdll.dll+0x194cd] RtlFreeThreadActivationContextStack+0x418
C [ntdll.dll+0x1727a] towlower+0xb1
C [ntdll.dll+0x2b460] wcscpy+0x175
C [MSVCR71.dll+0x16b3] _crtLCMapStringA+0x305
C [MSVCR71.dll+0x16db] _crtLCMapStringA+0x32d
V [jvm.dll+0x5edd3]
V [jvm.dll+0x5eff2]
V [jvm.dll+0x5f273]
V [jvm.dll+0x27ed7e]
V [jvm.dll+0x27c263]
V [jvm.dll+0x259bf9]
V [jvm.dll+0x2600a3]
V [jvm.dll+0x261382]
V [jvm.dll+0x24fa4a]
V [jvm.dll+0x4882d]
V [jvm.dll+0x493bf]
V [jvm.dll+0x12e8a4]
V [jvm.dll+0x15719c]
C [MSVCR71.dll+0x9565] endthreadex+0xa0
C [kernel32.dll+0x2482f] GetModuleHandleA+0xdf
Current CompileTask:
C2: 4520 97 java.util.Arrays.mergeSort([Ljava/lang/Object;[Ljava/lang/Object;IIILjava/util/Comparator;)V (235 bytes)
--------------- P R O C E S S ---------------
[I removed this section due to StackOverflow question size limits. Please let me know if you need to see this]
Other Threads:
0x56879000 VMThread [stack: 0x566d0000,0x567d0000] [id=320]
0x568a6c00 WatcherThread [stack: 0x57260000,0x57360000] [id=5852]
VM state:synchronizing (normal execution)
VM Mutex/Monitor currently owned by a thread: ([mutex/lock_event])
[0x003e61c8] Threads_lock - owner thread: 0x56879000
Heap
par new generation total 184320K, used 37692K [0x039d0000, 0x101d0000, 0x101d0000)
eden space 163840K, 23% used [0x039d0000, 0x05e9f248, 0x0d9d0000)
from space 20480K, 0% used [0x0d9d0000, 0x0d9d0000, 0x0edd0000)
to space 20480K, 0% used [0x0edd0000, 0x0edd0000, 0x101d0000)
concurrent mark-sweep generation total 843776K, used 1673K [0x101d0000, 0x439d0000, 0x439d0000)
concurrent-mark-sweep perm gen total 262144K, used 12535K [0x439d0000, 0x539d0000, 0x539d0000)
Code Cache [0x00710000, 0x00950000, 0x03710000)
total_blobs=244 nmethods=121 adapters=77 free_code_cache=49693952 largest_free_block=21888
Dynamic libraries:
[I removed this section due to StackOverflow question size limits. Please let me know if you need to see this]
VM Arguments:
jvm_args: -Xmx1024m -Xms1024m -Xmn200m -XX:MaxPermSize=256m -XX:PermSize=256m -Xloggc:C:\ColdFusion8\runtime/../logs/CFGC/CFGC.log -XX:+PrintGCDetails -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Dcoldfusion.rootDir=C:\ColdFusion8\runtime/../ -Dcoldfusion.libPath=C:\ColdFusion8\runtime/../lib -Dsun.io.useCanonCaches=false -Djmx.invoke.getters=true
java_command: <unknown>
Launcher Type: generic
Environment Variables:
PATH=C:\ColdFusion8\runtime\..\lib;C:\ColdFusion8\runtime\..\jintegra\bin;C:\ColdFusion8\runtime\..\jintegra\bin\international;C:\Program Files\CollabNet Subversion Server;C:\ColdFusion8\verity\k2\_nti40\bin;C:\CFusionMX7\verity\k2\_nti40\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\PROGRA~1\NcFTP;C:\WINDOWS\system32\WindowsPowerShell\v1.0;C:\WINDOWS\system32\WindowsPowerShell\v1.0;D:\StrawberryPerl\c\bin;D:\StrawberryPerl\perl\site\bin;D:\StrawberryPerl\perl\bin;D:\php\ext;
OS=Windows_NT
PROCESSOR_IDENTIFIER=x86 Family 6 Model 23 Stepping 6, GenuineIntel
--------------- S Y S T E M ---------------
OS: Windows Server 2003 family Build 3790 Service Pack 2
CPU:total 2 (1 cores per cpu, 1 threads per core) family 6 model 23 stepping 6, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1
Memory: 4k page, physical 3919344k(2858352k free), swap 7963284k(5630528k free)
vm_info: Java HotSpot(TM) Server VM (20.13-b02) for windows-x86 JRE (1.6.0_38-b05), built on Nov 14 2012 01:50:25 by "java_re" with MS VC++ 7.1 (VS2003)
time: Wed Feb 06 17:22:00 2013
elapsed time: 4 seconds
You're running on a 32-bit VM, which imposes some limits on how large the Heap can be. I'm not the world's best authority on this, so you may need to do further googling to confirm/clarify some of these points.
You only have a 2 Gig memory limit per process, so you have a number of things to fit into there :
Your Heap. That's 1 gig in the example above
Your Perm gen (that' not held in the heap I believe). That's 256 meg.
Stack memory: Each thread has a stack, which is at least a meg in the example above. Multiply the number of threads by 1 meg and add that to the 1.25 gig above.
Native Java code. The code and native memory of the JVM need to be loaded or mapped into the memory space.
Other DLLs such as Windows functionality (ntdll.dll and MSVCR71.dll) and also I believe antivirus and various other cruft can get mapped in.
I believe that you can get memory allocation problems if the DLLs are mapped into memory locations all over your 2 Gig space, you get a kind of fragmentation where there's no large contigous block of memory available.
Count the number of threads active (proabaly the bit you had to snip from the dumps above) and see how much of dent that puts into your memory. Our application uses -Xss256k to reduce the thread stack size to 256k and we've tested this at scale and at load and it works well for our application, but you'll want to test your own app.
After the crash when CF won't start at all, will it start with a lower -Xmx setting?
Try using the VMMap.exe tool from SysInternals. It'll show you more than you ever wanted to know about what your process is doing. ProcMon is also very useful for higher-level process info.
I thought I'd share my final solution with you all.
I'll spare you the boring details of how I reached this solution. Suffices to say, the best advice I received was "Find out exactly what the problem is first".
I ended up installing JRockit as a means of identifying what pieces of code are chewing up the most RAM.
In the process of this, it was necessary to change the JRE that Coldfusion uses, to the JRockit internal one.
#java.home=D:/Program Files/Java/jdk1.6.0_38/jre
java.home=D:/Program Files/Java/jrockit-jdk1.6.0_37-R28.2.5-4.1.0/jre
I believe that this JRE is made by BEA as opposed to SUN? Don't quote me on that. But I needed to change my java args as follows:
java.args=-server -Xmx1200m -Xms1200m -Xns300m -Xgc:singlecon -Dcoldfusion.rootDir={application.home}/../ -Dcoldfusion.libPath={application.home}/../lib -Dcom.sun.management.jmxremote.port=53578 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
I've opened up a random port that JRockit can use to profile the JVM Heap usage. It's pointed me in the direction of several pieces of code, that I would never have thought were problematic.
However, I have also been tinkering with the RAM settings, and found that 1200m seems to be working quite alright.
Lesson learned: Not all JRE's were created equal.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I was having a discussion about the relative cost of fork() Vs thread() for parallelization of a task.
We understand the basic differences between processes Vs Thread
Thread:
Easy to communicate between threads
Fast context switching.
Processes:
Fault tolerance.
Communicating with parent not a real problem (open a pipe)
Communication with other child processes hard
But we disagreed on the start-up cost of processes Vs threads.
So to test the theories I wrote the following code. My question: Is this a valid test of measuring the start-up cost or I am missing something. Also I would be interested in how each test performs on different platforms.
fork.cpp
#include <boost/lexical_cast.hpp>
#include <vector>
#include <unistd.h>
#include <iostream>
#include <stdlib.h>
#include <time.h>
extern "C" int threadStart(void* threadData)
{
return 0;
}
int main(int argc,char* argv[])
{
int threadCount = boost::lexical_cast<int>(argv[1]);
std::vector<pid_t> data(threadCount);
clock_t start = clock();
for(int loop=0;loop < threadCount;++loop)
{
data[loop] = fork();
if (data[looo] == -1)
{
std::cout << "Abort\n";
exit(1);
}
if (data[loop] == 0)
{
exit(threadStart(NULL));
}
}
clock_t middle = clock();
for(int loop=0;loop < threadCount;++loop)
{
int result;
waitpid(data[loop], &result, 0);
}
clock_t end = clock();
std::cout << threadCount << "\t" << middle - start << "\t" << end - middle << "\t"<< end - start << "\n";
}
Thread.cpp
#include <boost/lexical_cast.hpp>
#include <vector>
#include <iostream>
#include <pthread.h>
#include <time.h>
extern "C" void* threadStart(void* threadData)
{
return NULL;
}
int main(int argc,char* argv[])
{
int threadCount = boost::lexical_cast<int>(argv[1]);
std::vector<pthread_t> data(threadCount);
clock_t start = clock();
for(int loop=0;loop < threadCount;++loop)
{
if (pthread_create(&data[loop], NULL, threadStart, NULL) != 0)
{
std::cout << "Abort\n";
exit(1);
}
}
clock_t middle = clock();
for(int loop=0;loop < threadCount;++loop)
{
void* result;
pthread_join(data[loop], &result);
}
clock_t end = clock();
std::cout << threadCount << "\t" << middle - start << "\t" << end - middle << "\t"<< end - start << "\n";
}
I expect Windows to do worse in processes creation.
But I would expect modern Unix like systems to have a fairly light fork cost and be at least comparable to thread. On older Unix style systems (before fork() was implemented as using copy on write pages) that it would be worse.
Anyway My timing results are:
> uname -a
Darwin Alpha.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
> gcc --version | grep GCC
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5659)
> g++ thread.cpp -o thread -I~/include
> g++ fork.cpp -o fork -I~/include
> foreach a ( 1 2 3 4 5 6 7 8 9 10 12 15 20 30 40 50 60 70 80 90 100 )
foreach? ./thread ${a} >> A
foreach? end
> foreach a ( 1 2 3 4 5 6 7 8 9 10 12 15 20 30 40 50 60 70 80 90 100 )
foreach? ./fork ${a} >> A
foreach? end
vi A
Thread: Fork:
C Start Wait Total C Start Wait Total
==============================================================
1 26 145 171 1 160 37 197
2 44 198 242 2 290 37 327
3 62 234 296 3 413 41 454
4 77 275 352 4 499 59 558
5 91 107 10808 5 599 57 656
6 99 332 431 6 665 52 717
7 130 388 518 7 741 69 810
8 204 468 672 8 833 56 889
9 164 469 633 9 1067 76 1143
10 165 450 615 10 1147 64 1211
12 343 585 928 12 1213 71 1284
15 232 647 879 15 1360 203 1563
20 319 921 1240 20 2161 96 2257
30 461 1243 1704 30 3005 129 3134
40 559 1487 2046 40 4466 166 4632
50 686 1912 2598 50 4591 292 4883
60 827 2208 3035 60 5234 317 5551
70 973 2885 3858 70 7003 416 7419
80 3545 2738 6283 80 7735 293 8028
90 1392 3497 4889 90 7869 463 8332
100 3917 4180 8097 100 8974 436 9410
Edit:
Doing a 1000 children caused the fork version to fail.
So I have reduced the children count. But doing a single test also seems unfair so here is a range of values.
mumble ... I do not like your solution for many reasons:
You are not taking in account the execution time of child processes/thread.
You should compare cpu-usage not the bare elapsed time. This way your statistics will not depend from, e.g., disk access congestion.
Let your child process do something. Remember that "modern" fork uses copy-on-write mechanisms to avoid to allocate memory to the child process until needed. It is too easy to exit immediately. This way you avoid quite all the disadvantages of fork.
CPU time is not the only cost you have to account. Memory consumption and slowness of IPC are both disadvantages of fork solution.
You could use "rusage" instead of "clock" to measure real resource usage.
P.S. I do not think you can really measure the process/thread overhead writing a simple test program. There are too many factors and, usually, the choice between threads and processes is driven by other reasons than mere cpu-usage.
Under Linux fork is a special call to sys_clone, either within the library or within the kernel. Clone has lots of switches to flip on and off, and each of them effects how expensive it is to start.
The actual library function clone is probably more expensive than fork though because it does more, though most of that is on the child side (stack swapping and calling a function by pointer).
What that micro-benchmark shows is that thread creation and joining (there are no fork results when I'm writing this) takes tens or hundreds of microseconds (assuming your system has CLOCKS_PER_SEC=1000000, which it probably has, since it's an XSI requirement).
Since you said that fork() takes 3 times the cost of threads, we are still talking tenths of a millisecond at worst. If that is noticeable on an application, you could use pools of processes/threads, like Apache 1.3 did. In any case, I'd say that startup time is a moot point.
The important difference of threads vs processes (on Linux and most Unix-likes) is that on processes you choose explicitly what to share, using IPC, shared memory (SYSV or mmap-style), pipes, sockets (you can send file descriptors over AF_UNIX sockets, meaning you get to choose which fd's to share), ... While on threads almost everything is shared by default, whether there's a need to share it or not. In fact, that is the reason Plan 9 had rfork() and Linux has clone() (and recently unshare()), so you can choose what to share.