package main
import (
"fmt"
"runtime"
"sync"
)
var wg sync.WaitGroup
func alphabets() {
for char := 'a'; char < 'a'+26; char++ {
fmt.Printf("%c ", char)
}
wg.Done() //decrement number of goroutines to wait for
}
func numbers() {
for number := 1; number < 27; number++ {
fmt.Printf("%d ", number)
}
wg.Done()
}
func main() {
runtime.GOMAXPROCS(2)
wg.Add(2) //wait for two goroutines
fmt.Println("Starting Go Routines")
go alphabets()
go numbers()
fmt.Println("\nWaiting To Finish")
wg.Wait() //wait for the two goroutines to finish executing
fmt.Println("\nTerminating Program")
}
I expect the output to be mixed up(for lack of a better word), but instead; a sample output is:
$ go run parallel_prog.go
Starting Go Routines
Waiting To Finish
a b c d e f g h i j k l m n o p q r s t u v w x y z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Terminating Program
What I'm I missing?
Thanks,
You're missing nothing. It's working. The calls aren't showing up "interlaced" (mixed up) not because they're not being parallelized, but because they're happening really fast.
You can easily add some calls to time.Sleep to see the parallelization better. By sleeping, we know 100% that printing alphabets and numbers should be interlaced.
Your program with Sleep calls to "force" interlacing
package main
import (
"fmt"
"sync"
"time"
)
var wg sync.WaitGroup
func alphabets() {
defer wg.Done()
for char := 'a'; char < 'a'+26; char++ {
fmt.Printf("%c ", char)
time.Sleep(time.Second * 2)
}
}
func numbers() {
defer wg.Done()
for number := 1; number < 27; number++ {
fmt.Printf("%d ", number)
time.Sleep(time.Second * 3)
}
}
func main() {
fmt.Println("Starting Go Routines")
wg.Add(2)
go alphabets()
go numbers()
fmt.Println("\nWaiting To Finish")
wg.Wait()
fmt.Println("\nTerminating Program")
}
Note
You probably already know this, but setting GOMAXPROCS doesn't have any effect on whether or not this example is executed in parallel, just how many resources it consumes.
The GOMAXPROCS setting controls how many operating systems threads attempt to execute code simultaneously. For example, if GOMAXPROCS is 4, then the program will only execute code on 4 operating system threads at once, even if there are 1000 goroutines. The limit does not count threads blocked in system calls such as I/O.
Source: Go 1.5 GOMAXPROCS Default
Are you using the Go playground by any chance? When I run your code locally, I get:
Starting Go Routines
Waiting To Finish
1 2 3 4 5 6 7 8 9 10 11 12 a 13 14 15 16 17 b 18 19 c 20 21 d 22 23 e 24 25 f 26 g h i j k l m n o p q r s t u v w x y z
Terminating Program
The playground is deterministic in nature. Goroutines don't yield as often and don't run in multiple threads.
Related
Some information about the overall project:
I have to find if specific nodes remain connected if i start removing the lowest width edges from a graph. I have a struct solve, which has a member array called connected. In a method of this struct , FindConnections I go over some of the edges, from the Kth till the last and see which nodes are connected. The way I keep track of the connected nodes is to have an array that for each node points to the lowest id node it is connected, with the lowest pointing to itself
for example
if nodes 2 5 6 12 are directly connected
connected[2] =connected[5] =connected[6] =connected[12] = 2
so now if 12 and 23 are connected (and 12 is the lowest connection of 23)
connected [23] = 12 and connected[connected[23]] = 2 so i can reach 2 from 23 with recursion
My problem is that after finishing modifying the connected array inside FindConnections, some of the changes are preserved while other not
Here is the code:
void FindConnections(int index)
{
for (int temp, i = index; i < NumberOfPortals; i++)
{
temp = min(first[i], second[i]); // the nodes which edge i connects
connected[first[i]] = temp;
connected[second[i]] = temp;
}
}
which is called by
void seeAllConnections() // this function is for visualization it will not be included
{
for (int i = NumberOfPortals - 1; i >= 0; --i)
{
printf("Only using %d Portals:\n", NumberOfPortals - i);
FindConnections(i);
seeconnected(); // prints connected array
for (int i = 0; i < NumberOfUniverses; i++) //resets connected array
{
connected[i] = i;
}
}
}
In the two first iterations of the for loop in seeAllConnections, everything is good, the edges that are examined are
first second width(irrelevant for now)
6 7 255
26 2 111
11 7 36
in the beginning everyone is connected with himself
in the first one we get the output
(I am placing ** around the values that are changed and !! around the one that was supposed to change but didn't , just so you can see it better, the program prints just the numbers)
Only using 1 Portals:
connected are:
0 1 2 3 4 5 6 7 8 9 10 *7* 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
and we can see that connected[11] = 7 just like we wanted to
in the second one
Only using 2 Portals:
connected are:
0 1 2 3 4 5 6 7 8 9 10 *7* 12 13 14 15 16 17 18 19 20 21 22 23 24 25 *2* 27 28 29
connected[11] =7 just like before
connected[26] = 2 just like we wanted
in the third one
Only using 3 Portals:
connected are:
0 1 2 3 4 5 6 !7! 8 9 10 *7* 12 13 14 15 16 17 18 19 20 21 22 23 24 25 *2* 27 28 29
connected [7] = 7 , not 6
moreover, when i use gdb, inside the FindConnections loop, connected[7] = 6 like we wanted
(gdb) print first[i]
$10 = 6
(gdb) print second[i]
$11 = 7
(gdb) print connected[first[i]]
$12 = 6
(gdb) print connected[second[i]]
$13 = 6
but when it exits the function and returns to seeAllConnected
connected[7] = 7
What Am I doing wrong? how can the first two changes be preserved form the same function in the same struct in the same loop, while the second one isn't?
Also after every time I call FindConnections I reset the array to it's original values, so the changes couldn't have been preserved from before
Thank you in advance
I found out what was wrong, as it was a reverse iteration connected[7] got overwritten.
package main
import (
"fmt"
"math"
)
func pow(x, n, lim float64) float64 {
if v := math.Pow(x, n); v < lim {
return v
} else {
fmt.Printf("%g >= %g\n", v, lim)
}
// can't use v here, though
return lim
}
func main() {
fmt.Println(
pow(3, 2, 10),
pow(3, 3, 20),
)
}
this code is from "A Tour of GO"
Expectation:
9
10
27 >= 20
20
Output:
27 >= 20
9 20
I don't understand about this. help me!
Println function will output of both pow functions in a single line and then it adds \n after returning from Println function.
package main
import (
"fmt"
"math"
)
func pow(x, n, lim float64) float64 {
if v := math.Pow(x, n); v < lim {
return v
}else {
fmt.Printf("%g >= %g\n", v, lim)
}
// can't use v here, though
return lim
}
func main() {
fmt.Println(pow(3, 2, 10))
fmt.Println(pow(3, 3, 20))
}
Playground
And 10 which is the limit in first if case will never print.
9
10
27 >= 20
20
Because pow function is returned before in the function.
Arguments are evaluated first, and the Println inside pow is in an if block so it runs conditionally.
First, the arguments to the Println in main() are evaluated. The first call to pow results in 9 which is less than lim, so pow itself prints nothing and returns 9. The second call to pow results in 27 which is greater than lim, so pow prints 27 >= 20 and returns 20. Then, with the arguments handled, the call to Println in main is executed, printing 9 20.
Consider the following code snippet of this class template...
template<class T>
class FileTemplate {
private:
std::vector<T> vals_;
std::string filenameAndPath_;
public:
inline FileTemplate( const std::string& filenameAndPath, const T& multiplier ) :
filenameAndPath_( filenameAndPath ) {
std::fstream file;
if ( !filenameAndPath_.empty() ) {
file.open( filenameAndPath_ );
T val = 0;
while ( file >> val ) {
vals_.push_back( val );
}
file.close();
for ( unsigned i = 0; i < vals_.size(); i++ ) {
vals_[i] *= multiplier;
}
file.open( filenameAndPath_ );
for ( unsigned i = 0; i < vals_.size(); i++ ) {
file << vals_[i] << " ";
}
file.close();
}
}
inline std::vector<T> getValues() const {
return vals_;
}
};
When used in main as such with the lower section commented out with the following pre-populated text file:
values.txt
1 2 3 4 5 6 7 8 9
int main() {
std::string filenameAndPath( "_build/values.txt" );
std::fstream file;
FileTemplate<unsigned> ft( filenameAndPath, 5 );
std::vector<unsigned> results = ft.getValues();
for ( auto r : results ) {
std::cout << r << " ";
}
std::cout << std::endl;
/*
FileTemplate<float> ft2( filenameAndPath, 2.5f );
std::vector<float> results2 = ft2.getValues();
for ( auto r : results2 ) {
std::cout << r << " ";
}
std::cout << std::endl;
*/
std::cout << "\nPress any key and enter to quit." << std::endl;
char q;
std::cin >> q;
return 0;
}
and I run this code through the debugger sure enough both the output to the screen and file are changed to
values.txt - overwritten are -
5 10 15 20 25 30 35 40 45
then lets say I don't change any code just stop the debugging or running of the application, and let's say I run this again 2 more times, the outputs respectively are:
values.txt - iterations 2 & 3
25 50 75 100 125 150 175 200 225 250
125 250 375 500 625 750 875 1000 1125 1250
Okay good so far; now lets reset our values in the text file back to default and lets uncomment the 2nd instantiation of this class template for the float with a multiplier value of 2.5f and then run this 3 times.
values.txt - reset to default
1 2 3 4 5 6 7 8 9
-iterations 1,2 & 3 with both unsigned & float the multipliers are <5,2.5> respectively. 5 for the unsigned and 2.5 for the float
- Iteration 1
cout:
5 10 15 20 25 30 35 40 45
12.5 25 37.5 50 62.5 75 87.5 100 112.5
values.txt:
12.5 25 37.5 50 62.5 75 87.5 100 112.5
- Iteration 2
cout:
60
150 12.5 62.5 93.75 125 156.25 187.5 218.75 250 281.25
values.txt:
150 12.5 62.5 93.75 125 156.25 187.5 218.75 250 281.25
- Iteration 3
cout:
750 60
1875 150 12.5 156.25 234.375 312.5 390.625 468.75 546.875 625 703.125
values.txt:
1875 150 12.5 156.25 234.375 312.5 390.625 468.75 546.875 625 703.125
A couple of questions come to mind: it is two fold regarding the same behavior of this program.
The first and primary question is: Are the file read and write calls being done at compile time considering this is a class template and the constructor is inline?
After running the debugger a couple of times; why is the output incrementing the number of values in the file? I started off with 9, but after an iteration or so there are 10, then 11.
This part just for fun if you want to answer:
The third and final question yes is opinion based but merely for educational purposes for I would like to see what the community thinks about this: What are the pros & cons to this type of programming? What are the potentials and the limits? Are their any practical real world applications & production benefits to this?
In terms of the other issues. The main issue is that you are not truncating the file when you do the second file.open statement, you need :
file.open( filenameAndPath_, std::fstream::trunc|std::fstream::out );
What is happening, is that, when you are reading unsigned int from a file containing floating points, it is only reading the first number (e.g. 12.5) up to the decimal place and then stopping (e.g. reading only 12)
, because there is no other text on the line that looks like an unsigned int. This means it only reads the number 12 and then multiplies it by 5 to get the 60, and writes it to the file.
Unfortunately because you don't truncate the file when writing the 60, it leaves the original text at the end which is interpreted as additional numbers in the next read loop. Hence, 12.5 appears in the file as 60 5
stream buffers
Extracts as many characters as possible from the stream and inserts them into the output sequence controlled by the stream buffer object pointed by sb (if any), until either the input sequence is exhausted or the function fails to insert into the object pointed by sb.
(http://www.cplusplus.com/reference/istream/istream/operator%3E%3E/)
So I posted a similar question to this earlier, but I didn't post enough code to get the help I needed. Even if I went back and added that code now, I don't think it would be noticed because the question is old and "answered". So here's my issue:
I'm trying to generate a section of the mandelbrot fractal. I can generate it fine, but when I add more cores, no matter how large the problem size is, the extra threads generate no speedup. I am completely new to multithreading and it's probably just something small I'm missing. Anyway, here are the functions that generate the fractal:
void mandelbrot_all(std::vector<std::vector<int>>& pixels, int X, int Y, int numThreads) {
using namespace std;
vector<thread> threads (numThreads);
int rowsPerThread = Y/numThreads;
mutex m;
for(int i=0; i<numThreads; i++) {
threads[i] = thread ([&](){
vector<int> row;
for(int j=(i-1)*rowsPerThread; j<i*rowsPerThread; j++) {
row = mandelbrot_row(j, X, Y);
{
lock_guard<mutex> lock(m);
pixels[j] = row;
}
}
});
}
for(int i=0; i<numThreads; i++) {
threads[i].join();
}
}
std::vector<int> mandelbrot_row(int rowNum, int topX, int topY) {
std::vector<int> row (topX);
for(int i=0; i<topX; i++) {
row[i] = mandelbrotOne(i, rowNum, topX, topY);
}
return row;
}
int mandelbrotOne(int currX, int currY, int X, int Y) { //code adapted from http://en.wikipedia.org/wiki/Mandelbrot_set
double x0 = convert(X, currX, true);
double y0 = convert(Y, currY, false);
double x = 0.0;
double y = 0.0;
double xtemp;
int iteration = 0;
int max_iteration = 255;
while ( x*x + y*y < 2*2 && iteration < max_iteration) {
xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
++iteration;
}
return iteration;
}
mandelbrot_all is passed a vector to hold the pixels, the maximum X and Y of the vector, and the number of threads to use, which is taken from the command line when the program is run. It attempts to split the work by row among multiple threads. Unfortunately, it seems that even if that is what it's doing, it's not making it any faster. If you need more details, feel free to ask and I will do my best to provide them.
Thanks in advance for the help.
Edit: reserved vectors in advance
Edit 2: ran this code with problem size 9600x7200 on a quad core laptop. It took an average of 36590000 cycles for one thread (over 5 runs) and 55142000 cycles for four threads.
Your code might appear to do parallel processing, but in practice it doesn't.
Basically, you are spending your time copying data around and queueing for memory allocator accesses.
Besides, you are using the unprotected i loop indice as if there was nothing to it, which will feed your worker threads with random junk instead of beautiful slices of the image.
As usual, C++ is hiding these sad facts from you under a thick crust of syntactic sugar.
But the greatest flaw of your code is the algorithm itself, as you might see if you read further ahead.
Since this example seems a textbook case of parallel processing to me and I never saw an "educational" analysis of it, I will try one.
Functional analysis
You want to use all CPU cores to crunch pixels of the Mandelbrot set. This is a perfect case of parallelizable computation, since each pixel computation can be done independently.
So basically it you have N cores on your machine you should have exactly one thread per core doing 1/N of the processing.
Unfortunately, dividing the input data so that each processor ends up doing 1/N of the needed processing is not as obvious as it might seem.
A given pixel can take from 0 to 255 iterations to compute. "black" pixels are 255 times more costly than "white" ones.
So if you simply divide your picture into N equal sub-surfaces, chances are all of your processors will breeze through "white" areas except one that will crawl through a "black" area. As a result, the slowest area computation time will dominate and parallelization will gain practically nothing.
In real cases, this will not be as dramatic, but still a huge loss of computing power.
Load balancing
To better balance the load, it is more efficient to split your picture in much smaller bits, and have each worker thread pick and compute the next available bit as soon as it is finished with the previous one.
That way, a worker processing "white" chunks will eventually finish its job and start picking "black" chunks to help its less fortunate siblings.
Ideally the chunks should be sorted by decreasing complexity, to avoid adding the linear cost of a big chunk to the total computatuin time.
Unfortunately, due to the chaotic nature of the Mandlebrot set, there is no practical way of predicting the computation time of a given area.
If we decide the chunks will be horizontal slices of the picture, sorting them in natural y order is clearly suboptimal. If that particular area is a kind of "white to black" gradient, the most costly lines will all be bunched at the end of the chunks list and you will end up computing the costliest bits last, which is the worst case for load balancing.
A possible solution is to shuffle the chunks in a butterfly pattern, so that the likelihood of having a "black" area concentrated in the end is small.
Another way would simply be to shuffle input patterns at random.
Here are two outputs of my proof of concept implementation:
Jobs are executed in reverse order (jobs 39 is the first, job 0 is the last).
Each line decodes as follows:
t a-b : thread n°a on processor b
b : begining time (since image computation start)
e : end time
d : duration (all times in milliseconds)
1) 40 jobs with butterfly ordering
job 0: t 1-1 b 162 e 174 d 12 // the 4 tasks finish within 5 ms from each other
job 1: t 0-0 b 156 e 176 d 20 //
job 2: t 2-2 b 155 e 173 d 18 //
job 3: t 3-3 b 154 e 174 d 20 //
job 4: t 1-1 b 141 e 162 d 21
job 5: t 2-2 b 137 e 155 d 18
job 6: t 0-0 b 136 e 156 d 20
job 7: t 3-3 b 133 e 154 d 21
job 8: t 1-1 b 117 e 141 d 24
job 9: t 0-0 b 116 e 136 d 20
job 10: t 2-2 b 115 e 137 d 22
job 11: t 3-3 b 113 e 133 d 20
job 12: t 0-0 b 99 e 116 d 17
job 13: t 1-1 b 99 e 117 d 18
job 14: t 2-2 b 96 e 115 d 19
job 15: t 3-3 b 95 e 113 d 18
job 16: t 0-0 b 83 e 99 d 16
job 17: t 3-3 b 80 e 95 d 15
job 18: t 2-2 b 77 e 96 d 19
job 19: t 1-1 b 72 e 99 d 27
job 20: t 3-3 b 69 e 80 d 11
job 21: t 0-0 b 68 e 83 d 15
job 22: t 2-2 b 63 e 77 d 14
job 23: t 1-1 b 56 e 72 d 16
job 24: t 3-3 b 54 e 69 d 15
job 25: t 0-0 b 53 e 68 d 15
job 26: t 2-2 b 48 e 63 d 15
job 27: t 0-0 b 41 e 53 d 12
job 28: t 3-3 b 40 e 54 d 14
job 29: t 1-1 b 36 e 56 d 20
job 30: t 3-3 b 29 e 40 d 11
job 31: t 2-2 b 29 e 48 d 19
job 32: t 0-0 b 23 e 41 d 18
job 33: t 1-1 b 18 e 36 d 18
job 34: t 2-2 b 16 e 29 d 13
job 35: t 3-3 b 15 e 29 d 14
job 36: t 2-2 b 0 e 16 d 16
job 37: t 3-3 b 0 e 15 d 15
job 38: t 1-1 b 0 e 18 d 18
job 39: t 0-0 b 0 e 23 d 23
You can see load balancing at work when a thread having processed a few small jobs will overtake another that took more time to process its own chunks.
2) 40 jobs with linear ordering
job 0: t 2-2 b 157 e 180 d 23 // last thread lags 17 ms behind first
job 1: t 1-1 b 154 e 175 d 21
job 2: t 3-3 b 150 e 171 d 21
job 3: t 0-0 b 143 e 163 d 20 // 1st thread ends
job 4: t 2-2 b 137 e 157 d 20
job 5: t 1-1 b 135 e 154 d 19
job 6: t 3-3 b 130 e 150 d 20
job 7: t 0-0 b 123 e 143 d 20
job 8: t 2-2 b 115 e 137 d 22
job 9: t 1-1 b 112 e 135 d 23
job 10: t 3-3 b 112 e 130 d 18
job 11: t 0-0 b 105 e 123 d 18
job 12: t 3-3 b 95 e 112 d 17
job 13: t 2-2 b 95 e 115 d 20
job 14: t 1-1 b 94 e 112 d 18
job 15: t 0-0 b 90 e 105 d 15
job 16: t 3-3 b 78 e 95 d 17
job 17: t 2-2 b 77 e 95 d 18
job 18: t 1-1 b 74 e 94 d 20
job 19: t 0-0 b 69 e 90 d 21
job 20: t 3-3 b 60 e 78 d 18
job 21: t 2-2 b 59 e 77 d 18
job 22: t 1-1 b 57 e 74 d 17
job 23: t 0-0 b 55 e 69 d 14
job 24: t 3-3 b 45 e 60 d 15
job 25: t 2-2 b 45 e 59 d 14
job 26: t 1-1 b 43 e 57 d 14
job 27: t 0-0 b 43 e 55 d 12
job 28: t 2-2 b 30 e 45 d 15
job 29: t 3-3 b 30 e 45 d 15
job 30: t 0-0 b 27 e 43 d 16
job 31: t 1-1 b 24 e 43 d 19
job 32: t 2-2 b 13 e 30 d 17
job 33: t 3-3 b 12 e 30 d 18
job 34: t 0-0 b 11 e 27 d 16
job 35: t 1-1 b 11 e 24 d 13
job 36: t 2-2 b 0 e 13 d 13
job 37: t 3-3 b 0 e 12 d 12
job 38: t 1-1 b 0 e 11 d 11
job 39: t 0-0 b 0 e 11 d 11
Here the costly chunks tend to bunch together at the end of the queue, hence a noticeable performance loss.
3) a run with only one job per core, with one to 4 cores activated
reported cores: 4
Master: start jobs 4 workers 1
job 0: t 0-0 b 410 e 590 d 180 // purely linear execution
job 1: t 0-0 b 255 e 409 d 154
job 2: t 0-0 b 127 e 255 d 128
job 3: t 0-0 b 0 e 127 d 127
Master: start jobs 4 workers 2 // gain factor : 1.6 out of theoretical 2
job 0: t 1-1 b 151 e 362 d 211
job 1: t 0-0 b 147 e 323 d 176
job 2: t 0-0 b 0 e 147 d 147
job 3: t 1-1 b 0 e 151 d 151
Master: start jobs 4 workers 3 // gain factor : 1.82 out of theoretical 3
job 0: t 0-0 b 142 e 324 d 182 // 4th packet is hurting the performance badly
job 1: t 2-2 b 0 e 158 d 158
job 2: t 1-1 b 0 e 160 d 160
job 3: t 0-0 b 0 e 142 d 142
Master: start jobs 4 workers 4 // gain factor : 3 out of theoretical 4
job 0: t 3-3 b 0 e 199 d 199 // finish at 199ms vs. 176 for butterfly 40, 13% loss
job 1: t 1-1 b 0 e 182 d 182 // 17 ms wasted
job 2: t 0-0 b 0 e 146 d 146 // 44 ms wasted
job 3: t 2-2 b 0 e 150 d 150 // 49 ms wasted
Here we get a 3x improvement while a better load balancing could have yielded a 3.5x.
And this is a very mild test case (you can see the computation times only vary by a factor of about 2, while they could theoretically vary by a factor of 255 !).
At any rate, if you don't implement some kind of load balancing, all the shiny multiprocessor code you might write will still yield poor do downright miserable performances.
Implementation
For the threads to work unhindered, they must be kept free from interferences from the ouside world.
One such interference is the memory allocation. Each time you allocate even a byte of memory, you will queue for exclusive access to the global memory allocator (and waste a bit of CPU doing the allocation).
Also, creating worker tasks for each picture computation is another waste of time and resources. The computation might be used to display the Mandlebrot set in an interactive application, so better have the workers premanently created and synchronized to compute successive images.
Lastly, there are the data copies. If you synchronize with the main program each time you're done computing a few points, you will again spend a good part of your time queueing for exclusive access to the result area. Besides, the useless copies of a sizeable amount of data will hurt the performances even more.
The obvious solution is to dispense with the copies altogether and work on original data.
design
You must provide your worker threads all they need to work unhindered. For that you need to:
determine the number of available cores on your system
pre-allocate all the memory needed
give access to a list of image chunks to each of your worker
launch exactly one thread per core and let them run free to do their job
job queue
There is no need for fancy no-wait or whatever gizmos, nor do we need to pay special attention to cache optimization.
Here again, the time needed to compute pixels dwarves the inter-thread synchronization cost and cache efficiency problems.
Basically, the queue can be computed as a whole at the start of an image generation. Workers will only have to read the jobs from it: there will never be concurrent read/write accesses on this queue, so the more or less standard bits of code around to implement job queues will be suboptimal and too complex for the job at hand.
We need two sync points:
let the workers wait for a new batch of jobs
let the master wait for a picture completion
workers will wait until the queue length changes to a positive value.
They will then all wakeup and start atomically decrementing the queue length. The current value of the queue length will provide them exclusive access to the associated job data (basically an area of the Mandlebrot set to compute, with an associated bitmap area to store the computed iteration values).
The same mechanism is used to terminate the workers. Instead of finding a new batch of jobs, the poor workers will wakeup to find an order to terminate.
the master waiting for a picture completion will be awoken by the worker that will finish processing the last job. This will be based on an atomic counter of the number of jobs to process.
This is how I implemented it:
class synchro {
friend class mandelbrot_calculator;
mutex lock; // queue lock
condition_variable work; // blocks workers waiting for jobs/termination
condition_variable done; // blocks master waiting for completion
int pending; // number of jobs in the queue
atomic_int active; // number of unprocessed jobs
bool kill; // poison pill for workers termination
void synchro (void)
{
pending = 0; // no job in queue
kill = false; // workers shall live (for now :) )
}
int worker_start(void)
{
unique_lock<mutex> waiter(lock);
while (!pending && !kill) work.wait(waiter);
return kill
? -1 // worker should die
: --pending; // index of the job to process
}
void worker_done(void)
{
if (!--active) // atomic decrement (exclusive with other workers)
done.notify_one(); // last job processed: wakeup master
}
void master_start(int jobs)
{
unique_lock<mutex> waiter(lock);
pending = active = jobs;
work.notify_all(); // wakeup all workers to start jobs
}
void master_done(void)
{
unique_lock<mutex> waiter(lock);
while (active) done.wait(waiter); // wait for workers to finish
}
void master_kill(void)
{
kill = true;
work.notify_all(); // wakeup all workers (to die)
}
};
Putting all together:
class mandelbrot_calculator {
int num_cores;
int num_jobs;
vector<thread> workers; // worker threads
vector<job> jobs; // job queue
synchro sync; // synchronization helper
mandelbrot_calculator (int num_cores, int num_jobs)
: num_cores(num_cores)
, num_jobs (num_jobs )
{
// worker thread
auto worker = [&]()
{
for (;;)
{
int job = sync.worker_start(); // fetch next job
if (job == -1) return; // poison pill
process (jobs[job]); // we have exclusive access to this job
sync.worker_done(); // signal end of picture to the master
}
};
jobs.resize(num_jobs, job()); // computation windows
workers.resize(num_cores);
for (int i = 0; i != num_cores; i++)
workers[i] = thread(worker, i, i%num_cores);
}
~mandelbrot_calculator()
{
// kill the workers
sync.master_kill();
for (thread& worker : workers) worker.join();
}
void compute(const viewport & vp)
{
// prepare worker data
function<void(int, int)> butterfly_jobs;
butterfly_jobs = [&](int min, int max)
// computes job windows in butterfly order
{
if (min > max) return;
jobs[min].setup(vp, max, num_jobs);
if (min == max) return;
jobs[max].setup(vp, min, num_jobs);
int mid = (min + max) / 2;
butterfly_jobs(min + 1, mid );
butterfly_jobs(mid + 1, max - 1);
};
butterfly_jobs(0, num_jobs - 1);
// launch workers
sync.master_start(num_jobs);
// wait for completion
sync.master_done();
}
};
Testing the concept
This code works pretty well on my 2 cores / 4 CPUs Intel I3 # 3.1 GHz, compiled with Microsoft Dev Studio 2013.
I use a bit of the set that has an average of 90 iterations / pixel, on a window of 1280x1024 pixels.
The computation time is about 1.700s with only one worker and drops to 0.480s with 4 workers.
The maximal possible gain would be a factor 4. I get a factor 3.5. Not too bad.
I assume the difference is partly due to the processor architecture (the I3 has only two "real" cores).
Tampering with the scheduler
My program forces the threads to run on one core each (using MSDN SetThreadAffinityMask).
If the scheduler is left free to allocate the tasks, the gain factor drops from 3,5 to 3,2.
This is significant, but still the Win7 scheduler does a pretty good job when left alone.
synchronization overhead
running the algorithm on an "white" window (outside the r < 2 area) gives a good idea of the system calls overhead.
It takes about 7ms to compute this "white" area, compared with the 480 ms of a representative area.
Something like 1.5%, including both the synchronization and computation of the job queue. And this is doing a synchronization on a queue of 1024 jobs.
Utterly neglectible, I would say. That might give food for thought to all the No-wait queue fanatics around.
optimizing iterations
The way iterations are done is a key factor for optimization.
After a few trials, I settled for this method:
static inline unsigned char mandelbrot_pixel(double x0, double y0)
{
register double x = x0;
register double y = y0;
register double x2 = x * x;
register double y2 = y * y;
unsigned iteration = 0;
const int max_iteration = 255;
while (x2 + y2 < 4.0)
{
if (++iteration == max_iteration) break;
y = 2 * x * y + y0;
x = x2 - y2 + x0;
x2 = x * x;
y2 = y * y;
}
return (unsigned char)iteration;
}
net gain: +20% compared with the OP's method
(the register directives don't make a bit of a difference, they are just there for decoration)
killing the tasks after each computation
The benefit of leaving the workers alive is about 5% of the computation time.
butterfly effect
On my test case, the "butterfly" order is doing really well, yielding more than 30% gain in extreme cases and routinely 10-15% due to "de-bunching" the bulkiest requests.
The problem in your code is that all thread capture and access the same i variable. This creates a race condition and the results are wildly incorrect.
You need to pass it as an argument to the thread lambda, and also correct the ranges (i-1 will make your indexing go out of bounds).
I am new to openMP, in my program complex simulations are needed, to repeat the result, the seed is set for each simulation, however, when implementing openMP, different results are produced for each time I run it. so I write a simple example to check the problem as follows,
I also generated different result each time:
#include <iostream>
#include <omp.h>
using namespace std;
int main () {
double A[10];
#pragma omp parallel for
for( int i=0;i<10;i++){
srand(i+1);
int m = rand()%100;
A[i] = m;
}
cout<<"A= \n";
for(int i=0;i<10;i++){
cout<<i<<" "<<A[i]<<" \n";
}
return 0;
}
I run it twice, the results are:
A=
0 86
1 25
2 78
3 1
4 46
5 95
6 77
7 83
8 15
9 8
and
A=
0 15
1 41
2 65
3 1
4 75
5 85
6 95
7 83
8 74
9 8
Thank you very much!
rand() uses static state and is not threadsafe. You'll need to use a different, thread-safe, PRNG. See Thread-safe random number generation for Monte-Carlo integration or Do PRNG need to be thread safe?
This is a bug
A[i] += m;
You're reading the prior value of A[i] which has never been assigned. That's undefined behavior. try
A[i] = m;
Then, note that the random number state might not be threadlocal. Get a better RNG, where you have an explicit state variable instead of accessing shared global state.