Go Worker Pool doesn't seem to be processing Concurrently - concurrency

Hello I'm brand new to go (and concurrent programming in general :() and trying to distribute a slow computation to a pool of workers.
http://play.golang.org/p/lTv4Tm75A4
func main() {
test := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
answer := getSmallestMultiple(test)
fmt.Println(answer)
}
I am trying to find the smallest number that is evenly divisible by all the numbers in test.
I have created a pool of workers and am sending them values until one of the goroutines finds a number that can be evenly divided by all the numbers in test
for w := 0; w < 100; w++ {
go divisibleByAllNumbers(&numbers, jobs, answer)
}
go func() {
for i := max; ; i += max {
fmt.Printf("Sending # %d\n", i)
jobs <- i
}
}()
The program seems to be running at the same speed despite how many workers I start. I have tried many number of workers and it always takes the same number of seconds to run, which seems like the work is not being done concurrently at all.
Each worker is consuming work from the queue using range:
for j := range jobs {}
And i was hoping the more processes consuming off the jobs channel the faster the program would execute.
I have also tried different values for the jobs := make(chan int) buffer value
I have stared at this all day and was hoping someone could see what the issue is. I would expect the more workers I add the faster the computation takes but am not experiencing that. I'm sure I"m missing some key concepts,
Thank you

http://golang.org/doc/effective_go.html#parallel
The current implementation of the Go runtime will not parallelize this code by default. It dedicates only a single core to user-level processing. An arbitrary number of goroutines can be blocked in system calls, but by default only one can be executing user-level code at any time. It should be smarter and one day it will be smarter, but until it is if you want CPU parallelism you must tell the run-time how many goroutines you want executing code simultaneously. There are two related ways to do this. Either run your job with environment variable GOMAXPROCS set to the number of cores to use or import the runtime package and call runtime.GOMAXPROCS(NCPU). A helpful value might be runtime.NumCPU(), which reports the number of logical CPUs on the local machine. Again, this requirement is expected to be retired as the scheduling and run-time improve.

Related

Can the minimum time to schedule tasks be found without brute force?

If I have a list of integers representing the time it takes for a task to be completed and I have x workers that can only work on one task until the time it takes to complete is up, can I find the minimum time it could possibly take in a best case scenario? I do not need the exact permutation that makes up this minimum completion time, just the time.
For example, to make it simple, if I have a list [2, 4, 6] and I have 2 workers then if I start with 2 and 4 then when 2 finishes 6 will start meaning that it will take 8 seconds to complete all tasks. However if I start with 6 and 2 then when 2 finishes 4 will start and finish at the same time as 6, therefore the tasks only take 6 seconds if done in this order.
Is there a way of knowing that it will only take 6 seconds that is better than n! or brute force complexity that guarantees it is the minimum time possible? Thank you for any help in advance please feel free to ask questions if I left out any details or you're confused!
edit: please help :(
edit 2: is it even possible? Anyone know?
In the case of a single worker, then the actual total time required is the same as the sum of all task times.
jobs = [ 2, 4, 6, etc... ]
time_required = SUM( jobs )
In the case of two workers, then given a specific ordering of jobs the total-time required can be determined by first assigning each task's required time to whichever worker has the current lowest sum associated with it, then getting the highest sum associated with each worker:
define type worker = vector<time_t>
define type workers = min_priority_queue<worker> using worker.sum() # so the current worker.sum() (the sum of `time_t` values in `vector<time_t>`) is the priority-queue key.
define type task = int
jobs = [ 2, 4, 6, etc... ]
# Use two workers:
workers.add( new worker )
workers.add( new worker )
# Iterate once through each job:
foreach( task t in jobs ) {
minWorker = workers.getMinWorker() # priority queue "find-min" operation
minWorker.add( t )
}
# Determine which worker will work the longest time:
time_required = workers.getMaxWorker().sum() # priority queue "find-max" operation
Because this is an actual solution, then the time_required is a point-sample that exists between the upper and lower-bounds - which isn't exactly what you're after, but because it can be computed in O(n) time it's a good starting point.
The above algorithm can then be generalised to any number of workers just by adding them to the priority queue - as heap-based priority queues' find-min operation is O(1) I believe this algorithm runs in O(n) time where n is the number of jobs, independent of the number of workers. (I may be wrong about the precise runtime complexity).
As for computing bounds in less time than O(n!) time... that's tricky (at least for me, as it's been a few years since I last cracked-open my copy of CLRS).
A minimal lower-bound for x workers for any order of jobs is simply the largest single value in the job set.
A maximal upper-bound for x workers for any order of jobs could be the sum of the largest 100 * (1/x) % of jobs (so given 2 workers it's the sum of the largest 50% jobs, for 3 workers it's the sum of the largest 33% jobs, for 4 workers it's 25%, etc). This will require you to sort the set first (taking O(n log n) if using Quicksort).
jobs = [ 2, 4, 6, etc... ]
worker_count = 2
jobs.sortDescending() # O(n log n)
# if there's 50 jobs and 2 workers, then take the first 25 jobs and sum them...
# ...that's an upper_bound for the time required to complete all tasks by 2 workers, as it assumes that 1 unlucky worker will get all of the longest tasks
upper_bound = jobs.take( jobs.count / worker_count ).sum()

Dealing with complex send recv message within a for loop

I am trying to parallelise a biological model in C++ with boost::mpi. It is my first attempt, and I am entirely new to the boost library (I have started from the Boost C++ Libraries book by Schaling). The model consists of grid cells and cohorts of individuals living within each grid cell. The classes are nested, such that a vector of Cohorts* belongs to a GridCell. The model runs for 1000 years, and at each time step, there is dispersal such that the cohorts of individuals move randomly between grid cells. I want to parallelise the content of the for loop, but not the loop itself as each time step depends on the state of the previous time.
I use world.send() and world.recv() to send the necessary information from one rank to another. Because sometimes there is nothing to send between ranks I use with mpi::status and world.iprobe() to make sure the code does not hang waiting for a message that was never sent (I followed this tutorial)
The first part of my code seems to work fine but I am having troubles with making sure all the sent messages have been received before moving on to the next step in the for loop. In fact, I noticed that some ranks move on to the following time step before the other ranks have had the time to send their messaages (or at least that what it looks like from the output)
I am not posting the code because it consists of several classes and it’s quite long. If interested the code is on github. I write here roughly the pseudocode. I hope this will be enough to understand the problem.
int main()
{
// initialise the GridCells and Cohorts living in them
//depending on the number of cores requested split the
//grid cells that are processed by each core evenly, and
//store the relevant grid cells in a vector of GridCell*
// start to loop through each time step
for (int k = 0; k < (burnIn+simTime); k++)
{
// calculate the survival and reproduction probabilities
// for each Cohort and the dispersal probability
// the dispersing Cohorts are sorted based on the rank of
// the destination and stored in multiple vector<Cohort*>
// I send the vector<Cohort*> with
world.send(…)
// the receiving rank gets the vector of Cohorts with:
mpi::status statuses[world.size()];
for(int st = 0; st < world.size(); st++)
{
....
if( world.iprobe(st, tagrec) )
statuses[st] = world.recv(st, tagrec, toreceive[st]);
//world.iprobe ensures that the code doesn't hang when there
// are no dispersers
}
// do some extra calculations here
//wait that all processes are received, and then the time step ends.
//This is the bit where I am stuck.
//I've seen examples with wait_all for the non-blocking isend/irecv,
// but I don't think it is applicable in my case.
//The problem is that I noticed that some ranks proceed to the next
//time step before all the other ranks have sent their messages.
}
}
I compile with
mpic++ -I/$HOME/boost_1_61_0/boost/mpi -std=c++11 -Llibdir \-lboost_mpi -lboost_serialization -lboost_locale -o out
and execute with mpirun -np 5 out, but I would like to be able to execute with a higher number of cores on an HPC cluster later on (the model will be run at the global scale, and the number of cells might depend on the grid cell size chosen by the user).
The compilers installed are g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, Open MPI: 2.1.1
The fact that you have nothing to send is an important piece of information in your scenario. You can not deduce that fact from only the absence of a message. The absence of a message only means nothing was sent yet.
Simply sending a zero-sized vector and skipping the probing is the easiest way out.
Otherwise you would probably have to change your approach radically or implement a very complex speculative execution / rollback mechanism.
Also note that the linked tutorial uses probe in a very different fashion.

Learning about multithreading. Tried to make a prime number finder

I'm studying for a uni project and one of the requirements is to include multithreading. I decided to make a prime number finder and - while it works - it's rather slow. My best guess is that this has to do with the amount of threads I'm creating and destroying.
My approach was to take the range of primes that are below N, and distribute these evenly across M threads (where M = number of cores (in my case 8)), however these threads are being created and destroyed every time N increases.
Pseudocode looks like this:
for each core
# new thread
for i in (range / numberOfCores) * currentCore
if !possiblePrimeIsntActuallyPrime
if possiblePrime % i == 0
possiblePrimeIsntActuallyPrime = true
return
else
return
Which does work, but 8 threads being created for every possible prime seems to be slowing the system down.
Any suggestions on how to optimise this further?
Use thread pooling.
Create 8 threads and store them in an array. Feed it new data each time one ends and start it again. This will prevent them from having to be created and destroyed each time.
Also, when calculating your range of numbers to check, only check up to ceil(sqrt(N)) as anything after that is guaranteed to either not go into it or the other corresponding factor has already been checked. i.e. ceil(sqrt(24)) is 5.
Once you check 5 you don't need to check anything else because 6 goes into 24 4 times and 4 has been checked, 8 goes into it 3 times and 3 has been checked, etc.

PPL - How to configure the number of native threads?

I am trying to manage the count of native threads in PPL by using its Scheduler class, here is my code:
for (int i = 0; i < 2000; i ++)
{
// configure concurrency count 16 to 32.
concurrency::SchedulerPolicy policy = concurrency::SchedulerPolicy(2, concurrency::MinConcurrency, 16,
concurrency::MaxConcurrency, 32);
concurrency::Scheduler *pScheduler = concurrency::Scheduler::Create(policy);
HANDLE hShutdownEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
pScheduler->RegisterShutdownEvent(hShutdownEvent);
pScheduler->Attach();
//////////////////////////////////////////////////////////////////////////
//for (int i = 0; i < 2000; i ++)
{
concurrency::create_task([]{
concurrency::wait(1000);
OutputDebugString(L"Task Completed\n");
});
}
//////////////////////////////////////////////////////////////////////////
concurrency::CurrentScheduler::Detach();
pScheduler->Release();
WaitForSingleObject(hShutdownEvent, INFINITE);
CloseHandle(hShutdownEvent);
}
The usage of SchedulerPolicy is from MSDN, but it didn't work at all. The expected result of my code above is, PPL will launch 16 to 32 threads to execute the 2000 tasks, but the fact is:
By observing the speed of console output, only one task was processed within a second. I also tried to comment the outter for loop and uncomment the inner for loop, however, this will cause 300 threads being created, still incorrect. If I wait a longer time, the threads created will be even more.
Any ideas on what is the correct way to configure concurrency in PPL?
It has been proved that I should not do concurrency::wait within the task body, PPL works in work stealing mode, when the current task was suspended by wait, it will start to schedule the rest of tasks in queue to maximize the use of computing resources.
When I use concurrency::create_task in real project, since there are a couple of real calculations within the task body, PPL won't create hundreds of threads any more.
Also, SchedulePolicy can be used to configure the number of virtual processors that PPL may use to process the tasks, which is not always same as the number of native threads PPL will create.
Saying my CPU has 8 virtual processors, by default PPL will just create 8 threads in pool, but when some of those threads were suspended by wait or lock, and also there are more tasks pending in the queue, PPL will immediately create more threads to execute them (if the virtual processors were not fully loaded).

java parallelisation problem - parallelisation is as slow as serialisation

I have been developing an individual base model. All you need to know is that individuals are born, reproduce and die. I have a GUI in which i can see these processes happening.
I have a mac pro, with 8 cores and 16GB ram.
Considering that the simulation will have to be repeated a few times to get error bars, etc, I thought i could run the main class and then have separate simulations (all run from the same program) ran on separate cores. Simple. Each parallel simulation would have no knowledge of the other simulations, hence no need for synchronization blocks.
When the main method is run, it invokes the constructor of the main class - which creates the other objects and the simulation begins. Hence - to parallelise - I created a fixed thread pool which would all separately invoke the main class constructor and multiple (well, 8, the number of cores) simulations.
BUT - it is running as slow as if I was running the simulations in serial. The animation in the GUIs for each simulation are updated in order, not simultaneously.
In fact, if I run the program 8 times simultaneously from the command line (and place in the background with '&') it is much faster and behaves much more like I would have hoped. Which is irritating!
At the start of the simulation some IO operations are performed to read in data about the individuals, but only at the start.
Interestingly, the first objects to be created by the `parallel' processes were made at the same memory addresses - but I don't think that is a problem.
If anybody has any insight into this lack of performance from the java concurrency tools, why the program appears to be running in serial and why simply running the main method from the command line 8 times is better than attempting to parallelise that would be most helpful.
Because to be frank I am losing faith in java's parallelisation capabilities.
Cheers
James
noOfProcessors = (byte)Runtime.getRuntime().availableProcessors();
ExecutorService eservice = Executors.newFixedThreadPool( noOfProcessors );
List<Future> futuresList = new ArrayList<Future>();
for( int i = 0; i < noOfProcessors; i++ ){
futuresList.add( eservice.submit( new simulation() ) );
}//end for
for( Future future : futuresList ){
try{
future.get();
}catch( InterruptedException ex ){
Logger.getLogger( simPanel.class.getName() ).log( Level.SEVERE, null, ex );
System.exit( 1 );
}catch( ExecutionException ex ){
Logger.getLogger( simPanel.class.getName() ).log( Level.SEVERE, null, ex );
System.exit( 1 );
}//end try-catch
}//end for loop
While not too familiar with Java's Executors class, the serial behaviour seems to indicate that your thread pool is running all threads on the same processor. Perhaps it has something to do with how the JVM handles threads? Anyway, see if you can create separate processes in Java and see if that makes a difference.