gnu parallel and resource management - scheduling

I would like to use the gnu parallel command line to basically act as a simple scheduling mechanism.
in my case, i have N number of GPU's on a system and i would like to effectively queue a list of jobs onto those GPU's.
basically, i have a list of inputs and i would naively run
parallel --jobs=4 ./my_script.sh ::: cat list_of_things.txt ::: 0 1 2 3
where ./my_script.sh accepts two args the thing i want to process, and the GPU i want to process it on.
what i want is for each thing in the list, to just run on one of the gpus (0 thru 3).
however, this ends up just running each thing 4 times.

Try this:
parallel --jobs=4 ./my_script.sh {%} {} :::: list_of_things.txt

Related

How to find finish times of processes in cplex

I have a machine,batch scheduling problem. Finish time of a batch is "Z[b]" variable. There are three machines(f represent machines). If a machine starts processing a specific batch at time t X[f][b][t] equals to 1.
"P[b]" parameter is the proccesing time of the batches. I need to find ending times of batches.Tried this constraint.t is the range of time for example 48 hours.
"forall(p in B) Z[p]-(sum(n in F)sum(a in 1..t-P[p]+1)(a+P[p])*X[n][p][a])==0 ;"
I have 3 machines but this constraint just use 2 machines at time 1. Also Z[p] values is not logical.How can i fix this?
Within CPLEX you have CPOptimizer that is good at scheduling.
And to get the end of an interval , endOf(itvs) works fine

Strange behaviour of Parallel Boost Graph Library example code

I have set up simple tests with Parallel Boost Graph Library (PBGL), which I have never used before, and observed entirely unexpected behaviour I would like to explain.
My steps were as follows:
Dump test data in METIS format (a kind of social graph with 50 mln vertices and 100 mln edges);
Build modified PBGL example from graph_parallel\example\dijkstra_shortest_paths.cpp
Example was slightly extended to proceed with Eager, Crauser and delta-stepping algorithms.
Note: building of the example required some obscure workaround about the MUTABLE_QUEUE define in crauser_et_al_shortest_paths.hpp (example code is in fact incompatible with the new mutable_queue)
int lookahead = 1;
delta_stepping_shortest_paths(g, start, dummy_property_map(), get(vertex_distance, g), get(edge_weight, g), lookahead);
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)).lookahead(lookahead));
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)));
Run
mpiexec -n 1 mytest.exe mydata.me
mpiexec -n 2 mytest.exe mydata.me
mpiexec -n 4 mytest.exe mydata.me
mpiexec -n 8 mytest.exe mydata.me
The observed behaviour:
-n 1:
mem usage: 35 GB in 1 running process, which utilizes exactly 1 device thread (processor load 12.5%)
delta stepping time: about 1 min 20 s
eager time: about 2 min
crauser time: about 3 min 20 s.
-n 2:
crash in the stage of data load.
-n 4:
mem usage: 40+ Gb in roughly equal parts in 4 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
-n 8:
mem usage: 44+ Gb in roughly equal parts in 8 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
So, except the unapropriate memory usage and very low total performance the only changes I observe when more MPI processes are running are slightly increased total memory consumption and linear rise of processor load.
The fact that initial graph is somehow partitioned between processes (probably by the vertices number ranges) is nevertheless evident.
What is wrong with this test (and, probably, my idea of MPI usage in whole)?
My enviromnent:
- one Win 10 PC with 64 Gb and 8 kernels;
- MS MPI 10.0.12498.5;
- MSVC 2017, toolset 141;
- boost 1.71
N.B. See original example code here.

Go Worker Pool doesn't seem to be processing Concurrently

Hello I'm brand new to go (and concurrent programming in general :() and trying to distribute a slow computation to a pool of workers.
http://play.golang.org/p/lTv4Tm75A4
func main() {
test := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
answer := getSmallestMultiple(test)
fmt.Println(answer)
}
I am trying to find the smallest number that is evenly divisible by all the numbers in test.
I have created a pool of workers and am sending them values until one of the goroutines finds a number that can be evenly divided by all the numbers in test
for w := 0; w < 100; w++ {
go divisibleByAllNumbers(&numbers, jobs, answer)
}
go func() {
for i := max; ; i += max {
fmt.Printf("Sending # %d\n", i)
jobs <- i
}
}()
The program seems to be running at the same speed despite how many workers I start. I have tried many number of workers and it always takes the same number of seconds to run, which seems like the work is not being done concurrently at all.
Each worker is consuming work from the queue using range:
for j := range jobs {}
And i was hoping the more processes consuming off the jobs channel the faster the program would execute.
I have also tried different values for the jobs := make(chan int) buffer value
I have stared at this all day and was hoping someone could see what the issue is. I would expect the more workers I add the faster the computation takes but am not experiencing that. I'm sure I"m missing some key concepts,
Thank you
http://golang.org/doc/effective_go.html#parallel
The current implementation of the Go runtime will not parallelize this code by default. It dedicates only a single core to user-level processing. An arbitrary number of goroutines can be blocked in system calls, but by default only one can be executing user-level code at any time. It should be smarter and one day it will be smarter, but until it is if you want CPU parallelism you must tell the run-time how many goroutines you want executing code simultaneously. There are two related ways to do this. Either run your job with environment variable GOMAXPROCS set to the number of cores to use or import the runtime package and call runtime.GOMAXPROCS(NCPU). A helpful value might be runtime.NumCPU(), which reports the number of logical CPUs on the local machine. Again, this requirement is expected to be retired as the scheduling and run-time improve.

changing thread number doesn't affect code

I am trying to learn xeon-phi , and while studying the Intel Xeon-Phi Coprocessor HPC book , I tried to run the code here. (from book)
The code uses openmp and 2 threads.
But the results I am taking are the same as running with 1 thread.
( no use of openmp at all )
I even used in mic different combinations but still the same:
export OMP_NUM_THREADS=2
export MIC_OMP_NUM_THREADS=124
export MIC_ENV_PREFIX=MIC
It seems that somehow openmp is not enabled?Am I missing something here?
The code using only 1 thread is here
I compiled using:
icc -mmic -openmp -qopt-report -O3 hello.c
Thanks!
I am not sure exactly which book you are talking about, but perhaps this will help.
The code you show does not use the offload programming style and must be run natively on the the coprocessor, meaning you copy the executable to the coprocessor and run it there or you use the micnativeloadex utility to run the code from the host processor. You show that you know the code must be run natively because you compiled it with the -mmic option.
If you use micnativeloadex, then the number of omp threads on the coprocessor is set by executing "export MIC_OMP_NUM_THREADS=124" on the host. If you copy the executable to the coprocessor and then log in to run it there, the number of omp threads on the coprocessor is set by executing "export OMP_NUM_THREADS=124" on the coprocessor. If you use "export OMP_NUM_THREADS=2" on the coprocessor, you get only two threads; the MIC_OMP_NUM_THREADS environment variable is not used if you set it directly on the coprocessor.
I don't see any place in the code where it prints out the number of threads, so I don't know for sure how you determined the number of threads actually being used. I suspect you were using a tool like micsmc. However micsmc tells you how may cores are in use, not how many threads are in use.
By default, the omp threads are laid out in order, so that the first core would run threads 0,1,2,3, the second core would run threads 4,5,6,7 and so on. If you are using only two threads, both threads would run on the first core.
So, is that what you are seeing - not that you are using only one thread but instead that you are using only one core?
I was looking at the serial version of the code you are using. For the following lines:
for(j=0; j<MAXFLOPS_ITERS; j++)
{
//
// scale 1st array and add in the 2nd array
// example usage - y = mx + b;
//
for(k=0; k<LOOP_COUNT; k++)
{
fa[k] = a * fa[k] + fb[k];
}
}
I see that here you do not scan the complete array. Instead you keep on updating the first 128 (LOOP_COUNT) elements of the array Fa. If you wish to compare this serial version to the parallel code you are referring to, then you will have to ensure that the program does same amount of work in both versions.
Thanks
I noticed three things in your first program omp:
the total floating point operations should reflect the number of threads doing the work. Therefore,
gflops = (double)( 1.0e-9*LOOP_COUNTMAXFLOPS_ITERSFLOPSPERCALC*numthreads);
You harded code the number of thread = 2. If you want to use the OMP env variable, you should comment out the API "omp_set_num_threads(2);"
After transferring the binary to the coprocessor, to set the OMP env variable in the coprocessor please use OMP_NUM_THREADS, and not MIC_OMP_NUM_THREADS. For example, if you want 64 threads to run your program in the coprocessor:
% ssh mic0
% export OMP_NUM_THREADS=64

Run part of program inside Fortran code for a limited time

I wanted to run a code (or an external executable) for a specified amount of time. For example, in Fortran I can
call system('./run')
Is there a way I can restrict its run to let's say 10 seconds, for example as follows
call system('./run', 10)
I want to do it from inside the Fortran code, example above is for system command, but I want to do it also for some other subroutines of my code. for example,
call performComputation(10)
where performComputation will be able to run only for 10 seconds. The system it will run on is Linux.
thanks!
EDITED
Ah, I see - you want to call a part of the current program a limited time. I see a number of options for that...
Option 1
Modify the subroutines you want to run for a limited time so they take an additional parameter, which is the number of seconds they may run. Then modify the subroutine to get the system time at the start, and then in their processing loop get the time again and break out of the loop and return to the caller if the time difference exceeds the maximum allowed number of seconds.
On the downside, this requires you to change every subroutine. It will exit the subroutine cleanly though.
Option 2
Take advantage of a threading library - e.g. pthreads. When you want to call a subroutine with a timeout, create a new thread that runs alongside your main program in parallel and execute the subroutine inside that thread of execution. Then in your main program, sleep for 10 seconds and then kill the thread that is running your subroutine.
This is quite easy and doesn't require changes to all your subroutines. It is not that elegant in that it chops the legs off your subroutine at some random point, maybe when it is least expecting it.
Imagine time running down the page in the following example, and the main program actions are on the left and the subroutine actions are on the right.
MAIN SUBROUTINE YOUR_SUB
... something ..
... something ...
f_pthread_create(,,,YOUR_SUB,) start processing
sleep(10) ... calculate ...
... calculate ...
... calculate ...
f_pthread_kill()
... something ..
... something ...
Option 3
Abstract out the subroutines you want to call and place them into their own separate executables, then proceed as per my original answer below.
Whichever option you choose, you are going to have to think about how you get the results from the subroutine you are calling - will it store them in a file? Does the main program need to access them? Are they in global variables? The reason is that if you are going to follow options 2 or 3, there will not be a return value from the subroutine.
Original Answer
If you don't have timeout, you can do
call system('./run & sleep 10; kill $!')
Yes there is a way. take a look at the linux command timeout
# run command for 10 seconds and then send it SIGTERM kill message
# if not finished.
call system('timeout 10 ./run')
Example
# finishes in 10 seconds with a return code of 0 to indicate success.
sleep 10
# finishes in 1 second with a return code of `124` to indicate timed out.
timeout 1 sleep 10
You can also choose the type of kill signal you want to send by specifying the -s parameter. See man timeout for more info.