How to make a higher-level parallel code/scripts? - c++

Newbie here.
I have a c++ program XX, note XX is a executable binary here. And now I want this program XX to do the similar job N times but with N sets of different input parameters, and say I have N processors now, then I could let these N jobs run simutaneously on these N processors.
Is it possible from a scripts level to qsub these kind of "parallel" jobs? Or it can be done even on a C++ level? Or any better ideas?
I asked because the XX code I have writen is based on a large project and it is not easy for me to change the mpi part of code. :(
OR do I HAVE TO modify the XX code and the project to have a new algorithm to fit my need.
OR any other advices, like using python or what? that can achieve my goal quickly.
Thanks a lot!
I want to add more to my question, to make it clearer.
What if the result of these N are dependent? No, I mean how could I do this,
1st cycle, N jobs on n processors run simutaneously and after a certain time, they all end and give N results, and I need to do a serial job based on these N results, the result of this will be used as initial condition for the next cycle, and then move to the next cycle,
2nd cycle,
3rd cycle,
and so on....
Is shell scripts able to do this? Or I'd better to learn to use python? Or I can still use c++???
Thanks :)

You can use GNU Parallel to execute jobs in parallel through a shell script.
Good old xargs takes a -P parameter that tells it how many jobs to execute at the same time.

The bourne shell can do what you ask just fine:
#!/bin/sh
# Run XX 3 times in parallel
XX args&
XX other-args&
XX different-args&
wait # Wait for all 3 to finish
...

GNU make has "-j" switch that allows you to specify how many jobs you want to run simultaneously. So if you can convert the whole thing into gnu Makefile with proper dependencies, you'll be probably able to run specified number of jobs simultaneously. Alternatively you could try some other build system/automation tool. Or you could implement it from scratch in a shell script or something like that. Plain C++ is also possible, as long as you have access to some threads library. Or you could generate Makefile (or another build script) using python. There are many ways to approach this.

Related

How to Limit Concurrency in fp-ts

Our team is starting to learn fp-ts, and we are starting with some basic async examples (mostly pulled from here). Running a set of Tasks in sequence is great, and it looks like array.sequence(task)(tasks)
The question is, what is the idiomatic way to restrict concurrency when executing parallel Tasks in fp-ts? For example, Promise.map (in bluebird) allows you to set a concurrency limit like {concurrency: 4}.
One solution might be to split the array into chunks, and then iterate the chunks, using sequence and flatMap. However, that would mean every Task in each chunk would have to complete before moving on to the next chunk - one long running task could hold up the whole operation.
There must be some abstraction that we're missing - we are all pretty new to FP, so hopefully someone here with more exp can help out.
I was able to find a resolution with the helpful folks over at the ts-fp git repo. Looks like wrapping p-map is the way to go
https://github.com/gcanti/fp-ts/issues/574#issuecomment-424658481

SAS Enterprise Guide - process dependence and parallel execution

I am working on a project in SAS EG (7.1) which involves process dependence and parallel execution, as depicted below:
I have the following questions:
Is there a way to retrieve or set relations (i.e. process_C --> program_D) between the processes programmatically? The maintenance is becoming problematic with complex projects. Ideally, I would like to be able to re-create the links between processes from external table.
I start the whole process with the option “Run branch from <>” process. Let’s assume that we have only 2 processors available. Is there a way to set the order of execution between process_A, B, C? The critical path of the whole flow is “begin -> process_C -> process_D -> end” hence we would like it to start with process_C in order to ensure minimum execution time.
Thank you in advance.
For 1, I think the answer is "no", if you mean a well defined SAS programmatic method. At least for the relatively limited information and example you provide above, anyway. More might be possible with metadata server - not my area of expertise.
You can do some of this at least using scripting via Powershell or VBScript. EG's API is fairly wide open and not all that hard to use. I won't suggest how as my understanding of this is limited also, but it seems like it should be possible to do what you suggest, though probably not easy.
For your second point:
First off, EG typically runs "top to bottom" if it has no other information on how to process a particular choice. So put c->d above a/b to get it processed first.
Second, you could use conditional processing perhaps. There should be a macro variable that tells you how many cpus you have (&SYSNCPU on my machine, hopefully same on other versions). You could use that value to conditionally link to A then B as opposed to A+B simultaneously. I'm not sure how easy this would be to do in a flexible fashion, though.

When does an action not run on the driver in Apache Spark?

I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.

How to measure the amount of data transmitted by my MPI program?

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.
Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.

Suggestions for unit testing

Hello another question concerning debugging : Automatically generating test cases when i know the parameterset. And doing it all at once, instead during development (could kick myself)
i have a set of parameters for my software that i wish to test. (~ 12 parameters only). However of course these parameters are often integers, so for every parameter i can have 4 values that make sense(0, insanely huge, normally big, normally small).
is there a way i can generate my testcases automatically? would save me a lot of time. I already have to inspect every test case by hand, do i not? Alot of my program produces output to the console so normal assertions probably wont work, also i work on home made datastructures most of the time, so i could not use a simple assertion.
My dream option would be kind of a reverse regular expression, where i set the rules and get myself some file generated that i can use as an input (my software has a crude scripting language). that way i can assemble all input files and test them one by one.
Looking forward to listening to your kind suggestions.
cheers
There are lots of ways to generate test cases in your scenario -- though you're a bit vague on what form the inputs for your programs and units need to take. For one of my Fortran programs I use a template input parameter file, a bash script and a make file. The make file, when called on the test phony target:
a) compiles the program;
b) runs the bash script, which uses sed to replace placeholders in the template parameter file, to create 128 (or whatever) test input files;
c) submits all the test jobs to the job management system on our cluster.
Once they jobs have finished I have some other scripts to compare outputs with benchmarks, collect statistics, that sort of thing.
If you need more specific advice, post more specific questions.
EDIT: Using sed inside a bash script:
Suppose that the parameter input template file contains 3 codes to be replaced: $FREQ$, $NUM$ and $TOL$. Then I write a bash script with a 3-deep loop nest something like this:
for frq in 0.01 0.0 1 10
do
for np in 1 2 4 8 16
do
for tol in 0.001 0.0001 0.00001
sed ....
done
done
done
It's not pretty but it works, and it saves me wrestling with much more sophisticated solutions such as xUnit testing or Python programming.
I suggest you read something about data-driven unit testing.
There are lots of frameworks that can help you with that.
You may start here: http://www.slideshare.net/dnastacio/datadriven-unit-testing-for-java-1933154.
I see that you work with FORTRAN and you probably deal with one of FORTRAN's versions of xUnit. Being user of JUnit I'd suggest parameterized tests - see if the concept applies in your case.