Open MPI: how to run exactly 1 process per host

Open MPI: how to run exactly 1 process per host - c++

Actually I have 3 questions. Any input is appreciated. Thank you!
1) How to run exactly 1 process on each host? My application uses TBB for multi-threading. Does it mean that I should run exactly 1 process on each host for best performance?
2) My cluster has heterogeneous hosts. Some hosts have better CPUs and more memory than the others. How to map process ranks to real hosts for work distribution purposes? I am thinking to use hostname.Is there a better to do it?
3) How process ranks are assigned? What process gets 0?

1) TBB splits loops into several threads of a thread pool to utilize all processors of one machine. So you should only run one process per machine. More processes would fight with each other for processor time. The number of processes per machine is given by options in your hostfile:
# my_hostfile
192.168.0.208 slots=1 max_slots=1
...
2) To give each machine an appropriate amount of work according to its performance is not trivial.
The easiest approach is to split the workload into small pieces of work, send them to the slaves, collect their answers, and give them new pieces of work, until you are done. There is an example on my website (in German). You can also find some references to manuals and tutorials there.
3) Each process gets a number (processID) in your program by
MPI_Comm_rank(MPI_COMM_WORLD, &processID);
The master has processID == 0. Maybe the other are given the slots in the order of your hostfile. Another possibility is they are assigned in the order the connections to slaves are established. I don't know that.

Related

Dynamically Evaluate load and create Threads depending on machine performance

Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !

Concurrency problems in netty application

I implemented a simple http server link, but the result of the test (ab -n 10000 -c 100 http://localhost:8080/status) is very bad (look through the test.png in the previous link)
I don't understand why it doesn't work correctly with multiple threads.

I believe that, by default, Netty's default thread pool is configured with as many threads as there are cores on the machine. The idea being to handle requests asynchronously and non-blocking (where possible).
Your /status test includes a database transaction which blocks because of the intrinsic design of database drivers etc. So your performance - at high level - is essentially a result of:-
a.) you are running a pretty hefty test of 10,000 requests attempting to run 100 requests in parallel
b.) you are calling into a database for each request so this is will not be quick (relatively speaking compared to some non-blocking I/O operation)
A couple of questions/considerations for you:-
Machine Spec.?
What is the spec. of the machine you are running your application and test on?
How many cores?
If you only have 8 cores available then you will only have 8 threads running in parallel at any time. That means those batches of 100 requests per time will be queueing up
Consider what is running on the machine during the test
It sound like you are running the application AND Apache Bench on the same machine so be aware that both your application and the testing tool will both be contending for those cores (this is in addition to any background processes going on also contending for those cores - such as the OS)
What will the load be?
Predicting load is difficult right. If you do think you are likely to have 100 requests into the database at any one time then you may need to think about:-
a. your production environment may need a couple of instance to handle the load
b. try changing the config. of Netty's default thread pool to increase the number of threads
c. think about your application architecture - can you cache any of those results instead of going to the database for each request

May be linked to the usage of Database access (synchronous task) within one of your handler (at least in your TrafficShappingHandler) ?
You might need to "make async" your database calls (other threads in a producer/consumer way for instance)...
If something else, I do not have enough information...

Best approach for writing a Linux Server in C (phtreads, select or fork ? )

i got a very specific question about server programming in UNIX (Debian, kernel 2.6.32). My goal is to learn how to write a server which can handle a huge amount of clients. My target is more than 30 000 concurrent clients (even when my college mentions that 500 000 are possible, which seems QUIIITEEE a huge amount :-)), but i really don't know (even whats possible) and that is why I ask here. So my first question. How many simultaneous clients are possible? Clients can connect whenever they want and get in contact with other clients and form a group (1 group contains a maximum of 12 clients). They can chat with each other, so the TCP/IP package size varies depending on the message sent.
Clients can also send mathematical formulas to the server. The server will solve them and broadcast the answer back to the group. This is a quite heavy operation.
My current approach is to start up the server. Than using fork to create a daemon process. The daemon process binds the socket fd_listen and starts listening. It is a while (1) loop. I use accept() to get incoming calls.
Once a client connects I create a pthread for that client which will run the communication. Clients get added to a group and share some memory together (needed to keep the group running) but still every client is running on a different thread. Getting the access to the memory right was quite a hazzle but works fine now.
In the beginning of the programm i read out the /proc/sys/kernel/threads-max file and according to that i create my threads. The amount of possible threads according to that file is around 5000. Far away from the amount of clients i want to be able to serve.
Another approach i consider is to use select () and create sets. But the access time to find a socket within a set is O(N). This can be quite long if i have more than a couple of thousands clients connected. Please correct me if i am wrong.
Well, i guess i need some ideas :-)
Groetjes
Markus
P.S. i tag it for C++ and C because it applies to both languages.

The best approach as of today is an event loop like libev or libevent.
In most cases you will find that one thread is more than enough, but even if it isn't, you can always have multiple threads with separate loops (at least with libev).
Libev[ent] uses the most efficient polling solution for each OS (and anything is more efficient than select or a thread per socket).

You'll run into a couple of limits:
fd_set size: This is changable at compile time, but has quite a low limit by default, this affects select solutions.
Thread-per-socket will run out of steam far earlier - I suggest putting the longs calculations in separate threads (with pooling if required), but otherwise a single thread approach will probably scale.
To reach 500,000 you'll need a set of machines, and round-robin DNS I suspect.
TCP ports shouldn't be a problem, as long as the server doesn't connection back to the clients. I always seem to forget this, and have to be reminded.
File descriptors themselves shouldn't be too much of a problem, I think, but getting them into your polling solution may be more difficult - certainly you don't want to be passing them in each time.

I think you can use the event model(epoll + worker threads pool) to solve this problem.
first listen and accept in main thread, if the client connects to the server, the main thread distribute the client_fd to one worker thread, and add epoll list, then this worker thread will handle the reqeust from the client.
the number of worker thread can be configured by the problem, and it must be no more the the 5000.

How to implement a master machine controlling several slave machines via Linux C++

could anyone give some advice for how to implement a master machine controlling some slave machines via C++?
I am trying to implement a simple program that can distribute tasks from master to slaves. It is easy to implement one master + one slave machine. However, when there are more than one slave machine, I don't know how to design.
If the solution can be used for both Linux and Windows, it would be much better.

You use should a framework rather than make your own. What you need to search for is Cluster Computing. one that might work easily is Boost.MPI

With n-machines, you need to keep track of which ones are free, and if there are none, load across your slaves (i.e. how many tasks have been queued up at each) and then queue on the lowest loaded machine (or whichever your algorithm deems best), say better hardware means that some slaves perform better than others etc. I'd start with a simple distribution algorithm, and then tweak once it's working...
More interesting problems will arise in exceptional circumstances (i.e. slaves dying, and various such issues.)
I would use an existing messaging bus to make your life easier (rather than re-inventing), the real intelligence is in the distribution algorithm and management of failed nodes.

We need to know more, but basically you just need to make sure the slaves don't block each other. Details of doing that in C++ will get involved, but the first thing to do is ask yourself what the algorithm is. The simplest case is going to be if you don't care about waiting for the slaves, in which case you have
while still tasks to do
launch a task on a slave
If you have to have just one job running on a slave then you'll need something like an array of flags, one per slave
slaves : array 0 to (number of slaves - 1)
initialize slaves to all FALSE
while not done
find the first FALSE slave -- it's not in use
set that slave to TRUE
launch a job on that slave
check for slaves that are done
set that slave to FALSE
Now, if you have multiple threads, you can make that into two threads
while not done
find the first FALSE slave -- it's not in use
set that slave to TRUE
launch a job on that slave
while not done
check for slaves that are done
set that slave to FALSE

Web application background processes, newbie design question

I'm building my first web application after many years of desktop application development (I'm using Django/Python but maybe this is a completely generic question, I'm not sure). So please beware - this may be an ultra-newbie question...
One of my user processes involves heavy processing in the server (i.e. user inputs something, server needs ~10 minutes to process it). On a desktop application, what I would do it throw the user input into a queue protected by a mutex, and have a dedicated background thread running in low priority blocking on the queue using that mutex.
However in the web application everything seems to be oriented towards synchronization with the HTTP requests.
Assuming I will use the database as my queue, what is best practice architecture for running a background process?

There are two schools of thought on this (at least).
Throw the work on a queue and have something else outside your web-stack handle it.
Throw the work on a queue and have something else in your web-stack handle it.
In either case, you create work units in a queue somewhere (e.g. a database table) and let some process take care of them.
I typically work with number 1 where I have a dedicated windows service that takes care of these things. You could also do this with SQL jobs or something similar.
The advantage to item 2 is that you can more easily keep all your code in one place--in the web tier. You'd still need something that triggers the execution (e.g. loading the web page that processes work units with a sufficiently high timeout), but that could be easily accomplished with various mechanisms.

Since:
1) This is a common problem,
2) You're new to your platform
-- I suggest that you look in the contributed libraries for your platform to find a solution to handle the task. In addition to queuing and processing the jobs, you'll also want to consider:
1) status communications between the worker and the web-stack. This will enable web pages that show the percentage complete number for the job, assure the human that the job is progressing, etc.
2) How to ensure that the worker process does not die.
3) If a job has an error, will the worker process automatically retry it periodically?
Will you or an operations person be notified if a job fails?
4) As the number of jobs increase, can additional workers be added to gain parallelism?
Or, even better, can workers be added on other servers?
If you can't find a good solution in Django/Python, you can also consider porting a solution from another platform to yours. I use delayed_job for Ruby on Rails. The worker process is managed by runit.
Regards,
Larry

Speaking generally, I'd look at running background processes on a different server, especially if your web server has any kind of load.

Running long processes in Django: http://iraniweb.com/blog/?p=56

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js