load balancing in openMPI? [duplicate]

load balancing in openMPI? [duplicate] - c++

The thing i want to ask from you is : i have several steps of a big source code (each step has a virtual computation time and virtual communication data,i am taking only virtual as i wants to model the latency and somehow i managed to measure them throughout the source code) . I need to test this by make the code sleep for the computation time and transferring the data equivalent to the communication data . Can you suggest some configuration models to the same ? My aim is to minimizing the overall execution time of the program and thus obviously i want to diminish the overhead that the process can have.
The simplest that strikes my mind are :
Do the computation on all processes and send the virtual data by making asynchronous call to the root processes
Do the same with Synchronous call .
Assume the communication time is linear with the communication data . Use some algorithm to divide the tasks formerly to each process (inspired from load balancing)
Start from the first task with the root process and sends data to the next process , do sleep on that process and show on.
can you please give me some idea or verify ,if this strategy makes a lot of difference ?

Related

Launching all threads at exactly the same time in C++

I have Rosbag file which contains messages on various topics, each topic has its own frequency. This data has been captured from a hardware device streaming data, and data from all topics would "reach" at the same time to be used for different algorithms.
I wish to simulate this using the rosbag file(think of it as every topic has associated an array of data) and it is imperative that this data streaming process start at the same time so that the data can be in sync.
I do this via launching different publishers on different threads (I am open to other approaches as well, this was the only one I could think of.), but the threads do not start at the same time, by the time thread 3 starts, thread 1 would be considerably ahead.
How may I achieve this?
Edit - I understand that launching at the exact same time is not possible, but maybe I can get away with a launch extremely close to each other as well. Is there any way to ensure this?
Edit2 - Since the main aim is to get the data stream in Sync, I was wondering about the warmup effect of the thread(suppose a thread1 starts from 3.3GHz and reaches to 4.2GHz by the time thread2 starts at 3.2). Would this have a significant effect (I can always warm them up before starting the publishing process, but I am curious whether it would have a pronounced effect)
TIA

As others have stated in the comments you cannot guarantee threads launch at exactly the same time. To address your overall goal: you're going about solving this problem the wrong way, from a ROS perspective. Instead of manually publishing data and trying to get it in sync, you should be using the rosbag api. This way you can actually guarantee messages have the same timestamp. Note that this doesn't guarantee they will be sent out at the exact same time, because they won't. You can put a message into a bag file directly like this
import rosbag
from std_msgs.msg import Int32, String
bag = rosbag.Bag('test.bag', 'w')
try:
s = String()
s.data = 'foo'
i = Int32()
i.data = 42
bag.write('chatter', s)
bag.write('numbers', i)
finally:
bag.close()
For more complex types that include a Header field simply edit the header.stamp portion to keep timestamps consistent

Dynamically Evaluate load and create Threads depending on machine performance

Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !

Client-server what is the limitation of server's processes run time?

I have a certain program that recieves input and returnes an output with a run-time of about 2 seconds,
Now, i want to run this program online on a server that can handle multiple connections (lets say up to 100k),
on each client-server session the program will launch,
the client will hand the server the program's input and will wait for the program to end to recieve the server's respond (program's output),
Lets say the server's host is a very powerful machine - e.g 16 cores,
Can this work or it is to much runtime for each client?
What is the maximum runtime this kind of program can have?

I'm posting this as an answer because it's too large to place as a comment.
Can this work? It depends. It depends because there are a lot of variables in this problem. Let's look at some of them:
you say it takes 2 seconds to compute a result. Where and how are those seconds spent? Is this pure computation or are you accessing a database, or the file system? Is this CPU bound or I/O bound? If you run computations for the full 2 seconds then you are consuming CPU which means that you can simultaneously serve only 16 clients, one per core. Are you hitting a database? Is this on the powerful server or on some other machine? If the database is the bottleneck than move this to the powerful server and have SSD drives on it.
can you improve processing for one client? What's more efficient, doing the processing on one core or spread it across all the cores? If you can parallelize, can you limit thread contention?
is CPU all you need? How about memory? Any backend service you access? Are those over the network? Do you have enough bandwidth?
related to memory, what language/platform are you using? Does it have a garbage collector? Do you generate a lot of object to compute a result? Does the GC kick in and pauses your application so it cleans up and compacts the memory? Do you allocate enough memory for the application to run?
can you cache responses and serve them to other clients or are responses custom to each client? Can you precompute the results and then just serve them to clients or can't you predict the inputs?
Have you tried running some performance tests and profile the application to see where hotspots might show up? Can you do something about them?
have you any imposed performance criteria? How many clients do you want to support simultaneously? Is 2 seconds too much? Can clients live with more? How much more? How many seconds does it mean an unacceptable response time?
do you need a big server to run this setup or smaller ones work better (i.e. scale horizontally instead of vertically)?
etc
Nobody can answer this for you. You have to do an analysis of your application, run some tests, profile it, optimize it, then repeat until you are satisfied with the results.

Whether performance will impact when database procedure is called from application many times?

In my project, we are calling the oracle procedure from our c++ application with
the help of Pro *C/C++ library provided by the oracle.
We have one big procedure, and my idea is to split the procedure in to two for modularity. But their advice is to call the procedure for one time, and perform all jobs at a shot.
The reason which i got from them is it will cause performance impacts, since application program is interacting with database for multiple times.
I agree, the above scenario will happen when application connects the database, calls the procedure and finally disconnects the database for each procedure call. But, what we really do is we create a pool of connections at startup, and reuse that pre-connected database connections for interacting with the database.
Information about my application:
It is multi-threaded application, which handles about 1000 request per second with thread pool size as 20. Currently for each request we communicate with database for 4 times.
EDIT:
"the switch between PLSQL and SQL is much faster than the other way around".
Q1. How this is is related to my actual question? My question is about spiliting a procedure in to two equal parts. Say suppose i have 4 queries executed in procedure, i just split it in to two as procedure a and procedure b, and each procedure will have two queries.
"pro*c calls which call PLSQL is an performance hit".
Q2. Do you mean an communication between application(pro *C/C++) and database(oracle) here? If so, is communication was a big performance hit?
In the ask tom link you have attached, "But don't be afraid at all to invoke SQL from PLSQL - that is what PLSQL does best"
Q4. Wheather context switch will happen while we call SQL from PLSQL? Because, as per above statement, it seems to be no performanace impact.

Your advice is correct, it would be better to perform all database tasks at once. There are 2 major performance impacts in your scenario
The context switching of pro*c between the SQL engine and the PL/SQL engine to run your threads multiple times. Usually the biggest issue in many PL/SQL calls from a client application.
The network stack overhead (TNS) in the communications between your pro*c app and the database engine - particularly if your app is on a different physical host.
Having said that, you are creating a connection pool at the application end the TNS listener should also have a pool of bequeathed server shadow processes waiting for each network connection (this is setup at the listener.ora).
The OCI login/logoff when the shadow process is already waiting for connect is very quick and not a huge factor in latency - I don't worry about this unless a new shadow process on the server has to start up - then it can be a very expensive call. As you are using connection pooling on the client side, this is usually not an issue but just something to consider because of the threading in your calls. Once you exhaust the pool of server shadow processes, you will notice a huge degradation if the TNS listener has to start up more server shadow process.
Edit in answer to the new questions:
It is very related. As pointed out previously, you should minimise the amount of plsql and sql calls within your C++ app. Each PLSQL call within your C++ app call invokes the SQL engine which then invokes the PLSQL engine for the procedure call. So if you split your procedure into 2 - you are doubling the SQL to PLSQL context switches which is the more expensive switch as outlined by the Tom Kyte article and my own personal experience.
Is answered in 1. But as I previously said the communications overhead is second unless your hosts are on different physical networks and the types of data you are transferring. For instance large C++ object parameters and large Oracle result sets with many calls will obviously affect communications latency with round trips. Remember that with more PLSQL calls you are also adding more SQLNET traffic for the setup for each connection and result set.
there is no 3. question
PLSQL to SQL within the PLSQL engine is negligible so don't get hung up on it. Put all your SQL calls within 1 PLSQL call for maximum performance throughput. Don't split the calls just to be more eloquent at the expensive of performance.

How to setup ZERO-MQ architecture to deal with workers of different speed

[as a small context provider: I am new to networking and ZERO-MQ, but I did spend quite a bit of time on the guide and examples]
I have the following challenge (done in C++, but irrelevant to the question). I have a single source that generates tasks. I have multiple engines that need to process those tasks, and send back the result.
First attempt:
I created a client with a ZMQ_PUSH socket. The engines have a ZMQ_PULL socket. To get the answers back to the client, I created the reverse: a ZMQ_PUSH on the workers and a ZMQ_PULL on the client. It worked out of the box. Only to find out that after some time the client ran out of memory since I was pushing way more requests than the workers could process. I need some backpressure.
Second attempt:
I added a counter on the client that took care of only pushing when no more than say 1000 tasks were 'in progress'. The out of memory issue was solved, since I was never having more than 1000 'in progress' tasks. But ... some workers were slower than others. Since PUSH/PULL uses fair queueing, the amount of work for that slow worker kept increasing and increasing...until the slowest worker had all 1000 requests queued and the others were starved. I was not using my workers effectively.
Now, what architecture could I use that solves the issue of 'workers with different speed'? Is the 'count the number of in progress tasks' approach a good way of balancing the number of pushed requests? Or is there a way I can PUSH tasks to the workers, and the pushing blocks on a predefined point? Can I do that with HWM?
I am sure this problem is of such a generic nature that I should be able to easily deal with this. Can anyone point me in the right direction?
Thanks!

we used the Paranoid Pirate Protocol http://rfc.zeromq.org/spec:6,
but in case of many very small jobs, where the overhead of communication might be high, a credit-based flow control pattern might be more efficient. http://unprotocols.org/blog:15
in both cases it is necessary for the requester to directly assign jobs to individual workers. this is abstracted away of course and, depending on the use-case, could be made available as a sync call, which returns when all tasks have been processed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js