I am using the non-blocking MPI_Isend/MPI_Irecv calls in conjunction with repeated MPI_Test calls to achieve communication/computation overlap. I have noticed that whether or not I achieve overlap depends on the frequency at which I call MPI_Test. It seems that if I call it more frequently than a certain threshold frequency, MPI_Test suddenly becomes blocking and uses the thread's resources to progress data. If I call it less frequently than the threshold, then MPI_Test is non-blocking and data progression is masked by the computations. I am not sure what this threshold is, but I've compared the extreme case of calling MPI_Test every iteration against every 500 iterations in my code. The former yields no overlap, and the latter yields complete overlap.
Can someone advise me on whether this is an issue that is commonly encountered and how I can go about computing an optimal MPI_Test frequency for a given function? I imagine that it would be a function of both the size of data transferred and the size of the loop in which my computations occur.
I have a loop in C++ that I would like to run for a few seconds. Although the amount of work on every iteration is different, from a few microseconds to seconds, it is ok to stop between iterations. It is high-performance code so I would like to avoid calculating time difference on each iteration:
while (!status.has_value())
{
// do something
// this adds extra delays that I would like to avoid
if (duration_cast<seconds>(system_clock::now() - started).count() >= limit)
status = CompletedBy::duration;
}
What I'm thinking is maybe there is a way to schedule signal and then stop the loop when it happens instead of checking the time difference on every iteration.
BTW, the loop may exit before the signal.
I have done something similar, but in Java. The general idea is to use a separate thread to manage a sentinel value, making your loop look like...
okayToLoop = true;
// code to launch thread that will wait N milliseconds, and then negate okayToLoop
while((!status.hasValue()) AND (okayToLoop)) {
// loop code
}
The "cautionary note" is that many sleep() functions for threads employ "sleep at least" semantics, so if it is really important to only sleep N milliseconds, you'll need to address that in your thread implementation. But, this avoids constantly checking the duration for each iteration of the loop.
Note that this will also allow the current iteration of the loop to finish, before the sentinel value is checked. I have also implemented this approach where the "control thread" actually interrupts the thread on which the loop is executing, interrupting the iteration. When I've done this, I've actually put the loop into a worker thread.
Any form of inter-thread communication is going to be way slower than a simple query of a high performance clock.
Now, steady_clock::now() might be too slow in the loop.
Using OS specific APIs, bind your thread to have ridiculous priority and affinity for a specific CPU. Or use rdtsc, after taking into account everything that can go wrong. Calculate what value you'd expect to get if (a) something went wrong, or (b) you have passed the time threshold.
When that happens, check steady_clock::now(), see if you are close enough to being done, and if so finish. If not, calculate a new high performance clock target and loop again.
So I have a Kinect program that has three main functions that collect data and saves it. I want one of these functions to execute as much as possible, while the other two run maybe 10 times every second.
while(1)
{
...
//multi-threading to make sure color and depth events are aligned -> get skeletal data
if (WaitForSingleObject(colorEvent, 0) == 0 && WaitForSingleObject(depthEvent, 0) == 0)
{
std::thread first(getColorImage, std::ref(colorEvent), std::ref(colorStreamHandle), std::ref(colorImage));
std::thread second(getDepthImage, std::ref(depthEvent), std::ref(depthStreamHandle), std::ref(depthImage));
if (WaitForSingleObject(skeletonEvent, INFINITE) == 0)
{
first.join();
second.join();
std::thread third(getSkeletonImage, std::ref(skeletonEvent), std::ref(skeletonImage), std::ref(colorImage), std::ref(depthImage), std::ref(myfile));
third.join();
}
//if (check == 1)
//check = 2;
}
}
Currently my threads are making them all run at the same exact time, but this slows down my computer a lot and I only need to run 'getColorImage' and 'getDepthImage' maybe 5-10 times/second, whereas 'getSkeletonImage' I would want to run as much as possible.
I want 'getSkeletonImage' to run at max frequency (~30 times/second through the while loop) and then the 'getColorImage' and 'getDepthImage' to time synchronize (~5-10 times/second through the while loop)
What is a way I can do this? I am already using threads, but I need one to run consistently, and then the other two to join in intermittently essentially. Thank you for your help.
Currently, your main loop is creating the threads every iteration, which suggests each thread function runs once to completion. That introduces the overhead of creating and destroying threads every time.
Personally, I wouldn't bother with threads at all. Instead, in the main thread I'd do
void RunSkeletonEvent(int n)
{
for (i = 0; i < n; ++i)
{
// wait required time (i.e. to next multiple of 1/30 second)
skeletonEvent();
}
}
// and, in your main function ....
while (termination_condition_not_met)
{
runSkeletonEvent(3);
colorEvent();
runSkeletonEvent(3);
depthEvent();
}
This interleaves the events, so skeletonEvent() runs six times for every time depthEvent() and colorEvent() are run. Just adjust the numbers as needed to get required behaviour.
You'll need to design the code for all the events so they don't run over time (if they do, all subsequent events will be delayed - there is no means to stop that).
The problem you'll then need to resolve is how to wait for the time to fire the skeleton event. A process of retrieving clock time, calculating how long to wait, and sleeping for that interval will do it. By sleeping (the thread yielding its time slice) your program will also be a bit better mannered (e.g. it won't be starving other processes of processor time).
One advantage is that, if data is to be shared between the "events" (e.g. all of the events modify some global data) there is no need for synchronisation, because the looping above guarantees that only one "event" accesses shared data at one time.
Note: your usage of WaitForSingleObject() indicates you are using windows. Windows (except, arguably CE in a weak sense) is not really a realtime system, so does not guarantee precise timing. In other words, the actual intervals you achieve will vary.
It is still possible to restructure to use threads. From your description, there is no evidence you really need anything like that, so I'll leave this reply at that.
I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}
This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.
Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.
Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}
Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.
If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick
It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.
The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.
It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.
As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.
This must be an easy question but I can't find a properly answer to it.
I'm coding on VS-C++. I've a custom class 'Person' with attribute 'height'. I want to call class method Grow() that starts a timer that will increment 'height' attribute every 0.5 seconds.
I'll have a StopGrow() that stops the timer and Shrink() that decrements instead of increment.
I really need a little push on which timer to use and how to use it within Grow() method. Other methods must be straight forward after knowing that.
That's my first question here so please be kind (and warn me if I'm doing it wrong :) Forgive my English, not my first language.
Do you really need to call the code every half second to recalculate a value? For most scenarios, there is another much simpler, faster, effective way.
Don't expose a height member, but use a method such as GetHeight(), which will calculate the height at the exact moment you need it.
Your Grow() method would set a base height value and start time and nothing else. Then, your GetHeight() method would subtract the starting time from the current time to calculate the height "right now", when you need it.
No timers needed!
Since you're on Windows, the simplest solution is probably to use the GetTickCount() function supplied by Windows.
There isn't a good timer function in the C++ language with a precision guaranteed to be less than a second.
So instead, include the windows.h header, and then call GetTickCount() to get a number of milliseconds. The next time you call it, you simlpy subtract the two values, and if the result is over 500, half a second has elapsed.
Alternatively, if you want to block the thread for half a second, use the Sleep(n) function, where n is the number of milliseconds you want the thread to sleep. (500 in your case)
You might want to take a look at CreateTimerQueue() and CreateTimerQueueTimer(). I've never personally used them, but they would probably fit the bill.
I currently spawn a thread that is responsible for doing timer based operations. It calls WaitForSingleObject() on a manual-reset event with a 10ms timeout. It keeps an internal collection of callbacks in the form of pointer-to-method and objects that the callbacks are invoked for. This is all hidden behind a singleton that provides a scheduler interface that let's the caller schedule method calls on the objects either after a timer expiration or regularly on an interval. It looks like the two functions that I mentioned should give you pretty much the same functionality... hmmm... might be time to revisit that scheduler code... ;-)
Sleep() an the normal timer event run off a 10ms clock.
For high resolution timer events on windows use high resolution timers
Not an easy question at all! You have at least two possibilities:
create a thread that will execute a loop: sleep 0.5s, increase height, sleep 0.5s, increase height, etc.
invert flow of control and pass it to some framework like Boost::Asio that will call your timer handler in every 0.5s.
In order to make the right decision you have to think about your whole application. Does it compute something (then maybe threads)? Does it interact with the user (then maybe event driven)? Each approach has some gotchas:
When you use threads you have to deal with locking, which can be tricky.
When you do event-driven stuff, you have to write asynchronous handlers, which can be tricky.