PPL - How to configure the number of native threads? - c++

I am trying to manage the count of native threads in PPL by using its Scheduler class, here is my code:
for (int i = 0; i < 2000; i ++)
{
// configure concurrency count 16 to 32.
concurrency::SchedulerPolicy policy = concurrency::SchedulerPolicy(2, concurrency::MinConcurrency, 16,
concurrency::MaxConcurrency, 32);
concurrency::Scheduler *pScheduler = concurrency::Scheduler::Create(policy);
HANDLE hShutdownEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
pScheduler->RegisterShutdownEvent(hShutdownEvent);
pScheduler->Attach();
//////////////////////////////////////////////////////////////////////////
//for (int i = 0; i < 2000; i ++)
{
concurrency::create_task([]{
concurrency::wait(1000);
OutputDebugString(L"Task Completed\n");
});
}
//////////////////////////////////////////////////////////////////////////
concurrency::CurrentScheduler::Detach();
pScheduler->Release();
WaitForSingleObject(hShutdownEvent, INFINITE);
CloseHandle(hShutdownEvent);
}
The usage of SchedulerPolicy is from MSDN, but it didn't work at all. The expected result of my code above is, PPL will launch 16 to 32 threads to execute the 2000 tasks, but the fact is:
By observing the speed of console output, only one task was processed within a second. I also tried to comment the outter for loop and uncomment the inner for loop, however, this will cause 300 threads being created, still incorrect. If I wait a longer time, the threads created will be even more.
Any ideas on what is the correct way to configure concurrency in PPL?

It has been proved that I should not do concurrency::wait within the task body, PPL works in work stealing mode, when the current task was suspended by wait, it will start to schedule the rest of tasks in queue to maximize the use of computing resources.
When I use concurrency::create_task in real project, since there are a couple of real calculations within the task body, PPL won't create hundreds of threads any more.
Also, SchedulePolicy can be used to configure the number of virtual processors that PPL may use to process the tasks, which is not always same as the number of native threads PPL will create.
Saying my CPU has 8 virtual processors, by default PPL will just create 8 threads in pool, but when some of those threads were suspended by wait or lock, and also there are more tasks pending in the queue, PPL will immediately create more threads to execute them (if the virtual processors were not fully loaded).

Related

Zombie/trash thread on MacOS with Rosetta

I have a very weird issue with a zombie/garbage thread crashing the process although this thread should have been joined a while ago.
Application written in C++, built for x86_64, runs on MacOS using Rosetta. Googletest runs app multiple times: initializes the engine, performs tests and uninitializes it for every test - no new processes are started - all happens within googletest process. In about one goggletest start of about ten, one of the threads that my application has not created during current run crashes with invalid resource access which triggers SIGABRT and crashes the complete process.
Threads in the pool are created like this during thread pool construction, done once during application initialization:
// threadsNum is a constant, e.g. 8
for (size_t i = 0; i < threadsNum; ++i){
// this->workers has type std::vector< std::thread > workers;
workers.emplace_back(
[this, i]
{
workerThreadFunction(i);
}
);
}
workerThreadFunction() sets thread name and enters an infinete loop for doing it's job while checking if it should stop in every iteration:
for (;;) {
// Mutex lock to access this->stop
std::unique_lock<std::mutex> lock(this->queue_mutex);
if (this->stop) {
break;
}
...
}
During thread pool destruction all threads are joined:
for(std::thread &worker: workers) {
worker.join();
}
To sum up:
Application starts 8 threads for a specfic task using a thread pool
Threads have their names for debugging purposes - e.g. "MyThread 1", "MyThread 2", etc.
I have verified that all threads are joined upon thread pool destruction (threadX.join() returns)
Application shutdown is clean after test iteration execution before the next test with the crash starts
In problematic runs with the crash there are more threads present than application has created - one of the threads is present two times (thread name duplicated). Crash dump also shows there are two threads with the same name and one of them has crashed.
That duplicated thread has corrupt stack and crashes due to invalid resource access (locking this->queue_mutex to be precise)
Additional main thread sleep for e.g. 100ms after engine uninitialization does not help - does not look like a timing issue
To me it looks like that thread has survived join() somehow and reappeared in the process. But I cannot imagine how could it be possible.
The question is, am I missing something here? Are there any tools to debug this issue besides what I have already done?

Multithreading Implementation in C++

I am a beginner using multithreading in C++, so I'd appreciate it if you can give me some recommendations.
I have a function which receives the previous frame and current frame from a video stream (let's call this function, readFrames()). The task of that function is to compute Motion Estimation.
The idea when calling readFrames() would be:
Store the previous and current frame in a buffer.
I want to compute the value of Motion between each pair of frames from the buffer but without blocking the function readFrames(), because more frames can be received while computing that value. I suppose I have to write a function computeMotionValue() and every time I want to execute it, create a new thread and launch it. This function should return some float motionValue.
Every time the motionValue returned by any thread is over a threshold, I want to +1 a common int variable, let's call it nValidMotion.
My problem is that I don't know how to "synchronize" the threads when accessing motionValue and nValidMotion.
Can you please explain to me in some pseudocode how can I do that?
and every time I want to execute it, create a new thread and launch it
That's usually a bad idea. Threads are usually fairly heavy-weight, and spawning one is usually slower than just passing a message to an existing thread pool.
Anyway, if you fall behind, you'll end up with more threads than processor cores and then you'll fall even further behind due to context-switching overhead and memory pressure. Eventually creating a new thread will fail.
My problem is that I don't know how to "synchronize" the threads when accessing motionValue and nValidMotion.
Synchronization of access to a shared resource is usually handled with std::mutex (mutex means "mutual exclusion", because only one thread can hold the lock at once).
If you need to wait for another thread to do something, use std::condition_variable to wait/signal. You're waiting-for/signalling a change in state of some shared resource, so you need a mutex for that as well.
The usual recommendation for this kind of processing is to have at most one thread per available core, all serving a thread pool. A thread pool has a work queue (protected by a mutex, and with the empty->non-empty transition signalled by a condvar).
For combining the results, you could have a global counter protected by a mutex (but this is relatively heavy-weight for a single integer), or you could just have each task added to added to the thread pool return a bool via the promise/future mechanism, or you could just make your counter atomic.
Here is a sample pseudo code you may use:
// Following thread awaits notification from worker threads, detecting motion
nValidMotion_woker_Thread()
{
while(true) { message_recieve(msg_q); ++nValidMotion; }
}
// Worker thread, computing motion on 2 frames; if motion detected, notify uysing message Q to nValidMotion_woker_Thread
WorkerThread(frame1 ,frame2)
{
x = computeMotionValue(frame1 ,frame2);
if x > THRESHOLD
msg_q.send();
}
// main thread
main_thread()
{
// 1. create new message Q for inter-thread communication
msg_q = new msg_q();
// start listening thread
Thread a = new nValidMotion_woker_Thread();
a.start();
while(true)
{
// collect 2 frames
frame1 = readFrames();
frame2 = readFrames();
// start workre thread
Thread b = new WorkerThread(frame1 ,frame2);
b.start();
}
}

overhead of thread-synchronization via Events

I am experimenting with multithreaded synchronization at the moment. For a backround I have a set of about 100000 objects - possibly more - I want to process in different ways multiple times per second.
Now the thing concerning me most is the performance of the synchronization.
This is what I think should work just fine (I omitted all security aspects as this is just a testprogram and in case of an error the program will just crash ..). I wrote two funktions, the first to be executed by the main thread of the program, the second to be run by all additional threads.
void SharedWorker::Start()
{
while (bRunning)
{
// Send the command to start task1
SetEvent(hTask1Event);
// Do task1 (on a subset of all objects) here
// Wait for all workers to finish task1
WaitForMultipleObjects(<NumberOfWorkers>, <ListOfTask1WorkerEvents>, TRUE, INFINITE);
// Reset the command for task1
ResetEvent(hTask1Event);
// Send the command to start task2
SetEvent(hTask2Event);
// Do task2 (on a subset of all objects) here
// Wait for all workers to finish task2
WaitForMultipleObjects(<NumberOfWorkers>, <ListOfTask2WorkerEvents>, TRUE, INFINITE);
// Reset the command for task2
ResetEvent(hTask2Event);
// Send the command to do cleanup
SetEvent(hCleanupEvent);
// Do some (on a subset of all objects) cleanup
// Wait for all workers to finish cleanup
WaitForMultipleObjects(<NumberOfWorkers>, <ListOfCleanupWorkerEvents>, TRUE, INFINITE);
// Reset the command for cleanup
ResetEvent(hCleanupEvent);
}
}
DWORD WINAPI WorkerThreads(LPVOID lpParameter)
{
while (bRunning)
{
WaitForSingleObject(hTask1Event, INFINITE);
// Unset finished cleanup
ResetEvent(hCleanedUp);
// Do task1 (on a subset of all objects) here
// Signal finished task1
SetEvent(hTask1);
WaitForSingleObject(hTask2Event, INFINITE);
// Reset task1 event
ResetEvent(hTask1);
// Do task2 (on a subset of all objects) here
// Signal finished task2
SetEvent(hTask2);
WaitForSingleObject(hCleanupEvent, INFINITE);
// Reset update event
ResetEvent(hTask2);
// Do cleanup (on a subset of all objects) here
// Signal finished cleanup
SetEvent(hCleanedUp);
}
return 0;
}
To point out my requirements, I'll just give you a little example:
Say we got the 100000 objects from above, split into 8 subsets of 12500 objects each, a modern multicore processor with 8 logical cores. The relevant part is the time. All tasks must be performed in about 8ms.
My questions are now, can I get a significant boost in time from split processing or is the synchronization via events too expensive? or is there maybe even another way of synchronizing threads with less effort or process time if all the tasks need to be done this way?
If your processing of a single object is fast, do not split it between threads. The thread synchronization on windows will eat well over 50 ms on every context switch. This time is not used by system, but just the time when something else is running on a system.
However, if every object processing will take around 8ms, there is a point of scheduling the work across pool of threads. However, object processing may vary a bit, and in large counts worker threads would complete the work in a different moment.
Better approach is to organize a synchronized object queue, to which you add objects to process, and from which you take them from processing. Furthermore, as processing of a single object considerably lower, than scheduling interval of a thread, it is good to take them into processing thread in batches (like 10-20). You can estimate the best number of worker threads in your pool and the best size of a batch with tests.
So the pseudocode can look like:
main_thread:
init queue
start workers
set counter to 100000
add 100000 objects to queue
while (counter) wait();
worker_thread:
while (!done)
get up to 10 objects from queue
process objects
counter -= processed count
if (counter == 0) notify done

Allowing connections given the number of threads in server

Every connection requires one thread for each, and for now, we're allowing only certain number of connections per period. So every time a user connects, we increment the counter if we're within certain period from the last time we set the check time.
1.get current_time = time(0)
2.if current_time is OUTSIDE certain period from check_time,
set counter = 0, and check_time = current_time.
3.(otherwise, just leave it the way it is)
4.if counter < LIMIT, counter++ and return TRUE
5.Otherwise return FALSE
But this is independent of actually how many threads we have running in the server, so I'm thinking of a way to allow connections depending on this number.
The problem is that we're actually using a third-party api for this, and we don't know exactly how long the connection will last. First I thought of creating a child thread and run ps on it to pass the result to the parent thread, but it seems like it's going to take more time since I'll have to parse the output result to get the total number of threads, etc. I'm actually not sure if I'm making any sense.. I'm using c++ by the way. Do you guys have any suggestions as to how I could implement the new checking method? It'll be very much appreciated.
There will be a /proc/[pid]/task (since Linux 2.6.0-test6) directory for every thread belonging to process [pid]. Look at man proc for documentation. Assuming you know the pid of your thread pool you could just count those directories.
You could use boost::filesystem to do that from c++, as described here:
How do I count the number of files in a directory using boost::filesystem?
I assumed you are using Linux.
Okay, if you know the TID of the thread in use by the connection then you can wait on that object in a separate thread which can then decrement the counter.
At least I know that you can do it with MSVC...
bool createConnection()
{
if( ConnectionMonitor::connectionsMaxed() )
{
LOG( "Connection Request failed due to over-subscription" );
return false;
}
ConnectionThread& connectionThread = ThreadFactory::createNewConnectionThread();
connectionThread.startConnection();
ThreadMonitorThread& monitor = ThreadFactory::createThreadMonitor(connectionThread);
monitor.monitor();
}
and in ThreadMonitorThread
ThreadMonitorThread( const Thread& thread )
{
this.thread = thread;
}
void monitor()
{
WaitForSingleObject( thread.getTid() );
ConnectionMonitor::decrementThreadCounter();
}
Of course ThreadMonitorThread will require some special privileges to call the decrement and the ThreadFactory will probably need the same to increment it.
You also need to worry about properly coding this up... who owns the objects and what about exceptions and errors etc...

java parallelisation problem - parallelisation is as slow as serialisation

I have been developing an individual base model. All you need to know is that individuals are born, reproduce and die. I have a GUI in which i can see these processes happening.
I have a mac pro, with 8 cores and 16GB ram.
Considering that the simulation will have to be repeated a few times to get error bars, etc, I thought i could run the main class and then have separate simulations (all run from the same program) ran on separate cores. Simple. Each parallel simulation would have no knowledge of the other simulations, hence no need for synchronization blocks.
When the main method is run, it invokes the constructor of the main class - which creates the other objects and the simulation begins. Hence - to parallelise - I created a fixed thread pool which would all separately invoke the main class constructor and multiple (well, 8, the number of cores) simulations.
BUT - it is running as slow as if I was running the simulations in serial. The animation in the GUIs for each simulation are updated in order, not simultaneously.
In fact, if I run the program 8 times simultaneously from the command line (and place in the background with '&') it is much faster and behaves much more like I would have hoped. Which is irritating!
At the start of the simulation some IO operations are performed to read in data about the individuals, but only at the start.
Interestingly, the first objects to be created by the `parallel' processes were made at the same memory addresses - but I don't think that is a problem.
If anybody has any insight into this lack of performance from the java concurrency tools, why the program appears to be running in serial and why simply running the main method from the command line 8 times is better than attempting to parallelise that would be most helpful.
Because to be frank I am losing faith in java's parallelisation capabilities.
Cheers
James
noOfProcessors = (byte)Runtime.getRuntime().availableProcessors();
ExecutorService eservice = Executors.newFixedThreadPool( noOfProcessors );
List<Future> futuresList = new ArrayList<Future>();
for( int i = 0; i < noOfProcessors; i++ ){
futuresList.add( eservice.submit( new simulation() ) );
}//end for
for( Future future : futuresList ){
try{
future.get();
}catch( InterruptedException ex ){
Logger.getLogger( simPanel.class.getName() ).log( Level.SEVERE, null, ex );
System.exit( 1 );
}catch( ExecutionException ex ){
Logger.getLogger( simPanel.class.getName() ).log( Level.SEVERE, null, ex );
System.exit( 1 );
}//end try-catch
}//end for loop
While not too familiar with Java's Executors class, the serial behaviour seems to indicate that your thread pool is running all threads on the same processor. Perhaps it has something to do with how the JVM handles threads? Anyway, see if you can create separate processes in Java and see if that makes a difference.