java parallelisation problem - parallelisation is as slow as serialisation - concurrency

I have been developing an individual base model. All you need to know is that individuals are born, reproduce and die. I have a GUI in which i can see these processes happening.
I have a mac pro, with 8 cores and 16GB ram.
Considering that the simulation will have to be repeated a few times to get error bars, etc, I thought i could run the main class and then have separate simulations (all run from the same program) ran on separate cores. Simple. Each parallel simulation would have no knowledge of the other simulations, hence no need for synchronization blocks.
When the main method is run, it invokes the constructor of the main class - which creates the other objects and the simulation begins. Hence - to parallelise - I created a fixed thread pool which would all separately invoke the main class constructor and multiple (well, 8, the number of cores) simulations.
BUT - it is running as slow as if I was running the simulations in serial. The animation in the GUIs for each simulation are updated in order, not simultaneously.
In fact, if I run the program 8 times simultaneously from the command line (and place in the background with '&') it is much faster and behaves much more like I would have hoped. Which is irritating!
At the start of the simulation some IO operations are performed to read in data about the individuals, but only at the start.
Interestingly, the first objects to be created by the `parallel' processes were made at the same memory addresses - but I don't think that is a problem.
If anybody has any insight into this lack of performance from the java concurrency tools, why the program appears to be running in serial and why simply running the main method from the command line 8 times is better than attempting to parallelise that would be most helpful.
Because to be frank I am losing faith in java's parallelisation capabilities.
Cheers
James
noOfProcessors = (byte)Runtime.getRuntime().availableProcessors();
ExecutorService eservice = Executors.newFixedThreadPool( noOfProcessors );
List<Future> futuresList = new ArrayList<Future>();
for( int i = 0; i < noOfProcessors; i++ ){
futuresList.add( eservice.submit( new simulation() ) );
}//end for
for( Future future : futuresList ){
try{
future.get();
}catch( InterruptedException ex ){
Logger.getLogger( simPanel.class.getName() ).log( Level.SEVERE, null, ex );
System.exit( 1 );
}catch( ExecutionException ex ){
Logger.getLogger( simPanel.class.getName() ).log( Level.SEVERE, null, ex );
System.exit( 1 );
}//end try-catch
}//end for loop

While not too familiar with Java's Executors class, the serial behaviour seems to indicate that your thread pool is running all threads on the same processor. Perhaps it has something to do with how the JVM handles threads? Anyway, see if you can create separate processes in Java and see if that makes a difference.

Related

Hard Realtime C++ for Robot Control

I am trying to control a robot using a template-based controller class written in c++. Essentially I have a UDP connection setup with the robot to receive the state of the robot and send new torque commands to the robot. I receive new observations at a higher frequency (say 2000Hz) and my controller takes about 1ms (1000Hz) to calculate new torque commands to send to the robot. The problem I am facing is that I don't want my main code to wait to send the old torque commands while my controller is still calculating new commands to send. From what I understand I can use Ubuntu with RT-Linux kernel, multi-thread the code so that my getTorques() method runs in a different thread, set priorities for the process, and use mutexes and locks to avoid data race between the 2 threads, but I was hoping to learn what the best strategies to write hard-realtime code for such a problem are.
// main.cpp
#include "CONTROLLER.h"
#include "llapi.h"
void main{
...
CONTROLLERclass obj;
...
double new_observation;
double u;
...
while(communicating){
get_newObs(new_observation); // Get new state of the robot (2000Hz)
obj.getTorques(new_observation, u); // Takes about 1ms to calculate new torques
send_newCommands(u); // Send the new torque commands to the robot
}
...
}
Thanks in advance!
Okay, so first of all, it sounds to me like you need to deal with the fact that you receive input at 2 KHz, but can only compute results at about 1 KHz.
Based on that, you're apparently going to have to discard roughly half the inputs, or else somehow (in a way that makes sense for your application) quickly combine the inputs that have arrived since the last time you processed the inputs.
But as the code is structured right now, you're going to fetch and process older and older inputs, so even though you're producing outputs at ~1 KHz, those outputs are constantly being based on older and older data.
For the moment, let's assume you want to receive inputs as fast as you can, and when you're ready to do so, you process the most recent input you've received, produce an output based on that input, and repeat.
In that case, you'd probably end up with something on this general order (using C++ threads and atomics for the moment):
std::atomic<double> new_observation;
std::thread receiver = [&] {
double d;
get_newObs(d);
new_observation = d;
};
std::thread sender = [&] {
auto input = new_observation;
auto u = get_torques(input);
send_newCommands(u);
};
I've assumed that you'll always receive input faster than you can consume it, so the processing thread can always process whatever input is waiting, without receiving anything to indicate that the input has been updated since it was last processed. If that's wrong, things get a little more complex, but I'm not going to try to deal with that right now, since it sounds like it's unnecessary.
As far as the code itself goes, the only thing that may not be obvious is that instead of passing a reference to new_input to either of the existing functions, I've read new_input into variable local to the thread, then passed a reference to that.

ZeroMQ: how to reduce multithread-communication latency with inproc?

I'm using inproc and PAIR to achieve inter-thread communication and trying to solve a latency problem due to polling. Correct me if I'm wrong: Polling is inevitable, because a plain recv() call will usually block and cannot take a specific timeout.
In my current case, among N threads, each of the N-1 worker threads has a main while-loop. The N-th thread is a controller thread which will notify all the worker threads to quit at any time. However, worker threads have to use polling with a timeout to get that quit message. This introduces a latency, the latency parameter is usually 1000ms.
Here is an example
while (true) {
const std::chrono::milliseconds nTimeoutMs(1000);
std::vector<zmq::poller_event<std::size_t>> events(n);
size_t nEvents = m_poller.wait_all(events, nTimeoutMs);
bool isToQuit = false;
for (auto& evt : events) {
zmq::message_t out_recved;
try {
evt.socket.recv(out_recved, zmq::recv_flags::dontwait);
}
catch (std::exception& e) {
trace("{}: Caught exception while polling: {}. Skipped.", GetLogTitle(), e.what());
continue;
}
if (!out_recved.empty()) {
if (IsToQuit(out_recved))
isToQuit = true;
break;
}
}
if (isToQuit)
break;
//
// main business
//
...
}
To make things worse, when the main loop has nested loops, the worker threads then need to include more polling code in each layer of the nested loops. Very ugly.
The reason why I chose ZMQ for multithread communication is because of its elegance and the potential of getting rid of thread-locking. But I never realized the polling overhead.
Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation? Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?
An above posted statement ( a hypothesis ):
"...a plain recv() call will usually block and cannot take a specific timeout."
is not correct:
a plain .recv( ZMQ_NOBLOCK )-call will never "block",
a plain .recv( ZMQ_NOBLOCK )-call can get decorated so as to mimick "a specific timeout"
An above posted statement ( a hypothesis ):
"...have to use polling with a timeout ... introduces a latency, the latency parameter is usually 1000ms."
is not correct:
- one need not use polling with a timeout
- the less one need not set 1000 ms code-"injected"-latency, spent obviously only on-no-new-message state
Q : "Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation?"
Yes.
Q : "Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?"
No. inproc-transport-class is the fastest of all these kinds as it is principally protocol-less / stack-less and has more to do with ultimately fast pointer-mechanics, like in a dual-end ring-buffer pointer-management.
The Best Next Step:
1 )Re-factor your code, so as to always harness but the zero-wait { .poll() | .recv() }-methods, properly decorated for both { event- | no-event- }-specific looping.
2 )
If then willing to shave the last few [us] from the smart-loop-detection turn-around-time, may focus on improved Context()-instance setting it to work with larger amount of nIOthreads > N "under the hood".
optionally 3 )
For almost hard-Real-Time systems' design one may finally harness a deterministically driven Context()-threads' and socket-specific mapping of these execution-vehicles onto specific, non-overlapped CPU-cores ( using a carefully-crafted affinity-map )
Having set 1000 [ms] in code, no one is fair to complain about spending those very 1000 [ms] waiting in a timeout, coded by herself / himself. No excuse for doing this.
Do not blame ZeroMQ for behaviour, that was coded from the application side of the API.
Never.

PPL - How to configure the number of native threads?

I am trying to manage the count of native threads in PPL by using its Scheduler class, here is my code:
for (int i = 0; i < 2000; i ++)
{
// configure concurrency count 16 to 32.
concurrency::SchedulerPolicy policy = concurrency::SchedulerPolicy(2, concurrency::MinConcurrency, 16,
concurrency::MaxConcurrency, 32);
concurrency::Scheduler *pScheduler = concurrency::Scheduler::Create(policy);
HANDLE hShutdownEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
pScheduler->RegisterShutdownEvent(hShutdownEvent);
pScheduler->Attach();
//////////////////////////////////////////////////////////////////////////
//for (int i = 0; i < 2000; i ++)
{
concurrency::create_task([]{
concurrency::wait(1000);
OutputDebugString(L"Task Completed\n");
});
}
//////////////////////////////////////////////////////////////////////////
concurrency::CurrentScheduler::Detach();
pScheduler->Release();
WaitForSingleObject(hShutdownEvent, INFINITE);
CloseHandle(hShutdownEvent);
}
The usage of SchedulerPolicy is from MSDN, but it didn't work at all. The expected result of my code above is, PPL will launch 16 to 32 threads to execute the 2000 tasks, but the fact is:
By observing the speed of console output, only one task was processed within a second. I also tried to comment the outter for loop and uncomment the inner for loop, however, this will cause 300 threads being created, still incorrect. If I wait a longer time, the threads created will be even more.
Any ideas on what is the correct way to configure concurrency in PPL?
It has been proved that I should not do concurrency::wait within the task body, PPL works in work stealing mode, when the current task was suspended by wait, it will start to schedule the rest of tasks in queue to maximize the use of computing resources.
When I use concurrency::create_task in real project, since there are a couple of real calculations within the task body, PPL won't create hundreds of threads any more.
Also, SchedulePolicy can be used to configure the number of virtual processors that PPL may use to process the tasks, which is not always same as the number of native threads PPL will create.
Saying my CPU has 8 virtual processors, by default PPL will just create 8 threads in pool, but when some of those threads were suspended by wait or lock, and also there are more tasks pending in the queue, PPL will immediately create more threads to execute them (if the virtual processors were not fully loaded).

C++ multithreaded application using std::thread works fine on Windows but not Ubuntu

I have a somewhat simple multithreaded application written using the C++ std::thread library for both Ubuntu 14.04 and Windows 8.1. The code is nearly completely identical except that I'm using the operating system respective libraries windows.h and unistd.h to use Sleep/sleep to pause execution for a time. They both actually begin to run and the Ubuntu version does keep running for a short time but then hangs. I am using the proper arguments to the sleep/Sleep functions since I know Windows Sleep takes milliseconds, while Unix sleep takes seconds.
I've run the code multiple times and on Ubuntu it never makes it past two minutes whereas I've run it on windows twice for 20 minutes and then multiple times for roughly five minutes each to see if I was just lucky. Is this just an incompatibility with the thread library or does sleep not do what I think it does, or something else? The infinite loops are there because this is a school project and is expected to run without deadlocks or crashing.
The gist is that this is a modified 4-way stop where cars who arrive first don't have to slow down and stop. We only had to let one car through the intersection at a time which takes 3 seconds to cross, hence Sleep(3000), and don't have to worry about turns. Three threads run the spawnCars function and there are four other threads that each monitor one of the four directions N, E, S, and W. I hope that it's understandable why I can't post the entire code in the chance some other student stumbles upon this. These two functions are the only place where code is different aside from the operating system dependent library inclusion at the top. Thanks.
edit: Since I've just gone and posted all the code for the project, if the problem does end up being a deadlock, may I request that you only say so, and not post an in depth solution? I'm new here so if that's against the spirit of SO then fire away and I'll try to figure it out without reading the details.
/* function clearIntersection
Makes a car go through the intersection. The sleep comes before the removal from the queue
because my understanding is that the wait condition simulates the go signal for drivers.
It wouldn't make sense for the sensors to tell a car to go if the intersection isn't yet
clear even if the lock here would prevent that.
*/
void clearIntersection(int direction)
{
lock->lock();
Sleep(3000);
dequeue(direction);
lock->unlock();
}
/* function atFront(int direction)
Checks whether the car waiting at the intersection from a particular direction
has permission to pass, meaning it is at the front of the list of ALL waiting cars.
This is the waiting condition.
*/
bool isAtFront(int direction)
{
lock->lock();
bool isAtFront = cardinalDirections[direction].front() == list->front();
lock->unlock();
return isAtFront;
}
void waitInLine()
{
unique_lock<mutex> conditionLock(*lock);
waitForTurn->wait(conditionLock);
conditionLock.unlock();
}
//function broadcast(): Let all waiting threads know they can check whether or not their car can go.
void broadcast()
{
waitForTurn->notify_all();
}
};
/* function monitorDirection(intersectionQueue,int,int)
Threads will run this function. There are four threads that run this function
in total, one for each of the cardinal directions. The threads check to see
if the car at the front of the intersectionQueue, which contains the arrival order
of cars regardless of direction, is the car at the front of the queue for the
direction the thread is assigned to monitor. If not, it waits on a condition
variable until it is the case. It then calls the function to clear the intersection.
Broadcast is then used on the condition variable so all drivers will check if they
are allowed to pass, which one will unless there are 0 waiting cars, waiting again if not the case.
*/
void monitorDirection(intersectionQueue *intersection, int direction, int id)
{
while (true) //Do forever to see if crashes can occur.
{
//Do nothing if there are no cars coming from this direction.
//Possibly add more condition_variables for each direction?
if (!intersection->empty(direction))
{
while (!intersection->isAtFront(direction))
intersection->waitInLine();
intersection->clearIntersection(direction);
cout << "A car has gone " << numberToDirection(direction) << endl;
//All cars at the intersection will check the signal to see if it's time to go so broadcast is used.
intersection->broadcast();
}
}
}
Your culprit is likely your while (!isAtFront(...)) loop. If another thread gets scheduled between the check and the subsequent call to waitInLine(), the state of your queues could change, causing all of your consumer threads to end up waiting. At that point there's no thread to signal your condition_variable, so they will wait forever.

multi-threading limit?

I am writing a program using threads in c++ in linux.
Currently, I am just keeping an array of threads, and every time one second has elapsed, I check to see which have finished, and restart them. Is this bad? I need to keep this program running for a long time. As it is now, I am getting a code 11 after so many loops of restarting threads (the 100th loop in the last trial). I figured that reusing threads and making sure I only have a small number of them running at any one time, that I would not hit the limit. The array I am using only has a size of 8 (of course, I am not starting 8 each time, just those that have stopped).
Any ideas?
My code is below:
if ( loop_times == 0 || pthread_kill(threads[t],0) != 0 )
{
rc = pthread_create(&threads[t], NULL, thread_stall, (void *)NULL);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
thread_count++;
}
The loop_times variable is just so that I can get into the loop and start the threads the first time. Otherwise, I get a SEGFAULT because the threads haven't been started before.
Also, I have been wanting to see the value of PTHREAD_THREADS_MAX, but I can't print it (even when including limits.h)
If you want to use multiple threads...It better to go for thread pool.
Start a set of threads as detached ones and then through a queue you can send info to every thread so that it can work on that and wait for next input from you.
As it turns out, my problem was that I needed to pthread_join my thread before I restarted it each time. After this, I stopped getting a code 11 and stopped having "still reachable" memory when running it through Valgrind.