I've written a simple, though highly multi-threaded, prime numbers generator.
The algorithm goes like this:
Thread 0: generates consecutive numbers.
Threads 1 .. N: filter out numbers that are not prime.
Upon each 'new' prime discovery, a new filter thread is added.
Take I: no flow control at all.
Thread 0 'send's numbers absolutely freely.
The program finishes with signal 11 (seg. fault), rarely signal 8, even more rarely finishes successfully.
Take II: flow control with 'setMaxMailboxSize' to 1.
Most of the time, everything works well.
Take III:
Now, if it all was a result of some internal unheld overflow, it should do well with 'setMaxMailboxSize' to 2 (or even 10), am I wrong ?
Thread 0 becomes stuck after it blocks for the first time.
Could someone please direct me what do I miss ?
Note 1:
I use DMD v2.053 under Ubuntu 10.04
Note 2:
This is my code:
#!/usr/bin/dmd -run
import std.stdio;
import std.conv;
import std.concurrency;
void main(string[] args)
{
/* parse command line arguments */
if (args.length < 2) {
writeln("Usage: prime <number of primes to generate>");
return;
}
auto nPrimes = to!int(args[1]);
auto tid = spawn(&generate, thisTid);
/* gather produced primes */
for (;;) {
auto prime = receiveOnly!int();
writeln(prime);
if (--nPrimes <= 0) {
break;
}
}
tid.send("stop");
}
void generate(Tid parentTid)
{
bool terminate = false;
// filter stage 1
auto tid = spawn(&filter_stage, parentTid);
/* WHAT DO I MISS HERE ? */
setMaxMailboxSize(tid, 1, OnCrowding.block);
for (int i = 2; !terminate; i++) {
receiveTimeout(0,
(string cmd) {
writeln(cmd);
terminate = true;
}
);
tid.send(i);
}
}
void filter_stage(Tid parentTid)
{
auto prime = receiveOnly!int();
parentTid.send(prime);
// filter stage 'N'
auto tid = spawn(&filter_stage, parentTid);
filter(prime, tid);
}
void filter(int prime, Tid tid)
{
for (;;) {
receive (
(int number) {
if (number % prime != 0) {
tid.send(number);
}
}
);
}
}
Sounds like a bug in std.concurrency. Try upgrading DMD to 2.055. I'm not sure if this specific bug is fixed but there are a lot of bug fixes between 2.053 and 2.055. If it's still broken then please file a bug report at http://d.puremagic.com/issues/.
Related
I am working a on database than runs on top on RocksDB. I have a find function that takes a query in parameter, iterates over all documents in the database, and returns the documents that match the query. I want to parallelize this function so the work is spread on multiple threads.
To achieve that, I tried to use ThreadPool: I moved the code of the loop in a lambda, and added a task to the thread pool for each document. After the loop, each result is processed by the main thread.
Current version (single thread):
void
EmbeDB::find(const bson_t& query,
DocumentPtrCallback callback,
int32_t limit,
const bson_t* projection)
{
int32_t count = 0;
bson_error_t error;
uint32_t num_query_keys = bson_count_keys(&query);
mongoc_matcher_t* matcher = num_query_keys != 0
? mongoc_matcher_new(&query, &error)
: nullptr;
if (num_query_keys != 0 && matcher == nullptr)
{
callback(&error, nullptr);
return;
}
bson_t document;
rocksdb::Iterator* it = _db->NewIterator(rocksdb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next())
{
const char* bson_data = (const char*)it->value().data();
int bson_length = it->value().size();
std::vector<char> decrypted_data;
if (encryptionEnabled())
{
decrypted_data.resize(bson_length);
bson_length = decrypt_data(bson_data, bson_length, decrypted_data.data(), _encryption_method, _encryption_key, _encryption_iv);
bson_data = decrypted_data.data();
}
bson_init_static(&document, (const uint8_t*)bson_data, bson_length);
if (num_query_keys == 0 || mongoc_matcher_match(matcher, &document))
{
++count;
if (projection != nullptr)
{
bson_error_t error;
bson_t projected;
bson_init(&projected);
mongoc_matcher_projection_execute_noop(
&document,
projection,
&projected,
&error,
NULL
);
callback(nullptr, &projected);
}
else
{
callback(nullptr, &document);
}
if (limit >= 0 && count >= limit)
{
break;
}
}
}
delete it;
if (matcher)
{
mongoc_matcher_destroy(matcher);
}
}
New version (multi-thread):
void
EmbeDB::find(const bson_t& query,
DocumentPtrCallback callback,
int32_t limit,
const bson_t* projection)
{
int32_t count = 0;
bool limit_reached = limit == 0;
bson_error_t error;
uint32_t num_query_keys = bson_count_keys(&query);
mongoc_matcher_t* matcher = num_query_keys != 0
? mongoc_matcher_new(&query, &error)
: nullptr;
if (num_query_keys != 0 && matcher == nullptr)
{
callback(&error, nullptr);
return;
}
auto process_document = [this, projection, num_query_keys, matcher](const char* bson_data, int bson_length) -> bson_t*
{
std::vector<char> decrypted_data;
if (encryptionEnabled())
{
decrypted_data.resize(bson_length);
bson_length = decrypt_data(bson_data, bson_length, decrypted_data.data(), _encryption_method, _encryption_key, _encryption_iv);
bson_data = decrypted_data.data();
}
bson_t* document = new bson_t();
bson_init_static(document, (const uint8_t*)bson_data, bson_length);
if (num_query_keys == 0 || mongoc_matcher_match(matcher, document))
{
if (projection != nullptr)
{
bson_error_t error;
bson_t* projected = new bson_t();
bson_init(projected);
mongoc_matcher_projection_execute_noop(
document,
projection,
projected,
&error,
NULL
);
delete document;
return projected;
}
else
{
return document;
}
}
else
{
delete document;
return nullptr;
}
};
const int WORKER_COUNT = std::max(1u, std::thread::hardware_concurrency());
ThreadPool pool(WORKER_COUNT);
std::vector<std::future<bson_t*>> futures;
bson_t document;
rocksdb::Iterator* db_it = _db->NewIterator(rocksdb::ReadOptions());
for (db_it->SeekToFirst(); db_it->Valid(); db_it->Next())
{
const char* bson_data = (const char*)db_it->value().data();
int bson_length = db_it->value().size();
futures.push_back(pool.enqueue(process_document, bson_data, bson_length));
}
delete db_it;
for (auto it = futures.begin(); it != futures.end(); ++it)
{
bson_t* result = it->get();
if (result)
{
count += 1;
if (limit < 0 || count < limit)
{
callback(nullptr, result);
}
delete result;
}
}
if (matcher)
{
mongoc_matcher_destroy(matcher);
}
}
With simple documents and query, the single-thread version processes 1 million documents in 0.5 second on my machine.
With the same documents and query, the multi-thread version processes 1 million documents in 3.3 seconds.
Surprisingly, the multi-thread version is way slower. Moreover, I measured the execution time and 75% of the time is spent in the for loop. So basically the line futures.push_back(pool.enqueue(process_document, bson_data, bson_length)); takes 75% of the time.
I did the following:
I checked the value of WORKER_COUNT, it is 6 on my machine.
I tried to add futures.reserve(1000000), thinking that maybe the vector re-allocation was at fault, but it didn't change anything.
I tried to remove the dynamic memory allocations (bson_t* document = new bson_t();), it didn't change the result significantly.
So my question is: is there something that I did wrong for the multi-thread version to be that slower than the single-thread version?
My current understanding is that the synchronization operations of the thread pool (when tasks are enqueued and dequeued) are simply consuming the majority of the time, and the solution would be to change the data-structure. Thoughts?
Parallelization has overhead.
It takes around 500 nanoseconds to process each document in the single-threaded version. There's a lot of bookkeeping that has to be done to delegate work to a thread-pool (both to delegate the work, and to synchronize it afterwards), and all that bookkeeping could very well require more than 500 nanoseconds per job.
Assuming your code is correct, then the bookkeeping takes around 2800 nanoseconds per job. To get a significant speedup from parallelization, you're going to want to break the work into bigger chunks.
I recommend trying to process documents in batches of 1000 at a time. Each future, instead of corresponding to just 1 document, will correspond to 1000 documents.
Other optimizations
If possible, avoid unnecessary copying. If something gets copied a bunch, see if you can capture it by reference instead of by value.
its my first time here. My code is suppose to make two ultrasonic sensors function at the same time using an mbed. However, i cant seem to make both classes void us_right() and void us_left() in the code run concurrently. Help please :(
#include "mbed.h"
DigitalOut triggerRight(p9);
DigitalIn echoRight(p10);
DigitalOut triggerLeft(p13);
DigitalIn echoLeft(p14);
//DigitalOut myled(LED1); //monitor trigger
//DigitalOut myled2(LED2); //monitor echo
PwmOut steering(p21);
PwmOut velocity(p22);
int distanceRight = 0, distanceLeft = 0;
int correctionRight = 0, correctionLeft = 0;
Timer sonarRight, sonarLeft;
float vo=0;
// Velocity expects -1 (reverse) to +1 (forward)
void Velocity(float v) {
v=v+1;
if (v>=0 && v<=2) {
if (vo>=1 && v<1) { //
velocity.pulsewidth(0.0014); // this is required to
wait(0.1); //
velocity.pulsewidth(0.0015); // move into reverse
wait(0.1); //
} //
velocity.pulsewidth(v/2000+0.001);
vo=v;
}
}
// Steering expects -1 (left) to +1 (right)
void Steering(float s) {
s=s+1;
if (s>=0 && s<=2) {
steering.pulsewidth(s/2000+0.001);
}
}
void us_right() {
sonarRight.reset();
sonarRight.start();
while (echoRight==2) {};
sonarRight.stop();
correctionRight = sonarLeft.read_us();
triggerRight = 1;
sonarRight.reset();
wait_us(10.0);
triggerRight = 0;
while (echoRight==0) {};
// myled2=echoRight;
sonarRight.start();
while (echoRight==1) {};
sonarRight.stop();
distanceRight = ((sonarRight.read_us()-correctionRight)/58.0);
printf("Distance from Right is: %d cm \n\r",distanceRight);
}
void us_left() {
sonarLeft.reset();
sonarLeft.start();
while (echoLeft==2) {};
sonarLeft.stop();
correctionLeft = sonarLeft.read_us();
triggerLeft = 1;
sonarLeft.reset();
wait_us(10.0);
triggerLeft = 0;
while (echoLeft==0) {};
// myled2=echoLeft;
sonarLeft.start();
while (echoLeft==1) {};
sonarLeft.stop();
distanceLeft = (sonarLeft.read_us()-correctionLeft)/58.0;
printf("Distance from Left is: %d cm \n\r",distanceLeft);
}
int main() {
while(true) {
us_right();
us_left();
}
if (distanceLeft < 10 || distanceRight < 10) {
if (distanceLeft < distanceRight) {
for(int i=0; i>-100; i--) { // Go left
Steering(i/100.0);
wait(0.1);
}
}
if (distanceLeft > distanceRight) {
for(int i=0; i>100; i++) { // Go Right
Steering(i/100.0);
wait(0.1);
}
}
}
wait(0.2);
}
You need to use some mechanism to create new threads or processes. Your implementation is sequential, there is nothing you do that tells the code to run concurrently.
You should take a look at some threads libraries (pthreads for example, or if you have access to c++11, there are thread functionality there) or how to create new processes as well as some kind of message passing interface between these processes.
Create two threads, one for each ultrasonic sensor:
void read_left_sensor() {
while (1) {
// do the reading
wait(0.5f);
}
}
int main() {
Thread left_thread;
left_thread.start(&read_left_sensor);
Thread right_thread;
right_thread.start(&read_right_sensor);
while (1) {
// put your control code for the vehicle here
wait(0.1f);
}
}
You can use global variables to write to when reading the sensor, and read them in your main loop. The memory is shared.
Your first problem is that you have placed code outside of your infinite while(true) loop. This later code will never run. But maybe you know this.
int main() {
while(true) {
us_right();
us_left();
} // <- Loops back to the start of while()
// You Never pass this point!!!
if (distanceLeft < 10 || distanceRight < 10) {
// Do stuff etc.
}
wait(0.2);
}
But, I think you are expecting us_right() and us_left() to happen at exactly the same time. You cannot do that in a sequential environment.
Jan Jongboom is correct in suggesting you could use Threads. This allows the 'OS' to designate time for each piece of code to run. But it is still not truly parallel. Each function (classes are a different thing) will get a chance to run. One will run, and when it is finished (or during a wait) another function will get its chance to run.
As you are using an mbed, I'd suggest that your project is an MBED OS 5 project
(you select this when you start a new project). Otherwise you'll need to use an RTOS library. There is a blinky example using threads that should sum it up well. Here is more info.
Threading can be dangerous for someone without experience. So stick to a simple implementation to start with. Make sure you understand what/why/how you are doing it.
Aside: From a hardware perspective, running ultrasonic sensors in parallel is actually not ideal. They both broadcast the same frequency, and can hear each other. Triggering them at the same time, they interfere with each other.
Imagine two people shouting words in a closed room. If they take turns, it will be obvious what they are saying. If they both shout at the same time, it will be very hard!
So actually, not being able to run in parallel is probably a good thing.
I run into a rather strange situation when using std::future and ThreadPool, though I do not think it's ThreadPool (I'm using https://github.com/bandi13/ThreadPool/blob/master/example.cpp) since (I've tried multiple forks of it and after some debugging I do not see how it would be related to the issue).
The issue is that under certain situation my doProcess method just goes nirvana - it does not return. It just disappears midst of a long running loop.
Therefore I think I must be doing something wrong, but can't figure out what.
Here's the code:
ThreadPool pool(numThreads);
std::vector< std::future<bool> > futures;
int count = 0;
string orgOut = outFile;
for (auto fileToProcess : filesToProcess) {
count++;
outFile = orgOut + std::to_string(count);
// enque processing in the thread pool
futures.emplace_back(
pool.enqueue([count, fileToProcess, outFile, filteredKeys, sql] {
return doProcess(fileToProcess, outFile, filteredKeys, sql);
})
);
}
Then I wait for all processings to be done (I think this could be done in a more elegant way also):
bool done = false;
while (!done) {
done = true;
for (auto && futr : futures) {
auto status = futr.wait_for(std::chrono::milliseconds(1));
if (status != std::future_status::ready) {
done = false;
break;
}
}
}
Edit: At first I also tried the obvius wait(), with the same result however:
bool done = false;
while (!done) {
done = true;
for (auto && futr : futures) {
futr.wait();
}
}
Edit: The doProcess() method. The behavior is this: The loopcnt variable is just a counter to debug how often the method was entered and the loop started. As you can see, there is no return from this loop, but the thread just vanishes when inside this loop with no error whatsoever and wasHereCnt is reached only occasionally (like 1 of 100 times the method is run). I'm really puzzled.
bool doProcess([...]) {
// ....
vector<vector<KVO*>*>& features = filter.result();
vector<vector<KVO*>*> filteredFeatures;
static int loopcnt = 0;
std::cout << "loops " << loopcnt << endl;
loopcnt++;
for (vector<KVO*>* feature : features) {
for (KVO *kv : *feature) {
switch (kv->value.type()) {
case Variant::JNULL:
sqlFilter.setNullValue(kv->key);
break;
case Variant::INT:
sqlFilter.setValue(static_cast<int64_t>(kv->value), kv->key);
break;
case Variant::UINT:
sqlFilter.setValue(static_cast<int64_t>(kv->value), kv->key);
break;
case Variant::DOUBLE:
sqlFilter.setValue(static_cast<double>(kv->value), kv->key);
break;
case Variant::STRING:
sqlFilter.setValue(static_cast<string>(kv->value), kv->key);
break;
default:
assert(false);
break;
}
}
int filterResult = sqlFilter.exec();
if (filterResult > 0) {
filteredFeatures.push_back(feature);
}
sqlFilter.reset();
}
static int wasHereCnt = 0;
std::cout << "was here: " << wasHereCnt << endl;
wasHereCnt++;
JsonWriter<Writer<FileWriteStream>> geojsonWriter(writer, filteredFeatures);
bool res = geojsonWriter.write();
os.Flush();
fclose(fp);
return res;
}
The doProcess method does work when it's taking less time. It breaks and disappears when it takes somewhat more time. The difference being just the complexity of an SQL query I run in the method. So I don't post the code for doProcess().
What causes the thread of the thread pool to be interrupted, and how to fix it?
UPDATE
Well, I found it out. After several hours I decided to remove the future tasks and just ran the task on the main thread. The issue was that an exception was thrown via:
throw std::runtime_error("bad cast");
... some time down the code flow after this:
case Variant::UINT:
sqlFilter.setValue(static_cast<int64_t>(kv->value), kv->key);
break;
This error was thrown as expected when running on the main thread. But it's never raised when run as future task. This is really odd and seems like a compiler or debugger issue.
Describtion of the problem:
we need to call a function in extern process as fast as possible. Boost interprocess shared memory is used for communication. The extern process is either mpi master or a single executable. The calculation time of the function lies between 1ms and 1s. The function should be called up to 10^8-10^9 times.
I've tried a lot of possibilities, but I still have some problems with each of them. Here I introduce two of best working implementations
Version 1 ( using intreprocess conditions )
Main-process
bool calculate(double& result, std::vector<double> c){
// data_ptr is a structure in shared memoty
data_ptr_->validCalculation = false;
bool timeout = false;
// write data (cVec_ is a vector in shared memory )
cVec_->clear();
for (int i = 0; i < c.size(); ++i)
{
cVec_->push_back(c[i]);
}
// cond_input_data is boost interprocess condition
data_ptr_->cond_input_data.notify_one();
boost::system_time const waittime = boost::get_system_time() + boost::posix_time::seconds(maxWaitTime_in_sec);
// lock slave process
scoped_lock<interprocess_mutex> lock_output(data_ptr_->mutex_output);
// wait till data calculated
timeout = !(data_ptr_->cond_output_data.timed_wait(lock_output, waittime)); // true if timeout, false if no timeout
if (!timeout)
{
// get result
result = *result_;
return data_ptr_->validCalculation;
}
else
{
return false;
}
};
Extern process runs a while-loop ( till abort condition is fullfilled)
do {
scoped_lock<interprocess_mutex> lock_input(data_ptr_->mutex_input);
boost::system_time const waittime = boost::get_system_time() + boost::posix_time::seconds(maxWaitTime_in_sec);
timeout = !(data_ptr_->cond_input_data.timed_wait(lock_input, waittime)); // true if timeout, false if no timeout
if (!timeout)
{
if (!*abort_flag_) {
c.clear();
for (int i = 0; i < (*cVec_).size(); ++i) //Insert data in the vector
{
c.push_back(cVec_->at(i));
}
// calculate value
if (call_of_function_here(result, c)) { // valid calculation ?
*result_ = result;
data_ptr_->validCalculation = true;
}
}
}
//Notify the other process that the data is avalible or we dont get the input data
data_ptr_->cond_output_data.notify_one();
} while (!*abort_flag_); // while abort flag is not set, check if some values should be calculated
This is best working version, but sometimes it holds up, if the calculation time is short (~1ms). I assume, it happens, if main-process reaches
data_ptr_->cond_input_data.notify_one();
earlier, than extern process is waiting on
timeout = !(data_ptr_->cond_input_data.timed_wait(lock_input, waittime));
waiting condition. So we have probably some kind of synchronisation problem.
Second condition does not help ( i.e. wait only if input data not set, similar to the anonymous condition example with message_in flag). Since, it is still possible, that one process notify the other one, before the second one is waiting for notification.
Version 2 ( using boolean flag and while loop with some delay )
Main-process
bool calculate(double& result, std::vector<double> c){
data_ptr_->validCalculation = false;
bool timeout = false;
// write data
cVec_->clear();
for (int i = 0; i < c.size(); ++i) //Insert data in the vector
{
cVec_->push_back(c[i]);
}
// this is the flag in shared memory used for communication
*calc_flag_ = true;
clock_t test_begin = clock();
clock_t calc_time_begin = clock();
do
{
calc_time_begin = clock();
boost::this_thread::sleep(boost::posix_time::milliseconds(while_loop_delay_m_s));
// wait till data calculated
timeout = (double(calc_time_begin - test_begin) / CLOCKS_PER_SEC > maxWaitTime_in_sec);
} while (*(calc_flag_) && !timeout);
if (!timeout)
{
// get result
result = *result_;
return data_ptr_->validCalculation;
}
else
{
return false;
}
};
and the extern process
do {
// we wait till input data is set
wait_begin = clock();
do
{
wait_end = clock();
timeout = (double(wait_end - wait_begin) / CLOCKS_PER_SEC > maxWaitTime_in_sec);
boost::this_thread::sleep(boost::posix_time::milliseconds(while_loop_delay_m_s));
} while (!(*calc_flag_) && !(*abort_flag_) && !timeout);
if (!timeout)
{
if (!*abort_flag_) {
c.clear();
for (int i = 0; i < (*cVec_).size(); ++i) //Insert data in the vector
{
c.push_back(cVec_->at(i));
}
// calculate value
if (call_of_local_function(result, c)) { // valid calculation ?
*result_ = result;
data_ptr_->validCalculation = true;
}
}
}
//Notify the other process that the data is avalible or we dont get the input data
*calc_flag_ = false;
} while (!*abort_flag_); // while abort flag is not set, check if some values should be calculated
The problem in this version is the delay-time. Since we have calculation times close to 1ms, we have to set the delay at least to this value. For smaller delays the cpu-load is high, for higher delays we lose a lot of performance due to not necessary waiting time
Do you have an idea how to improve one of this versions? or may be there is a better solution?
thx.
I know "printf" is standard-c and should be deterministic. But when run in Qt I see a more non-deterministic response(clock cycles). Could this be due to Qt adding some "pork" to its response?
I have multiple threads that make call to function that uses a mutex. When one thread enters it set a switch so the others can't until it is done. Things appeared to work ok for acouple seconds and then threads appeared to be killed off from 10 to 1 thread. So I tried adding a delay: (k=k+1: no help), then (looping k=k+1: no help), (usleep works), and so does (printf) work at creating a random delay and allowing all threads to continue running.
void CCB::Write(int iThread)
{
static bool bUse = false;
bool bDone = false;
char cStr[20];
int posWrite;// = *m_posWrite; // issue of posWrite be altered with next extrance
long k = 0;
long m = 0;
m_threadCount++;
while(bDone == false){
if(bUse == false){
bUse = true;
posWrite = *m_posWrite;
memcpy(m_cmMessageCB + posWrite, &m_cmMessageWrite, sizeof(typeCanMessage));
memset(cStr, '\0', 20);
memcpy(cStr, (m_cmMessageCB + posWrite)->cMessage, 11); //fails: every 20
*m_posWrite = *m_posWrite + 1;
if(*m_posWrite == m_iNBufferLength)
*m_posWrite = 0;
bDone = true;
bUse = false;
}else if(bUse == true){
//why are threads being killed ?
// printf("T%d_%d ", iThread, m_threadCount);//non-deterministic value ?
usleep(1);//non-deterministic value
//k++;//delay of a couple clock cycles was not enough
/*
for(k = 0; k < iThread * 100; k++){//deterministic and fails to resolve thread problem
m++;
}
*/
}
}
}